sebastians site
Adventures in Open Source Hard and Software

We have AI at home...

Due to a recent set of interesting developments earlier this year, $dayjob now revolves around energy efficient machine learning on embedded systems, neural architecture search and sometimes a bit of neuromorphic computing. The only problem is: I don't know anything about machine learning. To be fair: I know just enough to know what I don't know. Fortunately for me, I don't need to know all the mathematical details for know. While I would like to look into that at some point, important for now are ML frameworks that I can use and best practises that I can stick to. That will probably get me 80% of the way to where I want to be for now.

However, I also don't have any experience with those frameworks and the tooling that goes with them. In addition to that, a lot of the stuff I work with happens to run on a GPU cluster. This means I get to mess around with trying to keep NVIDIA drivers, the CUDA runtime, and what ever ML framework I'm using compatible to each other, while also cursing at conda and python virtual environments. Also, I don't have root access, so I can't just yeet everything and do a cleanish installation like all those stackoverflow posts tell me to do. What I need is a better understanding of the various components involved in that setup. Which component has to be compatible to what other component and how do they all interact, so that I can make my model identify all the cats in the image?

Of course, I did what any good hacker would do and tried to RTFM first. Unfortunately this one of those occasions where this will only get you so far. In reality any tech stack sufficiently deep will need some duct tape in form of if x happens you can work around it by doing y. That is arcane knowledge, which you can only obtain be messing around and finding out. Traditionally, this means one of two things for me: Either obtaining a working setup and repeatedly breaking it and fixing it until I understand what makes it tick, or alternatively obtain the raw materials and build a working setup from scratch.

I decided to attempt the latter one, at home in my free time, because that meant I could mess around inefficiently and aimlessly until I find out, and also I could explore any weird fun tangents that I came across.

It's probably also worth pointing out, that is post is not a tutorial, it's just the written account of what I did and some of the issues I encountered along the way. You can try to follow along, but be prepared to do your own research and read through the prior art linked below.

Hard(ware) Choices

So I've decided to build a GPU compute server for my home lab as a learning experience. It should at least be able to run some basic ML workloads, e.g. use easyocr to sort some photos from a mountain bike race by plate numbers, or use SentenceTransformers to build semantic search index over some documents (there is an ongoing side project, that you may or may not know about already, if you follow me on mastodon). Also, it would be interesting if it could run some decently sized LLMs locally, so that I can mess around with those without getting my API access revoked for trying weird stuff. (There is another blog post in the making, stay tuned.) If I'm going through the effort of building a dedicated system for ML stuff, I might as well go all the way.

Running LLMs is probably the more challenging one of those use cases. In my experience there are two major limiting factors for doing inference with larger models: Firstly, you need enough VRAM to load the model. Some runtimes let you get away with running only some layers of the LLM on the GPU, and running the rest on your CPU, but the performance is usually not that great. Even a slower, older GPU will give you more tokens per second, if you can fit the model into its VRAM. Secondly, the other limiting factor is the number and speed of your compute cores. Those numbers will pretty much determine how many tokes per second you can get out of the LLM. I don't care as much about speed. If it spits out tokens a bit faster than I can read them, it's probably fine for my experiments.

Armed with those constraints in mind I went looking for GPUs. Of course, I could have gone with some older consumer GPU. At the time of writing you can get a used GTX1080ti with 11GB with 11 TFLOP/s using FP32 for around 200€.

The question is: Can we do better than that for our application, if we look at old datacentre GPUs. Well... you can get a used Tesla K80 with 24GB and about 8 TFLOP/s for less than 100€. It's slower, but it has a lot of VRAM and you could get two of them for the price of one GTX1080ti. Also, it is no longer supported by the latest NVIDIA drivers, so it will only get cheaper, as you can't use them in any serious production setups any more. On top of that, making full use of the card can be a bit tricky, because it's actually a dual GPU setup build into a single card: You have two GPUs with 12GB of VRAM each. So whatever workload you intend to run has to have multi GPU support, to make full use of the card. In practise this does not matter too much, most modern machine learning frameworks support multi GPU setups anyway. You could even consider it an advantage, as you can run two smaller workloads independently, even if the workloads require exclusive access to a GPU. Another factor why they are so cheap is that a data centre GPU does not come with cooling fans. The server is supposed to provide a LOT of airflow to it to keep it cool. There is a bit of prior art on how to do that. There are even better solutions out there using 3D printed parts. You'll also need an adapter that takes the two more common 12V 8-pin connections for normal GPUs on your power supply and adapts them to the EPS-12V 8-pin connection NVIDIA uses for its workstation and datacentre GPUs. Remember what I said about messing around? Tangents are fun. Super janky setups are fun. There are also newer faster cards available that also have 24GB RAM e.g. the Tesla P40 or the Tesla M40. However those tend to be a lot more pricey, selling over 300€, and that's not exactly within my budget.

So after not enough thought I bought a K80 for 80€.

NVIDIA Tesla K80

It's a reassuringly hefty piece of hardware. Also, it's a lot longer than it looks on photos. I didn't add a banana for scale, but just look at the PCIe connector.

Minor Setbacks

Before buying more e-waste to build a system around the card, I wanted to quickly verify that it actually works. After all I just bought a piece of $5000 high-end hardware for less money, than my what I spent on my keyboard. Even if the hardware came out over 10 years ago and has lost software support, it still felt like getting scammed at that point. So I dug through my parts bin and found an old B85M-G motherboard with its i5-4440 still in the socket. I don't too much PC hardware tinkering any more, so that was the only board with PCIe that I had.

Test setup for the card using left-overs on my bench

Excitedly I installed Debian 12. Then I installed Debian 11 over the Debian 12 for reasons, that I'll explain in a bit. (See I already learned something there.) After I got the legacy NVIDIA driver installed and rebooted the system.... nothing happened. Well ... not nothing. nvtop told me I didn't have any compatible GPUs. A quick check of dmesg revealed a whole bunch of issues:

[    0.155860] pnp 00:00: disabling [mem 0xfed40000-0xfed44fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.155864] pnp 00:00: disabling [mem 0xfed40000-0xfed44fff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156822] pnp 00:08: disabling [mem 0xfed1c000-0xfed1ffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156825] pnp 00:08: disabling [mem 0xfed10000-0xfed17fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156826] pnp 00:08: disabling [mem 0xfed18000-0xfed18fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156828] pnp 00:08: disabling [mem 0xfed19000-0xfed19fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156829] pnp 00:08: disabling [mem 0xf8000000-0xfbffffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156831] pnp 00:08: disabling [mem 0xfed20000-0xfed3ffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156832] pnp 00:08: disabling [mem 0xfed90000-0xfed93fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156834] pnp 00:08: disabling [mem 0xfed45000-0xfed8ffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156835] pnp 00:08: disabling [mem 0xff000000-0xffffffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156837] pnp 00:08: disabling [mem 0xfee00000-0xfeefffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156838] pnp 00:08: disabling [mem 0xf7fdf000-0xf7fdffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156840] pnp 00:08: disabling [mem 0xf7fe0000-0xf7feffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156842] pnp 00:08: disabling [mem 0xfed1c000-0xfed1ffff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156843] pnp 00:08: disabling [mem 0xfed10000-0xfed17fff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156845] pnp 00:08: disabling [mem 0xfed18000-0xfed18fff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156846] pnp 00:08: disabling [mem 0xfed19000-0xfed19fff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156848] pnp 00:08: disabling [mem 0xf8000000-0xfbffffff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156850] pnp 00:08: disabling [mem 0xfed20000-0xfed3ffff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156851] pnp 00:08: disabling [mem 0xfed90000-0xfed93fff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156853] pnp 00:08: disabling [mem 0xfed45000-0xfed8ffff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156854] pnp 00:08: disabling [mem 0xff000000-0xffffffff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156856] pnp 00:08: disabling [mem 0xfee00000-0xfeefffff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156857] pnp 00:08: disabling [mem 0xf7fdf000-0xf7fdffff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.156859] pnp 00:08: disabling [mem 0xf7fe0000-0xf7feffff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
...
[    0.163732] pci 0000:00:01.0: BAR 15: no space for [mem size 0xc00000000 64bit pref]
[    0.163734] pci 0000:00:01.0: BAR 15: failed to assign [mem size 0xc00000000 64bit pref]
[    0.163737] pci 0000:00:01.0: BAR 14: assigned [mem 0xf1000000-0xf2ffffff]
[    0.163740] pci 0000:00:1c.0: BAR 14: assigned [mem 0xdf200000-0xdf3fffff]
[    0.163745] pci 0000:00:1c.0: BAR 15: assigned [mem 0xdf400000-0xdf5fffff 64bit pref]
[    0.163748] pci 0000:00:1c.0: BAR 13: assigned [io  0x2000-0x2fff]
[    0.163754] pci 0000:00:01.0: BAR 15: no space for [mem size 0xc00000000 64bit pref]
[    0.163755] pci 0000:00:01.0: BAR 15: failed to assign [mem size 0xc00000000 64bit pref]
[    0.163758] pci 0000:00:01.0: BAR 14: assigned [mem 0xf1000000-0xf2ffffff]
[    0.163761] pci 0000:00:1c.0: BAR 14: assigned [mem 0xdf200000-0xdf3fffff]
[    0.163766] pci 0000:00:1c.0: BAR 15: assigned [mem 0xdf400000-0xdf5fffff 64bit pref]
[    0.163768] pci 0000:01:00.0: BAR 15: no space for [mem size 0xc00000000 64bit pref]
[    0.163769] pci 0000:01:00.0: BAR 15: failed to assign [mem size 0xc00000000 64bit pref]
[    0.163771] pci 0000:01:00.0: BAR 14: assigned [mem 0xf1000000-0xf2ffffff]
[    0.163773] pci 0000:02:08.0: BAR 15: no space for [mem size 0x600000000 64bit pref]
[    0.163774] pci 0000:02:08.0: BAR 15: failed to assign [mem size 0x600000000 64bit pref]
[    0.163776] pci 0000:02:10.0: BAR 15: no space for [mem size 0x600000000 64bit pref]
[    0.163777] pci 0000:02:10.0: BAR 15: failed to assign [mem size 0x600000000 64bit pref]
[    0.163779] pci 0000:02:08.0: BAR 14: assigned [mem 0xf1000000-0xf1ffffff]
[    0.163780] pci 0000:02:10.0: BAR 14: assigned [mem 0xf2000000-0xf2ffffff]
[    0.163782] pci 0000:03:00.0: BAR 1: no space for [mem size 0x400000000 64bit pref]
[    0.163784] pci 0000:03:00.0: BAR 1: failed to assign [mem size 0x400000000 64bit pref]
[    0.163785] pci 0000:03:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[    0.163786] pci 0000:03:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
[    0.163788] pci 0000:03:00.0: BAR 0: assigned [mem 0xf1000000-0xf1ffffff]
[    0.163800] pci 0000:04:00.0: BAR 1: no space for [mem size 0x400000000 64bit pref]
[    0.163802] pci 0000:04:00.0: BAR 1: failed to assign [mem size 0x400000000 64bit pref]
[    0.163803] pci 0000:04:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[    0.163804] pci 0000:04:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
[    0.163806] pci 0000:04:00.0: BAR 0: assigned [mem 0xf2000000-0xf2ffffff]

...

[ 1089.302325] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[ 1089.302330] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.256.02  Thu May  2 14:37:44 UTC 2024
[ 1089.537364] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.256.02  Thu May  2 14:50:40 UTC 2024
[ 1089.538951] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[ 1089.538952] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[ 1089.539059] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 1089.539060] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 2
[ 1103.288634] NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x22:0xffff:667)
[ 1103.288669] NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
[ 1103.406608] NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x22:0xffff:667)
[ 1103.406629] NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
[ 1103.524304] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x22:0xffff:667)
[ 1103.524322] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 1
[ 1103.641921] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x22:0xffff:667)
[ 1103.642002] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 1
...

Apparently my GPU decided to go for a drink and there wasn't enough space at the bar... I can relate to that. Joking aside: RmInitAdapter failed! (0x22:0xffff:667) is the most ungooglable error message I've encountered in a while. Sure you find a lot of folks asking for help in the usual places. And the diagnostics reach from you need a bios update, via your kernel is b0rken to you probably damaged the PCIe slot while installing the card. Similarly, the solutions are wild mix of reinstalled the OS, downgraded the driver, got a new card, got a new motherboard and my all-time favourite: disabled the integrated NIC on the motherboard. Ironically, this also means you can't just ask ChatGPT, Claude, Gemini and friends. They'll just latch onto some of those wrong answers and keep repeating them back to you. Not that I had especially high hopes for that approach in the first place. So I had no choice, but to apply an ancient skill, passed down to me by previous generations of hackers: Actually reading error messages and deducing what went wrong.

First something called pnp complains about overlapping BARs. pnp in this context means Plug aNd Play. Back before we had PCIe, we had PCI. Before that we had ISA. If you've been touching computers long enough, you'll remember setting up interrupts and IO space addresses using jumpers on your cards. That wasn't fun at all and also quite easy to mess up. So Intel and Microsoft came up with what we now call Legacy Plug and Play. The TL;DR is that it was a standard for the BIOS and the OS to automatically negotiate those resource allocations for you. Win95 was the first "Plug and Play OS" and Linux famously had problems with plug and play for quite while. To be fair to Linux: It's not like Windows did such a great job with autoconfiguring devices either. Today that original standard isn't used any more, but the name stuck around in various places.

When PCI was standardised, somebody was smart enough to include the PCI configuration space such that the OS or the BIOS could just assign resources to the devices. Since most communication of those devices happens via memory mapped IO (basically the device pretends to be a piece of memory at a certain location), one of the most important bits of configuration is the base address and size of the memory region they can use. That can be configured through their Base Address Registers. (I know that's a gross oversimplification. Read the Wikipedia article for all the details.) So when the kernel complains about overlapping BARs it basically says that there is not enough address space to map all the memory regions needed to communicate with the device.

Now you might be thinking: Wait a second! Shouldn't those addresses be 64bit? That card can't be using that much address space now, can it? The answer to that is surprisingly no and no again. To stay compatible with 32bit OSes each BAR can describe a region that is between 16 bytes and 2 gigabytes in size, located below the 4 gigabyte address space limit. There are also 64bit BARs, but your BIOS has to support them explicitly. Since that breaks support for any 32bit OSes, it's usually an option called Above 4G or Above 4G decoding that you have to enable. Unless your motherboard is so new, that it simply does not care about 32bit compatibility any more. On the other hand if you are very unlucky, your BIOS was written with the assumption that going above 4G is not necessary for consumer level hardware. Similar issues were not uncommon for some more exotic dual GPU setups. While I know that I now, I did not know that back then. (Learned something again... see it's working.)

Reconsidering my Hardware Choices

After messing around some more a BIOS update and some changes to kernel command line, I decided to ask around on the fediverse whether anyone had any suggestions. Clemens offered some pointers.

While following those leads I discovered helpful a post in the NVIDIA developers forum:

The K80 presents a large BAR to the system (for each device). Not all system BIOSes are capable of mapping (two of these) large BAR regions.

So okay pretty much exactly what I've been suspecting at this point. I took another look at the blog post by Khalid Mammadov and the other one by Thomas Jungblut who both got their cards working in the end. The first one used a ASROCK B550 motherboard, while the latter was using a MSI AX99A. At that point I wrongly suspected that I'd need support for resizable BAR to make the card work. All things considered that didn't sound quite right to me, as resizable BAR as a BIOS feature, is much newer than the K80 itself.

When I shared my thoughts on the fediverse Clemens pointed out, that he is running a modern GPU in an old ThinkStation S30 without any problem. Following my line of reasoning that should not just work. The S30 is of a similar vintage than the K80, and it was certified by NVIDIA to run a dual 12GB Quadro K6000 cards... a setup that should be really similar to a single K80.

The S30 looked a good deal. I need a case for the setup anyway and if I could buy a decent case that fits the card, and also came with power supply, a CPU (Xeon E5-1620), some RAM (16GB), a small SSD (128GB) and motherboard, that actually works with the card, that would be ideal. I bought the cheapest Thinkstation S30 for 80€ of eBay. The chassis is scratched and dented, and it wouldn't boot right out of the package. Nothing reseating the RAM and CPU wouldn't fix though. Slight downside: It came with a NV310 that needs one of those weird y-splitter cables for DVI. Nothing another eBay excursion can't fix, but while I waited for the adapter to arrive, I installed an ancient GeForce card from my parts bin to have some form of display output.

The K80 inside its new home: The Thinkstation S30

When I booted the system for the first time, I got all the angry beeping imaginable during post. This is what experts commonly refer to as "not a good sign". However since the Thinkstation is a serious workstation for doing serious workstation-grade workstation work, it also has a proper BIOS, that will actually tell you what's wrong.

Error message complaining about: 'Insufficient PCI Resources detected!!!'

That's pretty much the same problem I had with my old setup. However, the BIOS version installed on my S30 was from 2012. The latest version on Lenovo's website is from 2017. While there is a newer version from 2020 is for the type 435x, I've got a 0606. Make sure to check your exact model before attempting an update. According to the changelog in the zip file some intermediate version between the two had added Above 4G decoding.

As is usually the case, the BIOS update process was less than smooth: I couldn't use the Windows tool for obvious reasons. The update ISO provided for download wouldn't boot when dded on a USB-Stick. While I would have had a USB CD drive, I didn't have any CD-Rs to write the ISO to. So I ended up making a FreeDOS USB drive and added the contents of the DOS BIOS updater zip file that Lenovo provides. After that I just had to boot into FreeDOS, run the batch file, decline to change the machine's serial, and wait for the progress bar to fill up. One reboot later I could just activate the Above 4G decoding option in the BIOS and the angry beeping stopped.

After that I duct-taped (it's an air duct, so using duct tape is fine) the radial fan back to the card. I also propped the fan up on some random workbench detritus that I hotglued into the chassis, so it's not like I used duct tape in a load-bearing situation. A conveniently placed spare fan connector for front chassis fans on the motherboard can be used to supply it with power.

It's just software from here

You might have already noticed, the word just is doing some very heavy lifting in this heading.

I started out trying to install Debian 12, as my reasoning was that stuff included with Debian is old enough, that I could just install the 470.256.02 NVIDIA driver (last driver to still support the card). Since the driver was not in the repos, I had to use the NVIDIA-Linux-x86_64-470.256.02.run installer, that you can download from here. Similarly, to get a compatible CUDA runtime you need CUDA 11.4.0 470.42.01, which is just new enough to allow things like __device__ variables to be constexpr. I don't know exactly what that means, but if you want to compile llama.cpp or ollama, you need it.
This version is also just old enough to work with the driver that still supports the K80. The CUDA installer file is called cuda_11.4.0_470.42.01_linux.run and you can download it from here.

There's only one slight problem with that: This CUDA version does not like any GCC newer than GCC 10. You can convince the installer to do its thing it regardless if you use the --override parameter. Unfortunately the resulting installation won't work:

$ cmake -B build -DCMAKE_CUDA_COMPILER:PATH=/usr/local/cuda/bin/nvcc

[...]

-- Using CUDA architectures: native
CMake Error at /usr/share/cmake-3.25/Modules/CMakeDetermineCompilerId.cmake:739 (message):
  Compiling the CUDA compiler identification source file
  "CMakeCUDACompilerId.cu" failed.

  Compiler: /usr/local/cuda/bin/nvcc

[...]

  /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_config.h:139:2:
  error: #error -- unsupported GNU version! gcc versions later than 10 are
  not supported! The nvcc flag '-allow-unsupported-compiler' can be used to
  override this version check; however, using an unsupported host compiler
  may cause compilation failure or incorrect run time execution.  Use at your
  own risk.

    139 | #error -- unsupported GNU version! gcc versions later than 10 are not supported! 
            The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; 
            however, using an unsupported host compiler may cause compilation failure or 
            incorrect run time execution. Use at your own risk.
        |  ^~~~~

  # --error 0x1 --

Injecting the -allow-unsupported-compiler flag into all the nvcc calls through several layers of build systems is really messy and ultimately results in binaries that just crash as soon as they try to do anything with the GPU. Of course, I tried just building GCC 10 and using it in the build process. That resulted in an even bigger mess.

It's just old software from here

Since Debian 12 wouldn't work, I decided to try Debian 11. It actually comes with gcc 10.2.1. Furthermore, Debian 11 comes with nvidia-tesla-470-driver, which is exactly the 470.256.02 I need. This is not an entirely future-proof plan though: It's 2025 and Debian 11 won't be EOL until the end of August 2026. So while I can get a bit over a year of use out of it, I need to come up with a different solution long term. I have some ideas involving nix for that, but those will have to wait for now.

The driver is in the non-free repos, so those need to be enabled. After that it should just install using apt update && apt install nvidia-tesla-470-driver nvtop. If you look into dmesg, the driver will complain about the NV310 being unsupported. That's not a problem for me, as I don't intend to the system as a headless sever anyway. If you want to actually run a graphical user interface on the machine itself, you'll have to figure out how to get the original NVIDIA driver and nouveau to coexist. That's an entirely different can of worms, that I would like to avoid for now.

The CUDA runtime in the Debian repos is 11.2.2, which is a few minor versions too old to do anything useful with. As I've already explained 11.4 is the minimum version needed for most things these days. The good thing is that the cuda_11.4.0_470.42.01_linux.run installer, will just install without complaints on Debian 11.

Once that is done you can use nvtop or nvidia-smi to marvel at your datacentre GPU setup.

The K80 inside the S30. nvtop is running on the display in the background to prove the card works.

Also pay no attention to the bench top power supply. I took this photo before I found the front chassis fan connector.

torch me a yolo

The heading is unfortunately not some new slang the kids use on TikTok. Yolo (you only look once) is a computer vision model by a company called Ultralitics (they really shouldn't be allowed to name stuff). There are specialized models for various tasks, e.g. for detection, which is computer-vision-speak for identifying objects of certain classes (e.g. dog or car) in an image as well as providing bounding box coordinates for each identified object. Yolo is interesting from an architectural point of view, as it was one of the first model for vision tasks, that functioned without any recurrent connections. This means it is purely feed forward from one layer to the next and will be done after a single forward pass, making it not only fast but also a lot easier to train. The model is open source and can be found in the ultralytics GitHub repo. In addition to the model itself, the repo also contains some high level tooling, to train it on different datasets. E.g. you could easily train it to spot trucks and planes in aerial imagery using the DOTA dataset. Since this is an unambiguously purely civilian application for counting vehicles, I'd rather stick to the COCO dataset for this example.

To start we need a python virtual environment with the correct version of pytorch and the ultralitics tools installed:

$ mkdir yolo
$ cd yolo/
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install --upgrade pip
$ pip install install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
$ pip install ultralytics

It's important to use the special URL https://download.pytorch.org/whl/cu118. The prebuild wheel pip will download if you just use the defaults, will be built for CUDA 12.something and won't work with our more vintage environment.

Once everything is installed (downloading will probably take a moment, go get some snacks), we can use the yolo tool to verify our installation:

$ yolo checks
Ultralytics 8.3.146 🚀 Python-3.9.2 torch-2.6.0+cu118 CUDA:0 (Tesla K80, 11441MiB)
Setup complete ✅ (8 CPUs, 15.6 GB RAM, 128.9/232.7 GB disk)

OS                  Linux-5.10.0-34-amd64-x86_64-with-glibc2.31
Environment         Linux
Python              3.9.2
Install             pip
Path                /home/sebastian/yolo/.venv/lib/python3.9/site-packages/ultralytics
RAM                 15.57 GB
Disk                128.9/232.7 GB
CPU                 Intel Xeon E5-1620 0 3.60GHz
CPU count           8
GPU                 Tesla K80, 11441MiB
GPU count           2
CUDA                11.8

numpy               ✅ 2.0.2>=1.23.0
matplotlib          ✅ 3.9.4>=3.3.0
opencv-python       ✅ 4.11.0.86>=4.6.0
pillow              ✅ 11.2.1>=7.1.2
pyyaml              ✅ 6.0.2>=5.3.1
requests            ✅ 2.32.3>=2.23.0
scipy               ✅ 1.13.1>=1.4.1
torch               ✅ 2.6.0+cu118>=1.8.0
torch               ✅ 2.6.0+cu118!=2.4.0,>=1.8.0; sys_platform == "win32"
torchvision         ✅ 0.21.0+cu118>=0.9.0
tqdm                ✅ 4.67.1>=4.64.0
psutil              ✅ 7.0.0
py-cpuinfo          ✅ 9.0.0
pandas              ✅ 2.2.3>=1.1.4
ultralytics-thop    ✅ 2.0.14>=2.0

It even found the GPUs first try.

Now we can simply use the yolo tool to train a model from scratch:

$ yolo train data=coco.yaml model=yolo11n.yaml

[...]
  
  File "/home/sebastian/yolo/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sebastian/yolo/.venv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/sebastian/yolo/.venv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
    return F.conv2d(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH

Unless we can't. That's not-good™. cuDNN is a library of primitives for deep neural networks, which is used by pytorch, which is used by the yolo tooling. The support matrix for cuDNN, does not list anything below the Maxwell GPU architecture with compute capability 5.0. Our card has a Kepler chip with compute capability 3.5 ... which is a bit older than 5.0. The good news is that this table inexplicitly only shows versions 9.something. So after doing some digging, I found out that cuDNN version 8.7.0 still had support for Kepler cards.

Let's try again:

$ mkdir yolo
$ cd yolo/
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install --upgrade pip
$ pip install install torch==2.3.1 torchvision torchaudio nvidia-cudnn-cu11==8.7.0.84 \
    --index-url https://download.pytorch.org/whl/cu118
$ pip install ultralytics

Let's see if it will run a DDP (distributed data parallel) training using both GPU devices with a slightly larger batch size:

$ yolo train data=coco.yaml model=yolo11n.yaml device=0,1 batch=64
[W NNPACK.cpp:61] Could not initialize NNPACK! Reason: Unsupported hardware.
Ultralytics 8.3.148 🚀 Python-3.9.2 torch-2.3.1+cu118 CUDA:0 (Tesla K80, 11441MiB)
                                                      CUDA:1 (Tesla K80, 11441MiB)
engine/trainer: agnostic_nms=False, amp=True, augment=False, auto_augment=randaugment, batch=64, 
[...]

                   from  n    params  module                                       arguments
  0                  -1  1       464  ultralytics.nn.modules.conv.Conv             [3, 16, 3, 2]
  1                  -1  1      4672  ultralytics.nn.modules.conv.Conv             [16, 32, 3, 2]
  2                  -1  1      6640  ultralytics.nn.modules.block.C3k2            [32, 64, 1, False, 0.25]
  3                  -1  1     36992  ultralytics.nn.modules.conv.Conv             [64, 64, 3, 2]
  4                  -1  1     26080  ultralytics.nn.modules.block.C3k2            [64, 128, 1, False, 0.25]
  5                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]
  6                  -1  1     87040  ultralytics.nn.modules.block.C3k2            [128, 128, 1, True]
  7                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]
  8                  -1  1    346112  ultralytics.nn.modules.block.C3k2            [256, 256, 1, True]
  9                  -1  1    164608  ultralytics.nn.modules.block.SPPF            [256, 256, 5]
 10                  -1  1    249728  ultralytics.nn.modules.block.C2PSA           [256, 256, 1]
 11                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']
 12             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 13                  -1  1    111296  ultralytics.nn.modules.block.C3k2            [384, 128, 1, False]
 14                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']
 15             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 16                  -1  1     32096  ultralytics.nn.modules.block.C3k2            [256, 64, 1, False]
 17                  -1  1     36992  ultralytics.nn.modules.conv.Conv             [64, 64, 3, 2]
 18            [-1, 13]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 19                  -1  1     86720  ultralytics.nn.modules.block.C3k2            [192, 128, 1, False]
 20                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]
 21            [-1, 10]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 22                  -1  1    378880  ultralytics.nn.modules.block.C3k2            [384, 256, 1, True]
 23        [16, 19, 22]  1    464912  ultralytics.nn.modules.head.Detect           [80, [64, 128, 256]]
YOLO11n summary: 181 layers, 2,624,080 parameters, 2,624,064 gradients, 6.6 GFLOPs

DDP: debug command /home/sebastian/yolo/.venv-cudnn/bin/python3 -m torch.distributed.run --nproc_per_node 2 --master_port 49269 /home/sebastian/.config/Ultralytics/DDP/_temp_mlfldkse139723384360912.py
Ultralytics 8.3.148 🚀 Python-3.9.2 torch-2.3.1+cu118 CUDA:0 (Tesla K80, 11441MiB)
                                                      CUDA:1 (Tesla K80, 11441MiB)
Freezing layer 'model.23.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks...
AMP: checks passed ✅
train: Fast image access ✅ (ping: 0.0±0.0 ms, read: 3248.7±450.6 MB/s, size: 148.8 KB)
train: Scanning /home/sebastian/yolo/datasets/coco/labels/train2017.cache... 117266 images, 1021 backgrounds, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
val: Fast image access ✅ (ping: 0.0±0.0 ms, read: 3231.9±818.0 MB/s, size: 210.6 KB)
val: Scanning /home/sebastian/yolo/datasets/coco/labels/val2017.cache... 4952 images, 48 backgrounds, 0 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]
Plotting labels to runs/detect/train5/labels.jpg...
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically...
optimizer: SGD(lr=0.01, momentum=0.9) with parameter groups 81 weight(decay=0.0), 88 weight(decay=0.0005), 87 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/train5
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100      5.54G      3.606      5.751      4.234        481        640:  
      27%|██▋       | 496/1849 [15:33<42:12,  1.87s/it]

Success! ... The yolo tool has downloaded about 25GB of training data and started training on both GPU devices. It is even reasonably fast: 15:33<42:12, 1.87s/it. About one hour per epoch. Training for a little over 4 days on a single GPU is not actually that bad.

Let's do some very rough ballpark comparison to renting a GPU in the cloud: Running on full blast, the K80 sips back about 300W. Add another 100W for the rest of the system, and you end up with 0.4kW for about 100h. That's 40kwh. Assuming about 0.26€ per kW/h that's about 10€. That's equivalent to about 7h on a NVIDIA RTX™ 6000 card with about 91.06 TFLOP/s for FP32 at. Assuming our initial 8 TFLOP/s figure, we can guestimate, that we will be doing \(100 \cdot 60 \cdot 60s \cdot 8{{TFLOP}\over{s}} = 2,88 FLOP\) (or \(2.88 \cdot 10^{18}\) operations) during those 100h. The RTX6000 should be able to do that in about 9h. Of course this comparison is grossly oversimplified and glosses over a lot of details, still this is probably why those K80s end up on eBay for less than 100€.

On a rainy Saturday morning, I was trying to work around, before I had my coffee... At least that's my excuse why I took NVIDIA's table at face value and forgot that cuDNN versions earlier than 9 existed. The question then becomes: Can we run pytorch models on the GPU without cuDNN? Of course, we can. If we just set torch.backends.cudnn.enabled = False, pytorch will fall back to its own CUDA kernels and stop looking for cuDNN. It will be slower, but it will work. This means we can't use the yolo utility any more, since there is no way to set that value from outside the python interpreter. (Being able to control the backend from an environment variable would have been nice for this.)

Writing the script is not that hard:

import torch
torch.backends.cudnn.enabled = False

from ultralytics import YOLO

model = YOLO("yolo11n.yaml")

results = model.train(data="coco.yaml", epochs=10)

results = model.val()
print(results)

However, this approach has one big downside. It works, but it can not just run on multiple GPUs any more. Since our K80 is two GPUs in a trench coat, half of the card will sit idle. While I can just call train like this: model.train(data="coco.yaml", epochs=10, devices=[0,1]), and start the script, but suddenly we are back at CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH. This was very confusing to say the least, but ultimately it all comes down to janky software design, a common problem in machine learning.

Looking closely at the DDP run log output you can spot this line: python3 -m torch.distributed.run --nproc_per_node 2 --master_port 49269 ~/.config/Ultralytics/DDP/_temp_mlfldkse139723384360912.py. So ultralytics creates a temporary script for DDP and uses that in place of my script. Then it starts one instance of that script for each GPU device. Sure enough looking at ultralytics/utils/dist.py#L29 there is a python file generator. It only captures some parameters and the trainer class from my original script. So for the new python interpreters spawned as part of the DDP process torch.backends.cudnn.enabled is never set to False.

There is a workaround though. We can add our own trainer class in its own trainer.py that sets up torch for us:

import torch

from ultralytics.models.yolo.detect import DetectionTrainer
from ultralytics.utils import DEFAULT_CFG

class CustomTrainer(DetectionTrainer):
    def __init__(self, cfg=DEFAULT_CFG, overrides=None, _callbacks=None):
        torch.backends.cudnn.enabled = False
        super().__init__(cfg=cfg, overrides=overrides, _callbacks=_callbacks)

Then we can just use that trainer and add some more glue code in the main script:

import os
from pathlib import Path

# Add the current directory to sys.path
current_dir = Path(__file__).resolve().parent
os.eviron["PYTHONPATH"] += ":" + current_dir

from trainer import CustomTrainer

def main():
    trainer = CustomTrainer(overrides={
        'data':'coco.yaml', 
        'model': 'yolo11n.yaml', 
        'batch': 128, 
        'epochs': 10, 
        'device':[0,1]
    })
    trainer.train()

if __name__ == "__main__":
    main()

The separate trainer.py and messing with os.eviron["PYTHONPATH"] are necessary, because that code generator has some unfortunate assumptions built in. It tries to generate the import for the trainer class using: module, name = f"{trainer.__class__.__module__}.{trainer.__class__.__name__}".rsplit(".", 1) If CustomTrainer is defined in your main script, module will just become __main__. So from {module} import {name} turns into from __main__ import CustomTrainer, which obviously won't work. Also, if module contains the correct module name or module path, it still assumes that newly spawned python interpreter can find the module. That's also not trivially true, hence the messing around with os.eviron["PYTHONPATH"]. Not the best helper around pytorchs DDP feature I've run into, but also not the worst.

In the end, if you can ignore all that jank, you can train your yolo model without using cuDNN like this.

Can we have LLMs at home now? Please?

Since I teased it initially, we finally have to talk about running LLMs locally. I'm rather opposed to using anything GenAI for obvious reasons (the ethics and consent issues with training data, the click-workers slaving away cleaning up the training data, the energy consumption, the fact that most usecases revolve around lowering the wages for artists, writers, translators and software developers, while CEOs get bonuses for deploying AI). However, the technology exists and has real implication for our daily lives. Therefore, it can be worthwhile investing the time to understand it and learning its failure modes.

There is also the security research aspect: Since everyone seems to be using retrival augmented generation on all their files and a model ccontext protocol sever for all their software, there is quite a bit of new attack surface on a typical developer's machine. To make matters worse that attack surface can potentially be accessed by a clever prompt injection. Lots of room for fun activities, lots of fun new ways to accidentally leak your private keys.

In any case it's a good idea to experiment locally. Not only can you control the entire stack and inspect the internal state of all components, there is also no one who can kick you out for poking around. I've already had my API access revoked from one model provider for trying interesting stuff.

While you could just run LLMs from a python script using TensorFlow or pytorch, that would be a lot of fiddly work which ultimately would not get you very far. To do anything useful with the model, e.g. hooking it up to something like continue or tmuxai you want an OpenAI-compatible or ollama-compatible API. Even those APIs are not too hard to implement, it's much more convenient to just use one of the common LLM runtimes.

The most common choice is ollama. It's basically a wrapper around llama.cpp written in go. The reason it became much more popular than the original llama.cpp is, that it comes with a docker like interface. You can simply type ollama run deepseek-r1:70b and it will download the model for you, figure out if it can use your GPU and finally start a chat session with the model in your terminal.

Unfortunately ollama does not support older GPUs. There is a fork, but it is very outdated by now. It will not run some of the newer models, especially those based on the qwen architecture. Since many of the smaller distilled model use qwen as their base, your choice is limited to mostly older models.

Fortunately for us, the development of llama.cpp didn't stop in the meantime. It now can download models from the same repository used by ollama, and it can still run on a Tesla K80 as pointed out by this really helpful GitHub issue.

Using the magic incantations provided by ChunkyPanda03 we can build working version of llama.cpp:

git clone https://github.com/ggml-org/llama.cpp.git
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.4/bin/nvcc \
    -DCMAKE_CUDA_ARCHITECTURES='37'
cd build/
make -j 16

To verify it works we can just download and run deepseek-r1.

./bin/llama-run -ngl 80 deepseek-r1:32b
Two terminals. One showing deepseeks output and one showing nvtop

The -ngl part specifies the number of layers to run on the GPU. You'll want to tune that for each model you run. Basically you can increase until your model just about fills the memory on your GPU. The remainder of the model is then run on the CPU, which fortunately is just about fast enough in my machine, that it does not become too much of a bottleneck.

To start an OpenAI compatible API server you can simply use llama-server which takes similar arguments to llama-run.

There you have it: AI at home.



For comments you can use your fediverse account to reply to this toot.


Published: 29.06.2025 17:25 Updated: 29.06.2025 17:25