Setting up PCI passthrough on openstack

Microstack is a stripped-down version of Openstack distributed by Canonical, less customizable and with a lot of Sane Defaults. Normally I’m not a fan of Canonical, but since I’m not familiar with openstack, this seems like a good starting place.

And honestly, microk8s wasn’t such a bad thing, and this is the same deal, right?

On ubuntu, install microstack from the beta

sudo snap install microstack --devmode --beta

Now configure openstack:

sudo microstack init --auto --control

That seems to have worked out.

You might also need the client programs? I haven’t used them, I don’t think.

sudo snap install openstackclients

Finally, if you want to look at a GUI, you can get dashboard access:

sudo snap get microstack config.credentials.keystone-password

Running Ubuntu on Ubuntu

Go grab an Ubuntu image. The kvm image is ⧉here, but it did not work for me, giving GRUB_FORCE_PARTUUID set, attempting initrdless boot, and locking up. ⧉This forum suggested that I instead use the non-kvm one, and it worked.

wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img
sudo cp jammy-server-cloudimg-amd64.img /var/snap/microstack/common/images/

Because snap-installed programs can only access certain directories, you have to place the file in /var/snap/microstack/common/images/focal-server-cloudimg-amd64.img.

Then create the image like this:

microstack.openstack image create ubuntu-jammy-nokvm \
  --disk-format qcow2 \
  --file /var/snap/microstack/common/images/jammy-server-cloudimg-amd64.img \
  --public

great, now we can run it

microstack.openstack server create --flavor m1.medium --image ubuntu-jammy-nokvm --nic net-id=831a8a23-4a6c-40a1-8435-620412144195 --key-name own --security-group 0d57b7c3-2e6c-4eab-be2c-319ecba62c7f test99

and you will need to find the correct ids and create and add a floating ip.

Atlernatively microstack has a nice shorthand:

microstack launch --flavor m1.medium ubuntu-jammy-nokvm

excellent!

A brief debugging session

This is where things get ugly. I’m used to being able to hack on python code, which the microstack repository is replete with. Unfortunately, ⧉most of the content of snaps are read-only. So I’m stuck reconfiguring and restarting things. At least I can read the code though. That’s OpenSourceTM.

I repeatedly got permission denied when trying to access hosts. I also briefly got no route to host… Hm. I started tyring to apply versions of

#cloud-config
users:
  - default
  - name: ubuntu
    groups: sudo
    shell: /bin/bash
    sudo: ['ALL=(ALL) NOPASSWD:ALL']
    ssh-authorized-keys:
      - ssh-rsa AAA

to the --user-data parameter, and in the UI. Nothing was taking. I could not get into the image via the SPICE consoel because ⧉there is no default user/pass for the ubuntu cloud image. The prepackaged cirrOS image would not allow me to SSH either, something was broken in there.

Eventually, I found

[   26.831426] cloud-init[603]: 2023-04-06 04:39:05,756 - url_helper.py[ERROR]: Timed out, no response from urls: ['http://169.254.169.254/openstack']
[   26.832426] cloud-init[603]: 2023-04-06 04:39:05,763 - util.py[WARNING]: No active metadata service found

⧉169.254.169.254 is commonly used as a metadata server on cloud computing platforms. It’s classed under “Link Local” in ⧉this RFC on special-use addresses. As far as I could find, that’s what serves the cloud-config, so that explains why it’s not working. Some variant of the above is probably used to configure the default SSH keys I added through the CLI.

So I went fishing on search engines, and I found nothing. ⧉This person had the same issue, but resolved it by just restarting the VM :/ And the person right below says I should simply sudo snap restart microstack… So that’s no help either.

Well, the docs said this was tested on 20.04, but I’m on a fresh 22.04. So I guess we’ll switch to 20.04 and we’ll see how that goes…

That worked out. On a fresh 20.04 the thing works as advertised. Yikes. All the previous section worked fine. Not the most satisfying conclusion but at least now we can move forward. Maybe I can revisit this in a 22.04 VM and try to figure out what’s going on :)


Making sure machines start on boot

After rebooting I noticed the machines did not start. Microstack keeps it’s nova.conf in /var/snap/microstack/common/etc/nova/nova.conf.d/nova-snap.conf. I edited it to include resume_guests_state_on_host_boot = True in the [DEFAULT] section and restarted the service with sudo systemctl restart snap.microstack.nova-compute.service, though since I need to reboot to verify this, that part wasn’t really necessary.

Getting an outside internet connection

I realized I couldn’t get an internet connection on the VMs, which is kind of critical for the use case I’m thinking of. I found ⧉this answer, which then pointed to ⧉this answer. I did not apply the iptables rules. Instead I applied

   (openstack) subnet set --dhcp external-subnet
   (openstack) subnet set --dhcp test-subnet
   (openstack) subnet set --dns-nameserver 8.8.8.8 external-subnet
   (openstack) subnet set --dns-nameserver 8.8.8.8 test-subnet
   (openstack) network set --share external
   (openstack) network set --share test

followed by

sudo sysctl net.ipv4.ip_forward=1

I suspect I will find whether the former was crucial if/when I set up another network/subnet. But for now, it was one of those.

update: I did a little bit more work on networking

Capturing PCI devices with the correct driver

Let’s perform the ⧉regular grub PCI-passthrough steps to get our card captured by the vfio driver.

Basically that consists of editing your /etc/default/grub to add the following

GRUB_CMDLINE_LINUX_DEFAULT="splash amd_iommu=on kvm.ignore_msrs=1 vfio-pci.ids=10de:2231,10de:1aef"

YMMV for the particulars on vfio-pci.ids, not to mention whether you have an amd gpu or need to ignore_msrs.

Whitelisting PCI devices

This is basically just adding, ie

[pci]
passthrough_whitelist = [{ "vendor_id": "10de", "product_id": "2231" },{ "vendor_id": "10de", "product_id": "1aef" },{ "address": "0000:02:00.0" },{ "vendor_id": "1022", "product_id": "14da" }]
alias = { "vendor_id":"10de", "product_id":"2231", "device_type":"type-PCI", "name":"a5" }
alias = { "vendor_id":"10de", "product_id":"1aef", "device_type":"type-PCI", "name":"a5audio" }
alias = { "vendor_id":"c0a9", "product_id":"540a", "device_type":"type-PCI", "name":"nvme" }
alias = { "vendor_id":"1022", "product_id":"14da", "device_type":"type-PCI", "name":"bridge" }

to your nova-snap.conf. Be careful that if you need to pass identical devices, you are using the address form of the whitelist object, and not the vendor_id/product_id form. For example in the above I am using 0000:02:00.0 to whitelist my nvme controller on that specific bus. Since I have two nvme controllers on the same machine with the same vendor and product ids, they would both be whitelisted if I used that form instead.

Attaching PCI devices

This consists of creating a flavor and adding the relevant PCI devices from your nova whitelist

openstack flavor set m1.large --property "pci_passthrough:alias"="a5:1,a5audio:1,nvme:1"

After trying to piece together the correct configs for PCI passthrough using ⧉docs for the full version of openstack, I was left with No valid host was found. There are not enough hosts available., so I began debugging that.

systemctl status snap.microstack.nova-scheduler.service tells me Filter PciPassthroughFilter returned 0 hosts. OK. I must be configured wrong. ⧉Here’s some doc for pci_passthrough in the properties section of flavor

After reading the ⧉newer docs, a little more carefully, I realized a lot of the configuration I was doing was for SR-IOV, which I am not going to use here. Eventually I was able to get Please ensure all devices within the iommu_group are bound to their vfio bus driver in the nova logs! This error is comprehensible to me. Turns out my iommu group is not isolated.

IOMMU Group 0 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14da]
IOMMU Group 0 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:14db]
IOMMU Group 0 00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:14db]
IOMMU Group 0 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2231] (rev a1)
IOMMU Group 0 01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)
IOMMU Group 0 02:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology Device [c0a9:540a] (rev 01)

Non-Volatile memory controller is my NVME drive. Turns out it shares a PCI lane with the GPU. I guess that’s what I get for getting a compact motherboard. I managed to get a gpu machine running after awhile though, here are some of the steps I took:

I think swapping the drives was definitely critical, and this seems to be confirmed since when my gpu is passed through I can no longer see the nvme drive in the same group on the controller. I also suspect that install virt-manager may have unstuck something, since I (re)installed a bunch of random dependencies including libvirt, and I wasn’t really reading the output apt-get commands at that point.

Passing through the nvme drive

I should be able to pass-through the nvme drive, and it would be a shame if I couldn’t, since it has data and models on it and is otherwise useless and inaccessible as it stands now.

And actually, it was a very simple matter once I read this in the ⧉openstack documentation:

If using vendor and product IDs, all PCI devices matching the vendor_id and product_id are added to the pool of PCI devices available for passthrough to VMs.

Basically, the passthrough_whitelist had to reference the address and not the vendor/product ids, since those are not unique to each of the multiple nvme controllers in my system.

Conclusion

Great! I have completed the basic setup of microstack on my server and can now use it for running GPU workloads. I have internet access and a large disk to use for models and data. I have learned that the second PCI slot is in an IOMMU group with my Ethernet controller, among others, which means this tower will be limited to a single GPU unless and until I get a different motherboard.