Setting up PCI passthrough on openstack

Microstack is a stripped-down version of Openstack distributed by Canonical, less customizable and with a lot of Sane Defaults. Normally I’m not a fan of Canonical, but since I’m not familiar with openstack, this seems like a good starting place.

And honestly, microk8s wasn’t such a bad thing, and this is the same deal, right?

On ubuntu, install microstack from the beta

sudo snap install microstack --devmode --beta

Now configure openstack:

sudo microstack init --auto --control

That seems to have worked out.

You might also need the client programs? I haven’t used them, I don’t think.

sudo snap install openstackclients

Finally, if you want to look at a GUI, you can get dashboard access:

sudo snap get microstack config.credentials.keystone-password

Running Ubuntu on Ubuntu

Go grab an Ubuntu image. The kvm image is here, but it did not work for me, giving GRUB_FORCE_PARTUUID set, attempting initrdless boot, and locking up. This forum suggested that I instead use the non-kvm one, and it worked.

wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img

sudo cp jammy-server-cloudimg-amd64.img /var/snap/microstack/common/images/

Because snap-installed programs can only access certain directories, you have to place the file in /var/snap/microstack/common/images/focal-server-cloudimg-amd64.img.

Then create the image like this:

microstack.openstack image create ubuntu-jammy-nokvm \
  --disk-format qcow2 \
  --file /var/snap/microstack/common/images/jammy-server-cloudimg-amd64.img \
  --public

great, now we can run it

microstack.openstack server create --flavor m1.medium --image ubuntu-jammy-nokvm --nic net-id=831a8a23-4a6c-40a1-8435-620412144195 --key-name own --security-group 0d57b7c3-2e6c-4eab-be2c-319ecba62c7f test99

and you will need to find the correct ids and create and add a floating ip.

Atlernatively microstack has a nice shorthand:

microstack launch --flavor m1.medium ubuntu-jammy-nokvm

excellent!

A brief debugging session

This is where things get ugly. I’m used to being able to hack on python code, which the microstack repository is replete with. Unfortunately, most of the content of snaps are read-only. So I’m stuck reconfiguring and restarting things. At least I can read the code though. That’s OpenSourceTM.

I repeatedly got permission denied when trying to access hosts. I also briefly got no route to host… Hm. I started tyring to apply versions of

#cloud-config
users:
  - default
  - name: ubuntu
    groups: sudo
    shell: /bin/bash
    sudo: ['ALL=(ALL) NOPASSWD:ALL']
    ssh-authorized-keys:
      - ssh-rsa AAA

to the --user-data parameter, and in the UI. Nothing was taking. I could not get into the image via the SPICE consoel because there is no default user/pass for the ubuntu cloud image. The prepackaged cirrOS image would not allow me to SSH either, something was broken in there.

Eventually, I found

[   26.831426] cloud-init[603]: 2023-04-06 04:39:05,756 - url_helper.py[ERROR]: Timed out, no response from urls: ['http://169.254.169.254/openstack']
[   26.832426] cloud-init[603]: 2023-04-06 04:39:05,763 - util.py[WARNING]: No active metadata service found

169.254.169.254 is commonly used as a metadata server on cloud computing platforms. It’s classed under “Link Local” in this RFC on special-use addresses. As far as I could find, that’s what serves the cloud-config, so that explains why it’s not working. Some variant of the above is probably used to configure the default SSH keys I added through the CLI.

So I went fishing on search engines, and I found nothing. This person had the same issue, but resolved it by just restarting the VM :/ And the person right below says I should simply sudo snap restart microstack… So that’s no help either.

Well, the docs said this was tested on 20.04, but I’m on a fresh 22.04. So I guess we’ll switch to 20.04 and we’ll see how that goes…

That worked out. On a fresh 20.04 the thing works as advertised. Yikes. All the previous section worked fine. Not the most satisfying conclusion but at least now we can move forward. Maybe I can revisit this in a 22.04 VM and try to figure out what’s going on :)

Making sure machines start on boot

After rebooting I noticed the machines did not start. Microstack keeps it’s nova.conf in /var/snap/microstack/common/etc/nova/nova.conf.d/nova-snap.conf. I edited it to include resume_guests_state_on_host_boot = True in the [DEFAULT] section and restarted the service with sudo systemctl restart snap.microstack.nova-compute.service, though since I need to reboot to verify this, that part wasn’t really necessary.

Getting an outside internet connection

I realized I couldn’t get an internet connection on the VMs, which is kind of critical for the use case I’m thinking of. I found this answer, which then pointed to this answer. I did not apply the iptables rules. Instead I applied

   (openstack) subnet set --dhcp external-subnet
   (openstack) subnet set --dhcp test-subnet
   (openstack) subnet set --dns-nameserver 8.8.8.8 external-subnet
   (openstack) subnet set --dns-nameserver 8.8.8.8 test-subnet
   (openstack) network set --share external
   (openstack) network set --share test

followed by

sudo sysctl net.ipv4.ip_forward=1

I suspect I will find whether the former was crucial if/when I set up another network/subnet. But for now, it was one of those.

update: I did a little bit more work on networking

Capturing PCI devices with the correct driver

Let’s perform the regular grub PCI-passthrough steps to get our card captured by the vfio driver.

Basically that consists of editing your /etc/default/grub to add the following

GRUB_CMDLINE_LINUX_DEFAULT="splash amd_iommu=on kvm.ignore_msrs=1 vfio-pci.ids=10de:2231,10de:1aef"

YMMV for the particulars on vfio-pci.ids, not to mention whether you have an amd gpu or need to ignore_msrs.

Whitelisting PCI devices

This is basically just adding, ie

[pci]
passthrough_whitelist = [{ "vendor_id": "10de", "product_id": "2231" },{ "vendor_id": "10de", "product_id": "1aef" },{ "address": "0000:02:00.0" },{ "vendor_id": "1022", "product_id": "14da" }]
alias = { "vendor_id":"10de", "product_id":"2231", "device_type":"type-PCI", "name":"a5" }
alias = { "vendor_id":"10de", "product_id":"1aef", "device_type":"type-PCI", "name":"a5audio" }
alias = { "vendor_id":"c0a9", "product_id":"540a", "device_type":"type-PCI", "name":"nvme" }
alias = { "vendor_id":"1022", "product_id":"14da", "device_type":"type-PCI", "name":"bridge" }

to your nova-snap.conf. Be careful that if you need to pass identical devices, you are using the address form of the whitelist object, and not the vendor_id/product_id form. For example in the above I am using 0000:02:00.0 to whitelist my nvme controller on that specific bus. Since I have two nvme controllers on the same machine with the same vendor and product ids, they would both be whitelisted if I used that form instead.

Attaching PCI devices

This consists of creating a flavor and adding the relevant PCI devices from your nova whitelist

openstack flavor set m1.large --property "pci_passthrough:alias"="a5:1,a5audio:1,nvme:1"

After trying to piece together the correct configs for PCI passthrough using docs for the full version of openstack, I was left with No valid host was found. There are not enough hosts available., so I began debugging that.

systemctl status snap.microstack.nova-scheduler.service tells me Filter PciPassthroughFilter returned 0 hosts. OK. I must be configured wrong. Here’s some doc for pci_passthrough in the properties section of flavor

After reading the newer docs, a little more carefully, I realized a lot of the configuration I was doing was for SR-IOV, which I am not going to use here. Eventually I was able to get Please ensure all devices within the iommu_group are bound to their vfio bus driver in the nova logs! This error is comprehensible to me. Turns out my iommu group is not isolated.

IOMMU Group 0 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14da]
IOMMU Group 0 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:14db]
IOMMU Group 0 00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:14db]
IOMMU Group 0 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2231] (rev a1)
IOMMU Group 0 01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)
IOMMU Group 0 02:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology Device [c0a9:540a] (rev 01)

Non-Volatile memory controller is my NVME drive. Turns out it shares a PCI lane with the GPU. I guess that’s what I get for getting a compact motherboard. I managed to get a gpu machine running after awhile though, here are some of the steps I took:

Swapped the nvme drives around so the one we’re booting off of is not in the same iommu group as the GPU
Apply driverctl to everything in the iommu group to try to either remove drivers or replace them with the vfio-pci driver
Install virt-manager as a basic sanity check that any of this is even possible - I was on the verge of purchasing a new mobo
Whitelisted and aliased other devices in the same IOMMU group
Make sure the only devices in the “large” flavor were the GPU and audio

I think swapping the drives was definitely critical, and this seems to be confirmed since when my gpu is passed through I can no longer see the nvme drive in the same group on the controller. I also suspect that install virt-manager may have unstuck something, since I (re)installed a bunch of random dependencies including libvirt, and I wasn’t really reading the output apt-get commands at that point.

Passing through the nvme drive

I should be able to pass-through the nvme drive, and it would be a shame if I couldn’t, since it has data and models on it and is otherwise useless and inaccessible as it stands now.

And actually, it was a very simple matter once I read this in the openstack documentation:

If using vendor and product IDs, all PCI devices matching the vendor_id and product_id are added to the pool of PCI devices available for passthrough to VMs.

Basically, the passthrough_whitelist had to reference the address and not the vendor/product ids, since those are not unique to each of the multiple nvme controllers in my system.

Conclusion

Great! I have completed the basic setup of microstack on my server and can now use it for running GPU workloads. I have internet access and a large disk to use for models and data. I have learned that the second PCI slot is in an IOMMU group with my Ethernet controller, among others, which means this tower will be limited to a single GPU unless and until I get a different motherboard.