Hot-swapping PCI devices on running machines

I’d like to share GPUs between virtual machines. I do not want to use the same gpu from multiple machines, just one at a time, but I’d like to swap the GPUs as quickly as possible.

I found some info about vGPUs and even a video on youtube, but they seem to require some sort of licensing that I think is too much for my simple needs right now.

I found gophercloud, a golang interface for openstack. I’m thinking of doing a timed round-robin for each VM. I could apply weights to the VMs…

No matter how I organize it, I have to do some research into how I might move the hardware between machines.

Naively using virsh

I thought I could simply use virsh to attach-device and detach-device. I detached my 3 PCI devices (gpu, gpu-audio, and NVME drive) succesfully.

This seems to work (the VM can no longer use the hardware), but openstack does not appear to be able to schedule another machine with the same PCI devices, sudo systemctl status snap.microstack.nova-scheduler.service shows errors related to all 3 PCI devices.

Exploring the openstack databases

I thought maybe I’d poke around in the databases a little bit.

I found a block of credentials using sudo snap get microstack config.credentials, including the mysql server root password. Using ps aux I found the mysqld process running with argument --defaults-file=/var/snap/microstack/common/etc/mysql/my.cnf, I used this file and the password I found along with --user=root. I installed the mariadb client and libmariadb3 for a password hasing algorithm and we were in. There are a TON of tables in the nova database. After checking instances and finding nothing, I found that instance_extra contained json blobs which indicated "pci_passthrough:alias"="a5:1,a5audio:1,nvme:1" was still set on the VM. Which lead me to the obvious idea of simply removing that property in microstack, which also doesn’t work :(

Feasibility of shelving and unshelving instances

It’s not the best solution, but I know that I can shelve and unshelve instances, and I think that will allow me to re-use the same hardware. This is not as good as being able to detach and re-attach the hardware on the fly, but until I can get openstack to recognize that the hardware has been freed up for another machine, there is little alternative.

The problem with shelving is the machine stays in the shelving_image_uploading state for altogether too long, which I would guess is due to the 50GB disk I used.

Correctly using virsh

Rather than using microstack to attach the PCI devices, we can create VMs without the PCI devices, and attach and detach the PCI devices using virsh. This appears to work fine, with some tweaking of the target PCI slots to fit into the new machine.

Here’s an example PCI device

<hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> </hostdev>

and the command to attach it is microstack.virsh attach-device instance-0000000b /root/pci-a.xml.

Something to watch out for

Because of the way my iommu groups ended up, I am passing through an entire NVME device. I will need to be careful not to detach this device while I am writing to it. The risk might be alleviated a bit by mounting the device as read-only.