KVM on Storage OS
I guess the time has come to share with the world some of our technical experiences while bringing new features out on Storage OS. I will stop on our last addition to Storage OS – local virtualization.
What does this actually mean?
Well you can now run virtual machines on the storage appliance itself without any additional infrastructure. This was a real need that most of our small to medium customers had. Get rid of the 2-3 servers that provide file sharing, active directory or some other services and virtualize them. But how can we bring this in our Storage OS?
Well this is obviously not a new concept. Oracle has Virtual Box and it’s quite nice. We had a few tries with it but we where not really satisfied with the overall performance and it’s managing capabilities. So we looked in other directions. The best and most stable choice was KVM (kernel virtual machine). KVM is a technology that runs on Linux for a really long time and it’s one of the most stable and performant hypervisors there is. We really have to give credit to Joyent for doing the initial porting of KVM to Illumos, they did an amazing job, hat off. They use it too for their smart machines. What really struck me was that during the porting the guys mentioned that they did not find a single bug or problem in the entire KVM. That’s something, considering that during a porting you will touch every single line of code there is.
The Storage Question
Having the kernel part ported we had to built up the infrastructure to manage virtual machines on Storage OS. Our first dilemma was: what would be the backing store of a VM? We initially decided to use a Vdisk (a ZFS volume) as the backing store. At first it sounded like a good idea. You can set it up as thick or thin provisioning, whatever you feel like it. What would happen if you need more size? Well you can easily expand a Vdisk and then allocate the new space within the guest OS itself. We used the /dev/zvol/rdsk as the raw device path for the disk image.
Controlling and configuring local VMs
Our next question was: now that we have the backing store in place how would we control the state and the configuration of a virtual machine? Working with Illumos (as Storage OS uses the Illumos kernel) for a few years now I understand there is only one true way of handling long running processes and that is using SMF (Service Management Facility). SMF is one of the best things there is regarding service management in Solaris based distributions. For every virtual machine we built up a SMF service. We kept the configuration in a config property group. The configured RAM, the mounted ISO images, the boot order, the networking configuration where all service properties. The service start method would parse the configuration and generate the command to start up qemu-kvm for that virtual machine. One thing to notice is that none of the tools to create, deploy or manage virtual machines, from KVM Linux distributions, are present on Illumos as libvirt is not ported. This was another reason why we had to build our own tools.
The communication challenge
Our next big challenge was how to communicate with the virtual machine while it’s running. The obvious choice was to use Qemu QMP. This is a JSON based protocol that would allow applications to communicate with Qemu instances (virtual machines). We wrote a new module that uses UNIX sockets to communicate with the virtual machines using QMP. We had a lot of challenges here as QMP is not finished, there are a few commands missing from the protocol itself but we managed to overcome them. Using JSON generated commands we where now able to control the virtual machines: adding new devices, issuing ACPI commands and so on. An interesting problem we faced here was when too different processes where connecting to the same socket to operate stuff on the same virtual machine. We had to be careful how we control the access of the processes to be exclusive.
The memory management dilemma
Ok, now that we have all these beautiful running virtual machines, how would we manage the memory of the host storage? What happens when a user goes on a frenzy and starts powering on just enough virtual machines to consume all the host memory? Another interesting problem on a storage using ZFS would be the management of the memory used by ZFS ARC. ZFS is very aggressive in using the RAM. Normally you would not limit the size of the ARC on a normal storage as ZFS would release the memory as other applications demand it. But when you do know beforehand that you need that memory, it’s normal to instruct ZFS not to use it anymore. What we did was that we built a dynamic memory provisioner for the virtual machines. As soon as you power on a virtual machine the memory provisioner will check to see if you have enough memory reserved for the operating system itself and at least 0.2 of the total memory available for the ARC after you power on the virtual machine. If it does meet the conditions it will automatically limit the ARC to the nearest value possible of the total memory minus the RAM used by the virtual machine.
As you power on virtual machines the ARC will be like a balloon and it will shrink in order to free up memory for the running VMs. We wrote a tool that uses Kstat and MDB (the modular debugger) to dynamically adjust the size of the ZFS ARC. Memory management: done.
Disaster Recovery for local VMs
Next we thought about what disaster recovery would mean for a virtual machine. One of the most used features in Storage OS is the dataset backup replication. Let’s say you replicate onsite or remotely the Vdisk (ZFS volume) that represents your virtual machine, how would you be able to easily recreate the virtual machine you had before on top of this? Not that easy because the SMF service configuration you had before is not replicated and it would be a real pain to do it. Attaching the configuration as a ZFS property of the volume came into mind. But then we had a demo with one of our clients and he really insisted on having two separate disks for the virtual machine. Then it was clear to us that using a vdisk as a backing store will just not cut it anymore. We did not had any flexibility. What to do? Buy a new big whiteboard at least 2 meters wide, a lot of markers, and let the design and idea dumping process begin :).
What we ended up with was to used a ZFS filesystem to represent a virtual machine, some sort of a bundle. Inside the filesystem mountpoint we save the configuration of the virtual machine in a JSON format (it’s quite easy to transform any object into a JSON representation in any programming language, nowadays). Maybe someone who would want to convert the machine configuration into a Joyent smart machine one, or transform this into an xml for libvirt on Linux would do this really easily. Beside the configuration we created the disk images. This time we used the native Qemu format – qcow2, and done the resizing using qemu-img. This kind of virtual machine representation allowed us to easily send/receive a filesystem that encapsulates a virtual machine bundle. Inside the SMF service attached to the VM we don’t keep any configuration anymore, just a link to the path of the ZFS filesystem that contains the bundle. Importing a virtual machine your backup storage would just parse the configuration and generate all the attached SMF services to control it. This turned out to be a great and natural solution that allowed us a great flexibility.
The connectivity venture
We decided to provide a way to connect directly to a virtual machine console from within the browser, as an alternative to a traditional VNC client. We created websocket proxies that would forward each VNC display to a websocket. VNC displays will be automatically generated when starting up virtual machines. We needed that in order to make sure we have unique displays for each virtual machine that is running. Having the VNC displays forwarded to websockets we used an in browser Javascript VNC client that we completely ported, modified and restyled to be usable within our web management application. We communicate with the virtual machine, take a screenshot of the current display state and as soon as you click on that image we open up an in-browser VNC client to the virtual machine. Pretty neat to be honest.
That’s pretty much the story of our last iteration. A lot of fun, a whole lot of new technology. Hope that people will enjoy what we did. We are always open to questions about everything so feel free to ask :). Thanks.