VMware VimBroker Removal

Introduction

Recently the ManageIQ team completed a major milestone for the VMware Provider, removal of the VimBroker worker. This was a large undertaking as it significantly alters how the basic functions of the VMware Provider are carried out.

Before we cover what was done to remove the broker worker we need to understand what problems the broker was solving and how it accomplished them.

What does the VimBroker do?

The VMware provider interfaces with the VMware vSphere API, also known as the VIM API (Virtual Infrastructure Manager) or more recently the “vSphere Web Services API” since it includes more than just the VIM endpoint (e.g. SPBM). It is a SOAP API with a WSDL file and can be downloaded [here] (Free download but a VMware account is required).

When working with the VIM SDK at scale a number of challenges arise:

  1. Sessions are expensive and limited
  2. Individual API calls are slow

Sessions

In order to perform most API requests on vSphere you have to Login to the SessionManager. When you do this you are assigned a UserSession. Depending on the version a vCenter is only able to have a few hundred sessions (newer versions support 2,000) but generally the overall performance of the vCenter decreases rapidly way before this.

It is also possible for an application to “starve” other users to the point where an administrator has to either delete other sessions from an existing session or reboot the vCenter server. Not ideal.

In addition to the limited number, it is relatively slow to Login and Logout so from a purely performance point of view it is advantageous to share sessions across operations.

The VimBroker solves this by acting as a connection broker allowing multiple client processes to execute VIM API calls remotely sharing a single VIM session.

API Calls

Once you have your session you will find that the inventory is organized in a very nice tree structure. You’ll be tempted to navigate this tree structure recursively, going from the rootFolder to the datacenters to the host to the VMs that you’re looking for. Don’t do this.

If you’re lucky you will very quickly discover that this does not scale. If you’re unlucky you won’t discover this until you have a customer with a vCenter much larger than what you are used to dealing with (ask me know I know).

Never do anything one object at a time

As a general rule you want to work with the VIM API in batches. When dealing with inventory the answer is using a PropertyCollector, for other operations like Metrics collection the answer is to pass multiple VMs to the QueryPerf method.

The VimBroker maintains a cache of the inventory using WaitForUpdatesEx in a thread to efficiently keep inventory up to date and make it available to clients. This means that methods that use inventory data from the broker can access it quickly without waiting for individual API calls to get information about the VM it is acting on.

Why did we want to change it?

So if the broker solved all of these issues for us why did we want to change it?

The VimBroker worker was notoriously a memory hog, in addition to caching all of that inventory it was a DRb Server and thus kept objects around that clients have open. Because it didn’t have any info about who was calling it it had to keep everything in memory just in case.

The other reason is that inventory refreshes can be dramatically faster, the way that refresh works currently requires that everything be cached due to the way the refresh process works.

In order for something like a new VM to be picked up the following has to occur:

  1. The MiqVimBrokerWorker's updateThread gets a WaitForUpdatesEx ObjectUpdate describing the new VM, it is added to the cache
  2. At the same time in another worker process the EventCatcher's WaitForUpdatesEx loop catches the VmCreatedEvent and puts it on the queue
  3. The MiqEventHandler picks the event off of the queue and adds it to the EventStreams table
  4. Then the MiqEventHandler invokes automate which processes the event
  5. Automate runs the event handler for this event which will queue a targeted refresh of the host the new VM is on
  6. The RefreshWorker will dequeue this targeted refresh request, ask the MiqVimBrokerWorker cache about all of the VMs on the host
  7. SaveInventory will then create the new VM in addition to updating all of the other VMs.

With the new “streaming refresh” method this is what happens:

  1. The RefreshWorker gets a WaitForUpdatesEx ObjectUpdate directly about the new VM
  2. It parses the payload describing the new VM
  3. SaveInventory then creates only the new VM record

This is drastically simpler and faster, as well as allowing us to reduce the dependence on caching in the RefreshWorker. For a video demo of this new Streaming Refresh method see the providers section of Sprint Review 127.

Why did we have to change it?

The VimBroker worker fundamentally is a DRb server, serving the VIM SDK to clients over the Distributed Ruby protocol. Since DRb stores the URI of the sender to respond to it doesn’t work behind a load balancer. It ends up sending the response to the load balancer instead of back to the original caller. This is fine when there is a VimBroker per-appliance serving processes over localhost, but when it is running as a service in OpenShift serving other pods the VimBroker is behind a kubernetes load balancer.

This meant that when MIQ workers are running in their own container deployments it wasn’t possible to use the VMware provider.

So what did we do?

We had to solve the same problems that the VimBroker solves but without DRb for all functions of the VMware Provider.

Inventory was simple because we had been moving towards streaming refresh with RbVmomi and not the broker for a long time.

Even though the EventCatcher used the VMwareWebService gem which also contains the broker, the EventCatcher never actually used DRb.

This just left Operations and Metrics.

What we decided on was to add a new OperationsWorker which could maintain a vSphere Session and cache that could be shared by the whole process. There was already a MiqVim “direct” connection which was wrapped up in DRb as the DMiqVim class which was served over the VimBroker. This gave us a way of running the existing methods in a completely compatible way with how they were run over the broker.

Then we just had to update all of the ems_operations roles to run on the OperationsWorker. Currently all ems_operations are run by the Generic or Priority workers. This means the queue_name for all of these were nil.

We had to find all of these and allow for them to be processed by this new OperationsWorker.

Now we have a VMware provider which can work in a podified environment in addition to the performance benefits that streaming refresh brings.

https://github.com/ManageIQ/manageiq-providers-vmware/issues/484 details the problems and has a checklist of things which had to change if you’re interested in more details.

What’s next?

We plan on continuing to replace the code using our unique VMwareWebService handsoap implementation instead of the standard VMware Ruby gem RbVmomi.

What we really need is increased testing focus on the VMware provider. This will help us flush out bugs or just significant differences with the new behavior.

If you find anything please open an issue at https://github.com/ManageIQ/manageiq-providers-vmware/issues/new or post in the Gitter room https://gitter.im/ManageIQ/manageiq-providers-vmware

Thanks!
ManageIQ Providers Team

3 Likes

A question has come up around automate scripts that hit the VimBroker directly by using vm.object_send('instance_eval', 'with_provider_object { | vimVm | return vimVm }').

Usage of instance_eval isn’t supported and the proper way to interact with the VM is through the service models which will queue the action for the appropriate worker.

For example adding a disk for a VM you would use the MiqAeServiceManageIQ_Providers_Vmware_InfraManager_Vm#add_disk method instead of hitting the broker’s add disk method directly.

1 Like