Recently the ManageIQ team completed a major milestone for the VMware Provider, removal of the
VimBroker worker. This was a large undertaking as it significantly alters how the basic functions of the VMware Provider are carried out.
Before we cover what was done to remove the broker worker we need to understand what problems the broker was solving and how it accomplished them.
What does the VimBroker do?
The VMware provider interfaces with the VMware vSphere API, also known as the VIM API (Virtual Infrastructure Manager) or more recently the “vSphere Web Services API” since it includes more than just the VIM endpoint (e.g. SPBM). It is a SOAP API with a WSDL file and can be downloaded [here] (Free download but a VMware account is required).
When working with the VIM SDK at scale a number of challenges arise:
- Sessions are expensive and limited
- Individual API calls are slow
In order to perform most API requests on vSphere you have to
Login to the
SessionManager. When you do this you are assigned a
UserSession. Depending on the version a vCenter is only able to have a few hundred sessions (newer versions support 2,000) but generally the overall performance of the vCenter decreases rapidly way before this.
It is also possible for an application to “starve” other users to the point where an administrator has to either delete other sessions from an existing session or reboot the vCenter server. Not ideal.
In addition to the limited number, it is relatively slow to Login and Logout so from a purely performance point of view it is advantageous to share sessions across operations.
The VimBroker solves this by acting as a connection broker allowing multiple client processes to execute VIM API calls remotely sharing a single VIM session.
Once you have your session you will find that the inventory is organized in a very nice tree structure. You’ll be tempted to navigate this tree structure recursively, going from the
rootFolder to the datacenters to the host to the VMs that you’re looking for. Don’t do this.
If you’re lucky you will very quickly discover that this does not scale. If you’re unlucky you won’t discover this until you have a customer with a
vCenter much larger than what you are used to dealing with (ask me know I know).
Never do anything one object at a time
As a general rule you want to work with the VIM API in batches. When dealing with inventory the answer is using a
PropertyCollector, for other operations like Metrics collection the answer is to pass multiple VMs to the
VimBroker maintains a cache of the inventory using
WaitForUpdatesEx in a thread to efficiently keep inventory up to date and make it available to clients. This means that methods that use inventory data from the broker can access it quickly without waiting for individual API calls to get information about the VM it is acting on.
Why did we want to change it?
So if the broker solved all of these issues for us why did we want to change it?
VimBroker worker was notoriously a memory hog, in addition to caching all of that inventory it was a
DRb Server and thus kept objects around that clients have open. Because it didn’t have any info about who was calling it it had to keep everything in memory just in case.
The other reason is that inventory refreshes can be dramatically faster, the way that refresh works currently requires that everything be cached due to the way the refresh process works.
In order for something like a new VM to be picked up the following has to occur:
ObjectUpdatedescribing the new VM, it is added to the cache
- At the same time in another worker process the
WaitForUpdatesExloop catches the
VmCreatedEventand puts it on the queue
MiqEventHandlerpicks the event off of the queue and adds it to the
- Then the
MiqEventHandlerinvokes automate which processes the event
- Automate runs the event handler for this event which will queue a targeted refresh of the host the new VM is on
RefreshWorkerwill dequeue this targeted refresh request, ask the
MiqVimBrokerWorkercache about all of the VMs on the host
SaveInventorywill then create the new VM in addition to updating all of the other VMs.
With the new “streaming refresh” method this is what happens:
ObjectUpdatedirectly about the new VM
- It parses the payload describing the new VM
SaveInventorythen creates only the new VM record
This is drastically simpler and faster, as well as allowing us to reduce the dependence on caching in the RefreshWorker. For a video demo of this new Streaming Refresh method see the providers section of Sprint Review 127.
Why did we have to change it?
VimBroker worker fundamentally is a
DRb server, serving the VIM SDK to clients over the Distributed Ruby protocol. Since
DRb stores the
URI of the sender to respond to it doesn’t work behind a load balancer. It ends up sending the response to the load balancer instead of back to the original caller. This is fine when there is a
VimBroker per-appliance serving processes over localhost, but when it is running as a service in OpenShift serving other pods the
VimBroker is behind a kubernetes load balancer.
This meant that when MIQ workers are running in their own container deployments it wasn’t possible to use the
So what did we do?
We had to solve the same problems that the
VimBroker solves but without
DRb for all functions of the VMware Provider.
Inventory was simple because we had been moving towards streaming refresh with RbVmomi and not the broker for a long time.
Even though the
EventCatcher used the
VMwareWebService gem which also contains the broker, the
EventCatcher never actually used
This just left Operations and Metrics.
What we decided on was to add a new OperationsWorker which could maintain a vSphere Session and cache that could be shared by the whole process. There was already a
MiqVim “direct” connection which was wrapped up in
DRb as the
DMiqVim class which was served over the
VimBroker. This gave us a way of running the existing methods in a completely compatible way with how they were run over the broker.
Then we just had to update all of the ems_operations roles to run on the OperationsWorker. Currently all ems_operations are run by the Generic or Priority workers. This means the queue_name for all of these were
We had to find all of these and allow for them to be processed by this new
Now we have a VMware provider which can work in a podified environment in addition to the performance benefits that streaming refresh brings.
https://github.com/ManageIQ/manageiq-providers-vmware/issues/484 details the problems and has a checklist of things which had to change if you’re interested in more details.
We plan on continuing to replace the code using our unique VMwareWebService handsoap implementation instead of the standard VMware Ruby gem RbVmomi.
What we really need is increased testing focus on the VMware provider. This will help us flush out bugs or just significant differences with the new behavior.
If you find anything please open an issue at https://github.com/ManageIQ/manageiq-providers-vmware/issues/new or post in the Gitter room https://gitter.im/ManageIQ/manageiq-providers-vmware
ManageIQ Providers Team