Introduction
Recently the ManageIQ team completed a major milestone for the VMware Provider, removal of the VimBroker
worker. This was a large undertaking as it significantly alters how the basic functions of the VMware Provider are carried out.
Before we cover what was done to remove the broker worker we need to understand what problems the broker was solving and how it accomplished them.
What does the VimBroker do?
The VMware provider interfaces with the VMware vSphere API, also known as the VIM API (Virtual Infrastructure Manager) or more recently the “vSphere Web Services API” since it includes more than just the VIM endpoint (e.g. SPBM). It is a SOAP API with a WSDL file and can be downloaded [here] (Free download but a VMware account is required).
When working with the VIM SDK at scale a number of challenges arise:
- Sessions are expensive and limited
- Individual API calls are slow
Sessions
In order to perform most API requests on vSphere you have to Login
to the SessionManager
. When you do this you are assigned a UserSession
. Depending on the version a vCenter is only able to have a few hundred sessions (newer versions support 2,000) but generally the overall performance of the vCenter decreases rapidly way before this.
It is also possible for an application to “starve” other users to the point where an administrator has to either delete other sessions from an existing session or reboot the vCenter server. Not ideal.
In addition to the limited number, it is relatively slow to Login and Logout so from a purely performance point of view it is advantageous to share sessions across operations.
The VimBroker solves this by acting as a connection broker allowing multiple client processes to execute VIM API calls remotely sharing a single VIM session.
API Calls
Once you have your session you will find that the inventory is organized in a very nice tree structure. You’ll be tempted to navigate this tree structure recursively, going from the rootFolder
to the datacenters to the host to the VMs that you’re looking for. Don’t do this.
If you’re lucky you will very quickly discover that this does not scale. If you’re unlucky you won’t discover this until you have a customer with a vCenter
much larger than what you are used to dealing with (ask me know I know).
Never do anything one object at a time
As a general rule you want to work with the VIM API in batches. When dealing with inventory the answer is using a PropertyCollector
, for other operations like Metrics collection the answer is to pass multiple VMs to the QueryPerf
method.
The VimBroker
maintains a cache of the inventory using WaitForUpdatesEx
in a thread to efficiently keep inventory up to date and make it available to clients. This means that methods that use inventory data from the broker can access it quickly without waiting for individual API calls to get information about the VM it is acting on.
Why did we want to change it?
So if the broker solved all of these issues for us why did we want to change it?
The VimBroker
worker was notoriously a memory hog, in addition to caching all of that inventory it was a DRb
Server and thus kept objects around that clients have open. Because it didn’t have any info about who was calling it it had to keep everything in memory just in case.
The other reason is that inventory refreshes can be dramatically faster, the way that refresh works currently requires that everything be cached due to the way the refresh process works.
In order for something like a new VM to be picked up the following has to occur:
- The
MiqVimBrokerWorker's
updateThread
gets aWaitForUpdatesEx
ObjectUpdate
describing the new VM, it is added to the cache - At the same time in another worker process the
EventCatcher's
WaitForUpdatesEx
loop catches theVmCreatedEvent
and puts it on the queue - The
MiqEventHandler
picks the event off of the queue and adds it to theEventStreams
table - Then the
MiqEventHandler
invokes automate which processes the event - Automate runs the event handler for this event which will queue a targeted refresh of the host the new VM is on
- The
RefreshWorker
will dequeue this targeted refresh request, ask theMiqVimBrokerWorker
cache about all of the VMs on the host -
SaveInventory
will then create the new VM in addition to updating all of the other VMs.
With the new “streaming refresh” method this is what happens:
- The
RefreshWorker
gets aWaitForUpdatesEx
ObjectUpdate
directly about the new VM - It parses the payload describing the new VM
-
SaveInventory
then creates only the new VM record
This is drastically simpler and faster, as well as allowing us to reduce the dependence on caching in the RefreshWorker. For a video demo of this new Streaming Refresh method see the providers section of Sprint Review 127.
Why did we have to change it?
The VimBroker
worker fundamentally is a DRb
server, serving the VIM SDK to clients over the Distributed Ruby protocol. Since DRb
stores the URI
of the sender to respond to it doesn’t work behind a load balancer. It ends up sending the response to the load balancer instead of back to the original caller. This is fine when there is a VimBroker
per-appliance serving processes over localhost, but when it is running as a service in OpenShift serving other pods the VimBroker
is behind a kubernetes load balancer.
This meant that when MIQ workers are running in their own container deployments it wasn’t possible to use the VMware
provider.
So what did we do?
We had to solve the same problems that the VimBroker
solves but without DRb
for all functions of the VMware Provider.
Inventory was simple because we had been moving towards streaming refresh with RbVmomi and not the broker for a long time.
Even though the EventCatcher
used the VMwareWebService
gem which also contains the broker, the EventCatcher
never actually used DRb
.
This just left Operations and Metrics.
What we decided on was to add a new OperationsWorker which could maintain a vSphere Session and cache that could be shared by the whole process. There was already a MiqVim
“direct” connection which was wrapped up in DRb
as the DMiqVim
class which was served over the VimBroker
. This gave us a way of running the existing methods in a completely compatible way with how they were run over the broker.
Then we just had to update all of the ems_operations roles to run on the OperationsWorker. Currently all ems_operations are run by the Generic or Priority workers. This means the queue_name for all of these were nil
.
We had to find all of these and allow for them to be processed by this new OperationsWorker
.
Now we have a VMware provider which can work in a podified environment in addition to the performance benefits that streaming refresh brings.
https://github.com/ManageIQ/manageiq-providers-vmware/issues/484 details the problems and has a checklist of things which had to change if you’re interested in more details.
What’s next?
We plan on continuing to replace the code using our unique VMwareWebService handsoap implementation instead of the standard VMware Ruby gem RbVmomi.
What we really need is increased testing focus on the VMware provider. This will help us flush out bugs or just significant differences with the new behavior.
If you find anything please open an issue at https://github.com/ManageIQ/manageiq-providers-vmware/issues/new or post in the Gitter room https://gitter.im/ManageIQ/manageiq-providers-vmware
Thanks!
ManageIQ Providers Team