Help needed - MIQ ruby server CPU issue after upgrade

Hi all, we’ve recently upgraded from hammer-4 to ivanchuk-3 and since then strange system issue occurs sometimes (for instance twice today)

Suddenly during automation (it seems that always is there a vmware provisioning running in such moment) some ruby proccesses eat server CPU and system is stuck for hours in that state (till is restarted)

all I can see in top are proccesses like this (with whole cpu core taken…):

14842 root      30  10  260068  83888   3776 R 100,5  0,3  99:26.68 ruby
 3086 root      30  10  260200  84176   3776 R  99,7  0,3  99:32.17 ruby
 3123 root      30  10  260068  83912   3776 R  99,7  0,3  99:34.29 ruby

neither ps aux | grep 14842 gives me more info…

we have ivanchuk on production for 3 weeks now. We didn’t see such behaviour on hammer before.

I am used to java world with heap and java thread dumps. How can I find problem in ruby application I have no idea :frowning:

Any advice or tips would be appreciated…

Thank you…

@johny5v This is very strange. Do you have any context from the logs about what happens right before those processes start getting stuck? If it helps there’s a file in the log directory called top_output.txt, which are dumps from top over time. It should help you narrow down a timestamp for when this starts happening, and then using that timestamp you can look into the logs to see what was happening around it.

Specifically, can you see if the UI or API workers restarted at that point? We know there is a deadlock when multiple clients simultaneously hit the API at worker at worker startup, but usually that shows up as 502 errors in the UI, which I’m not seeing here.

@jrafanie Does this issues sound familiar to you, or do you think it’s related to the constant deadlock problem?

@tinaafitz Does this sound familiar to you too?

Hi @Fryguy, No, it doesn’t sound familiar at all.

Thank you for pointing me to top_output.txt. I looked into it, these are snaps for PID 14842 (with several other top proccesses)

18:53:34 everything was quiet

next minute CPU for 14842 goes up

18:55:34 now it went haywire

I found this in evm.log (there were no errors prior this one for several hours)

[----] I, [2020-04-27T18:53:37.954068 #2904:3305f4] INFO – : Q-task_id([job_dispatcher]) MIQ(MiqQueue#delivered) Message id: [46178507], State: [ok], Delivered in [0.010502774] seconds
#<Thread:0x00000000131de9e0@/usr/local/lib/ruby/gems/2.5.0/bundler/gems/manageiq-automation_engine-9b2153128ba5/lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_method.rb:218 run> terminated with exception (report_on_exception is true):
/usr/local/lib/ruby/gems/2.5.0/bundler/gems/manageiq-automation_engine-9b2153128ba5/lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_method.rb:219:in each_line': stream closed in another thread (IOError) from /usr/local/lib/ruby/gems/2.5.0/bundler/gems/manageiq-automation_engine-9b2153128ba5/lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_method.rb:219:in block (2 levels) in run_method’
#<Thread:0x00000000131ddbd0@/usr/local/lib/ruby/gems/2.5.0/bundler/gems/manageiq-automation_engine-9b2153128ba5/lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_method.rb:221 run> terminated with exception (report_on_exception is true):
/usr/local/lib/ruby/gems/2.5.0/bundler/gems/manageiq-automation_engine-9b2153128ba5/lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_method.rb:222:in each_line': stream closed in another thread (IOError) from /usr/local/lib/ruby/gems/2.5.0/bundler/gems/manageiq-automation_engine-9b2153128ba5/lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_method.rb:222:in block (2 levels) in run_method’
[----] I, [2020-04-27T18:53:41.922011 #13591:3305f4] INFO – : MIQ(ManageIQ::Providers::Vmware::InfraManager::EventCatcher::Runner#filtered?) EMS [] as [xx_hpautomation] Skipping caught event [UserLogoutSessionEvent] chainId [17313312]
[----] E, [2020-04-27T18:53:44.440369 #2666:3305f4] ERROR – : MIQ(MiqServer#validate_worker) Worker [MiqGenericWorker] with ID: [16958], PID: [23845], GUID: [25841359-34ba-41f8-ab39-f2e7f021ce1a] has not responded in 3603.115479276 seconds, restarting worker

I don’t know if the server issue could have something with that killed worker. I did

zgrep “MiqGenericWorker.*23845” evm.log-20200428.gz > worker_23845.txt
and attached the result here just for case…worker_23845.txt (23.6 KB)

It sounds like the automation that’s running some ruby code is using up a lot of CPU. From your top output snippet, it’s using 99+% cpu but only 83 MB of memory which indicates to me something in the automation model ruby method is never completing. Do you know what is running and can add some debug logging to your automation methods so you can see where it’s hitting the high CPU usage?

We have installed telegraf on manageiq appliance. Hopefully we will be able to determine exact moment in case of future issue. Should I attach my old evm.log (few minutes around the issue)?
Meanwhile I am going investigate/simulate other (but maybe releated) issue with object walker - when we call object walker with existing VM object in $evm (like from button on VM), CPU rise to 100% at one point as well - but this time eats whole memory (to out of memory error). If I excluded evm_clusters (and one other record) from object walker whitelist, it runs ok…

Our CPU problem rised again today, when 6 vmware windows servers were ordered in 10 minutes span. 2 were completed ok (r1792, r1795), 4 failed.
I managed to track the automation flow to this code ( in our / Infrastructure / VM / Provisioning / StateMachines / Methods / vmware_PostProvision):

unless disks.empty?
  disks.each do |dsk|
    $evm.log(:info, "Adding #{dsk['size_gb']}GB disk to VM #{}")
    vm.add_disk( "[#{vm.storage_name}]", dsk['size_mb'],  {:sync => true, :thin_provisioned => true,  :dependent => true})
$evm.log(:info, "Additional disks: #{disks}")

(User can request 1 to 3 optional disks in the service catalog form)

if I grep these messages, results are these:

miq server1:

[----] I, [2020-05-04T10:54:24.719293 #30713:d6f5c04] INFO – : Q-task_id([r1793_miq_provision_4342]) Adding 30GB disk to VM SAPLAB691
[----] I, [2020-05-04T10:54:25.074380 #2336:bedfca0] INFO – : Q-task_id([r1794_miq_provision_4333]) Adding 30GB disk to VM SAPLAB692

miq server2:

[----] I, [2020-05-04T10:54:19.868512 #28140:120311c0] INFO – : Q-task_id([r1791_miq_provision_4328]) Adding 40GB disk to VM SAPLAB689
[----] I, [2020-05-04T10:54:21.774805 #25223:ef797a8] INFO – : Q-task_id([r1796_miq_provision_4340]) Adding 20GB disk to VM SAPLAB688

No " Additional disks: " message in output here. So vm.add_disk was last called method in every one of that hunged requests.
for me is suspicious that all vm.add_disk methods were called almost simultaneously. Maybe some concurrency problem?

2 other vmware windows servers (with 1 optional disk each as well) were requested in that time but both completed successfully (much slower then usually due to CPU load). And log from that:

[----] I, [2020-05-04T11:42:37.359249 #18649:8171940] INFO – : Q-task_id([r1792_miq_provision_4330]) Adding 40GB disk to VM SAPLAB690
[----] I, [2020-05-04T11:44:12.487391 #20901:cdf9948] INFO – : Q-task_id([r1795_miq_provision_4336]) Adding 30GB disk to VM SAPLAB693
[----] I, [2020-05-04T11:47:43.720788 #18649:8171940] INFO – : Q-task_id([r1792_miq_provision_4330]) Additional disks: [{“drive_letter”=>“D”, “size_gb”=>40, “size_kb”=>41943040, “size_mb”=>40960, “drive_label”=>“DATA”, “id”=>0}]
[----] I, [2020-05-04T11:48:24.706817 #20901:cdf9948] INFO – : Q-task_id([r1795_miq_provision_4336]) Additional disks: [{“drive_letter”=>“D”, “size_gb”=>30, “size_kb”=>31457280, “size_mb”=>30720, “drive_label”=>“DATA”, “id”=>0}]

This was working fine in hammer-4
Do you have any suggestion?
Thank you

The add_disk hasn’t been changed since hammer-4 going to ivanchuk. Perhaps something on the vmware provider itself has changed. The :sync => true makes me think the high CPU is because it might be spinning, waiting for a lock.

Maybe @agrare has some ideas?

Hey @jrafanie actually this is a different add_disk() method that he is calling here, the one you link to is adding a disk to the initial CloneVM vmware method. @johny5v is calling vm.add_disk after the provision so it is actually calling this and the :sync => true is for the automation_engine to call wait_for_taskid here. The only change to the underlying addDisk method in the VimBroker was this which was released back in fine.

If any MIQ worker code was spinning the CPU it would show up with the worker process title, the fact that it says it is just “/usr/local/bin/ruby” it seems like it might be spawned from an automate method.

You could try calling vm.add_disk from a rails console to take automate out of it and see if you see the same issue.

1 Like

Hey @jrafanie, sorry for my delay. I ran several times this is in our TEST environment directly from rails console:

HOST_ARRAY = ["vnnub0102-TB","vnnub0106-TB","vnnub0136-TB"]  
$evm =
threads = []
Thread.abort_on_exception = true
HOST_ARRAY.each { |host|
  threads << {
    vm = $evm.vmdb('vm').find_by_name(host)
    vm.add_disk( "[#{vm.storage_name}]", 1024,  {:sync => true, :thin_provisioned => true,  :dependent => true})

and it worked fine. Unfortunately, our test environment is currently single node only, could it be different if I run that code simultaneously on 2 nodes?

@johny5v Yeah, it’s possible the race condition is more likely to occur from two or more concurrent clients as opposed to a single client IP making multiple concurrent requests. How frequently does this occur? How often do you have multiple concurrent provisions? If you did 3 provisions at once, how many would hit this issue? What about if you did 10?

Keep in mind, there could be other vmware requests that are “racing” with the add_disk function so, it might work 100% of the time when you do add_disk from several ip addresses, in threads, all concurrently.

@jrafanie we have this windows server flow available for like a year (through gaprind/hammer and finally ivanchuck) and issue appeared for the first time a few days after ivanchuk migration. Our window server admins (who are responsible for 95% windows server requests) were used to always add several servers into the shopping cart and run them at once. After this issue we asked them to not use a shopping cart for many servers but instead to order max 1 server per 5 minutes or so.
In few days I am gonna get another test appliance so I will make another test.

By the way - is there by chance any simple way how to create and release distributed lock in MIQ automation scripts?

@johny5v Hello. I’m not sure what happened in the code or in the environment to cause concurrent provisionings that add_disk to fail possibly due to a concurrency issue. Do you have a test environment with the old code that you can verify it works with lots of concurrent provisions and compare? Maybe you can try instrumenting a failing environment and run diagnostics like strace and look at lsof to get an idea of what the high CPU process is doing. It’s not easy.

I don’t think distributed locks with automation is a good idea especially if what’s being done while “holding a lock” includes external systems. There’s just too many things involved and could go wrong.

Is it possible to move the add_disk part of the provisioning to a followup task that’s run asynchronously?

I haven’t been able to run my above test code in parrallel in our test environment. On 1 instance it runs fine, on other I am getting error:

/usr/local/lib/ruby/gems/2.5.0/gems/activesupport-5.1.7/lib/active_support/dependencies.rb:509:in `load_missing_constant’: Circular dependency detected while autoloading constant Vm (RuntimeError)

If I run the code without threads, it runs fine. Without CPU issue.

We don’t have any old environment to test this behaviour. Meanwhile I changed :sync to false in add_disk method call to see what happens…