High CPU load on Jansa-3

Hi there,

I am observing a quite interesting situation in a jansa-3 setup. I have this jansa-3 appliance which is single member of a zone with some VMware Providers configured. So time goes by and appliance runs for 7 days. Now CPU load is at 180 and top shows a lot of ruby processes running with nice between 7 and 10.

Some VMware providers don’t work because connection to the vCenter is currently not possible.

Does anyone know how to analyze or troubleshoot the cause instead of just rebooting and hoping that it never happens again?

Kind Regards,
phospi

You can use vmdb && rake evm:status from the command line on the appliance to show the PIDs of the worker processes, then you could probably match up which workers are causing the load in top. In general though I’d recommend only one provider per zone (unless it’s a really small provider); that way you distribute the provider-specific workers between appliances.

Your appliance has 32GB memory, of which most is consumed, with little buff/cache used. It’s a really busy appliance!

Hope this helps,
pemcg

Unfortunately, there is no match.
Please don’t be confused. The output is from a different appliance for aws providers because I rebooted the vmware appliance but the behaviour is exactly the same.

[root@JANSA vmdb]# rake evm:status
Checking EVM status...
 Region | Zone | Server        | Status  |  PID |   SPID | Workers | Version | Started    | Heartbeat   | MB Usa                                               ge | Roles
--------+------+---------------+---------+------+--------+---------+---------+------------+-------------+-------                                               ---+------------------------------------------------------------------------------------------------------------                                               -----------------------------------------
     21 | AWS  | JANSA | started | 2274 | 175243 |     100 | jansa-3 | 2021-03-09 | 07:48:47UTC |      3                                               81 | automate:database_operations:ems_inventory:ems_operations:event:git_owner:remote_console:reporting:schedule                                               r:smartstate:user_interface:web_services

* marks a master appliance

 Type                        | Status  |    PID | SPID   | Queue                                                    | Started     | Heartbeat   | MB Usage
-----------------------------+---------+--------+--------+----------------------------------------------------------+-------------+-------------+----------
 Amazon::Cloud::EventCatcher | started | 417845 | 441302 | ems_21000000000229                                       | 22:50:25UTC | 07:48:31UTC | 222/2048
 Amazon::Cloud::EventCatcher | started | 417848 | 441290 | ems_21000000000238                                       | 22:50:18UTC | 07:48:41UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 417850 | 441295 | ems_21000000000250                                       | 22:50:19UTC | 07:48:29UTC | 220/2048
 Amazon::Cloud::EventCatcher | started | 417852 | 441313 | ems_21000000000253                                       | 22:50:25UTC | 07:48:34UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 417854 | 441305 | ems_21000000000256                                       | 22:50:21UTC | 07:48:32UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 417856 | 441303 | ems_21000000000280                                       | 22:50:24UTC | 07:48:34UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 417858 | 441282 | ems_21000000000304                                       | 22:50:18UTC | 07:48:38UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 417860 | 441294 | ems_21000000000307                                       | 22:50:19UTC | 07:48:29UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 417862 | 441307 | ems_21000000000319                                       | 22:50:24UTC | 07:48:34UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 417864 | 441304 | ems_21000000000346                                       | 22:50:23UTC | 07:48:32UTC | 222/2048
 Amazon::Cloud::EventCatcher | started | 417866 | 441300 | ems_21000000000355                                       | 22:50:22UTC | 07:48:42UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 417868 | 441306 | ems_21000000000370                                       | 22:50:22UTC | 07:48:32UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 417870 | 441301 | ems_21000000000379                                       | 22:50:22UTC | 07:48:42UTC | 222/2048
 Amazon::Cloud::EventCatcher | started | 417872 | 441311 | ems_21000000000393                                       | 22:50:23UTC | 07:48:34UTC | 222/2048
 Amazon::Cloud::EventCatcher | started | 417874 | 441312 | ems_21000000000638                                       | 22:50:25UTC | 07:48:34UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 417876 | 441291 | ems_21000000000653                                       | 22:50:18UTC | 07:48:41UTC | 221/2048
 Amazon::Cloud::EventCatcher | started | 417878 | 441310 | ems_21000000000731                                       | 22:50:25UTC | 07:48:34UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 417880 | 441308 | ems_21000000000764                                       | 22:50:24UTC | 07:48:33UTC | 220/2048
 Amazon::Cloud::EventCatcher | started | 417882 | 441309 | ems_21000000000770                                       | 22:50:23UTC | 07:48:34UTC | 221/2048
 Amazon::Cloud::EventCatcher | started | 417942 | 441332 | ems_21000000000217                                       | 22:50:57UTC | 07:48:39UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 417943 | 441338 | ems_21000000000223                                       | 22:50:59UTC | 07:48:38UTC | 222/2048
 Amazon::Cloud::EventCatcher | started | 417945 | 441331 | ems_21000000000313                                       | 22:50:55UTC | 07:48:38UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 417947 | 441337 | ems_21000000000349                                       | 22:50:59UTC | 07:48:27UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 417949 | 441334 | ems_21000000000358                                       | 22:50:55UTC | 07:48:38UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 417951 | 441336 | ems_21000000000361                                       | 22:50:57UTC | 07:48:37UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 417953 | 441333 | ems_21000000000390                                       | 22:50:55UTC | 07:48:38UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 417955 | 441335 | ems_21000000000701                                       | 22:51:00UTC | 07:48:38UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 418133 | 441344 | ems_21000000000292                                       | 22:51:21UTC | 07:48:27UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 418138 | 441345 | ems_21000000000316                                       | 22:51:22UTC | 07:48:27UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 418140 | 441346 | ems_21000000000352                                       | 22:51:26UTC | 07:48:31UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 418231 | 441376 | ems_21000000000665                                       | 22:51:39UTC | 07:48:36UTC | 222/2048
 Amazon::Cloud::EventCatcher | started | 418278 | 441391 | ems_21000000000683                                       | 22:51:58UTC | 07:48:39UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 418313 | 441403 | ems_21000000000232                                       | 22:52:08UTC | 07:48:34UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 418316 | 441404 | ems_21000000000298                                       | 22:52:09UTC | 07:48:32UTC | 224/2048
 Amazon::Cloud::EventCatcher | started | 418318 | 441408 | ems_21000000000373                                       | 22:52:10UTC | 07:48:39UTC | 222/2048
 Amazon::Cloud::EventCatcher | started | 418320 | 441407 | ems_21000000000674                                       | 22:52:10UTC | 07:48:39UTC | 220/2048
 Amazon::Cloud::EventCatcher | started | 418431 | 441421 | ems_21000000000364                                       | 22:52:29UTC | 07:48:38UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 418433 | 441418 | ems_21000000000382                                       | 22:52:24UTC | 07:48:33UTC | 221/2048
 Amazon::Cloud::EventCatcher | started | 418435 | 441420 | ems_21000000000385                                       | 22:52:29UTC | 07:48:36UTC | 221/2048
 Amazon::Cloud::EventCatcher | started | 418705 | 441460 | ems_21000000000259                                       | 22:52:45UTC | 07:48:36UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 418707 | 441457 | ems_21000000000289                                       | 22:52:44UTC | 07:48:35UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 418709 | 441456 | ems_21000000000301                                       | 22:52:44UTC | 07:48:34UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 419046 | 441511 | ems_21000000000277                                       | 22:53:49UTC | 07:48:34UTC | 223/2048
 Amazon::Cloud::EventCatcher | started | 419047 | 441512 | ems_21000000000286                                       | 22:53:49UTC | 07:48:39UTC | 223/2048
 Amazon::Cloud::Refresh      | started | 417886 | 441556 | ems_21000000000229:ems_21000000000230:ems_21000000000231 | 22:53:27UTC | 07:48:36UTC | 315/2048
 Amazon::Cloud::Refresh      | started | 417887 | 441549 | ems_21000000000238:ems_21000000000239:ems_21000000000240 | 22:53:25UTC | 07:48:41UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 417889 | 441528 | ems_21000000000250:ems_21000000000251:ems_21000000000252 | 22:53:24UTC | 07:48:37UTC | 311/2048
 Amazon::Cloud::Refresh      | started | 417891 | 441545 | ems_21000000000253:ems_21000000000254:ems_21000000000255 | 22:53:25UTC | 07:48:36UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 417893 | 441531 | ems_21000000000256:ems_21000000000257:ems_21000000000258 | 22:53:23UTC | 07:48:37UTC | 316/2048
 Amazon::Cloud::Refresh      | started | 417895 | 441537 | ems_21000000000280:ems_21000000000281:ems_21000000000282 | 22:53:24UTC | 07:48:42UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 417897 | 441536 | ems_21000000000304:ems_21000000000305:ems_21000000000306 | 22:53:25UTC | 07:48:38UTC | 312/2048
 Amazon::Cloud::Refresh      | started | 417899 | 441541 | ems_21000000000307:ems_21000000000308:ems_21000000000309 | 22:53:24UTC | 07:48:41UTC | 313/2048
 Amazon::Cloud::Refresh      | started | 417904 | 441540 | ems_21000000000313:ems_21000000000314:ems_21000000000315 | 22:53:24UTC | 07:48:41UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 417907 | 441550 | ems_21000000000319:ems_21000000000320:ems_21000000000321 | 22:53:26UTC | 07:48:38UTC | 312/2048
 Amazon::Cloud::Refresh      | started | 417909 | 441555 | ems_21000000000346:ems_21000000000347:ems_21000000000348 | 22:53:25UTC | 07:48:38UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 417911 | 441532 | ems_21000000000349:ems_21000000000350:ems_21000000000351 | 22:53:24UTC | 07:48:34UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 417913 | 441542 | ems_21000000000355:ems_21000000000356:ems_21000000000357 | 22:53:24UTC | 07:48:33UTC | 310/2048
 Amazon::Cloud::Refresh      | started | 417915 | 441530 | ems_21000000000361:ems_21000000000362:ems_21000000000363 | 22:53:24UTC | 07:48:34UTC | 313/2048
 Amazon::Cloud::Refresh      | started | 417917 | 441533 | ems_21000000000370:ems_21000000000371:ems_21000000000372 | 22:53:24UTC | 07:48:34UTC | 315/2048
 Amazon::Cloud::Refresh      | started | 417919 | 441527 | ems_21000000000379:ems_21000000000380:ems_21000000000381 | 22:53:25UTC | 07:48:35UTC | 308/2048
 Amazon::Cloud::Refresh      | started | 417921 | 441546 | ems_21000000000390:ems_21000000000391:ems_21000000000392 | 22:53:26UTC | 07:48:37UTC | 313/2048
 Amazon::Cloud::Refresh      | started | 417923 | 441548 | ems_21000000000393:ems_21000000000394:ems_21000000000395 | 22:53:26UTC | 07:48:39UTC | 313/2048
 Amazon::Cloud::Refresh      | started | 417925 | 441535 | ems_21000000000638:ems_21000000000639:ems_21000000000640 | 22:53:23UTC | 07:48:36UTC | 313/2048
 Amazon::Cloud::Refresh      | started | 417927 | 441529 | ems_21000000000653:ems_21000000000654:ems_21000000000655 | 22:53:24UTC | 07:48:34UTC | 308/2048
 Amazon::Cloud::Refresh      | started | 417929 | 441543 | ems_21000000000701:ems_21000000000702:ems_21000000000703 | 22:53:27UTC | 07:48:37UTC | 308/2048
 Amazon::Cloud::Refresh      | started | 417931 | 441534 | ems_21000000000731:ems_21000000000732:ems_21000000000733 | 22:53:24UTC | 07:48:38UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 417933 | 441544 | ems_21000000000764:ems_21000000000765:ems_21000000000766 | 22:53:27UTC | 07:48:35UTC | 307/2048
 Amazon::Cloud::Refresh      | started | 417935 | 441547 | ems_21000000000770:ems_21000000000771:ems_21000000000772 | 22:53:27UTC | 07:48:41UTC | 307/2048
 Amazon::Cloud::Refresh      | started | 418094 | 441558 | ems_21000000000217:ems_21000000000218:ems_21000000000219 | 22:53:36UTC | 07:48:33UTC | 313/2048
 Amazon::Cloud::Refresh      | started | 418095 | 441557 | ems_21000000000223:ems_21000000000224:ems_21000000000225 | 22:53:34UTC | 07:48:35UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 418097 | 441563 | ems_21000000000292:ems_21000000000293:ems_21000000000294 | 22:53:40UTC | 07:48:36UTC | 312/2048
 Amazon::Cloud::Refresh      | started | 418099 | 441560 | ems_21000000000316:ems_21000000000317:ems_21000000000318 | 22:53:38UTC | 07:48:42UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 418101 | 441559 | ems_21000000000358:ems_21000000000359:ems_21000000000360 | 22:53:37UTC | 07:48:34UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 418160 | 441571 | ems_21000000000352:ems_21000000000353:ems_21000000000354 | 22:53:48UTC | 07:48:38UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 418161 | 441568 | ems_21000000000665:ems_21000000000666:ems_21000000000667 | 22:53:46UTC | 07:48:35UTC | 311/2048
 Amazon::Cloud::Refresh      | started | 418291 | 441580 | ems_21000000000232:ems_21000000000233:ems_21000000000234 | 22:54:05UTC | 07:48:38UTC | 313/2048
 Amazon::Cloud::Refresh      | started | 418292 | 441581 | ems_21000000000373:ems_21000000000374:ems_21000000000375 | 22:54:05UTC | 07:48:34UTC | 313/2048
 Amazon::Cloud::Refresh      | started | 418294 | 441578 | ems_21000000000674:ems_21000000000675:ems_21000000000676 | 22:54:06UTC | 07:48:34UTC | 309/2048
 Amazon::Cloud::Refresh      | started | 418296 | 441579 | ems_21000000000683:ems_21000000000684:ems_21000000000685 | 22:54:05UTC | 07:48:36UTC | 308/2048
 Amazon::Cloud::Refresh      | started | 418357 | 441583 | ems_21000000000298:ems_21000000000299:ems_21000000000300 | 22:54:12UTC | 07:48:40UTC | 312/2048
 Amazon::Cloud::Refresh      | started | 418359 | 441582 | ems_21000000000382:ems_21000000000383:ems_21000000000384 | 22:54:14UTC | 07:48:37UTC | 308/2048
 Amazon::Cloud::Refresh      | started | 418484 | 441595 | ems_21000000000301:ems_21000000000302:ems_21000000000303 | 22:54:28UTC | 07:48:33UTC | 311/2048
 Amazon::Cloud::Refresh      | started | 418486 | 441600 | ems_21000000000364:ems_21000000000365:ems_21000000000366 | 22:54:27UTC | 07:48:41UTC | 313/2048
 Amazon::Cloud::Refresh      | started | 418489 | 441591 | ems_21000000000385:ems_21000000000386:ems_21000000000387 | 22:54:24UTC | 07:48:39UTC | 310/2048
 Amazon::Cloud::Refresh      | started | 418993 | 441630 | ems_21000000000259:ems_21000000000260:ems_21000000000261 | 22:55:16UTC | 07:48:38UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 418995 | 441629 | ems_21000000000277:ems_21000000000278:ems_21000000000279 | 22:55:15UTC | 07:48:37UTC | 312/2048
 Amazon::Cloud::Refresh      | started | 418997 | 441627 | ems_21000000000286:ems_21000000000287:ems_21000000000288 | 22:55:13UTC | 07:48:32UTC | 314/2048
 Amazon::Cloud::Refresh      | started | 418999 | 441628 | ems_21000000000289:ems_21000000000290:ems_21000000000291 | 22:55:14UTC | 07:48:35UTC | 314/2048
 Azure::Cloud::Refresh       | started | 417937 | 441561 | ems_21000000000388:ems_21000000000389                    | 22:53:25UTC | 07:48:32UTC | 302/2048
 EventHandler                | started |  37448 | 177717 | ems                                                      | 2021-03-09  | 07:48:30UTC | 213/600
 Generic                     | started |  38968 | 177905 | generic                                                  | 2021-03-09  | 07:48:30UTC | 519/600
 Generic                     | started |  39316 | 178374 | generic                                                  | 2021-03-09  | 07:48:31UTC | 484/600
 Priority                    | started |   3015 | 175255 | generic                                                  | 2021-03-09  | 07:48:38UTC | 285/600
 Priority                    | started |   3016 | 175254 | generic                                                  | 2021-03-09  | 07:48:39UTC | 290/600
 RemoteConsole               | started |   3174 |        | http:5000                                                | 2021-03-09  | 07:48:40UTC | 174/1024
 Reporting                   | started | 423028 | 443331 | reporting                                                | 00:02:17UTC | 07:48:41UTC | 570/600
 Reporting                   | started | 434486 | 449634 | reporting                                                | 04:02:13UTC | 07:48:38UTC | 505/600
 Schedule                    | started |   3014 | 175257 |                                                          | 2021-03-09  | 07:48:37UTC | 219/500
 Ui                          | started |   3175 |        | http:3000                                                | 2021-03-09  | 07:48:31UTC | 496/1024
 WebService                  | started |   3176 |        | http:4000                                                | 2021-03-09  | 07:48:31UTC | 392/1024

All rows have the values: Region=21, Zone=AWS, Server=JANSA


[root@JANSAvmdb]# top
top - 07:51:31 up 7 days, 19:16,  1 user,  load average: 15.13, 14.57, 14.52
Tasks: 366 total,  17 running, 349 sleeping,   0 stopped,   0 zombie
%Cpu(s): 18.1 us, 23.6 sy, 55.6 ni,  0.0 id,  0.0 wa,  1.4 hi,  1.4 si,  0.0 st
MiB Mem :  64151.6 total,  30931.6 free,  31530.3 used,   1689.7 buff/cache
MiB Swap:   6140.0 total,   6140.0 free,      0.0 used.  31888.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2274 root      20   0  831272 416392  16476 R  70.6   0.6 650:11.65 ruby
  37195 root      30  10  198328 119276   5752 R  35.3   0.2   3285:17 ruby
  36424 root      30  10  198164 118816   5588 R  29.4   0.2   3252:22 ruby
  37181 root      30  10  199516 120344   5768 R  29.4   0.2   3270:34 ruby
  35696 root      30  10  197680 112260   5696 R  23.5   0.2   3298:25 ruby
 418489 root      27   7  542700 340288  15112 S  23.5   0.5   0:37.21 ruby
  35835 root      30  10  196796 117744   5752 R  17.6   0.2   3306:44 ruby
  35969 root      30  10  199076 119952   5680 R  17.6   0.2   3277:32 ruby
  36065 root      30  10  199612 118308   5672 R  17.6   0.2   3283:25 ruby
  36100 root      30  10  199684 112752   5688 R  17.6   0.2   3287:31 ruby
  36197 root      30  10  198360 119152   5596 R  17.6   0.2   3288:29 ruby
  37036 root      30  10  197652 112128   5596 R  17.6   0.2   3276:48 ruby
  35561 root      30  10  198936 119820   5684 R  11.8   0.2   3308:36 ruby
  37268 root      30  10  198820 119804   5784 R  11.8   0.2   3271:07 ruby
 417923 root      27   7  546036 343540  15304 R  11.8   0.5   0:37.94 ruby
  36543 root      30  10  199436 112724   5784 R   5.9   0.2   3281:27 ruby


[root@JANSA vmdb]# ps -ef | grep ruby
root        2281    2276  0 Mar09 ?        00:00:19 ruby /bin/evm_watchdog.rb
root       35561       1 29 Mar09 ?        2-07:09:03 /usr/bin/ruby
root       35696       1 29 Mar09 ?        2-06:58:48 /usr/bin/ruby
root       35835       1 29 Mar09 ?        2-07:07:09 /usr/bin/ruby
root       35969       1 29 Mar09 ?        2-06:37:56 /usr/bin/ruby
root       36065       1 29 Mar09 ?        2-06:43:56 /usr/bin/ruby
root       36100       1 29 Mar09 ?        2-06:47:58 /usr/bin/ruby
root       36197       1 29 Mar09 ?        2-06:48:52 /usr/bin/ruby
root       36424       1 29 Mar09 ?        2-06:12:50 /usr/bin/ruby
root       36543       1 29 Mar09 ?        2-06:41:53 /usr/bin/ruby
root       37036       1 29 Mar09 ?        2-06:37:14 /usr/bin/ruby
root       37181       1 29 Mar09 ?        2-06:31:01 /usr/bin/ruby
root       37195       1 29 Mar09 ?        2-06:45:42 /usr/bin/ruby
root       37268       1 29 Mar09 ?        2-06:31:31 /usr/bin/ruby

That ps line isn’t really giving you the full picture. Try something like ps -ef --forest to show the whole parent/child process hierarchy. If any /usr/bin/ruby processes are child processes under a generic or priority worker then they’re probably automate workflows that are running (which would include the processing of provider events). Interestingly the /usr/bin/ruby processes in your listing have a PPID of 1 which is systemd.

Yes, you probably need to split those providers over multiple zones I think. Do you have a separate standalone database appliance (I think you’ll need one for performance reasons)?

pemcg

Well, we have a dedicated database appliance and we already split up our providers across multiple zones. The distribution of providers over zones usually works very well. And additionally the appliance responsiveness is better than expected. I assume that the ``` nice == 10` setting is very helpful here.

My question is what causes the spawn of endlessly running ruby processes with nice == 10. Where do they come from? What are they there for?

I’m observing those ruby processes for vmware and aws providers and it looks like they appear if some providers are not reachable.