Cloudforms with Openshift provider: multiple RefreshWorker creation - Problem with swap


#1

Hello,

I actually manage a Cloudforms installation (CFME 5.7.1.3) on VM installation (one for Cloudforms instance, the other one dedicated to PostGreSQL database). Cloudforms is only connected to one Cloud provider (Openshift Enterprise), mainly to monitor Openshift resources utilization, but I encounter really heavy slow-down at some point on the Web UI.

I checked what was going on with this issue, and i saw on my VM with a β€œtop” a list of ruby processes that are permanently swapping.

top - 10:21:50 up 10 days, 17:41,  2 users,  load average: 15.81, 14.63, 14.55
Tasks: 240 total,   4 running, 236 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.1 us, 15.1 sy, 46.0 ni,  8.1 id, 28.7 wa,  0.0 hi,  0.5 si,  0.4 st
KiB Mem : 26588984 total,   212108 free, 26260772 used,   116104 buff/cache
KiB Swap:  9957372 total,  2728800 free,  7228572 used.    37972 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                             SWAP
30875 root      27   7 3426164 2.310g   2072 R  56.1  9.1   4:55.20 ruby                                                                                                               34396
   41 root      20   0       0      0      0 D  30.6  0.0   1790:49 kswapd0                                                                                                                0
 3598 root      21   1 1083224 685516   1884 R  24.3  2.6  74:06.34 ruby                                                                                                               41208
20710 root      30  10  894604 367996   2036 S  18.9  1.4   2:52.43 ruby                                                                                                              143596
25837 root      30  10  873748 364880   2068 S  17.9  1.4   1:58.86 ruby                                                                                                              135156
17952 root      21   1 1028392 644688   2004 D  17.3  2.4 818:04.55 ruby                                                                                                               64064
 9296 root      27   7  797112 326816   1856 S  15.9  1.2 173:01.39 ruby                                                                                                               96676
12611 root      27   7 3653400 1.926g   1844 D   9.3  7.6  14:30.41 ruby                                                                                                              683540
30182 root      27   7 3493692 1.927g   1888 D   9.0  7.6  12:47.61 ruby                                                                                                              503408
10653 root      27   7 3185412 1.905g   1936 D   6.3  7.5  10:20.38 ruby                                                                                                              365368
14446 root      27   7 3193512 1.706g   1944 D   6.3  6.7   8:32.27 ruby                                                                                                              615032
16622 root      27   7 3513504 1.932g   1912 D   6.3  7.6  14:49.13 ruby                                                                                                              657896
26669 root      27   7 3470600 1.915g   1928 R   6.0  7.6  12:01.70 ruby                                                                                                              494568
21943 root      27   7 3539372 1.952g   1916 D   5.6  7.7  14:34.06 ruby                                                                                                              618020
17803 root      20   0  918980 330700   1908 S   5.0  1.2 343:08.15 ruby                                                                                                              168724
 4426 root      27   7 3492624 1.924g   1936 D   4.7  7.6  10:00.00 ruby                                                                                                              473240
18065 root      21   1  745068 349884   1968 S   4.0  1.3 125:53.87 ruby                                                                                                               58212
19447 root      27   7 3490000 1.882g   1972 D   3.0  7.4   7:02.45 ruby                                                                                                              458132
18131 root      21   1  777144 184412   1776 S   2.3  0.7  11:10.89 ruby                                                                                                              241000
25827 root      27   7 3664724 1.562g   2008 D   2.3  6.2   5:53.23 ruby                                                                                                              951740
23605 root      27   7  852000 197096   1892 S   2.0  0.7   1:24.02 ruby                                                                                                              279028
    1 root      20   0  193632   1588    924 S   0.7  0.0   1:22.75 systemd                                                                                                             2184
 1767 root      20   0       0      0      0 S   0.3  0.0   0:00.41 kworker/1:2                                                                                                            0
 9347 root      23   3  776236 157396   1424 S   0.3  0.6  12:45.04 ruby                                                                                                              237124
29579 root      20   0       0      0      0 S   0.3  0.0   0:00.15 kworker/u8:2             

Comparing PID ruby processes to the one that evmserverd service was creating, all of these processes are actually RefreshWorker (MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker) but they are are consuming all of Cloudforms memory and slow-down Web UI.

● evmserverd.service - EVM server daemon
   Loaded: loaded (/usr/lib/systemd/system/evmserverd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2017-04-21 17:31:21 CEST; 2 days ago
  Process: 17383 ExecStop=/bin/sh -c /bin/evmserver.sh stop (code=killed, signal=TERM)
  Process: 17790 ExecStart=/bin/sh -c /bin/evmserver.sh start (code=exited, status=0/SUCCESS)
 Main PID: 17803 (ruby)
   CGroup: /system.slice/evmserverd.service
           β”œβ”€ 3163 MIQ: MiqEmsMetricsProcessorWorker id: 1000000026812, queue: ems_metrics_processor
           β”œβ”€ 3598 MIQ: MiqPriorityWorker id: 1000000026779, queue: generic
           β”œβ”€ 4426 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026905, queue
           β”œβ”€ 9296 MIQ: MiqEventHandler id: 1000000025893, queue: ems
           β”œβ”€ 9347 MIQ: MiqScheduleWorker id: 1000000025899
           β”œβ”€10653 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026912, queue
           β”œβ”€12611 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026879, queue
           β”œβ”€14437 MIQ: MiqEmsMetricsProcessorWorker id: 1000000026915, queue: ems_metrics_processor
           β”œβ”€14446 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026916, queue
           β”œβ”€16622 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026884, queue
           β”œβ”€17803 MIQ Server
           β”œβ”€17952 MIQ: MiqPriorityWorker id: 1000000025293, queue: generic
           β”œβ”€18065 MIQ: OpenshiftEnterprise::ContainerManager::EventCatcher id: 1000000025301, queue:
           β”œβ”€18117 puma 3.3.0 (tcp://127.0.0.1:5000) [MIQ: Web Server Worker]
           β”œβ”€18131 puma 3.3.0 (tcp://127.0.0.1:3000) [MIQ: Web Server Worker]
           β”œβ”€18145 puma 3.3.0 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
           β”œβ”€19447 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026919, queue
           β”œβ”€20710 MIQ: MiqGenericWorker id: 1000000026921, queue: generic
           β”œβ”€21943 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026888, queue
           β”œβ”€23605 MIQ: MiqReportingWorker id: 1000000026696, queue: reporting
           β”œβ”€23613 MIQ: MiqReportingWorker id: 1000000026697, queue: reporting
           β”œβ”€25827 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026924, queue
           β”œβ”€25837 MIQ: MiqGenericWorker id: 1000000026925, queue: generic
           β”œβ”€26669 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026893, queue
           β”œβ”€30182 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026899, queue
           └─30875 MIQ: OpenshiftEnterprise::ContainerManager::RefreshWorker id: 1000000026928, queue

Apr 21 17:31:10 prod-fr2-cf-front-01 systemd[1]: Starting EVM server daemon...
Apr 21 17:31:15 prod-fr2-cf-front-01 sh[17790]: /var/www/miq/vmdb/app/models/mixins/supports_feature_mixin.rb:103: warning: key :terminate is duplicated and overwritten on line 111
Apr 21 17:31:19 prod-fr2-cf-front-01 sh[17790]: Starting EVM...
Apr 21 17:31:21 prod-fr2-cf-front-01 sh[17790]: Running EVM in background...
Apr 21 17:31:21 prod-fr2-cf-front-01 systemd[1]: Started EVM server daemon.

At some point, I had
an error preventing me creating new RefreshWorker on evm.log file :

_MIQ(MiqServer#start_algorithm_used_swap_percent_lt_value) Not allowing worker [MiqReportingWorker] to start since system memory usage has exceeded 60% of swap:_

Even so I have a pretty high configuration for my 2 VMs :

4 VCPU
26 Go RAM

Depending of the size of Openshift platform, I can understand that it could be heavy to update this provider, but we are managing only 17 Nodes with about 400 projects, which for the majority haven’t a big utilization. So I don’t really think this behavior is normal.

Also, when I kill of these processes, they are directly recreated by evmserverd service, but I want to be able to limit RefreshWorker memory utilization or the number of processes to avoid surcharging Cloudforms only to updates relationships between Openshift objects. I didn’t find any parameter allowing me to do it.

Is there a way to do this ? If not, what can I do to prevent this ?

Thanks

RAKOTOARISOA JΓ©rΓ©my


#2

Facing exact issue. MiqEmsMetricProcess consuming lot of memory


#3

I moved C & U roles to another appliance. Atleast UI/web-workers will not be down because of OOM issue.


#4

@blomquisg @gtanzillo @kbrock do you think this could be a generic issue? It seems to be part of the metrics processing rather than collection. What do you think?


#5

@simon3z looks like the customer opened a support case. We’ll track it there.