Smart State Scans Failing with Error Too Many Open Files

Hi

I am looking some support on an issue that i am having in regards to smart state scans.
My smart state scans work for say 7 days but then they all fail and the only way to get them to work again is by restarting MIQ.

I receive the below error when they start failing.
Unable to mount filesystem. Reason:[Too many open files - /var/www/miq/vmdb/log/vim.log]

I have tried to increase ulimits but this has made no difference.

I am running MIQ vmware appliance - Version ivanchuk-1.20190911145513_9f959bd

Any help would be greatly appreciated.

Thanks

This is a weird one I’ve not heard before. @jrafanie or @rpo Does this sound familiar to you?

Hmmm, I’ve never seen that. It seems like the worker isn’t closing some file(s), leaking FDs over time.

Any idea how I could resolve the leak?

I found a more detailed error if its any use.

ERROR – : Q-task_id([job_dispatcher]) MIQ(VmScan#process_abort) job aborting, Unable to mount filesystem. Reason:[Too many open files - /var/www/miq/vmdb/log/vim.log]

Does the process get killed when it gets “too many open files” ? If not, does lsof show what files/sockets it has open?

Hi Guys, I set up a new instance of ManageIQ appliance (ivanchuk-4.20200317112844_4cebf11) thinking that would solve the issue but sadly not, I still receive the failure after a few days. I found the below error which states error status 16.

[2020-06-04T05:05:50.922138 #2818:12545f8] ERROR – : Q-task_id([job_dispatcher]) MIQ(ManageIQ::Providers::Vmware::InfraManager::Vm#scan_via_miq_vm) ScanMetadata error status:[16]: message:[Unable to mount filesystem. Reason:[Too many open files - /var/www/miq/vmdb/log/vim.log]]

Really banging my head off the wall with this one as I would love to deploy ManageIQ in our production env, but at the min that is unrealistic with the failing of smartstate scans after a few days.

Any help would be appreciated greatly.

Hi Guys

I think it could be the MiqSmartProxyWorker causing the issue as when running lsof -p <PID_OF_MiqSmartProxyWorker> | wc -l it returned value of 1111 and that value didn’t increase when running a smart state scan.

But when I toggled SmartProxy from Configuration --> Server

image

A new pid for MiqSmartProxyWorker was created (as expected) and I was able to run smart state scan. This is not a fix as the issue will return in a couple of days.

Any ideas on who I can fix this?

It does look like MiqSmartProxyWorker isn’t closing files after a smart state sacn.

Seems is leaking fd’s and hitting 1024 which is causing the failures

ls -al /proc/<PID_OF_MiqSmartProxyWorker>/fd | wc -l

1024

Any Ideas why MiqSmartProxyWorker is leaking fd’s?

Is this a bug??

@Mpm2020 If you haven’t already can you open a bug report for this at github.com/ManageIQ/manaegiq/issues , as I don’t want to lose track of it. It seems like a bug to me somewhere, and hopefully we can narrow it down. Thanks!