SSA on new Rhv-m

@pemcg helped me get Miq working with my Prod Rhv-m setup. Thanks gain.

I am working to get SSA working. and I hope I have connected all the dots. Role is enabled. SP is on in config. Affinity is set. and relationship I believe is set right. (not to clear on that one).

When I choose run SSA on a vm I can see the job queued in tasked and It then begins to run. but after some time its errors out with an error saying " Job timed out due to inactivity threshold reached." I have dug through the evmlog but don’t see any errors.
So I am wondering what logs I can look through to get some insight into what is happening during this SSA hoping to see where the process is ending. causing the timeout.
SSA on a host works without issue. Tasks completes and I can see results.

Also I read somewhere on RH site that I need to present my Storage domains or something as Direct LUNs. I am unclear about that. any help on understand that would be great.

Thanks for all your help guys.

T

This is the line I was referring too.

  • Each ManageIQ appliance performing SmartState Analysis requires sharable, non-bootable DirectLUN access to each attached iSCSI/FCP storage domain. In order to perform smart analysis, the appliance must mount the data storage as a DirectLUN disk.

I kinda understand what this means. but I am unclear how to actually set this on the rhv-manager.
T

I found the section below this about adding custom properties to the vm and adding ```
directlun=:readonly

however When I choose custom properties. I have to choose. a key from the dropdown. I am not sure what to choose here.
sndbuf
hugepages
vhost
sap_agent
mdev_type
viodiskcache

Hoping this is the last hurtle.

T

The MIQ appliance needs to be able to read the VM disk and metadata directly from the storage domains. In the case of NFS storage domains, the MIQ appliance needs to be able to mount the storage domain itself, so may need a NIC added in the storage network (if there is one). For FC/iSCSI domains you need to present all of the iSCSI storage domain LUNs as Direct LUN disks to the MIQ appliance VM.

For example, in the RHV-M console: Storage -> Disks, create the new sharable Direct LUN disk from the iSCSI LUN, like so:

You’ll get a scary looking warning…

But then once you’ve created the sharable disk, you attach it as a Direct LUN to the MIQ appliance:

Don’t forget to check the ‘R/O’ box to make it readonly.

Finally you have to tell MIQ which VM it’s running on. Find the MIQ appliance VM in the MIQ console, then select Configuration -> Edit Management Engine Relationship.

hope this helps.
pemcg

Some other useful tips for SmartState Analysis:

  1. In Configuration → Settings, copy the ‘sample’ analysis profile:

Call the new profile ‘default’ (it must be this name):

Now you can go to the ‘File’ tab of the new analysis profile and add any files that you’d like examined:

  1. Add the "VM SmartState Analysis profile” to the provider:

This allows you to tag any VMs that shouldn’t be analyzed with the exclusions/do_not_analyze tag (Mainly Windows VMs running stateless applications such as Exchange server that don’t like being snapshotted). It’s also a really good idea to tag your own MIQ appliance as you don’t want it analysing itself.

Once a SmartState Analysis has run, the filters All VMs → Global Filters → Analysis Failed and Analysis Successful should show the VMs that have been tagged by the control policy, e.g.

hope this helps,
pemcg

@pemcg Thank you for this… this does make sense. I am going to try this today. Sorry for the delay. I was out of town last week. I figured I would add the direct LUN to the vm but there were no LUNs available. so your first post shows me how to make them available. I think that’s what I was looking for. I will let you know how this goes.

Thank again.
T

Quick Question. So when setting the Miq relationship… from other things I have read this needs to be done for all vm’s. but it seams its only allowing me to set one at a time. Does that seem right? or should I be able to set this for all vm’s?

Thanks
T

@pemcg I think I answered my own question on that last one…

I was able to add the Lun’s as you showed without issue. I can see them in the MIQ appliance when i run lsblk. but they don’t have mount points. So far when I try to run a SSA I still get the same results. Just a timeout. no errors. I don’t see any snapshots being created or anything like in Rhv-m. Not sure if I should. I am still tinkering.

Thanks.

T

The management engine relationship only applies to the appliances (the “management engines”), it’s literally telling MIQ which VM it’s running on. This really only applies when using a RHV provider as MIQ needs to know which RHV datacenter it’s running in.

Regarding the SSA failures, what error or message are you getting in Administrator -> Tasks (upper right hand side of the WebUI)?

pemcg

Thanks. That does make sense. I have set that for the MIQ vm. the error is the same as before. It a timeout error. for inactivity. screenshot attached.

Thanks
T

Just to confirm, you have both the SmartState Analysis and SmartProxy server roles enabled on your appliance? How many appliances do you have in your region?

pemcg

Yes. I just double checked. I have both set to “on” I only have one Appliance.

T

I have not Added all the Datastores to the Miq like you instructed. only two for testing. but I am attempting to run the SSA on two vm’s that I know are in the Datastores that I did add. Does the MIQ appliance need to me add all of them to get it to work? I was trying to prove it worked before I went and added them all but will if needed.

T

No you should just need the Datastores holding the VMs that you’re attempting to scan. You pre-empted my next question, have you tried to scan several VMs, but it sounds like you have?

How many VMs is the appliance managing? How many providers?

The other thing worth confirming is that the MIQ appliance is actually running as a VM in RHV, rather than outside of RHV (such as AWS or VMware)? The SmartProxy needs to be running on a RHV VM.

A timeout suggests that the message isn’t being picked up by a worker for some reason. Did you check to see if any of your workers are exceeding their memory thresholds and being terminated?

pemcg

Thanks again for your help on this. You may have hit on the solution here…

First. I only have 53Vm’s currently in this rhv-m stack. and 1 Provider in Miq.
Second. Yes. the Miq Appliance is running as a vm in this Rhv-m stack. .

but when you asked about checking the workers. I ran the command you gave me the last time you helped and I got some errors. “below” it looks like a work is reaching max and quitting… can you tell me what worker i need to increase? is it the " Vm Analysis Collector?" this one is currently set to 2 and 2GB.
Thanks again.
T

[root@ManageIQ ~]# zgrep ‘MiqServer#exceeded_memory_threshold’ /var/www/miq/vmdb/log/evm.log* | grep “WARN|ERROR”

/var/www/miq/vmdb/log/evm.log-20201224.gz:[----] W, [2020-12-23T11:24:38.640118 #7464:2ac2de715964] WARN – : MIQ(MiqServer#exceeded_memory_threshold?) Worker [ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker] with ID: [1000000000139], PID: [7811], GUID: [0aee3823-8554-4c6a-a155-2d6b66b1c05c] process memory usage [634960000] exceeded limit [629145600], requesting worker to exit

/var/www/miq/vmdb/log/evm.log-20201224.gz:[----] W, [2020-12-23T15:35:59.156686 #7464:2ac2de715964] WARN – : MIQ(MiqServer#exceeded_memory_threshold?) Worker [ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker] with ID: [1000000000138], PID: [7809], GUID: [b5b7baf0-4750-486d-9628-7b89ee44a39d] process memory usage [629696000] exceeded limit [629145600], requesting worker to exit

I just noticed that the date of those warning are old. So not sure now if that is a cause of the issue I am dealing with.
T

@pemcg… Well. I thought I had made some progress. but no luck. I found that when I added the DirectLUn to the Miq appliance using VirtIO in stead of VirtIO-ISCSI. the Appliance could see the drive properly. when attached via vio-iscsi. an lsblk would show the drive listed but when attached via virtio an lsblk would not only show the drives but would show the expansion of all the lvm’s on that DS. I could see the device id’s of each object on that directLun. So i rebooted the appliance. to be safe. and attempted to run a SSA… but alas I am still getting the same issue. it just sits there for about an hour then times out. no error… Any Idea how I can look into the logs to see where this process is failing. Thanks again for all your help.

T

I think I’ve also reproduced this on my Jansa-2 appliance in that the scans timeout like yours. I’ll start a conversation on Gitter about this and see if we can debug it.

pemcg

Well that could be good news to know I’m not the only one. Thanks so much Looking forward to see what you find.

Thanks
T

So we have a fix for this (thanks @agrare !).

You’ll need to stop the MIQ service with:

systemctl stop evmserverd

then copy down this file: manageiq/vm_scan.rb at master · ManageIQ/manageiq · GitHub

replacing the existing /var/www/miq/vmdb/app/models/vm_scan.rb

Then restart the MIQ service with:

systemctl start evmserverd

You can watch the MIQ workers start using:

vmdb
watch rake evm:status

When all the workers have started you should be good to go.

Hope the helps,
pemcg