State var "disappears" and some error with StateVarHash

Hello all,

I’ve got a transiend problem while deploying VM using rhc-miq-quickstart. Sometimes while deploying a VM I see WARN message “Failed to load BinaryBlob with ID [xxxxx] for MiqAeEngine::StateVarHash”. After that State Machine generally fails with " [State var :vm_prov_request_ids is blank!]". Seems like vm_prov_request_ids state variable turns out to be empty, while it surely had the value earlier (I see it in earlier log messages).

Have anybody seen this and where to investigate further?

WBR,
Viacheslav Medvedev

automation log fragment:
[----] W, [2020-12-15T11:37:46.941925 #510589:2aac912a796c] WARN – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Failed to load BinaryBlob with ID [1000000007953] for MiqAeEngine::StateVarHash
[----] I, [2020-12-15T11:37:46.957163 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Updated namespace [/Service/Provisioning/StateMachines/ServiceProvision_Template/build_vm_provision_request?MiqServer%3A%3Amiq_server=1000000000001&Service%3A%3AService=1000000000151&ServiceTemplateProvisionTask%3A%3Aservice_template_provision_task=1000000000361&User%3A%3Auser=1000000000008&ae_state=post1&ae_state_data=—%20%21ruby%2Fobject%3AMiqAeEngine%3A%3AStateVarHash%0Abinary_blob_id%3A%201000000007953%0A&ae_state_previous=—%0A%22%2Frhc-miq-quickstart%2FService%2FProvisioning%2FStateMachines%2FServiceProvision_Template%2Fbuild_vm_provision_request%22%3A%0A%20%20ae_state%3A%20post1%0A%20%20ae_state_retries%3A%206%0A%20%20ae_state_started%3A%202020-12-15%2008%3A31%3A22%20UTC%0A%20%20ae_state_max_retries%3A%20100%0A&ae_state_retries=6&ae_state_started=2020-12-15%2008%3A31%3A22%20UTC&dialog_option_1_flavor=small&dialog_option_1_number_of_vms=1&dialog_option_1_os=centos_linux_7_7&dialog_tag_0_environment=prod&dialog_tag_0_location=b7&dialog_tag_0_machine_type=web&dialog_tag_0_network_location=internal&object_name=build_vm_provision_request&request=clone_to_service&service_action=Provision&vmdb_object_type=service_template_provision_task#create rhc-miq-quickstart/Service/Provisioning/StateMachines]
[----] I, [2020-12-15T11:37:46.984697 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) In State=[post1], invoking [on_entry] method=[update_serviceprovision_status(status => ‘Processing Post1 Customizations’)]
[----] I, [2020-12-15T11:37:46.991537 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Updated namespace [Service/Provisioning/StateMachines/ServiceProvision_Template/update_serviceprovision_status RedHatConsulting_Utilities/Service/Provisioning/StateMachines]
[----] I, [2020-12-15T11:37:46.999378 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Invoking [inline] method [/RedHatConsulting_Utilities/Service/Provisioning/StateMachines/ServiceProvision_Template/update_serviceprovision_status] with inputs [{“status”=>“Processing Post1 Customizations”, “email_only_on_error”=>true}]
[----] I, [2020-12-15T11:37:47.000828 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) <AEMethod [/RedHatConsulting_Utilities/Service/Provisioning/StateMachines/ServiceProvision_Template/update_serviceprovision_status]> Starting
[----] I, [2020-12-15T11:37:47.487278 #510589:2aac9b745500] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Save Telemetry as State Var: { service_provisioning_telemetry_on_entry_post1 => 2020-12-15 11:37:47 +0300, :miq_request_id => 1000000000360 }
[----] I, [2020-12-15T11:37:47.498866 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) <AEMethod [/RedHatConsulting_Utilities/Service/Provisioning/StateMachines/ServiceProvision_Template/update_serviceprovision_status]> Ending
[----] I, [2020-12-15T11:37:47.499026 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Method exited with rc=MIQ_OK
[----] I, [2020-12-15T11:37:47.499475 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Processing State=[post1]
[----] I, [2020-12-15T11:37:47.499727 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Following Relationship [miqaedb:/Service/Provisioning/StateMachines/Methods/check_vm_provision_requests#create]
[----] I, [2020-12-15T11:37:47.510315 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Updated namespace [miqaedb:/Service/Provisioning/StateMachines/Methods/check_vm_provision_requests#create rhc-miq-quickstart/Service/Provisioning/StateMachines]
[----] I, [2020-12-15T11:37:47.548287 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Updated namespace [Service/Provisioning/StateMachines/Methods/check_vm_provision_requests rhc-miq-quickstart/Service/Provisioning/StateMachines]
[----] I, [2020-12-15T11:37:47.551329 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Invoking [inline] method [/rhc-miq-quickstart/Service/Provisioning/StateMachines/Methods/check_vm_provision_requests] with inputs [{}]
[----] I, [2020-12-15T11:37:47.558043 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Updated namespace [StdLib/Core/Core RedHatConsulting_Utilities/StdLib]
[----] I, [2020-12-15T11:37:47.570296 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) Loading embedded method /RedHatConsulting_Utilities/StdLib/Core/Core
[----] I, [2020-12-15T11:37:47.606126 #510589:2aac912a796c] INFO – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) <AEMethod [/rhc-miq-quickstart/Service/Provisioning/StateMachines/Methods/check_vm_provision_requests]> Starting
[----] E, [2020-12-15T11:37:48.048508 #510589:2aac9f471668] ERROR – : Q-task_id([r1000000000360_service_template_provision_task_1000000000361]) [State var :vm_prov_request_ids is blank!]
/rhc-miq-quickstart/Service/Provisioning/StateMachines/Methods/check_vm_provision_requests:37:in main' /rhc-miq-quickstart/Service/Provisioning/StateMachines/Methods/check_vm_provision_requests:84:in

State vars should only be “simple” data types (rather than for example a Rails object). What sort of variable are you storing (i.e. string, int etc)?

pemcg

vm_prov_request_ids it’s an array of strings. But in fact “check_vm_provision_requests” method which fails - is RedHat (rhc-miq-quickstart) code, not mine. They use this variable to save provision request ids between state machine stages. And what I’ve noticed is that this variable looses it’s value from time to time.

It would be interesting to see if any of your generic workers are being killed, try the following on all of your appliances:

grep 'MiqServer#validate_worker' evm.log | grep "WARN\|ERROR"

pemcg

no such entries :frowning:

But here’s what it contains about this BinaryBlob #7953

Let’s look at the older log files, try this:

zgrep ‘MiqServer#validate_worker’ /var/www/miq/vmdb/log/evm.log* | grep “WARN|ERROR”

Your command returned no results. But here is all the worker related errors:
zgrep 'worker' /var/www/miq/vmdb/log/evm.log* | grep -E "WARN|ERROR"

Interesting, the search string has changed - my bad. This is exactly what I was hoping to see. Try increasing the thresholds for your generic workers, probably up to 1GB (generic workers process Automate tasks).

Hope this helps,
pemcg

1 Like

@pemcg thank you for your suggestions. I’ve increased memory threshold for workers. “memory usage” messages dissapeared but the problem with “Failed to load BinaryBlob” is still there.

This time I’ve got “WARN message “Failed to load BinaryBlob with ID [xxxxx] for MiqAeEngine::StateVarHash” and Ansible Tower Job run from state machine have failed because pshost variable have failed to pass (in fact all the ansible_* state vars were lost and the job template run with default variables).

I checked that the pshost var was set correctly on preretirement state:

[----] I, [2020-12-24T17:53:20.117637 #256659:2b1e0116c5e8] INFO – : Q-task_id([r1000000000501_vm_retire_task_1000000000513]) Processing get_pshost…
[----] I, [2020-12-24T17:53:20.120186 #256659:2b1e0116c5e8] INFO – : Q-task_id([r1000000000501_vm_retire_task_1000000000513]) DEBUG m: pshost_lookup_strategy_manualbytag
[----] I, [2020-12-24T17:53:20.120576 #256659:2b1e0116c5e8] INFO – : Q-task_id([r1000000000501_vm_retire_task_1000000000513]) Processing pshost_lookup_strategy_manualbytag…
[----] I, [2020-12-24T17:53:20.125034 #256659:2b1e0116c5e8] INFO – : Q-task_id([r1000000000501_vm_retire_task_1000000000513]) Searching for pshost name with key [pshost_lookup_manualbytags_lookup_internal_test]
[----] I, [2020-12-24T17:53:20.127372 #256659:2b1e0116c5e8] INFO – : Q-task_id([r1000000000501_vm_retire_task_1000000000513]) DEBUG pshost from settings: XXXXXXXXXXXXXXXXXXXX

@pemcg I’ve updated to jansa-2 from jansa-1 and found such messages in evm.log. Could they relate to this issue, what do you think? Changed limits :log_threshold to 250 kbytes, :element_threshold to 20 kbytes. Let’s see…

[----] W, [2020-12-28T13:19:22.147016 #359336:2aee3abc7ebc]  WARN -- : MIQ(ops_controller-explorer): Session object size of 103.3 KB exceeds threshold of 100 KB
[----] W, [2020-12-28T17:27:56.420485 #359336:2aee3abc7318]  WARN -- : MIQ(ops_controller-accordion_select): Session object size of 154 KB exceeds threshold of 100 KB
[----] W, [2020-12-28T17:28:14.176421 #359336:2aee3abc7a70]  WARN -- : MIQ(ops_controller-change_tab): Session object size of 188.5 KB exceeds threshold of 100 KB
[----] W, [2020-12-28T17:28:20.133489 #359336:2aee3abc2174]  WARN -- : MIQ(ops_controller-tree_select): Session object size of 154.5 KB exceeds threshold of 100 KB

No more errors related to woker or session object size, but the problem still there. I’ve found this error which seems to be related with this issue. Investigating further.

[----] I, [2021-01-22T12:05:20.504423 #1132708:2af4e273d974]  INFO -- : Q-task_id([r1000000000624_service_template_provision_task_1000000000641]) MIQ(MiqQueue#deliver) Message id: [1000001912924], Delivering...
[----] I, [2021-01-22T12:05:20.508976 #1132708:2af4e273d974]  INFO -- : Q-task_id([r1000000000624_service_template_provision_task_1000000000641]) MIQ(MiqAeEngine.deliver) Delivering {:event_type=>:service_provisioned, "Miq
Event::miq_event"=>1000000015905, :miq_event_id=>1000000015905, "EventStream::event_stream"=>1000000015905, :event_stream_id=>1000000015905} for object [Service.1000000000233] with state [] to Automate
[----] E, [2021-01-22T12:05:20.512625 #1132708:2af4e273d974] ERROR -- : Q-task_id([r1000000000624_service_template_provision_task_1000000000641]) MIQ(MiqQueue#deliver) Message id: [1000001912924], Error: [Couldn't find Ser
vice]
[----] E, [2021-01-22T12:05:20.512809 #1132708:2af4e273d974] ERROR -- : Q-task_id([r1000000000624_service_template_provision_task_1000000000641]) [ActiveRecord::RecordNotFound]: Couldn't find Service  Method:[block (2 leve
ls) in <class:LogProxy>]
[----] E, [2021-01-22T12:05:20.512919 #1132708:2af4e273d974] ERROR -- : Q-task_id([r1000000000624_service_template_provision_task_1000000000641]) /opt/manageiq/manageiq-gemset/gems/activerecord-5.2.4.4/lib/active_record/core.rb:217:in `find_by!'
/opt/manageiq/manageiq-gemset/bundler/gems/manageiq-automation_engine-f0185addb172/lib/miq_automation_engine/engine/miq_ae_engine.rb:85:in `deliver'
/var/www/miq/vmdb/app/models/miq_queue.rb:484:in `block in dispatch_method'
/usr/share/ruby/timeout.rb:93:in `block in timeout'
/usr/share/ruby/timeout.rb:33:in `block in catch'
/usr/share/ruby/timeout.rb:33:in `catch'
/usr/share/ruby/timeout.rb:33:in `catch'
/usr/share/ruby/timeout.rb:108:in `timeout'
/var/www/miq/vmdb/app/models/miq_queue.rb:482:in `dispatch_method'
/var/www/miq/vmdb/app/models/miq_queue.rb:459:in `block in deliver'
/var/www/miq/vmdb/app/models/user.rb:317:in `with_user'
/var/www/miq/vmdb/app/models/user.rb:332:in `with_user_group'
/var/www/miq/vmdb/app/models/miq_queue.rb:459:in `deliver'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:80:in `deliver_queue_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:104:in `deliver_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:122:in `block in do_work'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:118:in `loop'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:118:in `do_work'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:256:in `block in do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:253:in `loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:253:in `do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:113:in `run'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:95:in `start'
/var/www/miq/vmdb/lib/workers/bin/run_single_worker.rb:113:in `<main>'
[----] I, [2021-01-22T12:05:20.513043 #1132708:2af4e273d974]  INFO -- : Q-task_id([r1000000000624_service_template_provision_task_1000000000641]) MIQ(MiqQueue#delivered) Message id: [1000001912924], State: [error], Delivered in [0.008648454] seconds

It seems that the problem is related to binary blob purging mechanism. 3 blobs purged and 3 tasks finally failed.

@pemcg could you please help how this purging mechanism can be tuned? Now the binary_blob_purge_interval is set to default 1.hour. All the tasks inbetween purging are ok. All three tasks running during purging - failed.

evm.log
[----] I, [2021-01-22T16:34:06.232508 #1172161:2b0472885954] INFO – : MIQ(BinaryBlob.purge_by_scope) Purging Binary blobs…
[----] I, [2021-01-22T16:34:06.233984 #1172161:2b0472885954] INFO – : MIQ(BinaryBlob.purge_in_batches) Purging 3 Binary blobs.
[----] I, [2021-01-22T16:34:06.236919 #1172161:2b0472885954] INFO – : MIQ(BinaryBlob.purge_by_scope) Purging Binary blobs…Complete - Deleted 3 records

automation.log
[----] W, [2021-01-22T16:34:46.175922 #1172161:2b0472885954] WARN – : Q-task_id([r1000000000655_service_template_provision_task_1000000000679]) Failed to load BinaryBlob with ID [1000000012727] for MiqAeEngine::StateVarHash
[----] W, [2021-01-22T16:35:03.448877 #1172159:2aef14ab394c] WARN – : Q-task_id([r1000000000657_service_template_provision_task_1000000000681]) Failed to load BinaryBlob with ID [1000000012728] for MiqAeEngine::StateVarHash
[----] W, [2021-01-22T16:35:03.698953 #1172161:2b0472885954] WARN – : Q-task_id([r1000000000652_vm_retire_task_1000000000676]) Failed to load BinaryBlob with ID [1000000012729] for MiqAeEngine::StateVarHash

If you go into the advanced settings you should see a value for binary_blob_purge_interval in the schedule worker section that you should be able to change.

It is very strange that 1 hour was chosen as the default. I want to dig into that, cause that might be a bug.

Actually as I’m digging in, the code will only purge BinaryBlob records if they are NOT associated to a resource. That is, they are already orphaned when the purger runs.

cc @agrare (since you added the purging code)

I’ve set binary_blob_purge_interval = 12.hour so that the purge process takes place outside of business hours.

What resources are checked? I’m not a ruby professional, but what I see here, may be it contains some clues.

manageiq/app/models/binary_blob/purging.rb

def purge_scope(_older_than = nil)
where(:resource => nil)
end

and what’s strange that database records, which are created during provisioning, have empty resource_type and resource_id fields:

vmdb_test=# select * from binary_blobs;

      id       |  resource_type   |  resource_id  |               md5                |  size   | part_size |       name        | data_type
---------------+------------------+---------------+----------------------------------+---------+-----------+-------------------+-----------
<......>
1000000013520 |                  |               | 9d8dd2e50902df0bd5d33fd0ff44e895 |    2843 |   1048576 |                   | YAML
1000000012477 | MiqReportResult  | 1000000000136 | 7523dd936e39a26181ff28a1af61870c |    4063 |   1048576 | report_results    | YAML
1000000012480 | MiqTask          | 1000000009160 | 68dc29647c74f5b34218f69edcabdfb7 |       9 |   1048576 | task_results      | YAML
1000000001879 | MiqReportResult  | 1000000000020 | 22e2a9538050e82a707dca9ffe686374 |    4063 |   1048576 | report_results    | YAML
1000000013106 | MiqTask          | 1000000009433 | 4ee1a51f18611218ddb4a7a1a3339723 |   25071 |   1048576 | task_results      | YAML
1000000013112 | MiqTask          | 1000000009435 | 340556208bac9e18ba3fc3203ea01c84 |   24998 |   1048576 | task_results      | YAML
1000000013255 | MiqTask          | 1000000009503 | bab3714adcfe69258798368f374634e1 |   25015 |   1048576 | task_results      | YAML
1000000013256 | MiqTask          | 1000000009504 | 8d2cc030c811c4b612f0b187b674b42d |   25015 |   1048576 | task_results      | YAML
1000000013188 | MiqReportResult  | 1000000000144 | e731284eda03f6456584601db9500c85 |    4063 |   1048576 | report_results    | YAML
1000000012897 | MiqTask          | 1000000009342 | 68dc29647c74f5b34218f69edcabdfb7 |       9 |   1048576 | task_results      | YAML
1000000013019 | MiqTask          | 1000000009400 | 68dc29647c74f5b34218f69edcabdfb7 |       9 |   1048576 | task_results      | YAML
1000000013353 | MiqTask          | 1000000009597 | dbf6a3b4f8cc2f0e86de48262cb92712 |   26238 |   1048576 | task_results      | YAML
1000000013379 | MiqTask          | 1000000009622 | cdd7d440d95f7adc3c585c699d09747d |   26238 |   1048576 | task_results      | YAML
1000000013521 |                  |               | 0757b0f40d934564660e726eb4279a24 |    1396 |   1048576 |                   | YAML
1000000013189 | MiqTask          | 1000000009471 | 68dc29647c74f5b34218f69edcabdfb7 |       9 |   1048576 | task_results      | YAML
<......>

I’d expect those “empty” records to be deleted on the next purge. Unfortunately I can’t tell what the content is. If you want to see, in a Rails console do:

BinaryBlob.find(1000000013520).binary

That should at least show you what the contents of those records are, to see if they are relevant.

I’ve checked those binaryblobs. They are related to the provisioning process. Contents as follows. Several binary blobs are created and then deleted during vm provisioning. All of them have resource_type, resource_id and name fields empty.

irb(main):006:0> BinaryBlob.find(1000000036489).binary
=> "---\nvm_provisioning_telemetry_on_entry_CustomizeRequest: 2021-03-15 19:29:28.508798733\n  +03:00\nansible_stats_vm_template_platform: linux\nansible_stats_dns_domain: XXXXXXXXXXXX\nmiq_project_name: XXXXXXXXXX\nansible_stats_miq_project_name: XXXXXXXXXX\npshost: XXXXXXXXXXXXXXXXXX\nansible_stats_pshost: XXXXXXXXXXXXXXXXX\nvm_provisioning_telemetry_on_exit_CustomizeRequest: 2021-03-15 19:29:31.100215258\n  +03:00\nvm_provisioning_telemetry_on_entry_Placement: 2021-03-15 19:29:31.673014976 +03:00\nvm_provisioning_telemetry_on_exit_Placement: 2021-03-15 19:29:36.257719746 +03:00\nvm_provisioning_telemetry_on_entry_PreProvision: 2021-03-15 19:29:36.738058716 +03:00\nvm_provisioning_telemetry_on_exit_PreProvision: 2021-03-15 19:29:37.849041524 +03:00\nvm_provisioning_telemetry_on_entry_AquireIPAddress: 2021-03-15 19:29:38.312186879\n  +03:00\naddr_mode:\n- static\n- Static\nip_addr: 10.XX.XX.XX\nsubnet_mask: 255.255.255.0\ngateway: 10.XX.XX.XX\ndns_servers: 10.XX.XX.XX,10.XX.XX.XX,10.XX.XX.XX,10.XX.XX.XX,10.XX.XX.XX\ndns_domain: XXXXXXXXXXXXX\ndns_suffixes: XXXXXXXXXXXXXX\nlinux_domain_name: XXXXXXXXXXXXXXXXXX\nansible_stats_vm_target_name: XXXXXXXXXXXXXXX\nansible_stats_ip_addr: 10.XX.XX.XX\nvm_provisioning_telemetry_on_exit_AquireIPAddress: 2021-03-15 19:29:47.261975896 +03:00\nvm_provisioning_telemetry_on_entry_AddDNSRecord: 2021-03-15 19:29:47.728566896 +03:00\nvm_provisioning_telemetry_on_exit_AddDNSRecord: 2021-03-15 19:31:35.426705327 +03:00\nvm_provisioning_telemetry_on_entry_vCenterFolders: 2021-03-15 19:31:35.902954069 +03:00\nvm_provisioning_telemetry_on_exit_vCenterFolders: 2021-03-15 19:31:36.917791189 +03:00\nvm_provisioning_telemetry_on_entry_JoinDomain: 2021-03-15 19:31:37.381523922 +03:00\nautomate_workspace_guid: e5914ada-7ec1-4157-8c34-98a87157c6e1\nJoinDomain_ansible_method_task_id: 1000000014520\n"

Those empty records are created and deleted automatically during VM provisioning process. The problem occurs when BinaryBlob purging starts during provisioning.