MIQ OpenStack Instance Provision Very Slow

providers

#1

Dear,

We are experiencing a high amount of time to provision a simple Instance in a Openstack provider (~ 40 mins, sometimes reaching 2 hours!). After some investigation we noticed that the issue root cause probably is related to the amount of time to refresh the Miq Inventory (EmsRefresh).

In our analyses we noticed that:
• Most part of the provision time (~ 25 mins) is spent in CheckProvisioned phase (evidence 1).
• However, we also noticed that the cloning process in Openstack takes about 3 minutes long.
• The rest amount of the provision time Miq is waiting until the EmsRefresh to create the new instance in VMDB (evidence 2).
However we also noticed that the Miq workers is not running over its capacity (in average the server is using about half of its ram memory, is not swapping and CPU time seems okay also). We are looking for some tuning instructions to minimize the refresh time in order to minimize this problem.

We appreciate any help to find out the best approach to fix this problem.

Evidences:

  1. CheckProvisioned waiting time:

[----] I, [2017-10-19T11:03:03.709350 #2982:427980] INFO – : Q-task_id([miq_provision_4923]) Processed State=[CheckProvisioned] with Result=[retry]
[----] I, [2017-10-19T11:04:10.998310 #27075:427980] INFO – : Q-task_id([miq_provision_4923]) Processed State=[CheckProvisioned] with Result=[retry]
(…)
[----] I, [2017-10-19T11:26:58.116785 #27075:427980] INFO – : Q-task_id([miq_provision_4923]) Processed State=[CheckProvisioned] with Result=[retry]
[----] I, [2017-10-19T11:28:06.723697 #2982:427980] INFO – : Q-task_id([miq_provision_4923]) Processed State=[CheckProvisioned] with Result=[retry]
[----] I, [2017-10-19T11:29:13.752986 #2982:427980] INFO – : Q-task_id([miq_provision_4923]) Processed State=[RegisterDHCP] with Result=[retry]
[----] I, [2017-10-19T11:30:21.422286 #27075:427980] INFO – : Q-task_id([miq_provision_4923]) Processed State=[RegisterDHCP] with Result=[retry]

  1. Provisioning process continues just after the vm is created in VMDB (EmsRefresh):

Cloning started (CheckProvisioned):
[----] I, [2017-10-19T11:03:03.709350 #2982:427980] INFO – : Q-task_id([miq_provision_4923]) Processed State=[CheckProvisioned] with Result=[retry]
[----] I, [2017-10-19T11:03:04.390900 #2982:427980] INFO – : Q-task_id([miq_provision_4923]) MIQ(ManageIQ::Providers::Openstack::CloudManager::Provision#poll_destination_in_vmdb) Unable to find Vm [vmdiasdev003] with ems_ref [5539330b-e63b-48c3-9e74-070dd70f889e], will retry

VM creating in VMDB:
[----] I, [2017-10-19T11:27:56.861898 #3557:6df984] INFO – : MIQ(EmsRefresh.save_vms_inventory) EMS: [Global IaaS - CMPN], id: [1] Creating Vm [vmdiasdev003] location: [] storage id: [] uid_ems: [5539330b-e63b-48c3-9e74-070dd70f889e] ems_ref: [5539330b-e63b-48c3-9e74-070dd70f889e]

Last occurence of “Unable do find Vm”:
[----] I, [2017-10-19T11:27:47.597154 #27075:427980] INFO – : Q-task_id([miq_provision_4923]) MIQ(ManageIQ::Providers::Openstack::CloudManager::Provision#poll_destination_in_vmdb) Unable to find Vm [vmdiasdev003] with ems_ref [5539330b-e63b-48c3-9e74-070dd70f889e], will retry

CheckProvisioned state finishes sucessfully:
[----] I, [2017-10-19T11:29:11.827705 #2982:427980] INFO – : Q-task_id([miq_provision_4923]) Processed State=[CheckProvisioned] with Result=[ok]

  1. Full refresh amount of time:

[root@miqapplbnpbr005 vmdb]# rails c
Loading production environment (Rails 5.0.0.1)
irb(main):001:0> openstack = ManageIQ::Providers::Openstack::CloudManager.first
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiti ng_in_queue: 0
=> #<ManageIQ::Providers::Openstack::CloudManager id: 1, name: “Global IaaS - CMPN”, created_on: “2017-02-09 18:53:43”, updated_on: “2017-10-20 12:27:36”, guid: “13dd2cee-eef9-11e6-8956-fa163e 98a73b”, zone_id: 1, type: “ManageIQ::Providers::Openstack::CloudManager”, api_version: “v2”, ui d_ems: nil, host_default_vnc_port_start: nil, host_default_vnc_port_end: nil, provider_region: n il, last_refresh_error: nil, last_refresh_date: “2017-10-20 12:27:36”, provider_id: nil, realm: nil, tenant_id: 1, project: nil, parent_ems_id: nil, subscription: nil>
irb(main):003:0> a = DateTime.now; EmsRefresh.refresh(openstack); b = DateTime.now; puts a.to_s + “|” + b.to_s
/opt/rubies/ruby-2.2.5/lib/ruby/gems/2.2.0/gems/fog-core-1.43.0/lib/fog/core/attributes/default.rb:52: warning: red efining `object_id’ may cause serious problems
2017-10-20T10:32:12-02:00|2017-10-20T10:53:45-02:00
=> nil
irb(main):004:0>

Regards.


#2

@gmileo
You are not alone in this issue. The root cause of this is unoptimal code of Ems Refresh worker. Developers have devstack/packstack syndrome and may be they has tested their code but on laptops and not on real environment. I have cases opened for this and currently you can use workaround to resolve issue:

  1. You should use fine-2 or fine-3 of MIQ or CFME 5.8.1.

  2. You need integration account for EMS connection. This account should have admin role on your OpenStack domain and admin role on all projects on OpenStack side.

  3. If you do not use Default domain you need to create ‘admin’ project in your domain.

  4. For heat templates collection the “stacks:global_index” rule should be set to “context_is_admin” in /etc/heat/policy.json configuration file on OpenStack controllers.

  5. On the appliances which have Provider Inventory role enabled change advanced settings for refresh worker:

      :ems:
        :ems_openstack:
          :excon:
             :omit_default_port: true
             :read_timeout: 60
          :refresh:
            :is_admin: true
            :heat:
               :is_global_admin: true
    
  6. Change the memory_threshold and restart_interval properties for refresh worker:

    :workers:
    <...>
      :queue_worker_base:
      <...>
        :ems_refresh_worker:
          :defaults:
            :memory_threshold: 2.gigabytes
            <...>
            :restart_interval: 8.hours
    
  7. On the appliances which have Provider Inventory role enabled change the ‘get_inventory_collection’ method on file /var/www/miq/vmdb/app/models/manageiq/providers/openstack/network_manager/refresh_parser.rb:

    def get_inventory_collection(collection_type)
      @network_service.handled_list(collection_type, {}, ::Settings.ems.ems_openstack.refresh.is_admin)
    end
    

And you should be Ok.

https://bugzilla.redhat.com/show_bug.cgi?id=1497169
https://bugzilla.redhat.com/show_bug.cgi?id=1469306
https://bugzilla.redhat.com/show_bug.cgi?id=1460977
https://bugzilla.redhat.com/show_bug.cgi?id=1498520


#3

And please describe your environment, what version of OS do you use, how you connect EMS provider and how many domains and projects do you have on OS side ?


#4

You can get a rough breakdown of where the time is spent in an inventory refresh using the ems_refresh_timings.rb script here: https://github.com/RHsyseng/cfme-log-parsing

If the save_inventory time is really large you might be able to improve things by tuning the database: increasing the size of shared_buffers, and maybe putting the database disk on faster storage. Also check that you’re running the database maintenance scripts (installable from appliance_console).

Hope this helps,
pemcg