How is AWS SNS/S3 service used?


#1

We were testing ManageIQ Botvinnik with Amazon AWS as a cloud provider. In our account we don’t have any instances running. However, about 1.5 million SNS notifications were consumed by the ManageIQ user we created over a period of 5 days.

Can you describe how ManageIQ uses SNS and what controls are available to restrict its usage? What S3 resources are required for ManageIQ? We noted 1 S3 unit used for a put/get request.

We did some basic search and found http://manageiq.org/documentation/development/providers/amazon_aws_config/
This was not quite fully explaining 1.5 million notifications.

Thanks,
Kiran


#2

AWS Config is leveraged for detecting events from Amazon, and thereby automating many tasks in ManageIQ. 1.5 million seems a bit high to me as well, given you have no instances.

@blomquisg Can you comment here?


#3

@kirank,

can you tell what types of SNS notifications you’re seeing?

I haven’t seen any cases where we have seen so many messages collected from AWS Config.

If you’re not using the Event capture feature for any other providers, you can disable that server role, and that will keep ManageIQ from attempting to contact the AWS Config service looking for events.

Alternatively, If you are using the event server role for other providers other than Amazon, then I believe you can change the poll setting for the Amazon Event Catcher. Increasing that configuration to a much higher value (1.week, perhaps) would significantly reduce the number of attempts the Amazon Event Catcher worker attempts to fetch events from the AWS Config service. @Fryguy, does that sound right?

If you can pinpoint the types of messages being consumed from the SNS topic, we might be able to track down what’s actually consuming those messages.

Thanks Kiran!


GregB


#4

@blomquisg,
What I have is not in depth info. But maybe it may help. I don’ have access to the logs on ManageIQ yet. Here’s the AWS side of things. This snippet repeats for a while over the hours. I have yet to check out the event poll setup. Will do so when I get to office in a day.

AmazonSNS	ListTopics	USW1-DataTransfer-Out-Bytes		07/09/15 15:00:00	07/09/15 16:00:00	0
AmazonSNS	ListTopics	USW1-Requests-Tier1		07/09/15 15:00:00	07/09/15 16:00:00	29576
AmazonSNS	ListTopics	USW1-DataTransfer-Out-Bytes		07/09/15 16:00:00	07/09/15 17:00:00	0
AmazonSNS	ListTopics	USW1-Requests-Tier1		07/09/15 16:00:00	07/09/15 17:00:00	37322
AmazonSNS	ListTopics	USW1-DataTransfer-Out-Bytes		07/09/15 17:00:00	07/09/15 18:00:00	0
AmazonSNS	ListTopics	USW1-Requests-Tier1		07/09/15 17:00:00	07/09/15 18:00:00	37461
AmazonSNS	ListTopics	USW1-Requests-Tier1		07/09/15 18:00:00	07/09/15 19:00:00	37814
AmazonSNS	ListTopics	USW1-DataTransfer-Out-Bytes		07/09/15 18:00:00	07/09/15 19:00:00	0
AmazonSNS	ListTopics	USW1-DataTransfer-Out-Bytes		07/09/15 19:00:00	07/09/15 20:00:00	0
AmazonSNS	ListTopics	USW1-Requests-Tier1		07/09/15 19:00:00	07/09/15 20:00:00	37778

#5

That only looks like one request an hour, which I think is reasonable…unless I’m misunderstand the date there. So, over 5 days that would only be about 120 requests.

I think we cache the topic list anyway, unless maybe the worker is restarting once an hour for some reason? @kirank, when you get a chance, I think we’d like to see the evm.log as well to see if there’s something going on with the event catcher. In addition, the aws.log might have some detail (however it might have to be put into debug mode first).


#6

I got more indepth data today. I had AWS log set at info. But its good information below. If i try debug, I won’t be able to correlate with AWS until a few days (or maybe a month)
@fryguy, it took a little while when I first looked at it. So the last column is aggregated units of “ListTopics” in the hour of 15:00 to 16:00 - In this line there were 29,576 requests according to AWS. So if we totaled it across the 5 days its actually much higher that 1.5 million - more like 4.4 million (again according to AWS billing).

AmazonSNS ListTopics USW1-Requests-Tier1 07/09/15 15:00:00 07/09/15 16:00:00 29576

Now trying to corelate with ManageIQ logs, I see this log repeatedly (which I think is due to Amazon AWS Config not setup)

[----] W, [2015-07-10T03:16:00.020176 #29382:3d59e4c]  WARN -- : MIQ(AmazonEventMonitor#find_or_create_queue) Unable to find the AWS Config Topic. Cannot collect Amazon events for AWS Access Key ID AYYYYYYYYYYYYYYYYYYYY
[----] W, [2015-07-10T03:16:00.020214 #29382:3d59e4c]  WARN -- : MIQ(AmazonEventMonitor#find_or_create_queue) Contact Amazon to create the AWS Config service and topic for Amazon events.

I checked for two hour windows 7/10/2015 16:00 to 18:00. There were 2,560 log entries so roughly (2560 lines/2 lines per request )/2 hours = 640 requests per hour or 10 per minute.

Trying to matching this with Amazon’s billing:

AmazonSNS	ListTopics	USW1-Requests-Tier1		7/10/2015 16:00	7/10/2015 17:00	37780
AmazonSNS	ListTopics	USW1-Requests-Tier1		7/10/2015 17:00	7/10/2015 18:00	37821

This is about 75,601 API requests. So there’s clear more API requests from manage IQ side.

For little more insight, i turned on debug in AWS side just to see:
aws.log

[----] W, [2015-08-05T11:31:13.118641 #2052:3f224a4]  WARN -- : MIQ(AmazonEventMonitor#find_or_create_queue) Contact Amazon to create the AWS Config service and topic for Amazon events.
[----] D, [2015-08-05T11:31:13.118754 #2052:3f224a4] DEBUG -- : MIQ(AmazonEventMonitor#find_or_create_queue) Looking for Amazon SQS Queue manageiq-awsconfig-queue-90ba7a22-264c-11e5-8b49-525400fa21a1 ...
[----] D, [2015-08-05T11:31:18.289550 #2052:3f224a4] DEBUG -- : [AWS SQS 400 5.169702 0 retries] get_queue_url(:queue_name=>"manageiq-awsconfig-queue-1111111-1111-1111-1111-11111111111") AWS::SQS::Errors::NonExistentQueue The specified queue does not exist for this wsdl version.
[----] D, [2015-08-05T11:31:23.378971 #2052:3f224a4] DEBUG -- : [AWS SNS 200 5.088042 0 retries] list_topics()  
[----] W, [2015-08-05T11:31:23.379261 #2052:3f224a4]  WARN -- : MIQ(AmazonEventMonitor#find_or_create_queue) Unable to find the AWS Config Topic. Cannot collect Amazon events for AWS Access Key ID AYYYYYYYYYYYYYYYYYYYY

evm.log around the above time:

[----] W, [2015-08-05T11:31:23.379306 #2052:3f224a4]  WARN -- : <AWS> MIQ(AmazonEventMonitor#find_or_create_queue) Contact Amazon to create the AWS Config service and topic for Amazon events.
[----] I, [2015-08-05T11:31:24.593391 #2089:77fe98]  INFO -- : MIQ(ScheduleWorker.do_work) Number of scheduled items to be processed: 2.
[----] I, [2015-08-05T11:31:24.617448 #2089:77fe98]  INFO -- : MIQ(MiqQueue.put)        Message id: [103260],  id: [], Zone: [default], Role: [ems_metrics_coordinator], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [Metric::Capture.perf_capture_timer], Timeout: [600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: []
[----] I, [2015-08-05T11:31:24.624027 #2089:77fe98]  INFO -- : MIQ(MiqQueue.put)        Message id: [103261],  id: [], Zone: [default], Role: [smartstate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [job_dispatcher], Command: [JobProxyDispatcher.dispatch], Timeout: [600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: []
[----] I, [2015-08-05T11:31:28.170581 #1758:853e8c]  INFO -- : MIQ(MiqServer.populate_queue_messages) Fetched 2 miq_queue rows for queue_name=generic, wcount=4, priority=200
[----] I, [2015-08-05T11:31:28.721798 #2075:86fe98]  INFO -- : MIQ(MiqQueue.get_via_drb) Message id: [103260], MiqWorker id: [386], Zone: [default], Role: [ems_metrics_coordinator], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [Metric::Capture.perf_capture_timer], Timeout: [600], Priority: [20], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [4.111119133] seconds
[----] I, [2015-08-05T11:31:28.722035 #2075:86fe98]  INFO -- : MIQ(MiqQueue.deliver)    Message id: [103260], Delivering...

@blomquisg, This server was setup as event monitor. Will check if this AWS SNS access gets disabled if I remove the event monitor role. The Poll time was set at 15 seconds. However, the logs show the check with AWS is roughly 10 requests per minute.

Does this look like a bug or atleast something to enhance in MIQ?


#7

@kirank This is awesome info. Thanks!

@blomquisg This feels like a bug to me, or in the very least we seem to not be handling when the topic has not been created properly. What do you think?


#8

@Fryguy, yeah this definitely looks like a bug for sure!

I think we need to do similar caching that we’re doing for the OpenStack
event monitor to cache the fact that we tried to connect to AWS Config but
couldn’t establish the connection, then sleep for maybe a day? That way,
if the AWS Config service is configured it can be picked up at some
point. But, if it’s never configured, we only check once per day.

Alternatively, we can just include a setting in the Advanced Configuration
that allows the Amazon Event Catcher to be selectively disabled (even if
the Event server role is enabled).


#9

Following up on other experiments:

Also Github issue for the above bug: https://github.com/ManageIQ/manageiq/issues/3740

One comment I may add: Since it doesn’t seem to make sense to enable event monitor role without AWS config, would it be better for ManageIQ to create the AWS config? Not sure of the complexities. But a thought.

Thanks much for the discussion!
PS: For the default poll times, would it be cheaper (in terms of costs) to distinguish public clouds like Amazon and set their defaults in hours or days rather than seconds?


#10

After observing the logs, I still have one sticky aws.log and even after setting event_catcher timeouts to higher:

[----] D, [2015-08-06T03:04:39.395591 #15129:d6de90] DEBUG -- : [AWS EC2 200 5.161249 0 retries] describe_regions()  
[----] D, [2015-08-06T03:04:40.058517 #15129:d6de90] DEBUG -- : [AWS EC2 200 0.056964 0 retries] describe_regions()  
[----] D, [2015-08-06T03:04:40.063959 #15129:38a1cb0] DEBUG -- : MIQ(AmazonEventMonitor#find_or_create_queue) Looking for Amazon SQS Queue manageiq-awsconfig-queue-111111-2222-222222222222222222222222 ...
[----] D, [2015-08-06T03:04:45.353220 #15129:38a1cb0] DEBUG -- : [AWS SQS 200 5.270353 0 retries] get_queue_url(:queue_name=>"manageiq-awsconfig-queue-111111-2222-222222222222222222222222")  
[----] D, [2015-08-06T03:04:45.358808 #15129:38a1cb0] DEBUG -- : MIQ(AmazonEventMonitor#find_or_create_queue) ... found Amazon SQS Queue
[----] D, [2015-08-06T03:05:00.408708 #15129:38a1cb0] DEBUG -- : [AWS SQS 200 15.048913 0 retries] receive_message(:queue_url=>"https://sqs.us-west-1.amazonaws.com/500000000000/manageiq-awsconfig-queue-111111-2222-222222222222222222222222",:wait_time_seconds=>15)  
[----] D, [2015-08-06T03:07:13.626851 #15246:50fe88] DEBUG -- : [AWS EC2 200 5.148773 0 retries] describe_regions()  
[----] D, [2015-08-06T03:07:14.272910 #15246:50fe88] DEBUG -- : [AWS EC2 200 0.055524 0 retries] describe_regions()  
[----] D, [2015-08-06T03:07:14.277291 #15246:32041b4] DEBUG -- : MIQ(AmazonEventMonitor#find_or_create_queue) Looking for Amazon SQS Queue manageiq-awsconfig-queue-111111-2222-222222222222222222222222 ...
[----] D, [2015-08-06T03:07:19.470103 #15246:32041b4] DEBUG -- : [AWS SQS 200 5.175946 0 retries] get_queue_url(:queue_name=>"manageiq-awsconfig-queue-111111-2222-222222222222222222222222")  
[----] D, [2015-08-06T03:07:19.476391 #15246:32041b4] DEBUG -- : MIQ(AmazonEventMonitor#find_or_create_queue) ... found Amazon SQS Queue
[----] D, [2015-08-06T03:07:34.539535 #15246:32041b4] DEBUG -- : [AWS SQS 200 15.061835 0 retries] receive_message(:queue_url=>"https://sqs.us-west-1.amazonaws.com/500000000000/manageiq-awsconfig-queue-111111-2222-222222222222222222222222",:wait_time_seconds=>15)  
[----] D, [2015-08-06T03:09:48.853513 #15360:911e8c] DEBUG -- : [AWS EC2 200 5.176123 0 retries] describe_regions()  
[----] D, [2015-08-06T03:09:49.541996 #15360:911e8c] DEBUG -- : [AWS EC2 200 0.054285 0 retries] describe_regions()  
[----] D, [2015-08-06T03:09:49.547678 #15360:3611874] DEBUG -- : MIQ(AmazonEventMonitor#find_or_create_queue) Looking for Amazon SQS Queue manageiq-awsconfig-queue-111111-2222-22222222222222222222 ...
[----] D, [2015-08-06T03:09:54.739550 #15360:3611874] DEBUG -- : [AWS SQS 200 5.175516 0 retries] get_queue_url(:queue_name=>"manageiq-awsconfig-queue-111111-2222-222222222222222222222")  
[----] D, [2015-08-06T03:09:54.750705 #15360:3611874] DEBUG -- : MIQ(AmazonEventMonitor#find_or_create_queue) ... found Amazon SQS Queue
[----] D, [2015-08-06T03:10:09.811707 #15360:3611874] DEBUG -- : [AWS SQS 200 15.059455 0 retries] receive_message(:queue_url=>"https://sqs.us-west-1.amazonaws.com/500000000000/manageiq-awsconfig-queue-111111-2222-222222222222222222222222",:**wait_time_seconds=>15**)  

Not sure where wait_time_seconds=>15 originates from.

Poll setting

workers:
  worker_base:
    :defaults:
      :count: 1
      :gc_interval: 15.minutes
      :heartbeat_freq: 60.seconds
      :heartbeat_timeout: 2.minutes
      :memory_threshold: 200.megabytes
      :nice_delta: 10
      :parent_time_threshold: 3.minutes
      :poll: 3.seconds
      :poll_escalate_max: 30.seconds
      :poll_method: :normal
      :restart_interval: 0.hours
      :starting_timeout: 10.minutes
    :ems_refresh_core_worker:
      :poll: 1.seconds
      :memory_threshold: 400.megabytes
      :nice_delta: 1
      :thread_shutdown_timeout: 10.seconds
    :event_catcher:
      :defaults:
        :ems_event_page_size: 100
        :ems_event_thread_shutdown_timeout: 10.seconds
        :memory_threshold: 2.gigabytes
        :nice_delta: 1
        :poll: 45.minutes
      :event_catcher_redhat:
        :poll: 15.hours
      :event_catcher_vmware:
        :poll: 1.seconds
      :event_catcher_openstack:
        :poll: 15.minutes
        :topics:
          :nova: notifications.*
          :cinder: notifications.*
          :glance: notifications.*
          :quantum: notifications.*
        :duration: 10.seconds
        :capacity: 50
        :amqp_port: 5672
      :event_catcher_openstack_infra:
        :poll: 15.minutes
        :topics:
          :nova: notifications.*
          :cinder: notifications.*
          :glance: notifications.*
          :neutron: notifications.*
          :heat: notifications.*
          :ironic: notifications.*
        :duration: 10.seconds
        :capacity: 50
        :amqp_port: 5672
      :event_catcher_amazon:
        :poll: 15.minutes
      :event_catcher_kubernetes:
        :poll: 1.seconds

#11

@kirank, first off, thanks so much for all the feedback! This is really a ton of good information.

Yes, definitely! I’ll add notes that:

  • If you enable the event monitor role, ManageIQ will attempt to use the AWS SNS service to gain access to the AWS Config service, and this may result in charges on your account.
  • Disabling the event monitor role will disable any attempts by ManageIQ to contact the AWS Config service.

Perfect! Thank you! I’m currently working on some different approaches to fix this. I’ll keep progress posted on that issue.

Again, another great suggestion. I think it definitely makes sense to change public cloud default settings at the very least. Then, users can decide to increase the frequency of polling as necessary.

I’m looking into this now … I can’t tell from the message yet, but this might coming from the Amazon API SDK. I’ll dig into it deeper and see how best to make that configurable.

Thanks so much for all your feedback!


GregB


#12

The wait_time_seconds is a default value in the AWS API SDK. I’ll open up a configuration value in the VMDB configuration file.


#13

Thanks @blomquisg. One quick note regarding my request for documentation update: I was referring to adding a note/warning that the SNS topic “AWSConfig_topic” and the AWS Cloud provider configuration should be in the same geo zone otherwise, the SNS topic will not be found by ManageIQ. That mismatch in the zones was the one that lead me on some debug chase. Its possible that person who sets up cloud provider config and SNS setup is not the same so its easy to make this mistake.

Thanks to you and fryguy for all the help!


#14

Ah, super important! Thanks!


#15

I hope this is the right thread. Do we have any info if people have implemented more functionality then the out of the box ones?

Meaning start stop shutdown?