Timestamps and Log Aggregation


#1

I’m currently looking at an EFK stack to aggregate log entries from multiple appliances in one spot. I’m curious how other people are solving this as I’m running into an issue with the timestamps of log entries.

[----] I, [2016-08-09T17:37:02.794408 #16835:1045988] 

The log timestamp is affected by the appliance’s Timezone. It seems that Elasticsearch really wants UTC timestamps (otherwise I end up with Local time -4:00 being interpreted as UTC and Kibana does the conversion as well resulting in a log timestamp of 13:37:02.794408 for the example above).

I see 3 solutions to this:

  1. Change the appliance timezone to UTC. This has the unfortunate side effect of things like the dashboard showing last run/next run in the future (with no TZ indication)

  2. Have the appliance start appending the TZ information to log entries

  3. Use fluentd to somehow mangle the time to include the TZ information and/or convert to UTC. I’m not sure this is possible as the time field is treated somewhat specially.

Any thoughts or suggestions from other folks who have solved this problem already?

Thanks,
Matt


#2

It would be great to have all systems run in UTC, and then have UIs adapt to display time data in the TZ of the browser.

So #1 would be ideal.

Given that is not the case today, the least invasive is to have fluentd include the TZ information, which can be done, see https://github.com/uken/fluent-plugin-elasticsearch#time_key_format

-peter


#3

Thanks for the reply, Peter. Maybe I’m approaching this the wrong way, but I had the appliance(s) sending the logs to fluentd on a central server and that server was using the elasticsearch plugin to push data. So if my appliances are in a different timezone than the central server, that wouldn’t work.

I’m new to the log aggregation scene, would the preferred method be to have each appliance ship logs directly to elasticsearch?


#4

First, you could use the timestamp of the central fluentd service for the indexing time, ignoring the timestamp in the original payload. You don’t loose any information, as you can still make that original timestamp field available.

You could also have a table to map known systems in certain timezones and having fluentd modify their timestamps to add a TZ value.

Time is so critical to effective logs, it might be worth running a local fluentd instance just for that purpose, where the appliances send to the local instances of fluentd, which then sends to the central one.

Second, I don’t think you are approaching it the wrong way. There are a few different schools of thought on what to do here.

A lot of folks are using very light-weight collectors, like fluent-bit, filebeat (formally logstash-forwarder), rsyslog, and sending logs with some additional enriching of metadata to a central aggregator (or normalizer) which forms the records for proper storage in the warehouse (Elasticsearch). That central system might also pull metadata out of logs to enrich the logs further, or add even more contextual metadata as they see fit.

Some folks feel that is too much to do on a central server, and so they do that same thing at the edge, where data is being collected.

I prefer the central so that I can massage the data without have to redeploy my collectors.

-peter


#5

I think we were on the same page. Here’s what I ended up with this morning.

On the MIQ appliance, td-agent is installed and configured as:

<source>
  @type tail
  format multiline
  format_firstline '/^\[----\].*/'
  format1 '/^\[----\]\ [A-Z],\ \[(?<time>[^\ ]*)\ #(?<pid>[0-9]*):(?<thread>[\h]*)\]\s+(?<level>[A-Z]*)\ --\ :(?<message>.*)/'
  path /var/www/miq/vmdb/log/automation.log
  pos_file /var/log/td-agent/automation.log.pos
  tag cfme.automation
  keep_time_key true
  time_key time
  time_format %Y-%m-%dT%H:%M:%S.%N
</source>

<filter cfme.automation>
  @type record_transformer
  enable_ruby
  <record>
    appliance "#{Socket.gethostname}"
  </record>
  <record>
    time ${time.strftime('%Y-%m-%dT%H:%M:%S.%N%z')}
  </record>
</filter>

## match tag=cfme.** and forward to another td-agent server  
<match cfme.**>
  @type forward
  <server>
    host central-fluentd.example.com
  </server>
</match>

And on the central efk server I’ve got:

<match cfme.automation>
  type copy
  <store>
    type elasticsearch
    logstash_format true
    time_key time
    time_key_format "%Y-%m-%dT%H:%M:%S.%N%z"
    include_tag_key true
    flush_interval 15s
  </store>
</match>

Looks like I’m losing the nanosecond resolution along the way - I’m guessing the fluentd representation of ‘time’ during the filter doesn’t include it as I’ve seen a couple of references to it not handling sub-second stuff very well. I can live with that (or worst case change the filter to strptime then strftime).

If you have any comments, I’d be glad to hear them, but I think your comments got me across the finish line. Thanks!

Matt


#6

What version of td-agent/fluentd?

See https://github.com/fluent/fluentd/issues/461 and http://stackoverflow.com/questions/27928479/fluentd-loses-milliseconds-and-now-log-messages-are-stored-out-of-order-in-elast


#7

Yeah, I think that was one of the places I saw it.

I’m running td-agent-2.3.2-0.el7.x86_64 on both systems


#8

I don’t know if 2.3.2 of td-agent has those fixes or not. But that should address it.


#9

The nanoseconds were included before I started mangling the time to add the TZ info, so it is probably in there. I’ll mess around with it to see what I can figure out.

Thanks!