Moving metrics out of postgres


#1

The metrics table contains time series data of events and statistics. (e.g.: CPU load) We spend a lot of cpu resources rolling up the data and trying to maintain decent indexes.

I’m looking into pulling that table out of Postgres and into something like Graphite, which is more geared towards this type of data and rollup capability. There is a good comparison out there on influxdb, opentsdb, and leveldb. Hundreds of great graphing options exist out there like grafana and cubism.

Question: What will break if metrics is not physically in Postgres?

Reporting

One problem area I see is reporting. Customers have probably already written custom sql queries that include metric data. But I don’t know the extent of this scenario. If the tables are not in the database, these reports will break.

Load testing

We currently use capacity and utilization to load our servers to see performance under load. If we move this workload out of the database, or potentially make it non-existent, then we’ll need to find different load testing options. (This is a high class problem, but something that needs to be on our radar)

Replication

If a customer has many nodes replicating to a central reporting master, how will this be handled in this scenario. What else does this central node need to perform other than reporting? Do we need a rollup to run a full vmdb?

Any other areas of concern?


#2

I do agree that the metrics collection is a large performance hit to the Postgres DB and the appliance overall.

Reporting is bigger than just reports in that the UI uses it heavily when displaying data.  The metrics shown in all of the Monitoring / Utilization charts as well as the Optimize tab are all reporting based.  In addition, even just displaying a VM summary screen dips down into some of the metrics for trending data and normal operating ranges (using virtual columns that bring in the metrics, I think).

Any movement of the actual storage of the metric data should probably provide reporting compatible models so the existing processes can still fetch the data with minimal code changes and hopefully a minimal performance hit.


#3

Great subject @kbrock and I really like the idea of optimizing the way the metrics data can be maintained, processed and accessed over-time.

To answer some of your questions:

Reporting

I am not aware of customers directly accessing the data through SQL Queries today. I think the advice has always been to NOT go directly to SQL for anything as any custom capabilities build that way MAY break with the next release of the build.
Instead, users have been driven towards using scripts (console) when they needed something that would not be available through the API or through Automate scripts.

Access of data from the “reporting engine” is the main consumer AFAIK. It is my understanding that any rendered screens leveraging that data (charts,…) is indeed going through the reporting engine so hopefully this is the one area to focus on when looking into transitioning into a new datastore.

Load Balancing

Agree. When looking into a new solution, it would be necessary to understand the impact that this may have on the “appliance” model beside CPU and memory. Example: Growth of the repo overtime, ability to sustain non-persistent data (aka - potential fragmentation)…

Replication

Usually users do not replicate the real-time data up to the master region. They mostly focus on the roll-up data.
The replication is currently done by triggers (leveraging rubyrep) so this would have to be factored in the new solution.


#4

While not a stickler for formality, I think this endeavor may be a candidate for the production of an investigation report (IR) - which is a document that describes and analyzes proposed changes or new features. Typically, it provides background information: what’s changing and why, new components and how they differ from the old (Graphite for example) and any other info that would be useful when analyzing the proposed change. It then identifies specific areas of change (the tables in question) and describes, at a high-level, how each area would change - identifying possible issues and raising related questions along the way.

I think an IR would provide the requisite background and framework to better enable the discussion and analysis of this proposal.


#5

Thanks.

So if anyone has any particular items they want me to address in my Investigative Report, please let me know.


#6

@dclarizio The existing models will be used to populate and read from the metrics tables that exist in another “database”.

I need to ensure that the testing is sufficient around this area.

If there are a bunch of sql joins across from non metrics models into the metrics tables, or just plain raw sql, that causes a problem. (I’m under the impression that “reporting” uses raw sql.)


#7

I’d like it to contain a distilled description of Graphite and how it differs from postgres.
What areas will be impacted (tables, features, etc).
Your initial take on how those changes could be made (addressing the obvious, high-level issues).


#8

Regarding replication, we only replicate rollup data, but if graphite is doing that “on the fly” that might be tricky.

Some other thoughts:

  • Purging old data. By default, realtime data is purged every 4 hours, rollup data is purged every 6 months. Would you have to keep 6 months of realtime data to have the rollup data available? Would that be costly storage-wise?
  • Remote access. We have a single place to go for data and multiple appliances may need to access it. Is that available in this scenario?
  • HA. We don’t currently have database HA by default, but someone could set that up in an external PG setup. In addition, we are planning to bake that into the project at some point. If Graphite is buried in the appliance we might have to deal with that as well.
  • Infrastructure rollups. Example: we rollup VM hourly to Host hourly based on the which VMs are in a Host.

#9

I want to ask if the reference to ‘metrics’ is specifically to the ‘metrics’ table within the VMDB that is partitioned into 24 hourly tables (00 to 23) or if this is a more general reference to both ‘metrics’ and ‘metric_rollups’ tables?
When C&U is active for VMware environments, this table (‘metrics’) can become extremely busy, but for other providers does not seem as busy.


#10

My answers are based upon my graphite experience, the various options will probably be similar.

Purging (and rolling up)

Graphite handles purging and rollups in with configuration.
E.g.: Keep data rolled up by minute for 2 months, rolled up by hour for 1 year, and rolled rolled up by day for a few years. Think graphite is limited to 6 samples per minute. Influxdb looks better here. You do not need to keep ultra granular data for rollups. You insert the real time data. It takes care of purging and rolling up. We want pretty standard stuff.

Most time series databases are key value, and while they have the concept of hierarchical keys, I don’t think they have the key rollup concept from what I Can tell. I’m thinking we would just insert a record at each level of the hierarchy. e.g.: insert a record with a host key “/outer_region/inner_region/host1/cpu”, and a record with a region key “/outer_region/inner_region/cpu”.

I’m assuming infrastructure rollup would be handled the same way.

The goal is to get out of the business of spending too much effort with these rollups. But it does mean the customer can not as easily create a dynamic rollup like you can in sql. (If you want to have a discussion about changes in reporting rollups, lest put into a separate thread. I’ve seen tons of heated debates on customer requirements here)

remote access

statsd/collectd are very popular front ends for graphite. They give tcp and udp access to insert a lot of data into these tables very easily and quickly. Newer contenders like influxdb provide this interface out of the box. It is crazy easy to add counters and metrics.

I’m assuming this will live in the master appliance vm. I need to look at footprint. Graphite is tricky to install on the mac, but is quite easy on ubuntu and RHEL.

HA

Most of these solutions are big data centric. So they can often dynamically add and remove nodes from the cluster on the fly. Graphite relies upon redis (not fun to replicate), but others lean on hadoop/cassandra. Influx has no external dependencies.


#11

@tch Yes. Need to clarify this.

I was talking about metrics, metric_rollups and the 36 related numbered tables. Capacity and Utilization populates these tables to display usage statistics in reports and drive business process.

@rpo / @tch

This is a performance, extensibility, and reusability play.

The hypothesis is:
If we used a database that is geared towards time series, it would roll up (across time) and purge for us. This means less CPU/network load on our servers and better utilization of disk space (read: quicker indexes, less IO, no index fragmentation).

Also, if we used a standardized system to populate these metrics, the customers (and performance team) could extend it to collect all of the needed metrics, rather than installing additional monitoring system like monit/nagious/other.

And a standardized time series system has many UI options already written, allowing us to integrate these solutions or have customers integrate into their existing dashboards. The sheer number of tools Graphite touts is impressive.

Not easy peasy.

A full blown solution is collection ( Sensu ), storage ( Graphite), and storage of non-polling data (Statsd – read etsy). This introduces rabbitmq, redis, and other stuff under the covers. The ui is also horrible, hence they have tons UI components. That can be read as a good thing or a sign of an underlying problem.

Also, one company says that with sensu, and 4 servers, they get they handle 250 clients and collect 25k metrics/hour perf-ref. Those numbers seem low.

So a transition path seems to store in Graphite or similar server, and transition some of our charting components to one of the plethora of ui options. If that works, we can look into using a standard collection mechanism.


#12

One concern I have is based on the fact that metrics collection can vary greatly depending on the provider. For example, VMware’s “real time” collection is really data that has already been rolled up over a short period of time. In contrast, for a NetApp filer, a snapshot of raw metrics registers and timer is returned. In the latter case, the collector is responsible for calculating deltas between adjacent sets of data. If the collector cannot explicitly set the collection time when the data is saved in the database, it cannot compensate for the latency inherent in this low-level collection. It may also make it impossible to fill in missing data after the fact, as we do now under some conditions.

I think part of the investigation needs to address the various methods of metrics collection, and how they would fit into the new scheme.


#13

@rpo thanks for reminding me about this. This is exactly why I started this thread.

influxdb and Graphite supports setting the exact date.

Of course, it is necessary that we have dates synced across servers. Wonder if we want to enable ntp. (though I guess our current system is working with that not enabled)


#14

Some of my thoughts revolve around implementation and the load generated from that implementation:

  • Will every worker have to connect to two dbs (Postgres and Time Series DB) now?

  • We will have to determine if the overhead of every worker/appliance connecting is less than the additional load we see on Postgres maintaining metric data now.

  • Will this eliminate the need for perf_rollup messages? I have seen some performance hiccups when a storm of these messages hit the queue all at once. The time series db should essentially take on that load.

  • Benchmarking how the time series db performs vs a C&U processor worker here will be critical.

  • Can the time series db handle the different sizes of metrics, @rpo already mentioned this in post 12.

  • Will we have to normalize the data to a level set before moving it into the time series db, what overhead is this normalization process?

  • Complexity of deploying ManageIQ, will Time Series and Postgres DBs live on same appliance or allowed to be sub-divided to create an appliance specifically provided for it? Dividing the DBs will allow for better performance catering but will increase the complexity in deployment.

+1 for the graphs I’ve seen in grafana.


#15

Hi. Just joined the ManageIQ community and pretty new to the product, but already loving it. Thanks for all the efforts to date! I’m curious why there is a need for a structured datastore such as postgreSQL in the design? Why not use something like MongoDB or Elasticsearch or other no sql solution?

I’m an end consumer who needs to get the stats out of ManageIQ and into something like Splunk for a more holistic view across the infrastructure estate. Having the information already in a ready form suitable for other ingestion would be real handy.


#16

@akrzos as always, great input from you.

We will need 2 connections into the database. The UI workers will probably require a connection into both, but I’m hoping that the workers populating the database would only need to connect to the metrics one.

The Sensu model has each of the workers speak over a rabitmq and only the central spot connects to the database. We could do something similar using drb.

@phil_griffiths

since the data is very structured, and we had a relational database, it was easy to put into there. This thread is focused on what we need to keep in mind as we evaluate our no-sql options.

Some metrics/event solutions (e.g. GrayLog) use MongoDB because it supports indexes and arbitrary event structure.

But solutions centered around metric collection tend to use Cassandra, Hadoop, and similar column or key value pair no-sql databases. These are better at supporting write and append, rather than pruning in the rollups. And they tend to work better with more than 5 machines dedicated to the collection and storage of metrics.

So I’m focused on a lighter weight solution that will do quick data aggregation and rollups for us.


#17

Thanks @kbrock , I understand where you’re coming from and will watch with interest the noSQL platform of choice for future generations of manageIQ!

I don’t suppose there is any current documentation around the schema/table structures of the postgreSQL solution as it stands? I’d like to get some data out so I can build some visualisation/reports around what we already have, to build the business case for further investment.

Thanks
Phil.


#18

It is tricky business to use a product’s database schema to run reporting.
Well, the first draft is always easy. SQL is pretty universal. But when the product changes, that is where the reports start to become brittle.

Does the reporting built into the app not meet your needs? I’m under the impression it is quite flexible, though I have not written any.


#19

The built in reporting may produce what I need as long as I can automate production and output in csv etc. But when the underlying format in the database changes beneath me I have the same problem as getting the data straight out the database anyway. But this is a problem for any application which experiences rapid development after all.

I need to combine the reporting from ManageIQ with other data such as that from SAN arrays to give a holistic perspective across the infrastructure estate. Not everything is in or managed by ManageIQ, but I need to take it all into account to make informed decisions. This is why I like tools like Splunk or Elasticsearch/Kibana which allow me to abstract the data from the individual tools and combine into the views I need for my business needs.


Picking a NoSQL database for Metrics