At Zebrium, we provide an Autonomous Monitoring service that automatically detects anomalies within logs and metrics. We started by correlating anomalies across log streams to automatically raise incidents that require our user's attention. Now, we have taken the step to augment our incident detection by detecting anomalies within a group of related metrics and correlate those with anomalies found in logs.
Since Prometheus is very popular in Kubernetes environments, we wanted to support discovering and scraping Prometheus targets and send those scraped metrics to our software running in the cloud for anomaly detection. Latency is important to us as we need to receive the metrics in near real time as they get scraped. We also need to preserve labels, types, and full fidelity of time stamps for anomaly detection and log correlation purposes. And, we need to do all this while being as efficient as possible in sending the metrics over the wire as data will be going from a user's Kubernetes cluster to our software which is running in the cloud.
In this blog, I will talk about our journey with using standard Prometheus server for our requirements and explain why we have built and open sourced a forked instance of the Prometheus server/scraper.
In summary, we are able to achieve near real time updates, preserve valuable information such as labels and types, handle out of order samples, and achieve over 500x bandwidth reduction (which translates to under 0.7 bytes per sample vs 391 bytes per sample in the raw form).
To achieve the goal of making our autonomous incident detection more accurate by detecting anomalies within a group of related metrics, and correlating those with anomalies found in logs, we came up with the following incremental requirements for Prometheus:
RQMT-1 : Ability to send and store time series data efficiently and in near real-time by consuming minimal bandwidth from the client-side to our SaaS back-end time series database.
RQMT-2 : Ability to type each metric (e.g. counters, gauge, histogram. etc.) to properly inform our anomaly detection.
RQMT-3: Ability to tolerate out-of-order times stamps of each consecutive sample so we don't omit inserting time series samples into our TSDB that may be critical to anomaly detection.
RQMT-4: Ability to tolerate differences in times stamps for any given sample so we don't omit inserting time series samples into our TSDB that may be critical to anomaly detection.
RQMT-5: Ability to automatically group similar metrics to help improve anomaly detection. For example, if there are different time series for cpu-user-time and cpu-interrupt-time, we want to automatically group by CPU without explicit direction from the end-user.
RQMT-6: Ability to correlate metrics with logs that are collected outside of our metrics collector (this is the Kubernetes log collector we use). To achieve this we need to include identical labels that allow us to match metrics and log events that come from the same container/source.
Prometheus provides a very elegant deployment model for Kubernetes environments with out-of-the-box metrics collection requiring virtually no effort. It's popularity, coupled with a wide range of available exporters made it the obvious choice.
A Little Bit About Prometheus...
Prometheus is perhaps the most popular an open-source systems monitoring and alerting toolkit adopted by many organizations and with a very active developer and user community.
The Prometheus architecture has two main components of interest for our solution:
Metrics targets: These are the end points that export metrics over HTTP. These are stateless end points export all the metrics as a blob in plain text format that is defined by Prometheus. There is a large number of available exporters covering Databases, Hardware, Storage, APIs, Messaging Systems, HTTP, miscellaneous software products. Prometheus supports four core types of metrics:
Monotonically increasing counters
Gauges with a value that can arbitrarily go down or up
Histograms exposes as buckets of observations
Summary which exposes streaming quantiles
Prometheus server: This discovers all the Prometheus targets and periodically scrapes metrics from them. In each scrape, it collects all the metrics from each target for a particular time stamp and saves them to its local time series database (TSDB). One can query this time series database and graph the metrics through the built-in graphing interface or through Grafana.
Compelling Attributes of Prometheus - Why we chose it!
Prometheus server provides a remote write feature to transparently send samples to a remote end-point. One could write a remote-storage-adapter for Prometheus to continually ship all the metrics from its local time series database (TSDB) storage to remote storage or another time series database. This is shown in the picture below:
This mechanism has been used by various cloud monitoring services to ship metrics from a user's Prometheus server to the cloud. In the Prometheus world, each Prometheus exporter exports a set of metrics. Each metric's value on a time scale is one time series, i.e a time series is a series of values for a given metric on the time axis. Each metric has a metric name, and the associated metadata. This metadata contains a set of labels and values. To the remote storage, metrics are written one time-series at a time.
We investigated using this remote write feature and discovered the following:
We came up with a few options to address the incremental requirements we needed in the standard Prometheus Remote Storage Adapter
This is our customized Prometheus server:
Each scraped blob contains a set of metrics. Each metric contains:
The basic idea is, if we have to send the blob as is, even after compression, it still represents a lot of data going over the wire. But if we do the diff of the blob with the blob that we scraped last time from this target, that diff will be very small. We call this diff an incremental blob . This incremental blob is computed as follows:
One can see that the incremental blob contains the data of the samples that changed from the last time we scraped and on top of that we do all the above mentioned optimizations to reduce its size.
Sending one request to the remote server for each blob is quite expensive in terms of the number of HTTP requests. So, we coalesce blobs, either incremental or full blobs, across all the targets and send them in one request. If the scraping interval is too small, we also coalesce blobs from the same target in one request. Each blob, either incremental or full, is also compressed before sending on the wire.
On the remote server side, given the incremental blob, we should be able to reconstruct the original full blob.
So far so good, one might think. Reconstructing the full blob from an incremental blob on the remote server side means you have to maintain the state. So how do we synchronize this state with the client side's Zebrium metrics collector? Well, we keep the state between remote server and metrics collector as independent as possible, something like a NFS file handle approach. The state is divided between metrics collector and remote server as follows:
On the remote server side, we reconstruct the metrics data as if it was scraped from the local target. We then perform a learning function to group the related metrics for ingest into our analytics database. Remember this example: if there are different time series for cpu-user-time and cpu-interrupt-time, we need to automatically group by CPU without explicit direction from the end-user ( addresses RQMT-5 )
Zebrium transfers metrics by using only 0.7 bytes per sample in most cases, including all metadata, and achieves near real-time results in lab testing ( addresses RQMT-1 ).
We have added three modules to the Prometheus server: zpacker , zcacher and zqmgr:
Here is what we have seen in terms of number of bytes sent over the wire from our own Kubernetes test cluster (as mentioned above, we achieved over 500x bandwidth reduction - ~0.7 bytes per sample vs 391 bytes per sample in the raw form):
2020-03-09T06:54:23.454-07:00 15810 INFO: worker.go:192 worker.dumpStats: Account summary stats for portal11, reqs=7660 fbytes=78304197246 bytes=405821031 cbytes=138765890 compr=564 misses=12 errs=0 full_blobs=115 fsamples=127504 labels=0 incr_blobs=47500 isamples=200095802
So to transfer a total of 200,223,306 samples, we only sent 138,765,890 bytes (about 132MB). On average, this works out to 0.69 bytes per sample
Prometheus server proved to be an outstanding choice to use as the platform for our metrics collection. It provides a very elegant deployment model for Kubernetes environments and out-of-the-box metrics collection requiring virtually no effort. This coupled with a wide range of available exporters, allows our users to take full advantage of the Zebrium Autonomous Monitoring solution .
As outlined about, the stringent requirements (RQMTs 1-6) imposed by our anomaly detection meant it was necessary to address them through custom code. The Zebrium Metrics Collector preserves all meta data and time stamps regardless of ordering, with no loss of data and while achieving bandwidth efficiency of 500:1 compression over the wire. Coupling this with with additional labels inserted at the source, allows us to cross correlate logs and metrics coming from the same container or same target. Our back-end machine learning then groups related metrics for ingest into our analytics database.
From this foundation, we are able to use machine learning to perform autonomous anomaly detection and incident creation across logs and metrics . But more on this part in a future blog... Stay tuned! Or try it by signing up for a free account .
The code we have built for this has been open sourced under the Apache license and can be found in the following Github repositories: