As Wayfair has grown and matured its software development and data center operations over the past decade, and particularly over the last five years, we have embraced the principle of providing maximum visibility into our processes and systems. The result is a very large logging and time series infrastructure built to receive a constant stream of application logs and metrics. Our logging infrastructure, built on the Elasticsearch stack, has benefited from continuous improvements from the Elastic company, as well as other resources such as consulting, training, and books. However, Graphite, our time series solution, has been more or less a self-supported system.

As a company that has experienced hyper growth over the last five years, Wayfair often needs to speedily adopt solutions and integrate them into mission critical operations roles. For us, Graphite was relatively easy for a bootstrapped engineering team to adopt and deploy quickly. Its schema-less model meant developers rapidly embraced it as they could send data in a format meaningful to them, without checking with a central gatekeeper. This schema-less environment also allowed the database to become quite unruly, however, since there was very little means to analyze and track the cardinality of various metrics.

While Graphite is a powerful open source package, its active development cycle has been on the wane for a number of years. During this time, Graphite’s user base has drifted away to other solutions such as Prometheus, DataDog, and InfluxDB. We did our best to have Graphite work for us: Our approach was to throw a considerable amount of compute and storage resources at it in order to force scalability, and a considerable amount of technical debt in maintaining arcane configurations and heavy handed cleanup scripts. The amount of engineering resources tied up in keeping the system stable and performant was relatively high, which meant those same engineers have not been able to work on higher value projects. This is not necessarily a unique story.

We knew it was time for a bolder move. For us that meant being on the lookout for newer platforms built on high availability principles – one that was tailor-made for the type of distributed pipeline we had successfully used with the Elastic stack.

Our Experience with Graphite

As with many open source projects, our teams were able to move quite rapidly with Graphite. It can be very fast for write and read, but works within very fixed scalability constraints. Our implementation is quite coupled with the StatsD aggregation process. Since this uses an upstream hash ring to route metrics, it has always been a challenge to refactor the storage footprint of a Graphite/Carbon cluster. There aren’t “relocate shard”, “backup shard”, or “restore shard” operations in the Graphite/Carbon world (mainly because there is no proper sharding concept; each metric is essentially a shard).

Having no replication/redundancy with this stack hurts us in several ways. Foremost is that it is impossible to do the kind of on-the fly maintenance and hardware upgrades that require us to bring a node down for any length of time (e.g., OS patching, relocating VMs to new hardware).  This also means that there is always high contention for a given metric as it only resides on a single host.

Up until a year ago, we were “getting by” with fewer than 10 Carbon storage hosts across three data centers and using a multi-layered Graphite proxy to merge data from remote data centers into a single web service. Getting by wasn’t adequate; we had many slow queries with frequent timeouts (the opposite of high availability). We were also continually hitting the storage wall until we had each Carbon host attached to nearly 7TB of direct attached storage. Not surprisingly, our storage team eventually indicated that this model wasn’t scalable.

The “legacy” Graphite cluster in service until early 2017. This cluster relied heavily on cross-datacenter HTTP calls using our Boston Graphite layer as a direct proxy for remote data center Carbon storage layers. Provisioning compute and storage was more of a challenge in the remote data centers.

In an effort to prepare for the massive traffic we received during 2017’s Cyber 5 weekend, we built out the cluster to 64 data nodes with only 0.5TB attached to each node. Further, we split this group of nodes into two clusters of “critical” and “general” data. The critical data mostly fed our 24/7 Network Operations Center (NOC) team as well as our most customer-critical web service alerts.

That’s a fair amount of compute power, but it addressed our main areas of system failure in the form of large contention for data at the TCP level of a given Carbon host. As there is no central index for each measurement, the Carbon architecture requires a fair number of redundant searches, which leads to network and file system contention. In order to scale so wide (as Wayfair needed to do), we had to consolidate storage in our central Boston, MA data center; meaning our data centers in Seattle, WA and Ireland would not have their own Graphite/Carbon storage instances. Instead, we sent our StatsD metrics cross-datacenter via UDP into the main data center. This element itself had a lot of overhead to monitor and wasn’t robust against a major network outage or latency.

Specifications of our current Graphite Implementation (Refactored for Holiday 2017) as follows:

  • Data Points / Second: Up to 1 Million
  • Total Storage Allocated: 35 TB
  • Carbon Storage Hosts: 64
  • Graphite Web server Hosts: 40
  • StatsD Pipeline Hosts: 50
  • Retention: 2 Months (at 10 sec. resolution)

The “Super Graphite” cluster built for Holiday 2017.

Through our efforts at scaling Graphite, we came to some very concrete conclusions (many already self-evident):

  1. Graphite/Carbon does not support true clustering (replication and high availability).
  2. Graphite/Carbon has a unique storage model, and although often fast due to it’s fixed temporal resolution, requires pre-allocation for each series, leading to storage infrastructure that is difficult to manage. It can even lead to rapid and catastrophic depletion of disk resources.
  3. Graphite/Carbon does not provide an out-of-the-box data pipeline solution. We’ve relied on our highly optimized in-house StatsD package. While this robust service is very good at what it does, the technical debt and opportunity costs in maintaining such a tool for internal consumption is high. For example, it currently lacks any sort of I/O support for Kafka. UDP is its only I/O mechanism.  If we wanted more I/O options, we would need to build them.
  4. The Python code underlying Graphite and Carbon has maxed out its performance potential.
  5. There is no straightforward way to move metrics from one data node to another (modern clustered solutions support this via shard relocation and backup/restore operations).

Our Next Generation Time Series Database

With the above conclusions in mind, we set out over a year ago to evaluate a number of replacement systems. After an initial investigation, we concluded that InfluxDB from InfluxData offered us the right combination of attributes that we felt would take us to the next level. We had a number of core technical requirements that were honed over the years of working with Graphite:

  • Support hundreds applications sending metrics from multiple data centers
  • Millions of points per second
  • Support for non-blocking I/O
  • Data availability within seconds (for alerts and graphs)
  • Tolerance for rapid spikes in traffic at certain times of day
  • Easy for developers to integrate into their code
  • Retention periods of two months to over a year
  • Support for data replication and shard management.

Through our evaluation and proof of concept phase, InfluxDB proved capable of meeting these core objectives.

InfluxDB is architected in such a way that allows us to balance horizontal and vertical scaling approaches for both compute and storage. Our in-house time series technology review identified some key attributes that offered substantive advantages over the status quo:

  1. InfluxDB is undergoing rapid performance improvements and hardening.
  2. InfluxDB supported true clustering technology (replication, high availability, shard management).
  3. InfluxDB and all related packages from InfluxData are written highly optimized in Go, a language that fully exploits multi-core environments.
  4. InfluxDB was built with an emphasis on efficient and scalable storage. Knowing that many of the metrics in a time series system will be sparse and/or ephemeral, InfluxDB per data point compression will generally be higher than Carbon’s as there is no pre-allocation of storage for stored data points (Carbon may be more storage efficient for fully saturated data streams).
  5. InfluxDB has an active community and an ecosystem of related tools for building enterprise-wide installations. It’s more appealing to us as an organization to contribute to an occasional Telegraf plugin (a plugin-based data streaming utility) that others in the industry are using and improving, than to maintain a completely proprietary codebase for the backbone of our data pipeline.
  6. InfluxDB has a growing customer base from a diverse set of companies in IT, IoT, financial analytics, and data science. We feel this diverse customer set will allow this platform to evolve into a general analytics platform that can support Wayfair applications as broad as IT infrastructure monitoring, application performance monitoring, as well as data science and market research.

General architecture for our InfluxDB proof of concept cluster at Wayfair.  Data nodes, contain shards, meta nodes manage the cluster state (shard assignments).  Telegraf is the gateway into the cluster.

As part of our migration process, we have worked closely with our development teams to retool their code so that they could send metrics to both Graphite (via StatsD format) and InfluxDB (via line protocol format). Using feature toggles, each application can write in either format or both at the same time. This allowed us to do head-to-head testing of query performance as well as load test our new InfluxDB cluster.

Grafana graph showing a data comparison in Graphite and InfluxDB.

A Modern Metrics Data Pipeline

In addition to the benefits shown, we are using our migration to InfluxDB as an opportunity to build a more flexible and robust data architecture with Kafka as an intermediate metrics buffer. This is modeled after a paradigm we’ve used successfully with our logging system. InfluxData’s Telegraf service made it relatively easy to configure a multi-layered pipeline by which applications could send data to Telegraf and allow Telegraf to pipe it into Kafka for later consumption. In practice we are using Telegraf twice: One layer to receive raw InfluxDB line protocol from applications via UDP and forward it on to Kafka, and an additional layer to consume metrics from the Kafka buffer and write to InfluxDB.

Benefits of this buffered model include:

  1. Allows us to connect multiple data centers by mirroring Kafka topics to shuttle metrics, rather than through cross-datacenter database replication.
  2. Allows us to use fast, non-blocking data protocols such as UDP upstream (at the top of the data funnel) and more transactionally robust protocols such as TCP as we get downstream.  
  3. Gives us the ability to inject various processing hooks into the data stream as our business needs evolve.
  4. Makes it easy to write the same data to multiple instances of InfluxDB (simply by consuming the same topic as the main cluster).
  5. Gives us multi-day tolerance against a severe network connectivity incident (dependent on the configured age limit of messages in Kafka).

Reference architecture for Wayfair’s InfluxDB metrics data pipeline.  We use this Kafka-centric pipeline to integrate metrics data from three different data centers (including secure and non-secure zones in each).

The Road Ahead

As I write this, we are busy fine tuning our data pipeline architecture to take advantage of Kafka in more sophisticated ways. For example, we are working on dynamically routing data into separate Kafka topics based on tags present in the incoming data. This will enable us to route data into separate databases within a single instance, or even separate InfluxDB clusters. This decoupling of the movement of data from it’s production and consumption forms the basis for our high availability and disaster recovery strategies. It will also help us break the cycle of monolithic unscalable systems early on.

Wayfair is working strategically with InfluxData to ensure we are on a path that is scalable, robust, and in line with the future direction of their platform. We also think Wayfair’s massive e-commerce platform provides a great set of rich case studies to help drive InfluxDB further in its support of features that larger enterprises need. The issues which we’ll continue to hammer concern things such as monitoring, high availability, support for even greater cardinality, and more elegant solutions for multi-tenant instances.

One thing is certain – InfluxDB will be a major component in our Cyber 5 Holiday Weekend monitoring and alerting systems. As we’ve learned from past experiences, there is no better time of year to pilot previous designs and collect specifications for future improvements.

One More Note

If you are in the Boston area, or plan on being here at some point (and you are data enthusiast), check out our meetup group, Time Series Boston. This group is jointly sponsored by Wayfair and InfluxData, and will provide a great new platform for the sharing and understanding of time series data concepts in this region.