Wayfair Engineering FAQ, a conversation with the Chief Architect

IMG_4290_ben_clark_300dpi

Q1: What's the tech stack in a nutshell?

A: PHP on Linux, and a few data backends, with continuous deployment at the pace of ~250 zero-downtime code pushes a day.

Q2: Whoah, that's a lot of code pushes! Break that down for me.

A: What I'm actually counting there is git changesets that are going to some type of production system. We group them in batches, so merges and testing don't get too hairy. The rumors that we're so cowboy we only test in production are an exaggeration.

Q3: What's the main idea of Wayfair Engineering?

A: Stay fast while growing to enormous size.

Q4: That sounds cool, but how do you measure 'enormous size'?

A: Anything you can think to measure: revenue, active customers, daily visitors, site traffic, terabytes of data, engineering headcount, warehouse square footage, linear miles traveled by our packages being delivered, square miles covered by our proprietary transportation network, number of products we sell. The numbers are out there every quarter in our reports, and in the independent venues: however you count, we're huge. I'd give you the actual numbers today, but I want to be able to leave this page up for a while, and at the rate we're growing, snapshot numbers get stale fast.

Q5: How do you measure 'fast'?

A: Two things. First and foremost, speed to market for new business ideas and features. We also look at performance metrics: web page load times, lead times for delivery, etc., etc. We measure everything.

Q6: How's the relationship between tech and the business?

A: It's very good, and it's one of mutual respect and cooperation. It's a founder-led business. Steve Conine (tech) and Niraj Shah (business), co-founders/owners, provide a good model of that kind of relationship for the rest of the company, and we take our cue from them. Steve is a very business-minded tech entrepreneur who's also a phenomenal programmer, and Niraj is an unusually tech-savvy business person. They were both engineering majors at Cornell. There's less of a divide there in the first place than at many companies. The core of Wayfair is an innovative, completely custom e-commerce platform, and management has consistently described that to the outside world as being a big part of the equity value of the company. On the tech side, let me ask people out there: how valuable would it be to you, to have a deep reserve of confidence that you're working at a well managed business, where your engineering efforts aren't going to be wasted on some improperly vetted hunch that will send the balance sheet into the red for no good reason? Combine that analytical sense and good judgment with Wayfair's characteristic aggressive business innovation, and we all feel pretty good about working together. The home goods niche of the economy isn't going to go on line all by itself: we're going to need to make it easy for people, and that's going to come from a strenuous, combined effort by all parts of Wayfair.

Q7: The home goods niche? You're taking credit for a whole sector of the retail economy going on line?

A: Of course it's not *all* us, but we do account for a hefty percentage of every dollar that goes on line for the purchase of home goods.

Q8: 'Innovative, completely custom e-commerce platform,' you say? Do you want to elaborate?

A: It's not that we never buy third-party software, but we have a strong bias towards 'build', in build-vs-buy discussions, and that has only become more pronounced over time. At the core, there's no third-party platform like DemandWare or Magento, just an evolving set of data models and architectural principles, a lot of code, and some great components that our developers know what to do with. We're very patient with the early stages of DIY efforts that aren't necessarily up to industry standards at first, if we think we can gain a sustainable advantage over time. Most recently, we've insourced some big parts of our marketing tech stack, that we formerly used outside vendors or commercial software for. It's satisfying when we can leave behind vendor-based point solutions to individual problems, and stand up a new part of the living, breathing, integrated whole, which allows us to take advantage of everything our platform has to offer.

Q9: What languages do you write code in?

A: PHP 7 and Javascript, centered around ReactJS with Hypernova for server-side rendering, are the bread and butter. We have also written important things in Python and C#, and some key components in Java. Objective C for iOS mobile apps, Java for Android, and some emerging language platforms for VR and AR. We use Puppet for configuration management, so there's a certain amount of Ruby hacking as well, and a lot of systems scripting with Python. Once in a while we write some C or C++, for optimized numerical work, PHP extensions, and opensource infrastructure like Twitter's Twemproxy (patches) and Statsdcc (from scratch, inspired by things in Node.js and other languages).

Q10: So you opensource code. Where can I find that?

A: Yes, we do that all the time, most of it on https://github.com/wayfair. Check it out!

Q11: VR and AR?

A: Virtual and augmented reality. We've got a lot going on in that space, particularly in a small department we call Wayfair Next that Steve Conine is leading. Right now the biggest push is to model the catalogue. From this we get excellent 2D imagery for the site, and next-gen experiences on things like the Google Tango AR devices that are becoming available to the general public in September. If you have a dev kit, check out the 'WayfairView' app. Big picture: we want to make it easier and easier to buy, say, an easy chair from your couch. VR/AR is going to be a big part of that.

Q12: What are your data platforms?

A: We're proud of how far we got as a business, from our founding in 2002 until 2010, on a keep-it-simple-stupid or KISS architecture of relational-database-backed web scripting. SQL Server was and still is our core for OLTP, and it allows us to plug new tools into our integrated operational and analytical infrastructure very quickly. But to drive innovative customer experiences, we now rely on Solr-backed search and browse, and Redis/Memcache for fast access to complex data structures and ordinary caching. We have a modern, on-premise big data infrastructure consisting of Hadoop and Vertica clusters, and some specialized, vertically-scaled big-memory and GPU machines, for analytical workloads. We do our machine learning, statistical analysis and other types of computation on that setup, and funnel the results to the 'Storefront,' as we call it, and to the operational business systems. RabbitMQ and Kafka provide a kind of circulatory system for the data, and they are gradually replacing what traditional ETL we have. As I speak with other architects and CTOs around the industry, I actually think it's pretty rare, at the biggest and most successful companies, to junk your relational databases, even when you're many years into adopting these next-generation auto-sharding platforms. We're fine with that.

Q13: OK, so with all these relational databases, do you use an ORM?

A: There's a joke around Wayfair Engineering, that if you use the word 'ORM' in a positive way, you might notice a sudden drop in your career prospects. Joking aside, I do think excessive reliance on ORMs tends to foster careless data access code, and excessive round trips to the back end. We mostly use the 'phrasebook' pattern and hand-make our data access layer. Besides, it's not as if an easily generated mock would really help you: by the time you've horizontally partitioned your data to the extent we have, ORMs are close to useless. We try to make it easy for everybody to develop against a readily accessible development database infrastructure. On the other hand, there is actually a bit of Hibernate, SQLAlchemy and both the Entity Framework and nHibernate in the Java, Python and C#, respectively. ORMs on language platforms like those can have some engineering benefits in addition to the convenience features, such as connection pooling, caching of various kinds, etc. None of that works, or at least works well, in PHP, so we just use PDO like the rest of the PHP world, and we're experimenting with SQL Relay for some other kinds of optimization and encapsulation of the details of how we talk to the databases. At the higher levels, we have some pretty handy traits, which are the multiple inheritance thing in PHP, to inject common functionality into our codebase in a DRY way. No fanaticism, one way or the other.

Q14: What are your thoughts on web services?

A: We have a handful of important web services behind the scenes at Wayfair. Search and browse for products, orders, and some other things, are powered by Solr, which is an opensource, Java-based web service that we have patched a few times for our own needs. Our Python-based customer recommendations and search enhancements, and our C#-based inventory service, deliver a lot of value. There are other examples.

Q15: Do you have any other kinds of services?

A: Good question. Some of the highest-value systems we have are data processing services that ingest data from our messaging platforms (Rabbit, Kafka) and push value-added results to where we can use them to move faster on behalf of customers and suppliers. There are some DIY ones, but most live in the frameworks Celery, Storm or Spark. You could call our caching system a service. It's a composite Redis/Memcache/consistent-hashing thing with smart proxies everywhere. You're using regular Redis and Memcache commands, not going through an adapter layer, but that's true of Elasticache too, so we're far from eccentric in this way. Unlike in early Elasticache (although they have added this more recently), the sharding is taken care of for you. We built it on the back of work by Twitter, Pinterest and Instagram, but we added some innovative elements of our own. It has some similarities with Facebook's McRouter, which is pretty awesome, and which we might well have chosen instead, if it had Redis support.

Q16: What about micro-services, or SOA?

A: We're not really into all that, although as I said, we have some pretty awesome services back there. Is your code base, or a big part of it, really a monolith, in any pejorative sense, if it has decent separation of concerns, and you can deploy small modifications to any layer of it without rebuilding the whole thing, and without down time? We've had all of that for years. Many of the best big tech companies have largely monolithic code bases, and they're too busy adding awesome features to the core to want to replatform. But don't get me wrong: there are some cool micro-service set-ups out there. If we keep developing very valuable *macro* services at the rate we've been doing that, we'll eventually have so many of them that micro-service-style orchestration techniques will start to make sense for us. Our Python services are the most numerous ones we have, and we're already experimenting with Docker, Mesos and Kubernetes, for them. It's just that over time I have seen the importance of web services diminishing, as data platforms become easier to scale horizontally, and server-side-of-the-front-end programming becomes easier and more powerful. The data is just too readily available for these layer cakes of http indirection to make any sense in a well-designed, modern setup.

Q17: Why do you like PHP?

A: I'm not sure I *do* like it, but it attracts the right kind of people: neither ivory-tower language snobs, nor hipster code posers. No fanatics, but no luddites either. We have some fun with all that, when we're trying to make sure the culture stays strong. I tried to depict both sides of the ivory-tower/hipster thing in this picture a few years ago, in a comic-strip-style blog post on our Python ops: I think the tweed jacket combined with the Brooklyn t-shirt really gets the point across. (To MIT professors, and to my former neighbors in Park Slope: I kid because I love!)

adam_rolling_eyeswd

With every other language, there are a lot of fanatics who think it's the answer to every problem, and will wear your ears out explaining why. Even people who love PHP don't think that about PHP. It's just a solid platform for web development, the kind the tattooed web ops expert in the picture would think is a fine thing to have running on his servers. There's no server lifecycle management to worry about, and practical problem-solvers gravitate to it. It's also very readable, even if you don't know it well, so let's all just pause to give it the big thank you it deserves for killing Perl (with a substantial assist from Python, of course, on the systems scripting side). That needed to happen, and in retrospect it's obvious that neither Java nor .NET had the slightest chance to do it. 80% of the internet runs on PHP, including a bunch of the biggest sites, which we're rapidly becoming one of. It'll do.

Q18: So PHP is a cultural thing?

A: Yes. Let me draw an analogy, which I sometimes use in talks for new hires. Do you remember the scene in Star Wars, when Luke Skywalker sees the Millennium Falcon for the first time, and says "What a piece of junk!"? Han Solo responds, "She'll make point five past lightspeed. She may not look like much, but she's got it where it counts, kid. I've made a lot of special modifications myself." H/t @danmil for that analogy. Our PHP-and-friends stack might seem inelegant to language snobs, but it takes us where we want to go fast. The Millenium Falcon is still a space ship, after all! Let's try a 'car' analogy: anyone who has ever done the coding equivalent of putting a Porsche engine in a VW Golf, or can show us the chops and attitude for that and wants to try, is welcome at Wayfair Engineering. Adding lambdas to the opensource php_mustache extension, which we did, is a great example of something that fits that mold. If you're more of an "I won't drive at all unless I have a Lamborghini" person, you should seek a company more willing to splurge on the shiniest tools, before thinking about whether there is really a need. If your mindset is more "My tank-like SUV keeps me safe, and I don't care that it handles poorly," there are plenty of J2EE shops out there.

Q19: OK, I'm getting the picture. Why use the other languages at all?

A: PHP is not a great fit for every programming task. Sometimes you need a long-running daemon that can respond to requests with little startup/wake-up overhead. We have excellent services in Python, C# and Java for that. The C# code grew out of our early-stage Microsoft heritage, but we are now doing some phenomenal things with it, and we have added some very elegant functional programming in F#. Python is our favorite language for data science, machine learning and the like, and it combines low-latency service qualities, the way we run it, with the convenience and productivity of loose typing, and that super-handy mix of the functional and object-oriented styles. Java allows us to tap directly into platform-level infrastructure such as Solr, Elastic Search, Hadoop, Kafka and Storm.

Q20: You've been talking about speed, and you mentioned that you measure web performance earlier. Can you give a bit more detail on that?

A: Sure. Web performance measurement is basically a 3-legged stool: RUM, or real user monitoring; synthetic monitoring, which is externally-located bots that measure page speed; and server-side execution metrics. We have a centralized performance team that makes sure we have the right tools and dashboards to be proactive about all of that. They also work on framework-level changes that can make a big impact, when those aren't naturally more specialized with another group. They play a strategic role for us, but that team wouldn't be very effective if we didn't have a good culture of thinking about web performance in a broader context of putting a great user experience into the hands, and onto all the devices, of our customers. The RUM instrumentation gives us great insight into what our customers are actually experiencing. I'm not original with this name, but my joke name for that department is the RUM distillery, and you can imagine the joking about operating precise instruments in the right state of mind. We have some cool 'responsive' experiences here and there, but the RUM tells us that our decision to emphasize adaptive delivery over responsive design was a good one. Check out Wayfair mobile web, on a small iOS or Android device, and you'll see what I mean. Our native apps are fast too, but that's a separate discipline, where server-side execution and expert Java and Objective C programming are the key components.

Q21: Thoughts on the cloud?

A: We run a few elastic workloads on public cloud infrastructure, but that's a drop in the ocean of Wayfair tech. Don't get me wrong: if we were starting Wayfair today, we would do it on public cloud infrastructure, for the speed-to-market aspect, for sure. In fact, Wayfair was very briefly a Yahoo! Store in 2002, before Steve built the first version of the platform to run in a colocated cage in a data center. We run colo-style to this day. Wayfair was already a multi-hundred-million-dollar company before the cloud was a thing. We think about it, and do some analysis and experiments periodically. But ultimately our traffic is not extremely spiky, and we grow into the holiday spike provisioning pretty early the following year. The economics, control and convenience have not yet aligned to make it worthwhile to go through a big process of switching. We're not big enough to have whole data centers, at least not yet, but we have our /22 ARIN range, and we use the border gateway protocol to make sure we have the kind of relationship with our ISPs where we have a lot more control than when we were smaller. Wrestling with these types of configurations is interesting work, and it attracts really good network and systems people. Let's face it: the public cloud is awesome, but when the problem is under the hood of the hypervisor, you're in for a frustrating day at the office. We do a lot of virtualization, and we like it, but when various types of systems become very cookie-cutter or have certain types of requirements, we run physical boxes. Virtualization adds overhead, and it's one more thing that can break. If you can provision basic types ahead of demand, the IAAS side of the cloud becomes just another provider, and of course the higher-level services are fraught with problems of vendor lock-in. The way cloud adoption presents itself to new or small companies, it's kind of ironic that we're moving too fast to be bothered with moving to the cloud. But never say never.

Q22: OK, sign me up. How do you succeed in Wayfair Engineering?

A: It's hard to answer that question without using some cliches, but I'll try to use the ones that are characteristic and relevant. Programmers with the polyglot, DevOps-savvy innovator background tend to do really well here. Boyscout principle for refactoring, rather than a penchant for from-scratch rewrites. Bias for action: if you're not embarrassed by the first version, you waited too long to ship it. Just ship! If you find yourself tempted by a months-long science project, don't do it. Instead, fast-follow/adopt something that's already here in the general area (whether we wrote it or it's open source from outside), and innovate at the margins for now. But when you see a quick win that you think is on a path to a real breakthrough, pounce.

Share