Wayfair Tech Blog

Lessons from a datacenter move

Last winter we were discussing all of our upcoming projects, and what they would require for new hardware in the datacenter.  Then we took a look at the space we had in our cage space at our main datacenter.  Turns out, we didn’t have enough space, and the facility wouldn’t give us any more power in the current footprint we had.  There was also no room to expand our cage.  We had two basic options, one would have been to add additional cage space either in the same building, or even another facility and rely on cross connects or WAN connections.  We weren’t wild about this approach because we knew it would come back to bite us later as we continuously fought with the concept, and had to decide which systems should be in which space.  The other option was to move entirely into a bigger footprint.  We opted to stay in the same facility, which made moving significantly easier, and moved to a space that is 70% larger then our old space, giving us lots of room as we grow.  Another major driver in the decision to move entirely was that it afforded us the opportunity to completely redo our network infrastructure from the ground up to have a much more modular setup and finally using 10Gb everywhere in our core and aggregation layers.

Some stats on the move:

  • Data migrated for NAS and SAN block storage: 161 TB
  • Network cables plugged in: 798
  • Physical servers moved or newly installed: 99 rack mount and 50 blades
  • Physical servers decommissioned to save power and simplify our environment: 49
  • VMs newly stood up or migrated: 619

It’s worth noting that the physical moves were done over the course of 2 months.  Why so long?  Unlike many companies that can have a weekend to bring things down, we aren’t afforded that luxury.  We have customer service working in our offices 7 days a week both in the US as well as Europe, and we have our website to think about, which never closes.  In fact, we were able to pull this off with only a single 4-hour outage to our storefront, and several very small outages to our internal and backend systems during weeknights throughout the project.

Lessons Learned:

No matter how good your documentation is, it’s probably not good enough.  Most folks documentation concentrates on break/fix and general architecture of a system, what’s installed, how it’s configured, etc.  Since we drastically changed our network infrastructure, we had to re-ip every server when it was moved.  We had to go through and come up with procedures for what else needed to happen when a machine suddenly had a new IP address.  We use DNS for some things, but not everything, so we had to ensure that inter-related systems were also updated when we moved things.

Get business leads involved in the timeline.  This sounds funny, but one of the biggest metrics in measuring the success of a project like this is the perception of the users.  Since a good percentage of the systems moved had certain business units as the main “customers”, we worked with leaders from these business units to ensure we understood  their use of the systems, what days or times of day were they using it the most, or if they had any concerns over off-hours operations during different times of the week.  Once we had this info from many different groups, we sat down in a big room with all the engineers responsible for these systems, and came up with a calendar for the move, then got final approval for dates from the business leads.  This was probably the smarted thing we did, and went a long way in helping our “customer satisfaction”.

Another thing we learned early on was to divide the work of the physical moving of equipment and the work done by the subject matter experts to make system changes and ensure things are working properly after the physical move.  This freed the subject matter expert to get right to work, and not have to worry about other, non-related systems that were also being moved in the same maintenance window.  How did we pull this off?  Again, include everyone.  We have a large Infrastructure Engineering team, 73 people as of this writing.  We got everyone involved, from our frontline and IT Support groups, all the way up to directors; even Steve Conine, one of our co-founders did an overnight stint at the datacenter helping with the physical move of servers.  It was an amazing team effort, and we would never have had such a smooth transition if everyone didn’t step up in a big way.

I hope these little tidbits are helpful to anyone taking on such a monumental task as moving an entire data center.  As always, thanks for reading.

IMG_0974

DSC_0673

Share