We had big plans to expand our server infrastructure this year. So, we put together a rough plan that included capacity planning, hardware selection, hardware testing and validation, roll out, and finally, using the new hardware!
Since our savvy Engineering team knew what they needed, the Operations team was able to compile all those needs to understand the scope of our task at hand. We checked our current performance, and used those numbers to figure out roughly how many more servers we’d need. Initial calculations had us adding roughly 2,000 new servers to our existing stable of 800! Those old servers were purchased with older processor technology, and things have come a long way since. We just didn’t know how far they had come until we started testing.
What We Wanted in a Server
After looking at all the options, we eventually decided to test platforms from two vendors. Both platforms were sled-based servers that fit into a larger chassis, kind of like a blade system but with fewer shared devices. The only commonly shared device on these servers is power.
We initially selected an Ivy Bridge and a Sandy Bridge processor. Both performed fairly well, but we were looking for more horsepower. We kept testing until we found the monster Intel E5-2670 processor. The performance was so amazing that we were able to cut down the number of servers we needed to buy from 2,000 to about 700 machines! We opted for a single-processor config as were seeing more than enough power from from that single chip.
Then we loaded these processors into the HP SL6500 server system on board the sl230 Gen8 sleds. The systems are capable of holding up to 6 drives, two of which are hot-swappable. We decided to go with the largest SATA drive we could get our hands on, their 1TB drive. In a small grouping of systems, we opted to add their 200GB SSD. All systems had 32GB of RAM and are fully expandable to a whopping 256GB!
Bringing Them Home
So, we had our quantities and configs nailed down. Now we needed them to talk to each other and to the old servers. We selected the HP 8200 series chassis switch and connected each to the core over a 20 Gbit LACP trunk. That core switch also ties back to our original network with another 20 Gbit LACP interconnect. Each server rack was independently networked so that each was completely self-contained.
We worked with our data center provider to build out more space for our new equipment, and started working with HP to make sure we could save ourselves as much trouble as possible. We sent over a variety of configs which would be applied to the BIOS on each server, firmware updates were to be applied, and lastly, servers were to be installed in cabinets according to our specifications. This was a huge time saver for us! After that, network chassis were installed, servers were installed into the chassis, and all network cables were run, connected, and labelled. Once complete, each cabinet was ready to be crated and shipped to its destination.
Once the servers arrived on the loading dock at the data center, they were uncrated, moved into the cage, and bolted down. Power was applied and the cross-connects were run to the core, then the two core switches were connected to one another. From the time the servers were taken off the truck to being fully ready for use was roughly a half day. By the third day after arrival, servers were being used by the teams in full production.
Meanwhile . . .
While we were waiting for the servers to be built, configured, and delivered, we had a lot of work to do. Up to this point, we had used Google Docs as our inventory system, and configuration management was a combination of Kickstart scripts plus things written down in various places. There aren’t a ton of great inventory systems out there but we needed more than a spreadsheet, so we began using Tumblr’s excellent Collins inventory system. (For examples, check out these images.)
We wanted to add the new servers to Collins in the most efficient manner so, an intake process was devised that would handle that. The systems would be PXE booted, a few hardware tests would be run to ensure that the new systems weren’t misbehaving, and then the servers would add themselves to Collins. This method of having the server add itself helped us avoid any human error.
Next, it was time for OSes. We had perfected our Cobbler-based OS installs the year before when we went through a much smaller server expansion. With our new tools in place, that process changed somewhat so, scripts were created to streamline the process of installing a machine. We didn’t want Collins to just be a record of what we had; we wanted Collins to be a record of what that server does, whether or not it’s undergoing maintenance, what it’s IPs are, etc. These scripts ensure that a system could be installed quickly, and that our inventory database is accurate.
We needed a more sophisticated configuration management system than Kickstart, so selected Chef. As we began understanding Chef better, we ran it on our old systems as well as the new ones. It’s great to know that we can quickly install an OS, and that configs in every system will always be correct and easily changeable in a repeatable fashion.
We’re really proud of our new infrastructure but definitely aren’t resting on our laurels. We’re always looking to make things better and more efficient. There are still more things to add to Chef––new things to research, and efficiencies to be made. We never like to say “good enough” around here. We like it that way.