The Setup

Lessons Learned in the Land of a 1000 Edge Cases

Created by Alex Juarez / @mralexjuarez

Administrative Notes

Slides Available at http://slides.unsupported.io/txlf-2014

About Me

  • Principal Engineer
  • Red Hat Certified Architect (RHCA)
  • At Rackspace for Seven Years
  • After Hours for most of that time
  • Web Developer for years before that
  • I am a fan of Whiskey

Hand Raising Making Assumptions

The Land of 1000 Edge Cases

So why this topic?

Unique Snowflakes

On any given day, any number of our upwards of 4500 customers can call in about servers on their account. That comes out to be about 100,000 servers.

One Box Wonder

Role Based Servers

A Tiered Environment

The Lessons

Lesson 1

Plan for Failure

Murphy's Law

Anything that can go wrong will go wrong.

Things will fail in the most grandiose fashion.

Ever think a truck could crash in to a DC?

It happens

Grandiose

Fashion

At the server level

  • RAID-1 OS Drives
  • Dual Power Supplies
  • Plugged in to separate power units
  • Managed Backup Offering

Lesson 2

Standardize an Image

Kickstart

  • KS file created based of a OS
  • Set Partitions
  • Creates first couple of users
  • Install Packages

Default Partitions

/boot100MB
/tmp2GB
swap2GB
/50G
/var/log15%
RemainderUn-Allocated

Install Packages

  • @ubuntu-minimal
  • @ubuntu-standard
  • curl
  • iptraf
  • locate
  • mtr
  • netcat-openbsd
  • strace
  • snmpd
  • sysstat
  • vim-nox

Lesson 3

Automate Some of the Things

Device Configuration

Low Touch

The Changes We Do Make

/tmp is mounted with noexec

Add apache to /etc/cron.deny

Set sysstat to record disk activity

Set sysstat for 27 day retention

Set the soft and hard limits for the number of open files to 8192

Add rm aliases to include
--preserve-root

Increase the size of bash history

Alert Remediation

Alerts would clear before we could get to them.

Some alerts now auto-update and close out.

Configuration Management

Leveraging Ansible

  • Installing Services
  • Adding / Removing Users
  • Installing Configuring LDAP

Lesson 4

Monitoring / Alert Design

Alerting on System Vitals

  • CPU
  • Load Average
  • Memory Usage
  • Swap Usage
  • Paging
  • Disk Space Usage
  • Log Monitoring


Thresholds were set statically

Something went bump.

  • Increased tickets to the support floor
  • Increased latency
  • Increased noise for our customers

What did we change?

  • Stop monitoring CPU
  • Stop monitoring Swap Usage
  • Load Average
  • Paging
  • Disk Space Usage
  • Log Monitoring


Thresholds were set dynamically set

Lesson 5

Writing tools

Tool Development

Rackspace has encouraged the development of tools to help you get your job done.

Holland Backup Manager

Holland is an Open Source backup framework originally developed at Rackspace and written in Python. Its goal is to help facilitate backing up databases with greater configurability, consistency, and ease. Holland is capable of backing up other types of data, too. Because of its plugin structure, Holland can be used to backup anything you want by whatever means you want.

http://hollandbackup.org/

recap

Recap is a reporting script that generates reports of various information about the server.

https://github.com/rackerlabs/recap

IUS

The IUS Community Project is aimed at providing up to date and regularly maintained RPM packages for the latest upstream versions of PHP, Python, MySQL and other common software specifically for Redhat Enterprise Linux. IUS can be thought of as a better way to upgrade RHEL, when you need to.

https://iuscommunity.org/

Lesson 6

Handing Over the Server

Communication is Key

Documentation is Key

Getting an alert and having steps to fix it

Lessons Learned

  • Plan for Failure
  • Standardize an Image
  • Automate Some of the Things
  • Monitoring / Alert Design
  • Writing tools
  • Handing Over the Server

Wanna Hollar?

  • Email: alex.juarez@rackspace.com
  • Twitter: @mralexjuarez

Got Skills?

Come by and try your hand at our break fix!

Questions?