Best Practices in Systems Administration
From Laen
This is the outline for a paper I was planning on doing. It grew on me a bit, to the point that it would probably just end up being a book. It started while working as a sysadmin for a telecom startup in Tampa, FL.
Best Practices -- Building an Enterprise from the Ground Up
Description of Initial Environment
- Number and type of systems
- Applications supported
- Deficiencies
- Strengths
The growing and shrinking network
(knowing when to split and when to consolidate)
Virtualization
- Why it's important
- Doesn't more machines mean more administrative overhead?
Sure. So does supporting more applications.
The question is: How much does your administrative overhead go up per machine you provision (virtual or otherwise)? If it's significant, then it implies that you could improve your provisioning or configuration management systems. In a well automated environment, the additional load of an additional machine should be negligable.
Supporting more applications _absolutely_ adds to administrative overhead.
- Doesn't more machines mean more system overhead?
Sure. Each virtual machine is going to use some additional disk space and RAM that it wouldn't if you weren't subdividing the machine. What we strive to do is trade system overhead for an in increase in our administrative efficiency.
- Technologies
- Xen - In Xen, each VM runs its own kernel, and each VM may be running a different OS.
- Zones
- Domains
- UML
- Vservers
- KVM
- VMware
- Virtual Iron
- Eucalyptus
Set up a cloud infrastructure. Allow users to "check out" a chunk of CPUs.
Networks
- Single subnets
- ELANS/VLANS
- Administrative/Backup Restore Networks
- Multiple Subnets
- Firewalling and Policy Enforcement
- Storage Networking
Auto-installation
Reproducability
Autoinstallation insures that each system starts its life the same.
Autoinstall vs. Manual Installation
Autoinstallation allows you to answer all the installation questions up-front once, and stamp out multiple machines based on those answers. Besides the time savings, this nets a reliability win as well: Answering lots of questions means lots of opportunity for mistakes, which can leads to unreproducable systems.
Partitioning
Package selection
Don't be stingy, but don't install more than you'll realistically use on a regular basis. More packages means more pieces, which increases the complexity of the system (through possible odd interactions), which can affect your uptime.
Post-install scripts
You may be tempted to toss in a bunch of post-install changes right here. Really, that should be done by your configuration management tool. Your post-install scripts should just install your configuration management tool and little-- if anything-- less.
Heterogeneity and other Pizza Toppings
(Knowing that new OSes will be added to the mix, and planning for it)
- Directory layouts of network file servers
- Choosing software and technologies based on portability
Location Independence of Data and Applications
(/local, /remote, SANs, NAS, etc)
- SAN vs NAS.
- /local and /remote
- Package tree
- Data tree
There are three principles I hold dear in package installation:
- Binaries and Data should be held separate
- Software shouldn't be able to tell if they're running from a local disk, or a remote fileserver.
- No special LD_LIBRARY_PATH should be required to run it. The binary should have the right paths encoded into its run path, or it should be replaced with a script that sets the LD_LIBRARY_PATH correctly for that application.
Let's start with the first principle.
Separation of Binaries from Data
Okay, what I'm really talking about here is keeping the "strictly static stuff" and the "highly or possibly variable stuff" it uses and acts on apart. Binaries on one hand, and on the other hand configs, log files, and run-time things like pid files and sockets.
Here's a directory structure I've had great sucess with:
Unchanging Binaries:
/pkgs/ /pkgs/packagename/ /pkgs/packagename/version/ /pkgs/packagename/current -> /pkgs/packagename/verson/
Variable Data:
/data/ /data/packagename/ /data/packagename/conf /data/packagename/logs /data/packagename/run
Location Independance
(or, "Software shouldn't be able to tell if it's running from a local disk, or a remote fileserver")
Server Independence
(or, "Data should be separate from machine personality. An app should be able to be brought up again anywhere in your infrastructure.)
Data center layout and design issues
- Raised floor
- Under-floor cooling
- Flooding (raised power)
- Cable-rails
- Cleanliness
The importance of security
(designing with security in mind)
- Physical Security
- Umasks, File permissions, ACLs
- Firewalling
- As a means of policy enforcement
- As a means of knowing what hosts are communicating with your servers
- For security
- Hardened OSes
Monitoring
Being the first to know when something bad happens
- SNMP
- mon
- nagios
- Syslog-ng
- Event correlation
- Logsurfer++
- SEC
- Anomoly detection (swatch/logsurfer)
- IDSes and Anomoly Detection
- Netflow/Flowscan
Trending
While Monitoring is about noticing individual events, Trending is about looking at how the number and severity of the events increases over time. A single DNS server failure may not be a huge deal, but if it starts happening more and more often, then you may have an issue worth investigating.
Trending is about logging monitor data, and charting the frequency of events over time to look for patterns or progressions.
- Cricket
- mon
- Nagios
- Cacti
- NetMRG
- SNMP
- rrdtool
- logsurfer/SEC
Time Series Databases
- Graphite
- OpenTSDB
Host naming conventions and CNAMEs
- Service names
The hostname of a box isn't really important. What's important is that you know what services are pointed at the box. Every service you run on a box should be given a CNAME.
- Why location has no place in hostnames
- Why service has no place in hostnames
- Okay, so what _should_ go in hostnames?
- RFC2100 - The Naming of Hosts
- Host databases (DNS? HCD?)
- Appliance Servers
Backups
- Disk Based
- Tape Based
- Network based
- Offsite
- Monitoring and auditing
Keeping it all together
Cfengine, expect, PIKT, puppet, and rsync
- The UNIXverse as a virtual machine
- Using cfengine to keep it all together
- Assigning roles to systems and "role packages"
Machines aren't important. The roles they play are. Group your machines together by what their role is in the network. You may have only one mailhost, but that shouldn't stop you from referring to it in all your configurations as "mailhost" instead of its actual hostname. One day you may add an additional mailhost, and when you do, all you'll have to do is add it to the "mailhost" group, and it'll get updated with all the configuration that is standard for a mailhost on your network.
Configuration Management
- Revision Control
- cfengine / puppet / chuff
- Generating Configuration Files (erb, etc)
- Global, Site, Role, and Host configuration (Sharing "recipes")
- Configuration Directories
- Tracking Unauthorized Changes (Tripwire/radmind/CM_SAFE)
Sticking square software in a round hole
(dealing with software that doesn't fit our pretty little world)
- Symlinks
- Production Release Management
Planning for the Future and Big Picture Thinking
(knowing what you hope to accomplish)
- Daydreaming
- Identifying problems
- Brainstorming
- User feedback
Writing policies while being flexible
We're not fascists, we're facilitators
Time management
(small groups with lots to do)
- Ticketing systems
- Off-hours support
- On-hours support
Time Management for System Administrators
Developers and root
- sudo
- SSH (SSU)
- SELinux/AppArmor
- Solaris BSM
The Solaris Basic Security Module (BSM) allows you to log practically all activity that takes place on your system, including file writes, execs, and changes in privilage levels (such as from running su). This can produce a detailed log of what a user runs as root.
Good ol' human interaction
(communication in a group)
- Meetings
- One team meeting a week, to touch base with everyone and see what teammates are working on.
- Monday meetings are good for getting everyone pointed in the right direction.
- Friday meetings are good for wrapping up the week and providing status.
- Tuesday meetings are good because everyone's got themselves together from Monday, and most of the week is still ahead.
- Mailing lists
- Have a team mailing list that all communication to and from the customers gets CC'ed on.
- Intranets
- Blogs and Wikis
- Store knowledge in a wiki
- Store status reports in a blog
Writing business cases by using graphs
(Getting Things Done When You Are Not In Charge?)
- cricket/orca/cacti graphs
- User buy-in (Let _them_ drive changes). One of the best things you do can do to drive change in an organization is get your users to request it.
Change control, positives and negatives
- Manage change, don't stop it.
- Too restrictive
- Not restrictive enough
- Just right..
- Document!
OSS and RFCs
(The importance of open protocols)
- Giving users a choice.
Who are you to decide what program your users use to read their email? Who are you to dictate their work flow? Let them use the tools that they're most comfortable with by supporting open protocols.
- Competition
This also fosters competition among vendors. Your new vendor doesn't support the open protocols you use? Tough.
Redundancy, HA, and fault tolerance
- HA is good, but fault tolerance is better.
- Maintenance windows
- Failing Safe
- HA is expensive, so use it sparingly.
Documentation
- Document everything
Think of your job in terms of "Standard Operating Procedures". What do you do, and how do you do it? Have a standard form that you load up and walk through.
- Amnesia (Reminding yourself)
Humans forget things. In any procedure, you risk forgetting a step. So, step through your procedure. If what you're doing deviates from the procedure, edit the procedure and add to the decision tree. Wikis are good for this.
- Include your vision (why was this designed this way?)
Don't assume things are self-evident. Write down _why_ you made the decisions you did. Talk about other ways of doing the same thing that you considered, and why you eventually chose the one you did.
- Standard Operating Procedures (SOPs)
SOPs contain the human subroutines you use to do your job. They should have an implementation plan, a test plan, and a backout plan. If a SOP is too complicated to document clearly, then consider simplifying the process, either through scripting and automation, or by splitting the SOP into two parts.
- Training new sysadmins
Your SOPs should be able to be handed to someone trained in the arts of systems administration, and they should be able to follow them. A good way of testing your SOPs is to hand them to the junior people on your team and ask them to do it while supervising. Their questions will tell you which areas of your documentation need work.
Keeping a log
- Work log
- Change log
- Reminding yourself
- Billing
- Supporting headcount
- ..and if that fails, resume fodder. ;)
Extending to remote sites
- Console Servers
Nowadays, I buy servers with ILOMs.
- Remote Power
Again, systems with ILOMs.
- Remote Backup
Replicate data wherever possible. Try to make remote sites more-or-less expendable.
- Service Contracts
- Colocation
- Outsourcing
- Reliable Hardware
- Redundancy
Mergers and Acquisitions
- Taking over other machines
- Taking over other sysadmins
- There _will_ be down-time
- Learning about new environments
Politics and Organizational changes
- Communication with management
- Communication with coworkers
- Communication with users
- Defining your job and responsibilities
Working too much!
- Dealing with interruptions
- Understanding your priorities
- Know when to walk away
Guiding principles
- Choose applications based on "openness". That is, ones that support published protocols. Decide on the protocols, so that users can choose their own applications. Most users won't have a strong preference on application, but those that do will think of you each time they're forced to use an application that they don't want to use.
- Choose hardware based on flexibility. Disk arrays should be compatible with a lot of different types of hardware. Computers that can run many OSes (SparcLinux, Solaris, Linux, *BSD). Upgradability (buy cheap version and keep adding on). Hardware should be duplicated in a test lab. This provides a development and learning environment, a pre-production test environment, and a spares kit.
- Don't do anything that can't be scripted. If a vendor wishes to sell you something that can't be configured via script, choose another product. Disk arrays must have a command line tool. Scripts should be under revision control.
- Constantly monitor uptime and resources to determine which hosts are overloaded and which can handle more load.
- Use CNAMEs for every application. This makes it even easier to relocate applications. For services that require fixed IP addresses, use IP aliases.
- Separate application binaries from application data. Store application data in a place that can be easily duplicated. Databases are good. We install applications so location doesn't matter, why do so differently with data?
- Take note of files that change from the initial installation. Keep them under revision control and make sure you can easily reproduce the change when the application moves to a new machine, or the existing machine is reloaded from scratch.
- Treat systems as through they're going to be reloaded from scratch tomorrow. When a system is down, you're graded on how quickly you can get the services it runs back up.
