A Day in the Life of Facebook Operations

Notes from the “A Day in the Life of Facebook Operations” presentation by Tom Cook, Systems Engineer, Facebook at Surge2010 conference.

So far this is the most attended session. Standing room only available only before it start.

What does facebook sysadmins have to support?

Monthly 700 million minutes of time spent on fb
6billion pieces of content updated
3 billion photos
1 million connect implementations
1/2 billion active users

Infrastructure Growth

fb reached a limit on leasing datacenter space
fb is building their own http://www.facebook.com/prinevilledatacenter
currently serving out of california and Virginia

Initially a LAMP stack. LB -> Web Servers -> Services/Memcached/Databases

Originally facebook was a simple Apache PHP site. When fb started hitting a limit on this, they started compiling PHP into C++ (HipHop for PHP).

FB claims to be the biggest memcache deployment in the world. They server 300 Terbytes of memcached data out of memory.

MySQL improvements contributed back is flashcache.

Services supported

News Feed
Search
Cache

Service implementation languages

C++
PHP - front end
python
Ruby
Java
erlang (chat room)

How do they talk between these? Json? SOAP? No, fb implemented Thrift - ligtwaith software framework for cross language development, a common glue behind all facebook systems.

For Systems, what does fb have to worry about on a daily basis?

deployment
monitoring
data manaement
Core operating updates

Facebook OS is…. CentOS!

Systems Management

Configuration Management
CFengine for system management
On Demand

Deployments

Web Push - new code gets deployed to fb at least once a day. Its a coordinated push, everyone is aware, notification happens to dev team. Everyone sites on IRC during the push. It is undestood by engineers and the rest of the company
- push software built over on-demand control tools
- code distributed via internal BitTorrent swarm
- php gets compiled, the few hundred MB binary gets rapidly pushed bia bit torrent.
- it takes one minute to push across the entire network
Backend Deployments - only Engineering and Operations. Engineers write, test and display
- Quickly make performance decisions
- Expose changes to subset of real traffic
- No ‘commit and quit’
- Deeply involved in moving services to production
- Ops ‘embeded’ into engineering teams
Heavy Change logging - pin pointing code to every push and change

Monitoring and Metrics of servers and performance at facebook

Ganglia - aggregated metrics
- fast
- straightforward
- nested grids & pools
- over 5 million monitored metrics
facebook inhouse monitoring system

Monitoring - facebook still uses Nagios!

To manage complexity and the number of alarms and systems monitoring the fb team uses aggregation. Initially alarms were managed by email.

Scribe - high performance logging application. Initially used syslog. Also used Hadoop and Hive.

How does it work and gets done?

clear delineation of dependencies and responsibilities
Constant Failure
Servers were the first line of defense, then started focusing on racks
Now is focused on clusters. Logical delineation based on function (web, db, feed, etc)
Next stage is datacenters - what to do if a natural disaster strikes?
Constant Communication - information is shared constantly.
- IRC
- lots of automated bots, get and set data
- internal news updates
- “Headers” on internal tools
- Change log/feeds
Small teams

Interesting fact - each fb server gets an update on average every eight minutes.

Busiest day for FB is day after halloween :)

URLS to check out:

facebook.com/engineering

facebook.com/opensource

Linux Sysadmin Blog

A Day in the Life of Facebook Operations

Comments