dl.google.com: Powered by Go

26 July 2013

Brad Fitzpatrick

Gopher, Google

Overview / tl;dw:

dl.google.com serves Google downloads
Was written in C++
Now in Go
Now much better
Extensive, idiomatic use of Go's standard library
... which is all open source
composition of interfaces is fun
groupcache, now Open Source, handles group-aware caching and cache-filling

too long...

me

Brad Fitzpatrick
bradfitz.com
@bradfitz
past: LiveJournal, memcached, OpenID, Perl stuff...
nowadays: Go, Go, Camlistore, Go, anything & everything written in Go ...

I love Go

this isn't a talk about Go, sorry.
but check it out.
simple, powerful, fast, liberating, refreshing
great mix of low- and high- level
light on the page
static binaries, easy to deploy
not perfect, but my favorite language yet

dl.google.com

HTTP download server
serves Chrome, Android SDK, Earth, much more
Some huge, some tiny (e.g. WebGL white/blacklist JSON)
behind an edge cache; still high traffic
lots of datacenters, lots of bandwidth

Why port?

reason 0

$ apt-get update

embarassing
Google can't serve a 1,238 byte file?
Hanging?
207 B/s?!

Yeah, embarassing, for years...

... which led to:

complaining on corp G+. Me: "We suck. This sucks."
primary SRE owning it: "Yup, it sucks. And is unmaintained."
"I'll rewrite it for you!"
"Hah."
"No, serious. That's kinda our job. But I get to do it in Go."
(Go team's loan-out-a-Gopher program...)

How hard can this be?

dl.google.com: few tricks

each "payload" (~URL) described by a protobuf:

paths/patterns for its URL(s)
go-live reveal date
ACLs (geo, network, user, user type, ...)
dynamic zip files
custom HTTP headers
custom caching

dl.google.com: how it was

Aside: Why good code goes bad

Why good code goes bad

Premise: people don't suck
Premise: code was once beautiful
code tends towards complexity (gets worse)
environment changes
scale changes

code complexity

without regular love, code grows warts over time
localized fixes and additions are easy & quick, but globally crappy
features, hacks and workarounds added without docs or tests
maintainers come & go,
... or just go.

changing environment

Google's infrastructure (hardware & software), like anybody's, is always changing
properties of networks, storage
design assumptions no longer make sense
scale changes (design for 10x growth, rethink at 100x)
new internal services (beta or non-existent then, dependable now)
once-modern home-grown invented wheels might now look archaic

so why did it suck?

stalling its single-threaded event loop, blocking when it shouldn't
maxed out at one CPU, but couldn't even use a fraction of a single CPU.

but why?

code was too complicated
future maintainers slowly violated unwritten rules
or knowingly violated them, assuming it couldn't be too bad?
C++ single-threaded event-based callback spaghetti
hard to know when/where code was running, or what "blocking" meant

Old code

served from local disk
single-threaded event loop
used sendfile(2) "for performance"
tried to be clever and steal the fd from the "SelectServer" sometimes to manually call sendfile
while also trying to do HTTP chunking,
... and HTTP range requests,
... and dynamic zip files,
lots of duplicated copy/paste code paths
many wrong/incomplete in different ways

Mitigation solution?

more complexity!
ad hoc addition of more threads
... not really defined which threads did what,
... or what the ownership or locking rules were,
no surprise: random crashes

Summary of 5-year old code in 2012

incomplete docs, tests
stalling event loop
ad-hoc threads...
... stalling event loops
... races
... crashes
copy/paste code
... incomplete code
two processes in the container
... different languages

Environment changes

Remember: on start, we had to copy all payloads to local disk
in 2007, using local disk wasn't restricted
in 2007, sum(payload size) was much smaller
in 2012, containers get tiny % of local disk spindle time
... why aren't you using the cluster file systems like everybody else?
... cluster file systems own disk time on your machine, not you.
in 2007, it started up quickly.
in 2012, it started in 12-24 hours (!!!)
... hope we don't crash! (oh, whoops)

Copying N bytes from A to B in event loop environments (node.js, this C++, etc)

Can A read?
Read up to n bytes from A.
What'd we get? rn
n -= rn
Store those.
Note we want to want to write to B now.
Can B write?
Try to write rn bytes to B. Got wn.
buffered -= wn
while (blah blah blah) { ... blah blah blah ... }

Thought that sucked? Try to mix in other state / logic, and then write it in C++.

Or in JavaScript...

github.com/nodejitsu/node-http-proxy/blob/master/lib/node-http-proxy/http-proxy.js
Or Python gevent, Twisted, ...
Or Perl AnyEvent, etc.
Unreadable, discontiguous code.

Copying N bytes from A to B in Go:

    n, err := io.Copy(dst, src)

dst is an io.Writer (an interface type)
src is an io.Reader (an interface type)
synchronous (blocks)
Go runtime deals with making blocking efficient
goroutines, epoll, user-space scheduler, ...
easier to reason about
fewer, easier, compatible APIs
concurrency is a language (not library) feature

Where to start?

baby steps, not changing everything at once
only port the payload_server, not the payload_fetcher
read lots of old design docs
read lots of C++ code
port all command-line flags
serve from local disk
try to run integration tests
while (fail) { debug, port, swear, ...}

Notable stages

pass integration tests
run in a lightly-loaded datacenter
audit mode
... mirror traffic to old & new servers; compare responses.
drop all SWIG dependencies on C++ libraries
... use IP-to-geo lookup service, not static file + library

Notable stages

fetch blobs directly from blobstore, falling back to local disk on any errors,
relying entirely on blobstore, but payload_fetcher still running
disable payload_fetcher entirely; fast start-up time.

Using Go's Standard Library

dl.google.com mostly just uses the standard library

Go's Standard Library

net/http
io
http.ServeContent

Hello World

package main

import (
    "fmt"
    "log"
    "net/http"
    "os"
)

func handler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(os.Stdout, "%s details: %+v\n", r.URL.Path, r)
    fmt.Fprintf(w, "Hello, world! at %s\n", r.URL.Path)
}

func main() {
    log.Printf("Running...")
    log.Fatal(http.ListenAndServe("127.0.0.1:8080", http.HandlerFunc(handler)))
}

File Server

package main

import (
    "log"
    "net/http"
    "os"
    "path/filepath"
)

func main() {
    log.Printf("Running...")
    log.Fatal(http.ListenAndServe(
        "127.0.0.1:8080",
        http.FileServer(http.Dir(
            filepath.Join(os.Getenv("HOME"), "go", "doc")))))
}

http.ServeContent

io.Reader, io.Seeker

http.ServeContent

$ curl -H "Range: bytes=5-" http://localhost:8080

package main

import (
    "log"
    "net/http"
    "strings"
    "time"
)

func main() {
    log.Printf("Running...")
    err := http.ListenAndServe("127.0.0.1:8080", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        http.ServeContent(w, r, "foo.txt", time.Now(),
            strings.NewReader("I am some content.\n"))
    }))
    log.Fatal(err)
}

groupcache

memcached alternative / replacement
github.com/golang/groupcache
library that is both a client & server
connects to its peers
coordinated cache filling (no thundering herds on miss)
replication of hot items

Using groupcache

Declare who you are and who your peers are.

    me := "http://10.0.0.1"
    peers := groupcache.NewHTTPPool(me)

    // Whenever peers change:
    peers.Set("http://10.0.0.1", "http://10.0.0.2", "http://10.0.0.3")

This peer interface is pluggable. (e.g. inside Google it's automatic.)

Using groupcache

Declare a group. (group of keys, shared between group of peers)

    var thumbNails = groupcache.NewGroup("thumbnail", 64<<20, groupcache.GetterFunc(
        func(ctx groupcache.Context, key string, dest groupcache.Sink) error {
            fileName := key
            dest.SetBytes(generateThumbnail(fileName))
            return nil
        }))

group name "thumbnail" must be globally unique
64 MB max per-node memory usage
Sink is an interface with SetString, SetBytes, SetProto

Using groupcache

Request keys

    var data []byte
    err := thumbNails.Get(ctx, "big-file.jpg",
        groupcache.AllocatingByteSliceSink(&data))
    // ...
    http.ServeContent(w, r, "big-file-thumb.jpg", modTime, bytes.NewReader(data))

might come from local memory cache
might come from peer's memory cache
might be computed locally
might be computed remotely
of all threads on all machines, only one thumbnail is made, then fanned out in-process and across-network to all waiters

dl.google.com and groupcache

Keys are "<blobref>-<chunk_offset>"
Chunks are 2MB
Chunks cached from local memory (for self-owned and hot items),
Chunks cached remotely, or
Chunks fetched from Google storage systems

dl.google.com interface composition

// A SizeReaderAt is a ReaderAt with a Size method.
//
// An io.SectionReader implements SizeReaderAt.
type SizeReaderAt interface {
    Size() int64
    io.ReaderAt
}

// NewMultiReaderAt is like io.MultiReader but produces a ReaderAt
// (and Size), instead of just a reader.
func NewMultiReaderAt(parts ...SizeReaderAt) SizeReaderAt {
    m := &multi{
        parts: make([]offsetAndSource, 0, len(parts)),
    }
    var off int64
    for _, p := range parts {
        m.parts = append(m.parts, offsetAndSource{off, p})
        off += p.Size()
    }
    m.size = off
    return m
}

io.SectionReader

chunk-aligned ReaderAt

// NewChunkAlignedReaderAt returns a ReaderAt wrapper that is backed
// by a ReaderAt r of size totalSize where the wrapper guarantees that
// all ReadAt calls are aligned to chunkSize boundaries and of size
// chunkSize (except for the final chunk, which may be shorter).
//
// A chunk-aligned reader is good for caching, letting upper layers have
// any access pattern, but guarantees that the wrapped ReaderAt sees
// only nicely-cacheable access patterns & sizes.
func NewChunkAlignedReaderAt(r SizeReaderAt, chunkSize int) SizeReaderAt {
    // ...
}

Caller can do ReadAt calls of any size and any offset
r only sees ReadAt calls on 2MB offset boundaries, of size 2MB (unless final chunk)

Composing all this

http.ServeContent wants a ReadSeeker
io.SectionReader(ReaderAt + size) -> ReadSeeker
Download server payloads are a type "content" with Size and ReadAt, implemented with calls to groupcache.
Wrapped in a chunk-aligned ReaderAt
... concatenate parts of with MultiReaderAt

// +build ignore,OMIT

package main

import (
	"io"
	"log"
	"net/http"
	"sort"
	"strings"
	"time"
)

var modTime = time.Unix(1374708739, 0)

func part(s string) SizeReaderAt {
    return io.NewSectionReader(strings.NewReader(s), 0, int64(len(s)))
}

func handler(w http.ResponseWriter, r *http.Request) {
    sra := NewMultiReaderAt(
        part("Hello, "), part(" world! "),
        part("You requested "+r.URL.Path+"\n"),
    )
    rs := io.NewSectionReader(sra, 0, sra.Size())
    http.ServeContent(w, r, "foo.txt", modTime, rs)
}


func main() {
	log.Printf("Running...")
	http.HandleFunc("/", handler)
	log.Fatal(http.ListenAndServe("127.0.0.1:8080", nil))
}

// START_1 OMIT
// A SizeReaderAt is a ReaderAt with a Size method.
//
// An io.SectionReader implements SizeReaderAt.
type SizeReaderAt interface {
	Size() int64
	io.ReaderAt
}

// NewMultiReaderAt is like io.MultiReader but produces a ReaderAt
// (and Size), instead of just a reader.
func NewMultiReaderAt(parts ...SizeReaderAt) SizeReaderAt {
	m := &multi{
		parts: make([]offsetAndSource, 0, len(parts)),
	}
	var off int64
	for _, p := range parts {
		m.parts = append(m.parts, offsetAndSource{off, p})
		off += p.Size()
	}
	m.size = off
	return m
}

// END_1 OMIT

type offsetAndSource struct {
	off int64
	SizeReaderAt
}

type multi struct {
	parts []offsetAndSource
	size  int64
}

func (m *multi) Size() int64 { return m.size }

func (m *multi) ReadAt(p []byte, off int64) (n int, err error) {
	wantN := len(p)

	// Skip past the requested offset.
	skipParts := sort.Search(len(m.parts), func(i int) bool {
		// This function returns whether parts[i] will
		// contribute any bytes to our output.
		part := m.parts[i]
		return part.off+part.Size() > off
	})
	parts := m.parts[skipParts:]

	// How far to skip in the first part.
	needSkip := off
	if len(parts) > 0 {
		needSkip -= parts[0].off
	}

	for len(parts) > 0 && len(p) > 0 {
		readP := p
		partSize := parts[0].Size()
		if int64(len(readP)) > partSize-needSkip {
			readP = readP[:partSize-needSkip]
		}
		pn, err0 := parts[0].ReadAt(readP, needSkip)
		if err0 != nil {
			return n, err0
		}
		n += pn
		p = p[pn:]
		if int64(pn)+needSkip == partSize {
			parts = parts[1:]
		}
		needSkip = 0
	}

	if n != wantN {
		err = io.ErrUnexpectedEOF
	}
	return
}

Things we get for free from net/http

Last-Modified
ETag
Range requests (w/ its paranoia)
HTTP/1.1 chunking, etc.
... old server tried to do all this itself
... incorrectly
... incompletely
... in a dozen different copies

Overall simplification

deleted C++ payload_server & Python payload_fetcher
39 files (14,032 lines) deleted
one binary now (just Go payload_server, no payload_fetcher)
starts immediately, no huge start-up delay
server is just "business logic" now, not HTTP logic

From this...

... to this.

And from page and pages of this...

... to this

So how does it compare to C++?

less than half the code
more testable, tests
same CPU usage for same bandwidth
... but can do much more bandwidth
... and more than one CPU
less memory (!)
no disk
starts up instantly (not 24 hours)
doesn't crash
handles hot download spikes

Could we have just rewritten it in new C++?

Sure.
But why?

Could I have just fixed the bugs in the C++ version?

Sure, if I could find them.
Then have to own it ("You touched it last...")
And I already maintain an HTTP server library. Don't want to maintain a bad one too.
It's much more maintainable. (and 3+ other people now do)

How much of dl.google.com is closed-source?

Very little.
... ACL policies
... RPCs to Google storage services.
Most is open source:
... code.google.com/p/google-api-go-client/storage/v1beta1
... net/http and rest of Go standard library
... groupcache, now open source (github.com/golang/groupcache)

Thank you

Brad Fitzpatrick

Gopher, Google

https://github.com/golang/groupcache/

Oct	NOV	Dec
	25
2015	2016	2017