lmbench - tools for performance analysis
	.sp .5
\s+2Larry McVoy\s0
	.sp -.5
lm@sgi.com
	.sp 3
Silicon Graphics Engineering
	.bp
Outline
	What's lmbench?
	Why do I care?
	Example benchmark
	Benchmark timing interfaces
	Results
	Getting lmbench
	What's next for lmbench?
What's lmbench?
	Suite of micro benchmarks
	Measures system performance
		Latency and bandwidth measurements
	Measurements focus on OS and hardware
		Not marketing numbers
		Not theoretical numbers
Why do I care?
	Results for lots of current systems
		P6, UltraSPARC, R10K, Alphas, latest Linux
		Apples to apples comparison
	Benchmark performance predicts application performance
	Benchmarks are available and easily extended
		Very useful as base for new benchmarks
		Check out src/timing.c
	Used by SUN, SGI, Intel, and Linus to do analysis
		Found bugs, too...
Example benchmark
	.so bench.c
Benchmark timing interfaces
	start(), usecs = stop(), usecs = now(), usecs = delta()
	usecs = adjust(usecs)
	bandwidth(bytes, verbose)
		10.2 MB in .5 secs, 20.4 MB/sec
	latency(xfers, size)
		1000 xfers in 2.0 seconds, 2.000 millisec/xfer, 10.5 KB/sec
	Etc.
Results
	Systems measured
	Memory latency graphs
	Context switching graphs
	Process creation & signaling
	Interprocess communication latency
	File & VM system latencies
	Communication bandwidths
Systems measured
	.ps 14
	.vs 16
	.sp .5i
	.po -.25i
	.so systems.tbl
Memory latencies
	How is memory latency measured?
	Ultra, P5, P6, K210
	Memory latency summary
How is memory latency measured?
	Build an infinite linked list in an array
		Each entry points to next (wraps at end)
		Walk it in an unrolled loop of
			C src: ``p = *p;''
		Vary stride & array sizes
	Plot results on a log base 2 scale for array size
	Cache sizes & latencies are apparent
	Cache line size can be seen
Sun Ultra1 memory latencies
	.sp -.2i
	.so ../../Results/tmp/mem.ultrasparc.10.pic
Asus P5 memory latencies
	.sp -.2i
	.so ../../Results/tmp/mem.P5-133.2.pic
Intel P6 memory latencies
	.sp -.2i
	.so ../../Results/tmp/mem.P6.pic
HP K210 memory latencies
	.sp -.2i
	.so ../../Results/tmp/mem.K210.pic
Memory latency summary
	.ps 14
	.vs 16
	.sp 1i
	.po -.24i
	.so lat_allmem.tbl
Context switching
	How is context switching measured?
	P6, R10K, K210, alpha
How is context switching measured?
	Build a ring of processes linked with pipes
		Pass token through the ring
		Processes may have an artificial ``footprint''
		Vary footprint size and number of processes
	Plot as multiple lines
		Y is context switch time
		X is number of processes
		Each line is a different footprint size
P6/Linux context switch times
	.sp -.2i
	.so ../../Results/tmp/ctx.P6.pic
R10K context switch times
	.sp -.2i
	.so ../../Results/tmp/ctx.R10K.pic
K210 context switch times
	.sp -.2i
	.so ../../Results/tmp/ctx.K210.pic
Alpha context switch times
	.sp -.2i
	.so ../../Results/tmp/ctx.8400.pic
Process creation & signaling
	Null syscall
		write an int to /dev/null
	Null process
		fork/exit/wait loop
	Simple process
		fork/exec/exit/wait loop
	/bin/sh process
		fork/exec sh -c/exit/wait loop
	Signal installation & handling
Process creation & signaling summary
	.ps 14
	.vs 16
	.sp 1i
	.po -.24i
	.so proc.tbl
Interprocess communication latency
	2 process context switches
	Hot potato test
		I send to you, you send to me loop
	Measured using pipes, UDP, TCP, RPC
	Loopback numbers
Interprocess communication summary
	.ps 14
	.vs 16
	.sp 1i
	.po -.24i
	.so ipc.tbl
File and VM system latencies
	File create and delete time
	mmap(2) latency
	Protection fault latency (new)
	Page fault latency
File and VM system summary
	.ps 14
	.vs 16
	.sp 1i
	.po -.24i
	.so fs_vm.tbl
Communication bandwidths
	MB/sec moved through
		pipes
		TCP sockets
		file system
			read & mmap interfaces
		Bcopy
			libc & hand unrolled
		Memory
			read & write
Communication bandwidth summary
	.ps 14
	.vs 16
	.sp 1i
	.po -.24i
	.so bw.tbl
Getting lmbench
	On the Web
		http://reality.sgi.com/employees/lm/
		Tar file is there
		Results are there
		Usenix paper is there
		Updated usenix paper coming
What's next for lmbench?
	Finer tuning of the benchmarks
		Too much variance across runs
	Bandwidth measurements of L1, L2 caches
	Fix memory latency so that prefetch doesn't help
	MP measurements
		I.e., cache to cache latency (spinlocks)
	Open for suggestions...
lmbench - tools for performance analysis
	.sp .5
\s+2Larry McVoy\s0
	.sp -.5
lm@sgi.com
	.sp 2
Silicon Graphics Engineering
	.sp 
\s+2Carl Staelin\s0
	.sp -.5
staelin@hpl.hp.com
	.sp -.25
	.so_nobp HP
	.sp -.75
Hewlett-Packard Laboratories
	.bp
