#%include "default.mgp"
%tab 1 size 6, vgap 40, prefix "  ", icon box "green" 50
%tab 2 size 5, vgap 40, prefix "     ", icon arc "yellow" 50
%tab 3 size 4, vgap 40, prefix "           --", icon delta3 "white" 40
%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%tfont "/usr/share/fonts/default/TrueType/timr____.ttf"
%fore "white"
%center


%size 8
Linux AIO Performance and Robustness for Enterprise Workloads

%fore "darkorange"
%size 4
Suparna Bhattacharya
John Tran
Mike Sullivan
Chris Mason
%size 4
{suparna@in, jbtran@us, mksully@us}.ibm.com, mason@suse.com

%left
%size 8
%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
Overview of kernel AIO in Linux(tm) 2.6
%fore "skyblue"

	AIO overlaps processing with I/O operations
		improved utilization of CPU and devices 
		especially under variable loads
%pause

	AIO syscalls
		io_setup(max_events, &ctx)
		io_submit(ctx, nr, iocbs[]) 
			IO_CMD_PREAD, IO_CMD_PWRITE, IO_CMD_POLL
		io_getevents(ctx, min_nr, nr, events[], timeout)
		io_destroy(ctx)
		io_cancel(ctx, iocb, &result)

%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
Retry-based AIO Recap
%fore "skyblue"

	Sync I/O and AIO share a common code path
		Async io_wait context for AIO
			Allocated as part of kiocb
			Replaces default on-stack wait queue entry for sync I/O
		Retry exit for AIO instead of blocking 
			Return number of bytes completed or -EIOCBRETRY	
%pause

	AIO proceeds as a series of non-blocking iterations
		Retry kicked via async wait queue callback on wakeup
			Reissues fop->aio_read/write 
			Modified arguments representing the remaining I/O
		Worker threads use caller's address space during retries

%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
Impact of Readahead on AIO Reads
%fore "skyblue"
%size 4
%fore "green"
          Current		Ahead
    ----|-------|--------|----------------|-----
        ^start          ^start+size
                          ^ahead_start     ^ahead_start+ahead_size
                  ^preoffset
%fore "skyblue"
%pause

%size 6
	Impact on Streaming Random AIO Reads
		fop->aio_read(fd, o1, 16384) = -EIOCBRETRY
			Readahead o1 to o1+64KB, pre = o1 
%pause
		fop->aio_read(fd, o2, 16384) = -EIOCBRETRY
			Readahead o2 to o2+8KB, pre = o2 
%pause
		fop->aio_read(fd, o3, 16384) = -EIOCBRETRY
			No readahead, Slow read 4KB
%pause
		fop->aio_read(fd, o1, 16384) = 16384
%pause
		fop->aio_read(fd, o2, 16384) =  8192
%pause
		fop->aio_read(fd, o3, 16384) =  4096
%size 8
%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
Upfront Readahead for AIO
%fore "skyblue"

	Tells readahead logic about all pages in a request
		Doesn't repeat readahead on retries
%pause
	I/O Pattern with Streaming AIO Reads
		fop->aio_read(fd, o1, 16384) = -EIOCBRETRY
			Readahead o1 to 64KB, pre = o1+12KB
		fop->aio_read(fd, o2, 16384) = -EIOCBRETRY
			Readahead o2 to 20KB, pre = o2+16KB
		fop->aio_read(fd, o1, 16384) = 16384
		fop->aio_read(fd, o2, 16384) = 16384

%pause
	Addressing Sendfile regression 
		Upfront only within max readahead pages
		Restrict to AIO case
%fore "skyblue"

%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
Random AIO Read Throughput
%fore "skyblue"
%center
%newimage -scrzoom 70 "aioread-perf.eps" 
%leftfill
%size 4
Filesystem: ext3 4KB blocksize, 1GB file
AIC7896 Ultra2 SCSI
4-way Pentium(tm) III 700MHz, 512MB
%size 8

%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
AIO DIO vs Cached I/O Integrity
%fore "skyblue"

	DIO vs Buffered I/O races (2.4 & 2.6) - sct
		Meta-data mods vs actual disk block instantiation
			Atomicity of flush & DIO read - i_sem
			Truncate vs DIO write/read - r/w i_alloc_sem
			DIO fallback to buffered I/O for writes to sparse regions
%pause

	AIO DIO specific races (2.6)
		AIO-DIO read/write vs truncate
			hold i_alloc_sem till I/O completion
		AIO-DIO file extends
			block instantiation atomicity with i_size updates
			force synchronous behaviour
		AIO-DIO writes to sparse regions
			request spanning allocated & sparse regions
			force synchronous behaviour

%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
Concurrent Synchronized I/O
%fore "skyblue"

	Synchronized Writes (O_SYNC, fsync)
		I/O committed to disk before request completes	
%pause
	Concurrent DIO writes
		i_sem serializes parallel DIO writes	
			could be released after blocks looked up and I/O started (helps streaming AIO-DIO writes)
%pause
	Concurrent O_SYNC buffered writes
		Per-address space page lists (dirty & writeback)
			i_sem held across traversal to writeback completion serializes parallel O_SYNC writes & difficult to retry-enable for AIO
			races between sync and background writes
		Radix-tree lookup of range to O_SYNC
			avoids i_sem across I/O waits & AIO retry friendly
%%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
Tagged Radix-tree based Writeback
%fore "skyblue"

	Lookup dirty or writeback pages in O(log64(n))
		Adds tag bits for each slot to each radix-tree node
			To search keep going down sub-trees under slots with tag set 
		Tagged gang-lookup for in-order searches in a range
			Replaces per-address space page list traversals
%pause

	To synchronize writes to disk ...
		Radix-tree walk and writeout dirty pages in range
		Radix-tree walk and wait on writeback pages in range
		Repeatable logic for wait on writeback for AIO O_SYNC
			Issue all writeouts for the range during first iteration
			Wait for writeback completion converted a retry exit
			Retries don't re-dirty pages, fall through to wait on writeback step

%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
Random AIO Write Throughput
%fore "skyblue"
%center
%newimage -scrzoom 70 "aiowrite-perf.eps" 
%leftfill
%size 4
Filesystem: ext3 4KB blocksize, 1GB file
AIC7896 Ultra2 SCSI
4-way PIII 700MHz, 512MB
%size 8
%%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
Retry Storms & Filtered Waitqueues
%fore "skyblue"
%size 4

Hashed wait queues & overloaded wakeups can interact with AIO retry-logic in tricky ways
%pause
%fore "green"

    CPU1                                  	CPU2

    lock_page(px)
    ...
    unlock_page(px)
                                                            lock_page(py)
    wait_on_page_writeback_wq(px)         	...
                                          	               unlock_page(py) 
                                                            wakes up px
                                                            triggering a retry
                  <-----------------------------------
   lock_page(px)                                   wait_on_page_writeback_wq(py)
   ...
   unlock_page(py)  ---- wakes up py --- causes retry ---->

%pause
%fore "orange"
Filtered wakeups ensure specificity of the wakeup to the specific object and reason for wakeup, eliminating redundant wakeups, so we no longer have to worry about situations like the above.
%fore "skyblue"
%size 8
%%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
AIO and Database Workloads
%fore "skyblue"

	DB2(R) page cleaners
		Flush dirty buffer pool pages to disk
			Number and behaviour can be tuned according to demand
			Frees agent processes to be dedicated to processing txns
		AIO reduces no. of page cleaner processes	
			Helps OLTP workloads
			Stream random synchronized AIO writes to preallocated blocks
			Individual request size = Database page size (e.g. 8KB)
			Keep disk queues maximally utilized and limit contention

%pause
	AIO reads for prefetching data
		Expected to help decision support workloads	
		Streaming large random AIO reads	
			Individual request size = Database extent size (e.g. 256KB)
			Need to tune readahead setting for device for buffered AIO

%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
AIO OLTP Performance - Raw Devices
%fore "skyblue"
%size 3
%area 50 20 0 20, leftfill, fore "green"
Update-intensive OLTP database workload
Derived from a TPC benchmark, but in no way comparable to any TPC results

%area 50 20 50 20, leftfill, fore "green"
DB2 V8, Linux 2.6.1, 2-way AMD Opteron, 
QLogic 2342 FC, 2 storage servers x 8 disk enclosures x 14 disks each
RAID-0 configuration, stripe size 256KB

%pause
%area 100 50 0 40, leftfill, fore "skyblue"
%center
%size 8
______________________________________________________________
      Configuration                    Relative Throughput 
______________________________________________________________

      1 page cleaner with AIO               133                 
      55 page cleaners without AIO        122                 
______________________________________________________________
%pause

______________________________________________________________
      Configuration                    Page Cleaner Writes (%)
______________________________________________________________

      1 page cleaner with AIO               100                 
      55 page cleaners without AIO         70                 
______________________________________________________________
%leftfill
%%%%%%%%%%%%%%%%%%%%%
%page
%size 8
%fore "white"
AIO OLTP Performance - Filesystems
%fore "skyblue"
%size 3
%area 50 20 0 20, leftfill, fore "green"
Update-intensive OLTP database workload
Derived from a TPC benchmark, but in no way comparable to any TPC results

%area 50 20 50 20, leftfill, fore "green"
DB2 V8, Linux 2.6.0+mm1, 2-way AMD Opteron
QLogic 2310 FC, 4 disk enclosures, each with 2 disk
RAID-0 arrays, stripe size 256KB

%pause
%area 100 50 0 40, leftfill, fore "skyblue"
%center
%size 8
______________________________________________________________
      Configuration                    Relative Throughput 
______________________________________________________________

      5 page cleaners with AIO (buffered)   113.7               
      5 page cleaners without AIO              100                 
______________________________________________________________
%pause

______________________________________________________________
      Configuration                    Page Cleaner Writes (%)
______________________________________________________________

      5 page cleaners with AIO (buffered)   100                
      5 page cleaners without AIO                37                 
______________________________________________________________
%leftfill
%%%%%%%%%%%%%%%%%%%%%
%page
%size 8
%fore "white"
Combining AIO & Communications I/O
%fore "skyblue"

	Different API models
		Communications IO uses epoll + O_NONBLOCK
		Disk-based AIO uses native AIO API 

%pause
	Combining into a single event loop
		Extend epoll for notification of AIO completion
		AIO poll
			IO_CMD_POLL (fd, event mask)
			retry based implementation
		Native AIO support for communications I/O
			e.g. AIO pipe (retry-based implementation)
			read immediate added to retry-model
			context switch reduction vs IO stalls
			AIO on sockets not implemented yet
%%%%%%%%%%%%%%%%%%%%%
%page
%fore "white"
Conclusions 
%fore "skyblue"

	In retrospect ...
		Real challenges are beyond sync->async conversion
		AIO exposed I/O patterns less likely with sync I/O alone
		AIO appeared to magnify some problems early
			hashed waitqueues -> filtered wakeups
			readahead window collapse with large random reads
		Enhancements improved concurrency for sync I/O paths

%pause
	Room for future work
		More benchmarking, analysis & optimizations
		AIO fsync
		More widely used AIO applications
		Study to determine if network AIO is worth it
		Enabling efficient POSIX AIO implementations
%%%%%%%%%%%%%%%%%%%%%
%page
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%size 6
%fore "white"
Our Thanks to ...
%fore "skyblue"
%size 4

Andrew Morton <akpm@digeo.com>
Daniel McNeil <daniel@osdl.org>
Janet Morgan <janetmor@us.ibm.com>
Badari Pulavarthy <pbadari@us.ibm.com>
Stephen Tweedie <sct@redhat.com>
William Lee Irwin <wli@holomorphy.com>

%pause
and several other people on the linux-aio and the linux-kernel mailing lists for various suggestions and fixes

%pause
%size 6
%fore "white"
Downloads:
%fore "orange"
%size 4

AIO patches discussed here are available at www.kernel.org/pub/linux/kernel/people/suparna/aio
AIO web page: http://lse.sf.net/io/aio.html

%%%%%%%%%%%%%%%%%%%%%%
%page
%size 8
%bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen"
%fore "white"
Disclaimers and Trademarks
%fore "skyblue"

%size 4
This work represents the view of the authors and 
does not necessarily represent the view of IBM.

IBM and DB2 are registered trademarks
of International Business Machines Corporation in the 
United States and/or other countries.

Linux is a registered trademark of Linus Torvalds.

Pentium is a trademark of Intel Corporation in the
United States, other countries or both

Other company, product, and service names may be 
trademarks or service marks of others.

The benchmarks discussed in this presentation were
conducted for research purposes only, under laboratory
conditions. Results will not be realised in all
computing environments

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%