github.com/dshulyak/uring@v0.0.0-20210209113719-1b2ec51f1542/README.md (about) 1 Golang library for io_uring framework (without CGO) 2 === 3 4 io_uring is the new kernel interface for asynchronous IO. The best introduction is 5 [io_uring intro](https://kernel.dk/io_uring.pdf). 6 7 Note that this library is mostly tested on 5.8.* kernel. While the core of this library 8 doesn't use any of the newer features and will work on the kernel with io_uring that 9 supports flags IORING_SETUP_CQSIZE and IORING_SETUP_ATTACH_WQ, and supports notifications with eventfd (IORING_REGISTER_EVENTFD) - some of the tests will depend on the latest features and will probably fail with cryptic errors if run on the kernel that doesn't support those features. 10 11 Benchmarks 12 === 13 14 Benchmarks for reading 40gb file are collected on 5.8.15 kernel, ext4 and Samsung EVO 960. File is opened with O_DIRECT. Benchmarks are comparing the fastest way to read a file using optimal strategies with io_uring or os. 15 16 #### io_uring 17 18 - 16 rings (one per core) with shared kernel workers 19 - one thread per ring to reap completions (faster than single thread with epoll and eventfd's) 20 - 4096 submission queue size 21 - 8192 completion queue size 22 - 100 000 concurrent readers 23 24 ``` 25 BenchmarkReadAt/enter_4096-16 5000000 1709 ns/op 2397.39 MB/s 34 B/op 2 allocs/op 26 27 ``` 28 29 #### os 30 31 - 128 os threads (more than that hurts performance) 32 33 ``` 34 BenchmarkReadAt/os_4096-128 5000000 1901 ns/op 2154.84 MB/s 0 B/op 0 allocs/op 35 ``` 36 37 Implementation 38 === 39 40 #### memory ordering (atomics) 41 42 io_uring relies on StoreRelease/ReadAcquire atomic semantics to guarantee that submitted entries will be visible after by the kernel when sq tail is updated, and vice verse for completed and cq head. 43 44 Based on comments ([#1](https://github.com/golang/go/issues/32428),[#2](https://github.com/golang/go/issues/35639)) golang provides sufficient guarantees (actually stronger) for this to work. However i wasn't able to find any mention in memory model specification so it is a subject to change, but highly unlikely. 45 46 Also, runtime/internal/atomic module has implementation for weaker atomics that provide exactly StoreRelease and ReadAcquire. You can link them using `go:linkname` pragma, but it is not safe, nobody wants about it and it doesn't provide any noticeable perf improvements to justify this hack. 47 48 ```go 49 //go:linkname LoadAcq runtime/internal/atomic.LoadAcq 50 func LoadAcq(ptr *uint32) uint32 51 52 //go:linkname StoreRel runtime/internal/atomic.StoreRel 53 func StoreRel(ptr *uint32, val uint32) 54 ``` 55 56 #### pointers 57 58 Casting pointers to uintptr is generally unsafe as there is no guarantee that they will remain in the same location after the cast. Documentation for unsafe.Pointer specifies in what situations it can be done safely. Communication with io_uring, obviously, assumes that pointer will be valid until the end of the execution. Take a look at the writev syscall example: 59 60 ```go 61 func Writev(sqe *SQEntry, fd uintptr, iovec []syscall.Iovec, offset uint64, flags uint32) { 62 sqe.opcode = IORING_OP_WRITEV 63 sqe.fd = int32(fd) 64 sqe.len = uint32(len(iovec)) 65 sqe.offset = offset 66 sqe.opcodeFlags = flags 67 sqe.addr = (uint64)(uintptr(unsafe.Pointer(&iovec[0]))) 68 } 69 ``` 70 71 In this example `sqe.addr` may become invalid right after Writev helper returns. In order to lock pointer in place there is hidden pragma `go:uintptrescapes`. 72 73 ```go 74 //go:uintptrescapes 75 76 func (q *Queue) Syscall(opts func(*uring.SQEntry), ptrs ...uintptr) (uring.CQEntry, error) { 77 return q.Complete(opts) 78 } 79 80 ... 81 82 func (f *File) WriteAt(buf []byte, off int64) (int, error) { 83 if len(buf) == 0 { 84 return 0, nil 85 } 86 iovec := []syscall.Iovec{{Base: &buf[0], Len: uint64(len(buf))}} 87 return ioRst(f.queue.Syscall(func(sqe *uring.SQEntry) { 88 uring.Writev(sqe, f.ufd, iovec, uint64(off), 0) 89 sqe.SetFlags(f.flags) 90 }, uintptr(unsafe.Pointer(&iovec[0])))) 91 } 92 ``` 93 94 `ptrs` are preventing pointers from being moved to another location until the Syscall exits. 95 96 This approach has several limitations: 97 98 - pragma `go:uintptrescapes` forces heap allocation (e.g. iovec will escape to the heap in this example). 99 - you cannot use interface for queue.Syscall. 100 101 It must be possible to achieve the same without heap allocation (e.g. the same way as in syscall.Syscall/syscall.RawSyscall). 102 103 #### synchronization and goroutines 104 105 Submissions queue requires synchronization if used concurrently by multiple goroutines. It leads to contention with large number of CPU's. The natural way to avoid contetion is to setup ring per thread, io_uring provides handy flag IORING_SETUP_ATTACH_WQ that allows to share same kernel pool between multiple rings. 106 107 On linux we can use syscall.Gettid efficiently to assign work to a particular ring in a way that minimizes contetion. It is also critical to ensure that completion path doesn't have to synchronize with submission, as it introductes noticeable degradation. 108 109 Another potential unsafe improvement is to link procPin from runtime. And use it in the place of syscall.Gettid, my last benchmark shows no difference (maybe couple ns not in favor of procPing/procUnpin). 110 111 ```go 112 //go:linkname procPin runtime.procPin 113 func procPin() int 114 115 //go:linkname procUnpin runtime.procUnpin 116 func procUnpin() int 117 ``` 118 119 In runtime we can use gopark/goready directly, however this is not available outside of the runtime and I had to use simple channel for notifying submitter on completion. This works nicely and doesn't introduce a lot of overhead. This whole approach in general adds ~750ns with high submission rate (includes spawning goroutine, submitting nop uring operation, and waiting for completion). 120 121 Several weak points of this approach are: 122 123 - IOPOLL and SQPOLL can't be used, as it will lead to creation of a polling thread for each CPU. 124 - Submissions are not batched (syscall per operation). 125 However this can be improved with a batch API. 126 127 By introducing uring into the runtime directly we can decrease overhead even futher, by removing syscall.Gettid, removing sq synchronization, and improving gopark/goready efficiency (if compared with waiting on channel).