github.com/dshulyak/uring@v0.0.0-20210209113719-1b2ec51f1542/README.md (about)

     1  Golang library for io_uring framework (without CGO)
     2  ===
     3  
     4  io_uring is the new kernel interface for asynchronous IO. The best introduction is
     5  [io_uring intro](https://kernel.dk/io_uring.pdf).
     6  
     7  Note that this library is mostly tested on 5.8.* kernel. While the core of this library
     8  doesn't use any of the newer features and will work on the kernel with io_uring that
     9  supports flags IORING_SETUP_CQSIZE and IORING_SETUP_ATTACH_WQ, and supports notifications with eventfd (IORING_REGISTER_EVENTFD) - some of the tests will depend on the latest features and will probably fail with cryptic errors if run on the kernel that doesn't support those features.
    10  
    11  Benchmarks
    12  ===
    13  
    14  Benchmarks for reading 40gb file are collected on 5.8.15 kernel, ext4 and Samsung EVO 960. File is opened with O_DIRECT. Benchmarks are comparing the fastest way to read a file using optimal strategies with io_uring or os.
    15  
    16  #### io_uring
    17  
    18  - 16 rings (one per core) with shared kernel workers
    19  - one thread per ring to reap completions (faster than single thread with epoll and eventfd's)
    20  - 4096 submission queue size
    21  - 8192 completion queue size
    22  - 100 000 concurrent readers
    23  
    24  ```
    25  BenchmarkReadAt/enter_4096-16            5000000          1709 ns/op	2397.39 MB/s          34 B/op          2 allocs/op
    26  
    27  ```
    28  
    29  #### os
    30  
    31  - 128 os threads (more than that hurts performance)
    32  
    33  ```
    34  BenchmarkReadAt/os_4096-128              5000000          1901 ns/op	2154.84 MB/s           0 B/op          0 allocs/op
    35  ```
    36  
    37  Implementation
    38  ===
    39  
    40  #### memory ordering (atomics)
    41  
    42  io_uring relies on StoreRelease/ReadAcquire atomic semantics to guarantee that submitted entries will be visible after by the kernel when sq tail is updated, and vice verse for completed and cq head.
    43  
    44  Based on comments ([#1](https://github.com/golang/go/issues/32428),[#2](https://github.com/golang/go/issues/35639)) golang provides sufficient guarantees (actually stronger) for this to work. However i wasn't able to find any mention in memory model specification so it is a subject to change, but highly unlikely.
    45  
    46  Also, runtime/internal/atomic module has implementation for weaker atomics that provide exactly StoreRelease and ReadAcquire. You can link them using `go:linkname` pragma, but it is not safe, nobody wants about it and it doesn't provide any noticeable perf improvements to justify this hack.
    47  
    48  ```go
    49  //go:linkname LoadAcq runtime/internal/atomic.LoadAcq
    50  func LoadAcq(ptr *uint32) uint32
    51  
    52  //go:linkname StoreRel runtime/internal/atomic.StoreRel
    53  func StoreRel(ptr *uint32, val uint32)
    54  ```
    55  
    56  #### pointers
    57  
    58  Casting pointers to uintptr is generally unsafe as there is no guarantee that they will remain in the same location after the cast. Documentation for unsafe.Pointer specifies in what situations it can be done safely. Communication with io_uring, obviously, assumes that pointer will be valid until the end of the execution. Take a look at the writev syscall example:
    59  
    60  ```go
    61  func Writev(sqe *SQEntry, fd uintptr, iovec []syscall.Iovec, offset uint64, flags uint32) {
    62          sqe.opcode = IORING_OP_WRITEV
    63          sqe.fd = int32(fd)
    64          sqe.len = uint32(len(iovec))
    65          sqe.offset = offset
    66          sqe.opcodeFlags = flags
    67          sqe.addr = (uint64)(uintptr(unsafe.Pointer(&iovec[0])))
    68  }
    69  ```
    70  
    71  In this example `sqe.addr` may become invalid right after Writev helper returns. In order to lock pointer in place there is hidden pragma `go:uintptrescapes`.
    72  
    73  ```go
    74  //go:uintptrescapes
    75  
    76  func (q *Queue) Syscall(opts func(*uring.SQEntry), ptrs ...uintptr) (uring.CQEntry, error) {
    77          return q.Complete(opts)
    78  }
    79  
    80  ...
    81  
    82  func (f *File) WriteAt(buf []byte, off int64) (int, error) {
    83          if len(buf) == 0 {
    84                  return 0, nil
    85          }
    86          iovec := []syscall.Iovec{{Base: &buf[0], Len: uint64(len(buf))}}
    87          return ioRst(f.queue.Syscall(func(sqe *uring.SQEntry) {
    88                  uring.Writev(sqe, f.ufd, iovec, uint64(off), 0)
    89                  sqe.SetFlags(f.flags)
    90          }, uintptr(unsafe.Pointer(&iovec[0]))))
    91  }
    92  ```
    93  
    94  `ptrs` are preventing pointers from being moved to another location until the Syscall exits.
    95  
    96  This approach has several limitations:
    97  
    98  - pragma `go:uintptrescapes` forces heap allocation (e.g. iovec will escape to the heap in this example).
    99  - you cannot use interface for queue.Syscall.
   100  
   101  It must be possible to achieve the same without heap allocation (e.g. the same way as in syscall.Syscall/syscall.RawSyscall).
   102  
   103  #### synchronization and goroutines
   104  
   105  Submissions queue requires synchronization if used concurrently by multiple goroutines. It leads to contention with large number of CPU's. The natural way to avoid contetion is to setup ring per thread, io_uring provides handy flag IORING_SETUP_ATTACH_WQ that allows to share same kernel pool between multiple rings.
   106  
   107  On linux we can use syscall.Gettid efficiently to assign work to a particular ring in a way that minimizes contetion. It is also critical to ensure that completion path doesn't have to synchronize with submission, as it introductes noticeable degradation.
   108  
   109  Another potential unsafe improvement is to link procPin from runtime. And use it in the place of syscall.Gettid, my last benchmark shows no difference (maybe couple ns not in favor of procPing/procUnpin).
   110  
   111  ```go
   112  //go:linkname procPin runtime.procPin
   113  func procPin() int
   114  
   115  //go:linkname procUnpin runtime.procUnpin
   116  func procUnpin() int
   117  ```
   118  
   119  In runtime we can use gopark/goready directly, however this is not available outside of the runtime and I had to use simple channel for notifying submitter on completion. This works nicely and doesn't introduce a lot of overhead. This whole approach in general adds ~750ns with high submission rate (includes spawning goroutine, submitting nop uring operation, and waiting for completion).
   120  
   121  Several weak points of this approach are:
   122  
   123  - IOPOLL and SQPOLL can't be used, as it will lead to creation of a polling thread for each CPU.
   124  - Submissions are not batched (syscall per operation).
   125    However this can be improved with a batch API.
   126  
   127  By introducing uring into the runtime directly we can decrease overhead even futher, by removing syscall.Gettid, removing sq synchronization, and improving gopark/goready efficiency (if compared with waiting on channel).