kernel - Refactor lockmgr() * Seriously refactor lockmgr() so we can use atomic_fetchadd_*() for shared locks and reduce unnecessary atomic ops and atomic op loops. The main win here is being able to use atomic_fetchadd_*() when acquiring and releasing shared locks. A simple fstat() loop (which utilizes a LK_SHARED lockmgr lock on the vnode) improves from 191ns to around 110ns per loop with 32 concurrent threads (on a 16-core/ 32-thread xeon). * To accomplish this, the 32-bit lk_count field becomes 64-bits. The shared count is separated into the high 32-bits, allowing it to be manipulated for both blocking shared requests and the shared lock count field. The low count bits are used for exclusive locks. Control bits are adjusted to manage lockmgr features. LKC_SHARED Indicates shared lock count is active, else excl lock count. Can predispose the lock when the related count is 0 (does not have to be cleared, for example). LKC_UPREQ Queued upgrade request. Automatically granted by releasing entity (UPREQ -> ~SHARED|1). LKC_EXREQ Queued exclusive request (only when lock held shared). Automatically granted by releasing entity (EXREQ -> ~SHARED|1). LKC_EXREQ2 Aggregated exclusive request. When EXREQ cannot be obtained due to the lock being held exclusively or EXREQ already being queued, EXREQ2 is flagged for wakeup/retries. LKC_CANCEL Cancel API support LKC_SMASK Shared lock count mask (LKC_SCOUNT increments). LKC_XMASK Exclusive lock count mask (+1 increments) The 'no lock' condition occurs when LKC_XMASK is 0 and LKC_SMASK is 0, regardless of the state of LKC_SHARED. * Lockmgr still supports exclusive priority over shared locks. The semantics have slightly changed. The priority mechanism only applies to the EXREQ holder. Once an exclusive lock is obtained, any blocking shared or exclusive locks will have equal priority until the exclusive lock is released. Once released, shared locks can squeeze in, but then the next pending exclusive lock will assert its priority over any new shared locks when it wakes up and loops. This isn't quite what I wanted, but it seems to work quite well. I had to make a trade-off in the EXREQ lock-grant mechanism to improve performance. * In addition, we use atomic_fcmpset_long() instead of atomic_cmpset_long() to reduce cache line flip flopping at least a little. * Remove lockcount() and lockcountnb(), which tried to count lock refs. Replace with lockinuse(), which simply tells the caller whether the lock is referenced or not. * Expand some of the copyright notices (years and authors) for major rewrites. Really there are a lot more and I have to pay more attention to adjustments.
kernel - Refactor smp collision statistics * Add an indefinite wait timing API (sys/indefinite.h, sys/indefinite2.h). This interface uses the TSC and will record lock latencies to our pcpu stats in microseconds. The systat -pv 1 display shows this under smpcoll. Note that latencies generated by tokens, lockmgr, and mutex locks do not necessarily reflect actual lost cpu time as the kernel will schedule other threads while those are blocked, if other threads are available. * Formalize TSC operations more, supply a type (tsc_uclock_t and tsc_sclock_t). * Reinstrument lockmgr, mutex, token, and spinlocks to use the new indefinite timing interface.
kernel - KVABIO stabilization * bp->b_cpumask must be cleared in vfs_vmio_release(). * Generally speaking, it is generally desireable for the kernel to set B_KVABIO when flushing or disposing of a buffer, as long as b_cpumask is also correct. This avoids unnecessary synchronization when underlying device drivers support KVABIO, even if the filesystem does not. * In findblk() we cannot just gratuitously clear B_KVABIO. We must issue a bkvasync_all() to clear the flag in order to ensure proper synchronization with the caller's desired B_KVABIO state. * It was intended that bkvasync_all() clear the B_KVABIO flag. Make sure it does. * In contrast, B_KVABIO can always be set at any time, so long as the cpumask is cleared whenever the mappings are changed, and also as long as the caller's B_KVABIO state is respected if the buffer is later returned to the caller in a locked state. If the buffer will simply be disposed of by the kernel instead, the flag can be set. The wrapper (typically a vn_strategy() or dev_dstrategy() call) will clear the flag via bkvasync_all() if the target does not support KVABIO. * Kernel support code outside of filesystem and device drivers is expected to support KVABIO. * nvtruncbuf() and nvextendbuf() now use bread_kvabio() (i.e. they now properly support KVABIO). * The buf_countdeps(), buf_checkread(), and buf_checkwrite() callbacks call bkvasync_all() in situations where the vnode does not support KVABIO. This is because the kernel might have set the flag for other incidental operations even if the filesystem did not. * As per above, devfs_spec_strategy() now sets B_KVABIO and properly calls bkvasync() when it needs to operate directly on buf->b_data. * Fix bug in tmpfs(). tmpfs() was using bread_kvabio() as intended, but failed to call bkvasync() prior to operating directly on buf->b_data (prior to calling uiomovebp()). * Any VFS function that calls BUF_LOCK*() itself may also have to call bkvasync_all() if it wishes to operate directly on buf->b_data, even if the VFS is not KVABIO aware. This is because the VFS bypassed the normal buffer cache APIs to obtain a locked buffer.
kernel - Add KVABIO API (ability to avoid global TLB syncs) * Add KVABIO support. This works as follows: (1) Devices can set D_KVABIO in the ops flags to specify that the device strategy routine supports the API. passed to The dev_dstrategy() wrapper will fully synchronize the buffer to all cpus prior to dispatch if the device flag is not set. (2) Vnodes can set VKVABIO in v_flag to indicate that VOP_STRATEGY supports the API. The vn_strategy() wrapper will fully synchronize the buffer to all cpus prior to dispatch if the vnode flag is not set. (3) GETBLK_KVABIO and FINDBLK_KVABIO flags added to allow buffer cache consumers (primarily filesystem code) to indicate that they support the API. B_KVABIO flag added to struct buf. This occurs on a per-acquisition basis. For example, a standard bread() will clear the flag, indicating no support. A bread_kvabio() will set the flag, indicating support. * The getblk(), getcacheblk(), and cluster*() interfaces set the flag for any I/O they dispatch, and then adjust the flag as necessary upon return according to the caller's wishes.
kernel - Expand breadnx/breadcb/cluster_readx/cluster_readcb API * Pass B_NOTMETA flagging into breadnx(), breadcb(), cluster_readx(), and cluster_readcb(). Solve issues where data can wind up not being tagged B_NOTMETA in read-ahead and clustered buffers. * Adjust the standard bread(), breadn(), and cluster_read() inlines to pass B_NOTMETA.
kernel - Add cluster_readcb() * This function is similar to breadcb() in that it issues the requested buffer I/O asynchronously with a callback, but then also clusters additional asynchronous I/Os (without a callback) to improve performance. * Used by HAMMER2 to improve performance.
kernel - Greatly improve shared memory fault rate concurrency / shared tokens This commit rolls up a lot of work to improve postgres database operations and the system in general. With this changes we can pgbench -j 8 -c 40 on our 48-core opteron monster at 140000+ tps, and the shm vm_fault rate hits 3.1M pps. * Implement shared tokens. They work as advertised, with some cavets. It is acceptable to acquire a shared token while you already hold the same token exclusively, but you will deadlock if you acquire an exclusive token while you hold the same token shared. Currently exclusive tokens are not given priority over shared tokens so starvation is possible under certain circumstances. * Create a critical code path in vm_fault() using the new shared token feature to quickly fault-in pages which already exist in the VM cache. pmap_object_init_pt() also uses the new feature. This increases fault-in concurrency by a ridiculously huge amount, particularly on SHM segments (say when you have a large number of postgres clients). Scaling for large numbers of clients on large numbers of cores is significantly improved. This also increases fault-in concurrency for MAP_SHARED file maps. * Expand the breadn() and cluster_read() APIs. Implement breadnx() and cluster_readx() which allows a getblk()'d bp to be passed. If *bpp is not NULL a bp is being passed in, otherwise the routines call getblk(). * Modify the HAMMER read path to use the new API. Instead of calling getcacheblk() HAMMER now calls getblk() and checks the B_CACHE flag. This gives getblk() a chance to regenerate a fully cached buffer from VM backing store without having to acquire any hammer-related locks, resulting in even faster operation. * If kern.ipc.shm_use_phys is set to 2 the VM pages will be pre-allocated. This can take quite a while for a large map and also lock the machine up for a few seconds. Defaults to off. * Reorder the smp_invltlb()/cpu_invltlb() combos in a few places, running cpu_invltlb() last. * An invalidation interlock might be needed in pmap_enter() under certain circumstances, enable the code for now. * vm_object_backing_scan_callback() was failing to properly check the validity of a vm_object after acquiring its token. Add the required check + some debugging. * Make vm_object_set_writeable_dirty() a bit more cache friendly. * The vmstats sysctl was scanning every process's vm_map (requiring a vm_map read lock to do so), which can stall for long periods of time when the system is paging heavily. Change the mechanic to a LWP flag which can be tested with minimal locking. * Have the phys_pager mark the page as dirty too, to make sure nothing tries to free it. * Remove the spinlock in pmap_prefault_ok(), since we do not delete page table pages it shouldn't be needed. * Add a required cpu_ccfence() in pmap_inval.c. The code generated prior to this fix was still correct, and this makes sure it stays that way. * Replace several manual wiring cases with calls to vm_page_wire().
kernel - Major performance changes to VM page management. This commit significantly changes the way the kernel caches VM pages. Essentially what happens now is that vnodes and VM pages which are accessed often wind up in the VM active queue and last on the list for recyclement while vnodes and VM pages which are only accessed once or twice wind up on the VM inactive queue and are inserted in the middle of the list for recyclement. Previously vnodes were essentially recycled in a LRU fashion and due to algorithmic design issues VM pages associated with files scanned via open()/read() were also winding up getting recycled in a LRU fashion. This caused relatively often-used data to get recycled way too early in the face of large filesystem scans (tar, rdist, cvs, etc). In the new scheme vnodes and VM pages are essentially split into two camps: Those which are used often and those which are only used once or twice. The ones used often wind up in the VM active queue (and their vnodes are last on the list of vnodes which can be recycled), and the ones used only once or twice wind up in the VM inactive queue. The cycling of a large number of files from single-use scans (tar, rdist, cvs, etc on large data sets) now only recycles within the inactive set and does not touch the active set AT ALL. So, for example, files often-accessed by a shell or other programs tend to remain cached permanently. Permanance here is a relative term. Given enough memory pressure such files WILL be recycled. But single-use scans even of huge data sets will not create this sort of memory pressure. Examples of how active VM pages and vnodes will get recycled include: (1) Too many pages or vnodes wind up being marked as active. (2) Memory pressure created by anonymous memory from running processes. Technical Description of changes: * The buffer cache is limited. For example, on a 3G system the buffer cache only manages around 200MB. The VM page cache, on the otherhand can cover all available memory. This means that data can cycle in and out of buffer cache at a much higher rate then it would from the VM page cache. * VM pages were losing their activity history (m->act_count) when wired to back buffer cache pages. Because the buffer cache only manages around 200MB the VM pages were being cycled in and out of the buffer cache on a shorter time period verses how long they would be able to survive in the VM page queues. This caused VM pages to get recycled in more of a LRU fashion instead of based on usage, particularly the VM pages for files accessed with open()/read(). VM pages now retain their activity history and it also gets updated even while the VM pages are owned by the buffer cache. * Files accessed just once, for example in a large 'tar', 'find', or 'ls', could cause vnodes for files accessed numerous times to get kicked out of the vnode free list. This could occur due to an edge case when many tiny files are iterated (such as in a cvs update), on machines with 2G or more of memory. In these cases the vnode cache would reach its maximum number of vnodes without the VM page cache ever coming under pressure, forcing the VM system to throw away vnodes. The VM system invariably chose vnodes with small numbers of cached VM pages (which is what we desire), but wound up chosing them in strict LRU order regardless of whether the vnode was for a file accessed just once or for a file accessed many times. More technical Description of changes: * The buffer cache now inherits the highest m->act_count from the VM pages backing it, and updates its tracking b_act_count whenever the buffer is getblk()'d (and HAMMER does it manually for buffers it attaches to internal structures). * VAGE in the vnode->v_flag field has been changed to VAGE0 and VAGE1 (a 2 bit counter). Vnodes start out marked as being fully aged (count of 3) and the count is decremented every time the vnode is opened. * When a vnode is placed in the vnode free list aged vnodes are now inserted into the middle of the list while non-aged vnodes are inserted at the end. So aged vnodes get recycled first. * VM pages returned from the buffer cache are now placed in the inactive queue or the active queue based on m->act_count. This works properly now that we do not lose the activity state when wiring and unwiring the VM page for buffer cache backings. * The VM system now sets a much larger inactive page target, 1/4 of available memory. This combined with the vnode reclamation algorithm which reclaims 1/10 of the active vnodes in the system is now responsible for regulating the distribution of 'active' pages verses 'inactive' pages. It is important to note that the inactive page target and the vnode reclamation algorithm sets a minimum size for pages and vnodes intended to be on the inactive side of the ledger. Memory pressure from having too many active pages or vnodes will cause VM pages to move to the inactive side. But, as already mentioned, the simple one-time cycling of files such as in a tar, rdist, or other file scan will NOT cause this sort of memory pressure. Negative aspects of the patch. * Very large data sets which might have previously fit in memory but do not fit in e.g. 1/2 of available memory will no longer be fully cached. This is an either-or type of deal. We can't prevent active pages from getting recycled unless we reduce the amount of data we allow to get cached from 'one time' uses before starting to recycle that data. -Matt
bioqdisksort - refactor I/O queueing to fix read starvation issues. It is possible to queue several hundred megabytes worth of write I/O's all at once. When this occurs, whether we sort the queue or not, reads wind up getting seriously starved. Refactor bioqdisksort() to prioritize reads over writes and to also allow writes to 'leak' into the read space every so often to prevent write starvation. The new code is designed to make best use of drive zone caches.
bioqdisksort - fixes to avoid starvation Long chains of pipelined write I/O were being sorted in front of other requests. Due to the pipelining these other requests would wind up getting starved virtually permanently. Prevent starvation by forcing one out of every 16 BIOs to be ordered. This fixes issues with HAMMER which tends to have more of an absolute ordering of meta data verses data then UFS.
MPSAFE - tsleep_interlock, BUF/BIO, cluster, swap_pager. * tsleep_interlock()/tsleep() could miss wakeups during periods of heavy cpu activity. What would happen is code inbetween the two calls would try to send an IPI (say, issue a wakeup()), but while sending the IPI the kernel would be forced to process incoming IPIs synchronous to avoid a deadlock. The new tsleep_interlock()/tsleep() code adds another TAILQ_ENTRY to the thread structure allowing tsleep_interlock() to formally place the thread on the appropriate sleep queue without having to deschedule the thread. Any wakeup which occurs between the interlock and the real tsleep() call will remove the thread from the queue and the later tsleep() call will recognize this and simply return without sleeping. The new tsleep() call requires PINTERLOCKED to be passed to tsleep so tsleep() knows that the thread has already been placed on a sleep queue. * Continue making BUF/BIO MPSAFE. Remove B_ASYNC and B_WANT from buf->b_flag and add a new bio->bio_flags field to the bio. Add BIO_SYNC, BIO_WANT, and BIO_DONE. Use atomic_cmpset_int() (aka cmpxchg) to interlock biodone() against biowait(). vn_strategy() and dev_dstrategy() call semantics now require that synchronous BIO's install a bio_done function and set BIO_SYNC in the bio. * Clean up the cluster code a bit. * Redo the swap_pager code. Instead of issuing I/O during the collection, which depended on critical sections to avoid races in the cluster append, we now build the entire collection first and then dispatch the I/O. This allows us to use only async completion for the BIOs, instead of a hybrid sync-or-async completion.
Add bio_ops->io_checkread and io_checkwrite - a read and write pre-check which gives HAMMER a chance to set B_LOCKED if the kernel wants to write out a passively held buffer. Change B_LOCKED semantics slightly. B_LOCKED buffers will not be written until B_LOCKED is cleared. This allows HAMMER to hold off B_DELWRI writes on passively held buffers.
Convert the global 'bioops' into per-mount bio_ops. For now we also have to have a per buffer b_ops as well since the controlling filesystem cannot be located from information in struct buf (b_vp could be the backing store so that can't be used). This change allows HAMMER to use bio_ops. Change the ordering of the bio_ops.io_deallocate call so it occurs before the buffer's B_LOCKED is checked. This allows the deallocate call to set B_LOCKED to retain the buffer in situations where the target filesystem is unable to immediately disassociate the buffer. Also keep VMIO intact for B_LOCKED buffers (in addition to B_DELWRI buffers). HAMMER will use this feature to keep buffers passively associated with other filesystem structures and thus be able to avoid constantly brelse()ing and getblk()ing them.
I'm growing tired of having to add #include lines for header files that the include file(s) I really want depend on. Go through nearly all major system include files and add appropriately #ifndef'd #include lines to include all dependant header files. Kernel source files now only need to #include the header files they directly depend on. So, for example, if I wanted to add a SYSCTL to a kernel source file, I would only have to #include <sys/sysctl.h> to bring in the support for it, rather then four or five header files in addition to <sys/sysctl.h>.