kernel - Refactor vfs_cache 2/N * Use lockmgr locks for the ncp lock. Convert nc_lockstatus / nc_locktd to struct lock. lockmgr locks use atomic_fetchadd_*() instead of atomic_fcmpset_*() for nominal shared and exclusive lock count operations, which avoids contention loops on failed fcmpset operations. There is still cache line contention but since the code doesn't have to loop so much it scales to core count a whole lot better. * Two experimental __cachealign's added to bloat struct namecache. It won't stay this way. * Retain the non-optimal nc_vp ref count mess, which is why nc_vprefs is needed. This will be fixed next.
kernel/lockmgr: Add lockmgr_try(). It just adds LK_NOWAIT to the flags and returns whether the lock was obtained. It is similar to other functions such as spin_trylock() or FreeBSD's mtx_trylock() and can be used to port the latter. Note that like these functions, it returns TRUE if successful, while lockmgr() returns 0 if successful. This difference was the source of minor confusion and porting mistakes in the past. In fact, our driver porting document also didn't point out this difference. I will fix some of these little issues in a separate commit.
kernel - Refactor lockmgr() * Seriously refactor lockmgr() so we can use atomic_fetchadd_*() for shared locks and reduce unnecessary atomic ops and atomic op loops. The main win here is being able to use atomic_fetchadd_*() when acquiring and releasing shared locks. A simple fstat() loop (which utilizes a LK_SHARED lockmgr lock on the vnode) improves from 191ns to around 110ns per loop with 32 concurrent threads (on a 16-core/ 32-thread xeon). * To accomplish this, the 32-bit lk_count field becomes 64-bits. The shared count is separated into the high 32-bits, allowing it to be manipulated for both blocking shared requests and the shared lock count field. The low count bits are used for exclusive locks. Control bits are adjusted to manage lockmgr features. LKC_SHARED Indicates shared lock count is active, else excl lock count. Can predispose the lock when the related count is 0 (does not have to be cleared, for example). LKC_UPREQ Queued upgrade request. Automatically granted by releasing entity (UPREQ -> ~SHARED|1). LKC_EXREQ Queued exclusive request (only when lock held shared). Automatically granted by releasing entity (EXREQ -> ~SHARED|1). LKC_EXREQ2 Aggregated exclusive request. When EXREQ cannot be obtained due to the lock being held exclusively or EXREQ already being queued, EXREQ2 is flagged for wakeup/retries. LKC_CANCEL Cancel API support LKC_SMASK Shared lock count mask (LKC_SCOUNT increments). LKC_XMASK Exclusive lock count mask (+1 increments) The 'no lock' condition occurs when LKC_XMASK is 0 and LKC_SMASK is 0, regardless of the state of LKC_SHARED. * Lockmgr still supports exclusive priority over shared locks. The semantics have slightly changed. The priority mechanism only applies to the EXREQ holder. Once an exclusive lock is obtained, any blocking shared or exclusive locks will have equal priority until the exclusive lock is released. Once released, shared locks can squeeze in, but then the next pending exclusive lock will assert its priority over any new shared locks when it wakes up and loops. This isn't quite what I wanted, but it seems to work quite well. I had to make a trade-off in the EXREQ lock-grant mechanism to improve performance. * In addition, we use atomic_fcmpset_long() instead of atomic_cmpset_long() to reduce cache line flip flopping at least a little. * Remove lockcount() and lockcountnb(), which tried to count lock refs. Replace with lockinuse(), which simply tells the caller whether the lock is referenced or not. * Expand some of the copyright notices (years and authors) for major rewrites. Really there are a lot more and I have to pay more attention to adjustments.
kernel - Refactor smp collision statistics (2) * Refactor indefinite_info mechanics. Instead of tracking indefinite loops on a per-thread basis for tokens, track them on a scheduler basis. The scheduler records the overhead while it is live-looping on tokens, but the moment it finds a thread it can actually schedule it stops (then restarts later the next time it is entered), even if some of the other threads still have unresolved tokens. This gives us a fairer representation of how many cpu cycles are actually being wasted waiting for tokens. * Go back to using a local indefinite_info in the lockmgr*(), mutex*(), and spinlock code. * Refactor lockmgr() by implementing an __inline frontend to interpret the directive. Since this argument is usually a constant, the change effectively removes the switch(). Use LK_NOCOLLSTATS to create a clean recursion to wrap the blocking case with the indefinite*() API.
kernel - Refactor smp collision statistics * Add an indefinite wait timing API (sys/indefinite.h, sys/indefinite2.h). This interface uses the TSC and will record lock latencies to our pcpu stats in microseconds. The systat -pv 1 display shows this under smpcoll. Note that latencies generated by tokens, lockmgr, and mutex locks do not necessarily reflect actual lost cpu time as the kernel will schedule other threads while those are blocked, if other threads are available. * Formalize TSC operations more, supply a type (tsc_uclock_t and tsc_sclock_t). * Reinstrument lockmgr, mutex, token, and spinlocks to use the new indefinite timing interface.
kernel - Add lock canceling features * The current (typically exclusive) lock holder can enable cancel mode by executing lockmgr(lk, LK_CANCEL_BEG, 0). This call always succeeds. The lock state is not otherwise affected. Any current threads blocked on the lock or any future thread which attempts to gain the lock, who also specify the LK_CANCELABLE flag, will be canceled as long as cancel mode is active and their operation will return ENOLCK. NOTE! Threads which do not specify LK_CANCELABLE are not affected by cancel mode and their blocking locks will block normally. WARNING! Cancel mode is not stackable. The system will panic if you enable cancel mode on a lock where it is already enabled. * The current (typically exclusive) lock holder can terminate cancel mode by executing lockmgr(lk, LK_CANCEL_END, 0). This call always succeeds. Once canceled, any other threads that would block on the lock and specify the LK_CANCELABLE flag will block normally and not be canceled. The current lock holder can also terminate cancel mode by simply releasing the last lock with LK_RELEASE. That is, a release where the lock count returns to 0. * Lock canceling is an optional feature. Your lock cannot be canceled unless you specify LK_CANCELABLE.
kernel - Performance tuning (3) * The VOP_CLOSE issues revealed a bigger issue with vn_lock(). Many callers do not check the return code for vn_lock() and in nearly all of those cases it wouldn't fail anyway due to a prior ref, but it creates an API issue. * Add the LK_FAILRECLAIM flag to vn_lock(). This flag explicitly allows vn_lock() to fail if the vnode is undergoing reclamation. This fixes numerous issues, particularly when VOP_CLOSE() is called during a reclaim due to recent LK_UPGRADE's that we do in some VFS *_close() functions. * Remove some unused LK_ defines.
kernel - Rewrite lockmgr / struct lock * Rewrite lockmgr() to remove the exclusive spinlock used internally to guard operations. * Retain existing API and operational semantics. This is primarily: - Acquiring a LK_SHARED lock on a lock the caller already owns exclusively simply bumps the count and retains the exclusive nature of the lock. - Exclusive requests and upgrade requests have priority over shared locks even if the lock is currently held shared, unless the thread is flagged for deadlock treatment. - Upgrade requests are capable of guaranteeing the upgrade (as before). This could be further enhanced because we now have the last release transfer the exclusive lock to the upgrade requestor, but the original API didn't have a function for this so neither do we. The more primitive detection method is used (aka LK_SLEEPFAIL and/or LK_EXCLUPGRADE). * Reduce multiple tracking fields into one field so we can use atomic_cmpset_int(). * Hot-path common operations. A single atomic_cmpset_int() gets us through.
kernel - Major SMP performance patch / VM system, bus-fault/seg-fault fixes This is a very large patch which reworks locking in the entire VM subsystem, concentrated on VM objects and the x86-64 pmap code. These fixes remove nearly all the spin lock contention for non-threaded VM faults and narrows contention for threaded VM faults to just the threads sharing the pmap. Multi-socket many-core machines will see a 30-50% improvement in parallel build performance (tested on a 48-core opteron), depending on how well the build parallelizes. As part of this work a long-standing problem on 64-bit systems where programs would occasionally seg-fault or bus-fault for no reason has been fixed. The problem was related to races between vm_fault, the vm_object collapse code, and the vm_map splitting code. * Most uses of vm_token have been removed. All uses of vm_spin have been removed. These have been replaced with per-object tokens and per-queue (vm_page_queues[]) spin locks. Note in particular that since we still have the page coloring code the PQ_FREE and PQ_CACHE queues are actually many queues, individually spin-locked, resulting in very excellent MP page allocation and freeing performance. * Reworked vm_page_lookup() and vm_object->rb_memq. All (object,pindex) lookup operations are now covered by the vm_object hold/drop system, which utilize pool tokens on vm_objects. Calls now require that the VM object be held in order to ensure a stable outcome. Also added vm_page_lookup_busy_wait(), vm_page_lookup_busy_try(), vm_page_busy_wait(), vm_page_busy_try(), and other API functions which integrate the PG_BUSY handling. * Added OBJ_CHAINLOCK. Most vm_object operations are protected by the vm_object_hold/drop() facility which is token-based. Certain critical functions which must traverse backing_object chains use a hard-locking flag and lock almost the entire chain as it is traversed to prevent races against object deallocation, collapses, and splits. The last object in the chain (typically a vnode) is NOT locked in this manner, so concurrent faults which terminate at the same vnode will still have good performance. This is important e.g. for parallel compiles which might be running dozens of the same compiler binary concurrently. * Created a per vm_map token and removed most uses of vmspace_token. * Removed the mp_lock in sys_execve(). It has not been needed in a while. * Add kmem_lim_size() which returns approximate available memory (reduced by available KVM), in megabytes. This is now used to scale up the slab allocator cache and the pipe buffer caches to reduce unnecessary global kmem operations. * Rewrote vm_page_alloc(), various bits in vm/vm_contig.c, the swapcache scan code, and the pageout scan code. These routines were rewritten to use the per-queue spin locks. * Replaced the exponential backoff in the spinlock code with something a bit less complex and cleaned it up. * Restructured the IPIQ func/arg1/arg2 array for better cache locality. Removed the per-queue ip_npoll and replaced it with a per-cpu gd_npoll, which is used by other cores to determine if they need to issue an actual hardware IPI or not. This reduces hardware IPI issuance considerably (and the removal of the decontention code reduced it even more). * Temporarily removed the lwkt thread fairq code and disabled a number of features. These will be worked back in once we track down some of the remaining performance issues. Temproarily removed the lwkt thread resequencer for tokens for the same reason. This might wind up being permanent. Added splz_check()s in a few critical places. * Increased the number of pool tokens from 1024 to 4001 and went to a prime-number mod algorithm to reduce overlaps. * Removed the token decontention code. This was a bit of an eyesore and while it did its job when we had global locks it just gets in the way now that most of the global locks are gone. Replaced the decontention code with a fall back which acquires the tokens in sorted order, to guarantee that deadlocks will always be resolved eventually in the scheduler. * Introduced a simplified spin-for-a-little-while function _lwkt_trytoken_spin() that the token code now uses rather than giving up immediately. * The vfs_bio subsystem no longer uses vm_token and now uses the vm_object_hold/drop API for buffer cache operations, resulting in very good concurrency. * Gave the vnode its own spinlock instead of sharing vp->v_lock.lk_spinlock, which fixes a deadlock. * Adjusted all platform pamp.c's to handle the new main kernel APIs. The i386 pmap.c is still a bit out of date but should be compatible. * Completely rewrote very large chunks of the x86-64 pmap.c code. The critical path no longer needs pmap_spin but pmap_spin itself is still used heavily, particularin the pv_entry handling code. A per-pmap token and per-pmap object are now used to serialize pmamp access and vm_page lookup operations when needed. The x86-64 pmap.c code now uses only vm_page->crit_count instead of both crit_count and hold_count, which fixes races against other parts of the kernel uses vm_page_hold(). _pmap_allocpte() mechanics have been completely rewritten to remove potential races. Much of pmap_enter() and pmap_enter_quick() has also been rewritten. Many other changes. * The following subsystems (and probably more) no longer use the vm_token or vmobj_token in critical paths: x The swap_pager now uses the vm_object_hold/drop API instead of vm_token. x mmap() and vm_map/vm_mmap in general now use the vm_object_hold/drop API instead of vm_token. x vnode_pager x zalloc x vm_page handling x vfs_bio x umtx system calls x vm_fault and friends * Minor fixes to fill_kinfo_proc() to deal with process scan panics (ps) revealed by recent global lock removals. * lockmgr() locks no longer support LK_NOSPINWAIT. Spin locks are unconditionally acquired. * Replaced netif/e1000's spinlocks with lockmgr locks. The spinlocks were not appropriate owing to the large context they were covering. * Misc atomic ops added