kernel - vnode recycling, intermediate fix * Fix a condition where vnlru (the vnode recycler) can live- lock on unsuitable vnodes in the inactive list and stop making progress, causing the system to block. First, don't deactivate vnodes which the inactive scan won't recycle. Vnodes which are in the namecache topology but not at a leaf won't be recycled by the vnlru thread. Leave these vnodes on the active queue. This prevents the inactive queue from filling up with vnodes that it can't recycle. Second, the active scan in vnlru() will now call cache_inval_vp_quick() to attempt to make a vnode presentable so it can be deactivated. The inactive scan also does the same thing, because some leakage can happen anyway. * The active scan should be able to make continuous progress as successful cache_inval_vp_quick() calls make more and more vnodes presentable that might have previously been internal nodes in the namecache topology. So the active scan should be able to achieve the desired balance between the active and inactive queue. * This should also improve performance when constant recycling is happening by moving more of the work to the active->inactive transition and doing less work in the inactive->free transition * Add cache_inval_vp_quick(), a function which attempts to trivially disassociate a vnode from the namecache topology and will handle any direct children if the vnode is not at a leaf (but not recursively on its own). The definition of 'trivially' for the children are namecache records that can be locked non-blocking, have no additional refs, and do not record a vnode. * Cleanup cache_unlink_parent(). Have cache_zap() use this function instead of rerolling the same code. The cache_rename() code winds up being slightly more complex. And now cache_inval_vp_quick() can use the function too.
hammer2 - Fix issue where deleted files sometimes linger until umount (2) This is related to the issue of having to retain the inodes for deleted files that still have live references. Even though their nlinks has dropped to 0, such inodes must be retained and be fully operational until the last live reference goes away. When that reference DOES go away, we need to dispose of the inode as quickly as possible. * The last fix wasn't good enough. Some vnodes still linger for indefinite periods of time after a rm -rf. In addition, the last fix attempted to clean-out inodes that might have still had dirty buffers associated with the vnode. * Fall-back to the method that UFS and HAMMER1 use, which is to obtain a full ref on ip->vp using vget() (or similar) that we can cycle to force the vnode to be inactivated. This also entails using the inode lock in the inactive/reclaim path to interlock the ip->vp accesss, unfortunately. The vnode buffers and inode are now cleaned up in the inactivation path (when nlinks is 0) instead of the reclaim path. * Validated against a (roughly) 20 million inode distfile unpack and another few million inodes created via grok processing. * Add a vfs support function in the kernel called vfinalize() which operates on a referenced vnode. This function flags the vnode for immediate deactivation when the last ref is released.
kernel - Add kmalloc_obj subsystem step 1 * Implement per-zone memory management to kmalloc() in the form of kmalloc_obj() and friends. Currently the subsystem uses the same malloc_type structure but is otherwise distinct from the normal kmalloc(), so to avoid programming mistakes the *_obj() subsystem post-pends '_obj' to malloc_type pointers passed into it. This mechanism will eventually replace objcache. This mechanism is designed to greatly reduce fragmentation issues on systems with long uptimes. Eventually the feature will be better integrated and I will be able to remove the _obj stuff. * This is a object allocator, so the zone must be dedicated to one type of object with a fixed size. All allocations out of the zone are of the object. The allocator is not quite type-stable yet, but will be once existential locks are integrated into the freeing mechanism. * Implement a mini-slab allocator for management. Since the zones are single-object, similar to objcache, the fixed-size mini-slabs are a lot easier to optimize and much simpler in construction than the main kernel slab allocator. Uses a per-zone/per-cpu active/alternate slab with an ultra-optimized allocation path, and a per-zone partial/full/empty list. Also has a globaldata-based per-cpu cache of free slabs. The mini-slab allocator frees slabs back to the same cpu they were originally allocated from in order to retain memory locality over time. * Implement a passive cleanup poller. This currently polls kmalloc zones very slowly looking for excess full slabs to release back to the global slab cache or the system (if the global slab cache is full). This code will ultimately also handle existential type-stable freeing. * Fragmentation is greatly reduced due to the distinct zones. Slabs are dedicated to the zone and do not share allocation space with other zones. Also, when a zone is destroyed, all of its memory is cleanly disposed of and there will be no left-over fragmentation. * Initially use the new interface for the following. These zones tend to or can become quite big: vnodes namecache (but not related strings) hammer2 chains hammer2 inodes tmpfs nodes tmpfs dirents (but not related strings)
kernel - Refactor cache_vref() using counter trick * Refactor cache_vref() such that it is able to validate that a vnode (whos ref count might be 0) is not in VRECLAIM, without acquiring the vnode lock. This is the normal case. If cache_vref() is unable to do this, it backs down to the old method which was to get a vnode lock, validate that the vnode is not in VRECLAIM, then release the lock. * NOTE: In DragonFlyBSD, holding a vref on a vnode (vref, NOT vhold) will prevent the vnode from transitioning to VRECLAIM. * Use the new feature for nlookup's naccess() tests and for the *stat*() series of system calls. This significantly increases performance. However, we are not entirely cache-contention free as both the namecache entry and the vnode are still referenced, requiring atomic adds.
kernel - Normalize the vx_*() vnode interface * The vx_*() vnode interface is used for initial allocations, reclaims, and terminations. Normalize all use cases to prevent the mixing together of the vx_*() API and the vn_*() API. For example, vx_lock() should not be paired with vn_unlock(), and so forth. * Integrate an update-counter mechanism into the vx_*() API, assert reasonability. * Change vfs_cache.c to use an int update counter instead of a long. The vfs_cache code can't quite use the spin-lock update counter API yet. Use proper atomics for load and store. * Implement VOP_GETATTR_QUICK, meant to be a 'quick' version of VOP_GETATTR() that only retrieves information related to permissions and ownership. This will be fast-pathed in a later commit. * Implement vx_downgrade() to convert an exclusive vx_lock into an exclusive vn_lock (for vnodes). Adjust all use cases in the getnewvnode() path. * Remove unnecessary locks in tmpfs_getattr() and don't use any in tmpfs_getattr_quick(). * Remove unnecessary locks in hammer2_vop_getattr() and don't use any in hammer2_vop_getattr_quick()
kernel - Comment future vrele() code intention * vrele() currently uses atomic_fcmpset_*() and will in the future use atomic_fetchadd_*() instead, but I can't change it without a bit more work. * Avoid updating v_flag and v_act if the values do not change, reducing SMP contention a bit.
kernel - Refactor vfs_cache 3/N * Leave the vnode held for each linked namecache entry, allowing us to remove all the hold/drop code for 0->1 and 1->0 lock transitions of ncps. This significantly simplifies the cache_lock*() and cache_unlock() functions. * Adjust the vnode recycling code to check v_auxrefs against v_namecache_count instead of against 0.
kernel: Remove numerous #include <sys/thread2.h>. Most of them were added when we converted spl*() calls to crit_enter()/crit_exit(), almost 14 years ago. We can now remove a good chunk of them again for where crit_*() are no longer used. I had to adjust some files that were relying on thread2.h or headers that it includes coming in via other headers that it was removed from.
kernel - Fix rare vref() assertion * The VREF_TERMINATE flag gets cleared when a vnode is reactivated. However, concurrent LK_SHARED locks on vnodes can race the v_state test. Thus the code cannot assume that VREF_TERMINATE has been cleared when v_state is VS_ACTIVE. To avoid the race, we simply unconditionally clear VREF_TERMINATE on a successful vget(). * Could be reproduced by running blogbench and synth together, both of which generate extreme filesystem-intensive loads.
kernel - Localize [in]activevnodes globals, improve allocvnode * Move to globaldata, keep globals as rollup statistics. * We already solved normal active->inactive->active issues in prior work, this change primarily effects vnode termination, such as for unlink operations. * Enhance allocvnode to reuse a convenient reclaimed vnode if we can find one on the pcpu's inactive list and lock it non-blocking. This reduces unnecessary vnode count bloating.
kernel - Refactor lockmgr() (2) * Remove the global vfs_spin() lock and single vnode_active_list and single vnode_inactive_list. * Replace with a pcpu array of spinlocks and lists. However, for this initial push the array is simply hashed based on the vnode pointer, so it isn't really being acted on pcpu. * Significantly reduces numerous bottlenecks when vnodes start to get recycled by vnlru(). Cache line bounces are still a problem, but direct spinlock conflicts are essentially gone.
kernel - Refactor lockmgr() * Seriously refactor lockmgr() so we can use atomic_fetchadd_*() for shared locks and reduce unnecessary atomic ops and atomic op loops. The main win here is being able to use atomic_fetchadd_*() when acquiring and releasing shared locks. A simple fstat() loop (which utilizes a LK_SHARED lockmgr lock on the vnode) improves from 191ns to around 110ns per loop with 32 concurrent threads (on a 16-core/ 32-thread xeon). * To accomplish this, the 32-bit lk_count field becomes 64-bits. The shared count is separated into the high 32-bits, allowing it to be manipulated for both blocking shared requests and the shared lock count field. The low count bits are used for exclusive locks. Control bits are adjusted to manage lockmgr features. LKC_SHARED Indicates shared lock count is active, else excl lock count. Can predispose the lock when the related count is 0 (does not have to be cleared, for example). LKC_UPREQ Queued upgrade request. Automatically granted by releasing entity (UPREQ -> ~SHARED|1). LKC_EXREQ Queued exclusive request (only when lock held shared). Automatically granted by releasing entity (EXREQ -> ~SHARED|1). LKC_EXREQ2 Aggregated exclusive request. When EXREQ cannot be obtained due to the lock being held exclusively or EXREQ already being queued, EXREQ2 is flagged for wakeup/retries. LKC_CANCEL Cancel API support LKC_SMASK Shared lock count mask (LKC_SCOUNT increments). LKC_XMASK Exclusive lock count mask (+1 increments) The 'no lock' condition occurs when LKC_XMASK is 0 and LKC_SMASK is 0, regardless of the state of LKC_SHARED. * Lockmgr still supports exclusive priority over shared locks. The semantics have slightly changed. The priority mechanism only applies to the EXREQ holder. Once an exclusive lock is obtained, any blocking shared or exclusive locks will have equal priority until the exclusive lock is released. Once released, shared locks can squeeze in, but then the next pending exclusive lock will assert its priority over any new shared locks when it wakes up and loops. This isn't quite what I wanted, but it seems to work quite well. I had to make a trade-off in the EXREQ lock-grant mechanism to improve performance. * In addition, we use atomic_fcmpset_long() instead of atomic_cmpset_long() to reduce cache line flip flopping at least a little. * Remove lockcount() and lockcountnb(), which tried to count lock refs. Replace with lockinuse(), which simply tells the caller whether the lock is referenced or not. * Expand some of the copyright notices (years and authors) for major rewrites. Really there are a lot more and I have to pay more attention to adjustments.
kernel - Overhaul namecache operations to reduce SMP contention * Overhaul the namecache code to remove a significant amount of cacheline ping-ponging from the namecache paths. This primarily effects multi-socket systems but also improves multi-core single-socket systems. Cacheline ping-ponging in the critical path can constrict a multi-core system to roughly ~1-2M operations per second running through that path. For example, even if looking up different paths or stating different files, even something as simple as a non-atomic ++global_counter seriously derates performance when it is being executed on all cores at once. In the simple non-conflicting single-component stat() case, this improves performance from ~2.5M/second to ~25M/second on a 4-socket 48-core opteron and has a similar improvement on a 2-socket 32-thread xeon, as well as significantly improves namecache perf on single-socket multi-core systems. * Remove the vfs.cache.numcalls and vfs.cache.numchecks debugging counters. These global counters caused significant cache ping-ponging and were only being used for debugging. * Implement a poor-man's referenced-structure pcpu cache for struct mount and struct namecache. This allows atomic ops on the ref-count for these structures to be avoided in certain critical path cases. For now limit to ncdir and nrdir (nrdir particularly, which is usually the same across nearly all processes in the system). Eventually we will want to expand this cache to handle more cases. Because we are holding refs persistently, add a bit of infrastructure to clear the cache as necessary (e.g. when doing an unmount, for example). * Shift the 'cachedvnodes' global to a per-cpu accumulator, then roll-up the counter back to the global approximately once per second. The code critical paths adjust only the per-cpu accumulator, removing another global cache ping-pong from nearly all vnode and nlookup paths. * The nlookup structure now 'Borrows' the ucred reference from td->td_ucred instead of crhold()ing it, removing another global ref/unref from all nlookup paths. * We have a large hash table of spinlocks for nchash, add a little pad from 24 to 32 bytes. Its ok that two spin locks share the same cache line (its a huge table), adding the pad cleans up cacheline-crossing cases. * Add a bit of pad to put mount->mnt_refs on its own cache-line verses prior fields which are accessed shared. But don't bother isolating it completely.
kernel - Attempt to fix cluster pbuf deadlock on recursive filesystems * Change global pbuf count limits (used primarily for clustered I/O) to per-mount and per-device limits. The per-mount / per-device limit is set to nswbuf_kva / 10, allowing 10 different entities to obtain pbufs concurrently without interference. * This change goes a long way towards fixing deadlocks that could occur with the old global system (a global limit of nswbuf_kva / 2) when the I/O system recurses through a virtual block device or filesystem. Two examples of virtual block devices are the 'vn' device and the crypto layer. * We also note that even normal filesystem read and write I/O strategy calls will recurse at least once to dive the underlying block device. DFly also had issues with pbuf hogging by one mount causing unnecessary stalls in other mounts. This fix also prevents pbuf hogging. * Remove unused internal O_MAPONREAD flag. Reported-by: htse, multiple Testing-by: htse, dillon
kernel - Rename desiredvnodes to maxvnodes, fix deadlock * Rename the kernel variable 'desiredvnodes' to 'maxvnodes' to match the sysctl name (which has always been 'maxvnodes'), and to make the code more readable. * Probable fix to a rare mount/umount deadlock which can occur in two situations (1) When a large number of mounts and unmounts are running concurrently, and (2) During a umount -a, shutdown, or reboot. * Considered minor, normal use cases will not reproduce this bug. Only synth or poudriere can generate the mount/umount traffic necessary to reproduce this bug. * Also fixes a minor kernel memory leak of the mount structure which can occur when a 'df' or filesystem sync races a umount. Also minor. Reported-by: marino (mount race)