kernel - Clean up cache_hysteresis contention, better trigger_syncer() * cache_hysteresis() is no longer multi-entrant, which causes unnecessary contention between cpus. Only one cpu can run it at a time and it just returns for other cpus if it is already running. * Add better trigger_syncer functions. Adding trigger_syncer_start() and trigger_syncer_stop() to elide filesystem sleeps, ensuring that a filesystem waiting on a flush due to excessive dirty pages does not race the flusher and wind up twiddling its fingers while no flush is happening.
kernel - Attempt to fix broken vfs.cache.numunres tracker (2) * The main culprit appears to be cache_allocroot() accounting for new root ncps differently than the rest of the module. So anything which mounts and umounts continuously, like dsynth, can seriously make the numbers whacky. * Fix that and run an overnight test.
kernel - check nc_generation in nlookup path * With nc_generation now operating in a more usable manner, we can use it in nlookup() to check for changes. When a change is detected, the related lock will be cycled and the entire nlookup() will retry up to debug.nlookup_max_retries, which currently defaults to 4. * Add debugging via debug.nlookup_debug. Set to 3 for nc_generation debugging. * Move "Parent directory lost" kprintfs into a debugging conditional, reported via (debug.nlookup_debug & 4). * This fixes lookup/remove races which could sometimes cause open() and other system calls to return EINVAL or ENOTCONN. Basically what happened was that nlookup() wound up on a NCF_DESTROYED entry. * A few minutes worth of a dsynth bulk does not report any random generation number mismatches or retries, so the code in this commit is probably very close to correct.
kernel - Change ncp->nc_generation operation * Change nc_generation operation. Bit 0 is reserved. The field is incremented by 2 whenever major changes are being made to the ncp (linking, unlinking, destruction, resolve, unresolve, vnode adjustment), and then incremented by 2 again when the operation is complete. The caller can test for a major gen change using: curr_gen = ncp->nc_generation & ~3; if ((orig_gen - curr_gen) & ~1) (retry needed) * Allows unlocked/relocked code to determine whether the ncp has possibly changed or not (will be used in upcoming commits). * Adjust the kern_rename() code to use the generation numbers. * Bit 0 will be used to check for a combination of major changes and lock cycling inthe future.
kernel - namecache eviction fixes * Fix several namecache eviction issues which interfere with nlookup*() functions. There is an optimization where nlookup*() avoids locking intermediate ncp's in a path whenever possible on the assumption that the ref on the ncp will prevent eviction. This assumption fails when the machine is under a heavy namecache load. Errors included spurious ENOTCONN and EINVAL error codes from file operations. * Refactor the namecache code to not evict resolved namecache entries which have extra refs under normal operation. This allows nlookup*() and other functions to operate semi-lockless for intermediate elements in a path. However, they still obtain a ref which is a cache-unfriendly atomic operation. This fixes numerous weird errors that occur during heavy dsynth bulk builds. * Also fix a bug which evicted too many resolved namecache entries when attempting to evict unresolved entries. This should improve performance under heavy namecache loads a bit.
kernel - Fix lock order reversal in cache_resolve_mp() * This function is a helper when path lookups cross mount boundaries. * Locking order between namecache records and vnodes must be { ncp, vnode }. * Fix a lock order reversal in cache_resolve_mp() which was doing { vnode, ncp }. This deadlock is very rare because mount points are almost never evicted from the namecache. However, dsynth can trigger this bug due to its heavy use of null mounts and high concurrent path lookup loads.
kernel - vnode recycling, intermediate fix * Fix a condition where vnlru (the vnode recycler) can live- lock on unsuitable vnodes in the inactive list and stop making progress, causing the system to block. First, don't deactivate vnodes which the inactive scan won't recycle. Vnodes which are in the namecache topology but not at a leaf won't be recycled by the vnlru thread. Leave these vnodes on the active queue. This prevents the inactive queue from filling up with vnodes that it can't recycle. Second, the active scan in vnlru() will now call cache_inval_vp_quick() to attempt to make a vnode presentable so it can be deactivated. The inactive scan also does the same thing, because some leakage can happen anyway. * The active scan should be able to make continuous progress as successful cache_inval_vp_quick() calls make more and more vnodes presentable that might have previously been internal nodes in the namecache topology. So the active scan should be able to achieve the desired balance between the active and inactive queue. * This should also improve performance when constant recycling is happening by moving more of the work to the active->inactive transition and doing less work in the inactive->free transition * Add cache_inval_vp_quick(), a function which attempts to trivially disassociate a vnode from the namecache topology and will handle any direct children if the vnode is not at a leaf (but not recursively on its own). The definition of 'trivially' for the children are namecache records that can be locked non-blocking, have no additional refs, and do not record a vnode. * Cleanup cache_unlink_parent(). Have cache_zap() use this function instead of rerolling the same code. The cache_rename() code winds up being slightly more complex. And now cache_inval_vp_quick() can use the function too.
kernel - Temporary work-around for vnode recyclement problems * vnlru deadlocks were encountered on grok while indexing ~20 million files in deep directory trees. * Add vfscache_unres accounting to keep track of unresolved ncp's at the leaves of the namecache tree. Start trimming the namecache when the unres leaf count exceeds 1/16 maxvnodes, in addition to the other algorithms. * Add code in vnlru to decomission vnodes with children in the namecache when those children are trivial (e.g. unresolved, dead, or negative entries that can be easily locked).
kernel - Fix namecache issue that can slow systems down over time * Fix a serious dead-record issue with the namecache. It is very common to create a blah.new file and then rename it over an existing file, say, blah.jpg, in order to atomically replace the existing file. Such rename-over operations can cause up to (2 * maxvnodes) dead dead namecache records to build up on a single hash slot's list. * Over time, this could result in over a million records on a single hash slot's list which is often scanned during namecache lookups, causing the kernel to turn into a sludge-pile. This was not a memory leak per-say, the kernel still cleans excess structures (above 2 * maxvnodes) up, but that just maintains the status-quo and leaves the system in a slow, poorly-responsive state. * Fixed by proactively deleting matching dead entries during namecache lookups. The 'live' record is typically at the beginning of the list. So to fix, the namecache lookup now scans the list for the hash slot backwards and attempts to dispose of dead records.
kernel - Make sure nl_dvp is non-NULL in a few situations * When NLC_REFDVP is set, nl_dvp should be returned non-NULL when the nlookup succeeds. However, there is one case where nlookup() can succeed but nl_dvp can be NULL, and this is when the nlookup() represents a mount-point. * Fix three instances where this case was not being checked and could lead to a NULL pointer dereference / kernel panic. * Do the full resolve treatment for cache_resolve_dvp(). In null-mount situations where we have A/B and we null-mount B onto C, path resolutions of C via the null mount will resolve B but not resolve A. This breaks an assumption that nlookup() and cache_dvpref() make about the parent ncp having a valid vnode. In fact, the parent ncp of B (which is A) might not, because the resolve path for B may have bypassed it due to the presence of the null mount. * Should fix occassional 'mkdir /var/cache' calls that fail with EINVAL instead of EEXIST. Reported-by: zach
kernel - Add kmalloc_obj subsystem step 1 * Implement per-zone memory management to kmalloc() in the form of kmalloc_obj() and friends. Currently the subsystem uses the same malloc_type structure but is otherwise distinct from the normal kmalloc(), so to avoid programming mistakes the *_obj() subsystem post-pends '_obj' to malloc_type pointers passed into it. This mechanism will eventually replace objcache. This mechanism is designed to greatly reduce fragmentation issues on systems with long uptimes. Eventually the feature will be better integrated and I will be able to remove the _obj stuff. * This is a object allocator, so the zone must be dedicated to one type of object with a fixed size. All allocations out of the zone are of the object. The allocator is not quite type-stable yet, but will be once existential locks are integrated into the freeing mechanism. * Implement a mini-slab allocator for management. Since the zones are single-object, similar to objcache, the fixed-size mini-slabs are a lot easier to optimize and much simpler in construction than the main kernel slab allocator. Uses a per-zone/per-cpu active/alternate slab with an ultra-optimized allocation path, and a per-zone partial/full/empty list. Also has a globaldata-based per-cpu cache of free slabs. The mini-slab allocator frees slabs back to the same cpu they were originally allocated from in order to retain memory locality over time. * Implement a passive cleanup poller. This currently polls kmalloc zones very slowly looking for excess full slabs to release back to the global slab cache or the system (if the global slab cache is full). This code will ultimately also handle existential type-stable freeing. * Fragmentation is greatly reduced due to the distinct zones. Slabs are dedicated to the zone and do not share allocation space with other zones. Also, when a zone is destroyed, all of its memory is cleanly disposed of and there will be no left-over fragmentation. * Initially use the new interface for the following. These zones tend to or can become quite big: vnodes namecache (but not related strings) hammer2 chains hammer2 inodes tmpfs nodes tmpfs dirents (but not related strings)
kernel - Deal with VOP_NRENAME races * VOP_NRENAME() as implemented by the kernel can race any number of ways, including deadlocking, allowing duplicate entries, and panicing tmpfs. It typically requires a heavy test load to replicate this but a dsynth build triggered the issue at least once. Other recently reported tmpfs issues with log file handling might also be effected. * A per-mount (semi-global) lock is now obtained whenever a directory is renamed. This helps deal with numerous MP races that can cause lock order reversals. Loosely taken from netbsd and linux (mjg brought me up to speed on this). Renaming directories is fraught with issues and this fix, while somewhat brutish, is fine. Directories are very rarely renamed at a high rate. * kern_rename() now proactively locks all four elements of a rename operation (source_dir, source_file, dest_dir, dest_file) instead of only two. * The new locking function, cache_lock4_tondlocked(), takes no chances on lock order reversals and will use a (currently brute-force) non-blocking and lock cycling algorithm. Probably needs some work. * Fix a bug in cache_nlookup() related to reusing DESTROYED entries in the hash table. This algorithm tried to reuse the entries while maintaining shared locks, since only the entries need to be manipulate to reuse them. However, this resulted in lookup races which could cause duplicate entries. The duplicate entries then triggered assertions in TMPFS. * nlookup now tries a little harder and will retry if the parent of an element is flagged DESTROYED after its lock was released. DESTROYED elements are not necessarily temporary events as an operation can wind up running in a deleted directory and must properly fail under those conditions. * Use krateprintf() to reduce debug output related to rename race reporting. * Revamp nfsrv_rename() as well (requires more testing). * Allow nfs_namei() to be called in a loop for retry purposes if desired. It now detects that the nd structure is initialized from a prior run and won't try to re-parse the mbuf (needs testing). Reported-by: zrj, mjg
kernel - Refactor in-kernel system call API to remove bcopy() * Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only. int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *); * System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe. The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).