kernel - Clean up cache_hysteresis contention, better trigger_syncer() * cache_hysteresis() is no longer multi-entrant, which causes unnecessary contention between cpus. Only one cpu can run it at a time and it just returns for other cpus if it is already running. * Add better trigger_syncer functions. Adding trigger_syncer_start() and trigger_syncer_stop() to elide filesystem sleeps, ensuring that a filesystem waiting on a flush due to excessive dirty pages does not race the flusher and wind up twiddling its fingers while no flush is happening.
kernel - Rejigger mount code to add vfs_flags in struct vfsops * Rejigger the mount code so we can add a vfs_flags field to vfsops, which mount_init() has visibility to. * Allows nullfs to flag that its mounts do not need a syncer thread. Previously nullfs would destroy the syncer thread after the fact. * Improves dsynth performance (it does lots of nullfs mounts).
hammer2 - Fix inode & chain limits, improve flush pipeline. * Reorganize VFS_MODIFYING() to avoid certain deadlock conditions and adjust hammer2 to unconditionally stall in VFS_MODIFYING() when dirty limits are exceeded. Make sure VFS_MODIFYING() is called in all appropriate filesystem- modifying paths. This ensures that inode and chain structure allocation limits are adhered to. * Fix hammer2's wakeup code for the dirty inode count hystereis. This fixes a situation where stalls due to excessive dirty inodes were waiting a full second before resuming operation based on the dirty count hysteresis. The hysteresis now works as intended: (1) Trigger a sync when the dirty count reache 50% N. (2) Stall the frontend when the dirty count reaches 100% N. (3) Resume the frontend when the diirty count drops to 66% N. * Fix trigger_syncer() to guarantee that the syncer will flush the filesystem ASAP when called. If the filesystem is already in a flush, it will be flushed again. Previously if the filesystem was already in a flush it would wait one second before flushing again, which significantly reduces performance under conditions where the dirty chain limit or the dirty inode limit is constantly being hit (e.g. chown -R, etc). Reported-by: tuxillo
kernel: Remove numerous #include <sys/thread2.h>. Most of them were added when we converted spl*() calls to crit_enter()/crit_exit(), almost 14 years ago. We can now remove a good chunk of them again for where crit_*() are no longer used. I had to adjust some files that were relying on thread2.h or headers that it includes coming in via other headers that it was removed from.
kernel - Add trigger_syncer(), VFS_MODIFYING() * Add trigger_syncer(). This function asynchronously triggers the syncer vnode in the syncer thread for the mount. It is a NOP if there is no syncer thread or syncer vnode. Will be used by HAMMER2 to pipeline syncs when heavy filesystem activity over-extends internal memory structures. * Add VFS_MODIFYING(). This is a hook into the filesystem that modifying filesystem ops in the kernel will call prior to locking any vnodes. It allows the filesystem to moderate the over-allocation of internal structures. Waiting until after the VOP is called is too late, so we need kernel support for this. Numerous attempts to hack moderation code into the H2 VOPs have all failed spectacularly. In H2, over-allocation can occur because H2 must retain disconnected inodes related to file creation and deletion until the next sync cycle.
kernel - Add dirty vnode management facility * Keep track of how many vnodes are queued to the syncer, which is basically the number of dirty vnodes. The syncer vnode is included so the idle count is usually 1 and not 0. * vn_syncer_count() returns the count. * vn_syncer_one() attempts to fsync the next dirty vnode immediately, if it can acquire it non-blocking. The special syncer vnode is ignored. On failure the vnode will be requeued for 1 second, so this routine can be cycled.
kernel - Attempt to fix panic during shutdown from tmpfs * When doing a forced-unmount on tmpfs an active vnode may wind up staying on the mount's dirty worklist, resulting in an assertion. * Try to handle the case by forcefully removing the vnode from the dirty worklist in the forced-unmount case. Reported-by: zrj (Rimvydas Jasinskas)
kernel - Change bundirty() location in I/O sequence * When doing a write BIO, do not bundirty() the buffer prior to issuing the vn_strategy(). Instead, bundirty() the buffer when the I/O is complete, primarily in bpdone(). The I/O's data buffer is protected during the operation by vfs_busy_pages(), so related VM pages cannot be modified while the write is running. And, of course, the buffer itself is locked exclusively for the duration of the opeartion. Thus this change should NOT introduce any redirtying races. * This change ensures that vp->v_rbdirty_tree remains non-empty until all related write I/Os have completed, removing a race condition for code which checks vp->v_rbdirty_tree to determine e.g. if a file requires synchronization or not. This race could cause problems because the system buffer flusher might be in the midst of flushing a buffer just as a filesystem decides to sync and starts checking vp->v_rbdirty_tree. * This should theoretically fix a long-standing but difficult-to-reproduce bug in HAMMER1 where a backend flush occurs at an inopportune time.
kernel - Fix hammer recovery crash (due to recent syncer work) * Unconditionally create a syncer thread for each mount. This way we can create the thread prior to calling VFS_MOUNT. * hammer(1) needs to acquire vnodes and potentially issue vn_rdwr()'s during mount for recovery purposes. This syncer thread is expected to already exist. (and it does now). * Remove the default syncer thread. * rewrite speedup_syncer().
kernel - Optimize sync and msync for tmpfs and nfs * Flesh-out the vfs_sync API and implement vhold/vdrop callbacks (used by NFS). * Use MNTK_THR_SYNC in tmpfs and finish implementing it in nfs. This will optimize sync and msync for these filesystems. * In both cases inode attributes are either synchronous or don't involve any VFS work to flush, so we don't have to use VISDIRTY.
kernel - Optimize vfs_msync() when MNTK_THR_SYNC is used * vfs_msync() will now use vsyncscan() when MNTK_THR_SYNC is set. Lazy synchronization scans will still properly ignore MADV_NOSYNC areas, but will not be able to optimize away the scan overhead for those vnodes (they remain on the syncer list). This change allows both lazy synchronization and explicit 'sync' commands to avoid having to scan all cached vnodes on the system, resulting in O(1) operation in many cases where it might have taken a few seconds before (on large systems with hundreds of thousands to millions of vnodes cached). With this change both the vnode sync and the memory sync will be optimal. Currently implemented for hammer1 and hammer2. * Add VOBJDIRTY to the set of flags that will place the vnode on the syncer list. This occurs from the vm_page_dirty() and other bits of code only if MNTK_THR_SYNC is set. Theoretically it should be safe for us to do this even though neither the vm_object or the related vnode are likely locked or guarded, because neither can go away while an associated vm_page is busied. The syncer list code itself is protected with a token.
hammer - Use new vsyncscan() mechanic (3) * The vsyncscan() feature requires using MNTK_THR_SYNC, otherwise the callback has to deal with vnodes unrelated to the mount point. Assert this in vsyncscan(). * Enable MNTK_THR_SYNC in hammer * Cleanup edge cases in the scan2 callback.
kernel - Add vsyncscan() infrastructure * For VFS's which support it, allows vnodes with dirty inodes to be placed on the syncer list rather than just vnodes with dirty buffers. The VFS can then implement its VFS_SYNC ops by calling vsyncscan() instead of vmntvnodescan(). * On large systems with potentially hundreds of thousands to millions of cached vnodes, this reduces sync scan overhead by several orders of magnitude. * Add the VISDIRTY flag to vnode->v_flag to indicate a dirty inode, adjust syncer add/delete code to use the flag. * Cleanup vfs_sync.c. Always initialize mp->mnt_syncer_ctx to something. Change the kern.syncdelay sysctl to use SYSCTL_PROC which properly range-checks syncdelay. * Implement vsyncscan() which only scans the syncer lists for a mount point.
kernel - Change time_second to time_uptime for all expiration calculations * Vet the entire kernel and change use cases for expiration calculations using time_second to use time_uptime instead. * Protects these expiration calculations from step changes in the wall time, particularly needed for route table entries. * Probably requires further variable type adjustments but the use of time_uptime instead if time_second is highly unlikely to ever overrun any demotions to int still present.