kernel - Normalize the vx_*() vnode interface * The vx_*() vnode interface is used for initial allocations, reclaims, and terminations. Normalize all use cases to prevent the mixing together of the vx_*() API and the vn_*() API. For example, vx_lock() should not be paired with vn_unlock(), and so forth. * Integrate an update-counter mechanism into the vx_*() API, assert reasonability. * Change vfs_cache.c to use an int update counter instead of a long. The vfs_cache code can't quite use the spin-lock update counter API yet. Use proper atomics for load and store. * Implement VOP_GETATTR_QUICK, meant to be a 'quick' version of VOP_GETATTR() that only retrieves information related to permissions and ownership. This will be fast-pathed in a later commit. * Implement vx_downgrade() to convert an exclusive vx_lock into an exclusive vn_lock (for vnodes). Adjust all use cases in the getnewvnode() path. * Remove unnecessary locks in tmpfs_getattr() and don't use any in tmpfs_getattr_quick(). * Remove unnecessary locks in hammer2_vop_getattr() and don't use any in hammer2_vop_getattr_quick()
kernel - Rejigger mount code to add vfs_flags in struct vfsops * Rejigger the mount code so we can add a vfs_flags field to vfsops, which mount_init() has visibility to. * Allows nullfs to flag that its mounts do not need a syncer thread. Previously nullfs would destroy the syncer thread after the fact. * Improves dsynth performance (it does lots of nullfs mounts).
Rename some functions to better names. devfs_find_device_by_udev() -> devfs_find_device_by_devid() dev2udev() -> devid_from_dev() udev2dev() -> dev_from_devid() This fits with the rest of the code. 'dev' usually means a cdev_t, such as in make_dev(), etc. Instead of 'udev', use 'devid', since that's what dev_t is, a "Device ID".
kernel: Save some indent here and there and some small cleanup. All these are related to an inspection of the places where we do: if (...) { ... goto blah; } else { ... } in which case the 'else' is not needed. I only changed places where I thought that it improves readability or is just as readable without the 'else'.
sys/kern: Don't implement .vfs_sync unless sync is supported The only reason filesystems without requirement of syncing (e.g. no backing storage) need to implement .vfs_sync is because those fs need a sync with a return value of 0 on unmount. If unmount allows sync with return value of EOPNOTSUPP for fs that do not support sync, those fs no longer have to implement .vfs_sync with vfs_stdsync() only to pass dounmount(). The drawback is when there is a sync (other than vfs_stdnosync) that returns EOPNOTSUPP for real errors. The existing fs in DragonFly don't do this (and shouldn't either). Also see https://bugs.dragonflybsd.org/issues/2912. # grep "\.vfs_sync" sys/vfs sys/gnu/vfs -rI | grep vfs_stdsync sys/vfs/udf/udf_vfsops.c: .vfs_sync = vfs_stdsync, sys/vfs/portal/portal_vfsops.c: .vfs_sync = vfs_stdsync sys/vfs/devfs/devfs_vfsops.c: .vfs_sync = vfs_stdsync, sys/vfs/isofs/cd9660/cd9660_vfsops.c: .vfs_sync = vfs_stdsync, sys/vfs/autofs/autofs_vfsops.c: .vfs_sync = vfs_stdsync, /* for unmount(2) */ sys/vfs/tmpfs/tmpfs_vfsops.c: .vfs_sync = vfs_stdsync, sys/vfs/dirfs/dirfs_vfsops.c: .vfs_sync = vfs_stdsync, sys/vfs/ntfs/ntfs_vfsops.c: .vfs_sync = vfs_stdsync, sys/vfs/procfs/procfs_vfsops.c: .vfs_sync = vfs_stdsync sys/vfs/hpfs/hpfs_vfsops.c: .vfs_sync = vfs_stdsync, sys/vfs/nullfs/null_vfsops.c: .vfs_sync = vfs_stdsync,
kernel - Performance tuning * Use a shared lock in the exec*() code, open, close, chdir, fchdir, access, stat, and readlink. * Adjust nlookup() to allow the last namecache record in a path to be locked shared if it is already resolved, and the caller requests it. * Remove nearly all global locks from critical dsched paths. Defer creation of the tdio until an I/O actually occurs (huge savings in the fork/exit paths). * Improves fork/exec concurrency on monster of static binaries from 14200/sec to 55000/sec+. For dynamic binaries improve from around 2500/sec to 9000/sec or so (48 cores fork/exec'ing different dynamic binaries). For the same dynamic binary it's more around 5000/sec or so. Lots of issues here including the fact that all dynamic binaries load many shared resources, even hen the binaries are different programs. AKA libc.so.X and ld-elf.so.2, as well as /dev/urandom (from libc), and access numerous common path elements. Nearly all of these paths are now non-contending. The major remaining contention is in per-vm_page/PMAP manipulation. This is per-page and concurrent execs of the same program tend to pipeline so it isn't a big problem.
hpfs - Fix a couple panics and a little cleanup. * Fix compilation with HPFS_DEBUG. * Fix a panic due CNP_PDIRUNLOCK flag not being cleared. * Fix a panic where returned vnode after a lookup is not NULL in the ENOENT case. * Disable write support completely. It was pretty minimal and operations like create or rename were not supported. It has been tested with a filesystem created by OS/2 Warp 2.1. Copying data out of it worked fine, but there is still an outstanding issue with overlapping buffers.
kernel - Greatly improve shared memory fault rate concurrency / shared tokens This commit rolls up a lot of work to improve postgres database operations and the system in general. With this changes we can pgbench -j 8 -c 40 on our 48-core opteron monster at 140000+ tps, and the shm vm_fault rate hits 3.1M pps. * Implement shared tokens. They work as advertised, with some cavets. It is acceptable to acquire a shared token while you already hold the same token exclusively, but you will deadlock if you acquire an exclusive token while you hold the same token shared. Currently exclusive tokens are not given priority over shared tokens so starvation is possible under certain circumstances. * Create a critical code path in vm_fault() using the new shared token feature to quickly fault-in pages which already exist in the VM cache. pmap_object_init_pt() also uses the new feature. This increases fault-in concurrency by a ridiculously huge amount, particularly on SHM segments (say when you have a large number of postgres clients). Scaling for large numbers of clients on large numbers of cores is significantly improved. This also increases fault-in concurrency for MAP_SHARED file maps. * Expand the breadn() and cluster_read() APIs. Implement breadnx() and cluster_readx() which allows a getblk()'d bp to be passed. If *bpp is not NULL a bp is being passed in, otherwise the routines call getblk(). * Modify the HAMMER read path to use the new API. Instead of calling getcacheblk() HAMMER now calls getblk() and checks the B_CACHE flag. This gives getblk() a chance to regenerate a fully cached buffer from VM backing store without having to acquire any hammer-related locks, resulting in even faster operation. * If kern.ipc.shm_use_phys is set to 2 the VM pages will be pre-allocated. This can take quite a while for a large map and also lock the machine up for a few seconds. Defaults to off. * Reorder the smp_invltlb()/cpu_invltlb() combos in a few places, running cpu_invltlb() last. * An invalidation interlock might be needed in pmap_enter() under certain circumstances, enable the code for now. * vm_object_backing_scan_callback() was failing to properly check the validity of a vm_object after acquiring its token. Add the required check + some debugging. * Make vm_object_set_writeable_dirty() a bit more cache friendly. * The vmstats sysctl was scanning every process's vm_map (requiring a vm_map read lock to do so), which can stall for long periods of time when the system is paging heavily. Change the mechanic to a LWP flag which can be tested with minimal locking. * Have the phys_pager mark the page as dirty too, to make sure nothing tries to free it. * Remove the spinlock in pmap_prefault_ok(), since we do not delete page table pages it shouldn't be needed. * Add a required cpu_ccfence() in pmap_inval.c. The code generated prior to this fix was still correct, and this makes sure it stays that way. * Replace several manual wiring cases with calls to vm_page_wire().
kernel: Add missing MODULE_VERSION()s for file systems. The loader will figure out by itself whether to load a module or not, depending on whether it's already in the kernel config or not, iif MODULE_VERSION() is present. I.e., if MSDOSFS (that has MODULE_VERSION()) is in the config and msdos_load="YES" is in /boot/loader.conf, msdos.ko will not be loaded by the loader at all. Without MODULE_VERSION() it will lead (in the best case) to whining in dmesg like for ahci or (in the worst case) to weird behavior, such as for nullfs: # mount -a null: vfsload(null): No such file or directory Therefore, we definitely want MODULE_VERSION() for all new modules. This commit is the first in a series to add the missing MODULE_VERSION()s. I know that ufs is not a module, just included it for completeness' sake. Reported-by: marino, tuxillo
kernel - Add additional fields to kinfo_cputime * Add a message field and address to allow the kernel to report contention points on the cpus to userland. * Enhance the mplock and token subsystems to record contention points. * Enhance the scheduler to record contention information in the per-cpu cpu_time structure.
kernel - lwkt_token revamp * Simplify the token API. Hide the lwkt_tokref mechanics and simplify the lwkt_gettoken()/lwkt_reltoken() API to remove the need to declare and pass a lwkt_tokref along with the token. This makes tokens operate more like locks. There is a minor restriction that tokens must be unlocked in exactly the reverse order they were locked in, and another restriction limiting the maximum number of tokens a thread can hold to defined value (32 for now). The tokrefs are now an array embedded in the thread structure. * Improve performance when blocking and unblocking threads with recursively held tokens. * Improve performance when acquiring the same token recursively. This operation is now O(1) and requires no locks or critical sections of any sort. This will allow us to acquire redundant tokens in deep call paths without having to worry about performance issues. * Add a flags field to the lwkt_token and lwkt_tokref structures and add a flagged feature which will acquire the MP lock along with a particular token. This will be used as a transitory mechanism in upcoming MPSAFE work. The mplock feature in the token structure can be directly connected to a mpsafe sysctl without being vulnerable to state-change races.
kernel - fine-grained namecache and partial vnode MPSAFE work Namecache subsystem * All vnode->v_flag modifications now use vsetflags() and vclrflags(). Because some flags are set and cleared by vhold()/vdrop() which do not require any locks to be held, all modifications must use atomic ops. * Clean up and revamp the namecache MPSAFE work. Namecache operations now use a fine-grained MPSAFE locking model which loosely follows these rules: - lock ordering is child to parent. e.g. lock file, then lock parent directory. This allows resolver recursions up the parent directory chain. - Downward-traversing namecache invalidations and path lookups will unlock the parent (but leave it referenced) before attempting to lock the child. - Namecache hash table lookups utilize a per-bucket spinlock. - vnode locks may be acquired while holding namecache locks but not vise-versa. VNodes are not destroyed until all namecache references go away, but can enter reclamation. Namecache lookups detect the case and re-resolve to overcome the race. Namecache entries are not destroyed while referenced. * Remove vfs_token, the namecache MPSAFE model is now totally fine-grained. * Revamp namecache locking primitves (cache_lock/cache_unlock and friends). Use atomic ops and nc_exlocks instead of nc_locktd and build-in a request flag. This solves busy/tsleep races between lock holder and lock requester. * Revamp namecache parent/child linkages. Instead of using vfs_token to lock such operations we simply lock both child and parent namecache entries. Hash table operations are also fully integrated with the parent/child linking operations. * The vnode->v_namecache list is locked via vnode->v_spinlock, which is actually vnode->v_lock.lk_spinlock. * Revamp cache_vref() and cache_vget(). The passed namecache entry must be referenced and locked. Internals are simplified. * Fix a deadlock by moving the call to _cache_hysteresis() to a place where the current thread otherwise does not hold any locked ncp's. * Revamp nlookup() to follow the new namecache locking rules. * Fix a number of places, e.g. in vfs/nfs/nfs_subs.c, where ncp->nc_parent or ncp->nc_vp was being accessed with an unlocked ncp. nc_parent and nc_vp accesses are only valid if the ncp is locked. * Add the vfs.cache_mpsafe sysctl, which defaults to 0. This may be set to 1 to enable MPSAFE namecache operations for [l,f]stat() and open() system calls (for the moment). VFS/VNODE subsystem * Use a global spinlock for now called vfs_spin to manage vnode_free_list. Use vnode->v_spinlock (and vfs_spin) to manage vhold/vdrop ops and to interlock v_auxrefs tests against vnode terminations. * Integrate per-mount mnt_token and (for now) the MP lock into VOP_*() and VFS_*() operations. This allows the MP lock to be shifted further inward from the system calls, but we don't do it quite yet. * HAMMER: VOP_GETATTR, VOP_READ, and VOP_INACTIVE are now MPSAFE. The corresponding sysctls have been removed. * FIFOFS: Needed some MPSAFE work in order to allow HAMMER to make things MPSAFE above, since HAMMER forwards vops for in-filesystem fifos to fifofs. * Add some debugging kprintf()s when certain MP races are averted, for testing only. MISC * Add some assertions to the VM system. * Document existing and newly MPSAFE code.
DEVFS - rollup - all kernel devices * Make changes needed to kernel devices to use devfs. * Also pre-generate some devices (usually 4) to support system utilities which do not yet deal with the auto-cloning device support. * Adjust the spec_vnops for various filesystems to vector to dummy code for read and write, for VBLK/VCHR nodes in old filesystems which are no longer supported. Submitted-by: Alex Hornung <ahornung@gmail.com>
HAMMER / VFS_VGET - Add optional dvp argument to VFS_VGET(). Fix readdirplus * VGET is used by NFS to acquire a vnode given an inode number. HAMMER requires additional information to determine the PFS the inode is being acquired from. Add an optional directory vnode argument to the VGET. If non-NULL, HAMMER will extract the PFS information from this vnode. * Adjust NFS to pass the dvp to VGET when doing a readdirplus. Note that the PFS is already encoded in file handles, but readdirplus acquires the attributes for each directory entry it scans (readdir does not). This fixes readdirplus for NFS served HAMMER PFS exports.