kernel - Greatly improve shared memory fault rate concurrency / shared tokens This commit rolls up a lot of work to improve postgres database operations and the system in general. With this changes we can pgbench -j 8 -c 40 on our 48-core opteron monster at 140000+ tps, and the shm vm_fault rate hits 3.1M pps. * Implement shared tokens. They work as advertised, with some cavets. It is acceptable to acquire a shared token while you already hold the same token exclusively, but you will deadlock if you acquire an exclusive token while you hold the same token shared. Currently exclusive tokens are not given priority over shared tokens so starvation is possible under certain circumstances. * Create a critical code path in vm_fault() using the new shared token feature to quickly fault-in pages which already exist in the VM cache. pmap_object_init_pt() also uses the new feature. This increases fault-in concurrency by a ridiculously huge amount, particularly on SHM segments (say when you have a large number of postgres clients). Scaling for large numbers of clients on large numbers of cores is significantly improved. This also increases fault-in concurrency for MAP_SHARED file maps. * Expand the breadn() and cluster_read() APIs. Implement breadnx() and cluster_readx() which allows a getblk()'d bp to be passed. If *bpp is not NULL a bp is being passed in, otherwise the routines call getblk(). * Modify the HAMMER read path to use the new API. Instead of calling getcacheblk() HAMMER now calls getblk() and checks the B_CACHE flag. This gives getblk() a chance to regenerate a fully cached buffer from VM backing store without having to acquire any hammer-related locks, resulting in even faster operation. * If kern.ipc.shm_use_phys is set to 2 the VM pages will be pre-allocated. This can take quite a while for a large map and also lock the machine up for a few seconds. Defaults to off. * Reorder the smp_invltlb()/cpu_invltlb() combos in a few places, running cpu_invltlb() last. * An invalidation interlock might be needed in pmap_enter() under certain circumstances, enable the code for now. * vm_object_backing_scan_callback() was failing to properly check the validity of a vm_object after acquiring its token. Add the required check + some debugging. * Make vm_object_set_writeable_dirty() a bit more cache friendly. * The vmstats sysctl was scanning every process's vm_map (requiring a vm_map read lock to do so), which can stall for long periods of time when the system is paging heavily. Change the mechanic to a LWP flag which can be tested with minimal locking. * Have the phys_pager mark the page as dirty too, to make sure nothing tries to free it. * Remove the spinlock in pmap_prefault_ok(), since we do not delete page table pages it shouldn't be needed. * Add a required cpu_ccfence() in pmap_inval.c. The code generated prior to this fix was still correct, and this makes sure it stays that way. * Replace several manual wiring cases with calls to vm_page_wire().
kernel - VM PAGER part 2/2 - Expand vinitvmio() and vnode_pager_alloc() * vinitvmio() is responsible for assigning the initial VM object size based on the file size. Adjust vinitvmio() to conform to the new nvextendbuf() and nvtruncbuf() API. * vinitvmio() has been given two additional parameters, blksize and boff, to allow it to determine how much larger the VM object must be relative to the byte-granular file size passed to it. * Remove vm_page_alloc() and remove the pgo_alloc vector from struct pagerops. Convert all the VM pager allocation procedures into global procedures which are called directly. Trying to feed everything through a single function was a joke when all the callers knew precisely what kind of VM object they were creating anyway. Add the extra arguments to vnode_pager_alloc() which vinitvmio() needs to pass in.
kernel - fine-grained namecache and partial vnode MPSAFE work Namecache subsystem * All vnode->v_flag modifications now use vsetflags() and vclrflags(). Because some flags are set and cleared by vhold()/vdrop() which do not require any locks to be held, all modifications must use atomic ops. * Clean up and revamp the namecache MPSAFE work. Namecache operations now use a fine-grained MPSAFE locking model which loosely follows these rules: - lock ordering is child to parent. e.g. lock file, then lock parent directory. This allows resolver recursions up the parent directory chain. - Downward-traversing namecache invalidations and path lookups will unlock the parent (but leave it referenced) before attempting to lock the child. - Namecache hash table lookups utilize a per-bucket spinlock. - vnode locks may be acquired while holding namecache locks but not vise-versa. VNodes are not destroyed until all namecache references go away, but can enter reclamation. Namecache lookups detect the case and re-resolve to overcome the race. Namecache entries are not destroyed while referenced. * Remove vfs_token, the namecache MPSAFE model is now totally fine-grained. * Revamp namecache locking primitves (cache_lock/cache_unlock and friends). Use atomic ops and nc_exlocks instead of nc_locktd and build-in a request flag. This solves busy/tsleep races between lock holder and lock requester. * Revamp namecache parent/child linkages. Instead of using vfs_token to lock such operations we simply lock both child and parent namecache entries. Hash table operations are also fully integrated with the parent/child linking operations. * The vnode->v_namecache list is locked via vnode->v_spinlock, which is actually vnode->v_lock.lk_spinlock. * Revamp cache_vref() and cache_vget(). The passed namecache entry must be referenced and locked. Internals are simplified. * Fix a deadlock by moving the call to _cache_hysteresis() to a place where the current thread otherwise does not hold any locked ncp's. * Revamp nlookup() to follow the new namecache locking rules. * Fix a number of places, e.g. in vfs/nfs/nfs_subs.c, where ncp->nc_parent or ncp->nc_vp was being accessed with an unlocked ncp. nc_parent and nc_vp accesses are only valid if the ncp is locked. * Add the vfs.cache_mpsafe sysctl, which defaults to 0. This may be set to 1 to enable MPSAFE namecache operations for [l,f]stat() and open() system calls (for the moment). VFS/VNODE subsystem * Use a global spinlock for now called vfs_spin to manage vnode_free_list. Use vnode->v_spinlock (and vfs_spin) to manage vhold/vdrop ops and to interlock v_auxrefs tests against vnode terminations. * Integrate per-mount mnt_token and (for now) the MP lock into VOP_*() and VFS_*() operations. This allows the MP lock to be shifted further inward from the system calls, but we don't do it quite yet. * HAMMER: VOP_GETATTR, VOP_READ, and VOP_INACTIVE are now MPSAFE. The corresponding sysctls have been removed. * FIFOFS: Needed some MPSAFE work in order to allow HAMMER to make things MPSAFE above, since HAMMER forwards vops for in-filesystem fifos to fifofs. * Add some debugging kprintf()s when certain MP races are averted, for testing only. MISC * Add some assertions to the VM system. * Document existing and newly MPSAFE code.
Kernel - fix access checks * VOP_ACCESS() is used for more then just access(). UFS and other filesystems (but not HAMMER) were calling it in the open/create/rename/ unlink paths. The uid/gid must be used in those cases, not the ruid/rgid. Add a VOP_EACCESS() macro which passes the appropriate flag to use the uid/gid instead of the ruid/rgid, and adjust the filesystems to use this macro. Reported-by: Stathis Kamperis <ekamperi@gmail.com>
DEVFS - rollup - all kernel devices * Make changes needed to kernel devices to use devfs. * Also pre-generate some devices (usually 4) to support system utilities which do not yet deal with the auto-cloning device support. * Adjust the spec_vnops for various filesystems to vector to dummy code for read and write, for VBLK/VCHR nodes in old filesystems which are no longer supported. Submitted-by: Alex Hornung <ahornung@gmail.com>
HAMMER / VFS_VGET - Add optional dvp argument to VFS_VGET(). Fix readdirplus * VGET is used by NFS to acquire a vnode given an inode number. HAMMER requires additional information to determine the PFS the inode is being acquired from. Add an optional directory vnode argument to the VGET. If non-NULL, HAMMER will extract the PFS information from this vnode. * Adjust NFS to pass the dvp to VGET when doing a readdirplus. Note that the PFS is already encoded in file handles, but readdirplus acquires the attributes for each directory entry it scans (readdir does not). This fixes readdirplus for NFS served HAMMER PFS exports.
* Implement the ability to export NULLFS mounts via NFS. * Enforce PFS isolation when exporting a HAMMER PFS via a NULLFS mount. NOTE: Exporting anything other then HAMMER PFS root's via nullfs does NOT protect the parent of the exported directory from being accessed via NFS. Generally speaking this feature is implemented by giving each nullfs mount a synthesized fsid based on what is being mounted and implementing the NFS export infrastructure in the nullfs code instead of just bypassing those functions to the underyling VFS.
Give the device major / minor numbers their own separate 32 bit fields in the kernel. Change dev_ops to use a RB tree to index major device numbers and remove the 256 device major number limitation. Build a dynamic major number assignment feature into dev_ops_add() and adjust ASR (which already had a hand-rolled one), and MFS to use the feature. MFS at least does not require any filesystem visibility to access its backing device. Major devices numbers >= 256 are used for dynamic assignment. Retain filesystem compatibility for device numbers that fall within the range that can be represented in UFS or struct stat (which is a single 32 bit field supporting 8 bit major numbers and 24 bit minor numbers).
Major namecache work primarily to support NULLFS. * Move the nc_mount field out of the namecache{} record and use a new namecache handle structure called nchandle { mount, ncp } for all API accesses to the namecache. * Remove all mount point linkages from the namecache topology. Each mount now has its own namecache topology rooted at the root of the mount point. Mount points are flagged in their underlying filesystem's namecache topology but instead of linking the mount into the topology, the flag simply triggers a mountlist scan to locate the mount. ".." is handled the same way... when the root of a topology is encountered the scan can traverse to the underlying filesystem via a field stored in the mount structure. * Ref the mount structure based on the number of nchandle structures referencing it, and do not kfree() the mount structure during a forced unmount if refs remain. These changes have the following effects: * Traversal across mount points no longer require locking of any sort, preventing process blockages occuring in one mount from leaking across a mount point to another mount. * Aliased namespaces such as occurs with NULLFS no longer duplicate the namecache topology of the underlying filesystem. Instead, a NULLFS mount simply shares the underlying topology (differentiating between it and the underlying topology by the fact that the name cache handles { mount, ncp } contain NULLFS's mount pointer. This saves an immense amount of memory and allows NULLFS to be used heavily within a system without creating any adverse impact on kernel memory or performance. * Since the namecache topology for a NULLFS mount is shared with the underyling mount, the namecache records are in fact the same records and thus full coherency between the NULLFS mount and the underlying filesystem is maintained by design. * Future efforts, such as a unionfs or shadow fs implementation, now have a mount structure to work with. The new API is a lot more flexible then the old one.
Change the kernel dev_t, representing a pointer to a specinfo structure, to cdev_t. Change struct specinfo to struct cdev. The name 'cdev' was taken from FreeBSD. Remove the dev_t shim for the kernel. This commit generally removes the overloading of 'dev_t' between userland and the kernel. Also fix a bug in libkvm where a kernel dev_t (now cdev_t) was not being properly converted to a userland dev_t.
VNode sequencing and locking - part 3/4. VNode aliasing is handled by the namecache (aka nullfs), so there is no longer a need to have VOP_LOCK, VOP_UNLOCK, or VOP_ISSLOCKED as 'VOP' functions. Both NFS and DEADFS have been using standard locking functions for some time and are no longer special cases. Replace all uses with native calls to vn_lock, vn_unlock, and vn_islocked. We can't have these as VOP functions anyhow because of the introduction of the new SYSLINK transport layer, since vnode locks are primarily used to protect the local vnode structure itself.