kernel - Improve tmpfs support * When a file in tmpfs is truncated to a size that is not on a block boundary, or extended (but not written) to a size that is not on a block boundary, the nvextendbuf() and nvtruncbuf() functions must modify the contents of the straddling buffer and bdwrite(). However, a bdwrite() for a tmpfs buffer will result in a dirty buffer cache buffer and likely force it to be cycled out to swap relatively soon under a modest load. This is not desirable if there is no memory pressure present to force it out. Tmpfs almost always uses buwrite() in order to leave the buffer 'clean' (the underlying VM pages are dirtied instead), to prevent unecessary paging of tmpfs data to swap when the buffer gets recycled or the vnode cycles out. * Add support for calling buwrite() in these functions by changing the 'trivial' boolean into a flags variable. * Tmpfs now passes the appropriate flag, preventing the undesirable behavior.
<sys/slaballoc.h>: Switch to lighter <sys/_malloc.h> header. The <sys/globaldata.h> embeds SLGlobalData that in turn embeds the "struct malloc_type". Adjust several kernel sources for missing includes where memory allocation is performed. Try to use alphabetical include order. Now (in most cases) <sys/malloc.h> is included after <sys/objcache.h>. Once it gets cleaned up, the <sys/malloc.h> inclusion could be moved out of <sys/idr.h> to drm Linux compat layer linux/slab.h without side effects.
kernel: Cleanup <sys/uio.h> issues. The iovec_free() inline very complicates this header inclusion. The NULL check is not always seen from <sys/_null.h>. Luckily only three kernel sources needs it: kern_subr.c, sys_generic.c and uipc_syscalls.c. Also just a single dev/drm source makes use of 'struct uio'. * Include <sys/uio.h> explicitly first in drm_fops.c to avoid kfree() macro override in drm compat layer. * Use <sys/_uio.h> where only enums and struct uio is needed, but ensure that userland will not include it for possible later <sys/user.h> use. * Stop using <sys/vnode.h> as shortcut for uiomove*() prototypes. The uiomove*() family functions possibly transfer data across kernel/user space boundary. This header presence explicitly mark sources as such. * Prefer to add <sys/uio.h> after <sys/systm.h>, but before <sys/proc.h> and definitely before <sys/malloc.h> (except for 3 mentioned sources). This will allow to remove <sys/malloc.h> from <sys/uio.h> later on. * Adjust <sys/user.h> to use component headers instead of <sys/uio.h>. While there, use opportunity for a minimal whitespace cleanup. No functional differences observed in compiler intermediates.
kernel - Refactor buffer cache code in preparation for vm_page repurposing * Keep buffer_map but no longer use vm_map_findspace/vm_map_delete to manage buffer sizes. Instead, reserve MAXBSIZE of unallocated KVM for each buffer. * Refactor the buffer cache management code. bufspace exhaustion now has hysteresis, bufcount works just about the same. * Start work on the repurposing code (currently disabled).
kernel - Add kqueue support to NFS (fix firefox issues w/nfs) * Firefox appears to get semi-random memory corruption and otherwise implodes if one or more filesystems it accesses does not support kqueue. This appears to be due to some interaction between firefox, glib, and the kernel when kqueue support is missing from a filesystem. * Add host-local kqueue support to NFS. As with locks, the support is host-local only and will not work across multiple clients sharing the same files. * Appears to stabilize firefox when file(s) it accesses are on NFS.
kernel - Adjust UFS and HAMMER to use uiomovebp() * Add uiomovebp(), a version of uiomove() which is aware of a locked bp representing the to or from buffer and can work-around issues related to VM faults causing recursions and deadlocks on the user buffer. uiomovebp() does not yet detect or handle deadlocks. Implementing deadlock handling will require a certain degree of finess related to the vnode and bp locks and we don't want to have to do it unless we actually deadlock. TODO. * Adjust UFS, HAMMER, TMPFS, MSDOSFS, NFS, NTFS to use uiomovebp().
kernel - Fix NFS client & server bugs * A very long standing bug in the server cache was finally whacked. The write-gather code was improperly returning the wrong mbuf for the server to reply with, causing client stalls. This behavior depends on the client doing burst asynchronous writes. Newer releases of DragonFly do burst asynchronous writes but older ones tended not to. * The server cache was not MPSAFE. Add a MP token to fix that. * Remove critical sectons from the server cache which are no longer needed. * Fix a potential client-side rpc request race where a request's NEEDSXMIT flag is not set until after the request possibly blocks, which can lead to issues if another thread picks up the request and then believes that it has already been transmitted when it has not. * Document a big problem with NFSv2 and HAMMER-served directories. NFSv2 only has 32-bit directory cookies. It is possible to work around the problem by using rdirplus (which is the default now). However, some servers may not be able to handle rdirplus with a NFSv2 mount. Users who need to serve out NFSv2 cannot serve HAMMER directories with NFSv2 unless the clients support rdirplus. Our defaults are NFSv3 and rdirplus and NFSv3 does NOT have this problem. Reported-by: Thomas Nikolajsen <thomas.nikolajsen@mail.dk>
network - Tokenize NFS, fix MP races * Now that the rest of the network stack is running MPSAFE, poor NFS is hitting races and other issues because it was depending on the MP lock. * Recombobulate NFS with tokens, protecting all border crossings: A global nfs_token is used for the nfs mount list, nfsd list, and server socket list. A per-socket token (nfssvc_sock->ns_token) governs each served mount. A per-mount token (nfsmount->nm_token) governs each client mount. * Callouts and TCP upcalls are protected. The per-socket TCP upcall is protected by the nfssvc_sock token. * The NFS iod thread pairs and nfsd threads now run MPSAFE. * NFSv3 is now holy-shit fast and can trivially max-out a GigE link without TSO when the server is not otherwise limited by server-side disks.
kernel - Major MPSAFE Infrastructure * vm_page_lookup() now requires the vm_token to be held on call instead of the MP lock. And fix the few places where the routine was being called without the vm_token. Various situations where a vm_page_lookup() is performed followed by vm_page_wire(), without busying the page, and other similar situations, require the vm_token to be held across the whole block of code. * bio_done callbacks are now MPSAFE but some drivers (ata, ccd, vinum, aio, nfs) are not MPSAFE yet so get the mplock for those. They will be converted to a generic driver-wide token later. * Remove critical sections that used to protect VM system related interrupts, replace with the vm_token. * Spinlocks now bump thread->td_critcount in addition to mycpu->gd_spinlock*. Note the ordering is important. Then remove gd_spinlock* checks elsewhere that are covered by td_critcount and replace with assertions. Also use td_critcount in the kern_mutex.c code instead of gd_spinlock*. This fixes situations where the last crit_exit() would call splx() without checking for spinlocks. Adding the additional checks would have made the crit_*() inlines too complex so instead we just fold it into td_critcount. * lwkt_yield() no longer guarantees that lwkt_switch() will be called so call lwkt_switch() instead in places where a switch is required. For example, to unwind a preemption. Otherwise the kernel could end up live-locking trying to yield because the new switch code does not necessarily schedule a different kernel thread. * Add the sysctl user_pri_sched (default 0). Setting this will make the LWKT scheduler more aggressively schedule user threads when runnable kernel threads are unable to gain token/mplock resources. For debugging only. * Change the bufspin spinlock to bufqspin and bufcspin, and generally rework vfs_bio.c to lock numerous fields with bufcspin. Also use bufcspin to interlock waitrunningbufspace() and friends. Remove several mplocks in vfs_bio.c that are no longer needed. Protect the page manipulation code in vfs_bio.c with vm_token instead of the mplock. * Fix a deadlock with the FINDBLK_TEST/BUF_LOCK sequence which can occur due to the fact that the buffer may change its (vp,loffset) during the BUF_LOCK call. Even though the code checks for this after the lock succeeds there is still the problem of the locking operation itself potentially creating a deadlock betwen two threads by locking an unexpected buffer when the caller is already holding other buffers locked. We do this by adding an interlock refcounter, b_refs. getnewbuf() will avoid reusing such buffers. * The syncer_token was not protecting all accesses to the syncer list. Fix that. * Make HAMMER MPSAFE. All major entry points now use a per-mount token, hmp->fs_token. Backend callbacks (bioops, bio_done) use hmp->io_token. The cache-case for the read and getattr paths require not tokens at all (as before). The bitfield flags had to be separated into two groups to deal with SMP cache coherency races. Certain flags in the hammer_record structure had to be separated for the same reason. Certain interactions between the frontend and the backend must use the hmp->io_token. It is important to note that for any given buffer there are two locking entities: (1) The hammer structure and (2) The buffer cache buffer. These interactions are very fragile. Do not allow the kernel to flush a dirty buffer if we are unable to obtain a norefs-interlock on the buffer, which fixes numerous frontend/backend MP races on the io structure. Add a write interlock in one of the recover_flush_buffer cases.