kernel - Improve tmpfs support * When a file in tmpfs is truncated to a size that is not on a block boundary, or extended (but not written) to a size that is not on a block boundary, the nvextendbuf() and nvtruncbuf() functions must modify the contents of the straddling buffer and bdwrite(). However, a bdwrite() for a tmpfs buffer will result in a dirty buffer cache buffer and likely force it to be cycled out to swap relatively soon under a modest load. This is not desirable if there is no memory pressure present to force it out. Tmpfs almost always uses buwrite() in order to leave the buffer 'clean' (the underlying VM pages are dirtied instead), to prevent unecessary paging of tmpfs data to swap when the buffer gets recycled or the vnode cycles out. * Add support for calling buwrite() in these functions by changing the 'trivial' boolean into a flags variable. * Tmpfs now passes the appropriate flag, preventing the undesirable behavior.
kernel: Remove numerous #include <sys/thread2.h>. Most of them were added when we converted spl*() calls to crit_enter()/crit_exit(), almost 14 years ago. We can now remove a good chunk of them again for where crit_*() are no longer used. I had to adjust some files that were relying on thread2.h or headers that it includes coming in via other headers that it was removed from.
kernel: Remove <sys/sysref{,2}.h> inclusion from files that don't need it. Some of the headers are public in one way or another so bump __DragonFly_version for safety. While here, add a missing <sys/objcache.h> include to kern_exec.c which was previously relying on it coming in via <sys/sysref.h> (which was included by <sys/vm_map.h> prior to this commit).
kernel - KVABIO stabilization * bp->b_cpumask must be cleared in vfs_vmio_release(). * Generally speaking, it is generally desireable for the kernel to set B_KVABIO when flushing or disposing of a buffer, as long as b_cpumask is also correct. This avoids unnecessary synchronization when underlying device drivers support KVABIO, even if the filesystem does not. * In findblk() we cannot just gratuitously clear B_KVABIO. We must issue a bkvasync_all() to clear the flag in order to ensure proper synchronization with the caller's desired B_KVABIO state. * It was intended that bkvasync_all() clear the B_KVABIO flag. Make sure it does. * In contrast, B_KVABIO can always be set at any time, so long as the cpumask is cleared whenever the mappings are changed, and also as long as the caller's B_KVABIO state is respected if the buffer is later returned to the caller in a locked state. If the buffer will simply be disposed of by the kernel instead, the flag can be set. The wrapper (typically a vn_strategy() or dev_dstrategy() call) will clear the flag via bkvasync_all() if the target does not support KVABIO. * Kernel support code outside of filesystem and device drivers is expected to support KVABIO. * nvtruncbuf() and nvextendbuf() now use bread_kvabio() (i.e. they now properly support KVABIO). * The buf_countdeps(), buf_checkread(), and buf_checkwrite() callbacks call bkvasync_all() in situations where the vnode does not support KVABIO. This is because the kernel might have set the flag for other incidental operations even if the filesystem did not. * As per above, devfs_spec_strategy() now sets B_KVABIO and properly calls bkvasync() when it needs to operate directly on buf->b_data. * Fix bug in tmpfs(). tmpfs() was using bread_kvabio() as intended, but failed to call bkvasync() prior to operating directly on buf->b_data (prior to calling uiomovebp()). * Any VFS function that calls BUF_LOCK*() itself may also have to call bkvasync_all() if it wishes to operate directly on buf->b_data, even if the VFS is not KVABIO aware. This is because the VFS bypassed the normal buffer cache APIs to obtain a locked buffer.
hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree * This merge contains work primarily by Mihai Carabas, with some misc fixes also by Matthew Dillon. * Special note on GSOC core This is, needless to say, a huge amount of work compressed down into a few paragraphs of comments. Adds the pc64/vmm subdirectory and tons of stuff to support hardware virtualization in guest-user mode, plus the ability for programs (vkernels) running in this mode to make normal system calls to the host. * Add system call infrastructure for VMM mode operations in kern/sys_vmm.c which vectors through a structure to machine-specific implementations. vmm_guest_ctl_args() vmm_guest_sync_addr_args() vmm_guest_ctl_args() - bootstrap VMM and EPT modes. Copydown the original user stack for EPT (since EPT 'physical' addresses cannot reach that far into the backing store represented by the process's original VM space). Also installs the GUEST_CR3 for the guest using parameters supplied by the guest. vmm_guest_sync_addr_args() - A host helper function that the vkernel can use to invalidate page tables on multiple real cpus. This is a lot more efficient than having the vkernel try to do it itself with IPI signals via cpusync*(). * Add Intel VMX support to the host infrastructure. Again, tons of work compressed down into a one paragraph commit message. Intel VMX support added. AMD SVM support is not part of this GSOC and not yet supported by DragonFly. * Remove PG_* defines for PTE's and related mmu operations. Replace with a table lookup so the same pmap code can be used for normal page tables and also EPT tables. * Also include X86_PG_V defines specific to normal page tables for a few situations outside the pmap code. * Adjust DDB to disassemble SVM related (intel) instructions. * Add infrastructure to exit1() to deal related structures. * Optimize pfind() and pfindn() to remove the global token when looking up the current process's PID (Matt) * Add support for EPT (double layer page tables). This primarily required adjusting the pmap code to use a table lookup to get the PG_* bits. Add an indirect vector for copyin, copyout, and other user address space copy operations to support manual walks when EPT is in use. A multitude of system calls which manually looked up user addresses via the vm_map now need a VMM layer call to translate EPT. * Remove the MP lock from trapsignal() use cases in trap(). * (Matt) Add pthread_yield()s in most spin loops to help situations where the vkernel is running on more cpu's than the host has, and to help with scheduler edge cases on the host. * (Matt) Add a pmap_fault_page_quick() infrastructure that vm_fault_page() uses to try to shortcut operations and avoid locks. Implement it for pc64. This function checks whether the page is already faulted in as requested by looking up the PTE. If not it returns NULL and the full blown vm_fault_page() code continues running. * (Matt) Remove the MP lock from most the vkernel's trap() code * (Matt) Use a shared spinlock when possible for certain critical paths related to the copyin/copyout path.
kernel - Fix cpu/token starvation, vfs_busy deadlocks. incls sysctl * Remove the mplock around the userland sysctl system call, it should no longer be needed. * Remove the mplock around getcwd(), it should no longer be needed. * Change the vfs_busy(), sys_mount(), and related mount code to use the per-mount token instead of the mp lock. * Fix a race in vfs_busy() which could cause it to never get woken up. * Fix a deadlock in nlookup() when the lookup is racing an unmount. When the mp is flagged MNTK_UNMOUNT, the unmount is in progress and the lookup must fail instead of loop. * per-mount token now protects mp->mnt_kern_flag. * unmount code now waits for final mnt_refs to return to the proper value, fixing races with other code that might temporarily ref the mount point. * Add lwkt_yield()'s in nvtruncbuf*() and nvnode_pager_setsize(), reducing cpu stalls due to large file-extending I/O's. Also in tmpfs. * Use a marker in the vm_meter code and check for vmobj_token collisions. When a collision is detected, give other threads a chance to take the token. This prevents hogging of this very important token. Testing-by: dillon, vsrinivas, ftigeot
kernel - Major SMP performance patch / VM system, bus-fault/seg-fault fixes This is a very large patch which reworks locking in the entire VM subsystem, concentrated on VM objects and the x86-64 pmap code. These fixes remove nearly all the spin lock contention for non-threaded VM faults and narrows contention for threaded VM faults to just the threads sharing the pmap. Multi-socket many-core machines will see a 30-50% improvement in parallel build performance (tested on a 48-core opteron), depending on how well the build parallelizes. As part of this work a long-standing problem on 64-bit systems where programs would occasionally seg-fault or bus-fault for no reason has been fixed. The problem was related to races between vm_fault, the vm_object collapse code, and the vm_map splitting code. * Most uses of vm_token have been removed. All uses of vm_spin have been removed. These have been replaced with per-object tokens and per-queue (vm_page_queues[]) spin locks. Note in particular that since we still have the page coloring code the PQ_FREE and PQ_CACHE queues are actually many queues, individually spin-locked, resulting in very excellent MP page allocation and freeing performance. * Reworked vm_page_lookup() and vm_object->rb_memq. All (object,pindex) lookup operations are now covered by the vm_object hold/drop system, which utilize pool tokens on vm_objects. Calls now require that the VM object be held in order to ensure a stable outcome. Also added vm_page_lookup_busy_wait(), vm_page_lookup_busy_try(), vm_page_busy_wait(), vm_page_busy_try(), and other API functions which integrate the PG_BUSY handling. * Added OBJ_CHAINLOCK. Most vm_object operations are protected by the vm_object_hold/drop() facility which is token-based. Certain critical functions which must traverse backing_object chains use a hard-locking flag and lock almost the entire chain as it is traversed to prevent races against object deallocation, collapses, and splits. The last object in the chain (typically a vnode) is NOT locked in this manner, so concurrent faults which terminate at the same vnode will still have good performance. This is important e.g. for parallel compiles which might be running dozens of the same compiler binary concurrently. * Created a per vm_map token and removed most uses of vmspace_token. * Removed the mp_lock in sys_execve(). It has not been needed in a while. * Add kmem_lim_size() which returns approximate available memory (reduced by available KVM), in megabytes. This is now used to scale up the slab allocator cache and the pipe buffer caches to reduce unnecessary global kmem operations. * Rewrote vm_page_alloc(), various bits in vm/vm_contig.c, the swapcache scan code, and the pageout scan code. These routines were rewritten to use the per-queue spin locks. * Replaced the exponential backoff in the spinlock code with something a bit less complex and cleaned it up. * Restructured the IPIQ func/arg1/arg2 array for better cache locality. Removed the per-queue ip_npoll and replaced it with a per-cpu gd_npoll, which is used by other cores to determine if they need to issue an actual hardware IPI or not. This reduces hardware IPI issuance considerably (and the removal of the decontention code reduced it even more). * Temporarily removed the lwkt thread fairq code and disabled a number of features. These will be worked back in once we track down some of the remaining performance issues. Temproarily removed the lwkt thread resequencer for tokens for the same reason. This might wind up being permanent. Added splz_check()s in a few critical places. * Increased the number of pool tokens from 1024 to 4001 and went to a prime-number mod algorithm to reduce overlaps. * Removed the token decontention code. This was a bit of an eyesore and while it did its job when we had global locks it just gets in the way now that most of the global locks are gone. Replaced the decontention code with a fall back which acquires the tokens in sorted order, to guarantee that deadlocks will always be resolved eventually in the scheduler. * Introduced a simplified spin-for-a-little-while function _lwkt_trytoken_spin() that the token code now uses rather than giving up immediately. * The vfs_bio subsystem no longer uses vm_token and now uses the vm_object_hold/drop API for buffer cache operations, resulting in very good concurrency. * Gave the vnode its own spinlock instead of sharing vp->v_lock.lk_spinlock, which fixes a deadlock. * Adjusted all platform pamp.c's to handle the new main kernel APIs. The i386 pmap.c is still a bit out of date but should be compatible. * Completely rewrote very large chunks of the x86-64 pmap.c code. The critical path no longer needs pmap_spin but pmap_spin itself is still used heavily, particularin the pv_entry handling code. A per-pmap token and per-pmap object are now used to serialize pmamp access and vm_page lookup operations when needed. The x86-64 pmap.c code now uses only vm_page->crit_count instead of both crit_count and hold_count, which fixes races against other parts of the kernel uses vm_page_hold(). _pmap_allocpte() mechanics have been completely rewritten to remove potential races. Much of pmap_enter() and pmap_enter_quick() has also been rewritten. Many other changes. * The following subsystems (and probably more) no longer use the vm_token or vmobj_token in critical paths: x The swap_pager now uses the vm_object_hold/drop API instead of vm_token. x mmap() and vm_map/vm_mmap in general now use the vm_object_hold/drop API instead of vm_token. x vnode_pager x zalloc x vm_page handling x vfs_bio x umtx system calls x vm_fault and friends * Minor fixes to fill_kinfo_proc() to deal with process scan panics (ps) revealed by recent global lock removals. * lockmgr() locks no longer support LK_NOSPINWAIT. Spin locks are unconditionally acquired. * Replaced netif/e1000's spinlocks with lockmgr locks. The spinlocks were not appropriate owing to the large context they were covering. * Misc atomic ops added
kernel - Major MPSAFE Infrastructure 2 * Refactor buffer cache code which assumes content-stable data across a non-blocking BUF_LOCK(). This is no longer true. The content must be reverified after the BUF_LOCK() succeeds. * Make setting and clearing B_DELWRI atomic with buffer reassignment. * Release cached mplock when looping in the scheduler and run check_splz() to avoid livelocking cpus. * Refactor the mplock contention handling code to handle both the mplock and token contention. Generate a 2uS delay for all but one cpu to try to avoid livelocks. * Do not splz() from inside a spinlock, it will just panic. * Fix the token description field for 'systat -pv 1'. * Optimize MP_LOCK macros a bit.
kernel - lwkt_token revamp * Simplify the token API. Hide the lwkt_tokref mechanics and simplify the lwkt_gettoken()/lwkt_reltoken() API to remove the need to declare and pass a lwkt_tokref along with the token. This makes tokens operate more like locks. There is a minor restriction that tokens must be unlocked in exactly the reverse order they were locked in, and another restriction limiting the maximum number of tokens a thread can hold to defined value (32 for now). The tokrefs are now an array embedded in the thread structure. * Improve performance when blocking and unblocking threads with recursively held tokens. * Improve performance when acquiring the same token recursively. This operation is now O(1) and requires no locks or critical sections of any sort. This will allow us to acquire redundant tokens in deep call paths without having to worry about performance issues. * Add a flags field to the lwkt_token and lwkt_tokref structures and add a flagged feature which will acquire the MP lock along with a particular token. This will be used as a transitory mechanism in upcoming MPSAFE work. The mplock feature in the token structure can be directly connected to a mpsafe sysctl without being vulnerable to state-change races.
kernel - Even more buffer cache / VM coherency work * nvtruncbuf/nvextendbuf now clear the cached layer 2 disk offset from the buffer cache buffer being zero-extended or zero-truncated. This is required by HAMMER since HAMMER never overwrites data in the same media block. * Convert HAMMER over to the new nvtruncbuf/nvextendbuf API. The new API automatically handles zero-truncations and zero-extensions within the buffer straddling the file EOF and also changes the way backing VM pages are handled. Instead of cutting the VM pages off at the nearest boundary past file EOF any pages in the straddling buffer are left fully valid and intact, which avoids numerous pitfalls the old API had in dealing with VM page valid/dirty bits during file truncations and extensions. * Make sure the PG_ZERO flag in the VM page is cleared in allocbuf(). * Refactor HAMMER's strategy code to close two small windows of opportunity where stale data might be read from the media. In particular, refactor hammer_ip_*_bulk(), hammer_frontend_trunc*(), and hammer_io_direct_write(). These were detected by the fsx test program on a heavily paging system with physical memory set artificially low. Data flows through three stages in HAMMER: (1) Buffer cache. (2) In-memory records referencing the direct-write data offset on the media until the actual B-Tree is updated on-media at a later time. (3) Media B-Tree lookups referencing the committed data offset on the media. HAMMER must perform a careful, fragile dance to ensure that access to the data from userland doesn't slip through any cracks while the data is transitioning between stages. Two cracks were found and fixed: (A) The direct-write code was allowing the BUF/BIO in the strategy call to complete before adding the in-memory record to the index for the stage 1->2 transition. Now fixed. (B) The HAMMER truncation code was skipping in-memory records queued to the backend flusher under the assumption that the backend flusher would deal with them, which it will eventually, but there was a small window where the data was still accessible by userland after the truncation if userland did a truncation followed by an extension. Now fixed.
kernel - More buffer cache / VM coherency work * Add a buffer offset argument to nvtruncbuf(). The truncation length and blocksize for the block containing the truncation point alone are insufficient since prior blocks might be using a different blocksize. * Add a buffer offset argument to nvnode_pager_setsize() for the same reason. * nvtruncbuf() and nvextendbuf() now bdwrite() the buffer being zero-filled. This fixes a race where the clean buffer might be discarded and read from the medias pre-truncation backing store again before the filesystem has a chance to adjust it. * nvextendbuf() now takes additional arguments. The block offset for the old and new blocks must be passed. * Convert UFS over to the use nv*() API, hopefully solving any remaining fsx VM/BUF coherency issues. * Correct bugs with swap_burst_read mode, but leave the mode disabled. There are still unresolved issues when the mode is enabled. (Reported-by: YONETANI Tomokazu <qhwt+dfly@les.ath.cx>) * Fix a bug in vm_prefault() which would leak VM pages, eventually causing the machine to run out of memory.
kernel - Add new bufcache/VM consolidated API, fsx fixes for NFS * Add kern/vfs_vm.c with a new API for vtruncbuf() and vnode_pager_setsize() called nvtruncbuf(), nvextendbuf(), and nvnode_pager_setsize(). This API solves numerous problems with data coherency between the VM and buffer cache subsystems. Generally speaking what this API does is allow the VM pages backing the buffer straddling EOF in a file to remain valid instead of invalidating them. Take NFS for example with 32K buffers and, say, a 16385 byte file. The NFS buffer cache buffer is backed by 8 x 4K VM pages but the actual file only requires 5 x 4K pages. This API keeps all 8 VM pages valid. This API also handles zeroing out portions of the buffer after truncation and zero-extending portions of the buffer after a file extension. NFS has been migrated to the new API. HAMMER will soon follow. UFS and EXT2FS are harder due to their far more complex buffer cache sizing operations (due to their fragment vs full-sized block handling). * Remodel the NFS client to use the new API. This allows NFS to consolidate all truncation and extension operations into nfs_meta_setsize(), including all code which previously had to deal with special buffer cache / VM cases related to truncation and extension. * Fix a bug in kern/vfs_bio.c where NFS buffers requiring the clearing of B_NEEDCOMMIT failed to also clear B_CLUSTEROK, leading to occassional attempts by NFS to issue RPCs larger than the NFS I/O block size (resulting in a panic). * NFS now uses vop_stdgetpages() and vop_stdputpages(). The NFS-specific nfs_getpages() and nfs_putpages() has been removed. Remove a vinvalbuf() in the nfs_bioread() code on remote-directory modification which was deadlocking getpages. This needs more work. * Simplify the local-vs-remote modification tests in NFS. This needs more work. What was happening, generally, was that the larger number of RPCs inflight allowed by the NFS client was creating too much confusion in the attribute feedback in the RPC replies, causing the NFS client to lose track of the file's actual size during heavy modifying operations (aka fsx tests).