kernel - Refactor in-kernel system call API to remove bcopy() * Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only. int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *); * System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe. The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
kernel - Implement sbrk(), change low-address mmap hinting * Change mmap()'s internal lower address bound from dmax (32GB) to RLIMIT_DATA's current value. This allows the rlimit to be e.g. reduced and for hinted mmap()s to then map space below the 4GB mark. The default data rlimit is 32GB. This change is needed to support several languages, at least lua and probably another one or two, who use mmap hinting under the assumption that it can map space below the 4GB address mark. The data limit must be lowered with a limit command too, which can be scripted or patched for such programs. * Implement the sbrk() system call. This system call was already present but just returned EOPNOTSUPP and libc previously had its own shim for sbrk() which used the ancient break() system call. (Note that the prior implementation did not ENOSYS or signal). sbrk() in the kernel is thread-safe for positive increments and is also byte-granular (the old libc sbrk() was only page-granular). sbrk() in the kernel does not implement negative increments and will return EOPNOTSUPP if asked to. Negative increments were historically designed to be able to 'free' memory allocated with sbrk(), but it is not possible to implement the case in a modern VM system due to the mmap changes above. (1) Because the new mmap hinting changes make it possible for normal mmap()s to have mapped space prior to the RLIMIT_DATA resource limit being increased, causing intermingling of sbrk() and user mmap()d regions. (2) because negative increments are not even remotely thread-safe. * Note the previous commit refactored libc to use the kernel sbrk() and fall-back to its previous emulation code on failure, so libc supports both new and old kernels. * Remove the brk() shim from libc. brk() is not implemented by the kernel. Symbol removed. Requires testing against ports so we may have to add it back in but basically there is no way to implement brk() properly with the mmap() hinting fix * Adjust manual pages.
kernel - per-thread fd cache, p_fd lock bypass * Implement a per-thread (fd,fp) cache. Cache hits can keep fp's in a held state (avoiding the need to fhold()/fdrop() the ref count), and bypasses the p_fd spinlock. This allows the file pointer structure to generally be shared across cpu caches. * Can cache up to four descriptors in each thread, LRU. This is the common case. Highly threaded programs tend to focus work on a distinct file descriptors in each thread. * One file descriptor can be cached in up to four threads. This is a significant limitation, though relatively uncommon. On a cache miss the code drops into the normal shared p_fd spinlock lookup.
kernel - Improve mountlist_scan() performance, track vfs_getvfs() * Use a shared token whenever possible, and do not hold the token across the callback in the mountlist_scan() call. * vfs_getvfs() mount_hold()'s the returned mp. The caller is now expected to mount_drop() it when done. This fixes a very rare race.
kernel - Fix panic during coredump * multi-threaded coredumps were not stopping all other threads before attempting to scan the vm_map, resulting in numerous possible panics. * Add a new process state, SCORE, indicating that a core dump is in progress and adjust proc_stop() and friends as well as any code which tests the SSTOP state. SCORE overrides SSTOP. * The coredump code actively waits for all running threads to stop before proceeding. * Prevent a deadlock between a SIGKILL and core dump in progress by temporarily counting the master exit thread as a stopped thread (which allows the coredump to proceed and finish). Reported-by: marino
kernel - Major signal path adjustments to fix races, tsleep race fixes, +more * Refactor the signal code to properly hold the lp->lwp_token. In particular the ksignal() and lwp_signotify() paths. * The tsleep() path must also hold lp->lwp_token to properly handle lp->lwp_stat states and interlocks. * Refactor the timeout code in tsleep() to ensure that endtsleep() is only called from the proper context, and fix races between endtsleep() and lwkt_switch(). * Rename proc->p_flag to proc->p_flags * Rename lwp->lwp_flag to lwp->lwp_flags * Add lwp->lwp_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add td->td_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add some freeze testing code to the x86-64 trap code (default disabled).
kernel - Add per-process token, adjust signal code to use it. * Add proc->p_token and use it to interlock signal-related operations. * Remove the use of proc_token in various signal paths. Note that proc_token is still used in conjuction with pfind(). * Remove the use of proc_token in CURSIG*()/issignal() sequences, which also removes its use in the tsleep path and the syscall path. p->p_token is use instead. * Move the automatic interlock in the tsleep code to before the CURSIG code, fixing a rare race where a SIGCHLD could race against a parent process in sigsuspend(). Also acquire p->p_token here to interlock LWP_SINTR handling.
kernel - fine-grained namecache and partial vnode MPSAFE work Namecache subsystem * All vnode->v_flag modifications now use vsetflags() and vclrflags(). Because some flags are set and cleared by vhold()/vdrop() which do not require any locks to be held, all modifications must use atomic ops. * Clean up and revamp the namecache MPSAFE work. Namecache operations now use a fine-grained MPSAFE locking model which loosely follows these rules: - lock ordering is child to parent. e.g. lock file, then lock parent directory. This allows resolver recursions up the parent directory chain. - Downward-traversing namecache invalidations and path lookups will unlock the parent (but leave it referenced) before attempting to lock the child. - Namecache hash table lookups utilize a per-bucket spinlock. - vnode locks may be acquired while holding namecache locks but not vise-versa. VNodes are not destroyed until all namecache references go away, but can enter reclamation. Namecache lookups detect the case and re-resolve to overcome the race. Namecache entries are not destroyed while referenced. * Remove vfs_token, the namecache MPSAFE model is now totally fine-grained. * Revamp namecache locking primitves (cache_lock/cache_unlock and friends). Use atomic ops and nc_exlocks instead of nc_locktd and build-in a request flag. This solves busy/tsleep races between lock holder and lock requester. * Revamp namecache parent/child linkages. Instead of using vfs_token to lock such operations we simply lock both child and parent namecache entries. Hash table operations are also fully integrated with the parent/child linking operations. * The vnode->v_namecache list is locked via vnode->v_spinlock, which is actually vnode->v_lock.lk_spinlock. * Revamp cache_vref() and cache_vget(). The passed namecache entry must be referenced and locked. Internals are simplified. * Fix a deadlock by moving the call to _cache_hysteresis() to a place where the current thread otherwise does not hold any locked ncp's. * Revamp nlookup() to follow the new namecache locking rules. * Fix a number of places, e.g. in vfs/nfs/nfs_subs.c, where ncp->nc_parent or ncp->nc_vp was being accessed with an unlocked ncp. nc_parent and nc_vp accesses are only valid if the ncp is locked. * Add the vfs.cache_mpsafe sysctl, which defaults to 0. This may be set to 1 to enable MPSAFE namecache operations for [l,f]stat() and open() system calls (for the moment). VFS/VNODE subsystem * Use a global spinlock for now called vfs_spin to manage vnode_free_list. Use vnode->v_spinlock (and vfs_spin) to manage vhold/vdrop ops and to interlock v_auxrefs tests against vnode terminations. * Integrate per-mount mnt_token and (for now) the MP lock into VOP_*() and VFS_*() operations. This allows the MP lock to be shifted further inward from the system calls, but we don't do it quite yet. * HAMMER: VOP_GETATTR, VOP_READ, and VOP_INACTIVE are now MPSAFE. The corresponding sysctls have been removed. * FIFOFS: Needed some MPSAFE work in order to allow HAMMER to make things MPSAFE above, since HAMMER forwards vops for in-filesystem fifos to fifofs. * Add some debugging kprintf()s when certain MP races are averted, for testing only. MISC * Add some assertions to the VM system. * Document existing and newly MPSAFE code.
kernel - Move mplock to machine-independent C * Remove the per-platform mplock code and move it all into machine-independent code: sys/mplock2.h and kern/kern_mplock.c. * Inline the critical path. * When a conflict occurs kern_mplock.c will KTR log the file and line number of both the holder and conflicting acquirer. Set debug.ktr.giant_enable=-1 to enable conflict logging.
kernel - adjust falloc and arguments to dupfdopen, fsetfd, fdcheckstd * Make changes to the pointer type passed (proc, lwp, filedesc) to numerous routines. * falloc() needs access to td_ucred (it was previously using p_ucred which is not MPSAFE). * Adjust fsetfd() to make it conform to the other fsetfd*() procedures. * Related changes to fdcheckstd() and dupfdopen().
kernel - use new td_ucred in numerous places * Use curthread->td_ucred in numerous places, primarily system calls, where curproc->p_ucred was used before. * Clean up local variable use related to the above. * Adjust several places where p_ucred is replaced to properly deal with lwp threading races to avoid accessing and freeing a potentially stale ucred. * Adjust static procedures in the ktrace code to generally take lwp pointers instead of proc pointers.
kernel - Move MP lock inward, plus misc other stuff * Remove the MPSAFE flag from the syscalls.master file. All system calls are now called without the MP lock held and will acquire the MP lock if necessary. * Shift the MP lock inward. Try to leave most copyin/copyout operations outside the MP lock. Reorder some of the copyouts in the linux emulation code to suit. Kernel resource operations are MP safe. Process ucred access is now outside the MP lock but not quite MP safe yet (will be fixed in a followup). * Remove unnecessary KKASSERT(p) calls left over from the time before system calls where prefixed with sys_* * Fix a bunch of cases in the linux emulation code when setting groups where the ngrp range check is incorrect.