kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
kernel - Refactor in-kernel system call API to remove bcopy() * Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only. int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *); * System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe. The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
libc - Implement sigblockall() and sigunblockall() (2) * Cleanup the logic a bit. Store the lwp or proc pointer in the vm_map_backing structure and make vm_map_fork() and friends more aware of it. * Rearrange lwp allocation in [v]fork() to make the pointer(s) available to vm_fork(). * Put the thread mappings on the lwp's list immediately rather than waiting for the first fault, which means that per-thread mappings will be deterministically removed on thread exit whether any faults happened or not. * Adjust vmspace_fork*() functions to not propagate 'dead' lwp mappings for threads that won't exist in the forked process. Only the lwp mappings for the thread doing the [v]fork() is retained.
<sys/slaballoc.h>: Switch to lighter <sys/_malloc.h> header. The <sys/globaldata.h> embeds SLGlobalData that in turn embeds the "struct malloc_type". Adjust several kernel sources for missing includes where memory allocation is performed. Try to use alphabetical include order. Now (in most cases) <sys/malloc.h> is included after <sys/objcache.h>. Once it gets cleaned up, the <sys/malloc.h> inclusion could be moved out of <sys/idr.h> to drm Linux compat layer linux/slab.h without side effects.
kernel: Remove <sys/sysref{,2}.h> inclusion from files that don't need it. Some of the headers are public in one way or another so bump __DragonFly_version for safety. While here, add a missing <sys/objcache.h> include to kern_exec.c which was previously relying on it coming in via <sys/sysref.h> (which was included by <sys/vm_map.h> prior to this commit).
kernel - Redo struct vmspace allocator and ref-count handling. * Get rid of the sysref-based allocator and ref-count handler and replace with objcache. Replace all sysref API calls in other kernel modules with vmspace_*() API calls (adding new API calls as needed). * Roll-our-own hopefully safer ref-count handling. We get rid of exitingcnt and instead just leave holdcnt bumped during the exit/reap sequence. We add vm_refcnt and redo vm_holdcnt. Now a formal reference (vm_refcnt) is ALSO covered by a holdcnt. Stage-1 termination occurs when vm_refcnt transitions from 1->0. Stage-2 termination occurs when vm_holdcnt transitions from 1->0. * Should fix rare reported panic under heavy load.
kernel - Refactor the vmspace locking code and several use cases * Reorder the vnode ref/rele sequence in the exec path so p_textvp is left in a more valid state while being initialized. * Removing the vm_exitingcnt test in exec_new_vmspace(). Release various resources unconditionally on the last exiting thread regardless of the state of exitingcnt. This just moves some of the resource releases out of the wait*() system call path and back into the exit*() path. * Implement a hold/drop mechanic for vmspaces and use them in procfs_rwmem(), vmspace_anonymous_count(), and vmspace_swap_count(), and various other places. This does a better job protecting the vmspace from deletion while various unrelated third parties might be trying to access it. * Implement vmspace_free() for other code to call instead of them trying to call sysref_put() directly. Interlock with a vmspace_hold() so final termination processing always keys off the vm_holdcount. * Implement vm_object_allocate_hold() and use it in a few places in order to allow OBJT_SWAP objects to be allocated atomically, so other third parties (like the swapcache cleaning code) can't wiggle their way in and access a partially initialized object. * Reorder the vmspace_terminate() code and introduce some flags to ensure that resources are terminated at the proper time and in the proper order.
cache_fullpath - Guess mountpoints if requested * cache_fullpath (and vn_fullpath) now take an extra parameter, guess, which, if != 0, makes cache_fullpath look for a matching mp if an ncp flagged as a mountpoint is found while traversing upwards. This fixes uses of *_fullpath when no nch is provided, but only a vp. * Change all consumers of cache_fullpath and vn_fullpath to accomodate for the extra parameter. Suggested-by: Matthew Dillon
kernel - Move mplock to machine-independent C * Remove the per-platform mplock code and move it all into machine-independent code: sys/mplock2.h and kern/kern_mplock.c. * Inline the critical path. * When a conflict occurs kern_mplock.c will KTR log the file and line number of both the holder and conflicting acquirer. Set debug.ktr.giant_enable=-1 to enable conflict logging.
kernel - use new td_ucred in numerous places * Use curthread->td_ucred in numerous places, primarily system calls, where curproc->p_ucred was used before. * Clean up local variable use related to the above. * Adjust several places where p_ucred is replaced to properly deal with lwp threading races to avoid accessing and freeing a potentially stale ucred. * Adjust static procedures in the ktrace code to generally take lwp pointers instead of proc pointers.
kernel - Move MP lock inward, plus misc other stuff * Remove the MPSAFE flag from the syscalls.master file. All system calls are now called without the MP lock held and will acquire the MP lock if necessary. * Shift the MP lock inward. Try to leave most copyin/copyout operations outside the MP lock. Reorder some of the copyouts in the linux emulation code to suit. Kernel resource operations are MP safe. Process ucred access is now outside the MP lock but not quite MP safe yet (will be fixed in a followup). * Remove unnecessary KKASSERT(p) calls left over from the time before system calls where prefixed with sys_* * Fix a bunch of cases in the linux emulation code when setting groups where the ngrp range check is incorrect.