kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
kernel - Refactor in-kernel system call API to remove bcopy() * Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only. int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *); * System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe. The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
Rename some functions to better names. devfs_find_device_by_udev() -> devfs_find_device_by_devid() dev2udev() -> devid_from_dev() udev2dev() -> dev_from_devid() This fits with the rest of the code. 'dev' usually means a cdev_t, such as in make_dev(), etc. Instead of 'udev', use 'devid', since that's what dev_t is, a "Device ID".
kernel - Major signal path adjustments to fix races, tsleep race fixes, +more * Refactor the signal code to properly hold the lp->lwp_token. In particular the ksignal() and lwp_signotify() paths. * The tsleep() path must also hold lp->lwp_token to properly handle lp->lwp_stat states and interlocks. * Refactor the timeout code in tsleep() to ensure that endtsleep() is only called from the proper context, and fix races between endtsleep() and lwkt_switch(). * Rename proc->p_flag to proc->p_flags * Rename lwp->lwp_flag to lwp->lwp_flags * Add lwp->lwp_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add td->td_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add some freeze testing code to the x86-64 trap code (default disabled).
kernel - Move mplock to machine-independent C * Remove the per-platform mplock code and move it all into machine-independent code: sys/mplock2.h and kern/kern_mplock.c. * Inline the critical path. * When a conflict occurs kern_mplock.c will KTR log the file and line number of both the holder and conflicting acquirer. Set debug.ktr.giant_enable=-1 to enable conflict logging.
kernel - Move MP lock inward, plus misc other stuff * Remove the MPSAFE flag from the syscalls.master file. All system calls are now called without the MP lock held and will acquire the MP lock if necessary. * Shift the MP lock inward. Try to leave most copyin/copyout operations outside the MP lock. Reorder some of the copyouts in the linux emulation code to suit. Kernel resource operations are MP safe. Process ucred access is now outside the MP lock but not quite MP safe yet (will be fixed in a followup). * Remove unnecessary KKASSERT(p) calls left over from the time before system calls where prefixed with sys_* * Fix a bunch of cases in the linux emulation code when setting groups where the ngrp range check is incorrect.
1:1 Userland threading stage 2.10/4: Separate p_stats into p_ru and lwp_ru. proc.p_ru keeps track of all statistics directly related to a proc. This consists of RSS usage and nswap information and aggregate numbers for all former lwps of this proc. proc.p_cru is the sum of all stats of reaped children. lwp.lwp_ru contains the stats directly related to one specific lwp, meaning packet, scheduler switch or page fault counts, etc. This information gets added to lwp.lwp_proc.p_ru when the lwp exits.
VNode sequencing and locking - part 3/4. VNode aliasing is handled by the namecache (aka nullfs), so there is no longer a need to have VOP_LOCK, VOP_UNLOCK, or VOP_ISSLOCKED as 'VOP' functions. Both NFS and DEADFS have been using standard locking functions for some time and are no longer special cases. Replace all uses with native calls to vn_lock, vn_unlock, and vn_islocked. We can't have these as VOP functions anyhow because of the introduction of the new SYSLINK transport layer, since vnode locks are primarily used to protect the local vnode structure itself.