kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
kernel - Remove P_SWAPPEDOUT flag and paging mode * This code basically no longer functions in any worthwhile or useful manner, remove it. The code harkens back to a time when machines had very little memory and had to time-share processes by actually descheduling them for long periods of time (like 20 seconds) and paging out the related memory. In modern times the chooser algorithm just doesn't work well because we can no longer assume that programs with large memory footprints can be demoted. * In modern times machines have sufficient memory to rely almost entirely on the VM fault and pageout scan. The latencies caused by fault-ins are usually sufficient to demote paging-intensive processes while allowing the machine to continue to function. If functionality need to be added back in, it can be added back in on the fault path and not here.
kernel: Remove numerous #include <sys/thread2.h>. Most of them were added when we converted spl*() calls to crit_enter()/crit_exit(), almost 14 years ago. We can now remove a good chunk of them again for where crit_*() are no longer used. I had to adjust some files that were relying on thread2.h or headers that it includes coming in via other headers that it was removed from.
kernel - Optimize lwp-specific signaling. * Optimize the signal code to remove most instances of needing proc->p_token when lwp-specific signals are sent. * Add a CURSIG_LCK_TRACE() macro which can now return with p_token held, and pass the status to postsig() which then consumes it. * lwpsignal() now tries very hard to avoid acquiring proc->p_token. * Significantly improves vkernel operation under heavy (vkernel) IPI loads.
kernel - Fix panic during coredump * multi-threaded coredumps were not stopping all other threads before attempting to scan the vm_map, resulting in numerous possible panics. * Add a new process state, SCORE, indicating that a core dump is in progress and adjust proc_stop() and friends as well as any code which tests the SSTOP state. SCORE overrides SSTOP. * The coredump code actively waits for all running threads to stop before proceeding. * Prevent a deadlock between a SIGKILL and core dump in progress by temporarily counting the master exit thread as a stopped thread (which allows the coredump to proceed and finish). Reported-by: marino
kernel - proc_token removal pass stage 1/2 * Remove proc_token use from all subsystems except kern/kern_proc.c. * The token had become mostly useless in these subsystems now that process locking is more fine-grained. Do the final wipe of proc_token except for allproc/zombproc list use in kern_proc.c
kernel - Attempt to make procfs MPSAFE * pfs_pfind() now acquires the p->p_token in addition to its PHOLD(). * Replace PRELE()'s with pfs_pdone() which releases the token along with PRELE() * Double-check the validity of nch's passed to cache_fullpath(). This probably still needs work. Reported-by: swildner
kernel - Major signal path adjustments to fix races, tsleep race fixes, +more * Refactor the signal code to properly hold the lp->lwp_token. In particular the ksignal() and lwp_signotify() paths. * The tsleep() path must also hold lp->lwp_token to properly handle lp->lwp_stat states and interlocks. * Refactor the timeout code in tsleep() to ensure that endtsleep() is only called from the proper context, and fix races between endtsleep() and lwkt_switch(). * Rename proc->p_flag to proc->p_flags * Rename lwp->lwp_flag to lwp->lwp_flags * Add lwp->lwp_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add td->td_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add some freeze testing code to the x86-64 trap code (default disabled).
kernel = Fix tsleep(), remove MAILBOX signals, change signalset locks for LWPs * tsleep() was improperly calling lwkt_gettoken() and potentially blocking prior to sleeping, which it isn't supposed to do. This may have been the cause of several odd panics and corruption, though no smoking gun was found. * Change access to lp->lwp_siglist to use a spinlock instead of a token. Add a per-LWP spinlock in addition to the per-LWP token. * Remove MAILBOX signals (which require p->p_token). These are no longer used.
kernel - Make numerous proc accesses use p->p_token instead of proc_token. * pfind() zpfind() now returns a referenced proc structure, callers must release the proc with PRELE(). Callers no longer need to hold proc_token for stable access. * Enhance pgrp, adding pgrp->pg_token and pgrp->pg_refs in addition to pgrp->pg_lock. The lock is used to interlock races between fork() and signals while the token and refs are used to control access. * Add pfindn(), a version of pfind() which does not ref the returned proc. Some code still uses it (linux emulation) ---> needs work. * Add pgref() and pgrel() to mess with the pgrp's pg_refs. pgrel() automatically destroys the pgrp when the last reference goes away. * Most process group operations now use the per-process token instead of proc_token, though pgfind() still needs it temporarily. * pgfind() now returns a referenced pgrp or NULL. * Interlock signal handling with p->p_token instead of proc_token. * Adjust most nice/priority functions to use the per-process token. * Add protective PHOLD()s in various places in the signal code, the ptrace code, and procfs. * Change funsetown() to take the address of the sigio pointer to match fsetown(), add sanity assertions. * pgrp's in tty sessions are now ref-counted.
kernel - Add per-process token, adjust signal code to use it. * Add proc->p_token and use it to interlock signal-related operations. * Remove the use of proc_token in various signal paths. Note that proc_token is still used in conjuction with pfind(). * Remove the use of proc_token in CURSIG*()/issignal() sequences, which also removes its use in the tsleep path and the syscall path. p->p_token is use instead. * Move the automatic interlock in the tsleep code to before the CURSIG code, fixing a rare race where a SIGCHLD could race against a parent process in sigsuspend(). Also acquire p->p_token here to interlock LWP_SINTR handling.
kernel - Change the discrete mplock into mp_token * Use a lwkt_token for the mp_lock. This consolidates our longer-term spinnable locks (the mplock and tokens) into just tokens, making it easier to solve performance issues. * Some refactoring of the token code was needed to guarantee the ordering when acquiring and releasing the mp_token vs other tokens. * The thread switch code, lwkt_switch(), is simplified by this change though not necessarily faster. * Remove td_mpcount, mp_lock, and other related fields. * Remove assertions related to td_mpcount and friends, generally folding them into similar assertions for tokens.
Rework stopping of procs. Before, proc_stop() would sleep until all running lwps stopped. This break when a stop signal is actually coming from the console and is executed in the context of the idle thread. Now we count all sleeping threads as stopped and also set LWP_WSTOP to indicate so. These threads will stop before return to userland. Running threads (including the current one) will eventually stop when returning to userland and will increase p_nstopped. The last thread stopping will then send a signal to the parent process. Discussed-with: Thomas E. Spanjaard <tgen@netphreax.net>