kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
kernel - Refactor in-kernel system call API to remove bcopy() * Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only. int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *); * System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe. The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
kernel - Remove SMP bottlenecks on uidinfo, descriptors, and lockf * Use an eventcounter and the per-thread fd cache to fix bottlenecks in checkfdclosed(). This will work well for the vast majority of applications and test benches. * Batch holdfp*() operations on kqueue collections when implementing poll() and select(). This significant improves performance. Full scaling not yet achieved, however. * Increase copyin item batching from 8 to 32 for select() and poll(). * Give the uidinfo structure a pcpu array to hold the posixlocks and openfiles count fields, with a rollup contained in the uidinfo structure itself. This removes numerous global bottlenecks related to open(), close(), dup*(), and lockf operations (posixlocks count). ui_openfiles will force a rollup on limit reached to be sure that the limit was actually reached. ui_posixlocks stays fairly loose. Each cpu rolls up generally only when the pcpu count exceeds +32 or goes below -32. * Give the proc structure a pcpu array for the same counts, in order to properly support seteuid() and such. * Replace P_ADVLOCK with a char field proc->p_advlock_flag, and remove token operations around the field.
kernel - Improve uidinfo * Improve uifind() to check td_cred for likely uid's, avoiding all locking on hits. * Create proc0 cred's cr_uidinfo and cr_ruidinfo using uicreate(). All creds should now never have a NULL cr_uidinfo or cr_ruidinfo, so also remove conditionals that test for NULL. Suggested-by: __mjg
kernel - Optimize struct uidinfo * Refactor struct uidinfo. Use atomic ops for ui_posixlocks and ui_proccnt. They were already being used for ui_openfiles and ui_ref. * Refactor ui_ref a bit to improve the drop code. Use a cute trick for the transition. When we transition to 0 we allow ui_ref to actually go to 0, and then do an independent lookup of the uid with the hash table spinlock to conditionally free it if it remains 0. This allows us to completely avoid using atomic_cmpset_int(), which can be seriously inefficient due to races in SMP environments. Suggested-by: mjg__
kernel - Break up scheduler and loadavg callout * Change the scheduler and loadavg callouts from cpu 0 to all cpus, and adjust the allproc_scan() and alllwp_scan() to segment the hash table when asked. Every cpu is now tasked with handling the nominal scheduler recalc and nominal load calculation for a portion of the process list. The portion is unrelated to which cpu(s) the processes are actually scheduled on, it is strictly a way to spread the work around, split up by hash range. * Significantly reduces cpu 0 stalls when a large number of user processes or threads are present (that is, in the tens of thousands or more). In the test below, before this change, cpu 0 was straining under 40%+ interupt load (from the callout). After this change the load is spread across all cpus, approximately 1.5% per cpu. * Tested with 400,000 running user processes on a 32-thread dual-socket xeon (yes, these numbers are real): 12:27PM up 8 mins, 3 users, load avg: 395143.28, 270541.13, 132638.33 12:33PM up 14 mins, 3 users, load avg: 399496.57, 361405.54, 225669.14 * NOTE: There are still a number of other non-segmented allproc scans in the system, particularly related to paging and swapping. * NOTE: Further spreading-out of the work may be needed, by using a more frequent callout and smaller hash index range for each.
kernel: Remove the COMPAT_43 kernel option along with all related code. It is commented out in our default kernel config files for almost five years now, since 9466f37df5258f3bc3d99ae43627a71c1c085e7d. Approved-by: dillon Dragonfly-bug: <https://bugs.dragonflybsd.org/issues/2946>
chgsbsize: Optimize for x86_64 by avoid uidinfo spinlock This kills one of the highly contended spinlocks on accept(2) path. And it also greatly helps connect(2) path. With this commit, tools/kq_connect_client could do 273Kconns/s instead of 260Kconns/s (~5% improvement, however, connect(2) is still cpu bound).
kernel - proc_token performance cleanups * pfind()/pfindn()/zpfind() now acquire proc_token shared. * Fix a bug in alllwp_scan(). Must hold p->p_token while scanning its lwp's. * Process list scan can use a shared token, use pfind() instead of pfindn() and remove proc_token for individual pid lookups. * cwd can use a shared p->p_token. * getgroups(), seteuid(), and numerous other uid/gid access and setting functions need to use p->p_token, not proc_token (Repored by enjolras).
kernel - Performance optimization pass * Numerous pid and priority related syscalls, such as getpid(), were improperly acquiring proc_token to protect fields that are now protected with per-process or per-pgrp tokens. Do a pass on kern_prot.c and kern_resource.c fixing these issues. This removes the use of proc_token from several common system call paths but it should be noted that none of these system calls are in critical paths. The benefit is probably minor but will improve performance in the face of allproc-scanning operations (such as when you do a 'ps' or 'top'). * vmntvnodescan() is not in the critical path except for vflush()'s which occur on umount. vflush()'s pass a NULL fast function. The vmntvnodescan() only needs to hold the vmobj_token when the fastfunc is non-NULL. Do not hold the vmobj_token when fastfunc is NULL. This primarily improves performance when tmpfs's are being mounted and unmounted at a high rate (poudriere bulk builds).
kernel - Major signal path adjustments to fix races, tsleep race fixes, +more * Refactor the signal code to properly hold the lp->lwp_token. In particular the ksignal() and lwp_signotify() paths. * The tsleep() path must also hold lp->lwp_token to properly handle lp->lwp_stat states and interlocks. * Refactor the timeout code in tsleep() to ensure that endtsleep() is only called from the proper context, and fix races between endtsleep() and lwkt_switch(). * Rename proc->p_flag to proc->p_flags * Rename lwp->lwp_flag to lwp->lwp_flags * Add lwp->lwp_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add td->td_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add some freeze testing code to the x86-64 trap code (default disabled).
kernel - Make numerous proc accesses use p->p_token instead of proc_token. * pfind() zpfind() now returns a referenced proc structure, callers must release the proc with PRELE(). Callers no longer need to hold proc_token for stable access. * Enhance pgrp, adding pgrp->pg_token and pgrp->pg_refs in addition to pgrp->pg_lock. The lock is used to interlock races between fork() and signals while the token and refs are used to control access. * Add pfindn(), a version of pfind() which does not ref the returned proc. Some code still uses it (linux emulation) ---> needs work. * Add pgref() and pgrel() to mess with the pgrp's pg_refs. pgrel() automatically destroys the pgrp when the last reference goes away. * Most process group operations now use the per-process token instead of proc_token, though pgfind() still needs it temporarily. * pgfind() now returns a referenced pgrp or NULL. * Interlock signal handling with p->p_token instead of proc_token. * Adjust most nice/priority functions to use the per-process token. * Add protective PHOLD()s in various places in the signal code, the ptrace code, and procfs. * Change funsetown() to take the address of the sigio pointer to match fsetown(), add sanity assertions. * pgrp's in tty sessions are now ref-counted.