kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
kernel - Fix /dev/fd/N and clean up the old dup error-code-driven path * When opening /dev/fd/N, replicate the file pointer for descriptors that represent vnodes instead of dup()ing. This ensures that the seek offset and other fp-related elements are not shared unexpectedly. * Refactor the open() path to allow dev_dopen() to replace the struct file by passing a struct file ** instead of a struct file *. This removes old error-code-based hacks. * This fixes the shared seek position that fexecve() was operating with due to its use of /dev/fd/N for scripts. Reported-by: aly
kernel - Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS * Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS to procctl(2). This follows the linux and freebsd semantics, however it should be noted that since the child of a fork() clears the setting, these semantics have a fork/exit race between an exiting parent and a child which has not yet setup its death wish. * Also fix a number of signal ranging checks. Requested-by: zrj
kernel - Remove P_SWAPPEDOUT flag and paging mode * This code basically no longer functions in any worthwhile or useful manner, remove it. The code harkens back to a time when machines had very little memory and had to time-share processes by actually descheduling them for long periods of time (like 20 seconds) and paging out the related memory. In modern times the chooser algorithm just doesn't work well because we can no longer assume that programs with large memory footprints can be demoted. * In modern times machines have sufficient memory to rely almost entirely on the VM fault and pageout scan. The latencies caused by fault-ins are usually sufficient to demote paging-intensive processes while allowing the machine to continue to function. If functionality need to be added back in, it can be added back in on the fault path and not here.
kernel - Refactor sysclock_t from 32 to 64 bits * Refactor the core cpu timer API, changing sysclock_t from 32 to 64 bits. Provide a full 64-bit count from all sources. * Implement muldivu64() using gcc's 128-bit integer type. This functions takes three 64-bit valus, performs (a * b) / d using a 128-bit intermediate calculation, and returns a 64-bit result. Change all timer scaling functions to use this function which effectively gives systimers the capability of handling any timeout that fits 64 bits for the timer's resolution. * Remove TSC frequency scaling, it is no longer needed. The TSC timer is now used at its full resolution. * Use atomic_fcmpset_long() instead of a clock spinlock when updating the msb bits for hardware timer sources less than 64 bits wide. * Properly recalculate existing systimers when the clock source is changed. Existing systimers were not being recalculated, leading to the system failing to boot when time sources had radically different clock frequencies.
kernel - Allow 8254 timer to be forced, clean-up user/sys/intr/idle * Allows the 8254 timer to be forced on for machines which do not support the LAPIC timer during deep-sleep. Fix an assertion that occurs in this situation. hw.i8254.intr_disable="0" * Adjust the statclock to calculate user/sys/intr/idle time properly when the clock interrupt occurs from an interrupt thread instead of from a hard interrupt. Basically when the clock interrupt occurs from an interrupt thread, we have to look at curthread->td_preempted instead of curthread. In addition RQF_INTPEND will be set across the call due to the way processing works and we have to look at the bitmask of interrupt sources instead of this bit. Reported-by: CuteLarva
kernel - Fix rare wait*() deadlock * It is possible for the kernel to deadlock two processes or process threads attempting to wait*() on the same pid. * Fix by adding a bit of magic to give ownership of the reaping operation to one of the waiters, and causing the other waiters to skip/reject that pid.
<sys/kinfo.h>: Fix legacy inclusion issues. Sadly this header was not being included properly for a long time. Make it publicly accessible and put a big NOTE how to do it properly for future codes. This makes the <sys/user.h> the only other header that defines _KERNEL_STRUCTURES to solve long term inclusion order issues. Previous variant was hiding implicit dependencies, adjust netstat(1). Any changes in this header breaks a lot of ports, try not to change any of the structs. Also make sure KERN_SIGTRAMP has public visibility. While there remove two defines that were not used since introduced in 5dfd06ac148512faf075c4e399e8485fd955578f
Add <sys/cpumask.h>. Collect and gather all scatter cpumask bits to correct headers. This cleans up the namespace and simplifies platform handling in asm macros. The cpumask_t together with its macros is already non MI feature that is used in userland utilities, libraries, kernel scheduler and syscalls. It deserves sys/ header. Adjust syscalls.master and rerun sysent. While there, fix an issue in ports that set POSIX env, but has implementation of setting thread names through pthread_set_name_np().
kernel and libc - Reimplement lwp_setname*() using /dev/lpmap * Generally speaking we are implementing the features necessary to allow per-thread titling set via pthread_set_name_np() to show up in 'ps' output, and to use lpmap to make it fast. * The lwp_setname() system call now stores the title in lpmap->thread_title[]. * Implement a libc fast-path for lwp_setname() using lpmap. If called more than 10 times, libc will use lpmap for any further calls, which omits the need to make any system calls. * setproctitle() now stores the title in upmap->proc_title[] instead of replacing proc->p_args. proc->p_args is now no longer modified from its original contents. * The kernel now includes lpmap->thread_title[] in the following priority order when retrieving the process command line: lpmap->thread_title[] User-supplied thread title, if not empty upmap->proc_title[] User-supplied process title, if not empty proc->p_args Original process arguments (no longer modified) * Put the TID in /dev/lpmap for convenient access * Enhance the KERN_PROC_ARGS sysctl to allow the TID to be specified. The sysctl now accepts { KERN_PROC, KERN_PROC_ARGS, pid, tid } in addition to the existing { KERN_PROC, KERN_PROC_ARGS, pid } mechanism. Enhance libkvm to use the new feature. libkvm will fall-back to the old version if necessary.
kernel - sigblockall()/sigunblockall() support (per thread shared page) * Implement /dev/lpmap, a per-thread RW shared page between userland and the kernel. Each thread in the process will receive a unique shared page for communication with the kernel when memory-mapping /dev/lpmap and can access varous variables via this map. * The current thread's TID is retained for both fork() and vfork(). Previously it was only retained for vfork(). This avoids userland code confusion for any bits and pieces that are indexed based on the TID. * Implement support for a per-thread block-all-signals feature that does not require any system calls (see next commit to libc). The functions will be called sigblockall() and sigunblockall(). The lpmap->blockallsigs variable prevents normal signals from being dispatched. They will still be queued to the LWP as per normal. The behavior is not quite that of a signal mask when dealing with global signals. The low 31 bits represents a recursion counter, allowing recursive use of the functions. The high bit (bit 31) is set by the kernel if a signal was prevented from being dispatched. When userland decrements the counter to 0 (the low 31 bits), it can check and clear bit 31 and if found to be set userland can then make a dummy 'real' system call to cause pending signals to be delivered. Synchronous TRAPs (e.g. kernel-generated SIGFPE, SIGSEGV, etc) are not affected by this feature and will still be dispatched synchronously. * PThreads is expected to unmap the mapped page upon thread exit. The kernel will force-unmap the page upon thread exit if pthreads does not. XXX needs work - currently if the page has not been faulted in the kernel has no visbility into the mapping and will not unmap it, but neither will it get confused if the address is accessed. To be fixed soon. Because if we don't, programs using LWP primitives instead of pthreads might not realize that libc has mapped the page. * The TID is reset to 1 on a successful exec*() * On [v]fork(), if lpmap exists for the current thread, the kernel will copy the lpmap->blockallsigs value to the lpmap for the new thread in the new process. This way sigblock*() state is retained across the [v]fork(). This feature not only reduces code confusion in userland, it also allows [v]fork() to be implemented by the userland program in a way that ensures no signal races in either the parent or the new child process until it is ready for them. * The implementation leverages our vm_map_backing extents by having the per-thread memory mappings indexed within the lwp. This allows the lwp to remove the mappings when it exits (since not doing so would result in a wild pmap entry and kernel memory disclosure). * The implementation currently delays instantiation of the mapped page(s) and some side structures until the first fault. XXX this will have to be changed.
drm - Refactor task_struct and implement mm_struct * Change td->td_linux_task from an embedded structure to a pointer. * Add p->p_linux_mm to support tracking mm_struct's. * Change the 'current' macro to test td->td_linux_task and call a support function, linux_task_alloc(), if it is NULL. * Implement callbacks from the main kernel for thread exit and process exit to support functions that drop the td_linux_task and p_linux_mm pointers. Initialize and clear these callbacks in the module load/unload in drm_drv.c * Implement required support functions in linux_sched.c
kernel - Refactor tty_token, fix SMP performance issues * Remove most uses of tty_token in favor of per-tty tp->t_token. This is particularly important for removing bottlenecks related to PTYs, which are used all over the place. tty_token remains in a few places managing overall registration and global list manipulation. * tty structures are now required to be persistent. Implement a sepearate ttyinit() function. Continue to allow ttyregister() and ttyunregister() calls, but these no longer presume destruction of the structure. * Refactor ttymalloc() to take a **tty pointer and interlock allocations. Allocations are intended to be one-time. ttymalloc() only requires the tty_token for initial allocations. * Remove all critical section use that was combined with tty_token and tp->t_token. Leave only the tokens. The critical sections were hold-overs going all the way back to pre-SMP days. * syscons now gets its own token, vga_token. The ISA VGA code and the framebuffer code also now use this token instead of tty_token. * The keyboard subsystem now uses kbd_token instead of tty_token. * A few remaining serial-like devices (snp, nmdm) also get their own tokens, as well as use the now required tp->t_token. * Remove use of tty_token in the session management code. This fixes a niggling performance path since sessions almost universally go hand-in-hand with fork/exec/exit sequences. Instead we use the already-existing per-hash session token.
kernel - Remove SMP bottlenecks on uidinfo, descriptors, and lockf * Use an eventcounter and the per-thread fd cache to fix bottlenecks in checkfdclosed(). This will work well for the vast majority of applications and test benches. * Batch holdfp*() operations on kqueue collections when implementing poll() and select(). This significant improves performance. Full scaling not yet achieved, however. * Increase copyin item batching from 8 to 32 for select() and poll(). * Give the uidinfo structure a pcpu array to hold the posixlocks and openfiles count fields, with a rollup contained in the uidinfo structure itself. This removes numerous global bottlenecks related to open(), close(), dup*(), and lockf operations (posixlocks count). ui_openfiles will force a rollup on limit reached to be sure that the limit was actually reached. ui_posixlocks stays fairly loose. Each cpu rolls up generally only when the pcpu count exceeds +32 or goes below -32. * Give the proc structure a pcpu array for the same counts, in order to properly support seteuid() and such. * Replace P_ADVLOCK with a char field proc->p_advlock_flag, and remove token operations around the field.