KERN_PROC - Change behavior and bump version to 600302 * Change default behavior to not include pure LWPs. That is, to not include pure kernel threads without a process (pid returned as -1). * Add a flag KERN_PROC_FLAG_LWKT to re-include the LWPs for programs that don't get confused by them. * Adjust /bin/ps and /usr/bin/top to use the flag. Also conditionalized on the existance of the flag so buildworld on older systems doesn't fail. * Clean-up the sysctl kernel interface for KERN_PROC a bit, since adding the flag creates a lot more combinations that need to be handled as discrete sysctls.
kernel - Fix kernel crash in sysctl path * Fix a kernel crash which can occur in a particular sysctl due to some processes not having a p_textnch path. The sysctl code was assuming that p->p_textnch would always be valid. procfs already has the added check. * Fix a race against exit, requiring the proc->p_token to be held. Reported-by: htop devs, BenBE, cgzones
kernel - Fix rare wait*() deadlock * It is possible for the kernel to deadlock two processes or process threads attempting to wait*() on the same pid. * Fix by adding a bit of magic to give ownership of the reaping operation to one of the waiters, and causing the other waiters to skip/reject that pid.
kernel - Rejigger random number generator to be per-cpu 1/2 * Refactor all the kernel random number generation code to operate on a per-cpu basis. The csprng, ibaa, and l15 structures are now per-cpu. * RDRAND now runs a periodic timer callback on all available cpus rather than just on cpu 0, allowing rdrand data to mix on each cpu's rng independently. * The nrandom helper thread now chains state with an iteration between cpus, injecting a random data buffer generated from the previous cpu into the mix of the current.
Rename some functions to better names. devfs_find_device_by_udev() -> devfs_find_device_by_devid() dev2udev() -> devid_from_dev() udev2dev() -> dev_from_devid() This fits with the rest of the code. 'dev' usually means a cdev_t, such as in make_dev(), etc. Instead of 'udev', use 'devid', since that's what dev_t is, a "Device ID".
<sys/types.h>: Get rid of udev_t. In a time long long ago, dev_t was a pointer, which later became cdev_t during the great cleanups, until it ended up being a uint32_t, just like udev_t. See for example the definitions of __dev_t in <sys/stat.h>. This commit cleans up further by removing the udev_t type, leaving just the POSIX dev_t type for both kernel and userland. Put it inside a _DEV_T_DECLARED to prepare for further cleanups in <sys/stat.h>.
kernel and libc - Reimplement lwp_setname*() using /dev/lpmap * Generally speaking we are implementing the features necessary to allow per-thread titling set via pthread_set_name_np() to show up in 'ps' output, and to use lpmap to make it fast. * The lwp_setname() system call now stores the title in lpmap->thread_title[]. * Implement a libc fast-path for lwp_setname() using lpmap. If called more than 10 times, libc will use lpmap for any further calls, which omits the need to make any system calls. * setproctitle() now stores the title in upmap->proc_title[] instead of replacing proc->p_args. proc->p_args is now no longer modified from its original contents. * The kernel now includes lpmap->thread_title[] in the following priority order when retrieving the process command line: lpmap->thread_title[] User-supplied thread title, if not empty upmap->proc_title[] User-supplied process title, if not empty proc->p_args Original process arguments (no longer modified) * Put the TID in /dev/lpmap for convenient access * Enhance the KERN_PROC_ARGS sysctl to allow the TID to be specified. The sysctl now accepts { KERN_PROC, KERN_PROC_ARGS, pid, tid } in addition to the existing { KERN_PROC, KERN_PROC_ARGS, pid } mechanism. Enhance libkvm to use the new feature. libkvm will fall-back to the old version if necessary.
libc - Implement sigblockall() and sigunblockall() (2) * Cleanup the logic a bit. Store the lwp or proc pointer in the vm_map_backing structure and make vm_map_fork() and friends more aware of it. * Rearrange lwp allocation in [v]fork() to make the pointer(s) available to vm_fork(). * Put the thread mappings on the lwp's list immediately rather than waiting for the first fault, which means that per-thread mappings will be deterministically removed on thread exit whether any faults happened or not. * Adjust vmspace_fork*() functions to not propagate 'dead' lwp mappings for threads that won't exist in the forked process. Only the lwp mappings for the thread doing the [v]fork() is retained.
kernel - sigblockall()/sigunblockall() support (per thread shared page) * Implement /dev/lpmap, a per-thread RW shared page between userland and the kernel. Each thread in the process will receive a unique shared page for communication with the kernel when memory-mapping /dev/lpmap and can access varous variables via this map. * The current thread's TID is retained for both fork() and vfork(). Previously it was only retained for vfork(). This avoids userland code confusion for any bits and pieces that are indexed based on the TID. * Implement support for a per-thread block-all-signals feature that does not require any system calls (see next commit to libc). The functions will be called sigblockall() and sigunblockall(). The lpmap->blockallsigs variable prevents normal signals from being dispatched. They will still be queued to the LWP as per normal. The behavior is not quite that of a signal mask when dealing with global signals. The low 31 bits represents a recursion counter, allowing recursive use of the functions. The high bit (bit 31) is set by the kernel if a signal was prevented from being dispatched. When userland decrements the counter to 0 (the low 31 bits), it can check and clear bit 31 and if found to be set userland can then make a dummy 'real' system call to cause pending signals to be delivered. Synchronous TRAPs (e.g. kernel-generated SIGFPE, SIGSEGV, etc) are not affected by this feature and will still be dispatched synchronously. * PThreads is expected to unmap the mapped page upon thread exit. The kernel will force-unmap the page upon thread exit if pthreads does not. XXX needs work - currently if the page has not been faulted in the kernel has no visbility into the mapping and will not unmap it, but neither will it get confused if the address is accessed. To be fixed soon. Because if we don't, programs using LWP primitives instead of pthreads might not realize that libc has mapped the page. * The TID is reset to 1 on a successful exec*() * On [v]fork(), if lpmap exists for the current thread, the kernel will copy the lpmap->blockallsigs value to the lpmap for the new thread in the new process. This way sigblock*() state is retained across the [v]fork(). This feature not only reduces code confusion in userland, it also allows [v]fork() to be implemented by the userland program in a way that ensures no signal races in either the parent or the new child process until it is ready for them. * The implementation leverages our vm_map_backing extents by having the per-thread memory mappings indexed within the lwp. This allows the lwp to remove the mappings when it exits (since not doing so would result in a wild pmap entry and kernel memory disclosure). * The implementation currently delays instantiation of the mapped page(s) and some side structures until the first fault. XXX this will have to be changed.
kernel - Change fill_kinfo_lwp() and fix top * Change fill_kinfo_lwp(), an internal function used by kern_proc.c and libkvm, to aggregate lwp data instead of replace it. Note that fill_kinfo_proc() will zero the lwp sub-structure and is already typically called before zero or more fill_kinfo_lwp() calls, so the new aggregation essentially just works even though the API is a bit different. In addition, when getprocs is told to aggregate lwps the tid field will be set to -1 since it is not applicable in the aggregation case. * 'top' will now properly aggregate the threads belonging to a process when thread mode 'H' is not in effect. * Also allow top to display cpu percentages above 100%, since in the aggregation case the sum of threads can easily exceed 100% of one core. Requested-by: hsw
kernel - Refactor tty_token, fix SMP performance issues * Remove most uses of tty_token in favor of per-tty tp->t_token. This is particularly important for removing bottlenecks related to PTYs, which are used all over the place. tty_token remains in a few places managing overall registration and global list manipulation. * tty structures are now required to be persistent. Implement a sepearate ttyinit() function. Continue to allow ttyregister() and ttyunregister() calls, but these no longer presume destruction of the structure. * Refactor ttymalloc() to take a **tty pointer and interlock allocations. Allocations are intended to be one-time. ttymalloc() only requires the tty_token for initial allocations. * Remove all critical section use that was combined with tty_token and tp->t_token. Leave only the tokens. The critical sections were hold-overs going all the way back to pre-SMP days. * syscons now gets its own token, vga_token. The ISA VGA code and the framebuffer code also now use this token instead of tty_token. * The keyboard subsystem now uses kbd_token instead of tty_token. * A few remaining serial-like devices (snp, nmdm) also get their own tokens, as well as use the now required tp->t_token. * Remove use of tty_token in the session management code. This fixes a niggling performance path since sessions almost universally go hand-in-hand with fork/exec/exit sequences. Instead we use the already-existing per-hash session token.
system - Add wait6(), waitid(), and si_pid/si_uid siginfo support * Add the wait6() system call (header definitions taken from FreeBSD). This required rearranging kern_wait() a bit. In particular, we now maintain a hold count of 1 on the process during processing instead of releasing the hold count early. * Add waitid() to libc (waitid.c taken from FreeBSD). * Adjust manual pages (taken from FreeBSD). * Add siginfo si_pid and si_uid support. This basically allows a process taking a signal to determine where the signal came from. The fields already existed in siginfo but were not implemented. Implemented using a non-queued per-process array of signal numbers. The last originator sending any given signal is recorded and passed through to userland in the siginfo. * Fixes the 'lightdm' X display manager. lightdm relies on si_pid support. In addition, note that avoiding long lightdm related latencies and timeouts require a softlink from libmozjs-52.so to libmozjs-52.so.0 (must be addressed in dports, not addressed in this commit). Loosely-taken-from: FreeBSD (wait6, waitid support only) Reviewed-by: swildner
kernel - Make certain sysctl's unlocked * Automatically flag all SYSCTL_[U]INT, [U]LONG, and [U]QUAD definitions CTLFLAG_NOLOCK. These do not have to be locked. Will improve program startup performance a tad. * Flag a ton of other sysctls used in program startup and also 'ps' CTLFLAG_NOLOCK. * For kern.hostname, interlock changes using XLOCK and allow the sysctl to run NOLOCK, avoiding unnecessary cache line bouncing.
kernel - Fix rare allproc scan vs p_ucred race * This race can occur because p->p_ucred can change out from under an allproc scan when the allproc scan is filtering based on credentials. * Access p->p_ucred via the per-process spinlock (p->p_spin). Also maintain a cache of the last ucred during the loop in order to avoid having to spin-lock every process. * Add missing spinlock around p->p_ucred = NULL in exit1(). This is also only applicable to races against allproc scans since p_token is held during exit1(). Reported-by: mjg_
kernel - Break up scheduler and loadavg callout * Change the scheduler and loadavg callouts from cpu 0 to all cpus, and adjust the allproc_scan() and alllwp_scan() to segment the hash table when asked. Every cpu is now tasked with handling the nominal scheduler recalc and nominal load calculation for a portion of the process list. The portion is unrelated to which cpu(s) the processes are actually scheduled on, it is strictly a way to spread the work around, split up by hash range. * Significantly reduces cpu 0 stalls when a large number of user processes or threads are present (that is, in the tens of thousands or more). In the test below, before this change, cpu 0 was straining under 40%+ interupt load (from the callout). After this change the load is spread across all cpus, approximately 1.5% per cpu. * Tested with 400,000 running user processes on a 32-thread dual-socket xeon (yes, these numbers are real): 12:27PM up 8 mins, 3 users, load avg: 395143.28, 270541.13, 132638.33 12:33PM up 14 mins, 3 users, load avg: 399496.57, 361405.54, 225669.14 * NOTE: There are still a number of other non-segmented allproc scans in the system, particularly related to paging and swapping. * NOTE: Further spreading-out of the work may be needed, by using a more frequent callout and smaller hash index range for each.