kernel - Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS * Add PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_STATUS to procctl(2). This follows the linux and freebsd semantics, however it should be noted that since the child of a fork() clears the setting, these semantics have a fork/exit race between an exiting parent and a child which has not yet setup its death wish. * Also fix a number of signal ranging checks. Requested-by: zrj
kernel - Refactor in-kernel system call API to remove bcopy() * Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only. int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *); * System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe. The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
libc - Implement sigblockall() and sigunblockall() (3) * I half expected my reordering of lwp_fork*() to cause problems, and it did. Fix a panic window due to the new lwp getting added to the process before its underlying lwkt thread is assigned. * Document the subtle issues, including the fact that in the [v]fork() case we must assign the correct TID in lwp_fork1() for vm_fork() to consume when dealing with /dev/lpmap
libc - Implement sigblockall() and sigunblockall() (2) * Cleanup the logic a bit. Store the lwp or proc pointer in the vm_map_backing structure and make vm_map_fork() and friends more aware of it. * Rearrange lwp allocation in [v]fork() to make the pointer(s) available to vm_fork(). * Put the thread mappings on the lwp's list immediately rather than waiting for the first fault, which means that per-thread mappings will be deterministically removed on thread exit whether any faults happened or not. * Adjust vmspace_fork*() functions to not propagate 'dead' lwp mappings for threads that won't exist in the forked process. Only the lwp mappings for the thread doing the [v]fork() is retained.
kernel - sigblockall()/sigunblockall() support (per thread shared page) * Implement /dev/lpmap, a per-thread RW shared page between userland and the kernel. Each thread in the process will receive a unique shared page for communication with the kernel when memory-mapping /dev/lpmap and can access varous variables via this map. * The current thread's TID is retained for both fork() and vfork(). Previously it was only retained for vfork(). This avoids userland code confusion for any bits and pieces that are indexed based on the TID. * Implement support for a per-thread block-all-signals feature that does not require any system calls (see next commit to libc). The functions will be called sigblockall() and sigunblockall(). The lpmap->blockallsigs variable prevents normal signals from being dispatched. They will still be queued to the LWP as per normal. The behavior is not quite that of a signal mask when dealing with global signals. The low 31 bits represents a recursion counter, allowing recursive use of the functions. The high bit (bit 31) is set by the kernel if a signal was prevented from being dispatched. When userland decrements the counter to 0 (the low 31 bits), it can check and clear bit 31 and if found to be set userland can then make a dummy 'real' system call to cause pending signals to be delivered. Synchronous TRAPs (e.g. kernel-generated SIGFPE, SIGSEGV, etc) are not affected by this feature and will still be dispatched synchronously. * PThreads is expected to unmap the mapped page upon thread exit. The kernel will force-unmap the page upon thread exit if pthreads does not. XXX needs work - currently if the page has not been faulted in the kernel has no visbility into the mapping and will not unmap it, but neither will it get confused if the address is accessed. To be fixed soon. Because if we don't, programs using LWP primitives instead of pthreads might not realize that libc has mapped the page. * The TID is reset to 1 on a successful exec*() * On [v]fork(), if lpmap exists for the current thread, the kernel will copy the lpmap->blockallsigs value to the lpmap for the new thread in the new process. This way sigblock*() state is retained across the [v]fork(). This feature not only reduces code confusion in userland, it also allows [v]fork() to be implemented by the userland program in a way that ensures no signal races in either the parent or the new child process until it is ready for them. * The implementation leverages our vm_map_backing extents by having the per-thread memory mappings indexed within the lwp. This allows the lwp to remove the mappings when it exits (since not doing so would result in a wild pmap entry and kernel memory disclosure). * The implementation currently delays instantiation of the mapped page(s) and some side structures until the first fault. XXX this will have to be changed.
kernel - Fix lwp_tid ranging error * We intended for lwp_tid's to range from 1..INT_MAX, but we had the following code in the kernel and unfortunately GCC optimizes out the conditional entirely. if (++lp->lwp_tid <= 0) lp->lwp_tid = 1; * In addition, the pthreads library actually would like to use a few high bits for flags, so fix the limit while we are here to 0x3FFFFFFF. This saves us a pthreads assertion in the mutex code if an application is actually able to wind the thread id up this high over its life. Since the TID allocation mechanism is pretty simplistic, it is entirely possible to do this given heavy thread creation / deletion rates and enough time. * Change the conditional such that GCC does not optimize it out. We do not want to depend on -fwrapv for the kernel code to compile properly, so we don't use it. if (lp->lwp_tid == 0 || lp->lwp_tid == 0x3FFFFFFF) lp->lwp_tid = 1; else ++lp->lwp_tid;
kernel - Refactor scheduler weightings part 2/2. * Change the default fork()ing mechanic from 0x80 (random cpu) to 0x20 (best cpu). We no longer need to mix it up on fork because weight4 now works. * The best cpu algorithm has a number of desirable characteristics for fork() and fork()/exec(). - Will generally start the child topologically 'close' to the parent, reducing fork/exec/exit/wait overheads, but still spread the children out while machine load is light. If the child sticks around for long enough, it will get spread out even more optimally. If not, closer is better. - Will not stack children up on the same cpu unnecessarily (e.g. parent fork()s a bunch of times at once). - Will optimize heavy and very-heavy load situations. If the child have nowhere else reasonable to go, this will schedule it on a hyper-thread sibling or even on the same cpu as the parent. Depending on the load. * Gives us around a 15% improvement in fork/exec/exit/wait performance. * Once a second we clear the td_wakefromcpu hint on the currently running thread. This allows a thread which has become cpu-bound to start to 'wander' afield (though the scheduler will still try to avoid moving it too far away, topologically).
kernel - VM rework part 12 - Core pmap work, stabilize & optimize * Add tracking for the number of PTEs mapped writeable in md_page. Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page to avoid clear/set races. This problem occurs because we would have otherwise tried to clear the bits without hard-busying the page. This allows the bits to be set with only an atomic op. Procedures which test these bits universally do so while holding the page hard-busied, and now call pmap_mapped_sfync() prior to properly synchronize the bits. * Fix bugs related to various counterse. pm_stats.resident_count, wiring counts, vm_page->md.writeable_count, and vm_page->md.pmap_count. * Fix bugs related to synchronizing removed pte's with the vm_page. Fix one case where we were improperly updating (m)'s state based on a lost race against a pte swap-to-0 (pulling the pte). * Fix a bug related to the page soft-busying code when the m->object/m->pindex race is lost. * Implement a heuristical version of vm_page_active() which just updates act_count unlocked if the page is already in the PQ_ACTIVE queue, or if it is fictitious. * Allow races against the backing scan for pmap_remove_all() and pmap_page_protect(VM_PROT_READ). Callers of these routines for these cases expect full synchronization of the page dirty state. We can identify when a page has not been fully cleaned out by checking vm_page->md.pmap_count and vm_page->md.writeable_count. In the rare situation where this happens, simply retry. * Assert that the PTE pindex is properly interlocked in pmap_enter(). We still allows PTEs to be pulled by other routines without the interlock, but multiple pmap_enter()s of the same page will be interlocked. * Assert additional wiring count failure cases. * (UNTESTED) Flag DEVICE pages (dev_pager_getfake()) as being PG_UNMANAGED. This essentially prevents all the various reference counters (e.g. vm_page->md.pmap_count and vm_page->md.writeable_count), PG_M, PG_A, etc from being updated. The vm_page's aren't tracked in the pmap at all because there is no way to find them.. they are 'fake', so without a pv_entry, we can't track them. Instead we simply rely on the vm_map_backing scan to manipulate the PTEs. * Optimize the new vm_map_entry_shadow() to use a shared object token instead of an exclusive one. OBJ_ONEMAPPING will be cleared with the shared token. * Optimize single-threaded access to pmaps to avoid pmap_inval_*() complexities. * Optimize __read_mostly for more globals. * Optimize pmap_testbit(), pmap_clearbit(), pmap_page_protect(). Pre-check vm_page->md.writeable_count and vm_page->md.pmap_count for an easy degenerate return; before real work. * Optimize pmap_inval_smp() and pmap_inval_smp_cmpset() for the single-threaded pmap case, when called on the same CPU the pmap is associated with. This allows us to use simple atomics and cpu_*() instructions and avoid the complexities of the pmap_inval_*() infrastructure. * Randomize the page queue used in bio_page_alloc(). This does not appear to hurt performance (e.g. heavy tmpfs use) on large many-core NUMA machines and it makes the vm_page_alloc()'s job easier. This change might have a downside for temporary files, but for more long-lasting files there's no point allocating pages localized to a particular cpu. * Optimize vm_page_alloc(). (1) Refactor the _vm_page_list_find*() routines to avoid re-scanning the same array indices over and over again when trying to find a page. (2) Add a heuristic, vpq.lastq, for each queue, which we set if a _vm_page_list_find*() operation had to go far-afield to find its page. Subsequent finds will skip to the far-afield position until the current CPUs queues have pages again. (3) Reduce PQ_L2_SIZE From an extravagant 2048 entries per queue down to 1024. The original 2048 was meant to provide 8-way set-associativity for 256 cores but wound up reducing performance due to longer index iterations. * Refactor the vm_page_hash[] array. This array is used to shortcut vm_object locks and locate VM pages more quickly, without locks. The new code limits the size of the array to something more reasonable, implements a 4-way set-associative replacement policy using 'ticks', and rewrites the hashing math. * Effectively remove pmap_object_init_pt() for now. In current tests it does not actually improve performance, probably because it may map pages that are not actually used by the program. * Remove vm_map_backing->refs. This field is no longer used. * Remove more of the old now-stale code related to use of pv_entry's for terminal PTEs. * Remove more of the old shared page-table-page code. This worked but could never be fully validated and was prone to bugs. So remove it. In the future we will likely use larger 2MB and 1GB pages anyway. * Remove pmap_softwait()/pmap_softhold()/pmap_softdone(). * Remove more #if 0'd code.
kernel - Implement sbrk(), change low-address mmap hinting * Change mmap()'s internal lower address bound from dmax (32GB) to RLIMIT_DATA's current value. This allows the rlimit to be e.g. reduced and for hinted mmap()s to then map space below the 4GB mark. The default data rlimit is 32GB. This change is needed to support several languages, at least lua and probably another one or two, who use mmap hinting under the assumption that it can map space below the 4GB address mark. The data limit must be lowered with a limit command too, which can be scripted or patched for such programs. * Implement the sbrk() system call. This system call was already present but just returned EOPNOTSUPP and libc previously had its own shim for sbrk() which used the ancient break() system call. (Note that the prior implementation did not ENOSYS or signal). sbrk() in the kernel is thread-safe for positive increments and is also byte-granular (the old libc sbrk() was only page-granular). sbrk() in the kernel does not implement negative increments and will return EOPNOTSUPP if asked to. Negative increments were historically designed to be able to 'free' memory allocated with sbrk(), but it is not possible to implement the case in a modern VM system due to the mmap changes above. (1) Because the new mmap hinting changes make it possible for normal mmap()s to have mapped space prior to the RLIMIT_DATA resource limit being increased, causing intermingling of sbrk() and user mmap()d regions. (2) because negative increments are not even remotely thread-safe. * Note the previous commit refactored libc to use the kernel sbrk() and fall-back to its previous emulation code on failure, so libc supports both new and old kernels. * Remove the brk() shim from libc. brk() is not implemented by the kernel. Symbol removed. Requires testing against ports so we may have to add it back in but basically there is no way to implement brk() properly with the mmap() hinting fix * Adjust manual pages.
kernel - Remove SMP bottlenecks on uidinfo, descriptors, and lockf * Use an eventcounter and the per-thread fd cache to fix bottlenecks in checkfdclosed(). This will work well for the vast majority of applications and test benches. * Batch holdfp*() operations on kqueue collections when implementing poll() and select(). This significant improves performance. Full scaling not yet achieved, however. * Increase copyin item batching from 8 to 32 for select() and poll(). * Give the uidinfo structure a pcpu array to hold the posixlocks and openfiles count fields, with a rollup contained in the uidinfo structure itself. This removes numerous global bottlenecks related to open(), close(), dup*(), and lockf operations (posixlocks count). ui_openfiles will force a rollup on limit reached to be sure that the limit was actually reached. ui_posixlocks stays fairly loose. Each cpu rolls up generally only when the pcpu count exceeds +32 or goes below -32. * Give the proc structure a pcpu array for the same counts, in order to properly support seteuid() and such. * Replace P_ADVLOCK with a char field proc->p_advlock_flag, and remove token operations around the field.
kernel - Rewrite umtx_sleep() and umtx_wakeup() * Rewrite umtx_sleep() and umtx_wakeup() to no longer use vm_fault_page_quick(). Calling the VM fault code incurs a huge overhead and creates massive contention when many threads are using these calls. The new code uses fuword(), translate to the physical address via PTmap, and has very low overhead and basically zero contention. * Instead, impose a mandatory timeout for umtx_sleep() and cap it at 2 seconds (adjustable via sysctl kern.umtx_timeout_max, set in microseconds). When the memory mapping underpinning a umtx changes, userland will not stall for more than 2 seconds. * The common remapping case caused by fork() is handled by the kernel by immediately waking up all sleeping umtx_sleep() calls for the related process. * Any other copy-on-write or remapping cases will stall no more than the maximum timeout (2 seconds). This might include paging to/from swap, for example, which can remap the physical page underpinning the umtx. This could also include user application snafus or weirdness. * umtx_sleep() and umtx_wakeup() still translate the user virtual address to a physical address for the tsleep() and wakeup() operation. This is done via a fault-protected access to the PTmap (the page-table self-mapping).
kernel - Simplify umtx_sleep and umtx_wakeup support * Rip out the vm_page_action / vm_page_event() API. This code was fairly SMP unfriendly and created serious bottlenecks with large threaded user programs using mutexes. * Replace with a simpler mechanism that simply wakes up any UMTX domain tsleeps after a fork(). * Implement a 4uS spin loop in umtx_sleep() similar to what the pipe code does.
kernel - Restrict kill(-1, ...) to the reaper group as well * When kill(-1, ...) is issued to 'all processes', restrict the list of processes signaled to the sender's reaper group or any sub-group of that group. * This works around issues found when testing low maxproc limits. At least one (and probably several) third party programs do not properly handle errors when [v]fork() returns -1 and may try to signal the returned 'pid', which being -1 crowbars the entire system. * Fixes issue when a cmake running under synth hits a fork failure and tries to signal the whole system. With this change, the cmake winds up only crowbaring its own chroot due to synthexec's use of the reaper feature. * Adjust the kill.2 manual page to reflect the change.
kernel - Automatically downscasle NPROC resource limit * Downscale the NPROC resource limit based on fork and chroot depth, up to 50%, and also make the limit apply to root processes. This is intended to be a poor-man's safety, preventing run-away (root or other) process creation from completely imploding a system. * Each level of fork() downscales the NPROC resource limit by 1/3%, capped at 32 levels (~10%) * Each chroot (including that made by a jail) downscales the NPROC resource limit by 10%, up to 40%.
kernel - Be nicer to pthreads in vfork() * When vfork()ing, give the new sub-process's lwp the same TID as the one that called vfork(). Even though user processes are not supposed to do anything sophisticated inside a vfork() prior to exec()ing, some things such as fileno() having to lock in a threaded environment might not be apparent to the programmer. * By giving the sub-process the same TID, operations done inside the vfork() prior to exec that interact with pthreads will not confuse pthreads and cause corruption due to e.g. TID 0 clashing with TID 0 running in the parent that is running concurrently.