kernel - Change allocvnode() to not recursively block freeing vnodes allocvnode() has caused many deadlock issues over the years, including recent issues with softupdates, because it is often called from deep within VFS modules and attempts to clean and free unrelated vnodes when the vnode limit is reached to make room for the new one. * numvnodes is not protected by any locks and needs atomic ops. * Change allocvnode() to always allocate and not attempt to free other vnodes. * allocvnode() now flags the LWP to handle reducing the number of vnodes in the system as of when it returns to userland instead. Consolidate several flags into a single conditional function call, lwpuserret(). When triggered, this code will do a limited scan of the free list to try to find vnodes to free. * The vnlru_proc_wait() code existed to handle a separate algorithm related to vnodes with cached buffers and VM pages but represented a major bottleneck in the system. Remove vnlru_proc_wait() and allow vnodes with buffers and/or non-empty VM objects to be placed on the free list. This also requires not vhold()ing the vnode for related buffer cache buffer since the vnode will not go away until related buffers have been cleaned out. We shouldn't need those holds. Testing-by: vsrinivas
kernel - usched_dfly revamp (7), bring back td_release, sysv_sem, weights * Bring back the td_release kernel priority adjustment. * sysv_sem now attempts to delay wakeups until after releasing its token. * Tune default weights. * Do not depress priority until we've become the uschedcp. * Fix priority sort for LWKT and usched_dfly to avoid context-switching across all runable threads twice.
kernel - Fix signal masking race assertion panic w/vkernel * sigsuspend() and pselect() record the old signal mask in order to allow an interrupting signal to run its handler before the old mask is restored. * When multiple threads are present a race can ensue where another thread changes the signal handler after sigsuspend() or pselect() have interrupted, but before they are able to process the signal. * If the signal is no longer enabled the old signal mask is not restored on system call return, resulting in an assertion and panic. * Fix the problem by checking the flag and restoring the old signal mask on return (rather than asserting when the flag is found to be non-zero on return). Reported-by: Venkatesh Srinivas
kern: Update traps, sigbus->sigsegv, cleanup and fixes The primary purpose of this changeset is to change the signal from bus error to segfault when an attempt to access protected memory is made. In the process of doing this, it was noticed that there were differences between i386 and x86_64 as well as differences between the actual and virtual kernels. Several of these differences were addressed. Extra whitespace was removed and some syncing between the divergent FreeBSD version was made. Some "#if 0" sections were removed. The virtual kernels were tested along with the real kernels. A conservative approach was taken as there seems to more cruft that could come out.
kernel - Major signal path adjustments to fix races, tsleep race fixes, +more * Refactor the signal code to properly hold the lp->lwp_token. In particular the ksignal() and lwp_signotify() paths. * The tsleep() path must also hold lp->lwp_token to properly handle lp->lwp_stat states and interlocks. * Refactor the timeout code in tsleep() to ensure that endtsleep() is only called from the proper context, and fix races between endtsleep() and lwkt_switch(). * Rename proc->p_flag to proc->p_flags * Rename lwp->lwp_flag to lwp->lwp_flags * Add lwp->lwp_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add td->td_mpflags and move flags which require atomic ops (are adjusted when not the current thread) to the new field. * Add some freeze testing code to the x86-64 trap code (default disabled).
kernel - Fix itimer hard critical section panic * ksignal() needs per-lwp tokens as well as the process token, the existing itimer code only gets the process token. * Flag the itimer signal and issue the ksignal() in the trap's AST code instead of trying to issue it from the hardclock. Reported-by: swildner
kernel - Hold required token when accessing p_flags, adjust kmem access * Numerous adjustments to p->p_flag were not being done with p->p_token held. In particular uiomove(). * Replace P_DEADLKTREAT with LWP_DEADLKTREAT in several places where it had not been previously converted. * Allow DMAP access in is_globaldata_space() for x86-64
kernel - Fix signal delivery races * The send side was using p->p_token but the processing code from trap was still using the mp_lock. Fix the trap processing code to use p->p_token. * This fixes several nasty races that can cause signals to be lost and vkernels to freeze, and possibly other programs which depend on signals between threads.
kernel - Make most of the fork and exit paths MPSAFE * Remove the MP lock from numerous system calls (mainly socket calls) that no longer need it. * Use proc_token in a couple of places that still need work (instead of the MP lock). For example, the process group (pgrp) and several places which call pfind() still need to use the proc_token. * Use the per-process p->p_token in fork1(), exit1(), and lwp_exit(). The critical portions of these paths now have significant concurrency. * Use the per-process p->p_token when traversing p->p_children, primarily aiding the kern_wait() code. So the wait*() system calls should now have significant concurrency. * Change the fgetown() API to avoid certain races. * Add M_ZERO to the struct filedesc_to_leader allocation for safety purposes.
kernel - scheduler adjustments for large ncpus / 48-core monster * Change the LWKT scheduler's token spinning algorithm. It used to DELAY a short period of time and then simply retry, creating a lot of contention between cpus trying to acquire a token. Now the LWKT scheduler uses a FIFO index mechanic to resequence the contending cpus into 1uS retry slots using essentially just atomic_fetchadd_int(), so it is very cache friendly. The spin-retry thus has a bounded cache management traffic load regardless of the number of cpus and contending cpus will not be tripping over each other. The new algorithm slightly regresses 4-cpu operation (~5% under heavy contention) but significantly improves 48-cpu operation. It is also flexible enough for further work down the road. The old algorithm simply did not scale very well. Add three sysctls: sysctl lwkt.spin_method=1 0 Allow a user thread to be scheduled on a cpu while kernel threads are contended on a token, using the IPI mechanic to interrupt the user thread and reschedule on decontention. This can potentially result in excessive IPI traffic. 1 Allow a user thread to be scheduled on a cpu while kernel threads are contended on a token, reschedule on the next clock tick (100 Hz typically). Decontention will NOT generate any IPI traffic. DEFAULT. 2 Do not allow a user thread to be scheduled on a cpu while kernel threads are contended. Should not be used normally, for debugging only. sysctl lwkt.spin_delay=1 Slot time in microseconds, default 1uS. Recommended values are 1 or 2 but not longer. sysctl lwkt.spin_loops=10 Number of times the LWKT scheduler loops on contended threads before giving up and allowing an idle-thread HLT. In order to wake up from the HLT decontention will cause an IPI so you do not want to set this value too small and. Values between 10 and 100 are recommended. * Redo the token decontention algorithm. Use a new gd_reqflags flag, RQF_WAKEUP, coupled with RQF_AST_LWKT_RESCHED in the per-cpu globaldata structure to determine what cpus actually need to be IPId on token decontention (to wakeup their idle threads stuck in HLT). This requires that all gd_reqflags operations use locked atomic instructions rather than non-locked instructions. * Decontention IPIs are a last-gasp effort if the LWKT scheduler has spun too many times. Under normal conditions, even under heavy contention, actual IPIing should be minimal.
kernel - Change the discrete mplock into mp_token * Use a lwkt_token for the mp_lock. This consolidates our longer-term spinnable locks (the mplock and tokens) into just tokens, making it easier to solve performance issues. * Some refactoring of the token code was needed to guarantee the ordering when acquiring and releasing the mp_token vs other tokens. * The thread switch code, lwkt_switch(), is simplified by this change though not necessarily faster. * Remove td_mpcount, mp_lock, and other related fields. * Remove assertions related to td_mpcount and friends, generally folding them into similar assertions for tokens.
kernel - (mainly x86_64) - Fix a number of rare races * Move the MP lock from outside to inside exit1(), also fixing an issue where sigexit() was calling exit1() without it. * Move calls to dsched_exit_thread() and biosched_done() out of the platform code and into the mainline code. This also fixes an issue where the code was improperly blocking way too late in the thread termination code, after the point where it had been descheduled permanently and tsleep decomissioned for the thread. * Cleanup and document related code areas. * Fix a missing proc_token release in the SIGKILL exit path. * Fix FAKE_MCOUNT()s in the x86-64 code. These are NOPs anyway (since kernel profiling doesn't work), but fix them anyway. * Use APIC_PUSH_FRAME() in the Xcpustop assembly code for x86-64 in order to properly acquire a working %gs. This may improve the handling of panic()s on x86_64. * Also fix some cases if #if JG'd (ifdef'd out) code in case the code is ever used later on. * Protect set_user_TLS() with a critical section to be safe. * Add debug code to help track down further x86-64 seg-fault issues, and provide better kprintf()s for the debug path in question.
kernel - Separate inherited mplocks from td_mplocks and fix a gettoken bug * Separate out td_mpcount into td_xpcount and td_mpcount. td_xpcount is an inherited mpcount. A preempting thread inherits the mpcount on the thread being preempted until it switches out to guarantee that the mplock remains atomic through the preemption (as expected by the poor thread that got preempted). * Fix a serious but hard to reproduce bug in lwkt_gettoken(). This function marks the token reference as being MPSAFE if td_mpcount is non-zero even when the token is not a MPSAFE token. However, until this patch td_mpcount also included inherited mpcounts when one thread preempts another and the inherited mpcounts could go away if the thread blocks or switches, leaving the token unprotected. * Fix a less serious bug where a new token reference was being populated prior to td_toks_stop being incremented, and where an existing token reference was being depopulated after td_toks_stop is decremented. Nothing can race us but switch around the index increment/decrement to protect the slot being operated upon. * Add a ton of assertions in the interrupt, trap, and syscall paths To assert that the mplock, number of tokens, and critcount remain unchanged across driver and other calls.
kernel - Change lwp_fork() to not hold the mplock in the new thread * Change lwp_fork() to produce a mpsafe thread at startup instead of one with the mplock held. * Change all fork_trampoline() functions and all kernel callbacks via cpu_set_fork_handler() to expect a thread without the mplock held. * Adjust the thread procedures for aio etc (those not yet mpsafe) to acquire the mplock.