kernel - Add TDF_RUNNING assertions * Assert that the target lwkt thread being switched to is not flagged as running. * Assert that the originating lwkt thread being switched from is flagged as running. * Fix the running flag initial condition for the idle thread.
kernel - Revamp LWKT thread migration * Rearrange the handling of TDF_RUNNING, making lwkt_switch() responsible for it instead of the assembly switch code. Adjust td->td_switch() to return the previously running thread. This allows lwkt_switch() to process thread migration between cpus after the thread has been completely and utterly switched out, removing the need to loop on TDF_RUNNING on the target cpu. * Fixes lwkt_setcpu_remote livelock failure * This required major surgery on the core thread switch assembly, testing is needed. I tried to avoid doing this but the livelock problems persisted, so the only solution was to remove the need for the loops that were causing the livelocks. * NOTE: The user process scheduler is still using the old giveaway/acquire method. More work is needed here. Reported-by: "Magliano Andre'" <masterblaster@tiscali.it>
kernel - rewrite the LWKT scheduler's priority mechanism The purpose of these changes is to begin to address the issue of cpu-bound kernel threads. For example, the crypto threads, or a HAMMER prune cycle that operates entirely out of the buffer cache. These threads tend to hicup the system, creating temporary lockups because they never switch away due to their nature as kernel threads. * Change the LWKT scheduler from a strict hard priority model to a fair-share with hard priority queueing model. A kernel thread will be queued with a hard priority, giving it dibs on the cpu earlier if it has a higher priority. However, if the thread runs past its fair-share quantum it will then become limited by that quantum and other lower-priority threads will be allowed to run. * Rewrite lwkt_yield() and lwkt_user_yield(), remove uio_yield(). Both yield functions are now very fast and can be called without further timing conditionals, simplifying numerous callers. lwkt_user_yield() now uses the fair-share quantum to determine when to yield the cpu for a cpu-bound kernel thread. * Implement the new yield in the crypto kernel threads, HAMMER, and other places (many of which already used the old yield functions which didn't work very well). * lwkt_switch() now only round-robins after the fair share quantum is exhausted. It does not necessarily always round robin. * Separate the critical section count from td_pri. Add td_critcount.
Bring in all of Joe Talbott's SMP virtual kernel work to date, which makes virtual kernel builds with SMP almost get through a full boot. This work includes: * Creation of 'cpu' threads via libthread_xu * Globaldata initialization * AP synchronization * Bootstrapping to the idle thread * SMP pmap (mmu) functions * IPI handling My part of this commit: * Bring all the signal interrupts under DragonFly's machine independant interrupt handler API. This will properly deal with the MP lock and critical section handling. * Some additional pmap bits to handle SMP invalidation issues. Submitted-by: Joe Talbott <josepht@cstone.net> Additional-bits-by: Matt Dillon
Implement struct lwp->lwp_vmspace. Leave p_vmspace intact. This allows vkernels to run threaded and to run emulated VM spaces on a per-thread basis. struct proc->p_vmspace is left intact, making it easy to switch into and out of an emulated VM space. This is needed for the virtual kernel SMP work. This also gives us the flexibility to run emulated VM spaces in their own threads, or in a limited number of separate threads. Linux does this and they say it improved performance. I don't think it necessarily improved performance but its nice to have the flexibility to do it in the future.
* Use SYSREF for vmspace structures. This replaces the vmspace structure's roll-your-own refcnt implementation and replaces its zalloc backing store. Numerous procedures have been added to handle termination and DTOR operations and to properly interlock with vm_exitingcnt, all centered around the vmspace_sysref_class declaration. * Replace pmap_activate() and pmap_deactivate() with add pmap_replacevm(). This replaces numerous instances where roll-your-own deactivate/activate sequences were being used, creating small windows of opportunity where an update to the kernel pmap would not be visible to running code. * Properly deactivate pmaps and add assertions to the fact in the teardown code. Cases had to be fixed in cpu_exit_switch(), the exec code, the AIO code, and a few other places. * Add pmap_puninit() which is called as part of the DTOR sequence for vmspaces, allowing the kmem mapping and VM object to be recovered. We could not do this with the previous zalloc() implementation. * Properly initialize the per-cpu sysid allocator (globaldata->gd_sysid_alloc). Make the following adjustments to the LWP exiting code. * P_WEXIT interlocks the master exiting thread, eliminating races which can occur when it is signaling the 'other' threads. * LWP_WEXIT interlocks individual exiting threads, eliminating races which can occur there and streamlining some of the tests. * Don't bother queueing the last LWP to the reaper. Instead, just leave it in the p_lwps list (but still decrement nthreads), and add code to kern_wait() to reap the last thread. This improves exit/wait performance for unthreaded applications. * Fix a VMSPACE teardown race in the LWP code. It turns out that it was still possible for the VMSPACE for an exiting LWP to be ripped out from under it by the reaper (due to a conditional that was really supposed to be a loop), or by kern_wait() (due to not waiting for all the LWPs to enter an exiting state). The fix is to have the LWPs PHOLD() the process and then PRELE() it when they are reaped. This is a little mixed up because the addition of SYSREF revealed a number of other semi-related bugs in the pmap and LWP code which also had to be fixed.
Modify the trapframe sigcontext, ucontext, etc. Add %gs to the trapframe and xflags and an expanded floating point save area to sigcontext/ucontext so traps can be fully specified. Remove all the %gs hacks in the system code and signal trampoline and handle %gs faults natively, like we do %fs faults. Implement writebacks to the virtual page table to set VPTE_M and VPTE_A and add checks for VPTE_R and VPTE_W. Consolidate the TLS save area into a MD structure that can be accessed by MI code. Reformulate the vmspace_ctl() system call to allow an extended context to be passed (for TLS info and soon the FP and eventually the LDT). Adjust the GDB patches to recognize the new location of %gs. Properly detect non-exception returns to the virtual kernel when the virtual kernel is running an emulated user process and receives a signal. And misc other work on the virtual kernel.