kernel - Move CPUMASK_LOCK out of the cpumask_t * Add cpulock_t (a 32-bit integer on all platforms) and implement CPULOCK_EXCL as well as space for a counter. * Break-out CPUMASK_LOCK, add a new field to the pmap (pm_active_lock) and do the process vmm (p_vmm_cpulock) and implement the mmu interlock there. The VMM subsystem uses additional bits in cpulock_t as a mask counter for implementing its interlock. The PMAP subsystem just uses the CPULOCK_EXCL bit in pm_active_lock for its own interlock. * Max cpus on 64-bit systems is now 64 instead of 63. * cpumask_t is now just a pure cpu mask and no longer requires all-or-none atomic ops, just normal bit-for-bit atomic ops. This will allow us to hopefully extend it past the 64-cpu limit soon.
kernel - Change the discrete mplock into mp_token * Use a lwkt_token for the mp_lock. This consolidates our longer-term spinnable locks (the mplock and tokens) into just tokens, making it easier to solve performance issues. * Some refactoring of the token code was needed to guarantee the ordering when acquiring and releasing the mp_token vs other tokens. * The thread switch code, lwkt_switch(), is simplified by this change though not necessarily faster. * Remove td_mpcount, mp_lock, and other related fields. * Remove assertions related to td_mpcount and friends, generally folding them into similar assertions for tokens.
kernel - Add support for up to 63 cpus & 512G of ram for 64-bit builds. * Increase SMP_MAXCPU to 63 for 64-bit builds. * cpumask_t is 64 bits on 64-bit builds now. It remains 32 bits on 32-bit builds. * Add #define's for atomic_set_cpumask(), atomic_clear_cpumask, and atomic_cmpset_cpumask(). Replace all use cases on cpu masks with these functions. * Add CPUMASK(), BSRCPUMASK(), and BSFCPUMASK() macros. Replace all use cases on cpu masks with these functions. In particular note that (1 << cpu) just doesn't work with a 64-bit cpumask. Numerous bits of assembly also had to be adjusted to use e.g. btq instead of btl, etc. * Change __uint32_t declarations that were meant to be cpu masks to use cpumask_t (most already have). Also change other bits of code which work on cpu masks to be more agnostic. For example, poll_cpumask0 and lwp_cpumask. * 64-bit atomic ops cannot use "iq", they must use "r", because most x86-64 do NOT have 64-bit immediate value support. * Rearrange initial kernel memory allocations to start from KvaStart and not KERNBASE, because only 2GB of KVM is available after KERNBASE. Certain VM allocations with > 32G of ram can exceed 2GB. For example, vm_page_array[]. 2GB was not enough. * Remove numerous mdglobaldata fields that are not used. * Align CPU_prvspace[] for now. Eventually it will be moved into a mapped area. Reserve sufficient space at MPPTDI now, but it is still unused. * When pre-allocating kernel page table PD entries calculate the number of page table pages at KvaStart and at KERNBASE separately, since the KVA space starting at KERNBASE caps out at 2GB. * Change kmem_init() and vm_page_startup() to not take memory range arguments. Instead the globals (virtual_start and virtual_end) are manipualted directly.
kernel - rewrite the LWKT scheduler's priority mechanism The purpose of these changes is to begin to address the issue of cpu-bound kernel threads. For example, the crypto threads, or a HAMMER prune cycle that operates entirely out of the buffer cache. These threads tend to hicup the system, creating temporary lockups because they never switch away due to their nature as kernel threads. * Change the LWKT scheduler from a strict hard priority model to a fair-share with hard priority queueing model. A kernel thread will be queued with a hard priority, giving it dibs on the cpu earlier if it has a higher priority. However, if the thread runs past its fair-share quantum it will then become limited by that quantum and other lower-priority threads will be allowed to run. * Rewrite lwkt_yield() and lwkt_user_yield(), remove uio_yield(). Both yield functions are now very fast and can be called without further timing conditionals, simplifying numerous callers. lwkt_user_yield() now uses the fair-share quantum to determine when to yield the cpu for a cpu-bound kernel thread. * Implement the new yield in the crypto kernel threads, HAMMER, and other places (many of which already used the old yield functions which didn't work very well). * lwkt_switch() now only round-robins after the fair share quantum is exhausted. It does not necessarily always round robin. * Separate the critical section count from td_pri. Add td_critcount.
kernel - more lwbuf followup work * Make lwbuf objcache only, removing all the manual per-cpu allocation tracking. Keep the cpumask stuff. We will deal with the KVM fragmentation issue inside objcache later on. * This basically takes us back to Sam's original objcache implementation. * Remove unnecessary assembly symbols (assembly didn't use those globaldata fields). Remove related globaldata fields now that we are back to the objcache-only implementation.
kernel - Introduce lightweight buffers * Summary: The lightweight buffer (lwbuf) subsystem is effectively a reimplementation of the sfbuf (sendfile buffers) implementation. It was designed to be lighter weight than the sfbuf implementation when possible, on x86_64 we use the DMAP and the implementation is -very- simple. It was also designed to be more SMP friendly. * Replace all consumption of sfbuf with lwbuf * Refactor sfbuf to act as an external refcount mechanism for sendfile(2), this will probably go away eventually as well.
kernel - Fix some rare pmap races in i386 and x86_64. * Adjust pmap_inval_init() to enter a critical section and add a new pmap_inval_done() function which flushes and exits it. It was possible for an interrupt or other preemptive action to come along during a pmap operation and issue its own pmap operation, potentially leading to races which corrupt the pmap. This case was tested an could actually occur, though the damage (if any) is unknown. x86_64 machines have had a long standing and difficult to reproduce bug where a program would sometimes seg-fault for no reason. It is unknown whether this fixes the bug or not. * Interlock the pmap structure when invalidating pages using a bit in the pm_active field. Check for the interlock in swtch.s when switching into threads and print a nice warning if it occurs. It was possible for one cpu to initiate a pmap modifying operation while another switches into a thread using the pmap the first cpu was in the middle of modifying. The case is extremely rare but can occur if the cpu doing the modifying operation receives a SMI interrupt, stalling it long enough for the other cpu to switch into the thread and resume running in userspace. * pmap_protect() assumed no races when clearing PG_RW and PG_M due to the pmap_inval operations it runs. This should in fact be true with the above fixes. However, the rest of the pmap code uses atomic operations so adjust pmap_protect() to also use atomic operations.
kernel - Move mplock to machine-independent C * Remove the per-platform mplock code and move it all into machine-independent code: sys/mplock2.h and kern/kern_mplock.c. * Inline the critical path. * When a conflict occurs kern_mplock.c will KTR log the file and line number of both the holder and conflicting acquirer. Set debug.ktr.giant_enable=-1 to enable conflict logging.
Implement struct lwp->lwp_vmspace. Leave p_vmspace intact. This allows vkernels to run threaded and to run emulated VM spaces on a per-thread basis. struct proc->p_vmspace is left intact, making it easy to switch into and out of an emulated VM space. This is needed for the virtual kernel SMP work. This also gives us the flexibility to run emulated VM spaces in their own threads, or in a limited number of separate threads. Linux does this and they say it improved performance. I don't think it necessarily improved performance but its nice to have the flexibility to do it in the future.
Modify the trapframe sigcontext, ucontext, etc. Add %gs to the trapframe and xflags and an expanded floating point save area to sigcontext/ucontext so traps can be fully specified. Remove all the %gs hacks in the system code and signal trampoline and handle %gs faults natively, like we do %fs faults. Implement writebacks to the virtual page table to set VPTE_M and VPTE_A and add checks for VPTE_R and VPTE_W. Consolidate the TLS save area into a MD structure that can be accessed by MI code. Reformulate the vmspace_ctl() system call to allow an extended context to be passed (for TLS info and soon the FP and eventually the LDT). Adjust the GDB patches to recognize the new location of %gs. Properly detect non-exception returns to the virtual kernel when the virtual kernel is running an emulated user process and receives a signal. And misc other work on the virtual kernel.
Major kernel build infrastructure changes, part 1/2 (sys). These changes are primarily designed to create a 2-layer machine and cpu build hierarchy in order to support virtual kernel builds in the near term and future porting efforts in the long term. * Split arch/ into a set of platform architectures under machine/ and a set of cpu architectures under cpu/. All platform and cpu header files will be accessible via <machine/*.h>. Platform header files may override cpu header files (the platform header file then typically #include's the cpu header file). * Any cpu header files that are not overridden will be copied directly into /usr/include/machine/, allowing the platform to omit those header files (not have to create degenerate forwarding header files). * All source files access platform and cpu architecture files via the <machine/*.h> path. The <cpu/*.h> path should only be used by platform header files when including the lower level cpu header files. * Require both the 'machine' and the 'machine_arch' directives in the kernel config file. * When building modules in the presence of a kernel config, use the IF files, use*.h files, and opt*.h files provided by the kernel config and do not generate them in each module's object directory. This streamlines the module build considerably.
Remove LWKT reader-writer locks (kern/lwkt_rwlock.c). Remove lwkt_wait queues (only RW locks used them). Convert remaining uses of RW locks to LOCKMGR locks. In recent months lockmgr locks have been simplified to the point where we no longer need a lighter-weight fully blocking lock. The removal also simplifies lwkt_schedule() in that it no longer needs a special case to deal with wait lists.