kernel - Rearrange struct vmmeter (requires world and kernel build) * Expand v_lock_name from 16 to 32 bytes * Add v_lock_addr field to go along with v_lock_name. These fields report SMP contention. * Rearrange vmmeter_uint_end to not include v_lock_name or v_lock_addr. * Cleanup the do_vmmeter_pcpu() sysctl code. Remove the useless aggregation code and just do a structural copy for the per-cpu gd_cnt (struct vmmeter) structure.
kernel - fbsd kpi support, add sleepq*() API (untested) * Initial sleepq*() API. We use our tsleep*() API underneath it. This is a horrible API so add a note that it should only be used for FreeBSD compat stuff. - Add tsleep/wakeup domains to implement the two sleepq*() queues. - Track blocking refs per queue in the sleepq API - Do not track individual threads (just let tsleep*()/wakeup*() do its thing). - objcache for wchan, 1K hash table for now, and retain a cache of available wchan structures in the hash table (up to 4 per slot). - Include the hash-slot spin lock as FreeBSD compat code will use it for interlock tests. - Relax sleepq_signal() a bit, allowing it to wakeup more than one thread (the DragonFly wakeup_*_one*() is a bit non-deterministic). * For now add discrete fields to the thread structure. Its a bit of bloat but its better than dynamically allocating a side-structure. We already use our tsleep*() API and related fields underneath. Add a few more needed for tracking the wchan structure, the queue, and the timeout. * Add sbintime_t type (as 64-bit ticks), and a sbticks global counter. Monotonic ticks since boot, 64 bits.
vm: Change 'kernel_map' global to type of 'struct vm_map *' Change the global variable 'kernel_map' from type 'struct vm_map' to a pointer to this struct. This simplify the code a bit since all invocations take its address. This change also aligns with NetBSD's 'kernal_map' that it's also a pointer, which also helps the porting of NVMM. No functional changes.
<sys/sysref.h>: Switch to lighter <sys/_malloc.h> header. * Make <sys/sysref2.h> a kernel only header. * Remove sys/types.h includes that follow <sys/param.h> in devfs(5). * Add sys/malloc.h includes where it is actually used in sources. While there, minor whitespace cleanup.
kernel - Remove cache ping-pong on common scheduler operations * Reflect dfly_curprocmask and dfly_rdyprocmask bits in the scheduler's pcpu structures. This allows us to reduce global atomic ops that are virtually guaranteed to cause cache ping ponging. * sched_yield and token-based yield operations no longer clear the bit in the curprocmask, since they are just yielding (XXX needs work, a later blocking op then might not pull a new process from another cpu).
kernel - Cleanup token code, add simple exclusive priority (2) * The priority mechanism revealed an issue with lwkt_switch()'s fall-back code in dealing with contended tokens. The code was refusing to schedule a lower-priority thread on a cpu requesting an exclusive lock as another on that same cpu requesting a shared lock. This creates a problem for the exclusive priority feature. More pointedly, it also creates a fairness problem in the mixed lock type use case generally. * Change the mechanism to allow any thread polling on tokens to be scheduled. The scheduler will still iterate in priority order. This imposes a little extra overhead with regards to userspace returns as a thread might be scheduled that then tries to return to userland without being the designated user thread. * This also fixes another bug that cropped up recently where a 32-way threaded program would sometimes not quickly schedule to all 32 cpus, sometimes leaving one or two cpus idle for a few seconds.
kernel - Refactor sysctl locking * Get rid of the global topology lock. Instead of a pcpu shared lock and change the XLOCK code (which is barely ever executed) to obtain an exclusive lock on all cpus. * Add CTLFLAG_NOLOCK, which disable the automatic per-OID sysctl lock. Suggested-by: mjg (Mateusz Guzik)
kernel - Refactor smp collision statistics (2) * Refactor indefinite_info mechanics. Instead of tracking indefinite loops on a per-thread basis for tokens, track them on a scheduler basis. The scheduler records the overhead while it is live-looping on tokens, but the moment it finds a thread it can actually schedule it stops (then restarts later the next time it is entered), even if some of the other threads still have unresolved tokens. This gives us a fairer representation of how many cpu cycles are actually being wasted waiting for tokens. * Go back to using a local indefinite_info in the lockmgr*(), mutex*(), and spinlock code. * Refactor lockmgr() by implementing an __inline frontend to interpret the directive. Since this argument is usually a constant, the change effectively removes the switch(). Use LK_NOCOLLSTATS to create a clean recursion to wrap the blocking case with the indefinite*() API.
kernel - Refactor smp collision statistics * Add an indefinite wait timing API (sys/indefinite.h, sys/indefinite2.h). This interface uses the TSC and will record lock latencies to our pcpu stats in microseconds. The systat -pv 1 display shows this under smpcoll. Note that latencies generated by tokens, lockmgr, and mutex locks do not necessarily reflect actual lost cpu time as the kernel will schedule other threads while those are blocked, if other threads are available. * Formalize TSC operations more, supply a type (tsc_uclock_t and tsc_sclock_t). * Reinstrument lockmgr, mutex, token, and spinlocks to use the new indefinite timing interface.
kernel - Fix GCC reordering problem with td_critcount * Wrap all ++td->td_critcount and --td->td_critcount use cases with an inline which executes cpu_ccfence() before and after, to guarantee that GCC does not try to reorder the operation around critical memory changes. * This fixes a race in lockmgr() and possibly a few other places too.
kernel - Fix sys% time reporting * Fix system time reporting in systat -vm 1, systat -pv 1, and process stats. * Basically the issue is that when coincident systimer interrupts occur, such as when the statclock, hardclock, and schedclock all fire at the same time, the statclock must execute first in order to properly detect the state the current thread is in. If it does not, it may see a lwkt thread schedule by one of the other systimers and improper dock the current thread as being in 'system' time. * The various systimer interrupts could wind up out of phase and desynchronized due to the tsc_frequency not being perfectly divisible by the requested frequencies. In addition, various timers could queue in an undesirable order due to being different integral frequencies of each other. * Refactor the systimer API a bit, adding new functions which guarantee synchronization for nominally requested frequencies and which guarantee ordering for coincident systimer events (which statclock uses). This should completely solve the problem. * Also, if the RQF_INTPEND flag is set, count as interrupt time. This will give us a slightly more accurate understanding of interrupt overhead (alternatively we could do this test for just the case where curthread is the idlethread, which might be more accurate).
kernel - Fix bottlenecks that develop when many processes are running * When a large number of processes or threads are running (in the tens of thousands or more), a number of O(n) or O(ncpus) bottlenecks can develop. These bottlenecks do not develop when only a few thousand threads are present. By fixing these bottlenecks, and assuming kern.maxproc is autoconfigured or manually set high enough, DFly can now handle hundreds of thousands of active processes running, polling, sleeping, whatever. Tested to around 400,000 discrete processes (no shared VM pages) on a 32-thread dual-socket Xeon system. Each process is placed in a 1/10 second sleep loop using umtx timeouts: baseline - (before changes), system bottlenecked starting at around the 30,000 process mark, eating all available cpu, high IPI rate from hash collisions, and other unrelated user processes bogged down due to the scheduling overhead. 200,000 processes - System settles down to 45% idle, and low IPI rate. 220,000 processes - System 30% idle and low IPI rate 250,000 processes - System 0% idle and low IPI rate 300,000 processes - System 0% idle and low IPI rate. 400,000 processes - Scheduler begins to bottleneck again after the 350,000 while the process test is still in its fork/exec loop. Once all 400,000 processes are settled down, system behaves fairly well. 0% idle, modest IPI rate averaging 300 IPI/sec/cpu (due to hash collisions in the wakeup code). * More work will be needed to better handle processes with massively shared VM pages. It should also be noted that the system does a *VERY* good job allocating and releasing kernel resources during this test using discrete processes. It can kill 400,000 processes in a few seconds when I ^C the test. * Change lwkt_enqueue()'s linear td_runq scan into a double-ended scan. This bottleneck does not arise when large numbers of processes are running in usermode, because typically only one user process per cpu will be scheduled to LWKT. However, this bottleneck does arise when large numbers of threads are woken up in-kernel. While in-kernel, a thread schedules directly to LWKT. Round-robin operation tends to result in appends to the tail of the queue, so this optimization saves an enormous amount of cpu time when large numbers of threads are present. * Limit ncallout to ~5 minutes worth of ring. The calculation code is primarily designed to allocate less space on low-memory machines, but will also cause an excessively-sized ring to be allocated on large-memory machines. 512MB was observed on a 32-way box. * Remove vm_map->hint, which had basically stopped functioning in a useful manner. Add a new vm_map hinting mechanism that caches up to four (size, align) start addresses for vm_map_findspace(). This cache is used to quickly index into the linear vm_map_entry list before entering the linear search phase. This fixes a serious bottleneck that arises due to vm_map_findspace()'s linear scan if the vm_map_entry list when the kernel_map becomes fragmented, typically when the machine is managing a large number of processes or threads (in the tens of thousands or more). This will also reduce overheads for processes with highly fragmented vm_maps. * Dynamically size the action_hash[] array in vm/vm_page.c. This array is used to record blocked umtx operations. The limited size of the array could result in an excessive number of hash entries when a large number of processes/threads are present in the system. Again, the effect is noticed as the number of threads exceeds a few tens of thousands.
kernel - Fix excessive call stack depth on stuck interrupt * Fix an issue where a stuck level interrupt can result in an excessively deep call-stack and possible panic. * Fixed by disallow thread preemption when curthread->td_nest_count is >= 2. The critical section count test is not sufficient for the fast-interrupt unpend -> preemption case.