kernel - Add kmalloc_obj subsystem step 1 * Implement per-zone memory management to kmalloc() in the form of kmalloc_obj() and friends. Currently the subsystem uses the same malloc_type structure but is otherwise distinct from the normal kmalloc(), so to avoid programming mistakes the *_obj() subsystem post-pends '_obj' to malloc_type pointers passed into it. This mechanism will eventually replace objcache. This mechanism is designed to greatly reduce fragmentation issues on systems with long uptimes. Eventually the feature will be better integrated and I will be able to remove the _obj stuff. * This is a object allocator, so the zone must be dedicated to one type of object with a fixed size. All allocations out of the zone are of the object. The allocator is not quite type-stable yet, but will be once existential locks are integrated into the freeing mechanism. * Implement a mini-slab allocator for management. Since the zones are single-object, similar to objcache, the fixed-size mini-slabs are a lot easier to optimize and much simpler in construction than the main kernel slab allocator. Uses a per-zone/per-cpu active/alternate slab with an ultra-optimized allocation path, and a per-zone partial/full/empty list. Also has a globaldata-based per-cpu cache of free slabs. The mini-slab allocator frees slabs back to the same cpu they were originally allocated from in order to retain memory locality over time. * Implement a passive cleanup poller. This currently polls kmalloc zones very slowly looking for excess full slabs to release back to the global slab cache or the system (if the global slab cache is full). This code will ultimately also handle existential type-stable freeing. * Fragmentation is greatly reduced due to the distinct zones. Slabs are dedicated to the zone and do not share allocation space with other zones. Also, when a zone is destroyed, all of its memory is cleanly disposed of and there will be no left-over fragmentation. * Initially use the new interface for the following. These zones tend to or can become quite big: vnodes namecache (but not related strings) hammer2 chains hammer2 inodes tmpfs nodes tmpfs dirents (but not related strings)
kernel - Initial commit for Existential structural tracking * Initial commit for exislock (existance-lock) support, based on cpu tick interlocks. Requires a bit more work. (see also previous commit 'Add missing sys/exislock.h' which wasn't actually missing at the time). This is a type-safe critical section plus a cache-friendly soft-lock type for system data structures that allows interlocking against structural existance and usability. The critical path has no atomic operations or cache ping-ponging. Cache ping-ponging occurs at most once per pseudo_tick (once every 2 real ticks) to reset the timeout field. * Implements a global 'pseudo_ticks' counter. This counter is only able to count when all cpus are armed. All cpus are armed on every 1->0 transition of their mycpu->gd_exislockcnt field. This, while the interlock is held, the global pseudo_ticks counter will increment by at most one, then stall. The cpus are disarmed when this increment occurs on even ticks and rearmed (if gd_exislockcnt is 0) on odd ticks. This the global pseudo_ticks counter increments at roughly hz / 2 under most conditions. This means that even when the per-cpu type-safe critical section is under very heavy load, cycling constantly, the global pseudo_ticks still tends to increment on a regular basis becaues there are a lot of 1->0 transitions occurring on each cpu. Importantly, the cpus do not need to be synchronized in order for pseudo_ticks to increment. * This codebase will be used to implement type-safe storage and lockless hash-table based caches. It works like this: - You use exis_hold() and exis_drop() around the type-safe code sections. While held, the global pseudo_ticks variable is guaranteed to increment no more than once (due to a prior arming condition). - You access the hash table or other robust structural topology (with or without locks depending), then call exis_isusable() on the structure you get to determine if it is usable. If TRUE is returned, the structure will remain type-safe and will not be repurposed for the duration of the exis_hold(). You then proceed to use the structure as needed, with or without further locking depending on what you are doing. For example, accessing stat information from a vnode structure could potentially proceed without further locking.
Adjust headers for <machine/stdint.h> visibility. This also reduces namespace pollution a bit. Include <machine/stdint.h> where <stdint.h> is used too. External compiler under -ffreestanding (__STDC_HOSTED__ == 0) will use their own <stdint.h> version and will not include <machine/stdint.h>.
kernel - Add uid, gid, and inum to stat data for pipes * fstat(pipefd) now populations additional fields. uid, gid, and inum. In-line with other BSDs and Linux. Not sure why any program would use the inum field but... now its populated. * Add an anonymous inode allocator to the pcpu structure. No atomic ops required. Basically just does: pipe->inum = gd->gd_anoninum++ * ncpus + gd->gd_cpuid + 2; * Facility can be used for other things as needed. Suggested-by: mjg
kernel - Localize [in]activevnodes globals, improve allocvnode * Move to globaldata, keep globals as rollup statistics. * We already solved normal active->inactive->active issues in prior work, this change primarily effects vnode termination, such as for unlink operations. * Enhance allocvnode to reuse a convenient reclaimed vnode if we can find one on the pcpu's inactive list and lock it non-blocking. This reduces unnecessary vnode count bloating.
kernel - Refactor sysctl locking * Get rid of the global topology lock. Instead of a pcpu shared lock and change the XLOCK code (which is barely ever executed) to obtain an exclusive lock on all cpus. * Add CTLFLAG_NOLOCK, which disable the automatic per-OID sysctl lock. Suggested-by: mjg (Mateusz Guzik)
kernel - Refactor smp collision statistics (2) * Refactor indefinite_info mechanics. Instead of tracking indefinite loops on a per-thread basis for tokens, track them on a scheduler basis. The scheduler records the overhead while it is live-looping on tokens, but the moment it finds a thread it can actually schedule it stops (then restarts later the next time it is entered), even if some of the other threads still have unresolved tokens. This gives us a fairer representation of how many cpu cycles are actually being wasted waiting for tokens. * Go back to using a local indefinite_info in the lockmgr*(), mutex*(), and spinlock code. * Refactor lockmgr() by implementing an __inline frontend to interpret the directive. Since this argument is usually a constant, the change effectively removes the switch(). Use LK_NOCOLLSTATS to create a clean recursion to wrap the blocking case with the indefinite*() API.
kernel - Improve tsleep/wakeup queue collisions * Expand the per-cpu array of TAILQs into an array of structures for tsleep/wakeup operation. The new structure stores up to four idents using a 4-way set-associative algorithm (-1 in ident0 handles overflows), allowing the originating cpu for a wakeup() to implement a second-level filter after the global array's cpumask. * This filter prevents nearly all possible spurious IPIs that used to occur due to ident hash collisions, even when the hash table size is forced to be relatively small. The code isn't the best in the world, but the IPIs it saves probably blow away the added overhead. Testing-by: sephe, dillon
kernel - Break up scheduler and loadavg callout * Change the scheduler and loadavg callouts from cpu 0 to all cpus, and adjust the allproc_scan() and alllwp_scan() to segment the hash table when asked. Every cpu is now tasked with handling the nominal scheduler recalc and nominal load calculation for a portion of the process list. The portion is unrelated to which cpu(s) the processes are actually scheduled on, it is strictly a way to spread the work around, split up by hash range. * Significantly reduces cpu 0 stalls when a large number of user processes or threads are present (that is, in the tens of thousands or more). In the test below, before this change, cpu 0 was straining under 40%+ interupt load (from the callout). After this change the load is spread across all cpus, approximately 1.5% per cpu. * Tested with 400,000 running user processes on a 32-thread dual-socket xeon (yes, these numbers are real): 12:27PM up 8 mins, 3 users, load avg: 395143.28, 270541.13, 132638.33 12:33PM up 14 mins, 3 users, load avg: 399496.57, 361405.54, 225669.14 * NOTE: There are still a number of other non-segmented allproc scans in the system, particularly related to paging and swapping. * NOTE: Further spreading-out of the work may be needed, by using a more frequent callout and smaller hash index range for each.
kernel - pmap and vkernel work * Remove the pmap.pm_token entirely. The pmap is currently protected primarily by fine-grained locks and the vm_map lock. The intention is to eventually be able to protect it without the vm_map lock at all. * Enhance pv_entry acquisition (representing PTE locations) to include a placemarker facility for non-existant PTEs, allowing the PTE location to be locked whether a pv_entry exists for it or not. * Fix dev_dmmap (struct dev_mmap) (for future use), it was returning a page index for physical memory as a 32-bit integer instead of a 64-bit integer. * Use pmap_kextract() instead of pmap_extract() where appropriate. * Put the token contention test back in kern_clock.c for real kernels so token contention shows up as sys% instead of idle%. * Modify the pmap_extract() API to also return a locked pv_entry, and add pmap_extract_done() to release it. Adjust users of pmap_extract(). * Change madvise/mcontrol MADV_INVAL (used primarily by the vkernel) to use a shared vm_map lock instead of an exclusive lock. This significantly improves the vkernel's performance and significantly reduces stalls and glitches when typing in one under heavy loads. * The new placemarkers also have the side effect of fixing several difficult-to-reproduce bugs in the pmap code, by ensuring that shared and unmanaged pages are properly locked whereas before only managed pages (with pv_entry's) were properly locked. * Adjust the vkernel's pmap code to use atomic ops in numerous places. * Rename the pmap_change_wiring() call to pmap_unwire(). The routine was only being used to unwire (and could only safely be called for unwiring anyway). Remove the unused 'wired' and the 'entry' arguments. Also change how pmap_unwire() works to remove a small race condition. * Fix race conditions in the vmspace_*() system calls which could lead to pmap corruption. Note that the vkernel did not trigger any of these conditions, I found them while looking for another bug. * Add missing maptypes to procfs's /proc/*/map report.
vkernel - Restabilize pmap code, redo kqueue, systimer, and console code * Remove vm_token and add necessary vm_page spin locks to the vkernel's pmap code, improving its stability. * Separate the systimer interrupt and console tty support from the kqueue subsystem. Uses SIGURG for systimer Uses SIGIO for kqueue Uses SIGALRM for cothread signalling * The vkernel systimer code now uses a dedicated cothread for timing. The cothread is a bit of a hack at the moment but is a more direct way of handling systimers. * Attempt to fix user%/sys%/intr%/idle% in the systat -vm and systat -pv output. Still isn't perfect, but it is now more accurate.
kernel - Further refactor vmstats, adjust page coloring algorithm * Further refactor vmstats by tracking adjustments in gd->gd_vmstats_adj and doing a copyback of the global vmstats into gd->gd_vmstats. All code critical paths access the localized copy to test VM state, removing most global cache ping pongs of the global structure. The global structure 'vmstats' still contains the master copy. * Bump PQ_L2_SIZE up to 512. We use this to localized the VM page queues. Make some adjustments to the pg_color calculation to reduce (in fact almost eliminate) SMP conflicts on the vm_page_queue[] between cpus when the VM system is operating normally (not paging). * This pumps the 4-socket opteron test system up to ~4.5-4.7M page faults/sec in testing (using a mmap/bzero/munmap loop on 16MB x N processes). This pumps the 2-socket xeon test system up to 4.6M page faults/sec with 32 threads (250K/sec on one core, 1M on 4 cores, 4M on 16 cores, 5.6M on 32 threads). This is near the theoretical maximum possible for this test. * In this particular page fault test, PC sampling indicates *NO* further globals are undergoing cache ping-ponging. The PC sampling predominantly indicates pagezero(), which is expected. The Xeon is zeroing an aggregate of 22GBytes/sec at 32 threads running normal vm_fault's.
kernel - Remove most global atomic ops for VM page statistics * Use a pcpu globaldata->gd_vmstats to update page statistics. * Hardclock rolls the individual stats into the global vmstats structure. * Force-roll any pcpu stat that goes below -10, to ensure that the low-memory handling algorithms still work properly.
kernel - Overhaul namecache operations to reduce SMP contention * Overhaul the namecache code to remove a significant amount of cacheline ping-ponging from the namecache paths. This primarily effects multi-socket systems but also improves multi-core single-socket systems. Cacheline ping-ponging in the critical path can constrict a multi-core system to roughly ~1-2M operations per second running through that path. For example, even if looking up different paths or stating different files, even something as simple as a non-atomic ++global_counter seriously derates performance when it is being executed on all cores at once. In the simple non-conflicting single-component stat() case, this improves performance from ~2.5M/second to ~25M/second on a 4-socket 48-core opteron and has a similar improvement on a 2-socket 32-thread xeon, as well as significantly improves namecache perf on single-socket multi-core systems. * Remove the vfs.cache.numcalls and vfs.cache.numchecks debugging counters. These global counters caused significant cache ping-ponging and were only being used for debugging. * Implement a poor-man's referenced-structure pcpu cache for struct mount and struct namecache. This allows atomic ops on the ref-count for these structures to be avoided in certain critical path cases. For now limit to ncdir and nrdir (nrdir particularly, which is usually the same across nearly all processes in the system). Eventually we will want to expand this cache to handle more cases. Because we are holding refs persistently, add a bit of infrastructure to clear the cache as necessary (e.g. when doing an unmount, for example). * Shift the 'cachedvnodes' global to a per-cpu accumulator, then roll-up the counter back to the global approximately once per second. The code critical paths adjust only the per-cpu accumulator, removing another global cache ping-pong from nearly all vnode and nlookup paths. * The nlookup structure now 'Borrows' the ucred reference from td->td_ucred instead of crhold()ing it, removing another global ref/unref from all nlookup paths. * We have a large hash table of spinlocks for nchash, add a little pad from 24 to 32 bytes. Its ok that two spin locks share the same cache line (its a huge table), adding the pad cleans up cacheline-crossing cases. * Add a bit of pad to put mount->mnt_refs on its own cache-line verses prior fields which are accessed shared. But don't bother isolating it completely.
kernel - Refactor cpu localization for VM page allocations * Change how cpu localization works. The old scheme was extremely unbalanced in terms of vm_page_queue[] load. The new scheme uses cpu topology information to break the vm_page_queue[] down into major blocks based on the physical package id, minor blocks based on the core id in each physical package, and then by 1's based on (pindex + object->pg_color). If PQ_L2_SIZE is not big enough such that 16-way operation is attainable by physical and core id, we break the queue down only by physical id. Note that the core id is a real core count, not a cpu thread count, so an 8-core/16-thread x 2 socket xeon system will just fit in the 16-way requirement (there are 256 PQ_FREE queues). * When a particular queue does not have a free page, iterate nearby queues start at +/- 1 (before we started at +/- PQ_L2_SIZE/2), in an attempt to retain as much locality as possible. This won't be perfect but it should be good enough. * Also fix an issue with the idlezero counters.