kernel - Rearrange struct vmmeter (requires world and kernel build) * Expand v_lock_name from 16 to 32 bytes * Add v_lock_addr field to go along with v_lock_name. These fields report SMP contention. * Rearrange vmmeter_uint_end to not include v_lock_name or v_lock_addr. * Cleanup the do_vmmeter_pcpu() sysctl code. Remove the useless aggregation code and just do a structural copy for the per-cpu gd_cnt (struct vmmeter) structure.
kernel - Major refactor of pageout daemon algorithms * Rewrite a large chunk of the pageout daemon's algorithm to significantly improve page selection for pageout on low-memory systems. * Implement persistent markers for hold and active queue scans. Instead of moving pages within the queues, we now implement a persistent marker and just move the marker instead. This ensures 100% fair scanning of these queues. * The pageout state machine is now governed by the following sysctls (with some example default settings from a 32G box containing 8071042 pages): vm.v_free_reserved: 20216 vm.v_free_min: 40419 vm.v_paging_wait: 80838 vm.v_paging_start: 121257 vm.v_paging_target1: 161676 vm.v_paging_target2: 202095 And separately vm.v_inactive_target: 484161 The arrangement is as follows: reserved < severe < minimum < wait < start < target1 < target2 * Paging is governed as follows: The pageout daemon is activated when FREE+CACHE falls below (v_paging_start). The daemon will free memory up until FREE+CACHE reaches (v_paging_target1), and then continue to free memory up more slowly until FREE+CACHE reaches (v_paging_target2). If, due to memory demand, FREE+CACHE falls below (v_paging_wait), most userland processes will begin short-stalls on VM allocations and page faults, and return to normal operation once FREE+CACHE goes above (v_paging_wait) (that is, as soon as possible). If, due to memory demand, FREE+CACHE falls below (v_paging_min), most userland processes will block on VM allocations and page faults until the level returns to above (v_paging_wait). The hysteresis between (wait) and (start) allows most processes to continue running normally during nominal paging activities. * The pageout daemon operates in batches and then loops as necessary. Pages will be moved from CACHE to FREE as necessary, then from INACTIVE to CACHE as necessary, then from ACTIVE to INACTIVE as necessary. Care is taken to avoid completely exhausting any given queue to ensure that the queue scan is reasonably efficient. * The ACTIVE to INACTIVE scan has been significantly reorganized and integrated with the page_stats scan (which updates m->act_count for pages in the ACTIVE queue). Pages in the ACTIVE queue are no longer moved within the lists. Instead a persistent roving marker is employed for each queue. The m->act_count tests is made against a dynamically adjusted comparison variable called vm.pageout_stats_actcmp. When no progress is made this variable is increased, and when sufficient progress is made this variable is decreased. Thus, under very heavy memory loads, a more permission m->act_count test allows active pages to be deactivated more quickly. * The INACTIVE to FREE+CACHE scan remains relatively unchanged. A two-pass LRU arrangement continues to be employed in order to give the system time to reclaim a deactivated page before it would otherwise get paged out. * The vm_pageout_page_stats() scan has been almost completely rewritten. This scan is responsible for updating m->act_count on pages in the ACTIVE queue. Example sysctl settings shown below vm.pageout_stats_rsecs: 300 <--- passive run time (seconds) after pageout vm.pageout_stats_scan: 472 <--- max number of pages to scan per tick vm.pageout_stats_ticks: 10 <--- poll rate in ticks vm.pageout_stats_inamin: 16 <--- inactive ratio governing dynamic vm.pageout_stats_inalim: 4096 adjustment of actcmnp. vm.pageout_stats_actcmp: 2 <--- dynamically adjusted by the kernel The page stats code polls slowly and will update m->act_count and deactivate pages until it is able to achieve (v_inactive_target) worth of pages in the inactive queue. Once this target has been reached, the poll stops deactivating pages, but will continue to run for (pageout_stats_rsecs) seconds after the pageout daemon last ran (typically 5 minutes) and continue to passively update m->act_count duiring this period. The polling resumes upon any pageout daemon activation and the cycle repeats. * The vm_pageout_page_stats() scan is mostly responsible for selecting the correct pages to move from ACTIVE to INACTIVE. Choosing the correct pages allows the system to continue to operate smoothly while concurrent paging is in progress. The additional 5 minutes of passive operation allows it to pre-stage m->act_count for pages in the ACTIVE queue to help grease the wheels for the next pageout daemon activation. TESTING * On a test box with memory limited to 2GB, running chrome. Video runs smoothly despite constant paging. Active tabs appear to operate smoothly. Inactive tabs are able to page-in decently fast and resume operation. * On a workstation with 32GB of memory and a large number of open chrome tabs, allowed to sit overnight (chrome burns up a lot of memory when tabs remain open), then video tested the next day. Paging appeared to operate well and so far there has been no stuttering. * On a 64GB build box running dsynth 32/32 (intentionally overloaded). The full bulk starts normally. The packages tend to get larger and larger as they are built. dsynth and the pageout daemon operate reasonably well in this situation. I was mostly looking for excessive stalls due to heavy memory loads and it looks like the new code handles it quite well.
kernel - Remove P_SWAPPEDOUT flag and paging mode * This code basically no longer functions in any worthwhile or useful manner, remove it. The code harkens back to a time when machines had very little memory and had to time-share processes by actually descheduling them for long periods of time (like 20 seconds) and paging out the related memory. In modern times the chooser algorithm just doesn't work well because we can no longer assume that programs with large memory footprints can be demoted. * In modern times machines have sufficient memory to rely almost entirely on the VM fault and pageout scan. The latencies caused by fault-ins are usually sufficient to demote paging-intensive processes while allowing the machine to continue to function. If functionality need to be added back in, it can be added back in on the fault path and not here.
<sys/slaballoc.h>: Switch to lighter <sys/_malloc.h> header. The <sys/globaldata.h> embeds SLGlobalData that in turn embeds the "struct malloc_type". Adjust several kernel sources for missing includes where memory allocation is performed. Try to use alphabetical include order. Now (in most cases) <sys/malloc.h> is included after <sys/objcache.h>. Once it gets cleaned up, the <sys/malloc.h> inclusion could be moved out of <sys/idr.h> to drm Linux compat layer linux/slab.h without side effects.
kernel - VM rework part 20 - Fix vmmeter_neg_slop_cnt * Fix some serious issues with the vmmeter_neg_slop_cnt calculation. The main problem is that this calculation was then causing vmstats.v_free_min to be recalculated to a much higher value than it should beeen calculated to, resulting in systems starting to page far earlier than they should. For example, the 128G TR started paging tmpfs data with 25GB of free memory, which was not intended. The correct target for that amount of memory is more around 3GB. * Remove vmmeter_neg_slop_cnt entirely and refactor the synchronization code to be smarter. It will now synchronize vmstats fields whos adjustments exceed -1024, but only if paging would actually be needed in the worst-case scenario. * This algorithm needs low-memory testing and might require more tuning.
kernel - VM rework part 16 - Optimization & cleanup pass * Adjust __exclusive_cache_line to use 128-byte alignment as per suggestion by mjg. Use this for the global vmstats. * Add the vmmeter_neg_slop_cnt global, which is a more generous dynamic calculation verses -VMMETER_SLOP_COUNT. The idea is to return how often vm_page_alloc() synchronizes its per-cpu statistics with the global vmstats.
kernel - Expand page count fields to 64 bits * 32 bit page count fields limit us to 8TB of ram. Expand to allow up to the DMAP limit (32TB). Do an initial pass on various page count fields and change them from int's to long's or vm_pindex_t's. * Fix a 32-bit overflow in the pv_entry initialization code. pv_entry_max = shpgperproc * maxproc + vm_page_array_size; 2000 * 1046516 + pages_of_phys_memory; maxproc is 1046516 @ 512GB. This calculation overflows its 32 bit signed variable somewhere between 256G and 512G of ram. This can lead to a zinitna() allocation in pvzone that is much too large. Reported-by: zrj
kernel - Make certain sysctl's unlocked * Automatically flag all SYSCTL_[U]INT, [U]LONG, and [U]QUAD definitions CTLFLAG_NOLOCK. These do not have to be locked. Will improve program startup performance a tad. * Flag a ton of other sysctls used in program startup and also 'ps' CTLFLAG_NOLOCK. * For kern.hostname, interlock changes using XLOCK and allow the sysctl to run NOLOCK, avoiding unnecessary cache line bouncing.
kernel - Break up scheduler and loadavg callout * Change the scheduler and loadavg callouts from cpu 0 to all cpus, and adjust the allproc_scan() and alllwp_scan() to segment the hash table when asked. Every cpu is now tasked with handling the nominal scheduler recalc and nominal load calculation for a portion of the process list. The portion is unrelated to which cpu(s) the processes are actually scheduled on, it is strictly a way to spread the work around, split up by hash range. * Significantly reduces cpu 0 stalls when a large number of user processes or threads are present (that is, in the tens of thousands or more). In the test below, before this change, cpu 0 was straining under 40%+ interupt load (from the callout). After this change the load is spread across all cpus, approximately 1.5% per cpu. * Tested with 400,000 running user processes on a 32-thread dual-socket xeon (yes, these numbers are real): 12:27PM up 8 mins, 3 users, load avg: 395143.28, 270541.13, 132638.33 12:33PM up 14 mins, 3 users, load avg: 399496.57, 361405.54, 225669.14 * NOTE: There are still a number of other non-segmented allproc scans in the system, particularly related to paging and swapping. * NOTE: Further spreading-out of the work may be needed, by using a more frequent callout and smaller hash index range for each.
kernel - Store page statistics in bytes * Store page statistics in bytes rather than pages. Pages aren't useful for userland display and there is no reason to force useland to do the conversion. * Include a realtime timestamp along with ticks in the structure. * Flesh out text output for kcollect. Reverse output order to print oldest data first, so output from the -f option stays consistent.
kernel - Further refactor vmstats, adjust page coloring algorithm * Further refactor vmstats by tracking adjustments in gd->gd_vmstats_adj and doing a copyback of the global vmstats into gd->gd_vmstats. All code critical paths access the localized copy to test VM state, removing most global cache ping pongs of the global structure. The global structure 'vmstats' still contains the master copy. * Bump PQ_L2_SIZE up to 512. We use this to localized the VM page queues. Make some adjustments to the pg_color calculation to reduce (in fact almost eliminate) SMP conflicts on the vm_page_queue[] between cpus when the VM system is operating normally (not paging). * This pumps the 4-socket opteron test system up to ~4.5-4.7M page faults/sec in testing (using a mmap/bzero/munmap loop on 16MB x N processes). This pumps the 2-socket xeon test system up to 4.6M page faults/sec with 32 threads (250K/sec on one core, 1M on 4 cores, 4M on 16 cores, 5.6M on 32 threads). This is near the theoretical maximum possible for this test. * In this particular page fault test, PC sampling indicates *NO* further globals are undergoing cache ping-ponging. The PC sampling predominantly indicates pagezero(), which is expected. The Xeon is zeroing an aggregate of 22GBytes/sec at 32 threads running normal vm_fault's.
kernel - Refactor struct vmstats and vm_zone * These changes significantly improve the simultaneous non-conflicting VM fault rate. On our 4-socket opteron (48 cores, which makes a great test case because its cache mastership stalls are so expensive), the maximum concurrent VM fault rate increased from ~2.4M/sec to ~3.5M/sec, and suffers no degredation after topping out. * Refactor the fields in struct vmstats to separate out mostly read-only variables from nominally modified variables, reducing cache mastership stalls. * Remove vm_shared_hit, vm_shared_count, and vm_shared_miss sysctl statistics, removing related cache mastership stalls from the critical path. * Move the spinlock in vpgqueues to the base of the structure. * Increase the vmstats slop (how large a negative value can accumulate in pcpu stats before rolling it up). * Fix cache mastership stalls in the zalloc() and zfree() paths by consolidating pcpus elements into its own cache-aligned structure and giving each pcpu its on znalloc counter.
kernel - Remove most global atomic ops for VM page statistics * Use a pcpu globaldata->gd_vmstats to update page statistics. * Hardclock rolls the individual stats into the global vmstats structure. * Force-roll any pcpu stat that goes below -10, to ensure that the low-memory handling algorithms still work properly.
kernel - Reduce memory testing and early-boot zeroing. * Reduce the amount of memory testing and early-boot zeroing that we do, improving boot times on systems with large amounts of memory. * Fix race in the page zeroing count. * Refactor the VM zeroidle code. Instead of having just one kernel thread, have one on each cpu. This significantly increases the rate at which the machine can eat up idle cycles to pre-zero pages in the cold path, improving performance in the hot-path (normal) page allocations which request zerod pages. * On systems with a lot of cpus there is usually a little idle time (e.g. 0.1%) on a few of the cpus, even under extreme loads. At the same time, such loads might also imply a lot of zfod faults requiring zero'd pages. On our 48-core opteron we see a zfod rate of 1.0 to 1.5 GBytes/sec and a page-freeing rate of 1.3 - 2.5 GBytes/sec. Distributing the page zeroing code and eating up these miniscule bits of idle improves the kernel's ability to provide a pre-zerod page (vs having to zero-it in the hot path) significantly. Under the synth test load the kernel was still able to provide 400-700 MBytes/sec worth of pre-zerod pages whereas before this change the kernel was only able to provide 20 MBytes/sec worth of pre-zerod pages.
kernel - proc_token removal pass stage 1/2 * Remove proc_token use from all subsystems except kern/kern_proc.c. * The token had become mostly useless in these subsystems now that process locking is more fine-grained. Do the final wipe of proc_token except for allproc/zombproc list use in kern_proc.c
kernel - Rewrite do_vmtotal and change the way VM statistics are collected * The vmtotal sysctl was iterating through all VM objects. This is a problem on machines with huge amounts of memory which might have millions of VM objects. * Collect running VM statistics in the swap pager and vm_page modules, on a per-cpu basis. Add a struct vmtotal structure to globaldata. Active real memory use is how many VM pages are mapped to processes. Total real memory use is how many VM pages are allocated whether they are mapped to processes or not. Shared real memory use represents VM pages mapped to more than one process. Total virtual memory use uses total real memory plus allocated swap space. Remaining fields are left 0 and not currently supported. * Represents a more realistic view of memmory and VM. In particular, totalling up the file sizes for all mmap()'d files is no longer a collected statistic because the system really has no way of knowing how little or how much of the file is 'active', or even ever accessed. * The vmtotal sysctl (e.g. used by systat -vm 1) now just iterates cpus to aggregate gd_vmtotal for VM statistics. This is basically O(1) for VM statistics. It still iterates processes (which we will want to fix too, eventually), but the main scaling issue was with VM objects and that has been fixed.