kernel - Implement mlockall() properly * Implement mlockall()'s MCL_CURRENT, and generalaly reimplement mlockall() using linux-like expectations. This generally means that the system will do a best-effort to allocate and lock the memory associated with the process's address space. * Prior semantics which disallowed protection changes on locked memory have been removed. Modern applications assume that protection changes will work on locked memory, even if it would force a fault. * As with linux, some license is taken and mlockall() will only force fault any copy-on-write flagged anonymous pages at the time of the call. It will not force a copy-on-write operation on unmodified file-backed pages that have been mapped MAP_PRIVATE, but not yet modified (still represent the file's actual content). Nor will it force-fault the parent process's pages when the parent issues a fork() (which forces all anonymous pages in both the parent and child to become copy-on-write). Such pages can still take a write-fault and be COWd. The resulting newly allocated page will be wired as expected. Submitted-by: tuxillo Testing-by: tuxillo, dillon
kernel - Rename vm_map_wire() and vm_map_unwire() * These names are mutant throwbacks to an earlier age and no longer mean what is implied. * Rename vm_map_wire() to vm_map_kernel_wiring(). This function can wire and unwire VM ranges in a vm_map under kernel control. Userland has no say. * Rename vm_map_unwire() to vm_map_user_wiring(). This function can wire and unwire VM ranges in a vm_map under user control. Userland can adjust the user wiring state for pages.
vm: Change 'kernel_map' global to type of 'struct vm_map *' Change the global variable 'kernel_map' from type 'struct vm_map' to a pointer to this struct. This simplify the code a bit since all invocations take its address. This change also aligns with NetBSD's 'kernal_map' that it's also a pointer, which also helps the porting of NVMM. No functional changes.
kernel - Remove MAP_VPAGETABLE * This will break vkernel support for now, but after a lot of mulling there's just no other way forward. MAP_VPAGETABLE was basically a software page-table feature for mmap()s that allowed the vkernel to implement page tables without needing hardware virtualization support. * The basic problem is that the VM system is moving to an extent-based mechanism for tracking VM pages entered into PMAPs and is no longer indexing individual terminal PTEs with pv_entry's. This means that the VM system is no longer able to get an exact list of PTEs in PMAPs that a particular vm_page is using. It just has a flag 'this page is in at least one pmap' or 'this page is not in any pmaps'. To track down the PTEs, the VM system must run through the extents via the vm_map_backing structures hanging off the related VM object. This mechanism does not work with MAP_VPAGETABLE. Short of scanning the entire real pmap, the kernel has no way to reverse-index a page that might be indirected through MAP_VPAGETABLE. * We will need actual hardware mmu virtualization to get the vkernel working again.
kernel - Start work on a better burst page-fault mechanic * The vm.fault_quick sysctl is now a burst count. It still defaults to 1 which is the same operation as before. Performance is roughly the same with it set to 1 to 8 as more work needs to be done to optimize pmap_enter().
libc - Implement sigblockall() and sigunblockall() (2) * Cleanup the logic a bit. Store the lwp or proc pointer in the vm_map_backing structure and make vm_map_fork() and friends more aware of it. * Rearrange lwp allocation in [v]fork() to make the pointer(s) available to vm_fork(). * Put the thread mappings on the lwp's list immediately rather than waiting for the first fault, which means that per-thread mappings will be deterministically removed on thread exit whether any faults happened or not. * Adjust vmspace_fork*() functions to not propagate 'dead' lwp mappings for threads that won't exist in the forked process. Only the lwp mappings for the thread doing the [v]fork() is retained.
kernel - sigblockall()/sigunblockall() support (per thread shared page) * Implement /dev/lpmap, a per-thread RW shared page between userland and the kernel. Each thread in the process will receive a unique shared page for communication with the kernel when memory-mapping /dev/lpmap and can access varous variables via this map. * The current thread's TID is retained for both fork() and vfork(). Previously it was only retained for vfork(). This avoids userland code confusion for any bits and pieces that are indexed based on the TID. * Implement support for a per-thread block-all-signals feature that does not require any system calls (see next commit to libc). The functions will be called sigblockall() and sigunblockall(). The lpmap->blockallsigs variable prevents normal signals from being dispatched. They will still be queued to the LWP as per normal. The behavior is not quite that of a signal mask when dealing with global signals. The low 31 bits represents a recursion counter, allowing recursive use of the functions. The high bit (bit 31) is set by the kernel if a signal was prevented from being dispatched. When userland decrements the counter to 0 (the low 31 bits), it can check and clear bit 31 and if found to be set userland can then make a dummy 'real' system call to cause pending signals to be delivered. Synchronous TRAPs (e.g. kernel-generated SIGFPE, SIGSEGV, etc) are not affected by this feature and will still be dispatched synchronously. * PThreads is expected to unmap the mapped page upon thread exit. The kernel will force-unmap the page upon thread exit if pthreads does not. XXX needs work - currently if the page has not been faulted in the kernel has no visbility into the mapping and will not unmap it, but neither will it get confused if the address is accessed. To be fixed soon. Because if we don't, programs using LWP primitives instead of pthreads might not realize that libc has mapped the page. * The TID is reset to 1 on a successful exec*() * On [v]fork(), if lpmap exists for the current thread, the kernel will copy the lpmap->blockallsigs value to the lpmap for the new thread in the new process. This way sigblock*() state is retained across the [v]fork(). This feature not only reduces code confusion in userland, it also allows [v]fork() to be implemented by the userland program in a way that ensures no signal races in either the parent or the new child process until it is ready for them. * The implementation leverages our vm_map_backing extents by having the per-thread memory mappings indexed within the lwp. This allows the lwp to remove the mappings when it exits (since not doing so would result in a wild pmap entry and kernel memory disclosure). * The implementation currently delays instantiation of the mapped page(s) and some side structures until the first fault. XXX this will have to be changed.
kernel - VM rework part 18 - Cleanup * Significantly reduce the zone limit for pvzone (for pmap pv_entry structures). pv_entry's are no longer allocated on a per-page basis so the limit can be made much smaller. This also has the effect of reducing the per-cpu cache limit which ultimately stabilizes wired memory use for the zone. * Also reduce the generic pre-cpu cache limit for zones. This only really effects the pvzone. * Make pvzone, mapentzone, and swap_zone __read_mostly. * Enhance vmstat -z, report current structural use and actual total memory use. * Also cleanup the copyright statement for vm/vm_zone.c. John Dyson's original copyright was slightly different than the BSD copyright and stipulated no changes, so separate out the DragonFly addendum.
kernel - VM rework part 13 - Core pmap work, stabilize & optimize * Refactor the vm_page_hash hash again to get a better distribution. * I tried to only hash shared objects but this resulted in a number of edge cases where program re-use could miss the optimization. * Add a sysctl vm.page_hash_vnode_only (default off). If turned on, only vm_page's associated with vnodes will be hashed. This should generally not be necessary. * Refactor vm_page_list_find2() again to avoid all duplicate queue checks. This time I mocked the algorithm up in userland and twisted it until it did what I wanted. * VM_FAULT_QUICK_DEBUG was accidently left on, turn it off. * Do not remove the original page from the pmap when vm_fault_object() must do a COW. And just in case this is ever added back in later, don't do it using pmap_remove_specific() !!! Use pmap_remove_pages() to avoid the backing scan lock. vm_fault_page() will now do this removal (for procfs rwmem), the normal vm_fault will of course replace the page anyway, and the umtx code uses different recovery mechanisms now and should be ok. * Optimize vm_map_entry_shadow() for the situation where the old object is no longer shared. Get rid of an unnecessary transient kmalloc() and vm_object_hold_shared().
kernel - VM rework part 12 - Core pmap work, stabilize & optimize * Add tracking for the number of PTEs mapped writeable in md_page. Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page to avoid clear/set races. This problem occurs because we would have otherwise tried to clear the bits without hard-busying the page. This allows the bits to be set with only an atomic op. Procedures which test these bits universally do so while holding the page hard-busied, and now call pmap_mapped_sfync() prior to properly synchronize the bits. * Fix bugs related to various counterse. pm_stats.resident_count, wiring counts, vm_page->md.writeable_count, and vm_page->md.pmap_count. * Fix bugs related to synchronizing removed pte's with the vm_page. Fix one case where we were improperly updating (m)'s state based on a lost race against a pte swap-to-0 (pulling the pte). * Fix a bug related to the page soft-busying code when the m->object/m->pindex race is lost. * Implement a heuristical version of vm_page_active() which just updates act_count unlocked if the page is already in the PQ_ACTIVE queue, or if it is fictitious. * Allow races against the backing scan for pmap_remove_all() and pmap_page_protect(VM_PROT_READ). Callers of these routines for these cases expect full synchronization of the page dirty state. We can identify when a page has not been fully cleaned out by checking vm_page->md.pmap_count and vm_page->md.writeable_count. In the rare situation where this happens, simply retry. * Assert that the PTE pindex is properly interlocked in pmap_enter(). We still allows PTEs to be pulled by other routines without the interlock, but multiple pmap_enter()s of the same page will be interlocked. * Assert additional wiring count failure cases. * (UNTESTED) Flag DEVICE pages (dev_pager_getfake()) as being PG_UNMANAGED. This essentially prevents all the various reference counters (e.g. vm_page->md.pmap_count and vm_page->md.writeable_count), PG_M, PG_A, etc from being updated. The vm_page's aren't tracked in the pmap at all because there is no way to find them.. they are 'fake', so without a pv_entry, we can't track them. Instead we simply rely on the vm_map_backing scan to manipulate the PTEs. * Optimize the new vm_map_entry_shadow() to use a shared object token instead of an exclusive one. OBJ_ONEMAPPING will be cleared with the shared token. * Optimize single-threaded access to pmaps to avoid pmap_inval_*() complexities. * Optimize __read_mostly for more globals. * Optimize pmap_testbit(), pmap_clearbit(), pmap_page_protect(). Pre-check vm_page->md.writeable_count and vm_page->md.pmap_count for an easy degenerate return; before real work. * Optimize pmap_inval_smp() and pmap_inval_smp_cmpset() for the single-threaded pmap case, when called on the same CPU the pmap is associated with. This allows us to use simple atomics and cpu_*() instructions and avoid the complexities of the pmap_inval_*() infrastructure. * Randomize the page queue used in bio_page_alloc(). This does not appear to hurt performance (e.g. heavy tmpfs use) on large many-core NUMA machines and it makes the vm_page_alloc()'s job easier. This change might have a downside for temporary files, but for more long-lasting files there's no point allocating pages localized to a particular cpu. * Optimize vm_page_alloc(). (1) Refactor the _vm_page_list_find*() routines to avoid re-scanning the same array indices over and over again when trying to find a page. (2) Add a heuristic, vpq.lastq, for each queue, which we set if a _vm_page_list_find*() operation had to go far-afield to find its page. Subsequent finds will skip to the far-afield position until the current CPUs queues have pages again. (3) Reduce PQ_L2_SIZE From an extravagant 2048 entries per queue down to 1024. The original 2048 was meant to provide 8-way set-associativity for 256 cores but wound up reducing performance due to longer index iterations. * Refactor the vm_page_hash[] array. This array is used to shortcut vm_object locks and locate VM pages more quickly, without locks. The new code limits the size of the array to something more reasonable, implements a 4-way set-associative replacement policy using 'ticks', and rewrites the hashing math. * Effectively remove pmap_object_init_pt() for now. In current tests it does not actually improve performance, probably because it may map pages that are not actually used by the program. * Remove vm_map_backing->refs. This field is no longer used. * Remove more of the old now-stale code related to use of pv_entry's for terminal PTEs. * Remove more of the old shared page-table-page code. This worked but could never be fully validated and was prone to bugs. So remove it. In the future we will likely use larger 2MB and 1GB pages anyway. * Remove pmap_softwait()/pmap_softhold()/pmap_softdone(). * Remove more #if 0'd code.
kernel - VM rework part 11 - Core pmap work to remove terminal PVs * Remove pv_entry_t belonging to terminal PTEs. The pv_entry's for PT, PD, PDP, and PML4 remain. This reduces kernel memory use for pv_entry's by 99%. The pmap code now iterates vm_object->backing_list (of vm_map_backing structures) to run-down pages for various operations. * Remove vm_page->pv_list. This was one of the biggest sources of contention for shared faults. However, in this first attempt I am leaving all sorts of ref-counting intact so the contention has not been entirely removed yet. * Current hacks: - Dynamic page table page removal currently disabled because the vm_map_backing scan needs to be able to deterministically run-down PTE pointers. Removal only occurs at program exit. - PG_DEVICE_IDX probably isn't being handled properly yet. - Shared page faults not yet optimized. * So far minor improvements in performance across the board. This is realtively unoptimized. The buildkernel test improves by 2% and the zero-fill fault test improves by around 10%. Kernel memory use is improved (reduced) enormously.
kernel - VM rework part 9 - Precursor work for terminal pv_entry removal * Cleanup the API a bit * Get rid of pmap_enter_quick() * Remove unused procedures. * Document that vm_page_protect() (and thus the related pmap_page_protect()) must be called with a hard-busied page. This ensures that the operation does not race a new pmap_enter() of the page.
kernel - VM rework part 8 - Precursor work for terminal pv_entry removal * Adjust structures so the pmap code can iterate backing_ba's with just the vm_object spinlock. Add a ba.pmap back-pointer. Move entry->start and entry->end into the ba (ba.start, ba.end). This is replicative of the base entry->ba.start and entry->ba.end, but local modifications are locked by individual objects to allow pmap ops to just look at backing ba's iterated via the object. Remove the entry->map back-pointer. Remove the ba.entry_base back-pointer. * ba.offset is now an absolute offset and not additive. Adjust all code that calculates and uses ba.offset (fortunately it is all concentrated in vm_map.c and vm_fault.c). * Refactor ba.start/offset/end modificatons to be atomic with the necessary spin-locks to allow the pmap code to safely iterate the vm_map_backing list for a vm_object. * Test VM system with full synth run.
kernel - VM rework part 7 - Initial vm_map_backing index * Implement a TAILQ and hang vm_map_backing structures off of the related object. This feature is still in progress and will eventually be used to allow pmaps to manipulate vm_page's without pv_entry's. At the same time, remove all sharing of vm_map_backing. For example, clips no longer share the vm_map_backing. We can't share the structures if they are being used to itemize areas for pmap management. TODO - reoptimize this at some point. TODO - not yet quite deterministic enough for pmap searches (due to clips). * Refactor vm_object_reference_quick() to again allow operation on any vm_object whos ref_count is already at least 1, or which belongs to a vnode. The ref_count is no longer being used for complex vm_object collapse, shadowing, or migration code. This allows us to avoid a number of unnecessary token grabs on objects during clips, shadowing, and forks. * Cleanup a few fields in vm_object. Name TAILQ_ENTRY() elements blahblah_entry instead of blahblah_list. * Fix an issue with a.out binaries (that are still supported but nobody uses) where the object refs on the binaries were not being properly accounted for.