kernel - Change pager interface to pass page index 1/2 * Change the *getpage() API to include the page index as an argument. This allows us to avoid passing any vm_page_t for OBJT_MGTDEVICE VM pages. By removing this requirement, the VM system no longer has to pre-allocate a placemarker page for DRM faults and the DRM system can directly install the page in the pmap without tracking it via a vm_page_t.
kernel - VM rework part 15 - Core pmap work, refactor PG_* * Augment PG_FICTITIOUS. This takes over some of PG_UNMANAGED's previous capabilities. In addition, the pmap_*() API will work with fictitious pages, making mmap() operation (aka of the GPU) more consistent. * Add PG_UNQUEUED. This prevents a vm_page from being manipulated in the vm_page_queues[] in any way. This takes over another feature of the old PG_UNMANAGED flag. * Remove PG_UNMANAGED * Remove PG_DEVICE_IDX. This is no longer relevant. We use PG_FICTITIOUS for all device pages. * Refactor vm_contig_pg_alloc(), vm_contig_pg_free(), vm_page_alloc_contig(), and vm_page_free_contig(). These functions now set PG_FICTITIOUS | PG_UNQUEUED on the returned pages, and properly clear the bits upon free or if/when a regular (but special contig-managed) page is handed over to the normal paging system. This is combined with making the pmap*() functions work better with PG_FICTITIOUS is the primary 'fix' for some of DRMs hacks.
kernel: Remove numerous #include <sys/thread2.h>. Most of them were added when we converted spl*() calls to crit_enter()/crit_exit(), almost 14 years ago. We can now remove a good chunk of them again for where crit_*() are no longer used. I had to adjust some files that were relying on thread2.h or headers that it includes coming in via other headers that it was removed from.
kernel - Remove PG_ZERO and zeroidle (page-zeroing) entirely * Remove the PG_ZERO flag and remove all page-zeroing optimizations, entirely. Aftering doing a substantial amount of testing, these optimizations, which existed all the way back to CSRG BSD, no longer provide any benefit on a modern system. - Pre-zeroing a page only takes 80ns on a modern cpu. vm_fault overhead in general is ~at least 1 microscond. - Pre-zeroing a page leads to a cold-cache case on-use, forcing the fault source (e.g. a userland program) to actually get the data from main memory in its likely immediate use of the faulted page, reducing performance. - Zeroing the page at fault-time is actually more optimal because it does not require any reading of dynamic ram and leaves the cache hot. - Multiple synth and build tests show that active idle-time zeroing of pages actually reduces performance somewhat and incidental allocations of already-zerod pages (from page-table tear-downs) do not affect performance in any meaningful way. * Remove bcopyi() and obbcopy() -> collapse into bcopy(). These other versions existed because bcopy() used to be specially-optimized and could not be used in all situations. That is no longer true. * Remove bcopy function pointer argument to m_devget(). It is no longer used. This function existed to help support ancient drivers which might have needed a special memory copy to read and write mapped data. It has long been supplanted by BUSDMA.
hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree * This merge contains work primarily by Mihai Carabas, with some misc fixes also by Matthew Dillon. * Special note on GSOC core This is, needless to say, a huge amount of work compressed down into a few paragraphs of comments. Adds the pc64/vmm subdirectory and tons of stuff to support hardware virtualization in guest-user mode, plus the ability for programs (vkernels) running in this mode to make normal system calls to the host. * Add system call infrastructure for VMM mode operations in kern/sys_vmm.c which vectors through a structure to machine-specific implementations. vmm_guest_ctl_args() vmm_guest_sync_addr_args() vmm_guest_ctl_args() - bootstrap VMM and EPT modes. Copydown the original user stack for EPT (since EPT 'physical' addresses cannot reach that far into the backing store represented by the process's original VM space). Also installs the GUEST_CR3 for the guest using parameters supplied by the guest. vmm_guest_sync_addr_args() - A host helper function that the vkernel can use to invalidate page tables on multiple real cpus. This is a lot more efficient than having the vkernel try to do it itself with IPI signals via cpusync*(). * Add Intel VMX support to the host infrastructure. Again, tons of work compressed down into a one paragraph commit message. Intel VMX support added. AMD SVM support is not part of this GSOC and not yet supported by DragonFly. * Remove PG_* defines for PTE's and related mmu operations. Replace with a table lookup so the same pmap code can be used for normal page tables and also EPT tables. * Also include X86_PG_V defines specific to normal page tables for a few situations outside the pmap code. * Adjust DDB to disassemble SVM related (intel) instructions. * Add infrastructure to exit1() to deal related structures. * Optimize pfind() and pfindn() to remove the global token when looking up the current process's PID (Matt) * Add support for EPT (double layer page tables). This primarily required adjusting the pmap code to use a table lookup to get the PG_* bits. Add an indirect vector for copyin, copyout, and other user address space copy operations to support manual walks when EPT is in use. A multitude of system calls which manually looked up user addresses via the vm_map now need a VMM layer call to translate EPT. * Remove the MP lock from trapsignal() use cases in trap(). * (Matt) Add pthread_yield()s in most spin loops to help situations where the vkernel is running on more cpu's than the host has, and to help with scheduler edge cases on the host. * (Matt) Add a pmap_fault_page_quick() infrastructure that vm_fault_page() uses to try to shortcut operations and avoid locks. Implement it for pc64. This function checks whether the page is already faulted in as requested by looking up the PTE. If not it returns NULL and the full blown vm_fault_page() code continues running. * (Matt) Remove the MP lock from most the vkernel's trap() code * (Matt) Use a shared spinlock when possible for certain critical paths related to the copyin/copyout path.
kernel - Greatly improve shared memory fault rate concurrency / shared tokens This commit rolls up a lot of work to improve postgres database operations and the system in general. With this changes we can pgbench -j 8 -c 40 on our 48-core opteron monster at 140000+ tps, and the shm vm_fault rate hits 3.1M pps. * Implement shared tokens. They work as advertised, with some cavets. It is acceptable to acquire a shared token while you already hold the same token exclusively, but you will deadlock if you acquire an exclusive token while you hold the same token shared. Currently exclusive tokens are not given priority over shared tokens so starvation is possible under certain circumstances. * Create a critical code path in vm_fault() using the new shared token feature to quickly fault-in pages which already exist in the VM cache. pmap_object_init_pt() also uses the new feature. This increases fault-in concurrency by a ridiculously huge amount, particularly on SHM segments (say when you have a large number of postgres clients). Scaling for large numbers of clients on large numbers of cores is significantly improved. This also increases fault-in concurrency for MAP_SHARED file maps. * Expand the breadn() and cluster_read() APIs. Implement breadnx() and cluster_readx() which allows a getblk()'d bp to be passed. If *bpp is not NULL a bp is being passed in, otherwise the routines call getblk(). * Modify the HAMMER read path to use the new API. Instead of calling getcacheblk() HAMMER now calls getblk() and checks the B_CACHE flag. This gives getblk() a chance to regenerate a fully cached buffer from VM backing store without having to acquire any hammer-related locks, resulting in even faster operation. * If kern.ipc.shm_use_phys is set to 2 the VM pages will be pre-allocated. This can take quite a while for a large map and also lock the machine up for a few seconds. Defaults to off. * Reorder the smp_invltlb()/cpu_invltlb() combos in a few places, running cpu_invltlb() last. * An invalidation interlock might be needed in pmap_enter() under certain circumstances, enable the code for now. * vm_object_backing_scan_callback() was failing to properly check the validity of a vm_object after acquiring its token. Add the required check + some debugging. * Make vm_object_set_writeable_dirty() a bit more cache friendly. * The vmstats sysctl was scanning every process's vm_map (requiring a vm_map read lock to do so), which can stall for long periods of time when the system is paging heavily. Change the mechanic to a LWP flag which can be tested with minimal locking. * Have the phys_pager mark the page as dirty too, to make sure nothing tries to free it. * Remove the spinlock in pmap_prefault_ok(), since we do not delete page table pages it shouldn't be needed. * Add a required cpu_ccfence() in pmap_inval.c. The code generated prior to this fix was still correct, and this makes sure it stays that way. * Replace several manual wiring cases with calls to vm_page_wire().
kernel - Major SMP performance patch / VM system, bus-fault/seg-fault fixes This is a very large patch which reworks locking in the entire VM subsystem, concentrated on VM objects and the x86-64 pmap code. These fixes remove nearly all the spin lock contention for non-threaded VM faults and narrows contention for threaded VM faults to just the threads sharing the pmap. Multi-socket many-core machines will see a 30-50% improvement in parallel build performance (tested on a 48-core opteron), depending on how well the build parallelizes. As part of this work a long-standing problem on 64-bit systems where programs would occasionally seg-fault or bus-fault for no reason has been fixed. The problem was related to races between vm_fault, the vm_object collapse code, and the vm_map splitting code. * Most uses of vm_token have been removed. All uses of vm_spin have been removed. These have been replaced with per-object tokens and per-queue (vm_page_queues[]) spin locks. Note in particular that since we still have the page coloring code the PQ_FREE and PQ_CACHE queues are actually many queues, individually spin-locked, resulting in very excellent MP page allocation and freeing performance. * Reworked vm_page_lookup() and vm_object->rb_memq. All (object,pindex) lookup operations are now covered by the vm_object hold/drop system, which utilize pool tokens on vm_objects. Calls now require that the VM object be held in order to ensure a stable outcome. Also added vm_page_lookup_busy_wait(), vm_page_lookup_busy_try(), vm_page_busy_wait(), vm_page_busy_try(), and other API functions which integrate the PG_BUSY handling. * Added OBJ_CHAINLOCK. Most vm_object operations are protected by the vm_object_hold/drop() facility which is token-based. Certain critical functions which must traverse backing_object chains use a hard-locking flag and lock almost the entire chain as it is traversed to prevent races against object deallocation, collapses, and splits. The last object in the chain (typically a vnode) is NOT locked in this manner, so concurrent faults which terminate at the same vnode will still have good performance. This is important e.g. for parallel compiles which might be running dozens of the same compiler binary concurrently. * Created a per vm_map token and removed most uses of vmspace_token. * Removed the mp_lock in sys_execve(). It has not been needed in a while. * Add kmem_lim_size() which returns approximate available memory (reduced by available KVM), in megabytes. This is now used to scale up the slab allocator cache and the pipe buffer caches to reduce unnecessary global kmem operations. * Rewrote vm_page_alloc(), various bits in vm/vm_contig.c, the swapcache scan code, and the pageout scan code. These routines were rewritten to use the per-queue spin locks. * Replaced the exponential backoff in the spinlock code with something a bit less complex and cleaned it up. * Restructured the IPIQ func/arg1/arg2 array for better cache locality. Removed the per-queue ip_npoll and replaced it with a per-cpu gd_npoll, which is used by other cores to determine if they need to issue an actual hardware IPI or not. This reduces hardware IPI issuance considerably (and the removal of the decontention code reduced it even more). * Temporarily removed the lwkt thread fairq code and disabled a number of features. These will be worked back in once we track down some of the remaining performance issues. Temproarily removed the lwkt thread resequencer for tokens for the same reason. This might wind up being permanent. Added splz_check()s in a few critical places. * Increased the number of pool tokens from 1024 to 4001 and went to a prime-number mod algorithm to reduce overlaps. * Removed the token decontention code. This was a bit of an eyesore and while it did its job when we had global locks it just gets in the way now that most of the global locks are gone. Replaced the decontention code with a fall back which acquires the tokens in sorted order, to guarantee that deadlocks will always be resolved eventually in the scheduler. * Introduced a simplified spin-for-a-little-while function _lwkt_trytoken_spin() that the token code now uses rather than giving up immediately. * The vfs_bio subsystem no longer uses vm_token and now uses the vm_object_hold/drop API for buffer cache operations, resulting in very good concurrency. * Gave the vnode its own spinlock instead of sharing vp->v_lock.lk_spinlock, which fixes a deadlock. * Adjusted all platform pamp.c's to handle the new main kernel APIs. The i386 pmap.c is still a bit out of date but should be compatible. * Completely rewrote very large chunks of the x86-64 pmap.c code. The critical path no longer needs pmap_spin but pmap_spin itself is still used heavily, particularin the pv_entry handling code. A per-pmap token and per-pmap object are now used to serialize pmamp access and vm_page lookup operations when needed. The x86-64 pmap.c code now uses only vm_page->crit_count instead of both crit_count and hold_count, which fixes races against other parts of the kernel uses vm_page_hold(). _pmap_allocpte() mechanics have been completely rewritten to remove potential races. Much of pmap_enter() and pmap_enter_quick() has also been rewritten. Many other changes. * The following subsystems (and probably more) no longer use the vm_token or vmobj_token in critical paths: x The swap_pager now uses the vm_object_hold/drop API instead of vm_token. x mmap() and vm_map/vm_mmap in general now use the vm_object_hold/drop API instead of vm_token. x vnode_pager x zalloc x vm_page handling x vfs_bio x umtx system calls x vm_fault and friends * Minor fixes to fill_kinfo_proc() to deal with process scan panics (ps) revealed by recent global lock removals. * lockmgr() locks no longer support LK_NOSPINWAIT. Spin locks are unconditionally acquired. * Replaced netif/e1000's spinlocks with lockmgr locks. The spinlocks were not appropriate owing to the large context they were covering. * Misc atomic ops added
kernel - Numerous VM MPSAFE fixes * Remove most critical sections from the VM subsystem, these are no longer applicable (vm_token covers the access). * _pmap_allocpte() for x86-64 - Conditionalize the zeroing of the vm_page after the grab. The grab can race other threads and result in a page which had already been zero'd AND populated with pte's, so we can't just zero it. Use m->valid to determine if the page is actually newly allocated or not. NOTE: The 32 bit code already properly zeros the page by detecting whether the pte has already been entered or not. The 64-bit code couldn't do this neatly so we used another method. * Hold the pmap vm_object in pmap_release() and pmap_object_init_pt() for the x86-64 pmap code. This prevents related loops from blocking on the pmap vm_object when freeing VM pages which is not expected by the code. * pmap_copy() for x86-64 needs the vm_token, critical sections are no longer sufficient. * Assert that PG_MANAGED is set when clearing pte's out of a pmap via the PV entries. The pte's must exist in this case and it's a critical panic if they don't. * pmap_replacevm() for x86-64 - Adjust newvm->vm_sysref prior to assigning it to p->p_vmspace to handle any potential MP races with other sysrefs on the vmspace. * faultin() needs p->p_token, not proc_token. * swapout_procs_callback() needs p->p_token. * Deallocate the VM object associated with a vm_page after freeing the page instead of before freeing the page. This fixes a potential use-after-refs-transition-to-0 case if a MP race occurs.
kernel - SWAP CACHE part 13/many - More vm_pindex_t work for vm_objects on i386 * vm_object->size also needs to be a vm_pindex_t, e.g. when mmap()ing regular HAMMER files or block devices or HAMMER's own use of block devices, in order to support vm_object operations past the 16TB mark. * Introduce a 64-bit-friendly trunc_page64() and round_page64(), just to make sure we don't cut off page alignment operations on 64-bit offsets.
kernel - SWAP CACHE part 6/many - Refactor swap_pager_freespace() * Refactor swap_pager_freespace() to use a RB_SCAN() instead of a vm_pindex_t iteration. This is necessary if we intend to allow swap backing store for vnodes because the related files & VM objects can be huge. This is also generally a good idea in 64-bit mode to help deal with x86_64's massive address space. * Start adding swap space freeing calls in the OBJT_VNODE handling code and generic VM object handling code. * Remove various checks for OBJT_SWAP from swap*() and swp*() functions to allow them to be used with OBJT_VNODE objects. * Add checks for degenerate cases to reduce call overheads as the swap handling functions are now called for vnode objects too. * Add assertions for pagers which do not need swap support.
kernel - SWAP CACHE part 3/many - Rearrange VM pagerops * Remove pgo_init, pgo_pageunswapped, and pgo_strategy * The swap pager was the only consumer of pgo_pageunswapped and pgo_strategy. Since these functions will soon operate on any VM object type and not just OBJT_SWAP there's no point putting them in pagerops. * Make swap_pager_strategy() and swap_pager_unswapped() global functions and call them directly.
kernel - SWAP CACHE part 2/many - Remove VM pager lists * VM pager lists were used to associate handles with VM objects. Only the device_pager actually used them. Store the VM object in cdev_t->si_object instead and remove the device pager's VM pager list. * phys_pager and swap_pager only use anonymous objects, the VM pager lists were implemented but not used. Assert that the handles are NULL and remove the VM pager lists. * Remove vm_pager_object_lookup().
kernel - simplify vm pager ops, add pre-faulting for zero-fill pages. * Remove the behind and ahead arguments to struct pagerops->pgo_getpages, and pagerops->pgo_haspage. Adjust pgo_getpages() to pgo_getpage(), change *_pager_getpages() to *_pager_getpage(), etc. Add a sequential access flag to the call. The VM system is no longer responsible for dealing with read-ahead on pager ops. The individual pagers are now responsible. The vnode pager now specifies the sequential access heuristic based on the hint passed to it. HAMMER uses this hint to issue readaheads via the buffer cache. * Move, rename, and consolidate pmap_prefault(). Remove this function from all platform sources and place it in vm/vm_fault.c. Add a simple platform-specific pmap_prefault_ok() function to test particular virtual addresses. * The new prefault code is called vm_prefault(). Enhance the code to also prefault and make writable (when it can) zero-fill pages. The new zero-fill prefault feature improves buildworld times by over 5% by greatly reducing the number of VM faults taken during normal program operation. This particularly helps larger applications and concurrent applications in SMP systems. The code is conditionalized such that small applications (which do not benefit much from prefaulting zero-fill) still run about as fast as they did before. * Fix an issue in vm_fault() where the vm_map was being unlocked before the prefault code was called when it really needs to be unlocked after the prefault code is called.
Change *_pager_allocate() to take off_t instead of vm_ooffset_t. The actual underlying type (a 64 bit signed integer) is the same. Recent and upcoming work is standardizing on off_t. Move object->un_pager.vnp.vnp_size to vnode->v_filesize. As before, the field is still only valid when a VM object is associated with the vnode.