kernel - Change pager interface to pass page index 1/2 * Change the *getpage() API to include the page index as an argument. This allows us to avoid passing any vm_page_t for OBJT_MGTDEVICE VM pages. By removing this requirement, the VM system no longer has to pre-allocate a placemarker page for DRM faults and the DRM system can directly install the page in the pmap without tracking it via a vm_page_t.
kernel - VM rework part 15 - Core pmap work, refactor PG_* * Augment PG_FICTITIOUS. This takes over some of PG_UNMANAGED's previous capabilities. In addition, the pmap_*() API will work with fictitious pages, making mmap() operation (aka of the GPU) more consistent. * Add PG_UNQUEUED. This prevents a vm_page from being manipulated in the vm_page_queues[] in any way. This takes over another feature of the old PG_UNMANAGED flag. * Remove PG_UNMANAGED * Remove PG_DEVICE_IDX. This is no longer relevant. We use PG_FICTITIOUS for all device pages. * Refactor vm_contig_pg_alloc(), vm_contig_pg_free(), vm_page_alloc_contig(), and vm_page_free_contig(). These functions now set PG_FICTITIOUS | PG_UNQUEUED on the returned pages, and properly clear the bits upon free or if/when a regular (but special contig-managed) page is handed over to the normal paging system. This is combined with making the pmap*() functions work better with PG_FICTITIOUS is the primary 'fix' for some of DRMs hacks.
kernel - VM rework part 14 - Core pmap work, stabilize for X/drm * Don't gratuitously change the vm_page flags in the drm code. The vm_phys_fictitious_reg_range() code in drm_vm.c was clearing PG_UNMANAGED. It was only luck that this worked before, but because these are faked pages, PG_UNMANAGED must be set or the system will implode trying to convert the physical address back to a vm_page in certain routines. The ttm code was setting PG_FICTITIOUS in order to prevent the page from getting into the active or inactive queues (they had a conditional test for PG_FICTITIOUS). But ttm never cleared the bit before freeing the page. Remove the hack and instead fix it in vm_page.c * in vm_object_terminate(), allow the case where there are still wired pages in a OBJT_MGTDEVICE object that has wound up on a queue (don't complain about it). This situation arises because the ttm code uses the contig malloc API which returns wired pages. NOTE: vm_page_activate()/vm_page_deactivate() are allowed to mess with wired pages. Wired pages are not anything 'special' to the queues, which allows us to avoid messing with the queues when pages are assigned to the buffer cache.
kernel - VM rework part 12 - Core pmap work, stabilize & optimize * Add tracking for the number of PTEs mapped writeable in md_page. Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page to avoid clear/set races. This problem occurs because we would have otherwise tried to clear the bits without hard-busying the page. This allows the bits to be set with only an atomic op. Procedures which test these bits universally do so while holding the page hard-busied, and now call pmap_mapped_sfync() prior to properly synchronize the bits. * Fix bugs related to various counterse. pm_stats.resident_count, wiring counts, vm_page->md.writeable_count, and vm_page->md.pmap_count. * Fix bugs related to synchronizing removed pte's with the vm_page. Fix one case where we were improperly updating (m)'s state based on a lost race against a pte swap-to-0 (pulling the pte). * Fix a bug related to the page soft-busying code when the m->object/m->pindex race is lost. * Implement a heuristical version of vm_page_active() which just updates act_count unlocked if the page is already in the PQ_ACTIVE queue, or if it is fictitious. * Allow races against the backing scan for pmap_remove_all() and pmap_page_protect(VM_PROT_READ). Callers of these routines for these cases expect full synchronization of the page dirty state. We can identify when a page has not been fully cleaned out by checking vm_page->md.pmap_count and vm_page->md.writeable_count. In the rare situation where this happens, simply retry. * Assert that the PTE pindex is properly interlocked in pmap_enter(). We still allows PTEs to be pulled by other routines without the interlock, but multiple pmap_enter()s of the same page will be interlocked. * Assert additional wiring count failure cases. * (UNTESTED) Flag DEVICE pages (dev_pager_getfake()) as being PG_UNMANAGED. This essentially prevents all the various reference counters (e.g. vm_page->md.pmap_count and vm_page->md.writeable_count), PG_M, PG_A, etc from being updated. The vm_page's aren't tracked in the pmap at all because there is no way to find them.. they are 'fake', so without a pv_entry, we can't track them. Instead we simply rely on the vm_map_backing scan to manipulate the PTEs. * Optimize the new vm_map_entry_shadow() to use a shared object token instead of an exclusive one. OBJ_ONEMAPPING will be cleared with the shared token. * Optimize single-threaded access to pmaps to avoid pmap_inval_*() complexities. * Optimize __read_mostly for more globals. * Optimize pmap_testbit(), pmap_clearbit(), pmap_page_protect(). Pre-check vm_page->md.writeable_count and vm_page->md.pmap_count for an easy degenerate return; before real work. * Optimize pmap_inval_smp() and pmap_inval_smp_cmpset() for the single-threaded pmap case, when called on the same CPU the pmap is associated with. This allows us to use simple atomics and cpu_*() instructions and avoid the complexities of the pmap_inval_*() infrastructure. * Randomize the page queue used in bio_page_alloc(). This does not appear to hurt performance (e.g. heavy tmpfs use) on large many-core NUMA machines and it makes the vm_page_alloc()'s job easier. This change might have a downside for temporary files, but for more long-lasting files there's no point allocating pages localized to a particular cpu. * Optimize vm_page_alloc(). (1) Refactor the _vm_page_list_find*() routines to avoid re-scanning the same array indices over and over again when trying to find a page. (2) Add a heuristic, vpq.lastq, for each queue, which we set if a _vm_page_list_find*() operation had to go far-afield to find its page. Subsequent finds will skip to the far-afield position until the current CPUs queues have pages again. (3) Reduce PQ_L2_SIZE From an extravagant 2048 entries per queue down to 1024. The original 2048 was meant to provide 8-way set-associativity for 256 cores but wound up reducing performance due to longer index iterations. * Refactor the vm_page_hash[] array. This array is used to shortcut vm_object locks and locate VM pages more quickly, without locks. The new code limits the size of the array to something more reasonable, implements a 4-way set-associative replacement policy using 'ticks', and rewrites the hashing math. * Effectively remove pmap_object_init_pt() for now. In current tests it does not actually improve performance, probably because it may map pages that are not actually used by the program. * Remove vm_map_backing->refs. This field is no longer used. * Remove more of the old now-stale code related to use of pv_entry's for terminal PTEs. * Remove more of the old shared page-table-page code. This worked but could never be fully validated and was prone to bugs. So remove it. In the future we will likely use larger 2MB and 1GB pages anyway. * Remove pmap_softwait()/pmap_softhold()/pmap_softdone(). * Remove more #if 0'd code.
kernel - VM rework part 7 - Initial vm_map_backing index * Implement a TAILQ and hang vm_map_backing structures off of the related object. This feature is still in progress and will eventually be used to allow pmaps to manipulate vm_page's without pv_entry's. At the same time, remove all sharing of vm_map_backing. For example, clips no longer share the vm_map_backing. We can't share the structures if they are being used to itemize areas for pmap management. TODO - reoptimize this at some point. TODO - not yet quite deterministic enough for pmap searches (due to clips). * Refactor vm_object_reference_quick() to again allow operation on any vm_object whos ref_count is already at least 1, or which belongs to a vnode. The ref_count is no longer being used for complex vm_object collapse, shadowing, or migration code. This allows us to avoid a number of unnecessary token grabs on objects during clips, shadowing, and forks. * Cleanup a few fields in vm_object. Name TAILQ_ENTRY() elements blahblah_entry instead of blahblah_list. * Fix an issue with a.out binaries (that are still supported but nobody uses) where the object refs on the binaries were not being properly accounted for.
kernel: Remove numerous #include <sys/thread2.h>. Most of them were added when we converted spl*() calls to crit_enter()/crit_exit(), almost 14 years ago. We can now remove a good chunk of them again for where crit_*() are no longer used. I had to adjust some files that were relying on thread2.h or headers that it includes coming in via other headers that it was removed from.
kernel - refactor vm_page busy * Move PG_BUSY, PG_WANTED, PG_SBUSY, and PG_SWAPINPROG out of m->flags. * Add m->busy_count with PBUSY_LOCKED, PBUSY_WANTED, PBUSY_SWAPINPROG, and PBUSY_MASK (for the soft-busy count). * Add support for acquiring a soft-busy count without a hard-busy. This requires that there not already be a hard-busy. The purpose of this is to allow a vm_page to be 'locked' in a shared manner via the soft-busy for situations where we only intend to read from it.
kernel - pmap and vkernel work * Remove the pmap.pm_token entirely. The pmap is currently protected primarily by fine-grained locks and the vm_map lock. The intention is to eventually be able to protect it without the vm_map lock at all. * Enhance pv_entry acquisition (representing PTE locations) to include a placemarker facility for non-existant PTEs, allowing the PTE location to be locked whether a pv_entry exists for it or not. * Fix dev_dmmap (struct dev_mmap) (for future use), it was returning a page index for physical memory as a 32-bit integer instead of a 64-bit integer. * Use pmap_kextract() instead of pmap_extract() where appropriate. * Put the token contention test back in kern_clock.c for real kernels so token contention shows up as sys% instead of idle%. * Modify the pmap_extract() API to also return a locked pv_entry, and add pmap_extract_done() to release it. Adjust users of pmap_extract(). * Change madvise/mcontrol MADV_INVAL (used primarily by the vkernel) to use a shared vm_map lock instead of an exclusive lock. This significantly improves the vkernel's performance and significantly reduces stalls and glitches when typing in one under heavy loads. * The new placemarkers also have the side effect of fixing several difficult-to-reproduce bugs in the pmap code, by ensuring that shared and unmanaged pages are properly locked whereas before only managed pages (with pv_entry's) were properly locked. * Adjust the vkernel's pmap code to use atomic ops in numerous places. * Rename the pmap_change_wiring() call to pmap_unwire(). The routine was only being used to unwire (and could only safely be called for unwiring anyway). Remove the unused 'wired' and the 'entry' arguments. Also change how pmap_unwire() works to remove a small race condition. * Fix race conditions in the vmspace_*() system calls which could lead to pmap corruption. Note that the vkernel did not trigger any of these conditions, I found them while looking for another bug. * Add missing maptypes to procfs's /proc/*/map report.
drm - Fix lock order reversal * Lock order reversal caused by holding dev_pager_mtx() across the object->un_pager.devp.ops->cdev_pg_dtor() call. devpgr -> drmslk. * Move the lock from before to after the call. Holding the mutex shouldn't be necessary across the call. This also fixes the reversal as devpgr is no longer held across the call. * Fixes a lock order reversal against drm_ioctl() which obtains drmslk first and recurses into a device pager operation which gets devpgr. * Fix a few other incidental bugs that would normally not be triggered by the DRM code due to outer locks held by the DRM code. Plus some formatting fixes.
drm - Fix deadlock in ttm pager * Fix a deadlock which most often occurs via the ttm (radeon) VM pager. A similar path is also used by i915 (all intel). * Basically removes an unnecessary lock in the paging path which was creating the deadlock. Reported-by: ivadasz, ftigeot
kernel - Fix bug in cdev_pager_allocate() that was messing up gem/ttm * cdev_pager_allocate() was assuming that the passed vm_object handle was a cdev_t and populating a field in it, but that is not always the case. Fix the case. * This solves RBTREE corruption in drm/ttm. Reported-by: Joris Giovannangeli
hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree * This merge contains work primarily by Mihai Carabas, with some misc fixes also by Matthew Dillon. * Special note on GSOC core This is, needless to say, a huge amount of work compressed down into a few paragraphs of comments. Adds the pc64/vmm subdirectory and tons of stuff to support hardware virtualization in guest-user mode, plus the ability for programs (vkernels) running in this mode to make normal system calls to the host. * Add system call infrastructure for VMM mode operations in kern/sys_vmm.c which vectors through a structure to machine-specific implementations. vmm_guest_ctl_args() vmm_guest_sync_addr_args() vmm_guest_ctl_args() - bootstrap VMM and EPT modes. Copydown the original user stack for EPT (since EPT 'physical' addresses cannot reach that far into the backing store represented by the process's original VM space). Also installs the GUEST_CR3 for the guest using parameters supplied by the guest. vmm_guest_sync_addr_args() - A host helper function that the vkernel can use to invalidate page tables on multiple real cpus. This is a lot more efficient than having the vkernel try to do it itself with IPI signals via cpusync*(). * Add Intel VMX support to the host infrastructure. Again, tons of work compressed down into a one paragraph commit message. Intel VMX support added. AMD SVM support is not part of this GSOC and not yet supported by DragonFly. * Remove PG_* defines for PTE's and related mmu operations. Replace with a table lookup so the same pmap code can be used for normal page tables and also EPT tables. * Also include X86_PG_V defines specific to normal page tables for a few situations outside the pmap code. * Adjust DDB to disassemble SVM related (intel) instructions. * Add infrastructure to exit1() to deal related structures. * Optimize pfind() and pfindn() to remove the global token when looking up the current process's PID (Matt) * Add support for EPT (double layer page tables). This primarily required adjusting the pmap code to use a table lookup to get the PG_* bits. Add an indirect vector for copyin, copyout, and other user address space copy operations to support manual walks when EPT is in use. A multitude of system calls which manually looked up user addresses via the vm_map now need a VMM layer call to translate EPT. * Remove the MP lock from trapsignal() use cases in trap(). * (Matt) Add pthread_yield()s in most spin loops to help situations where the vkernel is running on more cpu's than the host has, and to help with scheduler edge cases on the host. * (Matt) Add a pmap_fault_page_quick() infrastructure that vm_fault_page() uses to try to shortcut operations and avoid locks. Implement it for pc64. This function checks whether the page is already faulted in as requested by looking up the PTE. If not it returns NULL and the full blown vm_fault_page() code continues running. * (Matt) Remove the MP lock from most the vkernel's trap() code * (Matt) Use a shared spinlock when possible for certain critical paths related to the copyin/copyout path.