From b12defdc619df06fafb50cc7535a919224daa63c Mon Sep 17 00:00:00 2001 From: Matthew Dillon Date: Tue, 18 Oct 2011 10:36:11 -0700 Subject: [PATCH] kernel - Major SMP performance patch / VM system, bus-fault/seg-fault fixes This is a very large patch which reworks locking in the entire VM subsystem, concentrated on VM objects and the x86-64 pmap code. These fixes remove nearly all the spin lock contention for non-threaded VM faults and narrows contention for threaded VM faults to just the threads sharing the pmap. Multi-socket many-core machines will see a 30-50% improvement in parallel build performance (tested on a 48-core opteron), depending on how well the build parallelizes. As part of this work a long-standing problem on 64-bit systems where programs would occasionally seg-fault or bus-fault for no reason has been fixed. The problem was related to races between vm_fault, the vm_object collapse code, and the vm_map splitting code. * Most uses of vm_token have been removed. All uses of vm_spin have been removed. These have been replaced with per-object tokens and per-queue (vm_page_queues[]) spin locks. Note in particular that since we still have the page coloring code the PQ_FREE and PQ_CACHE queues are actually many queues, individually spin-locked, resulting in very excellent MP page allocation and freeing performance. * Reworked vm_page_lookup() and vm_object->rb_memq. All (object,pindex) lookup operations are now covered by the vm_object hold/drop system, which utilize pool tokens on vm_objects. Calls now require that the VM object be held in order to ensure a stable outcome. Also added vm_page_lookup_busy_wait(), vm_page_lookup_busy_try(), vm_page_busy_wait(), vm_page_busy_try(), and other API functions which integrate the PG_BUSY handling. * Added OBJ_CHAINLOCK. Most vm_object operations are protected by the vm_object_hold/drop() facility which is token-based. Certain critical functions which must traverse backing_object chains use a hard-locking flag and lock almost the entire chain as it is traversed to prevent races against object deallocation, collapses, and splits. The last object in the chain (typically a vnode) is NOT locked in this manner, so concurrent faults which terminate at the same vnode will still have good performance. This is important e.g. for parallel compiles which might be running dozens of the same compiler binary concurrently. * Created a per vm_map token and removed most uses of vmspace_token. * Removed the mp_lock in sys_execve(). It has not been needed in a while. * Add kmem_lim_size() which returns approximate available memory (reduced by available KVM), in megabytes. This is now used to scale up the slab allocator cache and the pipe buffer caches to reduce unnecessary global kmem operations. * Rewrote vm_page_alloc(), various bits in vm/vm_contig.c, the swapcache scan code, and the pageout scan code. These routines were rewritten to use the per-queue spin locks. * Replaced the exponential backoff in the spinlock code with something a bit less complex and cleaned it up. * Restructured the IPIQ func/arg1/arg2 array for better cache locality. Removed the per-queue ip_npoll and replaced it with a per-cpu gd_npoll, which is used by other cores to determine if they need to issue an actual hardware IPI or not. This reduces hardware IPI issuance considerably (and the removal of the decontention code reduced it even more). * Temporarily removed the lwkt thread fairq code and disabled a number of features. These will be worked back in once we track down some of the remaining performance issues. Temproarily removed the lwkt thread resequencer for tokens for the same reason. This might wind up being permanent. Added splz_check()s in a few critical places. * Increased the number of pool tokens from 1024 to 4001 and went to a prime-number mod algorithm to reduce overlaps. * Removed the token decontention code. This was a bit of an eyesore and while it did its job when we had global locks it just gets in the way now that most of the global locks are gone. Replaced the decontention code with a fall back which acquires the tokens in sorted order, to guarantee that deadlocks will always be resolved eventually in the scheduler. * Introduced a simplified spin-for-a-little-while function _lwkt_trytoken_spin() that the token code now uses rather than giving up immediately. * The vfs_bio subsystem no longer uses vm_token and now uses the vm_object_hold/drop API for buffer cache operations, resulting in very good concurrency. * Gave the vnode its own spinlock instead of sharing vp->v_lock.lk_spinlock, which fixes a deadlock. * Adjusted all platform pamp.c's to handle the new main kernel APIs. The i386 pmap.c is still a bit out of date but should be compatible. * Completely rewrote very large chunks of the x86-64 pmap.c code. The critical path no longer needs pmap_spin but pmap_spin itself is still used heavily, particularin the pv_entry handling code. A per-pmap token and per-pmap object are now used to serialize pmamp access and vm_page lookup operations when needed. The x86-64 pmap.c code now uses only vm_page->crit_count instead of both crit_count and hold_count, which fixes races against other parts of the kernel uses vm_page_hold(). _pmap_allocpte() mechanics have been completely rewritten to remove potential races. Much of pmap_enter() and pmap_enter_quick() has also been rewritten. Many other changes. * The following subsystems (and probably more) no longer use the vm_token or vmobj_token in critical paths: x The swap_pager now uses the vm_object_hold/drop API instead of vm_token. x mmap() and vm_map/vm_mmap in general now use the vm_object_hold/drop API instead of vm_token. x vnode_pager x zalloc x vm_page handling x vfs_bio x umtx system calls x vm_fault and friends * Minor fixes to fill_kinfo_proc() to deal with process scan panics (ps) revealed by recent global lock removals. * lockmgr() locks no longer support LK_NOSPINWAIT. Spin locks are unconditionally acquired. * Replaced netif/e1000's spinlocks with lockmgr locks. The spinlocks were not appropriate owing to the large context they were covering. * Misc atomic ops added --- sys/cpu/i386/include/atomic.h | 11 + sys/cpu/i386/include/cpu.h | 5 +- sys/cpu/x86_64/include/atomic.h | 11 + sys/cpu/x86_64/include/cpu.h | 5 +- sys/dev/agp/agp.c | 18 +- sys/dev/agp/agp_i810.c | 9 +- sys/dev/netif/e1000/e1000_osdep.h | 15 +- sys/dev/netif/e1000/if_em.h | 32 +- sys/emulation/43bsd/43bsd_vm.c | 2 - .../linux/i386/linprocfs/linprocfs_misc.c | 29 +- sys/emulation/linux/i386/linux_machdep.c | 6 +- sys/kern/imgact_aout.c | 9 +- sys/kern/imgact_elf.c | 59 +- sys/kern/init_main.c | 2 +- sys/kern/kern_clock.c | 25 +- sys/kern/kern_exec.c | 12 +- sys/kern/kern_kinfo.c | 27 +- sys/kern/kern_lock.c | 9 +- sys/kern/kern_slaballoc.c | 182 +-- sys/kern/kern_spinlock.c | 200 ++- sys/kern/kern_synch.c | 10 +- sys/kern/kern_umtx.c | 6 - sys/kern/kern_xio.c | 12 - sys/kern/link_elf.c | 4 +- sys/kern/link_elf_obj.c | 4 +- sys/kern/lwkt_ipiq.c | 119 +- sys/kern/lwkt_thread.c | 434 +++--- sys/kern/lwkt_token.c | 478 ++++--- sys/kern/sys_pipe.c | 24 + sys/kern/sys_process.c | 9 +- sys/kern/sysv_shm.c | 4 +- sys/kern/tty.c | 7 +- sys/kern/uipc_syscalls.c | 32 +- sys/kern/vfs_bio.c | 132 +- sys/kern/vfs_cache.c | 50 +- sys/kern/vfs_cluster.c | 5 + sys/kern/vfs_journal.c | 12 +- sys/kern/vfs_lock.c | 31 +- sys/kern/vfs_mount.c | 6 +- sys/kern/vfs_subr.c | 42 +- sys/kern/vfs_vm.c | 17 +- sys/platform/pc32/i386/machdep.c | 4 - sys/platform/pc32/i386/pmap.c | 179 ++- sys/platform/pc32/include/pmap.h | 10 + sys/platform/pc64/include/pmap.h | 14 +- sys/platform/pc64/x86_64/pmap.c | 908 +++++++------ sys/platform/vkernel/conf/files | 1 + sys/platform/vkernel/i386/cpu_regs.c | 4 +- sys/platform/vkernel/i386/mp.c | 2 + sys/platform/vkernel/include/pmap.h | 10 + sys/platform/vkernel/platform/pmap.c | 159 ++- sys/platform/vkernel64/conf/files | 1 + sys/platform/vkernel64/include/pmap.h | 10 + sys/platform/vkernel64/platform/pmap.c | 183 +-- sys/platform/vkernel64/x86_64/cpu_regs.c | 9 +- sys/platform/vkernel64/x86_64/mp.c | 4 +- sys/sys/globaldata.h | 11 +- sys/sys/lock.h | 2 +- sys/sys/malloc.h | 1 + sys/sys/param.h | 1 + sys/sys/spinlock.h | 14 +- sys/sys/spinlock2.h | 96 +- sys/sys/thread.h | 16 +- sys/sys/time.h | 1 + sys/sys/vnode.h | 6 +- sys/vfs/devfs/devfs_vnops.c | 2 +- sys/vfs/nwfs/nwfs_io.c | 2 +- sys/vfs/procfs/procfs_map.c | 31 +- sys/vfs/smbfs/smbfs_io.c | 2 +- sys/vm/device_pager.c | 6 +- sys/vm/phys_pager.c | 2 - sys/vm/pmap.h | 5 +- sys/vm/swap_pager.c | 216 +-- sys/vm/vm.h | 1 + sys/vm/vm_contig.c | 171 ++- sys/vm/vm_fault.c | 533 +++++--- sys/vm/vm_glue.c | 10 +- sys/vm/vm_kern.c | 9 +- sys/vm/vm_map.c | 508 ++++--- sys/vm/vm_map.h | 39 +- sys/vm/vm_meter.c | 4 +- sys/vm/vm_mmap.c | 58 +- sys/vm/vm_object.c | 1156 +++++++++------- sys/vm/vm_object.h | 34 +- sys/vm/vm_page.c | 1180 ++++++++++++----- sys/vm/vm_page.h | 169 +-- sys/vm/vm_page2.h | 55 + sys/vm/vm_pageout.c | 619 ++++++--- sys/vm/vm_swap.c | 59 +- sys/vm/vm_swapcache.c | 152 ++- sys/vm/vm_unix.c | 8 +- sys/vm/vm_vmspace.c | 5 +- sys/vm/vm_zone.c | 2 - sys/vm/vnode_pager.c | 148 ++- 94 files changed, 5531 insertions(+), 3407 deletions(-) diff --git a/sys/cpu/i386/include/atomic.h b/sys/cpu/i386/include/atomic.h index 8212a031e1..9ddaff1789 100644 --- a/sys/cpu/i386/include/atomic.h +++ b/sys/cpu/i386/include/atomic.h @@ -375,6 +375,7 @@ atomic_intr_cond_exit(__atomic_intr_t *p, void (*func)(void *), void *arg) extern int atomic_cmpset_int(volatile u_int *_dst, u_int _old, u_int _new); extern long atomic_cmpset_long(volatile u_long *_dst, u_long _exp, u_long _src); extern u_int atomic_fetchadd_int(volatile u_int *_p, u_int _v); +extern u_long atomic_fetchadd_long(volatile u_long *_p, u_long _v); #else @@ -411,6 +412,16 @@ atomic_fetchadd_int(volatile u_int *_p, u_int _v) return (_v); } +static __inline u_long +atomic_fetchadd_long(volatile u_long *_p, u_long _v) +{ + __asm __volatile(MPLOCKED "xaddl %0,%1; " \ + : "+r" (_v), "=m" (*_p) \ + : "m" (*_p) \ + : "memory"); + return (_v); +} + #endif /* KLD_MODULE */ #if defined(KLD_MODULE) diff --git a/sys/cpu/i386/include/cpu.h b/sys/cpu/i386/include/cpu.h index 1ed4711abf..1621d1ca73 100644 --- a/sys/cpu/i386/include/cpu.h +++ b/sys/cpu/i386/include/cpu.h @@ -72,12 +72,9 @@ * * We now have to use a locked bus cycle due to LWKT_RESCHED/WAKEUP * signalling by other cpus. - * - * NOTE: need_lwkt_resched() sets RQF_WAKEUP but clear_lwkt_resched() does - * not clear it. Only the scheduler will clear RQF_WAKEUP. */ #define need_lwkt_resched() \ - atomic_set_int(&mycpu->gd_reqflags, RQF_AST_LWKT_RESCHED | RQF_WAKEUP) + atomic_set_int(&mycpu->gd_reqflags, RQF_AST_LWKT_RESCHED) #define need_user_resched() \ atomic_set_int(&mycpu->gd_reqflags, RQF_AST_USER_RESCHED) #define need_proftick() \ diff --git a/sys/cpu/x86_64/include/atomic.h b/sys/cpu/x86_64/include/atomic.h index 4e2fd4085a..4850c5e252 100644 --- a/sys/cpu/x86_64/include/atomic.h +++ b/sys/cpu/x86_64/include/atomic.h @@ -401,6 +401,7 @@ atomic_intr_cond_exit(__atomic_intr_t *p, void (*func)(void *), void *arg) extern int atomic_cmpset_int(volatile u_int *_dst, u_int _old, u_int _new); extern long atomic_cmpset_long(volatile u_long *_dst, u_long _exp, u_long _src); extern u_int atomic_fetchadd_int(volatile u_int *_p, u_int _v); +extern u_long atomic_fetchadd_long(volatile u_long *_p, u_long _v); #else @@ -442,6 +443,16 @@ atomic_fetchadd_int(volatile u_int *_p, u_int _v) return (_v); } +static __inline u_long +atomic_fetchadd_long(volatile u_long *_p, u_long _v) +{ + __asm __volatile(MPLOCKED "xaddq %0,%1; " \ + : "+r" (_v), "=m" (*_p) \ + : "m" (*_p) \ + : "memory"); + return (_v); +} + #endif /* KLD_MODULE */ #if defined(KLD_MODULE) diff --git a/sys/cpu/x86_64/include/cpu.h b/sys/cpu/x86_64/include/cpu.h index a9a440e7cb..dff82a44aa 100644 --- a/sys/cpu/x86_64/include/cpu.h +++ b/sys/cpu/x86_64/include/cpu.h @@ -73,12 +73,9 @@ * We do not have to use a locked bus cycle but we do have to use an * atomic instruction because an interrupt on the local cpu can modify * the gd_reqflags field. - * - * NOTE: need_lwkt_resched() sets RQF_WAKEUP but clear_lwkt_resched() does - * not clear it. Only the scheduler will clear RQF_WAKEUP. */ #define need_lwkt_resched() \ - atomic_set_int(&mycpu->gd_reqflags, RQF_AST_LWKT_RESCHED | RQF_WAKEUP) + atomic_set_int(&mycpu->gd_reqflags, RQF_AST_LWKT_RESCHED) #define need_user_resched() \ atomic_set_int(&mycpu->gd_reqflags, RQF_AST_USER_RESCHED) #define need_proftick() \ diff --git a/sys/dev/agp/agp.c b/sys/dev/agp/agp.c index 3ce837a157..77aec71f02 100644 --- a/sys/dev/agp/agp.c +++ b/sys/dev/agp/agp.c @@ -566,13 +566,15 @@ agp_generic_bind_memory(device_t dev, struct agp_memory *mem, vm_page_wakeup(m); for (k = 0; k < i + j; k += AGP_PAGE_SIZE) AGP_UNBIND_PAGE(dev, offset + k); - lwkt_gettoken(&vm_token); + vm_object_hold(mem->am_obj); for (k = 0; k <= i; k += PAGE_SIZE) { - m = vm_page_lookup(mem->am_obj, - OFF_TO_IDX(k)); + m = vm_page_lookup_busy_wait( + mem->am_obj, OFF_TO_IDX(k), + FALSE, "agppg"); vm_page_unwire(m, 0); + vm_page_wakeup(m); } - lwkt_reltoken(&vm_token); + vm_object_drop(mem->am_obj); lockmgr(&sc->as_lock, LK_RELEASE); return error; } @@ -621,12 +623,14 @@ agp_generic_unbind_memory(device_t dev, struct agp_memory *mem) */ for (i = 0; i < mem->am_size; i += AGP_PAGE_SIZE) AGP_UNBIND_PAGE(dev, mem->am_offset + i); - lwkt_gettoken(&vm_token); + vm_object_hold(mem->am_obj); for (i = 0; i < mem->am_size; i += PAGE_SIZE) { - m = vm_page_lookup(mem->am_obj, atop(i)); + m = vm_page_lookup_busy_wait(mem->am_obj, atop(i), + FALSE, "agppg"); vm_page_unwire(m, 0); + vm_page_wakeup(m); } - lwkt_reltoken(&vm_token); + vm_object_drop(mem->am_obj); agp_flush_cache(); AGP_FLUSH_TLB(dev); diff --git a/sys/dev/agp/agp_i810.c b/sys/dev/agp/agp_i810.c index 4f921eb591..7e457da1d9 100644 --- a/sys/dev/agp/agp_i810.c +++ b/sys/dev/agp/agp_i810.c @@ -1008,10 +1008,13 @@ agp_i810_free_memory(device_t dev, struct agp_memory *mem) * Unwire the page which we wired in alloc_memory. */ vm_page_t m; - lwkt_gettoken(&vm_token); - m = vm_page_lookup(mem->am_obj, 0); + + vm_object_hold(mem->am_obj); + m = vm_page_lookup_busy_wait(mem->am_obj, 0, + FALSE, "agppg"); + vm_object_drop(mem->am_obj); vm_page_unwire(m, 0); - lwkt_reltoken(&vm_token); + vm_page_wakeup(m); } else { contigfree(sc->argb_cursor, mem->am_size, M_AGP); sc->argb_cursor = NULL; diff --git a/sys/dev/netif/e1000/e1000_osdep.h b/sys/dev/netif/e1000/e1000_osdep.h index 0bf4e45ee2..153ce288e7 100644 --- a/sys/dev/netif/e1000/e1000_osdep.h +++ b/sys/dev/netif/e1000/e1000_osdep.h @@ -37,8 +37,7 @@ #include #include -#include -#include +#include #include #include #include @@ -68,12 +67,12 @@ #define PCI_COMMAND_REGISTER PCIR_COMMAND /* Mutex used in the shared code */ -#define E1000_MUTEX struct spinlock -#define E1000_MUTEX_INIT(spin) spin_init(spin) -#define E1000_MUTEX_DESTROY(spin) spin_uninit(spin) -#define E1000_MUTEX_LOCK(spin) spin_lock(spin) -#define E1000_MUTEX_TRYLOCK(spin) spin_trylock(spin) -#define E1000_MUTEX_UNLOCK(spin) spin_unlock(spin) +#define E1000_MUTEX struct lock +#define E1000_MUTEX_INIT(spin) lockinit(spin, "emtx", 0, 0) +#define E1000_MUTEX_DESTROY(spin) lockuninit(spin) +#define E1000_MUTEX_LOCK(spin) lockmgr(spin, LK_EXCLUSIVE) +#define E1000_MUTEX_TRYLOCK(spin) (lockmgr(spin, LK_EXCLUSIVE | LK_NOWAIT) == 0) +#define E1000_MUTEX_UNLOCK(spin) lockmgr(spin, LK_RELEASE) typedef uint64_t u64; typedef uint32_t u32; diff --git a/sys/dev/netif/e1000/if_em.h b/sys/dev/netif/e1000/if_em.h index f69cd0134c..916d03e29f 100644 --- a/sys/dev/netif/e1000/if_em.h +++ b/sys/dev/netif/e1000/if_em.h @@ -316,9 +316,9 @@ struct adapter { int if_flags; int max_frame_size; int min_frame_size; - struct spinlock core_spin; - struct spinlock tx_spin; - struct spinlock rx_spin; + struct lock core_spin; + struct lock tx_spin; + struct lock rx_spin; int em_insert_vlan_header; /* Task for FAST handling */ @@ -469,21 +469,21 @@ typedef struct _DESCRIPTOR_PAIR } DESC_ARRAY, *PDESC_ARRAY; #define EM_CORE_LOCK_INIT(_sc, _name) \ - spin_init(&(_sc)->core_spin) + lockinit(&(_sc)->core_spin, "emcore", 0, 0) #define EM_TX_LOCK_INIT(_sc, _name) \ - spin_init(&(_sc)->tx_spin) + lockinit(&(_sc)->tx_spin, "emtx", 0, 0) #define EM_RX_LOCK_INIT(_sc, _name) \ - spin_init(&(_sc)->rx_spin) -#define EM_CORE_LOCK_DESTROY(_sc) spin_uninit(&(_sc)->core_spin) -#define EM_TX_LOCK_DESTROY(_sc) spin_uninit(&(_sc)->tx_spin) -#define EM_RX_LOCK_DESTROY(_sc) spin_uninit(&(_sc)->rx_spin) -#define EM_CORE_LOCK(_sc) spin_lock(&(_sc)->core_spin) -#define EM_TX_LOCK(_sc) spin_lock(&(_sc)->tx_spin) -#define EM_TX_TRYLOCK(_sc) spin_trylock(&(_sc)->tx_spin) -#define EM_RX_LOCK(_sc) spin_lock(&(_sc)->rx_spin) -#define EM_CORE_UNLOCK(_sc) spin_unlock(&(_sc)->core_spin) -#define EM_TX_UNLOCK(_sc) spin_unlock(&(_sc)->tx_spin) -#define EM_RX_UNLOCK(_sc) spin_unlock(&(_sc)->rx_spin) + lockinit(&(_sc)->rx_spi, "emrx", 0, 0n) +#define EM_CORE_LOCK_DESTROY(_sc) lockuninit(&(_sc)->core_spin) +#define EM_TX_LOCK_DESTROY(_sc) lockuninit(&(_sc)->tx_spin) +#define EM_RX_LOCK_DESTROY(_sc) lockuninit(&(_sc)->rx_spin) +#define EM_CORE_LOCK(_sc) lockmgr(&(_sc)->core_spin, LK_EXCLUSIVE) +#define EM_TX_LOCK(_sc) lockmgr(&(_sc)->tx_spin, LK_EXCLUSIVE) +#define EM_TX_TRYLOCK(_sc) (lockmgr(&(_sc)->tx_spin, LK_EXCLUSIVE | LK_NOWAIT) == 0) +#define EM_RX_LOCK(_sc) lockmgr(&(_sc)->rx_spin, LK_EXCLUSIVE) +#define EM_CORE_UNLOCK(_sc) lockmgr(&(_sc)->core_spin, LK_RELEASE) +#define EM_TX_UNLOCK(_sc) lockmgr(&(_sc)->tx_spin, LK_RELEASE) +#define EM_RX_UNLOCK(_sc) lockmgr(&(_sc)->rx_spin, LK_RELEASE) #define EM_CORE_LOCK_ASSERT(_sc) #define EM_TX_LOCK_ASSERT(_sc) diff --git a/sys/emulation/43bsd/43bsd_vm.c b/sys/emulation/43bsd/43bsd_vm.c index 8f22217a69..69f87c669c 100644 --- a/sys/emulation/43bsd/43bsd_vm.c +++ b/sys/emulation/43bsd/43bsd_vm.c @@ -113,11 +113,9 @@ sys_ommap(struct ommap_args *uap) if (uap->flags & OMAP_INHERIT) flags |= MAP_INHERIT; - lwkt_gettoken(&vm_token); error = kern_mmap(curproc->p_vmspace, uap->addr, uap->len, prot, flags, uap->fd, uap->pos, &uap->sysmsg_resultp); - lwkt_reltoken(&vm_token); return (error); } diff --git a/sys/emulation/linux/i386/linprocfs/linprocfs_misc.c b/sys/emulation/linux/i386/linprocfs/linprocfs_misc.c index 229c654464..204877dc18 100644 --- a/sys/emulation/linux/i386/linprocfs/linprocfs_misc.c +++ b/sys/emulation/linux/i386/linprocfs/linprocfs_misc.c @@ -741,10 +741,29 @@ linprocfs_domaps(struct proc *curp, struct proc *p, struct pfsnode *pfs, */ map->hint = entry; ostart = entry->start; - obj = entry->object.vm_object; - for( lobj = tobj = obj; tobj; tobj = tobj->backing_object) - lobj = tobj; + /* + * Find the bottom-most object, leaving the base object + * and the bottom-most object held (but only one hold + * if they happen to be the same). + */ + obj = entry->object.vm_object; + vm_object_hold(obj); + + lobj = obj; + while (lobj && (tobj = lobj->backing_object) != NULL) { + KKASSERT(tobj != obj); + vm_object_hold(tobj); + if (tobj == lobj->backing_object) { + if (lobj != obj) { + vm_object_lock_swap(); + vm_object_drop(lobj); + } + lobj = tobj; + } else { + vm_object_drop(tobj); + } + } if (lobj) { off = IDX_TO_OFF(lobj->size); @@ -771,6 +790,10 @@ linprocfs_domaps(struct proc *curp, struct proc *p, struct pfsnode *pfs, name = "[stack]"; } + if (lobj != obj) + vm_object_drop(lobj); + vm_object_drop(obj); + /* * We cannot safely hold the map locked while accessing * userspace as a VM fault might recurse the locked map. diff --git a/sys/emulation/linux/i386/linux_machdep.c b/sys/emulation/linux/i386/linux_machdep.c index 0c30fb5138..c0b0f2e886 100644 --- a/sys/emulation/linux/i386/linux_machdep.c +++ b/sys/emulation/linux/i386/linux_machdep.c @@ -625,8 +625,7 @@ linux_mmap_common(caddr_t linux_addr, size_t linux_len, int linux_prot, flags |= MAP_NOSYNC; } - lwkt_gettoken(&vm_token); - lwkt_gettoken(&vmspace_token); + lwkt_gettoken(&curproc->p_vmspace->vm_map.token); if (linux_flags & LINUX_MAP_GROWSDOWN) { flags |= MAP_STACK; @@ -711,8 +710,7 @@ linux_mmap_common(caddr_t linux_addr, size_t linux_len, int linux_prot, error = kern_mmap(curproc->p_vmspace, addr, len, prot, flags, fd, pos, &new); - lwkt_reltoken(&vmspace_token); - lwkt_reltoken(&vm_token); + lwkt_reltoken(&curproc->p_vmspace->vm_map.token); if (error == 0) *res = new; diff --git a/sys/kern/imgact_aout.c b/sys/kern/imgact_aout.c index 49727fb144..67a01713cc 100644 --- a/sys/kern/imgact_aout.c +++ b/sys/kern/imgact_aout.c @@ -180,7 +180,8 @@ exec_aout_imgact(struct image_params *imgp) count = vm_map_entry_reserve(MAP_RESERVE_COUNT); vm_map_lock(map); object = vp->v_object; - vm_object_reference(object); + vm_object_hold(object); + vm_object_reference_locked(object); text_end = virtual_offset + a_out->a_text; error = vm_map_insert(map, &count, object, @@ -189,14 +190,16 @@ exec_aout_imgact(struct image_params *imgp) VM_MAPTYPE_NORMAL, VM_PROT_READ | VM_PROT_EXECUTE, VM_PROT_ALL, MAP_COPY_ON_WRITE | MAP_PREFAULT); + if (error) { + vm_object_drop(object); vm_map_unlock(map); vm_map_entry_release(count); return (error); } data_end = text_end + a_out->a_data; if (a_out->a_data) { - vm_object_reference(object); + vm_object_reference_locked(object); error = vm_map_insert(map, &count, object, file_offset + a_out->a_text, text_end, data_end, @@ -204,11 +207,13 @@ exec_aout_imgact(struct image_params *imgp) VM_PROT_ALL, VM_PROT_ALL, MAP_COPY_ON_WRITE | MAP_PREFAULT); if (error) { + vm_object_drop(object); vm_map_unlock(map); vm_map_entry_release(count); return (error); } } + vm_object_drop(object); if (bss_size) { error = vm_map_insert(map, &count, NULL, 0, diff --git a/sys/kern/imgact_elf.c b/sys/kern/imgact_elf.c index 287eb84970..243e1d78ec 100644 --- a/sys/kern/imgact_elf.c +++ b/sys/kern/imgact_elf.c @@ -251,6 +251,8 @@ __elfN(load_section)(struct proc *p, struct vmspace *vmspace, struct vnode *vp, object = vp->v_object; error = 0; + vm_object_hold(object); + /* * It's necessary to fail if the filsz + offset taken from the * header is greater than the actual file pager object's size. @@ -262,6 +264,7 @@ __elfN(load_section)(struct proc *p, struct vmspace *vmspace, struct vnode *vp, */ if ((off_t)filsz + offset > vp->v_filesize || filsz > memsz) { uprintf("elf_load_section: truncated ELF file\n"); + vm_object_drop(object); return (ENOEXEC); } @@ -280,7 +283,7 @@ __elfN(load_section)(struct proc *p, struct vmspace *vmspace, struct vnode *vp, map_len = round_page(offset+filsz) - file_addr; if (map_len != 0) { - vm_object_reference(object); + vm_object_reference_locked(object); /* cow flags: don't dump readonly sections in core */ cow = MAP_COPY_ON_WRITE | MAP_PREFAULT | @@ -300,11 +303,13 @@ __elfN(load_section)(struct proc *p, struct vmspace *vmspace, struct vnode *vp, vm_map_entry_release(count); if (rv != KERN_SUCCESS) { vm_object_deallocate(object); + vm_object_drop(object); return (EINVAL); } /* we can stop now if we've covered it all */ if (memsz == filsz) { + vm_object_drop(object); return (0); } } @@ -333,6 +338,7 @@ __elfN(load_section)(struct proc *p, struct vmspace *vmspace, struct vnode *vp, vm_map_unlock(&vmspace->vm_map); vm_map_entry_release(count); if (rv != KERN_SUCCESS) { + vm_object_drop(object); return (EINVAL); } } @@ -352,15 +358,17 @@ __elfN(load_section)(struct proc *p, struct vmspace *vmspace, struct vnode *vp, vm_page_unhold(m); } if (error) { + vm_object_drop(object); return (error); } } + vm_object_drop(object); /* * set it to the specified protection */ - vm_map_protect(&vmspace->vm_map, map_addr, map_addr + map_len, prot, - FALSE); + vm_map_protect(&vmspace->vm_map, map_addr, map_addr + map_len, + prot, FALSE); return (error); } @@ -1180,6 +1188,8 @@ each_segment(struct proc *p, segment_callback func, void *closure, int writable) for (entry = map->header.next; error == 0 && entry != &map->header; entry = entry->next) { vm_object_t obj; + vm_object_t lobj; + vm_object_t tobj; /* * Don't dump inaccessible mappings, deal with legacy @@ -1212,17 +1222,40 @@ each_segment(struct proc *p, segment_callback func, void *closure, int writable) if ((obj = entry->object.vm_object) == NULL) continue; - /* Find the deepest backing object. */ - while (obj->backing_object != NULL) - obj = obj->backing_object; - - /* Ignore memory-mapped devices and such things. */ - if (obj->type != OBJT_DEFAULT && - obj->type != OBJT_SWAP && - obj->type != OBJT_VNODE) - continue; + /* + * Find the bottom-most object, leaving the base object + * and the bottom-most object held (but only one hold + * if they happen to be the same). + */ + vm_object_hold(obj); + + lobj = obj; + while (lobj && (tobj = lobj->backing_object) != NULL) { + KKASSERT(tobj != obj); + vm_object_hold(tobj); + if (tobj == lobj->backing_object) { + if (lobj != obj) { + vm_object_lock_swap(); + vm_object_drop(lobj); + } + lobj = tobj; + } else { + vm_object_drop(tobj); + } + } - error = (*func)(entry, closure); + /* + * The callback only applies to default, swap, or vnode + * objects. Other types of objects such as memory-mapped + * devices are ignored. + */ + if (lobj->type == OBJT_DEFAULT || lobj->type == OBJT_SWAP || + lobj->type == OBJT_VNODE) { + error = (*func)(entry, closure); + } + if (lobj != obj) + vm_object_drop(lobj); + vm_object_drop(obj); } return (error); } diff --git a/sys/kern/init_main.c b/sys/kern/init_main.c index c5e2489cdb..3452b44f8f 100644 --- a/sys/kern/init_main.c +++ b/sys/kern/init_main.c @@ -213,7 +213,7 @@ mi_startup(void) */ if ((long)sysinit % 8 != 0) { kprintf("Fixing sysinit value...\n"); - sysinit = (long)sysinit + 4; + sysinit = (void *)((long)(intptr_t)sysinit + 4); } #endif sysinit_end = SET_LIMIT(sysinit_set); diff --git a/sys/kern/kern_clock.c b/sys/kern/kern_clock.c index a2d6fa333e..428276759b 100644 --- a/sys/kern/kern_clock.c +++ b/sys/kern/kern_clock.c @@ -790,9 +790,12 @@ schedclock(systimer_t info, int in_ipi __unused, struct intrframe *frame) ru->ru_ixrss += pgtok(vm->vm_tsize); ru->ru_idrss += pgtok(vm->vm_dsize); ru->ru_isrss += pgtok(vm->vm_ssize); - rss = pgtok(vmspace_resident_count(vm)); - if (ru->ru_maxrss < rss) - ru->ru_maxrss = rss; + if (lwkt_trytoken(&vm->vm_map.token)) { + rss = pgtok(vmspace_resident_count(vm)); + if (ru->ru_maxrss < rss) + ru->ru_maxrss = rss; + lwkt_reltoken(&vm->vm_map.token); + } } } } @@ -1421,3 +1424,19 @@ tsc_test_target(int64_t target) #endif return(-1); } + +/* + * Delay the specified number of nanoseconds using the tsc. This function + * returns immediately if the TSC is not supported. At least one cpu_pause() + * will be issued. + */ +void +tsc_delay(int ns) +{ + int64_t clk; + + clk = tsc_get_target(ns); + cpu_pause(); + while (tsc_test_target(clk) == 0) + cpu_pause(); +} diff --git a/sys/kern/kern_exec.c b/sys/kern/kern_exec.c index fdc83cc0e1..729c8f5e42 100644 --- a/sys/kern/kern_exec.c +++ b/sys/kern/kern_exec.c @@ -576,8 +576,6 @@ exec_fail: /* * execve() system call. - * - * MPALMOSTSAFE */ int sys_execve(struct execve_args *uap) @@ -588,7 +586,6 @@ sys_execve(struct execve_args *uap) bzero(&args, sizeof(args)); - get_mplock(); error = nlookup_init(&nd, uap->fname, UIO_USERSPACE, NLC_FOLLOW); if (error == 0) { error = exec_copyin_args(&args, uap->fname, PATH_USERSPACE, @@ -604,7 +601,6 @@ sys_execve(struct execve_args *uap) exit1(W_EXITCODE(0, SIGABRT)); /* NOTREACHED */ } - rel_mplock(); /* * The syscall result is returned in registers to the new program. @@ -635,9 +631,8 @@ exec_map_page(struct image_params *imgp, vm_pindex_t pageno, if (pageno >= object->size) return (EIO); + vm_object_hold(object); m = vm_page_grab(object, pageno, VM_ALLOC_NORMAL | VM_ALLOC_RETRY); - - lwkt_gettoken(&vm_token); while ((m->valid & VM_PAGE_BITS_ALL) != VM_PAGE_BITS_ALL) { ma = m; @@ -656,13 +651,12 @@ exec_map_page(struct image_params *imgp, vm_pindex_t pageno, vm_page_protect(m, VM_PROT_NONE); vnode_pager_freepage(m); } - lwkt_reltoken(&vm_token); return EIO; } } - vm_page_hold(m); /* requires vm_token to be held */ + vm_page_hold(m); vm_page_wakeup(m); /* unbusy the page */ - lwkt_reltoken(&vm_token); + vm_object_drop(object); *plwb = lwbuf_alloc(m, *plwb); *pdata = (void *)lwbuf_kva(*plwb); diff --git a/sys/kern/kern_kinfo.c b/sys/kern/kern_kinfo.c index 65fbf483c6..71717b95f6 100644 --- a/sys/kern/kern_kinfo.c +++ b/sys/kern/kern_kinfo.c @@ -53,6 +53,8 @@ #include #ifdef _KERNEL #include +#include +#include #else #include @@ -72,6 +74,7 @@ fill_kinfo_proc(struct proc *p, struct kinfo_proc *kp) { struct session *sess; struct pgrp *pgrp; + struct vmspace *vm; pgrp = p->p_pgrp; sess = pgrp ? pgrp->pg_session : NULL; @@ -145,14 +148,22 @@ fill_kinfo_proc(struct proc *p, struct kinfo_proc *kp) kp->kp_nice = p->p_nice; kp->kp_swtime = p->p_swtime; - if (p->p_vmspace) { - kp->kp_vm_map_size = p->p_vmspace->vm_map.size; - kp->kp_vm_rssize = vmspace_resident_count(p->p_vmspace); - kp->kp_vm_prssize = vmspace_president_count(p->p_vmspace); - kp->kp_vm_swrss = p->p_vmspace->vm_swrss; - kp->kp_vm_tsize = p->p_vmspace->vm_tsize; - kp->kp_vm_dsize = p->p_vmspace->vm_dsize; - kp->kp_vm_ssize = p->p_vmspace->vm_ssize; + if ((vm = p->p_vmspace) != NULL) { +#ifdef _KERNEL + sysref_get(&vm->vm_sysref); + lwkt_gettoken(&vm->vm_map.token); +#endif + kp->kp_vm_map_size = vm->vm_map.size; + kp->kp_vm_rssize = vmspace_resident_count(vm); + kp->kp_vm_prssize = vmspace_president_count(vm); + kp->kp_vm_swrss = vm->vm_swrss; + kp->kp_vm_tsize = vm->vm_tsize; + kp->kp_vm_dsize = vm->vm_dsize; + kp->kp_vm_ssize = vm->vm_ssize; +#ifdef _KERNEL + lwkt_reltoken(&vm->vm_map.token); + sysref_put(&vm->vm_sysref); +#endif } if (p->p_ucred && jailed(p->p_ucred)) diff --git a/sys/kern/kern_lock.c b/sys/kern/kern_lock.c index da26055d89..4593cba907 100644 --- a/sys/kern/kern_lock.c +++ b/sys/kern/kern_lock.c @@ -196,14 +196,7 @@ debuglockmgr(struct lock *lkp, u_int flags, } #endif - /* - * So sue me, I'm too tired. - */ - if (spin_trylock(&lkp->lk_spinlock) == FALSE) { - if (flags & LK_NOSPINWAIT) - return(EBUSY); - spin_lock(&lkp->lk_spinlock); - } + spin_lock(&lkp->lk_spinlock); extflags = (flags | lkp->lk_flags) & LK_EXTFLG_MASK; td = curthread; diff --git a/sys/kern/kern_slaballoc.c b/sys/kern/kern_slaballoc.c index 8eb1b4e6f3..747ecc3753 100644 --- a/sys/kern/kern_slaballoc.c +++ b/sys/kern/kern_slaballoc.c @@ -227,6 +227,24 @@ SYSCTL_INT(_kern, OID_AUTO, zone_big_alloc, CTLFLAG_RD, &ZoneBigAlloc, 0, ""); SYSCTL_INT(_kern, OID_AUTO, zone_gen_alloc, CTLFLAG_RD, &ZoneGenAlloc, 0, ""); SYSCTL_INT(_kern, OID_AUTO, zone_cache, CTLFLAG_RW, &ZoneRelsThresh, 0, ""); +/* + * Returns the kernel memory size limit for the purposes of initializing + * various subsystem caches. The smaller of available memory and the KVM + * memory space is returned. + * + * The size in megabytes is returned. + */ +size_t +kmem_lim_size(void) +{ + size_t limsize; + + limsize = (size_t)vmstats.v_page_count * PAGE_SIZE; + if (limsize > KvaSize) + limsize = KvaSize; + return (limsize / (1024 * 1024)); +} + static void kmeminit(void *dummy) { @@ -234,12 +252,28 @@ kmeminit(void *dummy) int usesize; int i; - limsize = (size_t)vmstats.v_page_count * PAGE_SIZE; - if (limsize > KvaSize) - limsize = KvaSize; + limsize = kmem_lim_size(); + usesize = (int)(limsize * 1024); /* convert to KB */ - usesize = (int)(limsize / 1024); /* convert to KB */ + /* + * If the machine has a large KVM space and more than 8G of ram, + * double the zone release threshold to reduce SMP invalidations. + * If more than 16G of ram, do it again. + * + * The BIOS eats a little ram so add some slop. We want 8G worth of + * memory sticks to trigger the first adjustment. + */ + if (ZoneRelsThresh == ZONE_RELS_THRESH) { + if (limsize >= 7 * 1024) + ZoneRelsThresh *= 2; + if (limsize >= 15 * 1024) + ZoneRelsThresh *= 2; + } + /* + * Calculate the zone size. This typically calculates to + * ZALLOC_MAX_ZONE_SIZE + */ ZoneSize = ZALLOC_MIN_ZONE_SIZE; while (ZoneSize < ZALLOC_MAX_ZONE_SIZE && (ZoneSize << 1) < usesize) ZoneSize <<= 1; @@ -276,9 +310,7 @@ malloc_init(void *data) if (vmstats.v_page_count == 0) panic("malloc_init not allowed before vm init"); - limsize = (size_t)vmstats.v_page_count * PAGE_SIZE; - if (limsize > KvaSize) - limsize = KvaSize; + limsize = kmem_lim_size() * (1024 * 1024); type->ks_limit = limsize / 10; type->ks_next = kmemstatistics; @@ -1363,7 +1395,7 @@ chunk_mark_free(SLZone *z, void *chunk) * Interrupt code which has preempted other code is not allowed to * use PQ_CACHE pages. However, if an interrupt thread is run * non-preemptively or blocks and then runs non-preemptively, then - * it is free to use PQ_CACHE pages. + * it is free to use PQ_CACHE pages. <--- may not apply any longer XXX */ static void * kmem_slab_alloc(vm_size_t size, vm_offset_t align, int flags) @@ -1371,22 +1403,13 @@ kmem_slab_alloc(vm_size_t size, vm_offset_t align, int flags) vm_size_t i; vm_offset_t addr; int count, vmflags, base_vmflags; - vm_page_t mp[ZALLOC_MAX_ZONE_SIZE / PAGE_SIZE]; + vm_page_t mbase = NULL; + vm_page_t m; thread_t td; size = round_page(size); addr = vm_map_min(&kernel_map); - /* - * Reserve properly aligned space from kernel_map. RNOWAIT allocations - * cannot block. - */ - if (flags & M_RNOWAIT) { - if (lwkt_trytoken(&vm_token) == 0) - return(NULL); - } else { - lwkt_gettoken(&vm_token); - } count = vm_map_entry_reserve(MAP_RESERVE_COUNT); crit_enter(); vm_map_lock(&kernel_map); @@ -1396,19 +1419,22 @@ kmem_slab_alloc(vm_size_t size, vm_offset_t align, int flags) panic("kmem_slab_alloc(): kernel_map ran out of space!"); vm_map_entry_release(count); crit_exit(); - lwkt_reltoken(&vm_token); return(NULL); } /* * kernel_object maps 1:1 to kernel_map. */ - vm_object_reference(&kernel_object); + vm_object_hold(&kernel_object); + vm_object_reference_locked(&kernel_object); vm_map_insert(&kernel_map, &count, &kernel_object, addr, addr, addr + size, VM_MAPTYPE_NORMAL, VM_PROT_ALL, VM_PROT_ALL, 0); + vm_object_drop(&kernel_object); + vm_map_set_wired_quick(&kernel_map, addr, size, &count); + vm_map_unlock(&kernel_map); td = curthread; @@ -1424,32 +1450,28 @@ kmem_slab_alloc(vm_size_t size, vm_offset_t align, int flags) flags, ((int **)&size)[-1]); } - /* - * Allocate the pages. Do not mess with the PG_ZERO flag yet. + * Allocate the pages. Do not mess with the PG_ZERO flag or map + * them yet. VM_ALLOC_NORMAL can only be set if we are not preempting. + * + * VM_ALLOC_SYSTEM is automatically set if we are preempting and + * M_WAITOK was specified as an alternative (i.e. M_USE_RESERVE is + * implied in this case), though I'm not sure if we really need to + * do that. */ - for (i = 0; i < size; i += PAGE_SIZE) { - vm_page_t m; - - /* - * VM_ALLOC_NORMAL can only be set if we are not preempting. - * - * VM_ALLOC_SYSTEM is automatically set if we are preempting and - * M_WAITOK was specified as an alternative (i.e. M_USE_RESERVE is - * implied in this case), though I'm not sure if we really need to - * do that. - */ - vmflags = base_vmflags; - if (flags & M_WAITOK) { - if (td->td_preempted) - vmflags |= VM_ALLOC_SYSTEM; - else - vmflags |= VM_ALLOC_NORMAL; - } + vmflags = base_vmflags; + if (flags & M_WAITOK) { + if (td->td_preempted) + vmflags |= VM_ALLOC_SYSTEM; + else + vmflags |= VM_ALLOC_NORMAL; + } + vm_object_hold(&kernel_object); + for (i = 0; i < size; i += PAGE_SIZE) { m = vm_page_alloc(&kernel_object, OFF_TO_IDX(addr + i), vmflags); - if (i / PAGE_SIZE < NELEM(mp)) - mp[i / PAGE_SIZE] = m; + if (i == 0) + mbase = m; /* * If the allocation failed we either return NULL or we retry. @@ -1463,73 +1485,73 @@ kmem_slab_alloc(vm_size_t size, vm_offset_t align, int flags) if (m == NULL) { if (flags & M_WAITOK) { if (td->td_preempted) { - vm_map_unlock(&kernel_map); lwkt_switch(); - vm_map_lock(&kernel_map); } else { - vm_map_unlock(&kernel_map); vm_wait(0); - vm_map_lock(&kernel_map); } i -= PAGE_SIZE; /* retry */ continue; } + break; + } + } - /* - * We were unable to recover, cleanup and return NULL - * - * (vm_token already held) - */ - while (i != 0) { - i -= PAGE_SIZE; - m = vm_page_lookup(&kernel_object, OFF_TO_IDX(addr + i)); - /* page should already be busy */ - vm_page_free(m); - } - vm_map_delete(&kernel_map, addr, addr + size, &count); - vm_map_unlock(&kernel_map); - vm_map_entry_release(count); - crit_exit(); - lwkt_reltoken(&vm_token); - return(NULL); + /* + * Check and deal with an allocation failure + */ + if (i != size) { + while (i != 0) { + i -= PAGE_SIZE; + m = vm_page_lookup(&kernel_object, OFF_TO_IDX(addr + i)); + /* page should already be busy */ + vm_page_free(m); } + vm_map_lock(&kernel_map); + vm_map_delete(&kernel_map, addr, addr + size, &count); + vm_map_unlock(&kernel_map); + vm_object_drop(&kernel_object); + + vm_map_entry_release(count); + crit_exit(); + return(NULL); } /* * Success! * - * Mark the map entry as non-pageable using a routine that allows us to - * populate the underlying pages. - * - * The pages were busied by the allocations above. + * NOTE: The VM pages are still busied. mbase points to the first one + * but we have to iterate via vm_page_next() */ - vm_map_set_wired_quick(&kernel_map, addr, size, &count); + vm_object_drop(&kernel_object); crit_exit(); /* * Enter the pages into the pmap and deal with PG_ZERO and M_ZERO. */ - for (i = 0; i < size; i += PAGE_SIZE) { - vm_page_t m; + m = mbase; + i = 0; - if (i / PAGE_SIZE < NELEM(mp)) - m = mp[i / PAGE_SIZE]; - else - m = vm_page_lookup(&kernel_object, OFF_TO_IDX(addr + i)); + while (i < size) { + /* + * page should already be busy + */ m->valid = VM_PAGE_BITS_ALL; - /* page should already be busy */ vm_page_wire(m); - pmap_enter(&kernel_pmap, addr + i, m, VM_PROT_ALL, 1); + pmap_enter(&kernel_pmap, addr + i, m, VM_PROT_ALL | VM_PROT_NOSYNC, 1); if ((m->flags & PG_ZERO) == 0 && (flags & M_ZERO)) bzero((char *)addr + i, PAGE_SIZE); vm_page_flag_clear(m, PG_ZERO); KKASSERT(m->flags & (PG_WRITEABLE | PG_MAPPED)); vm_page_flag_set(m, PG_REFERENCED); vm_page_wakeup(m); + + i += PAGE_SIZE; + vm_object_hold(&kernel_object); + m = vm_page_next(m); + vm_object_drop(&kernel_object); } - vm_map_unlock(&kernel_map); + smp_invltlb(); vm_map_entry_release(count); - lwkt_reltoken(&vm_token); return((void *)addr); } @@ -1540,9 +1562,7 @@ static void kmem_slab_free(void *ptr, vm_size_t size) { crit_enter(); - lwkt_gettoken(&vm_token); vm_map_remove(&kernel_map, (vm_offset_t)ptr, (vm_offset_t)ptr + size); - lwkt_reltoken(&vm_token); crit_exit(); } diff --git a/sys/kern/kern_spinlock.c b/sys/kern/kern_spinlock.c index e5a2c4b057..018cfedbe2 100644 --- a/sys/kern/kern_spinlock.c +++ b/sys/kern/kern_spinlock.c @@ -28,8 +28,11 @@ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. - * - * $DragonFly: src/sys/kern/kern_spinlock.c,v 1.16 2008/09/11 01:11:42 y0netan1 Exp $ + */ +/* + * The spinlock code utilizes two counters to form a virtual FIFO, allowing + * a spinlock to allocate a slot and then only issue memory read operations + * until it is handed the lock (if it is not the next owner for the lock). */ #include @@ -42,6 +45,7 @@ #endif #include #include +#include #include #include #include @@ -49,11 +53,13 @@ #include #include -#define BACKOFF_INITIAL 1 -#define BACKOFF_LIMIT 256 - #ifdef SMP +struct indefinite_info { + sysclock_t base; + int secs; +}; + /* * Kernal Trace */ @@ -66,20 +72,14 @@ KTR_INFO_MASTER(spin); KTR_INFO(KTR_SPIN_CONTENTION, spin, beg, 0, SPIN_STRING, SPIN_ARG_SIZE); KTR_INFO(KTR_SPIN_CONTENTION, spin, end, 1, SPIN_STRING, SPIN_ARG_SIZE); -KTR_INFO(KTR_SPIN_CONTENTION, spin, backoff, 2, - "spin=%p bo1=%d thr=%p bo=%d", - ((2 * sizeof(void *)) + (2 * sizeof(int)))); -KTR_INFO(KTR_SPIN_CONTENTION, spin, bofail, 3, SPIN_STRING, SPIN_ARG_SIZE); - -#define logspin(name, mtx, type) \ - KTR_LOG(spin_ ## name, mtx, type) -#define logspin_backoff(mtx, bo1, thr, bo) \ - KTR_LOG(spin_backoff, mtx, bo1, thr, bo) +#define logspin(name, spin, type) \ + KTR_LOG(spin_ ## name, spin, type) #ifdef INVARIANTS static int spin_lock_test_mode; #endif +struct spinlock pmap_spin = SPINLOCK_INITIALIZER(pmap_spin); static int64_t spinlocks_contested1; SYSCTL_QUAD(_debug, OID_AUTO, spinlocks_contested1, CTLFLAG_RD, @@ -91,71 +91,67 @@ SYSCTL_QUAD(_debug, OID_AUTO, spinlocks_contested2, CTLFLAG_RD, &spinlocks_contested2, 0, "Serious spinlock contention count"); -static int spinlocks_backoff_limit = BACKOFF_LIMIT; -SYSCTL_INT(_debug, OID_AUTO, spinlocks_bolim, CTLFLAG_RW, - &spinlocks_backoff_limit, 0, - "Contested spinlock backoff limit"); +static int spinlocks_hardloops = 40; +SYSCTL_INT(_debug, OID_AUTO, spinlocks_hardloops, CTLFLAG_RW, + &spinlocks_hardloops, 0, + "Hard loops waiting for spinlock"); #define SPINLOCK_NUM_POOL (1024) static struct spinlock pool_spinlocks[SPINLOCK_NUM_POOL]; -struct exponential_backoff { - int backoff; - int nsec; - struct spinlock *mtx; - sysclock_t base; -}; -static int exponential_backoff(struct exponential_backoff *bo); - -static __inline -void -exponential_init(struct exponential_backoff *bo, struct spinlock *mtx) -{ - bo->backoff = BACKOFF_INITIAL; - bo->nsec = 0; - bo->mtx = mtx; - bo->base = 0; /* silence gcc */ -} +static int spin_indefinite_check(struct spinlock *spin, + struct indefinite_info *info); /* * We contested due to another exclusive lock holder. We lose. + * + * We have to unwind the attempt and may acquire the spinlock + * anyway while doing so. countb was incremented on our behalf. */ int -spin_trylock_wr_contested2(globaldata_t gd) +spin_trylock_contested(struct spinlock *spin) { - ++spinlocks_contested1; + globaldata_t gd = mycpu; + + /*++spinlocks_contested1;*/ --gd->gd_spinlocks_wr; --gd->gd_curthread->td_critcount; return (FALSE); } /* - * We were either contested due to another exclusive lock holder, - * or due to the presence of shared locks + * The spin_lock() inline was unable to acquire the lock. * - * NOTE: If value indicates an exclusively held mutex, no shared bits - * would have been set and we can throw away value. + * atomic_swap_int() is the absolute fastest spinlock instruction, at + * least on multi-socket systems. All instructions seem to be about + * the same on single-socket multi-core systems. */ void -spin_lock_wr_contested2(struct spinlock *mtx) +spin_lock_contested(struct spinlock *spin) { - struct exponential_backoff backoff; - int value; + int i; - /* - * Wait until we can gain exclusive access vs another exclusive - * holder. - */ - ++spinlocks_contested1; - exponential_init(&backoff, mtx); - - logspin(beg, mtx, 'w'); - do { - if (exponential_backoff(&backoff)) - break; - value = atomic_swap_int(&mtx->lock, SPINLOCK_EXCLUSIVE); - } while (value & SPINLOCK_EXCLUSIVE); - logspin(end, mtx, 'w'); + i = 0; + while (atomic_swap_int(&spin->counta, 1)) { + cpu_pause(); + if (i == spinlocks_hardloops) { + struct indefinite_info info = { 0, 0 }; + + logspin(beg, spin, 'w'); + while (atomic_swap_int(&spin->counta, 1)) { + cpu_pause(); + ++spin->countb; + if ((++i & 0x7F) == 0x7F) { + if (spin_indefinite_check(spin, &info)) + break; + } + } + logspin(end, spin, 'w'); + return; + } + ++spin->countb; + ++i; + } } static __inline int @@ -167,19 +163,17 @@ _spin_pool_hash(void *ptr) return (i); } -struct spinlock * -spin_pool_lock(void *chan) +void +_spin_pool_lock(void *chan) { struct spinlock *sp; sp = &pool_spinlocks[_spin_pool_hash(chan)]; spin_lock(sp); - - return (sp); } void -spin_pool_unlock(void *chan) +_spin_pool_unlock(void *chan) { struct spinlock *sp; @@ -187,58 +181,24 @@ spin_pool_unlock(void *chan) spin_unlock(sp); } -/* - * Handle exponential backoff and indefinite waits. - * - * If the system is handling a panic we hand the spinlock over to the caller - * after 1 second. After 10 seconds we attempt to print a debugger - * backtrace. We also run pending interrupts in order to allow a console - * break into DDB. - */ + static int -exponential_backoff(struct exponential_backoff *bo) +spin_indefinite_check(struct spinlock *spin, struct indefinite_info *info) { sysclock_t count; - int backoff; - -#ifdef _RDTSC_SUPPORTED_ - if (cpu_feature & CPUID_TSC) { - backoff = - (((u_long)rdtsc() ^ (((u_long)curthread) >> 5)) & - (bo->backoff - 1)) + BACKOFF_INITIAL; - } else -#endif - backoff = bo->backoff; - logspin_backoff(bo->mtx, bo->backoff, curthread, backoff); - /* - * Quick backoff - */ - for (; backoff; --backoff) - cpu_pause(); - if (bo->backoff < spinlocks_backoff_limit) { - bo->backoff <<= 1; - return (FALSE); - } else { - bo->backoff = BACKOFF_INITIAL; - } - - logspin(bofail, bo->mtx, 'u'); - - /* - * Indefinite - */ - ++spinlocks_contested2; cpu_spinlock_contested(); - if (bo->nsec == 0) { - bo->base = sys_cputimer->count(); - bo->nsec = 1; - } count = sys_cputimer->count(); - if (count - bo->base > sys_cputimer->freq) { - kprintf("spin_lock: %p, indefinite wait!\n", bo->mtx); + if (info->secs == 0) { + info->base = count; + ++info->secs; + } else if (count - info->base > sys_cputimer->freq) { + kprintf("spin_lock: %p, indefinite wait (%d secs)!\n", + spin, info->secs); + info->base = count; + ++info->secs; if (panicstr) return (TRUE); #if defined(INVARIANTS) @@ -247,14 +207,12 @@ exponential_backoff(struct exponential_backoff *bo) return (TRUE); } #endif - ++bo->nsec; #if defined(INVARIANTS) - if (bo->nsec == 11) + if (info->secs == 11) print_backtrace(-1); #endif - if (bo->nsec == 60) - panic("spin_lock: %p, indefinite wait!\n", bo->mtx); - bo->base = count; + if (info->secs == 60) + panic("spin_lock: %p, indefinite wait!\n", spin); } return (FALSE); } @@ -277,7 +235,7 @@ SYSCTL_INT(_debug, OID_AUTO, spin_test_count, CTLFLAG_RW, &spin_test_count, 0, static int sysctl_spin_lock_test(SYSCTL_HANDLER_ARGS) { - struct spinlock mtx; + struct spinlock spin; int error; int value = 0; int i; @@ -291,12 +249,12 @@ sysctl_spin_lock_test(SYSCTL_HANDLER_ARGS) * Indefinite wait test */ if (value == 1) { - spin_init(&mtx); - spin_lock(&mtx); /* force an indefinite wait */ + spin_init(&spin); + spin_lock(&spin); /* force an indefinite wait */ spin_lock_test_mode = 1; - spin_lock(&mtx); - spin_unlock(&mtx); /* Clean up the spinlock count */ - spin_unlock(&mtx); + spin_lock(&spin); + spin_unlock(&spin); /* Clean up the spinlock count */ + spin_unlock(&spin); spin_lock_test_mode = 0; } @@ -306,10 +264,10 @@ sysctl_spin_lock_test(SYSCTL_HANDLER_ARGS) if (value == 2) { globaldata_t gd = mycpu; - spin_init(&mtx); + spin_init(&spin); for (i = spin_test_count; i > 0; --i) { - spin_lock_quick(gd, &mtx); - spin_unlock_quick(gd, &mtx); + spin_lock_quick(gd, &spin); + spin_unlock_quick(gd, &spin); } } diff --git a/sys/kern/kern_synch.c b/sys/kern/kern_synch.c index fd0d760801..2929109a74 100644 --- a/sys/kern/kern_synch.c +++ b/sys/kern/kern_synch.c @@ -313,8 +313,8 @@ updatepcpu(struct lwp *lp, int cpticks, int ttlticks) * tsleep/wakeup hash table parameters. Try to find the sweet spot for * like addresses being slept on. */ -#define TABLESIZE 1024 -#define LOOKUP(x) (((intptr_t)(x) >> 6) & (TABLESIZE - 1)) +#define TABLESIZE 4001 +#define LOOKUP(x) (((u_int)(uintptr_t)(x)) % TABLESIZE) static cpumask_t slpque_cpumasks[TABLESIZE]; @@ -371,8 +371,10 @@ _tsleep_interlock(globaldata_t gd, const volatile void *ident, int flags) if (td->td_flags & TDF_TSLEEPQ) { id = LOOKUP(td->td_wchan); TAILQ_REMOVE(&gd->gd_tsleep_hash[id], td, td_sleepq); - if (TAILQ_FIRST(&gd->gd_tsleep_hash[id]) == NULL) - atomic_clear_cpumask(&slpque_cpumasks[id], gd->gd_cpumask); + if (TAILQ_FIRST(&gd->gd_tsleep_hash[id]) == NULL) { + atomic_clear_cpumask(&slpque_cpumasks[id], + gd->gd_cpumask); + } } else { td->td_flags |= TDF_TSLEEPQ; } diff --git a/sys/kern/kern_umtx.c b/sys/kern/kern_umtx.c index d7137c11db..869db8eae4 100644 --- a/sys/kern/kern_umtx.c +++ b/sys/kern/kern_umtx.c @@ -119,7 +119,6 @@ sys_umtx_sleep(struct umtx_sleep_args *uap) * Otherwise the physical page we sleep on my not match the page * being woken up. */ - lwkt_gettoken(&vm_token); m = vm_fault_page_quick((vm_offset_t)uap->ptr, VM_PROT_READ|VM_PROT_WRITE, &error); if (m == NULL) { @@ -166,7 +165,6 @@ sys_umtx_sleep(struct umtx_sleep_args *uap) /*vm_page_dirty(m); we don't actually dirty the page */ vm_page_unhold(m); done: - lwkt_reltoken(&vm_token); return(error); } @@ -178,9 +176,7 @@ done: static void umtx_sleep_page_action_cow(vm_page_t m, vm_page_action_t action) { - lwkt_gettoken(&vm_token); wakeup_domain(action->data, PDOMAIN_UMTX); - lwkt_reltoken(&vm_token); } /* @@ -203,7 +199,6 @@ sys_umtx_wakeup(struct umtx_wakeup_args *uap) cpu_mfence(); if ((vm_offset_t)uap->ptr & (sizeof(int) - 1)) return (EFAULT); - lwkt_gettoken(&vm_token); m = vm_fault_page_quick((vm_offset_t)uap->ptr, VM_PROT_READ, &error); if (m == NULL) { error = EFAULT; @@ -221,7 +216,6 @@ sys_umtx_wakeup(struct umtx_wakeup_args *uap) vm_page_unhold(m); error = 0; done: - lwkt_reltoken(&vm_token); return(error); } diff --git a/sys/kern/kern_xio.c b/sys/kern/kern_xio.c index 64416d5dcf..669700617a 100644 --- a/sys/kern/kern_xio.c +++ b/sys/kern/kern_xio.c @@ -177,8 +177,6 @@ xio_init_kbuf(xio_t xio, void *kbase, size_t kbytes) xio->xio_error = 0; if ((n = PAGE_SIZE - xio->xio_offset) > kbytes) n = kbytes; - lwkt_gettoken(&vm_token); - crit_enter(); for (i = 0; n && i < XIO_INTERNAL_PAGES; ++i) { if ((paddr = pmap_kextract(addr)) == 0) break; @@ -191,8 +189,6 @@ xio_init_kbuf(xio_t xio, void *kbase, size_t kbytes) n = PAGE_SIZE; addr += PAGE_SIZE; } - crit_exit(); - lwkt_reltoken(&vm_token); xio->xio_npages = i; /* @@ -223,14 +219,10 @@ xio_init_pages(xio_t xio, struct vm_page **mbase, int npages, int xflags) xio->xio_pages = xio->xio_internal_pages; xio->xio_npages = npages; xio->xio_error = 0; - lwkt_gettoken(&vm_token); - crit_enter(); for (i = 0; i < npages; ++i) { vm_page_hold(mbase[i]); xio->xio_pages[i] = mbase[i]; } - crit_exit(); - lwkt_reltoken(&vm_token); return(0); } @@ -244,16 +236,12 @@ xio_release(xio_t xio) int i; vm_page_t m; - lwkt_gettoken(&vm_token); - crit_enter(); for (i = 0; i < xio->xio_npages; ++i) { m = xio->xio_pages[i]; if (xio->xio_flags & XIOF_WRITE) vm_page_dirty(m); vm_page_unhold(m); } - crit_exit(); - lwkt_reltoken(&vm_token); xio->xio_offset = 0; xio->xio_npages = 0; xio->xio_bytes = 0; diff --git a/sys/kern/link_elf.c b/sys/kern/link_elf.c index 9f7b05d4e4..879a9732e7 100644 --- a/sys/kern/link_elf.c +++ b/sys/kern/link_elf.c @@ -570,7 +570,8 @@ link_elf_load_file(const char* filename, linker_file_t* result) error = ENOMEM; goto out; } - vm_object_reference(ef->object); + vm_object_hold(ef->object); + vm_object_reference_locked(ef->object); ef->address = (caddr_t)vm_map_min(&kernel_map); error = vm_map_find(&kernel_map, ef->object, 0, (vm_offset_t *)&ef->address, @@ -578,6 +579,7 @@ link_elf_load_file(const char* filename, linker_file_t* result) 1, VM_MAPTYPE_NORMAL, VM_PROT_ALL, VM_PROT_ALL, 0); + vm_object_drop(ef->object); if (error) { vm_object_deallocate(ef->object); kfree(ef, M_LINKER); diff --git a/sys/kern/link_elf_obj.c b/sys/kern/link_elf_obj.c index 51c867d174..3872ad3941 100644 --- a/sys/kern/link_elf_obj.c +++ b/sys/kern/link_elf_obj.c @@ -660,7 +660,8 @@ link_elf_obj_load_file(const char *filename, linker_file_t * result) error = ENOMEM; goto out; } - vm_object_reference(ef->object); + vm_object_hold(ef->object); + vm_object_reference_locked(ef->object); ef->address = (caddr_t) vm_map_min(&kernel_map); ef->bytes = 0; @@ -679,6 +680,7 @@ link_elf_obj_load_file(const char *filename, linker_file_t * result) round_page(mapsize), PAGE_SIZE, TRUE, VM_MAPTYPE_NORMAL, VM_PROT_ALL, VM_PROT_ALL, FALSE); + vm_object_drop(ef->object); if (error) { vm_object_deallocate(ef->object); ef->object = NULL; diff --git a/sys/kern/lwkt_ipiq.c b/sys/kern/lwkt_ipiq.c index 93369b67be..2235f5946f 100644 --- a/sys/kern/lwkt_ipiq.c +++ b/sys/kern/lwkt_ipiq.c @@ -73,7 +73,6 @@ static __int64_t ipiq_fifofull; /* number of fifo full conditions detected */ static __int64_t ipiq_avoided; /* interlock with target avoids cpu ipi */ static __int64_t ipiq_passive; /* passive IPI messages */ static __int64_t ipiq_cscount; /* number of cpu synchronizations */ -static int ipiq_optimized = 1; /* XXX temporary sysctl */ static int ipiq_debug; /* set to 1 for debug */ #ifdef PANIC_DEBUG static int panic_ipiq_cpu = -1; @@ -92,8 +91,6 @@ SYSCTL_QUAD(_lwkt, OID_AUTO, ipiq_passive, CTLFLAG_RW, &ipiq_passive, 0, "Number of passive IPI messages sent"); SYSCTL_QUAD(_lwkt, OID_AUTO, ipiq_cscount, CTLFLAG_RW, &ipiq_cscount, 0, "Number of cpu synchronizations"); -SYSCTL_INT(_lwkt, OID_AUTO, ipiq_optimized, CTLFLAG_RW, &ipiq_optimized, 0, - ""); SYSCTL_INT(_lwkt, OID_AUTO, ipiq_debug, CTLFLAG_RW, &ipiq_debug, 0, ""); #ifdef PANIC_DEBUG @@ -193,7 +190,7 @@ lwkt_send_ipiq3(globaldata_t target, ipifunc3_t func, void *arg1, int arg2) ++ipiq_fifofull; DEBUG_PUSH_INFO("send_ipiq3"); while (ip->ip_windex - ip->ip_rindex > MAXCPUFIFO / 4) { - if (atomic_poll_acquire_int(&ip->ip_npoll) || ipiq_optimized == 0) { + if (atomic_poll_acquire_int(&target->gd_npoll)) { logipiq(cpu_send, func, arg1, arg2, gd, target); cpu_send_ipiq(target->gd_cpuid); } @@ -213,16 +210,17 @@ lwkt_send_ipiq3(globaldata_t target, ipifunc3_t func, void *arg1, int arg2) * Queue the new message */ windex = ip->ip_windex & MAXCPUFIFO_MASK; - ip->ip_func[windex] = func; - ip->ip_arg1[windex] = arg1; - ip->ip_arg2[windex] = arg2; + ip->ip_info[windex].func = func; + ip->ip_info[windex].arg1 = arg1; + ip->ip_info[windex].arg2 = arg2; cpu_sfence(); ++ip->ip_windex; + atomic_set_cpumask(&target->gd_ipimask, gd->gd_cpumask); /* * signal the target cpu that there is work pending. */ - if (atomic_poll_acquire_int(&ip->ip_npoll) || ipiq_optimized == 0) { + if (atomic_poll_acquire_int(&target->gd_npoll)) { logipiq(cpu_send, func, arg1, arg2, gd, target); cpu_send_ipiq(target->gd_cpuid); } else { @@ -282,7 +280,7 @@ lwkt_send_ipiq3_passive(globaldata_t target, ipifunc3_t func, ++ipiq_fifofull; DEBUG_PUSH_INFO("send_ipiq3_passive"); while (ip->ip_windex - ip->ip_rindex > MAXCPUFIFO / 4) { - if (atomic_poll_acquire_int(&ip->ip_npoll) || ipiq_optimized == 0) { + if (atomic_poll_acquire_int(&target->gd_npoll)) { logipiq(cpu_send, func, arg1, arg2, gd, target); cpu_send_ipiq(target->gd_cpuid); } @@ -302,11 +300,12 @@ lwkt_send_ipiq3_passive(globaldata_t target, ipifunc3_t func, * Queue the new message */ windex = ip->ip_windex & MAXCPUFIFO_MASK; - ip->ip_func[windex] = func; - ip->ip_arg1[windex] = arg1; - ip->ip_arg2[windex] = arg2; + ip->ip_info[windex].func = func; + ip->ip_info[windex].arg1 = arg1; + ip->ip_info[windex].arg2 = arg2; cpu_sfence(); ++ip->ip_windex; + atomic_set_cpumask(&target->gd_ipimask, gd->gd_cpumask); --gd->gd_intr_nesting_level; /* @@ -352,16 +351,17 @@ lwkt_send_ipiq3_nowait(globaldata_t target, ipifunc3_t func, return(ENOENT); } windex = ip->ip_windex & MAXCPUFIFO_MASK; - ip->ip_func[windex] = func; - ip->ip_arg1[windex] = arg1; - ip->ip_arg2[windex] = arg2; + ip->ip_info[windex].func = func; + ip->ip_info[windex].arg1 = arg1; + ip->ip_info[windex].arg2 = arg2; cpu_sfence(); ++ip->ip_windex; + atomic_set_cpumask(&target->gd_ipimask, gd->gd_cpumask); /* * This isn't a passive IPI, we still have to signal the target cpu. */ - if (atomic_poll_acquire_int(&ip->ip_npoll) || ipiq_optimized == 0) { + if (atomic_poll_acquire_int(&target->gd_npoll)) { logipiq(cpu_send, func, arg1, arg2, gd, target); cpu_send_ipiq(target->gd_cpuid); } else { @@ -468,7 +468,7 @@ lwkt_seq_ipiq(globaldata_t target) * Called from IPI interrupt (like a fast interrupt), which has placed * us in a critical section. The MP lock may or may not be held. * May also be called from doreti or splz, or be reentrantly called - * indirectly through the ip_func[] we run. + * indirectly through the ip_info[].func we run. * * There are two versions, one where no interrupt frame is available (when * called from the send code and from splz, and one where an interrupt @@ -485,11 +485,16 @@ lwkt_process_ipiq(void) globaldata_t gd = mycpu; globaldata_t sgd; lwkt_ipiq_t ip; + cpumask_t mask; int n; ++gd->gd_processing_ipiq; again: - for (n = 0; n < ncpus; ++n) { + cpu_lfence(); + mask = gd->gd_ipimask; + atomic_clear_cpumask(&gd->gd_ipimask, mask); + while (mask) { + n = BSFCPUMASK(mask); if (n != gd->gd_cpuid) { sgd = globaldata_find(n); ip = sgd->gd_ipiq; @@ -498,12 +503,24 @@ again: ; } } + mask &= ~CPUMASK(n); } if (lwkt_process_ipiq_core(gd, &gd->gd_cpusyncq, NULL)) { if (gd->gd_curthread->td_cscount == 0) goto again; /* need_ipiq(); do not reflag */ } + + /* + * Interlock to allow more IPI interrupts. Recheck ipimask after + * releasing gd_npoll. + */ + if (gd->gd_ipimask) + goto again; + atomic_poll_release_int(&gd->gd_npoll); + cpu_mfence(); + if (gd->gd_ipimask) + goto again; --gd->gd_processing_ipiq; } @@ -513,10 +530,15 @@ lwkt_process_ipiq_frame(struct intrframe *frame) globaldata_t gd = mycpu; globaldata_t sgd; lwkt_ipiq_t ip; + cpumask_t mask; int n; again: - for (n = 0; n < ncpus; ++n) { + cpu_lfence(); + mask = gd->gd_ipimask; + atomic_clear_cpumask(&gd->gd_ipimask, mask); + while (mask) { + n = BSFCPUMASK(mask); if (n != gd->gd_cpuid) { sgd = globaldata_find(n); ip = sgd->gd_ipiq; @@ -525,6 +547,7 @@ again: ; } } + mask &= ~CPUMASK(n); } if (gd->gd_cpusyncq.ip_rindex != gd->gd_cpusyncq.ip_windex) { if (lwkt_process_ipiq_core(gd, &gd->gd_cpusyncq, frame)) { @@ -533,6 +556,17 @@ again: /* need_ipiq(); do not reflag */ } } + + /* + * Interlock to allow more IPI interrupts. Recheck ipimask after + * releasing gd_npoll. + */ + if (gd->gd_ipimask) + goto again; + atomic_poll_release_int(&gd->gd_npoll); + cpu_mfence(); + if (gd->gd_ipimask) + goto again; } #if 0 @@ -579,6 +613,9 @@ lwkt_process_ipiq_core(globaldata_t sgd, lwkt_ipiq_t ip, #endif /* + * Clear the originating core from our ipimask, we will process all + * incoming messages. + * * Obtain the current write index, which is modified by a remote cpu. * Issue a load fence to prevent speculative reads of e.g. data written * by the other cpu prior to it updating the index. @@ -605,9 +642,9 @@ lwkt_process_ipiq_core(globaldata_t sgd, lwkt_ipiq_t ip, while (wi - (ri = ip->ip_rindex) > 0) { ri &= MAXCPUFIFO_MASK; cpu_lfence(); - copy_func = ip->ip_func[ri]; - copy_arg1 = ip->ip_arg1[ri]; - copy_arg2 = ip->ip_arg2[ri]; + copy_func = ip->ip_info[ri].func; + copy_arg1 = ip->ip_info[ri].arg1; + copy_arg2 = ip->ip_info[ri].arg2; cpu_mfence(); ++ip->ip_rindex; KKASSERT((ip->ip_rindex & MAXCPUFIFO_MASK) == @@ -649,16 +686,8 @@ lwkt_process_ipiq_core(globaldata_t sgd, lwkt_ipiq_t ip, --mygd->gd_intr_nesting_level; /* - * If the queue is empty release ip_npoll to enable the other cpu to - * send us an IPI interrupt again. - * - * Return non-zero if there is still more in the queue. Note that we - * must re-check the indexes after potentially releasing ip_npoll. The - * caller must loop or otherwise ensure that a loop will occur prior to - * blocking. + * Return non-zero if there is still more in the queue. */ - if (ip->ip_rindex == ip->ip_windex) - atomic_poll_release_int(&ip->ip_npoll); cpu_lfence(); return (ip->ip_rindex != ip->ip_windex); } @@ -716,6 +745,9 @@ void lwkt_cpusync_interlock(lwkt_cpusync_t cs) { #ifdef SMP +#if 0 + const char *smsg = "SMPSYNL"; +#endif globaldata_t gd = mycpu; cpumask_t mask; @@ -733,10 +765,18 @@ lwkt_cpusync_interlock(lwkt_cpusync_t cs) ++gd->gd_curthread->td_cscount; lwkt_send_ipiq_mask(mask, (ipifunc1_t)lwkt_cpusync_remote1, cs); logipiq2(sync_start, mask); +#if 0 + if (gd->gd_curthread->td_wmesg == NULL) + gd->gd_curthread->td_wmesg = smsg; +#endif while (cs->cs_mack != mask) { lwkt_process_ipiq(); cpu_pause(); } +#if 0 + if (gd->gd_curthread->td_wmesg == smsg) + gd->gd_curthread->td_wmesg = NULL; +#endif DEBUG_POP_INFO(); } #else @@ -755,6 +795,9 @@ lwkt_cpusync_deinterlock(lwkt_cpusync_t cs) { globaldata_t gd = mycpu; #ifdef SMP +#if 0 + const char *smsg = "SMPSYNU"; +#endif cpumask_t mask; /* @@ -773,10 +816,18 @@ lwkt_cpusync_deinterlock(lwkt_cpusync_t cs) cs->cs_func(cs->cs_data); if (mask) { DEBUG_PUSH_INFO("cpusync_deinterlock"); +#if 0 + if (gd->gd_curthread->td_wmesg == NULL) + gd->gd_curthread->td_wmesg = smsg; +#endif while (cs->cs_mack != mask) { lwkt_process_ipiq(); cpu_pause(); } +#if 0 + if (gd->gd_curthread->td_wmesg == smsg) + gd->gd_curthread->td_wmesg = NULL; +#endif DEBUG_POP_INFO(); /* * cpusyncq ipis may be left queued without the RQF flag set due to @@ -833,9 +884,9 @@ lwkt_cpusync_remote2(lwkt_cpusync_t cs) ip = &gd->gd_cpusyncq; wi = ip->ip_windex & MAXCPUFIFO_MASK; - ip->ip_func[wi] = (ipifunc3_t)(ipifunc1_t)lwkt_cpusync_remote2; - ip->ip_arg1[wi] = cs; - ip->ip_arg2[wi] = 0; + ip->ip_info[wi].func = (ipifunc3_t)(ipifunc1_t)lwkt_cpusync_remote2; + ip->ip_info[wi].arg1 = cs; + ip->ip_info[wi].arg2 = 0; cpu_sfence(); ++ip->ip_windex; if (ipiq_debug && (ip->ip_windex & 0xFFFFFF) == 0) { diff --git a/sys/kern/lwkt_thread.c b/sys/kern/lwkt_thread.c index 4fbf54d0c7..6dad21084c 100644 --- a/sys/kern/lwkt_thread.c +++ b/sys/kern/lwkt_thread.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2003-2010 The DragonFly Project. All rights reserved. + * Copyright (c) 2003-2011 The DragonFly Project. All rights reserved. * * This code is derived from software contributed to The DragonFly Project * by Matthew Dillon @@ -102,6 +102,7 @@ static void lwkt_schedule_remote(void *arg, int arg2, struct intrframe *frame); static void lwkt_setcpu_remote(void *arg); #endif static void lwkt_fairq_accumulate(globaldata_t gd, thread_t td); +static int lwkt_fairq_tick(globaldata_t gd, thread_t td); extern void cpu_heavy_restore(void); extern void cpu_lwkt_restore(void); @@ -130,18 +131,25 @@ SYSCTL_QUAD(_lwkt, OID_AUTO, preempt_weird, CTLFLAG_RW, &preempt_weird, 0, SYSCTL_QUAD(_lwkt, OID_AUTO, token_contention_count, CTLFLAG_RW, &token_contention_count, 0, "spinning due to token contention"); #endif -static int fairq_enable = 1; +static int fairq_enable = 0; SYSCTL_INT(_lwkt, OID_AUTO, fairq_enable, CTLFLAG_RW, &fairq_enable, 0, "Turn on fairq priority accumulators"); +static int fairq_bypass = 1; +SYSCTL_INT(_lwkt, OID_AUTO, fairq_bypass, CTLFLAG_RW, + &fairq_bypass, 0, "Allow fairq to bypass td on token failure"); +extern int lwkt_sched_debug; +int lwkt_sched_debug = 0; +SYSCTL_INT(_lwkt, OID_AUTO, sched_debug, CTLFLAG_RW, + &lwkt_sched_debug, 0, "Scheduler debug"); static int lwkt_spin_loops = 10; SYSCTL_INT(_lwkt, OID_AUTO, spin_loops, CTLFLAG_RW, - &lwkt_spin_loops, 0, ""); -static int lwkt_spin_delay = 1; -SYSCTL_INT(_lwkt, OID_AUTO, spin_delay, CTLFLAG_RW, - &lwkt_spin_delay, 0, "Scheduler spin delay in microseconds 0=auto"); -static int lwkt_spin_method = 1; -SYSCTL_INT(_lwkt, OID_AUTO, spin_method, CTLFLAG_RW, - &lwkt_spin_method, 0, "LWKT scheduler behavior when contended"); + &lwkt_spin_loops, 0, "Scheduler spin loops until sorted decon"); +static int lwkt_spin_reseq = 0; +SYSCTL_INT(_lwkt, OID_AUTO, spin_reseq, CTLFLAG_RW, + &lwkt_spin_reseq, 0, "Scheduler resequencer enable"); +static int lwkt_spin_monitor = 0; +SYSCTL_INT(_lwkt, OID_AUTO, spin_monitor, CTLFLAG_RW, + &lwkt_spin_monitor, 0, "Scheduler uses monitor/mwait"); static int lwkt_spin_fatal = 0; /* disabled */ SYSCTL_INT(_lwkt, OID_AUTO, spin_fatal, CTLFLAG_RW, &lwkt_spin_fatal, 0, "LWKT scheduler spin loops till fatal panic"); @@ -173,9 +181,12 @@ _lwkt_dequeue(thread_t td) td->td_flags &= ~TDF_RUNQ; TAILQ_REMOVE(&gd->gd_tdrunq, td, td_threadq); + gd->gd_fairq_total_pri -= td->td_pri; if (TAILQ_FIRST(&gd->gd_tdrunq) == NULL) atomic_clear_int(&gd->gd_reqflags, RQF_RUNNING); + + /*td->td_fairq_lticks = ticks;*/ } } @@ -200,7 +211,7 @@ _lwkt_enqueue(thread_t td) TAILQ_INSERT_TAIL(&gd->gd_tdrunq, td, td_threadq); atomic_set_int(&gd->gd_reqflags, RQF_RUNNING); } else { - while (xtd && xtd->td_pri > td->td_pri) + while (xtd && xtd->td_pri >= td->td_pri) xtd = TAILQ_NEXT(xtd, td_threadq); if (xtd) TAILQ_INSERT_BEFORE(xtd, td, td_threadq); @@ -208,6 +219,15 @@ _lwkt_enqueue(thread_t td) TAILQ_INSERT_TAIL(&gd->gd_tdrunq, td, td_threadq); } gd->gd_fairq_total_pri += td->td_pri; + + /* + * The thread might have been dequeued for a while, bump it's + * fairq. + */ + if (td->td_fairq_lticks != ticks) { + td->td_fairq_lticks = ticks; + lwkt_fairq_accumulate(gd, td); + } } } @@ -512,11 +532,7 @@ lwkt_switch(void) thread_t td = gd->gd_curthread; thread_t ntd; thread_t xtd; - int spinning = lwkt_spin_loops; /* loops before HLTing */ - int reqflags; - int cseq; - int oseq; - int fatal_count; + int spinning = 0; KKASSERT(gd->gd_processing_ipiq == 0); @@ -618,6 +634,12 @@ lwkt_switch(void) goto havethread_preempted; } + /* + * Update the fairq accumulator if we are switching away in a + * different tick. + */ + lwkt_fairq_tick(gd, td); + /* * Implement round-robin fairq with priority insertion. The priority * insertion is handled by _lwkt_enqueue() @@ -630,18 +652,14 @@ lwkt_switch(void) */ for (;;) { /* - * Clear RQF_AST_LWKT_RESCHED (we handle the reschedule request) - * and set RQF_WAKEUP (prevent unnecessary IPIs from being - * received). + * We have already docked the current thread. If we get stuck in a + * scheduler switching loop we do not want to dock it over and over + * again. Reset lticks. */ - for (;;) { - reqflags = gd->gd_reqflags; - if (atomic_cmpset_int(&gd->gd_reqflags, reqflags, - (reqflags & ~RQF_AST_LWKT_RESCHED) | - RQF_WAKEUP)) { - break; - } - } + if (td != &gd->gd_idlethread) + td->td_fairq_lticks = ticks; + + clear_lwkt_resched(); /* * Hotpath - pull the head of the run queue and attempt to schedule @@ -653,8 +671,7 @@ lwkt_switch(void) if (ntd == NULL) { /* - * Runq is empty, switch to idle and clear RQF_WAKEUP - * to allow it to halt. + * Runq is empty, switch to idle to allow it to halt. */ ntd = &gd->gd_idlethread; #ifdef SMP @@ -663,10 +680,11 @@ lwkt_switch(void) #endif cpu_time.cp_msg[0] = 0; cpu_time.cp_stallpc = 0; - atomic_clear_int(&gd->gd_reqflags, RQF_WAKEUP); goto haveidle; } + break; +#if 0 if (ntd->td_fairq_accum >= 0) break; @@ -674,47 +692,83 @@ lwkt_switch(void) lwkt_fairq_accumulate(gd, ntd); TAILQ_REMOVE(&gd->gd_tdrunq, ntd, td_threadq); TAILQ_INSERT_TAIL(&gd->gd_tdrunq, ntd, td_threadq); +#endif } /* - * Hotpath - schedule ntd. Leaves RQF_WAKEUP set to prevent - * unwanted decontention IPIs. + * Hotpath - schedule ntd. * * NOTE: For UP there is no mplock and lwkt_getalltokens() * always succeeds. */ - if (TD_TOKS_NOT_HELD(ntd) || lwkt_getalltokens(ntd)) + if (TD_TOKS_NOT_HELD(ntd) || + lwkt_getalltokens(ntd, (spinning >= lwkt_spin_loops))) + { goto havethread; + } /* * Coldpath (SMP only since tokens always succeed on UP) * * We had some contention on the thread we wanted to schedule. * What we do now is try to find a thread that we can schedule - * in its stead until decontention reschedules on our cpu. + * in its stead. * * The coldpath scan does NOT rearrange threads in the run list - * and it also ignores the accumulator. + * and it also ignores the accumulator. We locate the thread with + * the highest accumulator value (positive or negative), then the + * next highest, and so forth. This isn't the most efficient but + * will theoretically try to schedule one thread per pass which + * is not horrible. * - * We do not immediately schedule a user priority thread, instead - * we record it in xtd and continue looking for kernel threads. - * A cpu can only have one user priority thread (normally) so just - * record the first one. + * If the accumulator for the selected thread happens to be negative + * the timer interrupt will come along and ask for another reschedule + * within 1 tick. * * NOTE: This scan will also include threads whos fairq's were * accumulated in the first loop. */ +#ifdef INVARIANTS ++token_contention_count; +#endif + + if (fairq_bypass) + goto skip; + + need_lwkt_resched(); + xtd = NULL; + while ((ntd = TAILQ_NEXT(ntd, td_threadq)) != NULL) { +#if 0 + if (ntd->td_fairq_accum < 0) + continue; + if (xtd == NULL || ntd->td_pri > xtd->td_pri) + xtd = ntd; +#endif + if (TD_TOKS_NOT_HELD(ntd) || + lwkt_getalltokens(ntd, (spinning >= lwkt_spin_loops))) { + goto havethread; + } + } +#if 0 + if (xtd) { + if (TD_TOKS_NOT_HELD(xtd) || + lwkt_getalltokens(xtd, (spinning >= lwkt_spin_loops))) + { + ntd = xtd; + goto havethread; + } + } +#endif + +#if 0 + if (fairq_bypass) + goto skip; + xtd = NULL; while ((ntd = TAILQ_NEXT(ntd, td_threadq)) != NULL) { /* - * Try to switch to this thread. If the thread is running at - * user priority we clear WAKEUP to allow decontention IPIs - * (since this thread is simply running until the one we wanted - * decontends), and we make sure that LWKT_RESCHED is not set. - * - * Otherwise for kernel threads we leave WAKEUP set to avoid - * unnecessary decontention IPIs. + * Try to switch to this thread. Kernel threads have priority + * over user threads in this case. */ if (ntd->td_pri < TDPRI_KERN_LPSCHED) { if (xtd == NULL) @@ -722,89 +776,33 @@ lwkt_switch(void) continue; } - /* - * Do not let the fairq get too negative. Even though we are - * ignoring it atm once the scheduler decontends a very negative - * thread will get moved to the end of the queue. - */ - if (TD_TOKS_NOT_HELD(ntd) || lwkt_getalltokens(ntd)) { - if (ntd->td_fairq_accum < -TDFAIRQ_MAX(gd)) - ntd->td_fairq_accum = -TDFAIRQ_MAX(gd); + if (TD_TOKS_NOT_HELD(ntd) || + lwkt_getalltokens(ntd, (spinning >= lwkt_spin_loops))) + { goto havethread; } - - /* - * Well fubar, this thread is contended as well, loop - */ - /* */ + /* thread contested, try another */ } /* * We exhausted the run list but we may have recorded a user - * thread to try. We have three choices based on - * lwkt.decontention_method. - * - * (0) Atomically clear RQF_WAKEUP in order to receive decontention - * IPIs (to interrupt the user process) and test - * RQF_AST_LWKT_RESCHED at the same time. - * - * This results in significant decontention IPI traffic but may - * be more responsive. - * - * (1) Leave RQF_WAKEUP set so we do not receive a decontention IPI. - * An automatic LWKT reschedule will occur on the next hardclock - * (typically 100hz). - * - * This results in no decontention IPI traffic but may be less - * responsive. This is the default. - * - * (2) Refuse to schedule the user process at this time. - * - * This is highly experimental and should not be used under - * normal circumstances. This can cause a user process to - * get starved out in situations where kernel threads are - * fighting each other for tokens. + * thread to try. */ if (xtd) { ntd = xtd; - - switch(lwkt_spin_method) { - case 0: - for (;;) { - reqflags = gd->gd_reqflags; - if (atomic_cmpset_int(&gd->gd_reqflags, - reqflags, - reqflags & ~RQF_WAKEUP)) { - break; - } - } - break; - case 1: - reqflags = gd->gd_reqflags; - break; - default: - goto skip; - break; - } - if ((reqflags & RQF_AST_LWKT_RESCHED) == 0 && - (TD_TOKS_NOT_HELD(ntd) || lwkt_getalltokens(ntd)) + if ((gd->gd_reqflags & RQF_AST_LWKT_RESCHED) == 0 && + (TD_TOKS_NOT_HELD(ntd) || + lwkt_getalltokens(ntd, (spinning >= lwkt_spin_loops))) ) { - if (ntd->td_fairq_accum < -TDFAIRQ_MAX(gd)) - ntd->td_fairq_accum = -TDFAIRQ_MAX(gd); goto havethread; } - -skip: - /* - * Make sure RQF_WAKEUP is set if we failed to schedule the - * user thread to prevent the idle thread from halting. - */ - atomic_set_int(&gd->gd_reqflags, RQF_WAKEUP); } +#endif +skip: /* * We exhausted the run list, meaning that all runnable threads - * are contended. + * are contested. */ cpu_pause(); ntd = &gd->gd_idlethread; @@ -815,97 +813,108 @@ skip: #endif /* - * Ok, we might want to spin a few times as some tokens are held for - * very short periods of time and IPI overhead is 1uS or worse - * (meaning it is usually better to spin). Regardless we have to - * call splz_check() to be sure to service any interrupts blocked - * by our critical section, otherwise we could livelock e.g. IPIs. - * - * The IPI mechanic is really a last resort. In nearly all other - * cases RQF_WAKEUP is left set to prevent decontention IPIs. + * We are going to have to retry but if the current thread is not + * on the runq we instead switch through the idle thread to get away + * from the current thread. We have to flag for lwkt reschedule + * to prevent the idle thread from halting. * - * When we decide not to spin we clear RQF_WAKEUP and switch to - * the idle thread. Clearing RQF_WEAKEUP allows the idle thread - * to halt and decontended tokens will issue an IPI to us. The - * idle thread will check for pending reschedules already set - * (RQF_AST_LWKT_RESCHED) before actually halting so we don't have - * to here. - * - * Also, if TDF_RUNQ is not set the current thread is trying to - * deschedule, possibly in an atomic fashion. We cannot afford to - * stay here. + * NOTE: A non-zero spinning is passed to lwkt_getalltokens() to + * instruct it to deal with the potential for deadlocks by + * ordering the tokens by address. */ - if (spinning <= 0 || (td->td_flags & TDF_RUNQ) == 0) { - atomic_clear_int(&gd->gd_reqflags, RQF_WAKEUP); + if ((td->td_flags & TDF_RUNQ) == 0) { + need_lwkt_resched(); goto haveidle; } - --spinning; - - /* - * When spinning a delay is required both to avoid livelocks from - * token order reversals (a thread may be trying to acquire multiple - * tokens), and also to reduce cpu cache management traffic. - * - * In order to scale to a large number of CPUs we use a time slot - * resequencer to force contending cpus into non-contending - * time-slots. The scheduler may still contend with the lock holder - * but will not (generally) contend with all the other cpus trying - * trying to get the same token. - * - * The resequencer uses a FIFO counter mechanic. The owner of the - * rindex at the head of the FIFO is allowed to pull itself off - * the FIFO and fetchadd is used to enter into the FIFO. This bit - * of code is VERY cache friendly and forces all spinning schedulers - * into their own time slots. - * - * This code has been tested to 48-cpus and caps the cache - * contention load at ~1uS intervals regardless of the number of - * cpus. Scaling beyond 64 cpus might require additional smarts - * (such as separate FIFOs for specific token cases). - * - * WARNING! We can't call splz_check() or anything else here as - * it could cause a deadlock. - */ #if defined(INVARIANTS) && defined(__amd64__) if ((read_rflags() & PSL_I) == 0) { cpu_enable_intr(); panic("lwkt_switch() called with interrupts disabled"); } #endif - cseq = atomic_fetchadd_int(&lwkt_cseq_windex, 1); - fatal_count = lwkt_spin_fatal; - while ((oseq = lwkt_cseq_rindex) != cseq) { - cpu_ccfence(); -#if !defined(_KERNEL_VIRTUAL) - if (cpu_mi_feature & CPU_MI_MONITOR) { - cpu_mmw_pause_int(&lwkt_cseq_rindex, oseq); - } else + + /* + * Number iterations so far. After a certain point we switch to + * a sorted-address/monitor/mwait version of lwkt_getalltokens() + */ + if (spinning < 0x7FFFFFFF) + ++spinning; + +#ifdef SMP + /* + * lwkt_getalltokens() failed in sorted token mode, we can use + * monitor/mwait in this case. + */ + if (spinning >= lwkt_spin_loops && + (cpu_mi_feature & CPU_MI_MONITOR) && + lwkt_spin_monitor) + { + cpu_mmw_pause_int(&gd->gd_reqflags, + (gd->gd_reqflags | RQF_SPINNING) & + ~RQF_IDLECHECK_WK_MASK); + } +#endif + + /* + * We already checked that td is still scheduled so this should be + * safe. + */ + splz_check(); + + /* + * This experimental resequencer is used as a fall-back to reduce + * hw cache line contention by placing each core's scheduler into a + * time-domain-multplexed slot. + * + * The resequencer is disabled by default. It's functionality has + * largely been superceeded by the token algorithm which limits races + * to a subset of cores. + * + * The resequencer algorithm tends to break down when more than + * 20 cores are contending. What appears to happen is that new + * tokens can be obtained out of address-sorted order by new cores + * while existing cores languish in long delays between retries and + * wind up being starved-out of the token acquisition. + */ + if (lwkt_spin_reseq && spinning >= lwkt_spin_reseq) { + int cseq = atomic_fetchadd_int(&lwkt_cseq_windex, 1); + int oseq; + + while ((oseq = lwkt_cseq_rindex) != cseq) { + cpu_ccfence(); +#if 1 + if (cpu_mi_feature & CPU_MI_MONITOR) { + cpu_mmw_pause_int(&lwkt_cseq_rindex, oseq); + } else { +#endif + cpu_pause(); + cpu_lfence(); +#if 1 + } #endif - { - DELAY(1); - cpu_lfence(); } - if (fatal_count && --fatal_count == 0) - panic("lwkt_switch: fatal spin wait"); + DELAY(1); + atomic_add_int(&lwkt_cseq_rindex, 1); } - cseq = lwkt_spin_delay; /* don't trust the system operator */ - cpu_ccfence(); - if (cseq < 1) - cseq = 1; - if (cseq > 1000) - cseq = 1000; - DELAY(cseq); - atomic_add_int(&lwkt_cseq_rindex, 1); - splz_check(); /* ok, we already checked that td is still scheduled */ /* highest level for(;;) loop */ } havethread: /* + * The thread may have been sitting in the runq for a while, be sure + * to reset td_fairq_lticks to avoid an improper scheduling tick against + * the thread if it gets dequeued again quickly. + * * We must always decrement td_fairq_accum on non-idle threads just * in case a thread never gets a tick due to being in a continuous * critical section. The page-zeroing code does this, for example. - * + */ + /* ntd->td_fairq_lticks = ticks; */ + --ntd->td_fairq_accum; + if (ntd->td_fairq_accum < -TDFAIRQ_MAX(gd)) + ntd->td_fairq_accum = -TDFAIRQ_MAX(gd); + + /* * If the thread we came up with is a higher or equal priority verses * the thread at the head of the queue we move our thread to the * front. This way we can always check the front of the queue. @@ -913,9 +922,8 @@ havethread: * Clear gd_idle_repeat when doing a normal switch to a non-idle * thread. */ - ++gd->gd_cnt.v_swtch; - --ntd->td_fairq_accum; ntd->td_wmesg = NULL; + ++gd->gd_cnt.v_swtch; xtd = TAILQ_FIRST(&gd->gd_tdrunq); if (ntd != xtd && ntd->td_pri >= xtd->td_pri) { TAILQ_REMOVE(&gd->gd_tdrunq, ntd, td_threadq); @@ -949,6 +957,14 @@ haveidle: lwkt_switch_return(td->td_switch(ntd)); /* ntd invalid, td_switch() can return a different thread_t */ } + +#if 1 + /* + * catch-all + */ + splz_check(); +#endif + /* NOTE: current cpu may have changed after switch */ crit_exit_quick(td); } @@ -1038,12 +1054,13 @@ lwkt_preempt(thread_t ntd, int critcount) */ KASSERT(ntd->td_critcount, ("BADCRIT0 %d", ntd->td_pri)); + td = gd->gd_curthread; if (preempt_enable == 0) { + if (ntd->td_pri > td->td_pri) + need_lwkt_resched(); ++preempt_miss; return; } - - td = gd->gd_curthread; if (ntd->td_pri <= td->td_pri) { ++preempt_miss; return; @@ -1113,6 +1130,12 @@ lwkt_preempt(thread_t ntd, int critcount) KKASSERT(ntd->td_preempted && (td->td_flags & TDF_PREEMPT_DONE)); ntd->td_preempted = NULL; td->td_flags &= ~(TDF_PREEMPT_LOCK|TDF_PREEMPT_DONE); +#if 1 + /* + * catch-all + */ + splz_check(); +#endif } /* @@ -1288,17 +1311,12 @@ _lwkt_schedule_post(globaldata_t gd, thread_t ntd, int ccount, int reschedok) } /* - * Give the thread a little fair share scheduler bump if it - * has been asleep for a while. This is primarily to avoid - * a degenerate case for interrupt threads where accumulator - * crosses into negative territory unnecessarily. + * If we are in a different tick give the thread a cycle advantage. + * This is primarily to avoid a degenerate case for interrupt threads + * where accumulator crosses into negative territory unnecessarily. */ - if (ntd->td_fairq_lticks != ticks) { - ntd->td_fairq_lticks = ticks; - ntd->td_fairq_accum += gd->gd_fairq_total_pri; - if (ntd->td_fairq_accum > TDFAIRQ_MAX(gd)) - ntd->td_fairq_accum = TDFAIRQ_MAX(gd); - } + if (ntd->td_fairq_lticks != ticks) + lwkt_fairq_accumulate(gd, ntd); } } @@ -1539,14 +1557,9 @@ lwkt_fairq_schedulerclock(thread_t td) if (fairq_enable) { while (td) { gd = td->td_gd; - if (td != &gd->gd_idlethread) { - td->td_fairq_accum -= gd->gd_fairq_total_pri; - if (td->td_fairq_accum < -TDFAIRQ_MAX(gd)) - td->td_fairq_accum = -TDFAIRQ_MAX(gd); - if (td->td_fairq_accum < 0) - need_lwkt_resched(); - td->td_fairq_lticks = ticks; - } + lwkt_fairq_tick(gd, td); + if (td->td_fairq_accum < 0) + need_lwkt_resched(); td = td->td_preempted; } } @@ -1560,6 +1573,19 @@ lwkt_fairq_accumulate(globaldata_t gd, thread_t td) td->td_fairq_accum = TDFAIRQ_MAX(td->td_gd); } +static int +lwkt_fairq_tick(globaldata_t gd, thread_t td) +{ + if (td->td_fairq_lticks != ticks && td != &gd->gd_idlethread) { + td->td_fairq_lticks = ticks; + td->td_fairq_accum -= gd->gd_fairq_total_pri; + if (td->td_fairq_accum < -TDFAIRQ_MAX(gd)) + td->td_fairq_accum = -TDFAIRQ_MAX(gd); + return TRUE; + } + return FALSE; +} + /* * Migrate the current thread to the specified cpu. * diff --git a/sys/kern/lwkt_token.c b/sys/kern/lwkt_token.c index f3cfd7f3ff..e1249cdd6e 100644 --- a/sys/kern/lwkt_token.c +++ b/sys/kern/lwkt_token.c @@ -76,10 +76,11 @@ #include #include +extern int lwkt_sched_debug; + #ifndef LWKT_NUM_POOL_TOKENS -#define LWKT_NUM_POOL_TOKENS 1024 /* power of 2 */ +#define LWKT_NUM_POOL_TOKENS 4001 /* prime number */ #endif -#define LWKT_MASK_POOL_TOKENS (LWKT_NUM_POOL_TOKENS - 1) static lwkt_token pool_tokens[LWKT_NUM_POOL_TOKENS]; @@ -131,9 +132,12 @@ struct lwkt_token tty_token = LWKT_TOKEN_INITIALIZER(tty_token); struct lwkt_token vnode_token = LWKT_TOKEN_INITIALIZER(vnode_token); struct lwkt_token vmobj_token = LWKT_TOKEN_INITIALIZER(vmobj_token); -static int lwkt_token_ipi_dispatch = 4; -SYSCTL_INT(_lwkt, OID_AUTO, token_ipi_dispatch, CTLFLAG_RW, - &lwkt_token_ipi_dispatch, 0, "Number of IPIs to dispatch on token release"); +static int lwkt_token_spin = 5; +SYSCTL_INT(_lwkt, OID_AUTO, token_spin, CTLFLAG_RW, + &lwkt_token_spin, 0, "Decontention spin loops"); +static int lwkt_token_delay = 0; +SYSCTL_INT(_lwkt, OID_AUTO, token_delay, CTLFLAG_RW, + &lwkt_token_delay, 0, "Decontention spin delay in ns"); /* * The collision count is bumped every time the LWKT scheduler fails @@ -159,6 +163,8 @@ SYSCTL_LONG(_lwkt, OID_AUTO, tty_collisions, CTLFLAG_RW, SYSCTL_LONG(_lwkt, OID_AUTO, vnode_collisions, CTLFLAG_RW, &vnode_token.t_collisions, 0, "Collision counter of vnode_token"); +static int _lwkt_getalltokens_sorted(thread_t td); + #ifdef SMP /* * Acquire the initial mplock @@ -175,16 +181,17 @@ cpu_get_initial_mplock(void) #endif /* - * Return a pool token given an address + * Return a pool token given an address. Use a prime number to reduce + * overlaps. */ static __inline lwkt_token_t _lwkt_token_pool_lookup(void *ptr) { - int i; + u_int i; - i = ((int)(intptr_t)ptr >> 2) ^ ((int)(intptr_t)ptr >> 12); - return(&pool_tokens[i & LWKT_MASK_POOL_TOKENS]); + i = (u_int)(uintptr_t)ptr % LWKT_NUM_POOL_TOKENS; + return(&pool_tokens[i]); } /* @@ -199,112 +206,153 @@ _lwkt_tokref_init(lwkt_tokref_t ref, lwkt_token_t tok, thread_t td) ref->tr_owner = td; } -#ifdef SMP -/* - * Force a LWKT reschedule on the target cpu when a requested token - * becomes available. - */ static +int +_lwkt_trytoken_spin(lwkt_token_t tok, lwkt_tokref_t ref) +{ + int n; + + for (n = 0; n < lwkt_token_spin; ++n) { + if (tok->t_ref == NULL && + atomic_cmpset_ptr(&tok->t_ref, NULL, ref)) { + return TRUE; + } + if (lwkt_token_delay) { + tsc_delay(lwkt_token_delay); + } else { + cpu_lfence(); + cpu_pause(); + } + } + return FALSE; +} + +static __inline void -lwkt_reltoken_mask_remote(void *arg, int arg2, struct intrframe *frame) +_lwkt_reltoken_spin(lwkt_token_t tok) { - need_lwkt_resched(); + tok->t_ref = NULL; } -#endif +#if 0 /* - * This bit of code sends a LWKT reschedule request to whatever other cpus - * had contended on the token being released. We could wake up all the cpus - * but generally speaking if there is a lot of contention we really only want - * to wake up a subset of cpus to avoid aggregating O(N^2) IPIs. The current - * cpuid is used as a basis to select which other cpus to wake up. - * - * For the selected cpus we can avoid issuing the actual IPI if the target - * cpu's RQF_WAKEUP is already set. In this case simply setting the - * reschedule flag RQF_AST_LWKT_RESCHED will be sufficient. + * Helper function used by lwkt_getalltokens[_sorted](). * - * lwkt.token_ipi_dispatch specifies the maximum number of IPIs to dispatch - * on a token release. + * Our attempt to acquire the token has failed. To reduce cache coherency + * bandwidth we set our cpu bit in t_collmask then wait for a reasonable + * period of time for a hand-off from the current token owner. */ -static __inline -void -_lwkt_reltoken_mask(lwkt_token_t tok) +static +int +_lwkt_trytoken_spin(lwkt_token_t tok, lwkt_tokref_t ref) { -#ifdef SMP - globaldata_t ngd; + globaldata_t gd = mycpu; cpumask_t mask; - cpumask_t tmpmask; - cpumask_t wumask; /* wakeup mask */ - cpumask_t remask; /* clear mask */ - int wucount; /* wakeup count */ - int cpuid; - int reqflags; + int n; /* - * Mask of contending cpus we want to wake up. + * Add our cpu to the collision mask and wait for the token to be + * handed off to us. */ - mask = tok->t_collmask; - cpu_ccfence(); - if (mask == 0) - return; + crit_enter(); + atomic_set_cpumask(&tok->t_collmask, gd->gd_cpumask); + for (n = 0; n < lwkt_token_spin; ++n) { + /* + * Token was released before we set our collision bit. + */ + if (tok->t_ref == NULL && + atomic_cmpset_ptr(&tok->t_ref, NULL, ref)) { + KKASSERT((tok->t_collmask & gd->gd_cpumask) != 0); + atomic_clear_cpumask(&tok->t_collmask, gd->gd_cpumask); + crit_exit(); + return TRUE; + } - /* - * Degenerate case - IPI to all contending cpus - */ - wucount = lwkt_token_ipi_dispatch; - if (wucount <= 0 || wucount >= ncpus) { - wucount = 0; - wumask = mask; - remask = mask; - } else { - wumask = 0; - remask = 0; + /* + * Token was handed-off to us. + */ + if (tok->t_ref == &gd->gd_handoff) { + KKASSERT((tok->t_collmask & gd->gd_cpumask) == 0); + tok->t_ref = ref; + crit_exit(); + return TRUE; + } + if (lwkt_token_delay) + tsc_delay(lwkt_token_delay); + else + cpu_pause(); } /* - * Calculate which cpus to IPI. These cpus are potentially in a - * HLT state waiting for token contention to go away. - * - * Ask the cpu LWKT scheduler to reschedule by setting - * RQF_AST_LWKT_RESCHEDULE. Signal the cpu if RQF_WAKEUP is not - * set (otherwise it has already been signalled or will check the - * flag very soon anyway). Both bits must be adjusted atomically - * all in one go to avoid races. - * - * The collision mask is cleared for all cpus we set the resched - * flag for, but we only IPI the ones that need signalling. + * We failed, attempt to clear our bit in the cpumask. We may race + * someone handing-off to us. If someone other than us cleared our + * cpu bit a handoff is incoming and we must wait for it. */ - while (wucount && mask) { - tmpmask = mask & ~(CPUMASK(mycpu->gd_cpuid) - 1); - if (tmpmask) - cpuid = BSFCPUMASK(tmpmask); - else - cpuid = BSFCPUMASK(mask); - ngd = globaldata_find(cpuid); - for (;;) { - reqflags = ngd->gd_reqflags; - if (atomic_cmpset_int(&ngd->gd_reqflags, reqflags, - reqflags | - (RQF_WAKEUP | - RQF_AST_LWKT_RESCHED))) { - break; + for (;;) { + mask = tok->t_collmask; + cpu_ccfence(); + if (mask & gd->gd_cpumask) { + if (atomic_cmpset_cpumask(&tok->t_collmask, + mask, + mask & ~gd->gd_cpumask)) { + crit_exit(); + return FALSE; } + continue; } - if ((reqflags & RQF_WAKEUP) == 0) { - wumask |= CPUMASK(cpuid); - --wucount; + if (tok->t_ref != &gd->gd_handoff) { + cpu_pause(); + continue; } - remask |= CPUMASK(cpuid); - mask &= ~CPUMASK(cpuid); + tok->t_ref = ref; + crit_exit(); + return TRUE; + } +} + +/* + * Release token with hand-off + */ +static __inline +void +_lwkt_reltoken_spin(lwkt_token_t tok) +{ + globaldata_t xgd; + cpumask_t sidemask; + cpumask_t mask; + int cpuid; + + if (tok->t_collmask == 0) { + tok->t_ref = NULL; + return; } - if (remask) { - atomic_clear_cpumask(&tok->t_collmask, remask); - lwkt_send_ipiq3_mask(wumask, lwkt_reltoken_mask_remote, - NULL, 0); + + crit_enter(); + sidemask = ~(mycpu->gd_cpumask - 1); /* high bits >= xcpu */ + for (;;) { + mask = tok->t_collmask; + cpu_ccfence(); + if (mask == 0) { + tok->t_ref = NULL; + break; + } + if (mask & sidemask) + cpuid = BSFCPUMASK(mask & sidemask); + else + cpuid = BSFCPUMASK(mask); + xgd = globaldata_find(cpuid); + if (atomic_cmpset_cpumask(&tok->t_collmask, mask, + mask & ~CPUMASK(cpuid))) { + tok->t_ref = &xgd->gd_handoff; + break; + } } -#endif + crit_exit(); } +#endif + + /* * Obtain all the tokens required by the specified thread on the current * cpu, return 0 on failure and non-zero on success. If a failure occurs @@ -313,17 +361,22 @@ _lwkt_reltoken_mask(lwkt_token_t tok) * lwkt_getalltokens is called by the LWKT scheduler to acquire all * tokens that the thread had acquired prior to going to sleep. * - * We always clear the collision mask on token aquision. + * If spinning is non-zero this function acquires the tokens in a particular + * order to deal with potential deadlocks. We simply use address order for + * the case. * * Called from a critical section. */ int -lwkt_getalltokens(thread_t td) +lwkt_getalltokens(thread_t td, int spinning) { lwkt_tokref_t scan; lwkt_tokref_t ref; lwkt_token_t tok; + if (spinning) + return(_lwkt_getalltokens_sorted(td)); + /* * Acquire tokens in forward order, assign or validate tok->t_ref. */ @@ -340,14 +393,8 @@ lwkt_getalltokens(thread_t td) */ ref = tok->t_ref; if (ref == NULL) { - if (atomic_cmpset_ptr(&tok->t_ref, NULL, scan)) - { - if (tok->t_collmask & td->td_gd->gd_cpumask) { - atomic_clear_cpumask(&tok->t_collmask, - td->td_gd->gd_cpumask); - } + if (atomic_cmpset_ptr(&tok->t_ref, NULL,scan)) break; - } continue; } @@ -363,28 +410,26 @@ lwkt_getalltokens(thread_t td) if (ref >= &td->td_toks_base && ref < td->td_toks_stop) break; -#ifdef SMP /* - * Otherwise we failed to acquire all the tokens. - * Undo and return. We have to try once more after - * setting cpumask to cover possible races against - * the checking of t_collmask. + * Try hard to acquire this token before giving up + * and releasing the whole lot. */ - atomic_set_cpumask(&tok->t_collmask, - td->td_gd->gd_cpumask); - if (atomic_cmpset_ptr(&tok->t_ref, NULL, scan)) { - if (tok->t_collmask & td->td_gd->gd_cpumask) { - atomic_clear_cpumask(&tok->t_collmask, - td->td_gd->gd_cpumask); - } + if (_lwkt_trytoken_spin(tok, scan)) break; - } -#endif + if (lwkt_sched_debug) + kprintf("toka %p %s\n", tok, tok->t_desc); + + /* + * Otherwise we failed to acquire all the tokens. + * Release whatever we did get. + */ td->td_wmesg = tok->t_desc; atomic_add_long(&tok->t_collisions, 1); lwkt_relalltokens(td); + return(FALSE); } + } return (TRUE); } @@ -393,11 +438,11 @@ lwkt_getalltokens(thread_t td) * Release all tokens owned by the specified thread on the current cpu. * * This code is really simple. Even in cases where we own all the tokens - * note that t_ref may not match the scan for recursively held tokens, - * or for the case where a lwkt_getalltokens() failed. + * note that t_ref may not match the scan for recursively held tokens which + * are held deeper in the stack, or for the case where a lwkt_getalltokens() + * failed. * - * The scheduler is responsible for maintaining the MP lock count, so - * we don't need to deal with tr_flags here. + * Tokens are released in reverse order to reduce chasing race failures. * * Called from a critical section. */ @@ -407,13 +452,136 @@ lwkt_relalltokens(thread_t td) lwkt_tokref_t scan; lwkt_token_t tok; - for (scan = &td->td_toks_base; scan < td->td_toks_stop; ++scan) { + for (scan = td->td_toks_stop - 1; scan >= &td->td_toks_base; --scan) { + /*for (scan = &td->td_toks_base; scan < td->td_toks_stop; ++scan) {*/ tok = scan->tr_tok; - if (tok->t_ref == scan) { - tok->t_ref = NULL; - _lwkt_reltoken_mask(tok); + if (tok->t_ref == scan) + _lwkt_reltoken_spin(tok); + } +} + +/* + * This is the decontention version of lwkt_getalltokens(). The tokens are + * acquired in address-sorted order to deal with any deadlocks. Ultimately + * token failures will spin into the scheduler and get here. + * + * In addition, to reduce hardware cache coherency contention monitor/mwait + * is interlocked with gd->gd_reqflags and RQF_SPINNING. Other cores which + * release a contended token will clear RQF_SPINNING and cause the mwait + * to resume. Any interrupt will also generally set RQF_* flags and cause + * mwait to resume (or be a NOP in the first place). + * + * This code is required to set up RQF_SPINNING in case of failure. The + * caller may call monitor/mwait on gd->gd_reqflags on failure. We do NOT + * want to call mwait here, and doubly so while we are holding tokens. + * + * Called from critical section + */ +static +int +_lwkt_getalltokens_sorted(thread_t td) +{ + /*globaldata_t gd = td->td_gd;*/ + lwkt_tokref_t sort_array[LWKT_MAXTOKENS]; + lwkt_tokref_t scan; + lwkt_tokref_t ref; + lwkt_token_t tok; + int i; + int j; + int n; + + /* + * Sort the token array. Yah yah, I know this isn't fun. + * + * NOTE: Recursively acquired tokens are ordered the same as in the + * td_toks_array so we can always get the earliest one first. + */ + i = 0; + scan = &td->td_toks_base; + while (scan < td->td_toks_stop) { + for (j = 0; j < i; ++j) { + if (scan->tr_tok < sort_array[j]->tr_tok) + break; } + if (j != i) { + bcopy(sort_array + j, sort_array + j + 1, + (i - j) * sizeof(lwkt_tokref_t)); + } + sort_array[j] = scan; + ++scan; + ++i; } + n = i; + + /* + * Acquire tokens in forward order, assign or validate tok->t_ref. + */ + for (i = 0; i < n; ++i) { + scan = sort_array[i]; + tok = scan->tr_tok; + for (;;) { + /* + * Try to acquire the token if we do not already have + * it. + * + * NOTE: If atomic_cmpset_ptr() fails we have to + * loop and try again. It just means we + * lost a cpu race. + */ + ref = tok->t_ref; + if (ref == NULL) { + if (atomic_cmpset_ptr(&tok->t_ref, NULL, scan)) + break; + continue; + } + + /* + * Someone holds the token. + * + * Test if ref is already recursively held by this + * thread. We cannot safely dereference tok->t_ref + * (it might belong to another thread and is thus + * unstable), but we don't have to. We can simply + * range-check it. + */ + if (ref >= &td->td_toks_base && ref < td->td_toks_stop) + break; + + /* + * Try hard to acquire this token before giving up + * and releasing the whole lot. + */ + if (_lwkt_trytoken_spin(tok, scan)) + break; + if (lwkt_sched_debug) + kprintf("tokb %p %s\n", tok, tok->t_desc); + + /* + * Tokens are released in reverse order to reduce + * chasing race failures. + */ + td->td_wmesg = tok->t_desc; + atomic_add_long(&tok->t_collisions, 1); + + for (j = i - 1; j >= 0; --j) { + /*for (j = 0; j < i; ++j) {*/ + scan = sort_array[j]; + tok = scan->tr_tok; + if (tok->t_ref == scan) + _lwkt_reltoken_spin(tok); + } + return (FALSE); + } + } + + /* + * We were successful, there is no need for another core to signal + * us. + */ +#if 0 + atomic_clear_int(&gd->gd_reqflags, RQF_SPINNING); +#endif + return (TRUE); } /* @@ -479,6 +647,13 @@ _lwkt_trytokref2(lwkt_tokref_t nref, thread_t td, int blocking) if (ref >= &td->td_toks_base && ref < td->td_toks_stop) return(TRUE); + /* + * Spin generously. This is preferable to just switching + * away unconditionally. + */ + if (_lwkt_trytoken_spin(tok, nref)) + return(TRUE); + /* * Otherwise we failed, and it is not ok to attempt to * acquire a token in a hard code section. @@ -519,21 +694,6 @@ lwkt_gettoken(lwkt_token_t tok) * return tr_tok->t_ref should be assigned to this specific * ref. */ -#ifdef SMP -#if 0 - /* - * (DISABLED ATM) - Do not set t_collmask on a token - * acquisition failure, the scheduler will spin at least - * once and deal with hlt/spin semantics. - */ - atomic_set_cpumask(&tok->t_collmask, td->td_gd->gd_cpumask); - if (atomic_cmpset_ptr(&tok->t_ref, NULL, ref)) { - atomic_clear_cpumask(&tok->t_collmask, - td->td_gd->gd_cpumask); - return; - } -#endif -#endif td->td_wmesg = tok->t_desc; atomic_add_long(&tok->t_collisions, 1); logtoken(fail, ref); @@ -567,21 +727,6 @@ lwkt_gettoken_hard(lwkt_token_t tok) * return tr_tok->t_ref should be assigned to this specific * ref. */ -#ifdef SMP -#if 0 - /* - * (DISABLED ATM) - Do not set t_collmask on a token - * acquisition failure, the scheduler will spin at least - * once and deal with hlt/spin semantics. - */ - atomic_set_cpumask(&tok->t_collmask, td->td_gd->gd_cpumask); - if (atomic_cmpset_ptr(&tok->t_ref, NULL, ref)) { - atomic_clear_cpumask(&tok->t_collmask, - td->td_gd->gd_cpumask); - goto success; - } -#endif -#endif td->td_wmesg = tok->t_desc; atomic_add_long(&tok->t_collisions, 1); logtoken(fail, ref); @@ -589,11 +734,6 @@ lwkt_gettoken_hard(lwkt_token_t tok) logtoken(succ, ref); KKASSERT(tok->t_ref == ref); } -#ifdef SMP -#if 0 -success: -#endif -#endif crit_enter_hard_gd(td->td_gd); } @@ -623,21 +763,6 @@ lwkt_getpooltoken(void *ptr) * return tr_tok->t_ref should be assigned to this specific * ref. */ -#ifdef SMP -#if 0 - /* - * (DISABLED ATM) - Do not set t_collmask on a token - * acquisition failure, the scheduler will spin at least - * once and deal with hlt/spin semantics. - */ - atomic_set_cpumask(&tok->t_collmask, td->td_gd->gd_cpumask); - if (atomic_cmpset_ptr(&tok->t_ref, NULL, ref)) { - atomic_clear_cpumask(&tok->t_collmask, - td->td_gd->gd_cpumask); - goto success; - } -#endif -#endif td->td_wmesg = tok->t_desc; atomic_add_long(&tok->t_collisions, 1); logtoken(fail, ref); @@ -645,11 +770,6 @@ lwkt_getpooltoken(void *ptr) logtoken(succ, ref); KKASSERT(tok->t_ref == ref); } -#ifdef SMP -#if 0 -success: -#endif -#endif return(tok); } @@ -712,10 +832,8 @@ lwkt_reltoken(lwkt_token_t tok) * * NOTE: The mplock is a token also so sequencing is a bit complex. */ - if (tok->t_ref == ref) { - tok->t_ref = NULL; - _lwkt_reltoken_mask(tok); - } + if (tok->t_ref == ref) + _lwkt_reltoken_spin(tok); cpu_sfence(); cpu_ccfence(); td->td_toks_stop = ref; diff --git a/sys/kern/sys_pipe.c b/sys/kern/sys_pipe.c index 9a36d75279..7fad99b114 100644 --- a/sys/kern/sys_pipe.c +++ b/sys/kern/sys_pipe.c @@ -148,6 +148,30 @@ SYSCTL_INT(_kern_pipe, OID_AUTO, bkmem_alloc, CTLFLAG_RW, &pipe_bkmem_alloc, 0, "pipe buffer from kmem"); #endif +/* + * Auto-size pipe cache to reduce kmem allocations and frees. + */ +static +void +pipeinit(void *dummy) +{ + size_t mbytes = kmem_lim_size(); + + if (pipe_maxbig == LIMITBIGPIPES) { + if (mbytes >= 7 * 1024) + pipe_maxbig *= 2; + if (mbytes >= 15 * 1024) + pipe_maxbig *= 2; + } + if (pipe_maxcache == PIPEQ_MAX_CACHE) { + if (mbytes >= 7 * 1024) + pipe_maxcache *= 2; + if (mbytes >= 15 * 1024) + pipe_maxcache *= 2; + } +} +SYSINIT(kmem, SI_BOOT2_MACHDEP, SI_ORDER_ANY, pipeinit, NULL) + static void pipeclose (struct pipe *cpipe); static void pipe_free_kmem (struct pipe *cpipe); static int pipe_create (struct pipe **cpipep); diff --git a/sys/kern/sys_process.c b/sys/kern/sys_process.c index aff01283da..bbef297bf2 100644 --- a/sys/kern/sys_process.c +++ b/sys/kern/sys_process.c @@ -93,7 +93,7 @@ pread (struct proc *procp, unsigned int addr, unsigned int *retval) { 0); if (!rv) { - vm_object_reference (object); + vm_object_reference XXX (object); rv = vm_map_wire (&kernel_map, kva, kva + PAGE_SIZE, 0); if (!rv) { @@ -156,20 +156,17 @@ pwrite (struct proc *procp, unsigned int addr, unsigned int datum) { tmap = map; rv = vm_map_lookup (&tmap, pageno, VM_PROT_WRITE, &out_entry, &object, &pindex, &out_prot, &wired); - if (rv != KERN_SUCCESS) { + if (rv != KERN_SUCCESS) return EINVAL; - } /* * Okay, we've got the page. Let's release tmap. */ - vm_map_lookup_done (tmap, out_entry, 0); /* * Fault the page in... */ - rv = vm_fault(map, pageno, VM_PROT_WRITE|VM_PROT_READ, FALSE); if (rv != KERN_SUCCESS) return EFAULT; @@ -182,7 +179,7 @@ pwrite (struct proc *procp, unsigned int addr, unsigned int datum) { VM_PROT_ALL, VM_PROT_ALL, 0); if (!rv) { - vm_object_reference (object); + vm_object_reference XXX (object); rv = vm_map_wire (&kernel_map, kva, kva + PAGE_SIZE, 0); if (!rv) { diff --git a/sys/kern/sysv_shm.c b/sys/kern/sysv_shm.c index 7c8f11757a..f26e4780ac 100644 --- a/sys/kern/sysv_shm.c +++ b/sys/kern/sysv_shm.c @@ -332,7 +332,8 @@ again: } shm_handle = shmseg->shm_internal; - vm_object_reference(shm_handle->shm_object); + vm_object_hold(shm_handle->shm_object); + vm_object_reference_locked(shm_handle->shm_object); rv = vm_map_find(&p->p_vmspace->vm_map, shm_handle->shm_object, 0, &attach_va, @@ -341,6 +342,7 @@ again: VM_MAPTYPE_NORMAL, prot, prot, 0); + vm_object_drop(shm_handle->shm_object); if (rv != KERN_SUCCESS) { vm_object_deallocate(shm_handle->shm_object); error = ENOMEM; diff --git a/sys/kern/tty.c b/sys/kern/tty.c index 21af56005a..c0e41d3733 100644 --- a/sys/kern/tty.c +++ b/sys/kern/tty.c @@ -2640,10 +2640,13 @@ ttyinfo(struct tty *tp) pctcpu = (lp->lwp_pctcpu * 10000 + FSCALE / 2) >> FSHIFT; - if (pick->p_stat == SIDL || pick->p_stat == SZOMB) + if (pick->p_stat == SIDL || pick->p_stat == SZOMB) { vmsz = 0; - else + } else { + lwkt_gettoken(&pick->p_vmspace->vm_map.token); vmsz = pgtok(vmspace_resident_count(pick->p_vmspace)); + lwkt_reltoken(&pick->p_vmspace->vm_map.token); + } crit_exit(); diff --git a/sys/kern/uipc_syscalls.c b/sys/kern/uipc_syscalls.c index 616a3dd6b1..cae66a2c65 100644 --- a/sys/kern/uipc_syscalls.c +++ b/sys/kern/uipc_syscalls.c @@ -1379,7 +1379,9 @@ sf_buf_mfree(void *arg) m = sf_buf_page(sf); if (sf_buf_free(sf)) { /* sf invalid now */ + vm_page_busy_wait(m, FALSE, "sockpgf"); vm_page_unwire(m, 0); + vm_page_wakeup(m); if (m->wire_count == 0 && m->object == NULL) vm_page_try_to_free(m); } @@ -1601,24 +1603,23 @@ retry_lookup: * interrupt can free the page) through to the * vm_page_wire() call. */ - lwkt_gettoken(&vm_token); - pg = vm_page_lookup(obj, pindex); + vm_object_hold(obj); + pg = vm_page_lookup_busy_try(obj, pindex, TRUE, &error); + if (error) { + vm_page_sleep_busy(pg, TRUE, "sfpbsy"); + vm_object_drop(obj); + goto retry_lookup; + } if (pg == NULL) { pg = vm_page_alloc(obj, pindex, VM_ALLOC_NORMAL); if (pg == NULL) { vm_wait(0); - lwkt_reltoken(&vm_token); + vm_object_drop(obj); goto retry_lookup; } - vm_page_wire(pg); - vm_page_wakeup(pg); - } else if (vm_page_sleep_busy(pg, TRUE, "sfpbsy")) { - lwkt_reltoken(&vm_token); - goto retry_lookup; - } else { - vm_page_wire(pg); } - lwkt_reltoken(&vm_token); + vm_page_wire(pg); + vm_object_drop(obj); /* * If page is not valid for what we need, initiate I/O @@ -1634,6 +1635,7 @@ retry_lookup: * completes. */ vm_page_io_start(pg); + vm_page_wakeup(pg); /* * Get the page from backing store. @@ -1654,12 +1656,12 @@ retry_lookup: td->td_ucred); vn_unlock(vp); vm_page_flag_clear(pg, PG_ZERO); + vm_page_busy_wait(pg, FALSE, "sockpg"); vm_page_io_finish(pg); if (error) { - crit_enter(); vm_page_unwire(pg, 0); + vm_page_wakeup(pg); vm_page_try_to_free(pg); - crit_exit(); ssb_unlock(&so->so_snd); goto done; } @@ -1671,14 +1673,14 @@ retry_lookup: * but this wait can be interrupted. */ if ((sf = sf_buf_alloc(pg)) == NULL) { - crit_enter(); vm_page_unwire(pg, 0); + vm_page_wakeup(pg); vm_page_try_to_free(pg); - crit_exit(); ssb_unlock(&so->so_snd); error = EINTR; goto done; } + vm_page_wakeup(pg); /* * Get an mbuf header and set it up as having external storage. diff --git a/sys/kern/vfs_bio.c b/sys/kern/vfs_bio.c index 5a5d8f5616..7bea33d242 100644 --- a/sys/kern/vfs_bio.c +++ b/sys/kern/vfs_bio.c @@ -701,9 +701,11 @@ bufinit(void) */ bogus_offset = kmem_alloc_pageable(&kernel_map, PAGE_SIZE); + vm_object_hold(&kernel_object); bogus_page = vm_page_alloc(&kernel_object, (bogus_offset >> PAGE_SHIFT), VM_ALLOC_NORMAL); + vm_object_drop(&kernel_object); vmstats.v_wire_count++; } @@ -1172,15 +1174,11 @@ buwrite(struct buf *bp) /* * Set valid & dirty. - * - * WARNING! vfs_dirty_one_page() assumes vm_token is held for now. */ - lwkt_gettoken(&vm_token); for (i = 0; i < bp->b_xio.xio_npages; i++) { m = bp->b_xio.xio_pages[i]; vfs_dirty_one_page(bp, i, m); } - lwkt_reltoken(&vm_token); bqrelse(bp); } @@ -1455,7 +1453,6 @@ brelse(struct buf *bp) resid = bp->b_bufsize; foff = bp->b_loffset; - lwkt_gettoken(&vm_token); for (i = 0; i < bp->b_xio.xio_npages; i++) { m = bp->b_xio.xio_pages[i]; vm_page_flag_clear(m, PG_ZERO); @@ -1470,6 +1467,7 @@ brelse(struct buf *bp) obj = vp->v_object; poff = OFF_TO_IDX(bp->b_loffset); + vm_object_hold(obj); for (j = i; j < bp->b_xio.xio_npages; j++) { vm_page_t mtmp; @@ -1483,6 +1481,7 @@ brelse(struct buf *bp) } } bp->b_flags &= ~B_HASBOGUS; + vm_object_drop(obj); if ((bp->b_flags & B_INVAL) == 0) { pmap_qenter(trunc_page((vm_offset_t)bp->b_data), @@ -1544,7 +1543,6 @@ brelse(struct buf *bp) } if (bp->b_flags & (B_INVAL | B_RELBUF)) vfs_vmio_release(bp); - lwkt_reltoken(&vm_token); } else { /* * Rundown for non-VMIO buffers. @@ -1790,11 +1788,12 @@ vfs_vmio_release(struct buf *bp) int i; vm_page_t m; - lwkt_gettoken(&vm_token); for (i = 0; i < bp->b_xio.xio_npages; i++) { m = bp->b_xio.xio_pages[i]; bp->b_xio.xio_pages[i] = NULL; + vm_page_busy_wait(m, FALSE, "vmiopg"); + /* * The VFS is telling us this is not a meta-data buffer * even if it is backed by a block device. @@ -1828,6 +1827,7 @@ vfs_vmio_release(struct buf *bp) */ if ((m->flags & PG_BUSY) || (m->busy != 0)) { vm_page_protect(m, VM_PROT_NONE); + vm_page_wakeup(m); continue; } @@ -1841,7 +1841,6 @@ vfs_vmio_release(struct buf *bp) #if 0 if ((bp->b_flags & B_ASYNC) == 0 && !m->valid && m->hold_count == 0) { - vm_page_busy(m); vm_page_protect(m, VM_PROT_NONE); vm_page_free(m); } else @@ -1859,17 +1858,21 @@ vfs_vmio_release(struct buf *bp) * being cached for long periods of time. */ if (bp->b_flags & B_DIRECT) { + vm_page_wakeup(m); vm_page_try_to_free(m); } else if ((bp->b_flags & B_NOTMETA) || vm_page_count_severe()) { m->act_count = bp->b_act_count; + vm_page_wakeup(m); vm_page_try_to_cache(m); } else { m->act_count = bp->b_act_count; + vm_page_wakeup(m); } + } else { + vm_page_wakeup(m); } } - lwkt_reltoken(&vm_token); pmap_qremove(trunc_page((vm_offset_t) bp->b_data), bp->b_xio.xio_npages); @@ -2753,6 +2756,7 @@ inmem(struct vnode *vp, off_t loffset) vm_object_t obj; vm_offset_t toff, tinc, size; vm_page_t m; + int res = 1; if (findblk(vp, loffset, FINDBLK_TEST)) return 1; @@ -2765,20 +2769,24 @@ inmem(struct vnode *vp, off_t loffset) if (size > vp->v_mount->mnt_stat.f_iosize) size = vp->v_mount->mnt_stat.f_iosize; + vm_object_hold(obj); for (toff = 0; toff < vp->v_mount->mnt_stat.f_iosize; toff += tinc) { - lwkt_gettoken(&vm_token); m = vm_page_lookup(obj, OFF_TO_IDX(loffset + toff)); - lwkt_reltoken(&vm_token); - if (m == NULL) - return 0; + if (m == NULL) { + res = 0; + break; + } tinc = size; if (tinc > PAGE_SIZE - ((toff + loffset) & PAGE_MASK)) tinc = PAGE_SIZE - ((toff + loffset) & PAGE_MASK); if (vm_page_is_valid(m, - (vm_offset_t) ((toff + loffset) & PAGE_MASK), tinc) == 0) - return 0; + (vm_offset_t) ((toff + loffset) & PAGE_MASK), tinc) == 0) { + res = 0; + break; + } } - return 1; + vm_object_drop(obj); + return (res); } /* @@ -3404,11 +3412,10 @@ allocbuf(struct buf *bp, int size) m = bp->b_xio.xio_pages[i]; KASSERT(m != bogus_page, ("allocbuf: bogus page found")); - while (vm_page_sleep_busy(m, TRUE, "biodep")) - ; - + vm_page_busy_wait(m, TRUE, "biodep"); bp->b_xio.xio_pages[i] = NULL; vm_page_unwire(m, 0); + vm_page_wakeup(m); } pmap_qremove((vm_offset_t) trunc_page((vm_offset_t)bp->b_data) + (desiredpages << PAGE_SHIFT), (bp->b_xio.xio_npages - desiredpages)); @@ -3438,13 +3445,28 @@ allocbuf(struct buf *bp, int size) vp = bp->b_vp; obj = vp->v_object; - lwkt_gettoken(&vm_token); + vm_object_hold(obj); while (bp->b_xio.xio_npages < desiredpages) { vm_page_t m; vm_pindex_t pi; + int error; - pi = OFF_TO_IDX(bp->b_loffset) + bp->b_xio.xio_npages; - if ((m = vm_page_lookup(obj, pi)) == NULL) { + pi = OFF_TO_IDX(bp->b_loffset) + + bp->b_xio.xio_npages; + + /* + * Blocking on m->busy might lead to a + * deadlock: + * + * vm_fault->getpages->cluster_read->allocbuf + */ + m = vm_page_lookup_busy_try(obj, pi, FALSE, + &error); + if (error) { + vm_page_sleep_busy(m, FALSE, "pgtblk"); + continue; + } + if (m == NULL) { /* * note: must allocate system pages * since blocking here could intefere @@ -3464,27 +3486,17 @@ allocbuf(struct buf *bp, int size) } /* - * We found a page. If we have to sleep on it, - * retry because it might have gotten freed out - * from under us. - * - * We can only test PG_BUSY here. Blocking on - * m->busy might lead to a deadlock: - * - * vm_fault->getpages->cluster_read->allocbuf - * + * We found a page and were able to busy it. */ - - if (vm_page_sleep_busy(m, FALSE, "pgtblk")) - continue; vm_page_flag_clear(m, PG_ZERO); vm_page_wire(m); + vm_page_wakeup(m); bp->b_xio.xio_pages[bp->b_xio.xio_npages] = m; ++bp->b_xio.xio_npages; if (bp->b_act_count < m->act_count) bp->b_act_count = m->act_count; } - lwkt_reltoken(&vm_token); + vm_object_drop(obj); /* * Step 2. We've loaded the pages into the buffer, @@ -3895,7 +3907,7 @@ bpdone(struct buf *bp, int elseit) bp->b_flags |= B_CACHE; } - lwkt_gettoken(&vm_token); + vm_object_hold(obj); for (i = 0; i < bp->b_xio.xio_npages; i++) { int bogusflag = 0; int resid; @@ -3933,6 +3945,7 @@ bpdone(struct buf *bp, int elseit) * already changed correctly (see bdwrite()), so we * only need to do this here in the read case. */ + vm_page_busy_wait(m, FALSE, "bpdpgw"); if (cmd == BUF_CMD_READ && !bogusflag && resid > 0) { vfs_clean_one_page(bp, i, m); } @@ -3965,12 +3978,13 @@ bpdone(struct buf *bp, int elseit) panic("biodone: page busy < 0"); } vm_page_io_finish(m); + vm_page_wakeup(m); vm_object_pip_wakeup(obj); foff = (foff + PAGE_SIZE) & ~(off_t)PAGE_MASK; iosize -= resid; } bp->b_flags &= ~B_HASBOGUS; - lwkt_reltoken(&vm_token); + vm_object_drop(obj); } /* @@ -4075,12 +4089,12 @@ vfs_unbusy_pages(struct buf *bp) runningbufwakeup(bp); - lwkt_gettoken(&vm_token); if (bp->b_flags & B_VMIO) { struct vnode *vp = bp->b_vp; vm_object_t obj; obj = vp->v_object; + vm_object_hold(obj); for (i = 0; i < bp->b_xio.xio_npages; i++) { vm_page_t m = bp->b_xio.xio_pages[i]; @@ -4100,13 +4114,15 @@ vfs_unbusy_pages(struct buf *bp) pmap_qenter(trunc_page((vm_offset_t)bp->b_data), bp->b_xio.xio_pages, bp->b_xio.xio_npages); } + vm_page_busy_wait(m, FALSE, "bpdpgw"); vm_object_pip_wakeup(obj); vm_page_flag_clear(m, PG_ZERO); vm_page_io_finish(m); + vm_page_wakeup(m); } bp->b_flags &= ~B_HASBOGUS; + vm_object_drop(obj); } - lwkt_reltoken(&vm_token); } /* @@ -4142,21 +4158,24 @@ vfs_busy_pages(struct vnode *vp, struct buf *bp) if (bp->b_flags & B_VMIO) { vm_object_t obj; - lwkt_gettoken(&vm_token); - obj = vp->v_object; KASSERT(bp->b_loffset != NOOFFSET, ("vfs_busy_pages: no buffer offset")); /* - * Loop until none of the pages are busy. + * Busy all the pages. We have to busy them all at once + * to avoid deadlocks. */ retry: for (i = 0; i < bp->b_xio.xio_npages; i++) { vm_page_t m = bp->b_xio.xio_pages[i]; - if (vm_page_sleep_busy(m, FALSE, "vbpage")) + if (vm_page_busy_try(m, FALSE)) { + vm_page_sleep_busy(m, FALSE, "vbpage"); + while (--i >= 0) + vm_page_wakeup(bp->b_xio.xio_pages[i]); goto retry; + } } /* @@ -4254,12 +4273,12 @@ retry: */ vm_page_protect(m, VM_PROT_NONE); } + vm_page_wakeup(m); } if (bogus) { pmap_qenter(trunc_page((vm_offset_t)bp->b_data), bp->b_xio.xio_pages, bp->b_xio.xio_npages); } - lwkt_reltoken(&vm_token); } /* @@ -4294,15 +4313,10 @@ vfs_clean_pages(struct buf *bp) KASSERT(bp->b_loffset != NOOFFSET, ("vfs_clean_pages: no buffer offset")); - /* - * vm_token must be held for vfs_clean_one_page() calls. - */ - lwkt_gettoken(&vm_token); for (i = 0; i < bp->b_xio.xio_npages; i++) { m = bp->b_xio.xio_pages[i]; vfs_clean_one_page(bp, i, m); } - lwkt_reltoken(&vm_token); } /* @@ -4321,9 +4335,6 @@ vfs_clean_pages(struct buf *bp) * This routine is typically called after a read completes (dirty should * be zero in that case as we are not called on bogus-replace pages), * or before a write is initiated. - * - * NOTE: vm_token must be held by the caller, and vm_page_set_validclean() - * currently assumes the vm_token is held. */ static void vfs_clean_one_page(struct buf *bp, int pageno, vm_page_t m) @@ -4418,6 +4429,8 @@ vfs_clean_one_page(struct buf *bp, int pageno, vm_page_t m) * * WARNING! vm_page_set_validclean() currently assumes vm_token * is held. The page might not be busied (bdwrite() case). + * XXX remove this comment once we've validated that this + * is no longer an issue. */ vm_page_set_validclean(m, soff & PAGE_MASK, eoff - soff); } @@ -4548,8 +4561,10 @@ vm_hold_load_pages(struct buf *bp, vm_offset_t from, vm_offset_t to) * could intefere with paging I/O, no matter which * process we are. */ + vm_object_hold(&kernel_object); p = bio_page_alloc(&kernel_object, pg >> PAGE_SHIFT, (vm_pindex_t)((to - pg) >> PAGE_SHIFT)); + vm_object_drop(&kernel_object); if (p) { vm_page_wire(p); p->valid = VM_PAGE_BITS_ALL; @@ -4584,15 +4599,14 @@ bio_page_alloc(vm_object_t obj, vm_pindex_t pg, int deficit) { vm_page_t p; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(obj)); + /* * Try a normal allocation, allow use of system reserve. */ - lwkt_gettoken(&vm_token); p = vm_page_alloc(obj, pg, VM_ALLOC_NORMAL | VM_ALLOC_SYSTEM); - if (p) { - lwkt_reltoken(&vm_token); + if (p) return(p); - } /* * The normal allocation failed and we clearly have a page @@ -4607,7 +4621,6 @@ bio_page_alloc(vm_object_t obj, vm_pindex_t pg, int deficit) * page now exists. */ if (vm_page_lookup(obj, pg)) { - lwkt_reltoken(&vm_token); return(NULL); } @@ -4631,7 +4644,6 @@ bio_page_alloc(vm_object_t obj, vm_pindex_t pg, int deficit) ++lowmempgfails; vm_wait(hz); } - lwkt_reltoken(&vm_token); return(p); } @@ -4657,7 +4669,6 @@ vm_hold_free_pages(struct buf *bp, vm_offset_t from, vm_offset_t to) index = (from - trunc_page((vm_offset_t)bp->b_data)) >> PAGE_SHIFT; newnpages = index; - lwkt_gettoken(&vm_token); for (pg = from; pg < to; pg += PAGE_SIZE, index++) { p = bp->b_xio.xio_pages[index]; if (p && (index < bp->b_xio.xio_npages)) { @@ -4669,13 +4680,12 @@ vm_hold_free_pages(struct buf *bp, vm_offset_t from, vm_offset_t to) } bp->b_xio.xio_pages[index] = NULL; pmap_kremove(pg); - vm_page_busy(p); + vm_page_busy_wait(p, FALSE, "vmhldpg"); vm_page_unwire(p, 0); vm_page_free(p); } } bp->b_xio.xio_npages = newnpages; - lwkt_reltoken(&vm_token); } /* diff --git a/sys/kern/vfs_cache.c b/sys/kern/vfs_cache.c index 73518f3228..8dd32b3c80 100644 --- a/sys/kern/vfs_cache.c +++ b/sys/kern/vfs_cache.c @@ -878,10 +878,10 @@ _cache_setvp(struct mount *mp, struct namecache *ncp, struct vnode *vp) */ if (!TAILQ_EMPTY(&ncp->nc_list)) vhold(vp); - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); ncp->nc_vp = vp; TAILQ_INSERT_HEAD(&vp->v_namecache, ncp, nc_vnode); - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); if (ncp->nc_exlocks) vhold(vp); @@ -970,10 +970,10 @@ _cache_setunresolved(struct namecache *ncp) ncp->nc_error = ENOTCONN; if ((vp = ncp->nc_vp) != NULL) { atomic_add_int(&numcache, -1); - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); ncp->nc_vp = NULL; TAILQ_REMOVE(&vp->v_namecache, ncp, nc_vnode); - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); /* * Any vp associated with an ncp with children is @@ -1259,7 +1259,7 @@ cache_inval_vp(struct vnode *vp, int flags) struct namecache *next; restart: - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); ncp = TAILQ_FIRST(&vp->v_namecache); if (ncp) _cache_hold(ncp); @@ -1267,7 +1267,7 @@ restart: /* loop entered with ncp held and vp spin-locked */ if ((next = TAILQ_NEXT(ncp, nc_vnode)) != NULL) _cache_hold(next); - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); _cache_lock(ncp); if (ncp->nc_vp != vp) { kprintf("Warning: cache_inval_vp: race-A detected on " @@ -1280,16 +1280,16 @@ restart: _cache_inval(ncp, flags); _cache_put(ncp); /* also releases reference */ ncp = next; - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); if (ncp && ncp->nc_vp != vp) { - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); kprintf("Warning: cache_inval_vp: race-B detected on " "%s\n", ncp->nc_name); _cache_drop(ncp); goto restart; } } - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); return(TAILQ_FIRST(&vp->v_namecache) != NULL); } @@ -1308,7 +1308,7 @@ cache_inval_vp_nonblock(struct vnode *vp) struct namecache *ncp; struct namecache *next; - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); ncp = TAILQ_FIRST(&vp->v_namecache); if (ncp) _cache_hold(ncp); @@ -1316,7 +1316,7 @@ cache_inval_vp_nonblock(struct vnode *vp) /* loop entered with ncp held */ if ((next = TAILQ_NEXT(ncp, nc_vnode)) != NULL) _cache_hold(next); - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); if (_cache_lock_nonblock(ncp)) { _cache_drop(ncp); if (next) @@ -1334,16 +1334,16 @@ cache_inval_vp_nonblock(struct vnode *vp) _cache_inval(ncp, 0); _cache_put(ncp); /* also releases reference */ ncp = next; - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); if (ncp && ncp->nc_vp != vp) { - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); kprintf("Warning: cache_inval_vp: race-B detected on " "%s\n", ncp->nc_name); _cache_drop(ncp); goto done; } } - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); done: return(TAILQ_FIRST(&vp->v_namecache) != NULL); } @@ -1609,11 +1609,11 @@ cache_fromdvp(struct vnode *dvp, struct ucred *cred, int makeit, * Handle the makeit == 0 degenerate case */ if (makeit == 0) { - spin_lock(&dvp->v_spinlock); + spin_lock(&dvp->v_spin); nch->ncp = TAILQ_FIRST(&dvp->v_namecache); if (nch->ncp) cache_hold(nch); - spin_unlock(&dvp->v_spinlock); + spin_unlock(&dvp->v_spin); } /* @@ -1623,14 +1623,14 @@ cache_fromdvp(struct vnode *dvp, struct ucred *cred, int makeit, /* * Break out if we successfully acquire a working ncp. */ - spin_lock(&dvp->v_spinlock); + spin_lock(&dvp->v_spin); nch->ncp = TAILQ_FIRST(&dvp->v_namecache); if (nch->ncp) { cache_hold(nch); - spin_unlock(&dvp->v_spinlock); + spin_unlock(&dvp->v_spin); break; } - spin_unlock(&dvp->v_spinlock); + spin_unlock(&dvp->v_spin); /* * If dvp is the root of its filesystem it should already @@ -1770,14 +1770,14 @@ cache_fromdvp_try(struct vnode *dvp, struct ucred *cred, break; } vn_unlock(pvp); - spin_lock(&pvp->v_spinlock); + spin_lock(&pvp->v_spin); if ((nch.ncp = TAILQ_FIRST(&pvp->v_namecache)) != NULL) { _cache_hold(nch.ncp); - spin_unlock(&pvp->v_spinlock); + spin_unlock(&pvp->v_spin); vrele(pvp); break; } - spin_unlock(&pvp->v_spinlock); + spin_unlock(&pvp->v_spin); if (pvp->v_flag & VROOT) { nch.ncp = _cache_get(pvp->v_mount->mnt_ncmountpt.ncp); error = cache_resolve_mp(nch.mount); @@ -3324,17 +3324,17 @@ vn_fullpath(struct proc *p, struct vnode *vn, char **retbuf, char **freebuf, if ((vn = p->p_textvp) == NULL) return (EINVAL); } - spin_lock(&vn->v_spinlock); + spin_lock(&vn->v_spin); TAILQ_FOREACH(ncp, &vn->v_namecache, nc_vnode) { if (ncp->nc_nlen) break; } if (ncp == NULL) { - spin_unlock(&vn->v_spinlock); + spin_unlock(&vn->v_spin); return (EINVAL); } _cache_hold(ncp); - spin_unlock(&vn->v_spinlock); + spin_unlock(&vn->v_spin); atomic_add_int(&numfullpathcalls, -1); nch.ncp = ncp;; diff --git a/sys/kern/vfs_cluster.c b/sys/kern/vfs_cluster.c index 80388b92fe..3b8d575580 100644 --- a/sys/kern/vfs_cluster.c +++ b/sys/kern/vfs_cluster.c @@ -534,8 +534,11 @@ cluster_rbuild(struct vnode *vp, off_t filesize, off_t loffset, off_t doffset, cluster_append(&bp->b_bio1, tbp); for (j = 0; j < tbp->b_xio.xio_npages; ++j) { vm_page_t m; + m = tbp->b_xio.xio_pages[j]; + vm_page_busy_wait(m, FALSE, "clurpg"); vm_page_io_start(m); + vm_page_wakeup(m); vm_object_pip_add(m->object, 1); if ((bp->b_xio.xio_npages == 0) || (bp->b_xio.xio_pages[bp->b_xio.xio_npages-1] != m)) { @@ -978,7 +981,9 @@ cluster_wbuild(struct vnode *vp, int blksize, off_t start_loffset, int bytes) for (j = 0; j < tbp->b_xio.xio_npages; ++j) { m = tbp->b_xio.xio_pages[j]; + vm_page_busy_wait(m, FALSE, "clurpg"); vm_page_io_start(m); + vm_page_wakeup(m); vm_object_pip_add(m->object, 1); if ((bp->b_xio.xio_npages == 0) || (bp->b_xio.xio_pages[bp->b_xio.xio_npages - 1] != m)) { diff --git a/sys/kern/vfs_journal.c b/sys/kern/vfs_journal.c index c07773dfd0..99d3456799 100644 --- a/sys/kern/vfs_journal.c +++ b/sys/kern/vfs_journal.c @@ -1354,18 +1354,18 @@ jrecord_write_vnode_ref(struct jrecord *jrec, struct vnode *vp) struct nchandle nch; nch.mount = vp->v_mount; - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); TAILQ_FOREACH(nch.ncp, &vp->v_namecache, nc_vnode) { if ((nch.ncp->nc_flag & (NCF_UNRESOLVED|NCF_DESTROYED)) == 0) break; } if (nch.ncp) { cache_hold(&nch); - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); jrecord_write_path(jrec, JLEAF_PATH_REF, nch.ncp); cache_drop(&nch); } else { - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); } } @@ -1376,7 +1376,7 @@ jrecord_write_vnode_link(struct jrecord *jrec, struct vnode *vp, struct nchandle nch; nch.mount = vp->v_mount; - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); TAILQ_FOREACH(nch.ncp, &vp->v_namecache, nc_vnode) { if (nch.ncp == notncp) continue; @@ -1385,11 +1385,11 @@ jrecord_write_vnode_link(struct jrecord *jrec, struct vnode *vp, } if (nch.ncp) { cache_hold(&nch); - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); jrecord_write_path(jrec, JLEAF_PATH_REF, nch.ncp); cache_drop(&nch); } else { - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); } } diff --git a/sys/kern/vfs_lock.c b/sys/kern/vfs_lock.c index 5cf2ca69cc..db7423bfdf 100644 --- a/sys/kern/vfs_lock.c +++ b/sys/kern/vfs_lock.c @@ -247,7 +247,7 @@ __vfreetail(struct vnode *vp) * This routine is only valid if the vnode is already either VFREE or * VCACHED, or if it can become VFREE or VCACHED via vnode_terminate(). * - * WARNING! This functions is typically called with v_spinlock held. + * WARNING! This functions is typically called with v_spin held. * * MPSAFE */ @@ -296,7 +296,7 @@ vrele(struct vnode *vp) * An auxiliary reference DOES NOT move a vnode out of the VFREE state * once it has entered it. * - * WARNING! vhold() and vhold_interlocked() must not acquire v_spinlock. + * WARNING! vhold() and vhold_interlocked() must not acquire v_spin. * The spinlock may or may not already be held by the caller. * vdrop() will clean up the free list state. * @@ -319,7 +319,7 @@ vhold_interlocked(struct vnode *vp) * Remove an auxiliary reference from the vnode. * * vdrop needs to check for a VCACHE->VFREE transition to catch cases - * where a vnode is held past its reclamation. We use v_spinlock to + * where a vnode is held past its reclamation. We use v_spin to * interlock VCACHED -> !VCACHED transitions. * * MPSAFE @@ -328,13 +328,13 @@ void vdrop(struct vnode *vp) { KKASSERT(vp->v_sysref.refcnt != 0 && vp->v_auxrefs > 0); - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); atomic_subtract_int(&vp->v_auxrefs, 1); if ((vp->v_flag & VCACHED) && vshouldfree(vp)) { _vclrflags(vp, VCACHED); __vfree(vp); } - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); } /* @@ -393,13 +393,13 @@ vnode_terminate(struct vnode *vp) if (vp->v_mount) VOP_INACTIVE(vp); } - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); KKASSERT((vp->v_flag & (VFREE|VCACHED)) == 0); if (vshouldfree(vp)) __vfree(vp); else _vsetflags(vp, VCACHED); /* inactive but not yet free*/ - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); vx_unlock(vp); } @@ -422,6 +422,7 @@ vnode_ctor(void *obj, void *private, int ocflags) RB_INIT(&vp->v_rbclean_tree); RB_INIT(&vp->v_rbdirty_tree); RB_INIT(&vp->v_rbhash_tree); + spin_init(&vp->v_spin); return(TRUE); } @@ -471,7 +472,7 @@ vx_lock_nonblock(struct vnode *vp) { if (lockcountnb(&vp->v_lock)) return(EBUSY); - return(lockmgr(&vp->v_lock, LK_EXCLUSIVE | LK_NOWAIT | LK_NOSPINWAIT)); + return(lockmgr(&vp->v_lock, LK_EXCLUSIVE | LK_NOWAIT)); } void @@ -546,16 +547,16 @@ vget(struct vnode *vp, int flags) * We are allowed to reactivate the vnode while we hold * the VX lock, assuming it can be reactivated. */ - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); if (vp->v_flag & VFREE) { __vbusy(vp); sysref_activate(&vp->v_sysref); - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); sysref_put(&vp->v_sysref); } else if (vp->v_flag & VCACHED) { _vclrflags(vp, VCACHED); sysref_activate(&vp->v_sysref); - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); sysref_put(&vp->v_sysref); } else { if (sysref_isinactive(&vp->v_sysref)) { @@ -563,7 +564,7 @@ vget(struct vnode *vp, int flags) kprintf("Warning vp %p reactivation race\n", vp); } - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); } _vclrflags(vp, VINACTIVE); error = 0; @@ -619,12 +620,12 @@ vx_get_nonblock(struct vnode *vp) void vx_put(struct vnode *vp) { - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); if ((vp->v_flag & VCACHED) && vshouldfree(vp)) { _vclrflags(vp, VCACHED); __vfree(vp); } - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); lockmgr(&vp->v_lock, LK_RELEASE); sysref_put(&vp->v_sysref); } @@ -715,7 +716,7 @@ allocfreevnode(void) * Cycle if we can't. * * We use a bad hack in vx_lock_nonblock() which avoids - * the lock order reversal between vfs_spin and v_spinlock. + * the lock order reversal between vfs_spin and v_spin. * This is very fragile code and I don't want to use * vhold here. */ diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c index 5b05cc99e8..393f487ef8 100644 --- a/sys/kern/vfs_mount.c +++ b/sys/kern/vfs_mount.c @@ -490,14 +490,14 @@ visleaf(struct vnode *vp) { struct namecache *ncp; - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); TAILQ_FOREACH(ncp, &vp->v_namecache, nc_vnode) { if (!TAILQ_EMPTY(&ncp->nc_list)) { - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); return(0); } } - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); return(1); } diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 65e16cc427..2d19bbfcfe 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -507,10 +507,10 @@ vtruncbuf(struct vnode *vp, off_t length, int blksize) /* * Debugging only */ - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); filename = TAILQ_FIRST(&vp->v_namecache) ? TAILQ_FIRST(&vp->v_namecache)->nc_name : "?"; - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); /* * Make sure no buffers were instantiated while we were trying @@ -1243,32 +1243,29 @@ vclean_vxlocked(struct vnode *vp, int flags) /* * If the vnode has an object, destroy it. */ - lwkt_gettoken(&vmobj_token); - object = vp->v_object; + while ((object = vp->v_object) != NULL) { + vm_object_hold(object); + if (object == vp->v_object) + break; + vm_object_drop(object); + } + if (object != NULL) { /* * Use vm_object_lock() rather than vm_object_hold to avoid * creating an extra (self-)hold on the object. - * - * NOTE: vm_object_terminate() eats the object lock. */ - vm_object_lock(object); - KKASSERT(object == vp->v_object); if (object->ref_count == 0) { - if ((object->flags & OBJ_DEAD) == 0) { - /* eats object lock */ + if ((object->flags & OBJ_DEAD) == 0) vm_object_terminate(object); - } else { - vm_object_unlock(object); - } + vm_object_drop(object); vclrflags(vp, VOBJBUF); } else { vm_pager_deallocate(object); vclrflags(vp, VOBJBUF); - vm_object_unlock(object); + vm_object_drop(object); } } - lwkt_reltoken(&vmobj_token); KKASSERT((vp->v_flag & VOBJBUF) == 0); /* @@ -1512,14 +1509,22 @@ vinitvmio(struct vnode *vp, off_t filesize, int blksize, int boff) vm_object_t object; int error = 0; - lwkt_gettoken(&vmobj_token); retry: - if ((object = vp->v_object) == NULL) { + while ((object = vp->v_object) != NULL) { + vm_object_hold(object); + if (object == vp->v_object) + break; + vm_object_drop(object); + } + + if (object == NULL) { object = vnode_pager_alloc(vp, filesize, 0, 0, blksize, boff); + /* * Dereference the reference we just created. This assumes * that the object is associated with the vp. */ + vm_object_hold(object); object->ref_count--; vrele(vp); } else { @@ -1528,12 +1533,13 @@ retry: if (vp->v_object == object) vm_object_dead_sleep(object, "vodead"); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); + vm_object_drop(object); goto retry; } } KASSERT(vp->v_object != NULL, ("vinitvmio: NULL object")); vsetflags(vp, VOBJBUF); - lwkt_reltoken(&vmobj_token); + vm_object_drop(object); return (error); } diff --git a/sys/kern/vfs_vm.c b/sys/kern/vfs_vm.c index a72fc691b6..137ec542cc 100644 --- a/sys/kern/vfs_vm.c +++ b/sys/kern/vfs_vm.c @@ -219,10 +219,10 @@ nvtruncbuf(struct vnode *vp, off_t length, int blksize, int boff) /* * Debugging only */ - spin_lock(&vp->v_spinlock); + spin_lock(&vp->v_spin); filename = TAILQ_FIRST(&vp->v_namecache) ? TAILQ_FIRST(&vp->v_namecache)->nc_name : "?"; - spin_unlock(&vp->v_spinlock); + spin_unlock(&vp->v_spin); /* * Make sure no buffers were instantiated while we were trying @@ -417,8 +417,11 @@ nvnode_pager_setsize(struct vnode *vp, off_t length, int blksize, int boff) */ if ((object = vp->v_object) == NULL) return; - if (length == vp->v_filesize) + vm_object_hold(object); + if (length == vp->v_filesize) { + vm_object_drop(object); return; + } /* * Calculate the size of the VM object, coverage includes @@ -456,23 +459,19 @@ nvnode_pager_setsize(struct vnode *vp, off_t length, int blksize, int boff) * invalidated. */ pi = OFF_TO_IDX(length + PAGE_MASK); - lwkt_gettoken(&vm_token); while (pi < nobjsize) { - do { - m = vm_page_lookup(object, pi); - } while (m && vm_page_sleep_busy(m, TRUE, "vsetsz")); + m = vm_page_lookup_busy_wait(object, pi, FALSE, "vmpg"); if (m) { - vm_page_busy(m); vm_page_protect(m, VM_PROT_NONE); vm_page_wakeup(m); } ++pi; } - lwkt_reltoken(&vm_token); } else { /* * File has expanded. */ vp->v_filesize = length; } + vm_object_drop(object); } diff --git a/sys/platform/pc32/i386/machdep.c b/sys/platform/pc32/i386/machdep.c index 55cf644fd3..958af52633 100644 --- a/sys/platform/pc32/i386/machdep.c +++ b/sys/platform/pc32/i386/machdep.c @@ -901,10 +901,6 @@ cpu_halt(void) * critical section. * * NOTE: On an SMP system we rely on a scheduler IPI to wake a HLTed cpu up. - * However, there are cases where the idlethread will be entered with - * the possibility that no IPI will occur and in such cases - * lwkt_switch() sets RQF_WAKEUP. We usually check - * RQF_IDLECHECK_WK_MASK. * * NOTE: cpu_idle_hlt again defaults to 2 (use ACPI sleep states). Set to * 1 to just use hlt and for debugging purposes. diff --git a/sys/platform/pc32/i386/pmap.c b/sys/platform/pc32/i386/pmap.c index f7c79218ec..977afd63e0 100644 --- a/sys/platform/pc32/i386/pmap.c +++ b/sys/platform/pc32/i386/pmap.c @@ -48,14 +48,7 @@ /* * Manages physical address maps. * - * In most cases the vm_token must be held when manipulating a user pmap - * or elements within a vm_page, and the kvm_token must be held when - * manipulating the kernel pmap. Operations on user pmaps may require - * additional synchronization. - * - * In some cases the caller may hold the required tokens to prevent pmap - * functions from blocking on those same tokens. This typically only works - * for lookup-style operations. + * In most cases we hold page table pages busy in order to manipulate them. */ /* * PMAP_DEBUG - see platform/pc32/include/pmap.h @@ -72,6 +65,7 @@ #include #include #include +#include #include #include @@ -89,6 +83,7 @@ #include #include #include +#include #include #include @@ -369,11 +364,18 @@ pmap_bootstrap(vm_paddr_t firstaddr, vm_paddr_t loadaddr) * The kernel's pmap is statically allocated so we don't have to use * pmap_create, which is unlikely to work correctly at this part of * the boot sequence (XXX and which no longer exists). + * + * The kernel_pmap's pm_pteobj is used only for locking and not + * for mmu pages. */ kernel_pmap.pm_pdir = (pd_entry_t *)(KERNBASE + (u_int)IdlePTD); kernel_pmap.pm_count = 1; kernel_pmap.pm_active = (cpumask_t)-1 & ~CPUMASK_LOCK; + kernel_pmap.pm_pteobj = &kernel_object; TAILQ_INIT(&kernel_pmap.pm_pvlist); + TAILQ_INIT(&kernel_pmap.pm_pvlist_free); + spin_init(&kernel_pmap.pm_spin); + lwkt_token_init(&kernel_pmap.pm_token, "kpmap_tok"); nkpt = NKPT; /* @@ -977,17 +979,15 @@ pmap_qremove(vm_offset_t va, int count) * This routine works like vm_page_lookup() but also blocks as long as the * page is busy. This routine does not busy the page it returns. * - * The caller must hold vm_token. + * The caller must hold the object. */ static vm_page_t pmap_page_lookup(vm_object_t object, vm_pindex_t pindex) { vm_page_t m; - ASSERT_LWKT_TOKEN_HELD(&vm_token); - do { - m = vm_page_lookup(object, pindex); - } while (m && vm_page_sleep_busy(m, FALSE, "pplookp")); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + m = vm_page_lookup_busy_wait(object, pindex, FALSE, "pplookp"); return(m); } @@ -1041,10 +1041,7 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, pmap_inval_info_t info) * Wait until we can busy the page ourselves. We cannot have * any active flushes if we block. */ - if (m->flags & PG_BUSY) { - while (vm_page_sleep_busy(m, FALSE, "pmuwpt")) - ; - } + vm_page_busy_wait(m, FALSE, "pmuwpt"); KASSERT(m->queue == PQ_NONE, ("_pmap_unwire_pte_hold: %p->queue != PQ_NONE", m)); @@ -1056,7 +1053,6 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, pmap_inval_info_t info) * the current one, when clearing a page directory * entry. */ - vm_page_busy(m); pmap_inval_interlock(info, pmap, -1); KKASSERT(pmap->pm_pdir[m->pindex]); pmap->pm_pdir[m->pindex] = 0; @@ -1079,7 +1075,7 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, pmap_inval_info_t info) vm_page_unhold(m); --m->wire_count; KKASSERT(m->wire_count == 0); - --vmstats.v_wire_count; + atomic_add_int(&vmstats.v_wire_count, -1); vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); vm_page_flash(m); vm_page_free_zero(m); @@ -1087,6 +1083,7 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, pmap_inval_info_t info) } else { KKASSERT(m->hold_count > 1); vm_page_unhold(m); + vm_page_wakeup(m); return 0; } } @@ -1120,6 +1117,8 @@ pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t mpte, { unsigned ptepindex; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(pmap->pm_pteobj)); + if (va >= UPT_MIN_ADDRESS) return 0; @@ -1129,8 +1128,9 @@ pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t mpte, (pmap->pm_ptphint->pindex == ptepindex)) { mpte = pmap->pm_ptphint; } else { - mpte = pmap_page_lookup( pmap->pm_pteobj, ptepindex); + mpte = pmap_page_lookup(pmap->pm_pteobj, ptepindex); pmap->pm_ptphint = mpte; + vm_page_wakeup(mpte); } } @@ -1158,6 +1158,9 @@ pmap_pinit0(struct pmap *pmap) pmap->pm_cached = 0; pmap->pm_ptphint = NULL; TAILQ_INIT(&pmap->pm_pvlist); + TAILQ_INIT(&pmap->pm_pvlist_free); + spin_init(&pmap->pm_spin); + lwkt_token_init(&pmap->pm_token, "pmap_tok"); bzero(&pmap->pm_stats, sizeof pmap->pm_stats); } @@ -1196,11 +1199,11 @@ pmap_pinit(struct pmap *pmap) ptdpg = vm_page_grab(pmap->pm_pteobj, PTDPTDI, VM_ALLOC_NORMAL | VM_ALLOC_RETRY); pmap->pm_pdirm = ptdpg; - vm_page_flag_clear(ptdpg, PG_MAPPED | PG_BUSY); + vm_page_flag_clear(ptdpg, PG_MAPPED); + vm_page_wire(ptdpg); ptdpg->valid = VM_PAGE_BITS_ALL; - ptdpg->wire_count = 1; - ++vmstats.v_wire_count; pmap_kenter((vm_offset_t)pmap->pm_pdir, VM_PAGE_TO_PHYS(ptdpg)); + vm_page_wakeup(ptdpg); } if ((ptdpg->flags & PG_ZERO) == 0) bzero(pmap->pm_pdir, PAGE_SIZE); @@ -1220,6 +1223,9 @@ pmap_pinit(struct pmap *pmap) pmap->pm_cached = 0; pmap->pm_ptphint = NULL; TAILQ_INIT(&pmap->pm_pvlist); + TAILQ_INIT(&pmap->pm_pvlist_free); + spin_init(&pmap->pm_spin); + lwkt_token_init(&pmap->pm_token, "pmap_tok"); bzero(&pmap->pm_stats, sizeof pmap->pm_stats); pmap->pm_stats.resident_count = 1; } @@ -1238,18 +1244,15 @@ pmap_puninit(pmap_t pmap) vm_page_t p; KKASSERT(pmap->pm_active == 0); - lwkt_gettoken(&vm_token); if ((p = pmap->pm_pdirm) != NULL) { KKASSERT(pmap->pm_pdir != NULL); pmap_kremove((vm_offset_t)pmap->pm_pdir); + vm_page_busy_wait(p, FALSE, "pgpun"); p->wire_count--; - vmstats.v_wire_count--; - KKASSERT((p->flags & PG_BUSY) == 0); - vm_page_busy(p); + atomic_add_int(&vmstats.v_wire_count, -1); vm_page_free_zero(p); pmap->pm_pdirm = NULL; } - lwkt_reltoken(&vm_token); if (pmap->pm_pdir) { kmem_free(&kernel_map, (vm_offset_t)pmap->pm_pdir, PAGE_SIZE); pmap->pm_pdir = NULL; @@ -1271,11 +1274,13 @@ pmap_puninit(pmap_t pmap) void pmap_pinit2(struct pmap *pmap) { - lwkt_gettoken(&vm_token); + /* + * XXX copies current process, does not fill in MPPTDI + */ + spin_lock(&pmap_spin); TAILQ_INSERT_TAIL(&pmap_list, pmap, pm_pmnode); - /* XXX copies current process, does not fill in MPPTDI */ bcopy(PTD + KPTDI, pmap->pm_pdir + KPTDI, nkpt * PTESIZE); - lwkt_reltoken(&vm_token); + spin_unlock(&pmap_spin); } /* @@ -1299,10 +1304,10 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) * page-table pages. Those pages are zero now, and * might as well be placed directly into the zero queue. */ - if (vm_page_sleep_busy(p, FALSE, "pmaprl")) + if (vm_page_busy_try(p, FALSE)) { + vm_page_sleep_busy(p, FALSE, "pmaprl"); return 0; - - vm_page_busy(p); + } /* * Remove the page table page from the processes address space. @@ -1334,7 +1339,7 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) vm_page_wakeup(p); } else { p->wire_count--; - vmstats.v_wire_count--; + atomic_add_int(&vmstats.v_wire_count, -1); vm_page_free_zero(p); } return 1; @@ -1378,7 +1383,7 @@ _pmap_allocpte(pmap_t pmap, unsigned ptepindex) } if (m->wire_count == 0) - vmstats.v_wire_count++; + atomic_add_int(&vmstats.v_wire_count, 1); m->wire_count++; @@ -1415,11 +1420,9 @@ _pmap_allocpte(pmap_t pmap, unsigned ptepindex) pmap_zero_page(ptepa); } } - m->valid = VM_PAGE_BITS_ALL; vm_page_flag_clear(m, PG_ZERO); - } - else { + } else { KKASSERT((m->flags & PG_ZERO) == 0); } @@ -1441,6 +1444,8 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) vm_offset_t ptepa; vm_page_t m; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(pmap->pm_pteobj)); + /* * Calculate pagetable page index */ @@ -1475,8 +1480,9 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) (pmap->pm_ptphint->pindex == ptepindex)) { m = pmap->pm_ptphint; } else { - m = pmap_page_lookup( pmap->pm_pteobj, ptepindex); + m = pmap_page_lookup(pmap->pm_pteobj, ptepindex); pmap->pm_ptphint = m; + vm_page_wakeup(m); } m->hold_count++; return m; @@ -1497,7 +1503,7 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) * Called when a pmap initialized by pmap_pinit is being released. * Should only be called if the map contains no valid mappings. * - * No requirements. + * Caller must hold pmap->pm_token */ static int pmap_release_callback(struct vm_page *p, void *data); @@ -1516,10 +1522,12 @@ pmap_release(struct pmap *pmap) info.pmap = pmap; info.object = object; - vm_object_hold(object); - lwkt_gettoken(&vm_token); + + spin_lock(&pmap_spin); TAILQ_REMOVE(&pmap_list, pmap, pm_pmnode); + spin_unlock(&pmap_spin); + vm_object_hold(object); do { info.error = 0; info.mpte = NULL; @@ -1532,9 +1540,9 @@ pmap_release(struct pmap *pmap) info.error = 1; } } while (info.error); - pmap->pm_cached = 0; - lwkt_reltoken(&vm_token); vm_object_drop(object); + + pmap->pm_cached = 0; } /* @@ -1574,7 +1582,7 @@ pmap_growkernel(vm_offset_t kstart, vm_offset_t kend) vm_page_t nkpg; pd_entry_t newpdir; - lwkt_gettoken(&vm_token); + vm_object_hold(kptobj); if (kernel_vm_end == 0) { kernel_vm_end = KERNBASE; nkpt = 0; @@ -1612,13 +1620,15 @@ pmap_growkernel(vm_offset_t kstart, vm_offset_t kend) /* * This update must be interlocked with pmap_pinit2. */ + spin_lock(&pmap_spin); TAILQ_FOREACH(pmap, &pmap_list, pm_pmnode) { *pmap_pde(pmap, kernel_vm_end) = newpdir; } + spin_unlock(&pmap_spin); kernel_vm_end = (kernel_vm_end + PAGE_SIZE * NPTEPG) & ~(PAGE_SIZE * NPTEPG - 1); } - lwkt_reltoken(&vm_token); + vm_object_drop(kptobj); } /* @@ -1721,13 +1731,16 @@ pmap_collect(void) warningdone++; } - for(i = 0; i < vm_page_array_size; i++) { + for (i = 0; i < vm_page_array_size; i++) { m = &vm_page_array[i]; - if (m->wire_count || m->hold_count || m->busy || - (m->flags & PG_BUSY)) { + if (m->wire_count || m->hold_count) continue; + if (vm_page_busy_try(m, TRUE) == 0) { + if (m->wire_count == 0 && m->hold_count == 0) { + pmap_remove_all(m); + } + vm_page_wakeup(m); } - pmap_remove_all(m); } lwkt_reltoken(&vm_token); } @@ -1769,13 +1782,16 @@ pmap_remove_entry(struct pmap *pmap, vm_page_t m, test_m_maps_pv(m, pv); TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); m->md.pv_list_count--; - m->object->agg_pv_list_count--; + atomic_add_int(&m->object->agg_pv_list_count, -1); if (TAILQ_EMPTY(&m->md.pv_list)) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); TAILQ_REMOVE(&pmap->pm_pvlist, pv, pv_plist); ++pmap->pm_generation; + vm_object_hold(pmap->pm_pteobj); rtval = pmap_unuse_pt(pmap, va, pv->pv_ptem, info); + vm_object_drop(pmap->pm_pteobj); free_pv_entry(pv); + return rtval; } @@ -1802,7 +1818,7 @@ pmap_insert_entry(pmap_t pmap, vm_offset_t va, vm_page_t mpte, vm_page_t m) TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list); ++pmap->pm_generation; m->md.pv_list_count++; - m->object->agg_pv_list_count++; + atomic_add_int(&m->object->agg_pv_list_count, 1); } /* @@ -1904,9 +1920,11 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) if (pmap == NULL) return; + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); if (pmap->pm_stats.resident_count == 0) { lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); return; } @@ -1922,6 +1940,7 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) pmap_remove_page(pmap, sva, &info); pmap_inval_done(&info); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); return; } @@ -1985,6 +2004,7 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) } pmap_inval_done(&info); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); } /* @@ -2003,7 +2023,6 @@ pmap_remove_all(vm_page_t m) if (!pmap_initialized || (m->flags & PG_FICTITIOUS)) return; - lwkt_gettoken(&vm_token); pmap_inval_init(&info); while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) { KKASSERT(pv->pv_pmap->pm_stats.resident_count > 0); @@ -2042,15 +2061,16 @@ pmap_remove_all(vm_page_t m) TAILQ_REMOVE(&pv->pv_pmap->pm_pvlist, pv, pv_plist); ++pv->pv_pmap->pm_generation; m->md.pv_list_count--; - m->object->agg_pv_list_count--; + atomic_add_int(&m->object->agg_pv_list_count, -1); if (TAILQ_EMPTY(&m->md.pv_list)) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); + vm_object_hold(pv->pv_pmap->pm_pteobj); pmap_unuse_pt(pv->pv_pmap, pv->pv_va, pv->pv_ptem, &info); + vm_object_drop(pv->pv_pmap->pm_pteobj); free_pv_entry(pv); } KKASSERT((m->flags & (PG_MAPPED|PG_WRITEABLE)) == 0); pmap_inval_done(&info); - lwkt_reltoken(&vm_token); } /* @@ -2193,6 +2213,7 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, print_backtrace(-1); } + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); /* @@ -2204,7 +2225,8 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, else mpte = NULL; - pmap_inval_init(&info); + if ((prot & VM_PROT_NOSYNC) == 0) + pmap_inval_init(&info); pte = pmap_pte(pmap, va); /* @@ -2338,18 +2360,24 @@ validate: * to update the pte. */ if ((origpte & ~(PG_M|PG_A)) != newpte) { - pmap_inval_interlock(&info, pmap, va); + if (prot & VM_PROT_NOSYNC) + cpu_invlpg((void *)va); + else + pmap_inval_interlock(&info, pmap, va); ptbase_assert(pmap); KKASSERT(*pte == 0 || (*pte & PG_FRAME) == (newpte & PG_FRAME)); *pte = newpte | PG_A; - pmap_inval_deinterlock(&info, pmap); + if ((prot & VM_PROT_NOSYNC) == 0) + pmap_inval_deinterlock(&info, pmap); if (newpte & PG_RW) vm_page_flag_set(m, PG_WRITEABLE); } KKASSERT((newpte & PG_MANAGED) == 0 || (m->flags & PG_MAPPED)); - pmap_inval_done(&info); + if ((prot & VM_PROT_NOSYNC) == 0) + pmap_inval_done(&info); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); } /* @@ -2371,6 +2399,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) vm_offset_t ptepa; pmap_inval_info info; + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); pmap_inval_init(&info); @@ -2414,8 +2443,9 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) (pmap->pm_ptphint->pindex == ptepindex)) { mpte = pmap->pm_ptphint; } else { - mpte = pmap_page_lookup( pmap->pm_pteobj, ptepindex); + mpte = pmap_page_lookup(pmap->pm_pteobj, ptepindex); pmap->pm_ptphint = mpte; + vm_page_wakeup(mpte); } if (mpte) mpte->hold_count++; @@ -2441,6 +2471,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) KKASSERT(((*pte ^ pa) & PG_FRAME) == 0); pmap_inval_done(&info); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); return; } @@ -2469,6 +2500,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) /* pmap_inval_add(&info, pmap, va); shouldn't be needed inval->valid */ pmap_inval_done(&info); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); } /* @@ -2552,10 +2584,8 @@ pmap_object_init_pt(pmap_t pmap, vm_offset_t addr, vm_prot_t prot, info.pmap = pmap; vm_object_hold(object); - lwkt_gettoken(&vm_token); vm_page_rb_tree_RB_SCAN(&object->rb_memq, rb_vm_page_scancmp, pmap_object_init_pt_callback, &info); - lwkt_reltoken(&vm_token); vm_object_drop(object); } @@ -2576,16 +2606,17 @@ pmap_object_init_pt_callback(vm_page_t p, void *data) vmstats.v_free_count < vmstats.v_free_reserved) { return(-1); } + if (vm_page_busy_try(p, TRUE)) + return 0; if (((p->valid & VM_PAGE_BITS_ALL) == VM_PAGE_BITS_ALL) && - (p->busy == 0) && (p->flags & (PG_BUSY | PG_FICTITIOUS)) == 0) { - vm_page_busy(p); + (p->flags & PG_FICTITIOUS) == 0) { if ((p->queue - p->pc) == PQ_CACHE) vm_page_deactivate(p); rel_index = p->pindex - info->start_pindex; pmap_enter_quick(info->pmap, info->addr + i386_ptob(rel_index), p); - vm_page_wakeup(p); } + vm_page_wakeup(p); return(0); } @@ -2880,8 +2911,11 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) else iscurrentpmap = 0; + if (pmap->pm_pteobj) + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); pmap_inval_init(&info); + for (pv = TAILQ_FIRST(&pmap->pm_pvlist); pv; pv = npv) { if (pv->pv_va >= eva || pv->pv_va < sva) { npv = TAILQ_NEXT(pv, pv_plist); @@ -2935,7 +2969,7 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) save_generation = ++pmap->pm_generation; m->md.pv_list_count--; - m->object->agg_pv_list_count--; + atomic_add_int(&m->object->agg_pv_list_count, -1); TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); if (TAILQ_EMPTY(&m->md.pv_list)) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); @@ -2954,6 +2988,8 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) } pmap_inval_done(&info); lwkt_reltoken(&vm_token); + if (pmap->pm_pteobj) + vm_object_drop(pmap->pm_pteobj); } /* @@ -3127,10 +3163,6 @@ pmap_phys_address(vm_pindex_t ppn) * is necessary that 0 only be returned when there are truly no * reference bits set. * - * XXX: The exact number of bits to check and clear is a matter that - * should be tested and standardized at some point in the future for - * optimal aging of shared pages. - * * No requirements. */ int @@ -3406,6 +3438,7 @@ done: * * Only called with new VM spaces. * The process must have only a single thread. + * The process must hold the vmspace->vm_map.token for oldvm and newvm * No other requirements. */ void @@ -3488,14 +3521,14 @@ pmap_interlock_wait(struct vmspace *vm) struct pmap *pmap = &vm->vm_pmap; if (pmap->pm_active & CPUMASK_LOCK) { - DEBUG_PUSH_INFO("pmap_interlock_wait"); crit_enter(); + DEBUG_PUSH_INFO("pmap_interlock_wait"); while (pmap->pm_active & CPUMASK_LOCK) { cpu_ccfence(); lwkt_process_ipiq(); } - crit_exit(); DEBUG_POP_INFO(); + crit_exit(); } } diff --git a/sys/platform/pc32/include/pmap.h b/sys/platform/pc32/include/pmap.h index dbbe40a31f..a82e923169 100644 --- a/sys/platform/pc32/include/pmap.h +++ b/sys/platform/pc32/include/pmap.h @@ -136,6 +136,12 @@ #ifndef _SYS_QUEUE_H_ #include #endif +#ifndef _SYS_SPINLOCK_H_ +#include +#endif +#ifndef _SYS_THREAD_H_ +#include +#endif #ifndef _MACHINE_TYPES_H_ #include #endif @@ -227,6 +233,7 @@ struct pmap { struct vm_object *pm_pteobj; /* Container for pte's */ TAILQ_ENTRY(pmap) pm_pmnode; /* list of pmaps */ TAILQ_HEAD(,pv_entry) pm_pvlist; /* list of mappings in pmap */ + TAILQ_HEAD(,pv_entry) pm_pvlist_free; /* free mappings */ int pm_count; /* reference count */ cpumask_t pm_active; /* active on cpus */ cpumask_t pm_cached; /* cached on cpus */ @@ -234,6 +241,8 @@ struct pmap { struct pmap_statistics pm_stats; /* pmap statistics */ struct vm_page *pm_ptphint; /* pmap ptp hint */ int pm_generation; /* detect pvlist deletions */ + struct spinlock pm_spin; + struct lwkt_token pm_token; }; #define pmap_resident_count(pmap) (pmap)->pm_stats.resident_count @@ -283,6 +292,7 @@ extern vm_offset_t clean_eva; extern vm_offset_t clean_sva; extern char *ptvmmap; /* poor name! */ +void pmap_release(struct pmap *pmap); void pmap_interlock_wait (struct vmspace *); void pmap_bootstrap (vm_paddr_t, vm_paddr_t); void *pmap_mapdev (vm_paddr_t, vm_size_t); diff --git a/sys/platform/pc64/include/pmap.h b/sys/platform/pc64/include/pmap.h index 6857376e0b..8469067e63 100644 --- a/sys/platform/pc64/include/pmap.h +++ b/sys/platform/pc64/include/pmap.h @@ -131,6 +131,12 @@ #ifndef _SYS_QUEUE_H_ #include #endif +#ifndef _SYS_SPINLOCK_H_ +#include +#endif +#ifndef _SYS_THREAD_H_ +#include +#endif #ifndef _MACHINE_TYPES_H_ #include #endif @@ -172,7 +178,6 @@ extern u_int64_t KPML4phys; /* physical address of kernel level 4 */ static __inline void pte_store(pt_entry_t *ptep, pt_entry_t pte) { - *ptep = pte; } @@ -188,6 +193,7 @@ struct vmspace; struct md_page { int pv_list_count; + int pv_generation; TAILQ_HEAD(,pv_entry) pv_list; }; @@ -212,12 +218,16 @@ struct pmap { struct vm_object *pm_pteobj; /* Container for pte's */ TAILQ_ENTRY(pmap) pm_pmnode; /* list of pmaps */ TAILQ_HEAD(,pv_entry) pm_pvlist; /* list of mappings in pmap */ + TAILQ_HEAD(,pv_entry) pm_pvlist_free; /* free mappings */ int pm_count; /* reference count */ cpumask_t pm_active; /* active on cpus */ int pm_filler02; /* (filler sync w/vkernel) */ struct pmap_statistics pm_stats; /* pmap statistics */ struct vm_page *pm_ptphint; /* pmap ptp hint */ int pm_generation; /* detect pvlist deletions */ + int pm_hold; + struct spinlock pm_spin; + struct lwkt_token pm_token; }; #define CPUMASK_LOCK CPUMASK(SMP_MAXCPU) @@ -241,6 +251,7 @@ typedef struct pv_entry { TAILQ_ENTRY(pv_entry) pv_list; TAILQ_ENTRY(pv_entry) pv_plist; struct vm_page *pv_ptem; /* VM page for pte */ + u_int pv_hold; /* hold on destruction count */ } *pv_entry_t; #ifdef _KERNEL @@ -262,6 +273,7 @@ extern vm_offset_t clean_eva; extern vm_offset_t clean_sva; extern char *ptvmmap; /* poor name! */ +void pmap_release(struct pmap *pmap); void pmap_interlock_wait (struct vmspace *); void pmap_bootstrap (vm_paddr_t *); void *pmap_mapdev (vm_paddr_t, vm_size_t); diff --git a/sys/platform/pc64/x86_64/pmap.c b/sys/platform/pc64/x86_64/pmap.c index ac06c70459..7396c056e2 100644 --- a/sys/platform/pc64/x86_64/pmap.c +++ b/sys/platform/pc64/x86_64/pmap.c @@ -102,6 +102,8 @@ #include #include #include +#include +#include #include #include @@ -203,6 +205,10 @@ struct msgbuf *msgbufp=0; static pt_entry_t *pt_crashdumpmap; static caddr_t crashdumpmap; +static int pmap_yield_count = 64; +SYSCTL_INT(_machdep, OID_AUTO, pmap_yield_count, CTLFLAG_RW, + &pmap_yield_count, 0, "Yield during init_pt/release"); + #define DISABLE_PSE static pv_entry_t get_pv_entry (void); @@ -225,9 +231,8 @@ static int pmap_release_free_page (pmap_t pmap, vm_page_t p); static vm_page_t _pmap_allocpte (pmap_t pmap, vm_pindex_t ptepindex); static pt_entry_t * pmap_pte_quick (pmap_t pmap, vm_offset_t va); static vm_page_t pmap_page_lookup (vm_object_t object, vm_pindex_t pindex); -static int _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m, +static int pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m, pmap_inval_info_t info); -static int pmap_unuse_pt (pmap_t, vm_offset_t, vm_page_t, pmap_inval_info_t); static vm_offset_t pmap_kmem_choose(vm_offset_t addr); static unsigned pdir4mb; @@ -393,7 +398,8 @@ static __inline pt_entry_t * vtopte(vm_offset_t va) { - uint64_t mask = ((1ul << (NPTEPGSHIFT + NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1); + uint64_t mask = ((1ul << (NPTEPGSHIFT + NPDEPGSHIFT + + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1); return (PTmap + ((va >> PAGE_SHIFT) & mask)); } @@ -402,7 +408,8 @@ static __inline pd_entry_t * vtopde(vm_offset_t va) { - uint64_t mask = ((1ul << (NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1); + uint64_t mask = ((1ul << (NPDEPGSHIFT + NPDPEPGSHIFT + + NPML4EPGSHIFT)) - 1); return (PDmap + ((va >> PDRSHIFT) & mask)); } @@ -609,11 +616,19 @@ pmap_bootstrap(vm_paddr_t *firstaddr) * The kernel's pmap is statically allocated so we don't have to use * pmap_create, which is unlikely to work correctly at this part of * the boot sequence (XXX and which no longer exists). + * + * The kernel_pmap's pm_pteobj is used only for locking and not + * for mmu pages. */ kernel_pmap.pm_pml4 = (pdp_entry_t *) (PTOV_OFFSET + KPML4phys); kernel_pmap.pm_count = 1; kernel_pmap.pm_active = (cpumask_t)-1 & ~CPUMASK_LOCK; + kernel_pmap.pm_pteobj = &kernel_object; TAILQ_INIT(&kernel_pmap.pm_pvlist); + TAILQ_INIT(&kernel_pmap.pm_pvlist_free); + kernel_pmap.pm_hold = 0; + spin_init(&kernel_pmap.pm_spin); + lwkt_token_init(&kernel_pmap.pm_token, "kpmap_tok"); /* * Reserve some special page table entries/VA space for temporary @@ -852,7 +867,7 @@ pmap_track_modified(vm_offset_t va) /* * Extract the physical page address associated with the map/VA pair. * - * The caller must hold vm_token if non-blocking operation is desired. + * The caller must hold pmap->pm_token if non-blocking operation is desired. */ vm_paddr_t pmap_extract(pmap_t pmap, vm_offset_t va) @@ -861,7 +876,7 @@ pmap_extract(pmap_t pmap, vm_offset_t va) pt_entry_t *pte; pd_entry_t pde, *pdep; - lwkt_gettoken(&vm_token); + lwkt_gettoken(&pmap->pm_token); rtval = 0; pdep = pmap_pde(pmap, va); if (pdep != NULL) { @@ -875,7 +890,7 @@ pmap_extract(pmap_t pmap, vm_offset_t va) } } } - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); return rtval; } @@ -1103,10 +1118,8 @@ pmap_qremove(vm_offset_t va, int count) * This routine works like vm_page_lookup() but also blocks as long as the * page is busy. This routine does not busy the page it returns. * - * Unless the caller is managing objects whos pages are in a known state, - * the call should be made with both vm_token held and the governing object - * and its token held so the page's object association remains valid on - * return. + * The call should be made with the governing object held so the page's + * object association remains valid on return. * * This function can block! */ @@ -1116,9 +1129,8 @@ pmap_page_lookup(vm_object_t object, vm_pindex_t pindex) { vm_page_t m; - do { - m = vm_page_lookup(object, pindex); - } while (m && vm_page_sleep_busy(m, FALSE, "pplookp")); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + m = vm_page_lookup_busy_wait(object, pindex, FALSE, "pplookp"); return(m); } @@ -1160,71 +1172,73 @@ pmap_dispose_proc(struct proc *p) ***************************************************/ /* - * This routine unholds page table pages, and if the hold count - * drops to zero, then it decrements the wire count. + * After removing a page table entry, this routine is used to + * conditionally free the page, and manage the hold/wire counts. */ static __inline int -pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m, - pmap_inval_info_t info) +pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t mpte, + pmap_inval_info_t info) { - KKASSERT(m->hold_count > 0); - if (m->hold_count > 1) { - vm_page_unhold(m); - return 0; - } else { - return _pmap_unwire_pte_hold(pmap, va, m, info); - } + if (mpte) + return (pmap_unwire_pte_hold(pmap, va, mpte, info)); + return 0; } -static +/* + * This routine reduces the wire_count on a page. If the wire_count + * would drop to zero we remove the PT, PD, or PDP from its parent page + * table. Under normal operation this only occurs with PT pages. + */ +static __inline int -_pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m, - pmap_inval_info_t info) +pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m, + pmap_inval_info_t info) { + if (!vm_page_unwire_quick(m)) + return 0; + /* * Wait until we can busy the page ourselves. We cannot have * any active flushes if we block. We own one hold count on the * page so it cannot be freed out from under us. */ - if (m->flags & PG_BUSY) { - while (vm_page_sleep_busy(m, FALSE, "pmuwpt")) - ; - } + vm_page_busy_wait(m, FALSE, "pmuwpt"); KASSERT(m->queue == PQ_NONE, ("_pmap_unwire_pte_hold: %p->queue != PQ_NONE", m)); /* - * This case can occur if new references were acquired while - * we were blocked. + * New references can bump the wire_count while we were blocked, + * try to unwire quickly again (e.g. 2->1). */ - if (m->hold_count > 1) { - KKASSERT(m->hold_count > 1); - vm_page_unhold(m); + if (vm_page_unwire_quick(m) == 0) { + vm_page_wakeup(m); return 0; } /* * Unmap the page table page */ - KKASSERT(m->hold_count == 1); - vm_page_busy(m); + KKASSERT(m->wire_count == 1); pmap_inval_interlock(info, pmap, -1); if (m->pindex >= (NUPDE + NUPDPE)) { /* PDP page */ pml4_entry_t *pml4; pml4 = pmap_pml4e(pmap, va); + KKASSERT(*pml4); *pml4 = 0; } else if (m->pindex >= NUPDE) { /* PD page */ pdp_entry_t *pdp; pdp = pmap_pdpe(pmap, va); + KKASSERT(*pdp); *pdp = 0; } else { /* PT page */ pd_entry_t *pd; pd = pmap_pde(pmap, va); + KKASSERT(*pd); *pd = 0; } @@ -1251,16 +1265,11 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m, } /* - * This was our last hold, the page had better be unwired - * after we decrement wire_count. - * - * FUTURE NOTE: shared page directory page could result in - * multiple wire counts. + * This was our wiring. */ - vm_page_unhold(m); - --m->wire_count; + KKASSERT(m->flags & PG_UNMANAGED); + vm_page_unwire(m, 0); KKASSERT(m->wire_count == 0); - --vmstats.v_wire_count; vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); vm_page_flash(m); vm_page_free_zero(m); @@ -1268,37 +1277,6 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m, return 1; } -/* - * After removing a page table entry, this routine is used to - * conditionally free the page, and manage the hold/wire counts. - */ -static -int -pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t mpte, - pmap_inval_info_t info) -{ - vm_pindex_t ptepindex; - - if (va >= VM_MAX_USER_ADDRESS) - return 0; - - if (mpte == NULL) { - ptepindex = pmap_pde_pindex(va); -#if JGHINT - if (pmap->pm_ptphint && - (pmap->pm_ptphint->pindex == ptepindex)) { - mpte = pmap->pm_ptphint; - } else { -#endif - mpte = pmap_page_lookup(pmap->pm_pteobj, ptepindex); - pmap->pm_ptphint = mpte; -#if JGHINT - } -#endif - } - return pmap_unwire_pte_hold(pmap, va, mpte, info); -} - /* * Initialize pmap0/vmspace0. This pmap is not added to pmap_list because * it, and IdlePTD, represents the template used to update all other pmaps. @@ -1315,6 +1293,10 @@ pmap_pinit0(struct pmap *pmap) pmap->pm_active = 0; pmap->pm_ptphint = NULL; TAILQ_INIT(&pmap->pm_pvlist); + TAILQ_INIT(&pmap->pm_pvlist_free); + pmap->pm_hold = 0; + spin_init(&pmap->pm_spin); + lwkt_token_init(&pmap->pm_token, "pmap_tok"); bzero(&pmap->pm_stats, sizeof pmap->pm_stats); } @@ -1325,7 +1307,7 @@ pmap_pinit0(struct pmap *pmap) void pmap_pinit(struct pmap *pmap) { - vm_page_t ptdpg; + vm_page_t pml4pg; /* * No need to allocate page table space yet but we do need a valid @@ -1339,43 +1321,52 @@ pmap_pinit(struct pmap *pmap) /* * Allocate an object for the ptes */ - if (pmap->pm_pteobj == NULL) - pmap->pm_pteobj = vm_object_allocate(OBJT_DEFAULT, NUPDE + NUPDPE + PML4PML4I + 1); + if (pmap->pm_pteobj == NULL) { + pmap->pm_pteobj = vm_object_allocate(OBJT_DEFAULT, + NUPDE + NUPDPE + PML4PML4I + 1); + } /* * Allocate the page directory page, unless we already have * one cached. If we used the cached page the wire_count will * already be set appropriately. */ - if ((ptdpg = pmap->pm_pdirm) == NULL) { - ptdpg = vm_page_grab(pmap->pm_pteobj, - NUPDE + NUPDPE + PML4PML4I, - VM_ALLOC_NORMAL | VM_ALLOC_RETRY); - pmap->pm_pdirm = ptdpg; - vm_page_flag_clear(ptdpg, PG_MAPPED | PG_BUSY); - ptdpg->valid = VM_PAGE_BITS_ALL; - if (ptdpg->wire_count == 0) - ++vmstats.v_wire_count; - ptdpg->wire_count = 1; - pmap_kenter((vm_offset_t)pmap->pm_pml4, VM_PAGE_TO_PHYS(ptdpg)); - } - if ((ptdpg->flags & PG_ZERO) == 0) + if ((pml4pg = pmap->pm_pdirm) == NULL) { + pml4pg = vm_page_grab(pmap->pm_pteobj, + NUPDE + NUPDPE + PML4PML4I, + VM_ALLOC_NORMAL | VM_ALLOC_RETRY); + pmap->pm_pdirm = pml4pg; + vm_page_unmanage(pml4pg); + vm_page_flag_clear(pml4pg, PG_MAPPED); + pml4pg->valid = VM_PAGE_BITS_ALL; + vm_page_wire(pml4pg); + vm_page_wakeup(pml4pg); + pmap_kenter((vm_offset_t)pmap->pm_pml4, + VM_PAGE_TO_PHYS(pml4pg)); + } + if ((pml4pg->flags & PG_ZERO) == 0) bzero(pmap->pm_pml4, PAGE_SIZE); #ifdef PMAP_DEBUG else - pmap_page_assertzero(VM_PAGE_TO_PHYS(ptdpg)); + pmap_page_assertzero(VM_PAGE_TO_PHYS(pml4pg)); #endif + vm_page_flag_clear(pml4pg, PG_ZERO); pmap->pm_pml4[KPML4I] = KPDPphys | PG_RW | PG_V | PG_U; pmap->pm_pml4[DMPML4I] = DMPDPphys | PG_RW | PG_V | PG_U; /* install self-referential address mapping entry */ - pmap->pm_pml4[PML4PML4I] = VM_PAGE_TO_PHYS(ptdpg) | PG_V | PG_RW | PG_A | PG_M; + pmap->pm_pml4[PML4PML4I] = VM_PAGE_TO_PHYS(pml4pg) | + PG_V | PG_RW | PG_A | PG_M; pmap->pm_count = 1; pmap->pm_active = 0; pmap->pm_ptphint = NULL; TAILQ_INIT(&pmap->pm_pvlist); + TAILQ_INIT(&pmap->pm_pvlist_free); + pmap->pm_hold = 0; + spin_init(&pmap->pm_spin); + lwkt_token_init(&pmap->pm_token, "pmap_tok"); bzero(&pmap->pm_stats, sizeof pmap->pm_stats); pmap->pm_stats.resident_count = 1; } @@ -1392,15 +1383,13 @@ pmap_puninit(pmap_t pmap) vm_page_t p; KKASSERT(pmap->pm_active == 0); - lwkt_gettoken(&vm_token); if ((p = pmap->pm_pdirm) != NULL) { KKASSERT(pmap->pm_pml4 != NULL); KKASSERT(pmap->pm_pml4 != (void *)(PTOV_OFFSET + KPML4phys)); pmap_kremove((vm_offset_t)pmap->pm_pml4); - p->wire_count--; - vmstats.v_wire_count--; - KKASSERT((p->flags & PG_BUSY) == 0); - vm_page_busy(p); + vm_page_busy_wait(p, FALSE, "pgpun"); + KKASSERT(p->flags & PG_UNMANAGED); + vm_page_unwire(p, 0); vm_page_free_zero(p); pmap->pm_pdirm = NULL; } @@ -1413,7 +1402,6 @@ pmap_puninit(pmap_t pmap) vm_object_deallocate(pmap->pm_pteobj); pmap->pm_pteobj = NULL; } - lwkt_reltoken(&vm_token); } /* @@ -1425,10 +1413,12 @@ pmap_puninit(pmap_t pmap) void pmap_pinit2(struct pmap *pmap) { - lwkt_gettoken(&vm_token); + /* + * XXX copies current process, does not fill in MPPTDI + */ + spin_lock(&pmap_spin); TAILQ_INSERT_TAIL(&pmap_list, pmap, pm_pmnode); - /* XXX copies current process, does not fill in MPPTDI */ - lwkt_reltoken(&vm_token); + spin_unlock(&pmap_spin); } /* @@ -1448,10 +1438,10 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) * page-table pages. Those pages are zero now, and * might as well be placed directly into the zero queue. */ - if (vm_page_sleep_busy(p, FALSE, "pmaprl")) + if (vm_page_busy_try(p, FALSE)) { + vm_page_sleep_busy(p, FALSE, "pmaprl"); return 0; - - vm_page_busy(p); + } /* * Remove the page table page from the processes address space. @@ -1464,13 +1454,14 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) } else if (p->pindex >= (NUPDE + NUPDPE)) { /* * Remove a PDP page from the PML4. We do not maintain - * hold counts on the PML4 page. + * wire counts on the PML4 page. */ pml4_entry_t *pml4; vm_page_t m4; int idx; - m4 = vm_page_lookup(pmap->pm_pteobj, NUPDE + NUPDPE + PML4PML4I); + m4 = vm_page_lookup(pmap->pm_pteobj, + NUPDE + NUPDPE + PML4PML4I); KKASSERT(m4 != NULL); pml4 = (void *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m4)); idx = (p->pindex - (NUPDE + NUPDPE)) % NPML4EPG; @@ -1478,10 +1469,9 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) pml4[idx] = 0; } else if (p->pindex >= NUPDE) { /* - * Remove a PD page from the PDP and drop the hold count - * on the PDP. The PDP is left cached in the pmap if - * the hold count drops to 0 so the wire count remains - * intact. + * Remove a PD page from the PDP and drop the wire count + * on the PDP. The PDP has a wire_count just from being + * mapped so the wire_count should never drop to 0 here. */ vm_page_t m3; pdp_entry_t *pdp; @@ -1494,13 +1484,13 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) idx = (p->pindex - NUPDE) % NPDPEPG; KKASSERT(pdp[idx] != 0); pdp[idx] = 0; - m3->hold_count--; + if (vm_page_unwire_quick(m3)) + panic("pmap_release_free_page: m3 wire_count 1->0"); } else { /* - * Remove a PT page from the PD and drop the hold count - * on the PD. The PD is left cached in the pmap if - * the hold count drops to 0 so the wire count remains - * intact. + * Remove a PT page from the PD and drop the wire count + * on the PD. The PD has a wire_count just from being + * mapped so the wire_count should never drop to 0 here. */ vm_page_t m2; pd_entry_t *pd; @@ -1512,17 +1502,18 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) pd = (void *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m2)); idx = p->pindex % NPDEPG; pd[idx] = 0; - m2->hold_count--; + if (vm_page_unwire_quick(m2)) + panic("pmap_release_free_page: m2 wire_count 1->0"); } /* - * One fewer mappings in the pmap. p's hold count had better - * be zero. + * p's wire_count should be transitioning from 1 to 0 here. */ + KKASSERT(p->wire_count == 1); + KKASSERT(p->flags & PG_UNMANAGED); KKASSERT(pmap->pm_stats.resident_count > 0); + vm_page_flag_clear(p, PG_MAPPED | PG_WRITEABLE); --pmap->pm_stats.resident_count; - if (p->hold_count) - panic("pmap_release: freeing held page table page"); if (pmap->pm_ptphint && (pmap->pm_ptphint->pindex == p->pindex)) pmap->pm_ptphint = NULL; @@ -1536,9 +1527,8 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) vm_page_flag_set(p, PG_ZERO); vm_page_wakeup(p); } else { - p->wire_count--; + vm_page_unwire(p, 0); KKASSERT(p->wire_count == 0); - vmstats.v_wire_count--; /* JG eventually revert to using vm_page_free_zero() */ vm_page_free(p); } @@ -1548,6 +1538,11 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) /* * This routine is called when various levels in the page table need to * be populated. This routine cannot fail. + * + * We returned a page wired for the caller. If we had to map the page into + * a parent page table it will receive an additional wire_count. For example, + * an empty page table directory which is still mapped into its pdp will + * retain a wire_count of 1. */ static vm_page_t @@ -1568,8 +1563,11 @@ _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex) * don't want to zero-out a raced page as this would desynchronize * the pv_entry's for the related pte's and cause pmap_remove_all() * to panic. + * + * Page table pages are unmanaged (do not use the normal PQ_s) */ if (m->valid == 0) { + vm_page_unmanage(m); if ((m->flags & PG_ZERO) == 0) { pmap_zero_page(VM_PAGE_TO_PHYS(m)); } @@ -1591,12 +1589,10 @@ _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex) ("_pmap_allocpte: %p->queue != PQ_NONE", m)); /* - * Increment the hold count for the page we will be returning to + * Increment the wire_count for the page we will be returning to * the caller. */ - m->hold_count++; - if (m->wire_count++ == 0) - vmstats.v_wire_count++; + vm_page_wire(m); /* * Map the pagetable page into the process address space, if @@ -1608,20 +1604,23 @@ _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex) */ if (ptepindex >= (NUPDE + NUPDPE)) { /* - * Wire up a new PDP page in the PML4 + * Wire up a new PDP page in the PML4. + * + * (m) is busied so we cannot race another thread trying + * to map the PDP entry in the PML4. */ vm_pindex_t pml4index; pml4_entry_t *pml4; pml4index = ptepindex - (NUPDE + NUPDPE); pml4 = &pmap->pm_pml4[pml4index]; - if (*pml4 & PG_V) { - if (--m->wire_count == 0) - --vmstats.v_wire_count; - vm_page_wakeup(m); - return(m); + if ((*pml4 & PG_V) == 0) { + *pml4 = VM_PAGE_TO_PHYS(m) | (PG_U | PG_RW | PG_V | + PG_A | PG_M); + ++pmap->pm_stats.resident_count; + vm_page_wire_quick(m); /* wire for mapping */ } - *pml4 = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M; + /* return (m) wired for the caller */ } else if (ptepindex >= NUPDE) { /* * Wire up a new PD page in the PDP @@ -1635,41 +1634,47 @@ _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex) pdpindex = ptepindex - NUPDE; pml4index = pdpindex >> NPML4EPGSHIFT; + /* + * Once mapped the PDP is not unmapped during normal operation + * so we only need to handle races in the unmapped case. + * + * Mapping a PD into the PDP requires an additional wiring + * of the PDP. + */ pml4 = &pmap->pm_pml4[pml4index]; if ((*pml4 & PG_V) == 0) { - /* - * Have to allocate a new PDP page, recurse. - * This always succeeds. Returned page will - * be held. - */ pdppg = _pmap_allocpte(pmap, NUPDE + NUPDPE + pml4index); + /* pdppg wired for the map and also wired for return */ } else { - /* - * Add a held reference to the PDP page. - */ pdppg = PHYS_TO_VM_PAGE(*pml4 & PG_FRAME); - pdppg->hold_count++; + vm_page_wire_quick(pdppg); } + /* we have an extra ref on pdppg now for our use */ /* - * Now find the pdp_entry and map the PDP. If the PDP - * has already been mapped unwind and return the - * already-mapped PDP held. + * Now find the PD entry in the PDP and map it. * - * pdppg is left held (hold_count is incremented for - * each PD in the PDP). + * (m) is busied so we cannot race another thread trying + * to map the PD entry in the PDP. + * + * If the PD entry is already mapped we have to drop one + * wire count on the pdppg that we had bumped above. */ pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME); pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)]; - if (*pdp & PG_V) { - vm_page_unhold(pdppg); - if (--m->wire_count == 0) - --vmstats.v_wire_count; - vm_page_wakeup(m); - return(m); + + if ((*pdp & PG_V) == 0) { + *pdp = VM_PAGE_TO_PHYS(m) | (PG_U | PG_RW | PG_V | + PG_A | PG_M); + vm_page_wire_quick(m); /* wire for mapping */ + ++pmap->pm_stats.resident_count; + /* eat extra pdppg wiring for mapping */ + } else { + if (vm_page_unwire_quick(pdppg)) + panic("pmap_allocpte: unwire case 1"); } - *pdp = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M; + /* return (m) wired for the caller */ } else { /* * Wire up the new PT page in the PD @@ -1679,53 +1684,68 @@ _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex) pml4_entry_t *pml4; pdp_entry_t *pdp; pd_entry_t *pd; + vm_page_t pdppg; vm_page_t pdpg; pdpindex = ptepindex >> NPDPEPGSHIFT; pml4index = pdpindex >> NPML4EPGSHIFT; /* - * Locate the PDP page in the PML4, then the PD page in - * the PDP. If either does not exist we simply recurse - * to allocate them. + * Locate the PDP page in the PML4 * - * We can just recurse on the PD page as it will recurse - * on the PDP if necessary. + * Once mapped the PDP is not unmapped during normal operation + * so we only need to handle races in the unmapped case. */ pml4 = &pmap->pm_pml4[pml4index]; if ((*pml4 & PG_V) == 0) { + pdppg = _pmap_allocpte(pmap, NUPDE + pdpindex); + } else { + pdppg = PHYS_TO_VM_PAGE(*pml4 & PG_FRAME); + vm_page_wire_quick(pdppg); + } + /* we have an extra ref on pdppg now for our use */ + + /* + * Locate the PD page in the PDP + * + * Once mapped the PDP is not unmapped during normal operation + * so we only need to handle races in the unmapped case. + * + * We can scrap the extra reference on pdppg not needed if + * *pdp is already mapped and also not needed if it wasn't + * because the _pmap_allocpte() picked up the case for us. + */ + pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME); + pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)]; + + if ((*pdp & PG_V) == 0) { pdpg = _pmap_allocpte(pmap, NUPDE + pdpindex); - pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME); - pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)]; } else { - pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME); - pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)]; - if ((*pdp & PG_V) == 0) { - pdpg = _pmap_allocpte(pmap, NUPDE + pdpindex); - } else { - pdpg = PHYS_TO_VM_PAGE(*pdp & PG_FRAME); - pdpg->hold_count++; - } + pdpg = PHYS_TO_VM_PAGE(*pdp & PG_FRAME); + vm_page_wire_quick(pdpg); } + vm_page_unwire_quick(pdppg); + /* we have an extra ref on pdpg now for our use */ /* - * Now fill in the pte in the PD. If the pte already exists - * (again, if we raced the grab), unhold pdpg and unwire - * m, returning a held m. + * Locate the PT page in the PD. * - * pdpg is left held (hold_count is incremented for - * each PT in the PD). + * (m) is busied so we cannot race another thread trying + * to map the PT page in the PD. */ pd = (pd_entry_t *)PHYS_TO_DMAP(*pdp & PG_FRAME); pd = &pd[ptepindex & ((1ul << NPDEPGSHIFT) - 1)]; - if (*pd != 0) { - vm_page_unhold(pdpg); - if (--m->wire_count == 0) - --vmstats.v_wire_count; - vm_page_wakeup(m); - return(m); + if ((*pd & PG_V) == 0) { + *pd = VM_PAGE_TO_PHYS(m) | (PG_U | PG_RW | PG_V | + PG_A | PG_M); + ++pmap->pm_stats.resident_count; + vm_page_wire_quick(m); /* wire for mapping */ + /* eat extra pdpg wiring for mapping */ + } else { + if (vm_page_unwire_quick(pdpg)) + panic("pmap_allocpte: unwire case 2"); } - *pd = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M; + /* return (m) wired for the caller */ } /* @@ -1733,7 +1753,6 @@ _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex) * valid bits, mapped flag, unbusy, and we're done. */ pmap->pm_ptphint = m; - ++pmap->pm_stats.resident_count; #if 0 m->valid = VM_PAGE_BITS_ALL; @@ -1753,6 +1772,8 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) pd_entry_t *pd; vm_page_t m; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(pmap->pm_pteobj)); + /* * Calculate pagetable page index */ @@ -1777,13 +1798,13 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) /* * If the page table page is mapped, we just increment the - * hold count, and activate it. + * wire count, and activate it. */ if (pd != NULL && (*pd & PG_V) != 0) { - /* YYY hint is used here on i386 */ - m = pmap_page_lookup( pmap->pm_pteobj, ptepindex); + m = pmap_page_lookup(pmap->pm_pteobj, ptepindex); pmap->pm_ptphint = m; - m->hold_count++; + vm_page_wire_quick(m); + vm_page_wakeup(m); return m; } /* @@ -1801,9 +1822,21 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) * Release any resources held by the given physical map. * Called when a pmap initialized by pmap_pinit is being released. * Should only be called if the map contains no valid mappings. + * + * Caller must hold pmap->pm_token */ static int pmap_release_callback(struct vm_page *p, void *data); +static __inline +void +pmap_auto_yield(struct rb_vm_page_scan_info *info) +{ + if (++info->desired >= pmap_yield_count) { + info->desired = 0; + lwkt_yield(); + } +} + void pmap_release(struct pmap *pmap) { @@ -1819,10 +1852,13 @@ pmap_release(struct pmap *pmap) info.pmap = pmap; info.object = object; - vm_object_hold(object); - lwkt_gettoken(&vm_token); + + spin_lock(&pmap_spin); TAILQ_REMOVE(&pmap_list, pmap, pm_pmnode); + spin_unlock(&pmap_spin); + info.desired = 0; + vm_object_hold(object); do { info.error = 0; info.mpte = NULL; @@ -1835,8 +1871,10 @@ pmap_release(struct pmap *pmap) info.error = 1; } } while (info.error); - lwkt_reltoken(&vm_token); vm_object_drop(object); + + while (pmap->pm_hold) + tsleep(pmap, 0, "pmapx", 1); } static @@ -1851,10 +1889,12 @@ pmap_release_callback(struct vm_page *p, void *data) } if (!pmap_release_free_page(info->pmap, p)) { info->error = 1; + pmap_auto_yield(info); return(-1); } if (info->object->generation != info->limit) { info->error = 1; + pmap_auto_yield(info); return(-1); } return(0); @@ -1877,7 +1917,7 @@ pmap_growkernel(vm_offset_t kstart, vm_offset_t kend) pdp_entry_t newpdp; int update_kernel_vm_end; - lwkt_gettoken(&vm_token); + vm_object_hold(kptobj); /* * bootstrap kernel_vm_end on first real VM use @@ -1981,7 +2021,7 @@ pmap_growkernel(vm_offset_t kstart, vm_offset_t kend) if (update_kernel_vm_end && kernel_vm_end < kstart) kernel_vm_end = kstart; - lwkt_reltoken(&vm_token); + vm_object_drop(kptobj); } /* @@ -1997,13 +2037,13 @@ pmap_destroy(pmap_t pmap) if (pmap == NULL) return; - lwkt_gettoken(&vm_token); + lwkt_gettoken(&pmap->pm_token); count = --pmap->pm_count; if (count == 0) { - pmap_release(pmap); + pmap_release(pmap); /* eats pm_token */ panic("destroying a pmap is not yet implemented"); } - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); } /* @@ -2013,9 +2053,9 @@ void pmap_reference(pmap_t pmap) { if (pmap != NULL) { - lwkt_gettoken(&vm_token); + lwkt_gettoken(&pmap->pm_token); pmap->pm_count++; - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); } } @@ -2031,7 +2071,7 @@ static __inline void free_pv_entry(pv_entry_t pv) { - pv_entry_count--; + atomic_add_int(&pv_entry_count, -1); KKASSERT(pv_entry_count >= 0); zfree(pvzone, pv); } @@ -2044,7 +2084,7 @@ static pv_entry_t get_pv_entry(void) { - pv_entry_count++; + atomic_add_int(&pv_entry_count, 1); if (pv_entry_high_water && (pv_entry_count > pv_entry_high_water) && (pmap_pagedaemon_waken == 0)) { @@ -2067,21 +2107,23 @@ pmap_collect(void) if (pmap_pagedaemon_waken == 0) return; - lwkt_gettoken(&vm_token); + pmap_pagedaemon_waken = 0; if (warningdone < 5) { kprintf("pmap_collect: collecting pv entries -- suggest increasing PMAP_SHPGPERPROC\n"); warningdone++; } - for(i = 0; i < vm_page_array_size; i++) { + for (i = 0; i < vm_page_array_size; i++) { m = &vm_page_array[i]; - if (m->wire_count || m->hold_count || m->busy || - (m->flags & PG_BUSY)) + if (m->wire_count || m->hold_count) continue; - pmap_remove_all(m); + if (vm_page_busy_try(m, TRUE) == 0) { + if (m->wire_count == 0 && m->hold_count == 0) { + pmap_remove_all(m); + } + vm_page_wakeup(m); + } } - pmap_pagedaemon_waken = 0; - lwkt_reltoken(&vm_token); } @@ -2092,16 +2134,17 @@ pmap_collect(void) * Otherwise we must search the list for the entry. In either case we * free the now unused entry. * - * Caller must hold vm_token + * Caller must hold pmap->pm_token */ static int pmap_remove_entry(struct pmap *pmap, vm_page_t m, - vm_offset_t va, pmap_inval_info_t info) + vm_offset_t va, pmap_inval_info_t info) { pv_entry_t pv; int rtval; + spin_lock(&pmap_spin); if (m->md.pv_list_count < pmap->pm_stats.resident_count) { TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) { if (pmap == pv->pv_pmap && va == pv->pv_va) @@ -2118,13 +2161,19 @@ pmap_remove_entry(struct pmap *pmap, vm_page_t m, KKASSERT(pv); TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); + m->md.pv_generation++; m->md.pv_list_count--; - m->object->agg_pv_list_count--; + vm_page_spin_lock(m); + if (m->object) + atomic_add_int(&m->object->agg_pv_list_count, -1); + vm_page_spin_unlock(m); KKASSERT(m->md.pv_list_count >= 0); if (TAILQ_EMPTY(&m->md.pv_list)) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); TAILQ_REMOVE(&pmap->pm_pvlist, pv, pv_plist); ++pmap->pm_generation; + spin_unlock(&pmap_spin); + rtval = pmap_unuse_pt(pmap, va, pv->pv_ptem, info); free_pv_entry(pv); @@ -2134,7 +2183,7 @@ pmap_remove_entry(struct pmap *pmap, vm_page_t m, /* * Create a pv entry for page at pa for (pmap, va). * - * Caller must hold vm_token + * Caller must hold pmap token */ static void @@ -2147,26 +2196,34 @@ pmap_insert_entry(pmap_t pmap, vm_offset_t va, vm_page_t mpte, vm_page_t m) pv->pv_pmap = pmap; pv->pv_ptem = mpte; + spin_lock(&pmap_spin); TAILQ_INSERT_TAIL(&pmap->pm_pvlist, pv, pv_plist); TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list); - ++pmap->pm_generation; + m->md.pv_generation++; m->md.pv_list_count++; - m->object->agg_pv_list_count++; + vm_page_spin_lock(m); + if (m->object) + atomic_add_int(&m->object->agg_pv_list_count, 1); + vm_page_spin_unlock(m); + pmap->pm_generation++; + spin_unlock(&pmap_spin); } /* * pmap_remove_pte: do the things to unmap a page in a process * - * Caller must hold vm_token + * Caller must hold pmap token */ static int pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq, vm_offset_t va, - pmap_inval_info_t info) + pmap_inval_info_t info) { pt_entry_t oldpte; vm_page_t m; + ASSERT_LWKT_TOKEN_HELD(&pmap->pm_token); + pmap_inval_interlock(info, pmap, va); oldpte = pte_load_clear(ptq); pmap_inval_deinterlock(info, pmap); @@ -2197,9 +2254,12 @@ pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq, vm_offset_t va, if (oldpte & PG_A) vm_page_flag_set(m, PG_REFERENCED); return pmap_remove_entry(pmap, m, va, info); - } else { + } +/* + else { return pmap_unuse_pt(pmap, va, NULL, info); } +*/ return 0; } @@ -2210,7 +2270,7 @@ pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq, vm_offset_t va, * This function may not be called from an interrupt if the pmap is * not kernel_pmap. * - * Caller must hold vm_token + * Caller must hold pmap->pm_token */ static void @@ -2218,6 +2278,8 @@ pmap_remove_page(struct pmap *pmap, vm_offset_t va, pmap_inval_info_t info) { pt_entry_t *pte; + ASSERT_LWKT_TOKEN_HELD(&pmap->pm_token); + pte = pmap_pte(pmap, va); if (pte == NULL) return; @@ -2247,9 +2309,11 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) if (pmap == NULL) return; - lwkt_gettoken(&vm_token); + vm_object_hold(pmap->pm_pteobj); + lwkt_gettoken(&pmap->pm_token); if (pmap->pm_stats.resident_count == 0) { - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); + vm_object_drop(pmap->pm_pteobj); return; } @@ -2265,7 +2329,8 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) if (pde && (*pde & PG_PS) == 0) { pmap_remove_page(pmap, sva, &info); pmap_inval_done(&info); - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); + vm_object_drop(pmap->pm_pteobj); return; } } @@ -2335,18 +2400,16 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) } } pmap_inval_done(&info); - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); + vm_object_drop(pmap->pm_pteobj); } /* - * pmap_remove_all: + * Removes this physical page from all physical maps in which it resides. + * Reflects back modify bits to the pager. * - * Removes this physical page from all physical maps in which it resides. - * Reflects back modify bits to the pager. - * - * This routine may not be called from an interrupt. + * This routine may not be called from an interrupt. */ - static void pmap_remove_all(vm_page_t m) @@ -2354,13 +2417,48 @@ pmap_remove_all(vm_page_t m) struct pmap_inval_info info; pt_entry_t *pte, tpte; pv_entry_t pv; + struct pmap *pmap; if (!pmap_initialized || (m->flags & PG_FICTITIOUS)) return; - lwkt_gettoken(&vm_token); pmap_inval_init(&info); + spin_lock(&pmap_spin); while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) { + /* + * We have to be holding the pmap token to interlock + * the pte destruction and pv removal. XXX need hold on + * pmap. + */ + pmap = pv->pv_pmap; + spin_unlock(&pmap_spin); + lwkt_gettoken(&pmap->pm_token); /* XXX hold race */ + spin_lock(&pmap_spin); + if (pv != TAILQ_FIRST(&m->md.pv_list)) { + spin_unlock(&pmap_spin); + lwkt_reltoken(&pmap->pm_token); + spin_lock(&pmap_spin); + continue; + } + + /* + * Remove the pv + */ + TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); + TAILQ_REMOVE(&pv->pv_pmap->pm_pvlist, pv, pv_plist); + m->md.pv_generation++; + m->md.pv_list_count--; + vm_page_spin_lock(m); + if (m->object) + atomic_add_int(&m->object->agg_pv_list_count, -1); + vm_page_spin_unlock(m); + KKASSERT(m->md.pv_list_count >= 0); + ++pv->pv_pmap->pm_generation; + spin_unlock(&pmap_spin); + + /* + * pv is now isolated + */ KKASSERT(pv->pv_pmap->pm_stats.resident_count > 0); --pv->pv_pmap->pm_stats.resident_count; @@ -2380,28 +2478,29 @@ pmap_remove_all(vm_page_t m) if (tpte & PG_M) { #if defined(PMAP_DIAGNOSTIC) if (pmap_nw_modified(tpte)) { - kprintf( - "pmap_remove_all: modified page not writable: va: 0x%lx, pte: 0x%lx\n", - pv->pv_va, tpte); + kprintf("pmap_remove_all: modified page not " + "writable: va: 0x%lx, pte: 0x%lx\n", + pv->pv_va, tpte); } #endif if (pmap_track_modified(pv->pv_va)) - vm_page_dirty(m); + vm_page_dirty(m); /* XXX races(m) */ } - TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); - TAILQ_REMOVE(&pv->pv_pmap->pm_pvlist, pv, pv_plist); - ++pv->pv_pmap->pm_generation; - m->md.pv_list_count--; - m->object->agg_pv_list_count--; - KKASSERT(m->md.pv_list_count >= 0); + + spin_lock(&pmap_spin); if (TAILQ_EMPTY(&m->md.pv_list)) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); + spin_unlock(&pmap_spin); + pmap_unuse_pt(pv->pv_pmap, pv->pv_va, pv->pv_ptem, &info); + lwkt_reltoken(&pv->pv_pmap->pm_token); + free_pv_entry(pv); + spin_lock(&pmap_spin); } + spin_unlock(&pmap_spin); KKASSERT((m->flags & (PG_MAPPED|PG_WRITEABLE)) == 0); pmap_inval_done(&info); - lwkt_reltoken(&vm_token); } /* @@ -2436,11 +2535,10 @@ pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot) if (prot & VM_PROT_WRITE) return; - lwkt_gettoken(&vm_token); + lwkt_gettoken(&pmap->pm_token); pmap_inval_init(&info); for (; sva < eva; sva = va_next) { - pml4e = pmap_pml4e(pmap, sva); if ((*pml4e & PG_V) == 0) { va_next = (sva + NBPML4) & ~PML4MASK; @@ -2527,7 +2625,7 @@ again: } } pmap_inval_done(&info); - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); } /* @@ -2577,7 +2675,8 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, #endif } - lwkt_gettoken(&vm_token); + vm_object_hold(pmap->pm_pteobj); + lwkt_gettoken(&pmap->pm_token); /* * In the case that a page table page is not @@ -2588,14 +2687,16 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, else mpte = NULL; - pmap_inval_init(&info); + if ((prot & VM_PROT_NOSYNC) == 0) + pmap_inval_init(&info); pde = pmap_pde(pmap, va); if (pde != NULL && (*pde & PG_V) != 0) { if ((*pde & PG_PS) != 0) panic("pmap_enter: attempted pmap_enter on 2MB page"); pte = pmap_pde_to_pte(pde, va); - } else + } else { panic("pmap_enter: invalid page directory va=%#lx", va); + } KKASSERT(pte != NULL); pa = VM_PAGE_TO_PHYS(m); @@ -2632,7 +2733,7 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, * bits below. */ if (mpte) - mpte->hold_count--; + vm_page_unwire_quick(mpte); /* * We might be turning off write access to the page, @@ -2670,6 +2771,9 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, * Enter on the PV list if part of our managed memory. Note that we * raise IPL while manipulating pv_table since pmap_enter can be * called at interrupt time. + * + * The new mapping covers mpte's new wiring count so we don't + * unwire it. */ if (pmap_initialized && (m->flags & (PG_FICTITIOUS|PG_UNMANAGED)) == 0) { @@ -2703,15 +2807,21 @@ validate: * to update the pte. */ if ((origpte & ~(PG_M|PG_A)) != newpte) { - pmap_inval_interlock(&info, pmap, va); + if ((prot & VM_PROT_NOSYNC) == 0) + pmap_inval_interlock(&info, pmap, va); *pte = newpte | PG_A; - pmap_inval_deinterlock(&info, pmap); + if (prot & VM_PROT_NOSYNC) + cpu_invlpg((void *)va); + else + pmap_inval_deinterlock(&info, pmap); if (newpte & PG_RW) vm_page_flag_set(m, PG_WRITEABLE); } KKASSERT((newpte & PG_MANAGED) == 0 || (m->flags & PG_MAPPED)); - pmap_inval_done(&info); - lwkt_reltoken(&vm_token); + if ((prot & VM_PROT_NOSYNC) == 0) + pmap_inval_done(&info); + lwkt_reltoken(&pmap->pm_token); + vm_object_drop(pmap->pm_pteobj); } /* @@ -2727,21 +2837,22 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) pt_entry_t *pte; vm_paddr_t pa; vm_page_t mpte; - vm_pindex_t ptepindex; - pd_entry_t *ptepa; pmap_inval_info info; - lwkt_gettoken(&vm_token); + lwkt_gettoken(&pmap->pm_token); + vm_object_hold(pmap->pm_pteobj); pmap_inval_init(&info); if (va < UPT_MAX_ADDRESS && pmap == &kernel_pmap) { - kprintf("Warning: pmap_enter_quick called on UVA with kernel_pmap\n"); + kprintf("Warning: pmap_enter_quick called on UVA with" + "kernel_pmap\n"); #ifdef DDB db_print_backtrace(); #endif } if (va >= UPT_MAX_ADDRESS && pmap != &kernel_pmap) { - kprintf("Warning: pmap_enter_quick called on KVA without kernel_pmap\n"); + kprintf("Warning: pmap_enter_quick called on KVA without" + "kernel_pmap\n"); #ifdef DDB db_print_backtrace(); #endif @@ -2752,41 +2863,11 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) /* * Calculate the page table page (mpte), allocating it if necessary. * - * A held page table page (mpte), or NULL, is passed onto the + * A wired page table page (mpte), or NULL, is passed onto the * section following. */ if (va < VM_MAX_USER_ADDRESS) { - /* - * Calculate pagetable page index - */ - ptepindex = pmap_pde_pindex(va); - - do { - /* - * Get the page directory entry - */ - ptepa = pmap_pde(pmap, va); - - /* - * If the page table page is mapped, we just increment - * the hold count, and activate it. - */ - if (ptepa && (*ptepa & PG_V) != 0) { - if (*ptepa & PG_PS) - panic("pmap_enter_quick: unexpected mapping into 2MB page"); -// if (pmap->pm_ptphint && -// (pmap->pm_ptphint->pindex == ptepindex)) { -// mpte = pmap->pm_ptphint; -// } else { - mpte = pmap_page_lookup( pmap->pm_pteobj, ptepindex); - pmap->pm_ptphint = mpte; -// } - if (mpte) - mpte->hold_count++; - } else { - mpte = _pmap_allocpte(pmap, ptepindex); - } - } while (mpte == NULL); + mpte = pmap_allocpte(pmap, va); } else { mpte = NULL; /* this code path is not yet used */ @@ -2799,17 +2880,21 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) */ pte = vtopte(va); if (*pte & PG_V) { - if (mpte) - pmap_unwire_pte_hold(pmap, va, mpte, &info); pa = VM_PAGE_TO_PHYS(m); KKASSERT(((*pte ^ pa) & PG_FRAME) == 0); pmap_inval_done(&info); - lwkt_reltoken(&vm_token); + if (mpte) + pmap_unwire_pte_hold(pmap, va, mpte, &info); + vm_object_drop(pmap->pm_pteobj); + lwkt_reltoken(&pmap->pm_token); return; } /* - * Enter on the PV list if part of our managed memory + * Enter on the PV list if part of our managed memory. + * + * The new mapping covers mpte's new wiring count so we don't + * unwire it. */ if ((m->flags & (PG_FICTITIOUS|PG_UNMANAGED)) == 0) { pmap_insert_entry(pmap, va, mpte, m); @@ -2832,7 +2917,8 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) *pte = pa | PG_V | PG_U | PG_MANAGED; /* pmap_inval_add(&info, pmap, va); shouldn't be needed inval->valid */ pmap_inval_done(&info); - lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); + lwkt_reltoken(&pmap->pm_token); } /* @@ -2910,12 +2996,11 @@ pmap_object_init_pt(pmap_t pmap, vm_offset_t addr, vm_prot_t prot, info.mpte = NULL; info.addr = addr; info.pmap = pmap; + info.desired = 0; vm_object_hold(object); - lwkt_gettoken(&vm_token); vm_page_rb_tree_RB_SCAN(&object->rb_memq, rb_vm_page_scancmp, pmap_object_init_pt_callback, &info); - lwkt_reltoken(&vm_token); vm_object_drop(object); } @@ -2925,6 +3010,7 @@ pmap_object_init_pt_callback(vm_page_t p, void *data) { struct rb_vm_page_scan_info *info = data; vm_pindex_t rel_index; + /* * don't allow an madvise to blow away our really * free pages allocating pv entries. @@ -2933,16 +3019,18 @@ pmap_object_init_pt_callback(vm_page_t p, void *data) vmstats.v_free_count < vmstats.v_free_reserved) { return(-1); } + if (vm_page_busy_try(p, TRUE)) + return 0; if (((p->valid & VM_PAGE_BITS_ALL) == VM_PAGE_BITS_ALL) && - (p->busy == 0) && (p->flags & (PG_BUSY | PG_FICTITIOUS)) == 0) { - vm_page_busy(p); + (p->flags & PG_FICTITIOUS) == 0) { if ((p->queue - p->pc) == PQ_CACHE) vm_page_deactivate(p); rel_index = p->pindex - info->start_pindex; pmap_enter_quick(info->pmap, info->addr + x86_64_ptob(rel_index), p); - vm_page_wakeup(p); } + vm_page_wakeup(p); + pmap_auto_yield(info); return(0); } @@ -2960,7 +3048,7 @@ pmap_prefault_ok(pmap_t pmap, vm_offset_t addr) pd_entry_t *pde; int ret; - lwkt_gettoken(&vm_token); + lwkt_gettoken(&pmap->pm_token); pde = pmap_pde(pmap, addr); if (pde == NULL || *pde == 0) { ret = 0; @@ -2968,7 +3056,7 @@ pmap_prefault_ok(pmap_t pmap, vm_offset_t addr) pte = vtopte(addr); ret = (*pte) ? 0 : 1; } - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); return(ret); } @@ -2987,7 +3075,7 @@ pmap_change_wiring(pmap_t pmap, vm_offset_t va, boolean_t wired) if (pmap == NULL) return; - lwkt_gettoken(&vm_token); + lwkt_gettoken(&pmap->pm_token); pte = pmap_pte(pmap, va); if (wired && !pmap_pte_w(pte)) @@ -3013,7 +3101,7 @@ pmap_change_wiring(pmap_t pmap, vm_offset_t va, boolean_t wired) else atomic_clear_long_nonlocked(pte, PG_W); #endif - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); } @@ -3056,11 +3144,8 @@ pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, pmap_inval_add(&info, dst_pmap, -1); pmap_inval_add(&info, src_pmap, -1); - /* - * vm_token section protection is required to maintain the page/object - * associations. - */ - lwkt_gettoken(&vm_token); + lwkt_gettoken(&src_pmap->pm_token); + lwkt_gettoken(&dst_pmap->pm_token); for (addr = src_addr; addr < end_addr; addr = pdnxt) { pt_entry_t *src_pte, *dst_pte; vm_page_t dstmpte, srcmpte; @@ -3098,8 +3183,11 @@ pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, continue; } + /* + * + */ srcmpte = vm_page_lookup(src_pmap->pm_pteobj, ptepindex); - if ((srcmpte == NULL) || (srcmpte->hold_count == 0) || + if (srcmpte == NULL || srcmpte->wire_count == 1 || (srcmpte->flags & PG_BUSY)) { continue; } @@ -3148,7 +3236,7 @@ pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, *dst_pte = ptetemp & ~(PG_M | PG_A); ++dst_pmap->pm_stats.resident_count; pmap_insert_entry(dst_pmap, addr, - dstmpte, m); + dstmpte, m); KKASSERT(m->flags & PG_MAPPED); } else { kprintf("WARNING: pmap_copy: dst_pte race detected and corrected\n"); @@ -3165,7 +3253,8 @@ pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, } } failed: - lwkt_reltoken(&vm_token); + lwkt_reltoken(&dst_pmap->pm_token); + lwkt_reltoken(&src_pmap->pm_token); pmap_inval_done(&info); #endif } @@ -3274,28 +3363,26 @@ pmap_page_exists_quick(pmap_t pmap, vm_page_t m) if (!pmap_initialized || (m->flags & PG_FICTITIOUS)) return FALSE; - lwkt_gettoken(&vm_token); - + spin_lock(&pmap_spin); TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) { if (pv->pv_pmap == pmap) { - lwkt_reltoken(&vm_token); + spin_unlock(&pmap_spin); return TRUE; } loops++; if (loops >= 16) break; } - lwkt_reltoken(&vm_token); + spin_unlock(&pmap_spin); return (FALSE); } /* - * Remove all pages from specified address space - * this aids process exit speeds. Also, this code - * is special cased for current process only, but - * can have the more generic (and slightly slower) - * mode enabled. This is much faster than pmap_remove - * in the case of running down an entire address space. + * Remove all pages from specified address space this aids process exit + * speeds. Also, this code is special cased for current process only, but + * can have the more generic (and slightly slower) mode enabled. This + * is much faster than pmap_remove in the case of running down an entire + * address space. */ void pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) @@ -3304,6 +3391,7 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) pt_entry_t *pte, tpte; pv_entry_t pv, npv; vm_page_t m; + vm_offset_t va; pmap_inval_info info; int iscurrentpmap; int save_generation; @@ -3314,39 +3402,92 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) else iscurrentpmap = 0; - lwkt_gettoken(&vm_token); + if (pmap->pm_pteobj) + vm_object_hold(pmap->pm_pteobj); + lwkt_gettoken(&pmap->pm_token); pmap_inval_init(&info); + + spin_lock(&pmap_spin); for (pv = TAILQ_FIRST(&pmap->pm_pvlist); pv; pv = npv) { + /* + * Validate the pv. We have to interlock the address with + * pmap_spin unlocked. + */ if (pv->pv_va >= eva || pv->pv_va < sva) { npv = TAILQ_NEXT(pv, pv_plist); continue; } KKASSERT(pmap == pv->pv_pmap); - if (iscurrentpmap) pte = vtopte(pv->pv_va); else pte = pmap_pte_quick(pmap, pv->pv_va); - pmap_inval_interlock(&info, pmap, pv->pv_va); /* * We cannot remove wired pages from a process' mapping - * at this time + * at this time. This does not require an invaldiation + * interlock as PG_W cannot be set by the MMU. */ if (*pte & PG_W) { - pmap_inval_deinterlock(&info, pmap); npv = TAILQ_NEXT(pv, pv_plist); continue; } + + /* + * Interlock the pte so we can safely remove it + */ + save_generation = pmap->pm_generation; + va = pv->pv_va; + spin_unlock(&pmap_spin); + + pmap_inval_interlock(&info, pmap, va); + + /* + * Restart the scan if the pv list changed out from under us. + */ + spin_lock(&pmap_spin); + if (save_generation != pmap->pm_generation) { + spin_unlock(&pmap_spin); + pmap_inval_deinterlock(&info, pmap); + kprintf("Warning: pmap_remove_pages race-A avoided\n"); + spin_lock(&pmap_spin); + npv = TAILQ_FIRST(&pmap->pm_pvlist); + continue; + } + KKASSERT(pmap == pv->pv_pmap && va == pv->pv_va); + + /* + * Extract the pte and clear its memory + */ tpte = pte_load_clear(pte); KKASSERT(tpte & PG_MANAGED); m = PHYS_TO_VM_PAGE(tpte & PG_FRAME); - KASSERT(m < &vm_page_array[vm_page_array_size], ("pmap_remove_pages: bad tpte %lx", tpte)); + /* + * Remove the entry, set npv + */ + npv = TAILQ_NEXT(pv, pv_plist); + TAILQ_REMOVE(&pmap->pm_pvlist, pv, pv_plist); + m->md.pv_generation++; + m->md.pv_list_count--; + vm_page_spin_lock(m); + if (m->object) + atomic_add_int(&m->object->agg_pv_list_count, -1); + vm_page_spin_unlock(m); + TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); + if (TAILQ_EMPTY(&m->md.pv_list)) + vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); + save_generation = ++pmap->pm_generation; + + spin_unlock(&pmap_spin); + + /* + * Adjust the pmap and cleanup the tpte and related vm_page + */ KKASSERT(pmap->pm_stats.resident_count > 0); --pmap->pm_stats.resident_count; pmap_inval_deinterlock(&info, pmap); @@ -3358,16 +3499,6 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) vm_page_dirty(m); } - npv = TAILQ_NEXT(pv, pv_plist); - TAILQ_REMOVE(&pmap->pm_pvlist, pv, pv_plist); - save_generation = ++pmap->pm_generation; - - m->md.pv_list_count--; - m->object->agg_pv_list_count--; - TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); - if (TAILQ_EMPTY(&m->md.pv_list)) - vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); - pmap_unuse_pt(pmap, pv->pv_va, pv->pv_ptem, &info); free_pv_entry(pv); @@ -3375,20 +3506,24 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) * Restart the scan if we blocked during the unuse or free * calls and other removals were made. */ + spin_lock(&pmap_spin); if (save_generation != pmap->pm_generation) { kprintf("Warning: pmap_remove_pages race-A avoided\n"); npv = TAILQ_FIRST(&pmap->pm_pvlist); } } + spin_unlock(&pmap_spin); pmap_inval_done(&info); - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); + if (pmap->pm_pteobj) + vm_object_drop(pmap->pm_pteobj); } /* * pmap_testbit tests bits in pte's note that the testbit/clearbit * routines are inline, and a lot of things compile-time evaluate. * - * Caller must hold vm_token + * Caller must hold pmap_spin */ static boolean_t @@ -3428,19 +3563,27 @@ pmap_testbit(vm_page_t m, int bit) } /* - * this routine is used to modify bits in ptes + * This routine is used to modify bits in ptes + * + * Caller must NOT hold pmap_spin */ static __inline void pmap_clearbit(vm_page_t m, int bit) { struct pmap_inval_info info; + int save_generation; + vm_offset_t save_va; + struct pmap *save_pmap; pv_entry_t pv; pt_entry_t *pte; pt_entry_t pbits; - if (!pmap_initialized || (m->flags & PG_FICTITIOUS)) + if (bit == PG_RW) + vm_page_flag_clear(m, PG_WRITEABLE); + if (!pmap_initialized || (m->flags & PG_FICTITIOUS)) { return; + } pmap_inval_init(&info); @@ -3448,6 +3591,8 @@ pmap_clearbit(vm_page_t m, int bit) * Loop over all current mappings setting/clearing as appropos If * setting RO do we need to clear the VAC? */ + spin_lock(&pmap_spin); +restart: TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) { /* * don't write protect pager mappings @@ -3473,9 +3618,20 @@ pmap_clearbit(vm_page_t m, int bit) * PG_M even for PTEs generated via virtual memory maps, * because the virtual kernel will invalidate the pmap * entry when/if it needs to resynchronize the Modify bit. + * + * We have to restart our scan if m->md.pv_generation changes + * on us. */ - if (bit & PG_RW) - pmap_inval_interlock(&info, pv->pv_pmap, pv->pv_va); + if (bit & PG_RW) { + save_generation = m->md.pv_generation; + save_pmap = pv->pv_pmap; + save_va = pv->pv_va; + spin_unlock(&pmap_spin); + pmap_inval_interlock(&info, save_pmap, save_va); + spin_lock(&pmap_spin); + if (save_generation != m->md.pv_generation) + goto restart; + } pte = pmap_pte_quick(pv->pv_pmap, pv->pv_va); again: pbits = *pte; @@ -3508,30 +3664,39 @@ again: atomic_clear_long(pte, bit); } } - if (bit & PG_RW) - pmap_inval_deinterlock(&info, pv->pv_pmap); + if (bit & PG_RW) { + save_generation = m->md.pv_generation; + save_pmap = pv->pv_pmap; + spin_unlock(&pmap_spin); + pmap_inval_deinterlock(&info, save_pmap); + spin_lock(&pmap_spin); + if (save_generation != m->md.pv_generation) + goto restart; + } } + spin_unlock(&pmap_spin); pmap_inval_done(&info); } /* - * pmap_page_protect: + * Lower the permission for all mappings to a given page. * - * Lower the permission for all mappings to a given page. + * Page must be busied by caller. */ void pmap_page_protect(vm_page_t m, vm_prot_t prot) { /* JG NX support? */ if ((prot & VM_PROT_WRITE) == 0) { - lwkt_gettoken(&vm_token); if (prot & (VM_PROT_READ | VM_PROT_EXECUTE)) { + /* + * NOTE: pmap_clearbit(.. PG_RW) also clears + * the PG_WRITEABLE flag in (m). + */ pmap_clearbit(m, PG_RW); - vm_page_flag_clear(m, PG_WRITEABLE); } else { pmap_remove_all(m); } - lwkt_reltoken(&vm_token); } } @@ -3542,16 +3707,16 @@ pmap_phys_address(vm_pindex_t ppn) } /* - * pmap_ts_referenced: + * Return a count of reference bits for a page, clearing those bits. + * It is not necessary for every reference bit to be cleared, but it + * is necessary that 0 only be returned when there are truly no + * reference bits set. * - * Return a count of reference bits for a page, clearing those bits. - * It is not necessary for every reference bit to be cleared, but it - * is necessary that 0 only be returned when there are truly no - * reference bits set. + * XXX: The exact number of bits to check and clear is a matter that + * should be tested and standardized at some point in the future for + * optimal aging of shared pages. * - * XXX: The exact number of bits to check and clear is a matter that - * should be tested and standardized at some point in the future for - * optimal aging of shared pages. + * This routine may not block. */ int pmap_ts_referenced(vm_page_t m) @@ -3563,8 +3728,7 @@ pmap_ts_referenced(vm_page_t m) if (!pmap_initialized || (m->flags & PG_FICTITIOUS)) return (rtval); - lwkt_gettoken(&vm_token); - + spin_lock(&pmap_spin); if ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) { pvf = pv; do { @@ -3572,6 +3736,7 @@ pmap_ts_referenced(vm_page_t m) TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list); + /*++pv->pv_pmap->pm_generation; not needed */ if (!pmap_track_modified(pv->pv_va)) continue; @@ -3591,7 +3756,7 @@ pmap_ts_referenced(vm_page_t m) } } while ((pv = pvn) != NULL && pv != pvf); } - lwkt_reltoken(&vm_token); + spin_unlock(&pmap_spin); return (rtval); } @@ -3607,9 +3772,9 @@ pmap_is_modified(vm_page_t m) { boolean_t res; - lwkt_gettoken(&vm_token); + spin_lock(&pmap_spin); res = pmap_testbit(m, PG_M); - lwkt_reltoken(&vm_token); + spin_unlock(&pmap_spin); return (res); } @@ -3619,9 +3784,7 @@ pmap_is_modified(vm_page_t m) void pmap_clear_modify(vm_page_t m) { - lwkt_gettoken(&vm_token); pmap_clearbit(m, PG_M); - lwkt_reltoken(&vm_token); } /* @@ -3632,9 +3795,7 @@ pmap_clear_modify(vm_page_t m) void pmap_clear_reference(vm_page_t m) { - lwkt_gettoken(&vm_token); pmap_clearbit(m, PG_A); - lwkt_reltoken(&vm_token); } /* @@ -3756,7 +3917,7 @@ pmap_mincore(pmap_t pmap, vm_offset_t addr) vm_page_t m; int val = 0; - lwkt_gettoken(&vm_token); + lwkt_gettoken(&pmap->pm_token); ptep = pmap_pte(pmap, addr); if (ptep && (pte = *ptep) != 0) { @@ -3795,7 +3956,8 @@ pmap_mincore(pmap_t pmap, vm_offset_t addr) } } done: - lwkt_reltoken(&vm_token); + lwkt_reltoken(&pmap->pm_token); + return val; } @@ -3806,7 +3968,7 @@ done: * The vmspace for all lwps associated with the process will be adjusted * and cr3 will be reloaded if any lwp is the current lwp. * - * Caller must hold vmspace_token + * The process must hold the vmspace->vm_map.token for oldvm and newvm */ void pmap_replacevm(struct proc *p, struct vmspace *newvm, int adjrefs) @@ -3832,7 +3994,7 @@ pmap_replacevm(struct proc *p, struct vmspace *newvm, int adjrefs) * same as the process vmspace, but virtual kernels need to swap out contexts * on a per-lwp basis. * - * Caller does not necessarily hold vmspace_token. Caller must control + * Caller does not necessarily hold any vmspace tokens. Caller must control * the lwp (typically be in the context of the lwp). We use a critical * section to protect against statclock and hardclock (statistics collection). */ @@ -3890,14 +4052,14 @@ pmap_interlock_wait(struct vmspace *vm) struct pmap *pmap = &vm->vm_pmap; if (pmap->pm_active & CPUMASK_LOCK) { - DEBUG_PUSH_INFO("pmap_interlock_wait"); crit_enter(); + DEBUG_PUSH_INFO("pmap_interlock_wait"); while (pmap->pm_active & CPUMASK_LOCK) { cpu_ccfence(); lwkt_process_ipiq(); } - crit_exit(); DEBUG_POP_INFO(); + crit_exit(); } } diff --git a/sys/platform/vkernel/conf/files b/sys/platform/vkernel/conf/files index 903eb20699..13b3101fb4 100644 --- a/sys/platform/vkernel/conf/files +++ b/sys/platform/vkernel/conf/files @@ -40,6 +40,7 @@ platform/vkernel/i386/mp.c optional smp \ cpu/i386/misc/elf_machdep.c standard cpu/i386/misc/lwbuf.c standard cpu/i386/misc/in_cksum2.s optional inet +cpu/i386/misc/monitor.s standard cpu/i386/misc/ktr.c optional ktr cpu/i386/misc/db_disasm.c optional ddb cpu/i386/misc/i386-gdbstub.c optional ddb diff --git a/sys/platform/vkernel/i386/cpu_regs.c b/sys/platform/vkernel/i386/cpu_regs.c index 0c2afd2ea9..2dc5ba047c 100644 --- a/sys/platform/vkernel/i386/cpu_regs.c +++ b/sys/platform/vkernel/i386/cpu_regs.c @@ -683,9 +683,7 @@ fetchupcall (struct vmupcall *vu, int morepending, void *rsp) * critical section. * * Note on cpu_idle_hlt: On an SMP system we rely on a scheduler IPI - * to wake a HLTed cpu up. However, there are cases where the idlethread - * will be entered with the possibility that no IPI will occur and in such - * cases lwkt_switch() sets RQF_WAKEUP. We nominally check RQF_IDLECHEK_MASK. + * to wake a HLTed cpu up. */ static int cpu_idle_hlt = 1; static int cpu_idle_hltcnt; diff --git a/sys/platform/vkernel/i386/mp.c b/sys/platform/vkernel/i386/mp.c index 931a04e5b1..a1e459f899 100644 --- a/sys/platform/vkernel/i386/mp.c +++ b/sys/platform/vkernel/i386/mp.c @@ -390,6 +390,7 @@ start_all_aps(u_int boot_addr) */ ap_tids[0] = pthread_self(); + vm_object_hold(&kernel_object); for (x = 1; x <= mp_naps; x++) { /* Allocate space for the CPU's private space. */ @@ -450,6 +451,7 @@ start_all_aps(u_int boot_addr) DELAY(1000); } } + vm_object_drop(&kernel_object); return(ncpus - 1); } diff --git a/sys/platform/vkernel/include/pmap.h b/sys/platform/vkernel/include/pmap.h index 51f83403bd..20974f0fcc 100644 --- a/sys/platform/vkernel/include/pmap.h +++ b/sys/platform/vkernel/include/pmap.h @@ -73,6 +73,12 @@ #ifndef _SYS_QUEUE_H_ #include #endif +#ifndef _SYS_SPINLOCK_H_ +#include +#endif +#ifndef _SYS_THREAD_H_ +#include +#endif #ifndef _SYS_VKERNEL_H_ #include #endif @@ -126,12 +132,15 @@ struct pmap { cpumask_t pm_cpucachemask;/* Invalidate cpu mappings */ TAILQ_ENTRY(pmap) pm_pmnode; /* list of pmaps */ TAILQ_HEAD(,pv_entry) pm_pvlist; /* list of mappings in pmap */ + TAILQ_HEAD(,pv_entry) pm_pvlist_free; /* free mappings */ int pm_count; /* reference count */ cpumask_t pm_active; /* active on cpus */ int pm_pdindex; /* page dir page in obj */ struct pmap_statistics pm_stats; /* pmap statistics */ struct vm_page *pm_ptphint; /* pmap ptp hint */ int pm_generation; /* detect pvlist deletions */ + struct spinlock pm_spin; + struct lwkt_token pm_token; }; #define CPUMASK_LOCK CPUMASK(SMP_MAXCPU) @@ -168,6 +177,7 @@ extern vm_offset_t clean_eva; void pmap_bootstrap (void); void *pmap_mapdev (vm_paddr_t, vm_size_t); void pmap_unmapdev (vm_offset_t, vm_size_t); +void pmap_release(struct pmap *pmap); struct vm_page *pmap_use_pt (pmap_t, vm_offset_t); #ifdef SMP void pmap_set_opt (void); diff --git a/sys/platform/vkernel/platform/pmap.c b/sys/platform/vkernel/platform/pmap.c index 125fcdc1e8..1f19d6e402 100644 --- a/sys/platform/vkernel/platform/pmap.c +++ b/sys/platform/vkernel/platform/pmap.c @@ -75,6 +75,7 @@ #include #include +#include #include @@ -153,11 +154,19 @@ pmap_bootstrap(void) { vm_pindex_t i = (vm_offset_t)KernelPTD >> PAGE_SHIFT; + /* + * The kernel_pmap's pm_pteobj is used only for locking and not + * for mmu pages. + */ kernel_pmap.pm_pdir = KernelPTD - (KvaStart >> SEG_SHIFT); kernel_pmap.pm_pdirpte = KernelPTA[i]; kernel_pmap.pm_count = 1; kernel_pmap.pm_active = (cpumask_t)-1 & ~CPUMASK_LOCK; + kernel_pmap.pm_pteobj = &kernel_object; TAILQ_INIT(&kernel_pmap.pm_pvlist); + TAILQ_INIT(&kernel_pmap.pm_pvlist_free); + spin_init(&kernel_pmap.pm_spin); + lwkt_token_init(&kernel_pmap.pm_token, "kpmap_tok"); i386_protection_init(); } @@ -214,22 +223,27 @@ pmap_pinit(struct pmap *pmap) VM_ALLOC_NORMAL | VM_ALLOC_RETRY); ptdpg->wire_count = 1; - ++vmstats.v_wire_count; + atomic_add_int(&vmstats.v_wire_count, 1); /* not usually mapped */ - vm_page_flag_clear(ptdpg, PG_MAPPED | PG_BUSY); ptdpg->valid = VM_PAGE_BITS_ALL; + vm_page_flag_clear(ptdpg, PG_MAPPED); + vm_page_wakeup(ptdpg); pmap_kenter((vm_offset_t)pmap->pm_pdir, VM_PAGE_TO_PHYS(ptdpg)); pmap->pm_pdirpte = KernelPTA[(vm_offset_t)pmap->pm_pdir >> PAGE_SHIFT]; if ((ptdpg->flags & PG_ZERO) == 0) bzero(pmap->pm_pdir, PAGE_SIZE); + vm_page_flag_clear(ptdpg, PG_ZERO); pmap->pm_count = 1; pmap->pm_active = 0; pmap->pm_ptphint = NULL; pmap->pm_cpucachemask = 0; TAILQ_INIT(&pmap->pm_pvlist); + TAILQ_INIT(&pmap->pm_pvlist_free); + spin_init(&pmap->pm_spin); + lwkt_token_init(&pmap->pm_token, "pmap_tok"); bzero(&pmap->pm_stats, sizeof pmap->pm_stats); pmap->pm_stats.resident_count = 1; } @@ -242,7 +256,6 @@ pmap_pinit(struct pmap *pmap) void pmap_puninit(pmap_t pmap) { - lwkt_gettoken(&vm_token); if (pmap->pm_pdir) { kmem_free(&kernel_map, (vm_offset_t)pmap->pm_pdir, PAGE_SIZE); pmap->pm_pdir = NULL; @@ -251,7 +264,6 @@ pmap_puninit(pmap_t pmap) vm_object_deallocate(pmap->pm_pteobj); pmap->pm_pteobj = NULL; } - lwkt_reltoken(&vm_token); } @@ -268,11 +280,9 @@ pmap_puninit(pmap_t pmap) void pmap_pinit2(struct pmap *pmap) { - crit_enter(); - lwkt_gettoken(&vm_token); + spin_lock(&pmap_spin); TAILQ_INSERT_TAIL(&pmap_list, pmap, pm_pmnode); - lwkt_reltoken(&vm_token); - crit_exit(); + spin_unlock(&pmap_spin); } /* @@ -280,7 +290,7 @@ pmap_pinit2(struct pmap *pmap) * * Should only be called if the map contains no valid mappings. * - * No requirements. + * Caller must hold pmap->pm_token */ static int pmap_release_callback(struct vm_page *p, void *data); @@ -320,13 +330,13 @@ pmap_release(struct pmap *pmap) info.pmap = pmap; info.object = object; - crit_enter(); - lwkt_gettoken(&vm_token); + + spin_lock(&pmap_spin); TAILQ_REMOVE(&pmap_list, pmap, pm_pmnode); - crit_exit(); + spin_unlock(&pmap_spin); + vm_object_hold(object); do { - crit_enter(); info.error = 0; info.mpte = NULL; info.limit = object->generation; @@ -337,15 +347,14 @@ pmap_release(struct pmap *pmap) if (!pmap_release_free_page(pmap, info.mpte)) info.error = 1; } - crit_exit(); } while (info.error); + vm_object_drop(object); /* * Leave the KVA reservation for pm_pdir cached for later reuse. */ pmap->pm_pdirpte = 0; pmap->pm_cpucachemask = 0; - lwkt_reltoken(&vm_token); } /* @@ -955,10 +964,9 @@ pmap_page_lookup(vm_object_t object, vm_pindex_t pindex) { vm_page_t m; -retry: - m = vm_page_lookup(object, pindex); - if (m && vm_page_sleep_busy(m, FALSE, "pplookp")) - goto retry; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + m = vm_page_lookup_busy_wait(object, pindex, FALSE, "pplookp"); + return(m); } @@ -972,8 +980,7 @@ retry: static int _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m) { - while (vm_page_sleep_busy(m, FALSE, "pmuwpt")) - ; + vm_page_busy_wait(m, FALSE, "pmuwpt"); KASSERT(m->queue == PQ_NONE, ("_pmap_unwire_pte_hold: %p->queue != PQ_NONE", m)); @@ -981,7 +988,6 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m) /* * Unmap the page table page. */ - vm_page_busy(m); KKASSERT(pmap->pm_pdir[m->pindex] != 0); pmap_inval_pde(&pmap->pm_pdir[m->pindex], pmap, (vm_offset_t)m->pindex << SEG_SHIFT); @@ -1001,7 +1007,7 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m) vm_page_unhold(m); --m->wire_count; KKASSERT(m->wire_count == 0); - --vmstats.v_wire_count; + atomic_add_int(&vmstats.v_wire_count, -1); vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); vm_page_flash(m); vm_page_free_zero(m); @@ -1009,6 +1015,8 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m) } KKASSERT(m->hold_count > 1); vm_page_unhold(m); + vm_page_wakeup(m); + return 0; } @@ -1033,6 +1041,8 @@ pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t mpte) { unsigned ptepindex; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(pmap->pm_pteobj)); + if (mpte == NULL) { /* * page table pages in the kernel_pmap are not managed. @@ -1044,8 +1054,9 @@ pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t mpte) (pmap->pm_ptphint->pindex == ptepindex)) { mpte = pmap->pm_ptphint; } else { - mpte = pmap_page_lookup( pmap->pm_pteobj, ptepindex); + mpte = pmap_page_lookup(pmap->pm_pteobj, ptepindex); pmap->pm_ptphint = mpte; + vm_page_wakeup(mpte); } } return pmap_unwire_pte_hold(pmap, mpte); @@ -1066,10 +1077,10 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) * page-table pages. Those pages are zero now, and * might as well be placed directly into the zero queue. */ - if (vm_page_sleep_busy(p, FALSE, "pmaprl")) + if (vm_page_busy_try(p, FALSE)) { + vm_page_sleep_busy(p, FALSE, "pmaprl"); return 0; - - vm_page_busy(p); + } KKASSERT(pmap->pm_stats.resident_count > 0); --pmap->pm_stats.resident_count; @@ -1109,7 +1120,7 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) * optimize the free call. */ p->wire_count--; - vmstats.v_wire_count--; + atomic_add_int(&vmstats.v_wire_count, -1); vm_page_free_zero(p); return 1; } @@ -1136,6 +1147,16 @@ _pmap_allocpte(pmap_t pmap, unsigned ptepindex) m = vm_page_grab(pmap->pm_pteobj, ptepindex, VM_ALLOC_NORMAL | VM_ALLOC_ZERO | VM_ALLOC_RETRY); + if (m->valid == 0) { + if ((m->flags & PG_ZERO) == 0) + pmap_zero_page(VM_PAGE_TO_PHYS(m)); + m->valid = VM_PAGE_BITS_ALL; + vm_page_flag_clear(m, PG_ZERO); + } else { + KKASSERT((m->flags & PG_ZERO) == 0); + } + vm_page_flag_set(m, PG_MAPPED); + KASSERT(m->queue == PQ_NONE, ("_pmap_allocpte: %p->queue != PQ_NONE", m)); @@ -1157,7 +1178,7 @@ _pmap_allocpte(pmap_t pmap, unsigned ptepindex) } if (m->wire_count == 0) - vmstats.v_wire_count++; + atomic_add_int(&vmstats.v_wire_count, 1); m->wire_count++; /* @@ -1176,16 +1197,6 @@ _pmap_allocpte(pmap_t pmap, unsigned ptepindex) */ pmap->pm_ptphint = m; - /* - * Try to use the new mapping, but if we cannot, then - * do it with the routine that maps the page explicitly. - */ - if ((m->flags & PG_ZERO) == 0) - pmap_zero_page(ptepa); - - m->valid = VM_PAGE_BITS_ALL; - vm_page_flag_clear(m, PG_ZERO); - vm_page_flag_set(m, PG_MAPPED); vm_page_wakeup(m); return (m); @@ -1204,6 +1215,8 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) vm_offset_t ptepa; vm_page_t m; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(pmap->pm_pteobj)); + /* * Calculate pagetable page index */ @@ -1238,8 +1251,9 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) (pmap->pm_ptphint->pindex == ptepindex)) { m = pmap->pm_ptphint; } else { - m = pmap_page_lookup( pmap->pm_pteobj, ptepindex); + m = pmap_page_lookup(pmap->pm_pteobj, ptepindex); pmap->pm_ptphint = m; + vm_page_wakeup(m); } m->hold_count++; return m; @@ -1310,12 +1324,16 @@ pmap_collect(void) warningdone++; } - for(i = 0; i < vm_page_array_size; i++) { + for (i = 0; i < vm_page_array_size; i++) { m = &vm_page_array[i]; - if (m->wire_count || m->hold_count || m->busy || - (m->flags & PG_BUSY)) + if (m->wire_count || m->hold_count) continue; - pmap_remove_all(m); + if (vm_page_busy_try(m, TRUE) == 0) { + if (m->wire_count == 0 && m->hold_count == 0) { + pmap_remove_all(m); + } + vm_page_wakeup(m); + } } lwkt_reltoken(&vm_token); } @@ -1325,6 +1343,8 @@ pmap_collect(void) * in the header and we must copy the following entry up * to the header. Otherwise we must search the list for * the entry. In either case we free the now unused entry. + * + * caller must hold vm_token */ static int pmap_remove_entry(struct pmap *pmap, vm_page_t m, vm_offset_t va) @@ -1353,12 +1373,14 @@ pmap_remove_entry(struct pmap *pmap, vm_page_t m, vm_offset_t va) TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); m->md.pv_list_count--; - m->object->agg_pv_list_count--; + atomic_add_int(&m->object->agg_pv_list_count, -1); TAILQ_REMOVE(&pmap->pm_pvlist, pv, pv_plist); if (TAILQ_EMPTY(&m->md.pv_list)) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); ++pmap->pm_generation; + vm_object_hold(pmap->pm_pteobj); rtval = pmap_unuse_pt(pmap, va, pv->pv_ptem); + vm_object_drop(pmap->pm_pteobj); free_pv_entry(pv); crit_exit(); @@ -1384,7 +1406,7 @@ pmap_insert_entry(pmap_t pmap, vm_offset_t va, vm_page_t mpte, vm_page_t m) TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list); ++pmap->pm_generation; m->md.pv_list_count++; - m->object->agg_pv_list_count++; + atomic_add_int(&m->object->agg_pv_list_count, 1); crit_exit(); } @@ -1484,10 +1506,12 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) if (pmap == NULL) return; + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); KKASSERT(pmap->pm_stats.resident_count >= 0); if (pmap->pm_stats.resident_count == 0) { lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); return; } @@ -1500,6 +1524,7 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) ((pmap->pm_pdir[(sva >> PDRSHIFT)] & VPTE_PS) == 0)) { pmap_remove_page(pmap, sva); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); return; } @@ -1562,6 +1587,7 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) } } lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); } /* @@ -1588,7 +1614,6 @@ pmap_remove_all(vm_page_t m) } #endif - crit_enter(); lwkt_gettoken(&vm_token); while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) { KKASSERT(pv->pv_pmap->pm_stats.resident_count > 0); @@ -1623,15 +1648,16 @@ pmap_remove_all(vm_page_t m) TAILQ_REMOVE(&pv->pv_pmap->pm_pvlist, pv, pv_plist); ++pv->pv_pmap->pm_generation; m->md.pv_list_count--; - m->object->agg_pv_list_count--; + atomic_add_int(&m->object->agg_pv_list_count, -1); if (TAILQ_EMPTY(&m->md.pv_list)) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); + vm_object_hold(pv->pv_pmap->pm_pteobj); pmap_unuse_pt(pv->pv_pmap, pv->pv_va, pv->pv_ptem); + vm_object_drop(pv->pv_pmap->pm_pteobj); free_pv_entry(pv); } KKASSERT((m->flags & (PG_MAPPED | PG_WRITEABLE)) == 0); lwkt_reltoken(&vm_token); - crit_exit(); } /* @@ -1767,6 +1793,7 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, va &= VPTE_FRAME; + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); /* @@ -1906,6 +1933,7 @@ validate: } KKASSERT((newpte & VPTE_MANAGED) == 0 || m->flags & PG_MAPPED); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); } /* @@ -1934,6 +1962,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) */ ptepindex = va >> PDRSHIFT; + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); do { @@ -1955,6 +1984,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) } else { mpte = pmap_page_lookup( pmap->pm_pteobj, ptepindex); pmap->pm_ptphint = mpte; + vm_page_wakeup(mpte); } if (mpte) mpte->hold_count++; @@ -1972,6 +2002,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) if (*pte) { pmap_unwire_pte_hold(pmap, mpte); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); return; } @@ -2002,6 +2033,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) /*pmap_inval_add(&info, pmap, va); shouldn't be needed 0->valid */ /*pmap_inval_flush(&info); don't need for vkernel */ lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); } /* @@ -2098,12 +2130,10 @@ pmap_object_init_pt(pmap_t pmap, vm_offset_t addr, vm_prot_t prot, info.addr = addr; info.pmap = pmap; - crit_enter(); - lwkt_gettoken(&vm_token); + vm_object_hold(object); vm_page_rb_tree_RB_SCAN(&object->rb_memq, rb_vm_page_scancmp, pmap_object_init_pt_callback, &info); - lwkt_reltoken(&vm_token); - crit_exit(); + vm_object_drop(object); } /* @@ -2115,6 +2145,7 @@ pmap_object_init_pt_callback(vm_page_t p, void *data) { struct rb_vm_page_scan_info *info = data; vm_pindex_t rel_index; + /* * don't allow an madvise to blow away our really * free pages allocating pv entries. @@ -2123,16 +2154,17 @@ pmap_object_init_pt_callback(vm_page_t p, void *data) vmstats.v_free_count < vmstats.v_free_reserved) { return(-1); } + if (vm_page_busy_try(p, TRUE)) + return 0; if (((p->valid & VM_PAGE_BITS_ALL) == VM_PAGE_BITS_ALL) && - (p->busy == 0) && (p->flags & (PG_BUSY | PG_FICTITIOUS)) == 0) { - vm_page_busy(p); + (p->flags & PG_FICTITIOUS) == 0) { if ((p->queue - p->pc) == PQ_CACHE) vm_page_deactivate(p); rel_index = p->pindex - info->start_pindex; pmap_enter_quick(info->pmap, info->addr + i386_ptob(rel_index), p); - vm_page_wakeup(p); } + vm_page_wakeup(p); return(0); } @@ -2233,7 +2265,7 @@ pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, if (src_pmap->pm_pdir == NULL) return; - crit_enter(); + lwkt_gettoken(&vm_token); src_frame = get_ptbase1(src_pmap, src_addr); dst_frame = get_ptbase2(dst_pmap, src_addr); @@ -2335,7 +2367,7 @@ pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, dst_pte++; } } - crit_exit(); + lwkt_reltoken(&vm_token); } /* @@ -2533,7 +2565,8 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) vm_page_t m; int32_t save_generation; - crit_enter(); + if (pmap->pm_pteobj) + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); for (pv = TAILQ_FIRST(&pmap->pm_pvlist); pv; pv = npv) { if (pv->pv_va >= eva || pv->pv_va < sva) { @@ -2575,7 +2608,7 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) save_generation = ++pmap->pm_generation; m->md.pv_list_count--; - m->object->agg_pv_list_count--; + atomic_add_int(&m->object->agg_pv_list_count, -1); TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); if (TAILQ_FIRST(&m->md.pv_list) == NULL) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); @@ -2593,7 +2626,8 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) } } lwkt_reltoken(&vm_token); - crit_exit(); + if (pmap->pm_pteobj) + vm_object_drop(pmap->pm_pteobj); } /* @@ -2997,6 +3031,9 @@ done: return val; } +/* + * Caller must hold vmspace->vm_map.token for oldvm and newvm + */ void pmap_replacevm(struct proc *p, struct vmspace *newvm, int adjrefs) { diff --git a/sys/platform/vkernel64/conf/files b/sys/platform/vkernel64/conf/files index b8c6c028b8..232e5b0be6 100644 --- a/sys/platform/vkernel64/conf/files +++ b/sys/platform/vkernel64/conf/files @@ -31,6 +31,7 @@ platform/vkernel64/x86_64/mp.c optional smp \ cpu/x86_64/misc/elf_machdep.c standard cpu/x86_64/misc/lwbuf.c standard cpu/x86_64/misc/in_cksum2.s optional inet +cpu/x86_64/misc/monitor.s standard cpu/x86_64/misc/ktr.c optional ktr cpu/x86_64/misc/db_disasm.c optional ddb cpu/x86_64/misc/x86_64-gdbstub.c optional ddb diff --git a/sys/platform/vkernel64/include/pmap.h b/sys/platform/vkernel64/include/pmap.h index 81fff9ad4f..2a9ffc2d29 100644 --- a/sys/platform/vkernel64/include/pmap.h +++ b/sys/platform/vkernel64/include/pmap.h @@ -96,6 +96,12 @@ #ifndef _SYS_QUEUE_H_ #include #endif +#ifndef _SYS_SPINLOCK_H_ +#include +#endif +#ifndef _SYS_THREAD_H_ +#include +#endif #ifndef _SYS_VKERNEL_H_ #include #endif @@ -151,12 +157,15 @@ struct pmap { struct vm_object *pm_pteobj; /* Container for pte's */ TAILQ_ENTRY(pmap) pm_pmnode; /* list of pmaps */ TAILQ_HEAD(,pv_entry) pm_pvlist; /* list of mappings in pmap */ + TAILQ_HEAD(,pv_entry) pm_pvlist_free; /* free mappings */ int pm_count; /* reference count */ cpumask_t pm_active; /* active on cpus */ vm_pindex_t pm_pdindex; /* page dir page in obj */ struct pmap_statistics pm_stats; /* pmap statistics */ struct vm_page *pm_ptphint; /* pmap ptp hint */ int pm_generation; /* detect pvlist deletions */ + struct spinlock pm_spin; + struct lwkt_token pm_token; }; #define pmap_resident_count(pmap) (pmap)->pm_stats.resident_count @@ -193,6 +202,7 @@ extern vm_offset_t clean_eva; void pmap_bootstrap(vm_paddr_t *, int64_t); void *pmap_mapdev (vm_paddr_t, vm_size_t); void pmap_unmapdev (vm_offset_t, vm_size_t); +void pmap_release(struct pmap *pmap); struct vm_page *pmap_use_pt (pmap_t, vm_offset_t); #endif /* _KERNEL */ diff --git a/sys/platform/vkernel64/platform/pmap.c b/sys/platform/vkernel64/platform/pmap.c index 246d36a563..4a7b41d53f 100644 --- a/sys/platform/vkernel64/platform/pmap.c +++ b/sys/platform/vkernel64/platform/pmap.c @@ -48,15 +48,6 @@ /* * Manages physical address maps. - * - * In most cases the vm_token must be held when manipulating a user pmap - * or elements within a vm_page, and the kvm_token must be held when - * manipulating the kernel pmap. Operations on user pmaps may require - * additional synchronization. - * - * In some cases the caller may hold the required tokens to prevent pmap - * functions from blocking on those same tokens. This typically only works - * for lookup-style operations. */ #if JG @@ -89,6 +80,7 @@ #include #include #include +#include #include #include @@ -336,7 +328,8 @@ pmap_pte(pmap_t pmap, vm_offset_t va) PMAP_INLINE pt_entry_t * vtopte(vm_offset_t va) { - uint64_t mask = ((1ul << (NPTEPGSHIFT + NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1); + uint64_t mask = ((1ul << (NPTEPGSHIFT + NPDEPGSHIFT + + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1); return (PTmap + ((va >> PAGE_SHIFT) & mask)); } @@ -344,7 +337,8 @@ vtopte(vm_offset_t va) static __inline pd_entry_t * vtopde(vm_offset_t va) { - uint64_t mask = ((1ul << (NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1); + uint64_t mask = ((1ul << (NPDEPGSHIFT + NPDPEPGSHIFT + + NPML4EPGSHIFT)) - 1); return (PDmap + ((va >> PDRSHIFT) & mask)); } @@ -473,11 +467,18 @@ pmap_bootstrap(vm_paddr_t *firstaddr, int64_t ptov_offset) * The kernel's pmap is statically allocated so we don't have to use * pmap_create, which is unlikely to work correctly at this part of * the boot sequence (XXX and which no longer exists). + * + * The kernel_pmap's pm_pteobj is used only for locking and not + * for mmu pages. */ kernel_pmap.pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(KPML4phys); kernel_pmap.pm_count = 1; kernel_pmap.pm_active = (cpumask_t)-1; /* don't allow deactivation */ + kernel_pmap.pm_pteobj = &kernel_object; TAILQ_INIT(&kernel_pmap.pm_pvlist); + TAILQ_INIT(&kernel_pmap.pm_pvlist_free); + lwkt_token_init(&kernel_pmap.pm_token, "kpmap_tok"); + spin_init(&kernel_pmap.pm_spin); /* * Reserve some special page table entries/VA space for temporary @@ -870,9 +871,8 @@ pmap_page_lookup(vm_object_t object, vm_pindex_t pindex) { vm_page_t m; - do { - m = vm_page_lookup(object, pindex); - } while (m && vm_page_sleep_busy(m, FALSE, "pplookp")); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + m = vm_page_lookup_busy_wait(object, pindex, FALSE, "pplookp"); return(m); } @@ -925,8 +925,7 @@ static __inline int pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, static int _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m) { - while (vm_page_sleep_busy(m, FALSE, "pmuwpt")) - ; + vm_page_busy_wait(m, FALSE, "pmuwpt"); KASSERT(m->queue == PQ_NONE, ("_pmap_unwire_pte_hold: %p->queue != PQ_NONE", m)); @@ -935,7 +934,6 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m) * Unmap the page table page. */ //abort(); /* JG */ - vm_page_busy(m); /* pmap_inval_add(info, pmap, -1); */ if (m->pindex >= (NUPDE + NUPDPE)) { @@ -986,7 +984,7 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m) vm_page_unhold(m); --m->wire_count; KKASSERT(m->wire_count == 0); - --vmstats.v_wire_count; + atomic_add_int(&vmstats.v_wire_count, -1); vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); vm_page_flash(m); vm_page_free_zero(m); @@ -994,6 +992,7 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_offset_t va, vm_page_t m) } else { KKASSERT(m->hold_count > 1); vm_page_unhold(m); + vm_page_wakeup(m); return 0; } } @@ -1020,6 +1019,8 @@ pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t mpte) /* JG Use FreeBSD/amd64 or FreeBSD/i386 ptepde approaches? */ vm_pindex_t ptepindex; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(pmap->pm_pteobj)); + if (mpte == NULL) { /* * page table pages in the kernel_pmap are not managed. @@ -1033,6 +1034,7 @@ pmap_unuse_pt(pmap_t pmap, vm_offset_t va, vm_page_t mpte) } else { mpte = pmap_page_lookup(pmap->pm_pteobj, ptepindex); pmap->pm_ptphint = mpte; + vm_page_wakeup(mpte); } } @@ -1085,20 +1087,25 @@ pmap_pinit(struct pmap *pmap) ptdpg = vm_page_grab(pmap->pm_pteobj, NUPDE + NUPDPE + PML4PML4I, VM_ALLOC_NORMAL | VM_ALLOC_RETRY); pmap->pm_pdirm = ptdpg; - vm_page_flag_clear(ptdpg, PG_MAPPED | PG_BUSY); + vm_page_flag_clear(ptdpg, PG_MAPPED); ptdpg->valid = VM_PAGE_BITS_ALL; if (ptdpg->wire_count == 0) - ++vmstats.v_wire_count; + atomic_add_int(&vmstats.v_wire_count, 1); ptdpg->wire_count = 1; + vm_page_wakeup(ptdpg); pmap_kenter((vm_offset_t)pmap->pm_pml4, VM_PAGE_TO_PHYS(ptdpg)); } if ((ptdpg->flags & PG_ZERO) == 0) bzero(pmap->pm_pml4, PAGE_SIZE); + vm_page_flag_clear(ptdpg, PG_ZERO); pmap->pm_count = 1; pmap->pm_active = 0; pmap->pm_ptphint = NULL; TAILQ_INIT(&pmap->pm_pvlist); + TAILQ_INIT(&pmap->pm_pvlist_free); + spin_init(&pmap->pm_spin); + lwkt_token_init(&pmap->pm_token, "pmap_tok"); bzero(&pmap->pm_stats, sizeof pmap->pm_stats); pmap->pm_stats.resident_count = 1; } @@ -1117,18 +1124,15 @@ pmap_puninit(pmap_t pmap) vm_page_t p; KKASSERT(pmap->pm_active == 0); - lwkt_gettoken(&vm_token); if ((p = pmap->pm_pdirm) != NULL) { KKASSERT(pmap->pm_pml4 != NULL); pmap_kremove((vm_offset_t)pmap->pm_pml4); + vm_page_busy_wait(p, FALSE, "pgpun"); p->wire_count--; - vmstats.v_wire_count--; - KKASSERT((p->flags & PG_BUSY) == 0); - vm_page_busy(p); + atomic_add_int(&vmstats.v_wire_count, -1); vm_page_free_zero(p); pmap->pm_pdirm = NULL; } - lwkt_reltoken(&vm_token); if (pmap->pm_pml4) { kmem_free(&kernel_map, (vm_offset_t)pmap->pm_pml4, PAGE_SIZE); pmap->pm_pml4 = NULL; @@ -1152,11 +1156,9 @@ pmap_puninit(pmap_t pmap) void pmap_pinit2(struct pmap *pmap) { - crit_enter(); - lwkt_gettoken(&vm_token); + spin_lock(&pmap_spin); TAILQ_INSERT_TAIL(&pmap_list, pmap, pm_pmnode); - lwkt_reltoken(&vm_token); - crit_exit(); + spin_unlock(&pmap_spin); } /* @@ -1175,10 +1177,10 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) * page-table pages. Those pages are zero now, and * might as well be placed directly into the zero queue. */ - if (vm_page_sleep_busy(p, FALSE, "pmaprl")) + if (vm_page_busy_try(p, FALSE)) { + vm_page_sleep_busy(p, FALSE, "pmaprl"); return 0; - - vm_page_busy(p); + } /* * Remove the page table page from the processes address space. @@ -1247,7 +1249,7 @@ pmap_release_free_page(struct pmap *pmap, vm_page_t p) } else { abort(); p->wire_count--; - vmstats.v_wire_count--; + atomic_add_int(&vmstats.v_wire_count, -1); /* JG eventually revert to using vm_page_free_zero() */ vm_page_free(p); } @@ -1264,13 +1266,20 @@ _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex) vm_page_t m, pdppg, pdpg; /* - * Find or fabricate a new pagetable page + * Find or fabricate a new pagetable page. Handle allocation + * races by checking m->valid. */ m = vm_page_grab(pmap->pm_pteobj, ptepindex, VM_ALLOC_NORMAL | VM_ALLOC_ZERO | VM_ALLOC_RETRY); - if ((m->flags & PG_ZERO) == 0) { - pmap_zero_page(VM_PAGE_TO_PHYS(m)); + if (m->valid == 0) { + if ((m->flags & PG_ZERO) == 0) { + pmap_zero_page(VM_PAGE_TO_PHYS(m)); + } + m->valid = VM_PAGE_BITS_ALL; + vm_page_flag_clear(m, PG_ZERO); + } else { + KKASSERT((m->flags & PG_ZERO) == 0); } KASSERT(m->queue == PQ_NONE, @@ -1283,14 +1292,13 @@ _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex) m->hold_count++; if (m->wire_count == 0) - vmstats.v_wire_count++; + atomic_add_int(&vmstats.v_wire_count, 1); m->wire_count++; /* * Map the pagetable page into the process address space, if * it isn't already there. */ - ++pmap->pm_stats.resident_count; if (ptepindex >= (NUPDE + NUPDPE)) { @@ -1391,9 +1399,6 @@ _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex) * Set the page table hint */ pmap->pm_ptphint = m; - - m->valid = VM_PAGE_BITS_ALL; - vm_page_flag_clear(m, PG_ZERO); vm_page_flag_set(m, PG_MAPPED); vm_page_wakeup(m); @@ -1413,6 +1418,8 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) pd_entry_t *pd; vm_page_t m; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(pmap->pm_pteobj)); + /* * Calculate pagetable page index */ @@ -1441,9 +1448,10 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) */ if (pd != NULL && (*pd & VPTE_V) != 0) { /* YYY hint is used here on i386 */ - m = pmap_page_lookup( pmap->pm_pteobj, ptepindex); + m = pmap_page_lookup(pmap->pm_pteobj, ptepindex); pmap->pm_ptphint = m; - m->hold_count++; + vm_page_hold(m); + vm_page_wakeup(m); return m; } /* @@ -1462,7 +1470,7 @@ pmap_allocpte(pmap_t pmap, vm_offset_t va) * Called when a pmap initialized by pmap_pinit is being released. * Should only be called if the map contains no valid mappings. * - * No requirements. + * Caller must hold pmap->pm_token */ static int pmap_release_callback(struct vm_page *p, void *data); @@ -1481,13 +1489,13 @@ pmap_release(struct pmap *pmap) info.pmap = pmap; info.object = object; - crit_enter(); - lwkt_gettoken(&vm_token); + + spin_lock(&pmap_spin); TAILQ_REMOVE(&pmap_list, pmap, pm_pmnode); - crit_exit(); + spin_unlock(&pmap_spin); + vm_object_hold(object); do { - crit_enter(); info.error = 0; info.mpte = NULL; info.limit = object->generation; @@ -1498,9 +1506,8 @@ pmap_release(struct pmap *pmap) if (!pmap_release_free_page(pmap, info.mpte)) info.error = 1; } - crit_exit(); } while (info.error); - lwkt_reltoken(&vm_token); + vm_object_drop(object); } static int @@ -1540,8 +1547,7 @@ pmap_growkernel(vm_offset_t kstart, vm_offset_t kend) addr = kend; - crit_enter(); - lwkt_gettoken(&vm_token); + vm_object_hold(kptobj); if (kernel_vm_end == 0) { kernel_vm_end = KvaStart; nkpt = 0; @@ -1614,8 +1620,7 @@ pmap_growkernel(vm_offset_t kstart, vm_offset_t kend) break; } } - lwkt_reltoken(&vm_token); - crit_exit(); + vm_object_drop(kptobj); } /* @@ -1751,12 +1756,16 @@ pmap_collect(void) warningdone++; } - for(i = 0; i < vm_page_array_size; i++) { + for (i = 0; i < vm_page_array_size; i++) { m = &vm_page_array[i]; - if (m->wire_count || m->hold_count || m->busy || - (m->flags & PG_BUSY)) + if (m->wire_count || m->hold_count) continue; - pmap_remove_all(m); + if (vm_page_busy_try(m, TRUE) == 0) { + if (m->wire_count == 0 && m->hold_count == 0) { + pmap_remove_all(m); + } + vm_page_wakeup(m); + } } lwkt_reltoken(&vm_token); } @@ -1767,6 +1776,8 @@ pmap_collect(void) * in the header and we must copy the following entry up * to the header. Otherwise we must search the list for * the entry. In either case we free the now unused entry. + * + * caller must hold vm_token. */ static int pmap_remove_entry(struct pmap *pmap, vm_page_t m, vm_offset_t va) @@ -1774,7 +1785,6 @@ pmap_remove_entry(struct pmap *pmap, vm_page_t m, vm_offset_t va) pv_entry_t pv; int rtval; - crit_enter(); if (m->md.pv_list_count < pmap->pm_stats.resident_count) { TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) { if (pmap == pv->pv_pmap && va == pv->pv_va) @@ -1796,16 +1806,18 @@ pmap_remove_entry(struct pmap *pmap, vm_page_t m, vm_offset_t va) if (pv) { TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); m->md.pv_list_count--; - m->object->agg_pv_list_count--; + atomic_add_int(&m->object->agg_pv_list_count, -1); KKASSERT(m->md.pv_list_count >= 0); if (TAILQ_EMPTY(&m->md.pv_list)) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); TAILQ_REMOVE(&pmap->pm_pvlist, pv, pv_plist); ++pmap->pm_generation; + KKASSERT(pmap->pm_pteobj != NULL); + vm_object_hold(pmap->pm_pteobj); rtval = pmap_unuse_pt(pmap, va, pv->pv_ptem); + vm_object_drop(pmap->pm_pteobj); free_pv_entry(pv); } - crit_exit(); return rtval; } @@ -1827,7 +1839,7 @@ pmap_insert_entry(pmap_t pmap, vm_offset_t va, vm_page_t mpte, vm_page_t m) TAILQ_INSERT_TAIL(&pmap->pm_pvlist, pv, pv_plist); TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list); m->md.pv_list_count++; - m->object->agg_pv_list_count++; + atomic_add_int(&m->object->agg_pv_list_count, 1); crit_exit(); } @@ -1862,9 +1874,9 @@ pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq, vm_offset_t va) if (oldpte & VPTE_M) { #if defined(PMAP_DIAGNOSTIC) if (pmap_nw_modified((pt_entry_t) oldpte)) { - kprintf( - "pmap_remove: modified page not writable: va: 0x%lx, pte: 0x%lx\n", - va, oldpte); + kprintf("pmap_remove: modified page not " + "writable: va: 0x%lx, pte: 0x%lx\n", + va, oldpte); } #endif if (pmap_track_modified(pmap, va)) @@ -1924,10 +1936,12 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) if (pmap == NULL) return; + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); KKASSERT(pmap->pm_stats.resident_count >= 0); if (pmap->pm_stats.resident_count == 0) { lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); return; } @@ -1941,6 +1955,7 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) if (pde && (*pde & VPTE_PS) == 0) { pmap_remove_page(pmap, sva); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); return; } } @@ -2009,6 +2024,7 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) } } lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); } /* @@ -2019,7 +2035,6 @@ pmap_remove(struct pmap *pmap, vm_offset_t sva, vm_offset_t eva) * * No requirements. */ - static void pmap_remove_all(vm_page_t m) { @@ -2036,7 +2051,6 @@ pmap_remove_all(vm_page_t m) } #endif - crit_enter(); lwkt_gettoken(&vm_token); while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) { KKASSERT(pv->pv_pmap->pm_stats.resident_count > 0); @@ -2071,16 +2085,17 @@ pmap_remove_all(vm_page_t m) TAILQ_REMOVE(&pv->pv_pmap->pm_pvlist, pv, pv_plist); ++pv->pv_pmap->pm_generation; m->md.pv_list_count--; - m->object->agg_pv_list_count--; + atomic_add_int(&m->object->agg_pv_list_count, -1); KKASSERT(m->md.pv_list_count >= 0); if (TAILQ_EMPTY(&m->md.pv_list)) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); + vm_object_hold(pv->pv_pmap->pm_pteobj); pmap_unuse_pt(pv->pv_pmap, pv->pv_va, pv->pv_ptem); + vm_object_drop(pv->pv_pmap->pm_pteobj); free_pv_entry(pv); } KKASSERT((m->flags & (PG_MAPPED|PG_WRITEABLE)) == 0); lwkt_reltoken(&vm_token); - crit_exit(); } /* @@ -2227,6 +2242,7 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, va = trunc_page(va); + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); /* @@ -2357,6 +2373,7 @@ validate: } KKASSERT((newpte & VPTE_MANAGED) == 0 || (m->flags & PG_MAPPED)); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); } /* @@ -2384,6 +2401,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) */ ptepindex = pmap_pde_pindex(va); + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); do { @@ -2405,6 +2423,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) } else { mpte = pmap_page_lookup( pmap->pm_pteobj, ptepindex); pmap->pm_ptphint = mpte; + vm_page_wakeup(mpte); } if (mpte) mpte->hold_count++; @@ -2425,6 +2444,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) pa = VM_PAGE_TO_PHYS(m); KKASSERT(((*pte ^ pa) & VPTE_FRAME) == 0); lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); return; } @@ -2453,6 +2473,7 @@ pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m) /*pmap_inval_add(&info, pmap, va); shouldn't be needed 0->valid */ /*pmap_inval_flush(&info); don't need for vkernel */ lwkt_reltoken(&vm_token); + vm_object_drop(pmap->pm_pteobj); } /* @@ -2533,12 +2554,10 @@ pmap_object_init_pt(pmap_t pmap, vm_offset_t addr, vm_prot_t prot, info.addr = addr; info.pmap = pmap; - crit_enter(); - lwkt_gettoken(&vm_token); + vm_object_hold(object); vm_page_rb_tree_RB_SCAN(&object->rb_memq, rb_vm_page_scancmp, pmap_object_init_pt_callback, &info); - lwkt_reltoken(&vm_token); - crit_exit(); + vm_object_drop(object); } static @@ -2555,16 +2574,17 @@ pmap_object_init_pt_callback(vm_page_t p, void *data) vmstats.v_free_count < vmstats.v_free_reserved) { return(-1); } + if (vm_page_busy_try(p, TRUE)) + return 0; if (((p->valid & VM_PAGE_BITS_ALL) == VM_PAGE_BITS_ALL) && - (p->busy == 0) && (p->flags & (PG_BUSY | PG_FICTITIOUS)) == 0) { - vm_page_busy(p); + (p->flags & PG_FICTITIOUS) == 0) { if ((p->queue - p->pc) == PQ_CACHE) vm_page_deactivate(p); rel_index = p->pindex - info->start_pindex; pmap_enter_quick(info->pmap, info->addr + x86_64_ptob(rel_index), p); - vm_page_wakeup(p); } + vm_page_wakeup(p); return(0); } @@ -2800,8 +2820,10 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) vm_page_t m; int save_generation; - crit_enter(); + if (pmap->pm_pteobj) + vm_object_hold(pmap->pm_pteobj); lwkt_gettoken(&vm_token); + for (pv = TAILQ_FIRST(&pmap->pm_pvlist); pv; pv = npv) { if (pv->pv_va >= eva || pv->pv_va < sva) { npv = TAILQ_NEXT(pv, pv_plist); @@ -2842,7 +2864,7 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) save_generation = ++pmap->pm_generation; m->md.pv_list_count--; - m->object->agg_pv_list_count--; + atomic_add_int(&m->object->agg_pv_list_count, -1); TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); if (TAILQ_EMPTY(&m->md.pv_list)) vm_page_flag_clear(m, PG_MAPPED | PG_WRITEABLE); @@ -2860,7 +2882,8 @@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) } } lwkt_reltoken(&vm_token); - crit_exit(); + if (pmap->pm_pteobj) + vm_object_drop(pmap->pm_pteobj); } /* @@ -3207,6 +3230,8 @@ done: /* * Replace p->p_vmspace with a new one. If adjrefs is non-zero the new * vmspace will be ref'd and the old one will be deref'd. + * + * Caller must hold vmspace->vm_map.token for oldvm and newvm */ void pmap_replacevm(struct proc *p, struct vmspace *newvm, int adjrefs) diff --git a/sys/platform/vkernel64/x86_64/cpu_regs.c b/sys/platform/vkernel64/x86_64/cpu_regs.c index 6007b68c6b..7ffb915749 100644 --- a/sys/platform/vkernel64/x86_64/cpu_regs.c +++ b/sys/platform/vkernel64/x86_64/cpu_regs.c @@ -687,10 +687,7 @@ fetchupcall(struct vmupcall *vu, int morepending, void *rsp) * critical section. * * Note on cpu_idle_hlt: On an SMP system we rely on a scheduler IPI - * to wake a HLTed cpu up. However, there are cases where the idlethread - * will be entered with the possibility that no IPI will occur and in such - * cases lwkt_switch() sets RQF_WAKEUP and we nominally check - * RQF_IDLECHECK_WK_MASK. + * to wake a HLTed cpu up. */ static int cpu_idle_hlt = 1; static int cpu_idle_hltcnt; @@ -721,9 +718,7 @@ cpu_idle(void) /* * The idle loop halts only if no threads are scheduleable - * and no signals have occured. If we race a signal - * RQF_WAKEUP and other gd_reqflags will cause umtx_sleep() - * to return immediately. + * and no signals have occured. */ if (cpu_idle_hlt && (td->td_gd->gd_reqflags & RQF_IDLECHECK_WK_MASK) == 0) { diff --git a/sys/platform/vkernel64/x86_64/mp.c b/sys/platform/vkernel64/x86_64/mp.c index 61d2ba32bb..7358034150 100644 --- a/sys/platform/vkernel64/x86_64/mp.c +++ b/sys/platform/vkernel64/x86_64/mp.c @@ -121,7 +121,7 @@ ap_finish(void) while (try_mplock() == 0) DELAY(100000); if (bootverbose) - kprintf("Active CPU Mask: %08x\n", smp_active_mask); + kprintf("Active CPU Mask: %08lx\n", (long)smp_active_mask); } SYSINIT(finishsmp, SI_BOOT2_FINISH_SMP, SI_ORDER_FIRST, ap_finish, NULL) @@ -392,6 +392,7 @@ start_all_aps(u_int boot_addr) */ ap_tids[0] = pthread_self(); + vm_object_hold(&kernel_object); for (x = 1; x <= mp_naps; x++) { /* Allocate space for the CPU's private space. */ @@ -452,6 +453,7 @@ start_all_aps(u_int boot_addr) DELAY(1000); } } + vm_object_drop(&kernel_object); return(ncpus - 1); } diff --git a/sys/sys/globaldata.h b/sys/sys/globaldata.h index b506f3b1a1..0c48caf167 100644 --- a/sys/sys/globaldata.h +++ b/sys/sys/globaldata.h @@ -136,8 +136,10 @@ struct globaldata { struct timeval gd_stattv; int gd_intr_nesting_level; /* hard code, intrs, ipis */ struct vmmeter gd_cnt; + cpumask_t gd_ipimask; /* pending ipis from cpus */ struct lwkt_ipiq *gd_ipiq; /* array[ncpu] of ipiq's */ struct lwkt_ipiq gd_cpusyncq; /* ipiq for cpu synchro */ + u_int gd_npoll; /* ipiq synchronization */ int gd_fairq_total_pri; struct thread gd_unused02B; struct thread gd_idlethread; @@ -166,7 +168,8 @@ struct globaldata { u_int gd_idle_repeat; /* repeated switches to idle */ int gd_ireserved[7]; const char *gd_infomsg; /* debugging */ - void *gd_preserved[10]; /* future fields */ + struct lwkt_tokref gd_handoff; /* hand-off tokref */ + void *gd_preserved[8]; /* future fields */ /* extended by */ }; @@ -181,7 +184,7 @@ typedef struct globaldata *globaldata_t; #define RQB_AST_UPCALL 6 #define RQB_TIMER 7 #define RQB_RUNNING 8 -#define RQB_WAKEUP 9 +#define RQB_SPINNING 9 #define RQF_IPIQ (1 << RQB_IPIQ) #define RQF_INTPEND (1 << RQB_INTPEND) @@ -192,13 +195,13 @@ typedef struct globaldata *globaldata_t; #define RQF_AST_LWKT_RESCHED (1 << RQB_AST_LWKT_RESCHED) #define RQF_AST_UPCALL (1 << RQB_AST_UPCALL) #define RQF_RUNNING (1 << RQB_RUNNING) -#define RQF_WAKEUP (1 << RQB_WAKEUP) +#define RQF_SPINNING (1 << RQB_SPINNING) #define RQF_AST_MASK (RQF_AST_OWEUPC|RQF_AST_SIGNAL|\ RQF_AST_USER_RESCHED|RQF_AST_LWKT_RESCHED|\ RQF_AST_UPCALL) #define RQF_IDLECHECK_MASK (RQF_IPIQ|RQF_INTPEND|RQF_TIMER) -#define RQF_IDLECHECK_WK_MASK (RQF_IDLECHECK_MASK|RQF_WAKEUP) +#define RQF_IDLECHECK_WK_MASK (RQF_IDLECHECK_MASK|RQF_AST_LWKT_RESCHED) /* * globaldata flags diff --git a/sys/sys/lock.h b/sys/sys/lock.h index efd70ba23a..fa50cf2585 100644 --- a/sys/sys/lock.h +++ b/sys/sys/lock.h @@ -131,7 +131,7 @@ struct lock { #define LK_SLEEPFAIL 0x00000020 /* sleep, then return failure */ #define LK_CANRECURSE 0x00000040 /* allow recursive exclusive lock */ #define LK_UNUSED0080 0x00000080 -#define LK_NOSPINWAIT 0x01000000 /* don't wait for spinlock */ +#define LK_UNUSED0100x 0x01000000 #define LK_TIMELOCK 0x02000000 #define LK_PCATCH 0x04000000 /* timelocked with signal catching */ /* diff --git a/sys/sys/malloc.h b/sys/sys/malloc.h index 340b2f781b..4a4895cda4 100644 --- a/sys/sys/malloc.h +++ b/sys/sys/malloc.h @@ -174,6 +174,7 @@ MALLOC_DECLARE(M_IP6NDP); /* for INET6 */ MALLOC_DECLARE(M_IOV); /* XXX struct malloc_type is unused for contig*(). */ +size_t kmem_lim_size(void); void contigfree (void *addr, unsigned long size, struct malloc_type *type); void *contigmalloc (unsigned long size, struct malloc_type *type, diff --git a/sys/sys/param.h b/sys/sys/param.h index cad4a967d7..4a0c972a44 100644 --- a/sys/sys/param.h +++ b/sys/sys/param.h @@ -140,6 +140,7 @@ #define PWAKEUP_ONE 0x00008000 /* argument to wakeup: only one */ #define PDOMAIN_MASK 0xFFFF0000 /* address domains for wakeup */ #define PDOMAIN_UMTX 0x00010000 /* independant domain for UMTX */ +#define PDOMAIN_XLOCK 0x00020000 /* independant domain for fifo_lock */ #define PWAKEUP_ENCODE(domain, cpu) ((domain) | (cpu)) #define PWAKEUP_DECODE(domain) ((domain) & PWAKEUP_CPUMASK) diff --git a/sys/sys/spinlock.h b/sys/sys/spinlock.h index db02d4cd0e..703f34c2c7 100644 --- a/sys/sys/spinlock.h +++ b/sys/sys/spinlock.h @@ -39,14 +39,20 @@ * Note that the spinlock structure is retained whether we are SMP or not, * so structures using embedded spinlocks do not change size for SMP vs UP * builds. + * + * DragonFly spinlocks use a chasing counter. A core desiring a spinlock + * does a atomic_fetchadd_int() on countb and then waits for counta to + * reach its value using MWAIT. Releasing the spinlock involves an + * atomic_add_int() on counta. If no MWAIT is available the core can spin + * waiting for the value to change which is still represented by a shared+ro + * cache entry. */ struct spinlock { - volatile int lock; /* 0 = unlocked, 1 = locked */ + int counta; + int countb; }; -#define SPINLOCK_EXCLUSIVE 0x80000000 - -#define SPINLOCK_INITIALIZER(head) { 0 } +#define SPINLOCK_INITIALIZER(head) { 0, 0 } #endif diff --git a/sys/sys/spinlock2.h b/sys/sys/spinlock2.h index 4d5660830f..85811d5b96 100644 --- a/sys/sys/spinlock2.h +++ b/sys/sys/spinlock2.h @@ -51,10 +51,14 @@ #include #include +extern struct spinlock pmap_spin; + #ifdef SMP -extern int spin_trylock_wr_contested2(globaldata_t gd); -extern void spin_lock_wr_contested2(struct spinlock *mtx); +int spin_trylock_contested(struct spinlock *spin); +void spin_lock_contested(struct spinlock *spin); +void _spin_pool_lock(void *chan); +void _spin_pool_unlock(void *chan); #endif @@ -65,29 +69,26 @@ extern void spin_lock_wr_contested2(struct spinlock *mtx); * TRUE on success. */ static __inline boolean_t -spin_trylock(struct spinlock *mtx) +spin_trylock(struct spinlock *spin) { globaldata_t gd = mycpu; - int value; ++gd->gd_curthread->td_critcount; cpu_ccfence(); ++gd->gd_spinlocks_wr; - if ((value = atomic_swap_int(&mtx->lock, SPINLOCK_EXCLUSIVE)) != 0) - return (spin_trylock_wr_contested2(gd)); -#ifdef SMP + if (atomic_swap_int(&spin->counta, 1)) + return (spin_trylock_contested(spin)); #ifdef DEBUG_LOCKS int i; for (i = 0; i < SPINLOCK_DEBUG_ARRAY_SIZE; i++) { if (gd->gd_curthread->td_spinlock_stack_id[i] == 0) { gd->gd_curthread->td_spinlock_stack_id[i] = 1; - gd->gd_curthread->td_spinlock_stack[i] = mtx; + gd->gd_curthread->td_spinlock_stack[i] = spin; gd->gd_curthread->td_spinlock_caller_pc[i] = __builtin_return_address(0); break; } } -#endif #endif return (TRUE); } @@ -95,7 +96,7 @@ spin_trylock(struct spinlock *mtx) #else static __inline boolean_t -spin_trylock(struct spinlock *mtx) +spin_trylock(struct spinlock *spin) { globaldata_t gd = mycpu; @@ -107,28 +108,33 @@ spin_trylock(struct spinlock *mtx) #endif +/* + * Return TRUE if the spinlock is held (we can't tell by whom, though) + */ +static __inline int +spin_held(struct spinlock *spin) +{ + return(spin->counta != 0); +} + /* * Obtain an exclusive spinlock and return. */ static __inline void -spin_lock_quick(globaldata_t gd, struct spinlock *mtx) +spin_lock_quick(globaldata_t gd, struct spinlock *spin) { -#ifdef SMP - int value; -#endif - ++gd->gd_curthread->td_critcount; cpu_ccfence(); ++gd->gd_spinlocks_wr; #ifdef SMP - if ((value = atomic_swap_int(&mtx->lock, SPINLOCK_EXCLUSIVE)) != 0) - spin_lock_wr_contested2(mtx); + if (atomic_swap_int(&spin->counta, 1)) + spin_lock_contested(spin); #ifdef DEBUG_LOCKS int i; for (i = 0; i < SPINLOCK_DEBUG_ARRAY_SIZE; i++) { if (gd->gd_curthread->td_spinlock_stack_id[i] == 0) { gd->gd_curthread->td_spinlock_stack_id[i] = 1; - gd->gd_curthread->td_spinlock_stack[i] = mtx; + gd->gd_curthread->td_spinlock_stack[i] = spin; gd->gd_curthread->td_spinlock_caller_pc[i] = __builtin_return_address(0); break; @@ -139,9 +145,9 @@ spin_lock_quick(globaldata_t gd, struct spinlock *mtx) } static __inline void -spin_lock(struct spinlock *mtx) +spin_lock(struct spinlock *spin) { - spin_lock_quick(mycpu, mtx); + spin_lock_quick(mycpu, spin); } /* @@ -150,14 +156,14 @@ spin_lock(struct spinlock *mtx) * cleared. */ static __inline void -spin_unlock_quick(globaldata_t gd, struct spinlock *mtx) +spin_unlock_quick(globaldata_t gd, struct spinlock *spin) { #ifdef SMP #ifdef DEBUG_LOCKS int i; for (i = 0; i < SPINLOCK_DEBUG_ARRAY_SIZE; i++) { if ((gd->gd_curthread->td_spinlock_stack_id[i] == 1) && - (gd->gd_curthread->td_spinlock_stack[i] == mtx)) { + (gd->gd_curthread->td_spinlock_stack[i] == spin)) { gd->gd_curthread->td_spinlock_stack_id[i] = 0; gd->gd_curthread->td_spinlock_stack[i] = NULL; gd->gd_curthread->td_spinlock_caller_pc[i] = NULL; @@ -165,34 +171,62 @@ spin_unlock_quick(globaldata_t gd, struct spinlock *mtx) } } #endif - mtx->lock = 0; + /* + * Don't use a locked instruction here. + */ + KKASSERT(spin->counta != 0); + cpu_sfence(); + spin->counta = 0; + cpu_sfence(); #endif KKASSERT(gd->gd_spinlocks_wr > 0); --gd->gd_spinlocks_wr; cpu_ccfence(); --gd->gd_curthread->td_critcount; +#if 0 + if (__predict_false(gd->gd_reqflags & RQF_IDLECHECK_MASK)) + lwkt_maybe_splz(gd->gd_curthread); +#endif } static __inline void -spin_unlock(struct spinlock *mtx) +spin_unlock(struct spinlock *spin) { - spin_unlock_quick(mycpu, mtx); + spin_unlock_quick(mycpu, spin); } static __inline void -spin_init(struct spinlock *mtx) +spin_pool_lock(void *chan) { - mtx->lock = 0; +#ifdef SMP + _spin_pool_lock(chan); +#else + spin_lock(NULL); +#endif } static __inline void -spin_uninit(struct spinlock *mtx) +spin_pool_unlock(void *chan) { - /* unused */ +#ifdef SMP + _spin_pool_unlock(chan); +#else + spin_unlock(NULL); +#endif } -struct spinlock *spin_pool_lock(void *); -void spin_pool_unlock(void *); +static __inline void +spin_init(struct spinlock *spin) +{ + spin->counta = 0; + spin->countb = 0; +} + +static __inline void +spin_uninit(struct spinlock *spin) +{ + /* unused */ +} #endif /* _KERNEL */ #endif /* _SYS_SPINLOCK2_H_ */ diff --git a/sys/sys/thread.h b/sys/sys/thread.h index 77abda57ec..fbaf8ab455 100644 --- a/sys/sys/thread.h +++ b/sys/sys/thread.h @@ -102,7 +102,7 @@ struct intrframe; typedef struct lwkt_token { struct lwkt_tokref *t_ref; /* Owning ref or NULL */ long t_collisions; /* Collision counter */ - cpumask_t t_collmask; /* Collision cpu mask for resched */ + cpumask_t t_collmask; /* Collision resolve mask */ const char *t_desc; /* Descriptive name */ } lwkt_token; @@ -161,7 +161,7 @@ struct lwkt_tokref { struct thread *tr_owner; /* me */ }; -#define MAXCPUFIFO 16 /* power of 2 */ +#define MAXCPUFIFO 32 /* power of 2 */ #define MAXCPUFIFO_MASK (MAXCPUFIFO - 1) #define LWKT_MAXTOKENS 32 /* max tokens beneficially held by thread */ @@ -178,10 +178,12 @@ typedef struct lwkt_ipiq { int ip_rindex; /* only written by target cpu */ int ip_xindex; /* written by target, indicates completion */ int ip_windex; /* only written by source cpu */ - ipifunc3_t ip_func[MAXCPUFIFO]; - void *ip_arg1[MAXCPUFIFO]; - int ip_arg2[MAXCPUFIFO]; - u_int ip_npoll; /* synchronization to avoid excess IPIs */ + struct { + ipifunc3_t func; + void *arg1; + int arg2; + char filler[32 - sizeof(int) - sizeof(void *) * 2]; + } ip_info[MAXCPUFIFO]; } lwkt_ipiq; /* @@ -439,7 +441,7 @@ extern int lwkt_trytoken(lwkt_token_t); extern void lwkt_reltoken(lwkt_token_t); extern void lwkt_reltoken_hard(lwkt_token_t); extern int lwkt_cnttoken(lwkt_token_t, thread_t); -extern int lwkt_getalltokens(thread_t); +extern int lwkt_getalltokens(thread_t, int); extern void lwkt_relalltokens(thread_t); extern void lwkt_drain_token_requests(void); extern void lwkt_token_init(lwkt_token_t, const char *); diff --git a/sys/sys/time.h b/sys/sys/time.h index df757e3e28..ae2f107d9a 100644 --- a/sys/sys/time.h +++ b/sys/sys/time.h @@ -215,6 +215,7 @@ int tstohz_high (struct timespec *); int tstohz_low (struct timespec *); int64_t tsc_get_target(int ns); int tsc_test_target(int64_t target); +void tsc_delay(int ns); #else /* !_KERNEL */ diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h index fd95f86bb3..8d0008bd82 100644 --- a/sys/sys/vnode.h +++ b/sys/sys/vnode.h @@ -144,16 +144,17 @@ struct mountctl_opt { * associated with the vnode. Otherwise it will be set to NOOFFSET. * * NOTE: The following fields require a spin or token lock. Note that - * additional subsystems may use v_token or v_spinlock for other + * additional subsystems may use v_token or v_spin for other * purposes, e.g. vfs/fifofs/fifo_vnops.c * - * v_namecache v_spinlock + * v_namecache v_spin * v_rb* v_token */ RB_HEAD(buf_rb_tree, buf); RB_HEAD(buf_rb_hash, buf); struct vnode { + struct spinlock v_spin; int v_flag; /* vnode flags (see below) */ int v_writecount; int v_opencount; /* number of explicit opens */ @@ -208,7 +209,6 @@ struct vnode { #define v_rdev v_un.vu_cdev.vu_cdevinfo #define v_cdevnext v_un.vu_cdev.vu_cdevnext #define v_fifoinfo v_un.vu_fifoinfo -#define v_spinlock v_lock.lk_spinlock /* * Vnode flags. diff --git a/sys/vfs/devfs/devfs_vnops.c b/sys/vfs/devfs/devfs_vnops.c index bb13ec13d2..dcc88fd661 100644 --- a/sys/vfs/devfs/devfs_vnops.c +++ b/sys/vfs/devfs/devfs_vnops.c @@ -2029,7 +2029,7 @@ devfs_spec_getpages(struct vop_getpages_args *ap) */ if (!error || (m->valid == VM_PAGE_BITS_ALL)) { if (m->valid) { - if (m->flags & PG_WANTED) { + if (m->flags & PG_REFERENCED) { vm_page_activate(m); } else { vm_page_deactivate(m); diff --git a/sys/vfs/nwfs/nwfs_io.c b/sys/vfs/nwfs/nwfs_io.c index caa5d1db9c..50771ddbee 100644 --- a/sys/vfs/nwfs/nwfs_io.c +++ b/sys/vfs/nwfs/nwfs_io.c @@ -472,7 +472,7 @@ nwfs_getpages(struct vop_getpages_args *ap) * now tell them that it is ok to use. */ if (!error) { - if (m->flags & PG_WANTED) + if (m->flags & PG_REFERENCED) vm_page_activate(m); else vm_page_deactivate(m); diff --git a/sys/vfs/procfs/procfs_map.c b/sys/vfs/procfs/procfs_map.c index 7ae9ec9275..f75df879e1 100644 --- a/sys/vfs/procfs/procfs_map.c +++ b/sys/vfs/procfs/procfs_map.c @@ -100,6 +100,9 @@ procfs_domap(struct proc *curp, struct lwp *lp, struct pfsnode *pfs, } obj = entry->object.vm_object; + if (obj) + vm_object_hold(obj); + if (obj && (obj->shadow_count == 1)) privateresident = obj->resident_page_count; else @@ -117,13 +120,28 @@ procfs_domap(struct proc *curp, struct lwp *lp, struct pfsnode *pfs, resident = 0; addr = entry->start; while (addr < entry->end) { - if (pmap_extract( pmap, addr)) + if (pmap_extract(pmap, addr)) resident++; addr += PAGE_SIZE; } - - for( lobj = tobj = obj; tobj; tobj = tobj->backing_object) - lobj = tobj; + if (obj) { + lobj = obj; + while ((tobj = lobj->backing_object) != NULL) { + KKASSERT(tobj != obj); + vm_object_hold(tobj); + if (tobj == lobj->backing_object) { + if (lobj != obj) { + vm_object_lock_swap(); + vm_object_drop(lobj); + } + lobj = tobj; + } else { + vm_object_drop(tobj); + } + } + } else { + lobj = NULL; + } freepath = NULL; fullpath = "-"; @@ -156,6 +174,8 @@ procfs_domap(struct proc *curp, struct lwp *lp, struct pfsnode *pfs, vn_fullpath(p, vp, &fullpath, &freepath, 1); vrele(vp); } + if (lobj != obj) + vm_object_drop(lobj); } else { type = "none"; flags = 0; @@ -179,6 +199,9 @@ procfs_domap(struct proc *curp, struct lwp *lp, struct pfsnode *pfs, (entry->eflags & MAP_ENTRY_NEEDS_COPY)?"NC":"NNC", type, fullpath); + if (obj) + vm_object_drop(obj); + if (freepath != NULL) { kfree(freepath, M_TEMP); freepath = NULL; diff --git a/sys/vfs/smbfs/smbfs_io.c b/sys/vfs/smbfs/smbfs_io.c index 26023d5ca1..faab8d000e 100644 --- a/sys/vfs/smbfs/smbfs_io.c +++ b/sys/vfs/smbfs/smbfs_io.c @@ -510,7 +510,7 @@ smbfs_getpages(struct vop_getpages_args *ap) * now tell them that it is ok to use. */ if (!error) { - if (m->flags & PG_WANTED) + if (m->flags & PG_REFERENCED) vm_page_activate(m); else vm_page_deactivate(m); diff --git a/sys/vm/device_pager.c b/sys/vm/device_pager.c index 8b3590a50d..e3900bc533 100644 --- a/sys/vm/device_pager.c +++ b/sys/vm/device_pager.c @@ -136,9 +136,11 @@ dev_pager_alloc(void *handle, off_t size, vm_prot_t prot, off_t foff) /* * Gain a reference to the object. */ - vm_object_reference(object); + vm_object_hold(object); + vm_object_reference_locked(object); if (OFF_TO_IDX(foff + size) > object->size) object->size = OFF_TO_IDX(foff + size); + vm_object_drop(object); } mtx_unlock(&dev_pager_mtx); @@ -210,12 +212,10 @@ dev_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) page = dev_pager_getfake(paddr); TAILQ_INSERT_TAIL(&object->un_pager.devp.devp_pglist, page, pageq); - lwkt_gettoken(&vm_token); vm_object_hold(object); vm_page_free(*mpp); vm_page_insert(page, object, offset); vm_object_drop(object); - lwkt_reltoken(&vm_token); } mtx_unlock(&dev_pager_mtx); return (VM_PAGER_OK); diff --git a/sys/vm/phys_pager.c b/sys/vm/phys_pager.c index a82180d8b0..918b8a07f3 100644 --- a/sys/vm/phys_pager.c +++ b/sys/vm/phys_pager.c @@ -81,7 +81,6 @@ phys_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) { vm_page_t m = *mpp; - lwkt_gettoken(&vm_token); if ((m->flags & PG_ZERO) == 0) vm_page_zero_fill(m); vm_page_flag_set(m, PG_ZERO); @@ -89,7 +88,6 @@ phys_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) vm_page_unmanage(m); m->valid = VM_PAGE_BITS_ALL; m->dirty = 0; - lwkt_reltoken(&vm_token); return (VM_PAGER_OK); } diff --git a/sys/vm/pmap.h b/sys/vm/pmap.h index e926e3d6b7..1f3ae7a273 100644 --- a/sys/vm/pmap.h +++ b/sys/vm/pmap.h @@ -77,6 +77,10 @@ #include #endif +#ifndef _SYS_SPINLOCK_H_ +#include +#endif + #ifndef _MACHINE_PMAP_H_ #include #endif @@ -179,7 +183,6 @@ void pmap_kmodify_nc(vm_offset_t va); void pmap_kremove (vm_offset_t); void pmap_kremove_quick (vm_offset_t); void pmap_reference (pmap_t); -void pmap_release (pmap_t); void pmap_remove (pmap_t, vm_offset_t, vm_offset_t); void pmap_remove_pages (pmap_t, vm_offset_t, vm_offset_t); void pmap_zero_page (vm_paddr_t); diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c index 6005ce773d..7d02dd2c01 100644 --- a/sys/vm/swap_pager.c +++ b/sys/vm/swap_pager.c @@ -423,11 +423,11 @@ swap_pager_alloc(void *handle, off_t size, vm_prot_t prot, off_t offset) vm_object_t object; KKASSERT(handle == NULL); - lwkt_gettoken(&vm_token); object = vm_object_allocate(OBJT_DEFAULT, OFF_TO_IDX(offset + PAGE_MASK + size)); + vm_object_hold(object); swp_pager_meta_convert(object); - lwkt_reltoken(&vm_token); + vm_object_drop(object); return (object); } @@ -446,7 +446,7 @@ swap_pager_alloc(void *handle, off_t size, vm_prot_t prot, off_t offset) static void swap_pager_dealloc(vm_object_t object) { - lwkt_gettoken(&vm_token); + vm_object_hold(object); vm_object_pip_wait(object, "swpdea"); /* @@ -456,7 +456,7 @@ swap_pager_dealloc(vm_object_t object) * if paging is still in progress on some objects. */ swp_pager_meta_free_all(object); - lwkt_reltoken(&vm_token); + vm_object_drop(object); } /************************************************************************ @@ -473,19 +473,15 @@ swap_pager_dealloc(vm_object_t object) * Also has the side effect of advising that somebody made a mistake * when they configured swap and didn't configure enough. * - * The caller must hold vm_token. + * The caller must hold the object. * This routine may not block. - * - * NOTE: vm_token must be held to avoid races with bitmap frees from - * vm_page_remove() via swap_pager_page_removed(). */ static __inline swblk_t swp_pager_getswapspace(vm_object_t object, int npages) { swblk_t blk; - ASSERT_LWKT_TOKEN_HELD(&vm_token); - + lwkt_gettoken(&vm_token); if ((blk = blist_alloc(swapblist, npages)) == SWAPBLK_NONE) { if (swap_pager_full != 2) { kprintf("swap_pager_getswapspace: failed\n"); @@ -500,6 +496,7 @@ swp_pager_getswapspace(vm_object_t object, int npages) vm_swap_cache_use += npages; swp_sizecheck(); } + lwkt_reltoken(&vm_token); return(blk); } @@ -514,7 +511,6 @@ swp_pager_getswapspace(vm_object_t object, int npages) * We must be called at splvm() to avoid races with bitmap frees from * vm_page_remove() aka swap_pager_page_removed(). * - * The caller must hold vm_token. * This routine may not block. */ @@ -523,18 +519,22 @@ swp_pager_freeswapspace(vm_object_t object, swblk_t blk, int npages) { struct swdevt *sp = &swdevt[BLK2DEVIDX(blk)]; + lwkt_gettoken(&vm_token); sp->sw_nused -= npages; if (object->type == OBJT_SWAP) vm_swap_anon_use -= npages; else vm_swap_cache_use -= npages; - if (sp->sw_flags & SW_CLOSING) + if (sp->sw_flags & SW_CLOSING) { + lwkt_reltoken(&vm_token); return; + } blist_free(swapblist, blk, npages); vm_swap_size += npages; swp_sizecheck(); + lwkt_reltoken(&vm_token); } /* @@ -554,9 +554,9 @@ swp_pager_freeswapspace(vm_object_t object, swblk_t blk, int npages) void swap_pager_freespace(vm_object_t object, vm_pindex_t start, vm_pindex_t size) { - lwkt_gettoken(&vm_token); + vm_object_hold(object); swp_pager_meta_free(object, start, size); - lwkt_reltoken(&vm_token); + vm_object_drop(object); } /* @@ -565,9 +565,9 @@ swap_pager_freespace(vm_object_t object, vm_pindex_t start, vm_pindex_t size) void swap_pager_freespace_all(vm_object_t object) { - lwkt_gettoken(&vm_token); + vm_object_hold(object); swp_pager_meta_free_all(object); - lwkt_reltoken(&vm_token); + vm_object_drop(object); } /* @@ -584,7 +584,7 @@ swap_pager_freespace_all(vm_object_t object) * * If we exhaust the object we will return a value n <= count. * - * The caller must hold vm_token. + * The caller must hold the object. * * WARNING! If count == 0 then -1 can be returned as a degenerate case, * callers should always pass a count value > 0. @@ -596,7 +596,7 @@ swap_pager_condfree(vm_object_t object, vm_pindex_t *basei, int count) { struct swfreeinfo info; - ASSERT_LWKT_TOKEN_HELD(&vm_token); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); info.object = object; info.basei = *basei; /* skip up to this page index */ @@ -619,7 +619,7 @@ swap_pager_condfree(vm_object_t object, vm_pindex_t *basei, int count) * We do not have to deal with clearing PG_SWAPPED in related VM * pages because there are no related VM pages. * - * The caller must hold vm_token. + * The caller must hold the object. */ static int swap_pager_condfree_callback(struct swblock *swap, void *data) @@ -654,10 +654,10 @@ void swap_pager_page_inserted(vm_page_t m) { if (m->object->swblock_count) { - lwkt_gettoken(&vm_token); + vm_object_hold(m->object); if (swp_pager_meta_ctl(m->object, m->pindex, 0) != SWAPBLK_NONE) vm_page_flag_set(m, PG_SWAPPED); - lwkt_reltoken(&vm_token); + vm_object_drop(m->object); } } @@ -679,7 +679,8 @@ swap_pager_reserve(vm_object_t object, vm_pindex_t start, vm_size_t size) swblk_t blk = SWAPBLK_NONE; vm_pindex_t beg = start; /* save start index */ - lwkt_gettoken(&vm_token); + vm_object_hold(object); + while (size) { if (n == 0) { n = BLIST_MAX_ALLOC; @@ -690,7 +691,7 @@ swap_pager_reserve(vm_object_t object, vm_pindex_t start, vm_size_t size) if (n == 0) { swp_pager_meta_free(object, beg, start - beg); - lwkt_reltoken(&vm_token); + vm_object_drop(object); return(-1); } } @@ -702,7 +703,7 @@ swap_pager_reserve(vm_object_t object, vm_pindex_t start, vm_size_t size) --n; } swp_pager_meta_free(object, start, n); - lwkt_reltoken(&vm_token); + vm_object_drop(object); return(0); } @@ -718,21 +719,15 @@ swap_pager_reserve(vm_object_t object, vm_pindex_t start, vm_size_t size) * indirectly through swp_pager_meta_build() or if paging is still in * progress on the source. * - * This routine can be called at any spl - * * XXX vm_page_collapse() kinda expects us not to block because we * supposedly do not need to allocate memory, but for the moment we * *may* have to get a little memory from the zone allocator, but * it is taken from the interrupt memory. We should be ok. * * The source object contains no vm_page_t's (which is just as well) - * * The source object is of type OBJT_SWAP. * - * The source and destination objects must be locked or - * inaccessible (XXX are they ?) - * - * The caller must hold vm_token. + * The source and destination objects must be held by the caller. */ void swap_pager_copy(vm_object_t srcobject, vm_object_t dstobject, @@ -740,7 +735,8 @@ swap_pager_copy(vm_object_t srcobject, vm_object_t dstobject, { vm_pindex_t i; - ASSERT_LWKT_TOKEN_HELD(&vm_token); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(srcobject)); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(dstobject)); /* * transfer source to destination. @@ -819,15 +815,14 @@ swap_pager_haspage(vm_object_t object, vm_pindex_t pindex) /* * do we have good backing store at the requested index ? */ - - lwkt_gettoken(&vm_token); + vm_object_hold(object); blk0 = swp_pager_meta_ctl(object, pindex, 0); if (blk0 == SWAPBLK_NONE) { - lwkt_reltoken(&vm_token); + vm_object_drop(object); return (FALSE); } - lwkt_reltoken(&vm_token); + vm_object_drop(object); return (TRUE); } @@ -848,18 +843,18 @@ swap_pager_haspage(vm_object_t object, vm_pindex_t pindex) * depends on it. * * The page must be busied or soft-busied. - * The caller must hold vm_token if the caller does not wish to block here. + * The caller can hold the object to avoid blocking, else we might block. * No other requirements. */ void swap_pager_unswapped(vm_page_t m) { if (m->flags & PG_SWAPPED) { - lwkt_gettoken(&vm_token); + vm_object_hold(m->object); KKASSERT(m->flags & PG_SWAPPED); swp_pager_meta_ctl(m->object, m->pindex, SWM_FREE); vm_page_flag_clear(m, PG_SWAPPED); - lwkt_reltoken(&vm_token); + vm_object_drop(m->object); } } @@ -933,9 +928,9 @@ swap_pager_strategy(vm_object_t object, struct bio *bio) * FREE PAGE(s) - destroy underlying swap that is no longer * needed. */ - lwkt_gettoken(&vm_token); + vm_object_hold(object); swp_pager_meta_free(object, start, count); - lwkt_reltoken(&vm_token); + vm_object_drop(object); bp->b_resid = 0; biodone(bio); return; @@ -960,7 +955,8 @@ swap_pager_strategy(vm_object_t object, struct bio *bio) /* * Execute read or write */ - lwkt_gettoken(&vm_token); + vm_object_hold(object); + while (count > 0) { swblk_t blk; @@ -1045,7 +1041,8 @@ swap_pager_strategy(vm_object_t object, struct bio *bio) ++start; data += PAGE_SIZE; } - lwkt_reltoken(&vm_token); + + vm_object_drop(object); /* * Flush out last buffer @@ -1189,10 +1186,13 @@ swap_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) int i; int j; int raonly; + int error; + u_int32_t flags; vm_page_t marray[XIO_INTERNAL_PAGES]; mreq = *mpp; + vm_object_hold(object); if (mreq->object != object) { panic("swap_pager_getpages: object mismatch %p/%p", object, @@ -1210,33 +1210,38 @@ swap_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) * set on the last page of the read-ahead to continue the pipeline. */ if (mreq->valid == VM_PAGE_BITS_ALL) { - if (swap_burst_read == 0 || mreq->pindex + 1 >= object->size) + if (swap_burst_read == 0 || mreq->pindex + 1 >= object->size) { + vm_object_drop(object); return(VM_PAGER_OK); - lwkt_gettoken(&vm_token); + } blk = swp_pager_meta_ctl(object, mreq->pindex + 1, 0); if (blk == SWAPBLK_NONE) { - lwkt_reltoken(&vm_token); + vm_object_drop(object); return(VM_PAGER_OK); } - m = vm_page_lookup(object, mreq->pindex + 1); - if (m == NULL) { + m = vm_page_lookup_busy_try(object, mreq->pindex + 1, + TRUE, &error); + if (error) { + vm_object_drop(object); + return(VM_PAGER_OK); + } else if (m == NULL) { m = vm_page_alloc(object, mreq->pindex + 1, VM_ALLOC_QUICK); if (m == NULL) { - lwkt_reltoken(&vm_token); + vm_object_drop(object); return(VM_PAGER_OK); } } else { - if ((m->flags & PG_BUSY) || m->busy || m->valid) { - lwkt_reltoken(&vm_token); + if (m->valid) { + vm_page_wakeup(m); + vm_object_drop(object); return(VM_PAGER_OK); } vm_page_unqueue_nowakeup(m); - vm_page_busy(m); } + /* page is busy */ mreq = m; raonly = 1; - lwkt_reltoken(&vm_token); } else { raonly = 0; } @@ -1250,7 +1255,6 @@ swap_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) * Note that blk and iblk can be SWAPBLK_NONE but the loop is * set up such that the case(s) are handled implicitly. */ - lwkt_gettoken(&vm_token); blk = swp_pager_meta_ctl(mreq->object, mreq->pindex, 0); marray[0] = mreq; @@ -1264,25 +1268,28 @@ swap_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) break; if ((blk ^ iblk) & dmmax_mask) break; - m = vm_page_lookup(object, mreq->pindex + i); - if (m == NULL) { + m = vm_page_lookup_busy_try(object, mreq->pindex + i, + TRUE, &error); + if (error) { + break; + } else if (m == NULL) { m = vm_page_alloc(object, mreq->pindex + i, VM_ALLOC_QUICK); if (m == NULL) break; } else { - if ((m->flags & PG_BUSY) || m->busy || m->valid) + if (m->valid) { + vm_page_wakeup(m); break; + } vm_page_unqueue_nowakeup(m); - vm_page_busy(m); } + /* page is busy */ marray[i] = m; } if (i > 1) vm_page_flag_set(marray[i - 1], PG_RAM); - lwkt_reltoken(&vm_token); - /* * If mreq is the requested page and we have nothing to do return * VM_PAGER_FAIL. If raonly is set mreq is just another read-ahead @@ -1292,8 +1299,10 @@ swap_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) KKASSERT(i == 1); if (raonly) { vnode_pager_freepage(mreq); + vm_object_drop(object); return(VM_PAGER_OK); } else { + vm_object_drop(object); return(VM_PAGER_FAIL); } } @@ -1357,17 +1366,26 @@ swap_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) * If this is a read-ahead only we return immediately without * waiting for I/O. */ - if (raonly) + if (raonly) { + vm_object_drop(object); return(VM_PAGER_OK); + } /* * Read-ahead includes originally requested page case. */ - lwkt_gettoken(&vm_token); - while ((mreq->flags & PG_SWAPINPROG) != 0) { - vm_page_flag_set(mreq, PG_WANTED | PG_REFERENCED); + for (;;) { + flags = mreq->flags; + cpu_ccfence(); + if ((flags & PG_SWAPINPROG) == 0) + break; + tsleep_interlock(mreq, 0); + if (!atomic_cmpset_int(&mreq->flags, flags, + flags | PG_WANTED | PG_REFERENCED)) { + continue; + } mycpu->gd_cnt.v_intrans++; - if (tsleep(mreq, 0, "swread", hz*20)) { + if (tsleep(mreq, PINTERLOCKED, "swread", hz*20)) { kprintf( "swap_pager: indefinite wait buffer: " " offset: %lld, size: %ld\n", @@ -1376,13 +1394,13 @@ swap_pager_getpage(vm_object_t object, vm_page_t *mpp, int seqaccess) ); } } - lwkt_reltoken(&vm_token); /* * mreq is left bussied after completion, but all the other pages * are freed. If we had an unrecoverable read error the page will * not be valid. */ + vm_object_drop(object); if (mreq->valid != VM_PAGE_BITS_ALL) return(VM_PAGER_ERROR); else @@ -1427,6 +1445,8 @@ swap_pager_putpages(vm_object_t object, vm_page_t *m, int count, int i; int n = 0; + vm_object_hold(object); + if (count && m[0]->object != object) { panic("swap_pager_getpages: object mismatch %p/%p", object, @@ -1442,10 +1462,8 @@ swap_pager_putpages(vm_object_t object, vm_page_t *m, int count, * force sync if not pageout process */ if (object->type == OBJT_DEFAULT) { - lwkt_gettoken(&vm_token); if (object->type == OBJT_DEFAULT) swp_pager_meta_convert(object); - lwkt_reltoken(&vm_token); } if (curthread != pagethread) @@ -1474,6 +1492,8 @@ swap_pager_putpages(vm_object_t object, vm_page_t *m, int count, * Adjust difference ( if possible ). If the current async * count is too low, we may not be able to make the adjustment * at this time. + * + * vm_token needed for nsw_wcount sleep interlock */ lwkt_gettoken(&vm_token); n -= nsw_wcount_async_max; @@ -1548,6 +1568,8 @@ swap_pager_putpages(vm_object_t object, vm_page_t *m, int count, bp = getpbuf_kva(&nsw_wcount_async); bio = &bp->b_bio1; + lwkt_reltoken(&vm_token); + pmap_qenter((vm_offset_t)bp->b_data, &m[i], n); bp->b_bcount = PAGE_SIZE * n; @@ -1570,8 +1592,6 @@ swap_pager_putpages(vm_object_t object, vm_page_t *m, int count, mycpu->gd_cnt.v_swapout++; mycpu->gd_cnt.v_swappgsout += bp->b_xio.xio_npages; - lwkt_reltoken(&vm_token); - bp->b_dirtyoff = 0; /* req'd for NFS */ bp->b_dirtyend = bp->b_bcount; /* req'd for NFS */ bp->b_cmd = BUF_CMD_WRITE; @@ -1613,6 +1633,7 @@ swap_pager_putpages(vm_object_t object, vm_page_t *m, int count, */ swp_pager_async_iodone(bio); } + vm_object_drop(object); } /* @@ -1668,7 +1689,6 @@ swp_pager_async_iodone(struct bio *bio) */ if (bp->b_xio.xio_npages) object = bp->b_xio.xio_pages[0]->object; - lwkt_gettoken(&vm_token); /* * remove the mapping for kernel virtual @@ -1751,6 +1771,7 @@ swp_pager_async_iodone(struct bio *bio) * been dirty in the first place, and they * do have backing store (the vnode). */ + vm_page_busy_wait(m, FALSE, "swadpg"); swp_pager_meta_ctl(m->object, m->pindex, SWM_FREE); vm_page_flag_clear(m, PG_SWAPPED); @@ -1760,6 +1781,7 @@ swp_pager_async_iodone(struct bio *bio) } vm_page_flag_clear(m, PG_SWAPINPROG); vm_page_io_finish(m); + vm_page_wakeup(m); } } else if (bio->bio_caller_info1.index & SWBIO_READ) { /* @@ -1826,6 +1848,7 @@ swp_pager_async_iodone(struct bio *bio) * When using the swap to cache clean vnode pages * we do not mess with the page dirty bits. */ + vm_page_busy_wait(m, FALSE, "swadpg"); if (m->object->type == OBJT_SWAP) vm_page_undirty(m); vm_page_flag_clear(m, PG_SWAPINPROG); @@ -1837,6 +1860,7 @@ swp_pager_async_iodone(struct bio *bio) vm_page_protect(m, VM_PROT_READ); #endif vm_page_io_finish(m); + vm_page_wakeup(m); } } @@ -1854,7 +1878,10 @@ swp_pager_async_iodone(struct bio *bio) * NOTE: Due to synchronous operations in the write case b_cmd may * already be set to BUF_CMD_DONE and BIO_SYNC may have already * been cleared. + * + * Use vm_token to interlock nsw_rcount/wcount wakeup? */ + lwkt_gettoken(&vm_token); if (bio->bio_caller_info1.index & SWBIO_READ) nswptr = &nsw_rcount; else if (bio->bio_caller_info1.index & SWBIO_SYNC) @@ -1868,6 +1895,8 @@ swp_pager_async_iodone(struct bio *bio) /* * Fault-in a potentially swapped page and remove the swap reference. + * + * object must be held. */ static __inline void swp_pager_fault_page(vm_object_t object, vm_pindex_t pindex) @@ -1876,6 +1905,8 @@ swp_pager_fault_page(vm_object_t object, vm_pindex_t pindex) vm_page_t m; int error; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + if (object->type == OBJT_VNODE) { /* * Any swap related to a vnode is due to swapcache. We must @@ -1913,12 +1944,15 @@ swap_pager_swapoff(int devidx) swblk_t v; int i; - lwkt_gettoken(&vm_token); lwkt_gettoken(&vmobj_token); rescan: TAILQ_FOREACH(object, &vm_object_list, object_list) { + if (object->type != OBJT_SWAP && object->type != OBJT_VNODE) + continue; + vm_object_hold(object); if (object->type == OBJT_SWAP || object->type == OBJT_VNODE) { - RB_FOREACH(swap, swblock_rb_tree, &object->swblock_root) { + RB_FOREACH(swap, + swblock_rb_tree, &object->swblock_root) { for (i = 0; i < SWAP_META_PAGES; ++i) { v = swap->swb_pages[i]; if (v != SWAPBLK_NONE && @@ -1926,14 +1960,15 @@ rescan: swp_pager_fault_page( object, swap->swb_index + i); + vm_object_drop(object); goto rescan; } } } } + vm_object_drop(object); } lwkt_reltoken(&vmobj_token); - lwkt_reltoken(&vm_token); /* * If we fail to locate all swblocks we just fail gracefully and @@ -1963,12 +1998,13 @@ rescan: /* * Lookup the swblock containing the specified swap block index. * - * The caller must hold vm_token. + * The caller must hold the object. */ static __inline struct swblock * swp_pager_lookup(vm_object_t object, vm_pindex_t index) { + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); index &= ~SWAP_META_MASK; return (RB_LOOKUP(swblock_rb_tree, &object->swblock_root, index)); } @@ -1976,19 +2012,20 @@ swp_pager_lookup(vm_object_t object, vm_pindex_t index) /* * Remove a swblock from the RB tree. * - * The caller must hold vm_token. + * The caller must hold the object. */ static __inline void swp_pager_remove(vm_object_t object, struct swblock *swap) { + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); RB_REMOVE(swblock_rb_tree, &object->swblock_root, swap); } /* * Convert default object to swap object if necessary * - * The caller must hold vm_token. + * The caller must hold the object. */ static void swp_pager_meta_convert(vm_object_t object) @@ -2009,7 +2046,7 @@ swp_pager_meta_convert(vm_object_t object) * the swapblk is not valid, it is freed instead. Any previously * assigned swapblk is freed. * - * The caller must hold vm_token. + * The caller must hold the object. */ static void swp_pager_meta_build(vm_object_t object, vm_pindex_t index, swblk_t swapblk) @@ -2018,6 +2055,7 @@ swp_pager_meta_build(vm_object_t object, vm_pindex_t index, swblk_t swapblk) struct swblock *oswap; KKASSERT(swapblk != SWAPBLK_NONE); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); /* * Convert object if necessary @@ -2081,7 +2119,7 @@ retry: * out. This routine does *NOT* operate on swap metadata associated * with resident pages. * - * The caller must hold vm_token. + * The caller must hold the object. */ static int swp_pager_meta_free_callback(struct swblock *swb, void *data); @@ -2090,6 +2128,8 @@ swp_pager_meta_free(vm_object_t object, vm_pindex_t index, vm_pindex_t count) { struct swfreeinfo info; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + /* * Nothing to do */ @@ -2113,7 +2153,7 @@ swp_pager_meta_free(vm_object_t object, vm_pindex_t index, vm_pindex_t count) } /* - * The caller must hold vm_token. + * The caller must hold the object. */ static int @@ -2147,7 +2187,6 @@ swp_pager_meta_free_callback(struct swblock *swap, void *data) swblk_t v = swap->swb_pages[index]; if (v != SWAPBLK_NONE) { - swp_pager_freeswapspace(object, v, 1); swap->swb_pages[index] = SWAPBLK_NONE; if (--swap->swb_count == 0) { swp_pager_remove(object, swap); @@ -2155,6 +2194,7 @@ swp_pager_meta_free_callback(struct swblock *swap, void *data) --object->swblock_count; break; } + swp_pager_freeswapspace(object, v, 1); /* can block */ } ++index; } @@ -2168,7 +2208,7 @@ swp_pager_meta_free_callback(struct swblock *swap, void *data) * This routine locates and destroys all swap metadata associated with * an object. * - * The caller must hold vm_token. + * The caller must hold the object. */ static void swp_pager_meta_free_all(vm_object_t object) @@ -2176,6 +2216,8 @@ swp_pager_meta_free_all(vm_object_t object) struct swblock *swap; int i; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + while ((swap = RB_ROOT(&object->swblock_root)) != NULL) { swp_pager_remove(object, swap); for (i = 0; i < SWAP_META_PAGES; ++i) { @@ -2213,7 +2255,7 @@ swp_pager_meta_free_all(vm_object_t object) * SWM_FREE remove and free swap block from metadata * SWM_POP remove from meta data but do not free.. pop it out * - * The caller must hold vm_token. + * The caller must hold the object. */ static swblk_t swp_pager_meta_ctl(vm_object_t object, vm_pindex_t index, int flags) @@ -2232,10 +2274,6 @@ swp_pager_meta_ctl(vm_object_t object, vm_pindex_t index, int flags) r1 = swap->swb_pages[index]; if (r1 != SWAPBLK_NONE) { - if (flags & SWM_FREE) { - swp_pager_freeswapspace(object, r1, 1); - r1 = SWAPBLK_NONE; - } if (flags & (SWM_FREE|SWM_POP)) { swap->swb_pages[index] = SWAPBLK_NONE; if (--swap->swb_count == 0) { @@ -2244,6 +2282,10 @@ swp_pager_meta_ctl(vm_object_t object, vm_pindex_t index, int flags) --object->swblock_count; } } + if (flags & SWM_FREE) { + swp_pager_freeswapspace(object, r1, 1); + r1 = SWAPBLK_NONE; + } } } return(r1); diff --git a/sys/vm/vm.h b/sys/vm/vm.h index 88e6370dbe..3ea45bbbe0 100644 --- a/sys/vm/vm.h +++ b/sys/vm/vm.h @@ -84,6 +84,7 @@ typedef u_char vm_prot_t; /* protection codes */ #define VM_PROT_WRITE ((vm_prot_t) 0x02) #define VM_PROT_EXECUTE ((vm_prot_t) 0x04) #define VM_PROT_OVERRIDE_WRITE ((vm_prot_t) 0x08) /* copy-on-write */ +#define VM_PROT_NOSYNC ((vm_prot_t) 0x10) #define VM_PROT_RW (VM_PROT_READ|VM_PROT_WRITE) #define VM_PROT_ALL (VM_PROT_READ|VM_PROT_WRITE|VM_PROT_EXECUTE) diff --git a/sys/vm/vm_contig.c b/sys/vm/vm_contig.c index cf9a3ab7ef..313f0887a5 100644 --- a/sys/vm/vm_contig.c +++ b/sys/vm/vm_contig.c @@ -1,6 +1,4 @@ /* - * (MPSAFE) - * * Copyright (c) 2003, 2004 The DragonFly Project. All rights reserved. * * This code is derived from software contributed to The DragonFly Project @@ -120,8 +118,11 @@ #include #include +#include #include +static void vm_contig_pg_free(int start, u_long size); + /* * vm_contig_pg_clean: * @@ -137,71 +138,98 @@ * * Otherwise if the object is of any other type, the generic * pageout (daemon) flush routine is invoked. - * - * The caller must hold vm_token. */ -static int -vm_contig_pg_clean(int queue) +static void +vm_contig_pg_clean(int queue, int count) { vm_object_t object; - vm_page_t m, m_tmp, next; + vm_page_t m, m_tmp; + struct vm_page marker; + struct vpgqueues *pq = &vm_page_queues[queue]; - ASSERT_LWKT_TOKEN_HELD(&vm_token); + /* + * Setup a local marker + */ + bzero(&marker, sizeof(marker)); + marker.flags = PG_BUSY | PG_FICTITIOUS | PG_MARKER; + marker.queue = queue; + marker.wire_count = 1; - for (m = TAILQ_FIRST(&vm_page_queues[queue].pl); m != NULL; m = next) { - KASSERT(m->queue == queue, - ("vm_contig_clean: page %p's queue is not %d", - m, queue)); - next = TAILQ_NEXT(m, pageq); + vm_page_queues_spin_lock(queue); + TAILQ_INSERT_HEAD(&pq->pl, &marker, pageq); + vm_page_queues_spin_unlock(queue); - if (m->flags & PG_MARKER) + /* + * Iterate the queue. Note that the vm_page spinlock must be + * acquired before the pageq spinlock so it's easiest to simply + * not hold it in the loop iteration. + */ + while (count-- > 0 && (m = TAILQ_NEXT(&marker, pageq)) != NULL) { + vm_page_and_queue_spin_lock(m); + if (m != TAILQ_NEXT(&marker, pageq)) { + vm_page_and_queue_spin_unlock(m); + ++count; continue; - - if (vm_page_sleep_busy(m, TRUE, "vpctw0")) - return (TRUE); - + } + KKASSERT(m->queue == queue); + + TAILQ_REMOVE(&pq->pl, &marker, pageq); + TAILQ_INSERT_AFTER(&pq->pl, m, &marker, pageq); + + if (m->flags & PG_MARKER) { + vm_page_and_queue_spin_unlock(m); + continue; + } + if (vm_page_busy_try(m, TRUE)) { + vm_page_and_queue_spin_unlock(m); + continue; + } + vm_page_and_queue_spin_unlock(m); + + /* + * We've successfully busied the page + */ + if (m->queue - m->pc != queue) { + vm_page_wakeup(m); + continue; + } + if ((object = m->object) == NULL) { + vm_page_wakeup(m); + continue; + } vm_page_test_dirty(m); if (m->dirty) { - object = m->object; + vm_object_hold(object); + KKASSERT(m->object == object); + if (object->type == OBJT_VNODE) { + vm_page_wakeup(m); vn_lock(object->handle, LK_EXCLUSIVE|LK_RETRY); vm_object_page_clean(object, 0, 0, OBJPC_SYNC); vn_unlock(((struct vnode *)object->handle)); - return (TRUE); } else if (object->type == OBJT_SWAP || object->type == OBJT_DEFAULT) { m_tmp = m; vm_pageout_flush(&m_tmp, 1, 0); - return (TRUE); + } else { + vm_page_wakeup(m); } - } - KKASSERT(m->busy == 0); - if (m->dirty == 0 && m->hold_count == 0) { - vm_page_busy(m); + vm_object_drop(object); + } else if (m->hold_count == 0) { vm_page_cache(m); + } else { + vm_page_wakeup(m); } } - return (FALSE); -} -/* - * vm_contig_pg_flush: - * - * Attempt to flush (count) pages from the given page queue. This may or - * may not succeed. Take up to passes and delay 1/20 of a second - * between each pass. - * - * The caller must hold vm_token. - */ -static void -vm_contig_pg_flush(int queue, int count) -{ - while (count > 0) { - if (!vm_contig_pg_clean(queue)) - break; - --count; - } + /* + * Scrap our local marker + */ + vm_page_queues_spin_lock(queue); + TAILQ_REMOVE(&pq->pl, &marker, pageq); + vm_page_queues_spin_unlock(queue); } + /* * vm_contig_pg_alloc: * @@ -211,8 +239,6 @@ vm_contig_pg_flush(int queue, int count) * * Malloc()'s data structures have been used for collection of * statistics and for allocations of less than a page. - * - * The caller must hold vm_token. */ static int vm_contig_pg_alloc(unsigned long size, vm_paddr_t low, vm_paddr_t high, @@ -269,15 +295,15 @@ again: * normal state. */ if ((i == vmstats.v_page_count) || - ((VM_PAGE_TO_PHYS(&pga[i]) + size) > high)) { + ((VM_PAGE_TO_PHYS(&pga[i]) + size) > high)) { /* * Best effort flush of all inactive pages. * This is quite quick, for now stall all * callers, even if they've specified M_NOWAIT. */ - vm_contig_pg_flush(PQ_INACTIVE, - vmstats.v_inactive_count); + vm_contig_pg_clean(PQ_INACTIVE, + vmstats.v_inactive_count); /* * Best effort flush of active pages. @@ -290,8 +316,8 @@ again: * will fail in the index < 0 case. */ if (pass > 0 && (mflags & M_WAITOK)) { - vm_contig_pg_flush (PQ_ACTIVE, - vmstats.v_active_count); + vm_contig_pg_clean(PQ_ACTIVE, + vmstats.v_active_count); } /* @@ -323,14 +349,31 @@ again: } /* + * Try to allocate the pages. + * * (still in critical section) */ for (i = start; i < (start + size / PAGE_SIZE); i++) { m = &pga[i]; + + if (vm_page_busy_try(m, TRUE)) { + vm_contig_pg_free(start, + (i - start) * PAGE_SIZE); + start++; + goto again; + } pqtype = m->queue - m->pc; if (pqtype == PQ_CACHE) { - vm_page_busy(m); vm_page_free(m); + --i; + continue; /* retry the page */ + } + if (pqtype != PQ_FREE) { + vm_page_wakeup(m); + vm_contig_pg_free(start, + (i - start) * PAGE_SIZE); + start++; + goto again; } KKASSERT(m->object == NULL); vm_page_unqueue_nowakeup(m); @@ -339,14 +382,15 @@ again: vm_page_zero_count--; KASSERT(m->dirty == 0, ("vm_contig_pg_alloc: page %p was dirty", m)); - m->wire_count = 0; - m->busy = 0; + KKASSERT(m->wire_count == 0); + KKASSERT(m->busy == 0); /* - * Clear all flags except PG_ZERO and PG_WANTED. This - * also clears PG_BUSY. + * Clear all flags except PG_BUSY, PG_ZERO, and + * PG_WANTED, then unbusy the now allocated page. */ - vm_page_flag_clear(m, ~(PG_ZERO|PG_WANTED)); + vm_page_flag_clear(m, ~(PG_BUSY|PG_ZERO|PG_WANTED)); + vm_page_wakeup(m); } /* @@ -382,13 +426,11 @@ vm_contig_pg_free(int start, u_long size) if (size == 0) panic("vm_contig_pg_free: size must not be 0"); - lwkt_gettoken(&vm_token); for (i = start; i < (start + size / PAGE_SIZE); i++) { m = &pga[i]; - vm_page_busy(m); + vm_page_busy_wait(m, FALSE, "cpgfr"); vm_page_free(m); } - lwkt_reltoken(&vm_token); } /* @@ -411,8 +453,6 @@ vm_contig_pg_kmap(int start, u_long size, vm_map_t map, int flags) if (size == 0) panic("vm_contig_pg_kmap: size must not be 0"); - lwkt_gettoken(&vm_token); - /* * We've found a contiguous chunk that meets our requirements. * Allocate KVM, and assign phys pages and return a kernel VM @@ -429,7 +469,6 @@ vm_contig_pg_kmap(int start, u_long size, vm_map_t map, int flags) */ vm_map_unlock(map); vm_map_entry_release(count); - lwkt_reltoken(&vm_token); return (0); } @@ -437,7 +476,7 @@ vm_contig_pg_kmap(int start, u_long size, vm_map_t map, int flags) * kernel_object maps 1:1 to kernel_map. */ vm_object_hold(&kernel_object); - vm_object_reference(&kernel_object); + vm_object_reference_locked(&kernel_object); vm_map_insert(map, &count, &kernel_object, addr, addr, addr + size, @@ -460,7 +499,6 @@ vm_contig_pg_kmap(int start, u_long size, vm_map_t map, int flags) vm_object_drop(&kernel_object); - lwkt_reltoken(&vm_token); return (addr); } @@ -498,21 +536,18 @@ contigmalloc_map( int index; void *rv; - lwkt_gettoken(&vm_token); index = vm_contig_pg_alloc(size, low, high, alignment, boundary, flags); if (index < 0) { kprintf("contigmalloc_map: failed size %lu low=%llx " "high=%llx align=%lu boundary=%lu flags=%08x\n", size, (long long)low, (long long)high, alignment, boundary, flags); - lwkt_reltoken(&vm_token); return NULL; } rv = (void *)vm_contig_pg_kmap(index, size, map, flags); if (rv == NULL) vm_contig_pg_free(index, size); - lwkt_reltoken(&vm_token); return rv; } diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c index 5092e2dd5f..57825d27d1 100644 --- a/sys/vm/vm_fault.c +++ b/sys/vm/vm_fault.c @@ -139,9 +139,6 @@ static void vm_set_nosync(vm_page_t m, vm_map_entry_t entry); static void vm_prefault(pmap_t pmap, vm_offset_t addra, vm_map_entry_t entry, int prot); -/* - * The caller must hold vm_token. - */ static __inline void release_page(struct faultstate *fs) { @@ -151,8 +148,6 @@ release_page(struct faultstate *fs) } /* - * The caller must hold vm_token. - * * NOTE: Once unlocked any cached fs->entry becomes invalid, any reuse * requires relocking and then checking the timestamp. * @@ -189,8 +184,6 @@ unlock_map(struct faultstate *fs) /* * Clean up after a successful call to vm_fault_object() so another call * to vm_fault_object() can be made. - * - * The caller must hold vm_token. */ static void _cleanup_successful_fault(struct faultstate *fs, int relock) @@ -208,17 +201,13 @@ _cleanup_successful_fault(struct faultstate *fs, int relock) } } -/* - * The caller must hold vm_token. - */ static void _unlock_things(struct faultstate *fs, int dealloc) { - vm_object_pip_wakeup(fs->first_object); _cleanup_successful_fault(fs, 0); if (dealloc) { - vm_object_deallocate(fs->first_object); - fs->first_object = NULL; + /*vm_object_deallocate(fs->first_object);*/ + /*fs->first_object = NULL; drop used later on */ } unlock_map(fs); if (fs->vp != NULL) { @@ -275,6 +264,8 @@ vm_fault(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type, int fault_flags) fs.fault_flags = fault_flags; growstack = 1; + lwkt_gettoken(&map->token); + RetryFault: /* * Find the vm_map_entry representing the backing store and resolve @@ -307,12 +298,13 @@ RetryFault: if (result == KERN_INVALID_ADDRESS && growstack && map != &kernel_map && curproc != NULL) { result = vm_map_growstack(curproc, vaddr); - if (result != KERN_SUCCESS) - return (KERN_FAILURE); - growstack = 0; - goto RetryFault; + if (result == KERN_SUCCESS) { + growstack = 0; + goto RetryFault; + } + result = KERN_FAILURE; } - return (result); + goto done; } /* @@ -328,8 +320,10 @@ RetryFault: &fs.entry, &fs.first_object, &first_pindex, &fs.first_prot, &fs.wired); - if (result != KERN_SUCCESS) - return result; + if (result != KERN_SUCCESS) { + result = KERN_FAILURE; + goto done; + } /* * If we don't COW now, on a user wire, the user will never @@ -370,24 +364,12 @@ RetryFault: } /* - * Make a reference to this object to prevent its disposal while we - * are messing with it. Once we have the reference, the map is free - * to be diddled. Since objects reference their shadows (and copies), - * they will stay around as well. - * * Bump the paging-in-progress count to prevent size changes (e.g. * truncation operations) during I/O. This must be done after * obtaining the vnode lock in order to avoid possible deadlocks. - * - * The vm_object must be held before manipulation. */ - lwkt_gettoken(&vm_token); vm_object_hold(fs.first_object); - vm_object_reference(fs.first_object); fs.vp = vnode_pager_lock(fs.first_object); - vm_object_pip_add(fs.first_object, 1); - vm_object_drop(fs.first_object); - lwkt_reltoken(&vm_token); fs.lookup_still_valid = TRUE; fs.first_m = NULL; @@ -411,10 +393,12 @@ RetryFault: result = vm_fault_vpagetable(&fs, &first_pindex, fs.entry->aux.master_pde, fault_type); - if (result == KERN_TRY_AGAIN) + if (result == KERN_TRY_AGAIN) { + vm_object_drop(fs.first_object); goto RetryFault; + } if (result != KERN_SUCCESS) - return (result); + goto done; } /* @@ -434,13 +418,11 @@ RetryFault: result = vm_fault_object(&fs, first_pindex, fault_type); if (result == KERN_TRY_AGAIN) { - /*lwkt_reltoken(&vm_token);*/ + vm_object_drop(fs.first_object); goto RetryFault; } - if (result != KERN_SUCCESS) { - /*lwkt_reltoken(&vm_token);*/ - return (result); - } + if (result != KERN_SUCCESS) + goto done; /* * On success vm_fault_object() does not unlock or deallocate, and fs.m @@ -461,7 +443,6 @@ RetryFault: vm_prefault(fs.map->pmap, vaddr, fs.entry, fs.prot); } } - lwkt_gettoken(&vm_token); unlock_things(&fs); /*KKASSERT(fs.m->queue == PQ_NONE); page-in op may deactivate page */ @@ -470,23 +451,16 @@ RetryFault: /* * If the page is not wired down, then put it where the pageout daemon * can find it. - * - * We do not really need to get vm_token here but since all the - * vm_*() calls have to doing it here improves efficiency. */ - /*lwkt_gettoken(&vm_token);*/ if (fs.fault_flags & VM_FAULT_WIRE_MASK) { - lwkt_reltoken(&vm_token); /* before wire activate does not */ if (fs.wired) vm_page_wire(fs.m); else vm_page_unwire(fs.m, 1); } else { vm_page_activate(fs.m); - lwkt_reltoken(&vm_token); /* before wire activate does not */ } - /*lwkt_reltoken(&vm_token); after wire/activate works */ if (curthread->td_lwp) { if (fs.hardfault) { @@ -500,12 +474,16 @@ RetryFault: * Unlock everything, and return */ vm_page_wakeup(fs.m); - vm_object_deallocate(fs.first_object); + /*vm_object_deallocate(fs.first_object);*/ /*fs.m = NULL; */ - /*fs.first_object = NULL; */ - /*lwkt_reltoken(&vm_token);*/ - - return (KERN_SUCCESS); + /*fs.first_object = NULL; must still drop later */ + + result = KERN_SUCCESS; +done: + if (fs.first_object) + vm_object_drop(fs.first_object); + lwkt_reltoken(&map->token); + return (result); } /* @@ -556,6 +534,8 @@ vm_fault_page(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type, fs.fault_flags = fault_flags; KKASSERT((fault_flags & VM_FAULT_WIRE_MASK) == 0); + lwkt_gettoken(&map->token); + RetryFault: /* * Find the vm_map_entry representing the backing store and resolve @@ -577,7 +557,8 @@ RetryFault: if (result != KERN_SUCCESS) { *errorp = result; - return (NULL); + fs.m = NULL; + goto done; } /* @@ -607,19 +588,16 @@ RetryFault: * to be diddled. Since objects reference their shadows (and copies), * they will stay around as well. * + * The reference should also prevent an unexpected collapse of the + * parent that might move pages from the current object into the + * parent unexpectedly, resulting in corruption. + * * Bump the paging-in-progress count to prevent size changes (e.g. * truncation operations) during I/O. This must be done after * obtaining the vnode lock in order to avoid possible deadlocks. - * - * The vm_object must be held before manipulation. */ - lwkt_gettoken(&vm_token); vm_object_hold(fs.first_object); - vm_object_reference(fs.first_object); fs.vp = vnode_pager_lock(fs.first_object); - vm_object_pip_add(fs.first_object, 1); - vm_object_drop(fs.first_object); - lwkt_reltoken(&vm_token); fs.lookup_still_valid = TRUE; fs.first_m = NULL; @@ -643,11 +621,14 @@ RetryFault: result = vm_fault_vpagetable(&fs, &first_pindex, fs.entry->aux.master_pde, fault_type); - if (result == KERN_TRY_AGAIN) + if (result == KERN_TRY_AGAIN) { + vm_object_drop(fs.first_object); goto RetryFault; + } if (result != KERN_SUCCESS) { *errorp = result; - return (NULL); + fs.m = NULL; + goto done; } } @@ -660,18 +641,22 @@ RetryFault: */ result = vm_fault_object(&fs, first_pindex, fault_type); - if (result == KERN_TRY_AGAIN) + if (result == KERN_TRY_AGAIN) { + vm_object_drop(fs.first_object); goto RetryFault; + } if (result != KERN_SUCCESS) { *errorp = result; - return(NULL); + fs.m = NULL; + goto done; } if ((orig_fault_type & VM_PROT_WRITE) && (fs.prot & VM_PROT_WRITE) == 0) { *errorp = KERN_PROTECTION_FAILURE; unlock_and_deallocate(&fs); - return(NULL); + fs.m = NULL; + goto done; } /* @@ -687,7 +672,6 @@ RetryFault: * (so we don't want to lose the fact that the page will be dirtied * if a write fault was specified). */ - lwkt_gettoken(&vm_token); vm_page_hold(fs.m); if (fault_type & VM_PROT_WRITE) vm_page_dirty(fs.m); @@ -718,11 +702,14 @@ RetryFault: * Unlock everything, and return the held page. */ vm_page_wakeup(fs.m); - vm_object_deallocate(fs.first_object); + /*vm_object_deallocate(fs.first_object);*/ /*fs.first_object = NULL; */ - lwkt_reltoken(&vm_token); - *errorp = 0; + +done: + if (fs.first_object) + vm_object_drop(fs.first_object); + lwkt_reltoken(&map->token); return(fs.m); } @@ -744,6 +731,7 @@ vm_fault_object_page(vm_object_t object, vm_ooffset_t offset, struct faultstate fs; struct vm_map_entry entry; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); bzero(&entry, sizeof(entry)); entry.object.vm_object = object; entry.maptype = VM_MAPTYPE_NORMAL; @@ -770,17 +758,15 @@ RetryFault: * to be diddled. Since objects reference their shadows (and copies), * they will stay around as well. * + * The reference should also prevent an unexpected collapse of the + * parent that might move pages from the current object into the + * parent unexpectedly, resulting in corruption. + * * Bump the paging-in-progress count to prevent size changes (e.g. * truncation operations) during I/O. This must be done after * obtaining the vnode lock in order to avoid possible deadlocks. */ - lwkt_gettoken(&vm_token); - vm_object_hold(fs.first_object); - vm_object_reference(fs.first_object); fs.vp = vnode_pager_lock(fs.first_object); - vm_object_pip_add(fs.first_object, 1); - vm_object_drop(fs.first_object); - lwkt_reltoken(&vm_token); fs.lookup_still_valid = TRUE; fs.first_m = NULL; @@ -836,7 +822,6 @@ RetryFault: * (so we don't want to lose the fact that the page will be dirtied * if a write fault was specified). */ - lwkt_gettoken(&vm_token); vm_page_hold(fs.m); if (fault_type & VM_PROT_WRITE) vm_page_dirty(fs.m); @@ -870,9 +855,8 @@ RetryFault: * Unlock everything, and return the held page. */ vm_page_wakeup(fs.m); - vm_object_deallocate(fs.first_object); + /*vm_object_deallocate(fs.first_object);*/ /*fs.first_object = NULL; */ - lwkt_reltoken(&vm_token); *errorp = 0; return(fs.m); @@ -886,8 +870,6 @@ RetryFault: * This implements an N-level page table. Any level can terminate the * scan by setting VPTE_PS. A linear mapping is accomplished by setting * VPTE_PS in the master page directory entry set via mcontrol(MADV_SETMAP). - * - * No requirements (vm_token need not be held). */ static int @@ -900,6 +882,7 @@ vm_fault_vpagetable(struct faultstate *fs, vm_pindex_t *pindex, int result = KERN_SUCCESS; vpte_t *ptep; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(fs->first_object)); for (;;) { /* * We cannot proceed if the vpte is not valid, not readable @@ -996,7 +979,7 @@ vm_fault_vpagetable(struct faultstate *fs, vm_pindex_t *pindex, * deallocated, fs.m will contained a resolved, busied page, and fs.object * will have an additional PIP count if it is not equal to fs.first_object. * - * No requirements. + * fs->first_object must be held on call. */ static int @@ -1005,11 +988,16 @@ vm_fault_object(struct faultstate *fs, { vm_object_t next_object; vm_pindex_t pindex; + int error; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(fs->first_object)); fs->prot = fs->first_prot; fs->object = fs->first_object; pindex = first_pindex; + vm_object_chain_acquire(fs->first_object); + vm_object_pip_add(fs->first_object, 1); + /* * If a read fault occurs we try to make the page writable if * possible. There are three cases where we cannot make the @@ -1034,78 +1022,85 @@ vm_fault_object(struct faultstate *fs, fs->prot &= ~VM_PROT_WRITE; } - lwkt_gettoken(&vm_token); + /* vm_object_hold(fs->object); implied b/c object == first_object */ for (;;) { /* * If the object is dead, we stop here */ if (fs->object->flags & OBJ_DEAD) { + vm_object_pip_wakeup(fs->first_object); + vm_object_chain_release_all(fs->first_object, + fs->object); + if (fs->object != fs->first_object) + vm_object_drop(fs->object); unlock_and_deallocate(fs); - lwkt_reltoken(&vm_token); return (KERN_PROTECTION_FAILURE); } /* - * See if the page is resident. + * See if the page is resident. Wait/Retry if the page is + * busy (lots of stuff may have changed so we can't continue + * in that case). + * + * We can theoretically allow the soft-busy case on a read + * fault if the page is marked valid, but since such + * pages are typically already pmap'd, putting that + * special case in might be more effort then it is + * worth. We cannot under any circumstances mess + * around with a vm_page_t->busy page except, perhaps, + * to pmap it. */ - fs->m = vm_page_lookup(fs->object, pindex); - if (fs->m != NULL) { - int queue; + fs->m = vm_page_lookup_busy_try(fs->object, pindex, + TRUE, &error); + if (error) { + vm_object_pip_wakeup(fs->first_object); + vm_object_chain_release_all(fs->first_object, + fs->object); + if (fs->object != fs->first_object) + vm_object_drop(fs->object); + unlock_things(fs); + vm_page_sleep_busy(fs->m, TRUE, "vmpfw"); + mycpu->gd_cnt.v_intrans++; + /*vm_object_deallocate(fs->first_object);*/ + /*fs->first_object = NULL;*/ + fs->m = NULL; + return (KERN_TRY_AGAIN); + } + if (fs->m) { /* - * Wait/Retry if the page is busy. We have to do this - * if the page is busy via either PG_BUSY or - * vm_page_t->busy because the vm_pager may be using - * vm_page_t->busy for pageouts ( and even pageins if - * it is the vnode pager ), and we could end up trying - * to pagein and pageout the same page simultaneously. + * The page is busied for us. * - * We can theoretically allow the busy case on a read - * fault if the page is marked valid, but since such - * pages are typically already pmap'd, putting that - * special case in might be more effort then it is - * worth. We cannot under any circumstances mess - * around with a vm_page_t->busy page except, perhaps, - * to pmap it. - */ - if ((fs->m->flags & PG_BUSY) || fs->m->busy) { - unlock_things(fs); - vm_page_sleep_busy(fs->m, TRUE, "vmpfw"); - mycpu->gd_cnt.v_intrans++; - vm_object_deallocate(fs->first_object); - fs->first_object = NULL; - lwkt_reltoken(&vm_token); - return (KERN_TRY_AGAIN); - } - - /* * If reactivating a page from PQ_CACHE we may have * to rate-limit. */ - queue = fs->m->queue; + int queue = fs->m->queue; vm_page_unqueue_nowakeup(fs->m); if ((queue - fs->m->pc) == PQ_CACHE && vm_page_count_severe()) { vm_page_activate(fs->m); + vm_page_wakeup(fs->m); + fs->m = NULL; + vm_object_pip_wakeup(fs->first_object); + vm_object_chain_release_all(fs->first_object, + fs->object); + if (fs->object != fs->first_object) + vm_object_drop(fs->object); unlock_and_deallocate(fs); vm_waitpfault(); - lwkt_reltoken(&vm_token); return (KERN_TRY_AGAIN); } /* - * Mark page busy for other processes, and the - * pagedaemon. If it still isn't completely valid - * (readable), or if a read-ahead-mark is set on - * the VM page, jump to readrest, else we found the - * page and can return. + * If it still isn't completely valid (readable), + * or if a read-ahead-mark is set on the VM page, + * jump to readrest, else we found the page and + * can return. * * We can release the spl once we have marked the * page busy. */ - vm_page_busy(fs->m); - if (fs->m->object != &kernel_object) { if ((fs->m->valid & VM_PAGE_BITS_ALL) != VM_PAGE_BITS_ALL) { @@ -1130,7 +1125,11 @@ vm_fault_object(struct faultstate *fs, * If the page is beyond the object size we fail */ if (pindex >= fs->object->size) { - lwkt_reltoken(&vm_token); + vm_object_pip_wakeup(fs->first_object); + vm_object_chain_release_all(fs->first_object, + fs->object); + if (fs->object != fs->first_object) + vm_object_drop(fs->object); unlock_and_deallocate(fs); return (KERN_PROTECTION_FAILURE); } @@ -1143,7 +1142,11 @@ vm_fault_object(struct faultstate *fs, limticks = vm_fault_ratelimit(curproc->p_vmspace); if (limticks) { - lwkt_reltoken(&vm_token); + vm_object_pip_wakeup(fs->first_object); + vm_object_chain_release_all( + fs->first_object, fs->object); + if (fs->object != fs->first_object) + vm_object_drop(fs->object); unlock_and_deallocate(fs); tsleep(curproc, 0, "vmrate", limticks); fs->didlimit = 1; @@ -1157,14 +1160,25 @@ vm_fault_object(struct faultstate *fs, fs->m = NULL; if (!vm_page_count_severe()) { fs->m = vm_page_alloc(fs->object, pindex, - (fs->vp || fs->object->backing_object) ? VM_ALLOC_NORMAL : VM_ALLOC_NORMAL | VM_ALLOC_ZERO); + ((fs->vp || fs->object->backing_object) ? + VM_ALLOC_NORMAL : + VM_ALLOC_NORMAL | VM_ALLOC_ZERO)); } if (fs->m == NULL) { - lwkt_reltoken(&vm_token); + vm_object_pip_wakeup(fs->first_object); + vm_object_chain_release_all(fs->first_object, + fs->object); + if (fs->object != fs->first_object) + vm_object_drop(fs->object); unlock_and_deallocate(fs); vm_waitpfault(); return (KERN_TRY_AGAIN); } + + /* + * Fall through to readrest. We have a new page which + * will have to be paged (since m->valid will be 0). + */ } readrest: @@ -1228,17 +1242,22 @@ readrest: mt = vm_page_lookup(fs->first_object, scan_pindex); - if (mt == NULL || - (mt->valid != VM_PAGE_BITS_ALL)) { + if (mt == NULL) + break; + if (vm_page_busy_try(mt, TRUE)) + goto skip; + + if (mt->valid != VM_PAGE_BITS_ALL) { + vm_page_wakeup(mt); break; } - if (mt->busy || - (mt->flags & (PG_BUSY | PG_FICTITIOUS | PG_UNMANAGED)) || + if ((mt->flags & + (PG_FICTITIOUS | PG_UNMANAGED)) || mt->hold_count || mt->wire_count) { + vm_page_wakeup(mt); goto skip; } - vm_page_busy(mt); if (mt->dirty == 0) vm_page_test_dirty(mt); if (mt->dirty) { @@ -1302,7 +1321,11 @@ skip: */ fs->m = vm_page_lookup(fs->object, pindex); if (fs->m == NULL) { - lwkt_reltoken(&vm_token); + vm_object_pip_wakeup(fs->first_object); + vm_object_chain_release_all( + fs->first_object, fs->object); + if (fs->object != fs->first_object) + vm_object_drop(fs->object); unlock_and_deallocate(fs); return (KERN_TRY_AGAIN); } @@ -1325,10 +1348,17 @@ skip: * the same time that we are. */ if (rv == VM_PAGER_ERROR) { - if (curproc) - kprintf("vm_fault: pager read error, pid %d (%s)\n", curproc->p_pid, curproc->p_comm); - else - kprintf("vm_fault: pager read error, thread %p (%s)\n", curthread, curproc->p_comm); + if (curproc) { + kprintf("vm_fault: pager read error, " + "pid %d (%s)\n", + curproc->p_pid, + curproc->p_comm); + } else { + kprintf("vm_fault: pager read error, " + "thread %p (%s)\n", + curthread, + curproc->p_comm); + } } /* @@ -1346,8 +1376,12 @@ skip: if (((fs->map != &kernel_map) && (rv == VM_PAGER_ERROR)) || (rv == VM_PAGER_BAD)) { vnode_pager_freepage(fs->m); - lwkt_reltoken(&vm_token); fs->m = NULL; + vm_object_pip_wakeup(fs->first_object); + vm_object_chain_release_all(fs->first_object, + fs->object); + if (fs->object != fs->first_object) + vm_object_drop(fs->object); unlock_and_deallocate(fs); if (rv == VM_PAGER_ERROR) return (KERN_FAILURE); @@ -1373,19 +1407,35 @@ skip: fs->first_m = fs->m; /* - * Move on to the next object. Lock the next object before - * unlocking the current one. + * Move on to the next object. The chain lock should prevent + * the backing_object from getting ripped out from under us. */ - pindex += OFF_TO_IDX(fs->object->backing_object_offset); - next_object = fs->object->backing_object; + if ((next_object = fs->object->backing_object) != NULL) { + vm_object_hold(next_object); + vm_object_chain_acquire(next_object); + KKASSERT(next_object == fs->object->backing_object); + pindex += OFF_TO_IDX(fs->object->backing_object_offset); + } + if (next_object == NULL) { /* * If there's no object left, fill the page in the top * object with zeros. */ if (fs->object != fs->first_object) { + if (fs->first_object->backing_object != + fs->object) { + vm_object_hold(fs->first_object->backing_object); + } + vm_object_chain_release_all( + fs->first_object->backing_object, + fs->object); + if (fs->first_object->backing_object != + fs->object) { + vm_object_drop(fs->first_object->backing_object); + } vm_object_pip_wakeup(fs->object); - + vm_object_drop(fs->object); fs->object = fs->first_object; pindex = first_pindex; fs->m = fs->first_m; @@ -1410,6 +1460,8 @@ skip: } if (fs->object != fs->first_object) { vm_object_pip_wakeup(fs->object); + vm_object_lock_swap(); + vm_object_drop(fs->object); } KASSERT(fs->object != next_object, ("object loop %p", next_object)); @@ -1421,7 +1473,7 @@ skip: * PAGE HAS BEEN FOUND. [Loop invariant still holds -- the object lock * is held.] * - * vm_token is still held + * object still held. * * If the page is being written, but isn't already owned by the * top-level object, we have to copy it into a new page owned by the @@ -1480,23 +1532,24 @@ skip: fs->map == NULL || lockmgr(&fs->map->lock, LK_EXCLUSIVE|LK_NOWAIT) == 0) ) { - - fs->lookup_still_valid = 1; /* - * get rid of the unnecessary page + * (first_m) and (m) are both busied. We have + * move (m) into (first_m)'s object/pindex + * in an atomic fashion, then free (first_m). + * + * first_object is held so second remove + * followed by the rename should wind + * up being atomic. vm_page_free() might + * block so we don't do it until after the + * rename. */ + fs->lookup_still_valid = 1; vm_page_protect(fs->first_m, VM_PROT_NONE); + vm_page_remove(fs->first_m); + vm_page_rename(fs->m, fs->first_object, + first_pindex); vm_page_free(fs->first_m); - fs->first_m = NULL; - - /* - * grab the page and put it into the - * process'es object. The page is - * automatically made dirty. - */ - vm_page_rename(fs->m, fs->first_object, first_pindex); fs->first_m = fs->m; - vm_page_busy(fs->first_m); fs->m = NULL; mycpu->gd_cnt.v_cow_optim++; } else { @@ -1514,11 +1567,24 @@ skip: release_page(fs); } + /* + * We intend to revert to first_object, undo the + * chain lock through to that. + */ + if (fs->first_object->backing_object != fs->object) + vm_object_hold(fs->first_object->backing_object); + vm_object_chain_release_all( + fs->first_object->backing_object, + fs->object); + if (fs->first_object->backing_object != fs->object) + vm_object_drop(fs->first_object->backing_object); + /* * fs->object != fs->first_object due to above * conditional */ vm_object_pip_wakeup(fs->object); + vm_object_drop(fs->object); /* * Only use the new page below... @@ -1552,7 +1618,11 @@ skip: if (relock_map(fs) || fs->map->timestamp != fs->map_generation) { release_page(fs); - lwkt_reltoken(&vm_token); + vm_object_pip_wakeup(fs->first_object); + vm_object_chain_release_all(fs->first_object, + fs->object); + if (fs->object != fs->first_object) + vm_object_drop(fs->object); unlock_and_deallocate(fs); return (KERN_TRY_AGAIN); } @@ -1582,7 +1652,10 @@ skip: } } - lwkt_reltoken(&vm_token); + vm_object_pip_wakeup(fs->first_object); + vm_object_chain_release_all(fs->first_object, fs->object); + if (fs->object != fs->first_object) + vm_object_drop(fs->object); /* * Page had better still be busy. We are still locked up and @@ -1622,9 +1695,12 @@ vm_fault_wire(vm_map_t map, vm_map_entry_t entry, boolean_t user_wire) vm_offset_t end; vm_offset_t va; vm_paddr_t pa; + vm_page_t m; pmap_t pmap; int rv; + lwkt_gettoken(&map->token); + pmap = vm_map_pmap(map); start = entry->start; end = entry->end; @@ -1632,7 +1708,6 @@ vm_fault_wire(vm_map_t map, vm_map_entry_t entry, boolean_t user_wire) (entry->object.vm_object->type == OBJT_DEVICE); if (entry->eflags & MAP_ENTRY_KSTACK) start += PAGE_SIZE; - lwkt_gettoken(&vm_token); map->timestamp++; vm_map_unlock(map); @@ -1654,17 +1729,21 @@ vm_fault_wire(vm_map_t map, vm_map_entry_t entry, boolean_t user_wire) if ((pa = pmap_extract(pmap, va)) == 0) continue; pmap_change_wiring(pmap, va, FALSE); - if (!fictitious) - vm_page_unwire(PHYS_TO_VM_PAGE(pa), 1); + if (!fictitious) { + m = PHYS_TO_VM_PAGE(pa); + vm_page_busy_wait(m, FALSE, "vmwrpg"); + vm_page_unwire(m, 1); + vm_page_wakeup(m); + } } - vm_map_lock(map); - lwkt_reltoken(&vm_token); - return (rv); + goto done; } } + rv = KERN_SUCCESS; +done: vm_map_lock(map); - lwkt_reltoken(&vm_token); - return (KERN_SUCCESS); + lwkt_reltoken(&map->token); + return (rv); } /* @@ -1679,8 +1758,11 @@ vm_fault_unwire(vm_map_t map, vm_map_entry_t entry) vm_offset_t end; vm_offset_t va; vm_paddr_t pa; + vm_page_t m; pmap_t pmap; + lwkt_gettoken(&map->token); + pmap = vm_map_pmap(map); start = entry->start; end = entry->end; @@ -1693,16 +1775,19 @@ vm_fault_unwire(vm_map_t map, vm_map_entry_t entry) * Since the pages are wired down, we must be able to get their * mappings from the physical map system. */ - lwkt_gettoken(&vm_token); for (va = start; va < end; va += PAGE_SIZE) { pa = pmap_extract(pmap, va); if (pa != 0) { pmap_change_wiring(pmap, va, FALSE); - if (!fictitious) - vm_page_unwire(PHYS_TO_VM_PAGE(pa), 1); + if (!fictitious) { + m = PHYS_TO_VM_PAGE(pa); + vm_page_busy_wait(m, FALSE, "vmwupg"); + vm_page_unwire(m, 1); + vm_page_wakeup(m); + } } } - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); } /* @@ -1743,6 +1828,7 @@ vm_fault_ratelimit(struct vmspace *vmspace) * Copy all of the pages from a wired-down map entry to another. * * The source and destination maps must be locked for write. + * The source and destination maps token must be held * The source map entry must be wired down (or be a sharing map * entry corresponding to a main map entry that is wired down). * @@ -1761,10 +1847,6 @@ vm_fault_copy_entry(vm_map_t dst_map, vm_map_t src_map, vm_page_t dst_m; vm_page_t src_m; -#ifdef lint - src_map++; -#endif /* lint */ - src_object = src_entry->object.vm_object; src_offset = src_entry->offset; @@ -1912,7 +1994,7 @@ vm_fault_additional_pages(vm_page_t m, int rbehind, int rahead, startpindex = pindex - rbehind; } - lwkt_gettoken(&vm_token); + vm_object_hold(object); for (tpindex = pindex; tpindex > startpindex; --tpindex) { if (vm_page_lookup(object, tpindex - 1)) break; @@ -1922,10 +2004,10 @@ vm_fault_additional_pages(vm_page_t m, int rbehind, int rahead, while (tpindex < pindex) { rtm = vm_page_alloc(object, tpindex, VM_ALLOC_SYSTEM); if (rtm == NULL) { - lwkt_reltoken(&vm_token); for (j = 0; j < i; j++) { vm_page_free(marray[j]); } + vm_object_drop(object); marray[0] = m; *reqpage = 0; return 1; @@ -1934,7 +2016,7 @@ vm_fault_additional_pages(vm_page_t m, int rbehind, int rahead, ++i; ++tpindex; } - lwkt_reltoken(&vm_token); + vm_object_drop(object); } else { i = 0; } @@ -1954,7 +2036,7 @@ vm_fault_additional_pages(vm_page_t m, int rbehind, int rahead, if (endpindex > object->size) endpindex = object->size; - lwkt_gettoken(&vm_token); + vm_object_hold(object); while (tpindex < endpindex) { if (vm_page_lookup(object, tpindex)) break; @@ -1965,7 +2047,7 @@ vm_fault_additional_pages(vm_page_t m, int rbehind, int rahead, ++i; ++tpindex; } - lwkt_reltoken(&vm_token); + vm_object_drop(object); return (i); } @@ -2048,23 +2130,23 @@ vm_prefault(pmap_t pmap, vm_offset_t addra, vm_map_entry_t entry, int prot) if (lp == NULL || (pmap != vmspace_pmap(lp->lwp_vmspace))) return; - lwkt_gettoken(&vm_token); - - object = entry->object.vm_object; - KKASSERT(object != NULL); - vm_object_hold(object); - starta = addra - PFBAK * PAGE_SIZE; if (starta < entry->start) starta = entry->start; else if (starta > addra) starta = 0; + object = entry->object.vm_object; + KKASSERT(object != NULL); + vm_object_hold(object); KKASSERT(object == entry->object.vm_object); + vm_object_chain_acquire(object); + for (i = 0; i < PAGEORDER_SIZE; i++) { vm_object_t lobject; vm_object_t nobject; int allocated = 0; + int error; addr = addra + vm_prefault_pageorder[i]; if (addr > addra + (PFFOR * PAGE_SIZE)) @@ -2088,9 +2170,6 @@ vm_prefault(pmap_t pmap, vm_offset_t addra, vm_map_entry_t entry, int prot) * In order to not have to check the pager via *haspage*() * we stop if any non-default object is encountered. e.g. * a vnode or swap object would stop the loop. - * - * XXX It is unclear whether hold chaining is sufficient - * to maintain the validity of the backing object chain. */ index = ((addr - entry->start) + entry->offset) >> PAGE_SHIFT; lobject = object; @@ -2098,9 +2177,10 @@ vm_prefault(pmap_t pmap, vm_offset_t addra, vm_map_entry_t entry, int prot) pprot = prot; KKASSERT(lobject == entry->object.vm_object); - vm_object_hold(lobject); + /*vm_object_hold(lobject); implied */ - while ((m = vm_page_lookup(lobject, pindex)) == NULL) { + while ((m = vm_page_lookup_busy_try(lobject, pindex, + TRUE, &error)) == NULL) { if (lobject->type != OBJT_DEFAULT) break; if (lobject->backing_object == NULL) { @@ -2112,7 +2192,9 @@ vm_prefault(pmap_t pmap, vm_offset_t addra, vm_map_entry_t entry, int prot) break; } - /* NOTE: allocated from base object */ + /* + * NOTE: Allocated from base object + */ m = vm_page_alloc(object, index, VM_ALLOC_NORMAL | VM_ALLOC_ZERO); @@ -2135,37 +2217,38 @@ vm_prefault(pmap_t pmap, vm_offset_t addra, vm_map_entry_t entry, int prot) } if (lobject->backing_object_offset & PAGE_MASK) break; - while ((nobject = lobject->backing_object) != NULL) { - vm_object_hold(nobject); - if (nobject == lobject->backing_object) { - pindex += - lobject->backing_object_offset >> - PAGE_SHIFT; - vm_object_lock_swap(); - vm_object_drop(lobject); - lobject = nobject; - break; - } - vm_object_drop(nobject); - } - if (nobject == NULL) { - kprintf("vm_prefault: Warning, backing object " - "race averted lobject %p\n", - lobject); - continue; + nobject = lobject->backing_object; + vm_object_hold(nobject); + KKASSERT(nobject == lobject->backing_object); + pindex += lobject->backing_object_offset >> PAGE_SHIFT; + if (lobject != object) { + vm_object_lock_swap(); + vm_object_drop(lobject); } + lobject = nobject; pprot &= ~VM_PROT_WRITE; + vm_object_chain_acquire(lobject); } - vm_object_drop(lobject); /* - * NOTE: lobject now invalid (if we did a zero-fill we didn't - * bother assigning lobject = object). + * NOTE: A non-NULL (m) will be associated with lobject if + * it was found there, otherwise it is probably a + * zero-fill page associated with the base object. * - * Give-up if the page is not available. + * Give-up if no page is available. */ - if (m == NULL) + if (m == NULL) { + if (lobject != object) { + if (object->backing_object != lobject) + vm_object_hold(object->backing_object); + vm_object_chain_release_all( + object->backing_object, lobject); + if (object->backing_object != lobject) + vm_object_drop(object->backing_object); + vm_object_drop(lobject); + } break; + } /* * Do not conditionalize on PG_RAM. If pages are present in @@ -2178,11 +2261,22 @@ vm_prefault(pmap_t pmap, vm_offset_t addra, vm_map_entry_t entry, int prot) * be I/O bound anyway). * * The object must be marked dirty if we are mapping a - * writable page. + * writable page. m->object is either lobject or object, + * both of which are still held. */ if (pprot & VM_PROT_WRITE) vm_object_set_writeable_dirty(m->object); + if (lobject != object) { + if (object->backing_object != lobject) + vm_object_hold(object->backing_object); + vm_object_chain_release_all(object->backing_object, + lobject); + if (object->backing_object != lobject) + vm_object_drop(object->backing_object); + vm_object_drop(lobject); + } + /* * Enter the page into the pmap if appropriate. If we had * allocated the page we have to place it on a queue. If not @@ -2195,24 +2289,25 @@ vm_prefault(pmap_t pmap, vm_offset_t addra, vm_map_entry_t entry, int prot) pmap_enter(pmap, addr, m, pprot, 0); vm_page_deactivate(m); vm_page_wakeup(m); + } else if (error) { + /* couldn't busy page, no wakeup */ } else if ( ((m->valid & VM_PAGE_BITS_ALL) == VM_PAGE_BITS_ALL) && - (m->busy == 0) && - (m->flags & (PG_BUSY | PG_FICTITIOUS)) == 0) { + (m->flags & PG_FICTITIOUS) == 0) { /* * A fully valid page not undergoing soft I/O can * be immediately entered into the pmap. */ - vm_page_busy(m); - if ((m->queue - m->pc) == PQ_CACHE) { + if ((m->queue - m->pc) == PQ_CACHE) vm_page_deactivate(m); - } if (pprot & VM_PROT_WRITE) vm_set_nosync(m, entry); pmap_enter(pmap, addr, m, pprot, 0); vm_page_wakeup(m); + } else { + vm_page_wakeup(m); } } + vm_object_chain_release(object); vm_object_drop(object); - lwkt_reltoken(&vm_token); } diff --git a/sys/vm/vm_glue.c b/sys/vm/vm_glue.c index 1cdbf83566..dcdb0db410 100644 --- a/sys/vm/vm_glue.c +++ b/sys/vm/vm_glue.c @@ -431,11 +431,13 @@ scheduler_callback(struct proc *p, void *data) * * Each second of sleep time is worth ~1MB */ + lwkt_gettoken(&p->p_vmspace->vm_map.token); pgs = vmspace_resident_count(p->p_vmspace); if (pgs < p->p_vmspace->vm_swrss) { pri -= (p->p_vmspace->vm_swrss - pgs) / (1024 * 1024 / PAGE_SIZE); } + lwkt_reltoken(&p->p_vmspace->vm_map.token); /* * If this process is higher priority and there is @@ -508,13 +510,11 @@ static int swapout_procs_callback(struct proc *p, void *data); void swapout_procs(int action) { - lwkt_gettoken(&vmspace_token); allproc_scan(swapout_procs_callback, &action); - lwkt_reltoken(&vmspace_token); } /* - * The caller must hold proc_token and vmspace_token. + * The caller must hold proc_token */ static int swapout_procs_callback(struct proc *p, void *data) @@ -593,7 +593,7 @@ swapout_procs_callback(struct proc *p, void *data) } /* - * The caller must hold proc_token and vmspace_token and p->p_token + * The caller must hold proc_token and p->p_token */ static void swapout(struct proc *p) @@ -606,7 +606,9 @@ swapout(struct proc *p) /* * remember the process resident count */ + lwkt_gettoken(&p->p_vmspace->vm_map.token); p->p_vmspace->vm_swrss = vmspace_resident_count(p->p_vmspace); + lwkt_reltoken(&p->p_vmspace->vm_map.token); p->p_flag |= P_SWAPPEDOUT; p->p_swtime = 0; } diff --git a/sys/vm/vm_kern.c b/sys/vm/vm_kern.c index b28bcb9219..a0e616c6ac 100644 --- a/sys/vm/vm_kern.c +++ b/sys/vm/vm_kern.c @@ -186,12 +186,15 @@ kmem_alloc3(vm_map_t map, vm_size_t size, int kmflags) vm_map_entry_release(count); return (0); } - vm_object_reference(&kernel_object); + vm_object_hold(&kernel_object); + vm_object_reference_locked(&kernel_object); vm_map_insert(map, &count, &kernel_object, addr, addr, addr + size, VM_MAPTYPE_NORMAL, VM_PROT_ALL, VM_PROT_ALL, cow); + vm_object_drop(&kernel_object); + vm_map_unlock(map); if (kmflags & KM_KRESERVE) vm_map_entry_krelease(count); @@ -215,7 +218,6 @@ kmem_alloc3(vm_map_t map, vm_size_t size, int kmflags) * We're intentionally not activating the pages we allocate to prevent a * race with page-out. vm_map_wire will wire the pages. */ - lwkt_gettoken(&vm_token); vm_object_hold(&kernel_object); for (i = gstart; i < size; i += PAGE_SIZE) { vm_page_t mem; @@ -229,7 +231,6 @@ kmem_alloc3(vm_map_t map, vm_size_t size, int kmflags) vm_page_wakeup(mem); } vm_object_drop(&kernel_object); - lwkt_reltoken(&vm_token); /* * And finally, mark the data as non-pageable. @@ -278,7 +279,6 @@ kmem_suballoc(vm_map_t parent, vm_map_t result, size = round_page(size); - lwkt_gettoken(&vm_token); *min = (vm_offset_t) vm_map_min(parent); ret = vm_map_find(parent, NULL, (vm_offset_t) 0, min, size, PAGE_SIZE, @@ -294,7 +294,6 @@ kmem_suballoc(vm_map_t parent, vm_map_t result, vm_map_init(result, *min, *max, vm_map_pmap(parent)); if ((ret = vm_map_submap(parent, *min, *max, result)) != KERN_SUCCESS) panic("kmem_suballoc: unable to change range to submap"); - lwkt_reltoken(&vm_token); } /* diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c index a657c6522f..94e6c02cab 100644 --- a/sys/vm/vm_map.c +++ b/sys/vm/vm_map.c @@ -160,7 +160,7 @@ static int randomize_mmap; SYSCTL_INT(_vm, OID_AUTO, randomize_mmap, CTLFLAG_RW, &randomize_mmap, 0, "Randomize mmap offsets"); -static void vm_map_entry_shadow(vm_map_entry_t entry); +static void vm_map_entry_shadow(vm_map_entry_t entry, int addref); static vm_map_entry_t vm_map_entry_create(vm_map_t map, int *); static void vm_map_entry_dispose (vm_map_t map, vm_map_entry_t entry, int *); static void _vm_map_clip_end (vm_map_t, vm_map_entry_t, vm_offset_t, int *); @@ -169,7 +169,6 @@ static void vm_map_entry_delete (vm_map_t, vm_map_entry_t, int *); static void vm_map_entry_unwire (vm_map_t, vm_map_entry_t); static void vm_map_copy_entry (vm_map_t, vm_map_t, vm_map_entry_t, vm_map_entry_t); -static void vm_map_split (vm_map_entry_t); static void vm_map_unclip_range (vm_map_t map, vm_map_entry_t start_entry, vm_offset_t start, vm_offset_t end, int *count, int flags); /* @@ -248,18 +247,18 @@ vmspace_alloc(vm_offset_t min, vm_offset_t max) { struct vmspace *vm; - lwkt_gettoken(&vmspace_token); vm = sysref_alloc(&vmspace_sysref_class); bzero(&vm->vm_startcopy, (char *)&vm->vm_endcopy - (char *)&vm->vm_startcopy); vm_map_init(&vm->vm_map, min, max, NULL); pmap_pinit(vmspace_pmap(vm)); /* (some fields reused) */ + lwkt_gettoken(&vm->vm_map.token); vm->vm_map.pmap = vmspace_pmap(vm); /* XXX */ vm->vm_shm = NULL; vm->vm_exitingcnt = 0; cpu_vmspace_alloc(vm); sysref_activate(&vm->vm_sysref); - lwkt_reltoken(&vmspace_token); + lwkt_reltoken(&vm->vm_map.token); return (vm); } @@ -302,14 +301,14 @@ vmspace_terminate(struct vmspace *vm) * If exitingcnt is non-zero we can't get rid of the entire vmspace * yet, but we can scrap user memory. */ - lwkt_gettoken(&vmspace_token); + lwkt_gettoken(&vm->vm_map.token); if (vm->vm_exitingcnt) { shmexit(vm); pmap_remove_pages(vmspace_pmap(vm), VM_MIN_USER_ADDRESS, VM_MAX_USER_ADDRESS); vm_map_remove(&vm->vm_map, VM_MIN_USER_ADDRESS, VM_MAX_USER_ADDRESS); - lwkt_reltoken(&vmspace_token); + lwkt_reltoken(&vm->vm_map.token); return; } cpu_vmspace_free(vm); @@ -334,9 +333,11 @@ vmspace_terminate(struct vmspace *vm) vm_map_unlock(&vm->vm_map); vm_map_entry_release(count); + lwkt_gettoken(&vmspace_pmap(vm)->pm_token); pmap_release(vmspace_pmap(vm)); + lwkt_reltoken(&vmspace_pmap(vm)->pm_token); + lwkt_reltoken(&vm->vm_map.token); sysref_put(&vm->vm_sysref); - lwkt_reltoken(&vmspace_token); } /* @@ -362,9 +363,9 @@ vmspace_unlock(struct vmspace *vm __unused) void vmspace_exitbump(struct vmspace *vm) { - lwkt_gettoken(&vmspace_token); + lwkt_gettoken(&vm->vm_map.token); ++vm->vm_exitingcnt; - lwkt_reltoken(&vmspace_token); + lwkt_reltoken(&vm->vm_map.token); } /* @@ -378,13 +379,16 @@ vmspace_exitfree(struct proc *p) { struct vmspace *vm; - lwkt_gettoken(&vmspace_token); vm = p->p_vmspace; + lwkt_gettoken(&vm->vm_map.token); p->p_vmspace = NULL; - if (--vm->vm_exitingcnt == 0 && sysref_isinactive(&vm->vm_sysref)) + if (--vm->vm_exitingcnt == 0 && sysref_isinactive(&vm->vm_sysref)) { + lwkt_reltoken(&vm->vm_map.token); vmspace_terminate(vm); - lwkt_reltoken(&vmspace_token); + } else { + lwkt_reltoken(&vm->vm_map.token); + } } /* @@ -396,15 +400,15 @@ vmspace_exitfree(struct proc *p) * No requirements. */ int -vmspace_swap_count(struct vmspace *vmspace) +vmspace_swap_count(struct vmspace *vm) { - vm_map_t map = &vmspace->vm_map; + vm_map_t map = &vm->vm_map; vm_map_entry_t cur; vm_object_t object; int count = 0; int n; - lwkt_gettoken(&vmspace_token); + lwkt_gettoken(&vm->vm_map.token); for (cur = map->header.next; cur != &map->header; cur = cur->next) { switch(cur->maptype) { case VM_MAPTYPE_NORMAL: @@ -421,7 +425,7 @@ vmspace_swap_count(struct vmspace *vmspace) break; } } - lwkt_reltoken(&vmspace_token); + lwkt_reltoken(&vm->vm_map.token); return(count); } @@ -433,14 +437,14 @@ vmspace_swap_count(struct vmspace *vmspace) * No requirements. */ int -vmspace_anonymous_count(struct vmspace *vmspace) +vmspace_anonymous_count(struct vmspace *vm) { - vm_map_t map = &vmspace->vm_map; + vm_map_t map = &vm->vm_map; vm_map_entry_t cur; vm_object_t object; int count = 0; - lwkt_gettoken(&vmspace_token); + lwkt_gettoken(&vm->vm_map.token); for (cur = map->header.next; cur != &map->header; cur = cur->next) { switch(cur->maptype) { case VM_MAPTYPE_NORMAL: @@ -457,7 +461,7 @@ vmspace_anonymous_count(struct vmspace *vmspace) break; } } - lwkt_reltoken(&vmspace_token); + lwkt_reltoken(&vm->vm_map.token); return(count); } @@ -497,6 +501,7 @@ vm_map_init(struct vm_map *map, vm_offset_t min, vm_offset_t max, pmap_t pmap) map->hint = &map->header; map->timestamp = 0; map->flags = 0; + lwkt_token_init(&map->token, "vm_map"); lockinit(&map->lock, "thrd_sleep", (hz + 9) / 10, 0); TUNABLE_INT("vm.cache_vmspaces", &vmspace_sysref_class.nom_cache); } @@ -520,14 +525,14 @@ vm_map_init(struct vm_map *map, vm_offset_t min, vm_offset_t max, pmap_t pmap) */ static void -vm_map_entry_shadow(vm_map_entry_t entry) +vm_map_entry_shadow(vm_map_entry_t entry, int addref) { if (entry->maptype == VM_MAPTYPE_VPAGETABLE) { vm_object_shadow(&entry->object.vm_object, &entry->offset, - 0x7FFFFFFF); /* XXX */ + 0x7FFFFFFF, addref); /* XXX */ } else { vm_object_shadow(&entry->object.vm_object, &entry->offset, - atop(entry->end - entry->start)); + atop(entry->end - entry->start), addref); } entry->eflags &= ~MAP_ENTRY_NEEDS_COPY; } @@ -743,12 +748,8 @@ vm_map_entry_dispose(vm_map_t map, vm_map_entry_t entry, int *countp) * Insert/remove entries from maps. * * The related map must be exclusively locked. + * The caller must hold map->token * No other requirements. - * - * NOTE! We currently acquire the vmspace_token only to avoid races - * against the pageout daemon's calls to vmspace_*_count(), which - * are unable to safely lock the vm_map without potentially - * deadlocking. */ static __inline void vm_map_entry_link(vm_map_t map, @@ -757,7 +758,6 @@ vm_map_entry_link(vm_map_t map, { ASSERT_VM_MAP_LOCKED(map); - lwkt_gettoken(&vmspace_token); map->nentries++; entry->prev = after_where; entry->next = after_where->next; @@ -765,7 +765,6 @@ vm_map_entry_link(vm_map_t map, after_where->next = entry; if (vm_map_rb_tree_RB_INSERT(&map->rb_root, entry)) panic("vm_map_entry_link: dup addr map %p ent %p", map, entry); - lwkt_reltoken(&vmspace_token); } static __inline void @@ -781,14 +780,12 @@ vm_map_entry_unlink(vm_map_t map, panic("vm_map_entry_unlink: attempt to mess with " "locked entry! %p", entry); } - lwkt_gettoken(&vmspace_token); prev = entry->prev; next = entry->next; next->prev = prev; prev->next = next; vm_map_rb_tree_RB_REMOVE(&map->rb_root, entry); map->nentries--; - lwkt_reltoken(&vmspace_token); } /* @@ -861,10 +858,11 @@ vm_map_lookup_entry(vm_map_t map, vm_offset_t address, vm_map_entry_t *entry) * address range. The object's size should match that of the address range. * * The map must be exclusively locked. + * The object must be held. * The caller must have reserved sufficient vm_map_entry structures. * - * If object is non-NULL, ref count must be bumped by caller - * prior to making call to account for the new entry. + * If object is non-NULL, ref count must be bumped by caller prior to + * making call to account for the new entry. */ int vm_map_insert(vm_map_t map, int *countp, @@ -878,8 +876,11 @@ vm_map_insert(vm_map_t map, int *countp, vm_map_entry_t prev_entry; vm_map_entry_t temp_entry; vm_eflags_t protoeflags; + int must_drop = 0; ASSERT_VM_MAP_LOCKED(map); + if (object) + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); /* * Check that the start and end points are not bogus. @@ -925,8 +926,7 @@ vm_map_insert(vm_map_t map, int *countp, if (cow & MAP_IS_KSTACK) protoeflags |= MAP_ENTRY_KSTACK; - lwkt_gettoken(&vm_token); - lwkt_gettoken(&vmobj_token); + lwkt_gettoken(&map->token); if (object) { /* @@ -934,7 +934,6 @@ vm_map_insert(vm_map_t map, int *countp, * process. We have to set or clear OBJ_ONEMAPPING * appropriately. */ - if ((object->ref_count > 1) || (object->shadow_count != 0)) { vm_object_clear_flag(object, OBJ_ONEMAPPING); } @@ -957,11 +956,10 @@ vm_map_insert(vm_map_t map, int *countp, if ((prev_entry->inheritance == VM_INHERIT_DEFAULT) && (prev_entry->protection == prot) && (prev_entry->max_protection == max)) { - lwkt_reltoken(&vmobj_token); - lwkt_reltoken(&vm_token); map->size += (end - prev_entry->end); prev_entry->end = end; vm_map_simplify_entry(map, prev_entry, countp); + lwkt_reltoken(&map->token); return (KERN_SUCCESS); } @@ -974,12 +972,14 @@ vm_map_insert(vm_map_t map, int *countp, object = prev_entry->object.vm_object; offset = prev_entry->offset + (prev_entry->end - prev_entry->start); - vm_object_reference_locked(object); + if (object) { + vm_object_hold(object); + vm_object_chain_wait(object); + vm_object_reference_locked(object); + must_drop = 1; + } } - lwkt_reltoken(&vmobj_token); - lwkt_reltoken(&vm_token); - /* * NOTE: if conditionals fail, object can be NULL here. This occurs * in things like the buffer map where we manage kva but do not manage @@ -1045,7 +1045,10 @@ vm_map_insert(vm_map_t map, int *countp, object, OFF_TO_IDX(offset), end - start, cow & MAP_PREFAULT_PARTIAL); } + if (must_drop) + vm_object_drop(object); + lwkt_reltoken(&map->token); return (KERN_SUCCESS); } @@ -1178,8 +1181,9 @@ vm_map_findspace(vm_map_t map, vm_offset_t start, vm_size_t length, /* * vm_map_find finds an unallocated region in the target address map with - * the given length. The search is defined to be first-fit from the - * specified address; the region found is returned in the same parameter. + * the given length and allocates it. The search is defined to be first-fit + * from the specified address; the region found is returned in the same + * parameter. * * If object is non-NULL, ref count must be bumped by caller * prior to making call to account for the new entry. @@ -1202,6 +1206,8 @@ vm_map_find(vm_map_t map, vm_object_t object, vm_ooffset_t offset, count = vm_map_entry_reserve(MAP_RESERVE_COUNT); vm_map_lock(map); + if (object) + vm_object_hold(object); if (fitit) { if (vm_map_findspace(map, start, length, align, 0, addr)) { vm_map_unlock(map); @@ -1215,6 +1221,8 @@ vm_map_find(vm_map_t map, vm_object_t object, vm_ooffset_t offset, maptype, prot, max, cow); + if (object) + vm_object_drop(object); vm_map_unlock(map); vm_map_entry_release(count); @@ -1350,7 +1358,12 @@ _vm_map_clip_start(vm_map_t map, vm_map_entry_t entry, vm_offset_t start, switch(entry->maptype) { case VM_MAPTYPE_NORMAL: case VM_MAPTYPE_VPAGETABLE: - vm_object_reference(new_entry->object.vm_object); + if (new_entry->object.vm_object) { + vm_object_hold(new_entry->object.vm_object); + vm_object_chain_wait(new_entry->object.vm_object); + vm_object_reference_locked(new_entry->object.vm_object); + vm_object_drop(new_entry->object.vm_object); + } break; default: break; @@ -1407,7 +1420,12 @@ _vm_map_clip_end(vm_map_t map, vm_map_entry_t entry, vm_offset_t end, switch(entry->maptype) { case VM_MAPTYPE_NORMAL: case VM_MAPTYPE_VPAGETABLE: - vm_object_reference(new_entry->object.vm_object); + if (new_entry->object.vm_object) { + vm_object_hold(new_entry->object.vm_object); + vm_object_chain_wait(new_entry->object.vm_object); + vm_object_reference_locked(new_entry->object.vm_object); + vm_object_drop(new_entry->object.vm_object); + } break; default: break; @@ -2075,7 +2093,7 @@ vm_map_unwire(vm_map_t map, vm_offset_t start, vm_offset_t real_end, MAP_ENTRY_NEEDS_COPY; if (copyflag && ((entry->protection & VM_PROT_WRITE) != 0)) { - vm_map_entry_shadow(entry); + vm_map_entry_shadow(entry, 0); } else if (entry->object.vm_object == NULL && !map->system_map) { vm_map_entry_allocate_object(entry); @@ -2272,7 +2290,7 @@ vm_map_wire(vm_map_t map, vm_offset_t start, vm_offset_t real_end, int kmflags) MAP_ENTRY_NEEDS_COPY; if (copyflag && ((entry->protection & VM_PROT_WRITE) != 0)) { - vm_map_entry_shadow(entry); + vm_map_entry_shadow(entry, 0); } else if (entry->object.vm_object == NULL && !map->system_map) { vm_map_entry_allocate_object(entry); @@ -2438,6 +2456,7 @@ vm_map_clean(vm_map_t map, vm_offset_t start, vm_offset_t end, vm_map_entry_t entry; vm_size_t size; vm_object_t object; + vm_object_t tobj; vm_ooffset_t offset; vm_map_lock_read(map); @@ -2446,6 +2465,8 @@ vm_map_clean(vm_map_t map, vm_offset_t start, vm_offset_t end, vm_map_unlock_read(map); return (KERN_INVALID_ADDRESS); } + lwkt_gettoken(&map->token); + /* * Make a first pass to check for holes. */ @@ -2468,12 +2489,7 @@ vm_map_clean(vm_map_t map, vm_offset_t start, vm_offset_t end, /* * Make a second pass, cleaning/uncaching pages from the indicated * objects as we go. - * - * Hold vm_token to avoid blocking in vm_object_reference() */ - lwkt_gettoken(&vm_token); - lwkt_gettoken(&vmobj_token); - for (current = entry; current->start < end; current = current->next) { offset = current->offset + (start - current->start); size = (end <= current->end ? end : current->end) - start; @@ -2494,6 +2510,10 @@ vm_map_clean(vm_map_t map, vm_offset_t start, vm_offset_t end, } else { object = current->object.vm_object; } + + if (object) + vm_object_hold(object); + /* * Note that there is absolutely no sense in writing out * anonymous objects, so we track down the vnode object @@ -2504,11 +2524,19 @@ vm_map_clean(vm_map_t map, vm_offset_t start, vm_offset_t end, * note: certain anonymous maps, such as MAP_NOSYNC maps, * may start out with a NULL object. */ - while (object && object->backing_object) { - offset += object->backing_object_offset; - object = object->backing_object; - if (object->size < OFF_TO_IDX( offset + size)) - size = IDX_TO_OFF(object->size) - offset; + while (object && (tobj = object->backing_object) != NULL) { + vm_object_hold(tobj); + if (tobj == object->backing_object) { + vm_object_lock_swap(); + offset += object->backing_object_offset; + vm_object_drop(object); + object = tobj; + if (object->size < OFF_TO_IDX(offset + size)) + size = IDX_TO_OFF(object->size) - + offset; + break; + } + vm_object_drop(tobj); } if (object && (object->type == OBJT_VNODE) && (current->protection & VM_PROT_WRITE) && @@ -2527,6 +2555,7 @@ vm_map_clean(vm_map_t map, vm_offset_t start, vm_offset_t end, */ int flags; + /* no chain wait needed for vnode objects */ vm_object_reference_locked(object); vn_lock(object->handle, LK_EXCLUSIVE | LK_RETRY); flags = (syncio || invalidate) ? OBJPC_SYNC : 0; @@ -2556,6 +2585,7 @@ vm_map_clean(vm_map_t map, vm_offset_t start, vm_offset_t end, (object->type == OBJT_DEVICE))) { int clean_only = (object->type == OBJT_DEVICE) ? FALSE : TRUE; + /* no chain wait needed for vnode/device objects */ vm_object_reference_locked(object); switch(current->maptype) { case VM_MAPTYPE_NORMAL: @@ -2571,10 +2601,11 @@ vm_map_clean(vm_map_t map, vm_offset_t start, vm_offset_t end, vm_object_deallocate_locked(object); } start += size; + if (object) + vm_object_drop(object); } - lwkt_reltoken(&vmobj_token); - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); vm_map_unlock_read(map); return (KERN_SUCCESS); @@ -2629,6 +2660,7 @@ vm_map_delete(vm_map_t map, vm_offset_t start, vm_offset_t end, int *countp) vm_map_entry_t first_entry; ASSERT_VM_MAP_LOCKED(map); + lwkt_gettoken(&map->token); again: /* * Find the start of the region, and clip it. Set entry to point @@ -2705,20 +2737,22 @@ again: offidxend = offidxstart + count; - /* - * Hold vm_token when manipulating vm_objects, - * - * Hold vmobj_token when potentially adding or removing - * objects (collapse requires both). - */ - lwkt_gettoken(&vm_token); - lwkt_gettoken(&vmobj_token); - vm_object_hold(object); - if (object == &kernel_object) { + vm_object_hold(object); vm_object_page_remove(object, offidxstart, offidxend, FALSE); - } else { + vm_object_drop(object); + } else if (object && object->type != OBJT_DEFAULT && + object->type != OBJT_SWAP) { + /* + * vnode object routines cannot be chain-locked + */ + vm_object_hold(object); + pmap_remove(map->pmap, s, e); + vm_object_drop(object); + } else if (object) { + vm_object_hold(object); + vm_object_chain_acquire(object); pmap_remove(map->pmap, s, e); if (object != NULL && @@ -2740,12 +2774,10 @@ again: object->size = offidxstart; } } + vm_object_chain_release(object); + vm_object_drop(object); } - vm_object_drop(object); - lwkt_reltoken(&vmobj_token); - lwkt_reltoken(&vm_token); - /* * Delete the entry (which may delete the object) only after * removing all pmap entries pointing to its pages. @@ -2755,6 +2787,7 @@ again: vm_map_entry_delete(map, entry, countp); entry = next; } + lwkt_reltoken(&map->token); return (KERN_SUCCESS); } @@ -2837,29 +2870,109 @@ vm_map_check_protection(vm_map_t map, vm_offset_t start, vm_offset_t end, } /* - * Split the pages in a map entry into a new object. This affords - * easier removal of unused pages, and keeps object inheritance from - * being a negative impact on memory usage. + * If appropriate this function shadows the original object with a new object + * and moves the VM pages from the original object to the new object. + * The original object will also be collapsed, if possible. * - * The vm_map must be exclusively locked. - * The orig_object should be held. + * We can only do this for normal memory objects with a single mapping, and + * it only makes sense to do it if there are 2 or more refs on the original + * object. i.e. typically a memory object that has been extended into + * multiple vm_map_entry's with non-overlapping ranges. + * + * This makes it easier to remove unused pages and keeps object inheritance + * from being a negative impact on memory usage. + * + * On return the (possibly new) entry->object.vm_object will have an + * additional ref on it for the caller to dispose of (usually by cloning + * the vm_map_entry). The additional ref had to be done in this routine + * to avoid racing a collapse. The object's ONEMAPPING flag will also be + * cleared. + * + * The vm_map must be locked and its token held. */ static void vm_map_split(vm_map_entry_t entry) { - vm_page_t m; - vm_object_t orig_object, new_object, source; +#if 0 + /* UNOPTIMIZED */ + vm_object_t oobject; + + oobject = entry->object.vm_object; + vm_object_hold(oobject); + vm_object_chain_wait(oobject); + vm_object_reference_locked(oobject); + vm_object_clear_flag(oobject, OBJ_ONEMAPPING); + vm_object_drop(oobject); +#else + /* OPTIMIZED */ + vm_object_t oobject, nobject, bobject; vm_offset_t s, e; + vm_page_t m; vm_pindex_t offidxstart, offidxend, idx; vm_size_t size; vm_ooffset_t offset; - orig_object = entry->object.vm_object; - if (orig_object->type != OBJT_DEFAULT && orig_object->type != OBJT_SWAP) + /* + * Setup. Chain lock the original object throughout the entire + * routine to prevent new page faults from occuring. + * + * XXX can madvise WILLNEED interfere with us too? + */ + oobject = entry->object.vm_object; + vm_object_hold(oobject); + vm_object_chain_acquire(oobject); + + /* + * Original object cannot be split? + */ + if (oobject->handle == NULL || (oobject->type != OBJT_DEFAULT && + oobject->type != OBJT_SWAP)) { + vm_object_chain_release(oobject); + vm_object_reference_locked(oobject); + vm_object_clear_flag(oobject, OBJ_ONEMAPPING); + vm_object_drop(oobject); return; - if (orig_object->ref_count <= 1) + } + + /* + * Collapse original object with its backing store as an + * optimization to reduce chain lengths when possible. + * + * If ref_count <= 1 there aren't other non-overlapping vm_map_entry's + * for oobject, so there's no point collapsing it. + * + * Then re-check whether the object can be split. + */ + vm_object_collapse(oobject); + + if (oobject->ref_count <= 1 || + (oobject->type != OBJT_DEFAULT && oobject->type != OBJT_SWAP) || + (oobject->flags & (OBJ_NOSPLIT|OBJ_ONEMAPPING)) != OBJ_ONEMAPPING) { + vm_object_chain_release(oobject); + vm_object_reference_locked(oobject); + vm_object_clear_flag(oobject, OBJ_ONEMAPPING); + vm_object_drop(oobject); return; + } + + /* + * Acquire the chain lock on the backing object. + * + * Give bobject an additional ref count for when it will be shadowed + * by nobject. + */ + if ((bobject = oobject->backing_object) != NULL) { + vm_object_hold(bobject); + vm_object_chain_wait(bobject); + vm_object_reference_locked(bobject); + vm_object_chain_acquire(bobject); + KKASSERT(bobject->backing_object == bobject); + KKASSERT((bobject->flags & OBJ_DEAD) == 0); + } + /* + * Calculate the object page range and allocate the new object. + */ offset = entry->offset; s = entry->start; e = entry->end; @@ -2868,53 +2981,65 @@ vm_map_split(vm_map_entry_t entry) offidxend = offidxstart + OFF_TO_IDX(e - s); size = offidxend - offidxstart; - switch(orig_object->type) { + switch(oobject->type) { case OBJT_DEFAULT: - new_object = default_pager_alloc(NULL, IDX_TO_OFF(size), - VM_PROT_ALL, 0); + nobject = default_pager_alloc(NULL, IDX_TO_OFF(size), + VM_PROT_ALL, 0); break; case OBJT_SWAP: - new_object = swap_pager_alloc(NULL, IDX_TO_OFF(size), - VM_PROT_ALL, 0); + nobject = swap_pager_alloc(NULL, IDX_TO_OFF(size), + VM_PROT_ALL, 0); break; default: /* not reached */ - new_object = NULL; + nobject = NULL; KKASSERT(0); } - if (new_object == NULL) + + if (nobject == NULL) { + if (bobject) { + vm_object_chain_release(bobject); + vm_object_deallocate(bobject); + vm_object_drop(bobject); + } + vm_object_chain_release(oobject); + vm_object_reference_locked(oobject); + vm_object_clear_flag(oobject, OBJ_ONEMAPPING); + vm_object_drop(oobject); return; + } /* - * vm_token required when manipulating vm_objects. + * The new object will replace entry->object.vm_object so it needs + * a second reference (the caller expects an additional ref). */ - lwkt_gettoken(&vm_token); - lwkt_gettoken(&vmobj_token); + vm_object_hold(nobject); + vm_object_reference_locked(nobject); + vm_object_chain_acquire(nobject); - vm_object_hold(new_object); - - source = orig_object->backing_object; - if (source != NULL) { - vm_object_hold(source); - /* Referenced by new_object */ - vm_object_reference_locked(source); - LIST_INSERT_HEAD(&source->shadow_head, - new_object, shadow_list); - vm_object_clear_flag(source, OBJ_ONEMAPPING); - new_object->backing_object_offset = - orig_object->backing_object_offset + - IDX_TO_OFF(offidxstart); - new_object->backing_object = source; - source->shadow_count++; - source->generation++; - vm_object_drop(source); + /* + * nobject shadows bobject (oobject already shadows bobject). + */ + if (bobject) { + nobject->backing_object_offset = + oobject->backing_object_offset + IDX_TO_OFF(offidxstart); + nobject->backing_object = bobject; + bobject->shadow_count++; + bobject->generation++; + LIST_INSERT_HEAD(&bobject->shadow_head, nobject, shadow_list); + vm_object_clear_flag(bobject, OBJ_ONEMAPPING); /* XXX? */ + vm_object_chain_release(bobject); + vm_object_drop(bobject); } + /* + * Move the VM pages from oobject to nobject + */ for (idx = 0; idx < size; idx++) { vm_page_t m; - retry: - m = vm_page_lookup(orig_object, offidxstart + idx); + m = vm_page_lookup_busy_wait(oobject, offidxstart + idx, + TRUE, "vmpg"); if (m == NULL) continue; @@ -2924,24 +3049,23 @@ vm_map_split(vm_map_entry_t entry) * * We do not have to VM_PROT_NONE the page as mappings should * not be changed by this operation. + * + * NOTE: The act of renaming a page updates chaingen for both + * objects. */ - if (vm_page_sleep_busy(m, TRUE, "spltwt")) - goto retry; - vm_page_busy(m); - vm_page_rename(m, new_object, idx); + vm_page_rename(m, nobject, idx); /* page automatically made dirty by rename and cache handled */ - vm_page_busy(m); + /* page remains busy */ } - if (orig_object->type == OBJT_SWAP) { - vm_object_pip_add(orig_object, 1); + if (oobject->type == OBJT_SWAP) { + vm_object_pip_add(oobject, 1); /* - * copy orig_object pages into new_object - * and destroy unneeded pages in - * shadow object. + * copy oobject pages into nobject and destroy unneeded + * pages in shadow object. */ - swap_pager_copy(orig_object, new_object, offidxstart, 0); - vm_object_pip_wakeup(orig_object); + swap_pager_copy(oobject, nobject, offidxstart, 0); + vm_object_pip_wakeup(oobject); } /* @@ -2949,17 +3073,36 @@ vm_map_split(vm_map_entry_t entry) * for a simple wakeup. */ for (idx = 0; idx < size; idx++) { - m = vm_page_lookup(new_object, idx); - if (m) + m = vm_page_lookup(nobject, idx); + if (m) { + KKASSERT(m->flags & PG_BUSY); vm_page_wakeup(m); + } } - - entry->object.vm_object = new_object; + entry->object.vm_object = nobject; entry->offset = 0LL; - vm_object_deallocate_locked(orig_object); - vm_object_drop(new_object); - lwkt_reltoken(&vmobj_token); - lwkt_reltoken(&vm_token); + + /* + * Cleanup + * + * NOTE: There is no need to remove OBJ_ONEMAPPING from oobject, the + * related pages were moved and are no longer applicable to the + * original object. + * + * NOTE: Deallocate oobject (due to its entry->object.vm_object being + * replaced by nobject). + */ + vm_object_chain_release(nobject); + vm_object_drop(nobject); + if (bobject) { + vm_object_chain_release(bobject); + vm_object_drop(bobject); + } + vm_object_chain_release(oobject); + /*vm_object_clear_flag(oobject, OBJ_ONEMAPPING);*/ + vm_object_deallocate_locked(oobject); + vm_object_drop(oobject); +#endif } /* @@ -2967,11 +3110,11 @@ vm_map_split(vm_map_entry_t entry) * entry. The entries *must* be aligned properly. * * The vm_map must be exclusively locked. - * vm_token must be held + * The vm_map's token must be held. */ static void vm_map_copy_entry(vm_map_t src_map, vm_map_t dst_map, - vm_map_entry_t src_entry, vm_map_entry_t dst_entry) + vm_map_entry_t src_entry, vm_map_entry_t dst_entry) { vm_object_t src_object; @@ -2980,9 +3123,6 @@ vm_map_copy_entry(vm_map_t src_map, vm_map_t dst_map, if (src_entry->maptype == VM_MAPTYPE_SUBMAP) return; - ASSERT_LWKT_TOKEN_HELD(&vm_token); - lwkt_gettoken(&vmobj_token); /* required for collapse */ - if (src_entry->wired_count == 0) { /* * If the source entry is marked needs_copy, it is already @@ -3004,25 +3144,14 @@ vm_map_copy_entry(vm_map_t src_map, vm_map_t dst_map, * probably try to destroy the object. The lock is a pool * token and doesn't care. */ - if ((src_object = src_entry->object.vm_object) != NULL) { - vm_object_lock(src_object); - if ((src_object->handle == NULL) && - (src_object->type == OBJT_DEFAULT || - src_object->type == OBJT_SWAP)) { - vm_object_collapse(src_object); - if ((src_object->flags & (OBJ_NOSPLIT|OBJ_ONEMAPPING)) == OBJ_ONEMAPPING) { - vm_map_split(src_entry); - vm_object_unlock(src_object); - src_object = src_entry->object.vm_object; - vm_object_lock(src_object); - } - } - vm_object_reference_locked(src_object); - vm_object_unlock(src_object); - vm_object_clear_flag(src_object, OBJ_ONEMAPPING); + if (src_entry->object.vm_object != NULL) { + vm_map_split(src_entry); + src_object = src_entry->object.vm_object; dst_entry->object.vm_object = src_object; - src_entry->eflags |= (MAP_ENTRY_COW|MAP_ENTRY_NEEDS_COPY); - dst_entry->eflags |= (MAP_ENTRY_COW|MAP_ENTRY_NEEDS_COPY); + src_entry->eflags |= (MAP_ENTRY_COW | + MAP_ENTRY_NEEDS_COPY); + dst_entry->eflags |= (MAP_ENTRY_COW | + MAP_ENTRY_NEEDS_COPY); dst_entry->offset = src_entry->offset; } else { dst_entry->object.vm_object = NULL; @@ -3039,7 +3168,6 @@ vm_map_copy_entry(vm_map_t src_map, vm_map_t dst_map, */ vm_fault_copy_entry(dst_map, src_map, dst_entry, src_entry); } - lwkt_reltoken(&vmobj_token); } /* @@ -3063,15 +3191,14 @@ vmspace_fork(struct vmspace *vm1) vm_object_t object; int count; - lwkt_gettoken(&vm_token); - lwkt_gettoken(&vmspace_token); - lwkt_gettoken(&vmobj_token); + lwkt_gettoken(&vm1->vm_map.token); vm_map_lock(old_map); /* * XXX Note: upcalls are not copied. */ vm2 = vmspace_alloc(old_map->min_offset, old_map->max_offset); + lwkt_gettoken(&vm2->vm_map.token); bcopy(&vm1->vm_startcopy, &vm2->vm_startcopy, (caddr_t)&vm1->vm_endcopy - (caddr_t)&vm1->vm_startcopy); new_map = &vm2->vm_map; /* XXX */ @@ -3101,29 +3228,31 @@ vmspace_fork(struct vmspace *vm1) * Clone the entry, creating the shared object if * necessary. */ - object = old_entry->object.vm_object; - if (object == NULL) { + if (old_entry->object.vm_object == NULL) vm_map_entry_allocate_object(old_entry); - object = old_entry->object.vm_object; - } /* - * Add the reference before calling vm_map_entry_shadow - * to insure that a shadow object is created. + * Shadow a map_entry which needs a copy, replacing + * its object with a new object that points to the + * old one. Ask the shadow code to automatically add + * an additional ref. We can't do it afterwords + * because we might race a collapse */ - vm_object_reference_locked(object); if (old_entry->eflags & MAP_ENTRY_NEEDS_COPY) { - vm_map_entry_shadow(old_entry); - /* Transfer the second reference too. */ - vm_object_reference_locked( - old_entry->object.vm_object); - vm_object_deallocate_locked(object); - object = old_entry->object.vm_object; + vm_map_entry_shadow(old_entry, 1); + } else { + if (old_entry->object.vm_object) { + object = old_entry->object.vm_object; + vm_object_hold(object); + vm_object_chain_wait(object); + vm_object_reference_locked(object); + vm_object_drop(object); + } } - vm_object_clear_flag(object, OBJ_ONEMAPPING); /* - * Clone the entry, referencing the shared object. + * Clone the entry. We've already bumped the ref on + * any vm_object. */ new_entry = vm_map_entry_create(new_map, &count); *new_entry = *old_entry; @@ -3169,9 +3298,8 @@ vmspace_fork(struct vmspace *vm1) vm_map_unlock(new_map); vm_map_entry_release(count); - lwkt_reltoken(&vmobj_token); - lwkt_reltoken(&vmspace_token); - lwkt_reltoken(&vm_token); + lwkt_reltoken(&vm2->vm_map.token); + lwkt_reltoken(&vm1->vm_map.token); return (vm2); } @@ -3467,11 +3595,13 @@ vmspace_exec(struct proc *p, struct vmspace *vmcopy) * we create a new vmspace. Note that exitingcnt and upcalls * are not copied to the new vmspace. */ - lwkt_gettoken(&vmspace_token); + lwkt_gettoken(&oldvmspace->vm_map.token); if (vmcopy) { newvmspace = vmspace_fork(vmcopy); + lwkt_gettoken(&newvmspace->vm_map.token); } else { newvmspace = vmspace_alloc(map->min_offset, map->max_offset); + lwkt_gettoken(&newvmspace->vm_map.token); bcopy(&oldvmspace->vm_startcopy, &newvmspace->vm_startcopy, (caddr_t)&oldvmspace->vm_endcopy - (caddr_t)&oldvmspace->vm_startcopy); @@ -3484,8 +3614,9 @@ vmspace_exec(struct proc *p, struct vmspace *vmcopy) */ pmap_pinit2(vmspace_pmap(newvmspace)); pmap_replacevm(p, newvmspace, 0); + lwkt_reltoken(&newvmspace->vm_map.token); + lwkt_reltoken(&oldvmspace->vm_map.token); sysref_put(&oldvmspace->vm_sysref); - lwkt_reltoken(&vmspace_token); } /* @@ -3501,14 +3632,18 @@ vmspace_unshare(struct proc *p) struct vmspace *oldvmspace = p->p_vmspace; struct vmspace *newvmspace; - lwkt_gettoken(&vmspace_token); - if (oldvmspace->vm_sysref.refcnt == 1 && oldvmspace->vm_exitingcnt == 0) + lwkt_gettoken(&oldvmspace->vm_map.token); + if (oldvmspace->vm_sysref.refcnt == 1 && oldvmspace->vm_exitingcnt == 0) { + lwkt_reltoken(&oldvmspace->vm_map.token); return; + } newvmspace = vmspace_fork(oldvmspace); + lwkt_gettoken(&newvmspace->vm_map.token); pmap_pinit2(vmspace_pmap(newvmspace)); pmap_replacevm(p, newvmspace, 0); + lwkt_reltoken(&newvmspace->vm_map.token); + lwkt_reltoken(&oldvmspace->vm_map.token); sysref_put(&oldvmspace->vm_sysref); - lwkt_reltoken(&vmspace_token); } /* @@ -3607,6 +3742,7 @@ RetryLookup: */ entry = map->hint; *out_entry = entry; + *object = NULL; if ((entry == &map->header) || (vaddr < entry->start) || (vaddr >= entry->end)) { @@ -3711,7 +3847,7 @@ RetryLookup: } use_read_lock = 0; - vm_map_entry_shadow(entry); + vm_map_entry_shadow(entry, 0); } else { /* * We're attempting to read a copy-on-write page -- diff --git a/sys/vm/vm_map.h b/sys/vm/vm_map.h index 15cee91627..5e54a0fdf4 100644 --- a/sys/vm/vm_map.h +++ b/sys/vm/vm_map.h @@ -207,17 +207,19 @@ vm_map_entry_set_behavior(struct vm_map_entry *entry, u_char behavior) } /* - * Maps are doubly-linked lists of map entries, kept sorted - * by address. A single hint is provided to start - * searches again from the last successful search, - * insertion, or removal. + * Maps are doubly-linked lists of map entries, kept sorted by address. + * A single hint is provided to start searches again from the last + * successful search, insertion, or removal. * - * Note: the lock structure cannot be the first element of vm_map - * because this can result in a running lockup between two or more - * system processes trying to kmem_alloc_wait() due to kmem_alloc_wait() - * and free tsleep/waking up 'map' and the underlying lockmgr also - * sleeping and waking up on 'map'. The lockup occurs when the map fills - * up. The 'exec' map, for example. + * NOTE: The lock structure cannot be the first element of vm_map + * because this can result in a running lockup between two or more + * system processes trying to kmem_alloc_wait() due to kmem_alloc_wait() + * and free tsleep/waking up 'map' and the underlying lockmgr also + * sleeping and waking up on 'map'. The lockup occurs when the map fills + * up. The 'exec' map, for example. + * + * NOTE: The vm_map structure can be hard-locked with the lockmgr lock + * or soft-serialized with the token, or both. */ struct vm_map { struct vm_map_entry header; /* List of entries */ @@ -233,6 +235,7 @@ struct vm_map { struct pmap *pmap; /* Physical map */ u_int president_cache; /* Remember president count */ u_int president_ticks; /* Save ticks for cache */ + struct lwkt_token token; /* Soft serializer */ #define min_offset header.start #define max_offset header.end }; @@ -429,6 +432,9 @@ vmspace_pmap(struct vmspace *vmspace) return &vmspace->vm_pmap; } +/* + * Caller must hold the vmspace->vm_map.token + */ static __inline long vmspace_resident_count(struct vmspace *vmspace) { @@ -450,6 +456,7 @@ vmspace_president_count(struct vmspace *vmspace) vm_map_entry_t cur; vm_object_t object; u_int count = 0; + u_int n; #ifdef _KERNEL if (map->president_ticks == ticks / hz || vm_map_lock_read_try(map)) @@ -466,9 +473,15 @@ vmspace_president_count(struct vmspace *vmspace) object->type != OBJT_SWAP) { break; } - if (object->agg_pv_list_count != 0) { - count += object->resident_page_count / - object->agg_pv_list_count; + /* + * synchronize non-zero case, contents of field + * can change at any time due to pmap ops. + */ + if ((n = object->agg_pv_list_count) != 0) { +#ifdef _KERNEL + cpu_ccfence(); +#endif + count += object->resident_page_count / n; } break; default: diff --git a/sys/vm/vm_meter.c b/sys/vm/vm_meter.c index b30128e645..4e87866bb8 100644 --- a/sys/vm/vm_meter.c +++ b/sys/vm/vm_meter.c @@ -216,7 +216,7 @@ do_vmtotal_callback(struct proc *p, void *data) * Note active objects. */ paging = 0; - lwkt_gettoken(&vm_token); + lwkt_gettoken(&p->p_token); if (p->p_vmspace) { map = &p->p_vmspace->vm_map; vm_map_lock_read(map); @@ -233,7 +233,7 @@ do_vmtotal_callback(struct proc *p, void *data) } vm_map_unlock_read(map); } - lwkt_reltoken(&vm_token); + lwkt_reltoken(&p->p_token); if (paging) totalp->t_pw++; return(0); diff --git a/sys/vm/vm_mmap.c b/sys/vm/vm_mmap.c index d6f1b2da31..526387ef4e 100644 --- a/sys/vm/vm_mmap.c +++ b/sys/vm/vm_mmap.c @@ -145,7 +145,7 @@ sys_sstk(struct sstk_args *uap) * is maintained as long as you do not write directly to the underlying * character device. * - * No requirements; sys_mmap path holds the vm_token + * No requirements */ int kern_mmap(struct vmspace *vms, caddr_t uaddr, size_t ulen, @@ -382,8 +382,7 @@ kern_mmap(struct vmspace *vms, caddr_t uaddr, size_t ulen, } } - /* Token serializes access to vm_map.nentries against vm_mmap */ - lwkt_gettoken(&vm_token); + lwkt_gettoken(&vms->vm_map.token); /* * Do not allow more then a certain number of vm_map_entry structures @@ -393,7 +392,7 @@ kern_mmap(struct vmspace *vms, caddr_t uaddr, size_t ulen, if (max_proc_mmap && vms->vm_map.nentries >= max_proc_mmap * vms->vm_sysref.refcnt) { error = ENOMEM; - lwkt_reltoken(&vm_token); + lwkt_reltoken(&vms->vm_map.token); goto done; } @@ -402,7 +401,7 @@ kern_mmap(struct vmspace *vms, caddr_t uaddr, size_t ulen, if (error == 0) *res = (void *)(addr + pageoff); - lwkt_reltoken(&vm_token); + lwkt_reltoken(&vms->vm_map.token); done: if (fp) fdrop(fp); @@ -465,12 +464,12 @@ sys_msync(struct msync_args *uap) map = &p->p_vmspace->vm_map; /* - * vm_token serializes extracting the address range for size == 0 + * map->token serializes extracting the address range for size == 0 * msyncs with the vm_map_clean call; if the token were not held * across the two calls, an intervening munmap/mmap pair, for example, * could cause msync to occur on a wrong region. */ - lwkt_gettoken(&vm_token); + lwkt_gettoken(&map->token); /* * XXX Gak! If size is zero we are supposed to sync "all modified @@ -500,7 +499,7 @@ sys_msync(struct msync_args *uap) rv = vm_map_clean(map, addr, addr + size, (flags & MS_ASYNC) == 0, (flags & MS_INVALIDATE) != 0); done: - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); switch (rv) { case KERN_SUCCESS: @@ -559,20 +558,20 @@ sys_munmap(struct munmap_args *uap) map = &p->p_vmspace->vm_map; - /* vm_token serializes between the map check and the actual unmap */ - lwkt_gettoken(&vm_token); + /* map->token serializes between the map check and the actual unmap */ + lwkt_gettoken(&map->token); /* * Make sure entire range is allocated. */ if (!vm_map_check_protection(map, addr, addr + size, VM_PROT_NONE, FALSE)) { - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); return (EINVAL); } /* returns nothing but KERN_SUCCESS anyway */ vm_map_remove(map, addr, addr + size); - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); return (0); } @@ -799,7 +798,7 @@ sys_mincore(struct mincore_args *uap) map = &p->p_vmspace->vm_map; pmap = vmspace_pmap(p->p_vmspace); - lwkt_gettoken(&vm_token); + lwkt_gettoken(&map->token); vm_map_lock_read(map); RestartScan: timestamp = map->timestamp; @@ -869,8 +868,12 @@ RestartScan: * required to maintain the object * association. And XXX what if the page is * busy? What's the deal with that? + * + * XXX vm_token - legacy for pmap_ts_referenced + * in i386 and vkernel pmap code. */ - crit_enter(); + lwkt_gettoken(&vm_token); + vm_object_hold(current->object.vm_object); m = vm_page_lookup(current->object.vm_object, pindex); if (m && m->valid) { @@ -884,7 +887,8 @@ RestartScan: mincoreinfo |= MINCORE_REFERENCED_OTHER; } } - crit_exit(); + vm_object_drop(current->object.vm_object); + lwkt_reltoken(&vm_token); } /* @@ -963,7 +967,7 @@ RestartScan: error = 0; done: - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); return (error); } @@ -1173,7 +1177,7 @@ sys_munlock(struct munlock_args *uap) * Currently used by mmap, exec, and sys5 shared memory. * Handle is either a vnode pointer or NULL for MAP_ANON. * - * No requirements; kern_mmap path holds the vm_token + * No requirements */ int vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, @@ -1198,7 +1202,7 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, return (EINVAL); size = objsize; - lwkt_gettoken(&vm_token); + lwkt_gettoken(&map->token); /* * XXX messy code, fixme @@ -1210,7 +1214,7 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, esize = map->size + size; /* workaround gcc4 opt */ if (esize < map->size || esize > p->p_rlimit[RLIMIT_VMEM].rlim_cur) { - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); return(ENOMEM); } } @@ -1227,7 +1231,7 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, * will optimize it out. */ if (foff & PAGE_MASK) { - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); return (EINVAL); } @@ -1236,12 +1240,12 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, *addr = round_page(*addr); } else { if (*addr != trunc_page(*addr)) { - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); return (EINVAL); } eaddr = *addr + size; if (eaddr < *addr) { - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); return (EINVAL); } fitit = FALSE; @@ -1263,7 +1267,7 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, object = default_pager_alloc(handle, objsize, prot, foff); if (object == NULL) { - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); return(ENOMEM); } docow = MAP_PREFAULT_PARTIAL; @@ -1287,7 +1291,7 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, handle = (void *)(intptr_t)vp->v_rdev; object = dev_pager_alloc(handle, objsize, prot, foff); if (object == NULL) { - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); return(EINVAL); } docow = MAP_PREFAULT_PARTIAL; @@ -1304,13 +1308,13 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, error = VOP_GETATTR(vp, &vat); if (error) { - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); return (error); } docow = MAP_PREFAULT_PARTIAL; object = vnode_pager_reference(vp); if (object == NULL && vp->v_type == VREG) { - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); kprintf("Warning: cannot mmap vnode %p, no " "object\n", vp); return(EINVAL); @@ -1402,7 +1406,7 @@ vm_mmap(vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, if (vp != NULL) vn_mark_atime(vp, td); out: - lwkt_reltoken(&vm_token); + lwkt_reltoken(&map->token); switch (rv) { case KERN_SUCCESS: diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c index ee49c2dcdd..e3bbd2147c 100644 --- a/sys/vm/vm_object.c +++ b/sys/vm/vm_object.c @@ -97,11 +97,11 @@ #define EASY_SCAN_FACTOR 8 -static void vm_object_qcollapse(vm_object_t object); +static void vm_object_qcollapse(vm_object_t object, + vm_object_t backing_object); static int vm_object_page_collect_flush(vm_object_t object, vm_page_t p, int pagerflags); static void vm_object_lock_init(vm_object_t); -static void vm_object_hold_wait(vm_object_t); /* @@ -194,8 +194,7 @@ vm_object_hold(vm_object_t obj) debugvm_object_hold(vm_object_t obj, char *file, int line) #endif { - if (obj == NULL) - return; + KKASSERT(obj != NULL); /* * Object must be held (object allocation is stable due to callers @@ -222,6 +221,9 @@ debugvm_object_hold(vm_object_t obj, char *file, int line) #endif } +/* + * Drop the token and hold_count on the object. + */ void vm_object_drop(vm_object_t obj) { @@ -252,41 +254,14 @@ vm_object_drop(vm_object_t obj) * The lock is a pool token, keep holding it across potential * wakeups to interlock the tsleep/wakeup. */ - if (refcount_release(&obj->hold_count)) + if (refcount_release(&obj->hold_count)) { + if (obj->ref_count == 0 && (obj->flags & OBJ_DEAD)) + zfree(obj_zone, obj); wakeup(obj); - vm_object_unlock(obj); -} - -/* - * This can only be called while the caller holds the object - * with the OBJ_DEAD interlock. Since there are no refs this - * is the only thing preventing an object destruction race. - */ -static void -vm_object_hold_wait(vm_object_t obj) -{ - vm_object_lock(obj); - -#if defined(DEBUG_LOCKS) - int i; - - for (i = 0; i < VMOBJ_DEBUG_ARRAY_SIZE; i++) { - if ((obj->debug_hold_bitmap & (1 << i)) && - (obj->debug_hold_thrs[i] == curthread)) { - kprintf("vm_object %p: self-hold in at %s:%d\n", obj, - obj->debug_hold_file[i], obj->debug_hold_line[i]); - panic("vm_object: self-hold in terminate or collapse"); - } } -#endif - - while (obj->hold_count) - tsleep(obj, 0, "vmobjhld", 0); - - vm_object_unlock(obj); + vm_object_unlock(obj); /* uses pool token, ok to call on freed obj */ } - /* * Initialize a freshly allocated object * @@ -321,7 +296,7 @@ _vm_object_allocate(objtype_t type, vm_pindex_t size, vm_object_t object) next_index = (next_index + incr) & PQ_L2_MASK; object->handle = NULL; object->backing_object = NULL; - object->backing_object_offset = (vm_ooffset_t) 0; + object->backing_object_offset = (vm_ooffset_t)0; object->generation++; object->swblock_count = 0; @@ -376,41 +351,115 @@ vm_object_allocate(objtype_t type, vm_pindex_t size) } /* - * Add an additional reference to a vm_object. + * Add an additional reference to a vm_object. The object must already be + * held. The original non-lock version is no longer supported. The object + * must NOT be chain locked by anyone at the time the reference is added. + * + * Referencing a chain-locked object can blow up the fairly sensitive + * ref_count and shadow_count tests in the deallocator. Most callers + * will call vm_object_chain_wait() prior to calling + * vm_object_reference_locked() to avoid the case. * - * Object passed by caller must be stable or caller must already - * hold vmobj_token to avoid races. + * The object must be held. */ void -vm_object_reference(vm_object_t object) +vm_object_reference_locked(vm_object_t object) { - lwkt_gettoken(&vmobj_token); - vm_object_hold(object); - vm_object_reference_locked(object); - vm_object_drop(object); - lwkt_reltoken(&vmobj_token); + KKASSERT(object != NULL); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + KKASSERT((object->flags & OBJ_CHAINLOCK) == 0); + object->ref_count++; + if (object->type == OBJT_VNODE) { + vref(object->handle); + /* XXX what if the vnode is being destroyed? */ + } } +/* + * Object OBJ_CHAINLOCK lock handling. + * + * The caller can chain-lock backing objects recursively and then + * use vm_object_chain_release_all() to undo the whole chain. + * + * Chain locks are used to prevent collapses and are only applicable + * to OBJT_DEFAULT and OBJT_SWAP objects. Chain locking operations + * on other object types are ignored. This is also important because + * it allows e.g. the vnode underlying a memory mapping to take concurrent + * faults. + * + * The object must usually be held on entry, though intermediate + * objects need not be held on release. + */ void -vm_object_reference_locked(vm_object_t object) +vm_object_chain_wait(vm_object_t object) { - if (object) { - ASSERT_LWKT_TOKEN_HELD(&vmobj_token); - /*NOTYET*/ - /*ASSERT_LWKT_TOKEN_HELD(vm_object_token(object));*/ - object->ref_count++; - if (object->type == OBJT_VNODE) { - vref(object->handle); - /* XXX what if the vnode is being destroyed? */ + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + while (object->flags & OBJ_CHAINLOCK) { + vm_object_set_flag(object, OBJ_CHAINWANT); + tsleep(object, 0, "objchain", 0); + } +} + +void +vm_object_chain_acquire(vm_object_t object) +{ + if (object->type == OBJT_DEFAULT || object->type == OBJT_SWAP) { + vm_object_chain_wait(object); + vm_object_set_flag(object, OBJ_CHAINLOCK); + } +} + +void +vm_object_chain_release(vm_object_t object) +{ + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + if (object->type == OBJT_DEFAULT || object->type == OBJT_SWAP) { + KKASSERT(object->flags & OBJ_CHAINLOCK); + if (object->flags & OBJ_CHAINWANT) { + vm_object_clear_flag(object, + OBJ_CHAINLOCK | OBJ_CHAINWANT); + wakeup(object); + } else { + vm_object_clear_flag(object, OBJ_CHAINLOCK); } } } +/* + * This releases the entire chain starting with object and recursing + * through backing_object until stopobj is encountered. stopobj is + * not released. The caller will typically release stopobj manually + * before making this call (as the deepest object is the most likely + * to collide with other threads). + * + * object and stopobj must be held by the caller. This code looks a + * bit odd but has been optimized fairly heavily. + */ +void +vm_object_chain_release_all(vm_object_t first_object, vm_object_t stopobj) +{ + vm_object_t backing_object; + vm_object_t object; + + vm_object_chain_release(stopobj); + object = first_object; + + while (object != stopobj) { + KKASSERT(object); + if (object != first_object) + vm_object_hold(object); + backing_object = object->backing_object; + vm_object_chain_release(object); + if (object != first_object) + vm_object_drop(object); + object = backing_object; + } +} + /* * Dereference an object and its underlying vnode. * - * The caller must hold vmobj_token. - * The object must be locked but not held. This function will eat the lock. + * The object must be held and will be held on return. */ static void vm_object_vndeallocate(vm_object_t object) @@ -420,18 +469,16 @@ vm_object_vndeallocate(vm_object_t object) KASSERT(object->type == OBJT_VNODE, ("vm_object_vndeallocate: not a vnode object")); KASSERT(vp != NULL, ("vm_object_vndeallocate: missing vp")); - ASSERT_LWKT_TOKEN_HELD(&vmobj_token); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); #ifdef INVARIANTS if (object->ref_count == 0) { vprint("vm_object_vndeallocate", vp); panic("vm_object_vndeallocate: bad object reference count"); } #endif - object->ref_count--; if (object->ref_count == 0) vclrflags(vp, VTEXT); - vm_object_unlock(object); vrele(vp); } @@ -447,25 +494,29 @@ vm_object_vndeallocate(vm_object_t object) void vm_object_deallocate(vm_object_t object) { - lwkt_gettoken(&vmobj_token); - vm_object_deallocate_locked(object); - lwkt_reltoken(&vmobj_token); + if (object) { + vm_object_hold(object); + vm_object_deallocate_locked(object); + vm_object_drop(object); + } } void vm_object_deallocate_locked(vm_object_t object) { vm_object_t temp; - - ASSERT_LWKT_TOKEN_HELD(&vmobj_token); - - if (object) - vm_object_lock(object); + int must_drop = 0; while (object != NULL) { +#if 0 + /* + * Don't rip a ref_count out from under an object undergoing + * collapse, it will confuse the collapse code. + */ + vm_object_chain_wait(object); +#endif if (object->type == OBJT_VNODE) { vm_object_vndeallocate(object); - /* vndeallocate ate the lock */ break; } @@ -475,20 +526,6 @@ vm_object_deallocate_locked(vm_object_t object) } if (object->ref_count > 2) { object->ref_count--; - vm_object_unlock(object); - break; - } - - /* - * We currently need the vm_token from this point on, and - * we must recheck ref_count after acquiring it. - */ - lwkt_gettoken(&vm_token); - - if (object->ref_count > 2) { - object->ref_count--; - lwkt_reltoken(&vm_token); - vm_object_unlock(object); break; } @@ -502,122 +539,172 @@ vm_object_deallocate_locked(vm_object_t object) if (object->ref_count == 2 && object->shadow_count == 0) { vm_object_set_flag(object, OBJ_ONEMAPPING); object->ref_count--; - lwkt_reltoken(&vm_token); - vm_object_unlock(object); break; } /* * If the second ref is from a shadow we chain along it - * if object's handle is exhausted. + * upwards if object's handle is exhausted. * * We have to decrement object->ref_count before potentially * collapsing the first shadow object or the collapse code - * will not be able to handle the degenerate case. + * will not be able to handle the degenerate case to remove + * object. However, if we do it too early the object can + * get ripped out from under us. */ - if (object->ref_count == 2 && object->shadow_count == 1) { + if (object->ref_count == 2 && object->shadow_count == 1 && + object->handle == NULL && (object->type == OBJT_DEFAULT || + object->type == OBJT_SWAP)) { + temp = LIST_FIRST(&object->shadow_head); + KKASSERT(temp != NULL); + vm_object_hold(temp); + + /* + * Wait for any paging to complete so the collapse + * doesn't (or isn't likely to) qcollapse. pip + * waiting must occur before we acquire the + * chainlock. + */ + while ( + temp->paging_in_progress || + object->paging_in_progress + ) { + vm_object_pip_wait(temp, "objde1"); + vm_object_pip_wait(object, "objde2"); + } + + /* + * If the parent is locked we have to give up, as + * otherwise we would be acquiring locks in the + * wrong order and potentially deadlock. + */ + if (temp->flags & OBJ_CHAINLOCK) { + vm_object_drop(temp); + goto skip; + } + vm_object_chain_acquire(temp); + + /* + * Recheck/retry after the hold and the paging + * wait, both of which can block us. + */ + if (object->ref_count != 2 || + object->shadow_count != 1 || + object->handle || + LIST_FIRST(&object->shadow_head) != temp || + (object->type != OBJT_DEFAULT && + object->type != OBJT_SWAP)) { + vm_object_chain_release(temp); + vm_object_drop(temp); + continue; + } + + /* + * We can safely drop object's ref_count now. + */ + KKASSERT(object->ref_count == 2); object->ref_count--; - if (object->handle == NULL && - (object->type == OBJT_DEFAULT || - object->type == OBJT_SWAP)) { - temp = LIST_FIRST(&object->shadow_head); - KASSERT(temp != NULL, - ("vm_object_deallocate: ref_count: " - "%d, shadow_count: %d", - object->ref_count, - object->shadow_count)); - lwkt_reltoken(&vm_token); - vm_object_lock(temp); - - if ((temp->handle == NULL) && - (temp->type == OBJT_DEFAULT || - temp->type == OBJT_SWAP)) { - /* - * Special case, must handle ref_count - * manually to avoid recursion. - */ - temp->ref_count++; - vm_object_lock_swap(); - while ( - temp->paging_in_progress || - object->paging_in_progress - ) { - vm_object_pip_wait(temp, - "objde1"); - vm_object_pip_wait(object, - "objde2"); - } + /* + * If our single parent is not collapseable just + * decrement ref_count (2->1) and stop. + */ + if (temp->handle || (temp->type != OBJT_DEFAULT && + temp->type != OBJT_SWAP)) { + vm_object_chain_release(temp); + vm_object_drop(temp); + break; + } - if (temp->ref_count == 1) { - temp->ref_count--; - vm_object_unlock(object); - object = temp; - goto doterm; - } + /* + * At this point we have already dropped object's + * ref_count so it is possible for a race to + * deallocate obj out from under us. Any collapse + * will re-check the situation. We must not block + * until we are able to collapse. + * + * Bump temp's ref_count to avoid an unwanted + * degenerate recursion (can't call + * vm_object_reference_locked() because it asserts + * that CHAINLOCK is not set). + */ + temp->ref_count++; + KKASSERT(temp->ref_count > 1); - lwkt_gettoken(&vm_token); - vm_object_collapse(temp); - lwkt_reltoken(&vm_token); - vm_object_unlock(object); - object = temp; - continue; - } - vm_object_unlock(temp); - } else { - lwkt_reltoken(&vm_token); + /* + * Collapse temp, then deallocate the extra ref + * formally. + */ + vm_object_collapse(temp); + vm_object_chain_release(temp); + if (must_drop) { + vm_object_lock_swap(); + vm_object_drop(object); } - vm_object_unlock(object); - break; + object = temp; + must_drop = 1; + continue; } /* - * Normal dereferencing path + * Drop the ref and handle termination on the 1->0 transition. + * We may have blocked above so we have to recheck. */ - object->ref_count--; - if (object->ref_count != 0) { - lwkt_reltoken(&vm_token); - vm_object_unlock(object); +skip: + KKASSERT(object->ref_count != 0); + if (object->ref_count >= 2) { + object->ref_count--; break; } + KKASSERT(object->ref_count == 1); /* - * Termination path - * - * We may have to loop to resolve races if we block getting - * temp's lock. If temp is non NULL we have to swap the - * lock order so the original object lock as at the top - * of the lock heap. + * 1->0 transition. Chain through the backing_object. + * Maintain the ref until we've located the backing object, + * then re-check. */ - lwkt_reltoken(&vm_token); -doterm: while ((temp = object->backing_object) != NULL) { - vm_object_lock(temp); + vm_object_hold(temp); if (temp == object->backing_object) break; - vm_object_unlock(temp); + vm_object_drop(temp); } + + /* + * 1->0 transition verified, retry if ref_count is no longer + * 1. Otherwise disconnect the backing_object (temp) and + * clean up. + */ + if (object->ref_count != 1) { + vm_object_drop(temp); + continue; + } + + /* + * It shouldn't be possible for the object to be chain locked + * if we're removing the last ref on it. + */ + KKASSERT((object->flags & OBJ_CHAINLOCK) == 0); + if (temp) { LIST_REMOVE(object, shadow_list); temp->shadow_count--; temp->generation++; object->backing_object = NULL; - vm_object_lock_swap(); } - /* - * Don't double-terminate, we could be in a termination - * recursion due to the terminate having to sync data - * to disk. - */ - if ((object->flags & OBJ_DEAD) == 0) { + --object->ref_count; + if ((object->flags & OBJ_DEAD) == 0) vm_object_terminate(object); - /* termination ate the object lock */ - } else { - vm_object_unlock(object); - } + if (must_drop && temp) + vm_object_lock_swap(); + if (must_drop) + vm_object_drop(object); object = temp; + must_drop = 1; } + if (must_drop && object) + vm_object_drop(object); } /* @@ -625,11 +712,8 @@ doterm: * * The object must have zero references. * - * The caller must be holding vmobj_token and properly interlock with - * OBJ_DEAD (at the moment). - * - * The caller must have locked the object only, and not be holding it. - * This function will eat the caller's lock on the object. + * The object must held. The caller is responsible for dropping the object + * after terminate returns. Terminate does NOT drop the object. */ static int vm_object_terminate_callback(vm_page_t p, void *data); @@ -640,8 +724,8 @@ vm_object_terminate(vm_object_t object) * Make sure no one uses us. Once we set OBJ_DEAD we should be * able to safely block. */ + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); KKASSERT((object->flags & OBJ_DEAD) == 0); - ASSERT_LWKT_TOKEN_HELD(&vmobj_token); vm_object_set_flag(object, OBJ_DEAD); /* @@ -684,10 +768,8 @@ vm_object_terminate(vm_object_t object) * removes them from paging queues. Don't free wired pages, just * remove them from the object. */ - lwkt_gettoken(&vm_token); vm_page_rb_tree_RB_SCAN(&object->rb_memq, NULL, vm_object_terminate_callback, NULL); - lwkt_reltoken(&vm_token); /* * Let the pager know object is dead. @@ -695,12 +777,12 @@ vm_object_terminate(vm_object_t object) vm_pager_deallocate(object); /* - * Wait for the object hold count to hit zero, clean out pages as - * we go. + * Wait for the object hold count to hit 1, clean out pages as + * we go. vmobj_token interlocks any race conditions that might + * pick the object up from the vm_object_list after we have cleared + * rb_memq. */ - lwkt_gettoken(&vm_token); for (;;) { - vm_object_hold_wait(object); if (RB_ROOT(&object->rb_memq) == NULL) break; kprintf("vm_object_terminate: Warning, object %p " @@ -709,7 +791,6 @@ vm_object_terminate(vm_object_t object) vm_page_rb_tree_RB_SCAN(&object->rb_memq, NULL, vm_object_terminate_callback, NULL); } - lwkt_reltoken(&vm_token); /* * There had better not be any pages left @@ -718,13 +799,12 @@ vm_object_terminate(vm_object_t object) /* * Remove the object from the global object list. - * - * (we are holding vmobj_token) */ + lwkt_gettoken(&vmobj_token); TAILQ_REMOVE(&vm_object_list, object, object_list); vm_object_count--; vm_object_dead_wakeup(object); - vm_object_unlock(object); + lwkt_reltoken(&vmobj_token); if (object->ref_count != 0) { panic("vm_object_terminate2: object with references, " @@ -732,27 +812,32 @@ vm_object_terminate(vm_object_t object) } /* - * Free the space for the object. + * NOTE: The object hold_count is at least 1, so we cannot zfree() + * the object here. See vm_object_drop(). */ - zfree(obj_zone, object); } /* - * The caller must hold vm_token. + * The caller must hold the object. */ static int vm_object_terminate_callback(vm_page_t p, void *data __unused) { - if (p->busy || (p->flags & PG_BUSY)) - panic("vm_object_terminate: freeing busy page %p", p); - if (p->wire_count == 0) { - vm_page_busy(p); + vm_object_t object; + + object = p->object; + vm_page_busy_wait(p, FALSE, "vmpgtrm"); + if (object != p->object) { + kprintf("vm_object_terminate: Warning: Encountered " + "busied page %p on queue %d\n", p, p->queue); + vm_page_wakeup(p); + } else if (p->wire_count == 0) { vm_page_free(p); mycpu->gd_cnt.v_pfree++; } else { if (p->queue != PQ_NONE) - kprintf("vm_object_terminate: Warning: Encountered wired page %p on queue %d\n", p, p->queue); - vm_page_busy(p); + kprintf("vm_object_terminate: Warning: Encountered " + "wired page %p on queue %d\n", p, p->queue); vm_page_remove(p); vm_page_wakeup(p); } @@ -763,12 +848,12 @@ vm_object_terminate_callback(vm_page_t p, void *data __unused) * The object is dead but still has an object<->pager association. Sleep * and return. The caller typically retests the association in a loop. * - * Must be called with the vmobj_token held. + * The caller must hold the object. */ void vm_object_dead_sleep(vm_object_t object, const char *wmesg) { - ASSERT_LWKT_TOKEN_HELD(&vmobj_token); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); if (object->handle) { vm_object_set_flag(object, OBJ_DEADWNT); tsleep(object, 0, wmesg, 0); @@ -780,12 +865,12 @@ vm_object_dead_sleep(vm_object_t object, const char *wmesg) * Wakeup anyone waiting for the object<->pager disassociation on * a dead object. * - * Must be called with the vmobj_token held. + * The caller must hold the object. */ void vm_object_dead_wakeup(vm_object_t object) { - ASSERT_LWKT_TOKEN_HELD(&vmobj_token); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); if (object->flags & OBJ_DEADWNT) { vm_object_clear_flag(object, OBJ_DEADWNT); wakeup(object); @@ -816,7 +901,7 @@ vm_object_page_clean(vm_object_t object, vm_pindex_t start, vm_pindex_t end, struct vnode *vp; int wholescan; int pagerflags; - int curgeneration; + int generation; vm_object_hold(object); if (object->type != OBJT_VNODE || @@ -858,10 +943,8 @@ vm_object_page_clean(vm_object_t object, vm_pindex_t start, vm_pindex_t end, */ if (wholescan) { info.error = 0; - lwkt_gettoken(&vm_token); vm_page_rb_tree_RB_SCAN(&object->rb_memq, rb_vm_page_scancmp, vm_object_page_clean_pass1, &info); - lwkt_reltoken(&vm_token); if (info.error == 0) { vm_object_clear_flag(object, OBJ_WRITEABLE|OBJ_MIGHTBEDIRTY); @@ -878,19 +961,17 @@ vm_object_page_clean(vm_object_t object, vm_pindex_t start, vm_pindex_t end, */ do { info.error = 0; - curgeneration = object->generation; - lwkt_gettoken(&vm_token); + generation = object->generation; vm_page_rb_tree_RB_SCAN(&object->rb_memq, rb_vm_page_scancmp, vm_object_page_clean_pass2, &info); - lwkt_reltoken(&vm_token); - } while (info.error || curgeneration != object->generation); + } while (info.error || generation != object->generation); vm_object_clear_flag(object, OBJ_CLEANING); vm_object_drop(object); } /* - * The caller must hold vm_token. + * The caller must hold the object. */ static int @@ -899,22 +980,26 @@ vm_object_page_clean_pass1(struct vm_page *p, void *data) struct rb_vm_page_scan_info *info = data; vm_page_flag_set(p, PG_CLEANCHK); - if ((info->limit & OBJPC_NOSYNC) && (p->flags & PG_NOSYNC)) + if ((info->limit & OBJPC_NOSYNC) && (p->flags & PG_NOSYNC)) { info->error = 1; - else + } else if (vm_page_busy_try(p, FALSE) == 0) { vm_page_protect(p, VM_PROT_READ); /* must not block */ + vm_page_wakeup(p); + } else { + info->error = 1; + } return(0); } /* - * The caller must hold vm_token. + * The caller must hold the object */ static int vm_object_page_clean_pass2(struct vm_page *p, void *data) { struct rb_vm_page_scan_info *info = data; - int n; + int generation; /* * Do not mess with pages that were inserted after we started @@ -923,12 +1008,22 @@ vm_object_page_clean_pass2(struct vm_page *p, void *data) if ((p->flags & PG_CLEANCHK) == 0) return(0); + generation = info->object->generation; + vm_page_busy_wait(p, TRUE, "vpcwai"); + if (p->object != info->object || + info->object->generation != generation) { + info->error = 1; + vm_page_wakeup(p); + return(0); + } + /* * Before wasting time traversing the pmaps, check for trivial * cases where the page cannot be dirty. */ if (p->valid == 0 || (p->queue - p->pc) == PQ_CACHE) { KKASSERT((p->dirty & p->valid) == 0); + vm_page_wakeup(p); return(0); } @@ -940,6 +1035,7 @@ vm_object_page_clean_pass2(struct vm_page *p, void *data) vm_page_test_dirty(p); if ((p->dirty & p->valid) == 0) { vm_page_flag_clear(p, PG_CLEANCHK); + vm_page_wakeup(p); return(0); } @@ -951,6 +1047,7 @@ vm_object_page_clean_pass2(struct vm_page *p, void *data) */ if ((info->limit & OBJPC_NOSYNC) && (p->flags & PG_NOSYNC)) { vm_page_flag_clear(p, PG_CLEANCHK); + vm_page_wakeup(p); return(0); } @@ -959,98 +1056,101 @@ vm_object_page_clean_pass2(struct vm_page *p, void *data) * the pages that get successfully flushed. Set info->error if * we raced an object modification. */ - n = vm_object_page_collect_flush(info->object, p, info->pagerflags); - if (n == 0) - info->error = 1; + vm_object_page_collect_flush(info->object, p, info->pagerflags); return(0); } /* * Collect the specified page and nearby pages and flush them out. - * The number of pages flushed is returned. + * The number of pages flushed is returned. The passed page is busied + * by the caller and we are responsible for its disposition. * - * The caller must hold vm_token. + * The caller must hold the object. */ static int vm_object_page_collect_flush(vm_object_t object, vm_page_t p, int pagerflags) { int runlen; + int error; int maxf; int chkb; int maxb; int i; - int curgeneration; vm_pindex_t pi; vm_page_t maf[vm_pageout_page_count]; vm_page_t mab[vm_pageout_page_count]; vm_page_t ma[vm_pageout_page_count]; - curgeneration = object->generation; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); pi = p->pindex; - while (vm_page_sleep_busy(p, TRUE, "vpcwai")) { - if (object->generation != curgeneration) { - return(0); - } - } - KKASSERT(p->object == object && p->pindex == pi); maxf = 0; for(i = 1; i < vm_pageout_page_count; i++) { vm_page_t tp; - if ((tp = vm_page_lookup(object, pi + i)) != NULL) { - if ((tp->flags & PG_BUSY) || - ((pagerflags & VM_PAGER_IGNORE_CLEANCHK) == 0 && - (tp->flags & PG_CLEANCHK) == 0) || - (tp->busy != 0)) - break; - if((tp->queue - tp->pc) == PQ_CACHE) { - vm_page_flag_clear(tp, PG_CLEANCHK); - break; - } - vm_page_test_dirty(tp); - if ((tp->dirty & tp->valid) == 0) { - vm_page_flag_clear(tp, PG_CLEANCHK); - break; - } - maf[ i - 1 ] = tp; - maxf++; - continue; + tp = vm_page_lookup_busy_try(object, pi + i, TRUE, &error); + if (error) + break; + if (tp == NULL) + break; + if ((pagerflags & VM_PAGER_IGNORE_CLEANCHK) == 0 && + (tp->flags & PG_CLEANCHK) == 0) { + vm_page_wakeup(tp); + break; } - break; + if ((tp->queue - tp->pc) == PQ_CACHE) { + vm_page_flag_clear(tp, PG_CLEANCHK); + vm_page_wakeup(tp); + break; + } + vm_page_test_dirty(tp); + if ((tp->dirty & tp->valid) == 0) { + vm_page_flag_clear(tp, PG_CLEANCHK); + vm_page_wakeup(tp); + break; + } + maf[i - 1] = tp; + maxf++; } maxb = 0; chkb = vm_pageout_page_count - maxf; - if (chkb) { - for(i = 1; i < chkb;i++) { - vm_page_t tp; - - if ((tp = vm_page_lookup(object, pi - i)) != NULL) { - if ((tp->flags & PG_BUSY) || - ((pagerflags & VM_PAGER_IGNORE_CLEANCHK) == 0 && - (tp->flags & PG_CLEANCHK) == 0) || - (tp->busy != 0)) - break; - if((tp->queue - tp->pc) == PQ_CACHE) { - vm_page_flag_clear(tp, PG_CLEANCHK); - break; - } - vm_page_test_dirty(tp); - if ((tp->dirty & tp->valid) == 0) { - vm_page_flag_clear(tp, PG_CLEANCHK); - break; - } - mab[ i - 1 ] = tp; - maxb++; - continue; - } + /* + * NOTE: chkb can be 0 + */ + for(i = 1; chkb && i < chkb; i++) { + vm_page_t tp; + + tp = vm_page_lookup_busy_try(object, pi - i, TRUE, &error); + if (error) + break; + if (tp == NULL) + break; + if ((pagerflags & VM_PAGER_IGNORE_CLEANCHK) == 0 && + (tp->flags & PG_CLEANCHK) == 0) { + vm_page_wakeup(tp); break; } + if ((tp->queue - tp->pc) == PQ_CACHE) { + vm_page_flag_clear(tp, PG_CLEANCHK); + vm_page_wakeup(tp); + break; + } + vm_page_test_dirty(tp); + if ((tp->dirty & tp->valid) == 0) { + vm_page_flag_clear(tp, PG_CLEANCHK); + vm_page_wakeup(tp); + break; + } + mab[i - 1] = tp; + maxb++; } - for(i = 0; i < maxb; i++) { + /* + * All pages in the maf[] and mab[] array are busied. + */ + for (i = 0; i < maxb; i++) { int index = (maxb - i) - 1; ma[index] = mab[i]; vm_page_flag_clear(ma[index], PG_CLEANCHK); @@ -1064,7 +1164,11 @@ vm_object_page_collect_flush(vm_object_t object, vm_page_t p, int pagerflags) } runlen = maxb + maxf + 1; + for (i = 0; i < runlen; i++) + vm_page_hold(ma[i]); + vm_pageout_flush(ma, runlen, pagerflags); + for (i = 0; i < runlen; i++) { if (ma[i]->valid & ma[i]->dirty) { vm_page_protect(ma[i], VM_PROT_READ); @@ -1078,6 +1182,7 @@ vm_object_page_collect_flush(vm_object_t object, vm_page_t p, int pagerflags) if (i >= maxb + 1 && (maxf > i - maxb - 1)) maxf = i - maxb - 1; } + vm_page_unhold(ma[i]); } return(maxf + 1); } @@ -1102,18 +1207,14 @@ vm_object_pmap_copy_1(vm_object_t object, vm_pindex_t start, vm_pindex_t end) if (object == NULL || (object->flags & OBJ_WRITEABLE) == 0) return; - /* - * spl protection needed to prevent races between the lookup, - * an interrupt unbusy/free, and our protect call. - */ - lwkt_gettoken(&vm_token); + vm_object_hold(object); for (idx = start; idx < end; idx++) { p = vm_page_lookup(object, idx); if (p == NULL) continue; vm_page_protect(p, VM_PROT_READ); } - lwkt_reltoken(&vm_token); + vm_object_drop(object); } /* @@ -1135,16 +1236,16 @@ vm_object_pmap_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end) info.start_pindex = start; info.end_pindex = end - 1; - lwkt_gettoken(&vm_token); + vm_object_hold(object); vm_page_rb_tree_RB_SCAN(&object->rb_memq, rb_vm_page_scancmp, vm_object_pmap_remove_callback, &info); if (start == 0 && end == object->size) vm_object_clear_flag(object, OBJ_WRITEABLE); - lwkt_reltoken(&vm_token); + vm_object_drop(object); } /* - * The caller must hold vm_token. + * The caller must hold the object */ static int vm_object_pmap_remove_callback(vm_page_t p, void *data __unused) @@ -1178,20 +1279,25 @@ vm_object_madvise(vm_object_t object, vm_pindex_t pindex, int count, int advise) { vm_pindex_t end, tpindex; vm_object_t tobject; + vm_object_t xobj; vm_page_t m; + int error; if (object == NULL) return; end = pindex + count; - lwkt_gettoken(&vm_token); + vm_object_hold(object); + tobject = object; /* * Locate and adjust resident pages */ for (; pindex < end; pindex += 1) { relookup: + if (tobject != object) + vm_object_drop(tobject); tobject = object; tpindex = pindex; shadowlookup: @@ -1207,13 +1313,12 @@ shadowlookup: } } - /* - * spl protection is required to avoid a race between the - * lookup, an interrupt unbusy/free, and our busy check. - */ - - m = vm_page_lookup(tobject, tpindex); + m = vm_page_lookup_busy_try(tobject, tpindex, TRUE, &error); + if (error) { + vm_page_sleep_busy(m, TRUE, "madvpo"); + goto relookup; + } if (m == NULL) { /* * There may be swap even if there is no backing page @@ -1224,33 +1329,40 @@ shadowlookup: /* * next object */ - if (tobject->backing_object == NULL) + while ((xobj = tobject->backing_object) != NULL) { + KKASSERT(xobj != object); + vm_object_hold(xobj); + if (xobj == tobject->backing_object) + break; + vm_object_drop(xobj); + } + if (xobj == NULL) continue; tpindex += OFF_TO_IDX(tobject->backing_object_offset); - tobject = tobject->backing_object; + if (tobject != object) { + vm_object_lock_swap(); + vm_object_drop(tobject); + } + tobject = xobj; goto shadowlookup; } /* - * If the page is busy or not in a normal active state, - * we skip it. If the page is not managed there are no - * page queues to mess with. Things can break if we mess - * with pages in any of the below states. + * If the page is not in a normal active state, we skip it. + * If the page is not managed there are no page queues to + * mess with. Things can break if we mess with pages in + * any of the below states. */ if ( - m->hold_count || + /*m->hold_count ||*/ m->wire_count || (m->flags & PG_UNMANAGED) || m->valid != VM_PAGE_BITS_ALL ) { + vm_page_wakeup(m); continue; } - if (vm_page_sleep_busy(m, TRUE, "madvpo")) { - goto relookup; - } - vm_page_busy(m); - /* * Theoretically once a page is known not to be busy, an * interrupt cannot come along and rip it out from under us. @@ -1285,77 +1397,108 @@ shadowlookup: } vm_page_wakeup(m); } - lwkt_reltoken(&vm_token); + if (tobject != object) + vm_object_drop(tobject); + vm_object_drop(object); } /* * Create a new object which is backed by the specified existing object - * range. The source object reference is deallocated. - * - * The new object and offset into that object are returned in the source - * parameters. + * range. Replace the pointer and offset that was pointing at the existing + * object with the pointer/offset for the new object. * * No other requirements. */ void -vm_object_shadow(vm_object_t *object, vm_ooffset_t *offset, vm_size_t length) +vm_object_shadow(vm_object_t *objectp, vm_ooffset_t *offset, vm_size_t length, + int addref) { vm_object_t source; vm_object_t result; - source = *object; + source = *objectp; /* * Don't create the new object if the old object isn't shared. + * We have to chain wait before adding the reference to avoid + * racing a collapse or deallocation. + * + * Add the additional ref to source here to avoid racing a later + * collapse or deallocation. Clear the ONEMAPPING flag whether + * addref is TRUE or not in this case because the original object + * will be shadowed. */ - lwkt_gettoken(&vm_token); - - if (source != NULL && - source->ref_count == 1 && - source->handle == NULL && - (source->type == OBJT_DEFAULT || - source->type == OBJT_SWAP)) { - lwkt_reltoken(&vm_token); - return; + if (source) { + vm_object_hold(source); + vm_object_chain_wait(source); + if (source->ref_count == 1 && + source->handle == NULL && + (source->type == OBJT_DEFAULT || + source->type == OBJT_SWAP)) { + vm_object_drop(source); + if (addref) { + vm_object_clear_flag(source, OBJ_ONEMAPPING); + vm_object_reference_locked(source); + } + return; + } + vm_object_reference_locked(source); + vm_object_clear_flag(source, OBJ_ONEMAPPING); } /* - * Allocate a new object with the given length + * Allocate a new object with the given length. The new object + * is returned referenced but we may have to add another one. + * If we are adding a second reference we must clear OBJ_ONEMAPPING. + * (typically because the caller is about to clone a vm_map_entry). + * + * The source object currently has an extra reference to prevent + * collapses into it while we mess with its shadow list, which + * we will remove later in this routine. */ - if ((result = vm_object_allocate(OBJT_DEFAULT, length)) == NULL) panic("vm_object_shadow: no object for shadowing"); + vm_object_hold(result); + if (addref) { + vm_object_reference_locked(result); + vm_object_clear_flag(result, OBJ_ONEMAPPING); + } /* - * The new object shadows the source object, adding a reference to it. - * Our caller changes his reference to point to the new object, - * removing a reference to the source object. Net result: no change - * of reference count. + * The new object shadows the source object. Chain wait before + * adjusting shadow_count or the shadow list to avoid races. * * Try to optimize the result object's page color when shadowing * in order to maintain page coloring consistency in the combined * shadowed object. */ + KKASSERT(result->backing_object == NULL); result->backing_object = source; if (source) { + vm_object_chain_wait(source); LIST_INSERT_HEAD(&source->shadow_head, result, shadow_list); source->shadow_count++; source->generation++; - result->pg_color = (source->pg_color + OFF_TO_IDX(*offset)) & PQ_L2_MASK; + result->pg_color = (source->pg_color + OFF_TO_IDX(*offset)) & + PQ_L2_MASK; } /* - * Store the offset into the source object, and fix up the offset into - * the new object. + * Adjust the return storage. Drop the ref on source before + * returning. */ result->backing_object_offset = *offset; - lwkt_reltoken(&vm_token); + vm_object_drop(result); + *offset = 0; + if (source) { + vm_object_deallocate_locked(source); + vm_object_drop(source); + } /* * Return the new things */ - *offset = 0; - *object = result; + *objectp = result; } #define OBSC_TEST_ALL_SHADOWED 0x0001 @@ -1365,21 +1508,22 @@ vm_object_shadow(vm_object_t *object, vm_ooffset_t *offset, vm_size_t length) static int vm_object_backing_scan_callback(vm_page_t p, void *data); /* - * The caller must hold vm_token. + * The caller must hold the object. */ static __inline int -vm_object_backing_scan(vm_object_t object, int op) +vm_object_backing_scan(vm_object_t object, vm_object_t backing_object, int op) { struct rb_vm_page_scan_info info; - vm_object_t backing_object; - backing_object = object->backing_object; + vm_object_assert_held(object); + vm_object_assert_held(backing_object); + + KKASSERT(backing_object == object->backing_object); info.backing_offset_index = OFF_TO_IDX(object->backing_object_offset); /* * Initial conditions */ - if (op & OBSC_TEST_ALL_SHADOWED) { /* * We do not want to have to test for the existence of @@ -1390,13 +1534,17 @@ vm_object_backing_scan(vm_object_t object, int op) * been ZFOD faulted yet? If we do not test for this, the * shadow test may succeed! XXX */ - if (backing_object->type != OBJT_DEFAULT) { + if (backing_object->type != OBJT_DEFAULT) return(0); - } } if (op & OBSC_COLLAPSE_WAIT) { KKASSERT((backing_object->flags & OBJ_DEAD) == 0); vm_object_set_flag(backing_object, OBJ_DEAD); + lwkt_gettoken(&vmobj_token); + TAILQ_REMOVE(&vm_object_list, backing_object, object_list); + vm_object_count--; + vm_object_dead_wakeup(backing_object); + lwkt_reltoken(&vmobj_token); } /* @@ -1419,7 +1567,7 @@ vm_object_backing_scan(vm_object_t object, int op) } /* - * The caller must hold vm_token. + * The caller must hold the object. */ static int vm_object_backing_scan_callback(vm_page_t p, void *data) @@ -1477,22 +1625,13 @@ vm_object_backing_scan_callback(vm_page_t p, void *data) /* * Check for busy page */ - if (op & (OBSC_COLLAPSE_WAIT | OBSC_COLLAPSE_NOWAIT)) { vm_page_t pp; - if (op & OBSC_COLLAPSE_NOWAIT) { - if ( - (p->flags & PG_BUSY) || - !p->valid || - p->hold_count || - p->wire_count || - p->busy - ) { + if (vm_page_busy_try(p, TRUE)) { + if (op & OBSC_COLLAPSE_NOWAIT) { return(0); - } - } else if (op & OBSC_COLLAPSE_WAIT) { - if (vm_page_sleep_busy(p, TRUE, "vmocol")) { + } else { /* * If we slept, anything could have * happened. Ask that the scan be restarted. @@ -1500,15 +1639,20 @@ vm_object_backing_scan_callback(vm_page_t p, void *data) * Since the object is marked dead, the * backing offset should not have changed. */ + vm_page_sleep_busy(p, TRUE, "vmocol"); info->error = -1; return(-1); } } - - /* - * Busy the page - */ - vm_page_busy(p); + if (op & OBSC_COLLAPSE_NOWAIT) { + if (p->valid == 0 /*|| p->hold_count*/ || + p->wire_count) { + vm_page_wakeup(p); + return(0); + } + } else { + /* XXX what if p->valid == 0 , hold_count, etc? */ + } KASSERT( p->object == backing_object, @@ -1559,6 +1703,7 @@ vm_object_backing_scan_callback(vm_page_t p, void *data) vm_page_deactivate(p); vm_page_rename(p, object, new_pindex); + vm_page_wakeup(p); /* page automatically made dirty by rename */ } return(0); @@ -1569,53 +1714,85 @@ vm_object_backing_scan_callback(vm_page_t p, void *data) * when paging_in_progress is true for an object... This is not a complete * operation, but should plug 99.9% of the rest of the leaks. * - * The caller must hold vm_token and vmobj_token. + * The caller must hold the object and backing_object and both must be + * chainlocked. + * * (only called from vm_object_collapse) */ static void -vm_object_qcollapse(vm_object_t object) +vm_object_qcollapse(vm_object_t object, vm_object_t backing_object) { - vm_object_t backing_object = object->backing_object; - - if (backing_object->ref_count != 1) - return; - - backing_object->ref_count += 2; - - vm_object_backing_scan(object, OBSC_COLLAPSE_NOWAIT); - - backing_object->ref_count -= 2; + if (backing_object->ref_count == 1) { + backing_object->ref_count += 2; + vm_object_backing_scan(object, backing_object, + OBSC_COLLAPSE_NOWAIT); + backing_object->ref_count -= 2; + } } /* * Collapse an object with the object backing it. Pages in the backing * object are moved into the parent, and the backing object is deallocated. * - * The caller must hold (object). + * object must be held and chain-locked on call. + * + * The caller must have an extra ref on object to prevent a race from + * destroying it during the collapse. */ void vm_object_collapse(vm_object_t object) { - ASSERT_LWKT_TOKEN_HELD(&vm_token); - ASSERT_LWKT_TOKEN_HELD(&vmobj_token); + vm_object_t backing_object; + + /* + * Only one thread is attempting a collapse at any given moment. + * There are few restrictions for (object) that callers of this + * function check so reentrancy is likely. + */ + KKASSERT(object != NULL); vm_object_assert_held(object); + KKASSERT(object->flags & OBJ_CHAINLOCK); - while (TRUE) { - vm_object_t backing_object; + for (;;) { + vm_object_t bbobj; + int dodealloc; /* - * Verify that the conditions are right for collapse: - * - * The object exists and the backing object exists. + * We have to hold the backing object, check races. */ - if (object == NULL) + while ((backing_object = object->backing_object) != NULL) { + vm_object_hold(backing_object); + if (backing_object == object->backing_object) + break; + vm_object_drop(backing_object); + } + + /* + * No backing object? Nothing to collapse then. + */ + if (backing_object == NULL) break; - if ((backing_object = object->backing_object) == NULL) + /* + * You can't collapse with a non-default/non-swap object. + */ + if (backing_object->type != OBJT_DEFAULT && + backing_object->type != OBJT_SWAP) { + vm_object_drop(backing_object); + backing_object = NULL; break; + } - vm_object_hold(backing_object); + /* + * Chain-lock the backing object too because if we + * successfully merge its pages into the top object we + * will collapse backing_object->backing_object as the + * new backing_object. Re-check that it is still our + * backing object. + */ + vm_object_chain_acquire(backing_object); if (backing_object != object->backing_object) { + vm_object_chain_release(backing_object); vm_object_drop(backing_object); continue; } @@ -1632,16 +1809,17 @@ vm_object_collapse(vm_object_t object) (object->type != OBJT_DEFAULT && object->type != OBJT_SWAP) || (object->flags & OBJ_DEAD)) { - vm_object_drop(backing_object); break; } + /* + * If paging is in progress we can't do a normal collapse. + */ if ( object->paging_in_progress != 0 || backing_object->paging_in_progress != 0 ) { - vm_object_drop(backing_object); - vm_object_qcollapse(object); + vm_object_qcollapse(object, backing_object); break; } @@ -1655,13 +1833,14 @@ vm_object_collapse(vm_object_t object) * vm_object_backing_scan fails the shadowing test in this * case. */ - if (backing_object->ref_count == 1) { /* * If there is exactly one reference to the backing * object, we can collapse it into the parent. */ - vm_object_backing_scan(object, OBSC_COLLAPSE_WAIT); + KKASSERT(object->backing_object == backing_object); + vm_object_backing_scan(object, backing_object, + OBSC_COLLAPSE_WAIT); /* * Move the pager from backing_object to object. @@ -1677,53 +1856,61 @@ vm_object_collapse(vm_object_t object) * new swapper is able to optimize the * destroy-source case. */ - vm_object_pip_add(object, 1); - swap_pager_copy( - backing_object, - object, - OFF_TO_IDX(object->backing_object_offset), TRUE); + swap_pager_copy(backing_object, object, + OFF_TO_IDX(object->backing_object_offset), + TRUE); vm_object_pip_wakeup(object); - vm_object_pip_wakeup(backing_object); } + /* * Object now shadows whatever backing_object did. - * Note that the reference to - * backing_object->backing_object moves from within - * backing_object to within object. + * Remove object from backing_object's shadow_list. + * + * NOTE: backing_object->backing_object moves from + * within backing_object to within object. */ - LIST_REMOVE(object, shadow_list); - object->backing_object->shadow_count--; - object->backing_object->generation++; - if (backing_object->backing_object) { + KKASSERT(object->backing_object == backing_object); + backing_object->shadow_count--; + backing_object->generation++; + + while ((bbobj = backing_object->backing_object) != NULL) { + vm_object_hold(bbobj); + if (bbobj == backing_object->backing_object) + break; + vm_object_drop(bbobj); + } + if (bbobj) { LIST_REMOVE(backing_object, shadow_list); - backing_object->backing_object->shadow_count--; - backing_object->backing_object->generation++; + bbobj->shadow_count--; + bbobj->generation++; } - object->backing_object = backing_object->backing_object; - if (object->backing_object) { - LIST_INSERT_HEAD( - &object->backing_object->shadow_head, - object, - shadow_list - ); - object->backing_object->shadow_count++; - object->backing_object->generation++; + object->backing_object = bbobj; + if (bbobj) { + LIST_INSERT_HEAD(&bbobj->shadow_head, + object, shadow_list); + bbobj->shadow_count++; + bbobj->generation++; } object->backing_object_offset += - backing_object->backing_object_offset; + backing_object->backing_object_offset; + + vm_object_drop(bbobj); /* - * Discard backing_object. + * Discard the old backing_object. Nothing should be + * able to ref it, other than a vm_map_split(), + * and vm_map_split() will stall on our chain lock. + * And we control the parent so it shouldn't be + * possible for it to go away either. * - * Since the backing object has no pages, no pager left, - * and no object references within it, all that is - * necessary is to dispose of it. + * Since the backing object has no pages, no pager + * left, and no object references within it, all + * that is necessary is to dispose of it. */ - KASSERT(backing_object->ref_count == 1, ("backing_object %p was somehow " "re-referenced during collapse!", @@ -1734,71 +1921,98 @@ vm_object_collapse(vm_object_t object) backing_object)); /* - * Wait for hold count to hit zero + * The object can be destroyed. + * + * XXX just fall through and dodealloc instead + * of forcing destruction? */ - vm_object_drop(backing_object); - vm_object_hold_wait(backing_object); - - /* (we are holding vmobj_token) */ - TAILQ_REMOVE(&vm_object_list, backing_object, - object_list); - --backing_object->ref_count; /* safety/debug */ - vm_object_count--; - - zfree(obj_zone, backing_object); - + --backing_object->ref_count; + if ((backing_object->flags & OBJ_DEAD) == 0) + vm_object_terminate(backing_object); object_collapses++; + dodealloc = 1; + dodealloc = 0; } else { - vm_object_t new_backing_object; - /* * If we do not entirely shadow the backing object, * there is nothing we can do so we give up. */ - - if (vm_object_backing_scan(object, OBSC_TEST_ALL_SHADOWED) == 0) { - vm_object_drop(backing_object); + if (vm_object_backing_scan(object, backing_object, + OBSC_TEST_ALL_SHADOWED) == 0) { break; } + while ((bbobj = backing_object->backing_object) != NULL) { + vm_object_hold(bbobj); + if (bbobj == backing_object->backing_object) + break; + vm_object_drop(bbobj); + } /* * Make the parent shadow the next object in the - * chain. Deallocating backing_object will not remove + * chain. Remove object from backing_object's + * shadow list. + * + * Deallocating backing_object will not remove * it, since its reference count is at least 2. */ - + KKASSERT(object->backing_object == backing_object); LIST_REMOVE(object, shadow_list); backing_object->shadow_count--; backing_object->generation++; - new_backing_object = backing_object->backing_object; - if ((object->backing_object = new_backing_object) != NULL) { - vm_object_reference(new_backing_object); - LIST_INSERT_HEAD( - &new_backing_object->shadow_head, - object, - shadow_list - ); - new_backing_object->shadow_count++; - new_backing_object->generation++; + /* + * Add a ref to bbobj + */ + if (bbobj) { + vm_object_chain_wait(bbobj); + vm_object_reference_locked(bbobj); + LIST_INSERT_HEAD(&bbobj->shadow_head, + object, shadow_list); + bbobj->shadow_count++; + bbobj->generation++; object->backing_object_offset += backing_object->backing_object_offset; + object->backing_object = bbobj; + vm_object_drop(bbobj); + } else { + object->backing_object = NULL; } /* - * Drop the reference count on backing_object. Since - * its ref_count was at least 2, it will not vanish; - * so we don't need to call vm_object_deallocate, but - * we do anyway. + * Drop the reference count on backing_object. To + * handle ref_count races properly we can't assume + * that the ref_count is still at least 2 so we + * have to actually call vm_object_deallocate() + * (after clearing the chainlock). */ - vm_object_drop(backing_object); - vm_object_deallocate_locked(backing_object); object_bypasses++; + dodealloc = 1; } /* - * Try again with this object's new backing object. + * Clean up the original backing_object and try again with + * this object's new backing object (loop). */ + vm_object_chain_release(backing_object); + + /* + * The backing_object was + */ + if (dodealloc) + vm_object_deallocate_locked(backing_object); + vm_object_drop(backing_object); + /* loop */ + } + + /* + * Clean up any left over backing_object + */ + if (backing_object) { +#if 1 + vm_object_chain_release(backing_object); +#endif + vm_object_drop(backing_object); } } @@ -1820,10 +2034,10 @@ vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end, /* * Degenerate cases and assertions */ - lwkt_gettoken(&vm_token); + vm_object_hold(object); if (object == NULL || (object->resident_page_count == 0 && object->swblock_count == 0)) { - lwkt_reltoken(&vm_token); + vm_object_drop(object); return; } KASSERT(object->type != OBJT_PHYS, @@ -1873,17 +2087,23 @@ vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end, * Cleanup */ vm_object_pip_wakeup(object); - lwkt_reltoken(&vm_token); + vm_object_drop(object); } /* - * The caller must hold vm_token. + * The caller must hold the object */ static int vm_object_page_remove_callback(vm_page_t p, void *data) { struct rb_vm_page_scan_info *info = data; + if (vm_page_busy_try(p, TRUE)) { + vm_page_sleep_busy(p, TRUE, "vmopar"); + info->error = 1; + return(0); + } + /* * Wired pages cannot be destroyed, but they can be invalidated * and we do so if clean_only (limit) is not set. @@ -1897,16 +2117,7 @@ vm_object_page_remove_callback(vm_page_t p, void *data) vm_page_protect(p, VM_PROT_NONE); if (info->limit == 0) p->valid = 0; - return(0); - } - - /* - * The busy flags are only cleared at - * interrupt -- minimize the spl transitions - */ - - if (vm_page_sleep_busy(p, TRUE, "vmopar")) { - info->error = 1; + vm_page_wakeup(p); return(0); } @@ -1917,16 +2128,21 @@ vm_object_page_remove_callback(vm_page_t p, void *data) */ if (info->limit && p->valid) { vm_page_test_dirty(p); - if (p->valid & p->dirty) + if (p->valid & p->dirty) { + vm_page_wakeup(p); return(0); - if (p->hold_count) + } +#if 0 + if (p->hold_count) { + vm_page_wakeup(p); return(0); + } +#endif } /* * Destroy the page */ - vm_page_busy(p); vm_page_protect(p, VM_PROT_NONE); vm_page_free(p); return(0); @@ -1950,8 +2166,6 @@ vm_object_page_remove_callback(vm_page_t p, void *data) * prev_size Size of reference to prev_object * next_size Size of reference to next_object * - * The caller must hold vm_token and vmobj_token. - * * The caller does not need to hold (prev_object) but must have a stable * pointer to it (typically by holding the vm_map locked). */ @@ -1961,12 +2175,8 @@ vm_object_coalesce(vm_object_t prev_object, vm_pindex_t prev_pindex, { vm_pindex_t next_pindex; - ASSERT_LWKT_TOKEN_HELD(&vm_token); - ASSERT_LWKT_TOKEN_HELD(&vmobj_token); - - if (prev_object == NULL) { + if (prev_object == NULL) return (TRUE); - } vm_object_hold(prev_object); @@ -1979,6 +2189,7 @@ vm_object_coalesce(vm_object_t prev_object, vm_pindex_t prev_pindex, /* * Try to collapse the object first */ + vm_object_chain_acquire(prev_object); vm_object_collapse(prev_object); /* @@ -1988,6 +2199,7 @@ vm_object_coalesce(vm_object_t prev_object, vm_pindex_t prev_pindex, */ if (prev_object->backing_object != NULL) { + vm_object_chain_release(prev_object); vm_object_drop(prev_object); return (FALSE); } @@ -1998,6 +2210,7 @@ vm_object_coalesce(vm_object_t prev_object, vm_pindex_t prev_pindex, if ((prev_object->ref_count > 1) && (prev_object->size != next_pindex)) { + vm_object_chain_release(prev_object); vm_object_drop(prev_object); return (FALSE); } @@ -2021,6 +2234,7 @@ vm_object_coalesce(vm_object_t prev_object, vm_pindex_t prev_pindex, if (next_pindex + next_size > prev_object->size) prev_object->size = next_pindex + next_size; + vm_object_chain_release(prev_object); vm_object_drop(prev_object); return (TRUE); } @@ -2028,14 +2242,15 @@ vm_object_coalesce(vm_object_t prev_object, vm_pindex_t prev_pindex, /* * Make the object writable and flag is being possibly dirty. * - * No requirements. + * The caller must hold the object. XXX called from vm_page_dirty(), + * There is currently no requirement to hold the object. */ void vm_object_set_writeable_dirty(vm_object_t object) { struct vnode *vp; - lwkt_gettoken(&vm_token); + /*vm_object_assert_held(object);*/ vm_object_set_flag(object, OBJ_WRITEABLE|OBJ_MIGHTBEDIRTY); if (object->type == OBJT_VNODE && (vp = (struct vnode *)object->handle) != NULL) { @@ -2043,7 +2258,6 @@ vm_object_set_writeable_dirty(vm_object_t object) vsetflags(vp, VOBJDIRTY); } } - lwkt_reltoken(&vm_token); } #include "opt_ddb.h" @@ -2059,14 +2273,14 @@ static int _vm_object_in_map (vm_map_t map, vm_object_t object, static int vm_object_in_map (vm_object_t object); /* - * The caller must hold vm_token. + * The caller must hold the object. */ static int _vm_object_in_map(vm_map_t map, vm_object_t object, vm_map_entry_t entry) { vm_map_t tmpm; vm_map_entry_t tmpe; - vm_object_t obj; + vm_object_t obj, nobj; int entcount; if (map == 0) @@ -2098,9 +2312,23 @@ _vm_object_in_map(vm_map_t map, vm_object_t object, vm_map_entry_t entry) case VM_MAPTYPE_VPAGETABLE: obj = entry->object.vm_object; while (obj) { - if (obj == object) + if (obj == object) { + if (obj != entry->object.vm_object) + vm_object_drop(obj); return 1; - obj = obj->backing_object; + } + while ((nobj = obj->backing_object) != NULL) { + vm_object_hold(nobj); + if (nobj == obj->backing_object) + break; + vm_object_drop(nobj); + } + if (obj != entry->object.vm_object) { + if (nobj) + vm_object_lock_swap(); + vm_object_drop(obj); + } + obj = nobj; } break; default: diff --git a/sys/vm/vm_object.h b/sys/vm/vm_object.h index 9a66a83e41..1d9b46b255 100644 --- a/sys/vm/vm_object.h +++ b/sys/vm/vm_object.h @@ -128,18 +128,22 @@ typedef u_char objtype_t; * vm_object A VM object which represents an arbitrarily sized * data store. * - * Locking requirements: vmobj_token for ref_count and object_list, and - * vm_token for everything else. + * Locking requirements: + * vmobj_token for object_list + * + * vm_object_hold/drop() for most vm_object related operations. + * + * OBJ_CHAINLOCK to avoid chain/shadow object collisions */ struct vm_object { TAILQ_ENTRY(vm_object) object_list; /* vmobj_token */ - LIST_HEAD(, vm_object) shadow_head; /* objects that this is a shadow for */ - LIST_ENTRY(vm_object) shadow_list; /* chain of shadow objects */ + LIST_HEAD(, vm_object) shadow_head; /* objects we are a shadow for */ + LIST_ENTRY(vm_object) shadow_list; /* chain of shadow objects */ RB_HEAD(vm_page_rb_tree, vm_page) rb_memq; /* resident pages */ int generation; /* generation ID */ vm_pindex_t size; /* Object size */ - int ref_count; /* vmobj_token */ - int shadow_count; /* how many objects that this is a shadow for */ + int ref_count; + int shadow_count; /* count of objs we are a shadow for */ objtype_t type; /* type of pager */ u_short flags; /* see below */ u_short pg_color; /* color of first page in obj */ @@ -148,8 +152,8 @@ struct vm_object { u_int agg_pv_list_count; /* aggregate pv list count */ struct vm_object *backing_object; /* object that I'm a shadow of */ vm_ooffset_t backing_object_offset;/* Offset in backing object */ - void *handle; - int hold_count; /* refcount for object liveness */ + void *handle; /* control handle: vp, etc */ + int hold_count; /* count prevents destruction */ #if defined(DEBUG_LOCKS) /* @@ -187,6 +191,8 @@ struct vm_object { /* * Flags */ +#define OBJ_CHAINLOCK 0x0001 /* backing_object/shadow changing */ +#define OBJ_CHAINWANT 0x0002 #define OBJ_ACTIVE 0x0004 /* active objects */ #define OBJ_DEAD 0x0008 /* dead objects (during rundown) */ #define OBJ_NOSPLIT 0x0010 /* dont split this object */ @@ -195,7 +201,7 @@ struct vm_object { #define OBJ_MIGHTBEDIRTY 0x0100 /* object might be dirty */ #define OBJ_CLEANING 0x0200 #define OBJ_DEADWNT 0x1000 /* waiting because object is dead */ -#define OBJ_ONEMAPPING 0x2000 /* One USE (a single, non-forked) mapping flag */ +#define OBJ_ONEMAPPING 0x2000 /* flag single vm_map_entry mapping */ #define OBJ_NOMSYNC 0x4000 /* disable msync() system call */ #define IDX_TO_OFF(idx) (((vm_ooffset_t)(idx)) << PAGE_SHIFT) @@ -275,12 +281,16 @@ void vm_object_page_remove (vm_object_t, vm_pindex_t, vm_pindex_t, boolean_t); void vm_object_pmap_copy (vm_object_t, vm_pindex_t, vm_pindex_t); void vm_object_pmap_copy_1 (vm_object_t, vm_pindex_t, vm_pindex_t); void vm_object_pmap_remove (vm_object_t, vm_pindex_t, vm_pindex_t); -void vm_object_reference (vm_object_t); void vm_object_reference_locked (vm_object_t); -void vm_object_shadow (vm_object_t *, vm_ooffset_t *, vm_size_t); +void vm_object_chain_wait (vm_object_t); +void vm_object_chain_acquire(vm_object_t object); +void vm_object_chain_release(vm_object_t object); +void vm_object_chain_release_all(vm_object_t object, vm_object_t stopobj); +void vm_object_shadow (vm_object_t *, vm_ooffset_t *, vm_size_t, int); void vm_object_madvise (vm_object_t, vm_pindex_t, int, int); void vm_object_init2 (void); -vm_page_t vm_fault_object_page(vm_object_t, vm_ooffset_t, vm_prot_t, int, int *); +vm_page_t vm_fault_object_page(vm_object_t, vm_ooffset_t, + vm_prot_t, int, int *); void vm_object_dead_sleep(vm_object_t, const char *); void vm_object_dead_wakeup(vm_object_t); void vm_object_lock_swap(void); diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index fe28554063..92b6a3df5e 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -90,6 +90,7 @@ #include #include +#include #define VMACTION_HSIZE 256 #define VMACTION_HMASK (VMACTION_HSIZE - 1) @@ -98,8 +99,12 @@ static void vm_page_queue_init(void); static void vm_page_free_wakeup(void); static vm_page_t vm_page_select_cache(vm_object_t, vm_pindex_t); static vm_page_t _vm_page_list_find2(int basequeue, int index); +static void _vm_page_deactivate_locked(vm_page_t m, int athead); -struct vpgqueues vm_page_queues[PQ_COUNT]; /* Array of tailq lists */ +/* + * Array of tailq lists + */ +__cachealign struct vpgqueues vm_page_queues[PQ_COUNT]; LIST_HEAD(vm_page_action_list, vm_page_action); struct vm_page_action_list action_list[VMACTION_HSIZE]; @@ -124,8 +129,10 @@ vm_page_queue_init(void) vm_page_queues[PQ_HOLD].cnt = &vmstats.v_active_count; /* PQ_NONE has no queue */ - for (i = 0; i < PQ_COUNT; i++) + for (i = 0; i < PQ_COUNT; i++) { TAILQ_INIT(&vm_page_queues[i].pl); + spin_init(&vm_page_queues[i].spin); + } for (i = 0; i < VMACTION_HSIZE; i++) LIST_INIT(&action_list[i]); @@ -170,8 +177,6 @@ vm_add_new_page(vm_paddr_t pa) struct vpgqueues *vpq; vm_page_t m; - ++vmstats.v_page_count; - ++vmstats.v_free_count; m = PHYS_TO_VM_PAGE(pa); m->phys_addr = pa; m->flags = 0; @@ -179,14 +184,16 @@ vm_add_new_page(vm_paddr_t pa) m->queue = m->pc + PQ_FREE; KKASSERT(m->dirty == 0); + atomic_add_int(&vmstats.v_page_count, 1); + atomic_add_int(&vmstats.v_free_count, 1); vpq = &vm_page_queues[m->queue]; if (vpq->flipflop) TAILQ_INSERT_TAIL(&vpq->pl, m, pageq); else TAILQ_INSERT_HEAD(&vpq->pl, m, pageq); vpq->flipflop = 1 - vpq->flipflop; + ++vpq->lcnt; - vm_page_queues[m->queue].lcnt++; return (m); } @@ -253,9 +260,11 @@ vm_page_startup(void) vm_page_queue_init(); - /* VKERNELs don't support minidumps and as such don't need vm_page_dump */ #if !defined(_KERNEL_VIRTUAL) /* + * VKERNELs don't support minidumps and as such don't need + * vm_page_dump + * * Allocate a bitmap to indicate that a random physical page * needs to be included in a minidump. * @@ -359,6 +368,326 @@ rb_vm_page_compare(struct vm_page *p1, struct vm_page *p2) return(0); } +/* + * Each page queue has its own spin lock, which is fairly optimal for + * allocating and freeing pages at least. + * + * The caller must hold the vm_page_spin_lock() before locking a vm_page's + * queue spinlock via this function. Also note that m->queue cannot change + * unless both the page and queue are locked. + */ +static __inline +void +_vm_page_queue_spin_lock(vm_page_t m) +{ + u_short queue; + + queue = m->queue; + if (queue != PQ_NONE) { + spin_lock(&vm_page_queues[queue].spin); + KKASSERT(queue == m->queue); + } +} + +static __inline +void +_vm_page_queue_spin_unlock(vm_page_t m) +{ + u_short queue; + + queue = m->queue; + cpu_ccfence(); + if (queue != PQ_NONE) + spin_unlock(&vm_page_queues[queue].spin); +} + +static __inline +void +_vm_page_queues_spin_lock(u_short queue) +{ + cpu_ccfence(); + if (queue != PQ_NONE) + spin_lock(&vm_page_queues[queue].spin); +} + + +static __inline +void +_vm_page_queues_spin_unlock(u_short queue) +{ + cpu_ccfence(); + if (queue != PQ_NONE) + spin_unlock(&vm_page_queues[queue].spin); +} + +void +vm_page_queue_spin_lock(vm_page_t m) +{ + _vm_page_queue_spin_lock(m); +} + +void +vm_page_queues_spin_lock(u_short queue) +{ + _vm_page_queues_spin_lock(queue); +} + +void +vm_page_queue_spin_unlock(vm_page_t m) +{ + _vm_page_queue_spin_unlock(m); +} + +void +vm_page_queues_spin_unlock(u_short queue) +{ + _vm_page_queues_spin_unlock(queue); +} + +/* + * This locks the specified vm_page and its queue in the proper order + * (page first, then queue). The queue may change so the caller must + * recheck on return. + */ +static __inline +void +_vm_page_and_queue_spin_lock(vm_page_t m) +{ + vm_page_spin_lock(m); + _vm_page_queue_spin_lock(m); +} + +static __inline +void +_vm_page_and_queue_spin_unlock(vm_page_t m) +{ + _vm_page_queues_spin_unlock(m->queue); + vm_page_spin_unlock(m); +} + +void +vm_page_and_queue_spin_unlock(vm_page_t m) +{ + _vm_page_and_queue_spin_unlock(m); +} + +void +vm_page_and_queue_spin_lock(vm_page_t m) +{ + _vm_page_and_queue_spin_lock(m); +} + +/* + * Helper function removes vm_page from its current queue. + * Returns the base queue the page used to be on. + * + * The vm_page and the queue must be spinlocked. + * This function will unlock the queue but leave the page spinlocked. + */ +static __inline u_short +_vm_page_rem_queue_spinlocked(vm_page_t m) +{ + struct vpgqueues *pq; + u_short queue; + + queue = m->queue; + if (queue != PQ_NONE) { + pq = &vm_page_queues[queue]; + TAILQ_REMOVE(&pq->pl, m, pageq); + atomic_add_int(pq->cnt, -1); + pq->lcnt--; + m->queue = PQ_NONE; + if ((queue - m->pc) == PQ_FREE && (m->flags & PG_ZERO)) + atomic_subtract_int(&vm_page_zero_count, 1); + vm_page_queues_spin_unlock(queue); + if ((queue - m->pc) == PQ_CACHE || (queue - m->pc) == PQ_FREE) + return (queue - m->pc); + } + return queue; +} + +/* + * Helper function places the vm_page on the specified queue. + * + * The vm_page must be spinlocked. + * This function will return with both the page and the queue locked. + */ +static __inline void +_vm_page_add_queue_spinlocked(vm_page_t m, u_short queue, int athead) +{ + struct vpgqueues *pq; + + KKASSERT(m->queue == PQ_NONE); + + if (queue != PQ_NONE) { + vm_page_queues_spin_lock(queue); + pq = &vm_page_queues[queue]; + ++pq->lcnt; + atomic_add_int(pq->cnt, 1); + m->queue = queue; + + /* + * Put zero'd pages on the end ( where we look for zero'd pages + * first ) and non-zerod pages at the head. + */ + if (queue - m->pc == PQ_FREE) { + if (m->flags & PG_ZERO) { + TAILQ_INSERT_TAIL(&pq->pl, m, pageq); + atomic_add_int(&vm_page_zero_count, 1); + } else { + TAILQ_INSERT_HEAD(&pq->pl, m, pageq); + } + } else if (athead) { + TAILQ_INSERT_HEAD(&pq->pl, m, pageq); + } else { + TAILQ_INSERT_TAIL(&pq->pl, m, pageq); + } + /* leave the queue spinlocked */ + } +} + +/* + * Wait until page is no longer PG_BUSY or (if also_m_busy is TRUE) + * m->busy is zero. Returns TRUE if it had to sleep, FALSE if we + * did not. Only one sleep call will be made before returning. + * + * This function does NOT busy the page and on return the page is not + * guaranteed to be available. + */ +void +vm_page_sleep_busy(vm_page_t m, int also_m_busy, const char *msg) +{ + u_int32_t flags; + + for (;;) { + flags = m->flags; + cpu_ccfence(); + + if ((flags & PG_BUSY) == 0 && + (also_m_busy == 0 || (flags & PG_SBUSY) == 0)) { + break; + } + tsleep_interlock(m, 0); + if (atomic_cmpset_int(&m->flags, flags, + flags | PG_WANTED | PG_REFERENCED)) { + tsleep(m, PINTERLOCKED, msg, 0); + break; + } + } +} + +/* + * Wait until PG_BUSY can be set, then set it. If also_m_busy is TRUE we + * also wait for m->busy to become 0 before setting PG_BUSY. + */ +void +VM_PAGE_DEBUG_EXT(vm_page_busy_wait)(vm_page_t m, + int also_m_busy, const char *msg + VM_PAGE_DEBUG_ARGS) +{ + u_int32_t flags; + + for (;;) { + flags = m->flags; + cpu_ccfence(); + if (flags & PG_BUSY) { + tsleep_interlock(m, 0); + if (atomic_cmpset_int(&m->flags, flags, + flags | PG_WANTED | PG_REFERENCED)) { + tsleep(m, PINTERLOCKED, msg, 0); + } + } else if (also_m_busy && (flags & PG_SBUSY)) { + tsleep_interlock(m, 0); + if (atomic_cmpset_int(&m->flags, flags, + flags | PG_WANTED | PG_REFERENCED)) { + tsleep(m, PINTERLOCKED, msg, 0); + } + } else { + if (atomic_cmpset_int(&m->flags, flags, + flags | PG_BUSY)) { +#ifdef VM_PAGE_DEBUG + m->busy_func = func; + m->busy_line = lineno; +#endif + break; + } + } + } +} + +/* + * Attempt to set PG_BUSY. If also_m_busy is TRUE we only succeed if m->busy + * is also 0. + * + * Returns non-zero on failure. + */ +int +VM_PAGE_DEBUG_EXT(vm_page_busy_try)(vm_page_t m, int also_m_busy + VM_PAGE_DEBUG_ARGS) +{ + u_int32_t flags; + + for (;;) { + flags = m->flags; + cpu_ccfence(); + if (flags & PG_BUSY) + return TRUE; + if (also_m_busy && (flags & PG_SBUSY)) + return TRUE; + if (atomic_cmpset_int(&m->flags, flags, flags | PG_BUSY)) { +#ifdef VM_PAGE_DEBUG + m->busy_func = func; + m->busy_line = lineno; +#endif + return FALSE; + } + } +} + +/* + * Clear the PG_BUSY flag and return non-zero to indicate to the caller + * that a wakeup() should be performed. + * + * The vm_page must be spinlocked and will remain spinlocked on return. + * The related queue must NOT be spinlocked (which could deadlock us). + * + * (inline version) + */ +static __inline +int +_vm_page_wakeup(vm_page_t m) +{ + u_int32_t flags; + + for (;;) { + flags = m->flags; + cpu_ccfence(); + if (atomic_cmpset_int(&m->flags, flags, + flags & ~(PG_BUSY | PG_WANTED))) { + break; + } + } + return(flags & PG_WANTED); +} + +/* + * Clear the PG_BUSY flag and wakeup anyone waiting for the page. This + * is typically the last call you make on a page before moving onto + * other things. + */ +void +vm_page_wakeup(vm_page_t m) +{ + KASSERT(m->flags & PG_BUSY, ("vm_page_wakeup: page not busy!!!")); + vm_page_spin_lock(m); + if (_vm_page_wakeup(m)) { + vm_page_spin_unlock(m); + wakeup(m); + } else { + vm_page_spin_unlock(m); + } +} + /* * Holding a page keeps it from being reused. Other parts of the system * can still disassociate the page from its current object and free it, or @@ -368,38 +697,44 @@ rb_vm_page_compare(struct vm_page *p1, struct vm_page *p2) * reference is released. (see vm_page_wire() if you want to prevent the * page from being disassociated from its object too). * - * The caller must hold vm_token. - * * The caller must still validate the contents of the page and, if necessary, * wait for any pending I/O (e.g. vm_page_sleep_busy() loop) to complete * before manipulating the page. + * + * XXX get vm_page_spin_lock() here and move FREE->HOLD if necessary */ void vm_page_hold(vm_page_t m) { - ASSERT_LWKT_TOKEN_HELD(&vm_token); - ++m->hold_count; + vm_page_spin_lock(m); + atomic_add_int(&m->hold_count, 1); + if (m->queue - m->pc == PQ_FREE) { + _vm_page_queue_spin_lock(m); + _vm_page_rem_queue_spinlocked(m); + _vm_page_add_queue_spinlocked(m, PQ_HOLD, 0); + _vm_page_queue_spin_unlock(m); + } + vm_page_spin_unlock(m); } /* * The opposite of vm_page_hold(). A page can be freed while being held, - * which places it on the PQ_HOLD queue. We must call vm_page_free_toq() - * in this case to actually free it once the hold count drops to 0. - * - * The caller must hold vm_token if non-blocking operation is desired, - * but otherwise does not need to. + * which places it on the PQ_HOLD queue. If we are able to busy the page + * after the hold count drops to zero we will move the page to the + * appropriate PQ_FREE queue by calling vm_page_free_toq(). */ void vm_page_unhold(vm_page_t m) { - lwkt_gettoken(&vm_token); - --m->hold_count; - KASSERT(m->hold_count >= 0, ("vm_page_unhold: hold count < 0!!!")); + vm_page_spin_lock(m); + atomic_add_int(&m->hold_count, -1); if (m->hold_count == 0 && m->queue == PQ_HOLD) { - vm_page_busy(m); - vm_page_free_toq(m); + _vm_page_queue_spin_lock(m); + _vm_page_rem_queue_spinlocked(m); + _vm_page_add_queue_spinlocked(m, PQ_FREE + m->pc, 0); + _vm_page_queue_spin_unlock(m); } - lwkt_reltoken(&vm_token); + vm_page_spin_unlock(m); } /* @@ -411,39 +746,31 @@ vm_page_unhold(vm_page_t m) * here so we *can't* do this anyway. * * This routine may not block. - * This routine must be called with the vm_token held. * This routine must be called with the vm_object held. * This routine must be called with a critical section held. */ void vm_page_insert(vm_page_t m, vm_object_t object, vm_pindex_t pindex) { - ASSERT_LWKT_TOKEN_HELD(&vm_token); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); if (m->object != NULL) panic("vm_page_insert: already inserted"); + object->generation++; + object->resident_page_count++; + /* - * Record the object/offset pair in this page + * Record the object/offset pair in this page and add the + * pv_list_count of the page to the object. + * + * The vm_page spin lock is required for interactions with the pmap. */ + vm_page_spin_lock(m); m->object = object; m->pindex = pindex; - - /* - * Insert it into the object. - */ vm_page_rb_tree_RB_INSERT(&object->rb_memq, m); - object->generation++; - - /* - * show that the object has one more resident page. - */ - object->resident_page_count++; - - /* - * Add the pv_list_cout of the page when its inserted in - * the object - */ - object->agg_pv_list_count = object->agg_pv_list_count + m->md.pv_list_count; + atomic_add_int(&object->agg_pv_list_count, m->md.pv_list_count); + vm_page_spin_unlock(m); /* * Since we are inserting a new and possibly dirty page, @@ -459,8 +786,7 @@ vm_page_insert(vm_page_t m, vm_object_t object, vm_pindex_t pindex) } /* - * Removes the given vm_page_t from the global (object,index) hash table - * and from the object's memq. + * Removes the given vm_page_t from the (object,index) table * * The underlying pmap entry (if any) is NOT removed here. * This routine may not block. @@ -476,9 +802,7 @@ vm_page_remove(vm_page_t m) { vm_object_t object; - lwkt_gettoken(&vm_token); if (m->object == NULL) { - lwkt_reltoken(&vm_token); return; } @@ -491,23 +815,26 @@ vm_page_remove(vm_page_t m) /* * Remove the page from the object and update the object. + * + * The vm_page spin lock is required for interactions with the pmap. */ + vm_page_spin_lock(m); vm_page_rb_tree_RB_REMOVE(&object->rb_memq, m); object->resident_page_count--; - object->agg_pv_list_count = object->agg_pv_list_count - m->md.pv_list_count; - object->generation++; + atomic_add_int(&object->agg_pv_list_count, -m->md.pv_list_count); m->object = NULL; + vm_page_spin_unlock(m); - vm_object_drop(object); + object->generation++; - lwkt_reltoken(&vm_token); + vm_object_drop(object); } /* * Locate and return the page at (object, pindex), or NULL if the * page could not be found. * - * The caller must hold vm_token. + * The caller must hold the vm_object token. */ vm_page_t vm_page_lookup(vm_object_t object, vm_pindex_t pindex) @@ -517,30 +844,131 @@ vm_page_lookup(vm_object_t object, vm_pindex_t pindex) /* * Search the hash table for this object/offset pair */ - ASSERT_LWKT_TOKEN_HELD(&vm_token); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); m = vm_page_rb_tree_RB_LOOKUP(&object->rb_memq, pindex); KKASSERT(m == NULL || (m->object == object && m->pindex == pindex)); return(m); } +vm_page_t +VM_PAGE_DEBUG_EXT(vm_page_lookup_busy_wait)(struct vm_object *object, + vm_pindex_t pindex, + int also_m_busy, const char *msg + VM_PAGE_DEBUG_ARGS) +{ + u_int32_t flags; + vm_page_t m; + + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + m = vm_page_rb_tree_RB_LOOKUP(&object->rb_memq, pindex); + while (m) { + KKASSERT(m->object == object && m->pindex == pindex); + flags = m->flags; + cpu_ccfence(); + if (flags & PG_BUSY) { + tsleep_interlock(m, 0); + if (atomic_cmpset_int(&m->flags, flags, + flags | PG_WANTED | PG_REFERENCED)) { + tsleep(m, PINTERLOCKED, msg, 0); + m = vm_page_rb_tree_RB_LOOKUP(&object->rb_memq, + pindex); + } + } else if (also_m_busy && (flags & PG_SBUSY)) { + tsleep_interlock(m, 0); + if (atomic_cmpset_int(&m->flags, flags, + flags | PG_WANTED | PG_REFERENCED)) { + tsleep(m, PINTERLOCKED, msg, 0); + m = vm_page_rb_tree_RB_LOOKUP(&object->rb_memq, + pindex); + } + } else if (atomic_cmpset_int(&m->flags, flags, + flags | PG_BUSY)) { +#ifdef VM_PAGE_DEBUG + m->busy_func = func; + m->busy_line = lineno; +#endif + break; + } + } + return m; +} + /* - * vm_page_rename() + * Attempt to lookup and busy a page. * - * Move the given memory entry from its current object to the specified - * target object/offset. + * Returns NULL if the page could not be found * - * The object must be locked. - * This routine may not block. + * Returns a vm_page and error == TRUE if the page exists but could not + * be busied. * - * Note: This routine will raise itself to splvm(), the caller need not. + * Returns a vm_page and error == FALSE on success. + */ +vm_page_t +VM_PAGE_DEBUG_EXT(vm_page_lookup_busy_try)(struct vm_object *object, + vm_pindex_t pindex, + int also_m_busy, int *errorp + VM_PAGE_DEBUG_ARGS) +{ + u_int32_t flags; + vm_page_t m; + + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + m = vm_page_rb_tree_RB_LOOKUP(&object->rb_memq, pindex); + *errorp = FALSE; + while (m) { + KKASSERT(m->object == object && m->pindex == pindex); + flags = m->flags; + cpu_ccfence(); + if (flags & PG_BUSY) { + *errorp = TRUE; + break; + } + if (also_m_busy && (flags & PG_SBUSY)) { + *errorp = TRUE; + break; + } + if (atomic_cmpset_int(&m->flags, flags, flags | PG_BUSY)) { +#ifdef VM_PAGE_DEBUG + m->busy_func = func; + m->busy_line = lineno; +#endif + break; + } + } + return m; +} + +/* + * Caller must hold the related vm_object + */ +vm_page_t +vm_page_next(vm_page_t m) +{ + vm_page_t next; + + next = vm_page_rb_tree_RB_NEXT(m); + if (next && next->pindex != m->pindex + 1) + next = NULL; + return (next); +} + +/* + * vm_page_rename() + * + * Move the given vm_page from its current object to the specified + * target object/offset. The page must be busy and will remain so + * on return. * - * Note: Swap associated with the page must be invalidated by the move. We + * new_object must be held. + * This routine might block. XXX ? + * + * NOTE: Swap associated with the page must be invalidated by the move. We * have to do this for several reasons: (1) we aren't freeing the * page, (2) we are dirtying the page, (3) the VM system is probably * moving the page from object A to B, and will then later move * the backing store from A to B and we can't have a conflict. * - * Note: We *always* dirty the page. It is necessary both for the + * NOTE: We *always* dirty the page. It is necessary both for the * fact that we moved it, and because we may be invalidating * swap. If the page is on the cache, we have to deactivate it * or vm_page_dirty() will panic. Dirty pages are not allowed @@ -549,16 +977,16 @@ vm_page_lookup(vm_object_t object, vm_pindex_t pindex) void vm_page_rename(vm_page_t m, vm_object_t new_object, vm_pindex_t new_pindex) { - lwkt_gettoken(&vm_token); - vm_object_hold(new_object); - vm_page_remove(m); + KKASSERT(m->flags & PG_BUSY); + ASSERT_LWKT_TOKEN_HELD(vm_object_token(new_object)); + if (m->object) { + ASSERT_LWKT_TOKEN_HELD(vm_object_token(m->object)); + vm_page_remove(m); + } vm_page_insert(m, new_object, new_pindex); if (m->queue - m->pc == PQ_CACHE) vm_page_deactivate(m); vm_page_dirty(m); - vm_page_wakeup(m); - vm_object_drop(new_object); - lwkt_reltoken(&vm_token); } /* @@ -566,47 +994,34 @@ vm_page_rename(vm_page_t m, vm_object_t new_object, vm_pindex_t new_pindex) * is being moved between queues or otherwise is to remain BUSYied by the * caller. * - * The caller must hold vm_token * This routine may not block. */ void vm_page_unqueue_nowakeup(vm_page_t m) { - int queue = m->queue; - struct vpgqueues *pq; - - ASSERT_LWKT_TOKEN_HELD(&vm_token); - if (queue != PQ_NONE) { - pq = &vm_page_queues[queue]; - m->queue = PQ_NONE; - TAILQ_REMOVE(&pq->pl, m, pageq); - (*pq->cnt)--; - pq->lcnt--; - } + vm_page_and_queue_spin_lock(m); + (void)_vm_page_rem_queue_spinlocked(m); + vm_page_spin_unlock(m); } /* * vm_page_unqueue() - Remove a page from its queue, wakeup the pagedemon * if necessary. * - * The caller must hold vm_token * This routine may not block. */ void vm_page_unqueue(vm_page_t m) { - int queue = m->queue; - struct vpgqueues *pq; + u_short queue; - ASSERT_LWKT_TOKEN_HELD(&vm_token); - if (queue != PQ_NONE) { - m->queue = PQ_NONE; - pq = &vm_page_queues[queue]; - TAILQ_REMOVE(&pq->pl, m, pageq); - (*pq->cnt)--; - pq->lcnt--; - if ((queue - m->pc) == PQ_CACHE || (queue - m->pc) == PQ_FREE) - pagedaemon_wakeup(); + vm_page_and_queue_spin_lock(m); + queue = _vm_page_rem_queue_spinlocked(m); + if (queue == PQ_FREE || queue == PQ_CACHE) { + vm_page_spin_unlock(m); + pagedaemon_wakeup(); + } else { + vm_page_spin_unlock(m); } } @@ -618,14 +1033,21 @@ vm_page_unqueue(vm_page_t m) * The page coloring optimization attempts to locate a page that does * not overload other nearby pages in the object in the cpu's L1 or L2 * caches. We need this optimization because cpu caches tend to be - * physical caches, while object spaces tend to be virtual. + * physical caches, while object spaces tend to be virtual. This optimization + * also gives us multiple queues and spinlocks to worth with on SMP systems. * - * Must be called with vm_token held. - * This routine may not block. + * The page is returned spinlocked and removed from its queue (it will + * be on PQ_NONE), or NULL. The page is not PG_BUSY'd. The caller + * is responsible for dealing with the busy-page case (usually by + * deactivating the page and looping). + * + * NOTE: This routine is carefully inlined. A non-inlined version + * is available for outside callers but the only critical path is + * from within this source file. * - * Note that this routine is carefully inlined. A non-inlined version - * is available for outside callers but the only critical path is - * from within this source file. + * NOTE: This routine assumes that the vm_pages found in PQ_CACHE and PQ_FREE + * represent stable storage, allowing us to order our locks vm_page + * first, then queue. */ static __inline vm_page_t @@ -633,12 +1055,23 @@ _vm_page_list_find(int basequeue, int index, boolean_t prefer_zero) { vm_page_t m; - if (prefer_zero) - m = TAILQ_LAST(&vm_page_queues[basequeue+index].pl, pglist); - else - m = TAILQ_FIRST(&vm_page_queues[basequeue+index].pl); - if (m == NULL) - m = _vm_page_list_find2(basequeue, index); + for (;;) { + if (prefer_zero) + m = TAILQ_LAST(&vm_page_queues[basequeue+index].pl, pglist); + else + m = TAILQ_FIRST(&vm_page_queues[basequeue+index].pl); + if (m == NULL) { + m = _vm_page_list_find2(basequeue, index); + return(m); + } + vm_page_and_queue_spin_lock(m); + if (m->queue == basequeue + index) { + _vm_page_rem_queue_spinlocked(m); + /* vm_page_t spin held, no queue spin */ + break; + } + vm_page_and_queue_spin_unlock(m); + } return(m); } @@ -656,20 +1089,42 @@ _vm_page_list_find2(int basequeue, int index) * same place. Even though this is not totally optimal, we've already * blown it by missing the cache case so we do not care. */ - - for(i = PQ_L2_SIZE / 2; i > 0; --i) { - if ((m = TAILQ_FIRST(&pq[(index + i) & PQ_L2_MASK].pl)) != NULL) - break; - - if ((m = TAILQ_FIRST(&pq[(index - i) & PQ_L2_MASK].pl)) != NULL) - break; + for (i = PQ_L2_SIZE / 2; i > 0; --i) { + for (;;) { + m = TAILQ_FIRST(&pq[(index + i) & PQ_L2_MASK].pl); + if (m) { + _vm_page_and_queue_spin_lock(m); + if (m->queue == + basequeue + ((index + i) & PQ_L2_MASK)) { + _vm_page_rem_queue_spinlocked(m); + return(m); + } + _vm_page_and_queue_spin_unlock(m); + continue; + } + m = TAILQ_FIRST(&pq[(index - i) & PQ_L2_MASK].pl); + if (m) { + _vm_page_and_queue_spin_lock(m); + if (m->queue == + basequeue + ((index - i) & PQ_L2_MASK)) { + _vm_page_rem_queue_spinlocked(m); + return(m); + } + _vm_page_and_queue_spin_unlock(m); + continue; + } + break; /* next i */ + } } return(m); } /* - * Must be called with vm_token held if the caller desired non-blocking - * operation and a stable result. + * Returns a vm_page candidate for allocation. The page is not busied so + * it can move around. The caller must busy the page (and typically + * deactivate it if it cannot be busied!) + * + * Returns a spinlocked vm_page that has been removed from its queue. */ vm_page_t vm_page_list_find(int basequeue, int index, boolean_t prefer_zero) @@ -678,57 +1133,94 @@ vm_page_list_find(int basequeue, int index, boolean_t prefer_zero) } /* - * Find a page on the cache queue with color optimization. As pages - * might be found, but not applicable, they are deactivated. This - * keeps us from using potentially busy cached pages. + * Find a page on the cache queue with color optimization, remove it + * from the queue, and busy it. The returned page will not be spinlocked. + * + * A candidate failure will be deactivated. Candidates can fail due to + * being busied by someone else, in which case they will be deactivated. * * This routine may not block. - * Must be called with vm_token held. + * */ -vm_page_t +static vm_page_t vm_page_select_cache(vm_object_t object, vm_pindex_t pindex) { vm_page_t m; - ASSERT_LWKT_TOKEN_HELD(&vm_token); - while (TRUE) { - m = _vm_page_list_find( - PQ_CACHE, - (pindex + object->pg_color) & PQ_L2_MASK, - FALSE - ); - if (m && ((m->flags & (PG_BUSY|PG_UNMANAGED)) || m->busy || - m->hold_count || m->wire_count)) { - /* cache page found busy */ - vm_page_deactivate(m); + for (;;) { + m = _vm_page_list_find(PQ_CACHE, + (pindex + object->pg_color) & PQ_L2_MASK, + FALSE); + if (m == NULL) + break; + /* + * (m) has been removed from its queue and spinlocked + */ + if (vm_page_busy_try(m, TRUE)) { + _vm_page_deactivate_locked(m, 0); + vm_page_spin_unlock(m); #ifdef INVARIANTS kprintf("Warning: busy page %p found in cache\n", m); #endif - continue; + } else { + /* + * We successfully busied the page + */ + if ((m->flags & PG_UNMANAGED) == 0 && + m->hold_count == 0 && + m->wire_count == 0) { + vm_page_spin_unlock(m); + pagedaemon_wakeup(); + return(m); + } + _vm_page_deactivate_locked(m, 0); + if (_vm_page_wakeup(m)) { + vm_page_spin_unlock(m); + wakeup(m); + } else { + vm_page_spin_unlock(m); + } } - return m; } - /* not reached */ + return (m); } /* * Find a free or zero page, with specified preference. We attempt to * inline the nominal case and fall back to _vm_page_select_free() - * otherwise. + * otherwise. A busied page is removed from the queue and returned. * - * This routine must be called with a critical section held. * This routine may not block. */ static __inline vm_page_t -vm_page_select_free(vm_object_t object, vm_pindex_t pindex, boolean_t prefer_zero) +vm_page_select_free(vm_object_t object, vm_pindex_t pindex, + boolean_t prefer_zero) { vm_page_t m; - m = _vm_page_list_find( - PQ_FREE, - (pindex + object->pg_color) & PQ_L2_MASK, - prefer_zero - ); + for (;;) { + m = _vm_page_list_find(PQ_FREE, + (pindex + object->pg_color) & PQ_L2_MASK, + prefer_zero); + if (m == NULL) + break; + if (vm_page_busy_try(m, TRUE)) { + _vm_page_deactivate_locked(m, 0); + vm_page_spin_unlock(m); +#ifdef INVARIANTS + kprintf("Warning: busy page %p found in cache\n", m); +#endif + } else { + KKASSERT((m->flags & PG_UNMANAGED) == 0); + KKASSERT(m->hold_count == 0); + KKASSERT(m->wire_count == 0); + vm_page_spin_unlock(m); + pagedaemon_wakeup(); + + /* return busied and removed page */ + return(m); + } + } return(m); } @@ -759,8 +1251,6 @@ vm_page_alloc(vm_object_t object, vm_pindex_t pindex, int page_req) { vm_page_t m = NULL; - lwkt_gettoken(&vm_token); - KKASSERT(object != NULL); KASSERT(!vm_page_lookup(object, pindex), ("vm_page_alloc: page already allocated")); @@ -811,7 +1301,6 @@ loop: if (m != NULL) { KASSERT(m->dirty == 0, ("Found dirty cache page %p", m)); - vm_page_busy(m); vm_page_protect(m, VM_PROT_NONE); vm_page_free(m); goto loop; @@ -820,7 +1309,6 @@ loop: /* * On failure return NULL */ - lwkt_reltoken(&vm_token); #if defined(DIAGNOSTIC) if (vmstats.v_cache_count > 0) kprintf("vm_page_alloc(NORMAL): missing pages on cache queue: %d\n", vmstats.v_cache_count); @@ -832,46 +1320,43 @@ loop: /* * No pages available, wakeup the pageout daemon and give up. */ - lwkt_reltoken(&vm_token); vm_pageout_deficit++; pagedaemon_wakeup(); return (NULL); } /* - * Good page found. The page has not yet been busied. We are in - * a critical section. + * Good page found. The page has already been busied for us. + * + * v_free_count can race so loop if we don't find the expected + * page. */ - KASSERT(m != NULL, ("vm_page_alloc(): missing page on free queue\n")); + if (m == NULL) + goto loop; KASSERT(m->dirty == 0, ("vm_page_alloc: free/cache page %p was dirty", m)); /* - * Remove from free queue + * NOTE: page has already been removed from its queue and busied. */ - vm_page_unqueue_nowakeup(m); + KKASSERT(m->queue == PQ_NONE); /* * Initialize structure. Only the PG_ZERO flag is inherited. Set * the page PG_BUSY */ - if (m->flags & PG_ZERO) { - vm_page_zero_count--; - m->flags = PG_ZERO | PG_BUSY; - } else { - m->flags = PG_BUSY; - } - m->wire_count = 0; - m->hold_count = 0; + vm_page_flag_clear(m, ~(PG_ZERO | PG_BUSY)); + KKASSERT(m->wire_count == 0); + KKASSERT(m->busy == 0); m->act_count = 0; - m->busy = 0; m->valid = 0; /* - * vm_page_insert() is safe while holding vm_token. Note also that - * inserting a page here does not insert it into the pmap (which - * could cause us to block allocating memory). We cannot block - * anywhere. + * Caller must be holding the object lock (asserted by + * vm_page_insert()). + * + * NOTE: Inserting a page here does not insert it into any pmaps + * (which could cause us to block allocating memory). */ vm_page_insert(m, object, pindex); @@ -881,8 +1366,6 @@ loop: */ pagedaemon_wakeup(); - lwkt_reltoken(&vm_token); - /* * A PG_BUSY page is returned. */ @@ -982,33 +1465,35 @@ vm_waitpfault(void) * Put the specified page on the active list (if appropriate). Ensure * that act_count is at least ACT_INIT but do not otherwise mess with it. * - * The page queues must be locked. + * The caller should be holding the page busied ? XXX * This routine may not block. */ void vm_page_activate(vm_page_t m) { - lwkt_gettoken(&vm_token); - if (m->queue != PQ_ACTIVE) { - if ((m->queue - m->pc) == PQ_CACHE) - mycpu->gd_cnt.v_reactivated++; + u_short oqueue; - vm_page_unqueue(m); + vm_page_spin_lock(m); + if (m->queue != PQ_ACTIVE) { + _vm_page_queue_spin_lock(m); + oqueue = _vm_page_rem_queue_spinlocked(m); + /* page is left spinlocked, queue is unlocked */ + if (oqueue == PQ_CACHE) + mycpu->gd_cnt.v_reactivated++; if (m->wire_count == 0 && (m->flags & PG_UNMANAGED) == 0) { - m->queue = PQ_ACTIVE; - vm_page_queues[PQ_ACTIVE].lcnt++; - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, - m, pageq); if (m->act_count < ACT_INIT) m->act_count = ACT_INIT; - vmstats.v_active_count++; + _vm_page_add_queue_spinlocked(m, PQ_ACTIVE, 0); } + _vm_page_and_queue_spin_unlock(m); + if (oqueue == PQ_CACHE || oqueue == PQ_FREE) + pagedaemon_wakeup(); } else { if (m->act_count < ACT_INIT) m->act_count = ACT_INIT; + vm_page_spin_unlock(m); } - lwkt_reltoken(&vm_token); } /* @@ -1017,7 +1502,6 @@ vm_page_activate(vm_page_t m) * queues. * * This routine may not block. - * This routine must be called at splvm() */ static __inline void vm_page_free_wakeup(void) @@ -1065,26 +1549,18 @@ vm_page_free_wakeup(void) } /* - * vm_page_free_toq: + * Returns the given page to the PQ_FREE or PQ_HOLD list and disassociates + * it from its VM object. * - * Returns the given page to the PQ_FREE list, disassociating it with - * any VM object. - * - * The vm_page must be PG_BUSY on entry. PG_BUSY will be released on - * return (the page will have been freed). No particular spl is required - * on entry. - * - * This routine may not block. + * The vm_page must be PG_BUSY on entry. PG_BUSY will be released on + * return (the page will have been freed). */ void vm_page_free_toq(vm_page_t m) { - struct vpgqueues *pq; - - lwkt_gettoken(&vm_token); mycpu->gd_cnt.v_tfree++; - KKASSERT((m->flags & PG_MAPPED) == 0); + KKASSERT(m->flags & PG_BUSY); if (m->busy || ((m->queue - m->pc) == PQ_FREE)) { kprintf( @@ -1098,21 +1574,22 @@ vm_page_free_toq(vm_page_t m) } /* - * unqueue, then remove page. Note that we cannot destroy - * the page here because we do not want to call the pager's - * callback routine until after we've put the page on the - * appropriate free queue. + * Remove from object, spinlock the page and its queues and + * remove from any queue. No queue spinlock will be held + * after this section (because the page was removed from any + * queue). */ - vm_page_unqueue_nowakeup(m); vm_page_remove(m); + vm_page_and_queue_spin_lock(m); + _vm_page_rem_queue_spinlocked(m); /* * No further management of fictitious pages occurs beyond object * and queue removal. */ if ((m->flags & PG_FICTITIOUS) != 0) { + vm_page_spin_unlock(m); vm_page_wakeup(m); - lwkt_reltoken(&vm_token); return; } @@ -1132,32 +1609,30 @@ vm_page_free_toq(vm_page_t m) * Clear the UNMANAGED flag when freeing an unmanaged page. */ if (m->flags & PG_UNMANAGED) { - vm_page_flag_clear(m, PG_UNMANAGED); + vm_page_flag_clear(m, PG_UNMANAGED); } if (m->hold_count != 0) { vm_page_flag_clear(m, PG_ZERO); - m->queue = PQ_HOLD; + _vm_page_add_queue_spinlocked(m, PQ_HOLD, 0); } else { - m->queue = PQ_FREE + m->pc; + _vm_page_add_queue_spinlocked(m, PQ_FREE + m->pc, 0); } - pq = &vm_page_queues[m->queue]; - pq->lcnt++; - ++(*pq->cnt); /* - * Put zero'd pages on the end ( where we look for zero'd pages - * first ) and non-zerod pages at the head. + * This sequence allows us to clear PG_BUSY while still holding + * its spin lock, which reduces contention vs allocators. We + * must not leave the queue locked or _vm_page_wakeup() may + * deadlock. */ - if (m->flags & PG_ZERO) { - TAILQ_INSERT_TAIL(&pq->pl, m, pageq); - ++vm_page_zero_count; + _vm_page_queue_spin_unlock(m); + if (_vm_page_wakeup(m)) { + vm_page_spin_unlock(m); + wakeup(m); } else { - TAILQ_INSERT_HEAD(&pq->pl, m, pageq); + vm_page_spin_unlock(m); } - vm_page_wakeup(m); vm_page_free_wakeup(); - lwkt_reltoken(&vm_token); } /* @@ -1165,8 +1640,6 @@ vm_page_free_toq(vm_page_t m) * * Remove a non-zero page from one of the free queues; the page is removed for * zeroing, so do not issue a wakeup. - * - * MPUNSAFE */ vm_page_t vm_page_free_fromq_fast(void) @@ -1175,19 +1648,42 @@ vm_page_free_fromq_fast(void) vm_page_t m; int i; - lwkt_gettoken(&vm_token); for (i = 0; i < PQ_L2_SIZE; ++i) { m = vm_page_list_find(PQ_FREE, qi, FALSE); - qi = (qi + PQ_PRIME2) & PQ_L2_MASK; - if (m && (m->flags & PG_ZERO) == 0) { - KKASSERT(m->busy == 0 && (m->flags & PG_BUSY) == 0); - vm_page_unqueue_nowakeup(m); - vm_page_busy(m); - break; + /* page is returned spinlocked and removed from its queue */ + if (m) { + if (vm_page_busy_try(m, TRUE)) { + /* + * We were unable to busy the page, deactivate + * it and loop. + */ + _vm_page_deactivate_locked(m, 0); + vm_page_spin_unlock(m); + } else if ((m->flags & PG_ZERO) == 0) { + /* + * The page is not PG_ZERO'd so return it. + */ + vm_page_spin_unlock(m); + break; + } else { + /* + * The page is PG_ZERO, requeue it and loop + */ + _vm_page_add_queue_spinlocked(m, + PQ_FREE + m->pc, + 0); + vm_page_queue_spin_unlock(m); + if (_vm_page_wakeup(m)) { + vm_page_spin_unlock(m); + wakeup(m); + } else { + vm_page_spin_unlock(m); + } + } + m = NULL; } - m = NULL; + qi = (qi + PQ_PRIME2) & PQ_L2_MASK; } - lwkt_reltoken(&vm_token); return (m); } @@ -1209,13 +1705,12 @@ vm_page_free_fromq_fast(void) * will eventually be extended to support 4MB unmanaged physical * mappings. * - * Must be called with a critical section held. - * Must be called with vm_token held. + * Caller must be holding the page busy. */ void vm_page_unmanage(vm_page_t m) { - ASSERT_LWKT_TOKEN_HELD(&vm_token); + KKASSERT(m->flags & PG_BUSY); if ((m->flags & PG_UNMANAGED) == 0) { if (m->wire_count == 0) vm_page_unqueue(m); @@ -1227,8 +1722,7 @@ vm_page_unmanage(vm_page_t m) * Mark this page as wired down by yet another map, removing it from * paging queues as necessary. * - * The page queues must be locked. - * This routine may not block. + * Caller must be holding the page busy. */ void vm_page_wire(vm_page_t m) @@ -1239,18 +1733,16 @@ vm_page_wire(vm_page_t m) * it is already off the queues). Don't do anything with fictitious * pages because they are always wired. */ - lwkt_gettoken(&vm_token); + KKASSERT(m->flags & PG_BUSY); if ((m->flags & PG_FICTITIOUS) == 0) { - if (m->wire_count == 0) { + if (atomic_fetchadd_int(&m->wire_count, 1) == 0) { if ((m->flags & PG_UNMANAGED) == 0) vm_page_unqueue(m); - vmstats.v_wire_count++; + atomic_add_int(&vmstats.v_wire_count, 1); } - m->wire_count++; KASSERT(m->wire_count != 0, ("vm_page_wire: wire_count overflow m=%p", m)); } - lwkt_reltoken(&vm_token); } /* @@ -1281,37 +1773,32 @@ vm_page_wire(vm_page_t m) void vm_page_unwire(vm_page_t m, int activate) { - lwkt_gettoken(&vm_token); + KKASSERT(m->flags & PG_BUSY); if (m->flags & PG_FICTITIOUS) { /* do nothing */ } else if (m->wire_count <= 0) { panic("vm_page_unwire: invalid wire count: %d", m->wire_count); } else { - if (--m->wire_count == 0) { - --vmstats.v_wire_count; + if (atomic_fetchadd_int(&m->wire_count, -1) == 1) { + atomic_add_int(&vmstats.v_wire_count, -1); if (m->flags & PG_UNMANAGED) { ; } else if (activate) { - TAILQ_INSERT_TAIL( - &vm_page_queues[PQ_ACTIVE].pl, m, pageq); - m->queue = PQ_ACTIVE; - vm_page_queues[PQ_ACTIVE].lcnt++; - vmstats.v_active_count++; + vm_page_spin_lock(m); + _vm_page_add_queue_spinlocked(m, PQ_ACTIVE, 0); + _vm_page_and_queue_spin_unlock(m); } else { + vm_page_spin_lock(m); vm_page_flag_clear(m, PG_WINATCFLS); - TAILQ_INSERT_TAIL( - &vm_page_queues[PQ_INACTIVE].pl, m, pageq); - m->queue = PQ_INACTIVE; - vm_page_queues[PQ_INACTIVE].lcnt++; - vmstats.v_inactive_count++; + _vm_page_add_queue_spinlocked(m, PQ_INACTIVE, + 0); ++vm_swapcache_inactive_heuristic; + _vm_page_and_queue_spin_unlock(m); } } } - lwkt_reltoken(&vm_token); } - /* * Move the specified page to the inactive queue. If the page has * any associated swap, the swap is deallocated. @@ -1320,35 +1807,32 @@ vm_page_unwire(vm_page_t m, int activate) * to 1 if we want this page to be 'as if it were placed in the cache', * except without unmapping it from the process address space. * + * vm_page's spinlock must be held on entry and will remain held on return. * This routine may not block. - * The caller must hold vm_token. */ -static __inline void -_vm_page_deactivate(vm_page_t m, int athead) +static void +_vm_page_deactivate_locked(vm_page_t m, int athead) { + u_short oqueue; + /* * Ignore if already inactive. */ if (m->queue == PQ_INACTIVE) return; + _vm_page_queue_spin_lock(m); + oqueue = _vm_page_rem_queue_spinlocked(m); if (m->wire_count == 0 && (m->flags & PG_UNMANAGED) == 0) { - if ((m->queue - m->pc) == PQ_CACHE) + if (oqueue == PQ_CACHE) mycpu->gd_cnt.v_reactivated++; vm_page_flag_clear(m, PG_WINATCFLS); - vm_page_unqueue(m); - if (athead) { - TAILQ_INSERT_HEAD(&vm_page_queues[PQ_INACTIVE].pl, - m, pageq); - } else { - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_INACTIVE].pl, - m, pageq); + _vm_page_add_queue_spinlocked(m, PQ_INACTIVE, athead); + if (athead == 0) ++vm_swapcache_inactive_heuristic; - } - m->queue = PQ_INACTIVE; - vm_page_queues[PQ_INACTIVE].lcnt++; - vmstats.v_inactive_count++; } + _vm_page_queue_spin_unlock(m); + /* leaves vm_page spinlocked */ } /* @@ -1359,35 +1843,55 @@ _vm_page_deactivate(vm_page_t m, int athead) void vm_page_deactivate(vm_page_t m) { - lwkt_gettoken(&vm_token); - _vm_page_deactivate(m, 0); - lwkt_reltoken(&vm_token); + vm_page_spin_lock(m); + _vm_page_deactivate_locked(m, 0); + vm_page_spin_unlock(m); +} + +void +vm_page_deactivate_locked(vm_page_t m) +{ + _vm_page_deactivate_locked(m, 0); } /* * Attempt to move a page to PQ_CACHE. + * * Returns 0 on failure, 1 on success * - * No requirements. + * The page should NOT be busied by the caller. This function will validate + * whether the page can be safely moved to the cache. */ int vm_page_try_to_cache(vm_page_t m) { - lwkt_gettoken(&vm_token); - if (m->dirty || m->hold_count || m->busy || m->wire_count || - (m->flags & (PG_BUSY|PG_UNMANAGED))) { - lwkt_reltoken(&vm_token); + vm_page_spin_lock(m); + if (vm_page_busy_try(m, TRUE)) { + vm_page_spin_unlock(m); + return(0); + } + if (m->dirty || m->hold_count || m->wire_count || + (m->flags & PG_UNMANAGED)) { + if (_vm_page_wakeup(m)) { + vm_page_spin_unlock(m); + wakeup(m); + } else { + vm_page_spin_unlock(m); + } return(0); } - vm_page_busy(m); + vm_page_spin_unlock(m); + + /* + * Page busied by us and no longer spinlocked. Dirty pages cannot + * be moved to the cache. + */ vm_page_test_dirty(m); if (m->dirty) { vm_page_wakeup(m); - lwkt_reltoken(&vm_token); return(0); } vm_page_cache(m); - lwkt_reltoken(&vm_token); return(1); } @@ -1400,21 +1904,39 @@ vm_page_try_to_cache(vm_page_t m) int vm_page_try_to_free(vm_page_t m) { - lwkt_gettoken(&vm_token); - if (m->dirty || m->hold_count || m->busy || m->wire_count || - (m->flags & (PG_BUSY|PG_UNMANAGED))) { - lwkt_reltoken(&vm_token); + vm_page_spin_lock(m); + if (vm_page_busy_try(m, TRUE)) { + vm_page_spin_unlock(m); + return(0); + } + if (m->dirty || m->hold_count || m->wire_count || + (m->flags & PG_UNMANAGED)) { + if (_vm_page_wakeup(m)) { + vm_page_spin_unlock(m); + wakeup(m); + } else { + vm_page_spin_unlock(m); + } return(0); } + vm_page_spin_unlock(m); + + /* + * Page busied by us and no longer spinlocked. Dirty pages will + * not be freed by this function. We have to re-test the + * dirty bit after cleaning out the pmaps. + */ vm_page_test_dirty(m); if (m->dirty) { - lwkt_reltoken(&vm_token); + vm_page_wakeup(m); return(0); } - vm_page_busy(m); vm_page_protect(m, VM_PROT_NONE); + if (m->dirty) { + vm_page_wakeup(m); + return(0); + } vm_page_free(m); - lwkt_reltoken(&vm_token); return(1); } @@ -1423,16 +1945,12 @@ vm_page_try_to_free(vm_page_t m) * * Put the specified page onto the page cache queue (if appropriate). * - * The caller must hold vm_token. - * This routine may not block. * The page must be busy, and this routine will release the busy and * possibly even free the page. */ void vm_page_cache(vm_page_t m) { - ASSERT_LWKT_TOKEN_HELD(&vm_token); - if ((m->flags & PG_UNMANAGED) || m->busy || m->wire_count || m->hold_count) { kprintf("vm_page_cache: attempting to cache busy/held page\n"); @@ -1473,12 +1991,16 @@ vm_page_cache(vm_page_t m) vm_page_deactivate(m); vm_page_wakeup(m); } else { - vm_page_unqueue_nowakeup(m); - m->queue = PQ_CACHE + m->pc; - vm_page_queues[m->queue].lcnt++; - TAILQ_INSERT_TAIL(&vm_page_queues[m->queue].pl, m, pageq); - vmstats.v_cache_count++; - vm_page_wakeup(m); + _vm_page_and_queue_spin_lock(m); + _vm_page_rem_queue_spinlocked(m); + _vm_page_add_queue_spinlocked(m, PQ_CACHE + m->pc, 0); + _vm_page_queue_spin_unlock(m); + if (_vm_page_wakeup(m)) { + vm_page_spin_unlock(m); + wakeup(m); + } else { + vm_page_spin_unlock(m); + } vm_page_free_wakeup(); } } @@ -1504,7 +2026,7 @@ vm_page_cache(vm_page_t m) * space from active. The idea is to not force this to happen too * often. * - * No requirements. + * The page must be busied. */ void vm_page_dontneed(vm_page_t m) @@ -1518,14 +2040,12 @@ vm_page_dontneed(vm_page_t m) /* * occassionally leave the page alone */ - lwkt_gettoken(&vm_token); if ((dnw & 0x01F0) == 0 || m->queue == PQ_INACTIVE || m->queue - m->pc == PQ_CACHE ) { if (m->act_count >= ACT_INIT) --m->act_count; - lwkt_reltoken(&vm_token); return; } @@ -1554,14 +2074,44 @@ vm_page_dontneed(vm_page_t m) */ head = 1; } - _vm_page_deactivate(m, head); - lwkt_reltoken(&vm_token); + vm_page_spin_lock(m); + _vm_page_deactivate_locked(m, head); + vm_page_spin_unlock(m); +} + +/* + * These routines manipulate the 'soft busy' count for a page. A soft busy + * is almost like PG_BUSY except that it allows certain compatible operations + * to occur on the page while it is busy. For example, a page undergoing a + * write can still be mapped read-only. + * + * Because vm_pages can overlap buffers m->busy can be > 1. m->busy is only + * adjusted while the vm_page is PG_BUSY so the flash will occur when the + * busy bit is cleared. + */ +void +vm_page_io_start(vm_page_t m) +{ + KASSERT(m->flags & PG_BUSY, ("vm_page_io_start: page not busy!!!")); + atomic_add_char(&m->busy, 1); + vm_page_flag_set(m, PG_SBUSY); +} + +void +vm_page_io_finish(vm_page_t m) +{ + KASSERT(m->flags & PG_BUSY, ("vm_page_io_finish: page not busy!!!")); + atomic_subtract_char(&m->busy, 1); + if (m->busy == 0) + vm_page_flag_clear(m, PG_SBUSY); } /* * Grab a page, blocking if it is busy and allocating a page if necessary. * A busy page is returned or NULL. * + * The page is not removed from its queues. XXX? + * * If VM_ALLOC_RETRY is specified VM_ALLOC_NORMAL must also be specified. * If VM_ALLOC_RETRY is not specified * @@ -1581,42 +2131,33 @@ vm_page_t vm_page_grab(vm_object_t object, vm_pindex_t pindex, int allocflags) { vm_page_t m; - int generation; + int error; KKASSERT(allocflags & (VM_ALLOC_NORMAL|VM_ALLOC_INTERRUPT|VM_ALLOC_SYSTEM)); - lwkt_gettoken(&vm_token); vm_object_hold(object); -retrylookup: - if ((m = vm_page_lookup(object, pindex)) != NULL) { - if (m->busy || (m->flags & PG_BUSY)) { - generation = object->generation; - - while ((object->generation == generation) && - (m->busy || (m->flags & PG_BUSY))) { - vm_page_flag_set(m, PG_WANTED | PG_REFERENCED); - tsleep(m, 0, "pgrbwt", 0); - if ((allocflags & VM_ALLOC_RETRY) == 0) { - m = NULL; - goto done; - } + for (;;) { + m = vm_page_lookup_busy_try(object, pindex, TRUE, &error); + if (error) { + vm_page_sleep_busy(m, TRUE, "pgrbwt"); + if ((allocflags & VM_ALLOC_RETRY) == 0) { + m = NULL; + break; } - goto retrylookup; + } else if (m == NULL) { + m = vm_page_alloc(object, pindex, + allocflags & ~VM_ALLOC_RETRY); + if (m) + break; + vm_wait(0); + if ((allocflags & VM_ALLOC_RETRY) == 0) + break; } else { - vm_page_busy(m); - goto done; + /* m found */ + break; } } - m = vm_page_alloc(object, pindex, allocflags & ~VM_ALLOC_RETRY); - if (m == NULL) { - vm_wait(0); - if ((allocflags & VM_ALLOC_RETRY) == 0) - goto done; - goto retrylookup; - } -done: vm_object_drop(object); - lwkt_reltoken(&vm_token); return(m); } @@ -1912,8 +2453,7 @@ vm_page_is_valid(vm_page_t m, int base, int size) /* * update dirty bits from pmap/mmu. May not block. * - * Caller must hold vm_token if non-blocking operation desired. - * No other requirements. + * Caller must hold the page busy */ void vm_page_test_dirty(vm_page_t m) diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 479d4ec4ce..1a726dc463 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -187,6 +187,14 @@ struct vm_page { #endif }; +#ifdef VM_PAGE_DEBUG +#define VM_PAGE_DEBUG_EXT(name) name ## _debug +#define VM_PAGE_DEBUG_ARGS , const char *func, int lineno +#else +#define VM_PAGE_DEBUG_EXT(name) name +#define VM_PAGE_DEBUG_ARGS +#endif + #ifndef __VM_PAGE_T_DEFINED__ #define __VM_PAGE_T_DEFINED__ typedef struct vm_page *vm_page_t; @@ -285,6 +293,9 @@ struct vpgqueues { int *cnt; int lcnt; int flipflop; /* probably not the best place */ + struct spinlock spin; + char unused[64 - sizeof(struct pglist) - + sizeof(int *) - sizeof(int) * 2]; }; extern struct vpgqueues vm_page_queues[PQ_COUNT]; @@ -312,6 +323,9 @@ extern struct vpgqueues vm_page_queues[PQ_COUNT]; * * PG_SWAPPED indicates that the page is backed by a swap block. Any * VM object type other than OBJT_DEFAULT can have swap-backed pages now. + * + * PG_SBUSY is set when m->busy != 0. PG_SBUSY and m->busy are only modified + * when the page is PG_BUSY. */ #define PG_BUSY 0x00000001 /* page is in transit (O) */ #define PG_WANTED 0x00000002 /* someone is waiting for page (O) */ @@ -330,7 +344,7 @@ extern struct vpgqueues vm_page_queues[PQ_COUNT]; #define PG_SWAPPED 0x00004000 /* backed by swap */ #define PG_NOTMETA 0x00008000 /* do not back with swap */ #define PG_ACTIONLIST 0x00010000 /* lookaside action list present */ - /* u_short, only 16 flag bits */ +#define PG_SBUSY 0x00020000 /* soft-busy also set */ /* * Misc constants. @@ -396,85 +410,23 @@ vm_page_flag_clear(vm_page_t m, unsigned int bits) atomic_clear_int(&(m)->flags, bits); } -#ifdef VM_PAGE_DEBUG - -static __inline void -_vm_page_busy(vm_page_t m, const char *func, int lineno) -{ - ASSERT_LWKT_TOKEN_HELD(&vm_token); - KASSERT((m->flags & PG_BUSY) == 0, - ("vm_page_busy: page already busy!!!")); - vm_page_flag_set(m, PG_BUSY); - m->busy_func = func; - m->busy_line = lineno; -} - -#define vm_page_busy(m) _vm_page_busy(m, __func__, __LINE__) - -#else - -static __inline void -vm_page_busy(vm_page_t m) -{ - ASSERT_LWKT_TOKEN_HELD(&vm_token); - KASSERT((m->flags & PG_BUSY) == 0, - ("vm_page_busy: page already busy!!!")); - vm_page_flag_set(m, PG_BUSY); -} - -#endif - /* - * vm_page_flash: - * - * wakeup anyone waiting for the page. + * Wakeup anyone waiting for the page after potentially unbusying + * (hard or soft) or doing other work on a page that might make a + * waiter ready. The setting of PG_WANTED is integrated into the + * related flags and it can't be set once the flags are already + * clear, so there should be no races here. */ static __inline void vm_page_flash(vm_page_t m) { - lwkt_gettoken(&vm_token); if (m->flags & PG_WANTED) { vm_page_flag_clear(m, PG_WANTED); wakeup(m); } - lwkt_reltoken(&vm_token); } -/* - * Clear the PG_BUSY flag and wakeup anyone waiting for the page. This - * is typically the last call you make on a page before moving onto - * other things. - */ -static __inline void -vm_page_wakeup(vm_page_t m) -{ - KASSERT(m->flags & PG_BUSY, ("vm_page_wakeup: page not busy!!!")); - vm_page_flag_clear(m, PG_BUSY); - vm_page_flash(m); -} - -/* - * These routines manipulate the 'soft busy' count for a page. A soft busy - * is almost like PG_BUSY except that it allows certain compatible operations - * to occur on the page while it is busy. For example, a page undergoing a - * write can still be mapped read-only. - */ -static __inline void -vm_page_io_start(vm_page_t m) -{ - atomic_add_char(&(m)->busy, 1); -} - -static __inline void -vm_page_io_finish(vm_page_t m) -{ - atomic_subtract_char(&m->busy, 1); - if (m->busy == 0) - vm_page_flash(m); -} - - #if PAGE_SIZE == 4096 #define VM_PAGE_BITS_ALL 0xff #endif @@ -494,6 +446,17 @@ vm_page_io_finish(vm_page_t m) #define VM_ALLOC_QUICK 0x10 /* like NORMAL but do not use cache */ #define VM_ALLOC_RETRY 0x80 /* indefinite block (vm_page_grab()) */ +void vm_page_queue_spin_lock(vm_page_t); +void vm_page_queues_spin_lock(u_short); +void vm_page_and_queue_spin_lock(vm_page_t); + +void vm_page_queue_spin_unlock(vm_page_t); +void vm_page_queues_spin_unlock(u_short); +void vm_page_and_queue_spin_unlock(vm_page_t m); + +void vm_page_io_finish(vm_page_t m); +void vm_page_io_start(vm_page_t m); +void vm_page_wakeup(vm_page_t m); void vm_page_hold(vm_page_t); void vm_page_unhold(vm_page_t); void vm_page_activate (vm_page_t); @@ -504,8 +467,12 @@ int vm_page_try_to_cache (vm_page_t); int vm_page_try_to_free (vm_page_t); void vm_page_dontneed (vm_page_t); void vm_page_deactivate (vm_page_t); +void vm_page_deactivate_locked (vm_page_t); void vm_page_insert (vm_page_t, struct vm_object *, vm_pindex_t); vm_page_t vm_page_lookup (struct vm_object *, vm_pindex_t); +vm_page_t VM_PAGE_DEBUG_EXT(vm_page_lookup_busy_wait)(struct vm_object *, vm_pindex_t, + int, const char * VM_PAGE_DEBUG_ARGS); +vm_page_t VM_PAGE_DEBUG_EXT(vm_page_lookup_busy_try)(struct vm_object *, vm_pindex_t, int, int * VM_PAGE_DEBUG_ARGS); void vm_page_remove (vm_page_t); void vm_page_rename (vm_page_t, struct vm_object *, vm_pindex_t); void vm_page_startup (void); @@ -514,6 +481,7 @@ void vm_page_unwire (vm_page_t, int); void vm_page_wire (vm_page_t); void vm_page_unqueue (vm_page_t); void vm_page_unqueue_nowakeup (vm_page_t); +vm_page_t vm_page_next (vm_page_t); void vm_page_set_validclean (vm_page_t, int, int); void vm_page_set_validdirty (vm_page_t, int, int); void vm_page_set_valid (vm_page_t, int, int); @@ -531,6 +499,27 @@ void vm_page_event_internal(vm_page_t, vm_page_event_t); void vm_page_dirty(vm_page_t m); void vm_page_register_action(vm_page_action_t action, vm_page_event_t event); void vm_page_unregister_action(vm_page_action_t action); +void vm_page_sleep_busy(vm_page_t m, int also_m_busy, const char *msg); +void VM_PAGE_DEBUG_EXT(vm_page_busy_wait)(vm_page_t m, int also_m_busy, const char *wmsg VM_PAGE_DEBUG_ARGS); +int VM_PAGE_DEBUG_EXT(vm_page_busy_try)(vm_page_t m, int also_m_busy VM_PAGE_DEBUG_ARGS); + +#ifdef VM_PAGE_DEBUG + +#define vm_page_lookup_busy_wait(object, pindex, alsob, msg) \ + vm_page_lookup_busy_wait_debug(object, pindex, alsob, msg, \ + __func__, __LINE__) + +#define vm_page_lookup_busy_try(object, pindex, alsob, errorp) \ + vm_page_lookup_busy_try_debug(object, pindex, alsob, errorp, \ + __func__, __LINE__) + +#define vm_page_busy_wait(m, alsob, msg) \ + vm_page_busy_wait_debug(m, alsob, msg, __func__, __LINE__) + +#define vm_page_busy_try(m, alsob) \ + vm_page_busy_try_debug(m, alsob, __func__, __LINE__) + +#endif /* * Reduce the protection of a page. This routine never raises the @@ -551,15 +540,16 @@ void vm_page_unregister_action(vm_page_action_t action); * might have changed, however. */ static __inline void -vm_page_protect(vm_page_t mem, int prot) +vm_page_protect(vm_page_t m, int prot) { + KKASSERT(m->flags & PG_BUSY); if (prot == VM_PROT_NONE) { - if (mem->flags & (PG_WRITEABLE|PG_MAPPED)) { - pmap_page_protect(mem, VM_PROT_NONE); + if (m->flags & (PG_WRITEABLE|PG_MAPPED)) { + pmap_page_protect(m, VM_PROT_NONE); /* PG_WRITEABLE & PG_MAPPED cleared by call */ } - } else if ((prot == VM_PROT_READ) && (mem->flags & PG_WRITEABLE)) { - pmap_page_protect(mem, VM_PROT_READ); + } else if ((prot == VM_PROT_READ) && (m->flags & PG_WRITEABLE)) { + pmap_page_protect(m, VM_PROT_READ); /* PG_WRITEABLE cleared by call */ } } @@ -624,39 +614,6 @@ vm_page_free_zero(vm_page_t m) vm_page_free_toq(m); } -/* - * Wait until page is no longer PG_BUSY or (if also_m_busy is TRUE) - * m->busy is zero. Returns TRUE if it had to sleep ( including if - * it almost had to sleep and made temporary spl*() mods), FALSE - * otherwise. - * - * This routine assumes that interrupts can only remove the busy - * status from a page, not set the busy status or change it from - * PG_BUSY to m->busy or vise versa (which would create a timing - * window). - * - * Note: as an inline, 'also_m_busy' is usually a constant and well - * optimized. - */ -static __inline int -vm_page_sleep_busy(vm_page_t m, int also_m_busy, const char *msg) -{ - if ((m->flags & PG_BUSY) || (also_m_busy && m->busy)) { - lwkt_gettoken(&vm_token); - if ((m->flags & PG_BUSY) || (also_m_busy && m->busy)) { - /* - * Page is busy. Wait and retry. - */ - vm_page_flag_set(m, PG_WANTED | PG_REFERENCED); - tsleep(m, 0, msg, 0); - } - lwkt_reltoken(&vm_token); - return(TRUE); - /* not reached */ - } - return(FALSE); -} - /* * Set page to not be dirty. Note: does not clear pmap modify bits . */ diff --git a/sys/vm/vm_page2.h b/sys/vm/vm_page2.h index 354192d5fb..c0c754b158 100644 --- a/sys/vm/vm_page2.h +++ b/sys/vm/vm_page2.h @@ -47,6 +47,12 @@ #ifndef _VM_PAGE_H_ #include #endif +#ifndef _SYS_SPINLOCK_H_ +#include +#endif +#ifndef _SYS_SPINLOCK2_H_ +#include +#endif #ifdef _KERNEL @@ -195,6 +201,55 @@ vm_page_clear_dirty_beg_nonincl(vm_page_t m, int base, int size) vm_page_clear_dirty(m, base, size - base); } +static __inline +void +vm_page_spin_lock(vm_page_t m) +{ + spin_pool_lock(m); +} + +static __inline +void +vm_page_spin_unlock(vm_page_t m) +{ + spin_pool_unlock(m); +} + +/* + * Wire a vm_page that is already wired. Does not require a busied + * page. + */ +static __inline +void +vm_page_wire_quick(vm_page_t m) +{ + if (atomic_fetchadd_int(&m->wire_count, 1) == 0) + panic("vm_page_wire_quick: wire_count was 0"); +} + +/* + * Unwire a vm_page quickly, does not require a busied page. + * + * This routine refuses to drop the wire_count to 0 and will return + * TRUE if it would have had to (instead of decrementing it to 0). + * The caller can then busy the page and deal with it. + */ +static __inline +int +vm_page_unwire_quick(vm_page_t m) +{ + KKASSERT(m->wire_count > 0); + for (;;) { + u_int wire_count = m->wire_count; + + cpu_ccfence(); + if (wire_count == 1) + return TRUE; + if (atomic_cmpset_int(&m->wire_count, wire_count, wire_count - 1)) + return FALSE; + } +} + #endif /* _KERNEL */ #endif /* _VM_VM_PAGE2_H_ */ diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c index e5e1af2dc2..855084a298 100644 --- a/sys/vm/vm_pageout.c +++ b/sys/vm/vm_pageout.c @@ -94,6 +94,7 @@ #include #include +#include #include /* @@ -235,8 +236,6 @@ vm_fault_ratecheck(void) * We set the busy bit to cause potential page faults on this page to * block. Note the careful timing, however, the busy bit isn't set till * late and we cannot do anything that will mess with the page. - * - * The caller must hold vm_token. */ static int vm_pageout_clean(vm_page_t m) @@ -244,6 +243,7 @@ vm_pageout_clean(vm_page_t m) vm_object_t object; vm_page_t mc[2*vm_pageout_page_count]; int pageout_count; + int error; int ib, is, page_base; vm_pindex_t pindex = m->pindex; @@ -260,9 +260,13 @@ vm_pageout_clean(vm_page_t m) /* * Don't mess with the page if it's busy, held, or special + * + * XXX do we really need to check hold_count here? hold_count + * isn't supposed to mess with vm_page ops except prevent the + * page from being reused. */ - if ((m->hold_count != 0) || - ((m->busy != 0) || (m->flags & (PG_BUSY|PG_UNMANAGED)))) { + if (m->hold_count != 0 || (m->flags & PG_UNMANAGED)) { + vm_page_wakeup(m); return 0; } @@ -302,12 +306,14 @@ more: break; } - if ((p = vm_page_lookup(object, pindex - ib)) == NULL) { + p = vm_page_lookup_busy_try(object, pindex - ib, TRUE, &error); + if (error || p == NULL) { ib = 0; break; } - if (((p->queue - p->pc) == PQ_CACHE) || - (p->flags & (PG_BUSY|PG_UNMANAGED)) || p->busy) { + if ((p->queue - p->pc) == PQ_CACHE || + (p->flags & PG_UNMANAGED)) { + vm_page_wakeup(p); ib = 0; break; } @@ -316,6 +322,7 @@ more: p->queue != PQ_INACTIVE || p->wire_count != 0 || /* may be held by buf cache */ p->hold_count != 0) { /* may be undergoing I/O */ + vm_page_wakeup(p); ib = 0; break; } @@ -334,10 +341,12 @@ more: pindex + is < object->size) { vm_page_t p; - if ((p = vm_page_lookup(object, pindex + is)) == NULL) + p = vm_page_lookup_busy_try(object, pindex + is, TRUE, &error); + if (error || p == NULL) break; if (((p->queue - p->pc) == PQ_CACHE) || (p->flags & (PG_BUSY|PG_UNMANAGED)) || p->busy) { + vm_page_wakeup(p); break; } vm_page_test_dirty(p); @@ -345,6 +354,7 @@ more: p->queue != PQ_INACTIVE || p->wire_count != 0 || /* may be held by buf cache */ p->hold_count != 0) { /* may be undergoing I/O */ + vm_page_wakeup(p); break; } mc[page_base + pageout_count] = p; @@ -377,7 +387,8 @@ more: * the parent to do more sophisticated things we may have to change * the ordering. * - * The caller must hold vm_token. + * The pages in the array must be busied by the caller and will be + * unbusied by this function. */ int vm_pageout_flush(vm_page_t *mc, int count, int flags) @@ -387,13 +398,13 @@ vm_pageout_flush(vm_page_t *mc, int count, int flags) int numpagedout = 0; int i; - ASSERT_LWKT_TOKEN_HELD(&vm_token); - /* * Initiate I/O. Bump the vm_page_t->busy counter. */ for (i = 0; i < count; i++) { - KASSERT(mc[i]->valid == VM_PAGE_BITS_ALL, ("vm_pageout_flush page %p index %d/%d: partially invalid page", mc[i], i, count)); + KASSERT(mc[i]->valid == VM_PAGE_BITS_ALL, + ("vm_pageout_flush page %p index %d/%d: partially " + "invalid page", mc[i], i, count)); vm_page_io_start(mc[i]); } @@ -403,9 +414,13 @@ vm_pageout_flush(vm_page_t *mc, int count, int flags) * cannot clear the bit for us since the I/O completion code * typically runs from an interrupt. The act of making the page * read-only handles the case for us. + * + * Then we can unbusy the pages, we still hold a reference by virtue + * of our soft-busy. */ for (i = 0; i < count; i++) { vm_page_protect(mc[i], VM_PROT_READ); + vm_page_wakeup(mc[i]); } object = mc[0]->object; @@ -431,8 +446,10 @@ vm_pageout_flush(vm_page_t *mc, int count, int flags) * essentially lose the changes by pretending it * worked. */ + vm_page_busy_wait(mt, FALSE, "pgbad"); pmap_clear_modify(mt); vm_page_undirty(mt); + vm_page_wakeup(mt); break; case VM_PAGER_ERROR: case VM_PAGER_FAIL: @@ -463,6 +480,7 @@ vm_pageout_flush(vm_page_t *mc, int count, int flags) * might still be read-heavy. */ if (pageout_status[i] != VM_PAGER_PEND) { + vm_page_busy_wait(mt, FALSE, "pgouw"); if (vm_page_count_severe()) vm_page_deactivate(mt); #if 0 @@ -470,6 +488,7 @@ vm_pageout_flush(vm_page_t *mc, int count, int flags) vm_page_protect(mt, VM_PROT_READ); #endif vm_page_io_finish(mt); + vm_page_wakeup(mt); vm_object_pip_wakeup(object); } } @@ -478,15 +497,12 @@ vm_pageout_flush(vm_page_t *mc, int count, int flags) #if !defined(NO_SWAPPING) /* - * vm_pageout_object_deactivate_pages - * - * deactivate enough pages to satisfy the inactive target - * requirements or if vm_page_proc_limit is set, then - * deactivate all of the pages in the object and its - * backing_objects. + * deactivate enough pages to satisfy the inactive target + * requirements or if vm_page_proc_limit is set, then + * deactivate all of the pages in the object and its + * backing_objects. * * The map must be locked. - * The caller must hold vm_token. * The caller must hold the vm_object. */ static int vm_pageout_object_deactivate_pages_callback(vm_page_t, void *); @@ -496,26 +512,22 @@ vm_pageout_object_deactivate_pages(vm_map_t map, vm_object_t object, vm_pindex_t desired, int map_remove_only) { struct rb_vm_page_scan_info info; - vm_object_t tmp; + vm_object_t lobject; + vm_object_t tobject; int remove_mode; - while (object) { - if (pmap_resident_count(vm_map_pmap(map)) <= desired) - return; - - vm_object_hold(object); + lobject = object; - if (object->type == OBJT_DEVICE || object->type == OBJT_PHYS) { - vm_object_drop(object); - return; - } - if (object->paging_in_progress) { - vm_object_drop(object); - return; - } + while (lobject) { + if (pmap_resident_count(vm_map_pmap(map)) <= desired) + break; + if (lobject->type == OBJT_DEVICE || lobject->type == OBJT_PHYS) + break; + if (lobject->paging_in_progress) + break; remove_mode = map_remove_only; - if (object->shadow_count > 1) + if (lobject->shadow_count > 1) remove_mode = 1; /* @@ -525,18 +537,26 @@ vm_pageout_object_deactivate_pages(vm_map_t map, vm_object_t object, info.limit = remove_mode; info.map = map; info.desired = desired; - vm_page_rb_tree_RB_SCAN(&object->rb_memq, NULL, + vm_page_rb_tree_RB_SCAN(&lobject->rb_memq, NULL, vm_pageout_object_deactivate_pages_callback, &info ); - tmp = object->backing_object; - vm_object_drop(object); - object = tmp; + while ((tobject = lobject->backing_object) != NULL) { + KKASSERT(tobject != object); + vm_object_hold(tobject); + if (tobject == lobject->backing_object) + break; + vm_object_drop(tobject); + } + if (lobject != object) + vm_object_drop(lobject); + lobject = tobject; } + if (lobject != object) + vm_object_drop(lobject); } /* - * The caller must hold vm_token. * The caller must hold the vm_object. */ static int @@ -549,9 +569,15 @@ vm_pageout_object_deactivate_pages_callback(vm_page_t p, void *data) return(-1); } mycpu->gd_cnt.v_pdpages++; - if (p->wire_count != 0 || p->hold_count != 0 || p->busy != 0 || - (p->flags & (PG_BUSY|PG_UNMANAGED)) || - !pmap_page_exists_quick(vm_map_pmap(info->map), p)) { + + if (vm_page_busy_try(p, TRUE)) + return(0); + if (p->wire_count || p->hold_count || (p->flags & PG_UNMANAGED)) { + vm_page_wakeup(p); + return(0); + } + if (!pmap_page_exists_quick(vm_map_pmap(info->map), p)) { + vm_page_wakeup(p); return(0); } @@ -562,44 +588,52 @@ vm_pageout_object_deactivate_pages_callback(vm_page_t p, void *data) actcount = 1; } - if ((p->queue != PQ_ACTIVE) && - (p->flags & PG_REFERENCED)) { + vm_page_and_queue_spin_lock(p); + if (p->queue != PQ_ACTIVE && (p->flags & PG_REFERENCED)) { + vm_page_and_queue_spin_unlock(p); vm_page_activate(p); p->act_count += actcount; vm_page_flag_clear(p, PG_REFERENCED); } else if (p->queue == PQ_ACTIVE) { if ((p->flags & PG_REFERENCED) == 0) { p->act_count -= min(p->act_count, ACT_DECLINE); - if (!info->limit && (vm_pageout_algorithm || (p->act_count == 0))) { - vm_page_busy(p); + if (!info->limit && + (vm_pageout_algorithm || (p->act_count == 0))) { + vm_page_and_queue_spin_unlock(p); vm_page_protect(p, VM_PROT_NONE); vm_page_deactivate(p); - vm_page_wakeup(p); } else { TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, p, pageq); TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, p, pageq); + vm_page_and_queue_spin_unlock(p); } } else { + vm_page_and_queue_spin_unlock(p); vm_page_activate(p); vm_page_flag_clear(p, PG_REFERENCED); - if (p->act_count < (ACT_MAX - ACT_ADVANCE)) - p->act_count += ACT_ADVANCE; - TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, p, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, p, pageq); + + vm_page_and_queue_spin_lock(p); + if (p->queue == PQ_ACTIVE) { + if (p->act_count < (ACT_MAX - ACT_ADVANCE)) + p->act_count += ACT_ADVANCE; + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, p, pageq); + TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, p, pageq); + } + vm_page_and_queue_spin_unlock(p); } } else if (p->queue == PQ_INACTIVE) { - vm_page_busy(p); + vm_page_and_queue_spin_unlock(p); vm_page_protect(p, VM_PROT_NONE); - vm_page_wakeup(p); + } else { + vm_page_and_queue_spin_unlock(p); } + vm_page_wakeup(p); return(0); } /* * Deactivate some number of pages in a map, try to do it fairly, but * that is really hard to do. - * - * The caller must hold vm_token. */ static void vm_pageout_map_deactivate_pages(vm_map_t map, vm_pindex_t desired) @@ -683,13 +717,10 @@ vm_pageout_map_deactivate_pages(vm_map_t map, vm_pindex_t desired) * to optimize shadow chain collapses but I don't quite see why it would * be necessary. An OBJ_DEAD object should terminate any and all vm_pages * synchronously and not have to be kicked-start. - * - * The caller must hold vm_token. */ static void vm_pageout_page_free(vm_page_t m) { - vm_page_busy(m); vm_page_protect(m, VM_PROT_NONE); vm_page_free(m); } @@ -704,14 +735,11 @@ struct vm_pageout_scan_info { static int vm_pageout_scan_callback(struct proc *p, void *data); -/* - * The caller must hold vm_token. - */ static int vm_pageout_scan(int pass) { struct vm_pageout_scan_info info; - vm_page_t m, next; + vm_page_t m; struct vm_page marker; struct vnode *vpfailed; /* warning, allowed to be stale */ int maxscan, pcount; @@ -739,14 +767,6 @@ vm_pageout_scan(int pass) inactive_original_shortage = inactive_shortage; vm_pageout_deficit = 0; - /* - * Initialize our marker - */ - bzero(&marker, sizeof(marker)); - marker.flags = PG_BUSY | PG_FICTITIOUS | PG_MARKER; - marker.queue = PQ_INACTIVE; - marker.wire_count = 1; - /* * Start scanning the inactive queue for pages we can move to the * cache or free. The scan will stop when the target is reached or @@ -769,52 +789,82 @@ vm_pageout_scan(int pass) maxlaunder = 10000; /* - * We will generally be in a critical section throughout the - * scan, but we can release it temporarily when we are sitting on a - * non-busy page without fear. this is required to prevent an - * interrupt from unbusying or freeing a page prior to our busy - * check, leaving us on the wrong queue or checking the wrong - * page. + * Initialize our marker + */ + bzero(&marker, sizeof(marker)); + marker.flags = PG_BUSY | PG_FICTITIOUS | PG_MARKER; + marker.queue = PQ_INACTIVE; + marker.wire_count = 1; + + /* + * Inactive queue scan. + * + * NOTE: The vm_page must be spinlocked before the queue to avoid + * deadlocks, so it is easiest to simply iterate the loop + * with the queue unlocked at the top. */ -rescan0: vpfailed = NULL; + + vm_page_queues_spin_lock(PQ_INACTIVE); + TAILQ_INSERT_HEAD(&vm_page_queues[PQ_INACTIVE].pl, &marker, pageq); maxscan = vmstats.v_inactive_count; - for (m = TAILQ_FIRST(&vm_page_queues[PQ_INACTIVE].pl); - m != NULL && maxscan-- > 0 && inactive_shortage > 0; - m = next - ) { + vm_page_queues_spin_unlock(PQ_INACTIVE); + + while ((m = TAILQ_NEXT(&marker, pageq)) != NULL && + maxscan-- > 0 && inactive_shortage > 0) + { + vm_page_and_queue_spin_lock(m); + if (m != TAILQ_NEXT(&marker, pageq)) { + vm_page_and_queue_spin_unlock(m); + ++maxscan; + continue; + } + KKASSERT(m->queue == PQ_INACTIVE); + TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, + &marker, pageq); + TAILQ_INSERT_AFTER(&vm_page_queues[PQ_INACTIVE].pl, m, + &marker, pageq); mycpu->gd_cnt.v_pdpages++; /* - * It's easier for some of the conditions below to just loop - * and catch queue changes here rather then check everywhere - * else. + * Skip marker pages */ - if (m->queue != PQ_INACTIVE) - goto rescan0; - next = TAILQ_NEXT(m, pageq); + if (m->flags & PG_MARKER) { + vm_page_and_queue_spin_unlock(m); + continue; + } /* - * skip marker pages + * Try to busy the page. Don't mess with pages which are + * already busy or reorder them in the queue. */ - if (m->flags & PG_MARKER) + if (vm_page_busy_try(m, TRUE)) { + vm_page_and_queue_spin_unlock(m); continue; + } + vm_page_and_queue_spin_unlock(m); + KKASSERT(m->queue == PQ_INACTIVE); /* - * A held page may be undergoing I/O, so skip it. + * The page has been successfully busied and is now no + * longer spinlocked. The queue is no longer spinlocked + * either. */ - if (m->hold_count) { - TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); - ++vm_swapcache_inactive_heuristic; - continue; - } /* - * Dont mess with busy pages, keep in the front of the - * queue, most likely are being paged out. + * A held page may be undergoing I/O, so skip it. */ - if (m->busy || (m->flags & PG_BUSY)) { + if (m->hold_count) { + vm_page_and_queue_spin_lock(m); + if (m->queue == PQ_INACTIVE) { + TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, + m, pageq); + TAILQ_INSERT_TAIL(&vm_page_queues[PQ_INACTIVE].pl, + m, pageq); + } + vm_page_and_queue_spin_unlock(m); + ++vm_swapcache_inactive_heuristic; + vm_page_wakeup(m); continue; } @@ -825,7 +875,7 @@ rescan0: */ vm_page_flag_clear(m, PG_REFERENCED); pmap_clear_reference(m); - + /* fall through to end */ } else if (((m->flags & PG_REFERENCED) == 0) && (actcount = pmap_ts_referenced(m))) { /* @@ -840,10 +890,13 @@ rescan0: */ vm_page_activate(m); m->act_count += (actcount + ACT_ADVANCE); + vm_page_wakeup(m); continue; } /* + * (m) is still busied. + * * If the upper level VM system knows about any page * references, we activate the page. We also set the * "activation count" higher than normal so that we will less @@ -854,6 +907,7 @@ rescan0: actcount = pmap_ts_referenced(m); vm_page_activate(m); m->act_count += (actcount + ACT_ADVANCE + 1); + vm_page_wakeup(m); continue; } @@ -888,7 +942,6 @@ rescan0: * Clean pages can be placed onto the cache queue. * This effectively frees them. */ - vm_page_busy(m); vm_page_cache(m); --inactive_shortage; } else if ((m->flags & PG_WINATCFLS) == 0 && pass == 0) { @@ -905,9 +958,14 @@ rescan0: * the thrash point for a heavily loaded machine. */ vm_page_flag_set(m, PG_WINATCFLS); - TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); + vm_page_and_queue_spin_lock(m); + if (m->queue == PQ_INACTIVE) { + TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); + TAILQ_INSERT_TAIL(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); + } + vm_page_and_queue_spin_unlock(m); ++vm_swapcache_inactive_heuristic; + vm_page_wakeup(m); } else if (maxlaunder > 0) { /* * We always want to try to flush some dirty pages if @@ -935,13 +993,20 @@ rescan0: * Those objects are in a "rundown" state. */ if (!swap_pageouts_ok || (object->flags & OBJ_DEAD)) { - TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); + vm_page_and_queue_spin_lock(m); + if (m->queue == PQ_INACTIVE) { + TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); + TAILQ_INSERT_TAIL(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); + } + vm_page_and_queue_spin_unlock(m); ++vm_swapcache_inactive_heuristic; + vm_page_wakeup(m); continue; } /* + * (m) is still busied. + * * The object is already known NOT to be dead. It * is possible for the vget() to block the whole * pageout daemon, but the new low-memory handling @@ -974,11 +1039,20 @@ rescan0: flags |= LK_NOWAIT; else flags |= LK_TIMELOCK; + vm_page_hold(m); + vm_page_wakeup(m); + + /* + * We have unbusied (m) temporarily so we can + * acquire the vp lock without deadlocking. + * (m) is held to prevent destruction. + */ if (vget(vp, flags) != 0) { vpfailed = vp; ++pageout_lock_miss; if (object->flags & OBJ_MIGHTBEDIRTY) vnodes_skipped++; + vm_page_unhold(m); continue; } @@ -995,6 +1069,7 @@ rescan0: if (object->flags & OBJ_MIGHTBEDIRTY) vnodes_skipped++; vput(vp); + vm_page_unhold(m); continue; } @@ -1004,52 +1079,64 @@ rescan0: * page back onto the end of the queue so that * statistics are more correct if we don't. */ - if (m->busy || (m->flags & PG_BUSY)) { + if (vm_page_busy_try(m, TRUE)) { vput(vp); + vm_page_unhold(m); continue; } + vm_page_unhold(m); /* - * If the page has become held it might - * be undergoing I/O, so skip it + * (m) is busied again + * + * We own the busy bit and remove our hold + * bit. If the page is still held it + * might be undergoing I/O, so skip it. */ if (m->hold_count) { - TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); + vm_page_and_queue_spin_lock(m); + if (m->queue == PQ_INACTIVE) { + TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); + TAILQ_INSERT_TAIL(&vm_page_queues[PQ_INACTIVE].pl, m, pageq); + } + vm_page_and_queue_spin_unlock(m); ++vm_swapcache_inactive_heuristic; if (object->flags & OBJ_MIGHTBEDIRTY) vnodes_skipped++; + vm_page_wakeup(m); vput(vp); continue; } + /* (m) is left busied as we fall through */ } /* + * page is busy and not held here. + * * If a page is dirty, then it is either being washed * (but not yet cleaned) or it is still in the * laundry. If it is still in the laundry, then we * start the cleaning operation. * - * This operation may cluster, invalidating the 'next' - * pointer. To prevent an inordinate number of - * restarts we use our marker to remember our place. - * * decrement inactive_shortage on success to account * for the (future) cleaned page. Otherwise we * could wind up laundering or cleaning too many * pages. */ - TAILQ_INSERT_AFTER(&vm_page_queues[PQ_INACTIVE].pl, m, &marker, pageq); if (vm_pageout_clean(m) != 0) { --inactive_shortage; --maxlaunder; } - next = TAILQ_NEXT(&marker, pageq); - TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, &marker, pageq); + /* clean ate busy, page no longer accessible */ if (vp != NULL) vput(vp); + } else { + vm_page_wakeup(m); } } + vm_page_queues_spin_lock(PQ_INACTIVE); + TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, &marker, pageq); + vm_page_queues_spin_unlock(PQ_INACTIVE); /* * We want to move pages from the active queue to the inactive @@ -1079,32 +1166,66 @@ rescan0: active_shortage = inactive_original_shortage * 2; } - pcount = vmstats.v_active_count; recycle_count = 0; - m = TAILQ_FIRST(&vm_page_queues[PQ_ACTIVE].pl); + marker.queue = PQ_ACTIVE; + + vm_page_queues_spin_lock(PQ_ACTIVE); + TAILQ_INSERT_HEAD(&vm_page_queues[PQ_ACTIVE].pl, &marker, pageq); + vm_page_queues_spin_unlock(PQ_ACTIVE); + pcount = vmstats.v_active_count; + + while ((m = TAILQ_NEXT(&marker, pageq)) != NULL && + pcount-- > 0 && (inactive_shortage > 0 || active_shortage > 0)) + { + vm_page_and_queue_spin_lock(m); + if (m != TAILQ_NEXT(&marker, pageq)) { + vm_page_and_queue_spin_unlock(m); + ++pcount; + continue; + } + KKASSERT(m->queue == PQ_ACTIVE); + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, + &marker, pageq); + TAILQ_INSERT_AFTER(&vm_page_queues[PQ_ACTIVE].pl, m, + &marker, pageq); - while ((m != NULL) && (pcount-- > 0) && - (inactive_shortage > 0 || active_shortage > 0) - ) { /* - * If the page was ripped out from under us, just stop. + * Skip marker pages */ - if (m->queue != PQ_ACTIVE) - break; - next = TAILQ_NEXT(m, pageq); + if (m->flags & PG_MARKER) { + vm_page_and_queue_spin_unlock(m); + continue; + } /* - * Don't deactivate pages that are busy. + * Try to busy the page. Don't mess with pages which are + * already busy or reorder them in the queue. */ - if ((m->busy != 0) || - (m->flags & PG_BUSY) || - (m->hold_count != 0)) { - TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); - m = next; + if (vm_page_busy_try(m, TRUE)) { + vm_page_and_queue_spin_unlock(m); continue; } + /* + * Don't deactivate pages that are held, even if we can + * busy them. (XXX why not?) + */ + if (m->hold_count != 0) { + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, + m, pageq); + TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, + m, pageq); + vm_page_and_queue_spin_unlock(m); + vm_page_wakeup(m); + continue; + } + vm_page_and_queue_spin_unlock(m); + + /* + * The page has been successfully busied and the page and + * queue are no longer locked. + */ + /* * The count for pagedaemon pages is done after checking the * page for eligibility... @@ -1133,8 +1254,15 @@ rescan0: * actcount is only valid if the object ref_count is non-zero. */ if (actcount && m->object->ref_count != 0) { - TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); + vm_page_and_queue_spin_lock(m); + if (m->queue == PQ_ACTIVE) { + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, + m, pageq); + TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, + m, pageq); + } + vm_page_and_queue_spin_unlock(m); + vm_page_wakeup(m); } else { m->act_count -= min(m->act_count, ACT_DECLINE); if (vm_pageout_algorithm || @@ -1156,7 +1284,6 @@ rescan0: m->object->ref_count == 0) { if (inactive_shortage > 0) ++recycle_count; - vm_page_busy(m); vm_page_protect(m, VM_PROT_NONE); if (m->dirty == 0 && inactive_shortage > 0) { @@ -1168,15 +1295,31 @@ rescan0: } } else { vm_page_deactivate(m); + vm_page_wakeup(m); } } else { - TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); + vm_page_and_queue_spin_lock(m); + if (m->queue == PQ_ACTIVE) { + TAILQ_REMOVE( + &vm_page_queues[PQ_ACTIVE].pl, + m, pageq); + TAILQ_INSERT_TAIL( + &vm_page_queues[PQ_ACTIVE].pl, + m, pageq); + } + vm_page_and_queue_spin_unlock(m); + vm_page_wakeup(m); } } - m = next; } + /* + * Clean out our local marker. + */ + vm_page_queues_spin_lock(PQ_ACTIVE); + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, &marker, pageq); + vm_page_queues_spin_unlock(PQ_ACTIVE); + /* * The number of actually free pages can drop down to v_free_reserved, * we try to build the free count back above v_free_min. Note that @@ -1209,25 +1352,39 @@ rescan0: while (vmstats.v_free_count < (vmstats.v_free_min + vmstats.v_free_target) / 2) { /* - * + * This steals some code from vm/vm_page.c */ static int cache_rover = 0; - m = vm_page_list_find(PQ_CACHE, cache_rover, FALSE); + + m = vm_page_list_find(PQ_CACHE, cache_rover & PQ_L2_MASK, FALSE); if (m == NULL) break; - if ((m->flags & (PG_BUSY|PG_UNMANAGED)) || - m->busy || - m->hold_count || - m->wire_count) { + /* page is returned removed from its queue and spinlocked */ + if (vm_page_busy_try(m, TRUE)) { + vm_page_deactivate_locked(m); + vm_page_spin_unlock(m); #ifdef INVARIANTS kprintf("Warning: busy page %p found in cache\n", m); #endif + continue; + } + vm_page_spin_unlock(m); + pagedaemon_wakeup(); + + /* + * Page has been successfully busied and it and its queue + * is no longer spinlocked. + */ + if ((m->flags & PG_UNMANAGED) || + m->hold_count || + m->wire_count) { vm_page_deactivate(m); + vm_page_wakeup(m); continue; } KKASSERT((m->flags & PG_MAPPED) == 0); KKASSERT(m->dirty == 0); - cache_rover = (cache_rover + PQ_PRIME2) & PQ_L2_MASK; + cache_rover += PQ_PRIME2; vm_pageout_page_free(m); mycpu->gd_cnt.v_dfree++; } @@ -1310,7 +1467,7 @@ rescan0: } /* - * The caller must hold vm_token and proc_token. + * The caller must hold proc_token. */ static int vm_pageout_scan_callback(struct proc *p, void *data) @@ -1362,20 +1519,20 @@ vm_pageout_scan_callback(struct proc *p, void *data) * so that during long periods of time where there is no paging, * that some statistic accumulation still occurs. This code * helps the situation where paging just starts to occur. - * - * The caller must hold vm_token. */ static void vm_pageout_page_stats(void) { - vm_page_t m,next; - int pcount,tpcount; /* Number of pages to check */ static int fullintervalcount = 0; + struct vm_page marker; + vm_page_t m; + int pcount, tpcount; /* Number of pages to check */ int page_shortage; - page_shortage = - (vmstats.v_inactive_target + vmstats.v_cache_max + vmstats.v_free_min) - - (vmstats.v_free_count + vmstats.v_inactive_count + vmstats.v_cache_count); + page_shortage = (vmstats.v_inactive_target + vmstats.v_cache_max + + vmstats.v_free_min) - + (vmstats.v_free_count + vmstats.v_inactive_count + + vmstats.v_cache_count); if (page_shortage <= 0) return; @@ -1383,76 +1540,136 @@ vm_pageout_page_stats(void) pcount = vmstats.v_active_count; fullintervalcount += vm_pageout_stats_interval; if (fullintervalcount < vm_pageout_full_stats_interval) { - tpcount = (vm_pageout_stats_max * vmstats.v_active_count) / vmstats.v_page_count; + tpcount = (vm_pageout_stats_max * vmstats.v_active_count) / + vmstats.v_page_count; if (pcount > tpcount) pcount = tpcount; } else { fullintervalcount = 0; } - m = TAILQ_FIRST(&vm_page_queues[PQ_ACTIVE].pl); - while ((m != NULL) && (pcount-- > 0)) { + bzero(&marker, sizeof(marker)); + marker.flags = PG_BUSY | PG_FICTITIOUS | PG_MARKER; + marker.queue = PQ_ACTIVE; + marker.wire_count = 1; + + vm_page_queues_spin_lock(PQ_ACTIVE); + TAILQ_INSERT_HEAD(&vm_page_queues[PQ_ACTIVE].pl, &marker, pageq); + vm_page_queues_spin_unlock(PQ_ACTIVE); + + while ((m = TAILQ_NEXT(&marker, pageq)) != NULL && + pcount-- > 0) + { int actcount; - if (m->queue != PQ_ACTIVE) { - break; + vm_page_and_queue_spin_lock(m); + if (m != TAILQ_NEXT(&marker, pageq)) { + vm_page_and_queue_spin_unlock(m); + ++pcount; + continue; } + KKASSERT(m->queue == PQ_ACTIVE); + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, &marker, pageq); + TAILQ_INSERT_AFTER(&vm_page_queues[PQ_ACTIVE].pl, m, + &marker, pageq); - next = TAILQ_NEXT(m, pageq); /* - * Don't deactivate pages that are busy. + * Ignore markers */ - if ((m->busy != 0) || - (m->flags & PG_BUSY) || - (m->hold_count != 0)) { - TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); - m = next; + if (m->flags & PG_MARKER) { + vm_page_and_queue_spin_unlock(m); continue; } + /* + * Ignore pages we can't busy + */ + if (vm_page_busy_try(m, TRUE)) { + vm_page_and_queue_spin_unlock(m); + continue; + } + vm_page_and_queue_spin_unlock(m); + KKASSERT(m->queue == PQ_ACTIVE); + + /* + * We now have a safely busied page, the page and queue + * spinlocks have been released. + * + * Ignore held pages + */ + if (m->hold_count) { + vm_page_wakeup(m); + continue; + } + + /* + * Calculate activity + */ actcount = 0; if (m->flags & PG_REFERENCED) { vm_page_flag_clear(m, PG_REFERENCED); actcount += 1; } - actcount += pmap_ts_referenced(m); + + /* + * Update act_count and move page to end of queue. + */ if (actcount) { m->act_count += ACT_ADVANCE + actcount; if (m->act_count > ACT_MAX) m->act_count = ACT_MAX; - TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); - } else { - if (m->act_count == 0) { - /* - * We turn off page access, so that we have - * more accurate RSS stats. We don't do this - * in the normal page deactivation when the - * system is loaded VM wise, because the - * cost of the large number of page protect - * operations would be higher than the value - * of doing the operation. - */ - vm_page_busy(m); - vm_page_protect(m, VM_PROT_NONE); - vm_page_deactivate(m); - vm_page_wakeup(m); - } else { - m->act_count -= min(m->act_count, ACT_DECLINE); - TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); - TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, m, pageq); + vm_page_and_queue_spin_lock(m); + if (m->queue == PQ_ACTIVE) { + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, + m, pageq); + TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, + m, pageq); } + vm_page_and_queue_spin_unlock(m); + vm_page_wakeup(m); + continue; } - m = next; + if (m->act_count == 0) { + /* + * We turn off page access, so that we have + * more accurate RSS stats. We don't do this + * in the normal page deactivation when the + * system is loaded VM wise, because the + * cost of the large number of page protect + * operations would be higher than the value + * of doing the operation. + * + * We use the marker to save our place so + * we can release the spin lock. both (m) + * and (next) will be invalid. + */ + vm_page_protect(m, VM_PROT_NONE); + vm_page_deactivate(m); + } else { + m->act_count -= min(m->act_count, ACT_DECLINE); + vm_page_and_queue_spin_lock(m); + if (m->queue == PQ_ACTIVE) { + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, + m, pageq); + TAILQ_INSERT_TAIL(&vm_page_queues[PQ_ACTIVE].pl, + m, pageq); + } + vm_page_and_queue_spin_unlock(m); + } + vm_page_wakeup(m); } + + /* + * Remove our local marker + */ + vm_page_queues_spin_lock(PQ_ACTIVE); + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, &marker, pageq); + vm_page_queues_spin_unlock(PQ_ACTIVE); + } -/* - * The caller must hold vm_token. - */ static int vm_pageout_free_page_calc(vm_size_t count) { @@ -1488,11 +1705,6 @@ vm_pageout_thread(void) int pass; int inactive_shortage; - /* - * Permanently hold vm_token. - */ - lwkt_gettoken(&vm_token); - /* * Initialize some paging parameters. */ @@ -1714,10 +1926,8 @@ static void vm_daemon(void) { /* - * Permanently hold vm_token. + * XXX vm_daemon_needed specific token? */ - lwkt_gettoken(&vm_token); - while (TRUE) { tsleep(&vm_daemon_needed, 0, "psleep", 0); if (vm_pageout_req_swapout) { @@ -1733,7 +1943,7 @@ vm_daemon(void) } /* - * Caller must hold vm_token and proc_token. + * Caller must hold proc_token. */ static int vm_daemon_callback(struct proc *p, void *data __unused) @@ -1768,11 +1978,12 @@ vm_daemon_callback(struct proc *p, void *data __unused) if (p->p_flag & P_SWAPPEDOUT) limit = 0; + lwkt_gettoken(&p->p_vmspace->vm_map.token); size = vmspace_resident_count(p->p_vmspace); if (limit >= 0 && size >= limit) { - vm_pageout_map_deactivate_pages( - &p->p_vmspace->vm_map, limit); + vm_pageout_map_deactivate_pages(&p->p_vmspace->vm_map, limit); } + lwkt_reltoken(&p->p_vmspace->vm_map.token); return (0); } diff --git a/sys/vm/vm_swap.c b/sys/vm/vm_swap.c index ae00c76afa..82e794a7c1 100644 --- a/sys/vm/vm_swap.c +++ b/sys/vm/vm_swap.c @@ -65,6 +65,7 @@ #include #include #include +#include /* * Indirect driver for multi-controller paging. @@ -442,6 +443,7 @@ swapoff_one(int index) swblk_t dvbase, vsbase; u_int pq_active_clean, pq_inactive_clean; struct swdevt *sp; + struct vm_page marker; vm_page_t m; mtx_lock(&swap_mtx); @@ -456,28 +458,61 @@ swapoff_one(int index) * of data we will have to page back in, plus an epsilon so * the system doesn't become critically low on swap space. */ - lwkt_gettoken(&vm_token); - TAILQ_FOREACH(m, &vm_page_queues[PQ_ACTIVE].pl, pageq) { + bzero(&marker, sizeof(marker)); + marker.flags = PG_BUSY | PG_FICTITIOUS | PG_MARKER; + marker.queue = PQ_ACTIVE; + marker.wire_count = 1; + + vm_page_queues_spin_lock(PQ_ACTIVE); + TAILQ_INSERT_HEAD(&vm_page_queues[PQ_ACTIVE].pl, &marker, pageq); + + while ((m = TAILQ_NEXT(&marker, pageq)) != NULL) { + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, + &marker, pageq); + TAILQ_INSERT_AFTER(&vm_page_queues[PQ_ACTIVE].pl, m, + &marker, pageq); if (m->flags & (PG_MARKER | PG_FICTITIOUS)) continue; - if (m->dirty == 0) { - vm_page_test_dirty(m); - if (m->dirty == 0) - ++pq_active_clean; + if (vm_page_busy_try(m, FALSE) == 0) { + vm_page_queues_spin_unlock(PQ_ACTIVE); + if (m->dirty == 0) { + vm_page_test_dirty(m); + if (m->dirty == 0) + ++pq_active_clean; + } + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_ACTIVE); } } - TAILQ_FOREACH(m, &vm_page_queues[PQ_INACTIVE].pl, pageq) { + TAILQ_REMOVE(&vm_page_queues[PQ_ACTIVE].pl, &marker, pageq); + vm_page_queues_spin_unlock(PQ_ACTIVE); + + marker.queue = PQ_INACTIVE; + vm_page_queues_spin_lock(PQ_INACTIVE); + TAILQ_INSERT_HEAD(&vm_page_queues[PQ_INACTIVE].pl, &marker, pageq); + + while ((m = TAILQ_NEXT(&marker, pageq)) != NULL) { + TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, + &marker, pageq); + TAILQ_INSERT_AFTER(&vm_page_queues[PQ_INACTIVE].pl, m, + &marker, pageq); if (m->flags & (PG_MARKER | PG_FICTITIOUS)) continue; - if (m->dirty == 0) { - vm_page_test_dirty(m); - if (m->dirty == 0) - ++pq_inactive_clean; + if (vm_page_busy_try(m, FALSE) == 0) { + vm_page_queues_spin_unlock(PQ_INACTIVE); + if (m->dirty == 0) { + vm_page_test_dirty(m); + if (m->dirty == 0) + ++pq_inactive_clean; + } + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); } } - lwkt_reltoken(&vm_token); + TAILQ_REMOVE(&vm_page_queues[PQ_INACTIVE].pl, &marker, pageq); + vm_page_queues_spin_unlock(PQ_INACTIVE); if (vmstats.v_free_count + vmstats.v_cache_count + pq_active_clean + pq_inactive_clean + vm_swap_size < aligned_nblks + nswap_lowat) { diff --git a/sys/vm/vm_swapcache.c b/sys/vm/vm_swapcache.c index dc29d1693c..ee7bfc8d74 100644 --- a/sys/vm/vm_swapcache.c +++ b/sys/vm/vm_swapcache.c @@ -76,6 +76,7 @@ #include #include +#include #include #define INACTIVE_LIST (&vm_page_queues[PQ_INACTIVE].pl) @@ -171,7 +172,6 @@ vm_swapcached_thread(void) swapcached_thread, SHUTDOWN_PRI_FIRST); EVENTHANDLER_REGISTER(shutdown_pre_sync, shutdown_swapcache, NULL, SHUTDOWN_PRI_SECOND); - lwkt_gettoken(&vm_token); /* * Initialize our marker for the inactive scan (SWAPC_WRITING) @@ -180,7 +180,11 @@ vm_swapcached_thread(void) page_marker.flags = PG_BUSY | PG_FICTITIOUS | PG_MARKER; page_marker.queue = PQ_INACTIVE; page_marker.wire_count = 1; + + vm_page_queues_spin_lock(PQ_INACTIVE); TAILQ_INSERT_HEAD(INACTIVE_LIST, &page_marker, pageq); + vm_page_queues_spin_unlock(PQ_INACTIVE); + vm_swapcache_hysteresis = vmstats.v_inactive_target / 2; vm_swapcache_inactive_heuristic = -vm_swapcache_hysteresis; @@ -264,8 +268,9 @@ vm_swapcached_thread(void) /* * Cleanup (NOT REACHED) */ + vm_page_queues_spin_lock(PQ_INACTIVE); TAILQ_REMOVE(INACTIVE_LIST, &page_marker, pageq); - lwkt_reltoken(&vm_token); + vm_page_queues_spin_unlock(PQ_INACTIVE); lwkt_gettoken(&vmobj_token); TAILQ_REMOVE(&vm_object_list, &object_marker, object_list); @@ -279,9 +284,6 @@ static struct kproc_desc swpc_kp = { }; SYSINIT(swapcached, SI_SUB_KTHREAD_PAGE, SI_ORDER_SECOND, kproc_start, &swpc_kp) -/* - * The caller must hold vm_token. - */ static void vm_swapcache_writing(vm_page_t marker) { @@ -316,22 +318,50 @@ vm_swapcache_writing(vm_page_t marker) * can end up with a very high datarate of VM pages * cycling from it. */ - m = marker; count = vm_swapcache_maxlaunder; - while ((m = TAILQ_NEXT(m, pageq)) != NULL && count--) { + vm_page_queues_spin_lock(PQ_INACTIVE); + while ((m = TAILQ_NEXT(marker, pageq)) != NULL && count-- > 0) { + KKASSERT(m->queue == PQ_INACTIVE); + + if (vm_swapcache_curburst < 0) + break; + TAILQ_REMOVE(INACTIVE_LIST, marker, pageq); + TAILQ_INSERT_AFTER(INACTIVE_LIST, m, marker, pageq); if (m->flags & (PG_MARKER | PG_SWAPPED)) { ++count; continue; } - if (vm_swapcache_curburst < 0) - break; - if (vm_swapcache_test(m)) + if (vm_page_busy_try(m, TRUE)) continue; - object = m->object; + vm_page_queues_spin_unlock(PQ_INACTIVE); + + if ((object = m->object) == NULL) { + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); + continue; + } + vm_object_hold(object); + if (m->object != object) { + vm_object_drop(object); + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); + continue; + } + if (vm_swapcache_test(m)) { + vm_object_drop(object); + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); + continue; + } + vp = object->handle; - if (vp == NULL) + if (vp == NULL) { + vm_object_drop(object); + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); continue; + } switch(vp->v_type) { case VREG: @@ -341,8 +371,12 @@ vm_swapcache_writing(vm_page_t marker) * (and leave it unset for meta-data buffers) as * appropriate when double buffering is enabled. */ - if (m->flags & PG_NOTMETA) + if (m->flags & PG_NOTMETA) { + vm_object_drop(object); + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); continue; + } /* * If data_enable is 0 do not try to swapcache data. @@ -352,11 +386,17 @@ vm_swapcache_writing(vm_page_t marker) if (vm_swapcache_data_enable == 0 || ((vp->v_flag & VSWAPCACHE) == 0 && vm_swapcache_use_chflags)) { + vm_object_drop(object); + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); continue; } if (vm_swapcache_maxfilesize && object->size > (vm_swapcache_maxfilesize >> PAGE_SHIFT)) { + vm_object_drop(object); + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); continue; } isblkdev = 0; @@ -368,21 +408,27 @@ vm_swapcache_writing(vm_page_t marker) * (and leave it unset for meta-data buffers) as * appropriate when double buffering is enabled. */ - if (m->flags & PG_NOTMETA) + if (m->flags & PG_NOTMETA) { + vm_object_drop(object); + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); continue; - if (vm_swapcache_meta_enable == 0) + } + if (vm_swapcache_meta_enable == 0) { + vm_object_drop(object); + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); continue; + } isblkdev = 1; break; default: + vm_object_drop(object); + vm_page_wakeup(m); + vm_page_queues_spin_lock(PQ_INACTIVE); continue; } - /* - * Ok, move the marker and soft-busy the page. - */ - TAILQ_REMOVE(INACTIVE_LIST, marker, pageq); - TAILQ_INSERT_AFTER(INACTIVE_LIST, m, marker, pageq); /* * Assign swap and initiate I/O. @@ -394,30 +440,28 @@ vm_swapcache_writing(vm_page_t marker) /* * Setup for next loop using marker. */ - m = marker; + vm_object_drop(object); + vm_page_queues_spin_lock(PQ_INACTIVE); } /* - * Cleanup marker position. If we hit the end of the - * list the marker is placed at the tail. Newly deactivated - * pages will be placed after it. + * The marker could wind up at the end, which is ok. If we hit the + * end of the list adjust the heuristic. * * Earlier inactive pages that were dirty and become clean * are typically moved to the end of PQ_INACTIVE by virtue * of vfs_vmio_release() when they become unwired from the * buffer cache. */ - TAILQ_REMOVE(INACTIVE_LIST, marker, pageq); - if (m) { - TAILQ_INSERT_BEFORE(m, marker, pageq); - } else { - TAILQ_INSERT_TAIL(INACTIVE_LIST, marker, pageq); + if (m == NULL) vm_swapcache_inactive_heuristic = -vm_swapcache_hysteresis; - } + vm_page_queues_spin_unlock(PQ_INACTIVE); } /* - * Flush the specified page using the swap_pager. + * Flush the specified page using the swap_pager. The page + * must be busied by the caller and its disposition will become + * the responsibility of this function. * * Try to collect surrounding pages, including pages which may * have already been assigned swap. Try to cluster within a @@ -430,8 +474,6 @@ vm_swapcache_writing(vm_page_t marker) * should be sufficient. * * Returns a count of pages we might have flushed (minimum 1) - * - * The caller must hold vm_token. */ static int @@ -445,10 +487,12 @@ vm_swapcached_flush(vm_page_t m, int isblkdev) int i; int j; int count; + int error; vm_page_io_start(m); vm_page_protect(m, VM_PROT_READ); object = m->object; + vm_object_hold(object); /* * Try to cluster around (m), keeping in mind that the swap pager @@ -457,15 +501,21 @@ vm_swapcached_flush(vm_page_t m, int isblkdev) x = (int)m->pindex & SWAP_META_MASK; marray[x] = m; basei = m->pindex; + vm_page_wakeup(m); for (i = x - 1; i >= 0; --i) { - m = vm_page_lookup(object, basei - x + i); - if (m == NULL) + m = vm_page_lookup_busy_try(object, basei - x + i, + TRUE, &error); + if (error || m == NULL) break; - if (vm_swapcache_test(m)) + if (vm_swapcache_test(m)) { + vm_page_wakeup(m); break; - if (isblkdev && (m->flags & PG_NOTMETA)) + } + if (isblkdev && (m->flags & PG_NOTMETA)) { + vm_page_wakeup(m); break; + } vm_page_io_start(m); vm_page_protect(m, VM_PROT_READ); if (m->queue - m->pc == PQ_CACHE) { @@ -473,17 +523,23 @@ vm_swapcached_flush(vm_page_t m, int isblkdev) vm_page_deactivate(m); } marray[i] = m; + vm_page_wakeup(m); } ++i; for (j = x + 1; j < SWAP_META_PAGES; ++j) { - m = vm_page_lookup(object, basei - x + j); - if (m == NULL) + m = vm_page_lookup_busy_try(object, basei - x + j, + TRUE, &error); + if (error || m == NULL) break; - if (vm_swapcache_test(m)) + if (vm_swapcache_test(m)) { + vm_page_wakeup(m); break; - if (isblkdev && (m->flags & PG_NOTMETA)) + } + if (isblkdev && (m->flags & PG_NOTMETA)) { + vm_page_wakeup(m); break; + } vm_page_io_start(m); vm_page_protect(m, VM_PROT_READ); if (m->queue - m->pc == PQ_CACHE) { @@ -491,6 +547,7 @@ vm_swapcached_flush(vm_page_t m, int isblkdev) vm_page_deactivate(m); } marray[j] = m; + vm_page_wakeup(m); } count = j - i; @@ -501,11 +558,14 @@ vm_swapcached_flush(vm_page_t m, int isblkdev) while (i < j) { if (rtvals[i] != VM_PAGER_PEND) { + vm_page_busy_wait(marray[i], FALSE, "swppgfd"); vm_page_io_finish(marray[i]); + vm_page_wakeup(marray[i]); vm_object_pip_wakeup(object); } ++i; } + vm_object_drop(object); return(count); } @@ -514,17 +574,15 @@ vm_swapcached_flush(vm_page_t m, int isblkdev) * Does not test m->queue, PG_MARKER, or PG_SWAPPED. * * Returns 0 on success, 1 on failure - * - * The caller must hold vm_token. */ static int vm_swapcache_test(vm_page_t m) { vm_object_t object; - if (m->flags & (PG_BUSY | PG_UNMANAGED)) + if (m->flags & PG_UNMANAGED) return(1); - if (m->busy || m->hold_count || m->wire_count) + if (m->hold_count || m->wire_count) return(1); if (m->valid != VM_PAGE_BITS_ALL) return(1); @@ -544,8 +602,6 @@ vm_swapcache_test(vm_page_t m) /* * Cleaning pass - * - * The caller must hold vm_token. */ static void @@ -562,7 +618,6 @@ vm_swapcache_cleaning(vm_object_t marker) /* * Look for vnode objects */ - lwkt_gettoken(&vm_token); lwkt_gettoken(&vmobj_token); while ((object = TAILQ_NEXT(object, object_list)) != NULL) { @@ -635,5 +690,4 @@ vm_swapcache_cleaning(vm_object_t marker) marker->backing_object = object; lwkt_reltoken(&vmobj_token); - lwkt_reltoken(&vm_token); } diff --git a/sys/vm/vm_unix.c b/sys/vm/vm_unix.c index 9f65700707..554ed72fe5 100644 --- a/sys/vm/vm_unix.c +++ b/sys/vm/vm_unix.c @@ -77,9 +77,10 @@ sys_obreak(struct obreak_args *uap) int error; error = 0; - lwkt_gettoken(&vm_token); - base = round_page((vm_offset_t) vm->vm_daddr); + lwkt_gettoken(&vm->vm_map.token); + + base = round_page((vm_offset_t)vm->vm_daddr); new = round_page((vm_offset_t)uap->nsize); old = base + ctob(vm->vm_dsize); @@ -137,6 +138,7 @@ sys_obreak(struct obreak_args *uap) vm->vm_dsize -= btoc(old - new); } done: - lwkt_reltoken(&vm_token); + lwkt_reltoken(&vm->vm_map.token); + return (error); } diff --git a/sys/vm/vm_vmspace.c b/sys/vm/vm_vmspace.c index d8b185492a..fcab966a4f 100644 --- a/sys/vm/vm_vmspace.c +++ b/sys/vm/vm_vmspace.c @@ -279,10 +279,8 @@ sys_vmspace_mmap(struct vmspace_mmap_args *uap) int error; /* - * We hold the vmspace token to serialize calls to vkernel_find_vmspace - * and the vm token to serialize calls to kern_mmap. + * We hold the vmspace token to serialize calls to vkernel_find_vmspace. */ - lwkt_gettoken(&vm_token); lwkt_gettoken(&vmspace_token); if ((vkp = curproc->p_vkernel) == NULL) { error = EINVAL; @@ -305,7 +303,6 @@ sys_vmspace_mmap(struct vmspace_mmap_args *uap) lwkt_reltoken(&vkp->token); done3: lwkt_reltoken(&vmspace_token); - lwkt_reltoken(&vm_token); return (error); } diff --git a/sys/vm/vm_zone.c b/sys/vm/vm_zone.c index 82e0cebed2..411db5c324 100644 --- a/sys/vm/vm_zone.c +++ b/sys/vm/vm_zone.c @@ -406,7 +406,6 @@ zget(vm_zone_t z) * Interrupt zones do not mess with the kernel_map, they * simply populate an existing mapping. */ - lwkt_gettoken(&vm_token); vm_object_hold(z->zobj); savezpc = z->zpagecount; nbytes = z->zpagecount * PAGE_SIZE; @@ -441,7 +440,6 @@ zget(vm_zone_t z) } nitems = ((z->zpagecount * PAGE_SIZE) - nbytes) / z->zsize; vm_object_drop(z->zobj); - lwkt_reltoken(&vm_token); } else if (z->zflags & ZONE_SPECIAL) { /* * The special zone is the one used for vm_map_entry_t's. diff --git a/sys/vm/vnode_pager.c b/sys/vm/vnode_pager.c index d5bb3df17f..081862ec41 100644 --- a/sys/vm/vnode_pager.c +++ b/sys/vm/vnode_pager.c @@ -129,7 +129,7 @@ vnode_pager_alloc(void *handle, off_t length, vm_prot_t prot, off_t offset, * Serialize potential vnode/object teardowns and interlocks */ vp = (struct vnode *)handle; - lwkt_gettoken(&vmobj_token); + lwkt_gettoken(&vp->v_token); /* * Prevent race condition when allocating the object. This @@ -140,14 +140,18 @@ vnode_pager_alloc(void *handle, off_t length, vm_prot_t prot, off_t offset, tsleep(vp, 0, "vnpobj", 0); } vsetflags(vp, VOLOCK); + lwkt_reltoken(&vp->v_token); /* * If the object is being terminated, wait for it to * go away. */ - while (((object = vp->v_object) != NULL) && - (object->flags & OBJ_DEAD)) { + while ((object = vp->v_object) != NULL) { + vm_object_hold(object); + if ((object->flags & OBJ_DEAD) == 0) + break; vm_object_dead_sleep(object, "vadead"); + vm_object_drop(object); } if (vp->v_sysref.refcnt <= 0) @@ -173,6 +177,7 @@ vnode_pager_alloc(void *handle, off_t length, vm_prot_t prot, off_t offset, * And an object of the appropriate size */ object = vm_object_allocate(OBJT_VNODE, lsize); + vm_object_hold(object); object->flags = 0; object->handle = handle; vp->v_object = object; @@ -180,7 +185,7 @@ vnode_pager_alloc(void *handle, off_t length, vm_prot_t prot, off_t offset, if (vp->v_mount && (vp->v_mount->mnt_kern_flag & MNTK_NOMSYNC)) vm_object_set_flag(object, OBJ_NOMSYNC); } else { - object->ref_count++; /* protected by vmobj_token */ + object->ref_count++; if (object->size != lsize) { kprintf("vnode_pager_alloc: Warning, objsize " "mismatch %jd/%jd vp=%p obj=%p\n", @@ -198,12 +203,15 @@ vnode_pager_alloc(void *handle, off_t length, vm_prot_t prot, off_t offset, } vref(vp); + lwkt_gettoken(&vp->v_token); vclrflags(vp, VOLOCK); if (vp->v_flag & VOWANT) { vclrflags(vp, VOWANT); wakeup(vp); } - lwkt_reltoken(&vmobj_token); + lwkt_reltoken(&vp->v_token); + + vm_object_drop(object); return (object); } @@ -219,28 +227,30 @@ vnode_pager_reference(struct vnode *vp) { vm_object_t object; - /* - * Serialize potential vnode/object teardowns and interlocks - */ - lwkt_gettoken(&vmobj_token); - /* * Prevent race condition when allocating the object. This * can happen with NFS vnodes since the nfsnode isn't locked. + * + * Serialize potential vnode/object teardowns and interlocks */ + lwkt_gettoken(&vp->v_token); while (vp->v_flag & VOLOCK) { vsetflags(vp, VOWANT); tsleep(vp, 0, "vnpobj", 0); } vsetflags(vp, VOLOCK); + lwkt_reltoken(&vp->v_token); /* * Prevent race conditions against deallocation of the VM * object. */ - while (((object = vp->v_object) != NULL) && - (object->flags & OBJ_DEAD)) { + while ((object = vp->v_object) != NULL) { + vm_object_hold(object); + if ((object->flags & OBJ_DEAD) == 0) + break; vm_object_dead_sleep(object, "vadead"); + vm_object_drop(object); } /* @@ -248,17 +258,20 @@ vnode_pager_reference(struct vnode *vp) * NULL returns if it does not. */ if (object) { - object->ref_count++; /* protected by vmobj_token */ + object->ref_count++; vref(vp); } + lwkt_gettoken(&vp->v_token); vclrflags(vp, VOLOCK); if (vp->v_flag & VOWANT) { vclrflags(vp, VOWANT); wakeup(vp); } + lwkt_reltoken(&vp->v_token); + if (object) + vm_object_drop(object); - lwkt_reltoken(&vmobj_token); return (object); } @@ -350,18 +363,24 @@ vnode_pager_setsize(struct vnode *vp, vm_ooffset_t nsize) { vm_pindex_t nobjsize; vm_pindex_t oobjsize; - vm_object_t object = vp->v_object; + vm_object_t object; + while ((object = vp->v_object) != NULL) { + vm_object_hold(object); + if (vp->v_object == object) + break; + vm_object_drop(object); + } if (object == NULL) return; /* * Hasn't changed size */ - if (nsize == vp->v_filesize) + if (nsize == vp->v_filesize) { + vm_object_drop(object); return; - - lwkt_gettoken(&vm_token); + } /* * Has changed size. Adjust the VM object's size and v_filesize @@ -393,9 +412,8 @@ vnode_pager_setsize(struct vnode *vp, vm_ooffset_t nsize) vm_offset_t kva; vm_page_t m; - do { - m = vm_page_lookup(object, OFF_TO_IDX(nsize)); - } while (m && vm_page_sleep_busy(m, TRUE, "vsetsz")); + m = vm_page_lookup_busy_wait(object, OFF_TO_IDX(nsize), + TRUE, "vsetsz"); if (m && m->valid) { int base = (int)nsize & PAGE_MASK; @@ -409,7 +427,6 @@ vnode_pager_setsize(struct vnode *vp, vm_ooffset_t nsize) * * This is byte aligned. */ - vm_page_busy(m); lwb = lwbuf_alloc(m, &lwb_cache); kva = lwbuf_kva(lwb); bzero((caddr_t)kva + base, size); @@ -450,12 +467,14 @@ vnode_pager_setsize(struct vnode *vp, vm_ooffset_t nsize) if (m->dirty != 0) m->dirty = VM_PAGE_BITS_ALL; vm_page_wakeup(m); + } else if (m) { + vm_page_wakeup(m); } } } else { vp->v_filesize = nsize; } - lwkt_reltoken(&vm_token); + vm_object_drop(object); } /* @@ -616,16 +635,10 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_page_t *mpp, int bytecount, /* * Severe hack to avoid deadlocks with the buffer cache */ - lwkt_gettoken(&vm_token); for (i = 0; i < count; ++i) { - vm_page_t mt = mpp[i]; - - while (vm_page_sleep_busy(mt, FALSE, "getpgs")) - ; - vm_page_busy(mt); - vm_page_io_finish(mt); + vm_page_busy_wait(mpp[i], FALSE, "getpgs"); + vm_page_io_finish(mpp[i]); } - lwkt_reltoken(&vm_token); /* * Calculate the actual number of bytes read and clean up the @@ -638,7 +651,7 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_page_t *mpp, int bytecount, if (i != reqpage) { if (error == 0 && mt->valid) { - if (mt->flags & PG_WANTED) + if (mt->flags & PG_REFERENCED) vm_page_activate(mt); else vm_page_deactivate(mt); @@ -829,37 +842,62 @@ vnode_pager_generic_putpages(struct vnode *vp, vm_page_t *m, int bytecount, return rtvals[0]; } +/* + * Run the chain and if the bottom-most object is a vnode-type lock the + * underlying vnode. A locked vnode or NULL is returned. + */ struct vnode * vnode_pager_lock(vm_object_t object) { - struct thread *td = curthread; /* XXX */ + struct vnode *vp = NULL; + vm_object_t lobject; + vm_object_t tobject; int error; - ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); - - for (; object != NULL; object = object->backing_object) { - if (object->type != OBJT_VNODE) - continue; - if (object->flags & OBJ_DEAD) - return NULL; + if (object == NULL) + return(NULL); - for (;;) { - struct vnode *vp = object->handle; - error = vget(vp, LK_SHARED | LK_RETRY | LK_CANRECURSE); - if (error == 0) { - if (object->handle != vp) { - vput(vp); - continue; - } - return (vp); - } - if ((object->flags & OBJ_DEAD) || - (object->type != OBJT_VNODE)) { - return NULL; + ASSERT_LWKT_TOKEN_HELD(vm_object_token(object)); + lobject = object; + + while (lobject->type != OBJT_VNODE) { + if (lobject->flags & OBJ_DEAD) + break; + tobject = lobject->backing_object; + if (tobject == NULL) + break; + vm_object_hold(tobject); + if (tobject == lobject->backing_object) { + if (lobject != object) { + vm_object_lock_swap(); + vm_object_drop(lobject); } - kprintf("vnode_pager_lock: vp %p error %d lockstatus %d, retrying\n", vp, error, lockstatus(&vp->v_lock, td)); + lobject = tobject; + } else { + vm_object_drop(tobject); + } + } + while (lobject->type == OBJT_VNODE && + (lobject->flags & OBJ_DEAD) == 0) { + /* + * Extract the vp + */ + vp = lobject->handle; + error = vget(vp, LK_SHARED | LK_RETRY | LK_CANRECURSE); + if (error == 0) { + if (lobject->handle == vp) + break; + vput(vp); + } else { + kprintf("vnode_pager_lock: vp %p error %d " + "lockstatus %d, retrying\n", + vp, error, + lockstatus(&vp->v_lock, curthread)); tsleep(object->handle, 0, "vnpgrl", hz); } + vp = NULL; } - return NULL; + if (lobject != object) + vm_object_drop(lobject); + return (vp); } -- 2.41.0