kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code
* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation
* Remove the old lwkt_ipi-based per-page invalidation code.
* Include Xinvltlb interrupts in the V_IPI statistics counter
(so they show up in systat -pv 1).
* Add loop counters to detect and log possible endless loops.
* (Fix single_apic_ipi_passive() but note that this function is currently
not used. Interrupts must be hard-disabled when checking icr_lo).
* NEW INVALIDATION MECHANISM
The new invalidation mechanism is primarily enclosed in mp_machdep.c and
pmap_inval.c. Supply new all-in-one rollup functions which include the
*ptep contents adjustment, instead of prior piecemeal functions.
The new mechanism uses Xinvltlb for both full-tlb and per-page
invalidations. This interrupt ignores critical sections (that is,
will operate even if kernel code is in a critical section), which
significantly improves the latency and stability of our pmap pte
invalidation support functions.
For example, prior to these changes the invalidation code uses the
lwkt_ipiq paths which are subject to critical sections and could result
in long stalls across substantially ALL cpus when one cpu was in a long
cpu-bound critical section.
* NEW SMP_INVLTLB() OPTIMIZATION
smp_invltlb() always used Xinvltlb, and it still does. However the
code now avoids IPIing idle cpus, instead flagging them to issue the
cpu_invltlb() call when they wake-up.
To make this work the idle code must temporarily enter a critical section
so 'normal' interrupts do not run until it has a chance to check and act
on the flag. This will slightly increase interrupt latency on an idle
cpu.
This change significantly improves smp_invltlb() overhead by avoiding
having to pull idle cpus out of their high-latency/low-power state. Thus
it also avoids the high latency on those cpus messing up.
* Remove unnecessary calls to smp_invltlb(). It is not necessary to call
this function when a *ptep is transitioning from 0 to non-zero. This
significantly cuts down on smp_invltlb() traffic under load.
* Remove a bunch of unused code in these paths.
* Add machdep.report_invltlb_src and machdep.report_invlpg_src, down
counters which do one stack backtrace when they hit 0.
TIMING TESTS
No appreciable differences with the new code other than feeling smoother.
mount_tmpfs dummy /usr/obj
On monster (4-socket, 48-core):
time make -j 50 buildworld
BEFORE: 7849.697u 4693.979s 16:23.07 1275.9%
AFTER: 7682.598u 4467.224s 15:47.87 1281.8%
time make -j 50 nativekernel NO_MODULES=TRUE
BEFORE: 927.608u 254.626s 1:36.01 1231.3%
AFTER: 531.124u 204.456s 1:25.99 855.4%
On 2 x E5-2620 (2-socket, 32-core):
time make -j 50 buildworld
BEFORE: 5750.042u 2291.083s 10:35.62 1265.0%
AFTER: 5694.573u 2280.078s 10:34.96 1255.9%
time make -j 50 nativekernel NO_MODULES=TRUE
BEFORE: 431.338u 84.458s 0:54.71 942.7%
AFTER: 414.962u 92.312s 0:54.75 926.5%
(time mostly spend in mkdep line and on final link)
Memory thread tests, 64 threads each allocating memory.
BEFORE: 3.1M faults/sec
AFTER: 3.1M faults/sec.
15 files changed: