kernel - Optimize spinlocks for 48-core contention
* Change the spinlock algorithm to do a read-test before atomic_swap_int().
This has no effect on single-chip cpus (tested on phenom II quad-core),
but has a HUGE HUGE HUGE effect on multi-chip/many-core systems. On
monster (48-core opteron / 4 x 12-core chips) concurrent kernel compile
time is reduced from 170 seconds to 75 seconds with this one change.
That's well over 100%.
The reason the change is important is because it unloads the hardware
cache coherency bus and communication by creating a closed-loop with
the pre-read, which essentially passively waits for the cache update
instead of actively issuing a locked bus cycle memory op. This prevents
total armagheddon on the memory busses when a substantial number of
cores are doing real work.
* Increase the number of pool spinlocks from 1024 to 8192. We need them
now that vm_page's use pool spinlocks.