kernel - Refactor cpu localization for VM page allocations
* Change how cpu localization works. The old scheme was extremely unbalanced
in terms of vm_page_queue[] load.
The new scheme uses cpu topology information to break the vm_page_queue[]
down into major blocks based on the physical package id, minor blocks
based on the core id in each physical package, and then by 1's based on
(pindex + object->pg_color).
If PQ_L2_SIZE is not big enough such that 16-way operation is attainable
by physical and core id, we break the queue down only by physical id.
Note that the core id is a real core count, not a cpu thread count, so
an 8-core/16-thread x 2 socket xeon system will just fit in the 16-way
requirement (there are 256 PQ_FREE queues).
* When a particular queue does not have a free page, iterate nearby queues
start at +/- 1 (before we started at +/- PQ_L2_SIZE/2), in an attempt to
retain as much locality as possible. This won't be perfect but it should
be good enough.
* Also fix an issue with the idlezero counters.