kernel - Improve exec performance
* Improves non-shared 32-way-concurrent exec performance for a small
static binary on the xeon from 92KE/s (92000 execs/sec) to 136KE/s.
* Improves single-threaded test performance from ~4.5KE/s to ~6.5KE/s.
And for reasons I don't entirely understand, sometimes up to ~8KE/s.
* Several changes here, but the only one that matters for the test is
that the pv_placemarker_wakeup() code removes a spin_lock/spin_unlock
pair on the pmap. I adjusted the code so the pmap spinlock is not
required for placemarker wakeup operations.
What I think might have happened here is that this removal also got
rid of a spin-lock shared/exclusive ping-pong. Still, the huge
improvement in performance was not expected. Even with the removal
there is still an atomic_swap_long() in the code path.
My guess is that multiple atomic ops degrade the instruction pipeline
more than one would otherwise expect due to the multiple memory
fences.