pthreads - Improve low level lock performance when heavily contested
* The low-level __thr_umtx_lock()/unlock and related primitives are
used by pthreads, but have very poor performance when heavily contested:
* Calling sched_yield() just doesn't work well.
* Attempts to sleep too quickly, which costs a great deal of
system overhead.
* And issues broadcast wakeups for waiters, causing excessive IPIs.
* Stop calling sched_yield() in the loop. Let the userland scheduler's
dynamic priority deal with it.
* Scale the spin count up significantly, and then further based on
the number of pthreads in the application. If the program is stupid
enough to cause excessive contention, then the penalty for making that
perform well is going to be more cpu time.
* Issue a wakeup1() equivalent on unlock if there are any waiters,
significantly reducing system IPIs.
To make this work reliably, the primary lock loop, when it sleeps,
will now always do so with a 1mS timeout, then loop/recheck. If
an API timeout is specified in excess of 1mS, the timo variable
is reduced on each loop and proper timeout handling occurs on
the last call.
* Running qemu w/ 32-cores specified (on a 64/128 threadripper host),
with nvmm, reduces build-all time from 9:10 to 8:20, relative to
a native host build time (usched restricted to 32 cores) of 6:11.
So this is a significant improvement.
(currently qemi-6.0.0 w/nvmm has some significant contention when a
high cpu count is configured, due to the implementation).