kernel - Fix MP system call performance regression
* The userland scheduler was unconditionally calling lwkt_switch()
via userexit() (i.e. on every system call), creating unnecessary
overhead and possibly also triggering a bsd4 scheduler event
requiring a common spinlock.
* Rearrange the code slightly to reduce instances where lwkt_switch()
is called. We want to try to keep instances where a higher priority
LWKT thread is potentially runnable or when the LWKT fairq accumulator
for the current thread has been exhausted.
* This removes system call overhead multiplication on MP systems. For
example, on a 48-core box system call overhead when all 48 cpus are
busy doing getuid() loops went from 10uS back down to 270nS (which
is near the single-cpu test results).