Enhance the pmap_kenter*() API and friends, separating out entries which
only need invalidation on the local cpu against entries which need invalidation
across the entire system, and provide a synchronization abstraction.
Enhance sf_buf_alloc() and friends to allow the caller to specify whether the
sf_buf's kernel mapping is going to be used on just the current cpu or
whether it needs to be valid across all cpus. This is done by maintaining
a cpumask of known-synchronized cpus in the struct sf_buf
Optimize sf_buf_alloc() and friends by removing both TAILQ operations in the
critical path. TAILQ operations to remove the sf_buf from the free queue
are now done in a lazy fashion. Most sf_buf operations allocate a buf,
work on it, and free it, so why waste time moving the sf_buf off the freelist
if we are only going to move back onto the free list a microsecond later?
Fix a bug in sf_buf_alloc() code as it was being used by the PIPE code.
sf_buf_alloc() was unconditionally using PCATCH in its tsleep() call, which
is only correct when called from the sendfile() interface.
Optimize the PIPE code to require only local cpu_invlpg()'s when mapping
sf_buf's, greatly reducing the number of IPIs required. On a DELL-2550,
a pipe test which explicitly blows out the sf_buf caching by using huge
buffers improves from 350 to 550 MBytes/sec. However, note that buildworld
times were not found to have changed.
Replace the PIPE code's custom 'struct pipemapping' structure with a
struct xio and use the XIO API functions rather then its own.