kernel - Optimize kfree() to greatly reduce IPI traffic
* Instead of IPIing the chunk being freed to the originating cpu we
use atomic ops to directly link the chunk onto the target slab.
We then notify the target cpu via an IPI message only in the case where
we believe the slab has to be entered back onto the target cpu's
ZoneAry.
This reduces the IPI messaging load by a factor of 100x or more.
kfree() sends virtually no IPIs any more.
* Move malloc_type accounting to the cpu issuing the kmalloc or kfree
(kfree used to forward the accounting to the target cpu). The
accounting is done using the per-cpu malloc_type accounting array
so large deltas will likely accumulate, but they should all cancel
out properly in the summation.
* Use the kmemusage array and kup->ku_pagecnt to track whether a
SLAB is active or not, which allows the handler for the asynchronous IPI
to validate that the SLAB still exists before trying to access it.
This is necessary because once the cpu doing the kfree() successfully
links the chunk into z_RChunks, the target slab can get ripped out
from under it by the owning cpu.
* The special cpu-competing linked list is different from the linked list
normally used to find free chunks, so the localized code and the
MP code is segregated.
We pay special attention to list ordering to try to avoid unnecessary
cache mastership changes, though it should be noted that the c_Next
link field in the chunk creates an issue no matter what we do.
A 100% lockless algorithm is used. atomic_cmpset_ptr() is used
to manage the z_RChunks singly-linked list.
* Remove the page localization code for now. For the life of the
typically chunk of memory I don't think this provided much of
an advantage.
Prodded-by: Venkatesh Srinivas