kernel - Refactor bcmp, bcopy, bzero, memset
* For now continue to use stosq/stosb, movsq/movsb, cmpsq/cmpsb sequences
which are well optimized on AMD and Intel. Do not just use the '*b'
string op. While this is optimized on Intel it is not optimized on
AMD.
* Note that two string ops in a row result in a serious pessimization.
To fix this, for now, conditionalize the movsb, stosb, or cmpsb op so
it is only executed when the remaining count is non-zero. That is,
assume nominal 8-byte alignment.
* Refactor pagezero() to use a movq/addq/jne sequence. This is
significantly faster than movsq on AMD and only just very slightly
slower than movsq on Intel.
* Also use the above adjusted kernel code in libc for these functions,
with minor modifications. Since we are copying the code wholesale,
replace the copyright for the related files in libc.
* Refactor libc's memset() to replicate the data to all 64 bits code and
then use code similar to bzero().
Reported-by: mjg_ (info on pessimizations)