serialize: Optimize atomic_intr_cond_{enter,try,exit}()
Use counter (30bits) of __atomic_intr_t as wait counter instead of request
counter:
- This avoids counter updates in atomic_intr_cond_try().
- Move counter decrement from atomic_intr_cond_exit() to
atomic_intr_cond_enter().
- Try obtaining intr_cond first in atomic_intr_cond_enter(). If the try
failed, counter would be incremented then.
This reduces the number of locked bus cycle intructions.
- For "try ok/exit" sequence: 4 -> 2.
- For "try fail": 3 -> 1.
- For uncontended "enter/exit" sequence: 3 -> 2
For contended "enter/exit" sequence, this increases the number of locked
bus cycle intructions from 3 to 4. Compared with the sleep, this should
be relatively cheap.
Tested on 8 HT (i7-3770) box, using kq_accept_server/kq_connect_client:
- 4/4 TX/RX rings device (BCM5719, using MSI-X), slight improvement.
- 8/8 TX/RX rings device (Intel 82580, using MSI-X), slight improvement.
- 1/2 TX/RX rings device (Intel 82599, using MSI), no observable
improvement.