socket/tcp: Implement asynchronized pru_attach for TCP
This commit mainly splits TCP pru_attach into two parts:
- First part operates on socket buffer, so it could run directly in
the caller thread.
- The second part creates and initializes tcpcb, which still runs in
netisr. But we don't wait for the result of this operation
(lwkt_sendmsg() is used instead of lwkt_domsg()).
This removes the last lwkt_domsg on commonly/mostly used socket APIs.
This is enabled by default and could be disabled by setting sysctl
kern.ipc.socreate_fast to 0.
The measured effect of this change on 2-ways E5-2600v2 with Intel 82599
(10Gbe) using tools/kq_connect_client:
- Connect rate increases by ~10Kconns/s; we are now doing 395Kconns/s.
- Idle time on the CPUs not running netisrs increases (55% -> 65%).
- IPIs rate to the CPUs not running netisrs reduces (40Kipis/s ->
23Kipis/s).