tcp: Implement asynchronized pru_rcvd
This mainly avoids extra scheduling cost on the reception path due to
lwkt_domsg(). lwkt_sendmsg() is now used to carry out TCP pru_rcvd.
Since TCP's pru_rcvd could be batched, one pru_rcvd netmsg is embedded
into struct socket to avoid pru_rcvd netmsg allocation for each pru_rcvd,
and this netmsg will be used by lwkt_sendmsg(). Whether this embedded
pcu_rcvd netmsg should be sent or not is determined by its MSG_DONE bit.
Since user thread and netisr thread could be on different CPUs, the
embedded pru_rcvd netmsg's MSG_DONE bit is protected by a spinlock.
To cope with the following race that could drop window updates,
tcp_usr_rcvd() replies asynchronized rcvd netmsg before tcp_output():
netisr thread user thread
tcp_usr_rcvd() sorcvtcp()
{ {
tcp_output() :
: :
: sbunlinkmbuf()
: if (rcvd & MSG_DONE) (2)
: lwkt_sendmsg(rvcd)
: :
lwkt_replymsg(rcvd) (1)
}
At (2) window update is dropped, since rcvd netmsg is not replied yet at (1)
The result:
On i7-2600 (4C/8T, 3.4GHz):
32 parallel netperf -H 127.0.0.1 -t TCP_STREAM -P0 -l 30 (4 runs, unit: Mbps)
old 30253.88 30242.58 30162.55 30101.51
new 33962.74 33798.70 33499.92 33482.35
This gives ~12% performance improvement.