altq: Implement two level "rough" priority queue for plain sub-queue The "rough" part comes from two sources: - Hardware queue could be deep, normally 512 or more even for GigE - Round robin on the transmission queues is used by all of the multiple transmission queue capable hardwares supported by DragonFly as of this commit. These two sources affect the packet priority set by DragonFly. DragonFly's "rough" prority queue has only two level, i.e. high priority and normal priority, which should be enough. Each queue has its own header. The normal priority queue will be dequeue only when there is no packets in the high priority queue. During enqueue, if the sub-queue is full and the high priority queue length is less than half of the sub- queue length (both packet count and byte count), drop-head will be applied on the normal priority queue. M_PRIO mbuf flag is added to mark that the mbuf is destined for the high priority queue. Currently TCP uses it to prioritize SYN, SYN|ACK, and pure ACK w/o FIN and RST. This behaviour could be turn off by net.inet.tcp.prio_synack, which is on by default. The performance improvement! The test environment: All three boxes are using Intel i7-2600 w/ HT enabled +-----+ | | +->- emx1 | B | TCP_MAERTS +-----+ | | | | | | +-----+ | A | bnx0 ---+ | | | +-----+ +-----+ | | | +-<- emx1 | C | TCP_STREAM/TCP_RR | | +-----+ A's kernel has this commit compiled. bnx0 has all four transmission queues enabled. For bnx0, the hardware's transmission queue round-robin is on TSO segment boundry. Some base line measurement: B<--A TCP_MAERTS (raw stats) (128 client): 984 Mbps (tcp_stream -H A -l 15 -i 128 -r) C-->A TCP_STREAM (128 client): 942 Mbps (tcp_stream -H A -l 15 -i 128) C-->A TCP_CC (768 client): 221199 conns/s (tcp_cc -H A -l 15 -i 768) To effectively measure the TCP_CC, the prefix route's MSL is changed to 10ms: route change 10.1.0.0/24 -msl 10 All stats gather in the following measurement are below the base line measurement (well, they should be). C-->A TCP_CC improvement, during test B<--A TCP_MAERTS is running: TCP_MAERTS(raw) TCP_CC TSO prio_synack=1 948 Mbps 15988 conns/s TSO prio_synack=0 965 Mbps 8867 conns/s non-TSO prio_synack=1 943 Mbps 18128 conns/s non-TSO prio_synack=0 959 Mbps 11371 conns/s * 80% TCP_CC performance improvement w/ TSO and 60% w/o TSO! C-->A TCP_STREAM improvement, during test B<--A TCP_MAERTS is running: TCP_MAERTS(raw) TCP_STREAM TSO prio_synack=1 969 Mbps 920 Mbps TSO prio_synack=0 969 Mbps 865 Mbps non-TSO prio_synack=1 969 Mbps 920 Mbps non-TSO prio_synack=0 969 Mbps 879 Mbps * 6% TCP_STREAM performance improvement w/ TSO and 4% w/o TSO.
altq: Add byte based limit and counter - This avoids having too much mbufs sitting on the send queue for TSO capable devices. Even by default, DragonFly has already limited TSO burst to at most 4 TCP segments, for TSO capable devices, there still could be 4 times mbufs sitting on the send queue compared with non-TSO capable devices. - This paves way for the AQMs, which require send queue byte counter, e.g. CoDel. For ethernet devices, the byte based limit is (1514 x max_packets). For other devices, e.g. pseudo devices, the byte based limit is (MCLBYTES x max_packets).
ifsq: Let ifaltq_subque know its related hardware TX queue's serializer This avoids following operations on packet transmission hot path: - Dereferening device driver supplied serialize function pointers - Locating hardware TX queue's serializer Comparing to the lwkt_serialize functions, the above two operations are costful. Driver changes: - For device drivers which use the default ifnet serializer, no additional code will be needed, if_attach() will assign ifnet serializer to ifaltq_subque. - For device drivers which use independent serializers for main function, RX queues and TX queues, ifsq_set_hw_serialize() must be called to properly assign the hardware TX queue's serializer to ifaltq_subque. Drivers in this category are bce(4), emx(4), igb(4) and jme(4).
if: Multiple TX queue support step 3 of 3; map CPUID to subqueue Add CPUID to subqueue mapping method to ifaltq. Driver could provide its own CPUID to subqueue mapping method through ifnet.if_mapsubq, which is used when ALTQ's packet scheduler is not enabled. ALTQ's packet schedulers always map CPUID to the default subqueue.
if: Multiple TX queue support step 1 of many; introduce ifaltq subqueue Put the plain queue information, e.g. queue header and tail, serializer, packet staging scoreboard and ifnet.if_start schedule netmsg etc. into its own structure (subqueue). ifaltq structure could have multiple of subqueues based on the count that drivers can specify. Subqueue's enqueue, dequeue, purging and states updating are protected by the subqueue's serializer, so for hardwares supporting multiple TX queues, contention on queuing operation could be greatly reduced. The subqueue is passed to if_start to let the driver know which hardware TX queue to work on. Only the related driver's TX queue serializer will be held, so for hardwares supporting multiple TX queues, contention on driver's TX queue serializer could be greatly reduced. Bunch of ifsq_ prefixed functions are added, which is used to perform various operations on subqueues. Commonly used ifq_ prefixed functions are still kept mainly for the drivers which do not support multiple TX queues (well, these functions also ease the netif/ convertion in this step :). All of the pseudo network devices under sys/net are converted to use the new subqueue operation. netproto/802_11 is converted too. igb(4) is converted to use the new subqueue operation, the rest of the network drivers are only changed for the if_start interface modification. For ALTQs which have packet scheduler enabled, only the first subqueue is used (*). (*) Whether we should utilize multiple TX queues if ALTQ's packet scheduler is enabled is quite questionable. Mainly because hardware's multiple TX queue packet dequeue mechanism could have negative impact on ALTQ's packet scheduler's decision.
if: Move if_cpuid into ifaltq; prepare multiple TX queues support if_cpuid and if_npoll_cpuid are merged and moved into ifaltq as altq_cpuid, which indicates the owner CPU of the tx queue. Since we already have code in if_start_dispatch() to catching tx queue owner CPU changes, this merging is quite safe.
if: Move IFF_OACTIVE bit into ifaltq; prepare multiple TX queues support ifaltq.altq_hw_oactive is now used to record that NIC's TX queue is full. IFF_OACTIVE is removed from kernel. User space IFF_OACTIVE is kept for compability. ifaltq.altq_hw_oactive should not be accessed directly. Following set of functions are provided and should be used: ifq_is_oactive(ifnet.if_snd) - Whether NIC's TX queue is full or not ifq_set_oactive(ifnet.if_snd) - NIC's TX queue is full ifq_clr_oactive(ifnet.if_snd) - NIC's TX queue is no longer full
ifq/staging: Perform IFQ packet staging for if_start scheduling IFQ packets staging is now performed for ifnet's if_start scheduling, i.e. if_start_schedule(), in addition to direct ifnet's if_start calling. IFQ packets staging stopping condition - if_start interlock (if_snd.altq_started) is not released. is now changed to - if_start_schedule() is not pending on the current CPU and if_start interlock (if_snd.altq_started) is not released. By setting net.link.stage_cntmax to 8 and hw.igbX.tx_wreg_nsegs to 16, following performance improvement is gained: +80Kpps for normal IP forwarding +30Kpps for fast IP forwarding
ifq/staging: Initial implementation of IFQ packet staging mechanism The packets enqueued into IFQ are staged to a certain amount before the ifnet's if_start is called. In this way, the driver could avoid writing to hardware registers upon every packet, instead, hardware registers could be written when certain amount of packets are put onto hardware TX ring. The measurement on several modern NICs (emx(4), igb(4), bnx(4), bge(4), jme(4)) shows that the hardware registers writing aggregation could save ~20% CPU time when 18bytes UDP datagrams are transmitted at 1.48Mpps. IFQ packets staging is performed for direct ifnet's if_start calling, i.e. ifq_try_ifstart() IFQ packets staging will be stopped upon any of the following conditions: - If the count of packets enqueued on the current CPU is great than or equal to ifq_stage_cntmax. - If the total length of packets enqueued on the current CPU is great than or equal to the hardware's MTU - max_protohdr. max_protohdr is cut from the hardware's MTU mainly bacause a full TCP segment's size is usually less than hardware's MTU. - if_start interlock (if_snd.altq_started) is not released. - The if_start_rollup(), which is registered as low priority netisr rollup function, is called; probably because no more work is pending for netisr. Currently IFQ packet staging is only performed in netisr threads. Inspired-by: Luigi Rizzo's netmap paper (http://info.iet.unipi.it/~luigi/netmap/) Also-Suggested-by: dillon@