.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" This document is derived in part from the enet man page (enet.4) .\" distributed with 4.3BSD Unix. .\" .\" $FreeBSD: head/share/man/man4/netmap.4 228017 2011-11-27 06:55:57Z gjb $ .\" .Dd May 25, 2019 .Dt NETMAP 4 .Os .Sh NAME .Nm netmap .Nd a framework for fast packet I/O .Sh SYNOPSIS .Cd device netmap .Sh DESCRIPTION .Nm is a framework for extremely fast and efficient packet I/O (reaching 14.88 Mpps with a single core at less than 1 GHz) for both userspace and kernel clients. Userspace clients can use the .Nm API to send and receive raw packets through physical interfaces or ports of the .Xr vale 4 switch. .Pp .Xr vale 4 is a very fast (reaching 20 Mpps per port) and modular software switch, implemented within the kernel, which can interconnect virtual ports, physical devices, and the native host stack. .Pp .Nm uses a memory mapped region to share packet buffers, descriptors and queues with the kernel. .Xr ioctl 2 is used to bind interfaces/ports to file descriptors and implement non-blocking I/O, whereas blocking I/O uses .Xr select 2 and .Xr poll 2 . .Nm can exploit the parallelism in multiqueue devices and multicore systems. .Pp For the best performance, .Nm requires explicit support in device drivers; a generic emulation layer is available to implement the .Nm API on top of unmodified device drivers, at the price of reduced performance (but still better than what can be achieved with .Xr socket 2 , .Xr bpf 4 , or .Xr pcap 3 ) . .Pp For a list of devices with native .Nm support, see section .Sx SUPPORTED INTERFACES at the end of this manual page. .Sh OPERATING THE API .Nm clients must first issue the following code to open the device node and to bind the file descriptor to a specific interface or port: .Bd -literal -offset indent fd = open("/dev/netmap"); ioctl(fd, NIOCREGIF, (struct nmreq *)arg); .Ed .Pp .Nm has multiple modes of operation controlled by the content of the .Vt struct nmreq passed to .Xr ioctl 2 . In particular, the .Va nr_name field specifies whether the client operates on a physical network interface or on a port of a .Xr vale 4 switch, as indicated below. Additional fields in the .Vt struct nmreq control the details of operation. .Bl -tag -width XXXX .It Sy Interface name (e.g. 'em0', 'eth1', ...) The data path of the interface is disconnected from the host stack. Depending on additional arguments, the file descriptor is bound to the NIC (one or all queues), or to the host stack. .It Sy valeXXX:YYY (arbitrary XXX and YYY) The file descriptor is bound to port YYY of a .Xr vale 4 switch called XXX, where XXX and YYY are arbitrary alphanumeric strings. The string cannot exceed IFNAMSIZ characters, and YYY cannot matching the name of any existing interface. .Pp The switch and the port are created if not existing. .It Sy valeXXX:ifname (ifname is an existing interface) Flags in the argument control whether the physical interface (and optionally the corresponding host stack endpoint) are connected or disconnected from the .Xr vale 4 switch named XXX. .Pp In this case .Xr ioctl 2 is used only for configuring the .Xr vale 4 switch, typically through the .Cm vale-ctl command. The file descriptor cannot be used for I/O, and should be passed to .Xr close 2 after issuing .Xr ioctl 2 . .El .Pp The binding can be removed (and the interface returns to regular operation, or the virtual port destroyed) with a .Xr close 2 on the file descriptor. .Pp The processes owning the file descriptor can then .Xr mmap 2 the memory region that contains pre-allocated buffers, descriptors and queues, and use them to read/write raw packets. Non blocking I/O is done with special .Xr ioctl 2 commands, whereas the file descriptor can be passed to .Xr select 2 and .Xr poll 2 to be notified about incoming packet or available transmit buffers. .Ss DATA STRUCTURES The data structures in the mmapped memory are described below (see .In net/netmap/netmap.h for reference). All physical devices operating in .Nm mode use the same memory region, shared by the kernel and all processes who own .Pa /dev/netmap descriptors bound to those devices (NOTE: visibility may be restricted in future implementations). Virtual ports instead use separate memory regions, shared only with the kernel. .Pp All references between the shared data structure are relative (offsets or indexes). Some macros help converting them into actual pointers. .Bl -tag -width XXXX .It Sy struct netmap_if (one per interface) indicates the number of rings supported by an interface, their sizes, and the offsets of the .Nm rings associated to the interface. .Pp .Vt struct netmap_if is at offset .Va nr_offset in the shared memory region indicated by the field in the structure returned by .Dv NIOCREGIF . .Bd -literal struct netmap_if { char ni_name[IFNAMSIZ]; /* name of the interface. */ const u_int ni_version; /* API version */ const u_int ni_rx_rings; /* number of rx ring pairs */ const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */ const ssize_t ring_ofs[]; /* offset of tx and rx rings */ }; .Ed .It Sy struct netmap_ring (one per ring) Contains the positions in the transmit and receive rings to synchronize the kernel and the application, and an array of .Nm slots describing the buffers. .Va reserved is used in receive rings to tell the kernel the number of slots after .Va cur that are still in use indicates how many slots starting from .Va cur the .\" XXX Fix and finish this sentence? .Pp Each physical interface has one .Vt struct netmap_ring for each hardware transmit and receive ring, plus one extra transmit and one receive structure that connect to the host stack. .Bd -literal struct netmap_ring { const ssize_t buf_ofs; /* see details */ const uint32_t num_slots; /* number of slots in the ring */ uint32_t avail; /* number of usable slots */ uint32_t cur; /* 'current' read/write index */ uint32_t reserved; /* not refilled before current */ const uint16_t nr_buf_size; uint16_t flags; #define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */ #define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */ #define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */ struct timeval ts; struct netmap_slot slot[0]; /* array of slots */ } .Ed .Pp In transmit rings, after a system call .Va cur indicates the first slot that can be used for transmissions, and .Va avail reports how many of them are available. Before the next .Nm Ns -related system call on the file descriptor, the application should fill buffers and slots with data, and update .Va cur and .Va avail accordingly, as shown in the figure below: .Bd -literal cur |----- avail ---| (after syscall) v TX [*****aaaaaaaaaaaaaaaaa**] TX [*****TTTTTaaaaaaaaaaaa**] ^ |-- avail --| (before syscall) cur .Ed .Pp In receive rings, after a system call .Va cur indicates the first slot that contains a valid packet, and .Va avail reports how many of them are available. Before the next .Nm Ns -related system call on the file descriptor, the application can process buffers and release them to the kernel updating .Va cur and .Va avail accordingly, as shown in the figure below. Receive rings have an additional field called .Va reserved to indicate how many buffers before .Va cur cannot be released because they are still being processed. .Bd -literal cur |-res-|-- avail --| (after syscall) v RX [**rrrrrrRRRRRRRRRRRR******] RX [**...........rrrrRRR******] |res|--|flags >> 8) & 0xff) uint64_t ptr; /* buffer address (indirect buffers) */ }; .Ed .Pp The flags control how the the buffer associated to the slot should be managed. .It Sy packet buffers are normally fixed size (2 Kbyte) buffers allocated by the kernel that contain packet data. .El .Pp Addresses are computed through macros in order to support access to objects in the shared memory region, e.g.: .Bl -tag -width ".Fn NETMAP_BUF ring buf_idx" .It Fn NETMAP_TXRING nifp i Returns the address of the .Va i Ns -th transmit ring. .It Fn NETMAP_RXRING nifp i Returns the address of the .Va i Ns -th receive ring. .It Fn NETMAP_BUF ring buf_idx Returns the address of the buffer with index .Va buf_idx (which can be part of any ring for the given interface). .El .Ss FLAGS Normally, buffers are associated to slots when interfaces are bound, and one packet is fully contained in a single buffer. Clients can, however, modify the mapping using the following flags: .Bl -tag -width ".Fn NS_RFRAGS slot" .It Dv NS_BUF_CHANGED indicates that the .Va buf_idx in the slot has changed. This can be useful if the client wants to implement some form of zero-copy forwarding (e.g. by passing buffers from an input interface to an output interface), or needs to process packets out of order. .Pp The flag MUST be used whenever the buffer index is changed. .It Dv NS_REPORT indicates that we want to be woken up when this buffer has been transmitted. This reduces performance but insures a prompt notification when a buffer has been sent. Normally, .Nm notifies transmit completions in batches, hence signals may be delayed indefinitely. However, we need such notifications before closing a descriptor. .It Dv NS_FORWARD When the device is opened in .Sq transparent mode, the client can mark slots in receive rings with this flag. For all marked slots, marked packets are forwarded to the other endpoint at the next system call, thus restoring (in a selective way) the connection between the NIC and the host stack. .It Dv NS_NO_LEARN tells the forwarding code that the SRC MAC address for this packet should not be used in the learning bridge. .It Dv NS_INDIRECT indicates that the packet's payload is not in the .Nm Ns -supplied buffer, but in a user-supplied buffer whose user virtual address is in the .Va ptr field of the slot. The size can reach 65535 bytes. This is only supported on the transmit ring of virtual ports. .It Dv NS_MOREFRAG indicates that the packet continues with subsequent buffers; the last buffer in a packet must have the flag cleared. The maximum length of a chain is 64 buffers. This is only supported on virtual ports. .It Fn NS_RFRAGS slot on receive rings, returns the number of remaining buffers in a packet, including this one. Slots with a value greater than 1 also have .Dv NS_MOREFRAG set. The length refers to the individual buffer; there is no field for the total length. .Pp On transmit rings, if .Dv NS_DST is set, it is passed to the lookup function, which can use it e.g. as the index of the destination port instead of doing an address lookup. .El .Sh SYSTEM CALLS .Nm supports .Xr ioctl 2 commands to synchronize the state of the rings between the kernel and the user processes, as well as to query and configure the interface. The former do not require any argument, whereas the latter use a .Vt struct nmreq defined as follows: .Bd -literal struct nmreq { char nr_name[IFNAMSIZ]; uint32_t nr_version; /* API version */ #define NETMAP_API 4 /* current version */ uint32_t nr_offset; /* nifp offset in the shared region */ uint32_t nr_memsize; /* size of the shared region */ uint32_t nr_tx_slots; /* slots in tx rings */ uint32_t nr_rx_slots; /* slots in rx rings */ uint16_t nr_tx_rings; /* number of tx rings */ uint16_t nr_rx_rings; /* number of tx rings */ uint16_t nr_ringid; /* ring(s) we care about */ #define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */ #define NETMAP_SW_RING 0x2000 /* we process the sw ring */ #define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */ #define NETMAP_RING_MASK 0xfff /* the actual ring number */ uint16_t nr_cmd; #define NETMAP_BDG_ATTACH 1 /* attach the NIC */ #define NETMAP_BDG_DETACH 2 /* detach the NIC */ #define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */ #define NETMAP_BDG_LIST 4 /* get bridge's info */ uint16_t nr_arg1; uint16_t nr_arg2; uint32_t spare2[3]; }; .Ed .Pp A device descriptor obtained through .Pa /dev/netmap supports the .Xr ioctl 2 command codes supported by network devices, as well as specific command codes defined in .In net/netmap/netmap.h . These specific command codes are as follows: .Bl -tag -width ".Dv NIOCTXSYNC" .It Dv NIOCGINFO returns .Dv EINVAL if the named device does not support .Nm . Otherwise, it returns zero and advisory information about the interface. Note that all the information below can change before the interface is actually put into .Nm mode. .Pp .Va nr_memsize indicates the size of the .Nm memory region. Physical devices all share the same memory region, whereas .Xr vale 4 ports may have independent regions for each port. These sizes can be set through system-wide .Xr sysctl 8 variables. .Va nr_tx_slots and .Va nr_rx_slots indicate the size of transmit and receive rings, respectively. .Va nr_tx_rings and .Va nr_rx_rings indicate the number of transmit and receive rings, respectively. Both ring number and size may be configured at runtime using interface-specific functions (e.g.\& .Xr sysctl 8 on BSD, or .Xr ethtool 8 on Linux). .It Dv NIOCREGIF puts the interface specified via .Va nr_name into .Nm mode, disconnecting it from the host stack, and/or defines which rings are controlled through this file descriptor. On return, it gives the same info as .Dv NIOCGINFO , and .Va nr_ringid indicates the identity of the rings controlled through the file descriptor. .Pp Possible values for .Va nr_ringid are as follows: .Bl -tag -width "Dv NETMAP_HW_RING + i" .It 0 default; all hardware rings .It Dv NETMAP_SW_RING .Dq host rings connecting to the host stack .It Dv NETMAP_HW_RING + i i-th hardware ring .El .Pp By default, a .Xr poll 2 or .Xr select 2 call pushes out any pending packets on the transmit ring, even if no write events were specified. The feature can be disabled by OR-ing the flag .Dv NETMAP_NO_TX_SYNC into .Va nr_ringid . Normally, you should keep this feature unless you are using separate file descriptors for the send and receive rings, because otherwise packets are pushed out only if .Dv NETMAP_TXSYNC is called, or the send queue is full. .Pp .Dv NIOCREGIF can be used multiple times to change the association of a file descriptor to a ring pair, always within the same device. .Pp When registering a virtual interface that is dynamically created to a .Xr vale 4 switch, we can specify the desired number of rings (1 by default, and currently up to 16) by setting the .Va nr_tx_rings and .Va nr_rx_rings fields accordingly. .It Dv NIOCTXSYNC tells the hardware about new packets to transmit, and updates the number of slots available for transmission. .It Dv NIOCRXSYNC tells the hardware about consumed packets, and asks for newly available packets. .El .Pp .Nm uses .Xr select 2 and .Xr poll 2 to wake up processes when significant events occur, and .Xr mmap 2 to map memory. .Pp Applications may need to create threads and bind them to specific cores to improve performance, using standard OS primitives; see .Xr pthread 3 . In particular, .Xr pthread_setaffinity_np 3 may be of use. .Sh EXAMPLES The following code implements a traffic generator: .Bd -literal #include #include #include #include #include #include #include #include #include int main(void) { struct netmap_if *nifp; struct netmap_ring *ring; struct pollfd fds; struct nmreq nmr; void *p; int fd; fd = open("/dev/netmap", O_RDWR); bzero(&nmr, sizeof(nmr)); strcpy(nmr.nr_name, "ix0"); nmr.nr_version = NETMAP_API; ioctl(fd, NIOCREGIF, &nmr); p = mmap(0, nmr.nr_memsize, PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0); nifp = NETMAP_IF(p, nmr.nr_offset); ring = NETMAP_TXRING(nifp, 0); fds.fd = fd; fds.events = POLLOUT; for (;;) { poll(&fds, 1, -1); for (; ring->avail > 0; ring->avail--) { uint32_t i; void *buf; i = ring->cur; buf = NETMAP_BUF(ring, ring->slot[i].buf_idx); /* prepare packet in buf */ ring->slot[i].len = 0; /* packet length */ ring->cur = NETMAP_RING_NEXT(ring, i); } } } .Ed .Sh SUPPORTED INTERFACES .Nm supports the following interfaces: .Xr em 4 , .Xr igb 4 , .Xr ixgbe 4 , .Xr lem 4 , and .Xr re 4 . .Sh SEE ALSO .Xr vale 4 .Rs .%A Luigi Rizzo .%T Revisiting network I/O APIs: the netmap framework .%J Communications of the ACM .%V 55 (3) .%P 45-51 .%D March 2012 .Re .Rs .%A Luigi Rizzo .%T netmap: a novel framework for fast packet I/O .%D June 2012 .%O USENIX ATC '12, Boston .Re .Pp .Lk http://info.iet.unipi.it/~luigi/netmap/ .Sh AUTHORS .An -nosplit The .Nm framework has been originally designed and implemented at the Universita` di Pisa in 2011 by .An Luigi Rizzo , and further extended with help from .An Matteo Landi , .An Gaetano Catalli , .An Giuseppe Lettieri , and .An Vincenzo Maffione . .Pp .Nm and .Xr vale 4 have been funded by the European Commission within the FP7 Projects CHANGE (257422) and OPENLAB (287581).