1 .\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2 .\" All rights reserved.
4 .\" Redistribution and use in source and binary forms, with or without
5 .\" modification, are permitted provided that the following conditions
7 .\" 1. Redistributions of source code must retain the above copyright
8 .\" notice, this list of conditions and the following disclaimer.
9 .\" 2. Redistributions in binary form must reproduce the above copyright
10 .\" notice, this list of conditions and the following disclaimer in the
11 .\" documentation and/or other materials provided with the distribution.
13 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
25 .\" This document is derived in part from the enet man page (enet.4)
26 .\" distributed with 4.3BSD Unix.
28 .\" $FreeBSD: head/share/man/man4/netmap.4 228017 2011-11-27 06:55:57Z gjb $
35 .Nd a framework for fast packet I/O
40 is a framework for extremely fast and efficient packet I/O
41 (reaching 14.88 Mpps with a single core at less than 1 GHz)
42 for both userspace and kernel clients.
43 Userspace clients can use the
46 to send and receive raw packets through physical interfaces
52 is a very fast (reaching 20 Mpps per port)
53 and modular software switch,
54 implemented within the kernel, which can interconnect
55 virtual ports, physical devices, and the native host stack.
58 uses a memory mapped region to share packet buffers,
59 descriptors and queues with the kernel.
61 is used to bind interfaces/ports to file descriptors and
62 implement non-blocking I/O, whereas blocking I/O uses
67 can exploit the parallelism in multiqueue devices and
70 For the best performance,
72 requires explicit support in device drivers;
73 a generic emulation layer is available to implement the
75 API on top of unmodified device drivers,
76 at the price of reduced performance
77 (but still better than what can be achieved with
83 For a list of devices with native
86 .Sx SUPPORTED INTERFACES
87 at the end of this manual page.
90 clients must first issue the following code to open the device
91 node and to bind the file descriptor to a specific interface or port:
92 .Bd -literal -offset indent
93 fd = open("/dev/netmap");
94 ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
98 has multiple modes of operation controlled by the
105 field specifies whether the client operates on a physical network
106 interface or on a port of a
108 switch, as indicated below.
109 Additional fields in the
111 control the details of operation.
113 .It Sy Interface name (e.g. 'em0', 'eth1', ...)
114 The data path of the interface is disconnected from the host stack.
115 Depending on additional arguments,
116 the file descriptor is bound to the NIC (one or all queues),
117 or to the host stack.
118 .It Sy valeXXX:YYY (arbitrary XXX and YYY)
119 The file descriptor is bound to port YYY of a
122 where XXX and YYY are arbitrary alphanumeric strings.
123 The string cannot exceed IFNAMSIZ characters, and YYY cannot
124 matching the name of any existing interface.
126 The switch and the port are created if not existing.
127 .It Sy valeXXX:ifname (ifname is an existing interface)
128 Flags in the argument control whether the physical interface
129 (and optionally the corresponding host stack endpoint)
130 are connected or disconnected from the
136 is used only for configuring the
138 switch, typically through the
141 The file descriptor cannot be used for I/O, and should be passed to
147 The binding can be removed (and the interface returns to
148 regular operation, or the virtual port destroyed) with a
150 on the file descriptor.
152 The processes owning the file descriptor can then
154 the memory region that contains pre-allocated
155 buffers, descriptors and queues, and use them to
156 read/write raw packets.
157 Non blocking I/O is done with special
159 commands, whereas the file descriptor can be passed to
163 to be notified about incoming packet or available transmit buffers.
165 The data structures in the mmapped memory are described below
169 All physical devices operating in
171 mode use the same memory region,
172 shared by the kernel and all processes who own
174 descriptors bound to those devices
175 (NOTE: visibility may be restricted in future implementations).
176 Virtual ports instead use separate memory regions,
177 shared only with the kernel.
179 All references between the shared data structure
180 are relative (offsets or indexes).
181 Some macros help converting
182 them into actual pointers.
184 .It Sy struct netmap_if (one per interface)
185 indicates the number of rings supported by an interface, their
186 sizes, and the offsets of the
188 rings associated to the interface.
193 in the shared memory region indicated by the
194 field in the structure returned by
198 char ni_name[IFNAMSIZ]; /* name of the interface. */
199 const u_int ni_version; /* API version */
200 const u_int ni_rx_rings; /* number of rx ring pairs */
201 const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */
202 const ssize_t ring_ofs[]; /* offset of tx and rx rings */
205 .It Sy struct netmap_ring (one per ring)
206 Contains the positions in the transmit and receive rings to
207 synchronize the kernel and the application,
210 slots describing the buffers.
212 is used in receive rings to tell the kernel the number of slots after
214 that are still in use indicates how many slots starting from
217 .\" XXX Fix and finish this sentence?
219 Each physical interface has one
220 .Vt struct netmap_ring
221 for each hardware transmit and receive ring,
222 plus one extra transmit and one receive structure
223 that connect to the host stack.
226 const ssize_t buf_ofs; /* see details */
227 const uint32_t num_slots; /* number of slots in the ring */
228 uint32_t avail; /* number of usable slots */
229 uint32_t cur; /* 'current' read/write index */
230 uint32_t reserved; /* not refilled before current */
232 const uint16_t nr_buf_size;
234 #define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */
235 #define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */
236 #define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */
238 struct netmap_slot slot[0]; /* array of slots */
242 In transmit rings, after a system call
244 indicates the first slot that can be used for transmissions, and
246 reports how many of them are available.
249 system call on the file
250 descriptor, the application should fill buffers and
251 slots with data, and update
255 accordingly, as shown in the figure below:
258 |----- avail ---| (after syscall)
260 TX [*****aaaaaaaaaaaaaaaaa**]
261 TX [*****TTTTTaaaaaaaaaaaa**]
263 |-- avail --| (before syscall)
267 In receive rings, after a system call
269 indicates the first slot that contains a valid packet, and
271 reports how many of them are available.
274 system call on the file
275 descriptor, the application can process buffers and
276 release them to the kernel updating
280 accordingly, as shown in the figure below.
281 Receive rings have an additional field called
283 to indicate how many buffers before
285 cannot be released because they are still being processed.
288 |-res-|-- avail --| (after syscall)
290 RX [**rrrrrrRRRRRRRRRRRR******]
291 RX [**...........rrrrRRR******]
292 |res|--|<avail (before syscall)
296 .It Sy struct netmap_slot (one per packet)
297 contains the metadata for a packet:
300 uint32_t buf_idx; /* buffer index */
301 uint16_t len; /* packet length */
302 uint16_t flags; /* buf changed, etc. */
303 #define NS_BUF_CHANGED 0x0001 /* must resync, buffer changed */
304 #define NS_REPORT 0x0002 /* tell hw to report results,
305 * e.g. by generating an interrupt
307 #define NS_FORWARD 0x0004 /* pass packet to the other endpoint
308 * (host stack or device)
310 #define NS_NO_LEARN 0x0008
311 #define NS_INDIRECT 0x0010
312 #define NS_MOREFRAG 0x0020
313 #define NS_PORT_SHIFT 8
314 #define NS_PORT_MASK (0xff << NS_PORT_SHIFT)
315 #define NS_RFRAGS(_slot) (((_slot)->flags >> 8) & 0xff)
316 uint64_t ptr; /* buffer address (indirect buffers) */
320 The flags control how the the buffer associated to the slot
322 .It Sy packet buffers
323 are normally fixed size (2 Kbyte) buffers allocated by the kernel
324 that contain packet data.
327 Addresses are computed through macros in order to
328 support access to objects in the shared memory region, e.g.:
329 .Bl -tag -width ".Fn NETMAP_BUF ring buf_idx"
330 .It Fn NETMAP_TXRING nifp i
331 Returns the address of the
334 .It Fn NETMAP_RXRING nifp i
335 Returns the address of the
338 .It Fn NETMAP_BUF ring buf_idx
339 Returns the address of the buffer with index
341 (which can be part of any ring for the given interface).
344 Normally, buffers are associated to slots when interfaces are bound,
345 and one packet is fully contained in a single buffer.
346 Clients can, however, modify the mapping using the
348 .Bl -tag -width ".Fn NS_RFRAGS slot"
349 .It Dv NS_BUF_CHANGED
352 in the slot has changed.
353 This can be useful if the client wants to implement
354 some form of zero-copy forwarding (e.g. by passing buffers
355 from an input interface to an output interface), or
356 needs to process packets out of order.
358 The flag MUST be used whenever the buffer index is changed.
360 indicates that we want to be woken up when this buffer
361 has been transmitted.
362 This reduces performance but insures
363 a prompt notification when a buffer has been sent.
366 notifies transmit completions in batches, hence signals
367 may be delayed indefinitely.
368 However, we need such notifications
369 before closing a descriptor.
371 When the device is opened in
373 mode, the client can mark slots in receive rings with this flag.
374 For all marked slots, marked packets are forwarded to
375 the other endpoint at the next system call, thus restoring
376 (in a selective way) the connection between the NIC and the
379 tells the forwarding code that the SRC MAC address for this
380 packet should not be used in the learning bridge.
382 indicates that the packet's payload is not in the
384 buffer, but in a user-supplied buffer whose
385 user virtual address is in the
388 The size can reach 65535 bytes.
389 This is only supported on the transmit ring of virtual ports.
391 indicates that the packet continues with subsequent buffers;
392 the last buffer in a packet must have the flag cleared.
393 The maximum length of a chain is 64 buffers.
394 This is only supported on virtual ports.
395 .It Fn NS_RFRAGS slot
396 on receive rings, returns the number of remaining buffers
397 in a packet, including this one.
398 Slots with a value greater than 1 also have
401 The length refers to the individual buffer;
402 there is no field for the total length.
404 On transmit rings, if
406 is set, it is passed to the lookup
407 function, which can use it e.g. as the index of the destination
408 port instead of doing an address lookup.
414 commands to synchronize the state of the rings
415 between the kernel and the user processes, as well as
416 to query and configure the interface.
417 The former do not require any argument, whereas the latter use a
422 char nr_name[IFNAMSIZ];
423 uint32_t nr_version; /* API version */
424 #define NETMAP_API 4 /* current version */
425 uint32_t nr_offset; /* nifp offset in the shared region */
426 uint32_t nr_memsize; /* size of the shared region */
427 uint32_t nr_tx_slots; /* slots in tx rings */
428 uint32_t nr_rx_slots; /* slots in rx rings */
429 uint16_t nr_tx_rings; /* number of tx rings */
430 uint16_t nr_rx_rings; /* number of tx rings */
431 uint16_t nr_ringid; /* ring(s) we care about */
432 #define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */
433 #define NETMAP_SW_RING 0x2000 /* we process the sw ring */
434 #define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */
435 #define NETMAP_RING_MASK 0xfff /* the actual ring number */
437 #define NETMAP_BDG_ATTACH 1 /* attach the NIC */
438 #define NETMAP_BDG_DETACH 2 /* detach the NIC */
439 #define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */
440 #define NETMAP_BDG_LIST 4 /* get bridge's info */
447 A device descriptor obtained through
451 command codes supported by network devices, as well as
452 specific command codes defined in
454 These specific command codes are as follows:
455 .Bl -tag -width ".Dv NIOCTXSYNC"
459 if the named device does not support
461 Otherwise, it returns zero and advisory information
463 Note that all the information below can change before the
464 interface is actually put into
469 indicates the size of the
472 Physical devices all share the same memory region, whereas
474 ports may have independent regions for each port.
475 These sizes can be set through system-wide
481 indicate the size of transmit and receive rings, respectively.
485 indicate the number of transmit and receive rings, respectively.
486 Both ring number and size may be configured at runtime
487 using interface-specific functions (e.g.\&
493 puts the interface specified via
497 mode, disconnecting it from the host stack, and/or defines which
498 rings are controlled through this file descriptor.
499 On return, it gives the same info as
503 indicates the identity of the rings controlled through the file
509 .Bl -tag -width "Dv NETMAP_HW_RING + i"
511 default; all hardware rings
512 .It Dv NETMAP_SW_RING
514 connecting to the host stack
515 .It Dv NETMAP_HW_RING + i
523 call pushes out any pending packets on the transmit ring, even if
524 no write events were specified.
525 The feature can be disabled by OR-ing the flag
526 .Dv NETMAP_NO_TX_SYNC
529 Normally, you should keep this feature unless you are using
530 separate file descriptors for the send and receive rings, because
531 otherwise packets are pushed out only if
533 is called, or the send queue is full.
536 can be used multiple times to change the association of a
537 file descriptor to a ring pair, always within the same device.
539 When registering a virtual interface that is dynamically created to a
541 switch, we can specify the desired number of rings (1 by default,
542 and currently up to 16) by setting the
548 tells the hardware about new packets to transmit, and updates the
549 number of slots available for transmission.
551 tells the hardware about consumed packets, and asks for newly available
560 to wake up processes when significant events occur, and
564 Applications may need to create threads and bind them to
565 specific cores to improve performance, using standard
569 .Xr pthread_setaffinity_np 3
572 The following code implements a traffic generator:
574 #include <sys/ioctl.h>
575 #include <sys/mman.h>
576 #include <sys/socket.h>
577 #include <sys/time.h>
578 #include <sys/types.h>
579 #include <net/netmap_user.h>
588 struct netmap_if *nifp;
589 struct netmap_ring *ring;
595 fd = open("/dev/netmap", O_RDWR);
596 bzero(&nmr, sizeof(nmr));
597 strcpy(nmr.nr_name, "ix0");
598 nmr.nr_version = NETMAP_API;
599 ioctl(fd, NIOCREGIF, &nmr);
600 p = mmap(0, nmr.nr_memsize, PROT_WRITE | PROT_READ,
602 nifp = NETMAP_IF(p, nmr.nr_offset);
603 ring = NETMAP_TXRING(nifp, 0);
605 fds.events = POLLOUT;
609 for (; ring->avail > 0; ring->avail--) {
614 buf = NETMAP_BUF(ring, ring->slot[i].buf_idx);
615 /* prepare packet in buf */
616 ring->slot[i].len = 0; /* packet length */
617 ring->cur = NETMAP_RING_NEXT(ring, i);
622 .Sh SUPPORTED INTERFACES
624 supports the following interfaces:
635 .%T Revisiting network I/O APIs: the netmap framework
636 .%J Communications of the ACM
643 .%T netmap: a novel framework for fast packet I/O
645 .%O USENIX ATC '12, Boston
648 .Lk http://info.iet.unipi.it/~luigi/netmap/
653 framework has been originally designed and implemented at the
654 Universita` di Pisa in 2011 by
656 and further extended with help from
658 .An Gaetano Catalli ,
659 .An Giuseppe Lettieri ,
661 .An Vincenzo Maffione .
666 have been funded by the European Commission within the FP7 Projects
667 CHANGE (257422) and OPENLAB (287581).