2 * Copyright (C) 2011-2013 Matteo Landi, Luigi Rizzo. All rights reserved.
4 * Redistribution and use in source and binary forms, with or without
5 * modification, are permitted provided that the following conditions
7 * 1. Redistributions of source code must retain the above copyright
8 * notice, this list of conditions and the following disclaimer.
9 * 2. Redistributions in binary form must reproduce the above copyright
10 * notice, this list of conditions and the following disclaimer in the
11 * documentation and/or other materials provided with the distribution.
13 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16 * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
28 * This module supports memory mapped access to network devices,
31 * The module uses a large, memory pool allocated by the kernel
32 * and accessible as mmapped memory by multiple userspace threads/processes.
33 * The memory pool contains packet buffers and "netmap rings",
34 * i.e. user-accessible copies of the interface's queues.
36 * Access to the network card works like this:
37 * 1. a process/thread issues one or more open() on /dev/netmap, to create
38 * select()able file descriptor on which events are reported.
39 * 2. on each descriptor, the process issues an ioctl() to identify
40 * the interface that should report events to the file descriptor.
41 * 3. on each descriptor, the process issues an mmap() request to
42 * map the shared memory region within the process' address space.
43 * The list of interesting queues is indicated by a location in
44 * the shared memory region.
45 * 4. using the functions in the netmap(4) userspace API, a process
46 * can look up the occupation state of a queue, access memory buffers,
47 * and retrieve received packets or enqueue packets to transmit.
48 * 5. using some ioctl()s the process can synchronize the userspace view
49 * of the queue with the actual status in the kernel. This includes both
50 * receiving the notification of new packets, and transmitting new
51 * packets on the output interface.
52 * 6. select() or poll() can be used to wait for events on individual
53 * transmit or receive queues (or all queues for a given interface).
56 SYNCHRONIZATION (USER)
58 The netmap rings and data structures may be shared among multiple
59 user threads or even independent processes.
60 Any synchronization among those threads/processes is delegated
61 to the threads themselves. Only one thread at a time can be in
62 a system call on the same netmap ring. The OS does not enforce
63 this and only guarantees against system crashes in case of
68 Within the kernel, access to the netmap rings is protected as follows:
70 - a spinlock on each ring, to handle producer/consumer races on
71 RX rings attached to the host stack (against multiple host
72 threads writing from the host stack to the same ring),
73 and on 'destination' rings attached to a VALE switch
74 (i.e. RX rings in VALE ports, and TX rings in NIC/host ports)
75 protecting multiple active senders for the same destination)
77 - an atomic variable to guarantee that there is at most one
78 instance of *_*xsync() on the ring at any time.
79 For rings connected to user file
80 descriptors, an atomic_test_and_set() protects this, and the
81 lock on the ring is not actually used.
82 For NIC RX rings connected to a VALE switch, an atomic_test_and_set()
83 is also used to prevent multiple executions (the driver might indeed
84 already guarantee this).
85 For NIC TX rings connected to a VALE switch, the lock arbitrates
86 access to the queue (both when allocating buffers and when pushing
89 - *xsync() should be protected against initializations of the card.
90 On FreeBSD most devices have the reset routine protected by
91 a RING lock (ixgbe, igb, em) or core lock (re). lem is missing
92 the RING protection on rx_reset(), this should be added.
94 On linux there is an external lock on the tx path, which probably
95 also arbitrates access to the reset routine. XXX to be revised
97 - a per-interface core_lock protecting access from the host stack
98 while interfaces may be detached from netmap mode.
99 XXX there should be no need for this lock if we detach the interfaces
100 only while they are down.
105 NMG_LOCK() serializes all modifications to switches and ports.
106 A switch cannot be deleted until all ports are gone.
108 For each switch, an SX lock (RWlock on linux) protects
109 deletion of ports. When configuring or deleting a new port, the
110 lock is acquired in exclusive mode (after holding NMG_LOCK).
111 When forwarding, the lock is acquired in shared mode (without NMG_LOCK).
112 The lock is held throughout the entire forwarding cycle,
113 during which the thread may incur in a page fault.
114 Hence it is important that sleepable shared locks are used.
116 On the rx ring, the per-port lock is grabbed initially to reserve
117 a number of slot in the ring, then the lock is released,
118 packets are copied from source to destination, and then
119 the lock is acquired again and the receive ring is updated.
120 (A similar thing is done on the tx ring for NIC and host stack
121 ports attached to the switch)
126 * OS-specific code that is used only within this file.
127 * Other OS-specific code that must be accessed by drivers
128 * is present in netmap_kern.h
131 /* __FBSDID("$FreeBSD: head/sys/dev/netmap/netmap.c 257176 2013-10-26 17:58:36Z glebius $"); */
132 #include <sys/types.h>
133 #include <sys/errno.h>
134 #include <sys/param.h> /* defines used in kernel.h */
135 #include <sys/kernel.h> /* types used in module initialization */
136 #include <sys/conf.h> /* cdevsw struct, UID, GID */
137 #include <sys/devfs.h>
138 #include <sys/sockio.h>
139 #include <sys/socketvar.h> /* struct socket */
140 #include <sys/malloc.h>
141 #include <sys/kernel.h>
142 #include <sys/queue.h>
143 #include <sys/event.h>
144 #include <sys/poll.h>
145 #include <sys/lock.h>
146 #include <sys/socket.h> /* sockaddrs */
147 #include <sys/sysctl.h>
148 #include <sys/bus.h> /* bus_dmamap_* */
149 #include <sys/endian.h>
150 #include <sys/refcount.h>
152 #include <net/if_var.h>
153 #include <net/bpf.h> /* BIOCIMMEDIATE */
155 /* reduce conditional code */
156 #define init_waitqueue_head(x) // only needed in linux
158 extern struct dev_ops netmap_cdevsw;
163 #include <net/netmap.h>
164 #include "netmap_kern.h"
165 #include "netmap_mem2.h"
167 #define selwakeuppri(x, y) do { } while (0) /* XXX porting in progress */
168 #define selrecord(x, y) do { } while (0) /* XXX porting in progress */
170 MALLOC_DEFINE(M_NETMAP, "netmap", "Network memory map");
173 * The following variables are used by the drivers and replicate
174 * fields in the global memory pool. They only refer to buffers
175 * used by physical interfaces.
177 u_int netmap_total_buffers;
178 u_int netmap_buf_size;
179 char *netmap_buffer_base; /* also address of an invalid buffer */
181 /* user-controlled variables */
184 static int netmap_no_timestamp; /* don't timestamp on rxsync */
186 SYSCTL_NODE(_net, OID_AUTO, netmap, CTLFLAG_RW, 0, "Netmap args");
187 SYSCTL_INT(_net_netmap, OID_AUTO, verbose,
188 CTLFLAG_RW, &netmap_verbose, 0, "Verbose mode");
189 SYSCTL_INT(_net_netmap, OID_AUTO, no_timestamp,
190 CTLFLAG_RW, &netmap_no_timestamp, 0, "no_timestamp");
191 int netmap_mitigate = 1;
192 SYSCTL_INT(_net_netmap, OID_AUTO, mitigate, CTLFLAG_RW, &netmap_mitigate, 0, "");
193 int netmap_no_pendintr = 1;
194 SYSCTL_INT(_net_netmap, OID_AUTO, no_pendintr,
195 CTLFLAG_RW, &netmap_no_pendintr, 0, "Always look for new received packets.");
196 int netmap_txsync_retry = 2;
197 SYSCTL_INT(_net_netmap, OID_AUTO, txsync_retry, CTLFLAG_RW,
198 &netmap_txsync_retry, 0 , "Number of txsync loops in bridge's flush.");
200 int netmap_flags = 0; /* debug flags */
201 int netmap_fwd = 0; /* force transparent mode */
202 int netmap_mmap_unreg = 0; /* allow mmap of unregistered fds */
205 * netmap_admode selects the netmap mode to use.
206 * Invalid values are reset to NETMAP_ADMODE_BEST
208 enum { NETMAP_ADMODE_BEST = 0, /* use native, fallback to generic */
209 NETMAP_ADMODE_NATIVE, /* either native or none */
210 NETMAP_ADMODE_GENERIC, /* force generic */
211 NETMAP_ADMODE_LAST };
212 #define NETMAP_ADMODE_NATIVE 1 /* Force native netmap adapter. */
213 #define NETMAP_ADMODE_GENERIC 2 /* Force generic netmap adapter. */
214 #define NETMAP_ADMODE_BEST 0 /* Priority to native netmap adapter. */
215 static int netmap_admode = NETMAP_ADMODE_BEST;
217 int netmap_generic_mit = 100*1000; /* Generic mitigation interval in nanoseconds. */
218 int netmap_generic_ringsize = 1024; /* Generic ringsize. */
220 SYSCTL_INT(_net_netmap, OID_AUTO, flags, CTLFLAG_RW, &netmap_flags, 0 , "");
221 SYSCTL_INT(_net_netmap, OID_AUTO, fwd, CTLFLAG_RW, &netmap_fwd, 0 , "");
222 SYSCTL_INT(_net_netmap, OID_AUTO, mmap_unreg, CTLFLAG_RW, &netmap_mmap_unreg, 0, "");
223 SYSCTL_INT(_net_netmap, OID_AUTO, admode, CTLFLAG_RW, &netmap_admode, 0 , "");
224 SYSCTL_INT(_net_netmap, OID_AUTO, generic_mit, CTLFLAG_RW, &netmap_generic_mit, 0 , "");
225 SYSCTL_INT(_net_netmap, OID_AUTO, generic_ringsize, CTLFLAG_RW, &netmap_generic_ringsize, 0 , "");
227 NMG_LOCK_T netmap_global_lock;
231 nm_kr_get(struct netmap_kring *kr)
233 while (NM_ATOMIC_TEST_AND_SET(&kr->nr_busy))
234 tsleep(kr, 0, "NM_KR_GET", 4);
239 netmap_disable_ring(struct netmap_kring *kr)
243 lockmgr(&kr->q_lock, LK_EXCLUSIVE);
244 lockmgr(&kr->q_lock, LK_RELEASE);
250 netmap_set_all_rings(struct ifnet *ifp, int stopped)
252 struct netmap_adapter *na;
255 if (!(ifp->if_capenable & IFCAP_NETMAP))
260 for (i = 0; i <= na->num_tx_rings; i++) {
262 netmap_disable_ring(na->tx_rings + i);
264 na->tx_rings[i].nkr_stopped = 0;
265 na->nm_notify(na, i, NR_TX, NAF_DISABLE_NOTIFY |
266 (i == na->num_tx_rings ? NAF_GLOBAL_NOTIFY: 0));
269 for (i = 0; i <= na->num_rx_rings; i++) {
271 netmap_disable_ring(na->rx_rings + i);
273 na->rx_rings[i].nkr_stopped = 0;
274 na->nm_notify(na, i, NR_RX, NAF_DISABLE_NOTIFY |
275 (i == na->num_rx_rings ? NAF_GLOBAL_NOTIFY: 0));
281 netmap_disable_all_rings(struct ifnet *ifp)
283 netmap_set_all_rings(ifp, 1 /* stopped */);
288 netmap_enable_all_rings(struct ifnet *ifp)
290 netmap_set_all_rings(ifp, 0 /* enabled */);
295 * generic bound_checking function
298 nm_bound_var(u_int *v, u_int dflt, u_int lo, u_int hi, const char *msg)
301 const char *op = NULL;
310 } else if (oldv > hi) {
315 kprintf("%s %s to %d (was %d)\n", op, msg, *v, oldv);
321 * packet-dump function, user-supplied or static buffer.
322 * The destination buffer must be at least 30+4*len
325 nm_dump_buf(char *p, int len, int lim, char *dst)
327 static char _dst[8192];
329 static char hex[] ="0123456789abcdef";
330 char *o; /* output position */
332 #define P_HI(x) hex[((x) & 0xf0)>>4]
333 #define P_LO(x) hex[((x) & 0xf)]
334 #define P_C(x) ((x) >= 0x20 && (x) <= 0x7e ? (x) : '.')
337 if (lim <= 0 || lim > len)
340 ksprintf(o, "buf 0x%p len %d lim %d\n", p, len, lim);
342 /* hexdump routine */
343 for (i = 0; i < lim; ) {
344 ksprintf(o, "%5d: ", i);
348 for (j=0; j < 16 && i < lim; i++, j++) {
350 o[j*3+1] = P_LO(p[i]);
353 for (j=0; j < 16 && i < lim; i++, j++)
354 o[j + 48] = P_C(p[i]);
368 * Fetch configuration from the device, to cope with dynamic
369 * reconfigurations after loading the module.
372 netmap_update_config(struct netmap_adapter *na)
374 struct ifnet *ifp = na->ifp;
375 u_int txr, txd, rxr, rxd;
377 txr = txd = rxr = rxd = 0;
379 na->nm_config(na, &txr, &txd, &rxr, &rxd);
381 /* take whatever we had at init time */
382 txr = na->num_tx_rings;
383 txd = na->num_tx_desc;
384 rxr = na->num_rx_rings;
385 rxd = na->num_rx_desc;
388 if (na->num_tx_rings == txr && na->num_tx_desc == txd &&
389 na->num_rx_rings == rxr && na->num_rx_desc == rxd)
390 return 0; /* nothing changed */
391 if (netmap_verbose || na->active_fds > 0) {
392 D("stored config %s: txring %d x %d, rxring %d x %d",
394 na->num_tx_rings, na->num_tx_desc,
395 na->num_rx_rings, na->num_rx_desc);
396 D("new config %s: txring %d x %d, rxring %d x %d",
397 NM_IFPNAME(ifp), txr, txd, rxr, rxd);
399 if (na->active_fds == 0) {
400 D("configuration changed (but fine)");
401 na->num_tx_rings = txr;
402 na->num_tx_desc = txd;
403 na->num_rx_rings = rxr;
404 na->num_rx_desc = rxd;
407 D("configuration changed while active, this is bad...");
413 netmap_krings_create(struct netmap_adapter *na, u_int ntx, u_int nrx, u_int tailroom)
416 struct netmap_kring *kring;
418 len = (ntx + nrx) * sizeof(struct netmap_kring) + tailroom;
420 na->tx_rings = kmalloc((size_t)len, M_DEVBUF, M_NOWAIT | M_ZERO);
421 if (na->tx_rings == NULL) {
422 D("Cannot allocate krings");
425 na->rx_rings = na->tx_rings + ntx;
427 ndesc = na->num_tx_desc;
428 for (i = 0; i < ntx; i++) { /* Transmit rings */
429 kring = &na->tx_rings[i];
430 bzero(kring, sizeof(*kring));
432 kring->nkr_num_slots = ndesc;
435 * Always keep one slot empty, so we can detect new
436 * transmissions comparing cur and nr_hwcur (they are
437 * the same only if there are no new transmissions).
439 kring->nr_hwavail = ndesc - 1;
440 lockinit(&kring->q_lock, "nm_txq_lock", 0, LK_CANRECURSE);
441 init_waitqueue_head(&kring->si);
444 ndesc = na->num_rx_desc;
445 for (i = 0; i < nrx; i++) { /* Receive rings */
446 kring = &na->rx_rings[i];
447 bzero(kring, sizeof(*kring));
449 kring->nkr_num_slots = ndesc;
450 lockinit(&kring->q_lock, "nm_rxq_lock", 0, LK_CANRECURSE);
451 init_waitqueue_head(&kring->si);
453 init_waitqueue_head(&na->tx_si);
454 init_waitqueue_head(&na->rx_si);
456 na->tailroom = na->rx_rings + nrx;
464 netmap_krings_delete(struct netmap_adapter *na)
468 for (i = 0; i < na->num_tx_rings + 1; i++) {
469 lockuninit(&na->tx_rings[i].q_lock);
471 for (i = 0; i < na->num_rx_rings + 1; i++) {
472 lockuninit(&na->rx_rings[i].q_lock);
474 kfree(na->tx_rings, M_DEVBUF);
475 na->tx_rings = na->rx_rings = na->tailroom = NULL;
479 static struct netmap_if*
480 netmap_if_new(const char *ifname, struct netmap_adapter *na)
482 struct netmap_if *nifp;
484 if (netmap_update_config(na)) {
485 /* configuration mismatch, report and fail */
492 if (na->nm_krings_create(na))
495 if (netmap_mem_rings_create(na))
500 nifp = netmap_mem_if_new(ifname, na);
508 if (na->active_fds == 0) {
509 netmap_mem_rings_delete(na);
510 na->nm_krings_delete(na);
517 /* grab a reference to the memory allocator, if we don't have one already. The
518 * reference is taken from the netmap_adapter registered with the priv.
522 netmap_get_memory_locked(struct netmap_priv_d* p)
524 struct netmap_mem_d *nmd;
527 if (p->np_na == NULL) {
528 if (!netmap_mmap_unreg)
530 /* for compatibility with older versions of the API
531 * we use the global allocator when no interface has been
536 nmd = p->np_na->nm_mem;
538 if (p->np_mref == NULL) {
539 error = netmap_mem_finalize(nmd);
542 } else if (p->np_mref != nmd) {
543 /* a virtual port has been registered, but previous
544 * syscalls already used the global allocator.
554 netmap_get_memory(struct netmap_priv_d* p)
558 error = netmap_get_memory_locked(p);
565 netmap_have_memory_locked(struct netmap_priv_d* p)
567 return p->np_mref != NULL;
572 netmap_drop_memory_locked(struct netmap_priv_d* p)
575 netmap_mem_deref(p->np_mref);
582 * File descriptor's private data destructor.
584 * Call nm_register(ifp,0) to stop netmap mode on the interface and
585 * revert to normal operation. We expect that np_na->ifp has not gone.
586 * The second argument is the nifp to work on. In some cases it is
587 * not attached yet to the netmap_priv_d so we need to pass it as
588 * a separate argument.
590 /* call with NMG_LOCK held */
592 netmap_do_unregif(struct netmap_priv_d *priv, struct netmap_if *nifp)
594 struct netmap_adapter *na = priv->np_na;
595 struct ifnet *ifp = na->ifp;
599 if (na->active_fds <= 0) { /* last instance */
602 D("deleting last instance for %s", NM_IFPNAME(ifp));
604 * (TO CHECK) This function is only called
605 * when the last reference to this file descriptor goes
606 * away. This means we cannot have any pending poll()
607 * or interrupt routine operating on the structure.
608 * XXX The file may be closed in a thread while
609 * another thread is using it.
610 * Linux keeps the file opened until the last reference
611 * by any outstanding ioctl/poll or mmap is gone.
612 * FreeBSD does not track mmap()s (but we do) and
613 * wakes up any sleeping poll(). Need to check what
614 * happens if the close() occurs while a concurrent
615 * syscall is running.
618 na->nm_register(na, 0); /* off, clear IFCAP_NETMAP */
619 /* Wake up any sleeping threads. netmap_poll will
620 * then return POLLERR
621 * XXX The wake up now must happen during *_down(), when
622 * we order all activities to stop. -gl
624 /* XXX kqueue(9) needed; these will mirror knlist_init. */
625 /* knlist_destroy(&na->tx_si.si_note); */
626 /* knlist_destroy(&na->rx_si.si_note); */
628 /* delete rings and buffers */
629 netmap_mem_rings_delete(na);
630 na->nm_krings_delete(na);
632 /* delete the nifp */
633 netmap_mem_if_delete(na, nifp);
638 * returns 1 if this is the last instance and we can free priv
641 netmap_dtor_locked(struct netmap_priv_d *priv)
643 struct netmap_adapter *na = priv->np_na;
646 * np_refcount is the number of active mmaps on
647 * this file descriptor
649 if (--priv->np_refcount > 0) {
653 return 1; //XXX is it correct?
655 netmap_do_unregif(priv, priv->np_nifp);
656 priv->np_nifp = NULL;
657 netmap_drop_memory_locked(priv);
659 netmap_adapter_put(na);
667 netmap_dtor(void *data)
669 struct netmap_priv_d *priv = data;
673 last_instance = netmap_dtor_locked(priv);
676 bzero(priv, sizeof(*priv)); /* for safety */
677 kfree(priv, M_DEVBUF);
685 * Handlers for synchronization of the queues from/to the host.
686 * Netmap has two operating modes:
687 * - in the default mode, the rings connected to the host stack are
688 * just another ring pair managed by userspace;
689 * - in transparent mode (XXX to be defined) incoming packets
690 * (from the host or the NIC) are marked as NS_FORWARD upon
691 * arrival, and the user application has a chance to reset the
692 * flag for packets that should be dropped.
693 * On the RXSYNC or poll(), packets in RX rings between
694 * kring->nr_kcur and ring->cur with NS_FORWARD still set are moved
696 * The transfer NIC --> host is relatively easy, just encapsulate
697 * into mbufs and we are done. The host --> NIC side is slightly
698 * harder because there might not be room in the tx ring so it
699 * might take a while before releasing the buffer.
704 * pass a chain of buffers to the host stack as coming from 'dst'
707 netmap_send_up(struct ifnet *dst, struct mbq *q)
711 /* send packets up, outside the lock */
712 while ((m = mbq_dequeue(q)) != NULL) {
713 if (netmap_verbose & NM_VERB_HOST)
714 D("sending up pkt %p size %d", m, MBUF_LEN(m));
722 * put a copy of the buffers marked NS_FORWARD into an mbuf chain.
723 * Run from hwcur to cur - reserved
726 netmap_grab_packets(struct netmap_kring *kring, struct mbq *q, int force)
728 /* Take packets from hwcur to cur-reserved and pass them up.
729 * In case of no buffers we give up. At the end of the loop,
730 * the queue is drained in all cases.
731 * XXX handle reserved
733 u_int lim = kring->nkr_num_slots - 1;
735 u_int k = kring->ring->cur, n = kring->ring->reserved;
736 struct netmap_adapter *na = kring->na;
738 /* compute the final position, ring->cur - ring->reserved */
741 k += kring->nkr_num_slots;
744 for (n = kring->nr_hwcur; n != k;) {
745 struct netmap_slot *slot = &kring->ring->slot[n];
748 if ((slot->flags & NS_FORWARD) == 0 && !force)
750 if (slot->len < 14 || slot->len > NETMAP_BDG_BUF_SIZE(na->nm_mem)) {
751 D("bad pkt at %d len %d", n, slot->len);
754 slot->flags &= ~NS_FORWARD; // XXX needed ?
755 /* XXX adapt to the case of a multisegment packet */
756 m = m_devget(BDG_NMB(na, slot), slot->len, 0, na->ifp, NULL);
766 * The host ring has packets from nr_hwcur to (cur - reserved)
767 * to be sent down to the NIC.
768 * We need to use the queue lock on the source (host RX ring)
769 * to protect against netmap_transmit.
770 * If the user is well behaved we do not need to acquire locks
771 * on the destination(s),
772 * so we only need to make sure that there are no panics because
776 * We scan the tx rings, which have just been
777 * flushed so nr_hwcur == cur. Pushing packets down means
778 * increment cur and decrement avail.
782 netmap_sw_to_nic(struct netmap_adapter *na)
784 struct netmap_kring *kring = &na->rx_rings[na->num_rx_rings];
785 struct netmap_kring *k1 = &na->tx_rings[0];
786 u_int i, howmany, src_lim, dst_lim;
788 /* XXX we should also check that the carrier is on */
789 if (kring->nkr_stopped)
792 lockmgr(&kring->q_lock, LK_EXCLUSIVE);
794 if (kring->nkr_stopped)
797 howmany = kring->nr_hwavail; /* XXX otherwise cur - reserved - nr_hwcur */
799 src_lim = kring->nkr_num_slots - 1;
800 for (i = 0; howmany > 0 && i < na->num_tx_rings; i++, k1++) {
801 ND("%d packets left to ring %d (space %d)", howmany, i, k1->nr_hwavail);
802 dst_lim = k1->nkr_num_slots - 1;
803 while (howmany > 0 && k1->ring->avail > 0) {
804 struct netmap_slot *src, *dst, tmp;
805 src = &kring->ring->slot[kring->nr_hwcur];
806 dst = &k1->ring->slot[k1->ring->cur];
808 src->buf_idx = dst->buf_idx;
809 src->flags = NS_BUF_CHANGED;
811 dst->buf_idx = tmp.buf_idx;
813 dst->flags = NS_BUF_CHANGED;
814 ND("out len %d buf %d from %d to %d",
815 dst->len, dst->buf_idx,
816 kring->nr_hwcur, k1->ring->cur);
818 kring->nr_hwcur = nm_next(kring->nr_hwcur, src_lim);
821 k1->ring->cur = nm_next(k1->ring->cur, dst_lim);
824 kring->ring->cur = kring->nr_hwcur; // XXX
828 lockmgr(&kring->q_lock, LK_RELEASE);
833 * netmap_txsync_to_host() passes packets up. We are called from a
834 * system call in user process context, and the only contention
835 * can be among multiple user threads erroneously calling
836 * this routine concurrently.
839 netmap_txsync_to_host(struct netmap_adapter *na)
841 struct netmap_kring *kring = &na->tx_rings[na->num_tx_rings];
842 struct netmap_ring *ring = kring->ring;
843 u_int k, lim = kring->nkr_num_slots - 1;
847 error = nm_kr_tryget(kring);
849 if (error == NM_KR_BUSY)
850 D("ring %p busy (user error)", kring);
855 D("invalid ring index in stack TX kring %p", kring);
856 netmap_ring_reinit(kring);
861 /* Take packets from hwcur to cur and pass them up.
862 * In case of no buffers we give up. At the end of the loop,
863 * the queue is drained in all cases.
866 netmap_grab_packets(kring, &q, 1);
868 kring->nr_hwavail = ring->avail = lim;
871 netmap_send_up(na->ifp, &q);
876 * rxsync backend for packets coming from the host stack.
877 * They have been put in the queue by netmap_transmit() so we
878 * need to protect access to the kring using a lock.
880 * This routine also does the selrecord if called from the poll handler
881 * (we know because td != NULL).
883 * NOTE: on linux, selrecord() is defined as a macro and uses pwait
884 * as an additional hidden argument.
887 netmap_rxsync_from_host(struct netmap_adapter *na, struct thread *td, void *pwait)
889 struct netmap_kring *kring = &na->rx_rings[na->num_rx_rings];
890 struct netmap_ring *ring = kring->ring;
891 u_int j, n, lim = kring->nkr_num_slots;
892 u_int k = ring->cur, resvd = ring->reserved;
894 (void)pwait; /* disable unused warnings */
896 if (kring->nkr_stopped) /* check a first time without lock */
899 lockmgr(&kring->q_lock, LK_EXCLUSIVE);
901 if (kring->nkr_stopped) /* check again with lock held */
905 netmap_ring_reinit(kring);
908 /* new packets are already set in nr_hwavail */
909 /* skip past packets that userspace has released */
912 if (resvd + ring->avail >= lim + 1) {
913 D("XXX invalid reserve/avail %d %d", resvd, ring->avail);
914 ring->reserved = resvd = 0; // XXX panic...
916 k = (k >= resvd) ? k - resvd : k + lim - resvd;
919 n = k >= j ? k - j : k + lim - j;
920 kring->nr_hwavail -= n;
923 k = ring->avail = kring->nr_hwavail - resvd;
925 selrecord(td, &kring->si);
926 if (k && (netmap_verbose & NM_VERB_HOST))
927 D("%d pkts from stack", k);
930 lockmgr(&kring->q_lock, LK_RELEASE);
934 /* Get a netmap adapter for the port.
936 * If it is possible to satisfy the request, return 0
937 * with *na containing the netmap adapter found.
938 * Otherwise return an error code, with *na containing NULL.
940 * When the port is attached to a bridge, we always return
942 * Otherwise, if the port is already bound to a file descriptor,
943 * then we unconditionally return the existing adapter into *na.
944 * In all the other cases, we return (into *na) either native,
945 * generic or NULL, according to the following table:
948 * active_fds dev.netmap.admode YES NO
949 * -------------------------------------------------------
950 * >0 * NA(ifp) NA(ifp)
952 * 0 NETMAP_ADMODE_BEST NATIVE GENERIC
953 * 0 NETMAP_ADMODE_NATIVE NATIVE NULL
954 * 0 NETMAP_ADMODE_GENERIC GENERIC GENERIC
959 netmap_get_hw_na(struct ifnet *ifp, struct netmap_adapter **na)
961 /* generic support */
962 int i = netmap_admode; /* Take a snapshot. */
964 struct netmap_adapter *prev_na;
965 struct netmap_generic_adapter *gna;
967 *na = NULL; /* default */
969 /* reset in case of invalid value */
970 if (i < NETMAP_ADMODE_BEST || i >= NETMAP_ADMODE_LAST)
971 i = netmap_admode = NETMAP_ADMODE_BEST;
973 if (NETMAP_CAPABLE(ifp)) {
974 /* If an adapter already exists, but is
975 * attached to a vale port, we report that the
978 if (NETMAP_OWNED_BY_KERN(NA(ifp)))
981 /* If an adapter already exists, return it if
982 * there are active file descriptors or if
983 * netmap is not forced to use generic
986 if (NA(ifp)->active_fds > 0 ||
987 i != NETMAP_ADMODE_GENERIC) {
993 /* If there isn't native support and netmap is not allowed
994 * to use generic adapters, we cannot satisfy the request.
996 if (!NETMAP_CAPABLE(ifp) && i == NETMAP_ADMODE_NATIVE)
999 /* Otherwise, create a generic adapter and return it,
1000 * saving the previously used netmap adapter, if any.
1002 * Note that here 'prev_na', if not NULL, MUST be a
1003 * native adapter, and CANNOT be a generic one. This is
1004 * true because generic adapters are created on demand, and
1005 * destroyed when not used anymore. Therefore, if the adapter
1006 * currently attached to an interface 'ifp' is generic, it
1008 * (NA(ifp)->active_fds > 0 || NETMAP_OWNED_BY_KERN(NA(ifp))).
1009 * Consequently, if NA(ifp) is generic, we will enter one of
1010 * the branches above. This ensures that we never override
1011 * a generic adapter with another generic adapter.
1014 error = generic_netmap_attach(ifp);
1019 gna = (struct netmap_generic_adapter*)NA(ifp);
1020 gna->prev = prev_na; /* save old na */
1021 if (prev_na != NULL) {
1022 ifunit(ifp->if_xname); /* XXX huh? */
1023 // XXX add a refcount ?
1024 netmap_adapter_get(prev_na);
1026 D("Created generic NA %p (prev %p)", gna, gna->prev);
1033 * MUST BE CALLED UNDER NMG_LOCK()
1035 * get a refcounted reference to an interface.
1036 * This is always called in the execution of an ioctl().
1038 * Return ENXIO if the interface does not exist, EINVAL if netmap
1039 * is not supported by the interface.
1040 * If successful, hold a reference.
1042 * When the NIC is attached to a bridge, reference is managed
1043 * at na->na_bdg_refcount using ADD/DROP_BDG_REF() as well as
1044 * virtual ports. Hence, on the final DROP_BDG_REF(), the NIC
1045 * is detached from the bridge, then ifp's refcount is dropped (this
1046 * is equivalent to that ifp is destroyed in case of virtual ports.
1048 * This function uses if_rele() when we want to prevent the NIC from
1049 * being detached from the bridge in error handling. But once refcount
1050 * is acquired by this function, it must be released using nm_if_rele().
1053 netmap_get_na(struct nmreq *nmr, struct netmap_adapter **na, int create)
1057 struct netmap_adapter *ret;
1059 *na = NULL; /* default return value */
1061 /* first try to see if this is a bridge port. */
1064 error = netmap_get_bdg_na(nmr, na, create);
1065 if (error || *na != NULL) /* valid match in netmap_get_bdg_na() */
1068 ifp = ifunit(nmr->nr_name);
1073 error = netmap_get_hw_na(ifp, &ret);
1078 /* Users cannot use the NIC attached to a bridge directly */
1079 if (NETMAP_OWNED_BY_KERN(ret)) {
1085 netmap_adapter_get(ret);
1097 * Error routine called when txsync/rxsync detects an error.
1098 * Can't do much more than resetting cur = hwcur, avail = hwavail.
1099 * Return 1 on reinit.
1101 * This routine is only called by the upper half of the kernel.
1102 * It only reads hwcur (which is changed only by the upper half, too)
1103 * and hwavail (which may be changed by the lower half, but only on
1104 * a tx ring and only to increase it, so any error will be recovered
1105 * on the next call). For the above, we don't strictly need to call
1109 netmap_ring_reinit(struct netmap_kring *kring)
1111 struct netmap_ring *ring = kring->ring;
1112 u_int i, lim = kring->nkr_num_slots - 1;
1115 // XXX KASSERT nm_kr_tryget
1116 RD(10, "called for %s", NM_IFPNAME(kring->na->ifp));
1117 if (ring->cur > lim)
1119 for (i = 0; i <= lim; i++) {
1120 u_int idx = ring->slot[i].buf_idx;
1121 u_int len = ring->slot[i].len;
1122 if (idx < 2 || idx >= netmap_total_buffers) {
1124 D("bad buffer at slot %d idx %d len %d ", i, idx, len);
1125 ring->slot[i].buf_idx = 0;
1126 ring->slot[i].len = 0;
1127 } else if (len > NETMAP_BDG_BUF_SIZE(kring->na->nm_mem)) {
1128 ring->slot[i].len = 0;
1130 D("bad len %d at slot %d idx %d",
1135 int pos = kring - kring->na->tx_rings;
1136 int n = kring->na->num_tx_rings + 1;
1138 RD(10, "total %d errors", errors);
1140 RD(10, "%s %s[%d] reinit, cur %d -> %d avail %d -> %d",
1141 NM_IFPNAME(kring->na->ifp),
1142 pos < n ? "TX" : "RX", pos < n ? pos : pos - n,
1143 ring->cur, kring->nr_hwcur,
1144 ring->avail, kring->nr_hwavail);
1145 ring->cur = kring->nr_hwcur;
1146 ring->avail = kring->nr_hwavail;
1148 return (errors ? 1 : 0);
1153 * Set the ring ID. For devices with a single queue, a request
1154 * for all rings is the same as a single ring.
1157 netmap_set_ringid(struct netmap_priv_d *priv, u_int ringid)
1159 struct netmap_adapter *na = priv->np_na;
1160 struct ifnet *ifp = na->ifp;
1161 u_int i = ringid & NETMAP_RING_MASK;
1162 /* initially (np_qfirst == np_qlast) we don't want to lock */
1163 u_int lim = na->num_rx_rings;
1165 if (na->num_tx_rings > lim)
1166 lim = na->num_tx_rings;
1167 if ( (ringid & NETMAP_HW_RING) && i >= lim) {
1168 D("invalid ring id %d", i);
1171 priv->np_ringid = ringid;
1172 if (ringid & NETMAP_SW_RING) {
1173 priv->np_qfirst = NETMAP_SW_RING;
1175 } else if (ringid & NETMAP_HW_RING) {
1176 priv->np_qfirst = i;
1177 priv->np_qlast = i + 1;
1179 priv->np_qfirst = 0;
1180 priv->np_qlast = NETMAP_HW_RING ;
1182 priv->np_txpoll = (ringid & NETMAP_NO_TX_POLL) ? 0 : 1;
1183 if (netmap_verbose) {
1184 if (ringid & NETMAP_SW_RING)
1185 D("ringid %s set to SW RING", NM_IFPNAME(ifp));
1186 else if (ringid & NETMAP_HW_RING)
1187 D("ringid %s set to HW RING %d", NM_IFPNAME(ifp),
1190 D("ringid %s set to all %d HW RINGS", NM_IFPNAME(ifp), lim);
1197 * possibly move the interface to netmap-mode.
1198 * If success it returns a pointer to netmap_if, otherwise NULL.
1199 * This must be called with NMG_LOCK held.
1202 netmap_do_regif(struct netmap_priv_d *priv, struct netmap_adapter *na,
1203 uint16_t ringid, int *err)
1205 struct ifnet *ifp = na->ifp;
1206 struct netmap_if *nifp = NULL;
1207 int error, need_mem = 0;
1210 /* ring configuration may have changed, fetch from the card */
1211 netmap_update_config(na);
1212 priv->np_na = na; /* store the reference */
1213 error = netmap_set_ringid(priv, ringid);
1216 /* ensure allocators are ready */
1217 need_mem = !netmap_have_memory_locked(priv);
1219 error = netmap_get_memory_locked(priv);
1220 ND("get_memory returned %d", error);
1224 nifp = netmap_if_new(NM_IFPNAME(ifp), na);
1225 if (nifp == NULL) { /* allocation failed */
1226 /* we should drop the allocator, but only
1227 * if we were the ones who grabbed it
1233 if (ifp->if_capenable & IFCAP_NETMAP) {
1234 /* was already set */
1236 /* Otherwise set the card in netmap mode
1237 * and make it use the shared buffers.
1239 * do not core lock because the race is harmless here,
1240 * there cannot be any traffic to netmap_transmit()
1242 na->na_lut = na->nm_mem->pools[NETMAP_BUF_POOL].lut;
1243 ND("%p->na_lut == %p", na, na->na_lut);
1244 na->na_lut_objtotal = na->nm_mem->pools[NETMAP_BUF_POOL].objtotal;
1245 error = na->nm_register(na, 1); /* mode on */
1247 netmap_do_unregif(priv, nifp);
1256 netmap_drop_memory_locked(priv);
1260 * advertise that the interface is ready bt setting ni_nifp.
1261 * The barrier is needed because readers (poll and *SYNC)
1262 * check for priv->np_nifp != NULL without locking
1264 wmb(); /* make sure previous writes are visible to all CPUs */
1265 priv->np_nifp = nifp;
1273 * ioctl(2) support for the "netmap" device.
1275 * Following a list of accepted commands:
1277 * - SIOCGIFADDR just for convenience
1283 * Return 0 on success, errno otherwise.
1286 netmap_ioctl(struct cdev *dev, u_long cmd, caddr_t data,
1287 int fflag, struct thread *td)
1289 struct netmap_priv_d *priv = NULL;
1290 struct ifnet *ifp = NULL;
1291 struct nmreq *nmr = (struct nmreq *) data;
1292 struct netmap_adapter *na = NULL;
1295 struct netmap_if *nifp;
1296 struct netmap_kring *krings;
1298 (void)dev; /* UNUSED */
1299 (void)fflag; /* UNUSED */
1302 error = devfs_get_cdevpriv((void **)&priv);
1304 /* XXX ENOENT should be impossible, since the priv
1305 * is now created in the open */
1306 return (error == ENOENT ? ENXIO : error);
1310 nmr->nr_name[sizeof(nmr->nr_name) - 1] = '\0'; /* truncate name */
1312 case NIOCGINFO: /* return capabilities etc */
1313 if (nmr->nr_version != NETMAP_API) {
1314 D("API mismatch got %d have %d",
1315 nmr->nr_version, NETMAP_API);
1316 nmr->nr_version = NETMAP_API;
1320 if (nmr->nr_cmd == NETMAP_BDG_LIST) {
1321 error = netmap_bdg_ctl(nmr, NULL);
1327 /* memsize is always valid */
1328 struct netmap_mem_d *nmd = &nm_mem;
1331 if (nmr->nr_name[0] != '\0') {
1332 /* get a refcount */
1333 error = netmap_get_na(nmr, &na, 1 /* create */);
1336 nmd = na->nm_mem; /* get memory allocator */
1339 error = netmap_mem_get_info(nmd, &nmr->nr_memsize, &memflags);
1342 if (na == NULL) /* only memory info */
1345 nmr->nr_rx_slots = nmr->nr_tx_slots = 0;
1346 netmap_update_config(na);
1347 nmr->nr_rx_rings = na->num_rx_rings;
1348 nmr->nr_tx_rings = na->num_tx_rings;
1349 nmr->nr_rx_slots = na->num_rx_desc;
1350 nmr->nr_tx_slots = na->num_tx_desc;
1351 if (memflags & NETMAP_MEM_PRIVATE)
1352 nmr->nr_ringid |= NETMAP_PRIV_MEM;
1353 netmap_adapter_put(na);
1359 if (nmr->nr_version != NETMAP_API) {
1360 nmr->nr_version = NETMAP_API;
1364 /* possibly attach/detach NIC and VALE switch */
1366 if (i == NETMAP_BDG_ATTACH || i == NETMAP_BDG_DETACH) {
1367 error = netmap_bdg_ctl(nmr, NULL);
1369 } else if (i != 0) {
1370 D("nr_cmd must be 0 not %d", i);
1375 /* protect access to priv from concurrent NIOCREGIF */
1380 if (priv->np_na != NULL) { /* thread already registered */
1381 error = netmap_set_ringid(priv, nmr->nr_ringid);
1384 /* find the interface and a reference */
1385 error = netmap_get_na(nmr, &na, 1 /* create */); /* keep reference */
1389 if (NETMAP_OWNED_BY_KERN(na)) {
1390 netmap_adapter_put(na);
1394 nifp = netmap_do_regif(priv, na, nmr->nr_ringid, &error);
1395 if (!nifp) { /* reg. failed, release priv and ref */
1396 netmap_adapter_put(na);
1397 priv->np_nifp = NULL;
1401 /* return the offset of the netmap_if object */
1402 nmr->nr_rx_rings = na->num_rx_rings;
1403 nmr->nr_tx_rings = na->num_tx_rings;
1404 nmr->nr_rx_slots = na->num_rx_desc;
1405 nmr->nr_tx_slots = na->num_tx_desc;
1406 error = netmap_mem_get_info(na->nm_mem, &nmr->nr_memsize, &memflags);
1408 netmap_adapter_put(na);
1411 if (memflags & NETMAP_MEM_PRIVATE) {
1412 nmr->nr_ringid |= NETMAP_PRIV_MEM;
1413 *(uint32_t *)(uintptr_t)&nifp->ni_flags |= NI_PRIV_MEM;
1415 nmr->nr_offset = netmap_mem_if_offset(na->nm_mem, nifp);
1421 // XXX we have no data here ?
1422 D("deprecated, data is %p", nmr);
1428 nifp = priv->np_nifp;
1434 rmb(); /* make sure following reads are not from cache */
1436 na = priv->np_na; /* we have a reference */
1439 D("Internal error: nifp != NULL && na == NULL");
1446 RD(1, "the ifp is gone");
1451 if (priv->np_qfirst == NETMAP_SW_RING) { /* host rings */
1452 if (cmd == NIOCTXSYNC)
1453 netmap_txsync_to_host(na);
1455 netmap_rxsync_from_host(na, NULL, NULL);
1458 /* find the last ring to scan */
1459 lim = priv->np_qlast;
1460 if (lim == NETMAP_HW_RING)
1461 lim = (cmd == NIOCTXSYNC) ?
1462 na->num_tx_rings : na->num_rx_rings;
1464 krings = (cmd == NIOCTXSYNC) ? na->tx_rings : na->rx_rings;
1465 for (i = priv->np_qfirst; i < lim; i++) {
1466 struct netmap_kring *kring = krings + i;
1467 if (nm_kr_tryget(kring)) {
1471 if (cmd == NIOCTXSYNC) {
1472 if (netmap_verbose & NM_VERB_TXSYNC)
1473 D("pre txsync ring %d cur %d hwcur %d",
1474 i, kring->ring->cur,
1476 na->nm_txsync(na, i, NAF_FORCE_RECLAIM);
1477 if (netmap_verbose & NM_VERB_TXSYNC)
1478 D("post txsync ring %d cur %d hwcur %d",
1479 i, kring->ring->cur,
1482 na->nm_rxsync(na, i, NAF_FORCE_READ);
1483 microtime(&na->rx_rings[i].ring->ts);
1493 D("ignore BIOCIMMEDIATE/BIOCSHDRCMPLT/BIOCSHDRCMPLT/BIOCSSEESENT");
1496 default: /* allow device-specific ioctls */
1500 bzero(&so, sizeof(so));
1502 error = netmap_get_na(nmr, &na, 0 /* don't create */); /* keep reference */
1504 netmap_adapter_put(na);
1509 // so->so_proto not null.
1510 error = ifioctl(&so, cmd, data, td);
1511 netmap_adapter_put(na);
1523 * select(2) and poll(2) handlers for the "netmap" device.
1525 * Can be called for one or more queues.
1526 * Return true the event mask corresponding to ready events.
1527 * If there are no ready events, do a selrecord on either individual
1528 * selinfo or on the global one.
1529 * Device-dependent parts (locking and sync of tx/rx rings)
1530 * are done through callbacks.
1532 * On linux, arguments are really pwait, the poll table, and 'td' is struct file *
1533 * The first one is remapped to pwait as selrecord() uses the name as an
1537 netmap_poll(struct cdev *dev, int events, struct thread *td)
1539 struct netmap_priv_d *priv = NULL;
1540 struct netmap_adapter *na;
1542 struct netmap_kring *kring;
1543 u_int i, check_all_tx, check_all_rx, want_tx, want_rx, revents = 0;
1544 u_int lim_tx, lim_rx, host_forwarded = 0;
1546 void *pwait = dev; /* linux compatibility */
1549 * In order to avoid nested locks, we need to "double check"
1550 * txsync and rxsync if we decide to do a selrecord().
1551 * retry_tx (and retry_rx, later) prevent looping forever.
1559 if (devfs_get_cdevpriv((void **)&priv) != 0 || priv == NULL)
1563 if (priv->np_nifp == NULL) {
1564 D("No if registered");
1567 rmb(); /* make sure following reads are not from cache */
1571 // check for deleted
1573 RD(1, "the ifp is gone");
1577 if ( (ifp->if_capenable & IFCAP_NETMAP) == 0)
1580 if (netmap_verbose & 0x8000)
1581 D("device %s events 0x%x", NM_IFPNAME(ifp), events);
1582 want_tx = events & (POLLOUT | POLLWRNORM);
1583 want_rx = events & (POLLIN | POLLRDNORM);
1585 lim_tx = na->num_tx_rings;
1586 lim_rx = na->num_rx_rings;
1588 if (priv->np_qfirst == NETMAP_SW_RING) {
1589 /* handle the host stack ring */
1590 if (priv->np_txpoll || want_tx) {
1591 /* push any packets up, then we are always ready */
1592 netmap_txsync_to_host(na);
1596 kring = &na->rx_rings[lim_rx];
1597 if (kring->ring->avail == 0)
1598 netmap_rxsync_from_host(na, td, dev);
1599 if (kring->ring->avail > 0) {
1607 * If we are in transparent mode, check also the host rx ring
1608 * XXX Transparent mode at the moment requires to bind all
1609 * rings to a single file descriptor.
1611 kring = &na->rx_rings[lim_rx];
1612 if ( (priv->np_qlast == NETMAP_HW_RING) // XXX check_all
1614 && (netmap_fwd || kring->ring->flags & NR_FORWARD) ) {
1615 if (kring->ring->avail == 0)
1616 netmap_rxsync_from_host(na, td, dev);
1617 if (kring->ring->avail > 0)
1622 * check_all_{tx|rx} are set if the card has more than one queue AND
1623 * the file descriptor is bound to all of them. If so, we sleep on
1624 * the "global" selinfo, otherwise we sleep on individual selinfo
1625 * (FreeBSD only allows two selinfo's per file descriptor).
1626 * The interrupt routine in the driver wake one or the other
1627 * (or both) depending on which clients are active.
1629 * rxsync() is only called if we run out of buffers on a POLLIN.
1630 * txsync() is called if we run out of buffers on POLLOUT, or
1631 * there are pending packets to send. The latter can be disabled
1632 * passing NETMAP_NO_TX_POLL in the NIOCREG call.
1634 check_all_tx = (priv->np_qlast == NETMAP_HW_RING) && (lim_tx > 1);
1635 check_all_rx = (priv->np_qlast == NETMAP_HW_RING) && (lim_rx > 1);
1637 if (priv->np_qlast != NETMAP_HW_RING) {
1638 lim_tx = lim_rx = priv->np_qlast;
1642 * We start with a lock free round which is cheap if we have
1643 * slots available. If this fails, then lock and call the sync
1646 for (i = priv->np_qfirst; want_rx && i < lim_rx; i++) {
1647 kring = &na->rx_rings[i];
1648 if (kring->ring->avail > 0) {
1650 want_rx = 0; /* also breaks the loop */
1653 for (i = priv->np_qfirst; want_tx && i < lim_tx; i++) {
1654 kring = &na->tx_rings[i];
1655 if (kring->ring->avail > 0) {
1657 want_tx = 0; /* also breaks the loop */
1662 * If we to push packets out (priv->np_txpoll) or want_tx is
1663 * still set, we do need to run the txsync calls (on all rings,
1664 * to avoid that the tx rings stall).
1665 * XXX should also check cur != hwcur on the tx rings.
1666 * Fortunately, normal tx mode has np_txpoll set.
1668 if (priv->np_txpoll || want_tx) {
1669 /* If we really want to be woken up (want_tx),
1670 * do a selrecord, either on the global or on
1671 * the private structure. Then issue the txsync
1672 * so there is no race in the selrecord/selwait
1675 for (i = priv->np_qfirst; i < lim_tx; i++) {
1676 kring = &na->tx_rings[i];
1678 * Skip this ring if want_tx == 0
1679 * (we have already done a successful sync on
1680 * a previous ring) AND kring->cur == kring->hwcur
1681 * (there are no pending transmissions for this ring).
1683 if (!want_tx && kring->ring->cur == kring->nr_hwcur)
1685 /* make sure only one user thread is doing this */
1686 if (nm_kr_tryget(kring)) {
1687 ND("ring %p busy is %d",
1688 kring, (int)kring->nr_busy);
1693 if (netmap_verbose & NM_VERB_TXSYNC)
1694 D("send %d on %s %d",
1695 kring->ring->cur, NM_IFPNAME(ifp), i);
1696 if (na->nm_txsync(na, i, 0))
1699 /* Check avail/call selrecord only if called with POLLOUT */
1701 if (kring->ring->avail > 0) {
1702 /* stop at the first ring. We don't risk
1711 if (want_tx && retry_tx) {
1712 selrecord(td, check_all_tx ?
1713 &na->tx_si : &na->tx_rings[priv->np_qfirst].si);
1720 * now if want_rx is still set we need to lock and rxsync.
1721 * Do it on all rings because otherwise we starve.
1726 for (i = priv->np_qfirst; i < lim_rx; i++) {
1727 kring = &na->rx_rings[i];
1729 if (nm_kr_tryget(kring)) {
1734 /* XXX NR_FORWARD should only be read on
1735 * physical or NIC ports
1737 if (netmap_fwd ||kring->ring->flags & NR_FORWARD) {
1738 ND(10, "forwarding some buffers up %d to %d",
1739 kring->nr_hwcur, kring->ring->cur);
1740 netmap_grab_packets(kring, &q, netmap_fwd);
1743 if (na->nm_rxsync(na, i, 0))
1745 if (netmap_no_timestamp == 0 ||
1746 kring->ring->flags & NR_TIMESTAMP) {
1747 microtime(&kring->ring->ts);
1750 if (kring->ring->avail > 0) {
1758 selrecord(td, check_all_rx ?
1759 &na->rx_si : &na->rx_rings[priv->np_qfirst].si);
1764 /* forward host to the netmap ring.
1765 * I am accessing nr_hwavail without lock, but netmap_transmit
1766 * can only increment it, so the operation is safe.
1768 kring = &na->rx_rings[lim_rx];
1769 if ( (priv->np_qlast == NETMAP_HW_RING) // XXX check_all
1770 && (netmap_fwd || kring->ring->flags & NR_FORWARD)
1771 && kring->nr_hwavail > 0 && !host_forwarded) {
1772 netmap_sw_to_nic(na);
1773 host_forwarded = 1; /* prevent another pass */
1779 netmap_send_up(na->ifp, &q);
1786 /*------- driver support routines ------*/
1788 static int netmap_hw_krings_create(struct netmap_adapter *);
1791 netmap_notify(struct netmap_adapter *na, u_int n_ring, enum txrx tx, int flags)
1793 struct netmap_kring *kring;
1796 kring = na->tx_rings + n_ring;
1797 selwakeuppri(&kring->si, PI_NET);
1798 if (flags & NAF_GLOBAL_NOTIFY)
1799 selwakeuppri(&na->tx_si, PI_NET);
1801 kring = na->rx_rings + n_ring;
1802 selwakeuppri(&kring->si, PI_NET);
1803 if (flags & NAF_GLOBAL_NOTIFY)
1804 selwakeuppri(&na->rx_si, PI_NET);
1810 // XXX check handling of failures
1812 netmap_attach_common(struct netmap_adapter *na)
1814 struct ifnet *ifp = na->ifp;
1816 if (na->num_tx_rings == 0 || na->num_rx_rings == 0) {
1817 D("%s: invalid rings tx %d rx %d",
1818 ifp->if_xname, na->num_tx_rings, na->num_rx_rings);
1822 NETMAP_SET_CAPABLE(ifp);
1823 if (na->nm_krings_create == NULL) {
1824 na->nm_krings_create = netmap_hw_krings_create;
1825 na->nm_krings_delete = netmap_krings_delete;
1827 if (na->nm_notify == NULL)
1828 na->nm_notify = netmap_notify;
1831 if (na->nm_mem == NULL)
1832 na->nm_mem = &nm_mem;
1838 netmap_detach_common(struct netmap_adapter *na)
1841 WNA(na->ifp) = NULL; /* XXX do we need this? */
1843 if (na->tx_rings) { /* XXX should not happen */
1844 D("freeing leftover tx_rings");
1845 na->nm_krings_delete(na);
1847 if (na->na_flags & NAF_MEM_OWNER)
1848 netmap_mem_private_delete(na->nm_mem);
1849 bzero(na, sizeof(*na));
1850 kfree(na, M_DEVBUF);
1855 * Initialize a ``netmap_adapter`` object created by driver on attach.
1856 * We allocate a block of memory with room for a struct netmap_adapter
1857 * plus two sets of N+2 struct netmap_kring (where N is the number
1858 * of hardware rings):
1859 * krings 0..N-1 are for the hardware queues.
1860 * kring N is for the host stack queue
1861 * kring N+1 is only used for the selinfo for all queues.
1862 * Return 0 on success, ENOMEM otherwise.
1864 * By default the receive and transmit adapter ring counts are both initialized
1865 * to num_queues. na->num_tx_rings can be set for cards with different tx/rx
1869 netmap_attach(struct netmap_adapter *arg)
1871 struct netmap_hw_adapter *hwna = NULL;
1872 // XXX when is arg == NULL ?
1873 struct ifnet *ifp = arg ? arg->ifp : NULL;
1875 if (arg == NULL || ifp == NULL)
1877 hwna = kmalloc(sizeof(*hwna), M_DEVBUF, M_NOWAIT | M_ZERO);
1881 if (netmap_attach_common(&hwna->up)) {
1882 kfree(hwna, M_DEVBUF);
1885 netmap_adapter_get(&hwna->up);
1887 D("success for %s", NM_IFPNAME(ifp));
1891 D("fail, arg %p ifp %p na %p", arg, ifp, hwna);
1893 return (hwna ? EINVAL : ENOMEM);
1898 NM_DBG(netmap_adapter_get)(struct netmap_adapter *na)
1904 refcount_acquire(&na->na_refcount);
1908 /* returns 1 iff the netmap_adapter is destroyed */
1910 NM_DBG(netmap_adapter_put)(struct netmap_adapter *na)
1915 if (!refcount_release(&na->na_refcount))
1921 netmap_detach_common(na);
1928 netmap_hw_krings_create(struct netmap_adapter *na)
1930 return netmap_krings_create(na,
1931 na->num_tx_rings + 1, na->num_rx_rings + 1, 0);
1937 * Free the allocated memory linked to the given ``netmap_adapter``
1941 netmap_detach(struct ifnet *ifp)
1943 struct netmap_adapter *na = NA(ifp);
1949 netmap_disable_all_rings(ifp);
1950 netmap_adapter_put(na);
1952 netmap_enable_all_rings(ifp);
1958 * Intercept packets from the network stack and pass them
1959 * to netmap as incoming packets on the 'software' ring.
1960 * We rely on the OS to make sure that the ifp and na do not go
1961 * away (typically the caller checks for IFF_DRV_RUNNING or the like).
1962 * In nm_register() or whenever there is a reinitialization,
1963 * we make sure to access the core lock and per-ring locks
1964 * so that IFCAP_NETMAP is visible here.
1967 netmap_transmit(struct ifnet *ifp, struct mbuf *m)
1969 struct netmap_adapter *na = NA(ifp);
1970 struct netmap_kring *kring;
1971 u_int i, len = MBUF_LEN(m);
1972 u_int error = EBUSY, lim;
1973 struct netmap_slot *slot;
1975 // XXX [Linux] we do not need this lock
1976 // if we follow the down/configure/up protocol -gl
1977 // mtx_lock(&na->core_lock);
1978 if ( (ifp->if_capenable & IFCAP_NETMAP) == 0) {
1979 /* interface not in netmap mode anymore */
1984 kring = &na->rx_rings[na->num_rx_rings];
1985 lim = kring->nkr_num_slots - 1;
1986 if (netmap_verbose & NM_VERB_HOST)
1987 D("%s packet %d len %d from the stack", NM_IFPNAME(ifp),
1988 kring->nr_hwcur + kring->nr_hwavail, len);
1989 // XXX reconsider long packets if we handle fragments
1990 if (len > NETMAP_BDG_BUF_SIZE(na->nm_mem)) { /* too long for us */
1991 D("%s from_host, drop packet size %d > %d", NM_IFPNAME(ifp),
1992 len, NETMAP_BDG_BUF_SIZE(na->nm_mem));
1995 /* protect against other instances of netmap_transmit,
1996 * and userspace invocations of rxsync().
1998 // XXX [Linux] there can be no other instances of netmap_transmit
1999 // on this same ring, but we still need this lock to protect
2000 // concurrent access from netmap_sw_to_nic() -gl
2001 lockmgr(&kring->q_lock, LK_EXCLUSIVE);
2002 if (kring->nr_hwavail >= lim) {
2004 D("stack ring %s full\n", NM_IFPNAME(ifp));
2006 /* compute the insert position */
2007 i = nm_kr_rxpos(kring);
2008 slot = &kring->ring->slot[i];
2009 m_copydata(m, 0, (int)len, BDG_NMB(na, slot));
2011 slot->flags = kring->nkr_slot_flags;
2012 kring->nr_hwavail++;
2013 if (netmap_verbose & NM_VERB_HOST)
2014 D("wake up host ring %s %d", NM_IFPNAME(na->ifp), na->num_rx_rings);
2015 na->nm_notify(na, na->num_rx_rings, NR_RX, 0);
2018 lockmgr(&kring->q_lock, LK_RELEASE);
2021 // mtx_unlock(&na->core_lock);
2023 /* release the mbuf in either cases of success or failure. As an
2024 * alternative, put the mbuf in a free list and free the list
2025 * only when really necessary.
2034 * netmap_reset() is called by the driver routines when reinitializing
2035 * a ring. The driver is in charge of locking to protect the kring.
2036 * If native netmap mode is not set just return NULL.
2038 struct netmap_slot *
2039 netmap_reset(struct netmap_adapter *na, enum txrx tx, u_int n,
2042 struct netmap_kring *kring;
2046 D("NULL na, should not happen");
2047 return NULL; /* no netmap support here */
2049 if (!(na->ifp->if_capenable & IFCAP_NETMAP) || nma_is_generic(na)) {
2050 ND("interface not in netmap mode");
2051 return NULL; /* nothing to reinitialize */
2054 /* XXX note- in the new scheme, we are not guaranteed to be
2055 * under lock (e.g. when called on a device reset).
2056 * In this case, we should set a flag and do not trust too
2057 * much the values. In practice: TODO
2058 * - set a RESET flag somewhere in the kring
2059 * - do the processing in a conservative way
2060 * - let the *sync() fixup at the end.
2063 if (n >= na->num_tx_rings)
2065 kring = na->tx_rings + n;
2066 new_hwofs = kring->nr_hwcur - new_cur;
2068 if (n >= na->num_rx_rings)
2070 kring = na->rx_rings + n;
2071 new_hwofs = kring->nr_hwcur + kring->nr_hwavail - new_cur;
2073 lim = kring->nkr_num_slots - 1;
2074 if (new_hwofs > lim)
2075 new_hwofs -= lim + 1;
2077 /* Always set the new offset value and realign the ring. */
2078 D("%s hwofs %d -> %d, hwavail %d -> %d",
2079 tx == NR_TX ? "TX" : "RX",
2080 kring->nkr_hwofs, new_hwofs,
2082 tx == NR_TX ? lim : kring->nr_hwavail);
2083 kring->nkr_hwofs = new_hwofs;
2085 kring->nr_hwavail = lim;
2086 kring->nr_hwreserved = 0;
2089 * Wakeup on the individual and global selwait
2090 * We do the wakeup here, but the ring is not yet reconfigured.
2091 * However, we are under lock so there are no races.
2093 na->nm_notify(na, n, tx, NAF_GLOBAL_NOTIFY);
2094 return kring->ring->slot;
2099 * Default functions to handle rx/tx interrupts from a physical device.
2100 * "work_done" is non-null on the RX path, NULL for the TX path.
2101 * "generic" is 0 when we are called by a device driver, and 1 when we
2102 * are called by the generic netmap adapter layer.
2103 * We rely on the OS to make sure that there is only one active
2104 * instance per queue, and that there is appropriate locking.
2106 * If the card is not in netmap mode, simply return 0,
2107 * so that the caller proceeds with regular processing.
2109 * We return 0 also when the card is in netmap mode but the current
2110 * netmap adapter is the generic one, because this function will be
2111 * called by the generic layer.
2113 * If the card is connected to a netmap file descriptor,
2114 * do a selwakeup on the individual queue, plus one on the global one
2115 * if needed (multiqueue card _and_ there are multiqueue listeners),
2118 * Finally, if called on rx from an interface connected to a switch,
2119 * calls the proper forwarding routine, and return 1.
2122 netmap_common_irq(struct ifnet *ifp, u_int q, u_int *work_done)
2124 struct netmap_adapter *na = NA(ifp);
2125 struct netmap_kring *kring;
2127 q &= NETMAP_RING_MASK;
2129 if (netmap_verbose) {
2130 RD(5, "received %s queue %d", work_done ? "RX" : "TX" , q);
2133 if (work_done) { /* RX path */
2134 if (q >= na->num_rx_rings)
2135 return 0; // not a physical queue
2136 kring = na->rx_rings + q;
2137 kring->nr_kflags |= NKR_PENDINTR; // XXX atomic ?
2138 na->nm_notify(na, q, NR_RX,
2139 (na->num_rx_rings > 1 ? NAF_GLOBAL_NOTIFY : 0));
2140 *work_done = 1; /* do not fire napi again */
2141 } else { /* TX path */
2142 if (q >= na->num_tx_rings)
2143 return 0; // not a physical queue
2144 kring = na->tx_rings + q;
2145 na->nm_notify(na, q, NR_TX,
2146 (na->num_tx_rings > 1 ? NAF_GLOBAL_NOTIFY : 0));
2152 * Default functions to handle rx/tx interrupts from a physical device.
2153 * "work_done" is non-null on the RX path, NULL for the TX path.
2154 * "generic" is 0 when we are called by a device driver, and 1 when we
2155 * are called by the generic netmap adapter layer.
2156 * We rely on the OS to make sure that there is only one active
2157 * instance per queue, and that there is appropriate locking.
2159 * If the card is not in netmap mode, simply return 0,
2160 * so that the caller proceeds with regular processing.
2162 * If the card is connected to a netmap file descriptor,
2163 * do a selwakeup on the individual queue, plus one on the global one
2164 * if needed (multiqueue card _and_ there are multiqueue listeners),
2167 * Finally, if called on rx from an interface connected to a switch,
2168 * calls the proper forwarding routine, and return 1.
2171 netmap_rx_irq(struct ifnet *ifp, u_int q, u_int *work_done)
2173 // XXX could we check NAF_NATIVE_ON ?
2174 if (!(ifp->if_capenable & IFCAP_NETMAP))
2177 if (NA(ifp)->na_flags & NAF_SKIP_INTR) {
2178 ND("use regular interrupt");
2182 return netmap_common_irq(ifp, q, work_done);
2186 static struct cdev *netmap_dev; /* /dev/netmap character device. */
2192 * Create the /dev/netmap device and initialize all global
2195 * Return 0 on success, errno on failure.
2204 error = netmap_mem_init();
2206 kprintf("netmap: unable to initialize the memory allocator.\n");
2209 kprintf("netmap: loaded module\n");
2210 netmap_dev = make_dev(&netmap_cdevsw, 0, UID_ROOT, GID_WHEEL, 0660,
2213 netmap_init_bridges();
2221 * Free all the memory, and destroy the ``/dev/netmap`` device.
2226 destroy_dev(netmap_dev);
2229 kprintf("netmap: unloaded module.\n");