From be4519a228f0cdc3d23bcbc147abcf2e7d27f4f7 Mon Sep 17 00:00:00 2001 From: Sepherosa Ziehau Date: Thu, 3 Jul 2014 21:15:27 +0800 Subject: [PATCH] udp: Make udp pcbinfo and portinfo per-cpu; greatly improve performance MAJOR CHANGES: - Add token to protect pcbinfo's inpcb list and wildcard hash table. Currently only udp per-cpu pcbinfo sets this token. udp serializer and netisr barrier are nuked. o udp inpcb list: Under most cases, udp inpcb list is operated in its owner netisr. However, it is also accessed and modified (no effiective udp inpcb will be unlinked though) in netisr0 to adjust multicast options if one interface is to be detached. So protecting udp inpcb list accessing and modification w/ token is necessary. At udp inpcb detach time, the udp inpcb is first removed from the udp inpcb list, then a message will go through all netisrs, which makes sure that no netisrs are using or can find this udp inpcb from the udp inpcb list. After all these, this udp inpcb is destroyed in its owner netisr. In netisrs, it is MP safe to find a udp inpcb from udp inpcb list, then release the token and process the found udp inpcb. In other threads, it is MP safe to find a udp inpcb from udp inpcb list, then release the token and process the found udp inpcb in non-blocking fashion. See also the usage of inpcb marker. o udp wildcard hash table: On input path, udp wildcard hash table is searched in its owner netisr. In order to ease implicit binding (bind during send), connect after binding, and disconnect, udp inpcb are inserted into and removed from other udp pcbinfos' wildcard hash table in its owner netisr. Thus the udp wildcard hash table must be protected w/ token. At udp inpcb detach time, a message will go through all netisrs, and this udp inpcb will be removed from the udp wildcard hash table belonging to the current netisr. This makes sure that once the current netisr runs the message handler, this udp inpcb will not be used and be found in the current netisr. When the message reaches the last netisr, this udp inpcb is redispatched to its owner netisr to be destroyed. In netisrs, it is MP safe to find a udp inpcb from udp wildcard hash table, then release the token and process the found udp inpcb, e.g. use udp inpcb found by in_pcblookuphash(). In other threads, it is MP safe to find a udp inpcb from udp wildcard hash table, then release the token and process the found udp inpcb in non-blocking fashion. See also the usage of inpcb container marker. o udp connect hash table: It is lockless MP safe, and only accessed and modified in its owner netisr. - During inpcb iteration through inpcb list, use inpcb marker when calling functions, which may block, e.g. in_pcbpurgeif0(), so the inpcb iteration will not stop prematurely, if the inpcb being processed is removed from the inpcb list. - Use udp inpcb wildcard table and udp inpcb connect hash table to dispatch input multicast and broadcast udp datagrams. Using udp inpcb list could be time consume, since we need to check udp inpcb lists on all cpus; and secondly, once udp inpcb has a local port, it will be in either udp wildcard hash table or udp connect hash table. Since the socket buffer operation on input path may block, inpcb container marker is used when iterating inpcbs from udp inpcb wildcard hash table. in_pcblookup_pkthash() is adjusted to skip inpcb container marker. - udp socket so_port is no longer fixed to netisr0 msgport o Initial udp socket so_port is the current cpu's netisr msgport. o Bound but unconnected udp socket so_port is selected according to local port hash. o Connected udp socket so_port is selected according to the udp hash, i.e. laddr/faddr toeplitz hash (exception: multicast laddr or multicast faddr, is hashed to netisr0). o Multicast socket options are forced to be handled in netisr0, since udp socket so_port may not be netisr0 msgport. - In order to support asynchronized udp inpcb detach: o EJUSTRETURN from pru_detach method now means protocol will call sodiscard() and sofree() for soclose(). udp pru_detach method returns EJUSTRETURN as of this commit. o SS_ISCLOSING socket state is set before calling pru_detach method, so protocol could avoid certain expensive, unnecessary or disallowed operation in pru_disconnect or pru_detach method, e.g. udp pru_disconnect method avoids putting udp inpcb back to udp wildcard hash table, if SS_ISCLOSING is set. MISC CHANGES: - pcbinfo's cpu id must be set now; -1 is disallowed. - udp pru_abort method should never be called; it panicks now. - Restore traditional BSD behaviour, if unbound udp socket connect fails: if local port of the udp socket has been selected, its inpcb should be in wildcard hash table, i.e. the udp inpcb should be visible on udp datagrams input path. - Make sure multicast stuffs are adjusted only in netisr0 for inet6, if one interface is about to be detached. PERFORMANCE IMPROVEMENT: For 'kq_connect_client -u' test, this commit gives 400% performance improvement (31Kconns/s -> 160Kconns/s). --- sys/kern/uipc_msg.c | 3 +- sys/kern/uipc_socket.c | 39 +- sys/net/ipfw/ip_fw2.c | 8 +- sys/net/netmsg.h | 1 + sys/net/pf/pf.c | 2 +- sys/netinet/in.c | 6 +- sys/netinet/in_pcb.c | 410 +++++++++++++------- sys/netinet/in_pcb.h | 44 ++- sys/netinet/in_proto.c | 11 +- sys/netinet/ip_demux.c | 26 +- sys/netinet/ip_divert.c | 6 +- sys/netinet/ip_output.c | 19 + sys/netinet/raw_ip.c | 6 +- sys/netinet/tcp_subr.c | 16 +- sys/netinet/udp_usrreq.c | 731 ++++++++++++++++++++++++------------ sys/netinet/udp_var.h | 10 +- sys/netinet6/in6_ifattach.c | 44 ++- sys/netinet6/in6_pcb.c | 81 +++- sys/netinet6/in6_pcb.h | 4 +- sys/netinet6/ipsec.c | 2 +- sys/netinet6/raw_ip6.c | 2 +- sys/netinet6/udp6_usrreq.c | 44 ++- sys/sys/protosw.h | 4 + sys/sys/socketops.h | 2 +- sys/sys/socketvar.h | 2 + 25 files changed, 1029 insertions(+), 494 deletions(-) diff --git a/sys/kern/uipc_msg.c b/sys/kern/uipc_msg.c index ba95a09842..55480c4e56 100644 --- a/sys/kern/uipc_msg.c +++ b/sys/kern/uipc_msg.c @@ -274,7 +274,7 @@ so_pru_detach(struct socket *so) return (error); } -void +int so_pru_detach_direct(struct socket *so) { struct netmsg_pru_detach msg; @@ -285,6 +285,7 @@ so_pru_detach_direct(struct socket *so) msg.base.lmsg.ms_flags |= MSGF_SYNC; func((netmsg_t)&msg); KKASSERT(msg.base.lmsg.ms_flags & MSGF_DONE); + return(msg.base.lmsg.ms_error); } int diff --git a/sys/kern/uipc_socket.c b/sys/kern/uipc_socket.c index 3e56b35c08..e2f8bf4f18 100644 --- a/sys/kern/uipc_socket.c +++ b/sys/kern/uipc_socket.c @@ -112,7 +112,6 @@ static void filt_sowdetach(struct knote *kn); static int filt_sowrite(struct knote *kn, long hint); static int filt_solisten(struct knote *kn, long hint); -static void sodiscard(struct socket *so); static int soclose_sync(struct socket *so, int fflag); static void soclose_fast(struct socket *so); @@ -422,6 +421,7 @@ soclose(struct socket *so, int fflag) int error; funsetown(&so->so_sigio); + sosetstate(so, SS_ISCLOSING); if (!use_soclose_fast || (so->so_proto->pr_flags & PR_SYNC_PORT) || ((so->so_state & SS_ISCONNECTED) && @@ -434,7 +434,7 @@ soclose(struct socket *so, int fflag) return error; } -static void +void sodiscard(struct socket *so) { lwkt_getpooltoken(so); @@ -560,6 +560,13 @@ drop: int error2; error2 = so_pru_detach(so); + if (error2 == EJUSTRETURN) { + /* + * Protocol will call sodiscard() + * and sofree() for us. + */ + return error; + } if (error == 0) error = error2; } @@ -596,8 +603,18 @@ soclose_disconn_async_handler(netmsg_t msg) (so->so_state & SS_ISDISCONNECTING) == 0) so_pru_disconnect_direct(so); - if (so->so_pcb) - so_pru_detach_direct(so); + if (so->so_pcb) { + int error; + + error = so_pru_detach_direct(so); + if (error == EJUSTRETURN) { + /* + * Protocol will call sodiscard() + * and sofree() for us. + */ + return; + } + } sodiscard(so); sofree(so); @@ -618,8 +635,18 @@ soclose_detach_async_handler(netmsg_t msg) { struct socket *so = msg->base.nm_so; - if (so->so_pcb) - so_pru_detach_direct(so); + if (so->so_pcb) { + int error; + + error = so_pru_detach_direct(so); + if (error == EJUSTRETURN) { + /* + * Protocol will call sodiscard() + * and sofree() for us. + */ + return; + } + } sodiscard(so); sofree(so); diff --git a/sys/net/ipfw/ip_fw2.c b/sys/net/ipfw/ip_fw2.c index 7c15b32c46..cef01d0aca 100644 --- a/sys/net/ipfw/ip_fw2.c +++ b/sys/net/ipfw/ip_fw2.c @@ -1566,15 +1566,15 @@ _ipfw_match_uid(const struct ipfw_flow_id *fid, struct ifnet *oif, { struct in_addr src_ip, dst_ip; struct inpcbinfo *pi; - int wildcard; + boolean_t wildcard; struct inpcb *pcb; if (fid->proto == IPPROTO_TCP) { - wildcard = 0; + wildcard = FALSE; pi = &tcbinfo[mycpuid]; } else if (fid->proto == IPPROTO_UDP) { - wildcard = 1; - pi = &udbinfo; + wildcard = TRUE; + pi = &udbinfo[mycpuid]; } else { return 0; } diff --git a/sys/net/netmsg.h b/sys/net/netmsg.h index 007025ae61..69e7007011 100644 --- a/sys/net/netmsg.h +++ b/sys/net/netmsg.h @@ -60,6 +60,7 @@ struct netmsg_base { }; #define MSGF_IGNSOPORT MSGF_USER0 /* don't check so_port */ +#define MSGF_PROTO1 MSGF_USER1 /* protocol specific */ typedef struct netmsg_base *netmsg_base_t; diff --git a/sys/net/pf/pf.c b/sys/net/pf/pf.c index de30f63c8a..f1a637afe2 100644 --- a/sys/net/pf/pf.c +++ b/sys/net/pf/pf.c @@ -3246,7 +3246,7 @@ pf_socket_lookup(int direction, struct pf_pdesc *pd) return (-1); sport = pd->hdr.udp->uh_sport; dport = pd->hdr.udp->uh_dport; - pi = &udbinfo; + pi = &udbinfo[mycpuid]; break; default: return (-1); diff --git a/sys/netinet/in.c b/sys/netinet/in.c index d2e41b5e47..3fa5544213 100644 --- a/sys/netinet/in.c +++ b/sys/netinet/in.c @@ -1414,9 +1414,11 @@ in_ifdetach_dispatch(netmsg_t nmsg) { struct lwkt_msg *lmsg = &nmsg->lmsg; struct ifnet *ifp = lmsg->u.ms_resultp; + int cpu; - in_pcbpurgeif0(LIST_FIRST(&ripcbinfo.pcblisthead), ifp); - in_pcbpurgeif0(LIST_FIRST(&udbinfo.pcblisthead), ifp); + in_pcbpurgeif0(&ripcbinfo, ifp); + for (cpu = 0; cpu < ncpus2; ++cpu) + in_pcbpurgeif0(&udbinfo[cpu], ifp); lwkt_replymsg(lmsg, 0); } diff --git a/sys/netinet/in_pcb.c b/sys/netinet/in_pcb.c index 1118d55ac4..ca79b60a45 100644 --- a/sys/netinet/in_pcb.c +++ b/sys/netinet/in_pcb.c @@ -89,6 +89,7 @@ #include #include #include +#include #include #include @@ -140,6 +141,13 @@ int ipport_hilastauto = IPPORT_HILASTAUTO; /* 65535 */ int udpencap_enable = 1; /* enabled by default */ int udpencap_port = 4500; /* triggers decapsulation */ +/* + * Per-netisr inpcb markers. + * NOTE: they should only be used in netisrs. + */ +static struct inpcb *in_pcbmarkers; +static struct inpcontainer *in_pcbcontainer_markers; + static int sysctl_net_ipport_check(SYSCTL_HANDLER_ARGS) { @@ -188,12 +196,22 @@ SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, hilast, CTLTYPE_INT|CTLFLAG_RW, */ void -in_pcbinfo_init(struct inpcbinfo *pcbinfo) +in_pcbinfo_init(struct inpcbinfo *pcbinfo, int cpu, boolean_t shared) { + KASSERT(cpu >= 0 && cpu < ncpus, ("invalid cpu%d", cpu)); + pcbinfo->cpu = cpu; + LIST_INIT(&pcbinfo->pcblisthead); - pcbinfo->cpu = -1; pcbinfo->portsave = kmalloc(sizeof(*pcbinfo->portsave), M_PCB, M_WAITOK | M_ZERO); + + if (shared) { + pcbinfo->infotoken = kmalloc(sizeof(struct lwkt_token), + M_PCB, M_WAITOK); + lwkt_token_init(pcbinfo->infotoken, "infotoken"); + } else { + pcbinfo->infotoken = NULL; + } } struct baddynamicports baddynamicports; @@ -219,6 +237,39 @@ in_baddynamic(u_int16_t port, u_int16_t proto) } } +void +in_pcbonlist(struct inpcb *inp) +{ + struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; + + KASSERT(&curthread->td_msgport == netisr_cpuport(pcbinfo->cpu), + ("not in the correct netisr")); + KASSERT((inp->inp_flags & INP_ONLIST) == 0, ("already on pcblist")); + inp->inp_flags |= INP_ONLIST; + + GET_PCBINFO_TOKEN(pcbinfo); + LIST_INSERT_HEAD(&pcbinfo->pcblisthead, inp, inp_list); + pcbinfo->ipi_count++; + REL_PCBINFO_TOKEN(pcbinfo); +} + +void +in_pcbofflist(struct inpcb *inp) +{ + struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; + + KASSERT(&curthread->td_msgport == netisr_cpuport(pcbinfo->cpu), + ("not in the correct netisr")); + KASSERT(inp->inp_flags & INP_ONLIST, ("not on pcblist")); + inp->inp_flags &= ~INP_ONLIST; + + GET_PCBINFO_TOKEN(pcbinfo); + LIST_REMOVE(inp, inp_list); + KASSERT(pcbinfo->ipi_count > 0, + ("invalid inpcb count %d", pcbinfo->ipi_count)); + pcbinfo->ipi_count--; + REL_PCBINFO_TOKEN(pcbinfo); +} /* * Allocate a PCB and associate it with the socket. @@ -252,8 +303,8 @@ in_pcballoc(struct socket *so, struct inpcbinfo *pcbinfo) #endif soreference(so); so->so_pcb = inp; - LIST_INSERT_HEAD(&pcbinfo->pcblisthead, inp, inp_list); - pcbinfo->ipi_count++; + + in_pcbonlist(inp); return (0); } @@ -269,8 +320,7 @@ in_pcbunlink(struct inpcb *inp, struct inpcbinfo *pcbinfo) KASSERT((inp->inp_flags & (INP_WILDCARD | INP_CONNECTED)) == 0, ("already linked")); - LIST_REMOVE(inp, inp_list); - pcbinfo->ipi_count--; + in_pcbofflist(inp); inp->inp_pcbinfo = NULL; } @@ -285,8 +335,7 @@ in_pcblink(struct inpcb *inp, struct inpcbinfo *pcbinfo) ("already linked")); inp->inp_pcbinfo = pcbinfo; - LIST_INSERT_HEAD(&pcbinfo->pcblisthead, inp, inp_list); - pcbinfo->ipi_count++; + in_pcbonlist(inp); } static int @@ -985,9 +1034,9 @@ void in_pcbdisconnect(struct inpcb *inp) { + in_pcbremconnhash(inp); inp->inp_faddr.s_addr = INADDR_ANY; inp->inp_fport = 0; - in_pcbremconnhash(inp); } void @@ -1098,17 +1147,29 @@ in_setpeeraddr_dispatch(netmsg_t msg) } void -in_pcbnotifyall(struct inpcbhead *head, struct in_addr faddr, int err, +in_pcbnotifyall(struct inpcbinfo *pcbinfo, struct in_addr faddr, int err, void (*notify)(struct inpcb *, int)) { - struct inpcb *inp, *ninp; + struct inpcb *inp, *marker; + + KASSERT(&curthread->td_msgport == netisr_cpuport(pcbinfo->cpu), + ("not in the correct netisr")); + marker = &in_pcbmarkers[mycpuid]; /* - * note: if INP_PLACEMARKER is set we must ignore the rest of - * the structure and skip it. + * NOTE: + * - If INP_PLACEMARKER is set we must ignore the rest of the + * structure and skip it. + * - It is safe to nuke inpcbs here, since we are in their own + * netisr. */ - crit_enter(); - LIST_FOREACH_MUTABLE(inp, head, inp_list, ninp) { + GET_PCBINFO_TOKEN(pcbinfo); + + LIST_INSERT_HEAD(&pcbinfo->pcblisthead, marker, inp_list); + while ((inp = LIST_NEXT(marker, inp_list)) != NULL) { + LIST_REMOVE(marker, inp_list); + LIST_INSERT_AFTER(inp, marker, inp_list); + if (inp->inp_flags & INP_PLACEMARKER) continue; #ifdef INET6 @@ -1120,21 +1181,57 @@ in_pcbnotifyall(struct inpcbhead *head, struct in_addr faddr, int err, continue; (*notify)(inp, err); /* can remove inp from list! */ } - crit_exit(); + LIST_REMOVE(marker, inp_list); + + REL_PCBINFO_TOKEN(pcbinfo); } void -in_pcbpurgeif0(struct inpcb *head, struct ifnet *ifp) +in_pcbpurgeif0(struct inpcbinfo *pcbinfo, struct ifnet *ifp) { - struct inpcb *inp; - struct ip_moptions *imo; - int i, gap; + struct inpcb *inp, *marker; + + /* + * We only need to make sure that we are in netisr0, where all + * multicast operation happen. We could check inpcbinfo which + * does not belong to netisr0 by holding the inpcbinfo's token. + * In this case, the pcbinfo must be able to be shared, i.e. + * pcbinfo->infotoken is not NULL. + */ + KASSERT(&curthread->td_msgport == netisr_cpuport(0), + ("not in netisr0")); + KASSERT(pcbinfo->cpu == 0 || pcbinfo->infotoken != NULL, + ("pcbinfo could not be shared")); + + /* + * Get a marker for the current netisr (netisr0). + * + * It is possible that the multicast address deletion blocks, + * which could cause temporary token releasing. So we use + * inpcb marker here to get a coherent view of the inpcb list. + * + * While, on the other hand, moptions are only added and deleted + * in netisr0, so we would not see staled moption or miss moption + * even if the token was released due to the blocking multicast + * address deletion. + */ + marker = &in_pcbmarkers[mycpuid]; + + GET_PCBINFO_TOKEN(pcbinfo); + + LIST_INSERT_HEAD(&pcbinfo->pcblisthead, marker, inp_list); + while ((inp = LIST_NEXT(marker, inp_list)) != NULL) { + struct ip_moptions *imo; + + LIST_REMOVE(marker, inp_list); + LIST_INSERT_AFTER(inp, marker, inp_list); - for (inp = head; inp != NULL; inp = LIST_NEXT(inp, inp_list)) { if (inp->inp_flags & INP_PLACEMARKER) continue; imo = inp->inp_moptions; if ((inp->inp_vflag & INP_IPV4) && imo != NULL) { + int i, gap; + /* * Unselect the outgoing interface if it is being * detached. @@ -1149,6 +1246,11 @@ in_pcbpurgeif0(struct inpcb *head, struct ifnet *ifp) for (i = 0, gap = 0; i < imo->imo_num_memberships; i++) { if (imo->imo_membership[i]->inm_ifp == ifp) { + /* + * NOTE: + * This could block and the pcbinfo + * token could be passively released. + */ in_delmulti(imo->imo_membership[i]); gap++; } else if (gap != 0) @@ -1158,6 +1260,9 @@ in_pcbpurgeif0(struct inpcb *head, struct ifnet *ifp) imo->imo_num_memberships -= gap; } } + LIST_REMOVE(marker, inp_list); + + REL_PCBINFO_TOKEN(pcbinfo); } /* @@ -1291,6 +1396,8 @@ in_pcblocalgroup_last(const struct inpcbinfo *pcbinfo, if (pcbinfo->localgrphashbase == NULL) return NULL; + GET_PCBINFO_TOKEN(pcbinfo); + hdr = &pcbinfo->localgrphashbase[ INP_PCBLOCALGRPHASH(inp->inp_lport, pcbinfo->localgrphashmask)]; @@ -1303,8 +1410,10 @@ in_pcblocalgroup_last(const struct inpcbinfo *pcbinfo, break; } } - if (grp == NULL || grp->il_inpcnt == 1) + if (grp == NULL || grp->il_inpcnt == 1) { + REL_PCBINFO_TOKEN(pcbinfo); return NULL; + } KASSERT(grp->il_inpcnt >= 2, ("invalid localgroup inp count %d", grp->il_inpcnt)); @@ -1314,9 +1423,11 @@ in_pcblocalgroup_last(const struct inpcbinfo *pcbinfo, if (i == last) last = grp->il_inpcnt - 2; + REL_PCBINFO_TOKEN(pcbinfo); return grp->il_inp[last]; } } + REL_PCBINFO_TOKEN(pcbinfo); return NULL; } @@ -1328,6 +1439,8 @@ inp_localgroup_lookup(const struct inpcbinfo *pcbinfo, const struct inp_localgrphead *hdr; const struct inp_localgroup *grp; + ASSERT_PCBINFO_TOKEN_HELD(pcbinfo); + hdr = &pcbinfo->localgrphashbase[ INP_PCBLOCALGRPHASH(lport, pcbinfo->localgrphashmask)]; #ifdef INP_LOCALGROUP_HASHTHR @@ -1413,6 +1526,7 @@ in_pcblookup_pkthash(struct inpcbinfo *pcbinfo, struct in_addr faddr, } if (jinp != NULL) return (jinp); + if (wildcard) { struct inpcb *local_wild = NULL; struct inpcb *jinp_wild = NULL; @@ -1424,6 +1538,8 @@ in_pcblookup_pkthash(struct inpcbinfo *pcbinfo, struct in_addr faddr, struct sockaddr_in jsin; struct ucred *cred; + GET_PCBINFO_TOKEN(pcbinfo); + /* * Check local group first */ @@ -1432,8 +1548,10 @@ in_pcblookup_pkthash(struct inpcbinfo *pcbinfo, struct in_addr faddr, !(ifp && ifp->if_type == IFT_FAITH)) { inp = inp_localgroup_lookup(pcbinfo, laddr, lport, m->m_pkthdr.hash); - if (inp != NULL) + if (inp != NULL) { + REL_PCBINFO_TOKEN(pcbinfo); return inp; + } } /* @@ -1448,6 +1566,9 @@ in_pcblookup_pkthash(struct inpcbinfo *pcbinfo, struct in_addr faddr, INP_PCBWILDCARDHASH(lport, pcbinfo->wildcardhashmask)]; LIST_FOREACH(ic, chead, ic_list) { inp = ic->ic_inp; + if (inp->inp_flags & INP_PLACEMARKER) + continue; + jsin.sin_addr.s_addr = laddr.s_addr; #ifdef INET6 if (!(inp->inp_vflag & INP_IPV4)) @@ -1470,10 +1591,12 @@ in_pcblookup_pkthash(struct inpcbinfo *pcbinfo, struct in_addr faddr, !(inp->inp_flags & INP_FAITH)) continue; if (inp->inp_laddr.s_addr == laddr.s_addr) { - if (cred != NULL && jailed(cred)) + if (cred != NULL && jailed(cred)) { jinp = inp; - else + } else { + REL_PCBINFO_TOKEN(pcbinfo); return (inp); + } } if (inp->inp_laddr.s_addr == INADDR_ANY) { #ifdef INET6 @@ -1490,6 +1613,9 @@ in_pcblookup_pkthash(struct inpcbinfo *pcbinfo, struct in_addr faddr, } } } + + REL_PCBINFO_TOKEN(pcbinfo); + if (local_wild != NULL) return (local_wild); #ifdef INET6 @@ -1538,10 +1664,10 @@ in_pcbinsconnhash(struct inpcb *inp) } #endif - KASSERT(!(inp->inp_flags & INP_WILDCARD), - ("already on wildcardhash")); - KASSERT(!(inp->inp_flags & INP_CONNECTED), - ("already on connhash")); + KASSERT(&curthread->td_msgport == netisr_cpuport(pcbinfo->cpu), + ("not in the correct netisr")); + KASSERT(!(inp->inp_flags & INP_WILDCARD), ("already on wildcardhash")); + KASSERT(!(inp->inp_flags & INP_CONNECTED), ("already on connhash")); inp->inp_flags |= INP_CONNECTED; /* @@ -1558,7 +1684,12 @@ in_pcbinsconnhash(struct inpcb *inp) void in_pcbremconnhash(struct inpcb *inp) { + struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; + + KASSERT(&curthread->td_msgport == netisr_cpuport(pcbinfo->cpu), + ("not in the correct netisr")); KASSERT(inp->inp_flags & INP_CONNECTED, ("inp not connected")); + LIST_REMOVE(inp, inp_hash); inp->inp_flags &= ~INP_CONNECTED; } @@ -1694,6 +1825,8 @@ in_pcbinslocalgrphash_oncpu(struct inpcb *inp, struct inpcbinfo *pcbinfo) struct inp_localgroup *grp, *grp_alloc = NULL; struct ucred *cred; + ASSERT_PCBINFO_TOKEN_HELD(pcbinfo); + if (pcbinfo->localgrphashbase == NULL) return; @@ -1829,6 +1962,8 @@ in_pcbinswildcardhash_oncpu(struct inpcb *inp, struct inpcbinfo *pcbinfo) struct inpcontainer *ic; struct inpcontainerhead *bucket; + GET_PCBINFO_TOKEN(pcbinfo); + in_pcbinslocalgrphash_oncpu(inp, pcbinfo); bucket = &pcbinfo->wildcardhashbase[ @@ -1837,6 +1972,8 @@ in_pcbinswildcardhash_oncpu(struct inpcb *inp, struct inpcbinfo *pcbinfo) ic = kmalloc(sizeof(struct inpcontainer), M_TEMP, M_INTWAIT); ic->ic_inp = inp; LIST_INSERT_HEAD(bucket, ic, ic_list); + + REL_PCBINFO_TOKEN(pcbinfo); } /* @@ -1847,6 +1984,8 @@ in_pcbinswildcardhash(struct inpcb *inp) { struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; + KASSERT(&curthread->td_msgport == netisr_cpuport(pcbinfo->cpu), + ("not in correct netisr")); KASSERT(!(inp->inp_flags & INP_CONNECTED), ("already on connhash")); KASSERT(!(inp->inp_flags & INP_WILDCARD), @@ -1862,6 +2001,8 @@ in_pcbremlocalgrphash_oncpu(struct inpcb *inp, struct inpcbinfo *pcbinfo) struct inp_localgrphead *hdr; struct inp_localgroup *grp; + ASSERT_PCBINFO_TOKEN_HELD(pcbinfo); + if (pcbinfo->localgrphashbase == NULL) return; @@ -1896,6 +2037,8 @@ in_pcbremwildcardhash_oncpu(struct inpcb *inp, struct inpcbinfo *pcbinfo) struct inpcontainer *ic; struct inpcontainerhead *head; + GET_PCBINFO_TOKEN(pcbinfo); + in_pcbremlocalgrphash_oncpu(inp, pcbinfo); /* find bucket */ @@ -1906,10 +2049,12 @@ in_pcbremwildcardhash_oncpu(struct inpcb *inp, struct inpcbinfo *pcbinfo) if (ic->ic_inp == inp) goto found; } + REL_PCBINFO_TOKEN(pcbinfo); return; /* not found! */ found: LIST_REMOVE(ic, ic_list); /* remove container from bucket chain */ + REL_PCBINFO_TOKEN(pcbinfo); kfree(ic, M_TEMP); /* deallocate container */ } @@ -1921,7 +2066,10 @@ in_pcbremwildcardhash(struct inpcb *inp) { struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; + KASSERT(&curthread->td_msgport == netisr_cpuport(pcbinfo->cpu), + ("not in correct netisr")); KASSERT(inp->inp_flags & INP_WILDCARD, ("inp not wildcard")); + in_pcbremwildcardhash_oncpu(inp, pcbinfo); inp->inp_flags &= ~INP_WILDCARD; } @@ -1958,8 +2106,9 @@ in_pcbremlists(struct inpcb *inp) } else if (inp->inp_flags & INP_CONNECTED) { in_pcbremconnhash(inp); } - LIST_REMOVE(inp, inp_list); - inp->inp_pcbinfo->ipi_count--; + + if (inp->inp_flags & INP_ONLIST) + in_pcbofflist(inp); } int @@ -1982,17 +2131,23 @@ prison_xinpcb(struct thread *td, struct inpcb *inp) int in_pcblist_global(SYSCTL_HANDLER_ARGS) { - struct inpcbinfo *pcbinfo = arg1; - struct inpcb *inp, *marker; - struct xinpcb xi; - int error, i, n; + struct inpcbinfo *pcbinfo_arr = arg1; + int pcbinfo_arrlen = arg2; + struct inpcb *marker; + int cpu, origcpu; + int error, n; + + KASSERT(pcbinfo_arrlen <= ncpus && pcbinfo_arrlen >= 1, + ("invalid pcbinfo count %d", pcbinfo_arrlen)); /* * The process of preparing the TCB list is too time-consuming and * resource-intensive to repeat twice on every request. */ + n = 0; if (req->oldptr == NULL) { - n = pcbinfo->ipi_count; + for (cpu = 0; cpu < pcbinfo_arrlen; ++cpu) + n += pcbinfo_arr[cpu].ipi_count; req->oldidx = (n + n/8 + 10) * sizeof(struct xinpcb); return 0; } @@ -2000,124 +2155,72 @@ in_pcblist_global(SYSCTL_HANDLER_ARGS) if (req->newptr != NULL) return EPERM; + marker = kmalloc(sizeof(struct inpcb), M_TEMP, M_WAITOK|M_ZERO); + marker->inp_flags |= INP_PLACEMARKER; + /* * OK, now we're committed to doing something. Re-fetch ipi_count * after obtaining the generation count. */ - n = pcbinfo->ipi_count; + error = 0; + origcpu = mycpuid; + for (cpu = 0; cpu < pcbinfo_arrlen && error == 0; ++cpu) { + struct inpcbinfo *pcbinfo = &pcbinfo_arr[cpu]; + struct inpcb *inp; + struct xinpcb xi; + int i; - marker = kmalloc(sizeof(struct inpcb), M_TEMP, M_WAITOK|M_ZERO); - marker->inp_flags |= INP_PLACEMARKER; - LIST_INSERT_HEAD(&pcbinfo->pcblisthead, marker, inp_list); + lwkt_migratecpu(cpu); - i = 0; - error = 0; + GET_PCBINFO_TOKEN(pcbinfo); - while ((inp = LIST_NEXT(marker, inp_list)) != NULL && i < n) { - LIST_REMOVE(marker, inp_list); - LIST_INSERT_AFTER(inp, marker, inp_list); + n = pcbinfo->ipi_count; - if (inp->inp_flags & INP_PLACEMARKER) - continue; - if (prison_xinpcb(req->td, inp)) - continue; - bzero(&xi, sizeof xi); - xi.xi_len = sizeof xi; - bcopy(inp, &xi.xi_inp, sizeof *inp); - if (inp->inp_socket) - sotoxsocket(inp->inp_socket, &xi.xi_socket); - if ((error = SYSCTL_OUT(req, &xi, sizeof xi)) != 0) - break; - ++i; - } - LIST_REMOVE(marker, inp_list); - if (error == 0 && i < n) { - bzero(&xi, sizeof xi); - xi.xi_len = sizeof xi; - while (i < n) { - error = SYSCTL_OUT(req, &xi, sizeof xi); + LIST_INSERT_HEAD(&pcbinfo->pcblisthead, marker, inp_list); + i = 0; + while ((inp = LIST_NEXT(marker, inp_list)) != NULL && i < n) { + LIST_REMOVE(marker, inp_list); + LIST_INSERT_AFTER(inp, marker, inp_list); + + if (inp->inp_flags & INP_PLACEMARKER) + continue; + if (prison_xinpcb(req->td, inp)) + continue; + + bzero(&xi, sizeof xi); + xi.xi_len = sizeof xi; + bcopy(inp, &xi.xi_inp, sizeof *inp); + if (inp->inp_socket) + sotoxsocket(inp->inp_socket, &xi.xi_socket); + if ((error = SYSCTL_OUT(req, &xi, sizeof xi)) != 0) + break; ++i; } - } - kfree(marker, M_TEMP); - return(error); -} + LIST_REMOVE(marker, inp_list); -int -in_pcblist_global_cpu0(SYSCTL_HANDLER_ARGS) -{ - boolean_t migrate = FALSE; - int origcpu = mycpuid; - int error; + REL_PCBINFO_TOKEN(pcbinfo); - if (origcpu != 0) { - migrate = TRUE; - lwkt_migratecpu(0); + if (error == 0 && i < n) { + bzero(&xi, sizeof xi); + xi.xi_len = sizeof xi; + while (i < n) { + error = SYSCTL_OUT(req, &xi, sizeof xi); + if (error) + break; + ++i; + } + } } - error = in_pcblist_global(oidp, arg1, arg2, req); - - if (migrate) - lwkt_migratecpu(origcpu); + lwkt_migratecpu(origcpu); + kfree(marker, M_TEMP); return error; } int -in_pcblist_global_nomarker(SYSCTL_HANDLER_ARGS, struct xinpcb **xi0, int *nxi0) +in_pcblist_global_ncpus2(SYSCTL_HANDLER_ARGS) { - struct inpcbinfo *pcbinfo = arg1; - struct inpcb *inp; - struct xinpcb *xi; - int nxi; - - *nxi0 = 0; - *xi0 = NULL; - - /* - * The process of preparing the PCB list is too time-consuming and - * resource-intensive to repeat twice on every request. - */ - if (req->oldptr == NULL) { - int n = pcbinfo->ipi_count; - - req->oldidx = (n + n/8 + 10) * sizeof(struct xinpcb); - return 0; - } - - if (req->newptr != NULL) - return EPERM; - - if (pcbinfo->ipi_count == 0) - return 0; - - nxi = 0; - xi = kmalloc(pcbinfo->ipi_count * sizeof(*xi), M_TEMP, - M_WAITOK | M_ZERO | M_NULLOK); - if (xi == NULL) - return ENOMEM; - - LIST_FOREACH(inp, &pcbinfo->pcblisthead, inp_list) { - struct xinpcb *xi_ptr = &xi[nxi]; - - if (prison_xinpcb(req->td, inp)) - continue; - - xi_ptr->xi_len = sizeof(*xi_ptr); - bcopy(inp, &xi_ptr->xi_inp, sizeof(*inp)); - if (inp->inp_socket) - sotoxsocket(inp->inp_socket, &xi_ptr->xi_socket); - ++nxi; - } - - if (nxi == 0) { - kfree(xi, M_TEMP); - return 0; - } - - *nxi0 = nxi; - *xi0 = xi; - - return 0; + return in_pcblist_global(oidp, arg1, ncpus2, req); } void @@ -2182,3 +2285,40 @@ in_pcbportrange(u_short *hi0, u_short *lo0, u_short ofs, u_short step) *hi0 = hi; *lo0 = lo; } + +void +in_pcbglobalinit(void) +{ + int cpu; + + in_pcbmarkers = kmalloc(ncpus * sizeof(struct inpcb), M_PCB, + M_WAITOK | M_ZERO); + in_pcbcontainer_markers = kmalloc(ncpus * sizeof(struct inpcontainer), + M_PCB, M_WAITOK | M_ZERO); + + for (cpu = 0; cpu < ncpus; ++cpu) { + struct inpcontainer *ic = &in_pcbcontainer_markers[cpu]; + struct inpcb *marker = &in_pcbmarkers[cpu]; + + marker->inp_flags |= INP_PLACEMARKER; + ic->ic_inp = marker; + } +} + +struct inpcb * +in_pcbmarker(int cpuid) +{ + KASSERT(cpuid >= 0 && cpuid < ncpus, ("invalid cpuid %d", cpuid)); + KASSERT(curthread->td_type == TD_TYPE_NETISR, ("not in netisr")); + + return &in_pcbmarkers[cpuid]; +} + +struct inpcontainer * +in_pcbcontainer_marker(int cpuid) +{ + KASSERT(cpuid >= 0 && cpuid < ncpus, ("invalid cpuid %d", cpuid)); + KASSERT(curthread->td_type == TD_TYPE_NETISR, ("not in netisr")); + + return &in_pcbcontainer_markers[cpuid]; +} diff --git a/sys/netinet/in_pcb.h b/sys/netinet/in_pcb.h index b386cdf912..676ed636ca 100644 --- a/sys/netinet/in_pcb.h +++ b/sys/netinet/in_pcb.h @@ -303,6 +303,7 @@ struct inpcbportinfo { } __cachealign; struct inpcbinfo { /* XXX documentation, prefixes */ + struct lwkt_token *infotoken; /* if this inpcbinfo is shared */ struct inpcbhead *hashbase; u_long hashmask; int portinfo_mask; @@ -315,8 +316,8 @@ struct inpcbinfo { /* XXX documentation, prefixes */ struct inpcbhead pcblisthead; /* head of queue of active pcb's */ size_t ipi_size; /* allocation size for pcbs */ u_int ipi_count; /* number of pcbs in this list */ + int cpu; /* related protocol thread cpu */ u_quad_t ipi_gencnt; /* current generation count */ - int cpu; /* related protocol thread cpu or -1 */ } __cachealign; @@ -362,6 +363,7 @@ struct inpcbinfo { /* XXX documentation, prefixes */ #define IN6P_RFC2292 0x40000000 /* used RFC2292 API on the socket */ #define IN6P_MTU 0x80000000 /* receive path MTU */ +#define INP_ONLIST 0x20000000 /* on pcblist */ #define INP_RECVTTL 0x80000000 /* receive incoming IP TTL */ #define INP_CONTROLOPTS (INP_RECVOPTS|INP_RECVRETOPTS|INP_RECVDSTADDR|\ @@ -439,6 +441,28 @@ do { \ #define ASSERT_PORT_TOKEN_HELD(portinfo) #endif /* INVARIANTS */ +#define GET_PCBINFO_TOKEN(pcbinfo) \ +do { \ + if ((pcbinfo)->infotoken) \ + lwkt_gettoken((pcbinfo)->infotoken); \ +} while (0) + +#define REL_PCBINFO_TOKEN(pcbinfo) \ +do { \ + if ((pcbinfo)->infotoken) \ + lwkt_reltoken((pcbinfo)->infotoken); \ +} while (0) + +#ifdef INVARIANTS +#define ASSERT_PCBINFO_TOKEN_HELD(pcbinfo) \ +do { \ + if ((pcbinfo)->infotoken) \ + ASSERT_LWKT_TOKEN_HELD((pcbinfo)->infotoken); \ +} while (0) +#else /* !INVARIANTS */ +#define ASSERT_PCBINFO_TOKEN_HELD(pcbinfo) +#endif /* INVARIANTS */ + extern int ipport_lowfirstauto; extern int ipport_lowlastauto; extern int ipport_firstauto; @@ -450,14 +474,16 @@ union netmsg; struct xinpcb; void in_pcbportrange(u_short *, u_short *, u_short, u_short); -void in_pcbpurgeif0 (struct inpcb *, struct ifnet *); +void in_pcbpurgeif0 (struct inpcbinfo *, struct ifnet *); void in_losing (struct inpcb *); void in_rtchange (struct inpcb *, int); -void in_pcbinfo_init (struct inpcbinfo *); +void in_pcbinfo_init (struct inpcbinfo *, int, boolean_t); void in_pcbportinfo_init (struct inpcbportinfo *, int, boolean_t, u_short); int in_pcballoc (struct socket *, struct inpcbinfo *); void in_pcbunlink (struct inpcb *, struct inpcbinfo *); void in_pcblink (struct inpcb *, struct inpcbinfo *); +void in_pcbonlist (struct inpcb *); +void in_pcbofflist (struct inpcb *); int in_pcbbind (struct inpcb *, struct sockaddr *, struct thread *); int in_pcbbind_remote(struct inpcb *, const struct sockaddr *, struct thread *); @@ -484,7 +510,7 @@ struct inpcb * in_pcblookup_pkthash (struct inpcbinfo *, struct in_addr, u_int, struct in_addr, u_int, boolean_t, struct ifnet *, const struct mbuf *); -void in_pcbnotifyall (struct inpcbhead *, struct in_addr, +void in_pcbnotifyall (struct inpcbinfo *, struct in_addr, int, void (*)(struct inpcb *, int)); int in_setpeeraddr (struct socket *so, struct sockaddr **nam); void in_setpeeraddr_dispatch(union netmsg *); @@ -499,11 +525,15 @@ int prison_xinpcb (struct thread *p, struct inpcb *inp); void in_savefaddr (struct socket *so, const struct sockaddr *faddr); struct inpcb * in_pcblocalgroup_last(const struct inpcbinfo *, const struct inpcb *); +void in_pcbglobalinit(void); int in_pcblist_global(SYSCTL_HANDLER_ARGS); -int in_pcblist_global_cpu0(SYSCTL_HANDLER_ARGS); -int in_pcblist_global_nomarker(SYSCTL_HANDLER_ARGS, - struct xinpcb **, int *); +int in_pcblist_global_ncpus2(SYSCTL_HANDLER_ARGS); + +struct inpcb * + in_pcbmarker(int cpuid); +struct inpcontainer * + in_pcbcontainer_marker(int cpuid); #endif /* _KERNEL */ diff --git a/sys/netinet/in_proto.c b/sys/netinet/in_proto.c index 6d62f47836..a6e0af20f9 100644 --- a/sys/netinet/in_proto.c +++ b/sys/netinet/in_proto.c @@ -50,6 +50,7 @@ #include #include +#include #include #include #include @@ -85,7 +86,6 @@ #endif /* FAST_IPSEC */ #ifdef SCTP -#include #include #include #include @@ -122,6 +122,7 @@ struct protosw inetsw[] = { .pr_flags = PR_ATOMIC|PR_ADDR|PR_MPSAFE| PR_ASYNC_SEND|PR_ASEND_HOLDTD, + .pr_initport = udp_initport, .pr_input = udp_input, .pr_output = NULL, .pr_ctlinput = udp_ctlinput, @@ -497,8 +498,14 @@ struct protosw inetsw[] = { #endif }; +static void +inetdomain_init(void) +{ + in_pcbglobalinit(); +} + struct domain inetdomain = { - AF_INET, "internet", NULL, NULL, NULL, + AF_INET, "internet", inetdomain_init, NULL, NULL, inetsw, &inetsw[NELEM(inetsw)], SLIST_ENTRY_INITIALIZER, in_inithead, 32, sizeof(struct sockaddr_in), diff --git a/sys/netinet/ip_demux.c b/sys/netinet/ip_demux.c index 782efce6dc..a1d769dc5a 100644 --- a/sys/netinet/ip_demux.c +++ b/sys/netinet/ip_demux.c @@ -86,24 +86,14 @@ tcp_addrcpu(in_addr_t faddr, in_port_t fport, in_addr_t laddr, in_port_t lport) return (netisr_hashcpu(INP_MPORT_HASH_TCP(faddr, laddr, fport, lport))); } -/* - * Not implemented yet, use protocol thread 0 - */ int udp_addrcpu(in_addr_t faddr, in_port_t fport, in_addr_t laddr, in_port_t lport) { -#ifdef notyet - return (netisr_hashcpu(INP_MPORT_HASH_UDP(faddr, laddr, fport, lport))); -#else - return 0; -#endif -} - -int -udp_addrcpu_pkt(in_addr_t faddr, in_port_t fport, in_addr_t laddr, - in_port_t lport) -{ - if (IN_MULTICAST(ntohl(faddr))) { + /* + * NOTE: laddr could be multicast, since UDP socket could be + * bound to multicast address. + */ + if (IN_MULTICAST(ntohl(faddr)) || IN_MULTICAST(ntohl(laddr))) { /* XXX handle multicast on CPU0 for now */ return 0; } @@ -472,3 +462,9 @@ tcp_initport(void) { return netisr_cpuport(mycpuid & ncpus2_mask); } + +struct lwkt_port * +udp_initport(void) +{ + return netisr_cpuport(mycpuid & ncpus2_mask); +} diff --git a/sys/netinet/ip_divert.c b/sys/netinet/ip_divert.c index df6c82e8af..dac294cd59 100644 --- a/sys/netinet/ip_divert.c +++ b/sys/netinet/ip_divert.c @@ -127,7 +127,7 @@ static struct lwkt_token div_token = LWKT_TOKEN_INITIALIZER(div_token); void div_init(void) { - in_pcbinfo_init(&divcbinfo); + in_pcbinfo_init(&divcbinfo, 0, FALSE); in_pcbportinfo_init(&divcbportinfo, 1, FALSE, 0); /* * XXX We don't use the hash list for divert IP, but it's easier @@ -517,8 +517,8 @@ div_send(netmsg_t msg) } SYSCTL_DECL(_net_inet_divert); -SYSCTL_PROC(_net_inet_divert, OID_AUTO, pcblist, CTLFLAG_RD, &divcbinfo, 0, - in_pcblist_global_cpu0, "S,xinpcb", "List of active divert sockets"); +SYSCTL_PROC(_net_inet_divert, OID_AUTO, pcblist, CTLFLAG_RD, &divcbinfo, 1, + in_pcblist_global, "S,xinpcb", "List of active divert sockets"); struct pr_usrreqs div_usrreqs = { .pru_abort = div_abort, diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c index da2c0c8e0d..4a35ce24d7 100644 --- a/sys/netinet/ip_output.c +++ b/sys/netinet/ip_output.c @@ -1348,6 +1348,25 @@ ip_ctloutput(netmsg_t msg) goto done; } + switch (sopt->sopt_name) { + case IP_MULTICAST_IF: + case IP_MULTICAST_VIF: + case IP_MULTICAST_TTL: + case IP_MULTICAST_LOOP: + case IP_ADD_MEMBERSHIP: + case IP_DROP_MEMBERSHIP: + /* + * Handle multicast options in netisr0 + */ + if (&curthread->td_msgport != netisr_cpuport(0)) { + /* NOTE: so_port MUST NOT be checked in netisr0 */ + msg->lmsg.ms_flags |= MSGF_IGNSOPORT; + lwkt_forwardmsg(netisr_cpuport(0), &msg->lmsg); + return; + } + break; + } + switch (sopt->sopt_dir) { case SOPT_SET: switch (sopt->sopt_name) { diff --git a/sys/netinet/raw_ip.c b/sys/netinet/raw_ip.c index b67b3f64bc..82eb6bc0bf 100644 --- a/sys/netinet/raw_ip.c +++ b/sys/netinet/raw_ip.c @@ -127,7 +127,7 @@ void (*ip_rsvp_force_done)(struct socket *); void rip_init(void) { - in_pcbinfo_init(&ripcbinfo); + in_pcbinfo_init(&ripcbinfo, 0, FALSE); in_pcbportinfo_init(&ripcbportinfo, 1, FALSE, 0); /* * XXX We don't use the hash list for raw IP, but it's easier @@ -735,8 +735,8 @@ rip_send(netmsg_t msg) lwkt_replymsg(&msg->lmsg, error); } -SYSCTL_PROC(_net_inet_raw, OID_AUTO/*XXX*/, pcblist, CTLFLAG_RD, &ripcbinfo, 0, - in_pcblist_global_cpu0, "S,xinpcb", "List of active raw IP sockets"); +SYSCTL_PROC(_net_inet_raw, OID_AUTO/*XXX*/, pcblist, CTLFLAG_RD, &ripcbinfo, 1, + in_pcblist_global, "S,xinpcb", "List of active raw IP sockets"); struct pr_usrreqs rip_usrreqs = { .pru_abort = rip_abort, diff --git a/sys/netinet/tcp_subr.c b/sys/netinet/tcp_subr.c index 21b76f4071..9885db5415 100644 --- a/sys/netinet/tcp_subr.c +++ b/sys/netinet/tcp_subr.c @@ -376,8 +376,7 @@ tcp_init(void) for (cpu = 0; cpu < ncpus2; cpu++) { ticb = &tcbinfo[cpu]; - in_pcbinfo_init(ticb); - ticb->cpu = cpu; + in_pcbinfo_init(ticb, cpu, FALSE); ticb->hashbase = hashinit(hashsize, M_PCB, &ticb->hashmask); in_pcbportinfo_init(&portinfo[cpu], hashsize, TRUE, cpu); @@ -1178,7 +1177,6 @@ tcp_pcblist(SYSCTL_HANDLER_ARGS) int error, i, n; struct inpcb *marker; struct inpcb *inp; - globaldata_t gd; int origcpu, ccpu; error = 0; @@ -1189,10 +1187,8 @@ tcp_pcblist(SYSCTL_HANDLER_ARGS) * resource-intensive to repeat twice on every request. */ if (req->oldptr == NULL) { - for (ccpu = 0; ccpu < ncpus2; ++ccpu) { - gd = globaldata_find(ccpu); - n += tcbinfo[gd->gd_cpuid].ipi_count; - } + for (ccpu = 0; ccpu < ncpus2; ++ccpu) + n += tcbinfo[ccpu].ipi_count; req->oldidx = (n + n/8 + 10) * sizeof(struct xtcpcb); return (0); } @@ -1369,7 +1365,7 @@ tcp_notifyall_oncpu(netmsg_t msg) struct netmsg_tcp_notify *nm = (struct netmsg_tcp_notify *)msg; int nextcpu; - in_pcbnotifyall(&tcbinfo[mycpuid].pcblisthead, nm->nm_faddr, + in_pcbnotifyall(&tcbinfo[mycpuid], nm->nm_faddr, nm->nm_arg, nm->nm_notify); nextcpu = mycpuid + 1; @@ -1535,7 +1531,7 @@ tcp6_ctlinput(netmsg_t msg) bzero(&th, sizeof th); m_copydata(m, off, sizeof *thp, (caddr_t)&th); - in6_pcbnotify(&tcbinfo[0].pcblisthead, sa, th.th_dport, + in6_pcbnotify(&tcbinfo[0], sa, th.th_dport, (struct sockaddr *)ip6cp->ip6c_src, th.th_sport, cmd, arg, notify); @@ -1546,7 +1542,7 @@ tcp6_ctlinput(netmsg_t msg) inc.inc_isipv6 = 1; syncache_unreach(&inc, &th); } else { - in6_pcbnotify(&tcbinfo[0].pcblisthead, sa, 0, + in6_pcbnotify(&tcbinfo[0], sa, 0, (const struct sockaddr *)sa6_src, 0, cmd, arg, notify); } out: diff --git a/sys/netinet/udp_usrreq.c b/sys/netinet/udp_usrreq.c index 33218e351f..63715aa4c4 100644 --- a/sys/netinet/udp_usrreq.c +++ b/sys/netinet/udp_usrreq.c @@ -118,6 +118,8 @@ #include #endif +#define MSGF_UDP_SEND MSGF_PROTO1 + #define UDP_KTR_STRING "inp=%p" #define UDP_KTR_ARGS struct inpcb *inp @@ -132,6 +134,7 @@ KTR_INFO(KTR_UDP, udp, send_ipout, 2, UDP_KTR_STRING, UDP_KTR_ARGS); KTR_INFO(KTR_UDP, udp, redisp_ipout_beg, 3, UDP_KTR_STRING, UDP_KTR_ARGS); KTR_INFO(KTR_UDP, udp, redisp_ipout_end, 4, UDP_KTR_STRING, UDP_KTR_ARGS); KTR_INFO(KTR_UDP, udp, send_redisp, 5, UDP_KTR_STRING, UDP_KTR_ARGS); +KTR_INFO(KTR_UDP, udp, send_inswildcard, 6, UDP_KTR_STRING, UDP_KTR_ARGS); #define logudp(name, inp) KTR_LOG(udp_##name, inp) @@ -172,11 +175,7 @@ static int udp_reuseport_ext = 1; SYSCTL_INT(_net_inet_udp, OID_AUTO, reuseport_ext, CTLFLAG_RW, &udp_reuseport_ext, 0, "SO_REUSEPORT extension"); -struct inpcbinfo udbinfo; -struct inpcbportinfo udbportinfo; - -static struct netisr_barrier *udbinfo_br; -static struct lwkt_serialize udbinfo_slize = LWKT_SERIALIZE_INITIALIZER; +struct inpcbinfo udbinfo[MAXCPU]; #ifndef UDBHASHSIZE #define UDBHASHSIZE 16 @@ -208,23 +207,41 @@ static void ip_2_ip6_hdr (struct ip6_hdr *ip6, struct ip *ip); static int udp_connect_oncpu(struct inpcb *inp, struct sockaddr_in *sin, struct sockaddr_in *if_sin); +static boolean_t udp_inswildcardhash(struct inpcb *inp, + struct netmsg_base *msg, int error); +static void udp_remwildcardhash(struct inpcb *inp); + void udp_init(void) { + struct inpcbportinfo *portinfo; int cpu; - in_pcbinfo_init(&udbinfo); - in_pcbportinfo_init(&udbportinfo, UDBHASHSIZE, FALSE, 0); + portinfo = kmalloc_cachealign(sizeof(*portinfo) * ncpus2, M_PCB, + M_WAITOK); - udbinfo.hashbase = hashinit(UDBHASHSIZE, M_PCB, &udbinfo.hashmask); - udbinfo.portinfo = &udbportinfo; - udbinfo.wildcardhashbase = hashinit(UDBHASHSIZE, M_PCB, - &udbinfo.wildcardhashmask); - udbinfo.localgrphashbase = hashinit(UDBHASHSIZE, M_PCB, - &udbinfo.localgrphashmask); - udbinfo.ipi_size = sizeof(struct inpcb); + for (cpu = 0; cpu < ncpus2; cpu++) { + struct inpcbinfo *uicb = &udbinfo[cpu]; - udbinfo_br = netisr_barrier_create(); + /* + * NOTE: + * UDP pcb list, wildcard hash table and localgroup hash + * table are shared. + */ + in_pcbinfo_init(uicb, cpu, TRUE); + uicb->hashbase = hashinit(UDBHASHSIZE, M_PCB, &uicb->hashmask); + + in_pcbportinfo_init(&portinfo[cpu], UDBHASHSIZE, TRUE, cpu); + uicb->portinfo = portinfo; + uicb->portinfo_mask = ncpus2_mask; + + uicb->wildcardhashbase = hashinit(UDBHASHSIZE, M_PCB, + &uicb->wildcardhashmask); + uicb->localgrphashbase = hashinit(UDBHASHSIZE, M_PCB, + &uicb->localgrphashmask); + + uicb->ipi_size = sizeof(struct inpcb); + } /* * Initialize UDP statistics counters for each CPU. @@ -261,20 +278,25 @@ SYSCTL_PROC(_net_inet_udp, UDPCTL_STATS, stats, (CTLTYPE_OPAQUE | CTLFLAG_RW), * Returns 0 if the packet is acceptable, -1 if it is not. */ static __inline int -check_multicast_membership(struct ip *ip, struct inpcb *inp, struct mbuf *m) +check_multicast_membership(const struct ip *ip, const struct inpcb *inp, + const struct mbuf *m) { + const struct ip_moptions *mopt; int mshipno; - struct ip_moptions *mopt; if (strict_mcast_mship == 0 || !IN_MULTICAST(ntohl(ip->ip_dst.s_addr))) { return (0); } + + KASSERT(&curthread->td_msgport == netisr_cpuport(0), + ("multicast input not in netisr0")); + mopt = inp->inp_moptions; if (mopt == NULL) return (-1); for (mshipno = 0; mshipno < mopt->imo_num_memberships; ++mshipno) { - struct in_multi *maddr = mopt->imo_membership[mshipno]; + const struct in_multi *maddr = mopt->imo_membership[mshipno]; if (ip->ip_dst.s_addr == maddr->inm_addr.s_addr && m->m_pkthdr.rcvif == maddr->inm_ifp) { @@ -284,6 +306,73 @@ check_multicast_membership(struct ip *ip, struct inpcb *inp, struct mbuf *m) return (-1); } +struct udp_mcast_arg { + struct inpcb *inp; + struct inpcb *last; + struct ip *ip; + struct mbuf *m; + int iphlen; + struct sockaddr_in *udp_in; +#ifdef INET6 + struct udp_in6 *udp_in6; + struct udp_ip6 *udp_ip6; +#endif +}; + +static int +udp_mcast_input(struct udp_mcast_arg *arg) +{ + struct inpcb *inp = arg->inp; + struct inpcb *last = arg->last; + struct ip *ip = arg->ip; + struct mbuf *m = arg->m; + + if (check_multicast_membership(ip, inp, m) < 0) + return ERESTART; /* caller continue */ + + if (last != NULL) { + struct mbuf *n; + +#ifdef IPSEC + /* check AH/ESP integrity. */ + if (ipsec4_in_reject_so(m, last->inp_socket)) + ipsecstat.in_polvio++; + /* do not inject data to pcb */ + else +#endif /*IPSEC*/ +#ifdef FAST_IPSEC + /* check AH/ESP integrity. */ + if (ipsec4_in_reject(m, last)) + ; + else +#endif /*FAST_IPSEC*/ + if ((n = m_copypacket(m, MB_DONTWAIT)) != NULL) + udp_append(last, ip, n, + arg->iphlen + sizeof(struct udphdr), + arg->udp_in, +#ifdef INET6 + arg->udp_in6, arg->udp_ip6 +#else + NULL, NULL +#endif + ); + } + arg->last = last = inp; + + /* + * Don't look for additional matches if this one does + * not have either the SO_REUSEPORT or SO_REUSEADDR + * socket options set. This heuristic avoids searching + * through all pcbs in the common case of a non-shared + * port. It * assumes that an application will never + * clear these options after setting them. + */ + if (!(last->inp_socket->so_options & + (SO_REUSEPORT | SO_REUSEADDR))) + return EJUSTRETURN; /* caller stop */ + return 0; +} + int udp_input(struct mbuf **mp, int *offp, int proto) { @@ -304,6 +393,7 @@ udp_input(struct mbuf **mp, int *offp, int proto) int len, off; struct ip save_ip; struct sockaddr *append_sa; + struct inpcbinfo *pcbinfo = &udbinfo[mycpuid]; off = *offp; m = *mp; @@ -387,7 +477,12 @@ udp_input(struct mbuf **mp, int *offp, int proto) if (IN_MULTICAST(ntohl(ip->ip_dst.s_addr)) || in_broadcast(ip->ip_dst, m->m_pkthdr.rcvif)) { + struct inpcbhead *connhead; + struct inpcontainer *ic, *ic_marker; + struct inpcontainerhead *ichead; + struct udp_mcast_arg arg; struct inpcb *last; + int error; /* * Deliver a multicast or broadcast datagram to *all* sockets @@ -410,6 +505,7 @@ udp_input(struct mbuf **mp, int *offp, int proto) */ udp_in.sin_port = uh->uh_sport; udp_in.sin_addr = ip->ip_src; + arg.udp_in = &udp_in; /* * Locate pcb(s) for datagram. * (Algorithm copied from raw_intr().) @@ -417,71 +513,79 @@ udp_input(struct mbuf **mp, int *offp, int proto) last = NULL; #ifdef INET6 udp_in6.uin6_init_done = udp_ip6.uip6_init_done = 0; + arg.udp_in6 = &udp_in6; + arg.udp_ip6 = &udp_ip6; #endif - LIST_FOREACH(inp, &udbinfo.pcblisthead, inp_list) { - KKASSERT((inp->inp_flags & INP_PLACEMARKER) == 0); + arg.iphlen = iphlen; + + connhead = &pcbinfo->hashbase[ + INP_PCBCONNHASH(ip->ip_src.s_addr, uh->uh_sport, + ip->ip_dst.s_addr, uh->uh_dport, pcbinfo->hashmask)]; + LIST_FOREACH(inp, connhead, inp_hash) { #ifdef INET6 if (!(inp->inp_vflag & INP_IPV4)) continue; #endif - if (inp->inp_lport != uh->uh_dport) + if (!in_hosteq(inp->inp_faddr, ip->ip_src) || + !in_hosteq(inp->inp_laddr, ip->ip_dst) || + inp->inp_fport != uh->uh_sport || + inp->inp_lport != uh->uh_dport) continue; - if (inp->inp_laddr.s_addr != INADDR_ANY) { - if (inp->inp_laddr.s_addr != - ip->ip_dst.s_addr) - continue; - } - if (inp->inp_faddr.s_addr != INADDR_ANY) { - if (inp->inp_faddr.s_addr != - ip->ip_src.s_addr || - inp->inp_fport != uh->uh_sport) - continue; - } - if (check_multicast_membership(ip, inp, m) < 0) + arg.inp = inp; + arg.last = last; + arg.ip = ip; + arg.m = m; + + error = udp_mcast_input(&arg); + if (error == ERESTART) continue; + last = arg.last; + + if (error == EJUSTRETURN) + goto done; + } - if (last != NULL) { - struct mbuf *n; + ichead = &pcbinfo->wildcardhashbase[ + INP_PCBWILDCARDHASH(uh->uh_dport, + pcbinfo->wildcardhashmask)]; + ic_marker = in_pcbcontainer_marker(mycpuid); -#ifdef IPSEC - /* check AH/ESP integrity. */ - if (ipsec4_in_reject_so(m, last->inp_socket)) - ipsecstat.in_polvio++; - /* do not inject data to pcb */ - else -#endif /*IPSEC*/ -#ifdef FAST_IPSEC - /* check AH/ESP integrity. */ - if (ipsec4_in_reject(m, last)) - ; - else -#endif /*FAST_IPSEC*/ - if ((n = m_copypacket(m, MB_DONTWAIT)) != NULL) - udp_append(last, ip, n, - iphlen + sizeof(struct udphdr), - &udp_in, + GET_PCBINFO_TOKEN(pcbinfo); + LIST_INSERT_HEAD(ichead, ic_marker, ic_list); + while ((ic = LIST_NEXT(ic_marker, ic_list)) != NULL) { + LIST_REMOVE(ic_marker, ic_list); + LIST_INSERT_AFTER(ic, ic_marker, ic_list); + + inp = ic->ic_inp; + if (inp->inp_flags & INP_PLACEMARKER) + continue; #ifdef INET6 - &udp_in6, &udp_ip6 -#else - NULL, NULL + if (!(inp->inp_vflag & INP_IPV4)) + continue; #endif - ); - } - last = inp; - /* - * Don't look for additional matches if this one does - * not have either the SO_REUSEPORT or SO_REUSEADDR - * socket options set. This heuristic avoids searching - * through all pcbs in the common case of a non-shared - * port. It * assumes that an application will never - * clear these options after setting them. - */ - if (!(last->inp_socket->so_options & - (SO_REUSEPORT | SO_REUSEADDR))) + if (inp->inp_lport != uh->uh_dport) + continue; + if (inp->inp_laddr.s_addr != INADDR_ANY && + inp->inp_laddr.s_addr != ip->ip_dst.s_addr) + continue; + + arg.inp = inp; + arg.last = last; + arg.ip = ip; + arg.m = m; + + error = udp_mcast_input(&arg); + if (error == ERESTART) + continue; + last = arg.last; + + if (error == EJUSTRETURN) break; } - + LIST_REMOVE(ic_marker, ic_list); + REL_PCBINFO_TOKEN(pcbinfo); +done: if (last == NULL) { /* * No matching pcb found; discard datagram. @@ -516,8 +620,8 @@ udp_input(struct mbuf **mp, int *offp, int proto) /* * Locate pcb for datagram. */ - inp = in_pcblookup_pkthash(&udbinfo, ip->ip_src, uh->uh_sport, - ip->ip_dst, uh->uh_dport, 1, m->m_pkthdr.rcvif, + inp = in_pcblookup_pkthash(pcbinfo, ip->ip_src, uh->uh_sport, + ip->ip_dst, uh->uh_dport, TRUE, m->m_pkthdr.rcvif, udp_reuseport_ext ? m : NULL); if (inp == NULL) { if (log_in_vain) { @@ -703,22 +807,15 @@ static void udp_notifyall_oncpu(netmsg_t msg) { struct netmsg_udp_notify *nm = (struct netmsg_udp_notify *)msg; -#if 0 - int nextcpu; -#endif + int nextcpu, cpu = mycpuid; - in_pcbnotifyall(&udbinfo.pcblisthead, nm->nm_faddr, - nm->nm_arg, nm->nm_notify); - lwkt_replymsg(&nm->base.lmsg, 0); + in_pcbnotifyall(&udbinfo[cpu], nm->nm_faddr, nm->nm_arg, nm->nm_notify); -#if 0 - /* XXX currently udp only runs on cpu 0 */ - nextcpu = mycpuid + 1; + nextcpu = cpu + 1; if (nextcpu < ncpus2) lwkt_forwardmsg(netisr_cpuport(nextcpu), &nm->base.lmsg); else - lwkt_replymsg(&nmsg->base.lmsg, 0); -#endif + lwkt_replymsg(&nm->base.lmsg, 0); } void @@ -732,8 +829,6 @@ udp_ctlinput(netmsg_t msg) struct in_addr faddr; struct inpcb *inp; - KKASSERT(&curthread->td_msgport == netisr_cpuport(0)); - faddr = ((struct sockaddr_in *)sa)->sin_addr; if (sa->sa_family != AF_INET || faddr.s_addr == INADDR_ANY) goto done; @@ -749,7 +844,7 @@ udp_ctlinput(netmsg_t msg) if (ip) { uh = (struct udphdr *)((caddr_t)ip + (ip->ip_hl << 2)); - inp = in_pcblookup_hash(&udbinfo, faddr, uh->uh_dport, + inp = in_pcblookup_hash(&udbinfo[mycpuid], faddr, uh->uh_dport, ip->ip_src, uh->uh_sport, 0, NULL); if (inp != NULL && inp->inp_socket != NULL) (*notify)(inp, inetctlerrmap[cmd]); @@ -769,36 +864,8 @@ done: lwkt_replymsg(&msg->lmsg, 0); } -static int -udp_pcblist(SYSCTL_HANDLER_ARGS) -{ - struct xinpcb *xi; - int error, nxi, i; - - udbinfo_lock(); - error = in_pcblist_global_nomarker(oidp, arg1, arg2, req, &xi, &nxi); - udbinfo_unlock(); - - if (error) { - KKASSERT(xi == NULL); - return error; - } - if (nxi == 0) { - KKASSERT(xi == NULL); - return 0; - } - - for (i = 0; i < nxi; ++i) { - error = SYSCTL_OUT(req, &xi[i], sizeof(xi[i])); - if (error) - break; - } - kfree(xi, M_TEMP); - - return error; -} -SYSCTL_PROC(_net_inet_udp, UDPCTL_PCBLIST, pcblist, CTLFLAG_RD, &udbinfo, 0, - udp_pcblist, "S,xinpcb", "List of active UDP sockets"); +SYSCTL_PROC(_net_inet_udp, UDPCTL_PCBLIST, pcblist, CTLFLAG_RD, udbinfo, 0, + in_pcblist_global_ncpus2, "S,xinpcb", "List of active UDP sockets"); static int udp_getcred(SYSCTL_HANDLER_ARGS) @@ -806,7 +873,7 @@ udp_getcred(SYSCTL_HANDLER_ARGS) struct sockaddr_in addrs[2]; struct ucred cred0, *cred = NULL; struct inpcb *inp; - int error; + int error, cpu, origcpu; error = priv_check(req->td, PRIV_ROOT); if (error) @@ -815,9 +882,16 @@ udp_getcred(SYSCTL_HANDLER_ARGS) if (error) return (error); - udbinfo_lock(); - inp = in_pcblookup_hash(&udbinfo, addrs[1].sin_addr, addrs[1].sin_port, - addrs[0].sin_addr, addrs[0].sin_port, 1, NULL); + origcpu = mycpuid; + cpu = udp_addrcpu(addrs[1].sin_addr.s_addr, addrs[1].sin_port, + addrs[0].sin_addr.s_addr, addrs[0].sin_port); + + if (cpu != origcpu) + lwkt_migratecpu(cpu); + + inp = in_pcblookup_hash(&udbinfo[cpu], + addrs[1].sin_addr, addrs[1].sin_port, + addrs[0].sin_addr, addrs[0].sin_port, TRUE, NULL); if (inp == NULL || inp->inp_socket == NULL) { error = ENOENT; } else { @@ -826,14 +900,15 @@ udp_getcred(SYSCTL_HANDLER_ARGS) cred = &cred0; } } - udbinfo_unlock(); + + if (cpu != origcpu) + lwkt_migratecpu(origcpu); if (error) return error; return SYSCTL_OUT(req, cred, sizeof(struct ucred)); } - SYSCTL_PROC(_net_inet_udp, OID_AUTO, getcred, CTLTYPE_OPAQUE|CTLFLAG_RW, 0, 0, udp_getcred, "S,ucred", "Get the ucred of a UDP connection"); @@ -884,7 +959,6 @@ udp_send(netmsg_t msg) struct sockaddr_in *sin; /* really is initialized before use */ int error = 0, cpu; - KKASSERT(&curthread->td_msgport == netisr_cpuport(0)); KKASSERT(msg->send.nm_control == NULL); logudp(send_beg, inp); @@ -900,13 +974,26 @@ udp_send(netmsg_t msg) } if (inp->inp_lport == 0) { /* unbound socket */ + boolean_t forwarded; + error = in_pcbbind(inp, NULL, td); if (error) goto release; - udbinfo_barrier_set(); - in_pcbinswildcardhash(inp); - udbinfo_barrier_rem(); + /* + * Need to call udp_send again, after this inpcb is + * inserted into wildcard hash table. + */ + msg->send.base.lmsg.ms_flags |= MSGF_UDP_SEND; + forwarded = udp_inswildcardhash(inp, &msg->send.base, 0); + if (forwarded) { + /* + * The message is further forwarded, so we are + * done here. + */ + logudp(send_inswildcard, inp); + return; + } } if (dstaddr != NULL) { /* destination address specified */ @@ -1025,7 +1112,7 @@ udp_send(netmsg_t msg) if (pru_flags & PRUS_DONTROUTE) flags |= SO_DONTROUTE; - cpu = udp_addrcpu_pkt(ui->ui_dst.s_addr, ui->ui_dport, + cpu = udp_addrcpu(ui->ui_dst.s_addr, ui->ui_dport, ui->ui_src.s_addr, ui->ui_sport); if (cpu != mycpuid) { struct mbuf *m_opt = NULL; @@ -1121,30 +1208,13 @@ SYSCTL_INT(_net_inet_udp, UDPCTL_RECVSPACE, recvspace, CTLFLAG_RW, &udp_recvspace, 0, "Maximum incoming UDP datagram size"); /* - * NOTE: (so) is referenced from soabort*() and netmsg_pru_abort() - * will sofree() it when we return. + * This should never happen, since UDP socket does not support + * connection acception (SO_ACCEPTCONN, i.e. listen(2)). */ static void -udp_abort(netmsg_t msg) +udp_abort(netmsg_t msg __unused) { - struct socket *so = msg->abort.base.nm_so; - struct inpcb *inp; - int error; - - KKASSERT(&curthread->td_msgport == netisr_cpuport(0)); - - inp = so->so_pcb; - if (inp) { - soisdisconnected(so); - - udbinfo_barrier_set(); - in_pcbdetach(inp); - udbinfo_barrier_rem(); - error = 0; - } else { - error = EINVAL; - } - lwkt_replymsg(&msg->abort.base.lmsg, error); + panic("udp_abort is called"); } static void @@ -1155,8 +1225,6 @@ udp_attach(netmsg_t msg) struct inpcb *inp; int error; - KKASSERT(&curthread->td_msgport == netisr_cpuport(0)); - inp = so->so_pcb; if (inp != NULL) { error = EINVAL; @@ -1166,10 +1234,7 @@ udp_attach(netmsg_t msg) if (error) goto out; - udbinfo_barrier_set(); - error = in_pcballoc(so, &udbinfo); - udbinfo_barrier_rem(); - + error = in_pcballoc(so, &udbinfo[mycpuid]); if (error) goto out; @@ -1181,26 +1246,126 @@ out: lwkt_replymsg(&msg->attach.base.lmsg, error); } +static boolean_t +udp_inswildcardhash_oncpu(struct inpcb *inp) +{ + int cpu; + + KASSERT(inp->inp_pcbinfo == &udbinfo[mycpuid], + ("not on owner cpu")); + + in_pcbinswildcardhash(inp); + for (cpu = 0; cpu < ncpus2; ++cpu) { + if (cpu == mycpuid) { + /* + * This inpcb has been inserted by the above + * in_pcbinswildcardhash(). + */ + continue; + } + in_pcbinswildcardhash_oncpu(inp, &udbinfo[cpu]); + } + + /* TODO need to change port again, if SO_REUSEPORT */ + return FALSE; +} + +static void +udp_inswildcardhash_dispatch(netmsg_t msg) +{ + struct inpcb *inp = msg->base.nm_so->so_pcb; + lwkt_msg_t lmsg = &msg->base.lmsg; + + KASSERT(inp->inp_lport != 0, ("local port not set yet")); + KASSERT((ntohs(inp->inp_lport) & ncpus2_mask) == mycpuid, + ("not target cpu")); + + in_pcblink(inp, &udbinfo[mycpuid]); + udp_inswildcardhash_oncpu(inp); + + if (lmsg->ms_flags & MSGF_UDP_SEND) { + udp_send(msg); + /* msg is replied by udp_send() */ + } else { + lwkt_replymsg(lmsg, lmsg->ms_error); + } +} + +static void +udp_sosetport(struct lwkt_msg *msg, lwkt_port_t port) +{ + sosetport(((struct netmsg_base *)msg)->nm_so, port); +} + +static boolean_t +udp_inswildcardhash(struct inpcb *inp, struct netmsg_base *msg, int error) +{ + struct route *ro = &inp->inp_route; + lwkt_msg_t lmsg = &msg->lmsg; + int cpu; + + /* + * Always clear the route cache, so we don't need to + * worry about any owner CPU changes later. + */ + if (ro->ro_rt != NULL) + RTFREE(ro->ro_rt); + bzero(ro, sizeof(*ro)); + + KASSERT(inp->inp_lport != 0, ("local port not set yet")); + cpu = ntohs(inp->inp_lport) & ncpus2_mask; + + lmsg->ms_error = error; + if (cpu != mycpuid) { + struct lwkt_port *port = netisr_cpuport(cpu); + + /* + * We are moving the protocol processing port the socket + * is on, we have to unlink here and re-link on the + * target cpu. + */ + in_pcbunlink(inp, &udbinfo[mycpuid]); + msg->nm_dispatch = udp_inswildcardhash_dispatch; + + /* See the related comment in tcp_usrreq.c tcp_connect() */ + lwkt_setmsg_receipt(lmsg, udp_sosetport); + lwkt_forwardmsg(port, lmsg); + return TRUE; /* forwarded */ + } + + udp_inswildcardhash_oncpu(inp); + return FALSE; +} + static void udp_bind(netmsg_t msg) { struct socket *so = msg->bind.base.nm_so; - struct sockaddr *nam = msg->bind.nm_nam; - struct thread *td = msg->bind.nm_td; - struct sockaddr_in *sin = (struct sockaddr_in *)nam; struct inpcb *inp; int error; inp = so->so_pcb; if (inp) { + struct sockaddr *nam = msg->bind.nm_nam; + struct thread *td = msg->bind.nm_td; + error = in_pcbbind(inp, nam, td); if (error == 0) { + struct sockaddr_in *sin = (struct sockaddr_in *)nam; + boolean_t forwarded; + if (sin->sin_addr.s_addr != INADDR_ANY) inp->inp_flags |= INP_WASBOUND_NOTANY; - udbinfo_barrier_set(); - in_pcbinswildcardhash(inp); - udbinfo_barrier_rem(); + forwarded = udp_inswildcardhash(inp, + &msg->bind.base, 0); + if (forwarded) { + /* + * The message is further forwarded, so + * we are done here. + */ + return; + } } } else { error = EINVAL; @@ -1217,10 +1382,9 @@ udp_connect(netmsg_t msg) struct inpcb *inp; struct sockaddr_in *sin = (struct sockaddr_in *)nam; struct sockaddr_in *if_sin; - lwkt_port_t port; + struct lwkt_port *port; int error; - KKASSERT(&curthread->td_msgport == netisr_cpuport(0)); KKASSERT(msg->connect.nm_m == NULL); inp = so->so_pcb; @@ -1230,11 +1394,8 @@ udp_connect(netmsg_t msg) } if (msg->connect.nm_flags & PRUC_RECONNECT) { - panic("UDP does not support RECONNECT"); -#ifdef notyet msg->connect.nm_flags &= ~PRUC_RECONNECT; - in_pcblink(inp, &udbinfo); -#endif + in_pcblink(inp, &udbinfo[mycpuid]); } if (inp->inp_faddr.s_addr != INADDR_ANY) { @@ -1270,8 +1431,9 @@ udp_connect(netmsg_t msg) inp->inp_laddr.s_addr != INADDR_ANY ? inp->inp_laddr.s_addr : if_sin->sin_addr.s_addr, inp->inp_lport); if (port != &curthread->td_msgport) { -#ifdef notyet struct route *ro = &inp->inp_route; + lwkt_msg_t lmsg = &msg->connect.base.lmsg; + int nm_flags = PRUC_RECONNECT; /* * in_pcbladdr() may have allocated a route entry for us @@ -1282,30 +1444,69 @@ udp_connect(netmsg_t msg) RTFREE(ro->ro_rt); bzero(ro, sizeof(*ro)); + if (inp->inp_flags & INP_WILDCARD) { + /* + * Remove this inpcb from the wildcard hash before + * the socket's msgport changes. + */ + udp_remwildcardhash(inp); + } + /* * We are moving the protocol processing port the socket * is on, we have to unlink here and re-link on the * target cpu. */ - in_pcbunlink(so->so_pcb, &udbinfo); - /* in_pcbunlink(so->so_pcb, &udbinfo[mycpu->gd_cpuid]); */ - sosetport(so, port); - msg->connect.nm_flags |= PRUC_RECONNECT; - msg->connect.base.nm_dispatch = udp_connect; + in_pcbunlink(inp, &udbinfo[mycpuid]); + msg->connect.nm_flags |= nm_flags; - lwkt_forwardmsg(port, &msg->connect.base.lmsg); + /* See the related comment in tcp_usrreq.c tcp_connect() */ + lwkt_setmsg_receipt(lmsg, udp_sosetport); + lwkt_forwardmsg(port, lmsg); /* msg invalid now */ return; -#else - panic("UDP activity should only be in netisr0"); -#endif } - KKASSERT(port == &curthread->td_msgport); error = udp_connect_oncpu(inp, sin, if_sin); out: + if (error && inp != NULL && inp->inp_lport != 0 && + (inp->inp_flags & INP_WILDCARD) == 0) { + boolean_t forwarded; + + /* Connect failed; put it to wildcard hash. */ + forwarded = udp_inswildcardhash(inp, &msg->connect.base, + error); + if (forwarded) { + /* + * The message is further forwarded, so we are done + * here. + */ + return; + } + } lwkt_replymsg(&msg->connect.base.lmsg, error); } +static void +udp_remwildcardhash(struct inpcb *inp) +{ + int cpu; + + KASSERT(inp->inp_pcbinfo == &udbinfo[mycpuid], + ("not on owner cpu")); + + for (cpu = 0; cpu < ncpus2; ++cpu) { + if (cpu == mycpuid) { + /* + * This inpcb will be removed by the later + * in_pcbremwildcardhash(). + */ + continue; + } + in_pcbremwildcardhash_oncpu(inp, &udbinfo[cpu]); + } + in_pcbremwildcardhash(inp); +} + static int udp_connect_oncpu(struct inpcb *inp, struct sockaddr_in *sin, struct sockaddr_in *if_sin) @@ -1320,10 +1521,15 @@ udp_connect_oncpu(struct inpcb *inp, struct sockaddr_in *sin, if (oinp != NULL) return EADDRINUSE; - udbinfo_barrier_set(); + /* + * No more errors can occur, finish adjusting the socket + * and change the processing port to reflect the connected + * socket. Once set we can no longer safely mess with the + * socket. + */ if (inp->inp_flags & INP_WILDCARD) - in_pcbremwildcardhash(inp); + udp_remwildcardhash(inp); if (inp->inp_laddr.s_addr == INADDR_ANY) inp->inp_laddr = if_sin->sin_addr; @@ -1331,49 +1537,116 @@ udp_connect_oncpu(struct inpcb *inp, struct sockaddr_in *sin, inp->inp_fport = sin->sin_port; in_pcbinsconnhash(inp); - /* - * No more errors can occur, finish adjusting the socket - * and change the processing port to reflect the connected - * socket. Once set we can no longer safely mess with the - * socket. - */ soisconnected(so); - udbinfo_barrier_rem(); - return 0; } +static void +udp_detach2(struct socket *so) +{ + in_pcbdetach(so->so_pcb); + sodiscard(so); + sofree(so); +} + +static void +udp_detach_final_dispatch(netmsg_t msg) +{ + udp_detach2(msg->base.nm_so); +} + +static void +udp_detach_oncpu_dispatch(netmsg_t msg) +{ + struct netmsg_base *clomsg = &msg->base; + struct socket *so = clomsg->nm_so; + struct inpcb *inp = so->so_pcb; + struct thread *td = curthread; + int nextcpu, cpuid = mycpuid; + + KASSERT(td->td_type == TD_TYPE_NETISR, ("not in netisr")); + + if (inp->inp_flags & INP_WILDCARD) { + /* + * This inp will be removed on the inp's + * owner CPU later, so don't do it now. + */ + if (&td->td_msgport != so->so_port) + in_pcbremwildcardhash_oncpu(inp, &udbinfo[cpuid]); + } + + if (cpuid == 0) { + /* + * Free and clear multicast socket option, + * which is only accessed in netisr0. + */ + ip_freemoptions(inp->inp_moptions); + inp->inp_moptions = NULL; + } + + nextcpu = cpuid + 1; + if (nextcpu < ncpus2) { + lwkt_forwardmsg(netisr_cpuport(nextcpu), &clomsg->lmsg); + } else { + /* + * No one could see this inpcb now; destroy this + * inpcb in its owner netisr. + */ + netmsg_init(clomsg, so, &netisr_apanic_rport, 0, + udp_detach_final_dispatch); + lwkt_sendmsg(so->so_port, &clomsg->lmsg); + } +} + static void udp_detach(netmsg_t msg) { struct socket *so = msg->detach.base.nm_so; + struct netmsg_base *clomsg; struct inpcb *inp; - int error; - - KKASSERT(&curthread->td_msgport == netisr_cpuport(0)); inp = so->so_pcb; - if (inp) { - udbinfo_barrier_set(); - in_pcbdetach(inp); - udbinfo_barrier_rem(); - error = 0; - } else { - error = EINVAL; + if (inp == NULL) { + lwkt_replymsg(&msg->detach.base.lmsg, EINVAL); + return; } - lwkt_replymsg(&msg->detach.base.lmsg, error); + + /* + * Reply EJUSTRETURN ASAP, we will call sodiscard() and + * sofree() later. + */ + lwkt_replymsg(&msg->detach.base.lmsg, EJUSTRETURN); + + if (ncpus == 1) { + /* Only one CPU, detach the inpcb directly. */ + udp_detach2(so); + return; + } + + /* + * Remove this inpcb from the inpcb list first, so that + * no one could find this inpcb from the inpcb list. + */ + in_pcbofflist(inp); + + /* + * Go through netisrs which process UDP to make sure + * no one could find this inpcb anymore. + */ + clomsg = &so->so_clomsg; + netmsg_init(clomsg, so, &netisr_apanic_rport, MSGF_IGNSOPORT, + udp_detach_oncpu_dispatch); + lwkt_sendmsg(netisr_cpuport(0), &clomsg->lmsg); } static void udp_disconnect(netmsg_t msg) { struct socket *so = msg->disconnect.base.nm_so; - struct route *ro; struct inpcb *inp; - int error; - - KKASSERT(&curthread->td_msgport == netisr_cpuport(0)); + boolean_t forwarded; + int error = 0; inp = so->so_pcb; if (inp == NULL) { @@ -1385,7 +1658,7 @@ udp_disconnect(netmsg_t msg) goto out; } - udbinfo_barrier_set(); + soclrstate(so, SS_ISCONNECTED); /* XXX */ in_pcbdisconnect(inp); @@ -1396,17 +1669,24 @@ udp_disconnect(netmsg_t msg) */ if (!(inp->inp_flags & INP_WASBOUND_NOTANY)) inp->inp_laddr.s_addr = INADDR_ANY; - in_pcbinswildcardhash(inp); - - udbinfo_barrier_rem(); - soclrstate(so, SS_ISCONNECTED); /* XXX */ + if (so->so_state & SS_ISCLOSING) { + /* + * If this socket is being closed, there is no need + * to put this socket back into wildcard hash table. + */ + error = 0; + goto out; + } - ro = &inp->inp_route; - if (ro->ro_rt != NULL) - RTFREE(ro->ro_rt); - bzero(ro, sizeof(*ro)); - error = 0; + forwarded = udp_inswildcardhash(inp, &msg->disconnect.base, 0); + if (forwarded) { + /* + * The message is further forwarded, so we are done + * here. + */ + return; + } out: lwkt_replymsg(&msg->disconnect.base.lmsg, error); } @@ -1418,8 +1698,6 @@ udp_shutdown(netmsg_t msg) struct inpcb *inp; int error; - KKASSERT(&curthread->td_msgport == netisr_cpuport(0)); - inp = so->so_pcb; if (inp) { socantsendmore(so); @@ -1430,32 +1708,6 @@ udp_shutdown(netmsg_t msg) lwkt_replymsg(&msg->shutdown.base.lmsg, error); } -void -udbinfo_lock(void) -{ - lwkt_serialize_enter(&udbinfo_slize); -} - -void -udbinfo_unlock(void) -{ - lwkt_serialize_exit(&udbinfo_slize); -} - -void -udbinfo_barrier_set(void) -{ - netisr_barrier_set(udbinfo_br); - udbinfo_lock(); -} - -void -udbinfo_barrier_rem(void) -{ - udbinfo_unlock(); - netisr_barrier_rem(udbinfo_br); -} - struct pr_usrreqs udp_usrreqs = { .pru_abort = udp_abort, .pru_accept = pr_generic_notsupp, @@ -1477,4 +1729,3 @@ struct pr_usrreqs udp_usrreqs = { .pru_sosend = sosendudp, .pru_soreceive = soreceive }; - diff --git a/sys/netinet/udp_var.h b/sys/netinet/udp_var.h index 1e9816b63b..5744265cf1 100644 --- a/sys/netinet/udp_var.h +++ b/sys/netinet/udp_var.h @@ -147,7 +147,7 @@ SYSCTL_DECL(_net_inet_udp); #define udp_stat udpstat_percpu[mycpuid] extern struct pr_usrreqs udp_usrreqs; -extern struct inpcbinfo udbinfo; +extern struct inpcbinfo udbinfo[MAXCPU]; extern u_long udp_sendspace; extern u_long udp_recvspace; extern struct udpstat udpstat_percpu[MAXCPU]; @@ -155,8 +155,6 @@ extern int log_in_vain; int udp_addrcpu (in_addr_t faddr, in_port_t fport, in_addr_t laddr, in_port_t lport); -int udp_addrcpu_pkt (in_addr_t faddr, in_port_t fport, - in_addr_t laddr, in_port_t lport); struct lwkt_port *udp_addrport (in_addr_t faddr, in_port_t fport, in_addr_t laddr, in_port_t lport); void udp_ctlinput(netmsg_t msg); @@ -166,13 +164,9 @@ int udp_input (struct mbuf **, int *, int); void udp_notify (struct inpcb *inp, int error); void udp_shutdown (union netmsg *); struct lwkt_port *udp_ctlport (int, struct sockaddr *, void *); +struct lwkt_port *udp_initport(void); struct lwkt_port *udp_cport (int); -void udbinfo_lock(void); -void udbinfo_unlock(void); -void udbinfo_barrier_set(void); -void udbinfo_barrier_rem(void); - #endif #endif diff --git a/sys/netinet6/in6_ifattach.c b/sys/netinet6/in6_ifattach.c index 09d77abbe6..8094a3851f 100644 --- a/sys/netinet6/in6_ifattach.c +++ b/sys/netinet6/in6_ifattach.c @@ -44,6 +44,8 @@ #include #include #include +#include +#include #include #include @@ -763,6 +765,28 @@ statinit: in6_maxmtu = ifp->if_mtu; } +static void +in6_leavemcast_dispatch(netmsg_t nmsg) +{ + struct lwkt_msg *lmsg = &nmsg->lmsg; + struct ifnet *ifp = lmsg->u.ms_resultp; + struct in6_multi *in6m; + struct in6_multi *in6m_next; + + in6_pcbpurgeif0(&ripcbinfo, ifp); + in6_pcbpurgeif0(&udbinfo[0], ifp); + + for (in6m = LIST_FIRST(&in6_multihead); in6m; in6m = in6m_next) { + in6m_next = LIST_NEXT(in6m, in6m_entry); + if (in6m->in6m_ifp != ifp) + continue; + in6_delmulti(in6m); + in6m = NULL; + } + + lwkt_replymsg(lmsg, 0); +} + /* * NOTE: in6_ifdetach() does not support loopback if at this moment. * We don't need this function in bsdi, because interfaces are never removed @@ -776,8 +800,8 @@ in6_ifdetach(struct ifnet *ifp) struct rtentry *rt; short rtflags; struct sockaddr_in6 sin6; - struct in6_multi *in6m; - struct in6_multi *in6m_next; + struct netmsg_base nmsg; + struct lwkt_msg *lmsg = &nmsg.lmsg; /* nuke prefix list. this may try to remove some of ifaddrs as well */ in6_purgeprefix(ifp); @@ -844,18 +868,10 @@ in6_ifdetach(struct ifnet *ifp) } /* leave from all multicast groups joined */ - udbinfo_lock(); - in6_pcbpurgeif0(LIST_FIRST(&udbinfo.pcblisthead), ifp); - udbinfo_unlock(); - - in6_pcbpurgeif0(LIST_FIRST(&ripcbinfo.pcblisthead), ifp); - for (in6m = LIST_FIRST(&in6_multihead); in6m; in6m = in6m_next) { - in6m_next = LIST_NEXT(in6m, in6m_entry); - if (in6m->in6m_ifp != ifp) - continue; - in6_delmulti(in6m); - in6m = NULL; - } + netmsg_init(&nmsg, NULL, &curthread->td_msgport, 0, + in6_leavemcast_dispatch); + lmsg->u.ms_resultp = ifp; + lwkt_domsg(netisr_cpuport(0), lmsg, 0); /* * remove neighbor management table. we call it twice just to make diff --git a/sys/netinet6/in6_pcb.c b/sys/netinet6/in6_pcb.c index 9b196f2271..1b1af2dc6a 100644 --- a/sys/netinet6/in6_pcb.c +++ b/sys/netinet6/in6_pcb.c @@ -89,6 +89,7 @@ #include #include #include +#include #include #include @@ -887,15 +888,13 @@ in6_mapped_peeraddr_dispatch(netmsg_t msg) * cmds that are uninteresting (e.g., no error in the map). * Call the protocol specific routine (if any) to report * any errors for each matching socket. - * - * Must be called under crit_enter(). */ void -in6_pcbnotify(struct inpcbhead *head, struct sockaddr *dst, in_port_t fport, - const struct sockaddr *src, in_port_t lport, int cmd, int arg, - void (*notify) (struct inpcb *, int)) +in6_pcbnotify(struct inpcbinfo *pcbinfo, struct sockaddr *dst, in_port_t fport, + const struct sockaddr *src, in_port_t lport, int cmd, int arg, + void (*notify) (struct inpcb *, int)) { - struct inpcb *inp, *ninp; + struct inpcb *inp, *marker; struct sockaddr_in6 sa6_src, *sa6_dst; u_int32_t flowinfo; @@ -930,9 +929,15 @@ in6_pcbnotify(struct inpcbhead *head, struct sockaddr *dst, in_port_t fport, } if (cmd != PRC_MSGSIZE) arg = inet6ctlerrmap[cmd]; - crit_enter(); - for (inp = LIST_FIRST(head); inp != NULL; inp = ninp) { - ninp = LIST_NEXT(inp, inp_list); + + marker = in_pcbmarker(mycpuid); + + GET_PCBINFO_TOKEN(pcbinfo); + + LIST_INSERT_HEAD(&pcbinfo->pcblisthead, marker, inp_list); + while ((inp = LIST_NEXT(marker, inp_list)) != NULL) { + LIST_REMOVE(marker, inp_list); + LIST_INSERT_AFTER(inp, marker, inp_list); if (inp->inp_flags & INP_PLACEMARKER) continue; @@ -981,7 +986,9 @@ do_notify: if (notify) (*notify)(inp, arg); } - crit_exit(); + LIST_REMOVE(marker, inp_list); + + REL_PCBINFO_TOKEN(pcbinfo); } /* @@ -1058,13 +1065,45 @@ in6_pcblookup_local(struct inpcbportinfo *portinfo, } void -in6_pcbpurgeif0(struct in6pcb *head, struct ifnet *ifp) +in6_pcbpurgeif0(struct inpcbinfo *pcbinfo, struct ifnet *ifp) { - struct in6pcb *in6p; + struct in6pcb *in6p, *marker; struct ip6_moptions *im6o; struct in6_multi_mship *imm, *nimm; - for (in6p = head; in6p != NULL; in6p = LIST_NEXT(in6p, inp_list)) { + /* + * We only need to make sure that we are in netisr0, where all + * multicast operation happen. We could check inpcbinfo which + * does not belong to netisr0 by holding the inpcbinfo's token. + * In this case, the pcbinfo must be able to be shared, i.e. + * pcbinfo->infotoken is not NULL. + */ + KASSERT(&curthread->td_msgport == netisr_cpuport(0), + ("not in netisr0")); + KASSERT(pcbinfo->cpu == 0 || pcbinfo->infotoken != NULL, + ("pcbinfo could not be shared")); + + /* + * Get a marker for the current netisr (netisr0). + * + * It is possible that the multicast address deletion blocks, + * which could cause temporary token releasing. So we use + * inpcb marker here to get a coherent view of the inpcb list. + * + * While, on the other hand, moptions are only added and deleted + * in netisr0, so we would not see staled moption or miss moption + * even if the token was released due to the blocking multicast + * address deletion. + */ + marker = in_pcbmarker(mycpuid); + + GET_PCBINFO_TOKEN(pcbinfo); + + LIST_INSERT_HEAD(&pcbinfo->pcblisthead, marker, inp_list); + while ((in6p = LIST_NEXT(marker, inp_list)) != NULL) { + LIST_REMOVE(marker, inp_list); + LIST_INSERT_AFTER(in6p, marker, inp_list); + if (in6p->in6p_flags & INP_PLACEMARKER) continue; im6o = in6p->in6p_moptions; @@ -1094,6 +1133,9 @@ in6_pcbpurgeif0(struct in6pcb *head, struct ifnet *ifp) } } } + LIST_REMOVE(marker, inp_list); + + REL_PCBINFO_TOKEN(pcbinfo); } /* @@ -1193,6 +1235,7 @@ in6_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in6_addr *faddr, } if (jinp != NULL) return(jinp); + if (wildcard) { struct inpcontainerhead *chead; struct inpcontainer *ic; @@ -1211,8 +1254,12 @@ in6_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in6_addr *faddr, jsin6.sin6_family = AF_INET6; chead = &pcbinfo->wildcardhashbase[INP_PCBWILDCARDHASH(lport, pcbinfo->wildcardhashmask)]; + + GET_PCBINFO_TOKEN(pcbinfo); LIST_FOREACH(ic, chead, ic_list) { inp = ic->ic_inp; + if (inp->inp_flags & INP_PLACEMARKER) + continue; if (!(inp->inp_vflag & INP_IPV6)) continue; @@ -1237,10 +1284,12 @@ in6_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in6_addr *faddr, continue; if (IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr)) { - if (cred != NULL && jailed(cred)) + if (cred != NULL && jailed(cred)) { jinp = inp; - else + } else { + REL_PCBINFO_TOKEN(pcbinfo); return (inp); + } } else if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) { if (cred != NULL && jailed(cred)) jinp_wild = inp; @@ -1249,6 +1298,8 @@ in6_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in6_addr *faddr, } } } + REL_PCBINFO_TOKEN(pcbinfo); + if (local_wild != NULL) return (local_wild); if (jinp != NULL) diff --git a/sys/netinet6/in6_pcb.h b/sys/netinet6/in6_pcb.h index d4cff1e40b..625aeea7dd 100644 --- a/sys/netinet6/in6_pcb.h +++ b/sys/netinet6/in6_pcb.h @@ -96,7 +96,7 @@ struct route_in6; struct sockaddr_in6; union netmsg; -void in6_pcbpurgeif0 (struct in6pcb *, struct ifnet *); +void in6_pcbpurgeif0 (struct inpcbinfo *, struct ifnet *); void in6_losing (struct inpcb *); int in6_pcballoc (struct socket *, struct inpcbinfo *, struct thread *); int in6_pcbbind (struct inpcb *, struct sockaddr *, struct thread *); @@ -112,7 +112,7 @@ struct inpcb * in6_pcblookup_hash (struct inpcbinfo *, struct in6_addr *, u_int, struct in6_addr *, u_int, int, struct ifnet *); -void in6_pcbnotify (struct inpcbhead *, struct sockaddr *, +void in6_pcbnotify (struct inpcbinfo *, struct sockaddr *, in_port_t, const struct sockaddr *, in_port_t, int, int, void (*)(struct inpcb *, int)); void in6_rtchange (struct inpcb *, int); diff --git a/sys/netinet6/ipsec.c b/sys/netinet6/ipsec.c index 58fcfbafe6..c3f30a2bc0 100644 --- a/sys/netinet6/ipsec.c +++ b/sys/netinet6/ipsec.c @@ -60,6 +60,7 @@ #include #include +#include #include #include #include @@ -75,7 +76,6 @@ #ifdef INET6 #include #endif -#include #ifdef INET6 #include #endif diff --git a/sys/netinet6/raw_ip6.c b/sys/netinet6/raw_ip6.c index af9a6e7290..ddde9a2f3d 100644 --- a/sys/netinet6/raw_ip6.c +++ b/sys/netinet6/raw_ip6.c @@ -310,7 +310,7 @@ rip6_ctlinput(netmsg_t msg) sa6_src = &sa6_any; } - in6_pcbnotify(&ripcbinfo.pcblisthead, sa, 0, + in6_pcbnotify(&ripcbinfo, sa, 0, (const struct sockaddr *)sa6_src, 0, cmd, 0, notify); out: lwkt_replymsg(&msg->ctlinput.base.lmsg, 0); diff --git a/sys/netinet6/udp6_usrreq.c b/sys/netinet6/udp6_usrreq.c index 7b59fd6db0..f0e788f703 100644 --- a/sys/netinet6/udp6_usrreq.c +++ b/sys/netinet6/udp6_usrreq.c @@ -153,6 +153,7 @@ udp6_input(struct mbuf **mp, int *offp, int proto) int plen, ulen; struct sockaddr_in6 udp_in6; struct socket *so; + struct inpcbinfo *pcbinfo = &udbinfo[0]; IP6_EXTHDR_CHECK(m, off, sizeof(struct udphdr), IPPROTO_DONE); @@ -186,7 +187,7 @@ udp6_input(struct mbuf **mp, int *offp, int proto) } if (IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst)) { - struct inpcb *last; + struct inpcb *last, *marker; /* * Deliver a multicast datagram to all sockets @@ -229,9 +230,18 @@ udp6_input(struct mbuf **mp, int *offp, int proto) * (Algorithm copied from raw_intr().) */ last = NULL; - LIST_FOREACH(in6p, &udbinfo.pcblisthead, inp_list) { - KKASSERT((in6p->inp_flags & INP_PLACEMARKER) == 0); + marker = in_pcbmarker(mycpuid); + + GET_PCBINFO_TOKEN(pcbinfo); + + LIST_INSERT_HEAD(&pcbinfo->pcblisthead, marker, inp_list); + while ((in6p = LIST_NEXT(marker, inp_list)) != NULL) { + LIST_REMOVE(marker, inp_list); + LIST_INSERT_AFTER(in6p, marker, inp_list); + + if (in6p->inp_flags & INP_PLACEMARKER) + continue; if (!(in6p->inp_vflag & INP_IPV6)) continue; if (in6p->in6p_lport != uh->uh_dport) @@ -313,6 +323,9 @@ udp6_input(struct mbuf **mp, int *offp, int proto) (SO_REUSEPORT | SO_REUSEADDR)) == 0) break; } + LIST_REMOVE(marker, inp_list); + + REL_PCBINFO_TOKEN(pcbinfo); if (last == NULL) { /* @@ -361,7 +374,7 @@ udp6_input(struct mbuf **mp, int *offp, int proto) /* * Locate pcb for datagram. */ - in6p = in6_pcblookup_hash(&udbinfo, &ip6->ip6_src, uh->uh_sport, + in6p = in6_pcblookup_hash(pcbinfo, &ip6->ip6_src, uh->uh_sport, &ip6->ip6_dst, uh->uh_dport, 1, m->m_pkthdr.rcvif); if (in6p == NULL) { @@ -487,11 +500,11 @@ udp6_ctlinput(netmsg_t msg) bzero(&uh, sizeof(uh)); m_copydata(m, off, sizeof(*uhp), (caddr_t)&uh); - in6_pcbnotify(&udbinfo.pcblisthead, sa, uh.uh_dport, + in6_pcbnotify(&udbinfo[0], sa, uh.uh_dport, (struct sockaddr *)ip6cp->ip6c_src, uh.uh_sport, cmd, 0, notify); } else { - in6_pcbnotify(&udbinfo.pcblisthead, sa, 0, + in6_pcbnotify(&udbinfo[0], sa, 0, (const struct sockaddr *)sa6_src, 0, cmd, 0, notify); } @@ -518,7 +531,7 @@ udp6_getcred(SYSCTL_HANDLER_ARGS) if (error) return (error); crit_enter(); - inp = in6_pcblookup_hash(&udbinfo, &addrs[1].sin6_addr, + inp = in6_pcblookup_hash(&udbinfo[0], &addrs[1].sin6_addr, addrs[1].sin6_port, &addrs[0].sin6_addr, addrs[0].sin6_port, 1, NULL); @@ -553,9 +566,7 @@ udp6_abort(netmsg_t msg) if (inp) { soisdisconnected(so); - udbinfo_barrier_set(); in6_pcbdetach(inp); - udbinfo_barrier_rem(); error = 0; } else { error = EINVAL; @@ -584,10 +595,7 @@ udp6_attach(netmsg_t msg) goto out; } - udbinfo_barrier_set(); - error = in_pcballoc(so, &udbinfo); - udbinfo_barrier_rem(); - + error = in_pcballoc(so, &udbinfo[0]); if (error) goto out; @@ -649,10 +657,7 @@ udp6_bind(netmsg_t msg) if (error == 0) { if (IN6_IS_ADDR_UNSPECIFIED(&sin6_p->sin6_addr)) inp->inp_flags |= INP_WASBOUND_NOTANY; - - udbinfo_barrier_set(); in_pcbinswildcardhash(inp); - udbinfo_barrier_rem(); } out: lwkt_replymsg(&msg->bind.base.lmsg, error); @@ -667,8 +672,6 @@ udp6_connect(netmsg_t msg) struct inpcb *inp; int error; - udbinfo_barrier_set(); - inp = so->so_pcb; if (inp == NULL) { error = EINVAL; @@ -730,7 +733,6 @@ udp6_connect(netmsg_t msg) in_pcbinswildcardhash(inp); } out: - udbinfo_barrier_rem(); lwkt_replymsg(&msg->connect.base.lmsg, error); } @@ -743,9 +745,7 @@ udp6_detach(netmsg_t msg) inp = so->so_pcb; if (inp) { - udbinfo_barrier_set(); in6_pcbdetach(inp); - udbinfo_barrier_rem(); error = 0; } else { error = EINVAL; @@ -777,9 +777,7 @@ udp6_disconnect(netmsg_t msg) if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr)) { error = ENOTCONN; } else { - udbinfo_barrier_set(); in6_pcbdisconnect(inp); - udbinfo_barrier_rem(); soclrstate(so, SS_ISCONNECTED); /* XXX */ error = 0; } diff --git a/sys/sys/protosw.h b/sys/sys/protosw.h index d052d61866..48b521831a 100644 --- a/sys/sys/protosw.h +++ b/sys/sys/protosw.h @@ -243,6 +243,10 @@ struct pr_usrreqs { void (*pru_connect) (netmsg_t msg); void (*pru_connect2) (netmsg_t msg); void (*pru_control) (netmsg_t msg); + /* + * If pru_detach() returns EJUSTRETURN, then protocol will + * call sodiscard() and sofree() for soclose(). + */ void (*pru_detach) (netmsg_t msg); void (*pru_disconnect) (netmsg_t msg); void (*pru_listen) (netmsg_t msg); diff --git a/sys/sys/socketops.h b/sys/sys/socketops.h index 957d4abf6e..2266e3af10 100644 --- a/sys/sys/socketops.h +++ b/sys/sys/socketops.h @@ -89,7 +89,7 @@ int so_pru_connect2 (struct socket *so1, struct socket *so2); int so_pru_control_direct(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp); int so_pru_detach (struct socket *so); -void so_pru_detach_direct (struct socket *so); +int so_pru_detach_direct (struct socket *so); int so_pru_disconnect (struct socket *so); void so_pru_disconnect_direct (struct socket *so); int so_pru_listen (struct socket *so, struct thread *td); diff --git a/sys/sys/socketvar.h b/sys/sys/socketvar.h index 7461e45133..198531fd59 100644 --- a/sys/sys/socketvar.h +++ b/sys/sys/socketvar.h @@ -189,6 +189,7 @@ struct socket { #define SS_ASSERTINPROG 0x0100 /* sonewconn race debugging */ #define SS_ASYNC 0x0200 /* async i/o notify */ #define SS_ISCONFIRMING 0x0400 /* deciding to accept connection req */ +#define SS_ISCLOSING 0x0800 /* in process of closing */ #define SS_INCOMP 0x0800 /* unaccepted, incomplete connection */ #define SS_COMP 0x1000 /* unaccepted, complete connection */ @@ -430,6 +431,7 @@ int soconnect2 (struct socket *so1, struct socket *so2); int socreate (int dom, struct socket **aso, int type, int proto, struct thread *td); int sodisconnect (struct socket *so); +void sodiscard (struct socket *so); void sofree (struct socket *so); int sogetopt (struct socket *so, struct sockopt *sopt); void sohasoutofband (struct socket *so); -- 2.41.0