socket: Implement the SO_USER_COOKIE option This socket option allows to attach an arbitrary uint32_t value to a socket as the user-defined cookie/metadata, and then the cookie can be used in the kernel help manipulate the traffic of the socket. For example, this socket option can be set by WireGuard and then matched in IPFW to help control the WireGuard traffic. This commit is mostly derived from FreeBSD, but I decided to also support this option in getsockopt(). Note that the support of this option in IPFW (and PF and others) is still need to be implemented. I'd like to do it in the future but it may take quite some efforts. This commit alone doesn't achieve much benefits, but it helps port the WireGuard code from FreeBSD, so commit it first. Bump __DragonFly_version. Credit: https://github.com/freebsd/freebsd-src/commit/d5e8d236f4009fc2611f996c317e94b2c8649cf5
socket: introduce SO_RERROR to detect receive buffer overflow kernel receive buffers are initially of a limited size and generally the network protocols that use them don't care if a packet gets lost. However some users do care about lost messages even if not baked into the protocol - such as consumers of route(4) to track state. POSIX states that read(2) can return an error of ENOBUFS so return this error code when an overflow is detected. Guard this with socket option SO_RERROR so that existing applications which do not care can carry on not caring by default. Taken-from: NetBSD Reviewed-by: sephe
kernel - Add kern.ipc.soaccept_reuse and set default to 1 * This feature, enabled by default, allows a service listening on a socket to be killed and restarted without causing "bind: Address already in use" errors due to accepted connections still being present. * The accepted connections may still be present either because they are still in active use (though typically this is not the case when a service is killed... its children also get killed). But also, more importantly, if the sockets are still present due to lingering on a TCP timeout. In both of these situations we allow bind() to ignore matches against accepted connections. This allows a service to be restared without having to set SO_REUSEADDR (for example named/bind generally does not set SO_REUSEADDR and restarting can be a pain).
kernel - Refactor inum stat data for sockets * Assign a dummy inode number to all sockets. We previously were only assigning a dummy inode number to unix domain sockets. Use the new pcpu facility and store the inum in the socket structure. * Rip out the old inode number assigner for unix domain sockets, it was using an atomic_fetchadd_long() on a global variable, introducing unnecessary SMP stalls. And it was specific to unix domain sockets. The new facility is generic to all sockets and uses a pcpu data structure.
udp: Save original protocol processing port for later synchronizing. Unlike TCP, user could send data w/ address to a UDP socket that connect(2) is being called (those data messages will be on the original protocol processing port and forwarded to the new protocol processing port later), and then close the UDP socket (the detach message could be sent to the new protocol processing port before the inflight data messages). The inflight data messages will cause later panic, since the socket/inp has been destroyed by the detach message. I will have to say this probably will never happen for any real world applications. We fix this by recording the original message port, and synchronize inflight data messages on it upon detaching. If the connect(2) moves between protocol processing ports more than once, we will go though all UDP processing netisrs to synchronize all possible inflight data messages.
socket: Close the soreference() race against socket owner netisr sofree() The race is kinda like this: Other thread/netisrN netisrM (so->so_pcb owner) : : getpooltoken(head); : so->so_head = NULL; : : sofree(so); (*) soreference(so); : relpooltoken(head); : (*) sofree(so) frees the socket, since so->so_head is NULL and getpooltoken(head) is not called. Reported-by: dillon@
udp: Make udp pcbinfo and portinfo per-cpu; greatly improve performance MAJOR CHANGES: - Add token to protect pcbinfo's inpcb list and wildcard hash table. Currently only udp per-cpu pcbinfo sets this token. udp serializer and netisr barrier are nuked. o udp inpcb list: Under most cases, udp inpcb list is operated in its owner netisr. However, it is also accessed and modified (no effiective udp inpcb will be unlinked though) in netisr0 to adjust multicast options if one interface is to be detached. So protecting udp inpcb list accessing and modification w/ token is necessary. At udp inpcb detach time, the udp inpcb is first removed from the udp inpcb list, then a message will go through all netisrs, which makes sure that no netisrs are using or can find this udp inpcb from the udp inpcb list. After all these, this udp inpcb is destroyed in its owner netisr. In netisrs, it is MP safe to find a udp inpcb from udp inpcb list, then release the token and process the found udp inpcb. In other threads, it is MP safe to find a udp inpcb from udp inpcb list, then release the token and process the found udp inpcb in non-blocking fashion. See also the usage of inpcb marker. o udp wildcard hash table: On input path, udp wildcard hash table is searched in its owner netisr. In order to ease implicit binding (bind during send), connect after binding, and disconnect, udp inpcb are inserted into and removed from other udp pcbinfos' wildcard hash table in its owner netisr. Thus the udp wildcard hash table must be protected w/ token. At udp inpcb detach time, a message will go through all netisrs, and this udp inpcb will be removed from the udp wildcard hash table belonging to the current netisr. This makes sure that once the current netisr runs the message handler, this udp inpcb will not be used and be found in the current netisr. When the message reaches the last netisr, this udp inpcb is redispatched to its owner netisr to be destroyed. In netisrs, it is MP safe to find a udp inpcb from udp wildcard hash table, then release the token and process the found udp inpcb, e.g. use udp inpcb found by in_pcblookuphash(). In other threads, it is MP safe to find a udp inpcb from udp wildcard hash table, then release the token and process the found udp inpcb in non-blocking fashion. See also the usage of inpcb container marker. o udp connect hash table: It is lockless MP safe, and only accessed and modified in its owner netisr. - During inpcb iteration through inpcb list, use inpcb marker when calling functions, which may block, e.g. in_pcbpurgeif0(), so the inpcb iteration will not stop prematurely, if the inpcb being processed is removed from the inpcb list. - Use udp inpcb wildcard table and udp inpcb connect hash table to dispatch input multicast and broadcast udp datagrams. Using udp inpcb list could be time consume, since we need to check udp inpcb lists on all cpus; and secondly, once udp inpcb has a local port, it will be in either udp wildcard hash table or udp connect hash table. Since the socket buffer operation on input path may block, inpcb container marker is used when iterating inpcbs from udp inpcb wildcard hash table. in_pcblookup_pkthash() is adjusted to skip inpcb container marker. - udp socket so_port is no longer fixed to netisr0 msgport o Initial udp socket so_port is the current cpu's netisr msgport. o Bound but unconnected udp socket so_port is selected according to local port hash. o Connected udp socket so_port is selected according to the udp hash, i.e. laddr/faddr toeplitz hash (exception: multicast laddr or multicast faddr, is hashed to netisr0). o Multicast socket options are forced to be handled in netisr0, since udp socket so_port may not be netisr0 msgport. - In order to support asynchronized udp inpcb detach: o EJUSTRETURN from pru_detach method now means protocol will call sodiscard() and sofree() for soclose(). udp pru_detach method returns EJUSTRETURN as of this commit. o SS_ISCLOSING socket state is set before calling pru_detach method, so protocol could avoid certain expensive, unnecessary or disallowed operation in pru_disconnect or pru_detach method, e.g. udp pru_disconnect method avoids putting udp inpcb back to udp wildcard hash table, if SS_ISCLOSING is set. MISC CHANGES: - pcbinfo's cpu id must be set now; -1 is disallowed. - udp pru_abort method should never be called; it panicks now. - Restore traditional BSD behaviour, if unbound udp socket connect fails: if local port of the udp socket has been selected, its inpcb should be in wildcard hash table, i.e. the udp inpcb should be visible on udp datagrams input path. - Make sure multicast stuffs are adjusted only in netisr0 for inet6, if one interface is about to be detached. PERFORMANCE IMPROVEMENT: For 'kq_connect_client -u' test, this commit gives 400% performance improvement (31Kconns/s -> 160Kconns/s).
kernel - Adjust ssb_space_prealloc() use cases * Add two flags to the signalsockbuf ssb_flags field. SSB_PREALLOC - Indicates that data preallocation tracking is being used SSB_STOPSUPP - Indicates that SSB_STOP flow control is being used * unix domain sockets set SSB_STOPSUPP, tcp and sctp sockets set SSB_PREALLOC. * sendfile() requires that either SSB_PREALLOC or SSB_STOPSUPP be specified. * Code now conditionalizes the use of ssb_space() vs ssb_space_prealloc() based on the presence of the SSB_PREALLOC flag. Reported-by: sephe
kernel - network adjustments (netisr, tcp, and socket buffer changes) * Change sowakeup() to use an atomic fetch when testing WAIT/WAKEUP for a quick return. It is now coded properly. Previous coding is not known to have created any bugs. * Change sowakeup() to use ssb_space_prealloc() instead of ssb_space() when testing against the transmit low-water mark. This is a bug fix which primarily effects very tiny write()'s. The prior code is not known to have created any problems. * Make the netisr packet counter before doing a rollup programmer and change the default from 512 to 32 for the moment. This may be changed back to 512 (or some number inbetween) after further testing. The issue here is that interrupt/netisr pipelining can cause ack aggregation to be delayed for too many packets. * For TCP, when timestamps are not being used, pass the correct delta to tcp_xmit_timer() in our fallback. The function expects N+1. This should improve/fix incorrect rtt calculations when tcp timestamps are not in use. * Fix an edge case in tcp_xmit_bandwidth_limit() where the 'ticks' global could change values out from under the code. Load the global into a local variable. * Change the inflight code to use (t_srtt + t_rttvar) instead of (t_srtt + t_rttbest) / 2. This needs fine-tuning, the buffer is still too big. Expect more commits later. * Call sowwakeup() when appending a mbuf to a stream. The append can call sbcompress() and make a stream buffer that has hit its mbuf limit writable again. * Remove the ssb_notify() macro and collapse the sorwakeup() and sowwakeup() macros. They now just call sowakeup() on the appropriate sockbuf. The notify test is now done in sowakeup().