kernel - Fix long-standing bug in kqueue backend for *poll*() * The poll() family of system calls passes an fds[] array with a series of descriptors and event requests. Our kernel implementation uses kqueue but a long standing bug breaks situations where more than one fds[] entry for the poll corresponds to the same { ident, filter } for kqueue, causing only the last such entry to be registered with kqueue and breaking poll(). * Added feature to kqueue to supply further distinctions between knotes beyond the nominal { kq, filter, ident } tuple, allowing us to fix poll(). * Added a FreeBSD feature where poll() implements an implied POLLHUP when events = 0. This is used by X11 and (perhaps mistakenly) also by sshd. Our poll previous ignored fds[] entries with events = 0. * Note that sshd can generate poll fds[] arrays with both an events = 0 and an events = POLLIN for the same descriptor, which broke sshd when I initially added the events = 0 support due to the first bug. Now with that fixed, sshd works properly. However it is unclear whether the authors of sshd intended events = 0 to detect POLLHUP or not. Reported-by: servik (missing events = 0 poll feature) Testing: servik, dillon
poll/select: Fix panic in kqueue backend * The poll and select system calls use kqueue as a backend and attempt to cache active events from prior calls to improve performance. However, this makes a potential race more likely where in a high-concurrency application one thread close()es a descriptor that another thread had previously used in a poll/select operation and this close() races the later poll/select operation that is attempting to remove the kevent. * The race can sometimes prevent the poll/select kevent copyout code from removing previously cached but no-longer-used events, because the removal references the events by their descriptor rather than directly and the descriptor is no longer valid. This causes kern_kevent() to loop infinite and hit a panic designed to check for that situation. * Fix the problem by moving the removal of old events from the poll/select copyout code into kqueue_scan(). kqueue_scan() can detect old unused events using the sequence id that the poll/select kernel code stores in the kevent.
kernel - Implement POLLHUP for pipes and filesystem fifos (3) * Add an internal NOTE_HUPONLY flag to allow the poll() system call to tell the kevent system that EVFILT_READ should only trigger on a HUP and not trigger on read-data-present. * Linux does not trigger POLLHUP on a half-closed socket, make DFly have the same behavior. POLLHUP is only triggered on a fully-closed socket. * Fix bug where data-present on the pipe, socket, or fifo would trigger an EVFILT_READ event when only a HUP is being requested. This caused our poll() implementation to complain about spurious events (which then results in incorrect operation).
kernel - Refactor in-kernel system call API to remove bcopy() * Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only. int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *); * System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe. The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
kernel - Refactor kern_kevent(), fix timeout overflow (ppoll() bug) (2) * Certain unsupported EV_ERROR events can cause kern_kevent() to live-lock, which hits a 'checkloop failed' panic. Silently deregister such events. * Complain and deregister any kqueue event on behalf of *poll() which does not set any poll return flags. Reported-by: swildner
kernel - Generate POLLHUP for fully disconnected socket * Properly generate POLLHUP for fully disconnected sockets. However, there is still a possible issue. We do not set POLLHUP for half-closed sockets and it is really unclear whether we should or not once read data has been exhausted.
kernel - Remove SMP bottlenecks on uidinfo, descriptors, and lockf * Use an eventcounter and the per-thread fd cache to fix bottlenecks in checkfdclosed(). This will work well for the vast majority of applications and test benches. * Batch holdfp*() operations on kqueue collections when implementing poll() and select(). This significant improves performance. Full scaling not yet achieved, however. * Increase copyin item batching from 8 to 32 for select() and poll(). * Give the uidinfo structure a pcpu array to hold the posixlocks and openfiles count fields, with a rollup contained in the uidinfo structure itself. This removes numerous global bottlenecks related to open(), close(), dup*(), and lockf operations (posixlocks count). ui_openfiles will force a rollup on limit reached to be sure that the limit was actually reached. ui_posixlocks stays fairly loose. Each cpu rolls up generally only when the pcpu count exceeds +32 or goes below -32. * Give the proc structure a pcpu array for the same counts, in order to properly support seteuid() and such. * Replace P_ADVLOCK with a char field proc->p_advlock_flag, and remove token operations around the field.
kernel - per-thread fd cache, p_fd lock bypass * Implement a per-thread (fd,fp) cache. Cache hits can keep fp's in a held state (avoiding the need to fhold()/fdrop() the ref count), and bypasses the p_fd spinlock. This allows the file pointer structure to generally be shared across cpu caches. * Can cache up to four descriptors in each thread, LRU. This is the common case. Highly threaded programs tend to focus work on a distinct file descriptors in each thread. * One file descriptor can be cached in up to four threads. This is a significant limitation, though relatively uncommon. On a cache miss the code drops into the normal shared p_fd spinlock lookup.
poll/select: Use 64bit serial for poll/select's kevent.udata. This fixes the issue mentioned in this commit: ce4975442fa0524017fb3c1aef93bbe6880ae770 It takes ~200 years for 2.5Ghz cpu to make the 64bit serial wrap, even if the cpu's speed were 10 times faster tomorrow, it still would take two decades to make the 64bit serial wrap. Suggested-by: dillon@
select: Don't allow unwanted/leftover fds being returned. The root cause is that the lwp_kqueue_serial will wrap pretty quickly, 6 seconds on my laptop, if the select(2) is polling, either due to heavy workload or 0 timeout. The POC test: https://leaf.dragonflybsd.org/~sephe/select_wrap.c Fixing this issue by saving the original fd_sets and do additional kevent filtering before return the fd to userland. poll(2) suffers the similar issue and will be fixed in later commit. Reported-by: many
kernel - Remove mplock from KTRACE paths * The mplock is no longer needed for KTRACE, ktrace writes are serialized by the vnode lock and everything else is MPSAFE. Note that this change means that even fast system calls may interleave in the ktrace output on a multi-threaded program. * Fix ktrace bug related to vkernels. The syscall2() code assumes that no tokens are held on entry (since we are coming from usermode), but a system call made from the vkernel may actually be nested inside another syscall2(). The mplock KTRACE held caused this to assert in the nested syscall2(). The removal of the mplock from the ktrace path also fixes this bug. * Minor comment adjustment in vm_vmspace.c. Reported-by: tuxillo
kernel - Implement ppoll system call with precise microseconds timeout. * Implement a maximum timeout of 2000s, because systimer(9) just accepts an int timeout in microseconds. * Add kern.kv_sleep_threshold sysctl variable for tuning the threshold for the ppoll sleep duration (in nanoseconds), below which we will busy-loop with DELAY instead of using tsleep for waiting.