kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
kernel - Fix /dev/fd/N and clean up the old dup error-code-driven path * When opening /dev/fd/N, replicate the file pointer for descriptors that represent vnodes instead of dup()ing. This ensures that the seek offset and other fp-related elements are not shared unexpectedly. * Refactor the open() path to allow dev_dopen() to replace the struct file by passing a struct file ** instead of a struct file *. This removes old error-code-based hacks. * This fixes the shared seek position that fexecve() was operating with due to its use of /dev/fd/N for scripts. Reported-by: aly
kernel - Refactor in-kernel system call API to remove bcopy() * Change the in-kernel system call prototype to take the system call arguments as a separate pointer, and make the contents read-only. int sy_call_t (void *); int sy_call_t (struct sysmsg *sysmsg, const void *); * System calls with 6 arguments or less no longer need to copy the arguments from the trapframe to a holding structure. Instead, we simply point into the trapframe. The L1 cache footprint will be a bit smaller, but in simple tests the results are not noticably faster... maybe 1ns or so (roughly 1%).
Rename some functions to better names. devfs_find_device_by_udev() -> devfs_find_device_by_devid() dev2udev() -> devid_from_dev() udev2dev() -> dev_from_devid() This fits with the rest of the code. 'dev' usually means a cdev_t, such as in make_dev(), etc. Instead of 'udev', use 'devid', since that's what dev_t is, a "Device ID".
<sys/types.h>: Get rid of udev_t. In a time long long ago, dev_t was a pointer, which later became cdev_t during the great cleanups, until it ended up being a uint32_t, just like udev_t. See for example the definitions of __dev_t in <sys/stat.h>. This commit cleans up further by removing the udev_t type, leaving just the POSIX dev_t type for both kernel and userland. Put it inside a _DEV_T_DECLARED to prepare for further cleanups in <sys/stat.h>.
kernel: Fix a problem with shutdown(8) - When halting a system with 'shutdown -h ...', it would be stuck forever in the 'press any key' message. Fix that by enabling kbd polling right before a cngetc() call. This a follow-up of commit ce7866b8. With-help-from: dillon, swildner
kernel - Improve umount operation * Move the cache_inval(), and both cache_unmounting() and cache_clearmntcache() into the retry loop. This ensures that the nc_refs test actually has a chance to update during the retry. This should significantly improve umount operation, reducing umount races against exiting processes still using the filesystem. * Only issue the allproc scan which matches and clears proc->p_textnch on a forced umount. Otherwise disallow the umount attempt. We may have to reallow this later, but the shutdown code now properly clears p_textnch so it should take care of the case for us (which is why this code was originally present). * Properly dispose of p->p_textnch during shutdown/halt/reboot for the calling process, proc0, and init. This is an attempt to allow the system to cleanly unmount root. * Cleanup the warning and error messages to clarify umount failures. * Only reinstasll the syncer vp on error if umount deinstalled it. * Add some debugging sysctls (default disabled). Reported-by: marino
kernel - Implement QUICKHALT shortcut for unmounting during shutdown * Add the MNTK_QUICKHALT flag which allows the system to just unlink but otherwise ignore certain mount types during a halt or reboot. For now we flag tmpfs, devfs, and procfs. * The main impetus for this is to reduce the messing around we do with devfs during a shutdown. Devfs has its fingers, and its vnodes, prettymuch sunk throughout the system (e.g. /dev/null, system console, vty's, root mount, and so on and so forth). There's no real need to attempt to unwind all of that mess nicely.
kernel - Refactor lockmgr() * Seriously refactor lockmgr() so we can use atomic_fetchadd_*() for shared locks and reduce unnecessary atomic ops and atomic op loops. The main win here is being able to use atomic_fetchadd_*() when acquiring and releasing shared locks. A simple fstat() loop (which utilizes a LK_SHARED lockmgr lock on the vnode) improves from 191ns to around 110ns per loop with 32 concurrent threads (on a 16-core/ 32-thread xeon). * To accomplish this, the 32-bit lk_count field becomes 64-bits. The shared count is separated into the high 32-bits, allowing it to be manipulated for both blocking shared requests and the shared lock count field. The low count bits are used for exclusive locks. Control bits are adjusted to manage lockmgr features. LKC_SHARED Indicates shared lock count is active, else excl lock count. Can predispose the lock when the related count is 0 (does not have to be cleared, for example). LKC_UPREQ Queued upgrade request. Automatically granted by releasing entity (UPREQ -> ~SHARED|1). LKC_EXREQ Queued exclusive request (only when lock held shared). Automatically granted by releasing entity (EXREQ -> ~SHARED|1). LKC_EXREQ2 Aggregated exclusive request. When EXREQ cannot be obtained due to the lock being held exclusively or EXREQ already being queued, EXREQ2 is flagged for wakeup/retries. LKC_CANCEL Cancel API support LKC_SMASK Shared lock count mask (LKC_SCOUNT increments). LKC_XMASK Exclusive lock count mask (+1 increments) The 'no lock' condition occurs when LKC_XMASK is 0 and LKC_SMASK is 0, regardless of the state of LKC_SHARED. * Lockmgr still supports exclusive priority over shared locks. The semantics have slightly changed. The priority mechanism only applies to the EXREQ holder. Once an exclusive lock is obtained, any blocking shared or exclusive locks will have equal priority until the exclusive lock is released. Once released, shared locks can squeeze in, but then the next pending exclusive lock will assert its priority over any new shared locks when it wakes up and loops. This isn't quite what I wanted, but it seems to work quite well. I had to make a trade-off in the EXREQ lock-grant mechanism to improve performance. * In addition, we use atomic_fcmpset_long() instead of atomic_cmpset_long() to reduce cache line flip flopping at least a little. * Remove lockcount() and lockcountnb(), which tried to count lock refs. Replace with lockinuse(), which simply tells the caller whether the lock is referenced or not. * Expand some of the copyright notices (years and authors) for major rewrites. Really there are a lot more and I have to pay more attention to adjustments.
acpi_pvpanic: Notify Qemu VM host if we panic. By default the virtual machine is stopped when the pvpanic is triggered. For accessing ddb, the virtual machine execution then needs to be resumed manually (e.g. with the "cont" command on the Qemu-monitor console). Triggering a pvpanic instead of just continuing to the ddb prompt can be useful to avoid hogging the CPU while hanging in ddb, and to provide a kernel panic notification for the VM-host system.
kernel - Fix live lock in vfs_conf.c mountroot> * The mountroot> prompt calls cngetc() to process user input. However, this function hard loops and can prevent other kernel threads from running on the current cpu. * Rearrange the code to use cncheckc() and a 1/25 second tsleep(). * Fix a bug in the syscons code where NOKEY was not being properly returned as documented. Modify all use cases to handle NOKEY. This allows us to differentiate between a keyboard present but not key pressed and a keyboard not present. * Pull the automatic polling mode code out of cncheckc() (or more precisely, out of sccncheckc()) and add a new cnpoll() API function to set it manually. This fixes issues in vfs_conf when normal keyboard processing interrupts are operational and cncheckc() is used with a tsleep() delay. The normal processing interrupt wound up eating the keystrokes so the cncheckc() basically always failed. cncheckc() in general also always had a small window of opportunity where a keystroke could be lost due loops on it. * Call cnpoll() in various places, such as when entering the debugger, asking for input in vfs_conf, and a few other places.