procfs(5): Add '/proc/self/exe' symlink support * Add the /proc/self symlink that's the same as /proc/curproc. * Add the /proc/<pid>/exe entry that's the same as /proc/<pid>/file. The '/proc/self/exe' symlink has been already landed in NetBSD and FreeBSD [0]. It could simplify some patches to ports that look for this symlink. [0] https://github.com/freebsd/freebsd-src/pull/976 GitHub PR: https://github.com/DragonFlyBSD/DragonFlyBSD/pull/22
kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
vm: Change 'kernel_map' global to type of 'struct vm_map *' Change the global variable 'kernel_map' from type 'struct vm_map' to a pointer to this struct. This simplify the code a bit since all invocations take its address. This change also aligns with NetBSD's 'kernal_map' that it's also a pointer, which also helps the porting of NVMM. No functional changes.
kernel - Remove MAP_VPAGETABLE * This will break vkernel support for now, but after a lot of mulling there's just no other way forward. MAP_VPAGETABLE was basically a software page-table feature for mmap()s that allowed the vkernel to implement page tables without needing hardware virtualization support. * The basic problem is that the VM system is moving to an extent-based mechanism for tracking VM pages entered into PMAPs and is no longer indexing individual terminal PTEs with pv_entry's. This means that the VM system is no longer able to get an exact list of PTEs in PMAPs that a particular vm_page is using. It just has a flag 'this page is in at least one pmap' or 'this page is not in any pmaps'. To track down the PTEs, the VM system must run through the extents via the vm_map_backing structures hanging off the related VM object. This mechanism does not work with MAP_VPAGETABLE. Short of scanning the entire real pmap, the kernel has no way to reverse-index a page that might be indirected through MAP_VPAGETABLE. * We will need actual hardware mmu virtualization to get the vkernel working again.
kernel - Remove P_SWAPPEDOUT flag and paging mode * This code basically no longer functions in any worthwhile or useful manner, remove it. The code harkens back to a time when machines had very little memory and had to time-share processes by actually descheduling them for long periods of time (like 20 seconds) and paging out the related memory. In modern times the chooser algorithm just doesn't work well because we can no longer assume that programs with large memory footprints can be demoted. * In modern times machines have sufficient memory to rely almost entirely on the VM fault and pageout scan. The latencies caused by fault-ins are usually sufficient to demote paging-intensive processes while allowing the machine to continue to function. If functionality need to be added back in, it can be added back in on the fault path and not here.
kernel - Rework vfs_timestamp(), adjust default * Rework the vfs_timestamp() precision mode as follows: 0 TSP_SEC seconds granularity 1 TSP_HZ ticks granularity 2 TSP_USEC ticks granularity modulo microseconds 3 TSP_NSEC ticks granularity modulo nanoseconds 4 TSP_USEC_PRECISE precise microseconds (expensive) 5 TSP_NSEC_PRECISE precise nanoseconds (expensive) The default is TSP_USEC (with tick granularity) * Change numerous bits of code that were calling getmicrotime() or calling microtime()/nanotime() explicitly instead of calling vfs_timstamp(). procfs and devfs in particular. Reported-by: mjg
kernel - Normalize the vx_*() vnode interface * The vx_*() vnode interface is used for initial allocations, reclaims, and terminations. Normalize all use cases to prevent the mixing together of the vx_*() API and the vn_*() API. For example, vx_lock() should not be paired with vn_unlock(), and so forth. * Integrate an update-counter mechanism into the vx_*() API, assert reasonability. * Change vfs_cache.c to use an int update counter instead of a long. The vfs_cache code can't quite use the spin-lock update counter API yet. Use proper atomics for load and store. * Implement VOP_GETATTR_QUICK, meant to be a 'quick' version of VOP_GETATTR() that only retrieves information related to permissions and ownership. This will be fast-pathed in a later commit. * Implement vx_downgrade() to convert an exclusive vx_lock into an exclusive vn_lock (for vnodes). Adjust all use cases in the getnewvnode() path. * Remove unnecessary locks in tmpfs_getattr() and don't use any in tmpfs_getattr_quick(). * Remove unnecessary locks in hammer2_vop_getattr() and don't use any in hammer2_vop_getattr_quick()
kernel - Rejigger mount code to add vfs_flags in struct vfsops * Rejigger the mount code so we can add a vfs_flags field to vfsops, which mount_init() has visibility to. * Allows nullfs to flag that its mounts do not need a syncer thread. Previously nullfs would destroy the syncer thread after the fact. * Improves dsynth performance (it does lots of nullfs mounts).
kernel and libc - Reimplement lwp_setname*() using /dev/lpmap * Generally speaking we are implementing the features necessary to allow per-thread titling set via pthread_set_name_np() to show up in 'ps' output, and to use lpmap to make it fast. * The lwp_setname() system call now stores the title in lpmap->thread_title[]. * Implement a libc fast-path for lwp_setname() using lpmap. If called more than 10 times, libc will use lpmap for any further calls, which omits the need to make any system calls. * setproctitle() now stores the title in upmap->proc_title[] instead of replacing proc->p_args. proc->p_args is now no longer modified from its original contents. * The kernel now includes lpmap->thread_title[] in the following priority order when retrieving the process command line: lpmap->thread_title[] User-supplied thread title, if not empty upmap->proc_title[] User-supplied process title, if not empty proc->p_args Original process arguments (no longer modified) * Put the TID in /dev/lpmap for convenient access * Enhance the KERN_PROC_ARGS sysctl to allow the TID to be specified. The sysctl now accepts { KERN_PROC, KERN_PROC_ARGS, pid, tid } in addition to the existing { KERN_PROC, KERN_PROC_ARGS, pid } mechanism. Enhance libkvm to use the new feature. libkvm will fall-back to the old version if necessary.
<sys/slaballoc.h>: Switch to lighter <sys/_malloc.h> header. The <sys/globaldata.h> embeds SLGlobalData that in turn embeds the "struct malloc_type". Adjust several kernel sources for missing includes where memory allocation is performed. Try to use alphabetical include order. Now (in most cases) <sys/malloc.h> is included after <sys/objcache.h>. Once it gets cleaned up, the <sys/malloc.h> inclusion could be moved out of <sys/idr.h> to drm Linux compat layer linux/slab.h without side effects.
kernel: Cleanup <sys/uio.h> issues. The iovec_free() inline very complicates this header inclusion. The NULL check is not always seen from <sys/_null.h>. Luckily only three kernel sources needs it: kern_subr.c, sys_generic.c and uipc_syscalls.c. Also just a single dev/drm source makes use of 'struct uio'. * Include <sys/uio.h> explicitly first in drm_fops.c to avoid kfree() macro override in drm compat layer. * Use <sys/_uio.h> where only enums and struct uio is needed, but ensure that userland will not include it for possible later <sys/user.h> use. * Stop using <sys/vnode.h> as shortcut for uiomove*() prototypes. The uiomove*() family functions possibly transfer data across kernel/user space boundary. This header presence explicitly mark sources as such. * Prefer to add <sys/uio.h> after <sys/systm.h>, but before <sys/proc.h> and definitely before <sys/malloc.h> (except for 3 mentioned sources). This will allow to remove <sys/malloc.h> from <sys/uio.h> later on. * Adjust <sys/user.h> to use component headers instead of <sys/uio.h>. While there, use opportunity for a minimal whitespace cleanup. No functional differences observed in compiler intermediates.
kernel - VM rework part 8 - Precursor work for terminal pv_entry removal * Adjust structures so the pmap code can iterate backing_ba's with just the vm_object spinlock. Add a ba.pmap back-pointer. Move entry->start and entry->end into the ba (ba.start, ba.end). This is replicative of the base entry->ba.start and entry->ba.end, but local modifications are locked by individual objects to allow pmap ops to just look at backing ba's iterated via the object. Remove the entry->map back-pointer. Remove the ba.entry_base back-pointer. * ba.offset is now an absolute offset and not additive. Adjust all code that calculates and uses ba.offset (fortunately it is all concentrated in vm_map.c and vm_fault.c). * Refactor ba.start/offset/end modificatons to be atomic with the necessary spin-locks to allow the pmap code to safely iterate the vm_map_backing list for a vm_object. * Test VM system with full synth run.
kernel - VM rework part 2 - Replace backing_object with backing_ba * Remove the vm_object based backing_object chains and all related chaining code. This removes an enormous number of locks from the VM system and also removes object-to-object dependencies which requires careful traversal code. A great deal of complex code has been removed and replaced with far simpler code. Ultimately the intention will be to support removal of pv_entry tracking from vm_pages to gain lockless shared faults, but that is far in the future. It will require hanging vm_map_backing structures off of a list based in the object. * Implement the vm_map_backing structure which is embedded in the vm_map_entry and then links to additional dynamically allocated vm_map_backing structures via entry->ba.backing_ba. This structure contains the object and offset and essentially takes over the functionality that object->backing_object used to have. backing objects are now handled via vm_map_backing. In this commit, fork operations create a fan-in tree to shared subsets of backings via vm_map_backing. In this particular commit, these subsets are not collapsed in any way. * Remove all the vm_map_split and collapse code. Every last line is gone. It will be reimplemented using vm_map_backing in a later commit. This means that as-of this commit both recursive forks and parent-to-multiple-children forks cause an accumulation of inefficient lists of backing objects to occur in the parent and children. This will begin to get addressed in part 3. * The code no longer releases the vm_map lock (typically shared) across (get_pages) I/O. There are no longer any chaining locks to get in the way (hopefully). This means that the code does not have to re-check as carefully as it did before. However, some complexity will have to be added back in once we begin to address the accumulation of vm_map_backing structures. * Paging performance improved by 30-40%