kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
tmpfs - Fix readdir() races * Fix multi-threaded deletion races against readdir(). These races can cause a directory scan in one thread to return EINVAL if the file representing the chaining cookie is deleted by another thread. Fix the issue by allowing tmpfs_dir_lookupbycookie() to return the nearest directory entry with a cookie >= the requested cookie. * Use a better cookie value for EOF * Allow readdir() to fail on "." or "..". Instead just iterate to the next entry. This can occur when a directory is deleted out from under a scan that has chdir()'d into it. * tmpfs was previously rescanning all entries to locate a cookie, which is rather dumb given that we have a RB tree. Do a proper iterative recursive search instead. Reported-by: zrj, others
kernel - Major refactor of pageout daemon algorithms * Rewrite a large chunk of the pageout daemon's algorithm to significantly improve page selection for pageout on low-memory systems. * Implement persistent markers for hold and active queue scans. Instead of moving pages within the queues, we now implement a persistent marker and just move the marker instead. This ensures 100% fair scanning of these queues. * The pageout state machine is now governed by the following sysctls (with some example default settings from a 32G box containing 8071042 pages): vm.v_free_reserved: 20216 vm.v_free_min: 40419 vm.v_paging_wait: 80838 vm.v_paging_start: 121257 vm.v_paging_target1: 161676 vm.v_paging_target2: 202095 And separately vm.v_inactive_target: 484161 The arrangement is as follows: reserved < severe < minimum < wait < start < target1 < target2 * Paging is governed as follows: The pageout daemon is activated when FREE+CACHE falls below (v_paging_start). The daemon will free memory up until FREE+CACHE reaches (v_paging_target1), and then continue to free memory up more slowly until FREE+CACHE reaches (v_paging_target2). If, due to memory demand, FREE+CACHE falls below (v_paging_wait), most userland processes will begin short-stalls on VM allocations and page faults, and return to normal operation once FREE+CACHE goes above (v_paging_wait) (that is, as soon as possible). If, due to memory demand, FREE+CACHE falls below (v_paging_min), most userland processes will block on VM allocations and page faults until the level returns to above (v_paging_wait). The hysteresis between (wait) and (start) allows most processes to continue running normally during nominal paging activities. * The pageout daemon operates in batches and then loops as necessary. Pages will be moved from CACHE to FREE as necessary, then from INACTIVE to CACHE as necessary, then from ACTIVE to INACTIVE as necessary. Care is taken to avoid completely exhausting any given queue to ensure that the queue scan is reasonably efficient. * The ACTIVE to INACTIVE scan has been significantly reorganized and integrated with the page_stats scan (which updates m->act_count for pages in the ACTIVE queue). Pages in the ACTIVE queue are no longer moved within the lists. Instead a persistent roving marker is employed for each queue. The m->act_count tests is made against a dynamically adjusted comparison variable called vm.pageout_stats_actcmp. When no progress is made this variable is increased, and when sufficient progress is made this variable is decreased. Thus, under very heavy memory loads, a more permission m->act_count test allows active pages to be deactivated more quickly. * The INACTIVE to FREE+CACHE scan remains relatively unchanged. A two-pass LRU arrangement continues to be employed in order to give the system time to reclaim a deactivated page before it would otherwise get paged out. * The vm_pageout_page_stats() scan has been almost completely rewritten. This scan is responsible for updating m->act_count on pages in the ACTIVE queue. Example sysctl settings shown below vm.pageout_stats_rsecs: 300 <--- passive run time (seconds) after pageout vm.pageout_stats_scan: 472 <--- max number of pages to scan per tick vm.pageout_stats_ticks: 10 <--- poll rate in ticks vm.pageout_stats_inamin: 16 <--- inactive ratio governing dynamic vm.pageout_stats_inalim: 4096 adjustment of actcmnp. vm.pageout_stats_actcmp: 2 <--- dynamically adjusted by the kernel The page stats code polls slowly and will update m->act_count and deactivate pages until it is able to achieve (v_inactive_target) worth of pages in the inactive queue. Once this target has been reached, the poll stops deactivating pages, but will continue to run for (pageout_stats_rsecs) seconds after the pageout daemon last ran (typically 5 minutes) and continue to passively update m->act_count duiring this period. The polling resumes upon any pageout daemon activation and the cycle repeats. * The vm_pageout_page_stats() scan is mostly responsible for selecting the correct pages to move from ACTIVE to INACTIVE. Choosing the correct pages allows the system to continue to operate smoothly while concurrent paging is in progress. The additional 5 minutes of passive operation allows it to pre-stage m->act_count for pages in the ACTIVE queue to help grease the wheels for the next pageout daemon activation. TESTING * On a test box with memory limited to 2GB, running chrome. Video runs smoothly despite constant paging. Active tabs appear to operate smoothly. Inactive tabs are able to page-in decently fast and resume operation. * On a workstation with 32GB of memory and a large number of open chrome tabs, allowed to sit overnight (chrome burns up a lot of memory when tabs remain open), then video tested the next day. Paging appeared to operate well and so far there has been no stuttering. * On a 64GB build box running dsynth 32/32 (intentionally overloaded). The full bulk starts normally. The packages tend to get larger and larger as they are built. dsynth and the pageout daemon operate reasonably well in this situation. I was mostly looking for excessive stalls due to heavy memory loads and it looks like the new code handles it quite well.
kernel - Refactor GETATTR_QUICK() -> GETATTR_LITE() * Refactor GETATTR_QUICK() into GETATTR_LITE() and use struct vattr_lite instead of struct vattr. The original GETATTR_QUICK() just used a struct vattr. This change ensures that users of this new VOP do not attempt to access attr fields that are not populated. Suggested-by: mjg
tmpfs - Fix bug in page-free bypass path * tmpfs moves pages between two objects in order to allow data to be paged to swap. Tmpfs makes an optimization to evict pages that are already backed by swap when reclaiming a vnode. Fix a bug in this code which can sometimes attempt to free a VM page which is still flagged PG_MAPPED, causing an assertion and panic. The page is not mapped at this point in time but the meaning of the flag has changed somewhat and the flag can still be left set by the time the page gets to this bit of code.
tmpfs - Change paging behavior, fix two directory-entry races * Change the paging behavior for vfs.tmpfs.bufcache_mode. These changes try to reduce unnecessary tmpfs flushes to swap when the pageout daemon is able to locate sufficient clean VM pages. The pageout daemon can still page tmpfs data to swap via its normal operation, but tmpfs itself will not force write()s to pipeline to swap unless memory pressure is severe. 0 tmpfs write()s are pipelined to swap via the buffer cache only if the VM system is below the minimum free page count. (this is the new default) 1 tmpfs write()s are pipelined to swap via the buffer cache when the VM system is paging. 2 Same as (1) but be more aggressive about releasing buffer cache buffers. 3 tmpfs_write()s are always pipelined to swap via the buffer cache, regardless. * Fix tmpfs file creation, hard-linking, and rename to ensure that the new file is not created in a deleted directory. We must lock the directory node around existing tests and add checks that were missing. Also remove a few unnecessary recursive locks.
tmpfs - Too aggressive during paging * tmpfs was being a bit too aggressive during paging. We were trying to bypass all the way to PQ_CACHE as well as unconditionally age the buffers. This caused the pageout daemon to thrash on any underlying tmpfs pages that were being re-referenced quickly. * Change the defaults to (1) Not try to flush all the way to PQ_CACHE unless the system is in an extreme low memory condition. And (2) to only B_AGE the buffer when the system is paging and otherwise allow it to cycle normally.
kernel - Normalize the vx_*() vnode interface * The vx_*() vnode interface is used for initial allocations, reclaims, and terminations. Normalize all use cases to prevent the mixing together of the vx_*() API and the vn_*() API. For example, vx_lock() should not be paired with vn_unlock(), and so forth. * Integrate an update-counter mechanism into the vx_*() API, assert reasonability. * Change vfs_cache.c to use an int update counter instead of a long. The vfs_cache code can't quite use the spin-lock update counter API yet. Use proper atomics for load and store. * Implement VOP_GETATTR_QUICK, meant to be a 'quick' version of VOP_GETATTR() that only retrieves information related to permissions and ownership. This will be fast-pathed in a later commit. * Implement vx_downgrade() to convert an exclusive vx_lock into an exclusive vn_lock (for vnodes). Adjust all use cases in the getnewvnode() path. * Remove unnecessary locks in tmpfs_getattr() and don't use any in tmpfs_getattr_quick(). * Remove unnecessary locks in hammer2_vop_getattr() and don't use any in hammer2_vop_getattr_quick()
tmpfs - Fix races in tmpfs_nrename() and tmpfs_nrmdir() * Lock all nrename elements before checks. This is particularly important when renaming over a file or empty directory, but other manipulations done by this code without locks could also cause races which result in corruption, particularly with the link count. * Lock all nrmdir elements before checks, for the same reason.
tmpfs - Cleanup, refactor tmpfs_alloc_vp() * Refactor tmpfs_alloc_vp() to handle races without having to have a weird intermediate TMPFS_VNODE_ALLOCATING state. This also removes the related ALLOCATING/WAIT code which had a totally broken tsleep() call in it. * Properly zero fields in tmpfs_alloc_node(). * Cleanup some comments
kernel - Improve tmpfs support * When a file in tmpfs is truncated to a size that is not on a block boundary, or extended (but not written) to a size that is not on a block boundary, the nvextendbuf() and nvtruncbuf() functions must modify the contents of the straddling buffer and bdwrite(). However, a bdwrite() for a tmpfs buffer will result in a dirty buffer cache buffer and likely force it to be cycled out to swap relatively soon under a modest load. This is not desirable if there is no memory pressure present to force it out. Tmpfs almost always uses buwrite() in order to leave the buffer 'clean' (the underlying VM pages are dirtied instead), to prevent unecessary paging of tmpfs data to swap when the buffer gets recycled or the vnode cycles out. * Add support for calling buwrite() in these functions by changing the 'trivial' boolean into a flags variable. * Tmpfs now passes the appropriate flag, preventing the undesirable behavior.
tmpfs - Improve write clustering * Setup bmap and max iosize parameters so the kernel's clustering code can actually cluster 16KB tmpfs blocks together into 64KB blocks. * In low-memory situations the pageout daemon will flush tmpfs pages via the VM page queues. This ultimately runs through the tmpfs_vop_write() UIO_NOCOPY path which was previously using cluster_awrite(). However, because other nearby buffers are probably not present (buwrite()'s can allow buffers to be dismissed early), there is nothing for cluster_awrite() to latch onto to improve write granularity beyond 16KB. Go back to using cluster_write() when SYNC and DIRECT are not specified. This allows the clustering code to collect buffers and flush them in larger chunks. * Reduces low-memory tmpfs paging I/O overheads by 4x and generally increases paging throughput to SSD-based swap by 2x-4x. Tmpfs is now able to issue a lot more 64KB I/Os when under memory pressure.
tmpfs - Flush and recycle pages quickly during heavy paging activity * When the pagedaemon is operating any write()s made via tmpfs will be forced to operate through the buffer cache via cluster_write() or bdwrite() instead of using buwrite(). This will cause the pages to be pipelined to backing store (swap) under these conditions, making them clean immediately to avoid having tmpfs cause further paging pressure on the system when it is already under paging pressure. * In addition, the B_TTC flag is set on these buffers to attempt to recycle the pages directly into PQ_CACHE ASAP after they are flushed. * Implement cluster_write() operation by default to try to improve block sizes for physical I/O. * TMPFS currently must move pages between two VM objects when reclaiming a vnode, and back again upon re-use. The current VM mechanism for renaming VM pages dirties them and this can potentially cause the paging system to thrash on the same page under heavy vnode recycling loads. Instead of allowing this to happen, TMPFS now frees any clean page that have backing store assigned when moving from the backing object, and any clean pages that were instantiated from backing store when moving to the backing object.
tmpfs - Cycle through buffer cache when pageout daemon is running * tmpfs usually allocates VM pages directly, but this can overwhelm the VM system in low-memory situations, causing processes to make very little progress in normal run-time operation if one or more of them are doing heavy writing to tmpfs. * Significantly improves dsynth performance in low-memory situations where multiple worker slots are in the install-pkgs phase.
tmpfs - Make tmpfs_move_pages() more robust * Make tmpfs_move_pages() more robust by re-checking the page queue in the scan loop and waiting for any paging-in-progress to complete. * Possibly fixes or improves a race that could cause corruption in tmpfs files. Reported-by: zrj
kernel: Cleanup <sys/uio.h> issues. The iovec_free() inline very complicates this header inclusion. The NULL check is not always seen from <sys/_null.h>. Luckily only three kernel sources needs it: kern_subr.c, sys_generic.c and uipc_syscalls.c. Also just a single dev/drm source makes use of 'struct uio'. * Include <sys/uio.h> explicitly first in drm_fops.c to avoid kfree() macro override in drm compat layer. * Use <sys/_uio.h> where only enums and struct uio is needed, but ensure that userland will not include it for possible later <sys/user.h> use. * Stop using <sys/vnode.h> as shortcut for uiomove*() prototypes. The uiomove*() family functions possibly transfer data across kernel/user space boundary. This header presence explicitly mark sources as such. * Prefer to add <sys/uio.h> after <sys/systm.h>, but before <sys/proc.h> and definitely before <sys/malloc.h> (except for 3 mentioned sources). This will allow to remove <sys/malloc.h> from <sys/uio.h> later on. * Adjust <sys/user.h> to use component headers instead of <sys/uio.h>. While there, use opportunity for a minimal whitespace cleanup. No functional differences observed in compiler intermediates.