Rename some functions to better names. devfs_find_device_by_udev() -> devfs_find_device_by_devid() dev2udev() -> devid_from_dev() udev2dev() -> dev_from_devid() This fits with the rest of the code. 'dev' usually means a cdev_t, such as in make_dev(), etc. Instead of 'udev', use 'devid', since that's what dev_t is, a "Device ID".
kernel: Cleanup <sys/uio.h> issues. The iovec_free() inline very complicates this header inclusion. The NULL check is not always seen from <sys/_null.h>. Luckily only three kernel sources needs it: kern_subr.c, sys_generic.c and uipc_syscalls.c. Also just a single dev/drm source makes use of 'struct uio'. * Include <sys/uio.h> explicitly first in drm_fops.c to avoid kfree() macro override in drm compat layer. * Use <sys/_uio.h> where only enums and struct uio is needed, but ensure that userland will not include it for possible later <sys/user.h> use. * Stop using <sys/vnode.h> as shortcut for uiomove*() prototypes. The uiomove*() family functions possibly transfer data across kernel/user space boundary. This header presence explicitly mark sources as such. * Prefer to add <sys/uio.h> after <sys/systm.h>, but before <sys/proc.h> and definitely before <sys/malloc.h> (except for 3 mentioned sources). This will allow to remove <sys/malloc.h> from <sys/uio.h> later on. * Adjust <sys/user.h> to use component headers instead of <sys/uio.h>. While there, use opportunity for a minimal whitespace cleanup. No functional differences observed in compiler intermediates.
kernel - Performance tuning (3) * The VOP_CLOSE issues revealed a bigger issue with vn_lock(). Many callers do not check the return code for vn_lock() and in nearly all of those cases it wouldn't fail anyway due to a prior ref, but it creates an API issue. * Add the LK_FAILRECLAIM flag to vn_lock(). This flag explicitly allows vn_lock() to fail if the vnode is undergoing reclamation. This fixes numerous issues, particularly when VOP_CLOSE() is called during a reclaim due to recent LK_UPGRADE's that we do in some VFS *_close() functions. * Remove some unused LK_ defines.
kernel - Rewrite vnode ref-counting code to improve performance * Rewrite the vnode ref-counting code and modify operation to not immediately VOP_INACTIVE a vnode when its refs drops to 0. By doing so we avoid cycling vnodes through exclusive locks when temporarily accessing them (such as in a path lookup). Shared locks can be used throughout. * Track active/inactive vnodes a bit differently, keep track of the number of vnodes that are still active but have zero refs, and rewrite the vnode freeing code to use the new statistics to deactivate cached vnodes.
kernel - Adjust UFS and HAMMER to use uiomovebp() * Add uiomovebp(), a version of uiomove() which is aware of a locked bp representing the to or from buffer and can work-around issues related to VM faults causing recursions and deadlocks on the user buffer. uiomovebp() does not yet detect or handle deadlocks. Implementing deadlock handling will require a certain degree of finess related to the vnode and bp locks and we don't want to have to do it unless we actually deadlock. TODO. * Adjust UFS, HAMMER, TMPFS, MSDOSFS, NFS, NTFS to use uiomovebp().
kernel - Greatly improve shared memory fault rate concurrency / shared tokens This commit rolls up a lot of work to improve postgres database operations and the system in general. With this changes we can pgbench -j 8 -c 40 on our 48-core opteron monster at 140000+ tps, and the shm vm_fault rate hits 3.1M pps. * Implement shared tokens. They work as advertised, with some cavets. It is acceptable to acquire a shared token while you already hold the same token exclusively, but you will deadlock if you acquire an exclusive token while you hold the same token shared. Currently exclusive tokens are not given priority over shared tokens so starvation is possible under certain circumstances. * Create a critical code path in vm_fault() using the new shared token feature to quickly fault-in pages which already exist in the VM cache. pmap_object_init_pt() also uses the new feature. This increases fault-in concurrency by a ridiculously huge amount, particularly on SHM segments (say when you have a large number of postgres clients). Scaling for large numbers of clients on large numbers of cores is significantly improved. This also increases fault-in concurrency for MAP_SHARED file maps. * Expand the breadn() and cluster_read() APIs. Implement breadnx() and cluster_readx() which allows a getblk()'d bp to be passed. If *bpp is not NULL a bp is being passed in, otherwise the routines call getblk(). * Modify the HAMMER read path to use the new API. Instead of calling getcacheblk() HAMMER now calls getblk() and checks the B_CACHE flag. This gives getblk() a chance to regenerate a fully cached buffer from VM backing store without having to acquire any hammer-related locks, resulting in even faster operation. * If kern.ipc.shm_use_phys is set to 2 the VM pages will be pre-allocated. This can take quite a while for a large map and also lock the machine up for a few seconds. Defaults to off. * Reorder the smp_invltlb()/cpu_invltlb() combos in a few places, running cpu_invltlb() last. * An invalidation interlock might be needed in pmap_enter() under certain circumstances, enable the code for now. * vm_object_backing_scan_callback() was failing to properly check the validity of a vm_object after acquiring its token. Add the required check + some debugging. * Make vm_object_set_writeable_dirty() a bit more cache friendly. * The vmstats sysctl was scanning every process's vm_map (requiring a vm_map read lock to do so), which can stall for long periods of time when the system is paging heavily. Change the mechanic to a LWP flag which can be tested with minimal locking. * Have the phys_pager mark the page as dirty too, to make sure nothing tries to free it. * Remove the spinlock in pmap_prefault_ok(), since we do not delete page table pages it shouldn't be needed. * Add a required cpu_ccfence() in pmap_inval.c. The code generated prior to this fix was still correct, and this makes sure it stays that way. * Replace several manual wiring cases with calls to vm_page_wire().
AMD64 - Refactor uio_resid and size_t assumptions. * uio_resid changed from int to size_t (size_t == unsigned long equivalent). * size_t assumptions in most kernel code has been refactored to operate in a 64 bit environment. * In addition, the 2G limitation for VM related system calls such as mmap() has been removed in 32 bit environments. Note however that because read() and write() return ssize_t, these functions are still limited to a 2G byte count in 32 bit environments.
HAMMER / VFS_VGET - Add optional dvp argument to VFS_VGET(). Fix readdirplus * VGET is used by NFS to acquire a vnode given an inode number. HAMMER requires additional information to determine the PFS the inode is being acquired from. Add an optional directory vnode argument to the VGET. If non-NULL, HAMMER will extract the PFS information from this vnode. * Adjust NFS to pass the dvp to VGET when doing a readdirplus. Note that the PFS is already encoded in file handles, but readdirplus acquires the attributes for each directory entry it scans (readdir does not). This fixes readdirplus for NFS served HAMMER PFS exports.
Add kernel-layer support for chflags checks, remove (most) from the VFS layer. Give nlookup() and nlookup_va() the tools to do nearly all chflags related activities. Here are the rules: Immutable (uchg, schg) If set on a directory no files associated with the directory may be created, deleted, linked, or renamed. In addition, any files open()ed via the directory will be immutable whether they are flagged that way or not. If set on a file or directory the file or directory may not be written to, chmodded, chowned, chgrped, or renamed. The file can still be hardlinked and the file/directory can still be chflagged. If you do not wish the file to be linkable then set the immutable bit on all directories containing a link of the file. Once you form this closure no further links will be possible. NOTE ON REASONING: Security scripts should check link counts anyway, depending on a file flag which can be changed as a replacement for checking the link count is stupid. If you are secure then your closures will hold. If you aren't then nothing will save you. This feature is not recursive. If the directory contains subdirectories they must be flagged immutable as well. Undeletable (uunlnk, sunlnk) If set on a file or directory that file or directory cannot be removed or renamed. The file can still otherwise be manipulated, linked, and so forth. However, it should be noted that any hardlinks you create will also not be deletable :-) If set on a directory this flag has no effect on the contents of the directory (yet). See APPEND-ONLY on directories for what you want. Append-only (uappnd/sappnd) If set on a directory no file within the directory may be deleted or renamed. However, new files may be created in the directory and the files in the directory can be modified or hardlinked without restriction. If set on a file the file cannot be truncated, random-written, or deleted. It CAN be chmoded, chowned, renamed, and appended to with O_APPEND etc. If you do not wish the file to be renameable then you must also set the Undeletable flag. Setting the append-only flag will ensure that the file doesn't disappear from the filesystem, but does not prevent it from being moved about the filesystem. Security fix - futimes() futimes() could be called on any open descriptor. Restrict it to just those files you own or have write permission on. Security fix - Hardlinks Users can no longer hardlink foreign-owned files which they do not have write access to. The user must now have write permission on the file being hardlinked or the user must own the file, or be root. Security fix - fcntl() fcntl() can no longer be used to turn of O_APPEND mode if the file was flagged append-only. NOTE - DIFFERENCES WITH FREEBSD * Append-only on directories * Immutable on directories to control set-in-stone & hardlinking * Immutable files can be hardlinked on DragonFly, not on FreeBSD. * User must be the owner of the file or have write access to the file being hardlinked.