kernel - Fix improper error on certain O_EXCL open() operations * O_EXCL|O_CREAT open()s were converting EACCES to EEXIST without determining whether the error was due to an interemdiate directory component. In fact it needs to return the error caused by the intermediate directory component. EACCES is only converted to EEXIST by an O_EXCL|O_CREAT open() when the error is caused by the last component of the path. Because in that case the last component does in fact exist and it is not relevant whether it is accessible or not. * Fix by specifying whether the error came from an intermediate directory check using a previously unused field in struct nlookupdata. A bit messy bit this was the easiest way since we've run out of NLC flag bits. Reported-by: tuxillo
kernel - Add per-process capability-based restrictions * This new system allows userland to set capability restrictions which turns off numerous kernel features and root accesses. These restrictions are inherited by sub-processes recursively. Once set, restrictions cannot be removed. Basic restrictions that mimic an unadorned jail can be enabled without creating a jail, but generally speaking real security also requires creating a chrooted filesystem topology, and a jail is still needed to really segregate processes from each other. If you do so, however, you can (for example) disable mount/umount and most global root-only features. * Add new system calls and a manual page for syscap_get(2) and syscap_set(2) * Add sys/caps.h * Add the "setcaps" userland utility and manual page. * Remove priv.9 and the priv_check infrastructure, replacing it with a newly designed caps infrastructure. * The intention is to add path restriction lists and similar features to improve jailess security in the near future, and to optimize the priv_check code.
kernel - check nc_generation in nlookup path * With nc_generation now operating in a more usable manner, we can use it in nlookup() to check for changes. When a change is detected, the related lock will be cycled and the entire nlookup() will retry up to debug.nlookup_max_retries, which currently defaults to 4. * Add debugging via debug.nlookup_debug. Set to 3 for nc_generation debugging. * Move "Parent directory lost" kprintfs into a debugging conditional, reported via (debug.nlookup_debug & 4). * This fixes lookup/remove races which could sometimes cause open() and other system calls to return EINVAL or ENOTCONN. Basically what happened was that nlookup() wound up on a NCF_DESTROYED entry. * A few minutes worth of a dsynth bulk does not report any random generation number mismatches or retries, so the code in this commit is probably very close to correct.
kernel - Make sure nl_dvp is non-NULL in a few situations * When NLC_REFDVP is set, nl_dvp should be returned non-NULL when the nlookup succeeds. However, there is one case where nlookup() can succeed but nl_dvp can be NULL, and this is when the nlookup() represents a mount-point. * Fix three instances where this case was not being checked and could lead to a NULL pointer dereference / kernel panic. * Do the full resolve treatment for cache_resolve_dvp(). In null-mount situations where we have A/B and we null-mount B onto C, path resolutions of C via the null mount will resolve B but not resolve A. This breaks an assumption that nlookup() and cache_dvpref() make about the parent ncp having a valid vnode. In fact, the parent ncp of B (which is A) might not, because the resolve path for B may have bypassed it due to the presence of the null mount. * Should fix occassional 'mkdir /var/cache' calls that fail with EINVAL instead of EEXIST. Reported-by: zach
kernel - Fix /dev/fd/N and clean up the old dup error-code-driven path * When opening /dev/fd/N, replicate the file pointer for descriptors that represent vnodes instead of dup()ing. This ensures that the seek offset and other fp-related elements are not shared unexpectedly. * Refactor the open() path to allow dev_dopen() to replace the struct file by passing a struct file ** instead of a struct file *. This removes old error-code-based hacks. * This fixes the shared seek position that fexecve() was operating with due to its use of /dev/fd/N for scripts. Reported-by: aly
kernel: improve open(2) error handling When trying to open a file with O_CREAT and O_EXCL flags while the file exists, disregard the file permissions and always return EEXIST as described in manpage and required by the standard. Issue: https://bugs.dragonflybsd.org/issues/2953
kernel - Normalize the vx_*() vnode interface * The vx_*() vnode interface is used for initial allocations, reclaims, and terminations. Normalize all use cases to prevent the mixing together of the vx_*() API and the vn_*() API. For example, vx_lock() should not be paired with vn_unlock(), and so forth. * Integrate an update-counter mechanism into the vx_*() API, assert reasonability. * Change vfs_cache.c to use an int update counter instead of a long. The vfs_cache code can't quite use the spin-lock update counter API yet. Use proper atomics for load and store. * Implement VOP_GETATTR_QUICK, meant to be a 'quick' version of VOP_GETATTR() that only retrieves information related to permissions and ownership. This will be fast-pathed in a later commit. * Implement vx_downgrade() to convert an exclusive vx_lock into an exclusive vn_lock (for vnodes). Adjust all use cases in the getnewvnode() path. * Remove unnecessary locks in tmpfs_getattr() and don't use any in tmpfs_getattr_quick(). * Remove unnecessary locks in hammer2_vop_getattr() and don't use any in hammer2_vop_getattr_quick()
kernel - Micro optimization for vnode exclusive lock * Micro-optimize open(... O_RDWR) by allowing a shared vnode lock for this case when opening a file which is not an executable. We used to unconditionally get an exclusive lock to deal with VTEXT vs O_RDWR races against executables, but this can cause unnecessary SMP contention on normal files and devices opened O_RDWR which are not executables.
hammer2 - Fix inode & chain limits, improve flush pipeline. * Reorganize VFS_MODIFYING() to avoid certain deadlock conditions and adjust hammer2 to unconditionally stall in VFS_MODIFYING() when dirty limits are exceeded. Make sure VFS_MODIFYING() is called in all appropriate filesystem- modifying paths. This ensures that inode and chain structure allocation limits are adhered to. * Fix hammer2's wakeup code for the dirty inode count hystereis. This fixes a situation where stalls due to excessive dirty inodes were waiting a full second before resuming operation based on the dirty count hysteresis. The hysteresis now works as intended: (1) Trigger a sync when the dirty count reache 50% N. (2) Stall the frontend when the dirty count reaches 100% N. (3) Resume the frontend when the diirty count drops to 66% N. * Fix trigger_syncer() to guarantee that the syncer will flush the filesystem ASAP when called. If the filesystem is already in a flush, it will be flushed again. Previously if the filesystem was already in a flush it would wait one second before flushing again, which significantly reduces performance under conditions where the dirty chain limit or the dirty inode limit is constantly being hit (e.g. chown -R, etc). Reported-by: tuxillo
Rename some functions to better names. devfs_find_device_by_udev() -> devfs_find_device_by_devid() dev2udev() -> devid_from_dev() udev2dev() -> dev_from_devid() This fits with the rest of the code. 'dev' usually means a cdev_t, such as in make_dev(), etc. Instead of 'udev', use 'devid', since that's what dev_t is, a "Device ID".
kernel: Cleanup <sys/uio.h> issues. The iovec_free() inline very complicates this header inclusion. The NULL check is not always seen from <sys/_null.h>. Luckily only three kernel sources needs it: kern_subr.c, sys_generic.c and uipc_syscalls.c. Also just a single dev/drm source makes use of 'struct uio'. * Include <sys/uio.h> explicitly first in drm_fops.c to avoid kfree() macro override in drm compat layer. * Use <sys/_uio.h> where only enums and struct uio is needed, but ensure that userland will not include it for possible later <sys/user.h> use. * Stop using <sys/vnode.h> as shortcut for uiomove*() prototypes. The uiomove*() family functions possibly transfer data across kernel/user space boundary. This header presence explicitly mark sources as such. * Prefer to add <sys/uio.h> after <sys/systm.h>, but before <sys/proc.h> and definitely before <sys/malloc.h> (except for 3 mentioned sources). This will allow to remove <sys/malloc.h> from <sys/uio.h> later on. * Adjust <sys/user.h> to use component headers instead of <sys/uio.h>. While there, use opportunity for a minimal whitespace cleanup. No functional differences observed in compiler intermediates.
world - More ABI breakage * Make more structural changes that will break ABIs. Since we are breaking ABI's we might as well get as much of it done as possible. struct datum (ndbm and rpcsvc) struct stat (see note below) struct ipc_perm (sysv messaging and ipc) * The struct stat changes use a spare field so the structure size has NOT changed. The kernel has been modified to fill in the 'old' field for ABI compatibility. The other structures, however, will break ABIs, particularly struct ipc_perm. * Tested with a full world + kernel build. Additional work in dports will be needed, certainly a whole new package set for master (also needed due to other ABI-breaking commits). Submitted-by: swildner
sys/vfs/fuse: Add initial FUSE support The basic code design comes from FreeBSD, but the code is written from scratch. It was just easier to write from scratch than trying to port sys/fs/fuse/* in FreeBSD for various reasons. Note that this is to implement FUSE API/ABI, but not to be compatible with FreeBSD implementation which contains FreeBSD specific sysctls, etc. The initial version doesn't support FUSE_WRITE by disabling VOP_WRITE() by returning EOPNOTSUPP. It currently works with simple write(2) calls like dd(1) via direct I/O, but not when syncer thread or mmap(2) gets involved under non trivial conditions. It looks to be doable with custom VOP_GETPAGES() and VOP_PUTPAGES(), but if not then it requires some changes to sys/kern/* and sys/vm/* to properly support writes. Besides above, this initial version supports basic FUSE operations invoked from file related system calls via FUSE VOP's, but not things like FUSE_IOCTL, FUSE_POLL, FUSE_FALLOCATE, etc. Although dmesg says FUSE 7.28, don't expect it to support everything 7.28 (or anywhere close to 7.28) says it has. FUSE will be dropped from DragonFly releases until it gets stabilized to certain extent including above, at least for write support.
sys/kern: Add struct file* arg to VOP_{GETATTR,SETATTR,READ,WRITE,FSYNC,READDIR} This commit changes VOP interface to support FUSE API/ABI. It just adds an additional struct file* argument to VOP's, so that FUSE VOP's can access *fp pointer (currently accessible only from caller of VOP's if any, with exception of VOP_OPEN(), VOP_CLOSE(), etc) and make use of its ->private_data pointer. FUSE API/ABI requires FUSE to maintain a per file (usually per file descriptor) data called fh. The fh is an opaque data whose purpose may differ among userspace filesystems, but typically used to store file descriptor value or arbitrary userspace address used by the userspace filesystem process. Below diagram illustrates typical flow of maintaining fh. The userspace filesystem uses fd obtained from opening backing store (e.g fd for regular file, socket, etc) for fh, as a consequence of end user's open(2) syscall, and expects FUSE to maintain that value for future use as an identifier for userspace. * Notes on Linux VFS I/F vs BSD VFS I/F: In Linux, supporting the concept of fh is quite straight forward since Linux kernel has functions vector built around (opened)file including things like mmap(2) handler, in addition to a vector built around inode. But since DragonFly doesn't have a vector built around file (other than a simple struct fileops, which doesn't meet requirements of fh), this change was needed for selected VOP's as minimum requirements for initial FUSE API/ABI support. -- FUSE user FUSE FUSE userspace fs | | | |---------open(2)--------->| | | * issue VOP_OPEN |---------VOP_OPEN-------->| | | * issue FUSE_OPEN | * open something | | | * reply fd as fh | |<--------VOP_OPEN---------| |<--------open(2)----------| * store fh in fp | | * open success | | | | | |... | | |... | | | | | |---------read(2)--------->| | | * issue VOP_READ |---------VOP_READ-------->| | | * issue FUSE_READ | | | with fh from fp | * read something | | | using fh for fd | |<--------VOP_READ---------| |<--------read(2)----------| * return read bytes | | * uiomove() success | | | | |
kernel: Remove numerous #include <sys/thread2.h>. Most of them were added when we converted spl*() calls to crit_enter()/crit_exit(), almost 14 years ago. We can now remove a good chunk of them again for where crit_*() are no longer used. I had to adjust some files that were relying on thread2.h or headers that it includes coming in via other headers that it was removed from.
kernel - Add trigger_syncer(), VFS_MODIFYING() * Add trigger_syncer(). This function asynchronously triggers the syncer vnode in the syncer thread for the mount. It is a NOP if there is no syncer thread or syncer vnode. Will be used by HAMMER2 to pipeline syncs when heavy filesystem activity over-extends internal memory structures. * Add VFS_MODIFYING(). This is a hook into the filesystem that modifying filesystem ops in the kernel will call prior to locking any vnodes. It allows the filesystem to moderate the over-allocation of internal structures. Waiting until after the VOP is called is too late, so we need kernel support for this. Numerous attempts to hack moderation code into the H2 VOPs have all failed spectacularly. In H2, over-allocation can occur because H2 must retain disconnected inodes related to file creation and deletion until the next sync cycle.