gitweb.dragonflybsd.org Git - dragonfly.git/atom - sys/kern/vfs_vnops.c history

kernel - Fix improper error on certain O_EXCL open() operations

2024-02-04T22:59:41Z

kernel - Fix improper error on certain O_EXCL open() operations

* O_EXCL|O_CREAT open()s were converting EACCES to EEXIST without
  determining whether the error was due to an interemdiate directory
  component.  In fact it needs to return the error caused by the
  intermediate directory component.

  EACCES is only converted to EEXIST by an O_EXCL|O_CREAT open()
  when the error is caused by the last component of the path.
  Because in that case the last component does in fact exist and
  it is not relevant whether it is accessible or not.

* Fix by specifying whether the error came from an intermediate
  directory check using a previously unused field in struct
  nlookupdata.  A bit messy bit this was the easiest way since
  we've run out of NLC flag bits.

Reported-by: tuxillo

[D B] sys/kern/vfs_vnops.c

kernel - Add per-process capability-based restrictions

2023-10-13T02:55:19Z

kernel - Add per-process capability-based restrictions

* This new system allows userland to set capability restrictions which
  turns off numerous kernel features and root accesses.  These restrictions
  are inherited by sub-processes recursively.  Once set, restrictions cannot
  be removed.

  Basic restrictions that mimic an unadorned jail can be enabled without
  creating a jail, but generally speaking real security also requires
  creating a chrooted filesystem topology, and a jail is still needed
  to really segregate processes from each other.  If you do so, however,
  you can (for example) disable mount/umount and most global root-only
  features.

* Add new system calls and a manual page for syscap_get(2) and syscap_set(2)

* Add sys/caps.h

* Add the "setcaps" userland utility and manual page.

* Remove priv.9 and the priv_check infrastructure, replacing it with
  a newly designed caps infrastructure.

* The intention is to add path restriction lists and similar features to
  improve jailess security in the near future, and to optimize the
  priv_check code.

[D B] sys/kern/vfs_vnops.c

kernel - check nc_generation in nlookup path

2022-07-04T03:47:32Z

kernel - check nc_generation in nlookup path

* With nc_generation now operating in a more usable manner, we can
  use it in nlookup() to check for changes.  When a change is detected,
  the related lock will be cycled and the entire nlookup() will retry up
  to debug.nlookup_max_retries, which currently defaults to 4.

* Add debugging via debug.nlookup_debug.  Set to 3 for nc_generation
  debugging.

* Move "Parent directory lost" kprintfs into a debugging conditional,
  reported via (debug.nlookup_debug & 4).

* This fixes lookup/remove races which could sometimes cause open()
  and other system calls to return EINVAL or ENOTCONN.  Basically
  what happened was that nlookup() wound up on a NCF_DESTROYED entry.

* A few minutes worth of a dsynth bulk does not report any random
  generation number mismatches or retries, so the code in this commit
  is probably very close to correct.

[D B] sys/kern/vfs_vnops.c

kernel - Make sure nl_dvp is non-NULL in a few situations

2021-06-08T21:34:40Z

kernel - Make sure nl_dvp is non-NULL in a few situations

* When NLC_REFDVP is set, nl_dvp should be returned non-NULL
  when the nlookup succeeds.

  However, there is one case where nlookup() can succeed but nl_dvp
  can be NULL, and this is when the nlookup() represents a
  mount-point.

* Fix three instances where this case was not being checked and
  could lead to a NULL pointer dereference / kernel panic.

* Do the full resolve treatment for cache_resolve_dvp().  In
  null-mount situations where we have A/B and we null-mount B onto C,
  path resolutions of C via the null mount will resolve B but
  not resolve A.

  This breaks an assumption that nlookup() and cache_dvpref()
  make about the parent ncp having a valid vnode.  In fact, the
  parent ncp of B (which is A) might not, because the resolve
  path for B may have bypassed it due to the presence of the null
  mount.

* Should fix occassional 'mkdir /var/cache' calls that fail with
  EINVAL instead of EEXIST.

Reported-by: zach

[D B] sys/kern/vfs_vnops.c

kernel - Fix /dev/fd/N and clean up the old dup error-code-driven path

2021-03-20T02:27:11Z

kernel - Fix /dev/fd/N and clean up the old dup error-code-driven path

* When opening /dev/fd/N, replicate the file pointer for descriptors
  that represent vnodes instead of dup()ing.  This ensures that the seek
  offset and other fp-related elements are not shared unexpectedly.

* Refactor the open() path to allow dev_dopen() to replace the
  struct file by passing a struct file ** instead of a struct file *.
  This removes old error-code-based hacks.

* This fixes the shared seek position that fexecve() was operating with
  due to its use of /dev/fd/N for scripts.

Reported-by: aly

[D B] sys/kern/vfs_vnops.c

kernel - Fix atime field for PTYs

2021-01-12T06:33:02Z

kernel - Fix atime field for PTYs

* Fix the atime field for PTYs.  A calculation in the shortcut code
  that avoids having to update the timestamp on every read() or
  write() was reversed.

Reported-by: dancrossnyc

[D B] sys/kern/vfs_vnops.c

kernel: Use howmany() in a couple of places.

2020-11-07T23:18:43Z

kernel: Use howmany() in a couple of places.

[D B] sys/kern/vfs_vnops.c

kernel: improve open(2) error handling

2020-06-16T05:23:19Z

kernel: improve open(2) error handling

When trying to open a file with O_CREAT and O_EXCL flags while the file
exists, disregard the file permissions and always return EEXIST as
described in manpage and required by the standard.

Issue: https://bugs.dragonflybsd.org/issues/2953

[D B] sys/kern/vfs_vnops.c

kernel - Rename spinlock counter trick API

2020-03-04T04:06:12Z

kernel - Rename spinlock counter trick API

* Rename the access side of the API from spin_update_*() to
  spin_access_*() to avoid confusion.

[D B] sys/kern/vfs_vnops.c

kernel - Normalize the vx_*() vnode interface

2020-03-03T21:26:48Z

kernel - Normalize the vx_*() vnode interface

* The vx_*() vnode interface is used for initial allocations, reclaims,
  and terminations.

  Normalize all use cases to prevent the mixing together of the vx_*()
  API and the vn_*() API.  For example, vx_lock() should not be paired
  with vn_unlock(), and so forth.

* Integrate an update-counter mechanism into the vx_*() API, assert
  reasonability.

* Change vfs_cache.c to use an int update counter instead of a long.
  The vfs_cache code can't quite use the spin-lock update counter API
  yet.

  Use proper atomics for load and store.

* Implement VOP_GETATTR_QUICK, meant to be a 'quick' version of
  VOP_GETATTR() that only retrieves information related to permissions
  and ownership.  This will be fast-pathed in a later commit.

* Implement vx_downgrade() to convert an exclusive vx_lock into an
  exclusive vn_lock (for vnodes).  Adjust all use cases in the
  getnewvnode() path.

* Remove unnecessary locks in tmpfs_getattr() and don't use
  any in tmpfs_getattr_quick().

* Remove unnecessary locks in hammer2_vop_getattr() and don't use
  any in hammer2_vop_getattr_quick()

[D B] sys/kern/vfs_vnops.c

kernel - Micro optimization for vnode exclusive lock

2020-02-15T19:43:10Z

kernel - Micro optimization for vnode exclusive lock

* Micro-optimize open(... O_RDWR) by allowing a shared vnode lock for
  this case when opening a file which is not an executable.

  We used to unconditionally get an exclusive lock to deal with VTEXT vs
  O_RDWR races against executables, but this can cause unnecessary SMP
  contention on normal files and devices opened O_RDWR which are not
  executables.

[D B] sys/kern/vfs_vnops.c

hammer2 - Fix inode & chain limits, improve flush pipeline.

2020-01-30T23:40:01Z

hammer2 - Fix inode & chain limits, improve flush pipeline.

* Reorganize VFS_MODIFYING() to avoid certain deadlock conditions and
  adjust hammer2 to unconditionally stall in VFS_MODIFYING() when dirty
  limits are exceeded.

  Make sure VFS_MODIFYING() is called in all appropriate filesystem-
  modifying paths.

  This ensures that inode and chain structure allocation limits are
  adhered to.

* Fix hammer2's wakeup code for the dirty inode count hystereis.  This
  fixes a situation where stalls due to excessive dirty inodes were waiting
  a full second before resuming operation based on the dirty count
  hysteresis.

  The hysteresis now works as intended:

  (1) Trigger a sync when the dirty count reache 50% N.
  (2) Stall the frontend when the dirty count reaches 100% N.
  (3) Resume the frontend when the diirty count drops to 66% N.

* Fix trigger_syncer() to guarantee that the syncer will flush the
  filesystem ASAP when called.  If the filesystem is already in a flush,
  it will be flushed again.

  Previously if the filesystem was already in a flush it would wait one
  second before flushing again, which significantly reduces performance
  under conditions where the dirty chain limit or the dirty inode limit is
  constantly being hit (e.g. chown -R, etc).

Reported-by: tuxillo

[D B] sys/kern/vfs_vnops.c

Rename some functions to better names.

2019-12-01T11:03:21Z

Rename some functions to better names.

devfs_find_device_by_udev() -> devfs_find_device_by_devid()
dev2udev()                  -> devid_from_dev()
udev2dev()                  -> dev_from_devid()

This fits with the rest of the code. 'dev' usually means a cdev_t,
such as in make_dev(), etc. Instead of 'udev', use 'devid', since
that's what dev_t is, a "Device ID".

[D B] sys/kern/vfs_vnops.c

kernel: Cleanup issues.

2019-10-18T08:46:47Z

kernel: Cleanup  issues.

 The iovec_free() inline very complicates this header inclusion.  The
 NULL check is not always seen from .  Luckily only three
 kernel sources needs it: kern_subr.c, sys_generic.c and uipc_syscalls.c.
 Also just a single dev/drm source makes use of 'struct uio'.
 * Include  explicitly first in drm_fops.c to avoid kfree()
   macro override in drm compat layer.
 * Use  where only enums and struct uio is needed, but ensure
   that userland will not include it for possible later  use.
 * Stop using  as shortcut for uiomove*() prototypes.  The
   uiomove*() family functions possibly transfer data across kernel/user
   space boundary.  This header presence explicitly mark sources as such.
 * Prefer to add  after , but before 
   and definitely before  (except for 3 mentioned sources).
   This will allow to remove  from  later on.
 * Adjust  to use component headers instead of .

 While there, use opportunity for a minimal whitespace cleanup.

 No functional differences observed in compiler intermediates.

[D B] sys/kern/vfs_vnops.c

world - More ABI breakage

2019-09-13T16:54:33Z

world - More ABI breakage

* Make more structural changes that will break ABIs.  Since we are
  breaking ABI's we might as well get as much of it done as possible.

  struct datum		(ndbm and rpcsvc)
  struct stat		(see note below)
  struct ipc_perm	(sysv messaging and ipc)

* The struct stat changes use a spare field so the structure
  size has NOT changed.  The kernel has been modified to fill
  in the 'old' field for ABI compatibility.

  The other structures, however, will break ABIs, particularly
  struct ipc_perm.

* Tested with a full world + kernel build.  Additional work in
  dports will be needed, certainly a whole new package set for
  master (also needed due to other ABI-breaking commits).

Submitted-by: swildner

[D B] sys/kern/vfs_vnops.c

sys/vfs/fuse: Add initial FUSE support

2019-03-31T16:30:07Z

sys/vfs/fuse: Add initial FUSE support

The basic code design comes from FreeBSD, but the code is written
from scratch. It was just easier to write from scratch than trying to
port sys/fs/fuse/* in FreeBSD for various reasons. Note that this is
to implement FUSE API/ABI, but not to be compatible with FreeBSD
implementation which contains FreeBSD specific sysctls, etc.

The initial version doesn't support FUSE_WRITE by disabling
VOP_WRITE() by returning EOPNOTSUPP. It currently works with simple
write(2) calls like dd(1) via direct I/O, but not when syncer thread
or mmap(2) gets involved under non trivial conditions. It looks to
be doable with custom VOP_GETPAGES() and VOP_PUTPAGES(), but if not
then it requires some changes to sys/kern/* and sys/vm/* to properly
support writes.

Besides above, this initial version supports basic FUSE operations
invoked from file related system calls via FUSE VOP's, but not things
like FUSE_IOCTL, FUSE_POLL, FUSE_FALLOCATE, etc. Although dmesg says
FUSE 7.28, don't expect it to support everything 7.28 (or anywhere
close to 7.28) says it has.

FUSE will be dropped from DragonFly releases until it gets stabilized
to certain extent including above, at least for write support.

[D B] sys/kern/vfs_vnops.c

sys/kern: Add struct file* arg to VOP_{GETATTR,SETATTR,READ,WRITE,FSYNC,READDIR}

2019-03-31T16:30:07Z

sys/kern: Add struct file* arg to VOP_{GETATTR,SETATTR,READ,WRITE,FSYNC,READDIR}

This commit changes VOP interface to support FUSE API/ABI.
It just adds an additional struct file* argument to VOP's, so that
FUSE VOP's can access *fp pointer (currently accessible only from
caller of VOP's if any, with exception of VOP_OPEN(), VOP_CLOSE(),
etc) and make use of its ->private_data pointer.

FUSE API/ABI requires FUSE to maintain a per file (usually per file
descriptor) data called fh. The fh is an opaque data whose purpose
may differ among userspace filesystems, but typically used to store
file descriptor value or arbitrary userspace address used by the
userspace filesystem process.

Below diagram illustrates typical flow of maintaining fh. The
userspace filesystem uses fd obtained from opening backing store
(e.g fd for regular file, socket, etc) for fh, as a consequence of
end user's open(2) syscall, and expects FUSE to maintain that value
for future use as an identifier for userspace.

* Notes on Linux VFS I/F vs BSD VFS I/F:
In Linux, supporting the concept of fh is quite straight forward since
Linux kernel has functions vector built around (opened)file including
things like mmap(2) handler, in addition to a vector built around
inode.

But since DragonFly doesn't have a vector built around file (other
than a simple struct fileops, which doesn't meet requirements of fh),
this change was needed for selected VOP's as minimum requirements
for initial FUSE API/ABI support.

--
FUSE user                  FUSE                       FUSE userspace fs
|                          |                          |
|---------open(2)--------->|                          |
| * issue VOP_OPEN         |---------VOP_OPEN-------->|
|                          | * issue FUSE_OPEN        | * open something
|                          |                          | * reply fd as fh
|                          |<--------VOP_OPEN---------|
|<--------open(2)----------| * store fh in fp         |
| * open success           |                          |
|                          |                          |
|...                       |                          |
|...                       |                          |
|                          |                          |
|---------read(2)--------->|                          |
| * issue VOP_READ         |---------VOP_READ-------->|
|                          | * issue FUSE_READ        |
|                          |   with fh from fp        | * read something
|                          |                          |   using fh for fd
|                          |<--------VOP_READ---------|
|<--------read(2)----------| * return read bytes      |
| * uiomove() success      |                          |
|                          |                          |

[D B] sys/kern/vfs_vnops.c

kernel: Remove numerous #include .

2019-03-02T20:34:21Z

kernel: Remove numerous #include .

Most of them were added when we converted spl*() calls to
crit_enter()/crit_exit(), almost 14 years ago. We can now
remove a good chunk of them again for where crit_*() are
no longer used.

I had to adjust some files that were relying on thread2.h
or headers that it includes coming in via other headers
that it was removed from.

[D B] sys/kern/vfs_vnops.c

kernel - Add trigger_syncer(), VFS_MODIFYING()

2018-12-05T05:49:35Z

kernel - Add trigger_syncer(), VFS_MODIFYING()

* Add trigger_syncer().  This function asynchronously triggers the
  syncer vnode in the syncer thread for the mount.  It is a NOP if
  there is no syncer thread or syncer vnode.

  Will be used by HAMMER2 to pipeline syncs when heavy filesystem
  activity over-extends internal memory structures.

* Add VFS_MODIFYING().  This is a hook into the filesystem that
  modifying filesystem ops in the kernel will call prior to locking
  any vnodes.  It allows the filesystem to moderate the over-allocation
  of internal structures.  Waiting until after the VOP is called is too
  late, so we need kernel support for this.  Numerous attempts to hack
  moderation code into the H2 VOPs have all failed spectacularly.

  In H2, over-allocation can occur because H2 must retain disconnected
  inodes related to file creation and deletion until the next sync cycle.

[D B] sys/kern/vfs_vnops.c

kernel: Remove some references to i386.

2017-12-19T17:46:42Z

kernel: Remove some references to i386.

While there, adjust some outdated paths in comments and some minor cleanup.

[D B] sys/kern/vfs_vnops.c