gitweb.dragonflybsd.org Git - dragonfly.git/atom - sys/kern/kern_checkpoint.c history

kernel - Refactor in-kernel system call API to remove bcopy()

2020-07-25T04:25:07Z

kernel - Refactor in-kernel system call API to remove bcopy()

* Change the in-kernel system call prototype to take the
  system call arguments as a separate pointer, and make the
  contents read-only.

  int     sy_call_t (void *);
  int     sy_call_t (struct sysmsg *sysmsg, const void *);

* System calls with 6 arguments or less no longer need to copy
  the arguments from the trapframe to a holding structure.  Instead,
  we simply point into the trapframe.

  The L1 cache footprint will be a bit smaller, but in simple tests
  the results are not noticably faster... maybe 1ns or so
  (roughly 1%).

[D B] sys/kern/kern_checkpoint.c

kernel: Remove from all files that don't need it.

2020-03-28T15:12:56Z

kernel: Remove  from all files that don't need it.

[D B] sys/kern/kern_checkpoint.c

kernel - Implement sbrk(), change low-address mmap hinting

2019-02-17T00:41:25Z

kernel - Implement sbrk(), change low-address mmap hinting

* Change mmap()'s internal lower address bound from dmax (32GB)
  to RLIMIT_DATA's current value.  This allows the rlimit to be
  e.g. reduced and for hinted mmap()s to then map space below
  the 4GB mark.  The default data rlimit is 32GB.

  This change is needed to support several languages, at least
  lua and probably another one or two, who use mmap hinting
  under the assumption that it can map space below the 4GB
  address mark.  The data limit must be lowered with a limit command
  too, which can be scripted or patched for such programs.

* Implement the sbrk() system call.  This system call was already
  present but just returned EOPNOTSUPP and libc previously had its
  own shim for sbrk() which used the ancient break() system call.
  (Note that the prior implementation did not ENOSYS or signal).

  sbrk() in the kernel is thread-safe for positive increments and
  is also byte-granular (the old libc sbrk() was only page-granular).

  sbrk() in the kernel does not implement negative increments and
  will return EOPNOTSUPP if asked to.  Negative increments were
  historically designed to be able to 'free' memory allocated with
  sbrk(), but it is not possible to implement the case in a modern
  VM system due to the mmap changes above.

  (1) Because the new mmap hinting changes make it possible for
  normal mmap()s to have mapped space prior to the RLIMIT_DATA resource
  limit being increased, causing intermingling of sbrk() and user mmap()d
  regions.  (2) because negative increments are not even remotely
  thread-safe.

* Note the previous commit refactored libc to use the kernel sbrk()
  and fall-back to its previous emulation code on failure, so libc
  supports both new and old kernels.

* Remove the brk() shim from libc.  brk() is not implemented by the
  kernel.  Symbol removed.  Requires testing against ports so we may
  have to add it back in but basically there is no way to implement
  brk() properly with the mmap() hinting fix

* Adjust manual pages.

[D B] sys/kern/kern_checkpoint.c

kernel - per-thread fd cache, p_fd lock bypass

2018-04-20T15:44:32Z

kernel - per-thread fd cache, p_fd lock bypass

* Implement a per-thread (fd,fp) cache.  Cache hits can keep fp's
  in a held state (avoiding the need to fhold()/fdrop() the ref count),
  and bypasses the p_fd spinlock.  This allows the file pointer structure
  to generally be shared across cpu caches.

* Can cache up to four descriptors in each thread, LRU.  This is the common
  case.  Highly threaded programs tend to focus work on a distinct
  file descriptors in each thread.

* One file descriptor can be cached in up to four threads.  This is
  a significant limitation, though relatively uncommon.  On a cache miss
  the code drops into the normal shared p_fd spinlock lookup.

[D B] sys/kern/kern_checkpoint.c

kernel - Improve mountlist_scan() performance, track vfs_getvfs()

2017-10-13T05:59:02Z

kernel - Improve mountlist_scan() performance, track vfs_getvfs()

* Use a shared token whenever possible, and do not hold the token
  across the callback in the mountlist_scan() call.

* vfs_getvfs() mount_hold()'s the returned mp.  The caller is now
  expected to mount_drop() it when done.  This fixes a very rare
  race.

[D B] sys/kern/kern_checkpoint.c

kernel - Fix panic during coredump

2015-07-10T07:37:32Z

kernel - Fix panic during coredump

* multi-threaded coredumps were not stopping all other threads before
  attempting to scan the vm_map, resulting in numerous possible panics.

* Add a new process state, SCORE, indicating that a core dump is in progress
  and adjust proc_stop() and friends as well as any code which tests the
  SSTOP state.  SCORE overrides SSTOP.

* The coredump code actively waits for all running threads to stop before
  proceeding.

* Prevent a deadlock between a SIGKILL and core dump in progress by
  temporarily counting the master exit thread as a stopped thread (which
  allows the coredump to proceed and finish).

Reported-by: marino

[D B] sys/kern/kern_checkpoint.c

kernel/checkpoint: Fix wrong sizeof (p_sigacts is a pointer).

2013-04-16T17:13:19Z

kernel/checkpoint: Fix wrong sizeof (p_sigacts is a pointer).

[D B] sys/kern/kern_checkpoint.c

kernel - Major signal path adjustments to fix races, tsleep race fixes, +more

2011-11-15T23:23:41Z

kernel - Major signal path adjustments to fix races, tsleep race fixes, +more

* Refactor the signal code to properly hold the lp->lwp_token.  In
  particular the ksignal() and lwp_signotify() paths.

* The tsleep() path must also hold lp->lwp_token to properly handle
  lp->lwp_stat states and interlocks.

* Refactor the timeout code in tsleep() to ensure that endtsleep() is only
  called from the proper context, and fix races between endtsleep() and
  lwkt_switch().

* Rename proc->p_flag to proc->p_flags

* Rename lwp->lwp_flag to lwp->lwp_flags

* Add lwp->lwp_mpflags and move flags which require atomic ops (are adjusted
  when not the current thread) to the new field.

* Add td->td_mpflags and move flags which require atomic ops (are adjusted
  when not the current thread) to the new field.

* Add some freeze testing code to the x86-64 trap code (default disabled).

[D B] sys/kern/kern_checkpoint.c

Remove some duplicate includes in sys/kern.

2011-10-22T09:27:10Z

Remove some duplicate includes in sys/kern.

[D B] sys/kern/kern_checkpoint.c

Fix LINT build.

2011-05-06T21:22:03Z

Fix LINT build.

[D B] sys/kern/kern_checkpoint.c

kernel: Fix some printf format warnings on x86_64.

2011-04-30T20:57:06Z

kernel: Fix some printf format warnings on x86_64.

[D B] sys/kern/kern_checkpoint.c

kernel - Add per-process token, adjust signal code to use it.

2011-02-11T22:47:58Z

kernel - Add per-process token, adjust signal code to use it.

* Add proc->p_token and use it to interlock signal-related operations.

* Remove the use of proc_token in various signal paths.  Note that proc_token
  is still used in conjuction with pfind().

* Remove the use of proc_token in CURSIG*()/issignal() sequences, which
  also removes its use in the tsleep path and the syscall path.  p->p_token
  is use instead.

* Move the automatic interlock in the tsleep code to before the CURSIG code,
  fixing a rare race where a SIGCHLD could race against a parent process
  in sigsuspend().  Also acquire p->p_token here to interlock LWP_SINTR
  handling.

[D B] sys/kern/kern_checkpoint.c

kernel - procfs_token work

2010-08-28T18:06:08Z

kernel - procfs_token work

* Cover proc_stop(), proc_unstop(), and setrunnable() with proc_token.
  Remove the MP lock assertion from setrunnable().

* Cover procfs operations with proc_token.

* Cover lwp_signotify() and friends with proc_token.

[D B] sys/kern/kern_checkpoint.c

kernel - Restore ability to thaw checkpoints

2010-03-08T18:27:23Z

kernel - Restore ability to thaw checkpoints

  * Catch up with changes to imgact_elf.c

[D B] sys/kern/kern_checkpoint.c

kernel - fine-grained namecache and partial vnode MPSAFE work

2009-12-28T06:36:07Z

kernel - fine-grained namecache and partial vnode MPSAFE work

			Namecache subsystem

* All vnode->v_flag modifications now use vsetflags() and vclrflags().
  Because some flags are set and cleared by vhold()/vdrop() which
  do not require any locks to be held, all modifications must use atomic
  ops.

* Clean up and revamp the namecache MPSAFE work.  Namecache operations now
  use a fine-grained MPSAFE locking model which loosely follows these
  rules:

  - lock ordering is child to parent.  e.g. lock file, then lock parent
    directory.  This allows resolver recursions up the parent directory
    chain.

  - Downward-traversing namecache invalidations and path lookups will
    unlock the parent (but leave it referenced) before attempting to
    lock the child.

  - Namecache hash table lookups utilize a per-bucket spinlock.

  - vnode locks may be acquired while holding namecache locks but not
    vise-versa.  VNodes are not destroyed until all namecache references
    go away, but can enter reclamation.  Namecache lookups detect the case
    and re-resolve to overcome the race.  Namecache entries are not
    destroyed while referenced.

* Remove vfs_token, the namecache MPSAFE model is now totally fine-grained.

* Revamp namecache locking primitves (cache_lock/cache_unlock and
  friends).  Use atomic ops and nc_exlocks instead of nc_locktd and
  build-in a request flag.  This solves busy/tsleep races between lock
  holder and lock requester.

* Revamp namecache parent/child linkages.  Instead of using vfs_token to
  lock such operations we simply lock both child and parent namecache
  entries.  Hash table operations are also fully integrated with the
  parent/child linking operations.

* The vnode->v_namecache list is locked via vnode->v_spinlock, which
  is actually vnode->v_lock.lk_spinlock.

* Revamp cache_vref() and cache_vget().  The passed namecache entry must
  be referenced and locked.  Internals are simplified.

* Fix a deadlock by moving the call to _cache_hysteresis() to a
  place where the current thread otherwise does not hold any locked
  ncp's.

* Revamp nlookup() to follow the new namecache locking rules.

* Fix a number of places, e.g. in vfs/nfs/nfs_subs.c, where ncp->nc_parent
  or ncp->nc_vp was being accessed with an unlocked ncp.  nc_parent
  and nc_vp accesses are only valid if the ncp is locked.

* Add the vfs.cache_mpsafe sysctl, which defaults to 0.  This may be set
  to 1 to enable MPSAFE namecache operations for [l,f]stat() and open()
  system calls (for the moment).

			VFS/VNODE subsystem

* Use a global spinlock for now called vfs_spin to manage vnode_free_list.
  Use vnode->v_spinlock (and vfs_spin) to manage vhold/vdrop ops and
  to interlock v_auxrefs tests against vnode terminations.

* Integrate per-mount mnt_token and (for now) the MP lock into VOP_*()
  and VFS_*() operations.  This allows the MP lock to be shifted further
  inward from the system calls, but we don't do it quite yet.

* HAMMER: VOP_GETATTR, VOP_READ, and VOP_INACTIVE are now MPSAFE.  The
  corresponding sysctls have been removed.

* FIFOFS: Needed some MPSAFE work in order to allow HAMMER to make things
  MPSAFE above, since HAMMER forwards vops for in-filesystem fifos to
  fifofs.

* Add some debugging kprintf()s when certain MP races are averted, for
  testing only.

				MISC

* Add some assertions to the VM system.

* Document existing and newly MPSAFE code.

[D B] sys/kern/kern_checkpoint.c

kernel - Move mplock to machine-independent C

2009-12-20T02:57:32Z

kernel - Move mplock to machine-independent C

* Remove the per-platform mplock code and move it all into
  machine-independent code: sys/mplock2.h and kern/kern_mplock.c.

* Inline the critical path.

* When a conflict occurs kern_mplock.c will KTR log the file and line
  number of both the holder and conflicting acquirer.  Set
  debug.ktr.giant_enable=-1 to enable conflict logging.

[D B] sys/kern/kern_checkpoint.c

kernel - adjust falloc and arguments to dupfdopen, fsetfd, fdcheckstd

2009-12-15T20:31:02Z

kernel - adjust falloc and arguments to dupfdopen, fsetfd, fdcheckstd

* Make changes to the pointer type passed (proc, lwp, filedesc) to
  numerous routines.

* falloc() needs access to td_ucred (it was previously using p_ucred which
  is not MPSAFE).

* Adjust fsetfd() to make it conform to the other fsetfd*() procedures.

* Related changes to fdcheckstd() and dupfdopen().

[D B] sys/kern/kern_checkpoint.c

kernel - use new td_ucred in numerous places

2009-12-15T18:43:48Z

kernel - use new td_ucred in numerous places

* Use curthread->td_ucred in numerous places, primarily system calls,
  where curproc->p_ucred was used before.

* Clean up local variable use related to the above.

* Adjust several places where p_ucred is replaced to properly deal
  with lwp threading races to avoid accessing and freeing a potentially
  stale ucred.

* Adjust static procedures in the ktrace code to generally take lwp
  pointers instead of proc pointers.

[D B] sys/kern/kern_checkpoint.c

kernel - Move MP lock inward, plus misc other stuff

2009-12-13T20:19:30Z

kernel - Move MP lock inward, plus misc other stuff

* Remove the MPSAFE flag from the syscalls.master file.  All system calls
  are now called without the MP lock held and will acquire the MP lock if
  necessary.

* Shift the MP lock inward.  Try to leave most copyin/copyout operations
  outside the MP lock.  Reorder some of the copyouts in the linux emulation
  code to suit.

  Kernel resource operations are MP safe.

  Process ucred access is now outside the MP lock but not quite MP safe
  yet (will be fixed in a followup).

* Remove unnecessary KKASSERT(p) calls left over from the time before
  system calls where prefixed with sys_*

* Fix a bunch of cases in the linux emulation code when setting groups
  where the ngrp range check is incorrect.

[D B] sys/kern/kern_checkpoint.c

AMD64 - Fix many compile-time warnings. int/ptr type mismatches, %llx, etc.

2009-06-24T19:31:02Z

AMD64 - Fix many compile-time warnings.  int/ptr type mismatches, %llx, etc.

[D B] sys/kern/kern_checkpoint.c