gitweb.dragonflybsd.org Git - dragonfly.git/atom - sys/kern/lwkt_thread.c history

kernel - Rearrange struct vmmeter (requires world and kernel build)

2023-10-20T01:44:13Z

kernel - Rearrange struct vmmeter (requires world and kernel build)

* Expand v_lock_name from 16 to 32 bytes

* Add v_lock_addr field to go along with v_lock_name.  These fields
  report SMP contention.

* Rearrange vmmeter_uint_end to not include v_lock_name or v_lock_addr.

* Cleanup the do_vmmeter_pcpu() sysctl code.  Remove the useless
  aggregation code and just do a structural copy for the per-cpu
  gd_cnt (struct vmmeter) structure.

[D B] sys/kern/lwkt_thread.c

kernel - fbsd kpi support, add sleepq*() API (untested)

2023-02-15T17:53:05Z

kernel - fbsd kpi support, add sleepq*() API (untested)

* Initial sleepq*() API.  We use our tsleep*() API underneath it.  This
  is a horrible API so add a note that it should only be used for
  FreeBSD compat stuff.

  - Add tsleep/wakeup domains to implement the two sleepq*() queues.

  - Track blocking refs per queue in the sleepq API

  - Do not track individual threads (just let tsleep*()/wakeup*() do
    its thing).

  - objcache for wchan, 1K hash table for now, and retain a cache of
    available wchan structures in the hash table (up to 4 per slot).

  - Include the hash-slot spin lock as FreeBSD compat code will use it
    for interlock tests.

  - Relax sleepq_signal() a bit, allowing it to wakeup more than one
    thread (the DragonFly wakeup_*_one*() is a bit non-deterministic).

* For now add discrete fields to the thread structure.  Its a bit of
  bloat but its better than dynamically allocating a side-structure.
  We already use our tsleep*() API and related fields underneath.
  Add a few more needed for tracking the wchan structure, the queue,
  and the timeout.

* Add sbintime_t type (as 64-bit ticks), and a sbticks global
  counter.  Monotonic ticks since boot, 64 bits.

[D B] sys/kern/lwkt_thread.c

kernel - Add kernel_fpu_begin() and kernel_fpu_end()

2021-10-31T19:06:56Z

kernel - Add kernel_fpu_begin() and kernel_fpu_end()

* Add kernel_fpu_begin() and kernel_fpu_end().  Some linux stuff
  in amdgpu will need it.

  Generally speaking the entire FP system needs a rewrite, but I'm
  not doing that now.

[D B] sys/kern/lwkt_thread.c

kernel - Improve invltlb latency warnings

2021-07-16T03:36:34Z

kernel - Improve invltlb latency warnings

* Improve kprintf()s for smp_invltlb latency warnings.  Make
  it abundantly clear that these are mostly WARNING messages,
  not fatal messages.

* Tested on VM with host under load and VM running nice +5.

[D B] sys/kern/lwkt_thread.c

vm: Change 'kernel_map' global to type of 'struct vm_map *'

2021-05-20T14:40:00Z

vm: Change 'kernel_map' global to type of 'struct vm_map *'

Change the global variable 'kernel_map' from type 'struct vm_map' to a
pointer to this struct.  This simplify the code a bit since all
invocations take its address.  This change also aligns with NetBSD's
'kernal_map' that it's also a pointer, which also helps the porting of
NVMM.

No functional changes.

[D B] sys/kern/lwkt_thread.c

vkernel64: Reduce exposure to generic kernel sources.

2019-11-02T09:35:26Z

vkernel64: Reduce  exposure to generic kernel sources.

 Implement vkernel_yield() wrapper and use it where needed.

[D B] sys/kern/lwkt_thread.c

: Switch to lighter header.

2019-10-20T18:05:43Z

: Switch to lighter  header.

 * Make  a kernel only header.
 * Remove sys/types.h includes that follow  in devfs(5).
 * Add sys/malloc.h includes where it is actually used in sources.

 While there, minor whitespace cleanup.

[D B] sys/kern/lwkt_thread.c

kernel - __read_mostly pass on lwkt_thread

2019-05-10T00:47:21Z

kernel - __read_mostly pass on lwkt_thread

* Qualify a number of globals as __read_mostly

* Conditionalize-out (effectively remove) a number of other
  globals that were being used for debugging.

[D B] sys/kern/lwkt_thread.c

kernel - Remove kthread exit debug kprintf()s

2018-05-04T18:28:10Z

kernel - Remove kthread exit debug kprintf()s

* Remove TDF_VERBOSE and debugging kprintf()s on kthread exit.
  We don't need this debugging any more.

Reported-by: zrj

[D B] sys/kern/lwkt_thread.c

kernel - Remove cache ping-pong on common scheduler operations

2018-04-23T07:27:21Z

kernel - Remove cache ping-pong on common scheduler operations

* Reflect dfly_curprocmask and dfly_rdyprocmask bits in the
  scheduler's pcpu structures.  This allows us to reduce global
  atomic ops that are virtually guaranteed to cause cache ping
  ponging.

* sched_yield and token-based yield operations no longer clear
  the bit in the curprocmask, since they are just yielding
  (XXX needs work, a later blocking op then might not pull a
  new process from another cpu).

[D B] sys/kern/lwkt_thread.c

kernel - Cleanup token code, add simple exclusive priority (2)

2017-10-21T22:02:05Z

kernel - Cleanup token code, add simple exclusive priority (2)

* The priority mechanism revealed an issue with lwkt_switch()'s
  fall-back code in dealing with contended tokens.  The code was
  refusing to schedule a lower-priority thread on a cpu requesting an
  exclusive lock as another on that same cpu requesting a shared lock.

  This creates a problem for the exclusive priority feature.  More
  pointedly, it also creates a fairness problem in the mixed lock
  type use case generally.

* Change the mechanism to allow any thread polling on tokens to be
  scheduled.  The scheduler will still iterate in priority order.
  This imposes a little extra overhead with regards to userspace
  returns as a thread might be scheduled that then tries to return
  to userland without being the designated user thread.

* This also fixes another bug that cropped up recently where a
  32-way threaded program would sometimes not quickly schedule to
  all 32 cpus, sometimes leaving one or two cpus idle for a few
  seconds.

[D B] sys/kern/lwkt_thread.c

kernel - Refactor sysctl locking

2017-10-19T02:01:49Z

kernel - Refactor sysctl locking

* Get rid of the global topology lock.  Instead of a pcpu shared lock
  and change the XLOCK code (which is barely ever executed) to obtain
  an exclusive lock on all cpus.

* Add CTLFLAG_NOLOCK, which disable the automatic per-OID sysctl lock.

Suggested-by: mjg (Mateusz Guzik)

[D B] sys/kern/lwkt_thread.c

kernel - Refactor smp collision statistics (2)

2017-10-05T16:09:27Z

kernel - Refactor smp collision statistics (2)

* Refactor indefinite_info mechanics.  Instead of tracking indefinite
  loops on a per-thread basis for tokens, track them on a scheduler
  basis.  The scheduler records the overhead while it is live-looping
  on tokens, but the moment it finds a thread it can actually schedule
  it stops (then restarts later the next time it is entered), even
  if some of the other threads still have unresolved tokens.

  This gives us a fairer representation of how many cpu cycles are
  actually being wasted waiting for tokens.

* Go back to using a local indefinite_info in the lockmgr*(), mutex*(),
  and spinlock code.

* Refactor lockmgr() by implementing an __inline frontend to
  interpret the directive.  Since this argument is usually a constant,
  the change effectively removes the switch().

  Use LK_NOCOLLSTATS to create a clean recursion to wrap the blocking
  case with the indefinite*() API.

[D B] sys/kern/lwkt_thread.c

kernel - Refactor smp collision statistics

2017-10-05T04:46:57Z

kernel - Refactor smp collision statistics

* Add an indefinite wait timing API (sys/indefinite.h,
  sys/indefinite2.h).  This interface uses the TSC and will
  record lock latencies to our pcpu stats in microseconds.
  The systat -pv 1 display shows this under smpcoll.

  Note that latencies generated by tokens, lockmgr, and mutex
  locks do not necessarily reflect actual lost cpu time as the
  kernel will schedule other threads while those are blocked,
  if other threads are available.

* Formalize TSC operations more, supply a type (tsc_uclock_t and
  tsc_sclock_t).

* Reinstrument lockmgr, mutex, token, and spinlocks to use the new
  indefinite timing interface.

[D B] sys/kern/lwkt_thread.c

kernel - Fix GCC reordering problem with td_critcount

2017-10-03T01:42:34Z

kernel - Fix GCC reordering problem with td_critcount

* Wrap all ++td->td_critcount and --td->td_critcount use cases
  with an inline which executes cpu_ccfence() before and after,
  to guarantee that GCC does not try to reorder the operation around
  critical memory changes.

* This fixes a race in lockmgr() and possibly a few other places
  too.

[D B] sys/kern/lwkt_thread.c

kernel - Fix cpu_rotator in lwkt_alloc_thread()

2017-09-28T02:40:13Z

kernel - Fix cpu_rotator in lwkt_alloc_thread()

* The cpu and rotator are signed.  Use an unsigned modulo to ensure
  that the resulting cpu is properly ranged.

[D B] sys/kern/lwkt_thread.c

kernel - Fix sys% time reporting

2017-09-13T02:50:47Z

kernel - Fix sys% time reporting

* Fix system time reporting in systat -vm 1, systat -pv 1, and process
  stats.

* Basically the issue is that when coincident systimer interrupts occur,
  such as when the statclock, hardclock, and schedclock all fire at the
  same time, the statclock must execute first in order to properly detect
  the state the current thread is in.  If it does not, it may see a lwkt
  thread schedule by one of the other systimers and improper dock the
  current thread as being in 'system' time.

* The various systimer interrupts could wind up out of phase and
  desynchronized due to the tsc_frequency not being perfectly divisible
  by the requested frequencies.  In addition, various timers could queue
  in an undesirable order due to being different integral frequencies of
  each other.

* Refactor the systimer API a bit, adding new functions which guarantee
  synchronization for nominally requested frequencies and which guarantee
  ordering for coincident systimer events (which statclock uses).  This
  should completely solve the problem.

* Also, if the RQF_INTPEND flag is set, count as interrupt time.  This
  will give us a slightly more accurate understanding of interrupt overhead
  (alternatively we could do this test for just the case where curthread is
  the idlethread, which might be more accurate).

[D B] sys/kern/lwkt_thread.c

kernel - Fix bottlenecks that develop when many processes are running

2017-08-12T17:26:17Z

kernel - Fix bottlenecks that develop when many processes are running

* When a large number of processes or threads are running (in the tens of
  thousands or more), a number of O(n) or O(ncpus) bottlenecks can develop.
  These bottlenecks do not develop when only a few thousand threads
  are present.

  By fixing these bottlenecks, and assuming kern.maxproc is autoconfigured
  or manually set high enough, DFly can now handle hundreds of thousands
  of active processes running, polling, sleeping, whatever.

  Tested to around 400,000 discrete processes (no shared VM pages) on
  a 32-thread dual-socket Xeon system.  Each process is placed in a
  1/10 second sleep loop using umtx timeouts:

  baseline 		- (before changes), system bottlenecked starting
			  at around the 30,000 process mark, eating all
			  available cpu, high IPI rate from hash
			  collisions, and other unrelated user processes
			  bogged down due to the scheduling overhead.

  200,000 processes	- System settles down to 45% idle, and low IPI
			  rate.

  220,000 processes	- System 30% idle and low IPI rate

  250,000 processes	- System 0% idle and low IPI rate

  300,000 processes	- System 0% idle and low IPI rate.

  400,000 processes	- Scheduler begins to bottleneck again after the
			  350,000 while the process test is still in its
			  fork/exec loop.

			  Once all 400,000 processes are settled down,
			  system behaves fairly well.  0% idle, modest
			  IPI rate averaging 300 IPI/sec/cpu (due to
			  hash collisions in the wakeup code).

* More work will be needed to better handle processes with massively
  shared VM pages.

  It should also be noted that the system does a *VERY* good job
  allocating and releasing kernel resources during this test using
  discrete processes.  It can kill 400,000 processes in a few seconds
  when I ^C the test.

* Change lwkt_enqueue()'s linear td_runq scan into a double-ended scan.
  This bottleneck does not arise when large numbers of processes are
  running in usermode, because typically only one user process per cpu
  will be scheduled to LWKT.

  However, this bottleneck does arise when large numbers of threads
  are woken up in-kernel.  While in-kernel, a thread schedules directly
  to LWKT.  Round-robin operation tends to result in appends to the tail
  of the queue, so this optimization saves an enormous amount of cpu
  time when large numbers of threads are present.

* Limit ncallout to ~5 minutes worth of ring.  The calculation code is
  primarily designed to allocate less space on low-memory machines,
  but will also cause an excessively-sized ring to be allocated on
  large-memory machines.  512MB was observed on a 32-way box.

* Remove vm_map->hint, which had basically stopped functioning in a
  useful manner.  Add a new vm_map hinting mechanism that caches up to
  four (size, align) start addresses for vm_map_findspace().  This cache
  is used to quickly index into the linear vm_map_entry list before
  entering the linear search phase.

  This fixes a serious bottleneck that arises due to vm_map_findspace()'s
  linear scan if the vm_map_entry list when the kernel_map becomes
  fragmented, typically when the machine is managing a large number of
  processes or threads (in the tens of thousands or more).

  This will also reduce overheads for processes with highly fragmented
  vm_maps.

* Dynamically size the action_hash[] array in vm/vm_page.c.  This array
  is used to record blocked umtx operations.  The limited size of the
  array could result in an excessive number of hash entries when a large
  number of processes/threads are present in the system.  Again, the
  effect is noticed as the number of threads exceeds a few tens of
  thousands.

[D B] sys/kern/lwkt_thread.c

kernel - Fix excessive call stack depth on stuck interrupt

2017-05-19T17:51:55Z

kernel - Fix excessive call stack depth on stuck interrupt

* Fix an issue where a stuck level interrupt can result in an excessively
  deep call-stack and possible panic.

* Fixed by disallow thread preemption when curthread->td_nest_count
  is >= 2.  The critical section count test is not sufficient for the
  fast-interrupt unpend -> preemption case.

[D B] sys/kern/lwkt_thread.c

kernel - Incidental MPLOCK removal

2017-01-11T17:47:56Z

kernel - Incidental MPLOCK removal

* Remove misc #include  statements that are no longer needed.

* Replace mplock with acct_lock in kern_acct.c

* Replace mplock with msg_token in sysv_msg.c

* Replace mplock with p->p_token in the profiling code.

[D B] sys/kern/lwkt_thread.c