Matthew Dillon [Sun, 13 Aug 2017 06:35:47 +0000 (23:35 -0700)]
kernel - Scale tsleep() performance vs many (thousands) of processes
* In situations where a huge number of processes or threads are present
and sleeping (that is, more than a few thousand), the global cpumask hash
table used by tsleep() would saturate and effectively cause any wakeup()
call to broadcast to all CPUs.
* Refactor the tsleep initialization code to allow the global cpumask
hash table and the pcpu hash tables to be dynamically allocated.
* Allocate a MUCH larger global cpumask hash table, and significantly
smaller pcpu hash tables. The global cpumask hash table is now
sized to approximate 2 * maxproc, greatly reducing cpumask collisions
when large numbers of processes exist in the system.
The pcpu hash tables can be smaller without effecting performance. This
will simply result in more entries in each queue which are trivially
iterated.
Nominal maxproc ~32,000 -> in the noise (normal desktop system)
Nominal maxproc ~250,000 -> 16MB worth of hash tables (on a 128G box)
Maximal maxproc ~2,000,000 -> 122MB worth of hash tables (on a 128G box)
* Remove the unused sched_quantum sysctl and variable.
* Tested with running a pipe() chain through 900,000 processes, the
end-to-end latency dropped from 25 seconds to 10 seconds and the
pcpu IPI rate dropped from 60,000 IPIs/cpu to 5000 IPIs/cpu. This
is still a bit more than ideal, but much better than before.
* Fix a low-memory panic in zalloc(). A possible infinite recursion
was not being properly handled.
Matthew Dillon [Sat, 12 Aug 2017 20:44:54 +0000 (13:44 -0700)]
kernel - Change maxproc cap calculation
* Increase the calculation for the maxproc cap based on physical ram.
This allows a machine with 128GB of ram to maxproc past a million,
though it should be noted that PIDs are only 6-digits, so for now
a million processes is the actual limit.
Matthew Dillon [Sat, 12 Aug 2017 19:24:16 +0000 (12:24 -0700)]
kernel - Break up scheduler and loadavg callout
* Change the scheduler and loadavg callouts from cpu 0 to all cpus, and
adjust the allproc_scan() and alllwp_scan() to segment the hash table
when asked.
Every cpu is now tasked with handling the nominal scheduler recalc and
nominal load calculation for a portion of the process list. The portion
is unrelated to which cpu(s) the processes are actually scheduled on,
it is strictly a way to spread the work around, split up by hash range.
* Significantly reduces cpu 0 stalls when a large number of user processes
or threads are present (that is, in the tens of thousands or more). In
the test below, before this change, cpu 0 was straining under 40%+
interupt load (from the callout). After this change the load is spread
across all cpus, approximately 1.5% per cpu.
* Tested with 400,000 running user processes on a 32-thread dual-socket
xeon (yes, these numbers are real):
12:27PM up 8 mins, 3 users, load avg: 395143.28, 270541.13, 132638.33
12:33PM up 14 mins, 3 users, load avg: 399496.57, 361405.54, 225669.14
* NOTE: There are still a number of other non-segmented allproc scans in
the system, particularly related to paging and swapping.
* NOTE: Further spreading-out of the work may be needed, by using a more
frequent callout and smaller hash index range for each.
Matthew Dillon [Sat, 12 Aug 2017 18:40:30 +0000 (11:40 -0700)]
kernel - Bump wakeup hash size a little
* Bump from 4001 to 8191 entries to reduce chain length to help situations
where a large numbers of user threads are in a wait state (in the tens or
hundreds of thousands of threads).
Matthew Dillon [Sat, 12 Aug 2017 18:16:26 +0000 (11:16 -0700)]
kernel - loadavg structure 32->64 bit fields
* The loadavg structure overflows when a large number of processes
are running. Yes, I in fact got it to overflow. Change the load
fields from 32 to 64 bits.
* Tested to 400,000 runnable processes.
Matthew Dillon [Sat, 12 Aug 2017 17:26:17 +0000 (10:26 -0700)]
kernel - Fix bottlenecks that develop when many processes are running
* When a large number of processes or threads are running (in the tens of
thousands or more), a number of O(n) or O(ncpus) bottlenecks can develop.
These bottlenecks do not develop when only a few thousand threads
are present.
By fixing these bottlenecks, and assuming kern.maxproc is autoconfigured
or manually set high enough, DFly can now handle hundreds of thousands
of active processes running, polling, sleeping, whatever.
Tested to around 400,000 discrete processes (no shared VM pages) on
a 32-thread dual-socket Xeon system. Each process is placed in a
1/10 second sleep loop using umtx timeouts:
baseline - (before changes), system bottlenecked starting
at around the 30,000 process mark, eating all
available cpu, high IPI rate from hash
collisions, and other unrelated user processes
bogged down due to the scheduling overhead.
200,000 processes - System settles down to 45% idle, and low IPI
rate.
220,000 processes - System 30% idle and low IPI rate
250,000 processes - System 0% idle and low IPI rate
300,000 processes - System 0% idle and low IPI rate.
400,000 processes - Scheduler begins to bottleneck again after the
350,000 while the process test is still in its
fork/exec loop.
Once all 400,000 processes are settled down,
system behaves fairly well. 0% idle, modest
IPI rate averaging 300 IPI/sec/cpu (due to
hash collisions in the wakeup code).
* More work will be needed to better handle processes with massively
shared VM pages.
It should also be noted that the system does a *VERY* good job
allocating and releasing kernel resources during this test using
discrete processes. It can kill 400,000 processes in a few seconds
when I ^C the test.
* Change lwkt_enqueue()'s linear td_runq scan into a double-ended scan.
This bottleneck does not arise when large numbers of processes are
running in usermode, because typically only one user process per cpu
will be scheduled to LWKT.
However, this bottleneck does arise when large numbers of threads
are woken up in-kernel. While in-kernel, a thread schedules directly
to LWKT. Round-robin operation tends to result in appends to the tail
of the queue, so this optimization saves an enormous amount of cpu
time when large numbers of threads are present.
* Limit ncallout to ~5 minutes worth of ring. The calculation code is
primarily designed to allocate less space on low-memory machines,
but will also cause an excessively-sized ring to be allocated on
large-memory machines. 512MB was observed on a 32-way box.
* Remove vm_map->hint, which had basically stopped functioning in a
useful manner. Add a new vm_map hinting mechanism that caches up to
four (size, align) start addresses for vm_map_findspace(). This cache
is used to quickly index into the linear vm_map_entry list before
entering the linear search phase.
This fixes a serious bottleneck that arises due to vm_map_findspace()'s
linear scan if the vm_map_entry list when the kernel_map becomes
fragmented, typically when the machine is managing a large number of
processes or threads (in the tens of thousands or more).
This will also reduce overheads for processes with highly fragmented
vm_maps.
* Dynamically size the action_hash[] array in vm/vm_page.c. This array
is used to record blocked umtx operations. The limited size of the
array could result in an excessive number of hash entries when a large
number of processes/threads are present in the system. Again, the
effect is noticed as the number of threads exceeds a few tens of
thousands.
Matthew Dillon [Thu, 10 Aug 2017 05:20:52 +0000 (22:20 -0700)]
kernel - Lower VM_MAX_USER_ADDRESS to finalize work-around for Ryzen bug
* Reduce VM_MAX_USER_ADDRESS by 2MB, effectively making the top 2MB of the
user address space unmappable. The user stack now starts 2MB down from
where it did before. Theoretically we only need to reduce the top of
the user address space by 4KB, but doing it by 2MB may be more useful for
future page table optimizations.
* As per AMD, Ryzen has an issue when the instruction pre-fetcher crosses
from canonical to non-canonical address space. This can only occur at
the top of the user stack.
In DragonFlyBSD, the signal trampoline resides at the top of the user stack
and an IRETQ into it can cause a Ryzen box to lockup and destabilize due
to this action. The bug case was, basically two cpu threads on the same
core, one in a cpu-bound loop of some sort while the other takes a normal
UNIX signal (causing the IRETQ into the signal trampoline). The IRETQ
microcode freezes until the cpu-bound loop terminates, preventing the
cpu thread from being able to take any interrupt or IPI whatsoever for
the duration, and the cpu may destabilize afterwords as well.
* The pre-fetcher is somewhat heuristical, so just moving the trampoline
down is no guarantee if the top 4KB of the user stack is mapped or mappable.
It is better to make the boundary unmappable by userland.
* Bug first tracked down by myself in early 2017. AMD validated the bug
and determined that unmapping the boundary page completely solves the
issue.
* Also retain the code which places the signal trampoline in its own page
so we can maintain separate protection settings for the code, and make it
read-only (R+X).
Matthew Dillon [Wed, 9 Aug 2017 00:14:18 +0000 (17:14 -0700)]
kcollect - Fix grammar
* gunplot -> gnuplot
Reported-by: Steve Horan <steve@horan.net.au>
Sepherosa Ziehau [Tue, 8 Aug 2017 07:29:29 +0000 (15:29 +0800)]
mbuf: Minor style change.
Sepherosa Ziehau [Mon, 7 Aug 2017 09:43:54 +0000 (17:43 +0800)]
netisr: Simplify assertion related bits
Sepherosa Ziehau [Mon, 7 Aug 2017 08:45:26 +0000 (16:45 +0800)]
stf: Make route per-cpu. And it should run in the first netisr_ncpus netisrs.
Sepherosa Ziehau [Mon, 7 Aug 2017 07:57:06 +0000 (15:57 +0800)]
gre: Rework routing facilities.
- Make route per-cpu.
- Make sure that all routing related operation happens in the first
netisr_ncpus netisrs.
Sepherosa Ziehau [Fri, 4 Aug 2017 12:12:16 +0000 (20:12 +0800)]
route: Minor cleanup
Sepherosa Ziehau [Fri, 4 Aug 2017 12:06:18 +0000 (20:06 +0800)]
route: Delete ortentry, SIOC{ADD,DEL}RT and RTM_OLD{ADD,DEL}
They have not been used for more than a decade.
Sepherosa Ziehau [Fri, 4 Aug 2017 11:07:30 +0000 (19:07 +0800)]
gif: It should only run in the first netisr_ncpus netisrs
Sepherosa Ziehau [Fri, 4 Aug 2017 11:03:19 +0000 (19:03 +0800)]
route: Add rtfree_async.
This prepares to run rtalloc/lookup/free in the first netisr_ncpus
netisrs. This function is only intended to be used on slow path;
e.g. detach.
Sascha Wildner [Sun, 6 Aug 2017 22:45:23 +0000 (00:45 +0200)]
ps(1): Remove -I${.CURDIR}/../../sys from the Makefile.
FreeBSD did that in 2001 (r76812). It's not a bootstrap tool, so there
is no reason to mix up include paths.
Matthew Dillon [Sun, 6 Aug 2017 03:07:34 +0000 (20:07 -0700)]
kernel - Move sigtramp even lower
* Attempt to work around a Ryzen cpu bug by moving sigtramp even lower than
we have already.
Sascha Wildner [Sat, 5 Aug 2017 23:54:29 +0000 (01:54 +0200)]
kdump(8): Add more support for printing flag etc. names.
* New support for: eaccess() modes, faccessat() modes and atflag,
fchmodat() atflag, fchownat() atflag, fstatat() atflag,
linkat() atflag, unlinkat() atflag, utimensat() atflag,
getvfsstat() flags, lchflags() flags, chflagsat() flags and atflag,
kenv() actions, usched_set() commands, sys_checkpoint() types,
procctl() commands, mountctl() operations, and varsym_{list,set}()
levels.
* Fix flags printing in chflags() and fchflags().
* Better separate mount() and getfsstat() flags definitions.
* Adjust grepping for fcntl() commands send/recvmsg() etc. flags to
include flags with more than one underscore in their name, like
F_DUPFD_CLOEXEC.
Still missing: extexit()'s 'how' argument.
Sascha Wildner [Sat, 5 Aug 2017 12:22:24 +0000 (14:22 +0200)]
kdump(1): Oops, remove forgotten include.
Sascha Wildner [Sat, 5 Aug 2017 12:13:42 +0000 (14:13 +0200)]
kdump(1): Remove unused ptraceopname(). ptrace ops are handled specially.
Matthew Dillon [Sat, 5 Aug 2017 04:47:23 +0000 (21:47 -0700)]
procfs - Fix blocked lock condition
* Two procfs races can result in a lock being blocked forever. Rip out
the old single-variable global procfs lock and per-node lock and replace
with a normal lockmgr lock.
* The original lock existed from a time when all of procfs was wrapped with
a global lock. This is no longer the case.
Matthew Dillon [Sat, 5 Aug 2017 04:38:10 +0000 (21:38 -0700)]
kernel - Fix serious permissions bug for sticky directories
* An optimization improperly bypassed the sticky-bit test, creating
a security issue with /tmp and /var/tmp.
* Fix by disabling the optimization for the second-to-last path component.
Any prior components retain the optimization, so long directory paths
are still well-optimized.
Sascha Wildner [Sat, 5 Aug 2017 00:51:05 +0000 (02:51 +0200)]
kdump(1): Add pathconf(2) variable name printing.
Sascha Wildner [Sat, 5 Aug 2017 00:01:41 +0000 (02:01 +0200)]
kdump(1): Add clockid_t name printing.
Sascha Wildner [Fri, 4 Aug 2017 23:40:13 +0000 (01:40 +0200)]
kdump(1): Allow auto switch funcs to not have an 'invalid' default case.
Sascha Wildner [Fri, 4 Aug 2017 20:00:11 +0000 (22:00 +0200)]
<sys/un.h>: Clean up namespace.
Sascha Wildner [Fri, 4 Aug 2017 19:59:39 +0000 (21:59 +0200)]
Remove some no longer used header files.
* <sys/device_port.h>: Kernel only, last consumer was scsi_low which was
removed in
075c6d38244abd0b0c8dc9b2974ef574b9180bf5 in January.
* <sys/ieee754.h>: NetBSD header that is supposed to be included from
the implementation's ieee.h header, which ours no longer does, so
it's a leftover from some earlier state of our libm.
* <nlist_aout.h>: Last consumer was removed in 2009 when a.out support
was dropped from modules (
70eee1c9363092cae6ef68466a44ad68fa22e183).
Sascha Wildner [Fri, 4 Aug 2017 11:43:57 +0000 (13:43 +0200)]
/boot/defaults/loader.conf: Fix typo.
Sepherosa Ziehau [Fri, 4 Aug 2017 08:48:28 +0000 (16:48 +0800)]
ipid: Call ip_randomid() on all CPUs.
- Remove unnecessary crit sections.
- Remove unapplied comment.
- Minor style fixes.
Sepherosa Ziehau [Fri, 4 Aug 2017 03:39:04 +0000 (11:39 +0800)]
pfsync: Send packet in netisr0 and do it asynchronously.
Sepherosa Ziehau [Fri, 4 Aug 2017 03:38:35 +0000 (11:38 +0800)]
mbuf: Add message header for generic mbuf sending/receiving.
Sascha Wildner [Thu, 3 Aug 2017 21:41:04 +0000 (23:41 +0200)]
ifconfig.8: Fix typo.
Sepherosa Ziehau [Thu, 3 Aug 2017 09:19:11 +0000 (17:19 +0800)]
inpcb: Simplify inpcb marker interface
Sepherosa Ziehau [Thu, 3 Aug 2017 08:58:25 +0000 (16:58 +0800)]
inpcb: All inpcb accessing should be from first netisr_ncpus netisrs
Sepherosa Ziehau [Thu, 3 Aug 2017 08:15:16 +0000 (16:15 +0800)]
inet: ip_{output/input}() should only run in first netisr_ncpus netisrs
Sepherosa Ziehau [Thu, 3 Aug 2017 07:47:52 +0000 (15:47 +0800)]
udp: It only runs in the first netisr_ncpus netisrs.
Sepherosa Ziehau [Thu, 3 Aug 2017 07:02:38 +0000 (15:02 +0800)]
systimer: Adjust systimers on their owner cpus.
Sepherosa Ziehau [Thu, 3 Aug 2017 05:37:43 +0000 (13:37 +0800)]
tcp: Prevent excessive IPI from draining TCP reassemble queues.
Sepherosa Ziehau [Thu, 3 Aug 2017 05:08:46 +0000 (13:08 +0800)]
inet: Prevent excessive IPI from draining PR cloned host routes.
Sepherosa Ziehau [Thu, 3 Aug 2017 05:07:46 +0000 (13:07 +0800)]
inet: Fix up draining flag setting.
Sepherosa Ziehau [Wed, 2 Aug 2017 09:37:35 +0000 (17:37 +0800)]
inet: Prevent excessive IPI from draining IPv4 fragments.
Sepherosa Ziehau [Wed, 2 Aug 2017 09:13:21 +0000 (17:13 +0800)]
inet6: Prevent excessive IPI from draining IPv6 fragments.
Sepherosa Ziehau [Wed, 2 Aug 2017 09:03:36 +0000 (17:03 +0800)]
net: Use PR_{FAST,SLOW}HZ, some code has the assumption of these macro usage.
Sascha Wildner [Wed, 2 Aug 2017 01:59:38 +0000 (03:59 +0200)]
efivar.8: Fix typos/improve language (taken from FreeBSD).
Sascha Wildner [Tue, 1 Aug 2017 10:36:04 +0000 (12:36 +0200)]
libthread_xu: Add a check for integer overflow (FreeBSD r321011).
Sepherosa Ziehau [Tue, 1 Aug 2017 09:46:30 +0000 (17:46 +0800)]
domain: Nuke pfslowtimo.
Sepherosa Ziehau [Tue, 1 Aug 2017 09:23:05 +0000 (17:23 +0800)]
ip: Don't use pr_slowtimo.
Sepherosa Ziehau [Tue, 1 Aug 2017 08:42:21 +0000 (16:42 +0800)]
igmp: Don't use pr_slowtimo.
Sepherosa Ziehau [Tue, 1 Aug 2017 08:22:00 +0000 (16:22 +0800)]
inet6: Drain IPv6 fragments in netisr0
Sepherosa Ziehau [Tue, 1 Aug 2017 07:40:56 +0000 (15:40 +0800)]
inet6: Dispatch frag6 slowtimo to netisr0 and stop using pr_slowtimo
Sepherosa Ziehau [Mon, 31 Jul 2017 08:51:58 +0000 (16:51 +0800)]
domain: Nuke pffasttimo
Sascha Wildner [Mon, 31 Jul 2017 08:46:36 +0000 (10:46 +0200)]
kernel: Fix wrong indentation in a few places.
Sascha Wildner [Mon, 31 Jul 2017 08:36:38 +0000 (10:36 +0200)]
kernel/sdhci: Remove wrong semicolon.
Sascha Wildner [Mon, 31 Jul 2017 08:32:04 +0000 (10:32 +0200)]
kernel/ieee80211: Add missing braces.
Sascha Wildner [Mon, 31 Jul 2017 08:27:12 +0000 (10:27 +0200)]
kernel/urtwn: Add missing braces.
Sepherosa Ziehau [Mon, 31 Jul 2017 11:41:26 +0000 (19:41 +0800)]
igmp: Use callout instead of pffasttimo.
Sepherosa Ziehau [Mon, 31 Jul 2017 03:50:46 +0000 (11:50 +0800)]
icmp6: Don't use pffasttimo and dispatch fasttimo to netisr0
Reported-by: ivadasz
Sascha Wildner [Sun, 30 Jul 2017 19:30:39 +0000 (21:30 +0200)]
kernel: Add FreeBSD's virtio_scsi(4) driver.
Thanks to ivadasz for figuring out the shutdown freeze issue
(see
c022ffc9484ecf07d8d7c4fb918d84a6154367be).
Tested-by: ivadasz, Peter Cannici <turkchess123@gmail.com>
Sascha Wildner [Sun, 30 Jul 2017 19:21:29 +0000 (21:21 +0200)]
kernel/cam: Add CAM_SCSI_IT_NEXUS_LOST (in preparation for virtio_scsi(4)).
Sascha Wildner [Sun, 30 Jul 2017 09:20:14 +0000 (11:20 +0200)]
larn(6): Fix two "use of index before limits check" issues.
Imre Vadász [Sun, 30 Jul 2017 14:08:58 +0000 (16:08 +0200)]
Make sure that cam(4)'s dashutdown handler runs before DEVICE_SHUTDOWN().
This meant, that the DEVICE_SHUTDOWN() callback of scsi drivers was running
before the final SYNCHRONIZE_CACHE scsi command was sent by cam(4). For
most drivers this was still fine, since usually the DEVICE_SHUTDOWN()
callback - if it's even implemented - only flushes the command queue.
This change avoids freezing at the end of shutdown which was known to
happen with the twa(4), and virtio_scsi(4) drivers.
The SHUTDOWN_PRI_SECOND priority is selected, because it's so far unused,
and inbetween existing handlers in the shutdown_post_sync phase, that are
at SHUTDOWN_PRI_FIRST and at SHUTDOWN_PRI_DEFAULT.
Tested-by: swildner (on twa(4)), ivadasz (on virtio_scsi(4))
Matthew Dillon [Sun, 30 Jul 2017 17:59:05 +0000 (10:59 -0700)]
kernel - Fix kcollect swapuse%
* The calculation was improperly using vm_swap_size (which is really free swap
remaining) instead of vm_swap_max.
Matthew Dillon [Sun, 30 Jul 2017 04:02:36 +0000 (21:02 -0700)]
kcollect - add swap% to plot output
* swap% used can be added easily, so go ahead and do it. Added to the bottom
graph.
Matthew Dillon [Sun, 30 Jul 2017 01:22:47 +0000 (18:22 -0700)]
kcollect - Fix gunplot warning when -x -f is specified
* gnuplot warns about multiple set terminal commands when the refresh
occurs. Rearrange when the command is sent to avoid the problem.
Matthew Dillon [Sun, 30 Jul 2017 01:10:48 +0000 (18:10 -0700)]
kcollect - Add a smoothing option (-s)
* Add an option (-s) which smooths the plot. The smoothing algorithm uses
an exponential average with fast collapse to the high-side so spikes do
not get lost.
Matthew Dillon [Sun, 30 Jul 2017 00:56:07 +0000 (17:56 -0700)]
kcollect - Add -t option to limit output
* Add the -t N option, limiting the output to the most recent N seconds
worth of samples. 'm'inutes, 'h'ours, and 'd'ays suffixes are allowed
for convenience.
Matthew Dillon [Sun, 30 Jul 2017 00:38:52 +0000 (17:38 -0700)]
kcollect - Adjust ordering of gnuplot solids
* Adjust ordering so active use fills solids upwards instead of downwards.
Matthew Dillon [Sat, 29 Jul 2017 23:24:41 +0000 (16:24 -0700)]
kcollect - Implement gnuplot output feature
* Implement the gunplot output feature. This feature currently
hard-selects a set of fields (fields cannot be specified).
Generates two graphs. The first collects memory statistics
and machine load. The second collects cpu utilization and
fault, syscall, and nlookup (file path resolution) rate.
* In gnuplot output mode, -f will cause the entire dataset to be
regenerated every 60 seconds (I don't see any way to avoid this
to update an existing gnuplot window).
* Finish implementing -o fields
Matthew Dillon [Sat, 29 Jul 2017 19:28:51 +0000 (12:28 -0700)]
kcollect - Fix swap text output
* Collection data for type 'm' is still stored in bytes.
* Fix field width for megabytes display field.
Matthew Dillon [Sat, 29 Jul 2017 18:42:59 +0000 (11:42 -0700)]
kernel - Store page statistics in bytes
* Store page statistics in bytes rather than pages. Pages aren't useful
for userland display and there is no reason to force useland to do the
conversion.
* Include a realtime timestamp along with ticks in the structure.
* Flesh out text output for kcollect. Reverse output order to print
oldest data first, so output from the -f option stays consistent.
Matthew Dillon [Sat, 29 Jul 2017 17:30:25 +0000 (10:30 -0700)]
kernel - Add a sampling history mechanism called kcollect (2)
* Add collection code for remaining base statistics.
* Round-up some calculations.
Sascha Wildner [Sat, 29 Jul 2017 09:54:31 +0000 (11:54 +0200)]
Restore WARNS in ftpd, it was just for testing.
Sascha Wildner [Sat, 29 Jul 2017 09:47:08 +0000 (11:47 +0200)]
ftpd(8): Remove weird line with just '#'.
Sascha Wildner [Sat, 29 Jul 2017 08:32:49 +0000 (10:32 +0200)]
ccdconfig(8): Add missing free().
Reported-by: dcb
Dragonfly-bug: <https://bugs.dragonflybsd.org/issues/3014>
Sascha Wildner [Sat, 29 Jul 2017 08:09:17 +0000 (10:09 +0200)]
kernel: Remove some variables that are only set but never used.
Reported-by: dcb
Dragonfly-bug: <https://bugs.dragonflybsd.org/issues/3019>
Sascha Wildner [Sat, 29 Jul 2017 08:08:34 +0000 (10:08 +0200)]
libpuffs: Fix two asserts.
Reported-by: dcb
Dragonfly-bug: <https://bugs.dragonflybsd.org/issues/3013>
Matthew Dillon [Sat, 29 Jul 2017 06:36:07 +0000 (23:36 -0700)]
kernel - Add a sampling history mechanism called kcollect
* Add a kernel API which automatically samples useful statistics on a
10-second period without needing a user program to poll it. This API
is enabled by default and can be disabled by setting kern.collect_samples=0
in /boot/loader.conf (or setting it higher, if desired).
The idea is for the kernel to always collect a solid amount of historical
data for various useful statistics such that a user can pull it all up
going back upwards of 23 hours (or more, depending on configured samples)
after the fact. "Oh, what happened recently"... bang.
* The sysctl provides sufficient information to userland to be able to
process the statistics dynamically, without necessarily having to know
what they are.
The sysctl can be cut short to request less data for ongoing incremental
collection, if desired.
* Implement "load" collection to start with as a test. Add #defines for
everything I want the kernel to collect. The kernel API's critical path
is lockless.
* Start working on a front-end user program called 'kcollect'. This program
will eventually generate fancy graphs via gnuplot and have a dbm interface
for collecting data continuously if desired.
Sascha Wildner [Fri, 28 Jul 2017 20:49:13 +0000 (22:49 +0200)]
<sys/malloc.h>: Remove an empty #ifdef.
Sascha Wildner [Fri, 28 Jul 2017 19:41:48 +0000 (21:41 +0200)]
Sync ACPICA with Intel's version
20170728.
* Support in the resource walking code for _DMA.
* Various additions and improvements.
* Fix various bugs and regressions.
For a more detailed list, please see sys/contrib/dev/acpica/changes.txt.
Sepherosa Ziehau [Thu, 27 Jul 2017 09:42:25 +0000 (17:42 +0800)]
polling: Simplify the code by using netsr_*msg functions.
Sepherosa Ziehau [Thu, 27 Jul 2017 09:42:09 +0000 (17:42 +0800)]
netisr: Add netisr_sendmsg_oncpu()
Sepherosa Ziehau [Thu, 27 Jul 2017 07:58:44 +0000 (15:58 +0800)]
bridge: It should only run in netisr_cpus netisrs
Matthew Dillon [Thu, 27 Jul 2017 06:55:29 +0000 (23:55 -0700)]
hammer2 - Allow @LABEL to be omitted
* Allow the @LABEL in "<devicepath>@LABEL" to be omitted when mounting.
If omitted, H2 will automatically supply "@LOCAL".
* Provides convenience for simple use cases.
Matthew Dillon [Thu, 27 Jul 2017 06:42:09 +0000 (23:42 -0700)]
hammer2 - Synchronize write-in-place feature
* Disable hole creation when the check mode is disabled on a file.
Writing zeros maintains any previously assigned block.
* Allows reserving storage for a file (such as an image) by dd'ing
/dev/zero into it, as long as check mode has also been disabled
on that file. No data CRCs or hashes will be computed or checked
for the file.
* Note that this should still work properly with snapshots. A snapshot
will force block reallocation on the master for any writes that cross
the snapshot boundary.
Matthew Dillon [Thu, 27 Jul 2017 05:34:50 +0000 (22:34 -0700)]
hammer2 - Consolidate backend rename ops
* Reduce frontend XOP ops required for rename from 3 to 2 by integrating
the unlink-target operation into xop_nrename. The xop_nrename backend
function now handles replacing the target namespace when it exists and
will also get rid of any duplicates as a safety.
* Adjust the frontend inode locking order to try to avoid deadlocks.
* Adjust iparent documentation.
* Properly set iparent in the rename operation. The iparent was not
being adjusted at all.
* Properly set iparent in the inode create operation. The iparent was
improperly being set to 0 instead 1 when the parent directory was the
mount point.
Matthew Dillon [Wed, 26 Jul 2017 19:41:20 +0000 (12:41 -0700)]
libc - Fix bug in rcmdsh()
* rcmdsh() (which really nothing should be using any more anyway) used
a generic wait(NULL) to wait for a child to exit, but this can wind
up waiting for the wrong pid in a multi-threaded or multi-fork environment.
* Solved by waiting on the specific pid instead.
Matthew Dillon [Wed, 26 Jul 2017 19:39:46 +0000 (12:39 -0700)]
sshlockout - Improve manual page
* Rewrite the manual page, provide a more concise example.
* Suggest using pfctl -T expire <seconds> instead of -T flush for the
crontab entry.
Reported-by: Miroslav Lachman <000.fbsd@quip.cz>
Sascha Wildner [Wed, 26 Jul 2017 17:57:32 +0000 (19:57 +0200)]
libc/libpthread: Add clock_getcpuclockid() and pthread_getcpuclockid().
* Adjust clock_gettime() and clock_getres() to accept values obtained
this way.
* Also set _POSIX_CPUTIME and _POSIX_THREAD_CPUTIME, although we should
really support values obtained by these functions in clock_settime()
too.
Based on and taken from FreeBSD's code.
Reviewed-by: sephe
Sascha Wildner [Wed, 26 Jul 2017 16:53:26 +0000 (18:53 +0200)]
Remove <stab.h> and <struct.h>.
* <stab.h> had information about the a.out related symbol table
format.
* <struct.h> has been removed from FreeBSD since 2001.
Both headers are not needed by anything in our tree or dports.
Sascha Wildner [Wed, 26 Jul 2017 12:05:42 +0000 (14:05 +0200)]
kernel/time: Change get_curthread_cputime() to get_thread_cputime().
Sascha Wildner [Wed, 26 Jul 2017 12:05:03 +0000 (14:05 +0200)]
<sys/types.h>: Add guard around lwpid_t, also put under __BSD_VISIBLE.
Sepherosa Ziehau [Mon, 24 Jul 2017 13:31:03 +0000 (21:31 +0800)]
altq/cbq: Drain pending callout and the cooresponding netmsg.
Sepherosa Ziehau [Mon, 24 Jul 2017 12:59:03 +0000 (20:59 +0800)]
altq/cbq: Redispatch restart function to netisr0.
Now, all pseudo interfaces' if_start run in netisr_ncpus netisrs.
Matthew Dillon [Tue, 25 Jul 2017 07:40:04 +0000 (00:40 -0700)]
hammer2 - Update DESIGN document
* Update the DESIGN document to reflect changes.
Matthew Dillon [Tue, 25 Jul 2017 02:50:47 +0000 (19:50 -0700)]
hammer2 - correct readdir bug
* Correct a readdir iteration bug for the new DIRENT type.
Matthew Dillon [Tue, 25 Jul 2017 02:05:33 +0000 (19:05 -0700)]
hammer2 - Initial HARDLINK -> DIRENT replacement code
* Initial removal of the vestiges of the old embedded inode code. Inodes
were moved to the root directory long ago but directories still contain
dummy OBJTYPE_HARDLINK inodes instead of real directory entries to point
to the moved inodes. These inodes ate 1024 bytes of disk space for each
directory entry.
* Remove the dummy OBJTYPE_HARDLINK inodes and replace with new
BREF_TYPE_DIRENT blockrefs. These blockrefs represent directory
entries, and the entire dirent will fit in the blockref (requiring
no data ref) if the filename is <= 64 bytes.
* This new DIRENT mechanic significantly improves performance and reduces
storage overage vs the previous mechanicn, for obvious reasons.
Directory entries are now 128 bytes instead of 1024 bytes, and since they
are collected together in indirect blocks or (if <= 4 entries) simply
placed in the 4 blockrefs embedded in the directory inode, the related
I/O tends to be fairly optimal.
Only directory entries whos filenames are > 64 bytes long require an
additional data block reference. For now, due to other constraints,
we use the minimum H2 allocation size of 1KB for these, so certainly
space is wasted. But in real life there aren't actually a whole lot
of filenames that are that long so it should be fine.
Matthew Dillon [Sun, 23 Jul 2017 07:57:20 +0000 (00:57 -0700)]
hammer2 - Adjust blockref to create an embedded area, start dirent work
* Create a type-specific embedded area in the blockref structure. Move
data_count and inode_count into the new area. The blockref structure
size does not change.
* Adjust code to access data_count and inode_count conditionally for
BREF_TYPE_INODE, DATA, and INDIRECT types only.
* Now that we have abandoned embedding inodes directly in directories for
normal operation, start working on removing HAMMER2_OBJTYPE_HARDLINK and
creating a real directory entry abstraction.
The real directory entry abstraction will allow directory entries to be
directly embedded in blockref structures, without requiring a data
reference for any filename <= 64 bytes. This will be accomplished by
using the new embedded area in the blockref for the directory entry
header and the check area for the filename (up to 64 bytes).
This will significantly improve directory compactness and I/O efficiency
by reducing the directory entry overhead from 1152 bytes (1024 + 128) to
just 128 bytes and guaranteeing locality since the blockrefs are
collected together in indirect blocks. Another nice facet is that since
inodes can embed up to four direct blockrefs, any directory with <=
4 entries in it can embed those entries in the directory inode itself.
So small directories will wind up being VERY compact.
We haven't entirely abandoned embedding inodes in directories as
directory entries. In fact, the feature is still used for superroot
entries, and may be allowed in the future mixed into normal directories
for 'special' non-hardlinkable directory inodes for quota control,
subdirectory
snapshot, and (NFS) export purposes.
Matthew Dillon [Sun, 23 Jul 2017 06:24:19 +0000 (23:24 -0700)]
hammer2 - Cleanup pass, remove unused fields and code
* Remove the unused per-inode cluster cache. The code isn't really
compatible with the XOP mechanism.
* Remove unused hammer2_xop_nlink() and related structures. Hardlinking
is handled through normal hammer2_inode_*() functions and no longer
needs an explicit backend.
* Remove the unused iocb.cluster field. iocb's are now exclusively
backend entities.
Sepherosa Ziehau [Mon, 24 Jul 2017 03:38:43 +0000 (11:38 +0800)]
altq: Fix typo