Sepherosa Ziehau [Sat, 21 Oct 2017 22:06:27 +0000 (06:06 +0800)]
ix: Free tx mbufs proactively.
This is preparation for the dillon's upcoming sendfile patch.
Sascha Wildner [Sat, 28 Oct 2017 10:48:15 +0000 (12:48 +0200)]
Remove the ancient rdist(1) tool along with related periodic(8) scripts.
There are substitutes in dports' net/44bsd-rdist and net/rdist6.
Sascha Wildner [Sat, 28 Oct 2017 10:22:41 +0000 (12:22 +0200)]
kernel/hptmv: Use __DragonFly__ instead of __DragonFly_version.
Matthew Dillon [Sat, 28 Oct 2017 01:37:16 +0000 (18:37 -0700)]
kernel - Rewrite umtx_sleep() and umtx_wakeup() (2)
* Refactor some of the umtx code to do a better job dealing
with pageout and fork() races.
This still is not ideal.
Reported-by: profmakx
zrj [Fri, 27 Oct 2017 10:31:02 +0000 (13:31 +0300)]
kldload.8: Mention /boot/modules.local purpose.
Imre Vadász [Wed, 25 Oct 2017 19:11:08 +0000 (21:11 +0200)]
boot - Abort boot if EFI-framebuffer format is unsupported.
* At the moment we support 24-bit and 32-bit pixel formats, make sure we
notify the user and abort booting, when encountering an unsupported
framebuffer format.
Imre Vadász [Sun, 22 Oct 2017 20:47:24 +0000 (22:47 +0200)]
syscons - Add 24bit pixel format support for EFI framebuffer.
Sepherosa Ziehau [Tue, 24 Oct 2017 04:54:44 +0000 (12:54 +0800)]
x86_64: Allow TSC MP synchronization test be disabled.
Matthew Dillon [Tue, 24 Oct 2017 06:07:13 +0000 (23:07 -0700)]
vmstat - Fix formatting
* 'fre' memory formatting width was incorrect, causing
the rest of the field to be incorrectly offset.
* Display more precision as the field width allows.
* Add -b for 'brief' mode to display less precision.
* Add -u for 'unformatted' mode to display raw numbers (columnar
output will not be aligned).
Matthew Dillon [Sun, 22 Oct 2017 07:02:18 +0000 (00:02 -0700)]
kernel - Use different queue iterator for emergency pager
* Adjust q1iterator and q1iterator to minimize collisions between
the two pageout demons. The pageout demon will iterate forwards
while the emergency demon will iterate backwards.
Matthew Dillon [Sun, 22 Oct 2017 06:17:26 +0000 (23:17 -0700)]
kernel - Use different cache_rover for emergency pager
* Fix an issue where the same cache_rover index was being used for both
pageout threads. This could result in a great deal of contention
and cache line bouncing between the threads due to the vm pagerq
spinlock.
* Fix by changing cache_rover to an array[2]. In addition, the
one pageout thread iterates its rover forwards while the other
runs its rover backwards, plus a little more code, to minimize
conflicts.
Aaron LI [Tue, 17 Oct 2017 15:19:48 +0000 (23:19 +0800)]
pf: Make pf_print_host() print IPv6 addresses correctly
Taken-from: OpenBSD sys/net/pf.c v.1.615
Aaron LI [Tue, 17 Oct 2017 14:55:45 +0000 (22:55 +0800)]
pf: Always skip "urpf-failed" test for IPv6 link local addresses
We could re-embed the scope-id before we do the route lookup,
but then we would just find the very interface we've received
the packet on anyway.
Taken-from: OpenBSD sys/net/pf.c v.1.625
Aaron LI [Tue, 17 Oct 2017 14:41:42 +0000 (22:41 +0800)]
pf: use IN6_IS_SCOPE_EMBED to check kernel-internal form addresses
Use IN6_IS_SCOPE_EMBED to check kernel-internal form addresses
(s6_addr16[1] filled).
Taken-from: OpenBSD sys/net/pf.c v.1.520
Matthew Dillon [Sun, 22 Oct 2017 00:31:43 +0000 (17:31 -0700)]
kernel - Zero out syncache_percpu properly
* The kmalloc for the syncache_percpu was not using M_ZERO
which I believe can cause cache_count to be some random
value. If this value is close to or larger than the syncache
limit, the garbage collector may run with no entries to reuse,
causing a NULL pointer dereference and panic.
Reported-by: pa3k #3088
Matthew Dillon [Sat, 21 Oct 2017 23:56:06 +0000 (16:56 -0700)]
hdaa - Remove dead code
* Remove dead code (an impossible condition).
Reported-by: dcb #3077
Matthew Dillon [Sat, 21 Oct 2017 23:52:06 +0000 (16:52 -0700)]
bc - Adjust bad syntax
* Adjust a badly syntaxed expression.
Reported-by: dcb #3079
Matthew Dillon [Sat, 21 Oct 2017 23:31:30 +0000 (16:31 -0700)]
swapon - Fix minor memory leak
* Fix a minor memory leak
Reported-by: liweitianux bug #3086
Matthew Dillon [Sat, 21 Oct 2017 22:02:05 +0000 (15:02 -0700)]
kernel - Cleanup token code, add simple exclusive priority (2)
* The priority mechanism revealed an issue with lwkt_switch()'s
fall-back code in dealing with contended tokens. The code was
refusing to schedule a lower-priority thread on a cpu requesting an
exclusive lock as another on that same cpu requesting a shared lock.
This creates a problem for the exclusive priority feature. More
pointedly, it also creates a fairness problem in the mixed lock
type use case generally.
* Change the mechanism to allow any thread polling on tokens to be
scheduled. The scheduler will still iterate in priority order.
This imposes a little extra overhead with regards to userspace
returns as a thread might be scheduled that then tries to return
to userland without being the designated user thread.
* This also fixes another bug that cropped up recently where a
32-way threaded program would sometimes not quickly schedule to
all 32 cpus, sometimes leaving one or two cpus idle for a few
seconds.
Sepherosa Ziehau [Sat, 21 Oct 2017 07:31:40 +0000 (15:31 +0800)]
inet6: Make non-prefix and directly reachable inet6 routes work.
e.g. inet6 routes added w/ -interface:
sysctl net.inet6.icmp6.nd6_onlink_ns_rfc4861=1
ifconfig ix0 inet6 2003:db8::1
route add -inet6 2003:db8:1::/64 -interface ix0
NOTE: net.inet6.icmp6.nd6_onlink_ns_rfc4861 MUST be on.
Sascha Wildner [Sat, 21 Oct 2017 10:30:50 +0000 (12:30 +0200)]
pstat.8: Add markup.
Aaron LI [Sat, 21 Oct 2017 05:34:39 +0000 (13:34 +0800)]
pstat.8: Remove a duplicate option of swapinfo
Matthew Dillon [Fri, 20 Oct 2017 23:42:42 +0000 (16:42 -0700)]
initrd - Add 'fetch'
* Add the 'fetch' program to the recovery shell. This is just too
useful a program to not have on the rescue ramdisk.
Matthew Dillon [Thu, 19 Oct 2017 20:27:22 +0000 (13:27 -0700)]
kernel - Cleanup token code, add simple exclusive priority
* Cleanup the token code and bring the comments up to date.
* Implement exclusive priority for the situation where a thread is
acquiring only a single shared token. We cannot implement exclusive
priority when multiple tokens are held because that can lead to
deadlocks. The token code guarantees no deadlocks.
Matthew Dillon [Thu, 19 Oct 2017 19:09:56 +0000 (12:09 -0700)]
kernel - Add p_ppid
* We have proc->p_pptr, but still needed a shared p->p_token to access
the ppid. Buckle under and add proc->p_ppid as well so getppid() can
run lockless.
* Adjust the vmtotal proc scan to use a shared proc->p_token instead
of an exclusive one.
Matthew Dillon [Thu, 19 Oct 2017 19:08:05 +0000 (12:08 -0700)]
kernel - Adjust tsc_delay()
* Add more cpu_pause()'s to the tsc_delay() loop to
be more hyper-thread friendly.
Sascha Wildner [Thu, 19 Oct 2017 20:17:09 +0000 (22:17 +0200)]
kernel/acpi: Ouch, add forgotten semicolon.
Sascha Wildner [Thu, 19 Oct 2017 20:15:53 +0000 (22:15 +0200)]
kernel/acpi: Use ACPI_UUID_LENGTH in acpi_eval_osc().
Matthew Dillon [Thu, 19 Oct 2017 04:43:27 +0000 (21:43 -0700)]
kernel - Increase ACPI_SEMAPHORES_MAX_PENDING
* Increase ACPI_SEMAPHORES_MAX_PENDING to a very large number.
Some of the ACPI codes assumes that mutexes always succeed,
but the mutex code uses the semaphore code and the semaphore
code appears to have an arbitrary failure path based on the
number of concurrent requests.
* This fixes kernel confusion and console spam from the ACPI
subsystem when running sysctl -a concurrently on more than
4 threads.
Matthew Dillon [Thu, 19 Oct 2017 04:42:49 +0000 (21:42 -0700)]
kernel - Make certain sysctl's unlocked (2)
* Make most of the oid translation and iteration sysctls unlocked.
Suggested-by: mjg
Matthew Dillon [Thu, 19 Oct 2017 03:04:43 +0000 (20:04 -0700)]
kernel - Make certain sysctl's unlocked
* Automatically flag all SYSCTL_[U]INT, [U]LONG, and [U]QUAD
definitions CTLFLAG_NOLOCK. These do not have to be locked.
Will improve program startup performance a tad.
* Flag a ton of other sysctls used in program startup and
also 'ps' CTLFLAG_NOLOCK.
* For kern.hostname, interlock changes using XLOCK and allow
the sysctl to run NOLOCK, avoiding unnecessary cache line
bouncing.
Matthew Dillon [Thu, 19 Oct 2017 02:01:49 +0000 (19:01 -0700)]
kernel - Refactor sysctl locking
* Get rid of the global topology lock. Instead of a pcpu shared lock
and change the XLOCK code (which is barely ever executed) to obtain
an exclusive lock on all cpus.
* Add CTLFLAG_NOLOCK, which disable the automatic per-OID sysctl lock.
Suggested-by: mjg (Mateusz Guzik)
Matthew Dillon [Wed, 18 Oct 2017 06:45:56 +0000 (23:45 -0700)]
kernel - Improve pmap hinting, improve performance
* Refactor pm_pvhint into two fields, pm_pvhint_pt and pm_pvhint_pte.
These are the most common hits.
* Consolidate the pv_entry lookup core into pv_entry_lookup() and
implementing the double hinting. Adjust pv_cache() to use the
new fields.
* Improve pmap_object_init_pt() performance by using the new
RB_SCAN_NOLK() code and soft-busying the VM pages instead
of hard-busying them. If we have to deactivate the page, however,
we must hard-busy the page.
* Fix vm_prefault_quick() committed recently. When soft-busying
VM pages for pmap-entry, we have to fall-back to a hard-busy
if the page must be moved out of PQ_CACHE.
Matthew Dillon [Wed, 18 Oct 2017 06:42:43 +0000 (23:42 -0700)]
kernel - Improve concurrency in devfs VOPs
* Use LK_SHARED instead of LK_EXCLUSIVE whenever possible. This
significantly reduces lock congestion for getattr(), read(),
and readlink().
* Check the new D_QUICK flag and, if set, devfs can avoid lock
congestion for open() and close() on devices (e.g. such as on
/dev/null, /dev/urandom, etc).
Matthew Dillon [Wed, 18 Oct 2017 06:40:29 +0000 (23:40 -0700)]
kernel - Use soft-busy in vop_helper_read_shortcut()
* Use a soft-busy for related VM pages in vop_helper_read_shortcut().
This prevents locking conflicts related to concurrent read() operation
from causing the routine to abort.
Related software can run optimized I/O from related VM pages
concurrently without conflict. This occurs in particular with
the concurrent exec*() of dynamic executables.
Matthew Dillon [Wed, 18 Oct 2017 06:36:40 +0000 (23:36 -0700)]
kernel - Add D_QUICK device flag
* Add the D_QUICK device flag. This flag tells devfs that it does not
have to handle complex opencount interactions in VOP_OPEN and
VOP_CLOSE, allowing devfs to retain the shared lock for those
operations.
* Flag kernel special devices such as /dev/zero, /dev/null,
/dev/urandom, etc, with D_QUICK.
Matthew Dillon [Wed, 18 Oct 2017 06:34:14 +0000 (23:34 -0700)]
kernel - Add more features to the RB tree
* Add RB_SCAN_NOLCK(), a version of RB_SCAN() that does not protect
the iterator with a spin-lock. This can be used in any scan loop
where the scan loop is able to determine whether the iterator has
been lost or not.
* Add RB_LOOKUP_REL(), a version of RB_LOOKUP() that optimizes the
specific next-index and prev-index case.
Matthew Dillon [Wed, 18 Oct 2017 06:31:59 +0000 (23:31 -0700)]
kernel - Add lock debugging
* Refactor debug.lock_test_mode to allow dumping of stack backtraces
when lockmgr locks or spinlocks are contested.
* Make some adjustments to the indefinite code (w/ heads up from mjg).
Do not start recording the TSC until we've contested for 15 loops,
and do not record the end time or duration unless we have contested
for more than 15 loops.
Matthew Dillon [Wed, 18 Oct 2017 06:25:24 +0000 (23:25 -0700)]
kernel - refactor vm_page busy
* Move PG_BUSY, PG_WANTED, PG_SBUSY, and PG_SWAPINPROG out of m->flags.
* Add m->busy_count with PBUSY_LOCKED, PBUSY_WANTED, PBUSY_SWAPINPROG,
and PBUSY_MASK (for the soft-busy count).
* Add support for acquiring a soft-busy count without a hard-busy.
This requires that there not already be a hard-busy. The purpose
of this is to allow a vm_page to be 'locked' in a shared manner
via the soft-busy for situations where we only intend to read from
it.
Imre Vadász [Tue, 17 Oct 2017 20:06:23 +0000 (22:06 +0200)]
if_vtnet - Handle missing IFCAP_VLAN_* flags nicer. Comment IFCAP_LOR stuff.
* The if_vtnet driver used to define the IFCAP_LRO, IFCAP_VLAN_HWFILTER and
IFCAP_VLAN_HWTSO flags itself, to make the code from FreeBSD build.
Instead define IFCAP_VLAN_HWFILTER and IFCAP_VLAN_HWTSO to 0, when they
are not defined already. This allows the code to build, but all checks
for the flags fail. (Inspired by the vmxnet3 driver port).
* The IFCAP_LRO flag is unavailable in DragonFly, but the LRO offload seems
to work somehow.
* According to the virtio specification, LRO support should be possible
without rx checksum support as well.
Matthew Dillon [Tue, 17 Oct 2017 21:57:19 +0000 (14:57 -0700)]
kernel - Cleanup vm_page_repurpose()
* Remove the now unused vm_page_repurpose() function.
* Remove emrunning variable.
Imre Vadász [Tue, 17 Oct 2017 20:11:08 +0000 (22:11 +0200)]
if_vtnet - Disable rx csum offload due to unsupported ipv6 rx csum offload.
* Ignoring the checksum offloading in the receive path of the driver isn't
sufficient, since we might receive only partially checksummed packets
from the host.
* Unfortunately there is only a single feature flag for both ipv4 and ipv6
receive checksum offloading, so we need to disable both for now.
* At the moment we don't support a way to explicitly enable the rx csum
feature at runtime, but this will be easily possible by adding support
for the VIRTIO_NET_F_CTRL_GUEST_OFFLOADS feature.
* Mention this as a caveat in the manpage.
* Update correct default value of hw.vtnet.lro_disable tunable in the
manpage, to match the code again.
Sascha Wildner [Tue, 17 Oct 2017 20:18:35 +0000 (22:18 +0200)]
kernel: Remove <sys/sysref{,2}.h> inclusion from files that don't need it.
Some of the headers are public in one way or another so bump
__DragonFly_version for safety.
While here, add a missing <sys/objcache.h> include to kern_exec.c which
was previously relying on it coming in via <sys/sysref.h> (which was
included by <sys/vm_map.h> prior to this commit).
Sascha Wildner [Tue, 17 Oct 2017 20:15:17 +0000 (22:15 +0200)]
<sys/indefinite2.h>: Add missing include for VKERNEL64.
Matthew Dillon [Tue, 17 Oct 2017 18:55:24 +0000 (11:55 -0700)]
kernel - Remove 'Emergency Pager' debugging messages
* Remove these messages. They were for debugging only and, in fact,
the activation of the anonymous-only pager is not really an
'Emergency'.
Sascha Wildner [Tue, 17 Oct 2017 18:31:29 +0000 (20:31 +0200)]
Stitch LINT64 build back together.
b1793cc6ba47622ab6ad154905f5c1385a6825bd removed the debuglockmgr()
code in kern_lock.c that was enabled with the DEBUG_LOCKS kernel
option. Its only consumer was in vfs_vnops.c for vn_lock.
For now, remove all associated remains.
Justin C. Sherrill [Tue, 17 Oct 2017 18:28:14 +0000 (14:28 -0400)]
Add mount_hammer2 and newfs_hammer2 to initrd list.
Sascha Wildner [Tue, 17 Oct 2017 07:36:30 +0000 (09:36 +0200)]
Remove "kernel ppp", i.e. if_ppp.ko and pppd(8).
It has been replaced by ppp(8), in conjunction with tun(4).
While here, rename the ppp-user rc script to 'ppp' and fix up
REQUIRE/PROVIDE situation.
Matthew Dillon [Mon, 16 Oct 2017 22:17:42 +0000 (15:17 -0700)]
mkinitrd - Add missing /var/db (3)
* When /var is mounted via tmpfs we have to mkdir the subdirs
manually.
* Add /var/db and /var/empty to the directories initrd creates
in its rc.
Submitted-by: amonk
Imre Vadász [Mon, 16 Oct 2017 22:00:32 +0000 (00:00 +0200)]
virtio_blk - Fix capacity calculation, when host sets large disk block size.
* The disk capacity in the virtio configuration space is always specified
in 512 byte sectors, so info.d_media_blksize should be 512.
* Also check for VIRTIO_BLK_F_GEOMETRY feature before reading the disk
geometry from configuration space.
* Add some device_printf calls to report the disk size and (if available)
geometry during bootup.
Matthew Dillon [Mon, 16 Oct 2017 07:28:11 +0000 (00:28 -0700)]
kernel - Rewrite umtx_sleep() and umtx_wakeup()
* Rewrite umtx_sleep() and umtx_wakeup() to no longer use
vm_fault_page_quick(). Calling the VM fault code incurs a huge
overhead and creates massive contention when many threads are
using these calls.
The new code uses fuword(), translate to the physical address via
PTmap, and has very low overhead and basically zero contention.
* Instead, impose a mandatory timeout for umtx_sleep() and cap it
at 2 seconds (adjustable via sysctl kern.umtx_timeout_max, set
in microseconds). When the memory mapping underpinning a umtx
changes, userland will not stall for more than 2 seconds.
* The common remapping case caused by fork() is handled by the kernel
by immediately waking up all sleeping umtx_sleep() calls for the
related process.
* Any other copy-on-write or remapping cases will stall no more
than the maximum timeout (2 seconds). This might include paging
to/from swap, for example, which can remap the physical page
underpinning the umtx. This could also include user application
snafus or weirdness.
* umtx_sleep() and umtx_wakeup() still translate the user virtual
address to a physical address for the tsleep() and wakeup() operation.
This is done via a fault-protected access to the PTmap (the page-table
self-mapping).
Matthew Dillon [Mon, 16 Oct 2017 01:57:43 +0000 (18:57 -0700)]
world - World build for ucred changes
* Adjust mountd and fstat kernel structure access for
changes.
Matthew Dillon [Mon, 16 Oct 2017 00:42:26 +0000 (17:42 -0700)]
kernel - Clean up ucred and plimit cache line locality
* Move struct plimit's p_spin and p_refcnt fields into their own
cacheline. This structure is massively shared and read often.
Doing this avoids unnecessary cache line ping-pongs.
* Only use p_spin to modify a resource limit. Do not use it to
access the resource limit.
* Integrate plimit's exclusivity flag into p_refcnt.
* Move struct ucred's cr_ref into its own cacheline. This structure
is massively shared and read often. Doing this avoids unnecessary
cache line ping-pongs.
Matthew Dillon [Sun, 15 Oct 2017 21:26:20 +0000 (14:26 -0700)]
kernel - Use fcmpset in lockmgr and tokens
* Use fcmpset for lockmgr and token locks.
Matthew Dillon [Sun, 15 Oct 2017 21:20:56 +0000 (14:20 -0700)]
kernel - Add atomic_fcmpset_*()
* Add atomic_fcmpset_*(). GCC has gotten good enough that it no longer
forces that &count onto the stack.
These functions work like atomic_cmpset_*() but update the originating
value on failure, allowing us to avoid reloading it from memory.
Suggested-by: mjg__
Matthew Dillon [Sun, 15 Oct 2017 19:26:28 +0000 (12:26 -0700)]
kernel - Partition large anon mappings, optimize vm_map_entry_reserve*()
* Partition large anonymous mappings in (for now) 16MB chunks.
The purpose of this is to improve concurrent VM faults for
threaded programs. Note that the pmap itself is still a
bottleneck.
* Refactor vm_map_entry_reserve() and related code to remove
unnecessary critical sections.
Matthew Dillon [Sun, 15 Oct 2017 18:25:21 +0000 (11:25 -0700)]
kernel - Optimize struct uidinfo
* Refactor struct uidinfo. Use atomic ops for ui_posixlocks
and ui_proccnt. They were already being used for ui_openfiles
and ui_ref.
* Refactor ui_ref a bit to improve the drop code. Use a cute
trick for the transition. When we transition to 0 we allow
ui_ref to actually go to 0, and then do an independent lookup
of the uid with the hash table spinlock to conditionally free
it if it remains 0.
This allows us to completely avoid using atomic_cmpset_int(),
which can be seriously inefficient due to races in SMP
environments.
Suggested-by: mjg__
Matthew Dillon [Sun, 15 Oct 2017 18:02:15 +0000 (11:02 -0700)]
kernel - pmap->pm_spin now uses a shared spinlock
* A shared spinlock is used whenever possible for pmap->pm_spin.
This is particularly beneficial for umtx_sleep/umtx_wakeup
operations.
Matthew Dillon [Sun, 15 Oct 2017 18:01:11 +0000 (11:01 -0700)]
kernel - Increase pmap placemarks hash from 16 to 64 entries
* Increase the pmap placemarks hash from 16 to 64 entries,
improving concurrent fault performance for threads a bit.
Matthew Dillon [Sun, 15 Oct 2017 17:54:59 +0000 (10:54 -0700)]
kernel - Simplify umtx_sleep and umtx_wakeup support
* Rip out the vm_page_action / vm_page_event() API. This code was
fairly SMP unfriendly and created serious bottlenecks with large
threaded user programs using mutexes.
* Replace with a simpler mechanism that simply wakes up any UMTX
domain tsleeps after a fork().
* Implement a 4uS spin loop in umtx_sleep() similar to what the
pipe code does.
Matthew Dillon [Sat, 14 Oct 2017 06:26:56 +0000 (23:26 -0700)]
kernel - Increase ncmount_cache array
* Increase the ncmount_cache hash from 1009 to 16301. The
slow-path (which can contend heavily on the mountlist_token)
was getting hit too often in the synth test due to the
number of mounts synth maintains.
* Improve the hash function to reduce chances of collisions.
Matthew Dillon [Sat, 14 Oct 2017 04:26:30 +0000 (21:26 -0700)]
kernel - Reoptimize sys_pipe
* Use atomic ops for state updates, allowing us to avoid acquiring
the other side's token. This removes all remaining contention.
* Performance boosted by around 35%. On the ryzen, bulk buffer
write->read tests between localized cpu cores went from 9.2 GB/sec
to around 13 GBytes/sec. Cross-die performance increased from
2.5 GB/sec to around 4.5 GB/sec (gigabytes/sec).
1-byte ping-ponging (write-1/read-1/turn-around/write-back-1/
read-back1) fell from 1.0-2.0uS to 0.7uS to 1.7uS.
* Add kern.pipe.size, allowing the kernel pipe buffer size to be
changed (effects new pipes only). The default buffer size has
been increased to 32KB (it was 16KB).
* Refactor pipelining optimizations, further reducing unnecessary
tsleep/wakeup IPIs.
* Improve kern.pipe.delay operation (an IPI avoidance mechanism),
and reduce from 5uS to 4uS.
Also add cpu_pause() in the TSC loop (suggested-by mjg_).
Matthew Dillon [Sat, 14 Oct 2017 00:55:41 +0000 (17:55 -0700)]
kernel - Refactor sys_pipe
* Refactor the pipe code in preparation for optimization. Get rid of
the dual-pipe structure and instead have one pipe structure with
two buffers.
* Scrap a ton of global statistics variables that nobody uses any more,
get rid of pipe_peer, and get rid of the slock.
Matthew Dillon [Fri, 13 Oct 2017 05:59:02 +0000 (22:59 -0700)]
kernel - Improve mountlist_scan() performance, track vfs_getvfs()
* Use a shared token whenever possible, and do not hold the token
across the callback in the mountlist_scan() call.
* vfs_getvfs() mount_hold()'s the returned mp. The caller is now
expected to mount_drop() it when done. This fixes a very rare
race.
Matthew Dillon [Fri, 13 Oct 2017 03:42:33 +0000 (20:42 -0700)]
kernel - Refactor smp collision statistics (2)
* tsc_uclock_t and tsc_sclock_t need to be exposed for now for
userland.
Matthew Dillon [Thu, 5 Oct 2017 16:09:27 +0000 (09:09 -0700)]
kernel - Refactor smp collision statistics (2)
* Refactor indefinite_info mechanics. Instead of tracking indefinite
loops on a per-thread basis for tokens, track them on a scheduler
basis. The scheduler records the overhead while it is live-looping
on tokens, but the moment it finds a thread it can actually schedule
it stops (then restarts later the next time it is entered), even
if some of the other threads still have unresolved tokens.
This gives us a fairer representation of how many cpu cycles are
actually being wasted waiting for tokens.
* Go back to using a local indefinite_info in the lockmgr*(), mutex*(),
and spinlock code.
* Refactor lockmgr() by implementing an __inline frontend to
interpret the directive. Since this argument is usually a constant,
the change effectively removes the switch().
Use LK_NOCOLLSTATS to create a clean recursion to wrap the blocking
case with the indefinite*() API.
Matthew Dillon [Thu, 5 Oct 2017 05:04:13 +0000 (22:04 -0700)]
kernel - Optimize shared -> excl spinlock contention
* When exclusive request is spinning waiting for shared holders to
release, throw in addition cpu_pause()'s based on the number of
shared holders.
Suggested-by: mjg_
Matthew Dillon [Thu, 5 Oct 2017 04:46:57 +0000 (21:46 -0700)]
kernel - Refactor smp collision statistics
* Add an indefinite wait timing API (sys/indefinite.h,
sys/indefinite2.h). This interface uses the TSC and will
record lock latencies to our pcpu stats in microseconds.
The systat -pv 1 display shows this under smpcoll.
Note that latencies generated by tokens, lockmgr, and mutex
locks do not necessarily reflect actual lost cpu time as the
kernel will schedule other threads while those are blocked,
if other threads are available.
* Formalize TSC operations more, supply a type (tsc_uclock_t and
tsc_sclock_t).
* Reinstrument lockmgr, mutex, token, and spinlocks to use the new
indefinite timing interface.
Matthew Dillon [Thu, 5 Oct 2017 03:28:55 +0000 (20:28 -0700)]
kernel - KVABIO allocbuf() optimization
* When using allocbuf() to set bufsize to 0 during buffer reuse,
do not bother synchronizing the pmap.
Matthew Dillon [Wed, 4 Oct 2017 03:06:04 +0000 (20:06 -0700)]
kernel - KVABIO stabilization
* bp->b_cpumask must be cleared in vfs_vmio_release().
* Generally speaking, it is generally desireable for the kernel to set
B_KVABIO when flushing or disposing of a buffer, as long as b_cpumask
is also correct. This avoids unnecessary synchronization when
underlying device drivers support KVABIO, even if the filesystem does
not.
* In findblk() we cannot just gratuitously clear B_KVABIO. We must issue
a bkvasync_all() to clear the flag in order to ensure proper
synchronization with the caller's desired B_KVABIO state.
* It was intended that bkvasync_all() clear the B_KVABIO flag. Make
sure it does.
* In contrast, B_KVABIO can always be set at any time, so long as the
cpumask is cleared whenever the mappings are changed, and also as long
as the caller's B_KVABIO state is respected if the buffer is later
returned to the caller in a locked state. If the buffer will simply
be disposed of by the kernel instead, the flag can be set. The
wrapper (typically a vn_strategy() or dev_dstrategy() call) will clear
the flag via bkvasync_all() if the target does not support KVABIO.
* Kernel support code outside of filesystem and device drivers is
expected to support KVABIO.
* nvtruncbuf() and nvextendbuf() now use bread_kvabio() (i.e. they now
properly support KVABIO).
* The buf_countdeps(), buf_checkread(), and buf_checkwrite() callbacks
call bkvasync_all() in situations where the vnode does not support
KVABIO. This is because the kernel might have set the flag for other
incidental operations even if the filesystem did not.
* As per above, devfs_spec_strategy() now sets B_KVABIO and properly
calls bkvasync() when it needs to operate directly on buf->b_data.
* Fix bug in tmpfs(). tmpfs() was using bread_kvabio() as intended,
but failed to call bkvasync() prior to operating directly on
buf->b_data (prior to calling uiomovebp()).
* Any VFS function that calls BUF_LOCK*() itself may also have to
call bkvasync_all() if it wishes to operate directly on buf->b_data,
even if the VFS is not KVABIO aware. This is because the VFS bypassed
the normal buffer cache APIs to obtain a locked buffer.
Matthew Dillon [Tue, 3 Oct 2017 01:49:28 +0000 (18:49 -0700)]
kernel - Adjust ipiq execution code a bit
* Remove unnecessary fences
* Adjust documentation
Matthew Dillon [Tue, 3 Oct 2017 01:48:19 +0000 (18:48 -0700)]
kernel - Add wakeup() probe sysctl
* Add a sysctl to allow us to probe wakeups.
* Add a few assertions in the optimized wakeup() path.
* Adjust documentation.
Matthew Dillon [Mon, 2 Oct 2017 02:42:59 +0000 (19:42 -0700)]
kernel - Implement KVABIO API in TMPFS
* TMPFS now fully supports the KVABIO API. This removes nearly all
IPIs from buffer cache operations related to TMPFS.
* In synth tests on 32-way and 48-way servers, the number of IPIs/cpu/sec
drops from 5000-12000 down to 200-1000. Needless to say, this is a
huge win, particularly on VMs.
Recommend-by: mjg_ (Mateusz Guzik)
Matthew Dillon [Mon, 2 Oct 2017 02:39:33 +0000 (19:39 -0700)]
kernel - Add KVABIO support to NVMe, disk translation layer, and swap
* Add KVABIO support to the NVMe driver. The driver no longer
requires that buffers be synchronized to all cpus.
* Add KVABIO support to the disk translation layer. The layer no
longer requires that buffers besynchronized to all cpus (note
however that the underlying device may still require such).
* Add KVABIO support to the swap subsystem. Again, actual avoidance
of buffer memory synchronization depends on the underlying devices.
Matthew Dillon [Mon, 2 Oct 2017 02:28:56 +0000 (19:28 -0700)]
kernel - Add KVABIO API (ability to avoid global TLB syncs)
* Add KVABIO support. This works as follows:
(1) Devices can set D_KVABIO in the ops flags to specify that the
device strategy routine supports the API.
passed to
The dev_dstrategy() wrapper will fully synchronize the buffer to
all cpus prior to dispatch if the device flag is not set.
(2) Vnodes can set VKVABIO in v_flag to indicate that VOP_STRATEGY
supports the API.
The vn_strategy() wrapper will fully synchronize the buffer to
all cpus prior to dispatch if the vnode flag is not set.
(3) GETBLK_KVABIO and FINDBLK_KVABIO flags added to allow buffer
cache consumers (primarily filesystem code) to indicate that
they support the API. B_KVABIO flag added to struct buf.
This occurs on a per-acquisition basis. For example, a standard
bread() will clear the flag, indicating no support. A bread_kvabio()
will set the flag, indicating support.
* The getblk(), getcacheblk(), and cluster*() interfaces set the flag for
any I/O they dispatch, and then adjust the flag as necessary upon return
according to the caller's wishes.
Matthew Dillon [Sun, 1 Oct 2017 22:11:21 +0000 (15:11 -0700)]
kernel - Remove geteblk()
* Remove geteblk(), the last B_MALLOC buffer cache API. Generally
use getpbuf_mem() instead.
Matthew Dillon [Sun, 1 Oct 2017 22:09:52 +0000 (15:09 -0700)]
kernel - Add pmap_qenter_noinval()
* Add pmap_qenter_noinval() API
Matthew Dillon [Sun, 1 Oct 2017 19:11:10 +0000 (12:11 -0700)]
kernel - Remove repurposebuf
* Remove the repurposebuf hack to prepare for the buffer cache
KVABIO API, which is a better solution.
Matthew Dillon [Sat, 30 Sep 2017 19:14:21 +0000 (12:14 -0700)]
kernel - Remove B_MALLOC
* Remove B_MALLOC buffer support. All primary buffer cache buffer
operations should now use pages. B_VMIO is required for all
vnode-centric operations like allocbuf(), but does not have to be set
for nominal I/O.
* Remove vm_hold_load_pages() and vm_hold_free_pages(). This code was
used to support mapping ad-hoc data buffers into buf structures, but
the only remaining use case in the CAM periph code can just use
getpbuf_mem() instead. So this code is no longer used.
Sepherosa Ziehau [Mon, 16 Oct 2017 05:16:18 +0000 (13:16 +0800)]
ipfw: Factor out ipfw_init_args()
Sepherosa Ziehau [Mon, 16 Oct 2017 04:52:17 +0000 (12:52 +0800)]
ipfw: Flush the rules before unload the module.
Sepherosa Ziehau [Mon, 16 Oct 2017 04:19:23 +0000 (12:19 +0800)]
ipfw: Factor out ipfw_defrag_redispatch.
Remove no longer needed IP_FW_CONTINUE.
Sepherosa Ziehau [Mon, 16 Oct 2017 04:07:45 +0000 (12:07 +0800)]
kern: Remove ncpus2 and friends.
They were no longer used, after netisr_ncpus was delployed.
Reminded-by: dillon@
Sepherosa Ziehau [Mon, 16 Oct 2017 03:44:45 +0000 (11:44 +0800)]
mpls: Use netisr_ncpus
Reminded-by: dillon@
Sascha Wildner [Sun, 15 Oct 2017 20:52:51 +0000 (22:52 +0200)]
Update the pciconf(8) database.
October 12, 2017 snapshot from http://pciids.sourceforge.net/
Sascha Wildner [Sun, 15 Oct 2017 11:07:04 +0000 (13:07 +0200)]
LINT64: Sort vmx a bit better.
Matthew Dillon [Sun, 15 Oct 2017 07:44:38 +0000 (00:44 -0700)]
Revert "libthread_xu - Wakeup all waiters"
This reverts commit
de7ba607e4500e7df6ade3916977cc8a91e1b4e9.
* Didn't intend to push this.
Sepherosa Ziehau [Sat, 30 Sep 2017 06:39:48 +0000 (14:39 +0800)]
ipfw: Implement state based "redirect", i.e. without using libalias.
Redirection creates two states, i.e. one before the translation (xlat0)
and one after the translation (xlat1). If the hash of the translated
packet indicates that it is owned by a remote CPU:
- If the packet triggers the state pair creation, the 'xlat1' will be
piggybacked by the translated packet, which will be forwarded to the
remote CPU for further evalution. And the 'xlat1' will be installed
on the remote CPU before the evalution of the translated packet.
- Else only the translated packet will be forwarded to the remote CPU
for further evalution.
The 'xlat1' is called the slave state, which will be deleted only when
the 'xlat0' (the master state) is deleted. The state pair is always
deleted on the CPU owning the 'xlat1'; the 'xlat0' will be forwarded
there.
The reference counting of the state pair is maintained independently
in each state, the memory of the state pair will be freed only after
the sum of the counter in each state reaches 0. This avoids expensive
per-packet atomic ops.
As far as I have tested, this implementation of "redirect" does _not_
introduce any noticeable performance reduction, latency increasing or
latency destability.
This commit makes most of the necessary bits for NAT ready too.
Matthew Dillon [Sun, 15 Oct 2017 07:13:42 +0000 (00:13 -0700)]
libthread_xu - Wakeup all waiters
* For now punt on trying to wakeup an optimized numbers of waiters.
Wake up all waiters and let them sort it out.
* This may fix specific count races in threaded programs using
pthread mutexes.
Matthew Dillon [Sat, 14 Oct 2017 22:28:12 +0000 (15:28 -0700)]
hammer2 - Handle error on rename in media out of space case
* Process the error code from hammer2_chain_delete() in
hammer2_xop_nrename() to ensure that we do not try to reattach
the chain under another parent.
Reported-by: arcade (Bug #3055)
Matthew Dillon [Sat, 14 Oct 2017 21:18:39 +0000 (14:18 -0700)]
sshd - Disable tunneled clear text passwords by default
* Reapply
1cb3a32c13b and
c866a462b3. sshd on DragonFlyBSD defaults
to disabling cleartext passwords by default.
Reminded-by: ivadasz
Sascha Wildner [Sat, 14 Oct 2017 19:06:14 +0000 (21:06 +0200)]
cpdup(1): Some improvements.
* Make cpdup retry failed rmdirs after chflags. It already does this
for remove().
* When deciding whether to copy a file, cpdup should ignore the
UF_ARCHIVE file flag. If that flag is supported by the destination
file system but it's cleared on a source file, then multiple
invocations of cpdup would all copy the source file because its
flags wouldn't match. OTOH, if the destination filesystem doesn't
support UF_ARCHIVE, then there's no point in cpdup setting it.
Submitted-by: Will Andrews <will@firepipe.net>
Dragonfly-bug: https://bugs.dragonflybsd.org/issues/2987
https://bugs.dragonflybsd.org/issues/2988
https://bugs.dragonflybsd.org/issues/3067
Matthew Dillon [Sat, 14 Oct 2017 17:59:30 +0000 (10:59 -0700)]
hammer2 - Slightly reduce LZ4 output buffer limit
* LZ4_compress_limitedOutput() appears to be able to overrun the
supplied buffer.
* Slightly reduce the LZ4 output buffer limit from a 4-byte alignment
to an 8-byte alignment to try to fix the problem.
Lubos Boucek [Fri, 13 Oct 2017 21:33:01 +0000 (21:33 +0000)]
Fix additional cases of seg-faults on crypt(3) failure
* On failure, crypt(3) returns NULL, which is then used as a
strcmp(3) argument
opieftpd.c and opiesu.c are not actually used anywhere.
Sascha Wildner [Sat, 14 Oct 2017 08:48:04 +0000 (10:48 +0200)]
rc.8: Clarify foo.sh behavior.
Improve wording a bit. See NetBSD's revision 1.38.
Reported-by: Aaron LI <aly@aaronly.me>
Aaron LI [Fri, 13 Oct 2017 04:26:29 +0000 (12:26 +0800)]
disklabel64: Fix an error message
Sascha Wildner [Sat, 14 Oct 2017 08:38:45 +0000 (10:38 +0200)]
ifconfig(8): Add 'lscan'. Like 'scan', but displays long SSIDs.
Submitted-by: Max Herrgard <herrgard@gmail.com>
Matthew Dillon [Sat, 14 Oct 2017 06:14:31 +0000 (23:14 -0700)]
mkinitrd - Add missing /var/db
* dhclient also needs /var/db to exist, make sure it does.
Reported-by: amonk
Matthew Dillon [Sat, 14 Oct 2017 05:39:31 +0000 (22:39 -0700)]
mkinitrd - Add missing /var/empty
* /var/empty is required by dhclient, which will SIGHUP itself
without it.
Reported-by: amonk
Matthew Dillon [Sat, 14 Oct 2017 04:44:06 +0000 (21:44 -0700)]
kernel - Rearrange namecache globals a bit
* Make sure ncspin and ncneglist are in the same cache line, and
do not overlap other globals in that cache line.
Suggested-by: mjg_