Matthew Dillon [Tue, 26 Jul 2016 19:56:31 +0000 (12:56 -0700)]
kernel - Reduce atomic ops in switch code
* Instead of using four atomic 'and' ops and four atomic 'or' ops, use
one atomic 'and' and one atomic 'or' when adjusting the pmap->pm_active.
* Store the array index and simplified cpu mask in the globaldata structure
for the above operation.
Matthew Dillon [Tue, 26 Jul 2016 19:53:39 +0000 (12:53 -0700)]
kernel - refactor CPUMASK_ADDR()
* Refactor CPUMASK_ADDR(), removing the conditionals and just indexing the
array as appropriate.
Matthew Dillon [Tue, 26 Jul 2016 00:06:52 +0000 (17:06 -0700)]
kernel - Fix VM bug introduced earlier this month
* Adding the yields to the VM page teardown and related code was a great
idea (~Jul 10th commits), but it also introduced a bug where the page
could get torn-out from under the scan due to the vm_object's token being
temporarily lost.
* Re-check page object ownership and (when applicable) its pindex before
acting on the page.
Matthew Dillon [Mon, 25 Jul 2016 23:05:40 +0000 (16:05 -0700)]
systat - Refactor memory displays for systat -vm
* Report paging and swap activity in bytes and I/Os instead of pages and
I/Os (I/Os usually matched pages).
* Report zfod and cow in bytes instead of pages.
* Replace the REAL and VIRTUAL section with something that makes a bit
more sense.
Report active memory (this is just active pages), kernel memory
(currently just wired but we can add more stuff later), Free
(inactive + cache + free is considered free/freeable memory), and
total system memory as reported at boot time.
Report total RSS - basically how many pages the system is mapping to
user processes. Due to sharing this can be a large value.
Do not try to report aggregate VSZ as there's no point in doing so
any more.
Reported swap usage on the main -vm display as well as total swap
allocated.
* Fix display bug in systat -sw display.
* Add "nvme" device type match for the disk display.
Imre Vadász [Sun, 24 Jul 2016 19:11:29 +0000 (21:11 +0200)]
if_iwm - Fix inverted logic in iwm_tx().
The PROT_REQUIRE flag in should be set for data frames above a certain
length, but we were setting it for !data frames above a certain length,
which makes no sense at all.
Taken-From: OpenBSD, Linux iwlwifi
Matthew Dillon [Mon, 25 Jul 2016 18:31:04 +0000 (11:31 -0700)]
kernel - Fix mountctl() / unmount race
* kern_mountctl() now properly checks to see if an unmount is in-progress
and returns an error, fixing a later panic.
Sascha Wildner [Mon, 25 Jul 2016 19:46:01 +0000 (21:46 +0200)]
sysconf.3: Fix typo.
Sascha Wildner [Mon, 25 Jul 2016 18:43:03 +0000 (20:43 +0200)]
libc/strptime: Return NULL, not 0, since the function returns char *.
While here, accept 'UTC' for %Z as well.
Taken-from: FreeBSD
Matthew Dillon [Mon, 25 Jul 2016 18:18:57 +0000 (11:18 -0700)]
mountd, mount - Change how mount signals mountd, reduce mountd spam
* mount now signals mountd with SIGUSR1 instead of SIGHUP.
* mountd now recognizes SIGUSR1 as requesting an incremental update.
Instead of wiping all exports on all mounts and then re-scanning
the exports file and re-adding from the exports file, mountd will
now only wipe the export(s) on mounts it finds in the exports file.
* Greatly reduces unnecessary mountlist scans and commands due to
mount_null and mount_tmpfs operations, while still preserving our
ability to export such filesystems.
Matthew Dillon [Mon, 25 Jul 2016 04:55:00 +0000 (21:55 -0700)]
kernel - Close a few SMP holes
* Don't trust the compiler when loading refs in cache_zap(). Make sure
it doesn't reorder or re-use the memory reference.
* In cache_nlookup() and cache_nlookup_maybe_shared(), do a full re-test
of the namecache element after locking instead of a partial re-test.
* Lock the namecache record in two situations where we need to set a
flag. Almost all other flag cases require similar locking. This fixes
a potential SMP race in a very thin window during mounting.
* Fix unmount / access races in sys_vquotactl() and, more importantly, in
sys_mount(). We were disposing of the namecache record after extracting
the mount pointer, then using the mount pointer. This could race an
unmount and result in a corrupt mount pointer.
Change the code to dispose of the namecache record after we finish using
the mount point. This is somewhat more complex then I'd like, but it
is important to unlock the namecache record across the potentially
blocking operation to prevent a lock chain from propagating upwards
towards the root.
* Enhanced debugging for the namecache teardown case when nc_refs changes
unexpectedly.
* Remove some dead code (cache_purgevfs()).
Matthew Dillon [Mon, 25 Jul 2016 04:52:26 +0000 (21:52 -0700)]
kernel - Cut buffer cache related pmap invalidations in half
* Do not bother to invalidate the TLB when tearing down a buffer
cache buffer. On the flip side, always invalidate the TLB
(the page range in question) when entering pages into a buffer
cache buffer. Only applicable to normal VMIO buffers.
* Significantly improves buffer cache / filesystem performance with
no real risk.
* Significantly improves performance for tmpfs teardowns on unmount
(which typically have to tear-down a lot of buffer cache buffers).
Matthew Dillon [Mon, 25 Jul 2016 04:49:57 +0000 (21:49 -0700)]
kernel - Add some more options for pmap_qremove*()
* Add pmap_qremove_quick() and pmap_qremove_noinval(), allowing pmap
entries to be removed without invalidation under carefully managed
circumstances by other subsystems.
* Redo the virtual kernel a little to work the same as the real kernel
when entering new pmap entries. We cannot assume that no invalidation
is needed when the prior contents of the pte is 0, because there are
several ways it could have become 0 without a prior invalidation.
Also use an atomic op to clear the entry.
Matthew Dillon [Mon, 25 Jul 2016 04:44:33 +0000 (21:44 -0700)]
kernel - cli interlock with critcount in interrupt assembly
* Disable interrupts when decrementing the critical section count
and gd_intr_nesting_level, just prior to jumping into doreti.
This prevents a stacking interrupt from occurring in this roughly
10-instruction window.
* While limited stacking is not really a problem, this closes a very
small and unlikely window where multiple device interrupts could
stack excessively and run the kernel thread out of stack space.
(unlikely that it has ever happened in real life, but becoming more
likely as some modern devices are capable of much higher interrupt
rates).
Sascha Wildner [Sun, 24 Jul 2016 22:45:46 +0000 (00:45 +0200)]
sysconf.3: Document _SC_PAGE_SIZE and _SC_PHYS_PAGES.
Taken-from: FreeBSD
Submitted-by: Sevan Janiyan
Dragonfly-bug: <https://bugs.dragonflybsd.org/issues/2929>
Matthew Dillon [Sun, 24 Jul 2016 21:02:10 +0000 (14:02 -0700)]
drm - Fix subtle plane masking bug.
* Index needs to be 1 << index.
Reported-by: davshao
Found-by: Matt Roper - https://patchwork.kernel.org/patch/7889051/
zrj [Wed, 20 Jul 2016 16:59:28 +0000 (19:59 +0300)]
cpumask.9: Add short manpage.
zrj [Tue, 19 Jul 2016 16:35:16 +0000 (19:35 +0300)]
cpumask.h: Turn CPUMASK_ELEMENTS as implementation defined.
No functional change intended.
zrj [Tue, 19 Jul 2016 07:07:45 +0000 (10:07 +0300)]
sys: Extract CPUMASK macros to new <machine/cpumask.h>
There are plenty enough CPUMASK macros already for them to have their own header.
So far only userspace users are powerd(8), usched(8) and kern_usched.c(VKERNEL64).
After recent change to expose kernel internal CPUMASK macros those got available
for userland codes even through <time.h> header. It is better to avoid that.
Also this reduces POSIX namespace pollution and keeps cpu/types.h header slim.
For now leave CPUMASK_ELEMENTS (not sure about ASSYM() macro handling the _ prefix)
and cpumask_t typedef (forward decl of struct cpumask would be better in prototypes).
Matthew Dillon [Sun, 24 Jul 2016 07:56:04 +0000 (00:56 -0700)]
kernel - Fix atomic op comparison
* The sequence was testing a signed integer and then testing the same
variable using atomic_fetchadd_int(&var, 0). Unfortunately, the
atomic-op returns an unsigned value so the result is that when the
buffer count was exhausted, the program would hard-loop without
calling tsleep.
* Fixed by casting the atomic op.
* Should fix the hardlock issue once and for all.
Matthew Dillon [Sun, 24 Jul 2016 02:19:46 +0000 (19:19 -0700)]
kernel - Refactor Xinvltlb a little, turn off the idle-thread invltlb opt
* Turn off the idle-thread invltlb optimization. This feature can be
turned on with a sysctl (default-off) machdep.optimized_invltlb. It
will be turned on by default when we've life-tested that it works
properly.
* Remove excess critical sections and interrupt disablements. All entries
into smp_invlpg() now occur with interrupts already disabled and the
thread already in a critical section. This also defers critical-section
1->0 transition handling away from smp_invlpg() and into its caller.
* Refactor the Xinvltlb APIs a bit. Have Xinvltlb enter the critical
section (it didn't before). Remove the critical section from
smp_inval_intr(). The critical section is now handled by the assembly,
and by any other callers.
* Add additional tsc-based loop/counter debugging to try to catch problems.
* Move inner-loop handling of smp_invltlb_mask to act on invltlbs a little
faster.
* Disable interrupts a little later inside pmap_inval_smp() and
pmap_inval_smp_cmpset().
Matthew Dillon [Sun, 24 Jul 2016 02:17:24 +0000 (19:17 -0700)]
hammer - remove commented out code, move a biodone()
* Remove commented-out code which is no longer applicable.
* Move the biodone() call in hammer_io_direct_write_complete() to after
the token-release, reducing stacking of tokens in biodone().
Matthew Dillon [Sun, 24 Jul 2016 02:09:26 +0000 (19:09 -0700)]
hammer - Try to fix improper DATA CRC error
* Under heavy I/O loads HAMMER has an optimization (similar to UFS) where
the logical buffer is used to issue a write to the underlying device,
rather than copying the logical buffer to a device buffer. This
optmization is earmarked by a hammer2_record.
* If the logical buffer is discarded just after it is written, and then
re-read, hammer may go through a path which calls
hammer_ip_resolve_data(). This code failed to check whether the record
was still in-progress, and in-fact the write to the device may not have
even been initiated yet, and there could also have been a device buffer
alias in the buffer cache for the device for the offset.
This caused the followup read to access the wrong data, causing HAMMER
to report a DATA CRC error. The actual media receives the correct data
eventually and a umount/remount would show an uncorrupted file.
* Try to fix the problem by calling hammer_io_direct_wait() on the record
in this path to wait for the operation to complete (and also to
invalidate the related device buffer) before trying to re-read the block
from the media.
Matthew Dillon [Sun, 24 Jul 2016 02:06:42 +0000 (19:06 -0700)]
kernel - Enhance indefinite wait buffer error message
* Enhance the error message re: indefinite wait buffer notifications.
Matthew Dillon [Sun, 24 Jul 2016 01:59:33 +0000 (18:59 -0700)]
kernel - Fix TDF_EXITING bug, instrument potential live loops
* Fix a TDF_EXITING bug. lwkt_switch_return() is called to fixup
the 'previous' thread, meaning turning off TDF_RUNNING and handling
TDF_EXITING.
However, if TDF_EXITING is not set, the old thread can be used or
acted upon / exited on by some other cpu the instant we clear
TDF_RUNNING. In this situation it is possible that the other cpu
will set TDF_EXITING in the small window of opportunity just before
we check ourselves, leading to serious thread management corruption.
* The new pmap_inval*() code runs on Xinvltlb instead of as a IPIQ
and can easily create significant latency between the two tests,
whereas the old code ran as an IPIQ and could not due to the critical
section.
Matthew Dillon [Sun, 24 Jul 2016 01:57:15 +0000 (18:57 -0700)]
kernel - Add vfs.repurpose_enable, adjust B_HASBOGUS
* Add vfs.repurpose_enable, default disabled. If this feature is turned on
the system will try to repurpose the VM pages underlying a buffer on
re-use instead of allowing the VM pages to cycle into the VM page cache.
Designed for high I/O-load environments.
* Use the B_HASBOGUS flag to determine if a pmap_qenter() is required,
and devolve the case to a single call to pmap_qenter() instead of one
for each bogus page.
Sascha Wildner [Sat, 23 Jul 2016 20:05:49 +0000 (22:05 +0200)]
Add a realquickkernel target, analogous to realquickworld.
It skips the recently added depend step, so it behaves like
quickkernel did before
521f740e8971df6fdb1b63933cb534746e86bfae.
Sascha Wildner [Sat, 23 Jul 2016 19:15:13 +0000 (21:15 +0200)]
Fix VKERNEL64 build.
François Tigeot [Sat, 23 Jul 2016 18:20:48 +0000 (20:20 +0200)]
kernel: Fix compilation
Sascha Wildner [Sat, 23 Jul 2016 17:15:24 +0000 (19:15 +0200)]
bsd-family-tree: Sync with FreeBSD.
François Tigeot [Sat, 23 Jul 2016 10:16:31 +0000 (12:16 +0200)]
drm/i915/gem: Reduce differences with Linux 4.4
François Tigeot [Sat, 23 Jul 2016 09:12:44 +0000 (11:12 +0200)]
drm: Sync a few headers with Linux 4.4
Sascha Wildner [Sat, 23 Jul 2016 07:40:11 +0000 (09:40 +0200)]
dmesg.8: Improve markup a bit and fix a typo (dumnr -> dumpnr).
Matthew Dillon [Sat, 23 Jul 2016 04:58:59 +0000 (21:58 -0700)]
kernel - Fix excessive ipiq recursion (4)
* Possibly the smoking gun. There was a case where the lwkt_switch()
code could wind up looping excessively calling lwkt_getalltokens()
if td_contended went negative, and td_contended on interrupt threads
could in-fact go negative.
This stopped IPIs in their tracks.
* Fix by making td_contended unsigned, causing the comparions to work
in all situations. And add a missing assignment to 0 for the
preempted thread case.
Matthew Dillon [Sat, 23 Jul 2016 01:22:17 +0000 (18:22 -0700)]
kernel - Fix excessive ipiq recursion (3)
* Third try. I'm not quite sure why we are still getting hard locks. These
changes (so far) appear to fix the problem, but I don't know why. It
is quite possible that the problem is still not fixed.
* Setting target->gd_npoll will prevent *all* other cpus from sending an
IPI to that target. This should have been ok because we were in a
critical section and about to send the IPI to the target ourselves, after
setting gd_npoll. The critical section does not prevent Xinvltlb, Xsniff,
Xspuriousint, or Xcpustop from running, but of these only Xinvltlb does
anything significant and it should theoretically run at a higher level
on all cpus than Xipiq (and thus complete without causing a deadlock of
any sort).
So in short, it should have been ok to allow something like an Xinvltlb
to interrupt the cpu inbetween setting target->gd_npoll and actually
sending the Xipiq to the target. But apparently it is not ok.
* Only clear mycpu->gd_npoll when we either (1) EOI and take the IPIQ
interrupt or (2) If the IPIQ is made pending via reqflags, when we clear
the flag. Previously we were clearing gd_npoll in the IPI processing
loop itself, potentially racing new incoming interrupts before they get
EOId by our cpu. This also should have been just fine, because interrupts
are enabled in the processing loop so nothing should have been able to
back-up in the LAPIC.
I can conjecture that possibly there was a race when we cleared gd_npoll
multiple times, potentially clearing it the second (or later) times,
allowing multiple incoming IPIs to be queued from multiple cpu sources but
then cli'ing and entering a e.g. Xinvltlb processing loop before our cpu
could acknowledge any of them. And then, possibly, trying to issue an IPI
with the system in this state.
I don't really see how this can cause a hard lock because I did not observe
any loop/counter error messages on the console which should have been
triggered if other cpus got stuck trying to issue IPIs. But LAPIC IPI
interactions are not well documented so... perhaps they were being issued
but blocked our local LAPIC from accepting a Xinvltlb due to having one
extra unacknowledged Xipiq pending? But then, our Xinvltlb processing loop
*does* enable interrupts for the duration, so it should have drained if
this were so.
In anycase, we no longer gratuitously clear gd_npoll in the processing
loop. We only clear it when we know there isn't one in-flight heading to
our cpu and none queued on our cpu. What will happen now is that a second
IPI can be sent to us once we've EOI'd the first one, and wind up in
reqflags, but will not be acted upon until our current processing loop
returns.
I will note that the gratuitous clearing we did before *could* have allowed
substantially all other cpus to try to Xipiq us at nearly the same time,
so perhaps the deadlock was related to that type of situation.
* When queueing an ipiq command from mycpu to a target, interrupts were
enabled between our entry into the ipiq fifo, the setting of our cpu bit
in the target gd_ipimask, the setting of target->gd_npoll, and our
issuing of the actual IPI to the target. We now disable interrupts across
these four steps.
It should have been ok for interrupts to have been left enabled across
these four steps. It might still be, but I am not taking any chances now.
Sascha Wildner [Fri, 22 Jul 2016 19:17:54 +0000 (21:17 +0200)]
build.7: Mention that KERNCONF can have more than one config.
Sascha Wildner [Fri, 22 Jul 2016 19:17:29 +0000 (21:17 +0200)]
Run make depend in quickkernel, too.
It is much cleaner to do that, just like it is run in quickworld, too.
At the price of a small increase in build time, quickkernel will now
continue working when a new kernel header is added, which broke it
before this commit because the header would not be copied to the right
place in /usr/obj.
Matthew Dillon [Fri, 22 Jul 2016 18:22:32 +0000 (11:22 -0700)]
drm - Stabilize broadwell and improve skylake
* The issue was primarily the bitops on longs were all wrong. '1 << N'
returns an integer (even if N is a long), so those had to be 1L or 1LU.
There were also some missing parenthesis in the bit test code.
* Throw in one fix from Linux, but I think its basically a NOP when DMAPs
are used (and we use DMAPs).
* Add some code to catch a particular failure condition by locking up X
in a while/tsleep loop instead of crashing outright, allowing a remote
login to kgdb the live system.
Matthew Dillon [Tue, 19 Jul 2016 01:27:12 +0000 (18:27 -0700)]
kernel - repurpose buffer cache entries under heavy I/O loads
* At buffer-cache I/O loads > 200 MBytes/sec (newbuf instantiations, not
cached buffer use), the buffer cache will now attempt to repurpose the
VM pages in the buffer it is recycling instead of returning the pages
to the VM system.
* sysctl vfs.repurposedspace may be used to adjust the I/O load limit.
* The repurposing code attempts to free the VM page then reassign it to
the logical offset and vnode of the new buffer. If this succeeds, the
new buffer can be returned to the caller without having to run any
SMP tlb operations. If it fails, the pages will be either freed or
returned to the VM system and the buffer cache will act as before.
* The I/O load limit has a secondary beneficial effect which is to reduce
the allocation load on the VM system to something the pageout daemon can
handle while still allowing new pages up to the I/O load limit to transfer
to VM backing store. Thus, this mechanism ONLY effects systems with I/O
load limits above 200 MBytes/sec (or whatever programmed value you decide
on).
* Pages already in the VM page cache do not count towards the I/O load limit
when reconstituting a buffer.
Matthew Dillon [Mon, 18 Jul 2016 18:44:11 +0000 (11:44 -0700)]
kernel - Refactor buffer cache code in preparation for vm_page repurposing
* Keep buffer_map but no longer use vm_map_findspace/vm_map_delete to manage
buffer sizes. Instead, reserve MAXBSIZE of unallocated KVM for each buffer.
* Refactor the buffer cache management code. bufspace exhaustion now has
hysteresis, bufcount works just about the same.
* Start work on the repurposing code (currently disabled).
Matthew Dillon [Fri, 22 Jul 2016 05:48:10 +0000 (22:48 -0700)]
hammer2 - Fix deadlocks, bad assertion, improve flushing.
* Fix a deadlock in checkdirempty(). We must release the lock on oparent
before following a hardlink. If after re-locking chain->parent != oparent,
return EAGAIN to the caller.
* When doing a full filesystem flush, pre-flush the vnodes with a normal
transaction to try to soak-up all the compression time and avoid stalling
user process writes for too long once we get inside the formal flush.
* Fix a flush bug. Flushing a deleted chain is allowed if it is an inode.
Matthew Dillon [Thu, 21 Jul 2016 02:29:06 +0000 (19:29 -0700)]
nvme - Fix BUF_KERNPROC() SMP race
* BUF_KERNPROC() must be issued before we submit the request. The subq
lock is not sufficient to interlock request completion (which only needs
the comq lock).
* Only occurs under extreme loads, probably due to an IPI or Xinvltlb
causing enough of a pause that the completion can run. NVMe is so fast,
probably no other controller would hit this particular race condition.
* Also fix a bio queueing race which can leave a bio hanging. If no
requests are available (which can only happen under very heavy I/O
loads), the signaling to the admin thread on the next I/O completion
can race the queueing of the bio. Fix the race by making sure the
admin thread is signalled *after* queueing the bio.
François Tigeot [Thu, 21 Jul 2016 10:13:58 +0000 (12:13 +0200)]
drm/i915: Mark a DragonFly-specific change as such
zrj [Fri, 20 May 2016 15:54:04 +0000 (18:54 +0300)]
drm/i915: Re-apply lost intel_dp.c diff.
Bring back intel_dp.c part of
9c52345db761baa0a08634b3e93a233804b7a91b
Also reduce spam on laptops with eDP panels on i915 load.
Great opportunity to use just implemented DRM_ERROR_RATELIMITED()
macro that uses krateprintf().
Issue is still there.
Sascha Wildner [Thu, 21 Jul 2016 06:52:54 +0000 (08:52 +0200)]
<sys/param.h>: Fix comments.
François Tigeot [Thu, 21 Jul 2016 04:01:12 +0000 (06:01 +0200)]
drm/i915: Update to Linux 4.4
* Broxton and Skylake support improvements
* Cherryview specific fixes
* Atomic modesetting conversion progress
* Improved validation of video modes. Some low-power chips can't
drive all DP screens and this is now detected by the driver.
* PSR and FBC improvements and bug fixes
* Workarounds for some specific HDMI monitors needing more time than
allowed by the spec to handle hot-plug events
* As usual, various fixes for little issues here and there
Matthew Dillon [Thu, 21 Jul 2016 01:36:14 +0000 (18:36 -0700)]
systat - enhance interrupt display (2)
* Also collapse 'dev auxN', e.g. 'igb0 rxtx0', 'igb0 rxtx1', etc is
collapsed to 'igb0 rxtx*'.
Matthew Dillon [Thu, 21 Jul 2016 01:15:00 +0000 (18:15 -0700)]
docs - Update tuning.7
* Revamp the swap space notes for modern times.
Matthew Dillon [Thu, 21 Jul 2016 01:09:16 +0000 (18:09 -0700)]
systat - enhance interrupt display
* There are often too many interrupts to list, collapse all
interrupts with the same name (e.g. usually multi-cpu interrupts)
into a single line and aggregate the results.
Justin C. Sherrill [Thu, 21 Jul 2016 01:09:51 +0000 (21:09 -0400)]
Updates to show "4.7".
Matthew Dillon [Wed, 20 Jul 2016 23:50:01 +0000 (16:50 -0700)]
kernel - Fix excessive ipiq recursion (2)
* Second try at this fix. Use different hysteresis levels when recursively
processing incoming IPIs during a send, and in such cases only process
incoming IPIs on queues which are trying to drain.
Matthew Dillon [Wed, 20 Jul 2016 23:48:57 +0000 (16:48 -0700)]
test - burst vmpageinfo pages
* burst vm_page structures in vmpageinfo to improve the scan rate.
François Tigeot [Wed, 20 Jul 2016 21:22:57 +0000 (23:22 +0200)]
linux/scatterlist.h: Add __sg_page_iter_next()
Obtained-from: Matt Macy <mmacy@nextbsd.org>
François Tigeot [Wed, 20 Jul 2016 20:21:49 +0000 (22:21 +0200)]
drm/linux: Add bitmap_weight()
Obtained-from: FreeBSD
François Tigeot [Wed, 20 Jul 2016 20:07:41 +0000 (22:07 +0200)]
drm/linux: Add a few ida definitions
Obtained-from: FreeBSD
François Tigeot [Wed, 20 Jul 2016 19:36:27 +0000 (21:36 +0200)]
drm/linux: Add ktime_to_us() and ktime_us_delta()
Sascha Wildner [Wed, 20 Jul 2016 17:10:52 +0000 (19:10 +0200)]
Add the sigwaitinfo.2 manual page from FreeBSD.
Also, bring in a number of fixes/improvements from FreeBSD in
other manual pages.
Submitted-by: zrj
Taken-from: FreeBSD
Matthew Dillon [Wed, 20 Jul 2016 06:56:15 +0000 (23:56 -0700)]
kernel - Fix excessive ipiq recursion
* Fix a situation where excessive IPIQ recursion can occur. The problem
was revealed by the previous commit when the passive signalling mechanism
was changed.
* Passive IPI sends now signal at 1/4 full.
* Active IPI sends wait for the FIFO to be < 1/2 full only when the nesting
level is 0, otherwise they allow it to become almost completely full.
This effectively gives IPI callbacks a buffer of roughly 1/2 the FIFO in
which they can issue IPI sends without triggering the wait-process loop
(which is the cause of the nesting).
IPI callbacks do not usually send more than one or two IPI sends to any
given cpu target which should theoretically guarantee that excessive
stacking will not occur.
Reported-by: marino
Matthew Dillon [Wed, 20 Jul 2016 00:14:33 +0000 (17:14 -0700)]
kernel - Fix Xinvltlb issue, fix ipiq issue, add Xsniff
* The Xinvltlb IPI interrupt enables interrupts in smp_inval_intr(), which
allows potentially pending interrupts and other things to happen. We
must use doreti instead of doreti_iret.
* Fix a reentrancy issue with lwkt_ipiq. Reentrancy can occur when the ipi
callback itself needs to issue an IPI, but the target cpu FIFO is full.
When this happens, the cpu mask may not be correct so force a scan of all
cpus in this situation.
* Add an infinite loop detection test to lwkt_process_ipiq() and jigger
another IPI if it persists more than 10 seconds, hopefully recovering the
system if as-yet unknown IPI issues persist.
* Add the Xsniff IPI and augment systat -pv to use it. This sniffs the %rip
and %rpc on all cpus, allowing us to see where where the kernel spends its
time.
Matthew Dillon [Wed, 20 Jul 2016 00:12:45 +0000 (17:12 -0700)]
hammer2: Add required check to hammer2_vop_nlink()
* Add required mount compatibility check to hammer2_vop_nlink().
François Tigeot [Tue, 19 Jul 2016 22:23:35 +0000 (00:23 +0200)]
drm/linux: Add div_s64()
zrj [Tue, 19 Jul 2016 09:29:45 +0000 (12:29 +0300)]
ifnet.9: Fix if_start() prototype in manpage.
Pointed-out-by: bycn82
zrj [Mon, 18 Jul 2016 15:47:44 +0000 (18:47 +0300)]
sys: Various include guard fixes.
zrj [Mon, 18 Jul 2016 15:26:56 +0000 (18:26 +0300)]
Remove pcibus.h header.
It is a subset of pci_cfgreg.h and both headers were included together.
zrj [Mon, 18 Jul 2016 12:18:05 +0000 (15:18 +0300)]
Prune _NO_NAMESPACE_POLLUTION cases.
param.h is not needed in sys/socket.h and removing it simplifies
handling between MD and AD headers.
zrj [Mon, 18 Jul 2016 12:10:50 +0000 (15:10 +0300)]
atomic.9: Align cpumask.
Matthew Dillon [Tue, 19 Jul 2016 01:15:11 +0000 (18:15 -0700)]
kernel - Fix realtime inconsistency
* The original hardclock() code assumed that an IPI (which can't get lost)
would distribute the tick across all cpus, but that no longer happens.
Code that incremented gd->gd_time_second and maintained the compensation
base gd->gd_cpuclock_base for relative calculations via cpu_systimer()
could slowly lose seconds. Once enough seconds accumulated,
gd_cpuclock_base would overflow and one or more cpu's would wind up with
a wildly incorrect (~40 seconds off) real time.
* Fix this by having CPU N just copy the compensation base from CPU 0. That
is, the base might be up to one tick off, but that is well within the
overflow range (which is ~40 seconds) and the time code will deal with it
properly. We use the same FIFO trick that we use for basetime[] to avoid
catching CPU 0 in the act of updating the timebase.
* Add missing lfence()s. These are required because if we catch the
basetime_index just after it changed, a pre-fetch of older array
content will be very wrong.
John Marino [Mon, 18 Jul 2016 11:25:28 +0000 (13:25 +0200)]
libc/collate.c: Revert previous, use F11 fix for ISO 8859-5
There were some edge failures with the previous fix as discussed between
Illumos (Tirkkonen/D'Amore) and FreeBSD (bapt). They were considered
showstoppers for F11 release; this change follows the current solution
for FreeBSD.
Sascha Wildner [Sun, 17 Jul 2016 21:19:42 +0000 (23:19 +0200)]
Sync zoneinfo database with tzdata2016f from ftp://ftp.iana.org/tz/releases
* The Egyptian government changed its mind on short notice, and
Africa/Cairo did not introduce DST starting 2016-07-07 after all.
(Thanks to Mina Samuel.)
* Asia/Novosibirsk switches from +06 to +07 on 2016-07-24 at 02:00.
(Thanks to Stepan Golosunov.)
* Asia/Novokuznetsk and Asia/Novosibirsk now use numeric time zone
abbreviations instead of invented ones.
* Europe/Minsk's 1992-03-29 spring-forward transition was at 02:00 not 00:00.
(Thanks to Stepan Golosunov.)
Matthew Dillon [Sun, 17 Jul 2016 18:56:49 +0000 (11:56 -0700)]
kernel - Improve physio performance (2)
* Increase the cap on pbuf_mem buffers from 256 to 512. 256
wasn't enough to max-out three NVMe devices.
* Add 25% hysteresis to the pbuf_{mem,kva,raw}_count counters
to reduce unnecessary tsleep()s and wakeup()s (and thus
unnecessary IPIs) when the pbuf pool is exhausted.
Add a tiny bit of hysteresis for the localized *pfreecnt
as subsystems tend to use smaller values (e.g. pageout
code).
* In physio tests throughput with 3 x NVMe + 4 x SATA SSDs
increases to 6.5 GBytes/sec and max IOPS @ 4K increases
to 1.05M IOPS (yes, that's million). (random read
from urandom-filled partition using 32KB and 4KB blocks,
with high user process concurrency).
Tomohiro Kusumi [Sun, 17 Jul 2016 13:19:11 +0000 (22:19 +0900)]
sys/kern: Mention pid 0 in usched_set(2) BUGS section
usched_set(2) only works for the current thread,
so it doesn't really matter if a caller specifies 0 or getpid().
Because of this, one would basically just pass 0 for pid.
Passing neither 0 nor current pid just results in EINVAL.
After this sanity check, uap->pid is never used.
> if (uap->pid != 0 && uap->pid != curthread->td_proc->p_pid)
> return (EINVAL);
François Tigeot [Sun, 17 Jul 2016 06:18:11 +0000 (08:18 +0200)]
drm/linux: Implement writex() functions
Matthew Dillon [Sun, 17 Jul 2016 06:15:19 +0000 (23:15 -0700)]
kernel - Improve physio performance
* See http://apollo.backplane.com/DFlyMisc/nvme_sys03.txt
* Hash the pbuf system. This chops down spin-lock collisions
at high transaction rates (>150K IOPS) by 1000x.
* Implement a pbuf with pre-allocated kernel memory that we
copy into, avoiding page table manipulations and thus
avoiding system-wide invltlb/invlpg IPIs.
* This increases NVMe IOPS tests with three cards from
150K-200K IOPS to 950K IOPS using physio (random read,
4K blocks, from urandom-filled partition, with many
process threads, from 3 NVMe cards in parallel).
* Further adjustments to the vkernel build.
Matthew Dillon [Sun, 17 Jul 2016 02:16:02 +0000 (19:16 -0700)]
kernel - Refactor Xinvltlb (3)
* Rollup invalidation operations for numerous kernel-related pmap, reducing
the number of IPIs needed (particularly for buffer cache operations).
* Implement semi-synchronous command execution, where target cpus do not
need to wait for the originating cpu to execute a command. This is used
for the above rollups when the related kernel memory is known to be accessed
concurrently with the pmap operations.
* Support invalidation of VA ranges.
* Support reduction of target cpu set for semi-synchronous commands, including
invltlb's, by removing idle cpus from the set when possible.
Sascha Wildner [Sun, 17 Jul 2016 02:58:42 +0000 (04:58 +0200)]
Fix vkernel build after pmap changes.
Matthew Dillon [Sat, 16 Jul 2016 20:07:46 +0000 (13:07 -0700)]
kernel - Refactor Xinvltlb (2)
* Backout the optimization where we avoided invalidating the tlb on
pte creation when the prior contents of the pte was 0.
The time has not yet come for this, there are still a few situations where
we appear to clear kernel pte's out without invalidating, which means
that we must invalidate when we enter new pte's into a pmap.
Reported-by: marino
Tomohiro Kusumi [Sat, 16 Jul 2016 16:07:55 +0000 (01:07 +0900)]
sbin/usched: Add cpumask limitation to usched(8) BUGS section
Tomohiro Kusumi [Sat, 16 Jul 2016 01:49:15 +0000 (10:49 +0900)]
sys/kern: Add USCHED_GET_CPUMASK for usched_set(2)
Add a new usched_set(2) command USCHED_GET_CPUMASK which simply
copies the cpumask of lwp to a pointer specified by userspace.
It's same as USCHED_GET_CPU except that USCHED_GET_CPU copies
the cpu id of lwp to userspace.
Many of the other kernels including Linux and FreeBSD have this
functionality via kernel specific syscalls, and not having it makes
some userspace programs difficult to port to DragonFly or support
the same feature sets that are available on other platforms.
Tomohiro Kusumi [Fri, 15 Jul 2016 23:57:06 +0000 (08:57 +0900)]
sys/cpu/x86_64: Expose CPUMASK macros to userspace without _KERNEL_STRUCTURES
Userspace programs other than /sbin/usched may use cpu affinity,
as the syscall was added for userspace programs to control it,
so it should not require _KERNEL_STRUCTURES.
Also note that cpumask_t which is a structure used by CPUMASK
macros doesn't require _KERNEL_STRUCTURES.
Confirmed the change doesn't break buildworld and buildkernel/LINT64.
(I actually had compile-time issues with fio while trying to add
cpu affinity support, and ended up copy-pasting CPUMASK macros
to a DragonFly specific header in fio source without defining
_KERNEL_STRUCTURES)
Sascha Wildner [Sat, 16 Jul 2016 06:12:14 +0000 (08:12 +0200)]
Update the pciconf(8) database.
July 13, 2016 snapshot from http://pciids.sourceforge.net/
Matthew Dillon [Fri, 15 Jul 2016 20:28:39 +0000 (13:28 -0700)]
kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code
* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation
* Remove the old lwkt_ipi-based per-page invalidation code.
* Include Xinvltlb interrupts in the V_IPI statistics counter
(so they show up in systat -pv 1).
* Add loop counters to detect and log possible endless loops.
* (Fix single_apic_ipi_passive() but note that this function is currently
not used. Interrupts must be hard-disabled when checking icr_lo).
* NEW INVALIDATION MECHANISM
The new invalidation mechanism is primarily enclosed in mp_machdep.c and
pmap_inval.c. Supply new all-in-one rollup functions which include the
*ptep contents adjustment, instead of prior piecemeal functions.
The new mechanism uses Xinvltlb for both full-tlb and per-page
invalidations. This interrupt ignores critical sections (that is,
will operate even if kernel code is in a critical section), which
significantly improves the latency and stability of our pmap pte
invalidation support functions.
For example, prior to these changes the invalidation code uses the
lwkt_ipiq paths which are subject to critical sections and could result
in long stalls across substantially ALL cpus when one cpu was in a long
cpu-bound critical section.
* NEW SMP_INVLTLB() OPTIMIZATION
smp_invltlb() always used Xinvltlb, and it still does. However the
code now avoids IPIing idle cpus, instead flagging them to issue the
cpu_invltlb() call when they wake-up.
To make this work the idle code must temporarily enter a critical section
so 'normal' interrupts do not run until it has a chance to check and act
on the flag. This will slightly increase interrupt latency on an idle
cpu.
This change significantly improves smp_invltlb() overhead by avoiding
having to pull idle cpus out of their high-latency/low-power state. Thus
it also avoids the high latency on those cpus messing up.
* Remove unnecessary calls to smp_invltlb(). It is not necessary to call
this function when a *ptep is transitioning from 0 to non-zero. This
significantly cuts down on smp_invltlb() traffic under load.
* Remove a bunch of unused code in these paths.
* Add machdep.report_invltlb_src and machdep.report_invlpg_src, down
counters which do one stack backtrace when they hit 0.
TIMING TESTS
No appreciable differences with the new code other than feeling smoother.
mount_tmpfs dummy /usr/obj
On monster (4-socket, 48-core):
time make -j 50 buildworld
BEFORE: 7849.697u 4693.979s 16:23.07 1275.9%
AFTER: 7682.598u 4467.224s 15:47.87 1281.8%
time make -j 50 nativekernel NO_MODULES=TRUE
BEFORE: 927.608u 254.626s 1:36.01 1231.3%
AFTER: 531.124u 204.456s 1:25.99 855.4%
On 2 x E5-2620 (2-socket, 32-core):
time make -j 50 buildworld
BEFORE: 5750.042u 2291.083s 10:35.62 1265.0%
AFTER: 5694.573u 2280.078s 10:34.96 1255.9%
time make -j 50 nativekernel NO_MODULES=TRUE
BEFORE: 431.338u 84.458s 0:54.71 942.7%
AFTER: 414.962u 92.312s 0:54.75 926.5%
(time mostly spend in mkdep line and on final link)
Memory thread tests, 64 threads each allocating memory.
BEFORE: 3.1M faults/sec
AFTER: 3.1M faults/sec.
Matthew Dillon [Fri, 15 Jul 2016 20:25:09 +0000 (13:25 -0700)]
kernel - Remove unnecessary cpu_enable_intr()
* Remove an unnecessary cpu_enable_intr() being called just prior to
a write_rflags().
Matthew Dillon [Fri, 15 Jul 2016 20:20:32 +0000 (13:20 -0700)]
kernel - Enhance CPUMASK and atomic ops
* Add atomic_testandset_long()
Add atomic_testandclear_long()
* Add atomic_cmpxchg_long_test(). This is for debugging only, it uses the
'z' flag instead of comparing old-vs-result. But they should have the
same effect.
* Add macros for atomic_store_rel_cpumask() and atomic_load_acq_cpumask().
* Add ATOMIC_CPUMASK_TESTANDSET()
Add ATOMIC_CPUMASK_TESTANDCLR()
Add ATOMIC_CPUMASK_COPY()
François Tigeot [Fri, 15 Jul 2016 21:05:36 +0000 (23:05 +0200)]
drm/linux: Add ioremap_wt()
François Tigeot [Fri, 15 Jul 2016 20:50:57 +0000 (22:50 +0200)]
drm/linux: Rework ioremap functions
No need to have pmap_mapdev_xxx() calls into the leaf functions,
put as much code as possible into __ioremap_common()
Tomohiro Kusumi [Fri, 15 Jul 2016 14:18:06 +0000 (23:18 +0900)]
sbin/newfs_hammer: Don't exit if -f when a blkdev doesn't support TRIM
With force option, exit(1) only when ioctl(IOCTLTRIM) failed.
Matthew Dillon [Fri, 15 Jul 2016 01:14:39 +0000 (18:14 -0700)]
kernel - Rename 'cpu' global
* Rename the 'cpu' global to 'cpu_type' to avoid overloading the variable.
Many procedures iterate cpus using a local 'cpu' variable.
* Fix once instance where a procedure iterated using the global instead of
a local.
Imre Vadász [Wed, 13 Jul 2016 20:47:15 +0000 (22:47 +0200)]
vga - Check for UEFI framebuffer in vga_configure() and vga_probe().
* If we have a UEFI framebuffer, we definitely won't be able to use a VGA
device at the same time.
* TODO: Another case where we probably should disable the vga(4) driver, is
when the "VGA not present" bit is set in the ACPI FADT BootFlags
value.
Tomohiro Kusumi [Thu, 14 Jul 2016 15:59:50 +0000 (00:59 +0900)]
sbin/newfs_hammer: Refactor TRIM support
Tomohiro Kusumi [Thu, 14 Jul 2016 15:17:41 +0000 (00:17 +0900)]
sbin/newfs_hammer: Don't assume blkdev is /dev/da...
newfs_hammer has "/dev/da..." hardcoded in its TRIM support,
as TRIM sysctls exist only for physical disks.
newfs_hammer should detect non physical block devices such as
device mapper or loopback devices, before it calls sysctl(3),
so as not to print an error message like below.
# newfs_hammer -E -L TEST /dev/mapper/linear1
Volume 0 DEVICE /dev/mapper/linear1 size 465.66TB
DEVICE /dev/mapper/linear1 (kern.cam.da.pper/linear1.trim_enabled) does not support the TRIM command
^^^^^^^^^^^^
usage: newfs_hammer -L label [-Ef] [-b bootsize] [-m savesize] [-u undosize]
[-V version] special ...
zrj [Thu, 14 Jul 2016 11:50:04 +0000 (14:50 +0300)]
<signal.h>: Bring back SI_QUEUE.
Some of dports assume SI_QUEUE is available (specially in "make test").
Even if in signal handlers SI_QUEUE would not be active add it back to
reduce the amount of patching in dports test sources.
Matthew Dillon [Thu, 14 Jul 2016 04:38:31 +0000 (21:38 -0700)]
kernel - Distribute queues in rw-sep map.
* Instead of forcing all cpus to share the same submission queue in
the ncpus > nsubqs case, distribute available submission queues
to the cpus to try to reduce conflicts.
* Will also distribute available completion queues to the submission
queues.
Matthew Dillon [Thu, 14 Jul 2016 02:41:21 +0000 (19:41 -0700)]
nvme - Fix comq mappings when too many cpus.
* Fix the rw-sep, minimal, and basic comq mappings. These mappings occur
when there are too many cpus to accomodate available submission and
completion queues.
* Fixes bug where a bad completion queue was being specified in the creation
of a submission queue.
Imre Vadász [Wed, 13 Jul 2016 20:12:28 +0000 (22:12 +0200)]
vga - Remove unused vga_sub_configure variable.
Sascha Wildner [Wed, 13 Jul 2016 17:21:05 +0000 (19:21 +0200)]
sigaction.2: Comment out reference to sigset().
Sascha Wildner [Wed, 13 Jul 2016 17:20:46 +0000 (19:20 +0200)]
kqueue.2: Fix a typo in a function name (sigpromask -> sigprocmask).
Imre Vadász [Tue, 12 Jul 2016 17:48:56 +0000 (19:48 +0200)]
wlan - send RTM_IEEE80211_SCAN event when scan was cancelled.
wpa_supplicant(8) expects to see 'scan complete' event after every
scan command; in case, when event is not sent it will hang for
indefinite time.
Taken-From: FreeBSD (SVN r300383)
Imre Vadász [Tue, 12 Jul 2016 17:46:17 +0000 (19:46 +0200)]
wlan - restore interface state check for IEEE80211_IOC_SCAN_REQ ioctl.
Do not try to start a scan when interface is not running.
How-to-reproduce:
1) ifconfig wlan0 create wlandev urtwn0
2) wlandebug -i wlan0 state
3) ifconfig wlan0 scan
Taken-From: FreeBSD (SVN r300237)
Imre Vadász [Tue, 12 Jul 2016 19:45:55 +0000 (21:45 +0200)]
if_iwm - When stopping TX DMA, wait for all channels at once.
* Makes the TX DMA stopping more similar to Linux code, and potentially
a bit faster. Also, output an error message when TX DMA idling fails.
Taken-From: Linux iwlwifi
Imre Vadász [Mon, 20 Jun 2016 19:50:53 +0000 (21:50 +0200)]
iwm: Send PHY DB commands as async commands.
Taken-From: OpenBSD
Imre Vadász [Tue, 12 Jul 2016 15:54:06 +0000 (17:54 +0200)]
if_iwm - Set different pm_timeout for action frames.
When building a Tx Command for management frames, we are lacking
a check for action frames, for which we should set a different
pm_timeout. This cause the fw to stay awake for 100TU after each
such frame is transmitted, resulting an excessive power consumption.
Taken-From: Linux iwlwifi (git
b084a35663c3f1f7)