Sascha Wildner [Sun, 17 Jul 2016 21:19:42 +0000 (23:19 +0200)]
Sync zoneinfo database with tzdata2016f from ftp://ftp.iana.org/tz/releases
* The Egyptian government changed its mind on short notice, and
Africa/Cairo did not introduce DST starting 2016-07-07 after all.
(Thanks to Mina Samuel.)
* Asia/Novosibirsk switches from +06 to +07 on 2016-07-24 at 02:00.
(Thanks to Stepan Golosunov.)
* Asia/Novokuznetsk and Asia/Novosibirsk now use numeric time zone
abbreviations instead of invented ones.
* Europe/Minsk's 1992-03-29 spring-forward transition was at 02:00 not 00:00.
(Thanks to Stepan Golosunov.)
Matthew Dillon [Sun, 17 Jul 2016 18:56:49 +0000 (11:56 -0700)]
kernel - Improve physio performance (2)
* Increase the cap on pbuf_mem buffers from 256 to 512. 256
wasn't enough to max-out three NVMe devices.
* Add 25% hysteresis to the pbuf_{mem,kva,raw}_count counters
to reduce unnecessary tsleep()s and wakeup()s (and thus
unnecessary IPIs) when the pbuf pool is exhausted.
Add a tiny bit of hysteresis for the localized *pfreecnt
as subsystems tend to use smaller values (e.g. pageout
code).
* In physio tests throughput with 3 x NVMe + 4 x SATA SSDs
increases to 6.5 GBytes/sec and max IOPS @ 4K increases
to 1.05M IOPS (yes, that's million). (random read
from urandom-filled partition using 32KB and 4KB blocks,
with high user process concurrency).
Tomohiro Kusumi [Sun, 17 Jul 2016 13:19:11 +0000 (22:19 +0900)]
sys/kern: Mention pid 0 in usched_set(2) BUGS section
usched_set(2) only works for the current thread,
so it doesn't really matter if a caller specifies 0 or getpid().
Because of this, one would basically just pass 0 for pid.
Passing neither 0 nor current pid just results in EINVAL.
After this sanity check, uap->pid is never used.
> if (uap->pid != 0 && uap->pid != curthread->td_proc->p_pid)
> return (EINVAL);
François Tigeot [Sun, 17 Jul 2016 06:18:11 +0000 (08:18 +0200)]
drm/linux: Implement writex() functions
Matthew Dillon [Sun, 17 Jul 2016 06:15:19 +0000 (23:15 -0700)]
kernel - Improve physio performance
* See http://apollo.backplane.com/DFlyMisc/nvme_sys03.txt
* Hash the pbuf system. This chops down spin-lock collisions
at high transaction rates (>150K IOPS) by 1000x.
* Implement a pbuf with pre-allocated kernel memory that we
copy into, avoiding page table manipulations and thus
avoiding system-wide invltlb/invlpg IPIs.
* This increases NVMe IOPS tests with three cards from
150K-200K IOPS to 950K IOPS using physio (random read,
4K blocks, from urandom-filled partition, with many
process threads, from 3 NVMe cards in parallel).
* Further adjustments to the vkernel build.
Matthew Dillon [Sun, 17 Jul 2016 02:16:02 +0000 (19:16 -0700)]
kernel - Refactor Xinvltlb (3)
* Rollup invalidation operations for numerous kernel-related pmap, reducing
the number of IPIs needed (particularly for buffer cache operations).
* Implement semi-synchronous command execution, where target cpus do not
need to wait for the originating cpu to execute a command. This is used
for the above rollups when the related kernel memory is known to be accessed
concurrently with the pmap operations.
* Support invalidation of VA ranges.
* Support reduction of target cpu set for semi-synchronous commands, including
invltlb's, by removing idle cpus from the set when possible.
Sascha Wildner [Sun, 17 Jul 2016 02:58:42 +0000 (04:58 +0200)]
Fix vkernel build after pmap changes.
Matthew Dillon [Sat, 16 Jul 2016 20:07:46 +0000 (13:07 -0700)]
kernel - Refactor Xinvltlb (2)
* Backout the optimization where we avoided invalidating the tlb on
pte creation when the prior contents of the pte was 0.
The time has not yet come for this, there are still a few situations where
we appear to clear kernel pte's out without invalidating, which means
that we must invalidate when we enter new pte's into a pmap.
Reported-by: marino
Tomohiro Kusumi [Sat, 16 Jul 2016 16:07:55 +0000 (01:07 +0900)]
sbin/usched: Add cpumask limitation to usched(8) BUGS section
Tomohiro Kusumi [Sat, 16 Jul 2016 01:49:15 +0000 (10:49 +0900)]
sys/kern: Add USCHED_GET_CPUMASK for usched_set(2)
Add a new usched_set(2) command USCHED_GET_CPUMASK which simply
copies the cpumask of lwp to a pointer specified by userspace.
It's same as USCHED_GET_CPU except that USCHED_GET_CPU copies
the cpu id of lwp to userspace.
Many of the other kernels including Linux and FreeBSD have this
functionality via kernel specific syscalls, and not having it makes
some userspace programs difficult to port to DragonFly or support
the same feature sets that are available on other platforms.
Tomohiro Kusumi [Fri, 15 Jul 2016 23:57:06 +0000 (08:57 +0900)]
sys/cpu/x86_64: Expose CPUMASK macros to userspace without _KERNEL_STRUCTURES
Userspace programs other than /sbin/usched may use cpu affinity,
as the syscall was added for userspace programs to control it,
so it should not require _KERNEL_STRUCTURES.
Also note that cpumask_t which is a structure used by CPUMASK
macros doesn't require _KERNEL_STRUCTURES.
Confirmed the change doesn't break buildworld and buildkernel/LINT64.
(I actually had compile-time issues with fio while trying to add
cpu affinity support, and ended up copy-pasting CPUMASK macros
to a DragonFly specific header in fio source without defining
_KERNEL_STRUCTURES)
Sascha Wildner [Sat, 16 Jul 2016 06:12:14 +0000 (08:12 +0200)]
Update the pciconf(8) database.
July 13, 2016 snapshot from http://pciids.sourceforge.net/
Matthew Dillon [Fri, 15 Jul 2016 20:28:39 +0000 (13:28 -0700)]
kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code
* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation
* Remove the old lwkt_ipi-based per-page invalidation code.
* Include Xinvltlb interrupts in the V_IPI statistics counter
(so they show up in systat -pv 1).
* Add loop counters to detect and log possible endless loops.
* (Fix single_apic_ipi_passive() but note that this function is currently
not used. Interrupts must be hard-disabled when checking icr_lo).
* NEW INVALIDATION MECHANISM
The new invalidation mechanism is primarily enclosed in mp_machdep.c and
pmap_inval.c. Supply new all-in-one rollup functions which include the
*ptep contents adjustment, instead of prior piecemeal functions.
The new mechanism uses Xinvltlb for both full-tlb and per-page
invalidations. This interrupt ignores critical sections (that is,
will operate even if kernel code is in a critical section), which
significantly improves the latency and stability of our pmap pte
invalidation support functions.
For example, prior to these changes the invalidation code uses the
lwkt_ipiq paths which are subject to critical sections and could result
in long stalls across substantially ALL cpus when one cpu was in a long
cpu-bound critical section.
* NEW SMP_INVLTLB() OPTIMIZATION
smp_invltlb() always used Xinvltlb, and it still does. However the
code now avoids IPIing idle cpus, instead flagging them to issue the
cpu_invltlb() call when they wake-up.
To make this work the idle code must temporarily enter a critical section
so 'normal' interrupts do not run until it has a chance to check and act
on the flag. This will slightly increase interrupt latency on an idle
cpu.
This change significantly improves smp_invltlb() overhead by avoiding
having to pull idle cpus out of their high-latency/low-power state. Thus
it also avoids the high latency on those cpus messing up.
* Remove unnecessary calls to smp_invltlb(). It is not necessary to call
this function when a *ptep is transitioning from 0 to non-zero. This
significantly cuts down on smp_invltlb() traffic under load.
* Remove a bunch of unused code in these paths.
* Add machdep.report_invltlb_src and machdep.report_invlpg_src, down
counters which do one stack backtrace when they hit 0.
TIMING TESTS
No appreciable differences with the new code other than feeling smoother.
mount_tmpfs dummy /usr/obj
On monster (4-socket, 48-core):
time make -j 50 buildworld
BEFORE: 7849.697u 4693.979s 16:23.07 1275.9%
AFTER: 7682.598u 4467.224s 15:47.87 1281.8%
time make -j 50 nativekernel NO_MODULES=TRUE
BEFORE: 927.608u 254.626s 1:36.01 1231.3%
AFTER: 531.124u 204.456s 1:25.99 855.4%
On 2 x E5-2620 (2-socket, 32-core):
time make -j 50 buildworld
BEFORE: 5750.042u 2291.083s 10:35.62 1265.0%
AFTER: 5694.573u 2280.078s 10:34.96 1255.9%
time make -j 50 nativekernel NO_MODULES=TRUE
BEFORE: 431.338u 84.458s 0:54.71 942.7%
AFTER: 414.962u 92.312s 0:54.75 926.5%
(time mostly spend in mkdep line and on final link)
Memory thread tests, 64 threads each allocating memory.
BEFORE: 3.1M faults/sec
AFTER: 3.1M faults/sec.
Matthew Dillon [Fri, 15 Jul 2016 20:25:09 +0000 (13:25 -0700)]
kernel - Remove unnecessary cpu_enable_intr()
* Remove an unnecessary cpu_enable_intr() being called just prior to
a write_rflags().
Matthew Dillon [Fri, 15 Jul 2016 20:20:32 +0000 (13:20 -0700)]
kernel - Enhance CPUMASK and atomic ops
* Add atomic_testandset_long()
Add atomic_testandclear_long()
* Add atomic_cmpxchg_long_test(). This is for debugging only, it uses the
'z' flag instead of comparing old-vs-result. But they should have the
same effect.
* Add macros for atomic_store_rel_cpumask() and atomic_load_acq_cpumask().
* Add ATOMIC_CPUMASK_TESTANDSET()
Add ATOMIC_CPUMASK_TESTANDCLR()
Add ATOMIC_CPUMASK_COPY()
François Tigeot [Fri, 15 Jul 2016 21:05:36 +0000 (23:05 +0200)]
drm/linux: Add ioremap_wt()
François Tigeot [Fri, 15 Jul 2016 20:50:57 +0000 (22:50 +0200)]
drm/linux: Rework ioremap functions
No need to have pmap_mapdev_xxx() calls into the leaf functions,
put as much code as possible into __ioremap_common()
Tomohiro Kusumi [Fri, 15 Jul 2016 14:18:06 +0000 (23:18 +0900)]
sbin/newfs_hammer: Don't exit if -f when a blkdev doesn't support TRIM
With force option, exit(1) only when ioctl(IOCTLTRIM) failed.
Matthew Dillon [Fri, 15 Jul 2016 01:14:39 +0000 (18:14 -0700)]
kernel - Rename 'cpu' global
* Rename the 'cpu' global to 'cpu_type' to avoid overloading the variable.
Many procedures iterate cpus using a local 'cpu' variable.
* Fix once instance where a procedure iterated using the global instead of
a local.
Imre Vadász [Wed, 13 Jul 2016 20:47:15 +0000 (22:47 +0200)]
vga - Check for UEFI framebuffer in vga_configure() and vga_probe().
* If we have a UEFI framebuffer, we definitely won't be able to use a VGA
device at the same time.
* TODO: Another case where we probably should disable the vga(4) driver, is
when the "VGA not present" bit is set in the ACPI FADT BootFlags
value.
Tomohiro Kusumi [Thu, 14 Jul 2016 15:59:50 +0000 (00:59 +0900)]
sbin/newfs_hammer: Refactor TRIM support
Tomohiro Kusumi [Thu, 14 Jul 2016 15:17:41 +0000 (00:17 +0900)]
sbin/newfs_hammer: Don't assume blkdev is /dev/da...
newfs_hammer has "/dev/da..." hardcoded in its TRIM support,
as TRIM sysctls exist only for physical disks.
newfs_hammer should detect non physical block devices such as
device mapper or loopback devices, before it calls sysctl(3),
so as not to print an error message like below.
# newfs_hammer -E -L TEST /dev/mapper/linear1
Volume 0 DEVICE /dev/mapper/linear1 size 465.66TB
DEVICE /dev/mapper/linear1 (kern.cam.da.pper/linear1.trim_enabled) does not support the TRIM command
^^^^^^^^^^^^
usage: newfs_hammer -L label [-Ef] [-b bootsize] [-m savesize] [-u undosize]
[-V version] special ...
zrj [Thu, 14 Jul 2016 11:50:04 +0000 (14:50 +0300)]
<signal.h>: Bring back SI_QUEUE.
Some of dports assume SI_QUEUE is available (specially in "make test").
Even if in signal handlers SI_QUEUE would not be active add it back to
reduce the amount of patching in dports test sources.
Matthew Dillon [Thu, 14 Jul 2016 04:38:31 +0000 (21:38 -0700)]
kernel - Distribute queues in rw-sep map.
* Instead of forcing all cpus to share the same submission queue in
the ncpus > nsubqs case, distribute available submission queues
to the cpus to try to reduce conflicts.
* Will also distribute available completion queues to the submission
queues.
Matthew Dillon [Thu, 14 Jul 2016 02:41:21 +0000 (19:41 -0700)]
nvme - Fix comq mappings when too many cpus.
* Fix the rw-sep, minimal, and basic comq mappings. These mappings occur
when there are too many cpus to accomodate available submission and
completion queues.
* Fixes bug where a bad completion queue was being specified in the creation
of a submission queue.
Imre Vadász [Wed, 13 Jul 2016 20:12:28 +0000 (22:12 +0200)]
vga - Remove unused vga_sub_configure variable.
Sascha Wildner [Wed, 13 Jul 2016 17:21:05 +0000 (19:21 +0200)]
sigaction.2: Comment out reference to sigset().
Sascha Wildner [Wed, 13 Jul 2016 17:20:46 +0000 (19:20 +0200)]
kqueue.2: Fix a typo in a function name (sigpromask -> sigprocmask).
Imre Vadász [Tue, 12 Jul 2016 17:48:56 +0000 (19:48 +0200)]
wlan - send RTM_IEEE80211_SCAN event when scan was cancelled.
wpa_supplicant(8) expects to see 'scan complete' event after every
scan command; in case, when event is not sent it will hang for
indefinite time.
Taken-From: FreeBSD (SVN r300383)
Imre Vadász [Tue, 12 Jul 2016 17:46:17 +0000 (19:46 +0200)]
wlan - restore interface state check for IEEE80211_IOC_SCAN_REQ ioctl.
Do not try to start a scan when interface is not running.
How-to-reproduce:
1) ifconfig wlan0 create wlandev urtwn0
2) wlandebug -i wlan0 state
3) ifconfig wlan0 scan
Taken-From: FreeBSD (SVN r300237)
Imre Vadász [Tue, 12 Jul 2016 19:45:55 +0000 (21:45 +0200)]
if_iwm - When stopping TX DMA, wait for all channels at once.
* Makes the TX DMA stopping more similar to Linux code, and potentially
a bit faster. Also, output an error message when TX DMA idling fails.
Taken-From: Linux iwlwifi
Imre Vadász [Mon, 20 Jun 2016 19:50:53 +0000 (21:50 +0200)]
iwm: Send PHY DB commands as async commands.
Taken-From: OpenBSD
Imre Vadász [Tue, 12 Jul 2016 15:54:06 +0000 (17:54 +0200)]
if_iwm - Set different pm_timeout for action frames.
When building a Tx Command for management frames, we are lacking
a check for action frames, for which we should set a different
pm_timeout. This cause the fw to stay awake for 100TU after each
such frame is transmitted, resulting an excessive power consumption.
Taken-From: Linux iwlwifi (git
b084a35663c3f1f7)
Imre Vadász [Mon, 11 Jul 2016 13:40:58 +0000 (15:40 +0200)]
if_iwm - Remove iwmsleep, it's no longer needed. Use just lksleep instead.
Matthew Dillon [Tue, 12 Jul 2016 07:23:32 +0000 (00:23 -0700)]
kernel - Adjust arp code to not spam all cpus
* Do not spam all cpus if the arp does not change the routing table.
* Supply (for now) a 1-second hysteresis for expiration updates.
* Add a little netisr debugging for kgdb.
Matthew Dillon [Tue, 12 Jul 2016 07:22:42 +0000 (00:22 -0700)]
kernel - cleanup sys/thread.h
* Cleanup unused TDPRI's
* Add a CPUMASK macro to retrieve the address of an element.
Matthew Dillon [Mon, 11 Jul 2016 19:19:20 +0000 (12:19 -0700)]
kernel - Do not spam all cpus for ipfrag_slowtimo()
* Only issue the ipfrag_slowtimo() to cpus with non-empty ip fragment
queues. This will not impact performance but significantly reduces
unnecessary IPIs to idle cpus. It makes for better systat -pv 1
eye-candy.
* Only allow one ipfrag timeout IPI to be in-flight to any particular target
cpu. This will not impact performance but may help reduce degenerate
ipiq-full conditions if the target cpu becomes cpu-bound in a critical
section.
zrj [Tue, 12 Jul 2016 06:47:06 +0000 (09:47 +0300)]
<signal.h>: Don't advertise sigqueue(2) availability.
sigqueue(2) is not yet implemented.
Sascha Wildner [Mon, 11 Jul 2016 17:23:36 +0000 (19:23 +0200)]
efi/loader: Use acdragonfly.h.
acpi.h is not readily includable from sys/boot, so in order to get at
ACPICA definitions etc., the specific ACPICA headers are included
directly, along with whatever acpi.h would include by itself normally.
On DragonFly, this is acdragonfly.h, not acfreebsd.h.
It's just a cosmetic change. The resulting binaries are identical with
one header or the other.
Approved-by: ivadasz
Imre Vadász [Mon, 11 Jul 2016 16:42:03 +0000 (18:42 +0200)]
kernel/pc64: Make metadata.h more compatible with FreeBSD again.
* Use the same values as FreeBSD for MODINFOMD_EFI and_MAP MODINFOMD_EFI_FB,
to keep kernel and bootloader more compatible with FreeBSD.
Pointed-out-by: zrj
Matthew Dillon [Mon, 11 Jul 2016 00:14:56 +0000 (17:14 -0700)]
kernel - Improve vm.prefault_pages + misc
* vm_prefault_quick() now gives up more quickly when things don't work out.
This fixes a scaling issue when vm.prefault_pages is set very high. A
prefault failure would still test every page and kill performance.
(example: linear zfod burst).
* Adjust pmap page removal loop to yield every 64 pages. Before it was
yielding every 4096*8 pages.
* Adjust vm_object_*() routines to yield every 64 pages as well.
Matthew Dillon [Sun, 10 Jul 2016 21:13:30 +0000 (14:13 -0700)]
kernel - Reduce stalls, refactor lwkt_switch() core.
* These changes primarily effect programs which have a lot of token
contention (aka concurrent write VM faults) and exiting programs which
have very large RSSs (e.g. multiple gigabytes).
* Release proc->p_token around potentially long vmspace destruction ops.
This avoids stalls in programs like 'ps' and functions like
fork/exec/wait/exit.
* Refactor lwkt_switch(). This may also fix a bug where we improperly
called splz_check() after releasing the current thread's tokens. An
interrupt or IPI could then sneak in and corrupt a recursive token.
Remove the infinite loop cycling. When token contention is present this
caused scheduler ticks to dock the wrong thread (the current thread instead
of the target thread). Heavy token contention could cause higher priority
processes to stall for very long periods of time.
Instead, once the spin limit is exhausted we switch through the idle
thread which places us in a better context from which to continue.
* Adjust the dragonfly process scheduler to detect contention when the
current thread is the idle thread, and then attribute the tick to the
correct thread (or at least a more-correct thread).
Matthew Dillon [Sun, 10 Jul 2016 21:01:31 +0000 (14:01 -0700)]
test - Fix build warnings
* Fix build warnings in the pipe1 and pipe2 tests.
Sascha Wildner [Sun, 10 Jul 2016 12:16:35 +0000 (14:16 +0200)]
<rpc/svc.h>: Add back comment.
François Tigeot [Sun, 10 Jul 2016 11:40:37 +0000 (13:40 +0200)]
drm/linux: Avoid contention in spinlock_irq routines
* Call crit_enter() first and lockmgr() later
* This helps to avoid unnecessary contention on the same cpu if a
regular kernel thread holding the lockmgr lock is preempted by
an interrupt thread which would like to acquire the same lock
* By putting the lockmgr() call in the critical section, we avoid the
situation where the preempting interrupt routine tries to lock,
races the main thread lock, and forces an extra two thread switches
Suggested-by: sephe@
Reviewed-by: dillon@ and sephe@
Matthew Dillon [Sun, 10 Jul 2016 07:46:25 +0000 (00:46 -0700)]
kernel - Yield during VM teardown, fix zfree() contention
* Yield during the teardown of vm_page's related to process exit
to allow other processes to get some cpu. Also use lwkt_user_yield()
instead of lwkt_yield().
* zfree() had no hysteresis once the pcpu cache was full, causing massive
contention on the pool spin lock. Generally only effects page-frees
(returning the pv_entry to the pool).
Implement hysteresis on free by moving up to 32 elements out of the pcpu
cache and back into the pool when the pool becomes full.
Matthew Dillon [Sun, 10 Jul 2016 05:24:06 +0000 (22:24 -0700)]
kernel - Fix stalls during major token contention
* When a set of processes is seriously contending on a token, unrelated
lower-priority processes scheduled to the same cpu may stall randomly for
several seconds at a time. Such contention is rare, but can still occur
at choke-points (such as multiple threads write-faulting on the same VM
object) and result in a degenerate condition.
This occurs because the scheduler has become fixated on the contending
thread due to its priority. Because the 'current thread' might not be the
one that is contending, the scheduler clock does not account for the
contending thread.
* Add a contention heuristic to the scheduler for now which releases the
contending thread on the current cpu (allowing the userland scheduler to
choose another thread to schedule).
* At the moment I have not tried to code the scheduler clock to account for
the contending thread. Theoretically doing so would reduce its dynamic
priority so the scheduler does not fixate on it, but it is a bit of a
round-about way to solve the problem whereas coding it in lwkt_switch()
gives us nearly instant detection.
Matthew Dillon [Sun, 10 Jul 2016 00:39:03 +0000 (17:39 -0700)]
hammer2 - Add feature to allow sector overwrite, fix meta-data check code (2)
* Remove printing of now-deleted fields from debug code (fixes buildworld)
Matthew Dillon [Sat, 9 Jul 2016 23:17:19 +0000 (16:17 -0700)]
hammer2 - Add feature to allow sector overwrite, fix meta-data check code
* If a file is set to use no check code (hammer2 setcheck none <file>),
data overwrites will reuse the same sector as long as it does not violate
the most recent snapshot.
This allows the program to relax copy-on-write requirements for certain
files, for example files which might be mmap()'d SHARED+RW and then
modified constantly where the programmer has determined that the
possibility of corruption is ok.
* Implement pfs_lsnap_tid in the PFS root inode meta-data. This records the
last snapshot TID so the chain code can determine if an overwrite is
allowed.
* Remove attr_tid and dirent_tid from the inode meta-data for now.
* Only BREF_TYPE_DATA brefs inherit the inode check mode. Meta-data brefs
such as indirect blocks, or directory entries, will only use the check
code type specified in the parent inode if it is not NONE. Otherwise
they will use the default check code.
This fixes a bug where meta-data brefs could wind up being unchecked. We
want all meta-data to always be checked (at least for now).
Sascha Wildner [Sat, 9 Jul 2016 07:47:16 +0000 (09:47 +0200)]
kernel: Don't use userland's <stdarg.h> from kernel files.
Use <machine/stdarg.h> instead which automatically comes in via
<sys/systm.h>.
Sascha Wildner [Sat, 9 Jul 2016 06:50:58 +0000 (08:50 +0200)]
<sys/systm.h>: Remove a leftover prototype.
Imre Vadász [Thu, 7 Jul 2016 21:37:00 +0000 (23:37 +0200)]
kernel - Change cpu_idle_hlt default for modern amd cpus.
* Set cpu_idle_hlt=3 for AMD Bobcat and later (which includes any Bulldozer
cpus and apus as well). These cpus do major power management in HLT or ACPI,
but cpu_idle_hlt=1 would try to use MWAIT. Also wakeup times should be
fast enough to make cpu_idle_hlt=2 unnecessary.
Matthew Dillon [Thu, 7 Jul 2016 15:53:38 +0000 (08:53 -0700)]
kernel - New threads should not inherit the sigaltstack
* New threads should not inherit the sigaltstack. The stack is still
inherited on a full fork().
* Fixes issue brought up by https://go-review.googlesource.com/#/c/18835/3
Reported-by: Tim Darby
Sepherosa Ziehau [Wed, 6 Jul 2016 08:17:21 +0000 (16:17 +0800)]
virtio: Fix MSI support; thus unbreak booting on bhyve
I'd like to thank Peter Grehan <grehan freebsd org> very much for
providing various information on bhyve side and helping testing
this patch.
Sascha Wildner [Wed, 6 Jul 2016 07:21:52 +0000 (09:21 +0200)]
libc/confstr: Fix comment indentation.
Sascha Wildner [Wed, 6 Jul 2016 07:00:11 +0000 (09:00 +0200)]
sysconf(3): Fix _SC_GET{GR,PW}_R_SIZE_MAX.
The standard allows to return -1 if there is no hard limit on the size
of the buffers.
Sascha Wildner [Tue, 5 Jul 2016 21:09:47 +0000 (23:09 +0200)]
<unistd.h>: Add more comments to options.
Sascha Wildner [Tue, 5 Jul 2016 21:09:08 +0000 (23:09 +0200)]
sysconf(3): Add _SC_THREAD_SPORADIC_SERVER.
Sascha Wildner [Tue, 5 Jul 2016 19:32:59 +0000 (21:32 +0200)]
getconf(1): Add some variables for backward compatibility.
The standard requires all of these.
Sascha Wildner [Tue, 5 Jul 2016 19:26:48 +0000 (21:26 +0200)]
getconf(1): Fix typo (_POSIX2_EXPR_NEXT_MAX -> _POSIX2_EXPR_NEST_MAX).
Sascha Wildner [Tue, 5 Jul 2016 19:11:40 +0000 (21:11 +0200)]
getconf(1): Add some missing variables.
_POSIX_ADVISORY_INFO
_POSIX_RAW_SOCKETS
_XOPEN_STREAMS
Sascha Wildner [Tue, 5 Jul 2016 18:40:56 +0000 (20:40 +0200)]
getconf(1): Fix confstr variable names.
All these don't have an underscore.
Sepherosa Ziehau [Tue, 5 Jul 2016 14:59:17 +0000 (22:59 +0800)]
cat: Align output from cat(1) between when invoked with -be & -ne flags
Obtained-from: NetBSD
Submitted-by: <venture37 geeklan co uk>
DragonFly-bug: https://bugs.dragonflybsd.org/issues/2922
zrj [Thu, 30 Jun 2016 14:07:11 +0000 (17:07 +0300)]
Remove <varargs.h> from the system.
Similarly as it was done with <malloc.h>
Not standard header, just a symlink to machine/varargs.h and
seems not used by anything in the base (<stdarg.h> is preferred).
zrj [Fri, 1 Jul 2016 10:32:46 +0000 (13:32 +0300)]
Fix <machine/varargs.h> use cases.
First varargs.h depended on namespace pollution to provide typdef of __va_list
to declare va_list. Usually thorugh sys/systm.h including sys/stdarg.h
So short-circuit directly to compiler builtin in case of __GNUC__
Also remove machine/varargs.h usage from other kernel sources:
sys/kern/kern_dsched.c: Not needed (just 3 dummy functions)
sys/dev/misc/tbridge/tbridge.c: Both use just __va_smth variants
sys/kern/subr_taskqueue.c: and get those through sys/systm.h
This leaves all the kernel code using <stdarg.h> variant consistently.
zrj [Thu, 30 Jun 2016 14:56:03 +0000 (17:56 +0300)]
<stdio.h>: Hide macros that break global :: ns in cxx.
Avoid expanding macros ::(!__isthreaded ?...) to poorly written
ports that assume some specific libc/stdio.h implementation.
Will help with patching efforts to have less +<cstdio> patches
in dports using c++ codes.
zrj [Thu, 30 Jun 2016 14:51:18 +0000 (17:51 +0300)]
Move __va_size() into freestanding block.
Mainly to match varargs.h layout. No users outside these headers.
zrj [Thu, 30 Jun 2016 14:12:24 +0000 (17:12 +0300)]
<wchar.h>: Reduce namespace pollution in <wchar.h>.
zrj [Thu, 30 Jun 2016 10:17:44 +0000 (13:17 +0300)]
sys/sys: Protect len and inout parameters in _IOC definition.
This should reduce the likelihood of _IOC() macro expanding to something
that wasn't intended and would provide a more flexible interface too.
While there, remove hardcoded value for IOC_DIRMASK
Taken-from: FreeBSD
zrj [Fri, 1 Jul 2016 06:04:56 +0000 (09:04 +0300)]
rpc: Whitespace cleanup.
While there, perform license change as per FreeBSD r258581
zrj [Thu, 30 Jun 2016 09:57:55 +0000 (12:57 +0300)]
rpc: Make few headers more compatible with gcc.
Previously gcc compilers from dports installed patched versions
of rpc headers that override the system ones.
By applying small changes, headers are no longer patched and
does not require rebuilding gcc dports to account for possible
change in include/rpc headers after installworld.
While there perform some minor cleanup.
No functional change.
Sascha Wildner [Mon, 4 Jul 2016 08:50:47 +0000 (10:50 +0200)]
<pthread.h>: Include <machine/limits.h> instead of <limits.h> for ULONG_MAX.
Also include <limits.h> in a couple of files that were missing it.
This commit will break 4 ports:
devel/clanlib1
games/orbital_eunuchs_sniper
games/zatacka
sysutils/cdargs
These will be fixed in the next time.
François Tigeot [Sun, 3 Jul 2016 06:26:40 +0000 (08:26 +0200)]
drm/linux: Improve spin_unlock_irqrestore()'s implementation
Prevents compilation failures in functions not using
spin_lock_irqsave() first.
François Tigeot [Sat, 2 Jul 2016 15:31:25 +0000 (17:31 +0200)]
installer: Do not waste too many inodes on /boot
* A fully populated /boot with kernel, kernel.old, kernel.alt
and associated modules needs aproximately 2K inodes
* With a 1GB /boot partition size, default newfs parameters
allocate 128K inodes
* Reduce this amount to 15K inodes, thus making an additional
13MB of disk space available on /boot
Tomohiro Kusumi [Sun, 26 Jun 2016 09:39:42 +0000 (18:39 +0900)]
sbin/hammer: Make global PFS/accounting variables static
Tomohiro Kusumi [Sat, 25 Jun 2016 16:35:00 +0000 (01:35 +0900)]
sys/vfs/hammer: Remove validate_zone()
Code becomes less clear with this function and usage of sum of results.
Just check if the given offsets are data zones or not (unless we plan
to support non-data zones, but we don't).
François Tigeot [Thu, 30 Jun 2016 05:38:34 +0000 (07:38 +0200)]
drm: Restore DRM_DEBUG_VBLANK() calls
François Tigeot [Wed, 29 Jun 2016 19:21:22 +0000 (21:21 +0200)]
drm/i915: Use the spin_lock_irq() family of functions (2/2)
Further reducing differences with Linux 4.3.
François Tigeot [Wed, 29 Jun 2016 06:12:10 +0000 (08:12 +0200)]
drm: Use the spin_lock_irq() family of functions
Reducing differences with Linux 4.3
François Tigeot [Wed, 29 Jun 2016 06:10:13 +0000 (08:10 +0200)]
drm/i915: Use the spin_lock_irq() family of functions
Reducing differences with Linux 4.3
Matthew Dillon [Wed, 29 Jun 2016 02:14:43 +0000 (19:14 -0700)]
kernel - Enhance buffer flush and cluster_write linearity (2)
* Fix bug last commit. When looping the buffer has to be reset to
the marker or the iteration can wind up on the wrong queue.
* Also count the INVAL case in the loop instead of breaking out.
Matthew Dillon [Wed, 29 Jun 2016 01:58:58 +0000 (18:58 -0700)]
hammer2 - Fix inode destroy panic
* Fix a race in hammer2_inode_xop_destroy() when deleting an inode chain.
The parent can be ripped out from under the code before it gets both
parent and chain locked, resulting in an assertion in hammer2_chain_delete().
Properly test the linkage and retry if the parent changes.
Matthew Dillon [Wed, 29 Jun 2016 01:52:29 +0000 (18:52 -0700)]
kernel - Enhance buffer flush and cluster_write linearity
* flushbufqueues() was iterating between cpus, taking only one buffer off
of each cpu's queue. This forced non-linearly on-flush, messing up
sequential performance for HAMMER1 and HAMMER2. For HAMMER2 this also
caused physical blocks to be allocated out of order.
Add sysctl vfs.flushperqueue to specify the number of buffers to flush
per cpu before iterating the pcpu queue. Default 1024.
* cluster_write() no longer requires that a buffer be VOP_BMAP()'d
successfully in order to issue writes. This effects HAMMER2, which does
not assign physical device blocks until the logical buffer is actually
flushed to the backend device.
* Fixes non-linearity problems for buffer daemon flushbufqueues() calls,
and for cluster_write() with or without write_behind.
Matthew Dillon [Tue, 28 Jun 2016 23:12:46 +0000 (16:12 -0700)]
hammer2 - Optimize indirect block algorithm
* Pack indirect blocks for linear files significantly better.
* First level indirect block for directories reduced to 4KB (32 entries).
* For now make the first level indirect block for directories cover the
entire hash range for either inodes or directory entries (63 bits).
Matthew Dillon [Tue, 28 Jun 2016 07:26:06 +0000 (00:26 -0700)]
hammer2 - Stabilization pass
* If the HAMMER2_CHAIN_DEDUP flag is set modified_needs_new_allocation()
must return 1 to force a new allocation. This fixes a number of dirty
buffer rewrite cases that broke dedup.
* Do not try to dedup a chain flagged MODIFIED or INITIAL.
* The indirect-block deletion code in the flusher needed to also count
blockrefs if it hadn't been done yet. This fixes cases of missing
directory entries.
* For now use a transaction in hammer2_strategy_write(). We probably don't
need it due to the way the logical buffer cache is handled, but do it
anyway.
* Clean-up some of the code documentation.
* Implement sysctls for dedup and buffer invalidation enablement. dedup
is turned on by default, invalidation is turned off. Invalidation is
not currently working well.
Matthew Dillon [Mon, 27 Jun 2016 20:08:56 +0000 (13:08 -0700)]
hammer2 - Remove the hidden directory, rework deletions
* Now that inodes are separately indexed we no longer need the hidden
directory abstraction to handle unlinked-but-open files. Get rid of
ALL the hidden directory handling code.
* Rework xop_unlink and hammer2_inode_unlink_finisher(). We cannot safely
reference the inode chain's inode data to get the nlinks count. Instead,
figure it all out on the frontend using the active nlinks in the
hammer2_inode_t structure.
* Fixes hardlink removal and rename issues.
Sepherosa Ziehau [Mon, 27 Jun 2016 14:21:57 +0000 (22:21 +0800)]
ifnet: Add oqdrops statistics
Matthew Dillon [Mon, 27 Jun 2016 08:37:40 +0000 (01:37 -0700)]
hammer2 - Stabilization, fix bulkfree bugs, change 'df' output
* Automatically delete any indirect nodes which become empty. This is done
in the flusher. Verify that a rm -rf cleans everything out.
* Fix three serious bugs in the bulkfree code.
(1) A range-check of cbinfo->sstop was using '>' instead of '>=', causing
a one-element overflow during the scan and potentially corrupting
memory.
(2) The live bitmap pointer must be reloaded after calling
hammer2_chain_modify()! The old pointer points to a buffer
which must remain clean, or worse points to a buffer completely
unrelated to the hammer2 filesystem.
(3) We were zeroing the temporary bmap, but it actually needs to be
initialized properly (particularly its reserved areas). Just
zeroing it led to reserved areas being improperly marked as
available for allocation.
* Validate that the free space counter is recovered properly after a
rm -rf and bulkfree.
* Disable the modify_tid test in the bulkfree code for now and go back to
forcing a flush.
* Change 'df' reporting. I was trying to be fancy by compensating for dedup
to report how big the filesystem would be if nothing were deduped, but it
just caused confusion. We now report an unchanging total volume size and
the actual number of 16KB blocks that are fully free.
* The 'hammer2 freemap' dump now includes all indices, including those
associated with reserved areas.
Matthew Dillon [Sun, 26 Jun 2016 21:08:21 +0000 (14:08 -0700)]
kernel - Fix panic in error path of nvextendbuf()
* nvextendbuf() was not releasing bp in the error path, leading to
a hanging lock and 'locking against myself' panic later on.
Matthew Dillon [Sun, 26 Jun 2016 05:05:14 +0000 (22:05 -0700)]
hammer2 - Stabilization (data corruption)
* Move the check code errors into hammer2_chain_testcheck() and supply
additional information in the kprintf.
* Reformulate hammer2_io_newq() a bit.
* Fix bugs in the buffer invalidation path. The hammer2_io_newq() path
was improperly setting INVALOK. This path is only used by the freemap
code to pre-validate a buffer to avoid unnecessary reads. Fixed by
not setting INVALOK if IOCB_QUICK is set.
Matthew Dillon [Sun, 26 Jun 2016 04:59:55 +0000 (21:59 -0700)]
hammer2 - Update error message in hammer2_mount
* Update the error message to reflect the current default labels
when the '@LABEL' specification is missing.
Matthew Dillon [Sun, 26 Jun 2016 04:58:29 +0000 (21:58 -0700)]
hammer2 - Enhance freemap output
* Output the base data offset for each freemap line in the freemap
dump.
* Also provide more check data info in the output.
Matthew Dillon [Sun, 26 Jun 2016 03:57:07 +0000 (20:57 -0700)]
nvme - Handle full submission queue
* The submission queue is a ring and can be full even if requests are
available due to out-of-order completion. Update the submission queue's
subq_head from the completion queue status and check for a full condition.
The normal requeue signaling suffices for resume.
* Also note that we allocate maxqe requests, which is actually one more than
we can have on the ring at once. But now that we have the queue-full check,
this becomes a non-issue. Just leave it at maxqe for convenience.
* Tested by temporarily reducing maxqe to 16 and doing stuff to overload it.
Maxqe was returned to 256 for the commit.
Matthew Dillon [Sat, 25 Jun 2016 22:01:44 +0000 (15:01 -0700)]
kernel - Enhance swap allocation failure message
* Output a more appropriate message if the system wants to page to swap
and no swap is configured.
Matthew Dillon [Sat, 25 Jun 2016 21:48:34 +0000 (14:48 -0700)]
kernel - Misc bug fixes and enhancements
* Add atomic_*_64() for 64-bit-explicit calls. This way if a platform
doesn't support 64-bit atomic ops H2 will at least get a compile error.
* Fix bug in sys/mutex2.h. mtx_upgrade_try() was not setting mtx_owner
on success.
* Enhance assertion panic message in lockmgr_kernproc().
Matthew Dillon [Sat, 25 Jun 2016 17:05:24 +0000 (10:05 -0700)]
hammer2 - Stabilization, optimization
* Increase the hammer2_io.refs field to 64 bits so we can add a few more
control bits.
* Track whether invalidation is ok at the DIO level for full-sized (64KB)
data blocks. We continue to use the slightly less-capable CHAIN_DEDUP
flag for smaller data blocks (this flag gets set on frontend->backend
flush whereas the DIO level flag is only cleared when a block is actually
reused for deduplication).
* Separate vfs.hammer2.cluster_enable into cluster_read and cluster_write.
Leave cluster_read enabled with a read-ahead of 4 blocks. Disable
cluster_write for now, but still set B_CLUSTEROK in the bdwrite().
This allows the frontend to 'flush' data to the backend without
initiating disk I/O on the block device, giving us a chance to discard
the data later if it winds up being temporary.
* Remove an improper BUF_KERNPROC(dio->bp) in the case where a different
thread owns the in-progress DIO.
* Defer setting of B_INVAL | B_RELBUF to when the DIO is in lastdrop.
* Add missing brelse() in the hammer2_read_file() error path. Add missing
B_CLUSTEROK in hammer2_write_file().
* The bulkfree code now ensures that the INVALOK bit in any related DIO
for a freed block is cleared, preventing accidental invalidations on
reuse.
Sascha Wildner [Sat, 25 Jun 2016 12:58:48 +0000 (14:58 +0200)]
Stop building/installing groff's soelim(1).
We have a version in usr.bin which we use since ages, so groff's
version got built/installed just to get overwritten again when
usr.bin was installed afterwards.
François Tigeot [Fri, 24 Jun 2016 14:32:20 +0000 (16:32 +0200)]
drm/linux: Implement some spin_lock_irq* functions
They are not just simple spin_lock/spin_unlock() variants but
disable hardware interrupt processing on the current cpu.
Suggested-by: Matt Macy <mmacy@nextbsd.org>
Tomohiro Kusumi [Fri, 24 Jun 2016 14:07:45 +0000 (23:07 +0900)]
sys/vfs/hammer: Remove DEDUP_CACHE_SIZE and wrong comment
It is a tunable sysctl since
e2ef7a95.
Sepherosa Ziehau [Fri, 24 Jun 2016 03:13:16 +0000 (11:13 +0800)]
nvme: Use high frequency interrupt for CQ processing
Suggested-by: dillon@
Reviewed-by: dillon@