Matthew Dillon [Mon, 10 Oct 2016 17:22:12 +0000 (10:22 -0700)]
vkernel - Fix FP corruption in VMX/EPT mode
* Properly invalidating the EPT TLB cache when it potentially becomes
stale.
* When destroying a VMX context, make sure the EPT TLB cache has been
invalidated for that context as a safety.
* Fix a bug in vmx_vminit() where it was losing track of the currently
loaded VMX.
* Setup the VMX to clear the host CR0_TS, and the host makes sure the FP
state is correct prior to vmlaunch.
* Make sure the GUEST_CR0's CR0_TS reflects the vkernel says it should
reflect.
* The vkernel has a choice of asking the guest user context to #NM fault
on FP use or not. Both mechanics work, but its probably better for it
to ensure that the FP state is valid and tell the user context to not
fault. However, this commit makes sure that both mechanics work.
* Document why we shouldn't vmclear the old current context when loading
a different context.
* Cleanup some of the vkernel's pmap handling. This isn't perfect and
probably needs to be rewritten (we need a more powerful guest pmap
adjustment system call to properly synchronize changes). For now
we try to avoid races against detecting the (M)odified flag by clearing
the RW flag first.
Matthew Dillon [Sun, 9 Oct 2016 23:41:17 +0000 (16:41 -0700)]
vkernel - Add COW image capability
* Add a copy-on-write disk image feature. This allows a vkernel
to mount a disk image RO or R+W but does not try to write changes
back to the image file.
This allows multiple vkernel instances to use the same image
file.
* Note that when the vkernel operates on an image in this mode,
modifications will eat up system memory and swap, so the user
should be cognizant of the use-case. Still, the flexiblity of
being able to mount the image R+W should not be underestimated.
Matthew Dillon [Sat, 8 Oct 2016 23:57:16 +0000 (16:57 -0700)]
kernel - Refactor VMX code
* Refactor the VMX code to use all three VMM states available to use
instead of two. The three states available are:
active and current (VMPTRLD)
active not current (replaced by some other context being VMPTRLD'd)
inactive not current (VMCLEAR)
In short, there is no need to VMCLEAR the current context when activating
another via VMPTRLD, doing so greatly reduces performance. VMCLEAR is
only really needed when a context is being destroyed or being moved to
another cpu.
* Also fixes a few bugs along the way.
* Live loop in vmx_vmrun() when necessary, otherwise we wind up with serious
problems synchronizing IPIs. The thread will still be subject to its
process priority.
Matthew Dillon [Sat, 8 Oct 2016 02:13:41 +0000 (19:13 -0700)]
kernel - Fix low memory process kill bug
* If a process is being killed, don't let it stay put in a low-memory
vm_wait loop in kernel mode, it will never exit.
* Try to improve the chances that we can dump by adjusting an assertion in
the user thread scheduler.
Matthew Dillon [Sat, 8 Oct 2016 02:10:06 +0000 (19:10 -0700)]
kernel - Fix a system lockup with vmm
* Fix an issue where vkernel_lwp_exit() was improperly trying to kfree()
the vklp->ve pointer for the guest-thread case. This field holds a
user-supplied address in that case, not a kernel structure.
* Yield the cpu more aggressively in the VMM_GUEST_RUN loop. We were
testing for pending interrupts but we were not calling lwkt_switch()
* Do not exit the vkernel on a call or jump to address 0. This debugging
code should have been removed and wasn't. A user process running under
the vkernel could cause the vkernel itself to exit.
* Numerous syntactical cleanups.
Reported-by: tuxillo
Matthew Dillon [Tue, 27 Sep 2016 21:39:03 +0000 (14:39 -0700)]
kernel - Remove mplock from KTRACE paths
* The mplock is no longer needed for KTRACE, ktrace writes are serialized
by the vnode lock and everything else is MPSAFE. Note that this change
means that even fast system calls may interleave in the ktrace output on
a multi-threaded program.
* Fix ktrace bug related to vkernels. The syscall2() code assumes that
no tokens are held on entry (since we are coming from usermode), but
a system call made from the vkernel may actually be nested inside
another syscall2(). The mplock KTRACE held caused this to assert in
the nested syscall2(). The removal of the mplock from the ktrace path
also fixes this bug.
* Minor comment adjustment in vm_vmspace.c.
Reported-by: tuxillo
Peter Avalos [Fri, 14 Oct 2016 19:25:19 +0000 (12:25 -0700)]
Update OpenSSL to 1.0.1u.
This only affects the 4.6 branch, because master has a different OpenSSL
version that is no longer being used.
Major changes between OpenSSL 1.0.1t and OpenSSL 1.0.1u [22 Sep 2016]
o OCSP Status Request extension unbounded memory growth (CVE-2016-6304)
o SWEET32 Mitigation (CVE-2016-2183)
o OOB write in MDC2_Update() (CVE-2016-6303)
o Malformed SHA512 ticket DoS (CVE-2016-6302)
o OOB write in BN_bn2dec() (CVE-2016-2182)
o OOB read in TS_OBJ_print_bio() (CVE-2016-2180)
o Pointer arithmetic undefined behaviour (CVE-2016-2177)
o Constant time flag not preserved in DSA signing (CVE-2016-2178)
o DTLS buffered message DoS (CVE-2016-2179)
o DTLS replay protection DoS (CVE-2016-2181)
o Certificate message OOB reads (CVE-2016-6306)
Matthew Dillon [Fri, 14 Oct 2016 16:32:12 +0000 (09:32 -0700)]
kernel - Fix improper user-space access in sys___semctl()
* Fix an improper user-space access in sys__semctl()
* Fix an improper kernel-space access that was using
a user-supplied pointer.
Reported-by: Mateusz Kocielski - LogicalTrust
John Marino [Mon, 3 Oct 2016 16:19:47 +0000 (11:19 -0500)]
Relocate private panel library to intended location
The only consumer of libpanel is the DF installer. It wasn't installed in
the correct private area, but the linker found it anyway on the standard
search path. Fix the installation location and task "make upgrade" to
remove the publically installed private libraries.
Antonio Huete Jimenez [Mon, 26 Sep 2016 22:45:50 +0000 (00:45 +0200)]
kernel/vmx - Add a missing lwkt_reltoken()
Antonio Huete Jimenez [Sun, 25 Sep 2016 10:57:09 +0000 (12:57 +0200)]
kernel/vmm - Fix build with VMM_DEBUG
Antonio Huete Jimenez [Tue, 20 Sep 2016 23:31:58 +0000 (01:31 +0200)]
vkernel - Invalidate pte before setting attributes to the vm_page
- Fixes a problem at mountroot time where it doesn't find any disk
even though the disk is detected earlier.
Antonio Huete Jimenez [Tue, 20 Sep 2016 22:03:05 +0000 (00:03 +0200)]
vkernel - Fix a vkernel lockup on startup
- During ap_init() any pending IPIs is processed manually so
clear gd_npoll as the real kernel does.
- Do not disable interrupts for vkernels during lwkt_send_ipiq3()
because they don't seem to be re-enabled afterwards as they should.
I'm not entirely sure this is the right fix, more investigation
is required.
John Marino [Thu, 6 Oct 2016 15:37:06 +0000 (10:37 -0500)]
localedef: Fix ctype dump (fixed wide spread errors)
This was a CTYPE encoding error involving consecutive points of the same
ctype. It was reported by myself to Illumos over a year ago but I was
unsure if it was only happening on BSD. Given the cause, the bug is also
present on Illumos.
Basically, if consecutive points were of the exact same ctype, they would
be defined as a range regardless. For example, all of these would be
considered equivalent:
<A> ... <C>, <H> (converts to <A> .. <H>)
<A>, <B>, <H> (converts to <A> .. <H>)
<A>, <J> ... <H> (converts to <A> .. <H>)
So all the points that shouldn't have been defined got "bridged" by the
extreme points.
The effects were recently reported to FreeBSD on PR 213013. There are
countless places were the ctype flags are misdefined, so this is a major
fix that has to be MFC'd.
Imre Vadász [Mon, 3 Oct 2016 12:41:29 +0000 (14:41 +0200)]
pc64: Fix typo in wrmsr_safe.
Imre Vadász [Mon, 26 Sep 2016 23:44:43 +0000 (01:44 +0200)]
cpuctl(4): Require write rights for CPUCTL_MSRSBIT and CPUCTL_MSRCBIT.
* Both CPUCTL_MSRSBIT and CPUCTL_MSRCBIT write MSR registers, so they
should require write rights like CPUCTL_WRMSR.
Sascha Wildner [Thu, 29 Sep 2016 17:39:30 +0000 (19:39 +0200)]
Sync zoneinfo database with tzdata2016g from ftp://ftp.iana.org/tz/releases
* Turkey switched from EET/EEST (+02/+03) to permanent +03,
effective 2016-09-07. (Thanks to Burak AYDIN.) Use "+03" rather
than an invented abbreviation for the new time.
* New leap second 2016-12-31 23:59:60 UTC as per IERS Bulletin C 52.
(Thanks to Tim Parenti.)
* For America/Los_Angeles, spring-forward transition times have been
corrected from 02:00 to 02:01 in 1948, and from 02:00 to 01:00 in
1950-1966.
* For zones using Soviet time on 1919-07-01, transitions to UT-based
time were at 00:00 UT, not at 02:00 local time. The affected
zones are Europe/Kirov, Europe/Moscow, Europe/Samara, and
Europe/Ulyanovsk. (Thanks to Alexander Belopolsky.)
* The Factory zone now uses the time zone abbreviation -00 instead
of a long English-language string, as -00 is now the normal way to
represent an undefined time zone.
* Several zones in Antarctica and the former Soviet Union, along
with zones intended for ships at sea that cannot use POSIX TZ
strings, now use numeric time zone abbreviations instead of
invented or obsolete alphanumeric abbreviations. The affected
zones are Antarctica/Casey, Antarctica/Davis,
Antarctica/DumontDUrville, Antarctica/Mawson, Antarctica/Rothera,
Antarctica/Syowa, Antarctica/Troll, Antarctica/Vostok,
Asia/Anadyr, Asia/Ashgabat, Asia/Baku, Asia/Bishkek, Asia/Chita,
Asia/Dushanbe, Asia/Irkutsk, Asia/Kamchatka, Asia/Khandyga,
Asia/Krasnoyarsk, Asia/Magadan, Asia/Omsk, Asia/Sakhalin,
Asia/Samarkand, Asia/Srednekolymsk, Asia/Tashkent, Asia/Tbilisi,
Asia/Ust-Nera, Asia/Vladivostok, Asia/Yakutsk, Asia/Yekaterinburg,
Asia/Yerevan, Etc/GMT-14, Etc/GMT-13, Etc/GMT-12, Etc/GMT-11,
Etc/GMT-10, Etc/GMT-9, Etc/GMT-8, Etc/GMT-7, Etc/GMT-6, Etc/GMT-5,
Etc/GMT-4, Etc/GMT-3, Etc/GMT-2, Etc/GMT-1, Etc/GMT+1, Etc/GMT+2,
Etc/GMT+3, Etc/GMT+4, Etc/GMT+5, Etc/GMT+6, Etc/GMT+7, Etc/GMT+8,
Etc/GMT+9, Etc/GMT+10, Etc/GMT+11, Etc/GMT+12, Europe/Kaliningrad,
Europe/Minsk, Europe/Samara, Europe/Volgograd, and
Indian/Kerguelen. For Europe/Moscow the invented abbreviation MSM
was replaced by +05, whereas MSK and MSD were kept as they are not
our invention and are widely used.
* Rename Asia/Rangoon to Asia/Yangon, with a backward compatibility link.
(Thanks to David Massoud.)
* Comments now cite URLs for some 1917-1921 Russian DST decrees.
(Thanks to Alexander Belopolsky.)
Matthew Dillon [Thu, 8 Sep 2016 23:02:07 +0000 (16:02 -0700)]
powerd - Detect power state changes
* The list of available frequencies changes when the power state changes,
detect such changes and set parameters for all cpus verses making
incremental changes.
* Fixes issue with powerd leaving the laptop running at a lower frequency
when the laptop is unplugged and then plugged back in.
Sepherosa Ziehau [Wed, 7 Sep 2016 10:54:24 +0000 (18:54 +0800)]
uipc: Make sure that listen is completed.
For unix socket, only HAVEPCCACHED really means the listen has been
completed.
Reported-by: dillon@
Matthew Dillon [Wed, 7 Sep 2016 18:41:25 +0000 (11:41 -0700)]
powerd - Add temperature-based management
* Add temperature-based management, with a default range of 75:85 (in C).
If the cpu temperature exceeds the low range, powerd will enter
temperature control mode and begin ramping-down the cpu frequency
regardless of the load in order to prevent the laptop from reaching
the high range.
* Add -H lowtemp:hightemp option to allow the range to be set when
starting or restarting powerd.
* Add code to automatically kill a previously-running powerd when a new
powerd is started. This makes the system operator's life easier as there
is no need to hunt-down and kill the previously-running powerd when
restarting it with new options.
* No desktop or server should ever get to 75C unless your cooling is broken,
so this feature is primarily targetted at laptops. Many laptops can exceed
80C due to bad cooling design (and poor-design in general), and a vendor
propensity to goose the specs to make their laptops look good on paper.
Even the BIOS HOT cap tends to actually be too hot for continuous use.
But it just isn't a good idea to exceed 80C regardless of what the specs
say. The laptop will last a lot longer and this reduces your chances of
having melt-down or fire. People who run BSD or Linux systems on laptops
often do bulk compiles on them and/or other things, such as multiple
browser windows, tabs, a lot of multi-media, multiple video windows,
multiple video outputs, etc, which can utilize all available resources
on the laptop. Vendors usually don't take all of this into account.
This feature can allow all of this to happen without burning the laptop
up.
You can also use this feature if your laptop gets too hot when sitting on
your lap :-).
Tested-by: Multiple people.
Matthew Dillon [Wed, 7 Sep 2016 01:09:14 +0000 (18:09 -0700)]
kernel - Deal with lost IPIs (VM related) (2)
* Fix an issue where Xinvltlb interacts badly with a drm console framebuffer,
imploding the machine. The 1/16 second watchdog can trigger during certain
DRM operations due to excessive interrupt disablement in the linux DRM code.
* Avoid kprintf()ing anything by default.
* Also make a minor fix to the watchdog logic to force the higher-level
Xinvltlb loop to re-test.
Matthew Dillon [Tue, 6 Sep 2016 00:11:05 +0000 (17:11 -0700)]
kernel - Deal with lost IPIs (VM related)
* Some (all?) VMs appear to be able to lose IPIs. Hopefully the same can't
be said for device interrupts! Add some recovery code for lost Xinvltlb
IPIs for now.
For synchronizing invalidations we use the TSC and run a recovery attempt
after 1/16 second, and every 1 second there-after, if an Xinvltlb is not
responded to (smp_invltlb() and smp_invlpg()). The IPI will be re-issued.
* Some basic testing shows that a VM can stall out a cpu thread for an
indefinite period of time, potentially causing the above watchdog to
trigger. Even so it should not have required re-issuing the IPI, but
it seems it does, so the VM appears to be losing the IPI(!) when a cpu
thread stalls out on the host! At least with the VM we tested under,
type unknown.
* IPIQ IPIs currently do not have any specific recovery but I think each
cpu will poll for IPIQs slowly in the idle thread, so they might
automatically recover anyway.
Reported-by: zach
Matthew Dillon [Mon, 5 Sep 2016 19:33:42 +0000 (12:33 -0700)]
kernel - Fix indefinite wait buffer during heavy swapping
* Fix a deadlock which can occur between CAM and the VM system due to
a bug in uiomove_nofault() when called via vop_helper_read_shortcut().
If the backing store is swapped out, vm_fault()/vm_fault_object() attempts
to page the data in instead of telling uiomove_nofault() to give up.
This can result in a deadlock against the underlying vm_page's in the
file that might already be undergoing I/O.
* Probably also reported by other people over the years, but could never
track it down until now.
Reported-by: Studbolt
Matthew Dillon [Thu, 8 Sep 2016 06:09:12 +0000 (23:09 -0700)]
libc - restir arc4random() on fork()
* Fix an issue where the arc4random() function was not being re-stirred
on a fork.
Reported-by: zrj
Matthew Dillon [Sat, 3 Sep 2016 17:24:56 +0000 (10:24 -0700)]
libc - Fix malloc() alignment for small allocations
* malloc()'s slab allocator was set to use 8-byte alignment
for any allocation < 128 bytes that was not otherwise on
an integral alignment boundary. This breaks GCC-7 which assumes
16-byte alignment for non-16-integral sizes < 128 bytes. e.g.
if 18 bytes is allocated, GCC-7 assumes the resulting pointer will
be 16-byte-aligned.
* The standard is somewhat deficient in its characterization of what the
required alignment should be, because there are already instructions
which prefer 32 and 64 byte alignments, but are relaxed on Intel to
only require 16-byte alignments (aka %ymm and %zmm registers in the
vector extensions), and its stupid to enforce even larger alignments
for tiny allocations.
* But generally speaking it makes sense to enforce a 16-byte alignment
for any allocations >= 16 bytes, regardless of the size being passed-in
not being 16-byte aligned, and this change does that. Allocations of
less than 16 bytes will still be 8-byte aligned because it is phenominally
wasteful for them not to be.
Reported-by: marino
Matthew Dillon [Wed, 31 Aug 2016 02:36:53 +0000 (19:36 -0700)]
kernel - Fix LOOPMASK debugging for Xinvltlb
* Fix LOOPMASK debugging for Xinvltlb, the #if 1 can now be set to #if 0
to turn off the debugging.
zrj [Sun, 21 Aug 2016 08:30:47 +0000 (11:30 +0300)]
buildworld - bootstrap compatibility compiling older DragonFly's
* Fix buildworld issue bootstrapping gencat/uudecode utilities which does not
expect newer header files (POSIX getline general visibility from <stdio.h>).
Could be MFC'd to DragonFly 4.6 branch for easier switching <-> master.
Matthew Dillon [Mon, 8 Aug 2016 17:46:35 +0000 (10:46 -0700)]
kernel - Add workaround for improper yield in ACPI path
* For now add a workaround for an improper yield that can occur indirectly
via the ACPI path. The problem is that the ACPI contrib code can hold
a spinlock across a kmalloc() call.
* The ACPI code, in particular AcpiOsAcquireLock(), uses a spin lock. At
the same time it MUST use a spinlock because it might be called from
the idle thread. But it also appears that the code might call kmalloc()
while holding a spinlock.
The kmalloc path ACPI calls is with M_INTWAIT, which reduces the chance
that kmalloc might try to block. However, kmalloc is used to execute
staged kfrees which can create a sequence:
kmalloc -> kmem_slab_free -> (vm system) ->
vm_object_page_remove_callback -> lwkt_user_yield().
Matthew Dillon [Mon, 8 Aug 2016 02:44:33 +0000 (19:44 -0700)]
kernel - Remove some debug output
* Remove "Warning: cache_resolve: ncp '%s' was unlinked" debug output.
This was originally added to validate a particular code path and is
no longer needed.
Sascha Wildner [Sun, 7 Aug 2016 09:29:13 +0000 (11:29 +0200)]
<time.h>: Adjust the visibility of CLOCK_REALTIME and TIMER_ABSTIME.
Looks like they came in with IEEE Std 1003.1b-1993.
Helps building math/clblas with GCC.
Reported-by: zrj
Matthew Dillon [Sun, 7 Aug 2016 04:25:26 +0000 (21:25 -0700)]
kernel - Fix memcpy assembly ABI
* memcpy must return the original (dst) argument, and wasn't for the kernel.
Nothing used it explicitly, but gcc sometimes decides to call memcpy and
assumes the correct return value. It was just luck that it hasn't up until
now.
Matthew Dillon [Sat, 6 Aug 2016 18:05:10 +0000 (11:05 -0700)]
libc - Include information on the 'e' flag in the popen() manual page.
* Include information on the 'e' flag in the popen() manual page.
Matthew Dillon [Fri, 5 Aug 2016 20:12:08 +0000 (13:12 -0700)]
kernel - Fix kern.proc.pathname sysctl
* kern.proc.pathname is a sysctl used by programs to find the path
of the running program. This sysctl was created before we stored
sufficient information in the proc structure to construct the
correct path when multiple aliases are present (due to e.g. null-mounts)
to the same file.
* We do have this information, in p->p_textnch, so change the sysctl to
use it. The sysctl will now return the actual full path in the context
of whomever ran the program, so it should properly take into account
chroots and such.
Matthew Dillon [Fri, 5 Aug 2016 07:18:07 +0000 (00:18 -0700)]
dma - Fix security hole
* dma makes an age-old mistake of not properly checking whether a file
owned by a user is a symlink or not, a bug which the original mail.local
also had.
* Add O_NOFOLLOW to disallow symlinks.
Thanks-to: BSDNow Episode 152, made me dive dma to check when they talked
about the mail.local bug.
Matthew Dillon [Thu, 4 Aug 2016 02:38:11 +0000 (19:38 -0700)]
kernel - Fix lwp_fork/exit race (2) (vkernel)
* Fix same race as before, in vkernel also.
Matthew Dillon [Wed, 3 Aug 2016 05:32:11 +0000 (22:32 -0700)]
kernel - Turn off zeroidle in the release branch.
* Turn of automatic page zeroing when idle in the release branch. This
code has been completely removed in the master branch. Pre-zeroing of
pages no longer has any benefit on modern cpus for a multitude of reasons.
* Will result in a small improvement in performance under heavy SMP loads.
Matthew Dillon [Wed, 3 Aug 2016 05:28:54 +0000 (22:28 -0700)]
kernel - Fix lwp_fork/exit race
* In a multi-threaded program it is possible for the exit sequence to
deadlock if one thread is trying to exit (exit the entire process)
while another thread is simultaniously creating a new thread.
* Fix the issue by having the new thread checking for the exit condition and
sending a SIGKILL to itself. And kprintf() a message when it happens.
Matthew Dillon [Sun, 31 Jul 2016 03:40:06 +0000 (20:40 -0700)]
kernel - Refactor cpu localization for VM page allocations (3)
* Instead of iterating the cpus in the mask starting at cpu #0, iterate
starting at mycpu to the end, then from 0 to mycpu - 1.
This fixes random masked wakeups from favoring lower-numbered cpus.
* The user process scheduler (usched_dfly) was favoring lower-numbered
cpus due to a bug in the simple selection algorithm, causing forked
processes to initially weight improperly. A high fork or fork/exec
rate skewed the way the cpus were loaded.
Fix this by correctly scanning cpus from the (scancpu) rover.
* For now, use a random 'previous' affinity for initially scheduling a
fork.
Matthew Dillon [Sat, 30 Jul 2016 19:30:24 +0000 (12:30 -0700)]
kernel - Refactor cpu localization for VM page allocations (2)
* Finish up the refactoring. Localize backoffs for search failures
by doing a masked domain search. This avoids bleeding into non-local
page queues until we've completely exhausted our local queues,
regardess of the starting pg_color index.
* We try to maintain 16-way set associativity for VM page allocations
even if the topology does not allow us to do it perfect. So, for
example, a 4-socket x 12-core (48-core) opteron can break the 256
queues into 4 x 64 queues, then split the 12-cores per socket into
sets of 3 giving 16 queues (the minimum) to each set of 3 cores.
* Refactor the page-zeroing code to only check the localized area.
This fixes a number of issues related to the zerod pages in the
queues winding up severely unbalanced. Other cpus in the local
group can help replentish a particular cpu's pre-zerod pages but
we intentionally allow a heavy user to exhaust the pages.
* Adjust the cpu topology code to normalize the physical package id.
Some machines start at 1, some machines start at 0. Normalize
everything to start at 0.
Matthew Dillon [Sat, 30 Jul 2016 19:27:09 +0000 (12:27 -0700)]
kernel - cleanup vfs_cache debugging
* Remove the deep namecache recursion warning, we've taken care of it
properly for a while now so we don't need to know when it happens any
more.
* Augment the cache_inval_internal warnings with more information.
Imre Vadász [Sat, 30 Jul 2016 10:32:26 +0000 (12:32 +0200)]
if_iwm - Fix iwm_poll_bit() usage in iwm_stop_device().
* The iwm(4) iwm_poll_bit() returns 1 on success and 0 on failure,
whereas iwl_poll_bit() in Linux iwlwifi returns >= 0 on success and
< 0 on failure.
Matthew Dillon [Sat, 30 Jul 2016 00:03:22 +0000 (17:03 -0700)]
kernel - Refactor cpu localization for VM page allocations
* Change how cpu localization works. The old scheme was extremely unbalanced
in terms of vm_page_queue[] load.
The new scheme uses cpu topology information to break the vm_page_queue[]
down into major blocks based on the physical package id, minor blocks
based on the core id in each physical package, and then by 1's based on
(pindex + object->pg_color).
If PQ_L2_SIZE is not big enough such that 16-way operation is attainable
by physical and core id, we break the queue down only by physical id.
Note that the core id is a real core count, not a cpu thread count, so
an 8-core/16-thread x 2 socket xeon system will just fit in the 16-way
requirement (there are 256 PQ_FREE queues).
* When a particular queue does not have a free page, iterate nearby queues
start at +/- 1 (before we started at +/- PQ_L2_SIZE/2), in an attempt to
retain as much locality as possible. This won't be perfect but it should
be good enough.
* Also fix an issue with the idlezero counters.
Matthew Dillon [Fri, 29 Jul 2016 21:59:15 +0000 (14:59 -0700)]
systat - Adjust extended vmstats display
* When the number of devices are few enough (or you explicitly specify
just a few disk devices, or one), there is enough room for the
extended vmstats display. Make some adjustments to this display.
* Display values in bytes (K, M, G, etc) instead of pages like the other
fields.
* Rename zfod to nzfod and subtract-away ozfod when displaying nzfod
(only in the extended display), so the viewer doesn't have to do the
subtraction in his head.
Matthew Dillon [Fri, 29 Jul 2016 20:29:03 +0000 (13:29 -0700)]
kernel - Reduce memory testing and early-boot zeroing.
* Reduce the amount of memory testing and early-boot zeroing that
we do, improving boot times on systems with large amounts of memory.
* Fix race in the page zeroing count.
* Refactor the VM zeroidle code. Instead of having just one kernel thread,
have one on each cpu.
This significantly increases the rate at which the machine can eat up
idle cycles to pre-zero pages in the cold path, improving performance
in the hot-path (normal) page allocations which request zerod pages.
* On systems with a lot of cpus there is usually a little idle time (e.g.
0.1%) on a few of the cpus, even under extreme loads. At the same time,
such loads might also imply a lot of zfod faults requiring zero'd pages.
On our 48-core opteron we see a zfod rate of 1.0 to 1.5 GBytes/sec and
a page-freeing rate of 1.3 - 2.5 GBytes/sec. Distributing the page
zeroing code and eating up these miniscule bits of idle improves the
kernel's ability to provide a pre-zerod page (vs having to zero-it in
the hot path) significantly.
Under the synth test load the kernel was still able to provide 400-700
MBytes/sec worth of pre-zerod pages whereas before this change the kernel
was only able to provide 20 MBytes/sec worth of pre-zerod pages.
Matthew Dillon [Fri, 29 Jul 2016 17:22:53 +0000 (10:22 -0700)]
kernel - Cleanup namecache stall messages on console
* Report the proper elapsed time and also include td->td_comm
in the printed output on the console.
Matthew Dillon [Fri, 29 Jul 2016 17:02:50 +0000 (10:02 -0700)]
kernel - Fix rare tsleep/callout race
* Fix a rare tsleep/callout race. The callout timer can trigger before
the tsleep() releases its lwp_token (or if someone else holds the
calling thread's lwp_token).
This case is detected, but failed to adjust lwp_stat before
descheduling and switching away. This resulted in an endless sleep.
Sepherosa Ziehau [Fri, 29 Jul 2016 08:56:10 +0000 (16:56 +0800)]
hyperv/vmbus: Passthrough interrupt resource allocation to nexus
This greatly simplies interrupt allocation. And reenable the interrupt
resource not found warning in acpi.
Matthew Dillon [Fri, 29 Jul 2016 01:05:42 +0000 (18:05 -0700)]
libthread_xu - Don't override vfork()
* Allow vfork() to operate normally in a threaded environment. The kernel
can handle multiple concurrent vfork()s by different threads (only the
calling thread blocks, same as how Linux deals with it).
Matthew Dillon [Thu, 28 Jul 2016 17:12:39 +0000 (10:12 -0700)]
kernel - Be nicer to pthreads in vfork()
* When vfork()ing, give the new sub-process's lwp the same TID as the one
that called vfork(). Even though user processes are not supposed to do
anything sophisticated inside a vfork() prior to exec()ing, some things
such as fileno() having to lock in a threaded environment might not be
apparent to the programmer.
* By giving the sub-process the same TID, operations done inside the
vfork() prior to exec that interact with pthreads will not confuse
pthreads and cause corruption due to e.g. TID 0 clashing with TID 0
running in the parent that is running concurrently.
Sascha Wildner [Thu, 28 Jul 2016 17:10:40 +0000 (19:10 +0200)]
ed(1): Sync with FreeBSD.
Sascha Wildner [Thu, 28 Jul 2016 17:18:46 +0000 (19:18 +0200)]
ed(1): Remove handling of non-POSIX environment.
Matthew Dillon [Thu, 28 Jul 2016 17:03:08 +0000 (10:03 -0700)]
libc - Fix more popen() issues
* Fix a file descriptor leak between popen() and pclose() in a threaded
environment. The control structure is removed from the list, then the
list is unlocked, then the file is closed. This can race a popen
inbetween the unlock and the closure.
* Do not use fileno() inside vfork, it is a complex function in a threaded
environment which could lead to corruption since the vfork()'s lwp id may
clash with one from the parent process.
Matthew Dillon [Thu, 28 Jul 2016 16:39:57 +0000 (09:39 -0700)]
kernel - Fix getpid() issue in vfork() when threaded
* upmap->invfork was a 0 or 1, but in a threaded program it is possible
for multiple threads to be in vfork() at the same time. Change invfork
to a count.
* Fixes improper getpid() return when concurrent vfork()s are occuring in
a threaded program.
François Tigeot [Thu, 28 Jul 2016 06:56:12 +0000 (08:56 +0200)]
drm/linux: Clean-up pci_resource_start()
Making it less verbose
Sascha Wildner [Thu, 28 Jul 2016 20:16:33 +0000 (22:16 +0200)]
mktemp.3: Fix a typo and bump .Dd
Matthew Dillon [Wed, 27 Jul 2016 23:22:11 +0000 (16:22 -0700)]
systat - Restrict %rip sampling to root
* Only allow root to sample the %rip and %rsp on all cpus. The sysctl will
not sample and return 0 for these fields if the uid is not root.
This is for security, as %rip sampling can be used to break cryptographic
keys.
* systat -pv 1 will not display the sampling columns if the sample value
is 0.
Matthew Dillon [Wed, 27 Jul 2016 18:22:56 +0000 (11:22 -0700)]
test - Add umtx1 code
* Add umtx1 code - fast context switch tests
* Make blib.c thread-safe.
Matthew Dillon [Wed, 27 Jul 2016 18:13:44 +0000 (11:13 -0700)]
libc - Fix numerous fork/exec*() leaks, also add mkostemp() and mkostemps().
* Use O_CLOEXEC in many places to prevent temporary descriptors from leaking
into fork/exec'd code (e.g. in multi-threaded situations).
* Note that the popen code will close any other popen()'d descriptors in
the child process that it forks just prior to exec. However, there was
a descriptor leak where another thread issuing popen() at the same time
could leak the descriptors into their exec.
Use O_CLOEXEC to close this hole.
* popen() now accepts the 'e' flag (i.e. "re") to retain O_CLOEXEC in the
returned descriptor. Normal "r" (etc) will clear O_CLOEXEC in the returned
descriptor.
Note that normal "r" modes are still fine for most use cases since popen
properly closes other popen()d descriptors in the fork(). BUT!! If the
threaded program calls exec*() in other ways, such descriptors may
unintentionally be passed onto sub-processes. So consider using "re".
* Add mkostemp() and mkostemps() to allow O_CLOEXEC to be passed in,
closing a thread race that would otherwise leak the temporary descriptor
into other fork/exec()s.
Taken-from: Mostly taken from FreeBSD
Matthew Dillon [Tue, 26 Jul 2016 19:53:39 +0000 (12:53 -0700)]
kernel - refactor CPUMASK_ADDR()
* Refactor CPUMASK_ADDR(), removing the conditionals and just indexing the
array as appropriate.
zrj [Wed, 20 Jul 2016 16:59:28 +0000 (19:59 +0300)]
cpumask.9: Add short manpage.
zrj [Tue, 19 Jul 2016 16:35:16 +0000 (19:35 +0300)]
cpumask.h: Turn CPUMASK_ELEMENTS as implementation defined.
No functional change intended.
zrj [Tue, 19 Jul 2016 07:07:45 +0000 (10:07 +0300)]
sys: Extract CPUMASK macros to new <machine/cpumask.h>
There are plenty enough CPUMASK macros already for them to have their own header.
So far only userspace users are powerd(8), usched(8) and kern_usched.c(VKERNEL64).
After recent change to expose kernel internal CPUMASK macros those got available
for userland codes even through <time.h> header. It is better to avoid that.
Also this reduces POSIX namespace pollution and keeps cpu/types.h header slim.
For now leave CPUMASK_ELEMENTS (not sure about ASSYM() macro handling the _ prefix)
and cpumask_t typedef (forward decl of struct cpumask would be better in prototypes).
Matthew Dillon [Tue, 26 Jul 2016 23:24:14 +0000 (16:24 -0700)]
kernel - Disable lwp->lwp optimization in thread switcher
* Put #ifdef around the existing lwp->lwp switch optimization and then
disable it. This optimizations tries to avoid reloading %cr3 and avoid
pmap->pm_active atomic ops when switching to a lwp that shares the same
process.
This optimization is no longer applicable on multi-core systems as such
switches are very rare. LWPs are usually distributed across multiple cores
so rarely does one switch to another on the same core (and in cpu-bound
situations, the scheduler will already be in batch mode). The conditionals
in the optimization, on the other hand, did measurably (just slightly)
reduce performance for normal switches. So turn it off.
* Implement an optimization for interrupt preemptions, but disable it for
now. I want to keep the code handy but so far my tests show no improvement
in performance with huge interrupt rates (from nvme devices), so it is
#undef'd for now.
Matthew Dillon [Tue, 26 Jul 2016 20:12:51 +0000 (13:12 -0700)]
kernel - Minor cleanup swtch.s
* Minor cleanup
Matthew Dillon [Tue, 26 Jul 2016 20:01:27 +0000 (13:01 -0700)]
kernel - Fix namecache race & panic
* Properly lock and re-check the parent association when iterating its
children, fixing a bug in a code path associated with unmounting
filesystems.
The code improperly assumed that there could be no races because there
are were no accessors left. In fact, under heavy loads, the namecache
scan in this routine can race against the negative-name-cache management
code.
* Generally speaking can only happen when lots of mounts and unmounts are
done under heavy loads (for example, tmpfs mounts during a poudriere or
synth run).
Matthew Dillon [Tue, 26 Jul 2016 19:56:31 +0000 (12:56 -0700)]
kernel - Reduce atomic ops in switch code
* Instead of using four atomic 'and' ops and four atomic 'or' ops, use
one atomic 'and' and one atomic 'or' when adjusting the pmap->pm_active.
* Store the array index and simplified cpu mask in the globaldata structure
for the above operation.
Matthew Dillon [Tue, 26 Jul 2016 00:06:52 +0000 (17:06 -0700)]
kernel - Fix VM bug introduced earlier this month
* Adding the yields to the VM page teardown and related code was a great
idea (~Jul 10th commits), but it also introduced a bug where the page
could get torn-out from under the scan due to the vm_object's token being
temporarily lost.
* Re-check page object ownership and (when applicable) its pindex before
acting on the page.
Matthew Dillon [Mon, 25 Jul 2016 23:05:40 +0000 (16:05 -0700)]
systat - Refactor memory displays for systat -vm
* Report paging and swap activity in bytes and I/Os instead of pages and
I/Os (I/Os usually matched pages).
* Report zfod and cow in bytes instead of pages.
* Replace the REAL and VIRTUAL section with something that makes a bit
more sense.
Report active memory (this is just active pages), kernel memory
(currently just wired but we can add more stuff later), Free
(inactive + cache + free is considered free/freeable memory), and
total system memory as reported at boot time.
Report total RSS - basically how many pages the system is mapping to
user processes. Due to sharing this can be a large value.
Do not try to report aggregate VSZ as there's no point in doing so
any more.
Reported swap usage on the main -vm display as well as total swap
allocated.
* Fix display bug in systat -sw display.
* Add "nvme" device type match for the disk display.
Imre Vadász [Sun, 24 Jul 2016 19:11:29 +0000 (21:11 +0200)]
if_iwm - Fix inverted logic in iwm_tx().
The PROT_REQUIRE flag in should be set for data frames above a certain
length, but we were setting it for !data frames above a certain length,
which makes no sense at all.
Taken-From: OpenBSD, Linux iwlwifi
Matthew Dillon [Mon, 25 Jul 2016 18:31:04 +0000 (11:31 -0700)]
kernel - Fix mountctl() / unmount race
* kern_mountctl() now properly checks to see if an unmount is in-progress
and returns an error, fixing a later panic.
Sascha Wildner [Mon, 25 Jul 2016 19:46:01 +0000 (21:46 +0200)]
sysconf.3: Fix typo.
Sascha Wildner [Mon, 25 Jul 2016 18:43:03 +0000 (20:43 +0200)]
libc/strptime: Return NULL, not 0, since the function returns char *.
While here, accept 'UTC' for %Z as well.
Taken-from: FreeBSD
Matthew Dillon [Mon, 25 Jul 2016 18:18:57 +0000 (11:18 -0700)]
mountd, mount - Change how mount signals mountd, reduce mountd spam
* mount now signals mountd with SIGUSR1 instead of SIGHUP.
* mountd now recognizes SIGUSR1 as requesting an incremental update.
Instead of wiping all exports on all mounts and then re-scanning
the exports file and re-adding from the exports file, mountd will
now only wipe the export(s) on mounts it finds in the exports file.
* Greatly reduces unnecessary mountlist scans and commands due to
mount_null and mount_tmpfs operations, while still preserving our
ability to export such filesystems.
Matthew Dillon [Mon, 25 Jul 2016 04:55:00 +0000 (21:55 -0700)]
kernel - Close a few SMP holes
* Don't trust the compiler when loading refs in cache_zap(). Make sure
it doesn't reorder or re-use the memory reference.
* In cache_nlookup() and cache_nlookup_maybe_shared(), do a full re-test
of the namecache element after locking instead of a partial re-test.
* Lock the namecache record in two situations where we need to set a
flag. Almost all other flag cases require similar locking. This fixes
a potential SMP race in a very thin window during mounting.
* Fix unmount / access races in sys_vquotactl() and, more importantly, in
sys_mount(). We were disposing of the namecache record after extracting
the mount pointer, then using the mount pointer. This could race an
unmount and result in a corrupt mount pointer.
Change the code to dispose of the namecache record after we finish using
the mount point. This is somewhat more complex then I'd like, but it
is important to unlock the namecache record across the potentially
blocking operation to prevent a lock chain from propagating upwards
towards the root.
* Enhanced debugging for the namecache teardown case when nc_refs changes
unexpectedly.
* Remove some dead code (cache_purgevfs()).
Matthew Dillon [Mon, 25 Jul 2016 04:52:26 +0000 (21:52 -0700)]
kernel - Cut buffer cache related pmap invalidations in half
* Do not bother to invalidate the TLB when tearing down a buffer
cache buffer. On the flip side, always invalidate the TLB
(the page range in question) when entering pages into a buffer
cache buffer. Only applicable to normal VMIO buffers.
* Significantly improves buffer cache / filesystem performance with
no real risk.
* Significantly improves performance for tmpfs teardowns on unmount
(which typically have to tear-down a lot of buffer cache buffers).
Matthew Dillon [Mon, 25 Jul 2016 04:49:57 +0000 (21:49 -0700)]
kernel - Add some more options for pmap_qremove*()
* Add pmap_qremove_quick() and pmap_qremove_noinval(), allowing pmap
entries to be removed without invalidation under carefully managed
circumstances by other subsystems.
* Redo the virtual kernel a little to work the same as the real kernel
when entering new pmap entries. We cannot assume that no invalidation
is needed when the prior contents of the pte is 0, because there are
several ways it could have become 0 without a prior invalidation.
Also use an atomic op to clear the entry.
Matthew Dillon [Mon, 25 Jul 2016 04:44:33 +0000 (21:44 -0700)]
kernel - cli interlock with critcount in interrupt assembly
* Disable interrupts when decrementing the critical section count
and gd_intr_nesting_level, just prior to jumping into doreti.
This prevents a stacking interrupt from occurring in this roughly
10-instruction window.
* While limited stacking is not really a problem, this closes a very
small and unlikely window where multiple device interrupts could
stack excessively and run the kernel thread out of stack space.
(unlikely that it has ever happened in real life, but becoming more
likely as some modern devices are capable of much higher interrupt
rates).
Sascha Wildner [Sun, 24 Jul 2016 22:45:46 +0000 (00:45 +0200)]
sysconf.3: Document _SC_PAGE_SIZE and _SC_PHYS_PAGES.
Taken-from: FreeBSD
Submitted-by: Sevan Janiyan
Dragonfly-bug: <https://bugs.dragonflybsd.org/issues/2929>
Matthew Dillon [Sun, 24 Jul 2016 21:02:10 +0000 (14:02 -0700)]
drm - Fix subtle plane masking bug.
* Index needs to be 1 << index.
Reported-by: davshao
Found-by: Matt Roper - https://patchwork.kernel.org/patch/7889051/
Matthew Dillon [Sun, 24 Jul 2016 07:56:04 +0000 (00:56 -0700)]
kernel - Fix atomic op comparison
* The sequence was testing a signed integer and then testing the same
variable using atomic_fetchadd_int(&var, 0). Unfortunately, the
atomic-op returns an unsigned value so the result is that when the
buffer count was exhausted, the program would hard-loop without
calling tsleep.
* Fixed by casting the atomic op.
* Should fix the hardlock issue once and for all.
Matthew Dillon [Sun, 24 Jul 2016 02:19:46 +0000 (19:19 -0700)]
kernel - Refactor Xinvltlb a little, turn off the idle-thread invltlb opt
* Turn off the idle-thread invltlb optimization. This feature can be
turned on with a sysctl (default-off) machdep.optimized_invltlb. It
will be turned on by default when we've life-tested that it works
properly.
* Remove excess critical sections and interrupt disablements. All entries
into smp_invlpg() now occur with interrupts already disabled and the
thread already in a critical section. This also defers critical-section
1->0 transition handling away from smp_invlpg() and into its caller.
* Refactor the Xinvltlb APIs a bit. Have Xinvltlb enter the critical
section (it didn't before). Remove the critical section from
smp_inval_intr(). The critical section is now handled by the assembly,
and by any other callers.
* Add additional tsc-based loop/counter debugging to try to catch problems.
* Move inner-loop handling of smp_invltlb_mask to act on invltlbs a little
faster.
* Disable interrupts a little later inside pmap_inval_smp() and
pmap_inval_smp_cmpset().
Matthew Dillon [Sun, 24 Jul 2016 02:17:24 +0000 (19:17 -0700)]
hammer - remove commented out code, move a biodone()
* Remove commented-out code which is no longer applicable.
* Move the biodone() call in hammer_io_direct_write_complete() to after
the token-release, reducing stacking of tokens in biodone().
Matthew Dillon [Sun, 24 Jul 2016 02:09:26 +0000 (19:09 -0700)]
hammer - Try to fix improper DATA CRC error
* Under heavy I/O loads HAMMER has an optimization (similar to UFS) where
the logical buffer is used to issue a write to the underlying device,
rather than copying the logical buffer to a device buffer. This
optmization is earmarked by a hammer2_record.
* If the logical buffer is discarded just after it is written, and then
re-read, hammer may go through a path which calls
hammer_ip_resolve_data(). This code failed to check whether the record
was still in-progress, and in-fact the write to the device may not have
even been initiated yet, and there could also have been a device buffer
alias in the buffer cache for the device for the offset.
This caused the followup read to access the wrong data, causing HAMMER
to report a DATA CRC error. The actual media receives the correct data
eventually and a umount/remount would show an uncorrupted file.
* Try to fix the problem by calling hammer_io_direct_wait() on the record
in this path to wait for the operation to complete (and also to
invalidate the related device buffer) before trying to re-read the block
from the media.
Matthew Dillon [Sun, 24 Jul 2016 02:06:42 +0000 (19:06 -0700)]
kernel - Enhance indefinite wait buffer error message
* Enhance the error message re: indefinite wait buffer notifications.
Matthew Dillon [Sun, 24 Jul 2016 01:59:33 +0000 (18:59 -0700)]
kernel - Fix TDF_EXITING bug, instrument potential live loops
* Fix a TDF_EXITING bug. lwkt_switch_return() is called to fixup
the 'previous' thread, meaning turning off TDF_RUNNING and handling
TDF_EXITING.
However, if TDF_EXITING is not set, the old thread can be used or
acted upon / exited on by some other cpu the instant we clear
TDF_RUNNING. In this situation it is possible that the other cpu
will set TDF_EXITING in the small window of opportunity just before
we check ourselves, leading to serious thread management corruption.
* The new pmap_inval*() code runs on Xinvltlb instead of as a IPIQ
and can easily create significant latency between the two tests,
whereas the old code ran as an IPIQ and could not due to the critical
section.
Matthew Dillon [Sun, 24 Jul 2016 01:57:15 +0000 (18:57 -0700)]
kernel - Add vfs.repurpose_enable, adjust B_HASBOGUS
* Add vfs.repurpose_enable, default disabled. If this feature is turned on
the system will try to repurpose the VM pages underlying a buffer on
re-use instead of allowing the VM pages to cycle into the VM page cache.
Designed for high I/O-load environments.
* Use the B_HASBOGUS flag to determine if a pmap_qenter() is required,
and devolve the case to a single call to pmap_qenter() instead of one
for each bogus page.
Sascha Wildner [Sat, 23 Jul 2016 20:05:49 +0000 (22:05 +0200)]
Add a realquickkernel target, analogous to realquickworld.
It skips the recently added depend step, so it behaves like
quickkernel did before
521f740e8971df6fdb1b63933cb534746e86bfae.
Sascha Wildner [Sat, 23 Jul 2016 19:15:13 +0000 (21:15 +0200)]
Fix VKERNEL64 build.
François Tigeot [Sat, 23 Jul 2016 18:20:48 +0000 (20:20 +0200)]
kernel: Fix compilation
Sascha Wildner [Sat, 23 Jul 2016 17:15:24 +0000 (19:15 +0200)]
bsd-family-tree: Sync with FreeBSD.
François Tigeot [Sat, 23 Jul 2016 10:16:31 +0000 (12:16 +0200)]
drm/i915/gem: Reduce differences with Linux 4.4
François Tigeot [Sat, 23 Jul 2016 09:12:44 +0000 (11:12 +0200)]
drm: Sync a few headers with Linux 4.4
Sascha Wildner [Sat, 23 Jul 2016 07:40:11 +0000 (09:40 +0200)]
dmesg.8: Improve markup a bit and fix a typo (dumnr -> dumpnr).
Matthew Dillon [Tue, 19 Jul 2016 01:27:12 +0000 (18:27 -0700)]
kernel - repurpose buffer cache entries under heavy I/O loads
* At buffer-cache I/O loads > 200 MBytes/sec (newbuf instantiations, not
cached buffer use), the buffer cache will now attempt to repurpose the
VM pages in the buffer it is recycling instead of returning the pages
to the VM system.
* sysctl vfs.repurposedspace may be used to adjust the I/O load limit.
* The repurposing code attempts to free the VM page then reassign it to
the logical offset and vnode of the new buffer. If this succeeds, the
new buffer can be returned to the caller without having to run any
SMP tlb operations. If it fails, the pages will be either freed or
returned to the VM system and the buffer cache will act as before.
* The I/O load limit has a secondary beneficial effect which is to reduce
the allocation load on the VM system to something the pageout daemon can
handle while still allowing new pages up to the I/O load limit to transfer
to VM backing store. Thus, this mechanism ONLY effects systems with I/O
load limits above 200 MBytes/sec (or whatever programmed value you decide
on).
* Pages already in the VM page cache do not count towards the I/O load limit
when reconstituting a buffer.
Matthew Dillon [Mon, 18 Jul 2016 18:44:11 +0000 (11:44 -0700)]
kernel - Refactor buffer cache code in preparation for vm_page repurposing
* Keep buffer_map but no longer use vm_map_findspace/vm_map_delete to manage
buffer sizes. Instead, reserve MAXBSIZE of unallocated KVM for each buffer.
* Refactor the buffer cache management code. bufspace exhaustion now has
hysteresis, bufcount works just about the same.
* Start work on the repurposing code (currently disabled).
Matthew Dillon [Fri, 22 Jul 2016 05:48:10 +0000 (22:48 -0700)]
hammer2 - Fix deadlocks, bad assertion, improve flushing.
* Fix a deadlock in checkdirempty(). We must release the lock on oparent
before following a hardlink. If after re-locking chain->parent != oparent,
return EAGAIN to the caller.
* When doing a full filesystem flush, pre-flush the vnodes with a normal
transaction to try to soak-up all the compression time and avoid stalling
user process writes for too long once we get inside the formal flush.
* Fix a flush bug. Flushing a deleted chain is allowed if it is an inode.
Sascha Wildner [Fri, 22 Jul 2016 19:17:54 +0000 (21:17 +0200)]
build.7: Mention that KERNCONF can have more than one config.
Sascha Wildner [Fri, 22 Jul 2016 19:17:29 +0000 (21:17 +0200)]
Run make depend in quickkernel, too.
It is much cleaner to do that, just like it is run in quickworld, too.
At the price of a small increase in build time, quickkernel will now
continue working when a new kernel header is added, which broke it
before this commit because the header would not be copied to the right
place in /usr/obj.
Matthew Dillon [Sat, 23 Jul 2016 04:58:59 +0000 (21:58 -0700)]
kernel - Fix excessive ipiq recursion (4)
* Possibly the smoking gun. There was a case where the lwkt_switch()
code could wind up looping excessively calling lwkt_getalltokens()
if td_contended went negative, and td_contended on interrupt threads
could in-fact go negative.
This stopped IPIs in their tracks.
* Fix by making td_contended unsigned, causing the comparions to work
in all situations. And add a missing assignment to 0 for the
preempted thread case.
Matthew Dillon [Sat, 23 Jul 2016 01:22:17 +0000 (18:22 -0700)]
kernel - Fix excessive ipiq recursion (3)
* Third try. I'm not quite sure why we are still getting hard locks. These
changes (so far) appear to fix the problem, but I don't know why. It
is quite possible that the problem is still not fixed.
* Setting target->gd_npoll will prevent *all* other cpus from sending an
IPI to that target. This should have been ok because we were in a
critical section and about to send the IPI to the target ourselves, after
setting gd_npoll. The critical section does not prevent Xinvltlb, Xsniff,
Xspuriousint, or Xcpustop from running, but of these only Xinvltlb does
anything significant and it should theoretically run at a higher level
on all cpus than Xipiq (and thus complete without causing a deadlock of
any sort).
So in short, it should have been ok to allow something like an Xinvltlb
to interrupt the cpu inbetween setting target->gd_npoll and actually
sending the Xipiq to the target. But apparently it is not ok.
* Only clear mycpu->gd_npoll when we either (1) EOI and take the IPIQ
interrupt or (2) If the IPIQ is made pending via reqflags, when we clear
the flag. Previously we were clearing gd_npoll in the IPI processing
loop itself, potentially racing new incoming interrupts before they get
EOId by our cpu. This also should have been just fine, because interrupts
are enabled in the processing loop so nothing should have been able to
back-up in the LAPIC.
I can conjecture that possibly there was a race when we cleared gd_npoll
multiple times, potentially clearing it the second (or later) times,
allowing multiple incoming IPIs to be queued from multiple cpu sources but
then cli'ing and entering a e.g. Xinvltlb processing loop before our cpu
could acknowledge any of them. And then, possibly, trying to issue an IPI
with the system in this state.
I don't really see how this can cause a hard lock because I did not observe
any loop/counter error messages on the console which should have been
triggered if other cpus got stuck trying to issue IPIs. But LAPIC IPI
interactions are not well documented so... perhaps they were being issued
but blocked our local LAPIC from accepting a Xinvltlb due to having one
extra unacknowledged Xipiq pending? But then, our Xinvltlb processing loop
*does* enable interrupts for the duration, so it should have drained if
this were so.
In anycase, we no longer gratuitously clear gd_npoll in the processing
loop. We only clear it when we know there isn't one in-flight heading to
our cpu and none queued on our cpu. What will happen now is that a second
IPI can be sent to us once we've EOI'd the first one, and wind up in
reqflags, but will not be acted upon until our current processing loop
returns.
I will note that the gratuitous clearing we did before *could* have allowed
substantially all other cpus to try to Xipiq us at nearly the same time,
so perhaps the deadlock was related to that type of situation.
* When queueing an ipiq command from mycpu to a target, interrupts were
enabled between our entry into the ipiq fifo, the setting of our cpu bit
in the target gd_ipimask, the setting of target->gd_npoll, and our
issuing of the actual IPI to the target. We now disable interrupts across
these four steps.
It should have been ok for interrupts to have been left enabled across
these four steps. It might still be, but I am not taking any chances now.