Sascha Wildner [Sun, 3 Nov 2013 22:00:28 +0000 (23:00 +0100)]
pthread/sem_timedwait: sem_timedwait()'s timespec argument is const.
Sascha Wildner [Sun, 3 Nov 2013 21:36:35 +0000 (22:36 +0100)]
Fix some more prototypes in manual pages.
Sascha Wildner [Sun, 3 Nov 2013 20:51:34 +0000 (21:51 +0100)]
bsd-family-tree: Sync with FreeBSD (for OpenBSD 5.4).
Sascha Wildner [Sun, 3 Nov 2013 20:50:39 +0000 (21:50 +0100)]
Fix two prototypes in stringlist.3 and rpc_svc_reg.3.
Sascha Wildner [Sun, 3 Nov 2013 08:21:33 +0000 (09:21 +0100)]
kernel/x86_64: Do not print a message upon every segmentation fault.
It was printed even when the SIGSEGV was caught, such as by configure
tests, causing a rather noisy console when packages were built. After
this commit we're back to the traditional behavior (no message if the
signal is caught, and the usual message if not):
pid <pid> (<user>), uid <uid>: exited on signal 11 (core dumped)
While here, adjust some comments.
François Tigeot [Sun, 3 Nov 2013 07:13:18 +0000 (08:13 +0100)]
kernel/i386: Implement atomic_swap_long()
François Tigeot [Sat, 2 Nov 2013 14:47:01 +0000 (15:47 +0100)]
drm: Use Linux atomic types and functions
Opportunistically sync to Linux 3.8 when possible
François Tigeot [Sat, 2 Nov 2013 14:36:46 +0000 (15:36 +0100)]
kref.h: Adapt to Linux 3.8's drm
* Implement kref_sub()
* The internal counter is used by the drm code, rename it to refcount
* The internal counter must be of type atomic_t
François Tigeot [Sat, 2 Nov 2013 14:02:53 +0000 (15:02 +0100)]
drm: Import linux/kernel.h from FreeBSD's OFED stack
François Tigeot [Sat, 2 Nov 2013 17:58:22 +0000 (18:58 +0100)]
drm: fix test_and_set_bit() prototype
François Tigeot [Sat, 2 Nov 2013 15:36:15 +0000 (16:36 +0100)]
drm: Import linux/bitops.h from the FreeBSD OFED stack
François Tigeot [Sat, 2 Nov 2013 12:41:31 +0000 (13:41 +0100)]
drm: Improve linux/atomic.h
* Add an atomic64_t definition
* Add back atomic_xchg()
* Add atomic64_read() atomic64_xchg() and atomic64_set()
François Tigeot [Sat, 2 Nov 2013 12:07:24 +0000 (13:07 +0100)]
drm: Replace drm_atomic.h by linux/atomic.h from FreeBSD's OFED stack
Antonio Huete Jimenez [Sat, 2 Nov 2013 21:06:52 +0000 (22:06 +0100)]
dirfs - Call VOP_INACTIVE() on last VOP_CLOSE()
Antonio Huete Jimenez [Sat, 2 Nov 2013 21:02:41 +0000 (22:02 +0100)]
dirfs - Add VFS_STATFS() function
Antonio Huete Jimenez [Sat, 2 Nov 2013 20:49:25 +0000 (21:49 +0100)]
dirfs - Add KTR_LOG for VFS_MOUNT / VFS_UNMOUNT
Antonio Huete Jimenez [Sat, 2 Nov 2013 20:16:54 +0000 (21:16 +0100)]
dirfs - Improve a couple KTR messages
Matthew Dillon [Sat, 2 Nov 2013 07:06:57 +0000 (00:06 -0700)]
hammer2 - Stabilization
* Fix heavy cpu use in flush due to a blown recursion which can run down
the same chain many times due to the aliasing of hammer2_chain_core
structures.
The basic problem is that there can be H2 operations running concurrently
with a flush that are not part of the flush. These operations have a
higher transaction id. When situated deep in the tree, they can cause
the flush to repeatedly traverse large portions of the tree that it had
already checked because the recording of the lower flush TID is lower
than the update_tid from the concurrent operations.
* Fix a multitude of flush / concurrent-operations races. The worst of the
lot is related to the situation where a concurrent operation does a
delete-duplicate on a chain containing a block table (which can include
an inode chain) which the flush needs to update. This results in TWO
block tables needing updating relative to different synchronization
points. Essentially, one of the chains is strictly temporary for flush
purposes while the other is the 'real' chain.
For example, if the concurrent operation is adding or deleting elements
from a block table the flush may have to add/delete DIFFERENT elements
for its own view. This requires two different versions of the block table
(one being strictly temporary).
Improper updates of the chain->bref.mirror_tid caused the flush to get
confused and assert on the blocktable not containing the expected dat.
* More concurrent-operations during a flush issues fixed. If a concurrent
operation deletes a chain and the flush needs to fork a 'live' version
of the chain, the flush's version will have a lower transaction id and
must be properly ordered in hammer2_chain_core->ownerq. It was not
being ordered properly.
* Flushes are recursive and to improve concurrency the flush temporarily
unlocks the old parent when diving under a child. This can result in a
race where, due to hammer2_chain_core aliasing the recursion can wrap
around back to the parent.
Detect the case after re-locking the parent on the way back up the tree
and do the right thing.
* Fix handling of the flush block table rollup. Consolidate the call to
modify the parent (so we can adjust the blockrefs after flushing the
children) to a single point.
* Improve flush performance. If a parent is deferred at a higher level
and then encountered again via a shallower path, we now leave it deferred
and do not try to execute it in the shallower path even though the stack
depth is ok, as it will likely become deferred at a lower level anyway.
Check a deleted-chain case early before we recurse. A deleted chain
which is flagged DUPLICATED does not have to recurse as the sub-path
is reachable via some other parent. This significantly improves
performance because there are often a ton of chains in-memory marked
DELETED.
This results in more efficient deferrals.
* Fix adjustments of modify_tid and delete_tid in delete-duplicate
operations, clean up handling of CHAIN_INITIAL, properly transfer
flags in delete-duplicate.
* Fix some gratuitous wakeups in the transaction API.
Matthew Dillon [Fri, 1 Nov 2013 16:48:10 +0000 (09:48 -0700)]
hammer2 - stabilization
* Reduce HAMMER2_FLUSH_DEPTH_LIMIT from 40 to 10 to avoid blowing out
the kernel stack.
* Retool hammer2_chain_drop() and hammer2_chain_lastdrop() to remove all
possible recursion. The in-memory topology can get very deep and very
wide. This fixes another kernel stack blowout.
* Fix a bug in hammer2_chain_flush()'s deferred flush. Now that
hammer2_chain_flush() can replace the passed-in chain, we have to drop
the extra ref before calling it instead of after.
Matthew Dillon [Fri, 1 Nov 2013 07:51:00 +0000 (00:51 -0700)]
hammer2 - stabilization
* Code generally assumes that a deleted-flagged chain can still be
duplicated. Remove bogus call to hammer2_freemap_free() and
remove bogus masking of chain->bref.data_off in hammer2_chain_delete().
Matthew Dillon [Fri, 1 Nov 2013 06:09:31 +0000 (23:09 -0700)]
hammer1 - cleanup, minor bug fixes
* Cleanup pass, remove some dead code
* Minor bug fixes, add tokens around some paths that need them.
* Remove use of the master token in several paths that don't need it,
improving concurrency.
Matthew Dillon [Fri, 1 Nov 2013 06:07:03 +0000 (23:07 -0700)]
kernel - Improve panic handling
* Clear the gd_spinlocks counter when handling a panic. This improves
our chances of being able to obtain a crash dump
Matthew Dillon [Fri, 1 Nov 2013 05:57:55 +0000 (22:57 -0700)]
hammer2 - Stabilization pass, more flush refactoring
* Add voldata.inode_tid, separate inode TID allocations from
transaction TID allocations in voldata.
* Rewrite the transaction management functions.
* Rewrite hammer2's filesystem sync code to reduce stalls.
* Keep track of a generation number on the hammer2_chain_core structure
so the flush code can re-scan when it modifies elements within the
flush transaction.
* Cleanup the duplication and delete-duplication code and hardlink handling.
The delete-duplication code now properly tags delete_tid when a flush is
delete-duplicating a chain which is deleted in the live view but is still
valid in the flush view.
* Correct numerous bugs in tracking the modified/deleted state of
a chain.
* Correct numerous flush bugs.
* Separate the mirror TID for the freemap chain from the volume chain.
This will allow freemap updates to be delayed.
* Implement a more stringent algorithm to determine when CHAIN_MOVED
can be cleared in chain->flags.
* Do a better job limiting the flush scan when concurrent modifying
operations are occuring in large volumes.
Matthew Dillon [Fri, 1 Nov 2013 05:55:45 +0000 (22:55 -0700)]
test - Adjust vnodeinfo for system changes
* Make vnodeinfo in test/debug compile again.
Matthew Dillon [Wed, 30 Oct 2013 07:13:34 +0000 (00:13 -0700)]
hammer2 - Refactor flush
* Replace HAMMER2_CHAIN_SUBMODIFIED with core->update_tid. SUBMODIFIED
applies to chain->core, not to chain. Use a TID to track updates to
make it easier for a flush to update records without messing up flush
sequencing of chains being concurrently modified outside the flush's
TID (that will be handled in the next flush).
* Make sure the DUPLICATED flag is set when duplicating a chain which
has already been duplicated to another target. This case is only during
flushes and can occur when the flush races against concurrent updates
which are not part of the flush.
* Refactor bioq flushing during a flush. hammer2_vfs_sync now gives the
bioq a window to operate using the flush's TID before the flush actually
starts to flush.
* hammer2_chain_modify() retains the current allocation block if the TID
does not cross a flush boundary.
* chain->bref.mirror_tid is now used to track flush progress and is compared
against core->update_tid to determine when a flush is needed.
* Code cleanups.
Sascha Wildner [Thu, 31 Oct 2013 18:21:09 +0000 (19:21 +0100)]
shutdown.8: Actually, start a new paragraph for poweroff's description.
Sascha Wildner [Thu, 31 Oct 2013 17:38:01 +0000 (18:38 +0100)]
shutdown.8: Remove an empty line.
Sascha Wildner [Thu, 31 Oct 2013 14:04:38 +0000 (15:04 +0100)]
Mention KTR_IF_POLL in LINT and the ktr(4) manual page.
Sascha Wildner [Wed, 30 Oct 2013 20:46:08 +0000 (21:46 +0100)]
<sys/msgport.h>: Extend the #ifdef _KERNEL to cover the lwkt_* protos too.
This unbreaks buildworld after the previous commit to <sys/msgport.h>.
Reported-by: Ed Berger <edwberger@gmail.com>
Sascha Wildner [Wed, 30 Oct 2013 17:57:15 +0000 (18:57 +0100)]
bsd-family-tree: Sync with FreeBSD.
Sepherosa Ziehau [Wed, 30 Oct 2013 13:50:55 +0000 (21:50 +0800)]
msgport: Add putport_oncpu; helps scheduling netisr locally for spin port
Background:
High rate (actually same rate as polling(4)) IPIs on random CPUs are
observed when polling(4) is enabled and there is virtually no network
activity.
After polling(4) activities are traced using ktr(9), it turns out that the
high rate IPIs are actually from the wakeup() on netisr's msgport. Since
the sleep queue cpumask is indexed by the hash of ident, there are chances
that the netisr's msgport ident has the same hash value as other idents
that certain threads on other CPUs are waiting on. If this ever happens
(well, it does happen), the netisr's msgport wakeup will trigger "wakeup"
IPIs to other CPUs. However, these "wakeup" IPIs are actually useless,
since only netisr will wait on its msgport.
putport_oncpu() msgport method is added to call wakeup_mycpu() for spin
msgport, if we know that this port is only accessed by one thread on the
current CPU, e.g. polling(4). This is also the case for other network
code, e.g. syncache timeout, TCP timeout, fastforward flow cache timeout
etc. However, these network code's running rate is too low to unveil the
extra "wakeup" IPIs problem. lwkt_sendmsg_oncpu() is added as wrapper to
putport_oncpu() msgport method.
Currently, only polling(4) is using lwkt_sendmsg_oncpu(). Others will
be converted soon.
Sepherosa Ziehau [Tue, 29 Oct 2013 14:10:14 +0000 (22:10 +0800)]
msgport: Merge several sendmsg functions
sendmsg_stage1 and sendmsg_stage2 are actually copy and paste of part of
sendmsg. Make the functionality inline and let sendmsg call them
sequentially.
While I am here, rename "stage1" to "prepare" and "stage2" to "start"
Sepherosa Ziehau [Tue, 29 Oct 2013 13:58:14 +0000 (21:58 +0800)]
polling: Add preliminary KTR support
Sascha Wildner [Tue, 29 Oct 2013 20:16:52 +0000 (21:16 +0100)]
ipsec_strerror.3: Ansify prototype.
Sascha Wildner [Tue, 29 Oct 2013 17:51:23 +0000 (18:51 +0100)]
Fix three typos I made in the locale manpages.
Franco Fichtner [Mon, 28 Oct 2013 21:04:43 +0000 (22:04 +0100)]
mdocml: tweak mandocdb(8) database creation
Since the current apropos(1) has a few issues, it's desirable to
switch to mdocml's version. In order to do this, tweak the handling
of MLINKS. While there, be a bit nostalgic about apropos(1) output.
Franco Fichtner [Mon, 28 Oct 2013 18:24:28 +0000 (19:24 +0100)]
man.conf: correctly point to database file
François Tigeot [Mon, 28 Oct 2013 15:16:08 +0000 (16:16 +0100)]
drm: Fix warnings introduced by recent commits
François Tigeot [Sun, 27 Oct 2013 14:52:48 +0000 (15:52 +0100)]
drm: Import linux/types.h from the FreeBSD OFED stack
François Tigeot [Sun, 27 Oct 2013 14:53:30 +0000 (15:53 +0100)]
drm: Import asm/types.h from the FreeBSD OFED stack
François Tigeot [Sun, 27 Oct 2013 09:02:44 +0000 (10:02 +0100)]
drm/i915: Put i915_drm.h into include/
Split the header in two separate files, like it is done in Linux 3.8.
Matthew Dillon [Sun, 27 Oct 2013 05:09:28 +0000 (22:09 -0700)]
hammer2 - Fix misc bugs
* Move the live_zero optimization from hammer2_chain to
hammer2_chain_core. It is only applicable to the core
and delete-duplicate operations can mess up the cache.
* Move the HAMMER2_CHAIN_COUNTEDBREFS flag to HAMMER2_CORE_COUNTEDBREFS.
It is only applicable to the core and delete-duplication operations
can really mess up calculations of live_count otherwise.
* Don't bump live_count if inserting a deleted chain.
* The vp in the hammer2_sync_scan2() is not locked on purpose. Use the
synclist token interlock to safely ref the hammer2_inode before
potentially blocking, otherwise it can get ripped out from under us.
Matthew Dillon [Sun, 27 Oct 2013 05:06:17 +0000 (22:06 -0700)]
tmpfs - Fix SMP race
* Hold the node lock in order to safely indirect through
node->tn_dir.tn_parent.
Sepherosa Ziehau [Sun, 27 Oct 2013 04:22:00 +0000 (12:22 +0800)]
mxge.4: Update according to recent changes
Matthew Dillon [Sat, 26 Oct 2013 23:38:03 +0000 (16:38 -0700)]
hammer2 - add fifo/dev support, bug fixes
* Add vops for fifo, blk, and chr devices
* Fix bug in hammer2_chain_insert() - allow insertion races to push a new
layer in all cases except when requested not to.
* Fix bug in hammer2_chain_duplicate() - Must call hammer2_chain_create()
instead of hammer2_chain_insert() in case the duplication target needs
indirect blocks.
Sepherosa Ziehau [Sat, 26 Oct 2013 14:13:32 +0000 (22:13 +0800)]
mxge: Log TX and RX descriptor count
Sepherosa Ziehau [Sat, 26 Oct 2013 13:47:56 +0000 (21:47 +0800)]
mxge: Avoid stopping TX enging, if there are more packets pending on ifq
Sepherosa Ziehau [Sat, 26 Oct 2013 13:37:51 +0000 (21:37 +0800)]
mxge: Remove unused code
Sepherosa Ziehau [Sat, 26 Oct 2013 13:33:21 +0000 (21:33 +0800)]
mxge: Enable multiple RX queues and TX queues by default
Sepherosa Ziehau [Sat, 26 Oct 2013 12:38:01 +0000 (20:38 +0800)]
mxge: Use chip private input hash instead of standard RSS input hash
This restores RX performance when multiple RX queues are enable.
If the stardard RSS input hash is used, the chip's firmware will compute
the RSS hash (*); this kinda explains why using standard RSS input hash
hurts RX performance greatly.
Sysctl hw.mxgeX.use_rss is added to turn on standard RSS input hash.
It is disabled by default. It is also controlled globally by tunable
hw.mxge.use_rss.
(*) Thank Andrew Gallatin <gallatin@FreeBSD.org> for giving the hint.
Sepherosa Ziehau [Sat, 26 Oct 2013 10:48:02 +0000 (18:48 +0800)]
mxge: Implement multiple TX queue support
To enable multiple multiple TX queues, you will need to set tunable
hw.mxge.num_slices or hw.mxgeX.num_slices to 0 or any value >1.
If you want only enable multiple RX queues bu only one TX queue,
in addition to the above tunables tunable hw.mxge.multi_tx should
be set to 0.
Sascha Wildner [Sat, 26 Oct 2013 10:27:32 +0000 (12:27 +0200)]
nrelease: Fix two pkgsrc references which i had forgotten in
c36c5990ecc.
Sascha Wildner [Sat, 26 Oct 2013 10:06:38 +0000 (12:06 +0200)]
find(1): Fix locate database updating, again. :)
This commit re-applies
83c5db2eae3d86.
I should have been more clear in its commit message that locate.updatedb
was not failing with an error if this local change wasn't kept, but that
the database was incomplete and of a much smaller size.
When testing find(1) after an upgrade, a good general rule is that
/var/db/locate.database needs to have about the same size after the
upgrade as it had before the upgrade.
François Tigeot [Sat, 26 Oct 2013 07:35:19 +0000 (09:35 +0200)]
drm/i915: Remove unused file
Sepherosa Ziehau [Wed, 23 Oct 2013 13:25:37 +0000 (21:25 +0800)]
mxge: Implement MSI-X support; multiple RX rings could be enabled
One thing need to note is the interrupt moderation when MSI-X is
enabled. On the PCIE-8AL-C, it looks like that the interrupt rate
set to the chip means total interrupt rate, NOT per MSI-X vector
interrupt rate: e.g. Given the interrupt rate is set too 8000 and 8
MSI-X vectors are allocated. If two MSI-X vectors are active, then
the interrupt rate for each MSI-X vector will be ~4000. If all
MSI-X vectors are active, then the interrupt rate for each MSI-X
vector will be ~1000. This is kind of interrupt moderation for
MSI-X is very unfriendly ...
MSI-X is not enabled by default yet. You could set tunable
hw.mxge.num_slices or hw.mxgeX.num_slices to 0 or any value greater
than 1 to enable MSI-X.
RSS key is not properly setup yet.
Robin Hahling [Fri, 25 Oct 2013 10:15:37 +0000 (12:15 +0200)]
Fix memory leak in df(1)
Fix a memory leak in makenetvfslist which would occur when a previous
call to strdup fails and the function returns on error.
The simple fix is a call to free(3) to free memory allocated to listptr
before returning.
Robin Hahling [Fri, 25 Oct 2013 08:32:36 +0000 (10:32 +0200)]
Add -T option to df(1).
Add -T option to df(1) as found in Linux and FreeBSD df(1).
When given, file system type will be included in df(1) output.
This has been adapted from FreeBSD df(1).
Robin Hahling [Fri, 25 Oct 2013 06:54:23 +0000 (08:54 +0200)]
df -hi prints inodes count "human-readable"
Enable "human-readable" printing of inodes count when df(1)
is called with both -h and -i flags. This is similar to what can be
found on FreeBSD df(1) or GNU df(1).
The code has been adapted from FreeBSD's df(1) and the manpage updated
accordingly.
Example output:
Now:
% df -hi
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
[...]
/dev/serno/VB6cbedbd6-
0a1f16ee.s1a 756M 302M 393M 43% 949 96k 1% /boot
[...]
Before:
% df -hi
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
[...]
/dev/serno/VB6cbedbd6-
0a1f16ee.s1a 756M 302M 393M 43% 949 96329 1% /boot
[...]
Robin Hahling [Fri, 25 Oct 2013 06:44:59 +0000 (08:44 +0200)]
declare functions local to df module as static
This makes it compliant to style(9).
Antonio Huete Jimenez [Fri, 25 Oct 2013 08:48:52 +0000 (01:48 -0700)]
A working IPSEC implementation (1/many)
* Fix fast_ipsec(4) so that it at least builds.
* Untested and probably not working.
Reported-by: Thomas Nikolajsen
Dragonfly-bug: <http://bugs.dragonflybsd.org/issues/1843>
Sascha Wildner [Fri, 25 Oct 2013 17:38:50 +0000 (19:38 +0200)]
kernel/vmm: Rename struct guest_options to vmm_guest_options.
Sascha Wildner [Fri, 25 Oct 2013 14:58:15 +0000 (16:58 +0200)]
kernel: Some cleanup in ext2fs and linux emulation after recent work.
François Tigeot [Fri, 25 Oct 2013 08:16:22 +0000 (10:16 +0200)]
kernel: Rename idr.c to linux_idr.c
This should make it more obvious this file is implementing a Linux API.
Requested-by: swildner
François Tigeot [Fri, 25 Oct 2013 06:47:53 +0000 (08:47 +0200)]
idr: Fix idr_get_new() and idr_get_new_above() return values
These functions are supposed to return -EAGAIN and not EAGAIN when
no more descriptors are available.
Matthew Dillon [Fri, 25 Oct 2013 04:54:48 +0000 (21:54 -0700)]
kernel - Remove proc_token, replace proc, pgrp, and session structure backend
* Isolate the remaining exposed topology for proc, pgrp, and session
into one source file (kern_proc.c).
* Remove allproc, zombproc, pgrp's spinlocks, and start tracking session
structures so we don't have to indirect through other system structures.
* Replace with arrays-of-lists, 1024 elements, including a 1024 element
token lock array to protect each list.
proc_tokens[1024]
allprocs[1024]
allpgrps[1024]
allsessn[1024]
This removes nearly all the prior proc_token contention and also removes
process-group processing contention and makes it easier to track tty
sessions.
* Normal process, Zombie processes, the original linear list, and the
original has mechanic are now all combined into a single allprocs[]
table. The various API functions will filter out zombie vs non-zombie
based on the type of request.
* Rewrite the PID allocator to take advantage of the hashed array topology.
An atomic_fetchadd_int() is used on the static base value which will cause
each cpu to start at a different array entry, thus removing SMP conflicts.
At the moment we iterate the relatively small number of elements in the
bucket to find a free pid.
Since the same proc_tokens[n] lock applies to all three arrays (proc,
pgrp, and session), we can validate the pid against all three at the
same time with a single lock.
* Rewrite the procs sysctl to iterate the hash table. Since there are
1024 different locks, a 'ps' or similar operation no longer has any
significant effect on system performance, and 'ps' is VERY fast now
regardless of the load.
* poudriere bulk build tests on a blade (4 core / 8 thread) shows virtually
no SMP collisions even under extreme loads.
* poudriere bulk build tests on monster (48-core opteron) show very low
SMP collision statistics outside of filesystem writes in most situations.
Pipes (which are already fine-grained) sometimes show significant
collisions.
Most importantly, NO collisions on the process fork/exec/exit critical
path, end-to-end. Not even in the VM system.
Matthew Dillon [Fri, 25 Oct 2013 01:51:24 +0000 (18:51 -0700)]
kernel - proc_token removal pass stage 1/2
* Remove proc_token use from all subsystems except kern/kern_proc.c.
* The token had become mostly useless in these subsystems now that process
locking is more fine-grained. Do the final wipe of proc_token except for
allproc/zombproc list use in kern_proc.c
Matthew Dillon [Fri, 25 Oct 2013 00:01:28 +0000 (17:01 -0700)]
kernel - Replace global vmobj_token with vmobj_tokens[] array
* Remove one of the two remaining major bottlenecks in the system, the
global vmobj_token which is used to manage access to the vm_object_list.
All VM object creation and deletion would get thrown into this list.
* Replace it with an array of 64 tokens and an array of 64 lists.
vmobj_token[] and vm_object_lists[]. Use a simple right-shift
hash code to index the array.
* This reduces contention by a factor of 64 or so which makes a big
difference on multi-chip cpu systems. It won't be as noticable on
single-chip (e.g. 4-core/8-thread) systems.
* Rip-out some of the linux vmstats compat functions which were iterating
the object list and replace with the pcpu accumulator scan that was
recently implemented for dragonfly vmstats.
* TODO: proc_token.
Matthew Dillon [Thu, 24 Oct 2013 20:39:11 +0000 (13:39 -0700)]
kernel - Remove last exclusive vnode vm_object lock from the critical path
* Remove the last exclusive vnode vm_object lock from the critical path.
* Path gets hit on exit, but it matters a lot if one is fork/exec'ing a
lot of binaries. For example, builds which fork/exec huge numbers of
concurrent /bin/sh's, compilers, and other programs.
* vfork/exec rate on blade server, 10000 x 8-threads (80000 total) reduced
from 10.6 seconds to 3.8 seconds, for a major 2.7x improvement in
performance.
Matthew Dillon [Thu, 24 Oct 2013 19:03:23 +0000 (12:03 -0700)]
kernel - Improve vfork/exec and wait*() performance
* Use a flags interlock instead of a token interlock for PPWAIT for
vfork/exec (handling when the parent must wait for the child to
finish exec'ing).
* The exit1() code must wakeup the parent's wait*()'s. Delay the wakeup
until after the token has been released.
* Change thet interlock in the parent's wait*() code to use a generation
counter.
* Do not wakeup p_nthreads on exit if the program was never multi-threaded,
saving a few cycles.
Matthew Dillon [Thu, 24 Oct 2013 18:26:23 +0000 (11:26 -0700)]
pstat - sync w/kernel
* Remove flags no longer used by the kernel
Sascha Wildner [Thu, 24 Oct 2013 13:43:32 +0000 (15:43 +0200)]
Remove no longer used <sys/localedef.h>.
Matthew Dillon [Thu, 24 Oct 2013 06:53:16 +0000 (23:53 -0700)]
kernel - more SMP optimizations in the VM system
* imgact_elf - drop the vm_object a little earlier in load_section(),
and use a shared object lock when iterating ELF segments.
* When starting a vforked process use a shared process token to
interlock the wait loop instead of an exclusive token. Also don't
bother with the token if there's nothing to wait for.
* When forking, pre-assign lp2 thread's td_ucred.
* Remove the vp->v_object load check loop. It should not be possible
for vp->v_object to change after being assigned as long as the vp
is referenced.
* Replace most OBJ_DEAD tests with assertions that the flag is not set.
* Remove the VOLOCK/VOWANT vnode interlock. It shouldn't be possible
for the vnode's object to change while the vnode is ref'd. This was
a leftover from a long-ago time when vnodes were more persistent and
could be recycled and race accessors.
This also removes vm_object_dead_sleep/wait and related code.
* When memory mapping a vnode object there is no need to formally
hold and chain_wait the object. We can simply add a ref to it,
because vnode objects cannot have backing chains.
* When deallocating a vm_object we can shortcut counts greater than 1
for OBJT_VNODE objects instead of counts greater than 3.
* Optimize vnode_pager_alloc(), avoiding unnecessary locks. Keep the
temporary vnode token for the moment.
* Optimize vnode_pager_reference(), removing all locks from the path.
Sascha Wildner [Wed, 23 Oct 2013 18:33:20 +0000 (20:33 +0200)]
Add some missing MLINKS for the locale manual pages (per ctype_l.3).
While here, fix NAME and SEE ALSO sections.
Sascha Wildner [Wed, 23 Oct 2013 18:32:09 +0000 (20:32 +0200)]
libhammer_stats.3: Add a missing MLINK and remove unnecessary quotes.
Matthew Dillon [Wed, 23 Oct 2013 18:01:30 +0000 (11:01 -0700)]
tmpfs - Fix readdir race
* The original cookie cache does not play nice with shared locks.
Replace it with a second RB tree, indexed by cookie.
Matthew Dillon [Wed, 23 Oct 2013 16:41:22 +0000 (09:41 -0700)]
tmpfs - Fix deadlock
* Fix deadlock introduced in recent commits. Do not recurse shared locks
on tmpfs nodes as this can deadlock against an exclusive requester.
Matthew Dillon [Wed, 23 Oct 2013 16:27:41 +0000 (09:27 -0700)]
kernel - proc_token performance cleanups
* pfind()/pfindn()/zpfind() now acquire proc_token shared.
* Fix a bug in alllwp_scan(). Must hold p->p_token while scanning
its lwp's.
* Process list scan can use a shared token, use pfind() instead of
pfindn() and remove proc_token for individual pid lookups.
* cwd can use a shared p->p_token.
* getgroups(), seteuid(), and numerous other uid/gid access and setting
functions need to use p->p_token, not proc_token (Repored by enjolras).
Matthew Dillon [Wed, 23 Oct 2013 15:53:45 +0000 (08:53 -0700)]
shutdown - Add poweroff link and code
* Adds /sbin/poweroff (hardlink to shutdown), and recognize it. This
does basically the same thing as 'halt -p'.
Taken-from: FreeBSD
Submitted-by: robin@tsf-444-wpa-2-005.epfl.ch (Robin Hahling)
Matthew Dillon [Wed, 23 Oct 2013 15:12:29 +0000 (08:12 -0700)]
tmpfs - remove most mp->mnt_token cases, kqueue filterops are MPSAFE (2)
* Fix bug introduced in tmpfs_nrename(). The RB tree removal was not being
guarded by the appropriate node lock.
* Also assert that the directory entry is still present.
Antonio Huete Jimenez [Wed, 23 Oct 2013 12:17:02 +0000 (05:17 -0700)]
ipsec - Add missing reference when so_pcb is attached.
* This fixes a panic on disconnection/detaching in the raw socket.
* Fix inspired in rts_attach()
Reported-by: David BERARD
Dragonfly-bug: <http://bugs.dragonflybsd.org/issues/1848>
Sascha Wildner [Wed, 23 Oct 2013 11:58:51 +0000 (13:58 +0200)]
Add missing whitespace in some manual pages.
Matthew Dillon [Wed, 23 Oct 2013 07:32:16 +0000 (00:32 -0700)]
tmpfs - remove most mp->mnt_token cases, kqueue filterops are MPSAFE
* tmpfs's kqueue filterops are MPSAFE, set appropriate flag.
* tmpfs's vnops frontend universally obtained the tmpfs mnt_token, but
most of tmpfs's underlying code was already sub-locked by node.
Remove most mnt_token use cases and make the portions that were not
safe, safe. This was primarily the directory lookup and scanning
code and code to create, delete, and rename files.
* Should greatly improve tmpfs concurrency.
Matthew Dillon [Wed, 23 Oct 2013 07:31:32 +0000 (00:31 -0700)]
hammer - kqueue filterops are MPSAFE
* Hammer's kqueue filterops are MPSAFE, set appropriate flag.
Matthew Dillon [Wed, 23 Oct 2013 07:26:07 +0000 (00:26 -0700)]
kernel - general cleanup and mplock removal
* General cleanup and remove use of the mplock in multiple non-critical
functions.
* Might slightly improve performance if programs run uname(),
gethostname(), or getdomainname() a lot.
Matthew Dillon [Wed, 23 Oct 2013 01:53:15 +0000 (18:53 -0700)]
kernel - Remove debugging kprintf() from procfs
* Remove a procfs console warning that is not longer applicable. The
procfs filesystem topology is never quite in sync with reality and
races between lookups and existing processes are to be expected.
Sascha Wildner [Tue, 22 Oct 2013 16:40:33 +0000 (18:40 +0200)]
Sort SEE ALSO in some manual pages.
Matthew Dillon [Tue, 22 Oct 2013 16:23:48 +0000 (09:23 -0700)]
kernel - Cleanup vfs_lock & ref-count states states (2)
* Adjust trigger points such that under normal operation vnlru_proc()
handles cleaning up extra vnodes. If this is not sufficient then
the synchronous cleanup code will kick in at higher levels.
* Adjust vnode->v_act handling and try to take into account vnodes
with large memory objects (which we would rather reclaim later and
not sooner). This takes over functionality from vlru_reclaim().
* Remove the vlrureclaim() mount-scanning infrastructure. vnlru_proc()
now just calls freesomevnodes(). This should now be sufficient. This
removes significant locking overheads during steady-state operation.
Sepherosa Ziehau [Tue, 22 Oct 2013 12:45:50 +0000 (20:45 +0800)]
mxge: Record RX slot count instead of its size
Original code assumes that the total size of RX slots is same as one RX
descriptor ring size. This assumption could easily be broken if we ask
chip to deliver RSS hash (RX slot size will be changed from 4 bytes to 8
bytes). RX slot count is recorded now.
Sepherosa Ziehau [Tue, 22 Oct 2013 12:09:09 +0000 (20:09 +0800)]
test: Test commit from orb
Sepherosa Ziehau [Tue, 22 Oct 2013 12:03:20 +0000 (20:03 +0800)]
mxge: Make sure RX data size is cache line size aligned
Currently even without the __cachealign, RX data struct size is properly
aligned on 2 cache line size. Add __cachealign, so that even if some
debugging fields are added, RX data struct size still will be cache line
size aligned.
Matthew Dillon [Tue, 22 Oct 2013 06:51:49 +0000 (23:51 -0700)]
kernel - Cleanup vfs_lock & ref-count states states
* Clean up vp->v_state state transitions
* Fix bugs in the cachedvnodes counter tracking. v_refcnt has to
be masked against VREF_MASK to detect non-zero->0 and 0->non-zero
transitions properly.
* Clear VREF_FINALIZE when reactivating a vnode in vget().
* vhold()/vdrop() no longer prevent a vnode from being moved to the
vinactive list. They simply prevent reclamation.
* Adjust the vnlru trigger points a bit.
* When cleaning, leave the vnode on the inactive list until we determine
we can destroy it. Add a ref instead of using the VREF_TERMINATE
placeholding ref (since the vnode is still on the list).
* Implement vnode->v_act and remove the inactive mid-point stuff. The
now is that vnodes are selectively moved from the active list to
the inactive list as needed. Inactive vnodes are then cleaned up in order.
* Adjust hysteresis so that vnlru has a better chance of handling the
vnode garbage collection before we forced it to be done synchronously
in userexit.
Matthew Dillon [Tue, 22 Oct 2013 06:48:49 +0000 (23:48 -0700)]
kernel - Fix hammer flush-during-reclaim bug
* hammer was improperly using vn_unlock/vn_lock to temporarily unlock a
vnode. If this is done during a reclaim vn_lock() will fail, resulting
in a locking mismatch.
* Use vx_unlock/vx_lock instead, and also check for the reclaim by checking
the VRECLAIMED flag rather than the VINACTIVE flag. VINACTIVE might not
yet be set.
Matthew Dillon [Tue, 22 Oct 2013 00:02:31 +0000 (17:02 -0700)]
test - Adjust vnodeinfo for recent kernel changes
* Adjust vnodeinfo for recent kernel changes
Sascha Wildner [Tue, 22 Oct 2013 06:23:47 +0000 (08:23 +0200)]
Fix typos in messages and manual pages.
Matthew Dillon [Mon, 21 Oct 2013 19:53:09 +0000 (12:53 -0700)]
buildworld - Adjust for recent commits
* Adjust for recent commits (VFREE no longer exists)
François Tigeot [Mon, 21 Oct 2013 18:58:22 +0000 (20:58 +0200)]
kernel: Fix sys/mqueue.h includes
Matthew Dillon [Mon, 21 Oct 2013 17:59:40 +0000 (10:59 -0700)]
kernel - Rewrite lockmgr / struct lock
* Rewrite lockmgr() to remove the exclusive spinlock used internally
to guard operations.
* Retain existing API and operational semantics. This is primarily:
- Acquiring a LK_SHARED lock on a lock the caller already owns
exclusively simply bumps the count and retains the exclusive
nature of the lock.
- Exclusive requests and upgrade requests have priority over shared
locks even if the lock is currently held shared, unless the thread
is flagged for deadlock treatment.
- Upgrade requests are capable of guaranteeing the upgrade (as before).
This could be further enhanced because we now have the last release
transfer the exclusive lock to the upgrade requestor, but the original
API didn't have a function for this so neither do we. The more
primitive detection method is used (aka LK_SLEEPFAIL and/or
LK_EXCLUPGRADE).
* Reduce multiple tracking fields into one field so we can use
atomic_cmpset_int().
* Hot-path common operations. A single atomic_cmpset_int() gets us
through.
Matthew Dillon [Mon, 21 Oct 2013 17:17:12 +0000 (10:17 -0700)]
kernel - Fix a SMP race between pageout and exec_new_vmspace()
* Panics on token mismatch due to p->p_vmspace being replaced out
from under a process utilizing p->p_vmspace->vm_map.map_token.
* Fix a SMP race between pageout and exec_new_vmspace(). The pageout
code properly PHOLD()s the process and related process token but
fails to hold p->p_vmspace during a potentially blocking call.
Thus it is still possible to race termination of the vmspace and/or
for the process to replace its vmspace while the pageout activity is
in progress.
* Use vmspace_hold()/vmspace_drop() and reference the vmspace directly
after load it from p->p_vmspace. The race is allowed, but the vmspace
will no longer be destroyed out from under the pageout and the code
will no longer attempt to release the wrong token.
Sascha Wildner [Sun, 20 Oct 2013 08:04:36 +0000 (10:04 +0200)]
kernel - Rewrite vnode ref-counting code to improve performance
* Rewrite the vnode ref-counting code and modify operation to not
immediately VOP_INACTIVE a vnode when its refs drops to 0. By
doing so we avoid cycling vnodes through exclusive locks when
temporarily accessing them (such as in a path lookup). Shared
locks can be used throughout.
* Track active/inactive vnodes a bit differently, keep track of
the number of vnodes that are still active but have zero refs,
and rewrite the vnode freeing code to use the new statistics
to deactivate cached vnodes.
Sascha Wildner [Mon, 21 Oct 2013 16:13:26 +0000 (18:13 +0200)]
make.1: We use bmake.1, not make.1.