linux.git
7 years agofilesystem-dax: fix broken __dax_zero_page_range() conversion
Dan Williams [Thu, 11 May 2017 02:38:13 +0000 (19:38 -0700)]
filesystem-dax: fix broken __dax_zero_page_range() conversion

The conversion of __dax_zero_page_range() to 'struct dax_operations'
caused it to frequently fail. The mistake was treating the @size
parameter as a dax mapping length rather than just a length of the
clear_pmem() operation. The dax mapping length is assumed to be hard
coded as PAGE_SIZE.

Without this fix any page unaligned zeroing request will trigger a
-EINVAL return from bdev_dax_pgoff().

Cc: Jan Kara <jack@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: cccbce671582 ("filesystem-dax: convert to dax_direct_access()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm, btt: ensure that initializing metadata clears poison
Vishal Verma [Wed, 10 May 2017 21:01:31 +0000 (15:01 -0600)]
libnvdimm, btt: ensure that initializing metadata clears poison

If we had badblocks/poison in the metadata area of a BTT, recreating the
BTT would not clear the poison in all cases, notably the flog area. This
is because rw_bytes will only clear errors if the request being sent
down is 512B aligned and sized.

Make sure that when writing the map and info blocks, the rw_bytes being
sent are of the correct size/alignment. For the flog, instead of doing
the smaller log_entry writes only, first do a 'wipe' of the entire area
by writing zeroes in large enough chunks so that errors get cleared.

Cc: Andy Rudoff <andy.rudoff@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: add an atomic vs process context flag to rw_bytes
Vishal Verma [Wed, 10 May 2017 21:01:30 +0000 (15:01 -0600)]
libnvdimm: add an atomic vs process context flag to rw_bytes

nsio_rw_bytes can clear media errors, but this cannot be done while we
are in an atomic context due to locking within ACPI. From the BTT,
->rw_bytes may be called either from atomic or process context depending
on whether the calls happen during initialization or during IO.

During init, we want to ensure error clearing happens, and the flag
marking process context allows nsio_rw_bytes to do that. When called
during IO, we're in atomic context, and error clearing can be skipped.

Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agox86, pmem: Fix cache flushing for iovec write < 8 bytes
Ben Hutchings [Tue, 9 May 2017 17:00:43 +0000 (18:00 +0100)]
x86, pmem: Fix cache flushing for iovec write < 8 bytes

Commit 11e63f6d920d added cache flushing for unaligned writes from an
iovec, covering the first and last cache line of a >= 8 byte write and
the first cache line of a < 8 byte write.  But an unaligned write of
2-7 bytes can still cover two cache lines, so make sure we flush both
in that case.

Cc: <stable@vger.kernel.org>
Fixes: 11e63f6d920d ("x86, pmem: fix broken __copy_user_nocache ...")
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodevice-dax: kill NR_DEV_DAX
Dan Williams [Mon, 8 May 2017 19:33:53 +0000 (12:33 -0700)]
device-dax: kill NR_DEV_DAX

There is no point to ask how many device-dax instances the kernel should
support. Since we are already using a dynamic major number, just allow
the max number of minors by default and be done. This also fixes the
fact that the proposed max for the NR_DEV_DAX range was larger than what
could be supported by alloc_chrdev_region().

Fixes: ba09c01d2fa8 ("dax: convert to the cdev api")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoblock, dax: move "select DAX" from BLOCK to FS_DAX
Dan Williams [Mon, 8 May 2017 17:55:27 +0000 (10:55 -0700)]
block, dax: move "select DAX" from BLOCK to FS_DAX

For configurations that do not enable DAX filesystems or drivers, do not
require the DAX core to be built.

Given that the 'direct_access' method has been removed from
'block_device_operations', we can also go ahead and remove the
block-related dax helper functions from fs/block_dev.c to
drivers/dax/super.c. This keeps dax details out of the block layer and
lets the DAX core be built as a module in the FS_DAX=n case.

Filesystems need to include dax.h to call bdev_dax_supported().

Cc: linux-xfs@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodevice-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
Mike Galbraith [Sat, 6 May 2017 04:14:43 +0000 (06:14 +0200)]
device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

ERROR: "devm_create_dev_dax" [drivers/dax/dax_pmem.ko] undefined!
ERROR: "alloc_dax_region" [drivers/dax/dax_pmem.ko] undefined!
ERROR: "dax_region_put" [drivers/dax/dax_pmem.ko] undefined!

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoMerge branch 'for-4.12/dax' into libnvdimm-for-next
Dan Williams [Fri, 5 May 2017 06:38:43 +0000 (23:38 -0700)]
Merge branch 'for-4.12/dax' into libnvdimm-for-next

7 years agolibnvdimm, pfn: fix 'npfns' vs section alignment
Dan Williams [Fri, 5 May 2017 02:54:42 +0000 (19:54 -0700)]
libnvdimm, pfn: fix 'npfns' vs section alignment

Fix failures to create namespaces due to the vmem_altmap not advertising
enough free space to store the memmap.

 WARNING: CPU: 15 PID: 8022 at arch/x86/mm/init_64.c:656 arch_add_memory+0xde/0xf0
 [..]
 Call Trace:
  dump_stack+0x63/0x83
  __warn+0xcb/0xf0
  warn_slowpath_null+0x1d/0x20
  arch_add_memory+0xde/0xf0
  devm_memremap_pages+0x244/0x440
  pmem_attach_disk+0x37e/0x490 [nd_pmem]
  nd_pmem_probe+0x7e/0xa0 [nd_pmem]
  nvdimm_bus_probe+0x71/0x120 [libnvdimm]
  driver_probe_device+0x2bb/0x460
  bind_store+0x114/0x160
  drv_attr_store+0x25/0x30

In commit 658922e57b84 "libnvdimm, pfn: fix memmap reservation sizing"
we arranged for the capacity to be allocated, but failed to also update
the 'npfns' parameter. This leads to cases where there is enough
capacity reserved to hold all the allocated sections, but
vmemmap_populate_hugepages() still encounters -ENOMEM from
altmap_alloc_block_buf().

This fix is a stop-gap until we can teach the core memory hotplug
implementation to permit sub-section hotplug.

Cc: <stable@vger.kernel.org>
Fixes: 658922e57b84 ("libnvdimm, pfn: fix memmap reservation sizing")
Reported-by: Anisha Allada <anisha.allada@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: handle locked label storage areas
Dan Williams [Thu, 4 May 2017 18:47:22 +0000 (11:47 -0700)]
libnvdimm: handle locked label storage areas

Per the latest version of the "NVDIMM DSM Interface Example" [1], the
label data retrieval routine can report a "locked" status. In this case
all regions associated with that DIMM are disabled until the label area
is unlocked. Provide generic libnvdimm enabling for NVDIMMs with label
data area locking capabilities.

[1]: http://pmem.io/documents/

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
Dan Williams [Thu, 4 May 2017 21:01:24 +0000 (14:01 -0700)]
libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED

This is a preparation patch for handling locked nvdimm label regions, a
new concept as introduced by the latest DSM document on pmem.io [1]. A
future patch will leverage nvdimm_set_locked() at DIMM probe time to
flag regions that can not be enabled. There should be no functional
difference resulting from this change.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdf

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agobrd: fix uninitialized use of brd->dax_dev
Gerald Schaefer [Wed, 3 May 2017 12:56:02 +0000 (14:56 +0200)]
brd: fix uninitialized use of brd->dax_dev

commit 1647b9b9 "brd: add dax_operations support" introduced the allocation
and freeing of a dax_device, but the allocated dax_device is not stored
into the brd_device, so brd_del_one() will eventually operate on an
uninitialized brd->dax_dev.

Fix this by storing the allocated dax_device to brd->dax_dev.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoblock, dax: use correct format string in bdev_dax_supported
Arnd Bergmann [Thu, 27 Apr 2017 14:30:41 +0000 (16:30 +0200)]
block, dax: use correct format string in bdev_dax_supported

The new message has an incorrect format string, causing a warning in some
configurations:

fs/block_dev.c: In function 'bdev_dax_supported':
fs/block_dev.c:779:5: error: format '%d' expects argument of type 'int', but argument 2 has type 'long int' [-Werror=format=]
     "error: dax access failed (%d)", len);

This changes it to use the correct %ld instead of %d.

Fixes: 2093f2e9dfec ("block, dax: convert bdev_dax_supported() to dax_direct_access()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodevice-dax: fix sysfs attribute deadlock
Dan Williams [Sun, 30 Apr 2017 13:57:01 +0000 (06:57 -0700)]
device-dax: fix sysfs attribute deadlock

Usage of device_lock() for dax_region attributes is unnecessary and
deadlock prone. It's unnecessary because the order of registration /
un-registration guarantees that drvdata is always valid. It's deadlock
prone because it sets up this situation:

 ndctl           D    0  2170   2082 0x00000000
 Call Trace:
  __schedule+0x31f/0x980
  schedule+0x3d/0x90
  schedule_preempt_disabled+0x15/0x20
  __mutex_lock+0x402/0x980
  ? __mutex_lock+0x158/0x980
  ? align_show+0x2b/0x80 [dax]
  ? kernfs_seq_start+0x2f/0x90
  mutex_lock_nested+0x1b/0x20
  align_show+0x2b/0x80 [dax]
  dev_attr_show+0x20/0x50

 ndctl           D    0  2186   2079 0x00000000
 Call Trace:
  __schedule+0x31f/0x980
  schedule+0x3d/0x90
  __kernfs_remove+0x1f6/0x340
  ? kernfs_remove_by_name_ns+0x45/0xa0
  ? remove_wait_queue+0x70/0x70
  kernfs_remove_by_name_ns+0x45/0xa0
  remove_files.isra.1+0x35/0x70
  sysfs_remove_group+0x44/0x90
  sysfs_remove_groups+0x2e/0x50
  dax_region_unregister+0x25/0x40 [dax]
  devm_action_release+0xf/0x20
  release_nodes+0x16d/0x2b0
  devres_release_all+0x3c/0x60
  device_release_driver_internal+0x17d/0x220
  device_release_driver+0x12/0x20
  unbind_store+0x112/0x160

ndctl/2170 is trying to acquire the device_lock() to read an attribute,
and ndctl/2186 is holding the device_lock() while trying to drain all
active attribute readers.

Thanks to Yi Zhang for the reproduction script.

Fixes: d7fe1a67f658 ("dax: add region 'id', 'size', and 'align' attributes")
Cc: <stable@vger.kernel.org>
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
Dan Williams [Mon, 1 May 2017 17:00:02 +0000 (10:00 -0700)]
libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"

This continues the 4.11 status quo of disabling of error clearing from
the BTT I/O path. Toshi found that even though we have eliminated all
the libnvdimm sources of sleeping-while-atomic triggers, we still have
sleeping operations that will occur in the path to send the ACPI DSM to
the DIMM to clear the error:

 BUG: sleeping function called from invalid context at mm/slab.h:432
 in_atomic(): 1, irqs_disabled(): 0, pid: 13353, name: dd
 Call Trace:
  dump_stack+0x86/0xc3
  ___might_sleep+0x17d/0x250
  __might_sleep+0x4a/0x80
  __kmalloc+0x1c0/0x2e0
  acpi_os_allocate_zeroed+0x2d/0x2f
  acpi_evaluate_object+0x59/0x3b1
  acpi_evaluate_dsm+0xbd/0x10c
  acpi_nfit_ctl+0x1ef/0x7c0 [nfit]
  ? nsio_rw_bytes+0x152/0x280
  nvdimm_clear_poison+0x77/0x140
  nsio_rw_bytes+0x18f/0x280
  btt_write_pg+0x1d4/0x3d0 [nd_btt]
  btt_make_request+0x119/0x2d0 [nd_btt]

A solution for tracking and handling media errors natively in the BTT is
needed.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
Dan Williams [Sat, 29 Apr 2017 05:05:14 +0000 (22:05 -0700)]
libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering

A debug patch to turn the standard device_lock() into something that
lockdep can analyze yielded the following:

 ======================================================
 [ INFO: possible circular locking dependency detected ]
 4.11.0-rc4+ #106 Tainted: G           O
 -------------------------------------------------------
 lt-libndctl/1898 is trying to acquire lock:
  (&dev->nvdimm_mutex/3){+.+.+.}, at: [<ffffffffc023c948>] nd_attach_ndns+0x178/0x1b0 [libnvdimm]

 but task is already holding lock:
  (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffc022e0b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm]

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
        lock_acquire+0xf6/0x1f0
        __mutex_lock+0x88/0x980
        mutex_lock_nested+0x1b/0x20
        nvdimm_bus_lock+0x21/0x30 [libnvdimm]
        nvdimm_namespace_capacity+0x1b/0x40 [libnvdimm]
        nvdimm_namespace_common_probe+0x230/0x510 [libnvdimm]
        nd_pmem_probe+0x14/0x180 [nd_pmem]
        nvdimm_bus_probe+0xa9/0x260 [libnvdimm]

 -> #0 (&dev->nvdimm_mutex/3){+.+.+.}:
        __lock_acquire+0x1107/0x1280
        lock_acquire+0xf6/0x1f0
        __mutex_lock+0x88/0x980
        mutex_lock_nested+0x1b/0x20
        nd_attach_ndns+0x178/0x1b0 [libnvdimm]
        nd_namespace_store+0x308/0x3c0 [libnvdimm]
        namespace_store+0x87/0x220 [libnvdimm]

In this case '&dev->nvdimm_mutex/3' mirrors '&dev->mutex'.

Fix this by replacing the use of device_lock() with nvdimm_bus_lock() to protect
nd_{attach,detach}_ndns() operations.

Cc: <stable@vger.kernel.org>
Fixes: 8c2f7e8658df ("libnvdimm: infrastructure for btt devices")
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: rework region badblocks clearing
Dan Williams [Sat, 29 Apr 2017 22:24:03 +0000 (15:24 -0700)]
libnvdimm: rework region badblocks clearing

Toshi noticed that the new support for a region-level badblocks missed
the case where errors are cleared due to BTT I/O.

An initial attempt to fix this ran into a "sleeping while atomic"
warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
However, that lock is not needed since we are not acting on any data that
is subject to change under that lock. The badblocks instance has its own
internal lock to handle mutations of the error list.

So, in order to make it clear that we are just acting on region devices,
rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
Eliminate the lock and consolidate all support routines for the new
nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
opportunity to cleanup to some unnecessary casts, make the calling
convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
resource with the minimal struct clear_badblocks_context, and use the
DEVICE_ATTR macro.

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit: kill ACPI_NFIT_DEBUG
Dan Williams [Fri, 28 Apr 2017 20:54:30 +0000 (13:54 -0700)]
acpi, nfit: kill ACPI_NFIT_DEBUG

Inevitably when one actually needs to debug a DSM issue it's on a
distribution kernel that has CONFIG_ACPI_NFIT_DEBUG=n. The config symbol
was only there to avoid the compile error due to the missing fallback for
print_hex_dump_debug in the CONFIG_DYNAMIC_DEBUG=n case. That was fixed
with commit cdf17449af1d "hexdump: do not print debug dumps for
!CONFIG_DEBUG", so the config symbol can just be dropped.

Cc: Joe Perches <joe@perches.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: fix clear length of nvdimm_forget_poison()
Toshi Kani [Thu, 27 Apr 2017 22:57:05 +0000 (16:57 -0600)]
libnvdimm: fix clear length of nvdimm_forget_poison()

ND_CMD_CLEAR_ERROR command returns 'clear_err.cleared', the length
of error actually cleared, which may be smaller than its requested
'len'.

Change nvdimm_clear_poison() to call nvdimm_forget_poison() with
'clear_err.cleared' when this value is valid.

Cc: <stable@vger.kernel.org>
Fixes: e046114af5fc ("libnvdimm: clear the internal poison_list when clearing badblocks")
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
Toshi Kani [Tue, 25 Apr 2017 23:04:13 +0000 (17:04 -0600)]
libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify

The following BUG was observed when nd_pmem_notify() was called
for a BTT device.  The use of a pmem_device pointer is not valid
with BTT.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
 IP: nd_pmem_notify+0x30/0xf0 [nd_pmem]
 Call Trace:
  nd_device_notify+0x40/0x50
  child_notify+0x10/0x20
  device_for_each_child+0x50/0x90
  nd_region_notify+0x20/0x30
  nd_device_notify+0x40/0x50
  nvdimm_region_notify+0x27/0x30
  acpi_nfit_scrub+0x341/0x590 [nfit]
  process_one_work+0x197/0x450
  worker_thread+0x4e/0x4a0
  kthread+0x109/0x140

Fix nd_pmem_notify() by setting nd_region and badblocks pointers
properly for BTT.

Cc: <stable@vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Fixes: 719994660c24 ("libnvdimm: async notification support")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm, region: sysfs trigger for nvdimm_flush()
Dan Williams [Fri, 21 Apr 2017 20:28:12 +0000 (13:28 -0700)]
libnvdimm, region: sysfs trigger for nvdimm_flush()

The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.

When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. These requests are typically associated with filesystem
metadata updates. However, when a namespace is in device-dax mode,
userspace (think database metadata) needs another path to perform the
same flushing. In other words this is not required to make data
persistent, but in the case of metadata it allows for a smaller failure
domain in the unlikely event of an ADR failure.

The new 'deep_flush' attribute is visible when the individual DIMMs
backing a given interleave-set are described by platform firmware. In
ACPI terms this is "NVDIMM Region Mapping Structures" and associated
"Flush Hint Address Structures". Reads return "1" if the region supports
triggering WPQ flushes on all DIMMs. Reads return "0" the flush
operation is a platform nop, and in that case the attribute is
read-only.

Why sysfs and not an ioctl? An ioctl requires establishing a new
ioctl function number space for device-dax. Given that this would be
called on a device-dax fd an application could be forgiven for
accidentally calling this on a filesystem-dax fd. Placing this interface
in libnvdimm sysfs removes that potential for collision with a
filesystem ioctl, and it keeps ioctls out of the generic device-dax
implementation.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: fix phys_addr for nvdimm_clear_poison
Toshi Kani [Tue, 25 Apr 2017 21:16:51 +0000 (15:16 -0600)]
libnvdimm: fix phys_addr for nvdimm_clear_poison

nvdimm_clear_poison() expects a physical address, not an offset.
Fix nsio_rw_bytes() to call nvdimm_clear_poison() with a physical
address.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agox86, dax, pmem: remove indirection around memcpy_from_pmem()
Dan Williams [Fri, 13 Jan 2017 22:14:23 +0000 (14:14 -0800)]
x86, dax, pmem: remove indirection around memcpy_from_pmem()

memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
serves no real benefit aside from affording a more generic function name
than the x86-specific 'mcsafe'. However this would not be the first time
that x86 terminology leaked into the global namespace. For lack of
better name, just use memcpy_mcsafe() directly.

This conversion also catches a place where we should have been using
plain memcpy, acpi_nfit_blk_single_io().

Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoblock: remove block_device_operations ->direct_access()
Dan Williams [Sat, 28 Jan 2017 01:22:03 +0000 (17:22 -0800)]
block: remove block_device_operations ->direct_access()

Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoblock, dax: convert bdev_dax_supported() to dax_direct_access()
Dan Williams [Sat, 1 Apr 2017 23:02:38 +0000 (16:02 -0700)]
block, dax: convert bdev_dax_supported() to dax_direct_access()

Kill of the final user of bdev_direct_access() and struct blk_dax_ctl.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agofilesystem-dax: convert to dax_direct_access()
Dan Williams [Fri, 27 Jan 2017 21:31:42 +0000 (13:31 -0800)]
filesystem-dax: convert to dax_direct_access()

Now that a dax_device is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.

This also lets us kill off some usages of struct blk_dax_ctl on the way
to its eventual removal.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoRevert "block: use DAX for partition table reads"
Dan Williams [Fri, 27 Jan 2017 22:13:15 +0000 (14:13 -0800)]
Revert "block: use DAX for partition table reads"

commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.

Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.

Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
223757016837 ("block_dev: remove DAX leftovers").

Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoext2, ext4, xfs: retrieve dax_device for iomap operations
Dan Williams [Fri, 27 Jan 2017 20:04:59 +0000 (12:04 -0800)]
ext2, ext4, xfs: retrieve dax_device for iomap operations

In preparation for converting fs/dax.c to use dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_device associated with a given block_device.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodm: teach dm-targets to use a dax_device + dax_operations
Dan Williams [Wed, 12 Apr 2017 20:37:44 +0000 (13:37 -0700)]
dm: teach dm-targets to use a dax_device + dax_operations

Arrange for dm to lookup the dax services available from member devices.
Update the dax-capable targets, linear and stripe, to route dax
operations to the underlying device. Changes the target-internal
->direct_access() method to more closely align with the dax_operations
->direct_access() calling convention.

Cc: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm, region: fix flush hint detection crash
Dan Williams [Mon, 24 Apr 2017 22:43:05 +0000 (15:43 -0700)]
libnvdimm, region: fix flush hint detection crash

In the case where a dimm does not have any associated flush hints the
ndrd->flush_wpq array may be uninitialized leading to crashes with the
following signature:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
 IP: region_visible+0x10f/0x160 [libnvdimm]

 Call Trace:
  internal_create_group+0xbe/0x2f0
  sysfs_create_groups+0x40/0x80
  device_add+0x2d8/0x650
  nd_async_device_register+0x12/0x40 [libnvdimm]
  async_run_entry_fn+0x39/0x170
  process_one_work+0x212/0x6c0
  ? process_one_work+0x197/0x6c0
  worker_thread+0x4e/0x4a0
  kthread+0x10c/0x140
  ? process_one_work+0x6c0/0x6c0
  ? kthread_create_on_node+0x60/0x60
  ret_from_fork+0x31/0x40

Cc: <stable@vger.kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Fixes: f284a4f23752 ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodm: add dax_device and dax_operations support
Dan Williams [Wed, 12 Apr 2017 19:35:44 +0000 (12:35 -0700)]
dm: add dax_device and dax_operations support

Allocate a dax_device to represent the capacity of a device-mapper
instance. Provide a ->direct_access() method via the new dax_operations
indirection that mirrors the functionality of the current direct_access
support via block_device_operations.  Once fs/dax.c has been converted
to use dax_operations the old dm_blk_direct_access() will be removed.

A new helper dm_dax_get_live_target() is introduced to separate some of
the dm-specifics from the direct_access implementation.

This enabling is only for the top-level dm representation to upper
layers. Converting target direct_access implementations is deferred to a
separate patch.

Cc: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodax: introduce dax_direct_access()
Dan Williams [Fri, 27 Jan 2017 04:37:35 +0000 (20:37 -0800)]
dax: introduce dax_direct_access()

Replace bdev_direct_access() with dax_direct_access() that uses
dax_device and dax_operations instead of a block_device and
block_device_operations for dax. Once all consumers of the old api have
been converted bdev_direct_access() will be deleted.

Given that block device partitioning decisions can cause dax page
alignment constraints to be violated this also introduces the
bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
to the dax_device and also checks for page alignment.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoblock: kill bdev_dax_capable()
Dan Williams [Fri, 27 Jan 2017 07:30:05 +0000 (23:30 -0800)]
block: kill bdev_dax_capable()

This is leftover dead code that has since been replaced by
bdev_dax_supported().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodcssblk: add dax_operations support
Dan Williams [Thu, 26 Jan 2017 21:54:35 +0000 (13:54 -0800)]
dcssblk: add dax_operations support

Setup a dax_dev to have the same lifetime as the dcssblk block device
and add a ->direct_access() method that is equivalent to
dcssblk_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old dcssblk_direct_access() will be removed.

Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agobrd: add dax_operations support
Dan Williams [Thu, 26 Jan 2017 00:54:45 +0000 (16:54 -0800)]
brd: add dax_operations support

Setup a dax_inode to have the same lifetime as the brd block device and
add a ->direct_access() method that is equivalent to
brd_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old brd_direct_access() will be removed.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoaxon_ram: add dax_operations support
Dan Williams [Wed, 25 Jan 2017 23:25:11 +0000 (15:25 -0800)]
axon_ram: add dax_operations support

Setup a dax_device to have the same lifetime as the axon_ram block
device and add a ->direct_access() method that is equivalent to
axon_ram_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old axon_ram_direct_access() will be removed.

Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agopmem: add dax_operations support
Dan Williams [Wed, 25 Jan 2017 07:02:09 +0000 (23:02 -0800)]
pmem: add dax_operations support

Setup a dax_device to have the same lifetime as the pmem block device
and add a ->direct_access() method that is equivalent to
pmem_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old pmem_direct_access() will be removed.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodax: introduce dax_operations
Dan Williams [Wed, 25 Jan 2017 02:44:18 +0000 (18:44 -0800)]
dax: introduce dax_operations

Track a set of dax_operations per dax_device that can be set at
alloc_dax() time. These operations will be used to stop the abuse of
block_device_operations for communicating dax capabilities to
filesystems. It will also be used to replace the "pmem api" and move
pmem-specific cache maintenance, and other dax-driver-specific
filesystem-dax operations, to dax device methods. In particular this
allows us to stop abusing __copy_user_nocache(), via memcpy_to_pmem(),
with a driver specific replacement.

This is a standalone introduction of the operations. Follow on patches
convert each dax-driver and teach fs/dax.c to use ->direct_access() from
dax_operations instead of block_device_operations.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodax: add a facility to lookup a dax device by 'host' device name
Dan Williams [Wed, 19 Apr 2017 22:14:31 +0000 (15:14 -0700)]
dax: add a facility to lookup a dax device by 'host' device name

For the current block_device based filesystem-dax path, we need a way
for it to lookup the dax_device associated with a block_device. Add a
'host' property of a dax_device that can be used for this purpose. It is
a free form string, but for a dax_device associated with a block device
it is the bdev name.

This is a stop-gap until filesystems are able to mount on a dax-inode
directly.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit: fix module unload vs workqueue shutdown race
Dan Williams [Tue, 18 Apr 2017 16:56:31 +0000 (09:56 -0700)]
acpi, nfit: fix module unload vs workqueue shutdown race

The workqueue may still be running when the devres callbacks start
firing to deallocate an acpi_nfit_desc instance. Stop and flush the
workqueue before letting any other devres de-allocations proceed.

Reported-by: Linda Knippers <linda.knippers@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agotools/testing/nvdimm: fix nfit_test shutdown crash
Dan Williams [Fri, 14 Apr 2017 06:14:34 +0000 (23:14 -0700)]
tools/testing/nvdimm: fix nfit_test shutdown crash

Keep the nfit_test instances alive until after nfit_test_teardown(), as
we may be doing resource lookups until the final un-registrations have
completed. This fixes crashes of the form.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
 IP: __release_resource+0x12/0x90
 Call Trace:
  remove_resource+0x23/0x40
  __wrap_remove_resource+0x29/0x30 [nfit_test_iomap]
  acpi_nfit_remove_resource+0xe/0x10 [nfit]
  devm_action_release+0xf/0x20
  release_nodes+0x16d/0x2b0
  devres_release_all+0x3c/0x60
  device_release+0x21/0x90
  kobject_release+0x6a/0x170
  kobject_put+0x2f/0x60
  put_device+0x17/0x20
  platform_device_unregister+0x20/0x30
  nfit_test_exit+0x36/0x960 [nfit_test]

Reported-by: Linda Knippers <linda.knippers@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit: limit ->flush_probe() to initialization work
Dan Williams [Fri, 14 Apr 2017 05:48:46 +0000 (22:48 -0700)]
acpi, nfit: limit ->flush_probe() to initialization work

The nvdimm probe flushing mechanism gives userspace a sync point where
it knows all asynchronous driver probe sequences have completed.
However, it need not wait for other asynchronous actions, like
on-demand address-range-scrub. Track the init work separately from other
work in the workqueue, and only flush the former.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit: collate health state flags
Dan Williams [Fri, 14 Apr 2017 17:27:11 +0000 (10:27 -0700)]
acpi, nfit: collate health state flags

Be tolerant of cases where the BIOS provided NFIT does not consistently
set the flags in all NVDIMM Region Mapping structures associated with a
given dimm.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit: support "map failed" dimms
Dan Williams [Fri, 14 Apr 2017 02:46:36 +0000 (19:46 -0700)]
acpi, nfit: support "map failed" dimms

Stop requiring dimms be successfully mapped into a
system-physical-address range. For provisioning and hardware remediation
purposes the kernel should account for failed devices in sysfs. If
possible it should still allow management commands to be sent to the
device.

Reported-by: Toshi Kani <toshi.kani@hpe.com>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: Linda Knippers <linda.knippers@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agotools/testing/nvdimm: test acpi 6.1 health state flags
Dan Williams [Thu, 13 Apr 2017 22:17:38 +0000 (15:17 -0700)]
tools/testing/nvdimm: test acpi 6.1 health state flags

Add a simulated dimm with an ACPI_NFIT_MEM_MAP_FAILED indication, and
set the ACPI_NFIT_MEM_HEALTH_ENABLED flag on all the dimms where
nfit_test simulates health events, but spread it out over several
redundant memdev entries to test that the nfit driver coalesces all the
flags.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit: add support for acpi 6.1 dimm state flags
Dan Williams [Thu, 13 Apr 2017 22:05:30 +0000 (15:05 -0700)]
acpi, nfit: add support for acpi 6.1 dimm state flags

Add support for the ACPI_NFIT_MEM_MAP_FAILED ("map_fail") and
ACPI_NFIT_MEM_HEALTH_ENABLED ("smart_notify") health state flags. The
"map_fail" flag identifies DIMMs that were not mapped into one or more
physical address ranges. The "health_notify" flag indicates whether
platform firmware will send notifications when there is new SMART health
data to consume.

Acked-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoRevert "libnvdimm: band aid btt vs clear poison locking"
Dan Williams [Thu, 13 Apr 2017 21:37:39 +0000 (14:37 -0700)]
Revert "libnvdimm: band aid btt vs clear poison locking"

This reverts commit 4aa5615e080a "libnvdimm: band aid btt vs clear
poison locking".

Now that poison list locking has been converted to a spinlock and poison
list entry allocation during i/o has been converted to GFP_NOWAIT,
revert the band-aid that disabled error clearing from btt i/o.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: fix clear poison locking with spinlock and GFP_NOWAIT allocation
Dave Jiang [Thu, 13 Apr 2017 21:25:17 +0000 (14:25 -0700)]
libnvdimm: fix clear poison locking with spinlock and GFP_NOWAIT allocation

The following warning results from holding a lane spinlock,
preempt_disable(), or the btt map spinlock and then trying to take the
reconfig_mutex to walk the poison list and potentially add new entries.

BUG: sleeping function called from invalid context at kernel/locking/mutex.
c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
[..]
Call Trace:
dump_stack+0x85/0xc8
___might_sleep+0x184/0x250
__might_sleep+0x4a/0x90
__mutex_lock+0x58/0x9b0
? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
? acpi_nfit_forget_poison+0x79/0x80 [nfit]
? _raw_spin_unlock+0x27/0x40
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_forget_poison+0x25/0x50 [libnvdimm]
nvdimm_clear_poison+0x106/0x140 [libnvdimm]
nsio_rw_bytes+0x164/0x270 [libnvdimm]
btt_write_pg+0x1de/0x3e0 [nd_btt]
? blk_queue_enter+0x30/0x290
btt_make_request+0x11a/0x310 [nd_btt]
? blk_queue_enter+0xb7/0x290
? blk_queue_enter+0x30/0x290
generic_make_request+0x118/0x3b0

A spinlock is introduced to protect the poison list. This allows us to not
having to acquire the reconfig_mutex for touching the poison list. The
add_poison() function has been broken out into two helper functions. One to
allocate the poison entry and the other to apppend the entry. This allows us
to unlock the poison_lock in non-I/O path and continue to be able to allocate
the poison entry with GFP_KERNEL. We will use GFP_NOWAIT in the I/O path in
order to satisfy being in atomic context.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodax: refactor dax-fs into a generic provider of 'struct dax_device' instances
Dan Williams [Tue, 11 Apr 2017 16:49:49 +0000 (09:49 -0700)]
dax: refactor dax-fs into a generic provider of 'struct dax_device' instances

We want dax capable drivers to be able to publish a set of dax
operations [1]. However, we do not want to further abuse block_devices
to advertise these operations. Instead we will attach these operations
to a dax device and add a lookup mechanism to go from block device path
to a dax device. A dax capable driver like pmem or brd is responsible
for registering a dax device, alongside a block device, and then a dax
capable filesystem is responsible for retrieving the dax device by path
name if it wants to call dax_operations.

For now, we refactor the dax pseudo-fs to be a generic facility, rather
than an implementation detail, of the device-dax use case. Where a "dax
device" is just an inode + dax infrastructure, and "Device DAX" is a
mapping service layered on top of that base 'struct dax_device'.
"Filesystem DAX" is then a mapping service that layers a filesystem on
top of that same base device. Filesystem DAX is associated with a
block_device for now, but perhaps directly to a dax device in the
future, or for new pmem-only filesystems.

[1]: https://lkml.org/lkml/2017/1/19/880

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodevice-dax: rename 'dax_dev' to 'dev_dax'
Dan Williams [Tue, 31 Jan 2017 05:43:10 +0000 (21:43 -0800)]
device-dax: rename 'dax_dev' to 'dev_dax'

In preparation for introducing a struct dax_device type to the kernel
global type namespace, rename dax_dev to dev_dax. A 'dax_device'
instance will be a generic device-driver object for any provider of dax
functionality. A 'dev_dax' object is a device-dax-driver local /
internal instance.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodevice-dax: improve fault handler debug output
Oliver O'Halloran [Tue, 11 Apr 2017 15:59:36 +0000 (01:59 +1000)]
device-dax: improve fault handler debug output

A couple of minor improvements to the debug output in the fault handlers:

a) Print the region alignment and fault size when we sent a SIGBUS
   because the region alignment is greater than the fault size.
b) Fix the message in the PFN_{DEV|MAP} check.
c) Additionally print the fault size enum value in the huge fault handler.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodevice-dax: fix dax_dev_huge_fault() unknown fault size handling
Pushkar Jambhlekar [Tue, 11 Apr 2017 16:12:25 +0000 (09:12 -0700)]
device-dax: fix dax_dev_huge_fault() unknown fault size handling

The default case for dax_dev_huge_fault() fault size handling mistakenly
returns when it should unlock. This is not a problem in practice since
the only three possible fault sizes are handled. Going forward, if the
core mm adds a new fault size beyond pte, pmd, or pud device-dax should
abort VM_FAULT_SIGBUS requests not VM_FAULT_FALLBACK since device-dax
guarantees a configured fault granularity for all faults.

Signed-off-by: Pushkar Jambhlekar <pushkar.iit@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoMerge branch 'for-4.11/libnvdimm' into for-4.12/dax
Dan Williams [Thu, 13 Apr 2017 04:59:01 +0000 (21:59 -0700)]
Merge branch 'for-4.11/libnvdimm' into for-4.12/dax

7 years agodevice-dax, tools/testing/nvdimm: enable device-dax with mock resources
Dave Jiang [Fri, 7 Apr 2017 22:33:36 +0000 (15:33 -0700)]
device-dax, tools/testing/nvdimm: enable device-dax with mock resources

Provide a replacement pgoff_to_phys() that translates an nfit_test
resource (allocated by vmalloc()) to a pfn.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: add support for clear poison list and badblocks for device dax
Dave Jiang [Fri, 7 Apr 2017 22:33:31 +0000 (15:33 -0700)]
libnvdimm: add support for clear poison list and badblocks for device dax

Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
call. We will update the poison list and also the badblocks at region level
if the region is in dax mode or in pmem mode and not active. In other
words we force badblocks to be cleared through write requests if the
address is currently accessed through a block device, otherwise it can
only be done via the ioctl+dsm path.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: Add 'resource' sysfs attribute to regions
Dave Jiang [Fri, 7 Apr 2017 22:33:25 +0000 (15:33 -0700)]
libnvdimm: Add 'resource' sysfs attribute to regions

Adding sysfs attribute in order to export the physical address of the
region. This is for supporting of user app poison clear via
ND_IOCTL_CLEAR_ERROR.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: add mechanism to publish badblocks at the region level
Dave Jiang [Fri, 7 Apr 2017 22:33:20 +0000 (15:33 -0700)]
libnvdimm: add mechanism to publish badblocks at the region level

badblocks sysfs file will be export at region level. When nvdimm event
notifier happens for NVDIMM_REVALIATE_POISON, the badblocks in the
region will be updated.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit: fix acpi_get_table leak
Dan Williams [Mon, 3 Apr 2017 20:52:14 +0000 (13:52 -0700)]
acpi, nfit: fix acpi_get_table leak

Calls to acpi_get_table() must be paired with acpi_put_table() to undo
the mapping established by acpi_tb_acquire_table().

It turns out this has no effect in practice since the NFIT will already
be mapped to support the /sys/firmware/acpi/tables/NFIT attribute in
sysfs.

Fixes: 6b11d1d67713 ("ACPI / osl: Remove acpi_get_table_with_size()/early_acpi_os_unmap_memory() users")
Cc: Lv Zheng <lv.zheng@intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit: remove unnecessary newline
Linda Knippers [Tue, 7 Mar 2017 21:35:14 +0000 (16:35 -0500)]
acpi, nfit: remove unnecessary newline

The newline in MODULE_PARM_DESC causes modinfo to print the parameter
data type on a separate line, which is different from all the other
module parameters and could potentially cause a problem for someone
parsing the output of modinfo.

Signed-off-by: Linda Knippers <linda.knippers@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit: allow specifying a default DSM family
Linda Knippers [Tue, 7 Mar 2017 21:35:13 +0000 (16:35 -0500)]
acpi, nfit: allow specifying a default DSM family

Provide the ability to request a default DSM family. If it is not
supported, then fall back to the normal discovery order.

This is helpful for testing platforms that support multiple DSM
families.  It will also allow administrators to request the DSM family
that their management tools support, which may not be the first one
found using the current discovery order.

Signed-off-by: Linda Knippers <linda.knippers@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit: allow override of built-in bitmasks for nvdimm DSMs
Linda Knippers [Tue, 7 Mar 2017 21:35:12 +0000 (16:35 -0500)]
acpi, nfit: allow override of built-in bitmasks for nvdimm DSMs

As it is today, we can't enable or test new NVDIMM management functions
provided by new firmware and tools without changing the kernel.  We also
can't prevent documented DSM functions from being called in the case of
buggy firmware.  This patch provides a module parameter that overrides
the DSM function mask that is built into the kernel.

If the "disable_vendor_specific" module parameter is also used we ignore
the new parameter.

Signed-off-by: Linda Knippers <linda.knippers@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agox86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
Dan Williams [Thu, 6 Apr 2017 16:04:31 +0000 (09:04 -0700)]
x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

Before we rework the "pmem api" to stop abusing __copy_user_nocache()
for memcpy_to_pmem() we need to fix cases where we may strand dirty data
in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
for arbitrary data transfers from userspace. There is no guarantee that
these transfers, performed by dax_iomap_actor(), will have aligned
destinations or aligned transfer lengths. Backstop the usage
__copy_user_nocache() with explicit cache management in these unaligned
cases.

Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
that is saved for a later patch that moves the entirety of the "pmem
api" into the pmem driver directly.

Fixes: 5de490daec8b ("pmem: add copy_from_iter_pmem() and clear_pmem()")
Cc: <stable@vger.kernel.org>
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agodevice-dax: switch to srcu, fix rcu_read_lock() vs pte allocation
Dan Williams [Fri, 7 Apr 2017 23:42:08 +0000 (16:42 -0700)]
device-dax: switch to srcu, fix rcu_read_lock() vs pte allocation

The following warning triggers with a new unit test that stresses the
device-dax interface.

 ===============================
 [ ERR: suspicious RCU usage.  ]
 4.11.0-rc4+ #1049 Tainted: G           O
 -------------------------------
 ./include/linux/rcupdate.h:521 Illegal context switch in RCU read-side critical section!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 0
 2 locks held by fio/9070:
  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8d0739d7>] __do_page_fault+0x167/0x4f0
  #1:  (rcu_read_lock){......}, at: [<ffffffffc03fbd02>] dax_dev_huge_fault+0x32/0x620 [dax]

 Call Trace:
  dump_stack+0x86/0xc3
  lockdep_rcu_suspicious+0xd7/0x110
  ___might_sleep+0xac/0x250
  __might_sleep+0x4a/0x80
  __alloc_pages_nodemask+0x23a/0x360
  alloc_pages_current+0xa1/0x1f0
  pte_alloc_one+0x17/0x80
  __pte_alloc+0x1e/0x120
  __get_locked_pte+0x1bf/0x1d0
  insert_pfn.isra.70+0x3a/0x100
  ? lookup_memtype+0xa6/0xd0
  vm_insert_mixed+0x64/0x90
  dax_dev_huge_fault+0x520/0x620 [dax]
  ? dax_dev_huge_fault+0x32/0x620 [dax]
  dax_dev_fault+0x10/0x20 [dax]
  __do_fault+0x1e/0x140
  __handle_mm_fault+0x9af/0x10d0
  handle_mm_fault+0x16d/0x370
  ? handle_mm_fault+0x47/0x370
  __do_page_fault+0x28c/0x4f0
  trace_do_page_fault+0x58/0x2a0
  do_async_page_fault+0x1a/0xa0
  async_page_fault+0x28/0x30

Inserting a page table entry may trigger an allocation while we are
holding a read lock to keep the device instance alive for the duration
of the fault. Use srcu for this keep-alive protection.

Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: band aid btt vs clear poison locking
Dan Williams [Fri, 7 Apr 2017 19:25:52 +0000 (12:25 -0700)]
libnvdimm: band aid btt vs clear poison locking

The following warning results from holding a lane spinlock,
preempt_disable(), or the btt map spinlock and then trying to take the
reconfig_mutex to walk the poison list and potentially add new entries.

 BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
 in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
 [..]
 Call Trace:
  dump_stack+0x85/0xc8
  ___might_sleep+0x184/0x250
  __might_sleep+0x4a/0x90
  __mutex_lock+0x58/0x9b0
  ? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
  ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
  ? acpi_nfit_forget_poison+0x79/0x80 [nfit]
  ? _raw_spin_unlock+0x27/0x40
  mutex_lock_nested+0x1b/0x20
  nvdimm_bus_lock+0x21/0x30 [libnvdimm]
  nvdimm_forget_poison+0x25/0x50 [libnvdimm]
  nvdimm_clear_poison+0x106/0x140 [libnvdimm]
  nsio_rw_bytes+0x164/0x270 [libnvdimm]
  btt_write_pg+0x1de/0x3e0 [nd_btt]
  ? blk_queue_enter+0x30/0x290
  btt_make_request+0x11a/0x310 [nd_btt]
  ? blk_queue_enter+0xb7/0x290
  ? blk_queue_enter+0x30/0x290
  generic_make_request+0x118/0x3b0

As a minimal fix, disable error clearing when the BTT is enabled for the
namespace. For the final fix a larger rework of the poison list locking
is needed.

Note that this is not a problem in the blk case since that path never
calls nvdimm_clear_poison().

Cc: <stable@vger.kernel.org>
Fixes: 82bf1037f2ca ("libnvdimm: check and clear poison before writing to pmem")
Cc: Dave Jiang <dave.jiang@intel.com>
[jeff: dynamically disable error clearing in the btt case]
Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat
Dan Williams [Fri, 7 Apr 2017 16:47:24 +0000 (09:47 -0700)]
libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat

Holding the reconfig_mutex over a potential userspace fault sets up a
lockdep dependency chain between filesystem-DAX and the libnvdimm ioctl
path. Move the user access outside of the lock.

     [ INFO: possible circular locking dependency detected ]
     4.11.0-rc3+ #13 Tainted: G        W  O
     -------------------------------------------------------
     fallocate/16656 is trying to acquire lock:
      (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffa00080b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm]
     but task is already holding lock:
      (jbd2_handle){++++..}, at: [<ffffffff813b4944>] start_this_handle+0x104/0x460

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (jbd2_handle){++++..}:
            lock_acquire+0xbd/0x200
            start_this_handle+0x16a/0x460
            jbd2__journal_start+0xe9/0x2d0
            __ext4_journal_start_sb+0x89/0x1c0
            ext4_dirty_inode+0x32/0x70
            __mark_inode_dirty+0x235/0x670
            generic_update_time+0x87/0xd0
            touch_atime+0xa9/0xd0
            ext4_file_mmap+0x90/0xb0
            mmap_region+0x370/0x5b0
            do_mmap+0x415/0x4f0
            vm_mmap_pgoff+0xd7/0x120
            SyS_mmap_pgoff+0x1c5/0x290
            SyS_mmap+0x22/0x30
            entry_SYSCALL_64_fastpath+0x1f/0xc2

    -> #1 (&mm->mmap_sem){++++++}:
            lock_acquire+0xbd/0x200
            __might_fault+0x70/0xa0
            __nd_ioctl+0x683/0x720 [libnvdimm]
            nvdimm_ioctl+0x8b/0xe0 [libnvdimm]
            do_vfs_ioctl+0xa8/0x740
            SyS_ioctl+0x79/0x90
            do_syscall_64+0x6c/0x200
            return_from_SYSCALL_64+0x0/0x7a

    -> #0 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
            __lock_acquire+0x16b6/0x1730
            lock_acquire+0xbd/0x200
            __mutex_lock+0x88/0x9b0
            mutex_lock_nested+0x1b/0x20
            nvdimm_bus_lock+0x21/0x30 [libnvdimm]
            nvdimm_forget_poison+0x25/0x50 [libnvdimm]
            nvdimm_clear_poison+0x106/0x140 [libnvdimm]
            pmem_do_bvec+0x1c2/0x2b0 [nd_pmem]
            pmem_make_request+0xf9/0x270 [nd_pmem]
            generic_make_request+0x118/0x3b0
            submit_bio+0x75/0x150

Cc: <stable@vger.kernel.org>
Fixes: 62232e45f4a2 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices")
Cc: Dave Jiang <dave.jiang@intel.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agolibnvdimm: fix blk free space accounting
Dan Williams [Tue, 4 Apr 2017 22:08:36 +0000 (15:08 -0700)]
libnvdimm: fix blk free space accounting

Commit a1f3e4d6a0c3 "libnvdimm, region: update nd_region_available_dpa()
for multi-pmem support" reworked blk dpa (DIMM Physical Address)
accounting to comprehend multiple pmem namespace allocations aliasing
with a given blk-dpa range.

The following call trace is a result of failing to account for allocated
blk capacity.

 WARNING: CPU: 1 PID: 2433 at tools/testing/nvdimm/../../../drivers/nvdimm/names
4 size_store+0x6f3/0x930 [libnvdimm]
 nd_region region5: allocation underrun: 0x0 of 0x1000000 bytes
 [..]
 Call Trace:
  dump_stack+0x86/0xc3
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  size_store+0x6f3/0x930 [libnvdimm]
  dev_attr_store+0x18/0x30

If a given blk-dpa allocation does not alias with any pmem ranges then
the full allocation should be accounted as busy space, not the size of
the current pmem contribution to the region.

The thinkos that led to this confusion was not realizing that the struct
resource management is already guaranteeing no collisions between pmem
allocations and blk allocations on the same dimm. Also, we do not try to
support blk allocations in aliased pmem holes.

This patch also fixes a case where the available blk goes negative.

Cc: <stable@vger.kernel.org>
Fixes: a1f3e4d6a0c3 ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support").
Reported-by: Dariusz Dokupil <dariusz.dokupil@intel.com>
Reported-by: Dave Jiang <dave.jiang@intel.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoacpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison)
Dan Williams [Tue, 28 Mar 2017 04:53:38 +0000 (21:53 -0700)]
acpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison)

While reviewing the -stable patch for commit 86ef58a4e35e "nfit,
libnvdimm: fix interleave set cookie calculation" Ben noted:

    "This is returning an int, thus it's effectively doing a 32-bit
     comparison and not the 64-bit comparison you say is needed."

Update the compare operation to be immune to this integer demotion problem.

Cc: <stable@vger.kernel.org>
Cc: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Fixes: 86ef58a4e35e ("nfit, libnvdimm: fix interleave set cookie calculation")
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
7 years agoLinux 4.11-rc4 v4.11-rc4
Linus Torvalds [Sun, 26 Mar 2017 21:15:16 +0000 (14:15 -0700)]
Linux 4.11-rc4

7 years agoMerge tag 'char-misc-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Sun, 26 Mar 2017 18:15:54 +0000 (11:15 -0700)]
Merge tag 'char-misc-4.11-rc4' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "A smattering of different small fixes for some random driver
  subsystems. Nothing all that major, just resolutions for reported
  issues and bugs.

  All have been in linux-next with no reported issues"

* tag 'char-misc-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits)
  extcon: int3496: Set the id pin to direction-input if necessary
  extcon: int3496: Use gpiod_get instead of gpiod_get_index
  extcon: int3496: Add dependency on X86 as it's Intel specific
  extcon: int3496: Add GPIO ACPI mapping table
  extcon: int3496: Rename GPIO pins in accordance with binding
  vmw_vmci: handle the return value from pci_alloc_irq_vectors correctly
  ppdev: fix registering same device name
  parport: fix attempt to write duplicate procfiles
  auxdisplay: img-ascii-lcd: add missing sentinel entry in img_ascii_lcd_matches
  Drivers: hv: vmbus: Don't leak memory when a channel is rescinded
  Drivers: hv: vmbus: Don't leak channel ids
  Drivers: hv: util: don't forget to init host_ts.lock
  Drivers: hv: util: move waiting for release to hv_utils_transport itself
  vmbus: remove hv_event_tasklet_disable/enable
  vmbus: use rcu for per-cpu channel list
  mei: don't wait for os version message reply
  mei: fix deadlock on mei reset
  intel_th: pci: Add Gemini Lake support
  intel_th: pci: Add Denverton SOC support
  intel_th: Don't leak module refcount on failure to activate
  ...

7 years agoMerge tag 'driver-core-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 26 Mar 2017 18:05:42 +0000 (11:05 -0700)]
Merge tag 'driver-core-4.11-rc4' of git://git./linux/kernel/git/gregkh/driver-core

Pull driver core fix from Greg KH:
 "Here is a single kernfs fix for 4.11-rc4 that resolves a reported
  issue.

  It has been in linux-next with no reported issues"

* tag 'driver-core-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()

7 years agoMerge tag 'tty-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Sun, 26 Mar 2017 18:03:42 +0000 (11:03 -0700)]
Merge tag 'tty-4.11-rc4' of git://git./linux/kernel/git/gregkh/tty

Pull tty/serial driver fixes from Greg KH:
 "Here are some tty and serial driver fixes for 4.11-rc4.

  One of these fix a long-standing issue in the ldisc code that was
  found by Dmitry Vyukov with his great fuzzing work. The other fixes
  resolve other reported issues, and there is one revert of a patch in
  4.11-rc1 that wasn't correct.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'tty-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: fix data race in tty_ldisc_ref_wait()
  tty: don't panic on OOM in tty_set_ldisc()
  Revert "tty: serial: pl011: add ttyAMA for matching pl011 console"
  tty: acpi/spcr: QDF2400 E44 checks for wrong OEM revision
  serial: 8250_dw: Fix breakage when HAVE_CLK=n
  serial: 8250_dw: Honor clk_round_rate errors in dw8250_set_termios

7 years agoMerge tag 'staging-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sun, 26 Mar 2017 18:02:00 +0000 (11:02 -0700)]
Merge tag 'staging-4.11-rc4' of git://git./linux/kernel/git/gregkh/staging

Pull IIO driver fixes from Greg KH:
 "Here are some small IIO driver fixes for 4.11-rc4 that resolve a
  number of tiny reported issues. All of these have been in linux-next
  for a while with no reported issues"

* tag 'staging-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  iio: imu: st_lsm6dsx: fix FIFO_CTRL2 overwrite during watermark configuration
  iio: adc: ti_am335x_adc: fix fifo overrun recovery
  iio: sw-device: Fix config group initialization
  iio: magnetometer: ak8974: remove incorrect __exit markups
  iio: hid-sensor-trigger: Change get poll value function order to avoid sensor properties losing after resume from S3

7 years agoMerge tag 'usb-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sun, 26 Mar 2017 17:52:52 +0000 (10:52 -0700)]
Merge tag 'usb-4.11-rc4' of git://git./linux/kernel/git/gregkh/usb

Pull USB/PHY fixes from Greg KH:
 "Here are a number of small USB and PHY driver fixes for 4.11-rc4.

  Nothing major here, just an bunch of small fixes, and a handfull of
  good fixes from Johan for devices with crazy descriptors. There are a
  few new device ids in here as well.

  All of these have been in linux-next with no reported issues"

* tag 'usb-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (26 commits)
  usb: gadget: f_hid: fix: Don't access hidg->req without spinlock held
  usb: gadget: udc: remove pointer dereference after free
  usb: gadget: f_uvc: Sanity check wMaxPacketSize for SuperSpeed
  usb: gadget: f_uvc: Fix SuperSpeed companion descriptor's wBytesPerInterval
  usb: gadget: acm: fix endianness in notifications
  usb: dwc3: gadget: delay unmap of bounced requests
  USB: serial: qcserial: add Dell DW5811e
  usb: hub: Fix crash after failure to read BOS descriptor
  ACM gadget: fix endianness in notifications
  USB: usbtmc: fix probe error path
  USB: usbtmc: add missing endpoint sanity check
  USB: serial: option: add Quectel UC15, UC20, EC21, and EC25 modems
  usb: musb: fix possible spinlock deadlock
  usb: musb: dsps: fix iounmap in error and exit paths
  usb: musb: cppi41: don't check early-TX-interrupt for Isoch transfer
  usb-core: Add LINEAR_FRAME_INTR_BINTERVAL USB quirk
  uwb: i1480-dfu: fix NULL-deref at probe
  uwb: hwa-rc: fix NULL-deref at probe
  USB: wusbcore: fix NULL-deref at probe
  USB: uss720: fix NULL-deref at probe
  ...

7 years agoMerge tag 'powerpc-4.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 26 Mar 2017 17:34:10 +0000 (10:34 -0700)]
Merge tag 'powerpc-4.11-6' of git://git./linux/kernel/git/powerpc/linux

Pull more powerpc fixes from Michael Ellerman:
 "These are all pretty minor. The fix for idle wakeup would be a bad bug
  but has not been observed in practice.

  The update to the gcc-plugins docs was Cc'ed to Kees and Jon, Kees
  OK'ed it going via powerpc and I didn't hear from Jon.

   - cxl: Route eeh events to all slices for pci_channel_io_perm_failure state

   - powerpc/64s: Fix idle wakeup potential to clobber registers

   - Revert "powerpc/64: Disable use of radix under a hypervisor"

   - gcc-plugins: update architecture list in documentation

  Thanks to: Andrew Donnellan, Nicholas Piggin, Paul Mackerras, Vaibhav
  Jain"

* tag 'powerpc-4.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  gcc-plugins: update architecture list in documentation
  Revert "powerpc/64: Disable use of radix under a hypervisor"
  powerpc/64s: Fix idle wakeup potential to clobber registers
  cxl: Route eeh events to all slices for pci_channel_io_perm_failure state

7 years agoMerge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 26 Mar 2017 17:29:21 +0000 (10:29 -0700)]
Merge tag 'ext4_for_linus_stable' of git://git./linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a memory leak on an error path, and two races when modifying
  inodes relating to the inline_data and metadata checksum features"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix two spelling nits
  ext4: lock the xattr block before checksuming it
  jbd2: don't leak memory if setting up journal fails
  ext4: mark inode dirty after converting inline directory

7 years agoMerge tag 'fscrypt-for-linus_stable' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 25 Mar 2017 22:36:56 +0000 (15:36 -0700)]
Merge tag 'fscrypt-for-linus_stable' of git://git./linux/kernel/git/tytso/fscrypt

Pull fscrypto fixes from Ted Ts'o:
 "A code cleanup and bugfix for fs/crypto"

* tag 'fscrypt-for-linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: eliminate ->prepare_context() operation
  fscrypt: remove broken support for detecting keyring key revocation

7 years agoMerge tag 'hwmon-for-linus-v4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 25 Mar 2017 22:31:50 +0000 (15:31 -0700)]
Merge tag 'hwmon-for-linus-v4.11-rc4' of git://git./linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:

 - bug fixes in asus_atk0110, it87 and max31790 drivers

 - added missing API definition to hwmon core

* tag 'hwmon-for-linus-v4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (asus_atk0110) fix uninitialized data access
  hwmon: Add missing HWMON_T_ALARM
  hwmon: (it87) Avoid registering the same chip on both SIO addresses
  hwmon: (max31790) Set correct PWM value

7 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Linus Torvalds [Sat, 25 Mar 2017 22:25:58 +0000 (15:25 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/dledford/rdma

Pull rdma fixes from Doug Ledford:
 "This has been a slow -rc cycle for the RDMA subsystem. We really
  haven't had a lot of rc fixes come in. This pull request is the first
  of this entire rc cycle and it has all of the suitable fixes so far
  and it's still only about 20 patches. The fix for the minor breakage
  cause by the dma mapping patchset is in here, as well as a couple
  other potential oops fixes, but the rest is more minor.

  Summary:

   - fix for dma_ops change in this kernel, resolving the s390, powerpc,
     and IOMMU operation

   - a few other oops fixes

   - the rest are all minor fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
  IB/qib: fix false-postive maybe-uninitialized warning
  RDMA/iser: Fix possible mr leak on device removal event
  IB/device: Convert ib-comp-wq to be CPU-bound
  IB/cq: Don't process more than the given budget
  IB/rxe: increment msn only when completing a request
  uapi: fix rdma/mlx5-abi.h userspace compilation errors
  IB/core: Restore I/O MMU, s390 and powerpc support
  IB/rxe: Update documentation link
  RDMA/ocrdma: fix a type issue in ocrdma_put_pd_num()
  IB/rxe: double free on error
  RDMA/vmw_pvrdma: Activate device on ethernet link up
  RDMA/vmw_pvrdma: Dont hardcode QP header page
  RDMA/vmw_pvrdma: Cleanup unused variables
  infiniband: Fix alignment of mmap cookies to support VIPT caching
  IB/core: Protect against self-requeue of a cq work item
  i40iw: Receive netdev events post INET_NOTIFIER state

7 years agoMerge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit
Linus Torvalds [Sat, 25 Mar 2017 22:13:55 +0000 (15:13 -0700)]
Merge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit

Pull audit fix from Paul Moore:
 "We've got an audit fix, and unfortunately it is big.

  While I'm not excited that we need to be sending you something this
  large during the -rcX phase, it does fix some very real, and very
  tangled, problems relating to locking, backlog queues, and the audit
  daemon connection.

  This code has passed our testsuite without problem and it has held up
  to my ad-hoc stress tests (arguably better than the existing code),
  please consider pulling this as fix for the next v4.11-rcX tag"

* 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
  audit: fix auditd/kernel connection state tracking

7 years agoext4: fix two spelling nits
Theodore Ts'o [Sat, 25 Mar 2017 21:33:31 +0000 (17:33 -0400)]
ext4: fix two spelling nits

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agoext4: lock the xattr block before checksuming it
Theodore Ts'o [Sat, 25 Mar 2017 21:22:47 +0000 (17:22 -0400)]
ext4: lock the xattr block before checksuming it

We must lock the xattr block before calculating or verifying the
checksum in order to avoid spurious checksum failures.

https://bugzilla.kernel.org/show_bug.cgi?id=193661

Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
7 years agoMerge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 25 Mar 2017 17:34:56 +0000 (10:34 -0700)]
Merge tag 'clk-fixes-for-linus' of git://git./linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
 "A handful of Sunxi and Rockchip clk driver fixes and a core framework
  one where we need to copy a string because we can't guarantee it isn't
  freed sometime later"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: sunxi-ng: fix recalc_rate formula of NKMP clocks
  clk: sunxi-ng: Fix div/mult settings for osc12M on A64
  clk: rockchip: Make uartpll a child of the gpll on rk3036
  clk: rockchip: add "," to mux_pll_src_apll_dpll_gpll_usb480m_p on rk3036
  clk: core: Copy connection id
  dt-bindings: arm: update Armada CP110 system controller binding
  clk: sunxi-ng: sun6i: Fix enable bit offset for hdmi-ddc module clock
  clk: sunxi: ccu-sun5i needs nkmp
  clk: sunxi-ng: mp: Adjust parent rate for pre-dividers

7 years agoIB/qib: fix false-postive maybe-uninitialized warning
Arnd Bergmann [Tue, 14 Mar 2017 12:18:45 +0000 (13:18 +0100)]
IB/qib: fix false-postive maybe-uninitialized warning

aarch64-linux-gcc-7 complains about code it doesn't fully understand:

drivers/infiniband/hw/qib/qib_iba7322.c: In function 'qib_7322_txchk_change':
include/asm-generic/bitops/non-atomic.h:105:35: error: 'shadow' may be used uninitialized in this function [-Werror=maybe-uninitialized]

The code is right, and despite trying hard, I could not come up with a version
that I liked better than just adding a fake initialization here to shut up the
warning.

Fixes: f931551bafe1 ("IB/qib: Add new qib driver for QLogic PCIe InfiniBand adapters")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoRDMA/iser: Fix possible mr leak on device removal event
Sagi Grimberg [Mon, 27 Feb 2017 18:16:33 +0000 (20:16 +0200)]
RDMA/iser: Fix possible mr leak on device removal event

When the rdma device is removed, we must cleanup all
the rdma resources within the DEVICE_REMOVAL event
handler to let the device teardown gracefully. When
this happens with live I/O, some memory regions are
occupied. Thus, track them too and dereg all the mr's.

We are safe with mr access by iscsi_iser_cleanup_task.

Reported-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoIB/device: Convert ib-comp-wq to be CPU-bound
Sagi Grimberg [Wed, 8 Mar 2017 20:03:17 +0000 (22:03 +0200)]
IB/device: Convert ib-comp-wq to be CPU-bound

This workqueue is used by our storage target mode ULPs
via the new CQ API. Recent observations when working
with very high-end flash storage devices reveal that
UNBOUND workqueue threads can migrate between cpu cores
and even numa nodes (although some numa locality is accounted
for).

While this attribute can be useful in some workloads,
it does not fit in very nicely with the normal
run-to-completion model we usually use in our target-mode
ULPs and the block-mq irq<->cpu affinity facilities.

The whole block-mq concept is that the completion will
land on the same cpu where the submission was performed.
The fact that our submitter thread is migrating cpus
can break this locality.

We assume that as a target mode ULP, we will serve multiple
initiators/clients and we can spread the load enough without
having to use unbound kworkers.

Also, while we're at it, expose this workqueue via sysfs which
is harmless and can be useful for debug.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>--
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoIB/cq: Don't process more than the given budget
Sagi Grimberg [Thu, 16 Mar 2017 16:57:00 +0000 (18:57 +0200)]
IB/cq: Don't process more than the given budget

The caller might not want this overhead.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoIB/rxe: increment msn only when completing a request
David Marchand [Fri, 24 Feb 2017 14:38:26 +0000 (15:38 +0100)]
IB/rxe: increment msn only when completing a request

According to C9-147, MSN should only be incremented when the last packet of
a multi packet request has been received.

"Logically, the requester associates a sequential Send Sequence Number
(SSN) with each WQE posted to the send queue. The SSN bears a one-
to-one relationship to the MSN returned by the responder in each re-
sponse packet. Therefore, when the requester receives a response, it in-
terprets the MSN as representing the SSN of the most recent request
completed by the responder to determine which send WQE(s) can be
completed."

Fixes: 8700e3e7c485 ("Soft RoCE driver")

Signed-off-by: David Marchand <david.marchand@6wind.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agouapi: fix rdma/mlx5-abi.h userspace compilation errors
Dmitry V. Levin [Fri, 24 Feb 2017 00:28:13 +0000 (03:28 +0300)]
uapi: fix rdma/mlx5-abi.h userspace compilation errors

Consistently use types from linux/types.h to fix the following
rdma/mlx5-abi.h userspace compilation errors:

/usr/include/rdma/mlx5-abi.h:69:25: error: 'u64' undeclared here (not in a function)
  MLX5_LIB_CAP_4K_UAR = (u64)1 << 0,
/usr/include/rdma/mlx5-abi.h:69:29: error: expected ',' or '}' before numeric constant
  MLX5_LIB_CAP_4K_UAR = (u64)1 << 0,

Include <linux/if_ether.h> to fix the following rdma/mlx5-abi.h
userspace compilation error:

/usr/include/rdma/mlx5-abi.h:286:12: error: 'ETH_ALEN' undeclared here (not in a function)
  __u8 dmac[ETH_ALEN];

Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoIB/core: Restore I/O MMU, s390 and powerpc support
Bart Van Assche [Tue, 7 Mar 2017 22:56:53 +0000 (22:56 +0000)]
IB/core: Restore I/O MMU, s390 and powerpc support

Avoid that the following error message is reported on the console
while loading an RDMA driver with I/O MMU support enabled:

DMAR: Allocating domain for mlx5_0 failed

Ensure that DMA mapping operations that use to_pci_dev() to
access to struct pci_dev see the correct PCI device. E.g. the s390
and powerpc DMA mapping operations use to_pci_dev() even with I/O
MMU support disabled.

This patch preserves the following changes of the DMA mapping updates
patch series:
- Introduction of dma_virt_ops.
- Removal of ib_device.dma_ops.
- Removal of struct ib_dma_mapping_ops.
- Removal of an if-statement from each ib_dma_*() operation.
- IB HW drivers no longer set dma_device directly.

Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reported-by: Parav Pandit <parav@mellanox.com>
Fixes: commit 99db9494035f ("IB/core: Remove ib_device.dma_device")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: parav@mellanox.com
Tested-by: parav@mellanox.com
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoIB/rxe: Update documentation link
Leon Romanovsky [Tue, 28 Feb 2017 19:42:53 +0000 (21:42 +0200)]
IB/rxe: Update documentation link

All Soft-RoCE (rxe) is handled now in rdma-core user space library,
so the documentation. The patch below updates the documentation
link to that new location.

Reported-by: Josh Beavers <josh.beavers@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoRDMA/ocrdma: fix a type issue in ocrdma_put_pd_num()
Dan Carpenter [Thu, 23 Feb 2017 10:40:16 +0000 (13:40 +0300)]
RDMA/ocrdma: fix a type issue in ocrdma_put_pd_num()

We want to return zero on success or negative error codes.  The type
should be int and not u8.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoIB/rxe: double free on error
Dan Carpenter [Wed, 8 Mar 2017 05:21:52 +0000 (08:21 +0300)]
IB/rxe: double free on error

"goto err;" has it's own kfree_skb() call so it's a double free.  We
only need to free on the "goto exit;" path.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoRDMA/vmw_pvrdma: Activate device on ethernet link up
Aditya Sarwade [Thu, 23 Feb 2017 01:22:58 +0000 (17:22 -0800)]
RDMA/vmw_pvrdma: Activate device on ethernet link up

Restore device state when ethernet link changes to active.

Acked-by: George Zhang <georgezhang@vmware.com>
Acked-by: Jorgen Hansen <jhansen@vmware.com>
Acked-by: Bryan Tan <bryantan@vmware.com>
Signed-off-by: Aditya Sarwade <asarwade@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoRDMA/vmw_pvrdma: Dont hardcode QP header page
Adit Ranadive [Thu, 23 Feb 2017 01:22:57 +0000 (17:22 -0800)]
RDMA/vmw_pvrdma: Dont hardcode QP header page

Moved the header page count to a macro.

Reported-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Tested-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoRDMA/vmw_pvrdma: Cleanup unused variables
Adit Ranadive [Thu, 23 Feb 2017 01:22:56 +0000 (17:22 -0800)]
RDMA/vmw_pvrdma: Cleanup unused variables

Removed the unused nreq and redundant index variables.
Moved hardcoded async and cq ring pages number to macro.

Reported-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Tested-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
7 years agoMerge tag 'vfio-v4.11-rc4' of git://github.com/awilliam/linux-vfio
Linus Torvalds [Fri, 24 Mar 2017 21:39:36 +0000 (14:39 -0700)]
Merge tag 'vfio-v4.11-rc4' of git://github.com/awilliam/linux-vfio

Pull VFIO fix from Alex Williamson:
 "Rework sanity check for mdev driver group notifier de-registration
  (Alex Williamson)"

* tag 'vfio-v4.11-rc4' of git://github.com/awilliam/linux-vfio:
  vfio: Rework group release notifier warning

7 years agoMerge branch 'for-linus' of git://git.kernel.dk/linux-block
Linus Torvalds [Fri, 24 Mar 2017 21:37:12 +0000 (14:37 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes for the current series that should go into -rc4. This
  contains:

   - a fix for a potential corruption of un-started requests from Ming.

   - a blk-stat fix from Omar, ensuring we flush the stat batch before
     checking nr_samples.

   - a set of fixes from Sagi for the nvmeof family"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: don't complete un-started request in timeout handler
  nvme-loop: handle cpu unplug when re-establishing the controller
  nvme-rdma: handle cpu unplug when re-establishing the controller
  nvmet-rdma: Fix a possible uninitialized variable dereference
  nvmet: confirm sq percpu has scheduled and switched to atomic
  nvme-loop: fix a possible use-after-free when destroying the admin queue
  blk-stat: fix blk_stat_sum() if all samples are batched

7 years agoMerge tag 'ceph-for-4.11-rc4' of git://github.com/ceph/ceph-client
Linus Torvalds [Fri, 24 Mar 2017 21:35:39 +0000 (14:35 -0700)]
Merge tag 'ceph-for-4.11-rc4' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A fix for a writeback deadlock caused by a GFP_KERNEL allocation on
  the reclaim path, tagged for stable"

* tag 'ceph-for-4.11-rc4' of git://github.com/ceph/ceph-client:
  libceph: force GFP_NOIO for socket allocations

7 years agoMerge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Linus Torvalds [Fri, 24 Mar 2017 21:32:21 +0000 (14:32 -0700)]
Merge tag 'armsoc-fixes' of git://git./linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Arnd Bergmann:

 - a couple of OMAP 4.11 regression fixes, including a boot regression
   for SmartReflex, hypervisor mode in thumb2 mode, and reference
   counting of device nodes

 - a fix for cpu_idle on at91

 - minor DT fixes on across several platforms: sunxi, bcm53xx, at91,
   nsp, ns2, ux500, omap

 - a fix to correct an API change in the reset controllers

* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (22 commits)
  arm64: dts: NS2: Add dma-coherent to relevant DT entries
  reset: fix optional reset_control_get stubs to return NULL
  ARM: sun8i: a23/a33: drop bl_en_pin GPIO pinmux in reference design DTSI
  ARM: dts: sun7i: lamobo-r1: Fix CPU port RGMII settings
  ARM: dts: NSP: GPIO reboot open-source
  ARM: at91: pm: cpu_idle: switch DDR to power-down mode
  ARM: dts: add the AB8500 clocks to the device tree
  ARM: dts: imx6sx-udoo-neo: Fix reboot hang
  ARM: sun8i: Fix the mali clock rate
  ARM: dts: BCM5301X: Correct GIC_PPI interrupt flags
  ARM: dts: BCM5301X: Fix memory start address
  ARM: dts: BCM5301X: Fix UARTs on bcm953012k
  Revert "ARM: at91/dt: sama5d2: Use new compatible for ohci node"
  ARM: OMAP2+: Release device node after it is no longer needed.
  ARM: OMAP2+: Fix device node reference counts
  ARM: OMAP2+: Remove legacy gpmc-nand.c
  ARM: OMAP2+: gpmc-onenand: propagate error on initialization failure
  ARM: dts: am335x-pcm953: Fix legacy wakeup source binding
  ARM: omap2plus_defconfig: Enable INPUT_MOUSEDEV as loadable modules
  ARM: dts: am57xx-idk: tpic2810 is on I2C bus, not SPI
  ...

7 years agoMerge tag 'for-linus-4.11b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 24 Mar 2017 21:29:23 +0000 (14:29 -0700)]
Merge tag 'for-linus-4.11b-rc4-tag' of git://git./linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
 "Fixes for PM under Xen"

* tag 'for-linus-4.11b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/acpi: upload PM state from init-domain to Xen
  xen/acpi: Replace hard coded "ACPI0007"