Eric Lee / smarc-fsl-linux-kernel

04 Nov, 2017

29 commits

2a171788b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Files removed in 'net-next' had their license header updated
in 'net'. We take the remove from 'net-next'.

Signed-off-by: David S. Miller

David S. Miller
2017-11-04 08:26:51 +0800
bf5345882 liquidio: Fix an issue with multiple switchdev enable disables ... Browse Code »

Return success if the same dispatch function is being registered for
a given opcode and subcode, there by allow multiple switchdev enable
and disables.

Signed-off-by: Vijaya Mohan Guvva
Signed-off-by: Satanand Burla
Signed-off-by: Felix Manlunas
Signed-off-by: David S. Miller

Vijaya Mohan Guvva
2017-11-04 08:17:29 +0800
de4cc8bd6 Merge branch 'mlxsw-Handle-changes-in-GRE-configuration' ... Browse Code »

Jiri Pirko says:

====================
mlxsw: Handle changes in GRE configuration

Petr says:

Until now, when an IP tunnel was offloaded by the mlxsw driver, the
offload was pretty much static, and changes in Linux configuration were
not reflected in the hardware. That led to discrepancies between traffic
flows in slow path and fast path. The work-around used to be to remove
all routes that forward to the netdevice and re-add them. This is
clearly suboptimal, but actually, as of the decap-only patchset, it's
not even enough anymore, and one needs to go all the way and simply drop
the tunnel and recreate it correctly.

With this patchset, the NETDEV_CHANGE events that are generated for
changes of up'd tunnel netdevices are captured and interpreted to
correctly reconfigure the HW in accordance with changes requested at the
software layer. In addition, NETDEV_CHANGEUPPER, NETDEV_UP and
NETDEV_DOWN are now handled not only for tunnel devices themselves, but
also for their bound devices. Each change is then translated to one or
more of the following updates to the HW configuration:

- refresh of offload of local route that corresponds to tunnel's local
address
- refresh of the loopback RIF
- refresh of offloads of routes that forward to the changed tunnel
- removal of tunnel offloads

These tools are used to implement the following configuration changes:

- addition of a new offloadable tunnel with local address that conflicts
with that of an already-offloaded tunnel (the existing tunnel is
onloaded, the new one isn't offloaded)
- changes to TTL, TOS that make tunnel unsuitable for offloading
- changes to ikey, okey, remote
- changes to local, which when they cause conflict with another
tunnel, lead to onloading of both newly-conflicting tunnels
- migration of a bound device of an offloaded tunnel device to a
different VRF
- changes to what device is bound to a tunnel device (i.e. like what
"ip tunnel change name g dev another" does)
- changes to up / down state of a bound device. A down bound device
doesn't forward encapsulated traffic anymore, but decap still works.

This patchset starts with a suite of patches that adapt the existing
code base step by step to facilitate introduction of the offloading
code. The five substantial patches at the end then implement the changes
mentioned above.
====================

Signed-off-by: David S. Miller

David S. Miller
2017-11-04 08:15:18 +0800
44b0fff1d mlxsw: spectrum_router: Handle down of tunnel underlay ... Browse Code »

When the bound device of a tunnel device is down, encapsulated packets
are not egressed anymore, but tunnel decap still works. Extend
mlxsw_sp_nexthop_rif_update() to take IFF_UP into consideration when
deciding whether a given next hop should be offloaded.

Because the new logic was added to mlxsw_sp_nexthop_rif_update(), this
fixes the case where a newly-added tunnel has a down bound device, which
would previously be fully offloaded. Now the down state of the bound
device is noted and next hops forwarding to such tunnel are not
offloaded.

In addition to that, notice NETDEV_UP and NETDEV_DOWN of a bound device
to force refresh of tunnel encap route offloads.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:18 +0800
89c2b7dab mlxsw: spectrum_ipip: Handle underlay device change ... Browse Code »

When a bound device of an IP-in-IP tunnel changes, such as through
'ip tunnel change name $name dev $dev', the loopback backing the tunnel
needs to be recreated.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:18 +0800
4cf04f3ff mlxsw: spectrum: Handle NETDEV_CHANGE on L3 tunnels ... Browse Code »

Changes to L3 tunnel netdevices (through `ip tunnel change' as well as
`ip link set') lead to NETDEV_CHANGE being generated on the tunnel
device. Because what is relevant for the tunnel in question depends on
the tunnel type, handling of the event is dispatched to the IPIP module
through a newly-added interface mlxsw_sp_ipip_ops.ol_netdev_change().

IPIP tunnels now remember the last set of tunnel parameters in struct
mlxsw_sp_ipip_entry.parms, and use it to figure out what exactly has
changed.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:18 +0800
61481f2fc mlxsw: spectrum: Support IPIP underlay VRF migration ... Browse Code »

When a bound device of a tunnel netdevice changes VRF, the loopback RIF
that backs the tunnel needs to be updated and existing encapsulating
routes need to be refreshed.

Note that several tunnels can share the same bound device, in which case
all the impacted tunnels need to be updated.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:18 +0800
af641713e mlxsw: spectrum_router: Onload conflicting tunnels ... Browse Code »

The approach for offloading IP tunnels implemented currently by mlxsw
doesn't allow two tunnels that have the same local IP address in the
same (underlay) VRF. Previously, offloads were introduced on demand as
encap routes were formed. When such a route was created that would cause
offload of a conflicting tunnel, mlxsw_sp_ipip_entry_create() would
detect it and return -EEXIST, which would propagate up and cause FIB
abort.

Now however IPIP entries are created as soon as an offloadable netdevice
is created, and the failure prevents creation of such device.
Furthermore, if the driver is installed at the point where such
conflicting tunnels exist, the failure actually prevents successful
modprobe.

Furthermore, follow-up patches implement handling of NETDEV_CHANGE due
to the local address change. However, NETDEV_CHANGE can't be vetoed. The
failure merely means that the offloads weren't updated, but the change
in Linux configuration is not rolled back. It is thus desirable to have
a robust way of handling these conflicts, which can later be reused for
handling NETDEV_CHANGE as well.

To fix this, when a conflicting tunnel is created, instead of failing,
simply pull the old tunnel to slow path and reject offloading the
new one.

Introduce two functions: mlxsw_sp_ipip_entry_demote_tunnel() and
mlxsw_sp_ipip_demote_tunnel_by_saddr() to handle this. Make them both
public, because they will be useful later on in this patchset.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:18 +0800
4526cc8ae mlxsw: spectrum_router: Fix saddr deduction in mlxsw_sp_ipip_entry_create() ... Browse Code »

When trying to determine whether there are other offloaded tunnels with
the same local address, mlxsw_sp_ipip_entry_create() should look for a
tunnel with matching UL protocol, matching saddr, in the same VRF.
However instead of taking into account the UL protocol of the tunnel
netdevice (which mlxsw_sp_ipip_entry_saddr_matches() then compares to
the UL protocol of inspected IPIP entry), it deduces the UL protocol
from the inspected IPIP entry (and that's compared to itself).

This is currently immaterial, because only one tunnel type is offloaded,
and therefore the UL protocol always matches, but introducing support
for a tunnel with IPv6 underlay would uncover this error.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
0c5f1cd5b mlxsw: spectrum_router: Generalize __mlxsw_sp_ipip_entry_update_tunnel() ... Browse Code »

The work that needs to be done to update HW configuration in response to
changes is similar to what __mlxsw_sp_ipip_entry_update_tunnel() already
does, but with a number of twists: each change requires a different
subset of things to happen. Extend the function to support all these
uses, and allow finely-grained configuration of what should happen at
each call through a suite of function arguments.

Publish the updated function to allow use from the spectrum_ipip module.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
65a6121b3 mlxsw: spectrum_router: Extract __mlxsw_sp_ipip_entry_update_tunnel() ... Browse Code »

The work that's done by mlxsw_sp_netdevice_ipip_ol_vrf_event() is a good
basis for a more versatile function that would take care of all sorts of
tunnel updates requests: __mlxsw_sp_ipip_entry_update_tunnel(). Extract
that function. Factor out a helper mlxsw_sp_ipip_entry_ol_lb_update() as
well.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
7e75af636 mlxsw: spectrum: Propagate extack for tunnel events ... Browse Code »

The function mlxsw_sp_rif_create() takes an extack parameter. So far,
for creation of loopback interfaces, NULL was passed. For some events
however the extack can be extracted and passed along. So do that for
NETDEV_CHANGEUPPER handler.

Use the opportunity to update the type of info argument that
mlxsw_sp_netdevice_ipip_ol_event() takes. Follow-up patches will
introduce handling of more changes, and some of them carry an extack as
well, but in an info structure of a different type. Though not strictly
erroneous (the pointer could be cast whichever way), it makes no sense
to pretend the value is always of a certain type, when in fact it isn't.
So change the prototype of the above-mentioned function as well.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
47518ca5d mlxsw: spectrum_router: Extract mlxsw_sp_ipip_entry_ol_up_event() ... Browse Code »

The piece of logic to promote decap route, if any, is useful for generic
tunnel updates, not just for handling of NETDEV_UP events on tunnel
interfaces. Extract it to a separate function.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
6d4de4455 mlxsw: spectrum_router: Make mlxsw_sp_netdevice_ipip_ol_up_event() void ... Browse Code »

This function only ever returns 0, so don't pretend it returns anything
useful and just make it void.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
a3fe198ec mlxsw: spectrum_router: Extract mlxsw_sp_ipip_entry_ol_down_event() ... Browse Code »

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
9fb7bd77d mlxsw: spectrum_ipip: Split accessor functions ... Browse Code »

To implement NETDEV_CHANGE notifications on IP-in-IP tunnels, the
handler needs to figure out what actually changed, to understand how
exactly to update the offloads. It will do so by storing struct
ip_tunnel_parm with previous configuration, and comparing that to the
new version.

To facilitate these comparisons, extract the code that operates on
struct ip_tunnel_parm from the existing accessor functions, and make
those a thin wrapper that extracts tunnel parameters and dispatches.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
474f0ff61 mlxsw: spectrum: Move mlxsw_sp_ipip_netdev_{s, d}addr{, 4}() ... Browse Code »

These functions ideologically belong to the IPIP module, and some
follow-up work will benefit from their presence there.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
cafdb2a0d mlxsw: spectrum_router: Extract mlxsw_sp_netdevice_ipip_can_offload() ... Browse Code »

Some of the code down the road needs this logic as well.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
796ec7769 mlxsw: spectrum: Rename IPIP-related netdevice handlers ... Browse Code »

To distinguish between events related to tunnel device itself and its
bound device, rename a number of functions related to handling tunneling
netdevice events to include _ol_ (for "overlay") in the name. That
leaves room in the namespace for underlay-related functions, which would
have _ul_ in the name.

Signed-off-by: Petr Machata
Reviewed-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2017-11-04 08:15:17 +0800
d4c2e9fca Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux ... Browse Code »

Pull clk fix from Stephen Boyd:
"One fix for USB clks on Uniphier PXs3 SoCs"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: uniphier: fix clock data for PXs3

Linus Torvalds
2017-11-04 04:56:15 +0800
81ca2caef Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile ... Browse Code »

Pull arch/tile fixes from Chris Metcalf:
"Two one-line bug fixes"

* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
arch/tile: Implement ->set_state_oneshot_stopped()
tile: pass machine size to sparse

Linus Torvalds
2017-11-04 01:36:43 +0800
01073ac1b Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull SCSI fix from James Bottomley:
"One minor fix in the error leg of the qla2xxx driver (it oopses the
system if we get an error trying to start the internal kernel thread).

The fix is minor because the problem isn't often encountered in the
field (although it can be induced by inserting the module in a low
memory environment)"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qla2xxx: Fix oops in qla2x00_probe_one error path

Linus Torvalds
2017-11-04 01:26:40 +0800
777a45b45 arch/tile: Implement ->set_state_oneshot_stopped() ... Browse Code »

set_state_oneshot_stopped() is called by the clkevt core, when the
next event is required at an expiry time of 'KTIME_MAX'. This normally
happens with NO_HZ_{IDLE|FULL} in both LOWRES/HIGHRES modes.

This patch makes the clockevent device to stop on such an event, to
avoid spurious interrupts, as explained by: commit 8fff52fd5093
("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").

Signed-off-by: Chris Metcalf

Chris Metcalf
2017-11-04 01:20:54 +0800
866ba84ea Merge tag 'powerpc-4.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 4.14.

This is bigger than I like to send at rc7, but that's at least partly
because I didn't send any fixes last week. If it wasn't for the IMC
driver, which is new and getting heavy testing, the diffstat would
look a bit better. I've also added ftrace on big endian to my test
suite, so we shouldn't break that again in future.

- A fix to the handling of misaligned paste instructions (P9 only),
where a change to a #define has caused the check for the
instruction to always fail.

- The preempt handling was unbalanced in the radix THP flush (P9
only). Though we don't generally use preempt we want to keep it
working as much as possible.

- Two fixes for IMC (P9 only), one when booting with restricted
number of CPUs and one in the error handling when initialisation
fails due to firmware etc.

- A revert to fix function_graph on big endian machines, and then a
rework of the reverted patch to fix kprobes blacklist handling on
big endian machines.

Thanks to: Anju T Sudhakar, Guilherme G. Piccoli, Madhavan Srinivasan,
Naveen N. Rao, Nicholas Piggin, Paul Mackerras"

* tag 'powerpc-4.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/perf: Fix core-imc hotplug callback failure during imc initialization
powerpc/kprobes: Dereference function pointers only if the address does not belong to kernel text
Revert "powerpc64/elfv1: Only dereference function descriptor for non-text symbols"
powerpc/64s/radix: Fix preempt imbalance in TLB flush
powerpc: Fix check for copy/paste instructions in alignment handler
powerpc/perf: Fix IMC allocation routine

Linus Torvalds
2017-11-04 00:25:53 +0800
3f46540ee Merge tag 'mmc-v4.14-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc ... Browse Code »

Pull MMC fixes from Ulf Hansson:
"Fix dw_mmc request timeout issues"

* tag 'mmc-v4.14-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: dw_mmc: Fix the DTO timeout calculation
mmc: dw_mmc: Add locking to the CTO timer
mmc: dw_mmc: Fix the CTO timeout calculation
mmc: dw_mmc: cancel the CTO timer after a voltage switch

Linus Torvalds
2017-11-04 00:19:20 +0800
e65a139d5 Merge tag 'drm-fixes-for-v4.14-rc8' of git://people.freedesktop.org/~airlied/linux ... Browse Code »

Pull drm fixes from Dave Airlie:

- one nouveau regression fix

- some amdgpu fixes for stable to fix hangs on some harvested Polaris
GPUs

- a set of KASAN and regression fixes for i915, their CI system seems
to be working pretty well now.

* tag 'drm-fixes-for-v4.14-rc8' of git://people.freedesktop.org/~airlied/linux:
drm/amdgpu: allow harvesting check for Polaris VCE
drm/amdgpu: return -ENOENT from uvd 6.0 early init for harvesting
drm/i915: Check incoming alignment for unfenced buffers (on i915gm)
drm/nouveau/kms/nv50: use the correct state for base channel notifier setup
drm/i915: Hold rcu_read_lock when iterating over the radixtree (vma idr)
drm/i915: Hold rcu_read_lock when iterating over the radixtree (objects)
drm/i915/edp: read edp display control registers unconditionally
drm/i915: Do not rely on wm preservation for ILK watermarks
drm/i915: Cancel the modeset retry work during modeset cleanup

Linus Torvalds
2017-11-04 00:14:22 +0800
7ba3ebff9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:
"Hopefully this is the last batch of networking fixes for 4.14

Fingers crossed...

1) Fix stmmac to use the proper sized OF property read, from Bhadram
Varka.

2) Fix use after free in net scheduler tc action code, from Cong
Wang.

3) Fix SKB control block mangling in tcp_make_synack().

4) Use proper locking in fib_dump_info(), from Florian Westphal.

5) Fix IPG encodings in systemport driver, from Florian Fainelli.

6) Fix division by zero in NV TCP congestion control module, from
Konstantin Khlebnikov.

7) Fix use after free in nf_reject_ipv4, from Tejaswi Tanikella"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: systemport: Correct IPG length settings
tcp: do not mangle skb->cb[] in tcp_make_synack()
fib: fib_dump_info can no longer use __in_dev_get_rtnl
stmmac: use of_property_read_u32 instead of read_u8
net_sched: hold netns refcnt for each action
net_sched: acquire RTNL in tc_action_net_exit()
net: vrf: correct FRA_L3MDEV encode type
tcp_nv: fix division by zero in tcpnv_acked()
netfilter: nf_reject_ipv4: Fix use-after-free in send_reset
netfilter: nft_set_hash: disable fast_ops for 2-len keys

Linus Torvalds
2017-11-04 00:09:21 +0800
f0395d5b4 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc fixes from Andrew Morton:
"7 fixes"

* emailed patches from Andrew Morton :
mm, swap: fix race between swap count continuation operations
mm/huge_memory.c: deposit page table when copying a PMD migration entry
initramfs: fix initramfs rebuilds w/ compression after disabling
fs/hugetlbfs/inode.c: fix hwpoison reserve accounting
ocfs2: fstrim: Fix start offset of first cluster group during fstrim
mm, /proc/pid/pagemap: fix soft dirty marking for PMD migration entry
userfaultfd: hugetlbfs: prevent UFFDIO_COPY to fill beyond the end of i_size

Linus Torvalds
2017-11-04 00:03:50 +0800
fb615d61b Update MIPS email addresses ... Browse Code »

MIPS will soon not be a part of Imagination Technologies, and as such
many @imgtec.com email addresses will no longer be valid. This patch
updates the addresses for those who:

- Have 10 or more patches in mainline authored using an @imgtec.com
email address, or any patches dated within the past year.

- Are still with Imagination but leaving as part of the MIPS business
unit, as determined from an internal email address list.

- Haven't already updated their email address (ie. JamesH) or expressed
a desire to be excluded (ie. Maciej).

- Acked v2 or earlier of this patch, which leaves Deng-Cheng, Matt &
myself.

New addresses are of the form firstname.lastname@mips.com, and all
verified against an internal email address list. An entry is added to
.mailmap for each person such that get_maintainer.pl will report the new
addresses rather than @imgtec.com addresses which will soon be dead.

Instances of the affected addresses throughout the tree are then
mechanically replaced with the new @mips.com address.

Signed-off-by: Paul Burton
Cc: Deng-Cheng Zhu
Cc: Deng-Cheng Zhu
Acked-by: Dengcheng Zhu
Cc: Matt Redfearn
Cc: Matt Redfearn
Acked-by: Matt Redfearn
Cc: Andrew Morton
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: trivial@kernel.org
Signed-off-by: Linus Torvalds

Paul Burton
2017-11-04 00:02:30 +0800

03 Nov, 2017

11 commits

941f5f0f6 x86: CPU: Fix up "cpu MHz" in /proc/cpuinfo ... Browse Code »

Commit 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get() for
/proc/cpuinfo "cpu MHz"") is not sufficient to restore the previous
behavior of "cpu MHz" in /proc/cpuinfo on x86 due to some changes
made after the commit it has reverted.

To address this, make the code in question use arch_freq_get_on_cpu()
which also is used by cpufreq for reporting the current frequency of
CPUs and since that function doesn't really depend on cpufreq in any
way, drop the CONFIG_CPU_FREQ dependency for the object file
containing it.

Also refactor arch_freq_get_on_cpu() somewhat to avoid IPIs and
return cached values right away if it is called very often over a
short time (to prevent user space from triggering IPI storms through
it).

Fixes: 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"")
Cc: stable@kernel.org # 4.13 - together with 890da9cf0983
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2017-11-03 23:50:13 +0800
2628bd6fc mm, swap: fix race between swap count continuation operations ... Browse Code »

One page may store a set of entries of the sis->swap_map
(swap_info_struct->swap_map) in multiple swap clusters.

If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX,
multiple pages will be used to store the set of entries of the
sis->swap_map. And the pages are linked with page->lru. This is called
swap count continuation. To access the pages which store the set of
entries of the sis->swap_map simultaneously, previously, sis->lock is
used. But to improve the scalability of __swap_duplicate(), swap
cluster lock may be used in swap_count_continued() now. This may race
with add_swap_count_continuation() which operates on a nearby swap
cluster, in which the sis->swap_map entries are stored in the same page.

The race can cause wrong swap count in practice, thus cause unfreeable
swap entries or software lockup, etc.

To fix the race, a new spin lock called cont_lock is added to struct
swap_info_struct to protect the swap count continuation page list. This
is a lock at the swap device level, so the scalability isn't very well.
But it is still much better than the original sis->lock, because it is
only acquired/released when swap count continuation is used. Which is
considered rare in practice. If it turns out that the scalability
becomes an issue for some workloads, we can split the lock into some
more fine grained locks.

Link: http://lkml.kernel.org/r/20171017081320.28133-1-ying.huang@intel.com
Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Shaohua Li
Cc: Tim Chen
Cc: Michal Hocko
Cc: Aaron Lu
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Minchan Kim
Cc: Hugh Dickins
Cc: [4.11+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-11-03 22:39:19 +0800
dd8a67f9a mm/huge_memory.c: deposit page table when copying a PMD migration entry ... Browse Code »

We need to deposit pre-allocated PTE page table when a PMD migration
entry is copied in copy_huge_pmd(). Otherwise, we will leak the
pre-allocated page and cause a NULL pointer dereference later in
zap_huge_pmd().

The missing counters during PMD migration entry copy process are added
as well.

The bug report is here: https://lkml.org/lkml/2017/10/29/214

Link: http://lkml.kernel.org/r/20171030144636.4836-1-zi.yan@sent.com
Fixes: 84c3fc4e9c563 ("mm: thp: check pmd migration entry in common path")
Signed-off-by: Zi Yan
Reported-by: Fengguang Wu
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zi Yan
2017-11-03 22:39:19 +0800
e08b18774 initramfs: fix initramfs rebuilds w/ compression after disabling ... Browse Code »

This is a follow-up to commit 57ddfdaa9a72 ("initramfs: fix disabling of
initramfs (and its compression)"). This particular commit fixed the use
case where we build the kernel with an initramfs with no compression,
and then we build the kernel with no initramfs.

Now this still left us with the same case as described here:

http://lkml.kernel.org/r/20170521033337.6197-1-f.fainelli@gmail.com

not working with initramfs compression. This can be seen by the
following steps/timestamps:

https://www.spinics.net/lists/kernel/msg2598153.html

.initramfs_data.cpio.gz.cmd is correct:

cmd_usr/initramfs_data.cpio.gz := /bin/bash
./scripts/gen_initramfs_list.sh -o usr/initramfs_data.cpio.gz -u 1000 -g 1000 /home/fainelli/work/uclinux-rootfs/romfs /home/fainelli/work/uclinux-rootfs/misc/initramfs.dev

and was generated the first time we did generate the gzip initramfs, so
the command has not changed, nor its arguments, so we just don't call
it, no initramfs cpio is re-generated as a consequence.

The fix for this problem is just to properly keep track of the
.initramfs_cpio_data.d file by suffixing it with the compression
extension. This takes care of properly tracking dependencies such that
the initramfs get (re)generated any time files are added/deleted etc.

Link: http://lkml.kernel.org/r/20170930033936.6722-1-f.fainelli@gmail.com
Fixes: db2aa7fd15e8 ("initramfs: allow again choice of the embedded initramfs compression algorithm")
Fixes: 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig")
Signed-off-by: Florian Fainelli
Cc: "Francisco Blas Izquierdo Riera (klondike)"
Cc: Nicholas Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Florian Fainelli
2017-11-03 22:39:19 +0800
ab615a5b8 fs/hugetlbfs/inode.c: fix hwpoison reserve accounting ... Browse Code »

Calling madvise(MADV_HWPOISON) on a hugetlbfs page will result in bad
(negative) reserved huge page counts. This may not happen immediately,
but may happen later when the underlying file is removed or filesystem
unmounted. For example:

AnonHugePages: 0 kB
ShmemHugePages: 0 kB
HugePages_Total: 1
HugePages_Free: 0
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0
Hugepagesize: 2048 kB

In routine hugetlbfs_error_remove_page(), hugetlb_fix_reserve_counts is
called after remove_huge_page. hugetlb_fix_reserve_counts is designed
to only be called/used only if a failure is returned from
hugetlb_unreserve_pages. Therefore, call hugetlb_unreserve_pages as
required and only call hugetlb_fix_reserve_counts in the unlikely event
that hugetlb_unreserve_pages returns an error.

Link: http://lkml.kernel.org/r/20171019230007.17043-2-mike.kravetz@oracle.com
Fixes: 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Mike Kravetz
Acked-by: Naoya Horiguchi
Cc: Michal Hocko
Cc: Aneesh Kumar
Cc: Anshuman Khandual
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2017-11-03 22:39:19 +0800
105ddc93f ocfs2: fstrim: Fix start offset of first cluster group during fstrim ... Browse Code »

The first cluster group descriptor is not stored at the start of the
group but at an offset from the start. We need to take this into
account while doing fstrim on the first cluster group. Otherwise we
will wrongly start fstrim a few blocks after the desired start block and
the range can cross over into the next cluster group and zero out the
group descriptor there. This can cause filesytem corruption that cannot
be fixed by fsck.

Link: http://lkml.kernel.org/r/1507835579-7308-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant
Reviewed-by: Junxiao Bi
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ashish Samant
2017-11-03 22:39:19 +0800
b83d7e432 mm, /proc/pid/pagemap: fix soft dirty marking for PMD migration entry ... Browse Code »

When the pagetable is walked in the implementation of /proc//pagemap,
pmd_soft_dirty() is used for both the PMD huge page map and the PMD
migration entries. That is wrong, pmd_swp_soft_dirty() should be used
for the PMD migration entries instead because the different page table
entry flag is used.

As a result, /proc/pid/pagemap may report incorrect soft dirty information
for PMD migration entries.

Link: http://lkml.kernel.org/r/20171017081818.31795-1-ying.huang@intel.com
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
Signed-off-by: "Huang, Ying"
Acked-by: Kirill A. Shutemov
Acked-by: Naoya Horiguchi
Cc: Michal Hocko
Cc: David Rientjes
Cc: Arnd Bergmann
Cc: Hugh Dickins
Cc: "Jérôme Glisse"
Cc: Daniel Colascione
Cc: Zi Yan
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-11-03 22:39:19 +0800
1e3921471 userfaultfd: hugetlbfs: prevent UFFDIO_COPY to fill beyond the end of i_size ... Browse Code »

This oops:

kernel BUG at fs/hugetlbfs/inode.c:484!
RIP: remove_inode_hugepages+0x3d0/0x410
Call Trace:
hugetlbfs_setattr+0xd9/0x130
notify_change+0x292/0x410
do_truncate+0x65/0xa0
do_sys_ftruncate.constprop.3+0x11a/0x180
SyS_ftruncate+0xe/0x10
tracesys+0xd9/0xde

was caused by the lack of i_size check in hugetlb_mcopy_atomic_pte.

mmap() can still succeed beyond the end of the i_size after vmtruncate
zapped vmas in those ranges, but the faults must not succeed, and that
includes UFFDIO_COPY.

We could differentiate the retval to userland to represent a SIGBUS like
a page fault would do (vs SIGSEGV), but it doesn't seem very useful and
we'd need to pick a random retval as there's no meaningful syscall
retval that would differentiate from SIGSEGV and SIGBUS, there's just
-EFAULT.

Link: http://lkml.kernel.org/r/20171016223914.2421-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reviewed-by: Mike Kravetz
Cc: Mike Rapoport
Cc: "Dr. David Alan Gilbert"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-11-03 22:39:19 +0800
6ee79b6eb Merge branch 'net-mini_Qdisc' ... Browse Code »

Jiri Pirko says:

====================
net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath

This patchset's main patch is patch number 2. It carries the
description. Patch 1 is just a dependency.

---
v3->v4:
- rebased to be applicable on top of the current net-next
v2->v3:
- Using head change callback to replace miniq pointer every time tp head
changes. This eliminates one rcu dereference and makes the claim "without
added overhead" valid.
v1->v2:
- Use dev instead of skb->dev in sch_handle_egress as pointed out by Daniel
- Fixed synchronize_rcu_bh() in mini_qdisc_disable and commented
====================

Signed-off-by: David S. Miller

David S. Miller
2017-11-03 20:57:35 +0800
46209401f net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath ... Browse Code »

In sch_handle_egress and sch_handle_ingress tp->q is used only in order
to update stats. So stats and filter list are the only things that are
needed in clsact qdisc fastpath processing. Introduce new mini_Qdisc
struct to hold those items. Also, introduce a helper to swap the
mini_Qdisc structures in case filter list head changes.

This removes need for tp->q usage without added overhead.

Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Jiri Pirko
2017-11-03 20:57:24 +0800
c7eb7d723 net: sched: introduce chain_head_change callback ... Browse Code »

Add a callback that is to be called whenever head of the chain changes.
Also provide a callback for the default case when the caller gets a
block using non-extended getter.

Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Jiri Pirko
2017-11-03 20:57:23 +0800