Eric Lee / smarc-fsl-linux-kernel

12 Jan, 2017

5 commits

75437bb30 locking/pvqspinlock: Don't wait if vCPU is preempted ... Browse Code »

If prev node is not in running state or its vCPU is preempted, we can give
up our vCPU slices in pv_wait_node() ASAP.

Signed-off-by: Pan Xinhui
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: longman@redhat.com
Link: http://lkml.kernel.org/r/1484035006-6787-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
[ Fixed typos in the changelog, removed ugly linebreak from the code. ]
Signed-off-by: Ingo Molnar

Pan Xinhui
2017-01-12 16:35:57 +0800
607904c35 locking/spinlocks: Remove the unused spin_lock_bh_nested() API ... Browse Code »

The spin_lock_bh_nested() API is defined but is not used anywhere
in the kernel. So all spin_lock_bh_nested() and related APIs are
now removed.

Signed-off-by: Waiman Long
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1483975612-16447-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar

Waiman Long
2017-01-12 16:33:39 +0800
ba836a6f5 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge fixes from Andrew Morton:
"27 fixes.

There are three patches that aren't actually fixes. They're simple
function renamings which are nice-to-have in mainline as ongoing net
development depends on them."

* akpm: (27 commits)
timerfd: export defines to userspace
mm/hugetlb.c: fix reservation race when freeing surplus pages
mm/slab.c: fix SLAB freelist randomization duplicate entries
zram: support BDI_CAP_STABLE_WRITES
zram: revalidate disk under init_lock
mm: support anonymous stable page
mm: add documentation for page fragment APIs
mm: rename __page_frag functions to __page_frag_cache, drop order from drain
mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
mm, memcg: fix the active list aging for lowmem requests when memcg is enabled
mm: don't dereference struct page fields of invalid pages
mailmap: add codeaurora.org names for nameless email commits
signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
mm: pmd dirty emulation in page fault handler
ipc/sem.c: fix incorrect sem_lock pairing
lib/Kconfig.debug: fix frv build failure
mm: get rid of __GFP_OTHER_NODE
mm: fix remote numa hits statistics
mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}
ocfs2: fix crash caused by stale lvb with fsdlm plugin
...

Linus Torvalds
2017-01-12 03:15:15 +0800
cff3b2c4b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Fix rtlwifi crash, from Larry Finger.

2) Memory disclosure in appletalk ipddp routing code, from Vlad
Tsyrklevich.

3) r8152 can erroneously split an RX packet into multiple URBs if the
Rx FIFO is not empty when we suspend. Fix this by waiting for the
FIFO to empty before suspending. From Hayes Wang.

4) Two GRO fixes (enter slow path when not enough SKB tail room exists,
disable frag0 optimizations when there are IPV6 extension headers)
from Eric Dumazet and Herbert Xu.

5) A series of mlx5e bug fixes (do source udp port offloading for
tunnels properly, Ip fragment matching fixes, handling firmware
errors properly when installing TC rules, etc.) from Saeed Mahameed,
Or Gerlitz, Roi Dayan, Hadar Hen Zion, Gil Rockah, and Daniel
Jurgens.

6) Two VRF fixes from David Ahern (don't skip multipath selection for
VRF paths, disallow VRF to be configured with table ID 0).

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
net: vrf: do not allow table id 0
net: phy: marvell: fix Marvell 88E1512 used in SGMII mode
sctp: Fix spelling mistake: "Atempt" -> "Attempt"
net: ipv4: Fix multipath selection with vrf
cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig
gro: use min_t() in skb_gro_reset_offset()
net/mlx5: Only cancel recovery work when cleaning up device
net/mlx5e: Remove WARN_ONCE from adaptive moderation code
net/mlx5e: Un-register uplink representor on nic_disable
net/mlx5e: Properly handle FW errors while adding TC rules
net/mlx5e: Fix kbuild warnings for uninitialized parameters
net/mlx5e: Set inline mode requirements for matching on IP fragments
net/mlx5e: Properly get address type of encapsulation IP headers
net/mlx5e: TC ipv4 tunnel encap offload error flow fixes
net/mlx5e: Warn when rejecting offload attempts of IP tunnels
net/mlx5e: Properly handle offloading of source udp port for IP tunnels
gro: Disable frag0 optimization on IPv6 ext headers
gro: Enter slow-path if there is no tailroom
mlx4: Return EOPNOTSUPP instead of ENOTSUPP
net/af_iucv: don't use paged skbs for TX on HiperSockets
...

Linus Torvalds
2017-01-12 01:52:12 +0800
a6b6e6165 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 ... Browse Code »

Pull crypto fix from Herbert Xu:
"This fixes a regression in aesni that renders it useless if it's
built-in with a modular pcbc configuration"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: aesni - Fix failure when built-in with modular pcbc

Linus Torvalds
2017-01-12 01:28:13 +0800

11 Jan, 2017

35 commits

24c63bbc1 net: vrf: do not allow table id 0 ... Browse Code »

Frank reported that vrf devices can be created with a table id of 0.
This breaks many of the run time table id checks and should not be
allowed. Detect this condition at create time and fail with EINVAL.

Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
Reported-by: Frank Kellermann
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2017-01-11 23:04:01 +0800
a13c06525 net: phy: marvell: fix Marvell 88E1512 used in SGMII mode ... Browse Code »

When an Marvell 88E1512 PHY is connected to a nic in SGMII mode, the
fiber page is used for the SGMII host-side connection. The PHY driver
notices that SUPPORTED_FIBRE is set, so it tries reading the fiber page
for the link status, and ends up reading the MAC-side status instead of
the outgoing (copper) link. This leads to incorrect results reported
via ethtool.

If the PHY is connected via SGMII to the host, ignore the fiber page.
However, continue to allow the existing power management code to
suspend and resume the fiber page.

Fixes: 6cfb3bcc0641 ("Marvell phy: check link status in case of fiber link.")
Signed-off-by: Russell King
Signed-off-by: David S. Miller

Russell King
2017-01-11 23:02:37 +0800
eb004603c sctp: Fix spelling mistake: "Atempt" -> "Attempt" ... Browse Code »

Trivial fix to spelling mistake in WARN_ONCE message

Signed-off-by: Colin Ian King
Acked-by: Neil Horman
Signed-off-by: David S. Miller

Colin Ian King
2017-01-11 23:01:01 +0800
7a18c5b9f net: ipv4: Fix multipath selection with vrf ... Browse Code »

fib_select_path does not call fib_select_multipath if oif is set in the
flow struct. For VRF use cases oif is always set, so multipath route
selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif
check similar to what is done in fib_table_lookup.

Add saddr and proto to the flow struct for the fib lookup done by the
VRF driver to better match hash computation for a flow.

Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2017-01-11 22:59:55 +0800
73b351473 cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig ... Browse Code »

We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
not right when CONFIG_NET is disabled and there is no socket interface:

warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)

I don't know what the correct solution for this is, but simply removing
the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
'if NET' section avoids the warning and does not produce other build
errors.

Fixes: 483c4933ea09 ("cgroup: Fix CGROUP_BPF config")
Signed-off-by: Arnd Bergmann
Signed-off-by: David S. Miller

Arnd Bergmann
2017-01-11 22:47:10 +0800
7cfd5fd5a gro: use min_t() in skb_gro_reset_offset() ... Browse Code »

On 32bit arches, (skb->end - skb->data) is not 'unsigned int',
so we shall use min_t() instead of min() to avoid a compiler error.

Fixes: 1272ce87fa01 ("gro: Enter slow-path if there is no tailroom")
Reported-by: kernel test robot
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-01-11 21:15:40 +0800
6c711c869 Merge branch 'mlx5-fixes' ... Browse Code »

Saeed Mahameed says:

====================
Mellanox mlx5 fixes and cleanups 2017-01-10

This series includes some mlx5e general cleanups from Daniel, Gil, Hadar
and myself.
Also it includes some critical mlx5e TC offloads fixes from Or Gerlitz.

For -stable:
- net/mlx5e: Remove WARN_ONCE from adaptive moderation code

Although this fix doesn't affect any functionality, I thought it is
better to clean this -WARN_ONCE- up for -stable in case someone hits
such corner case.

Please apply and let me know if there's any problem.
====================

Signed-off-by: David S. Miller

David S. Miller
2017-01-11 10:34:03 +0800
5e44fca50 net/mlx5: Only cancel recovery work when cleaning up device ... Browse Code »

Do not attempt to drain the health workqueue when unloading the device in
the recovery flow, this can cause a deadlock when the recovery work
tries to cancel itself with sync.

Because the work is no longer unconditionally canceled when unloading, it
must be explicitly canceled in the AER flow.

fixes: 689a248df83b ("net/mlx5: Cancel recovery work in remove flow")
Signed-off-by: Daniel Jurgens
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Daniel Jurgens
2017-01-11 10:34:01 +0800
0bbcc0a8f net/mlx5e: Remove WARN_ONCE from adaptive moderation code ... Browse Code »

When trying to do interface down or changing interface configuration
under heavy traffic, some of the adaptive moderation corner cases can
occur and leave a WARN_ONCE call trace in the kernel log.

Those WARN_ONCE are meant for debug only, and should have been inserted
only under debug. We avoid such call traces by removing those WARN_ONCE.

Fixes: cb3c7fd4f839 ("net/mlx5e: Support adaptive RX coalescing")
Signed-off-by: Gil Rockah
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Gil Rockah
2017-01-11 10:34:01 +0800
3deef8cea net/mlx5e: Un-register uplink representor on nic_disable ... Browse Code »

The code before this patch registered uplink e-Switch representor
on nic_enable and unregistered on nic_cleanup, the right place
for this unregister is in nic_disable.

Fixes: 127ea380acc9 ("net/mlx5: Add Representors registration API")
Signed-off-by: Saeed Mahameed
Reviewed-by: Mohamad Haj Yahia
Signed-off-by: David S. Miller

Saeed Mahameed
2017-01-11 10:34:01 +0800
5e86397ab net/mlx5e: Properly handle FW errors while adding TC rules ... Browse Code »

When the firmware returns an error (common example is an attempt to
add twice the same rule which is refused by the some FWs), we are not
properly derefing/cleaning few resources allocated on the way.
Examples are vport vlan deref under eswitch vlan offloads, and encap
entry/neighbour deref under eswitch encapsulation offloads, fix that.

Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
Fixes: 8b32580df1cb ('net/mlx5e: Add TC vlan action for SRIOV offloads')
Signed-off-by: Or Gerlitz
Reviewed-by: Roi Dayan
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Or Gerlitz
2017-01-11 10:34:01 +0800
a757d108d net/mlx5e: Fix kbuild warnings for uninitialized parameters ... Browse Code »

kbuild warn about parameters that may be used uninitialized, fix it.

Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
Signed-off-by: Hadar Hen Zion
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Hadar Hen Zion
2017-01-11 10:34:01 +0800
0827444d0 net/mlx5e: Set inline mode requirements for matching on IP fragments ... Browse Code »

For e-switch level matching on packets being an IP fragment, we
need to make sure the source vport inline mode is L3, fix that.

Fixes: 3f7d0eb42d59 ('net/mlx5e: Offload TC matching on packets being IP fragments')
Signed-off-by: Or Gerlitz
Reviewed-by: Roi Dayan
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Or Gerlitz
2017-01-11 10:34:01 +0800
2e72eb438 net/mlx5e: Properly get address type of encapsulation IP headers ... Browse Code »

As done elsewhere in our TC/flower offload code, the address type of
the encapsulation IP headers should be realized accroding to the
addr_type field of the encapsulation control dissector key, do that.

Fixes: bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads')
Signed-off-by: Or Gerlitz
Reviewed-by: Hadar Hen Zion
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Or Gerlitz
2017-01-11 10:34:01 +0800
a42485eb0 net/mlx5e: TC ipv4 tunnel encap offload error flow fixes ... Browse Code »

When the route lookup fails we should return the actual error.

When the neigh isn't valid, we should return -EOPNOTSUPP as done
in similar cases along the code.

When the offload can't take place as of invalid neigh etc, we
must release the neigh.

Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
Signed-off-by: Or Gerlitz
Reviewed-by: Hadar Hen Zion
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Or Gerlitz
2017-01-11 10:34:01 +0800
2fcd82e9b net/mlx5e: Warn when rejecting offload attempts of IP tunnels ... Browse Code »

We silently reject offloading of IPv6 tunnels, non vxlan tunnels,
vxlan tunnels where the dst port to match is not provided, etc.

Be a bit more verbose and print a warning so the user better
realizes what went wrong here and can fix it.

Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
Fixes: bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads')
Signed-off-by: Or Gerlitz
Reviewed-by: Hadar Hen Zion
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Or Gerlitz
2017-01-11 10:34:01 +0800
cd3776638 net/mlx5e: Properly handle offloading of source udp port for IP tunnels ... Browse Code »

We can offload the matching on source udp port of ip tunnels for
decapsulation. We can not offload setting source udp port for tunnels
as part of encapsulation. Fix both the code that deals with matching
offload (decap) and the code that deal with encap offload to align with
that.

Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
Fixes: bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads')
Signed-off-by: Or Gerlitz
Reviewed-by: Hadar Hen Zion
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Or Gerlitz
2017-01-11 10:34:01 +0800
575b1967e timerfd: export defines to userspace ... Browse Code »

Since userspace is expected to call timerfd syscalls directly with these
flags/ioctls, make sure we export them so they don't have to duplicate
the values themselves.

Link: http://lkml.kernel.org/r/20161219064052.7196-1-vapier@gentoo.org
Signed-off-by: Mike Frysinger
Acked-by: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2017-01-11 10:31:55 +0800
e5bbc8a6c mm/hugetlb.c: fix reservation race when freeing surplus pages ... Browse Code »

return_unused_surplus_pages() decrements the global reservation count,
and frees any unused surplus pages that were backing the reservation.

Commit 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in
return_unused_surplus_pages()") added a call to cond_resched_lock in the
loop freeing the pages.

As a result, the hugetlb_lock could be dropped, and someone else could
use the pages that will be freed in subsequent iterations of the loop.
This could result in inconsistent global hugetlb page state, application
api failures (such as mmap) failures or application crashes.

When dropping the lock in return_unused_surplus_pages, make sure that
the global reservation count (resv_huge_pages) remains sufficiently
large to prevent someone else from claiming pages about to be freed.

Analyzed by Paul Cassella.

Fixes: 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()")
Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Reported-by: Paul Cassella
Suggested-by: Michal Hocko
Cc: Masayoshi Mizuma
Cc: Naoya Horiguchi
Cc: Aneesh Kumar
Cc: Hillf Danton
Cc: [3.15+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2017-01-11 10:31:55 +0800
c4e490cf1 mm/slab.c: fix SLAB freelist randomization duplicate entries ... Browse Code »

This patch fixes a bug in the freelist randomization code. When a high
random number is used, the freelist will contain duplicate entries. It
will result in different allocations sharing the same chunk.

It will result in odd behaviours and crashes. It should be uncommon but
it depends on the machines. We saw it happening more often on some
machines (every few hours of running tests).

Fixes: c7ce4f60ac19 ("mm: SLAB freelist randomization")
Link: http://lkml.kernel.org/r/20170103181908.143178-1-thgarnie@google.com
Signed-off-by: John Sperbeck
Signed-off-by: Thomas Garnier
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Sperbeck
2017-01-11 10:31:55 +0800
b09ab054b zram: support BDI_CAP_STABLE_WRITES ... Browse Code »

zram has used per-cpu stream feature from v4.7. It aims for increasing
cache hit ratio of scratch buffer for compressing. Downside of that
approach is that zram should ask memory space for compressed page in
per-cpu context which requires stricted gfp flag which could be failed.
If so, it retries to allocate memory space out of per-cpu context so it
could get memory this time and compress the data again, copies it to the
memory space.

In this scenario, zram assumes the data should never be changed but it is
not true without stable page support. So, If the data is changed under
us, zram can make buffer overrun so that zsmalloc free object chain is
broken so system goes crash like below

https://bugzilla.suse.com/show_bug.cgi?id=997574

This patch adds BDI_CAP_STABLE_WRITES to zram for declaring "I am block
device needing *stable write*".

Fixes: da9556a2367c ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/1482366980-3782-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Reviewed-by: Sergey Senozhatsky
Cc: Takashi Iwai
Cc: Hyeoncheol Lee
Cc:
Cc: Sangseok Lee
Cc: Hugh Dickins
Cc: Darrick J. Wong
Cc: [4.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-01-11 10:31:55 +0800
e7ccfc4cc zram: revalidate disk under init_lock ... Browse Code »

Commit b4c5c60920e3 ("zram: avoid lockdep splat by revalidate_disk")
moved revalidate_disk call out of init_lock to avoid lockdep
false-positive splat. However, commit 08eee69fcf6b ("zram: remove
init_lock in zram_make_request") removed init_lock in IO path so there
is no worry about lockdep splat. So, let's restore it.

This patch is needed to set BDI_CAP_STABLE_WRITES atomically in next
patch.

Fixes: da9556a2367c ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/1482366980-3782-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Reviewed-by: Sergey Senozhatsky
Cc: Takashi Iwai
Cc: Hyeoncheol Lee
Cc:
Cc: Sangseok Lee
Cc: Hugh Dickins
Cc: Darrick J. Wong
Cc: [4.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-01-11 10:31:55 +0800
f05714293 mm: support anonymous stable page ... Browse Code »

During developemnt for zram-swap asynchronous writeback, I found strange
corruption of compressed page, resulting in:

Modules linked in: zram(E)
CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G E 4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff88007620b840 task.stack: ffff880078090000
RIP: set_freeobj.part.43+0x1c/0x1f
RSP: 0018:ffff880078093ca8 EFLAGS: 00010246
RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
Call Trace:
obj_malloc+0x22b/0x260
zs_malloc+0x1e4/0x580
zram_bvec_rw+0x4cd/0x830 [zram]
page_requests_rw+0x9c/0x130 [zram]
zram_thread+0xe6/0x173 [zram]
kthread+0xca/0xe0
ret_from_fork+0x25/0x30

With investigation, it reveals currently stable page doesn't support
anonymous page. IOW, reuse_swap_page can reuse the page without waiting
writeback completion so it can overwrite page zram is compressing.

Unfortunately, zram has used per-cpu stream feature from v4.7.
It aims for increasing cache hit ratio of scratch buffer for
compressing. Downside of that approach is that zram should ask
memory space for compressed page in per-cpu context which requires
stricted gfp flag which could be failed. If so, it retries to
allocate memory space out of per-cpu context so it could get memory
this time and compress the data again, copies it to the memory space.

In this scenario, zram assumes the data should never be changed
but it is not true unless stable page supports. So, If the data is
changed under us, zram can make buffer overrun because second
compression size could be bigger than one we got in previous trial
and blindly, copy bigger size object to smaller buffer which is
buffer overrun. The overrun breaks zsmalloc free object chaining
so system goes crash like above.

I think below is same problem.
https://bugzilla.suse.com/show_bug.cgi?id=997574

Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
writeback in there so the approach in this patch is simply return false if
we found it needs stable page. Although it increases memory footprint
temporarily, it happens rarely and it should be reclaimed easily althoug
it happened. Also, It would be better than waiting of IO completion,
which is critial path for application latency.

Fixes: da9556a2367c ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Hugh Dickins
Cc: Sergey Senozhatsky
Cc: Darrick J. Wong
Cc: Takashi Iwai
Cc: Hyeoncheol Lee
Cc:
Cc: Sangseok Lee
Cc: [4.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-01-11 10:31:55 +0800
4d09d0f45 mm: add documentation for page fragment APIs ... Browse Code »

This is a first pass at trying to add documentation for the page_frag
APIs. They may still change over time but for now I thought I would try
to get these documented so that as more network drivers and stack calls
make use of them we have one central spot to document how they are meant
to be used.

Link: http://lkml.kernel.org/r/20170104024157.13451.6758.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Duyck
2017-01-11 10:31:55 +0800
2976db801 mm: rename __page_frag functions to __page_frag_cache, drop order from drain ... Browse Code »

This patch does two things.

First it goes through and renames the __page_frag prefixed functions to
__page_frag_cache so that we can be clear that we are draining or
refilling the cache, not the frags themselves.

Second we drop the order parameter from __page_frag_cache_drain since we
don't actually need to pass it since all fragments are either order 0 or
must be a compound page.

Link: http://lkml.kernel.org/r/20170104023954.13451.5678.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Duyck
2017-01-11 10:31:55 +0800
8c2dd3e4a mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free ... Browse Code »

Patch series "Page fragment updates", v4.

This patch series takes care of a few cleanups for the page fragments
API.

First we do some renames so that things are much more consistent. First
we move the page_frag_ portion of the name to the front of the functions
names. Secondly we split out the cache specific functions from the
other page fragment functions by adding the word "cache" to the name.

Finally I added a bit of documentation that will hopefully help to
explain some of this. I plan to revisit this later as we get things
more ironed out in the near future with the changes planned for the DMA
setup to support eXpress Data Path.

This patch (of 3):

This patch renames the page frag functions to be more consistent with
other APIs. Specifically we place the name page_frag first in the name
and then have either an alloc or free call name that we append as the
suffix. This makes it a bit clearer in terms of naming.

In addition we drop the leading double underscores since we are
technically no longer a backing interface and instead the front end that
is called from the networking APIs.

Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Duyck
2017-01-11 10:31:55 +0800
b4536f0c8 mm, memcg: fix the active list aging for lowmem requests when memcg is enabled ... Browse Code »

Nils Holland and Klaus Ethgen have reported unexpected OOM killer
invocations with 32b kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
active_file:274324 inactive_file:281962 isolated_file:0
unevictable:0 dirty:649 writeback:0 unstable:0
slab_reclaimable:40662 slab_unreclaimable:17754
mapped:7382 shmem:202 pagetables:351 bounce:0
free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

the oom killer is clearly pre-mature because there there is still a lot
of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.

The code simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense. We can simply end up
always seeing the resulting active and inactive counts 0 and return
false. This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled. Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are losing empty LRU but non-zero lru size detection introduced by
ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.

Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Nils Holland
Tested-by: Nils Holland
Reported-by: Klaus Ethgen
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-01-11 10:31:55 +0800
f073bdc51 mm: don't dereference struct page fields of invalid pages ... Browse Code »

The VM_BUG_ON() check in move_freepages() checks whether the node id of
a page matches the node id of its zone. However, it does this before
having checked whether the struct page pointer refers to a valid struct
page to begin with. This is guaranteed in most cases, but may not be
the case if CONFIG_HOLES_IN_ZONE=y.

So reorder the VM_BUG_ON() with the pfn_valid_within() check.

Link: http://lkml.kernel.org/r/1481706707-6211-2-git-send-email-ard.biesheuvel@linaro.org
Signed-off-by: Ard Biesheuvel
Acked-by: Will Deacon
Cc: Catalin Marinas
Cc: Hanjun Guo
Cc: Yisheng Xie
Cc: Robert Richter
Cc: James Morse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ard Biesheuvel
2017-01-11 10:31:55 +0800
9ebf73b27 mailmap: add codeaurora.org names for nameless email commits ... Browse Code »

Some codeaurora.org emails have crept in but the names don't exist for
them. Add the names for the emails so git can match everyone up.

Link: http://lkml.kernel.org/r/20170104194611.25933-1-sboyd@codeaurora.org
Signed-off-by: Stephen Boyd
Cc: Sarangdhar Joshi
Cc: Subash Abhinov Kasiviswanathan
Cc: Subhash Jadavani
Cc: Thomas Pedersen
Cc: Andy Gross
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stephen Boyd
2017-01-11 10:31:55 +0800
2d39b3cd3 signal: protect SIGNAL_UNKILLABLE from unintentional clearing. ... Browse Code »

Since commit 00cd5c37afd5 ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes. init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.

This can result in init becoming stoppable/killable after tracing. For
example, running:

while true; do kill -STOP 1; done &
strace -p 1

and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.

Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.

Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles
Acked-by: Oleg Nesterov
Cc: Alexander Viro
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jamie Iles
2017-01-11 10:31:55 +0800
20f664aab mm: pmd dirty emulation in page fault handler ... Browse Code »

Andreas reported [1] made a test in jemalloc hang in THP mode in arm64:

http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de

The problem is currently page fault handler doesn't supports dirty bit
emulation of pmd for non-HW dirty-bit architecture so that application
stucks until VM marked the pmd dirty.

How the emulation work depends on the architecture. In case of arm64,
when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
mark the pte dirty via triggering page fault when store access happens.
Once the page fault occurs, VM marks the pmd dirty and arch code for
setting pmd will clear PTE_RDONLY for application to proceed.

IOW, if VM doesn't mark the pmd dirty, application hangs forever by
repeated fault(i.e., store op but the pmd is PTE_RDONLY).

This patch enables pmd dirty-bit emulation for those architectures.

[1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called

Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
Link: http://lkml.kernel.org/r/1482506098-6149-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Reported-by: Andreas Schwab
Tested-by: Andreas Schwab
Acked-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Cc: Jason Evans
Cc: Will Deacon
Cc: Catalin Marinas
Cc: [4.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-01-11 10:31:55 +0800
c626bc46e ipc/sem.c: fix incorrect sem_lock pairing ... Browse Code »

Based on the syzcaller test case from dvyukov:

https://gist.githubusercontent.com/dvyukov/d0e5efefe4d7d6daed829f5c3ca26a40/raw/08d0a261fe3c987bed04fbf267e08ba04bd533ea/gistfile1.txt

The slow (i.e.: failure to acquire) syscall exit from semtimedop()
incorrectly assumed that the the same lock is acquired as it was at the
initial syscall entry.

This is wrong:
- thread A: single semop semop(), sleeps
- thread B: multi semop semop(), sleeps
- thread A: woken up by signal/timeout

With this sequence, the initial sem_lock() call locks the per-semaphore
spinlock, and it is unlocked with sem_unlock(). The call at the syscall
return locks the global spinlock. Because locknum is not updated, the
following sem_unlock() call unlocks the per-semaphore spinlock, which is
actually not locked.

The fix is trivial: Use the return value from sem_lock.

Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop")
Link: http://lkml.kernel.org/r/1482215645-22328-1-git-send-email-manfred@colorfullife.com
Signed-off-by: Manfred Spraul
Reported-by: Dmitry Vyukov
Reported-by: Johanna Abrahamsson
Tested-by: Johanna Abrahamsson
Acked-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Manfred Spraul
2017-01-11 10:31:55 +0800
da0510c47 lib/Kconfig.debug: fix frv build failure ... Browse Code »

The build of frv allmodconfig was failing with the errors like:

/tmp/cc0JSPc3.s: Assembler messages:
/tmp/cc0JSPc3.s:1839: Error: symbol `.LSLT0' is already defined
/tmp/cc0JSPc3.s:1842: Error: symbol `.LASLTP0' is already defined
/tmp/cc0JSPc3.s:1969: Error: symbol `.LELTP0' is already defined
/tmp/cc0JSPc3.s:1970: Error: symbol `.LELT0' is already defined

Commit 866ced950bcd ("kbuild: Support split debug info v4") introduced
splitting the debug info and keeping that in a separate file. Somehow,
the frv-linux gcc did not like that and I am guessing that instead of
splitting it started copying. The first report about this is at:

https://lists.01.org/pipermail/kbuild-all/2015-July/010527.html.

I will try and see if this can work with frv and if still fails I will
open a bug report with gcc. But meanwhile this is the easiest option to
solve build failure of frv.

Fixes: 866ced950bcd ("kbuild: Support split debug info v4")
Link: http://lkml.kernel.org/r/1482062348-5352-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee
Reported-by: Fengguang Wu
Cc: Andi Kleen
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sudip Mukherjee
2017-01-11 10:31:55 +0800
41b6167e8 mm: get rid of __GFP_OTHER_NODE ... Browse Code »

The flag was introduced by commit 78afd5612deb ("mm: add
__GFP_OTHER_NODE flag") to allow proper accounting of remote node
allocations done by kernel daemons on behalf of a process - e.g.
khugepaged.

After "mm: fix remote numa hits statistics" we do not need and actually
use the flag so we can safely remove it because all allocations which
are satisfied from their "home" node are accounted properly.

[mhocko@suse.com: fix build]
Link: http://lkml.kernel.org/r/20170106122225.GK5556@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170102153057.9451-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Taku Izumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-01-11 10:31:55 +0800
2df26639e mm: fix remote numa hits statistics ... Browse Code »

Jia He has noticed that commit b9f00e147f27 ("mm, page_alloc: reduce
branches in zone_statistics") has an unintentional side effect that
remote node allocation requests are accounted as NUMA_MISS rathat than
NUMA_HIT and NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE.

There are many of these potentially because the flag is used very rarely
while we have many users of __alloc_pages_node.

Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
follow up patch) and treat all allocations that were satisfied from the
preferred zone's node as NUMA_HITS because this is the same node we
requested the allocation from in most cases. If this is not the local
node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.

One downsize would be that an allocation request for a node which is
outside of the mempolicy nodemask would be reported as a hit which is a
bit weird but that was the case before b9f00e147f27 already.

Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
Link: http://lkml.kernel.org/r/20170102153057.9451-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Jia He
Reviewed-by: Vlastimil Babka # with cbmc[1] superpowers
Acked-by: Mel Gorman
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Taku Izumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-01-11 10:31:54 +0800