Eric Lee / smarc-fsl-linux-kernel

18 Feb, 2017

4 commits

6ebde312a net: introduce device min_header_len ... Browse Code »

[ Upstream commit 217e6fa24ce28ec87fca8da93c9016cb78028612 ]

The stack must not pass packets to device drivers that are shorter
than the minimum link layer header length.

Previously, packet sockets would drop packets smaller than or equal
to dev->hard_header_len, but this has false positives. Zero length
payload is used over Ethernet. Other link layer protocols support
variable length headers. Support for validation of these protocols
removed the min length check for all protocols.

Introduce an explicit dev->min_header_len parameter and drop all
packets below this value. Initially, set it to non-zero only for
Ethernet and loopback. Other protocols can follow in a patch to
net-next.

Fixes: 9ed988cd5915 ("packet: validate variable length ll headers")
Reported-by: Sowmini Varadhan
Signed-off-by: Willem de Bruijn
Acked-by: Eric Dumazet
Acked-by: Sowmini Varadhan
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Willem de Bruijn
2017-02-18 22:11:43 +0800
2b7f50d67 lwtunnel: valid encap attr check should return 0 when lwtunnel is disabled ... Browse Code »

[ Upstream commit 2bd137de531367fb573d90150d1872cb2a2095f7 ]

An error was reported upgrading to 4.9.8:
root@Typhoon:~# ip route add default table 210 nexthop dev eth0 via 10.68.64.1
weight 1 nexthop dev eth0 via 10.68.64.2 weight 1
RTNETLINK answers: Operation not supported

The problem occurs when CONFIG_LWTUNNEL is not enabled and a multipath
route is submitted.

The point of lwtunnel_valid_encap_type_attr is catch modules that
need to be loaded before any references are taken with rntl held. With
CONFIG_LWTUNNEL disabled, there will be no modules to load so the
lwtunnel_valid_encap_type_attr stub should just return 0.

Fixes: 9ed59592e3e3 ("lwtunnel: fix autoload of lwt modules")
Reported-by: pupilla@libero.it
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-02-18 22:11:42 +0800
66cdd4347 netlabel: out of bound access in cipso_v4_validate() ... Browse Code »

[ Upstream commit d71b7896886345c53ef1d84bda2bc758554f5d61 ]

syzkaller found another out of bound access in ip_options_compile(),
or more exactly in cipso_v4_validate()

Fixes: 20e2a8648596 ("cipso: handle CIPSO options correctly when NetLabel is disabled")
Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine")
Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Cc: Paul Moore
Acked-by: Paul Moore
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-02-18 22:11:41 +0800
adf86d59b can: Fix kernel panic at security_sock_rcv_skb ... Browse Code »

[ Upstream commit f1712c73714088a7252d276a57126d56c7d37e64 ]

Zhang Yanmin reported crashes [1] and provided a patch adding a
synchronize_rcu() call in can_rx_unregister()

The main problem seems that the sockets themselves are not RCU
protected.

If CAN uses RCU for delivery, then sockets should be freed only after
one RCU grace period.

Recent kernels could use sock_set_flag(sk, SOCK_RCU_FREE), but let's
ease stable backports with the following fix instead.

[1]
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] selinux_socket_sock_rcv_skb+0x65/0x2a0

Call Trace:

[] security_sock_rcv_skb+0x4c/0x60
[] sk_filter+0x41/0x210
[] sock_queue_rcv_skb+0x53/0x3a0
[] raw_rcv+0x2a3/0x3c0
[] can_rcv_filter+0x12b/0x370
[] can_receive+0xd9/0x120
[] can_rcv+0xab/0x100
[] __netif_receive_skb_core+0xd8c/0x11f0
[] __netif_receive_skb+0x24/0xb0
[] process_backlog+0x127/0x280
[] net_rx_action+0x33b/0x4f0
[] __do_softirq+0x184/0x440
[] do_softirq_own_stack+0x1c/0x30

[] do_softirq.part.18+0x3b/0x40
[] do_softirq+0x1d/0x20
[] netif_rx_ni+0xe5/0x110
[] slcan_receive_buf+0x507/0x520
[] flush_to_ldisc+0x21c/0x230
[] process_one_work+0x24f/0x670
[] worker_thread+0x9d/0x6f0
[] ? rescuer_thread+0x480/0x480
[] kthread+0x12c/0x150
[] ret_from_fork+0x3f/0x70

Reported-by: Zhang Yanmin
Signed-off-by: Eric Dumazet
Acked-by: Oliver Hartkopp
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-02-18 22:11:40 +0800

15 Feb, 2017

5 commits

1cf897fcc Drivers: hv: vmbus: finally fix hv_need_to_signal_on_read() ... Browse Code »

commit 433e19cf33d34bb6751c874a9c00980552fe508c upstream.

Commit a389fcfd2cb5 ("Drivers: hv: vmbus: Fix signaling logic in
hv_need_to_signal_on_read()")
added the proper mb(), but removed the test "prev_write_sz < pending_sz"
when making the signal decision.

As a result, the guest can signal the host unnecessarily,
and then the host can throttle the guest because the host
thinks the guest is buggy or malicious; finally the user
running stress test can perceive intermittent freeze of
the guest.

This patch brings back the test, and properly handles the
in-place consumption APIs used by NetVSC (see get_next_pkt_raw(),
put_pkt_raw() and commit_rd_index()).

Fixes: a389fcfd2cb5 ("Drivers: hv: vmbus: Fix signaling logic in
hv_need_to_signal_on_read()")

Signed-off-by: Dexuan Cui
Reported-by: Rolf Neugebauer
Tested-by: Rolf Neugebauer
Cc: "K. Y. Srinivasan"
Cc: Haiyang Zhang
Cc: Stephen Hemminger
Signed-off-by: K. Y. Srinivasan
Cc: Rolf Neugebauer
Signed-off-by: Greg Kroah-Hartman

Dexuan Cui
2017-02-15 07:25:39 +0800
964dfbe3d Drivers: hv: vmbus: On the read path cleanup the logic to interrupt the host ... Browse Code »

commit 3372592a140db69fd63837e81f048ab4abf8111e upstream.

Signal the host when we determine the host is to be signaled -
on th read path. The currrent code determines the need to signal in the
ringbuffer code and actually issues the signal elsewhere. This can result
in the host viewing this interrupt as spurious since the host may also
poll the channel. Make the necessary adjustments.

Signed-off-by: K. Y. Srinivasan
Cc: Rolf Neugebauer
Signed-off-by: Greg Kroah-Hartman

K. Y. Srinivasan
2017-02-15 07:25:38 +0800
e2fdf7841 Drivers: hv: vmbus: On write cleanup the logic to interrupt the host ... Browse Code »

commit 1f6ee4e7d83586c8b10bd4f2f4346353d04ce884 upstream.

Signal the host when we determine the host is to be signaled.
The currrent code determines the need to signal in the ringbuffer
code and actually issues the signal elsewhere. This can result
in the host viewing this interrupt as spurious since the host may also
poll the channel. Make the necessary adjustments.

Signed-off-by: K. Y. Srinivasan
Cc: Rolf Neugebauer
Signed-off-by: Greg Kroah-Hartman

K. Y. Srinivasan
2017-02-15 07:25:38 +0800
4978149de target: Fix multi-session dynamic se_node_acl double free OOPs ... Browse Code »

commit 01d4d673558985d9a118e1e05026633c3e2ade9b upstream.

This patch addresses a long-standing bug with multi-session
(eg: iscsi-target + iser-target) se_node_acl dynamic free
withini transport_deregister_session().

This bug is caused when a storage endpoint is configured with
demo-mode (generate_node_acls = 1 + cache_dynamic_acls = 1)
initiators, and initiator login creates a new dynamic node acl
and attaches two sessions to it.

After that, demo-mode for the storage instance is disabled via
configfs (generate_node_acls = 0 + cache_dynamic_acls = 0) and
the existing dynamic acl is never converted to an explicit ACL.

The end result is dynamic acl resources are released twice when
the sessions are shutdown in transport_deregister_session().

If the storage instance is not changed to disable demo-mode,
or the dynamic acl is converted to an explict ACL, or there
is only a single session associated with the dynamic ACL,
the bug is not triggered.

To address this big, move the release of dynamic se_node_acl
memory into target_complete_nacl() so it's only freed once
when se_node_acl->acl_kref reaches zero.

(Drop unnecessary list_del_init usage - HCH)

Reported-by: Rob Millner
Tested-by: Rob Millner
Cc: Rob Millner
Signed-off-by: Nicholas Bellinger
Signed-off-by: Greg Kroah-Hartman

Nicholas Bellinger
2017-02-15 07:25:36 +0800
c4236b0c7 cpumask: use nr_cpumask_bits for parsing functions ... Browse Code »

commit 4d59b6ccf000862beed6fc0765d3209f98a8d8a2 upstream.

Commit 513e3d2d11c9 ("cpumask: always use nr_cpu_ids in formatting and
parsing functions") converted both cpumask printing and parsing
functions to use nr_cpu_ids instead of nr_cpumask_bits. While this was
okay for the printing functions as it just picked one of the two output
formats that we were alternating between depending on a kernel config,
doing the same for parsing wasn't okay.

nr_cpumask_bits can be either nr_cpu_ids or NR_CPUS. We can always use
nr_cpu_ids but that is a variable while NR_CPUS is a constant, so it can
be more efficient to use NR_CPUS when we can get away with it.
Converting the printing functions to nr_cpu_ids makes sense because it
affects how the masks get presented to userspace and doesn't break
anything; however, using nr_cpu_ids for parsing functions can
incorrectly leave the higher bits uninitialized while reading in these
masks from userland. As all testing and comparison functions use
nr_cpumask_bits which can be larger than nr_cpu_ids, the parsed cpumasks
can erroneously yield false negative results.

This made the taskstats interface incorrectly return -EINVAL even when
the inputs were correct.

Fix it by restoring the parse functions to use nr_cpumask_bits instead
of nr_cpu_ids.

Link: http://lkml.kernel.org/r/20170206182442.GB31078@htj.duckdns.org
Fixes: 513e3d2d11c9 ("cpumask: always use nr_cpu_ids in formatting and parsing functions")
Signed-off-by: Tejun Heo
Reported-by: Martin Steigerwald
Debugged-by: Ben Hutchings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2017-02-15 07:25:34 +0800

09 Feb, 2017

3 commits

e02136282 irqdomain: Avoid activating interrupts more than once ... Browse Code »

commit 08d85f3ea99f1eeafc4e8507936190e86a16ee8c upstream.

Since commit f3b0946d629c ("genirq/msi: Make sure PCI MSIs are
activated early"), we can end-up activating a PCI/MSI twice (once
at allocation time, and once at startup time).

This is normally of no consequences, except that there is some
HW out there that may misbehave if activate is used more than once
(the GICv3 ITS, for example, uses the activate callback
to issue the MAPVI command, and the architecture spec says that
"If there is an existing mapping for the EventID-DeviceID
combination, behavior is UNPREDICTABLE").

While this could be worked around in each individual driver, it may
make more sense to tackle the issue at the core level. In order to
avoid getting in that situation, let's have a per-interrupt flag
to remember if we have already activated that interrupt or not.

Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early")
Reported-and-tested-by: Andre Przywara
Signed-off-by: Marc Zyngier
Link: http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Marc Zyngier
2017-02-09 15:08:31 +0800
12f822d23 percpu-refcount: fix reference leak during percpu-atomic transition ... Browse Code »

commit 966d2b04e070bc040319aaebfec09e0144dc3341 upstream.

percpu_ref_tryget() and percpu_ref_tryget_live() should return
"true" IFF they acquire a reference. But the return value from
atomic_long_inc_not_zero() is a long and may have high bits set,
e.g. PERCPU_COUNT_BIAS, and the return value of the tryget routines
is bool so the reference may actually be acquired but the routines
return "false" which results in a reference leak since the caller
assumes it does not need to do a corresponding percpu_ref_put().

This was seen when performing CPU hotplug during I/O, as hangs in
blk_mq_freeze_queue_wait where percpu_ref_kill (blk_mq_freeze_queue_start)
raced with percpu_ref_tryget (blk_mq_timeout_work).
Sample stack trace:

__switch_to+0x2c0/0x450
__schedule+0x2f8/0x970
schedule+0x48/0xc0
blk_mq_freeze_queue_wait+0x94/0x120
blk_mq_queue_reinit_work+0xb8/0x180
blk_mq_queue_reinit_prepare+0x84/0xa0
cpuhp_invoke_callback+0x17c/0x600
cpuhp_up_callbacks+0x58/0x150
_cpu_up+0xf0/0x1c0
do_cpu_up+0x120/0x150
cpu_subsys_online+0x64/0xe0
device_online+0xb4/0x120
online_store+0xb4/0xc0
dev_attr_store+0x68/0xa0
sysfs_kf_write+0x80/0xb0
kernfs_fop_write+0x17c/0x250
__vfs_write+0x6c/0x1e0
vfs_write+0xd0/0x270
SyS_write+0x6c/0x110
system_call+0x38/0xe0

Examination of the queue showed a single reference (no PERCPU_COUNT_BIAS,
and __PERCPU_REF_DEAD, __PERCPU_REF_ATOMIC set) and no requests.
However, conditions at the time of the race are count of PERCPU_COUNT_BIAS + 0
and __PERCPU_REF_DEAD and __PERCPU_REF_ATOMIC set.

The fix is to make the tryget routines use an actual boolean internally instead
of the atomic long result truncated to a int.

Fixes: e625305b3907 percpu-refcount: make percpu_ref based on longs instead of ints
Link: https://bugzilla.kernel.org/show_bug.cgi?id=190751
Signed-off-by: Douglas Miller
Reviewed-by: Jens Axboe
Signed-off-by: Tejun Heo
Fixes: e625305b3907 ("percpu-refcount: make percpu_ref based on longs instead of ints")
Signed-off-by: Greg Kroah-Hartman

Douglas Miller
2017-02-09 15:08:28 +0800
6cb0497ae base/memory, hotplug: fix a kernel oops in show_valid_zones() ... Browse Code »

commit a96dfddbcc04336bbed50dc2b24823e45e09e80c upstream.

Reading a sysfs "memoryN/valid_zones" file leads to the following oops
when the first page of a range is not backed by struct page.
show_valid_zones() assumes that 'start_pfn' is always valid for
page_zone().

BUG: unable to handle kernel paging request at ffffea017a000000
IP: show_valid_zones+0x6f/0x160

This issue may happen on x86-64 systems with 64GiB or more memory since
their memory block size is bumped up to 2GiB. [1] An example of such
systems is desribed below. 0x3240000000 is only aligned by 1GiB and
this memory block starts from 0x3200000000, which is not backed by
struct page.

BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable

Since test_pages_in_a_zone() already checks holes, fix this issue by
extending this function to return 'valid_start' and 'valid_end' for a
given range. show_valid_zones() then proceeds with the valid range.

[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'

Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
Signed-off-by: Toshi Kani
Cc: Greg Kroah-Hartman
Cc: Zhang Zhen
Cc: Reza Arbab
Cc: David Rientjes
Cc: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2017-02-09 15:08:28 +0800

04 Feb, 2017

6 commits

e972cce0c lwtunnel: Fix oops on state free after encap module unload ... Browse Code »

[ Upstream commit 85c814016ce3b371016c2c054a905fa2492f5a65 ]

When attempting to free lwtunnel state after the module for the encap
has been unloaded an oops occurs:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: lwtstate_free+0x18/0x40
[..]
task: ffff88003e372380 task.stack: ffffc900001fc000
RIP: 0010:lwtstate_free+0x18/0x40
RSP: 0018:ffff88003fd83e88 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88002bbb3380 RCX: ffff88000c91a300
[..]
Call Trace:

free_fib_info_rcu+0x195/0x1a0
? rt_fibinfo_free+0x50/0x50
rcu_process_callbacks+0x2d3/0x850
? rcu_process_callbacks+0x296/0x850
__do_softirq+0xe4/0x4cb
irq_exit+0xb0/0xc0
smp_apic_timer_interrupt+0x3d/0x50
apic_timer_interrupt+0x93/0xa0
[..]
Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8

The problem is after the module for the encap can be unloaded the
corresponding ops is removed and is thus NULL here.

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so grab the module
reference for the ops on creating lwtunnel state and of course release
the reference when freeing the state.

Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation")
Signed-off-by: Robert Shearman
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Robert Shearman
2017-02-04 16:47:11 +0800
89c258862 net: Specify the owning module for lwtunnel ops ... Browse Code »

[ Upstream commit 88ff7334f25909802140e690c0e16433e485b0a0 ]

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so specify the owning
module for all lwtunnel ops.

Signed-off-by: Robert Shearman
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Robert Shearman
2017-02-04 16:47:11 +0800
e9db042dc lwtunnel: fix autoload of lwt modules ... Browse Code »

[ Upstream commit 9ed59592e3e379b2e9557dc1d9e9ec8fcbb33f16]

Trying to add an mpls encap route when the MPLS modules are not loaded
hangs. For example:

CONFIG_MPLS=y
CONFIG_NET_MPLS_GSO=m
CONFIG_MPLS_ROUTING=m
CONFIG_MPLS_IPTUNNEL=m

$ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

The ip command hangs:
root 880 826 0 21:25 pts/0 00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

$ cat /proc/880/stack
[] call_usermodehelper_exec+0xd6/0x134
[] __request_module+0x27b/0x30a
[] lwtunnel_build_state+0xe4/0x178
[] fib_create_info+0x47f/0xdd4
[] fib_table_insert+0x90/0x41f
[] inet_rtm_newroute+0x4b/0x52
...

modprobe is trying to load rtnl-lwt-MPLS:

root 881 5 0 21:25 ? 00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS

and it hangs after loading mpls_router:

$ cat /proc/881/stack
[] rtnl_lock+0x12/0x14
[] register_netdevice_notifier+0x16/0x179
[] mpls_init+0x25/0x1000 [mpls_router]
[] do_one_initcall+0x8e/0x13f
[] do_init_module+0x5a/0x1e5
[] load_module+0x13bd/0x17d6
...

The problem is that lwtunnel_build_state is called with rtnl lock
held preventing mpls_init from registering.

Given the potential references held by the time lwtunnel_build_state it
can not drop the rtnl lock to the load module. So, extract the module
loading code from lwtunnel_build_state into a new function to validate
the encap type. The new function is called while converting the user
request into a fib_config which is well before any table, device or
fib entries are examined.

Fixes: 745041e2aaf1 ("lwtunnel: autoload of lwt modules")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-02-04 16:47:10 +0800
1e7cbb413 virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving ... Browse Code »

[ Upstream commit 6391a4481ba0796805d6581e42f9f0418c099e34 ]

Commit 501db511397f ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
fixing this by adding a hint (has_data_valid) and set it only on the
receiving path.

Cc: Rolf Neugebauer
Signed-off-by: Jason Wang
Acked-by: Rolf Neugebauer
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Jason Wang
2017-02-04 16:47:09 +0800
3eab5dd0e virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit ... Browse Code »

[ Upstream commit 501db511397fd6efff3aa5b4e8de415b55559550 ]

This patch part reverts fd2a0437dc33 and e858fae2b0b8 which introduced a
subtle change in how the virtio_net flags are derived from the SKBs
ip_summed field.

With the above commits, the flags are set to VIRTIO_NET_HDR_F_DATA_VALID
when ip_summed == CHECKSUM_UNNECESSARY, thus treating it differently to
ip_summed == CHECKSUM_NONE, which should be the same.

Further, the virtio spec 1.0 / CS04 explicitly says that
VIRTIO_NET_HDR_F_DATA_VALID must not be set by the driver.

Fixes: fd2a0437dc33 ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
Fixes: e858fae2b0b8 (" virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Rolf Neugebauer
Acked-by: Michael S. Tsirkin
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Rolf Neugebauer
2017-02-04 16:47:09 +0800
3524f6422 tcp: fix tcp_fastopen unaligned access complaints on sparc ... Browse Code »

[ Upstream commit 003c941057eaa868ca6fedd29a274c863167230d ]

Fix up a data alignment issue on sparc by swapping the order
of the cookie byte array field with the length field in
struct tcp_fastopen_cookie, and making it a proper union
to clean up the typecasting.

This addresses log complaints like these:
log_unaligned: 113 callbacks suppressed
Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360

Cc: Eric Dumazet
Signed-off-by: Shannon Nelson
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Shannon Nelson
2017-02-04 16:47:09 +0800

01 Feb, 2017

5 commits

143a9ad4e memory_hotplug: make zone_can_shift() return a boolean value ... Browse Code »

commit 8a1f780e7f28c7c1d640118242cf68d528c456cd upstream.

online_{kernel|movable} is used to change the memory zone to
ZONE_{NORMAL|MOVABLE} and online the memory.

To check that memory zone can be changed, zone_can_shift() is used.
Currently the function returns minus integer value, plus integer
value and 0. When the function returns minus or plus integer value,
it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.

But when the function returns 0, there are two meanings.

One of the meanings is that the memory zone does not need to be changed.
For example, when memory is in ZONE_NORMAL and onlined by online_kernel
the memory zone does not need to be changed.

Another meaning is that the memory zone cannot be changed. When memory
is in ZONE_NORMAL and onlined by online_movable, the memory zone may
not be changed to ZONE_MOVALBE due to memory online limitation(see
Documentation/memory-hotplug.txt). In this case, memory must not be
onlined.

The patch changes the return type of zone_can_shift() so that memory
online operation fails when memory zone cannot be changed as follows:

Before applying patch:
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal

node_scanned 0
spanned 8388608
present 7864320
managed 7864320
# echo online_movable > memory4097/state
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal

node_scanned 0
spanned 8388608
present 8388608
managed 8388608

online_movable operation succeeded. But memory is onlined as
ZONE_NORMAL, not ZONE_MOVABLE.

After applying patch:
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal

node_scanned 0
spanned 8388608
present 7864320
managed 7864320
# echo online_movable > memory4097/state
bash: echo: write error: Invalid argument
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal

node_scanned 0
spanned 8388608
present 7864320
managed 7864320

online_movable operation failed because of failure of changing
the memory zone from ZONE_NORMAL to ZONE_MOVABLE

Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online")
Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com
Signed-off-by: Yasuaki Ishimatsu
Reviewed-by: Reza Arbab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Yasuaki Ishimatsu
2017-02-01 15:33:13 +0800
cb1d48f55 SUNRPC: cleanup ida information when removing sunrpc module ... Browse Code »

commit c929ea0b910355e1876c64431f3d5802f95b3d75 upstream.

After removing sunrpc module, I get many kmemleak information as,
unreferenced object 0xffff88003316b1e0 (size 544):
comm "gssproxy", pid 2148, jiffies 4294794465 (age 4200.081s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] kmemleak_alloc+0x4a/0xa0
[] kmem_cache_alloc+0x15e/0x1f0
[] ida_pre_get+0xaa/0x150
[] ida_simple_get+0xad/0x180
[] nlmsvc_lookup_host+0x4ab/0x7f0 [lockd]
[] lockd+0x4d/0x270 [lockd]
[] param_set_timeout+0x55/0x100 [lockd]
[] svc_defer+0x114/0x3f0 [sunrpc]
[] svc_defer+0x2d7/0x3f0 [sunrpc]
[] rpc_show_info+0x8a/0x110 [sunrpc]
[] proc_reg_write+0x7f/0xc0
[] __vfs_write+0xdf/0x3c0
[] vfs_write+0xef/0x240
[] SyS_write+0xad/0x130
[] entry_SYSCALL_64_fastpath+0x1a/0xa9
[] 0xffffffffffffffff

I found, the ida information (dynamic memory) isn't cleanup.

Signed-off-by: Kinglong Mee
Fixes: 2f048db4680a ("SUNRPC: Add an identifier for struct rpc_clnt")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Kinglong Mee
2017-02-01 15:33:09 +0800
73fdda3b0 nfs: Don't increment lock sequence ID after NFS4ERR_MOVED ... Browse Code »

commit 059aa734824165507c65fd30a55ff000afd14983 upstream.

Xuan Qi reports that the Linux NFSv4 client failed to lock a file
that was migrated. The steps he observed on the wire:

1. The client sent a LOCK request to the source server
2. The source server replied NFS4ERR_MOVED
3. The client switched to the destination server
4. The client sent the same LOCK request to the destination
server with a bumped lock sequence ID
5. The destination server rejected the LOCK request with
NFS4ERR_BAD_SEQID

RFC 3530 section 8.1.5 provides a list of NFS errors which do not
bump a lock sequence ID.

However, RFC 3530 is now obsoleted by RFC 7530. In RFC 7530 section
9.1.7, this list has been updated by the addition of NFS4ERR_MOVED.

Reported-by: Xuan Qi
Signed-off-by: Chuck Lever
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Chuck Lever
2017-02-01 15:33:08 +0800
95600605f IB/cxgb3: fix misspelling in header guard ... Browse Code »

commit b1a27eac7fefff33ccf6acc919fc0725bf9815fb upstream.

Use CXGB3_... instead of CXBG3_...

Fixes: a85fb3383340 ("IB/cxgb3: Move user vendor structures")
Signed-off-by: Nicolas Iooss
Reviewed-by: Leon Romanovsky
Acked-by: Steve Wise
Signed-off-by: Doug Ledford
Signed-off-by: Greg Kroah-Hartman

Nicolas Iooss
2017-02-01 15:33:07 +0800
ade7afe9d mm, page_alloc: fix check for NULL preferred_zone ... Browse Code »

commit ea57485af8f4221312a5a95d63c382b45e7840dc upstream.

Patch series "fix premature OOM regression in 4.7+ due to cpuset races".

This is v2 of my attempt to fix the recent report based on LTP cpuset
stress test [1]. The intention is to go to stable 4.9 LTSS with this,
as triggering repeated OOMs is not nice. That's why the patches try to
be not too intrusive.

Unfortunately why investigating I found that modifying the testcase to
use per-VMA policies instead of per-task policies will bring the OOM's
back, but that seems to be much older and harder to fix problem. I have
posted a RFC [2] but I believe that fixing the recent regressions has a
higher priority.

Longer-term we might try to think how to fix the cpuset mess in a better
and less error prone way. I was for example very surprised to learn,
that cpuset updates change not only task->mems_allowed, but also
nodemask of mempolicies. Until now I expected the parameter to
alloc_pages_nodemask() to be stable. I wonder why do we then treat
cpusets specially in get_page_from_freelist() and distinguish HARDWALL
etc, when there's unconditional intersection between mempolicy and
cpuset. I would expect the nodemask adjustment for saving overhead in
g_p_f(), but that clearly doesn't happen in the current form. So we
have both crazy complexity and overhead, AFAICS.

[1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com
[2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@suse.cz

This patch (of 4):

Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first
zone in a zonelist twice") we have a wrong check for NULL preferred_zone,
which can theoretically happen due to concurrent cpuset modification. We
check the zoneref pointer which is never NULL and we should check the zone
pointer. Also document this in first_zones_zonelist() comment per Michal
Hocko.

Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Mel Gorman
Acked-by: Hillf Danton
Cc: Ganapatrao Kulkarni
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2017-02-01 15:33:04 +0800

26 Jan, 2017

5 commits

41c6b3e89 swiotlb: Add swiotlb=noforce debug option ... Browse Code »

commit fff5d99225107f5f13fe4a9805adc2a1c4b5fb00 upstream.

On architectures like arm64, swiotlb is tied intimately to the core
architecture DMA support. In addition, ZONE_DMA cannot be disabled.

To aid debugging and catch devices not supporting DMA to memory outside
the 32-bit address space, add a kernel command line option
"swiotlb=noforce", which disables the use of bounce buffers.
If specified, trying to map memory that cannot be used with DMA will
fail, and a rate-limited warning will be printed.

Note that io_tlb_nslabs is set to 1, which is the minimal supported
value.

Signed-off-by: Geert Uytterhoeven
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Greg Kroah-Hartman

Geert Uytterhoeven
2017-01-26 15:24:44 +0800
1fd1e6cd6 swiotlb: Convert swiotlb_force from int to enum ... Browse Code »

commit ae7871be189cb41184f1e05742b4a99e2c59774d upstream.

Convert the flag swiotlb_force from an int to an enum, to prepare for
the advent of more possible values.

Suggested-by: Konrad Rzeszutek Wilk
Signed-off-by: Geert Uytterhoeven
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Greg Kroah-Hartman

Geert Uytterhoeven
2017-01-26 15:24:44 +0800
a297ed84b sunrpc: don't call sleeping functions from the notifier block callbacks ... Browse Code »

commit 546125d1614264d26080817d0c8cddb9b25081fa upstream.

The inet6addr_chain is an atomic notifier chain, so we can't call
anything that might sleep (like lock_sock)... instead of closing the
socket from svc_age_temp_xprts_now (which is called by the notifier
function), just have the rpc service threads do it instead.

Fixes: c3d4879e01be "sunrpc: Add a function to close..."
Signed-off-by: Scott Mayhew
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Scott Mayhew
2017-01-26 15:24:37 +0800
90687fc3c rcu: Narrow early boot window of illegal synchronous grace periods ... Browse Code »

commit 52d7e48b86fc108e45a656d8e53e4237993c481d upstream.

The current preemptible RCU implementation goes through three phases
during bootup. In the first phase, there is only one CPU that is running
with preemption disabled, so that a no-op is a synchronous grace period.
In the second mid-boot phase, the scheduler is running, but RCU has
not yet gotten its kthreads spawned (and, for expedited grace periods,
workqueues are not yet running. During this time, any attempt to do
a synchronous grace period will hang the system (or complain bitterly,
depending). In the third and final phase, RCU is fully operational and
everything works normally.

This has been OK for some time, but there has recently been some
synchronous grace periods showing up during the second mid-boot phase.
This code worked "by accident" for awhile, but started failing as soon
as expedited RCU grace periods switched over to workqueues in commit
8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue").
Note that the code was buggy even before this commit, as it was subject
to failure on real-time systems that forced all expedited grace periods
to run as normal grace periods (for example, using the rcu_normal ksysfs
parameter). The callchain from the failure case is as follows:

early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited

The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
which caused the code to try using workqueues before they were
initialized, which did not go well.

This commit therefore reworks RCU to permit synchronous grace periods
to proceed during this mid-boot phase. This commit is therefore a
fix to a regression introduced in v4.9, and is therefore being put
forward post-merge-window in v4.10.

This commit sets a flag from the existing rcu_scheduler_starting()
function which causes all synchronous grace periods to take the expedited
path. The expedited path now checks this flag, using the requesting task
to drive the expedited grace period forward during the mid-boot phase.
Finally, this flag is updated by a core_initcall() function named
rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.

Note that this arrangement assumes that tasks are not sent POSIX signals
(or anything similar) from the time that the first task is spawned
through core_initcall() time.

Fixes: 8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue")
Reported-by: "Zheng, Lv"
Reported-by: Borislav Petkov
Signed-off-by: Paul E. McKenney
Tested-by: Stan Kain
Tested-by: Ivan
Tested-by: Emanuel Castelo
Tested-by: Bruno Pesavento
Tested-by: Borislav Petkov
Tested-by: Frederic Bezies
Signed-off-by: Greg Kroah-Hartman

Paul E. McKenney
2017-01-26 15:24:37 +0800
387812143 ARM: dts: r8a7794: remove Z clock ... Browse Code »

commit 68cc085a4daaa32f7138de1e918331c05165a484 upstream.

R8A7794 doesn't have Cortex-A15 CPUs, thus there's no Z clock...

Fixes: 0dce5454d5c2 ("ARM: shmobile: Initial r8a7794 SoC device tree")
Signed-off-by: Sergei Shtylyov
Reviewed-by: Geert Uytterhoeven
Signed-off-by: Simon Horman
Signed-off-by: Greg Kroah-Hartman

Sergei Shtylyov
2017-01-26 15:24:36 +0800

20 Jan, 2017

9 commits

cb50d45c3 power: supply: bq27xxx_battery: Fix register map for BQ27510 and BQ27520 ... Browse Code »

commit 3bee9ea1de687925d116670f036599cbed8b66b0 upstream.

The BQ27510 and BQ27520 use a slightly different register map than the
BQ27500, add a new type enum and add these gauges to it.

Fixes: d74534c27775 ("power: bq27xxx_battery: Add support for additional bq27xxx family devices")
Based-on-patch-by: Kenneth R. Crudup
Signed-off-by: Andrew F. Davis
Signed-off-by: Sebastian Reichel
Signed-off-by: Greg Kroah-Hartman

Andrew F. Davis
2017-01-20 03:18:07 +0800
f99694cda block: Change extern inline to static inline ... Browse Code »

commit 9a05e7541c39680d28ecf91892338e074738d5fd upstream.

With compilers which follow the C99 standard (like modern versions of
gcc and clang), "extern inline" does the opposite thing from older
versions of gcc (emits code for an externally linkable version of the
inline function).

"static inline" does the intended behavior in all cases instead.

Description taken from commit 6d91857d4826 ("staging, rtl8192e,
LLVMLinux: Change extern inline to static inline").

This also fixes the following GCC warning when building with CONFIG_PM
disabled:

./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes]

Fixes: d07ab6d11477 ("block: Add blk_set_runtime_active()")
Reviewed-by: Mika Westerberg
Signed-off-by: Tobias Klauser
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tobias Klauser
2017-01-20 03:18:07 +0800
f9cf776b0 ASoC: hdmi-codec: use unsigned type to structure members with bit-field ... Browse Code »

commit 9e4d59ada4d602e78eee9fb5f898ce61fdddb446 upstream.

This is a fix for Linux 4.10-rc1.

In C language specification, a bit-field is interpreted as a signed or
unsigned integer type consisting of the specified number of bits.

In GCC manual, the range of a signed bit field of N bits is from
-(2^N) / 2 to ((2^N) / 2) - 1
https://www.gnu.org/software/gnu-c-manual/gnu-c-manual.html#Bit-Fields

Therefore, when defined as 1 bit-field with signed type, variables can
represents -1 and 0.

The snd-soc-hdmi-codec module includes a structure which has signed type
members with bit-fields. Codes of this module assign 0 and 1 to the
members. This seems to result in implementation-dependent behaviours.

As of v4.10-rc1 merge window, outside of sound subsystem, this structure
is referred by below GPU modules.
- tda998x
- sti-drm
- mediatek-drm-hdmi
- msm

As long as I review their codes relevant to the structure, the structure
members are used just for condition statements and printk formats.
My proposal of change is a bit intrusive to the printk formats but this
may be acceptable.

Totally, it's reasonable to use unsigned type for the structure members.
This bug is detected by Sparse, static code analyzer with below warnings.

./include/sound/hdmi-codec.h:39:26: error: dubious one-bit signed bitfield
./include/sound/hdmi-codec.h:40:28: error: dubious one-bit signed bitfield
./include/sound/hdmi-codec.h:41:29: error: dubious one-bit signed bitfield
./include/sound/hdmi-codec.h:42:31: error: dubious one-bit signed bitfield

Fixes: 09184118a8ab ("ASoC: hdmi-codec: Add hdmi-codec for external HDMI-encoders")
Signed-off-by: Takashi Sakamoto
Acked-by: Arnaud Pouliquen
Signed-off-by: Mark Brown
Signed-off-by: Greg Kroah-Hartman

Takashi Sakamoto
2017-01-20 03:18:02 +0800
28dad9aa9 btrfs: fix crash when tracepoint arguments are freed by wq callbacks ... Browse Code »

commit ac0c7cf8be00f269f82964cf7b144ca3edc5dbc4 upstream.

Enabling btrfs tracepoints leads to instant crash, as reported. The wq
callbacks could free the memory and the tracepoints started to
dereference the members to get to fs_info.

The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2
removed the tracepoints but we could preserve them by passing only the
required data in a safe way.

Fixes: bc074524e123 ("btrfs: prefix fsid to all trace events")
Reported-by: Sebastian Andrzej Siewior
Reviewed-by: Qu Wenruo
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

David Sterba
2017-01-20 03:18:02 +0800
14d6c9667 x86/efi: Don't allocate memmap through memblock after mm_init() ... Browse Code »

commit 20b1e22d01a4b0b11d3a1066e9feb04be38607ec upstream.

With the following commit:

4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")

... efi_bgrt_init() calls into the memblock allocator through
efi_mem_reserve() => efi_arch_mem_reserve() *after* mm_init() has been called.

Indeed, KASAN reports a bad read access later on in efi_free_boot_services():

BUG: KASAN: use-after-free in efi_free_boot_services+0xae/0x24c
at addr ffff88022de12740
Read of size 4 by task swapper/0/0
page:ffffea0008b78480 count:0 mapcount:-127
mapping: (null) index:0x1 flags: 0x5fff8000000000()
[...]
Call Trace:
dump_stack+0x68/0x9f
kasan_report_error+0x4c8/0x500
kasan_report+0x58/0x60
__asan_load4+0x61/0x80
efi_free_boot_services+0xae/0x24c
start_kernel+0x527/0x562
x86_64_start_reservations+0x24/0x26
x86_64_start_kernel+0x157/0x17a
start_cpu+0x5/0x14

The instruction at the given address is the first read from the memmap's
memory, i.e. the read of md->type in efi_free_boot_services().

Note that the writes earlier in efi_arch_mem_reserve() don't splat because
they're done through early_memremap()ed addresses.

So, after memblock is gone, allocations should be done through the "normal"
page allocator. Introduce a helper, efi_memmap_alloc() for this. Use
it from efi_arch_mem_reserve(), efi_free_boot_services() and, for the sake
of consistency, from efi_fake_memmap() as well.

Note that for the latter, the memmap allocations cease to be page aligned.
This isn't needed though.

Tested-by: Dan Williams
Signed-off-by: Nicolai Stange
Reviewed-by: Ard Biesheuvel
Cc: Dave Young
Cc: Linus Torvalds
Cc: Matt Fleming
Cc: Mika Penttilä
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-efi@vger.kernel.org
Fixes: 4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
Link: http://lkml.kernel.org/r/20170105125130.2815-1-nicstange@gmail.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Nicolai Stange
2017-01-20 03:18:00 +0800
99b17ac00 efi/x86: Prune invalid memory map entries and fix boot regression ... Browse Code »

commit 0100a3e67a9cef64d72cd3a1da86f3ddbee50363 upstream.

Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
(2.28), include memory map entries with phys_addr=0x0 and num_pages=0.

These machines fail to boot after the following commit,

commit 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")

Fix this by removing such bogus entries from the memory map.

Furthermore, currently the log output for this case (with efi=debug)
looks like:

[ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)

This is clearly wrong, and also not as informative as it could be. This
patch changes it so that if we find obviously invalid memory map
entries, we print an error and skip those entries. It also detects the
display of the address range calculation overflow, so the new output is:

[ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
[ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0x0000000000000000] (invalid)

It also detects memory map sizes that would overflow the physical
address, for example phys_addr=0xfffffffffffff000 and
num_pages=0x0200000000000001, and prints:

[ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
[ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)

It then removes these entries from the memory map.

Signed-off-by: Peter Jones
Signed-off-by: Ard Biesheuvel
[ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
Signed-off-by: Matt Fleming
[Matt: Include bugzilla info in commit log]
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Peter Jones
2017-01-20 03:18:00 +0800
483ecebb2 jump_labels: API for flushing deferred jump label updates ... Browse Code »

commit b6416e61012429e0277bd15a229222fd17afc1c1 upstream.

Modules that use static_key_deferred need a way to synchronize with
any delayed work that is still pending when the module is unloaded.
Introduce static_key_deferred_flush() which flushes any pending
jump label updates.

Signed-off-by: David Matlack
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

David Matlack
2017-01-20 03:17:59 +0800
6ca29ee3c mm: support anonymous stable page ... Browse Code »

commit f05714293a591038304ddae7cb0dd747bb3786cc upstream.

During developemnt for zram-swap asynchronous writeback, I found strange
corruption of compressed page, resulting in:

Modules linked in: zram(E)
CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G E 4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff88007620b840 task.stack: ffff880078090000
RIP: set_freeobj.part.43+0x1c/0x1f
RSP: 0018:ffff880078093ca8 EFLAGS: 00010246
RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
Call Trace:
obj_malloc+0x22b/0x260
zs_malloc+0x1e4/0x580
zram_bvec_rw+0x4cd/0x830 [zram]
page_requests_rw+0x9c/0x130 [zram]
zram_thread+0xe6/0x173 [zram]
kthread+0xca/0xe0
ret_from_fork+0x25/0x30

With investigation, it reveals currently stable page doesn't support
anonymous page. IOW, reuse_swap_page can reuse the page without waiting
writeback completion so it can overwrite page zram is compressing.

Unfortunately, zram has used per-cpu stream feature from v4.7.
It aims for increasing cache hit ratio of scratch buffer for
compressing. Downside of that approach is that zram should ask
memory space for compressed page in per-cpu context which requires
stricted gfp flag which could be failed. If so, it retries to
allocate memory space out of per-cpu context so it could get memory
this time and compress the data again, copies it to the memory space.

In this scenario, zram assumes the data should never be changed
but it is not true unless stable page supports. So, If the data is
changed under us, zram can make buffer overrun because second
compression size could be bigger than one we got in previous trial
and blindly, copy bigger size object to smaller buffer which is
buffer overrun. The overrun breaks zsmalloc free object chaining
so system goes crash like above.

I think below is same problem.
https://bugzilla.suse.com/show_bug.cgi?id=997574

Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
writeback in there so the approach in this patch is simply return false if
we found it needs stable page. Although it increases memory footprint
temporarily, it happens rarely and it should be reclaimed easily althoug
it happened. Also, It would be better than waiting of IO completion,
which is critial path for application latency.

Fixes: da9556a2367c ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Hugh Dickins
Cc: Sergey Senozhatsky
Cc: Darrick J. Wong
Cc: Takashi Iwai
Cc: Hyeoncheol Lee
Cc:
Cc: Sangseok Lee
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Minchan Kim
2017-01-20 03:17:59 +0800
07fc9575e mm, memcg: fix the active list aging for lowmem requests when memcg is enabled ... Browse Code »

commit b4536f0c829c8586544c94735c343f9b5070bd01 upstream.

Nils Holland and Klaus Ethgen have reported unexpected OOM killer
invocations with 32b kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
active_file:274324 inactive_file:281962 isolated_file:0
unevictable:0 dirty:649 writeback:0 unstable:0
slab_reclaimable:40662 slab_unreclaimable:17754
mapped:7382 shmem:202 pagetables:351 bounce:0
free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

the oom killer is clearly pre-mature because there there is still a lot
of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.

The code simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense. We can simply end up
always seeing the resulting active and inactive counts 0 and return
false. This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled. Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are losing empty LRU but non-zero lru size detection introduced by
ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.

Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Nils Holland
Tested-by: Nils Holland
Reported-by: Klaus Ethgen
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2017-01-20 03:17:59 +0800

15 Jan, 2017

1 commit

17a561b19 gro: Disable frag0 optimization on IPv6 ext headers ... Browse Code »

[ Upstream commit 57ea52a865144aedbcd619ee0081155e658b6f7d ]

The GRO fast path caches the frag0 address. This address becomes
invalid if frag0 is modified by pskb_may_pull or its variants.
So whenever that happens we must disable the frag0 optimization.

This is usually done through the combination of gro_header_hard
and gro_header_slow, however, the IPv6 extension header path did
the pulling directly and would continue to use the GRO fast path
incorrectly.

This patch fixes it by disabling the fast path when we enter the
IPv6 extension header path.

Fixes: 78a478d0efd9 ("gro: Inline skb_gro_header and cache frag0 virtual address")
Reported-by: Slava Shwartsman
Signed-off-by: Herbert Xu
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Herbert Xu
2017-01-15 20:42:55 +0800

12 Jan, 2017

2 commits

34db201f0 clocksource/dummy_timer: Move hotplug callback after the real timers ... Browse Code »

commit 9bf11ecce5a2758e5a097c2f3a13d08552d0d6f9 upstream.

When the dummy timer callback is invoked before the real timer callbacks,
then it tries to install that timer for the starting CPU. If the platform
does not have a broadcast timer installed the installation fails with a
kernel crash. The crash happens due to a unconditional deference of the non
available broadcast device. This needs to be fixed in the timer core code.

But even when this is fixed in the core code then installing the dummy
timer before the real timers is a pointless exercise.

Move it to the end of the callback list.

Fixes: 00c1d17aab51 ("clocksource/dummy_timer: Convert to hotplug state machine")
Reported-and-tested-by: Mason
Signed-off-by: Thomas Gleixner
Cc: Mark Rutland
Cc: Anna-Maria Gleixner
Cc: Richard Cochran
Cc: Sebastian Andrzej Siewior
Cc: Daniel Lezcano
Cc: Peter Zijlstra ,
Cc: Sebastian Frias
Cc: Thibaud Cornic
Cc: Robin Murphy
Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2017-01-12 18:39:45 +0800
04b97a6be PCI: Add Mellanox device IDs ... Browse Code »

commit 7254383341bc6e1a61996accd836009f0c922b21 upstream.

Add Mellanox device IDs for use by the mlx4 driver and INTx quirks.

[bhelgaas: sorted and adapted from
http://lkml.kernel.org/r/1478011644-12080-1-git-send-email-noaos@mellanox.com]
Signed-off-by: Noa Osherovich
Signed-off-by: Bjorn Helgaas
Signed-off-by: Greg Kroah-Hartman

Noa Osherovich
2017-01-12 18:39:36 +0800