18 Feb, 2017

4 commits

  • [ Upstream commit 217e6fa24ce28ec87fca8da93c9016cb78028612 ]

    The stack must not pass packets to device drivers that are shorter
    than the minimum link layer header length.

    Previously, packet sockets would drop packets smaller than or equal
    to dev->hard_header_len, but this has false positives. Zero length
    payload is used over Ethernet. Other link layer protocols support
    variable length headers. Support for validation of these protocols
    removed the min length check for all protocols.

    Introduce an explicit dev->min_header_len parameter and drop all
    packets below this value. Initially, set it to non-zero only for
    Ethernet and loopback. Other protocols can follow in a patch to
    net-next.

    Fixes: 9ed988cd5915 ("packet: validate variable length ll headers")
    Reported-by: Sowmini Varadhan
    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Acked-by: Sowmini Varadhan
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit 2bd137de531367fb573d90150d1872cb2a2095f7 ]

    An error was reported upgrading to 4.9.8:
    root@Typhoon:~# ip route add default table 210 nexthop dev eth0 via 10.68.64.1
    weight 1 nexthop dev eth0 via 10.68.64.2 weight 1
    RTNETLINK answers: Operation not supported

    The problem occurs when CONFIG_LWTUNNEL is not enabled and a multipath
    route is submitted.

    The point of lwtunnel_valid_encap_type_attr is catch modules that
    need to be loaded before any references are taken with rntl held. With
    CONFIG_LWTUNNEL disabled, there will be no modules to load so the
    lwtunnel_valid_encap_type_attr stub should just return 0.

    Fixes: 9ed59592e3e3 ("lwtunnel: fix autoload of lwt modules")
    Reported-by: pupilla@libero.it
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit d71b7896886345c53ef1d84bda2bc758554f5d61 ]

    syzkaller found another out of bound access in ip_options_compile(),
    or more exactly in cipso_v4_validate()

    Fixes: 20e2a8648596 ("cipso: handle CIPSO options correctly when NetLabel is disabled")
    Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Paul Moore
    Acked-by: Paul Moore
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit f1712c73714088a7252d276a57126d56c7d37e64 ]

    Zhang Yanmin reported crashes [1] and provided a patch adding a
    synchronize_rcu() call in can_rx_unregister()

    The main problem seems that the sockets themselves are not RCU
    protected.

    If CAN uses RCU for delivery, then sockets should be freed only after
    one RCU grace period.

    Recent kernels could use sock_set_flag(sk, SOCK_RCU_FREE), but let's
    ease stable backports with the following fix instead.

    [1]
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] selinux_socket_sock_rcv_skb+0x65/0x2a0

    Call Trace:

    [] security_sock_rcv_skb+0x4c/0x60
    [] sk_filter+0x41/0x210
    [] sock_queue_rcv_skb+0x53/0x3a0
    [] raw_rcv+0x2a3/0x3c0
    [] can_rcv_filter+0x12b/0x370
    [] can_receive+0xd9/0x120
    [] can_rcv+0xab/0x100
    [] __netif_receive_skb_core+0xd8c/0x11f0
    [] __netif_receive_skb+0x24/0xb0
    [] process_backlog+0x127/0x280
    [] net_rx_action+0x33b/0x4f0
    [] __do_softirq+0x184/0x440
    [] do_softirq_own_stack+0x1c/0x30

    [] do_softirq.part.18+0x3b/0x40
    [] do_softirq+0x1d/0x20
    [] netif_rx_ni+0xe5/0x110
    [] slcan_receive_buf+0x507/0x520
    [] flush_to_ldisc+0x21c/0x230
    [] process_one_work+0x24f/0x670
    [] worker_thread+0x9d/0x6f0
    [] ? rescuer_thread+0x480/0x480
    [] kthread+0x12c/0x150
    [] ret_from_fork+0x3f/0x70

    Reported-by: Zhang Yanmin
    Signed-off-by: Eric Dumazet
    Acked-by: Oliver Hartkopp
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

15 Feb, 2017

5 commits

  • commit 433e19cf33d34bb6751c874a9c00980552fe508c upstream.

    Commit a389fcfd2cb5 ("Drivers: hv: vmbus: Fix signaling logic in
    hv_need_to_signal_on_read()")
    added the proper mb(), but removed the test "prev_write_sz < pending_sz"
    when making the signal decision.

    As a result, the guest can signal the host unnecessarily,
    and then the host can throttle the guest because the host
    thinks the guest is buggy or malicious; finally the user
    running stress test can perceive intermittent freeze of
    the guest.

    This patch brings back the test, and properly handles the
    in-place consumption APIs used by NetVSC (see get_next_pkt_raw(),
    put_pkt_raw() and commit_rd_index()).

    Fixes: a389fcfd2cb5 ("Drivers: hv: vmbus: Fix signaling logic in
    hv_need_to_signal_on_read()")

    Signed-off-by: Dexuan Cui
    Reported-by: Rolf Neugebauer
    Tested-by: Rolf Neugebauer
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Signed-off-by: K. Y. Srinivasan
    Cc: Rolf Neugebauer
    Signed-off-by: Greg Kroah-Hartman

    Dexuan Cui
     
  • commit 3372592a140db69fd63837e81f048ab4abf8111e upstream.

    Signal the host when we determine the host is to be signaled -
    on th read path. The currrent code determines the need to signal in the
    ringbuffer code and actually issues the signal elsewhere. This can result
    in the host viewing this interrupt as spurious since the host may also
    poll the channel. Make the necessary adjustments.

    Signed-off-by: K. Y. Srinivasan
    Cc: Rolf Neugebauer
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • commit 1f6ee4e7d83586c8b10bd4f2f4346353d04ce884 upstream.

    Signal the host when we determine the host is to be signaled.
    The currrent code determines the need to signal in the ringbuffer
    code and actually issues the signal elsewhere. This can result
    in the host viewing this interrupt as spurious since the host may also
    poll the channel. Make the necessary adjustments.

    Signed-off-by: K. Y. Srinivasan
    Cc: Rolf Neugebauer
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • commit 01d4d673558985d9a118e1e05026633c3e2ade9b upstream.

    This patch addresses a long-standing bug with multi-session
    (eg: iscsi-target + iser-target) se_node_acl dynamic free
    withini transport_deregister_session().

    This bug is caused when a storage endpoint is configured with
    demo-mode (generate_node_acls = 1 + cache_dynamic_acls = 1)
    initiators, and initiator login creates a new dynamic node acl
    and attaches two sessions to it.

    After that, demo-mode for the storage instance is disabled via
    configfs (generate_node_acls = 0 + cache_dynamic_acls = 0) and
    the existing dynamic acl is never converted to an explicit ACL.

    The end result is dynamic acl resources are released twice when
    the sessions are shutdown in transport_deregister_session().

    If the storage instance is not changed to disable demo-mode,
    or the dynamic acl is converted to an explict ACL, or there
    is only a single session associated with the dynamic ACL,
    the bug is not triggered.

    To address this big, move the release of dynamic se_node_acl
    memory into target_complete_nacl() so it's only freed once
    when se_node_acl->acl_kref reaches zero.

    (Drop unnecessary list_del_init usage - HCH)

    Reported-by: Rob Millner
    Tested-by: Rob Millner
    Cc: Rob Millner
    Signed-off-by: Nicholas Bellinger
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Bellinger
     
  • commit 4d59b6ccf000862beed6fc0765d3209f98a8d8a2 upstream.

    Commit 513e3d2d11c9 ("cpumask: always use nr_cpu_ids in formatting and
    parsing functions") converted both cpumask printing and parsing
    functions to use nr_cpu_ids instead of nr_cpumask_bits. While this was
    okay for the printing functions as it just picked one of the two output
    formats that we were alternating between depending on a kernel config,
    doing the same for parsing wasn't okay.

    nr_cpumask_bits can be either nr_cpu_ids or NR_CPUS. We can always use
    nr_cpu_ids but that is a variable while NR_CPUS is a constant, so it can
    be more efficient to use NR_CPUS when we can get away with it.
    Converting the printing functions to nr_cpu_ids makes sense because it
    affects how the masks get presented to userspace and doesn't break
    anything; however, using nr_cpu_ids for parsing functions can
    incorrectly leave the higher bits uninitialized while reading in these
    masks from userland. As all testing and comparison functions use
    nr_cpumask_bits which can be larger than nr_cpu_ids, the parsed cpumasks
    can erroneously yield false negative results.

    This made the taskstats interface incorrectly return -EINVAL even when
    the inputs were correct.

    Fix it by restoring the parse functions to use nr_cpumask_bits instead
    of nr_cpu_ids.

    Link: http://lkml.kernel.org/r/20170206182442.GB31078@htj.duckdns.org
    Fixes: 513e3d2d11c9 ("cpumask: always use nr_cpu_ids in formatting and parsing functions")
    Signed-off-by: Tejun Heo
    Reported-by: Martin Steigerwald
    Debugged-by: Ben Hutchings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

09 Feb, 2017

3 commits

  • commit 08d85f3ea99f1eeafc4e8507936190e86a16ee8c upstream.

    Since commit f3b0946d629c ("genirq/msi: Make sure PCI MSIs are
    activated early"), we can end-up activating a PCI/MSI twice (once
    at allocation time, and once at startup time).

    This is normally of no consequences, except that there is some
    HW out there that may misbehave if activate is used more than once
    (the GICv3 ITS, for example, uses the activate callback
    to issue the MAPVI command, and the architecture spec says that
    "If there is an existing mapping for the EventID-DeviceID
    combination, behavior is UNPREDICTABLE").

    While this could be worked around in each individual driver, it may
    make more sense to tackle the issue at the core level. In order to
    avoid getting in that situation, let's have a per-interrupt flag
    to remember if we have already activated that interrupt or not.

    Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early")
    Reported-and-tested-by: Andre Przywara
    Signed-off-by: Marc Zyngier
    Link: http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyngier@arm.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 966d2b04e070bc040319aaebfec09e0144dc3341 upstream.

    percpu_ref_tryget() and percpu_ref_tryget_live() should return
    "true" IFF they acquire a reference. But the return value from
    atomic_long_inc_not_zero() is a long and may have high bits set,
    e.g. PERCPU_COUNT_BIAS, and the return value of the tryget routines
    is bool so the reference may actually be acquired but the routines
    return "false" which results in a reference leak since the caller
    assumes it does not need to do a corresponding percpu_ref_put().

    This was seen when performing CPU hotplug during I/O, as hangs in
    blk_mq_freeze_queue_wait where percpu_ref_kill (blk_mq_freeze_queue_start)
    raced with percpu_ref_tryget (blk_mq_timeout_work).
    Sample stack trace:

    __switch_to+0x2c0/0x450
    __schedule+0x2f8/0x970
    schedule+0x48/0xc0
    blk_mq_freeze_queue_wait+0x94/0x120
    blk_mq_queue_reinit_work+0xb8/0x180
    blk_mq_queue_reinit_prepare+0x84/0xa0
    cpuhp_invoke_callback+0x17c/0x600
    cpuhp_up_callbacks+0x58/0x150
    _cpu_up+0xf0/0x1c0
    do_cpu_up+0x120/0x150
    cpu_subsys_online+0x64/0xe0
    device_online+0xb4/0x120
    online_store+0xb4/0xc0
    dev_attr_store+0x68/0xa0
    sysfs_kf_write+0x80/0xb0
    kernfs_fop_write+0x17c/0x250
    __vfs_write+0x6c/0x1e0
    vfs_write+0xd0/0x270
    SyS_write+0x6c/0x110
    system_call+0x38/0xe0

    Examination of the queue showed a single reference (no PERCPU_COUNT_BIAS,
    and __PERCPU_REF_DEAD, __PERCPU_REF_ATOMIC set) and no requests.
    However, conditions at the time of the race are count of PERCPU_COUNT_BIAS + 0
    and __PERCPU_REF_DEAD and __PERCPU_REF_ATOMIC set.

    The fix is to make the tryget routines use an actual boolean internally instead
    of the atomic long result truncated to a int.

    Fixes: e625305b3907 percpu-refcount: make percpu_ref based on longs instead of ints
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=190751
    Signed-off-by: Douglas Miller
    Reviewed-by: Jens Axboe
    Signed-off-by: Tejun Heo
    Fixes: e625305b3907 ("percpu-refcount: make percpu_ref based on longs instead of ints")
    Signed-off-by: Greg Kroah-Hartman

    Douglas Miller
     
  • commit a96dfddbcc04336bbed50dc2b24823e45e09e80c upstream.

    Reading a sysfs "memoryN/valid_zones" file leads to the following oops
    when the first page of a range is not backed by struct page.
    show_valid_zones() assumes that 'start_pfn' is always valid for
    page_zone().

    BUG: unable to handle kernel paging request at ffffea017a000000
    IP: show_valid_zones+0x6f/0x160

    This issue may happen on x86-64 systems with 64GiB or more memory since
    their memory block size is bumped up to 2GiB. [1] An example of such
    systems is desribed below. 0x3240000000 is only aligned by 1GiB and
    this memory block starts from 0x3200000000, which is not backed by
    struct page.

    BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable

    Since test_pages_in_a_zone() already checks holes, fix this issue by
    extending this function to return 'valid_start' and 'valid_end' for a
    given range. show_valid_zones() then proceeds with the valid range.

    [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

    Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Greg Kroah-Hartman
    Cc: Zhang Zhen
    Cc: Reza Arbab
    Cc: David Rientjes
    Cc: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

04 Feb, 2017

6 commits

  • [ Upstream commit 85c814016ce3b371016c2c054a905fa2492f5a65 ]

    When attempting to free lwtunnel state after the module for the encap
    has been unloaded an oops occurs:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: lwtstate_free+0x18/0x40
    [..]
    task: ffff88003e372380 task.stack: ffffc900001fc000
    RIP: 0010:lwtstate_free+0x18/0x40
    RSP: 0018:ffff88003fd83e88 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff88002bbb3380 RCX: ffff88000c91a300
    [..]
    Call Trace:

    free_fib_info_rcu+0x195/0x1a0
    ? rt_fibinfo_free+0x50/0x50
    rcu_process_callbacks+0x2d3/0x850
    ? rcu_process_callbacks+0x296/0x850
    __do_softirq+0xe4/0x4cb
    irq_exit+0xb0/0xc0
    smp_apic_timer_interrupt+0x3d/0x50
    apic_timer_interrupt+0x93/0xa0
    [..]
    Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8

    The problem is after the module for the encap can be unloaded the
    corresponding ops is removed and is thus NULL here.

    Modules implementing lwtunnel ops should not be allowed to unload
    while there is state alive using those ops, so grab the module
    reference for the ops on creating lwtunnel state and of course release
    the reference when freeing the state.

    Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation")
    Signed-off-by: Robert Shearman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Robert Shearman
     
  • [ Upstream commit 88ff7334f25909802140e690c0e16433e485b0a0 ]

    Modules implementing lwtunnel ops should not be allowed to unload
    while there is state alive using those ops, so specify the owning
    module for all lwtunnel ops.

    Signed-off-by: Robert Shearman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Robert Shearman
     
  • [ Upstream commit 9ed59592e3e379b2e9557dc1d9e9ec8fcbb33f16]

    Trying to add an mpls encap route when the MPLS modules are not loaded
    hangs. For example:

    CONFIG_MPLS=y
    CONFIG_NET_MPLS_GSO=m
    CONFIG_MPLS_ROUTING=m
    CONFIG_MPLS_IPTUNNEL=m

    $ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

    The ip command hangs:
    root 880 826 0 21:25 pts/0 00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

    $ cat /proc/880/stack
    [] call_usermodehelper_exec+0xd6/0x134
    [] __request_module+0x27b/0x30a
    [] lwtunnel_build_state+0xe4/0x178
    [] fib_create_info+0x47f/0xdd4
    [] fib_table_insert+0x90/0x41f
    [] inet_rtm_newroute+0x4b/0x52
    ...

    modprobe is trying to load rtnl-lwt-MPLS:

    root 881 5 0 21:25 ? 00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS

    and it hangs after loading mpls_router:

    $ cat /proc/881/stack
    [] rtnl_lock+0x12/0x14
    [] register_netdevice_notifier+0x16/0x179
    [] mpls_init+0x25/0x1000 [mpls_router]
    [] do_one_initcall+0x8e/0x13f
    [] do_init_module+0x5a/0x1e5
    [] load_module+0x13bd/0x17d6
    ...

    The problem is that lwtunnel_build_state is called with rtnl lock
    held preventing mpls_init from registering.

    Given the potential references held by the time lwtunnel_build_state it
    can not drop the rtnl lock to the load module. So, extract the module
    loading code from lwtunnel_build_state into a new function to validate
    the encap type. The new function is called while converting the user
    request into a fib_config which is well before any table, device or
    fib entries are examined.

    Fixes: 745041e2aaf1 ("lwtunnel: autoload of lwt modules")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit 6391a4481ba0796805d6581e42f9f0418c099e34 ]

    Commit 501db511397f ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
    xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
    fixing this by adding a hint (has_data_valid) and set it only on the
    receiving path.

    Cc: Rolf Neugebauer
    Signed-off-by: Jason Wang
    Acked-by: Rolf Neugebauer
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     
  • [ Upstream commit 501db511397fd6efff3aa5b4e8de415b55559550 ]

    This patch part reverts fd2a0437dc33 and e858fae2b0b8 which introduced a
    subtle change in how the virtio_net flags are derived from the SKBs
    ip_summed field.

    With the above commits, the flags are set to VIRTIO_NET_HDR_F_DATA_VALID
    when ip_summed == CHECKSUM_UNNECESSARY, thus treating it differently to
    ip_summed == CHECKSUM_NONE, which should be the same.

    Further, the virtio spec 1.0 / CS04 explicitly says that
    VIRTIO_NET_HDR_F_DATA_VALID must not be set by the driver.

    Fixes: fd2a0437dc33 ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
    Fixes: e858fae2b0b8 (" virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
    Signed-off-by: Rolf Neugebauer
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Rolf Neugebauer
     
  • [ Upstream commit 003c941057eaa868ca6fedd29a274c863167230d ]

    Fix up a data alignment issue on sparc by swapping the order
    of the cookie byte array field with the length field in
    struct tcp_fastopen_cookie, and making it a proper union
    to clean up the typecasting.

    This addresses log complaints like these:
    log_unaligned: 113 callbacks suppressed
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
    Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
    Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
    Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360

    Cc: Eric Dumazet
    Signed-off-by: Shannon Nelson
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Shannon Nelson
     

01 Feb, 2017

5 commits

  • commit 8a1f780e7f28c7c1d640118242cf68d528c456cd upstream.

    online_{kernel|movable} is used to change the memory zone to
    ZONE_{NORMAL|MOVABLE} and online the memory.

    To check that memory zone can be changed, zone_can_shift() is used.
    Currently the function returns minus integer value, plus integer
    value and 0. When the function returns minus or plus integer value,
    it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.

    But when the function returns 0, there are two meanings.

    One of the meanings is that the memory zone does not need to be changed.
    For example, when memory is in ZONE_NORMAL and onlined by online_kernel
    the memory zone does not need to be changed.

    Another meaning is that the memory zone cannot be changed. When memory
    is in ZONE_NORMAL and onlined by online_movable, the memory zone may
    not be changed to ZONE_MOVALBE due to memory online limitation(see
    Documentation/memory-hotplug.txt). In this case, memory must not be
    onlined.

    The patch changes the return type of zone_can_shift() so that memory
    online operation fails when memory zone cannot be changed as follows:

    Before applying patch:
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320
    # echo online_movable > memory4097/state
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 8388608
    managed 8388608

    online_movable operation succeeded. But memory is onlined as
    ZONE_NORMAL, not ZONE_MOVABLE.

    After applying patch:
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320
    # echo online_movable > memory4097/state
    bash: echo: write error: Invalid argument
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320

    online_movable operation failed because of failure of changing
    the memory zone from ZONE_NORMAL to ZONE_MOVABLE

    Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online")
    Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Reviewed-by: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yasuaki Ishimatsu
     
  • commit c929ea0b910355e1876c64431f3d5802f95b3d75 upstream.

    After removing sunrpc module, I get many kmemleak information as,
    unreferenced object 0xffff88003316b1e0 (size 544):
    comm "gssproxy", pid 2148, jiffies 4294794465 (age 4200.081s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x4a/0xa0
    [] kmem_cache_alloc+0x15e/0x1f0
    [] ida_pre_get+0xaa/0x150
    [] ida_simple_get+0xad/0x180
    [] nlmsvc_lookup_host+0x4ab/0x7f0 [lockd]
    [] lockd+0x4d/0x270 [lockd]
    [] param_set_timeout+0x55/0x100 [lockd]
    [] svc_defer+0x114/0x3f0 [sunrpc]
    [] svc_defer+0x2d7/0x3f0 [sunrpc]
    [] rpc_show_info+0x8a/0x110 [sunrpc]
    [] proc_reg_write+0x7f/0xc0
    [] __vfs_write+0xdf/0x3c0
    [] vfs_write+0xef/0x240
    [] SyS_write+0xad/0x130
    [] entry_SYSCALL_64_fastpath+0x1a/0xa9
    [] 0xffffffffffffffff

    I found, the ida information (dynamic memory) isn't cleanup.

    Signed-off-by: Kinglong Mee
    Fixes: 2f048db4680a ("SUNRPC: Add an identifier for struct rpc_clnt")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Kinglong Mee
     
  • commit 059aa734824165507c65fd30a55ff000afd14983 upstream.

    Xuan Qi reports that the Linux NFSv4 client failed to lock a file
    that was migrated. The steps he observed on the wire:

    1. The client sent a LOCK request to the source server
    2. The source server replied NFS4ERR_MOVED
    3. The client switched to the destination server
    4. The client sent the same LOCK request to the destination
    server with a bumped lock sequence ID
    5. The destination server rejected the LOCK request with
    NFS4ERR_BAD_SEQID

    RFC 3530 section 8.1.5 provides a list of NFS errors which do not
    bump a lock sequence ID.

    However, RFC 3530 is now obsoleted by RFC 7530. In RFC 7530 section
    9.1.7, this list has been updated by the addition of NFS4ERR_MOVED.

    Reported-by: Xuan Qi
    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     
  • commit b1a27eac7fefff33ccf6acc919fc0725bf9815fb upstream.

    Use CXGB3_... instead of CXBG3_...

    Fixes: a85fb3383340 ("IB/cxgb3: Move user vendor structures")
    Signed-off-by: Nicolas Iooss
    Reviewed-by: Leon Romanovsky
    Acked-by: Steve Wise
    Signed-off-by: Doug Ledford
    Signed-off-by: Greg Kroah-Hartman

    Nicolas Iooss
     
  • commit ea57485af8f4221312a5a95d63c382b45e7840dc upstream.

    Patch series "fix premature OOM regression in 4.7+ due to cpuset races".

    This is v2 of my attempt to fix the recent report based on LTP cpuset
    stress test [1]. The intention is to go to stable 4.9 LTSS with this,
    as triggering repeated OOMs is not nice. That's why the patches try to
    be not too intrusive.

    Unfortunately why investigating I found that modifying the testcase to
    use per-VMA policies instead of per-task policies will bring the OOM's
    back, but that seems to be much older and harder to fix problem. I have
    posted a RFC [2] but I believe that fixing the recent regressions has a
    higher priority.

    Longer-term we might try to think how to fix the cpuset mess in a better
    and less error prone way. I was for example very surprised to learn,
    that cpuset updates change not only task->mems_allowed, but also
    nodemask of mempolicies. Until now I expected the parameter to
    alloc_pages_nodemask() to be stable. I wonder why do we then treat
    cpusets specially in get_page_from_freelist() and distinguish HARDWALL
    etc, when there's unconditional intersection between mempolicy and
    cpuset. I would expect the nodemask adjustment for saving overhead in
    g_p_f(), but that clearly doesn't happen in the current form. So we
    have both crazy complexity and overhead, AFAICS.

    [1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com
    [2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@suse.cz

    This patch (of 4):

    Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first
    zone in a zonelist twice") we have a wrong check for NULL preferred_zone,
    which can theoretically happen due to concurrent cpuset modification. We
    check the zoneref pointer which is never NULL and we should check the zone
    pointer. Also document this in first_zones_zonelist() comment per Michal
    Hocko.

    Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
    Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Cc: Ganapatrao Kulkarni
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

26 Jan, 2017

5 commits

  • commit fff5d99225107f5f13fe4a9805adc2a1c4b5fb00 upstream.

    On architectures like arm64, swiotlb is tied intimately to the core
    architecture DMA support. In addition, ZONE_DMA cannot be disabled.

    To aid debugging and catch devices not supporting DMA to memory outside
    the 32-bit address space, add a kernel command line option
    "swiotlb=noforce", which disables the use of bounce buffers.
    If specified, trying to map memory that cannot be used with DMA will
    fail, and a rate-limited warning will be printed.

    Note that io_tlb_nslabs is set to 1, which is the minimal supported
    value.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Greg Kroah-Hartman

    Geert Uytterhoeven
     
  • commit ae7871be189cb41184f1e05742b4a99e2c59774d upstream.

    Convert the flag swiotlb_force from an int to an enum, to prepare for
    the advent of more possible values.

    Suggested-by: Konrad Rzeszutek Wilk
    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Greg Kroah-Hartman

    Geert Uytterhoeven
     
  • commit 546125d1614264d26080817d0c8cddb9b25081fa upstream.

    The inet6addr_chain is an atomic notifier chain, so we can't call
    anything that might sleep (like lock_sock)... instead of closing the
    socket from svc_age_temp_xprts_now (which is called by the notifier
    function), just have the rpc service threads do it instead.

    Fixes: c3d4879e01be "sunrpc: Add a function to close..."
    Signed-off-by: Scott Mayhew
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Scott Mayhew
     
  • commit 52d7e48b86fc108e45a656d8e53e4237993c481d upstream.

    The current preemptible RCU implementation goes through three phases
    during bootup. In the first phase, there is only one CPU that is running
    with preemption disabled, so that a no-op is a synchronous grace period.
    In the second mid-boot phase, the scheduler is running, but RCU has
    not yet gotten its kthreads spawned (and, for expedited grace periods,
    workqueues are not yet running. During this time, any attempt to do
    a synchronous grace period will hang the system (or complain bitterly,
    depending). In the third and final phase, RCU is fully operational and
    everything works normally.

    This has been OK for some time, but there has recently been some
    synchronous grace periods showing up during the second mid-boot phase.
    This code worked "by accident" for awhile, but started failing as soon
    as expedited RCU grace periods switched over to workqueues in commit
    8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue").
    Note that the code was buggy even before this commit, as it was subject
    to failure on real-time systems that forced all expedited grace periods
    to run as normal grace periods (for example, using the rcu_normal ksysfs
    parameter). The callchain from the failure case is as follows:

    early_amd_iommu_init()
    |-> acpi_put_table(ivrs_base);
    |-> acpi_tb_put_table(table_desc);
    |-> acpi_tb_invalidate_table(table_desc);
    |-> acpi_tb_release_table(...)
    |-> acpi_os_unmap_memory
    |-> acpi_os_unmap_iomem
    |-> acpi_os_map_cleanup
    |-> synchronize_rcu_expedited

    The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
    which caused the code to try using workqueues before they were
    initialized, which did not go well.

    This commit therefore reworks RCU to permit synchronous grace periods
    to proceed during this mid-boot phase. This commit is therefore a
    fix to a regression introduced in v4.9, and is therefore being put
    forward post-merge-window in v4.10.

    This commit sets a flag from the existing rcu_scheduler_starting()
    function which causes all synchronous grace periods to take the expedited
    path. The expedited path now checks this flag, using the requesting task
    to drive the expedited grace period forward during the mid-boot phase.
    Finally, this flag is updated by a core_initcall() function named
    rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.

    Note that this arrangement assumes that tasks are not sent POSIX signals
    (or anything similar) from the time that the first task is spawned
    through core_initcall() time.

    Fixes: 8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue")
    Reported-by: "Zheng, Lv"
    Reported-by: Borislav Petkov
    Signed-off-by: Paul E. McKenney
    Tested-by: Stan Kain
    Tested-by: Ivan
    Tested-by: Emanuel Castelo
    Tested-by: Bruno Pesavento
    Tested-by: Borislav Petkov
    Tested-by: Frederic Bezies
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • commit 68cc085a4daaa32f7138de1e918331c05165a484 upstream.

    R8A7794 doesn't have Cortex-A15 CPUs, thus there's no Z clock...

    Fixes: 0dce5454d5c2 ("ARM: shmobile: Initial r8a7794 SoC device tree")
    Signed-off-by: Sergei Shtylyov
    Reviewed-by: Geert Uytterhoeven
    Signed-off-by: Simon Horman
    Signed-off-by: Greg Kroah-Hartman

    Sergei Shtylyov
     

20 Jan, 2017

9 commits

  • commit 3bee9ea1de687925d116670f036599cbed8b66b0 upstream.

    The BQ27510 and BQ27520 use a slightly different register map than the
    BQ27500, add a new type enum and add these gauges to it.

    Fixes: d74534c27775 ("power: bq27xxx_battery: Add support for additional bq27xxx family devices")
    Based-on-patch-by: Kenneth R. Crudup
    Signed-off-by: Andrew F. Davis
    Signed-off-by: Sebastian Reichel
    Signed-off-by: Greg Kroah-Hartman

    Andrew F. Davis
     
  • commit 9a05e7541c39680d28ecf91892338e074738d5fd upstream.

    With compilers which follow the C99 standard (like modern versions of
    gcc and clang), "extern inline" does the opposite thing from older
    versions of gcc (emits code for an externally linkable version of the
    inline function).

    "static inline" does the intended behavior in all cases instead.

    Description taken from commit 6d91857d4826 ("staging, rtl8192e,
    LLVMLinux: Change extern inline to static inline").

    This also fixes the following GCC warning when building with CONFIG_PM
    disabled:

    ./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes]

    Fixes: d07ab6d11477 ("block: Add blk_set_runtime_active()")
    Reviewed-by: Mika Westerberg
    Signed-off-by: Tobias Klauser
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tobias Klauser
     
  • commit 9e4d59ada4d602e78eee9fb5f898ce61fdddb446 upstream.

    This is a fix for Linux 4.10-rc1.

    In C language specification, a bit-field is interpreted as a signed or
    unsigned integer type consisting of the specified number of bits.

    In GCC manual, the range of a signed bit field of N bits is from
    -(2^N) / 2 to ((2^N) / 2) - 1
    https://www.gnu.org/software/gnu-c-manual/gnu-c-manual.html#Bit-Fields

    Therefore, when defined as 1 bit-field with signed type, variables can
    represents -1 and 0.

    The snd-soc-hdmi-codec module includes a structure which has signed type
    members with bit-fields. Codes of this module assign 0 and 1 to the
    members. This seems to result in implementation-dependent behaviours.

    As of v4.10-rc1 merge window, outside of sound subsystem, this structure
    is referred by below GPU modules.
    - tda998x
    - sti-drm
    - mediatek-drm-hdmi
    - msm

    As long as I review their codes relevant to the structure, the structure
    members are used just for condition statements and printk formats.
    My proposal of change is a bit intrusive to the printk formats but this
    may be acceptable.

    Totally, it's reasonable to use unsigned type for the structure members.
    This bug is detected by Sparse, static code analyzer with below warnings.

    ./include/sound/hdmi-codec.h:39:26: error: dubious one-bit signed bitfield
    ./include/sound/hdmi-codec.h:40:28: error: dubious one-bit signed bitfield
    ./include/sound/hdmi-codec.h:41:29: error: dubious one-bit signed bitfield
    ./include/sound/hdmi-codec.h:42:31: error: dubious one-bit signed bitfield

    Fixes: 09184118a8ab ("ASoC: hdmi-codec: Add hdmi-codec for external HDMI-encoders")
    Signed-off-by: Takashi Sakamoto
    Acked-by: Arnaud Pouliquen
    Signed-off-by: Mark Brown
    Signed-off-by: Greg Kroah-Hartman

    Takashi Sakamoto
     
  • commit ac0c7cf8be00f269f82964cf7b144ca3edc5dbc4 upstream.

    Enabling btrfs tracepoints leads to instant crash, as reported. The wq
    callbacks could free the memory and the tracepoints started to
    dereference the members to get to fs_info.

    The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2
    removed the tracepoints but we could preserve them by passing only the
    required data in a safe way.

    Fixes: bc074524e123 ("btrfs: prefix fsid to all trace events")
    Reported-by: Sebastian Andrzej Siewior
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    David Sterba
     
  • commit 20b1e22d01a4b0b11d3a1066e9feb04be38607ec upstream.

    With the following commit:

    4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")

    ... efi_bgrt_init() calls into the memblock allocator through
    efi_mem_reserve() => efi_arch_mem_reserve() *after* mm_init() has been called.

    Indeed, KASAN reports a bad read access later on in efi_free_boot_services():

    BUG: KASAN: use-after-free in efi_free_boot_services+0xae/0x24c
    at addr ffff88022de12740
    Read of size 4 by task swapper/0/0
    page:ffffea0008b78480 count:0 mapcount:-127
    mapping: (null) index:0x1 flags: 0x5fff8000000000()
    [...]
    Call Trace:
    dump_stack+0x68/0x9f
    kasan_report_error+0x4c8/0x500
    kasan_report+0x58/0x60
    __asan_load4+0x61/0x80
    efi_free_boot_services+0xae/0x24c
    start_kernel+0x527/0x562
    x86_64_start_reservations+0x24/0x26
    x86_64_start_kernel+0x157/0x17a
    start_cpu+0x5/0x14

    The instruction at the given address is the first read from the memmap's
    memory, i.e. the read of md->type in efi_free_boot_services().

    Note that the writes earlier in efi_arch_mem_reserve() don't splat because
    they're done through early_memremap()ed addresses.

    So, after memblock is gone, allocations should be done through the "normal"
    page allocator. Introduce a helper, efi_memmap_alloc() for this. Use
    it from efi_arch_mem_reserve(), efi_free_boot_services() and, for the sake
    of consistency, from efi_fake_memmap() as well.

    Note that for the latter, the memmap allocations cease to be page aligned.
    This isn't needed though.

    Tested-by: Dan Williams
    Signed-off-by: Nicolai Stange
    Reviewed-by: Ard Biesheuvel
    Cc: Dave Young
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Mika Penttilä
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-efi@vger.kernel.org
    Fixes: 4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
    Link: http://lkml.kernel.org/r/20170105125130.2815-1-nicstange@gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Nicolai Stange
     
  • commit 0100a3e67a9cef64d72cd3a1da86f3ddbee50363 upstream.

    Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
    (2.28), include memory map entries with phys_addr=0x0 and num_pages=0.

    These machines fail to boot after the following commit,

    commit 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")

    Fix this by removing such bogus entries from the memory map.

    Furthermore, currently the log output for this case (with efi=debug)
    looks like:

    [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)

    This is clearly wrong, and also not as informative as it could be. This
    patch changes it so that if we find obviously invalid memory map
    entries, we print an error and skip those entries. It also detects the
    display of the address range calculation overflow, so the new output is:

    [ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
    [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0x0000000000000000] (invalid)

    It also detects memory map sizes that would overflow the physical
    address, for example phys_addr=0xfffffffffffff000 and
    num_pages=0x0200000000000001, and prints:

    [ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
    [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)

    It then removes these entries from the memory map.

    Signed-off-by: Peter Jones
    Signed-off-by: Ard Biesheuvel
    [ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
    Signed-off-by: Matt Fleming
    [Matt: Include bugzilla info in commit log]
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Jones
     
  • commit b6416e61012429e0277bd15a229222fd17afc1c1 upstream.

    Modules that use static_key_deferred need a way to synchronize with
    any delayed work that is still pending when the module is unloaded.
    Introduce static_key_deferred_flush() which flushes any pending
    jump label updates.

    Signed-off-by: David Matlack
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    David Matlack
     
  • commit f05714293a591038304ddae7cb0dd747bb3786cc upstream.

    During developemnt for zram-swap asynchronous writeback, I found strange
    corruption of compressed page, resulting in:

    Modules linked in: zram(E)
    CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G E 4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff88007620b840 task.stack: ffff880078090000
    RIP: set_freeobj.part.43+0x1c/0x1f
    RSP: 0018:ffff880078093ca8 EFLAGS: 00010246
    RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
    RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
    RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
    R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
    FS: 0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
    Call Trace:
    obj_malloc+0x22b/0x260
    zs_malloc+0x1e4/0x580
    zram_bvec_rw+0x4cd/0x830 [zram]
    page_requests_rw+0x9c/0x130 [zram]
    zram_thread+0xe6/0x173 [zram]
    kthread+0xca/0xe0
    ret_from_fork+0x25/0x30

    With investigation, it reveals currently stable page doesn't support
    anonymous page. IOW, reuse_swap_page can reuse the page without waiting
    writeback completion so it can overwrite page zram is compressing.

    Unfortunately, zram has used per-cpu stream feature from v4.7.
    It aims for increasing cache hit ratio of scratch buffer for
    compressing. Downside of that approach is that zram should ask
    memory space for compressed page in per-cpu context which requires
    stricted gfp flag which could be failed. If so, it retries to
    allocate memory space out of per-cpu context so it could get memory
    this time and compress the data again, copies it to the memory space.

    In this scenario, zram assumes the data should never be changed
    but it is not true unless stable page supports. So, If the data is
    changed under us, zram can make buffer overrun because second
    compression size could be bigger than one we got in previous trial
    and blindly, copy bigger size object to smaller buffer which is
    buffer overrun. The overrun breaks zsmalloc free object chaining
    so system goes crash like above.

    I think below is same problem.
    https://bugzilla.suse.com/show_bug.cgi?id=997574

    Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
    writeback in there so the approach in this patch is simply return false if
    we found it needs stable page. Although it increases memory footprint
    temporarily, it happens rarely and it should be reclaimed easily althoug
    it happened. Also, It would be better than waiting of IO completion,
    which is critial path for application latency.

    Fixes: da9556a2367c ("zram: user per-cpu compression streams")
    Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
    Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Sergey Senozhatsky
    Cc: Darrick J. Wong
    Cc: Takashi Iwai
    Cc: Hyeoncheol Lee
    Cc:
    Cc: Sangseok Lee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     
  • commit b4536f0c829c8586544c94735c343f9b5070bd01 upstream.

    Nils Holland and Klaus Ethgen have reported unexpected OOM killer
    invocations with 32b kernel starting with 4.8 kernels

    kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
    kworker/u4:5 cpuset=/ mems_allowed=0
    CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
    [...]
    Mem-Info:
    active_anon:58685 inactive_anon:90 isolated_anon:0
    active_file:274324 inactive_file:281962 isolated_file:0
    unevictable:0 dirty:649 writeback:0 unstable:0
    slab_reclaimable:40662 slab_unreclaimable:17754
    mapped:7382 shmem:202 pagetables:351 bounce:0
    free:206736 free_pcp:332 free_cma:0
    Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
    DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 813 3474 3474
    Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
    lowmem_reserve[]: 0 0 21292 21292
    HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

    the oom killer is clearly pre-mature because there there is still a lot
    of page cache in the zone Normal which should satisfy this lowmem
    request. Further debugging has shown that the reclaim cannot make any
    forward progress because the page cache is hidden in the active list
    which doesn't get rotated because inactive_list_is_low is not memcg
    aware.

    The code simply subtracts per-zone highmem counters from the respective
    memcg's lru sizes which doesn't make any sense. We can simply end up
    always seeing the resulting active and inactive counts 0 and return
    false. This issue is not limited to 32b kernels but in practice the
    effect on systems without CONFIG_HIGHMEM would be much harder to notice
    because we do not invoke the OOM killer for allocations requests
    targeting < ZONE_NORMAL.

    Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
    and subtract per-memcg highmem counts when memcg is enabled. Introduce
    helper lruvec_zone_lru_size which redirects to either zone counters or
    mem_cgroup_get_zone_lru_size when appropriate.

    We are losing empty LRU but non-zero lru size detection introduced by
    ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
    of the inherent zone vs. node discrepancy.

    Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
    Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Nils Holland
    Tested-by: Nils Holland
    Reported-by: Klaus Ethgen
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

15 Jan, 2017

1 commit

  • [ Upstream commit 57ea52a865144aedbcd619ee0081155e658b6f7d ]

    The GRO fast path caches the frag0 address. This address becomes
    invalid if frag0 is modified by pskb_may_pull or its variants.
    So whenever that happens we must disable the frag0 optimization.

    This is usually done through the combination of gro_header_hard
    and gro_header_slow, however, the IPv6 extension header path did
    the pulling directly and would continue to use the GRO fast path
    incorrectly.

    This patch fixes it by disabling the fast path when we enter the
    IPv6 extension header path.

    Fixes: 78a478d0efd9 ("gro: Inline skb_gro_header and cache frag0 virtual address")
    Reported-by: Slava Shwartsman
    Signed-off-by: Herbert Xu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     

12 Jan, 2017

2 commits

  • commit 9bf11ecce5a2758e5a097c2f3a13d08552d0d6f9 upstream.

    When the dummy timer callback is invoked before the real timer callbacks,
    then it tries to install that timer for the starting CPU. If the platform
    does not have a broadcast timer installed the installation fails with a
    kernel crash. The crash happens due to a unconditional deference of the non
    available broadcast device. This needs to be fixed in the timer core code.

    But even when this is fixed in the core code then installing the dummy
    timer before the real timers is a pointless exercise.

    Move it to the end of the callback list.

    Fixes: 00c1d17aab51 ("clocksource/dummy_timer: Convert to hotplug state machine")
    Reported-and-tested-by: Mason
    Signed-off-by: Thomas Gleixner
    Cc: Mark Rutland
    Cc: Anna-Maria Gleixner
    Cc: Richard Cochran
    Cc: Sebastian Andrzej Siewior
    Cc: Daniel Lezcano
    Cc: Peter Zijlstra ,
    Cc: Sebastian Frias
    Cc: Thibaud Cornic
    Cc: Robin Murphy
    Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 7254383341bc6e1a61996accd836009f0c922b21 upstream.

    Add Mellanox device IDs for use by the mlx4 driver and INTx quirks.

    [bhelgaas: sorted and adapted from
    http://lkml.kernel.org/r/1478011644-12080-1-git-send-email-noaos@mellanox.com]
    Signed-off-by: Noa Osherovich
    Signed-off-by: Bjorn Helgaas
    Signed-off-by: Greg Kroah-Hartman

    Noa Osherovich