12 Jan, 2017

5 commits

  • If prev node is not in running state or its vCPU is preempted, we can give
    up our vCPU slices in pv_wait_node() ASAP.

    Signed-off-by: Pan Xinhui
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: longman@redhat.com
    Link: http://lkml.kernel.org/r/1484035006-6787-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
    [ Fixed typos in the changelog, removed ugly linebreak from the code. ]
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     
  • The spin_lock_bh_nested() API is defined but is not used anywhere
    in the kernel. So all spin_lock_bh_nested() and related APIs are
    now removed.

    Signed-off-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1483975612-16447-1-git-send-email-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Merge fixes from Andrew Morton:
    "27 fixes.

    There are three patches that aren't actually fixes. They're simple
    function renamings which are nice-to-have in mainline as ongoing net
    development depends on them."

    * akpm: (27 commits)
    timerfd: export defines to userspace
    mm/hugetlb.c: fix reservation race when freeing surplus pages
    mm/slab.c: fix SLAB freelist randomization duplicate entries
    zram: support BDI_CAP_STABLE_WRITES
    zram: revalidate disk under init_lock
    mm: support anonymous stable page
    mm: add documentation for page fragment APIs
    mm: rename __page_frag functions to __page_frag_cache, drop order from drain
    mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
    mm, memcg: fix the active list aging for lowmem requests when memcg is enabled
    mm: don't dereference struct page fields of invalid pages
    mailmap: add codeaurora.org names for nameless email commits
    signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
    mm: pmd dirty emulation in page fault handler
    ipc/sem.c: fix incorrect sem_lock pairing
    lib/Kconfig.debug: fix frv build failure
    mm: get rid of __GFP_OTHER_NODE
    mm: fix remote numa hits statistics
    mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}
    ocfs2: fix crash caused by stale lvb with fsdlm plugin
    ...

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) Fix rtlwifi crash, from Larry Finger.

    2) Memory disclosure in appletalk ipddp routing code, from Vlad
    Tsyrklevich.

    3) r8152 can erroneously split an RX packet into multiple URBs if the
    Rx FIFO is not empty when we suspend. Fix this by waiting for the
    FIFO to empty before suspending. From Hayes Wang.

    4) Two GRO fixes (enter slow path when not enough SKB tail room exists,
    disable frag0 optimizations when there are IPV6 extension headers)
    from Eric Dumazet and Herbert Xu.

    5) A series of mlx5e bug fixes (do source udp port offloading for
    tunnels properly, Ip fragment matching fixes, handling firmware
    errors properly when installing TC rules, etc.) from Saeed Mahameed,
    Or Gerlitz, Roi Dayan, Hadar Hen Zion, Gil Rockah, and Daniel
    Jurgens.

    6) Two VRF fixes from David Ahern (don't skip multipath selection for
    VRF paths, disallow VRF to be configured with table ID 0).

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
    net: vrf: do not allow table id 0
    net: phy: marvell: fix Marvell 88E1512 used in SGMII mode
    sctp: Fix spelling mistake: "Atempt" -> "Attempt"
    net: ipv4: Fix multipath selection with vrf
    cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig
    gro: use min_t() in skb_gro_reset_offset()
    net/mlx5: Only cancel recovery work when cleaning up device
    net/mlx5e: Remove WARN_ONCE from adaptive moderation code
    net/mlx5e: Un-register uplink representor on nic_disable
    net/mlx5e: Properly handle FW errors while adding TC rules
    net/mlx5e: Fix kbuild warnings for uninitialized parameters
    net/mlx5e: Set inline mode requirements for matching on IP fragments
    net/mlx5e: Properly get address type of encapsulation IP headers
    net/mlx5e: TC ipv4 tunnel encap offload error flow fixes
    net/mlx5e: Warn when rejecting offload attempts of IP tunnels
    net/mlx5e: Properly handle offloading of source udp port for IP tunnels
    gro: Disable frag0 optimization on IPv6 ext headers
    gro: Enter slow-path if there is no tailroom
    mlx4: Return EOPNOTSUPP instead of ENOTSUPP
    net/af_iucv: don't use paged skbs for TX on HiperSockets
    ...

    Linus Torvalds
     
  • Pull crypto fix from Herbert Xu:
    "This fixes a regression in aesni that renders it useless if it's
    built-in with a modular pcbc configuration"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
    crypto: aesni - Fix failure when built-in with modular pcbc

    Linus Torvalds
     

11 Jan, 2017

35 commits

  • Frank reported that vrf devices can be created with a table id of 0.
    This breaks many of the run time table id checks and should not be
    allowed. Detect this condition at create time and fail with EINVAL.

    Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
    Reported-by: Frank Kellermann
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • When an Marvell 88E1512 PHY is connected to a nic in SGMII mode, the
    fiber page is used for the SGMII host-side connection. The PHY driver
    notices that SUPPORTED_FIBRE is set, so it tries reading the fiber page
    for the link status, and ends up reading the MAC-side status instead of
    the outgoing (copper) link. This leads to incorrect results reported
    via ethtool.

    If the PHY is connected via SGMII to the host, ignore the fiber page.
    However, continue to allow the existing power management code to
    suspend and resume the fiber page.

    Fixes: 6cfb3bcc0641 ("Marvell phy: check link status in case of fiber link.")
    Signed-off-by: Russell King
    Signed-off-by: David S. Miller

    Russell King
     
  • Trivial fix to spelling mistake in WARN_ONCE message

    Signed-off-by: Colin Ian King
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • fib_select_path does not call fib_select_multipath if oif is set in the
    flow struct. For VRF use cases oif is always set, so multipath route
    selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif
    check similar to what is done in fib_table_lookup.

    Add saddr and proto to the flow struct for the fib lookup done by the
    VRF driver to better match hash computation for a flow.

    Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
    not right when CONFIG_NET is disabled and there is no socket interface:

    warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)

    I don't know what the correct solution for this is, but simply removing
    the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
    'if NET' section avoids the warning and does not produce other build
    errors.

    Fixes: 483c4933ea09 ("cgroup: Fix CGROUP_BPF config")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • On 32bit arches, (skb->end - skb->data) is not 'unsigned int',
    so we shall use min_t() instead of min() to avoid a compiler error.

    Fixes: 1272ce87fa01 ("gro: Enter slow-path if there is no tailroom")
    Reported-by: kernel test robot
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Saeed Mahameed says:

    ====================
    Mellanox mlx5 fixes and cleanups 2017-01-10

    This series includes some mlx5e general cleanups from Daniel, Gil, Hadar
    and myself.
    Also it includes some critical mlx5e TC offloads fixes from Or Gerlitz.

    For -stable:
    - net/mlx5e: Remove WARN_ONCE from adaptive moderation code

    Although this fix doesn't affect any functionality, I thought it is
    better to clean this -WARN_ONCE- up for -stable in case someone hits
    such corner case.

    Please apply and let me know if there's any problem.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Do not attempt to drain the health workqueue when unloading the device in
    the recovery flow, this can cause a deadlock when the recovery work
    tries to cancel itself with sync.

    Because the work is no longer unconditionally canceled when unloading, it
    must be explicitly canceled in the AER flow.

    fixes: 689a248df83b ("net/mlx5: Cancel recovery work in remove flow")
    Signed-off-by: Daniel Jurgens
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Daniel Jurgens
     
  • When trying to do interface down or changing interface configuration
    under heavy traffic, some of the adaptive moderation corner cases can
    occur and leave a WARN_ONCE call trace in the kernel log.

    Those WARN_ONCE are meant for debug only, and should have been inserted
    only under debug. We avoid such call traces by removing those WARN_ONCE.

    Fixes: cb3c7fd4f839 ("net/mlx5e: Support adaptive RX coalescing")
    Signed-off-by: Gil Rockah
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Gil Rockah
     
  • The code before this patch registered uplink e-Switch representor
    on nic_enable and unregistered on nic_cleanup, the right place
    for this unregister is in nic_disable.

    Fixes: 127ea380acc9 ("net/mlx5: Add Representors registration API")
    Signed-off-by: Saeed Mahameed
    Reviewed-by: Mohamad Haj Yahia
    Signed-off-by: David S. Miller

    Saeed Mahameed
     
  • When the firmware returns an error (common example is an attempt to
    add twice the same rule which is refused by the some FWs), we are not
    properly derefing/cleaning few resources allocated on the way.
    Examples are vport vlan deref under eswitch vlan offloads, and encap
    entry/neighbour deref under eswitch encapsulation offloads, fix that.

    Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
    Fixes: 8b32580df1cb ('net/mlx5e: Add TC vlan action for SRIOV offloads')
    Signed-off-by: Or Gerlitz
    Reviewed-by: Roi Dayan
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • kbuild warn about parameters that may be used uninitialized, fix it.

    Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
    Signed-off-by: Hadar Hen Zion
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Hadar Hen Zion
     
  • For e-switch level matching on packets being an IP fragment, we
    need to make sure the source vport inline mode is L3, fix that.

    Fixes: 3f7d0eb42d59 ('net/mlx5e: Offload TC matching on packets being IP fragments')
    Signed-off-by: Or Gerlitz
    Reviewed-by: Roi Dayan
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • As done elsewhere in our TC/flower offload code, the address type of
    the encapsulation IP headers should be realized accroding to the
    addr_type field of the encapsulation control dissector key, do that.

    Fixes: bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads')
    Signed-off-by: Or Gerlitz
    Reviewed-by: Hadar Hen Zion
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • When the route lookup fails we should return the actual error.

    When the neigh isn't valid, we should return -EOPNOTSUPP as done
    in similar cases along the code.

    When the offload can't take place as of invalid neigh etc, we
    must release the neigh.

    Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
    Signed-off-by: Or Gerlitz
    Reviewed-by: Hadar Hen Zion
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • We silently reject offloading of IPv6 tunnels, non vxlan tunnels,
    vxlan tunnels where the dst port to match is not provided, etc.

    Be a bit more verbose and print a warning so the user better
    realizes what went wrong here and can fix it.

    Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
    Fixes: bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads')
    Signed-off-by: Or Gerlitz
    Reviewed-by: Hadar Hen Zion
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • We can offload the matching on source udp port of ip tunnels for
    decapsulation. We can not offload setting source udp port for tunnels
    as part of encapsulation. Fix both the code that deals with matching
    offload (decap) and the code that deal with encap offload to align with
    that.

    Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads')
    Fixes: bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads')
    Signed-off-by: Or Gerlitz
    Reviewed-by: Hadar Hen Zion
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • Since userspace is expected to call timerfd syscalls directly with these
    flags/ioctls, make sure we export them so they don't have to duplicate
    the values themselves.

    Link: http://lkml.kernel.org/r/20161219064052.7196-1-vapier@gentoo.org
    Signed-off-by: Mike Frysinger
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • return_unused_surplus_pages() decrements the global reservation count,
    and frees any unused surplus pages that were backing the reservation.

    Commit 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in
    return_unused_surplus_pages()") added a call to cond_resched_lock in the
    loop freeing the pages.

    As a result, the hugetlb_lock could be dropped, and someone else could
    use the pages that will be freed in subsequent iterations of the loop.
    This could result in inconsistent global hugetlb page state, application
    api failures (such as mmap) failures or application crashes.

    When dropping the lock in return_unused_surplus_pages, make sure that
    the global reservation count (resv_huge_pages) remains sufficiently
    large to prevent someone else from claiming pages about to be freed.

    Analyzed by Paul Cassella.

    Fixes: 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()")
    Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Paul Cassella
    Suggested-by: Michal Hocko
    Cc: Masayoshi Mizuma
    Cc: Naoya Horiguchi
    Cc: Aneesh Kumar
    Cc: Hillf Danton
    Cc: [3.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • This patch fixes a bug in the freelist randomization code. When a high
    random number is used, the freelist will contain duplicate entries. It
    will result in different allocations sharing the same chunk.

    It will result in odd behaviours and crashes. It should be uncommon but
    it depends on the machines. We saw it happening more often on some
    machines (every few hours of running tests).

    Fixes: c7ce4f60ac19 ("mm: SLAB freelist randomization")
    Link: http://lkml.kernel.org/r/20170103181908.143178-1-thgarnie@google.com
    Signed-off-by: John Sperbeck
    Signed-off-by: Thomas Garnier
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Sperbeck
     
  • zram has used per-cpu stream feature from v4.7. It aims for increasing
    cache hit ratio of scratch buffer for compressing. Downside of that
    approach is that zram should ask memory space for compressed page in
    per-cpu context which requires stricted gfp flag which could be failed.
    If so, it retries to allocate memory space out of per-cpu context so it
    could get memory this time and compress the data again, copies it to the
    memory space.

    In this scenario, zram assumes the data should never be changed but it is
    not true without stable page support. So, If the data is changed under
    us, zram can make buffer overrun so that zsmalloc free object chain is
    broken so system goes crash like below

    https://bugzilla.suse.com/show_bug.cgi?id=997574

    This patch adds BDI_CAP_STABLE_WRITES to zram for declaring "I am block
    device needing *stable write*".

    Fixes: da9556a2367c ("zram: user per-cpu compression streams")
    Link: http://lkml.kernel.org/r/1482366980-3782-4-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc: Takashi Iwai
    Cc: Hyeoncheol Lee
    Cc:
    Cc: Sangseok Lee
    Cc: Hugh Dickins
    Cc: Darrick J. Wong
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Commit b4c5c60920e3 ("zram: avoid lockdep splat by revalidate_disk")
    moved revalidate_disk call out of init_lock to avoid lockdep
    false-positive splat. However, commit 08eee69fcf6b ("zram: remove
    init_lock in zram_make_request") removed init_lock in IO path so there
    is no worry about lockdep splat. So, let's restore it.

    This patch is needed to set BDI_CAP_STABLE_WRITES atomically in next
    patch.

    Fixes: da9556a2367c ("zram: user per-cpu compression streams")
    Link: http://lkml.kernel.org/r/1482366980-3782-3-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc: Takashi Iwai
    Cc: Hyeoncheol Lee
    Cc:
    Cc: Sangseok Lee
    Cc: Hugh Dickins
    Cc: Darrick J. Wong
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • During developemnt for zram-swap asynchronous writeback, I found strange
    corruption of compressed page, resulting in:

    Modules linked in: zram(E)
    CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G E 4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff88007620b840 task.stack: ffff880078090000
    RIP: set_freeobj.part.43+0x1c/0x1f
    RSP: 0018:ffff880078093ca8 EFLAGS: 00010246
    RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
    RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
    RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
    R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
    FS: 0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
    Call Trace:
    obj_malloc+0x22b/0x260
    zs_malloc+0x1e4/0x580
    zram_bvec_rw+0x4cd/0x830 [zram]
    page_requests_rw+0x9c/0x130 [zram]
    zram_thread+0xe6/0x173 [zram]
    kthread+0xca/0xe0
    ret_from_fork+0x25/0x30

    With investigation, it reveals currently stable page doesn't support
    anonymous page. IOW, reuse_swap_page can reuse the page without waiting
    writeback completion so it can overwrite page zram is compressing.

    Unfortunately, zram has used per-cpu stream feature from v4.7.
    It aims for increasing cache hit ratio of scratch buffer for
    compressing. Downside of that approach is that zram should ask
    memory space for compressed page in per-cpu context which requires
    stricted gfp flag which could be failed. If so, it retries to
    allocate memory space out of per-cpu context so it could get memory
    this time and compress the data again, copies it to the memory space.

    In this scenario, zram assumes the data should never be changed
    but it is not true unless stable page supports. So, If the data is
    changed under us, zram can make buffer overrun because second
    compression size could be bigger than one we got in previous trial
    and blindly, copy bigger size object to smaller buffer which is
    buffer overrun. The overrun breaks zsmalloc free object chaining
    so system goes crash like above.

    I think below is same problem.
    https://bugzilla.suse.com/show_bug.cgi?id=997574

    Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
    writeback in there so the approach in this patch is simply return false if
    we found it needs stable page. Although it increases memory footprint
    temporarily, it happens rarely and it should be reclaimed easily althoug
    it happened. Also, It would be better than waiting of IO completion,
    which is critial path for application latency.

    Fixes: da9556a2367c ("zram: user per-cpu compression streams")
    Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
    Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Sergey Senozhatsky
    Cc: Darrick J. Wong
    Cc: Takashi Iwai
    Cc: Hyeoncheol Lee
    Cc:
    Cc: Sangseok Lee
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This is a first pass at trying to add documentation for the page_frag
    APIs. They may still change over time but for now I thought I would try
    to get these documented so that as more network drivers and stack calls
    make use of them we have one central spot to document how they are meant
    to be used.

    Link: http://lkml.kernel.org/r/20170104024157.13451.6758.stgit@localhost.localdomain
    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • This patch does two things.

    First it goes through and renames the __page_frag prefixed functions to
    __page_frag_cache so that we can be clear that we are draining or
    refilling the cache, not the frags themselves.

    Second we drop the order parameter from __page_frag_cache_drain since we
    don't actually need to pass it since all fragments are either order 0 or
    must be a compound page.

    Link: http://lkml.kernel.org/r/20170104023954.13451.5678.stgit@localhost.localdomain
    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • Patch series "Page fragment updates", v4.

    This patch series takes care of a few cleanups for the page fragments
    API.

    First we do some renames so that things are much more consistent. First
    we move the page_frag_ portion of the name to the front of the functions
    names. Secondly we split out the cache specific functions from the
    other page fragment functions by adding the word "cache" to the name.

    Finally I added a bit of documentation that will hopefully help to
    explain some of this. I plan to revisit this later as we get things
    more ironed out in the near future with the changes planned for the DMA
    setup to support eXpress Data Path.

    This patch (of 3):

    This patch renames the page frag functions to be more consistent with
    other APIs. Specifically we place the name page_frag first in the name
    and then have either an alloc or free call name that we append as the
    suffix. This makes it a bit clearer in terms of naming.

    In addition we drop the leading double underscores since we are
    technically no longer a backing interface and instead the front end that
    is called from the networking APIs.

    Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomain
    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • Nils Holland and Klaus Ethgen have reported unexpected OOM killer
    invocations with 32b kernel starting with 4.8 kernels

    kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
    kworker/u4:5 cpuset=/ mems_allowed=0
    CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
    [...]
    Mem-Info:
    active_anon:58685 inactive_anon:90 isolated_anon:0
    active_file:274324 inactive_file:281962 isolated_file:0
    unevictable:0 dirty:649 writeback:0 unstable:0
    slab_reclaimable:40662 slab_unreclaimable:17754
    mapped:7382 shmem:202 pagetables:351 bounce:0
    free:206736 free_pcp:332 free_cma:0
    Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
    DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 813 3474 3474
    Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
    lowmem_reserve[]: 0 0 21292 21292
    HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

    the oom killer is clearly pre-mature because there there is still a lot
    of page cache in the zone Normal which should satisfy this lowmem
    request. Further debugging has shown that the reclaim cannot make any
    forward progress because the page cache is hidden in the active list
    which doesn't get rotated because inactive_list_is_low is not memcg
    aware.

    The code simply subtracts per-zone highmem counters from the respective
    memcg's lru sizes which doesn't make any sense. We can simply end up
    always seeing the resulting active and inactive counts 0 and return
    false. This issue is not limited to 32b kernels but in practice the
    effect on systems without CONFIG_HIGHMEM would be much harder to notice
    because we do not invoke the OOM killer for allocations requests
    targeting < ZONE_NORMAL.

    Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
    and subtract per-memcg highmem counts when memcg is enabled. Introduce
    helper lruvec_zone_lru_size which redirects to either zone counters or
    mem_cgroup_get_zone_lru_size when appropriate.

    We are losing empty LRU but non-zero lru size detection introduced by
    ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
    of the inherent zone vs. node discrepancy.

    Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
    Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Nils Holland
    Tested-by: Nils Holland
    Reported-by: Klaus Ethgen
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The VM_BUG_ON() check in move_freepages() checks whether the node id of
    a page matches the node id of its zone. However, it does this before
    having checked whether the struct page pointer refers to a valid struct
    page to begin with. This is guaranteed in most cases, but may not be
    the case if CONFIG_HOLES_IN_ZONE=y.

    So reorder the VM_BUG_ON() with the pfn_valid_within() check.

    Link: http://lkml.kernel.org/r/1481706707-6211-2-git-send-email-ard.biesheuvel@linaro.org
    Signed-off-by: Ard Biesheuvel
    Acked-by: Will Deacon
    Cc: Catalin Marinas
    Cc: Hanjun Guo
    Cc: Yisheng Xie
    Cc: Robert Richter
    Cc: James Morse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • Some codeaurora.org emails have crept in but the names don't exist for
    them. Add the names for the emails so git can match everyone up.

    Link: http://lkml.kernel.org/r/20170104194611.25933-1-sboyd@codeaurora.org
    Signed-off-by: Stephen Boyd
    Cc: Sarangdhar Joshi
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Subhash Jadavani
    Cc: Thomas Pedersen
    Cc: Andy Gross
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Boyd
     
  • Since commit 00cd5c37afd5 ("ptrace: permit ptracing of /sbin/init") we
    can now trace init processes. init is initially protected with
    SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
    there are a number of paths during tracing where SIGNAL_UNKILLABLE can
    be implicitly cleared.

    This can result in init becoming stoppable/killable after tracing. For
    example, running:

    while true; do kill -STOP 1; done &
    strace -p 1

    and then stopping strace and the kill loop will result in init being
    left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
    init will now respond to future SIGSTOP signals rather than ignoring
    them.

    Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
    that we don't clear SIGNAL_UNKILLABLE.

    Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
    Signed-off-by: Jamie Iles
    Acked-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jamie Iles
     
  • Andreas reported [1] made a test in jemalloc hang in THP mode in arm64:

    http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de

    The problem is currently page fault handler doesn't supports dirty bit
    emulation of pmd for non-HW dirty-bit architecture so that application
    stucks until VM marked the pmd dirty.

    How the emulation work depends on the architecture. In case of arm64,
    when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
    mark the pte dirty via triggering page fault when store access happens.
    Once the page fault occurs, VM marks the pmd dirty and arch code for
    setting pmd will clear PTE_RDONLY for application to proceed.

    IOW, if VM doesn't mark the pmd dirty, application hangs forever by
    repeated fault(i.e., store op but the pmd is PTE_RDONLY).

    This patch enables pmd dirty-bit emulation for those architectures.

    [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called

    Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
    Link: http://lkml.kernel.org/r/1482506098-6149-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: Andreas Schwab
    Tested-by: Andreas Schwab
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Jason Evans
    Cc: Will Deacon
    Cc: Catalin Marinas
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Based on the syzcaller test case from dvyukov:

    https://gist.githubusercontent.com/dvyukov/d0e5efefe4d7d6daed829f5c3ca26a40/raw/08d0a261fe3c987bed04fbf267e08ba04bd533ea/gistfile1.txt

    The slow (i.e.: failure to acquire) syscall exit from semtimedop()
    incorrectly assumed that the the same lock is acquired as it was at the
    initial syscall entry.

    This is wrong:
    - thread A: single semop semop(), sleeps
    - thread B: multi semop semop(), sleeps
    - thread A: woken up by signal/timeout

    With this sequence, the initial sem_lock() call locks the per-semaphore
    spinlock, and it is unlocked with sem_unlock(). The call at the syscall
    return locks the global spinlock. Because locknum is not updated, the
    following sem_unlock() call unlocks the per-semaphore spinlock, which is
    actually not locked.

    The fix is trivial: Use the return value from sem_lock.

    Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop")
    Link: http://lkml.kernel.org/r/1482215645-22328-1-git-send-email-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Reported-by: Dmitry Vyukov
    Reported-by: Johanna Abrahamsson
    Tested-by: Johanna Abrahamsson
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The build of frv allmodconfig was failing with the errors like:

    /tmp/cc0JSPc3.s: Assembler messages:
    /tmp/cc0JSPc3.s:1839: Error: symbol `.LSLT0' is already defined
    /tmp/cc0JSPc3.s:1842: Error: symbol `.LASLTP0' is already defined
    /tmp/cc0JSPc3.s:1969: Error: symbol `.LELTP0' is already defined
    /tmp/cc0JSPc3.s:1970: Error: symbol `.LELT0' is already defined

    Commit 866ced950bcd ("kbuild: Support split debug info v4") introduced
    splitting the debug info and keeping that in a separate file. Somehow,
    the frv-linux gcc did not like that and I am guessing that instead of
    splitting it started copying. The first report about this is at:

    https://lists.01.org/pipermail/kbuild-all/2015-July/010527.html.

    I will try and see if this can work with frv and if still fails I will
    open a bug report with gcc. But meanwhile this is the easiest option to
    solve build failure of frv.

    Fixes: 866ced950bcd ("kbuild: Support split debug info v4")
    Link: http://lkml.kernel.org/r/1482062348-5352-1-git-send-email-sudipm.mukherjee@gmail.com
    Signed-off-by: Sudip Mukherjee
    Reported-by: Fengguang Wu
    Cc: Andi Kleen
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sudip Mukherjee
     
  • The flag was introduced by commit 78afd5612deb ("mm: add
    __GFP_OTHER_NODE flag") to allow proper accounting of remote node
    allocations done by kernel daemons on behalf of a process - e.g.
    khugepaged.

    After "mm: fix remote numa hits statistics" we do not need and actually
    use the flag so we can safely remove it because all allocations which
    are satisfied from their "home" node are accounted properly.

    [mhocko@suse.com: fix build]
    Link: http://lkml.kernel.org/r/20170106122225.GK5556@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170102153057.9451-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Taku Izumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Jia He has noticed that commit b9f00e147f27 ("mm, page_alloc: reduce
    branches in zone_statistics") has an unintentional side effect that
    remote node allocation requests are accounted as NUMA_MISS rathat than
    NUMA_HIT and NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE.

    There are many of these potentially because the flag is used very rarely
    while we have many users of __alloc_pages_node.

    Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
    follow up patch) and treat all allocations that were satisfied from the
    preferred zone's node as NUMA_HITS because this is the same node we
    requested the allocation from in most cases. If this is not the local
    node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.

    One downsize would be that an allocation request for a node which is
    outside of the mempolicy nodemask would be reported as a hit which is a
    bit weird but that was the case before b9f00e147f27 already.

    Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
    Link: http://lkml.kernel.org/r/20170102153057.9451-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Jia He
    Reviewed-by: Vlastimil Babka # with cbmc[1] superpowers
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Taku Izumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko