10 Jul, 2016

3 commits


09 Jul, 2016

1 commit


02 Jul, 2016

4 commits

  • One of the memory bitmaps used by the hibernation image restoration
    code is freed after the image has been loaded.

    That is not quite efficient, though, because the memory pages used
    for building that bitmap are known to be safe (ie. they were not
    used by the image kernel before hibernation) and the arch-specific
    code finalizing the image restoration may need them. In that case
    it needs to allocate those pages again via the memory management
    subsystem, check if they are really safe again by consulting the
    other bitmaps and so on.

    To avoid that, recycle those pages by putting them into the global
    list of known safe pages so that they can be given to the arch code
    right away when necessary.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Rework mark_unsafe_pages() to use a simpler method of clearing
    all bits in free_pages_map and to set the bits for the "unsafe"
    pages (ie. pages that were used by the image kernel before
    hibernation) with the help of duplicate_memory_bitmap().

    For this purpose, move the pfn_valid() check from mark_unsafe_pages()
    to unpack_orig_pfns() where the "unsafe" pages are discovered.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • The core image restoration code preallocates some safe pages
    (ie. pages that weren't used by the image kernel before hibernation)
    for future use before allocating the bulk of memory for loading the
    image data. Those safe pages are then freed so they can be allocated
    again (with the memory management subsystem's help). That's done to
    ensure that there will be enough safe pages for temporary data
    structures needed during image restoration.

    However, it is not really necessary to free those pages after they
    have been allocated. They can be added to the (global) list of
    safe pages right away and then picked up from there when needed
    without freeing.

    That reduces the overhead related to using safe pages, especially
    in the arch-specific code, so modify the code accordingly.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • If freezable workqueue aborts suspend flow, show
    workqueue state for debug purpose.

    Signed-off-by: Roger Lu
    Acked-by: Tejun Heo
    Signed-off-by: Rafael J. Wysocki

    Roger Lu
     

01 Jul, 2016

1 commit

  • Logan Gunthorpe reports that hibernation stopped working reliably for
    him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
    and rodata).

    That turns out to be a consequence of a long-standing issue with the
    64-bit image restoration code on x86, which is that the temporary
    page tables set up by it to avoid page tables corruption when the
    last bits of the image kernel's memory contents are copied into
    their original page frames re-use the boot kernel's text mapping,
    but that mapping may very well get corrupted just like any other
    part of the page tables. Of course, if that happens, the final
    jump to the image kernel's entry point will go to nowhere.

    The exact reason why commit ab76f7b4ab23 matters here is that it
    sometimes causes a PMD of a large page to be split into PTEs
    that are allocated dynamically and get corrupted during image
    restoration as described above.

    To fix that issue note that the code copying the last bits of the
    image kernel's memory contents to the page frames occupied by them
    previoulsy doesn't use the kernel text mapping, because it runs from
    a special page covered by the identity mapping set up for that code
    from scratch. Hence, the kernel text mapping is only needed before
    that code starts to run and then it will only be used just for the
    final jump to the image kernel's entry point.

    Accordingly, the temporary page tables set up in swsusp_arch_resume()
    on x86-64 need to contain the kernel text mapping too. That mapping
    is only going to be used for the final jump to the image kernel, so
    it only needs to cover the image kernel's entry point, because the
    first thing the image kernel does after getting control back is to
    switch over to its own original page tables. Moreover, the virtual
    address of the image kernel's entry point in that mapping has to be
    the same as the one mapped by the image kernel's page tables.

    With that in mind, modify the x86-64's arch_hibernation_header_save()
    and arch_hibernation_header_restore() routines to pass the physical
    address of the image kernel's entry point (in addition to its virtual
    address) to the boot kernel (a small piece of assembly code involved
    in passing the entry point's virtual address to the image kernel is
    not necessary any more after that, so drop it). Update RESTORE_MAGIC
    too to reflect the image header format change.

    Next, in set_up_temporary_mappings(), use the physical and virtual
    addresses of the image kernel's entry point passed in the image
    header to set up a minimum kernel text mapping (using memory pages
    that won't be overwritten by the image kernel's memory contents) that
    will map those addresses to each other as appropriate.

    This makes the concern about the possible corruption of the original
    boot kernel text mapping go away and if the the minimum kernel text
    mapping used for the final jump marks the image kernel's entry point
    memory as executable, the jump to it is guaraneed to succeed.

    Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
    Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
    Reported-by: Logan Gunthorpe
    Reported-and-tested-by: Borislav Petkov
    Tested-by: Kees Cook
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

28 Jun, 2016

1 commit


27 Jun, 2016

2 commits

  • Linus Torvalds
     
  • Pull SCSI fixes from James Bottomley:
    "Two straightforward fixes.

    One is a concurrency issue only affecting SAS connected SATA drives,
    but which could hang the storage subsystem if it triggers (because the
    outstanding command count on error never goes back to zero) and the
    other is a NO_TAG fallout from the switch to hostwide tags which
    causes the system to crash on module insertion (we've checked
    carefully and only the 53c700 family of drivers is vulnerable to this
    issue)"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    53c700: fix BUG on untagged commands
    scsi: fix race between simultaneous decrements of ->host_failed

    Linus Torvalds
     

25 Jun, 2016

28 commits

  • …git/mason/linux-btrfs

    Pull btrfs fixes part 2 from Chris Mason:
    "This has one patch from Omar to bring iterate_shared back to btrfs.

    We have a tree of work we queue up for directory items and it doesn't
    lend itself well to shared access. While we're cleaning it up, Omar
    has changed things to use an exclusive lock when there are delayed
    items"

    * 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes

    Linus Torvalds
     
  • Pull btrfs fixes from Chris Mason:
    "I have a two part pull this time because one of the patches Dave
    Sterba collected needed to be against v4.7-rc2 or higher (we used
    rc4). I try to make my for-linus-xx branch testable on top of the
    last major so we can hand fixes to people on the list more easily, so
    I've split this pull in two.

    This first part has some fixes and two performance improvements that
    we've been testing for some time.

    Josef's two performance fixes are most notable. The transid tracking
    patch makes a big improvement on pretty much every workload"

    * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: Force stripesize to the value of sectorsize
    btrfs: fix disk_i_size update bug when fallocate() fails
    Btrfs: fix error handling in map_private_extent_buffer
    Btrfs: fix error return code in btrfs_init_test_fs()
    Btrfs: don't do nocow check unless we have to
    btrfs: fix deadlock in delayed_ref_async_start
    Btrfs: track transid for delayed ref flushing

    Linus Torvalds
     
  • Pull sound fixes from Takashi Iwai:
    "Again pretty calm weeks: we've had only a few trivial / stable
    HD-audio fixes in addition to a possible race fix for snd-dummy driver
    spotted by syzkaller"

    * tag 'sound-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
    ALSA: dummy: Fix a use-after-free at closing
    ALSA: hda / realtek - add two more Thinkpad IDs (5050,5053) for tpt460 fixup
    ALSA: hda - Fix the headset mic jack detection on Dell machine
    ALSA: hda/tegra: iomem fixups for sparse warnings
    ALSA: hdac_regmap - fix the register access for runtime PM

    Linus Torvalds
     
  • Pull x86 kprobe fix from Thomas Gleixner:
    "A single fix clearing the TF bit when a fault is single stepped"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    kprobes/x86: Clear TF bit in fault on single-stepping

    Linus Torvalds
     
  • Pull scheduler fixes from Thomas Gleixner:
    "A couple of scheduler fixes:

    - force watchdog reset while processing sysrq-w

    - fix a deadlock when enabling trace events in the scheduler

    - fixes to the throttled next buddy logic

    - fixes for the average accounting (missing serialization and
    underflow handling)

    - allow kernel threads for fallback to online but not active cpus"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Allow kthreads to fall back to online && !active cpus
    sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
    sched/fair: Initialize throttle_count for new task-groups lazily
    sched/fair: Fix cfs_rq avg tracking underflow
    kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w
    sched/debug: Fix deadlock when enabling sched events
    sched/fair: Fix post_init_entity_util_avg() serialization

    Linus Torvalds
     
  • Commit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"")
    backed out the conversion to ->iterate_shared() for Btrfs because the
    delayed inode handling in btrfs_real_readdir() is racy. However, we can
    still do readdir in parallel if there are no delayed nodes.

    This is a temporary fix which upgrades the shared inode lock to an
    exclusive lock only when we have delayed items until we come up with a
    more complete solution. While we're here, rename the
    btrfs_{get,put}_delayed_items functions to make it very clear that
    they're just for readdir.

    Tested with xfstests and by doing a parallel kernel build:

    while make tinyconfig && make -j4 && git clean dqfx; do
    :
    done

    along with a bunch of parallel finds in another shell:

    while true; do
    for ((i=0; i/dev/null &
    done
    wait
    done

    Signed-off-by: Omar Sandoval
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Omar Sandoval
     
  • Pull locking fix from Thomas Gleixner:
    "A single fix to address a race in the static key logic"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/static_key: Fix concurrent static_key_slow_inc()

    Linus Torvalds
     
  • Pull irq fix from Thomas Gleixner:
    "A single fix for the fallout from the conversion of MIPS GIC to irq
    domains"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/mips-gic: Fix IRQs in gic_dev_domain

    Linus Torvalds
     
  • Pull powerpc fixes from Michael Ellerman:
    "mm/radix (Aneesh Kumar K.V):
    - Update to tlb functions ric argument
    - Flush page walk cache when freeing page table
    - Update Radix tree size as per ISA 3.0

    mm/hash (Aneesh Kumar K.V):
    - Use the correct PPP mask when updating HPTE
    - Don't add memory coherence if cache inhibited is set

    eeh (Gavin Shan):
    - Fix invalid cached PE primary bus

    bpf/jit (Naveen N. Rao):
    - Disable classic BPF JIT on ppc64le

    .. and fix faults caused by radix patching of SLB miss handler
    (Michael Ellerman)"

    * tag 'powerpc-4.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/bpf/jit: Disable classic BPF JIT on ppc64le
    powerpc: Fix faults caused by radix patching of SLB miss handler
    powerpc/eeh: Fix invalid cached PE primary bus
    powerpc/mm/radix: Update Radix tree size as per ISA 3.0
    powerpc/mm/hash: Don't add memory coherence if cache inhibited is set
    powerpc/mm/hash: Use the correct PPP mask when updating HPTE
    powerpc/mm/radix: Flush page walk cache when freeing page table
    powerpc/mm/radix: Update to tlb functions ric argument

    Linus Torvalds
     
  • Commit b235beea9e99 ("Clarify naming of thread info/stack allocators")
    breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:

    kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
    kernel/fork.c:355:8: error: assignment from incompatible pointer type
    stack = alloc_thread_stack_node(tsk, node);
    ^

    Fix it by renaming free_stack() to free_thread_stack(), and updating the
    return type of alloc_thread_stack_node().

    Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators")
    Signed-off-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • Merge misc fixes from Andrew Morton:
    "Two weeks worth of fixes here"

    * emailed patches from Andrew Morton : (41 commits)
    init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
    autofs: don't get stuck in a loop if vfs_write() returns an error
    mm/page_owner: avoid null pointer dereference
    tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
    fs/nilfs2: fix potential underflow in call to crc32_le
    oom, suspend: fix oom_reaper vs. oom_killer_disable race
    ocfs2: disable BUG assertions in reading blocks
    mm, compaction: abort free scanner if split fails
    mm: prevent KASAN false positives in kmemleak
    mm/hugetlb: clear compound_mapcount when freeing gigantic pages
    mm/swap.c: flush lru pvecs on compound page arrival
    memcg: css_alloc should return an ERR_PTR value on error
    memcg: mem_cgroup_migrate() may be called with irq disabled
    hugetlb: fix nr_pmds accounting with shared page tables
    Revert "mm: disable fault around on emulated access bit architecture"
    Revert "mm: make faultaround produce old ptes"
    mailmap: add Boris Brezillon's email
    mailmap: add Antoine Tenart's email
    mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
    mm: mempool: kasan: don't poot mempool objects in quarantine
    ...

    Linus Torvalds
     
  • Pull rdma fixes from Doug Ledford:
    "This is the second batch of queued up rdma patches for this rc cycle.

    There isn't anything really major in here. It's passed 0day,
    linux-next, and local testing across a wide variety of hardware.
    There are still a few known issues to be tracked down, but this should
    amount to the vast majority of the rdma RC fixes.

    Round two of 4.7 rc fixes:

    - A couple minor fixes to the rdma core
    - Multiple minor fixes to hfi1
    - Multiple minor fixes to mlx4/mlx4
    - A few minor fixes to i40iw"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (31 commits)
    IB/srpt: Reduce QP buffer size
    i40iw: Enable level-1 PBL for fast memory registration
    i40iw: Return correct max_fast_reg_page_list_len
    i40iw: Correct status check on i40iw_get_pble
    i40iw: Correct CQ arming
    IB/rdmavt: Correct qp_priv_alloc() return value test
    IB/hfi1: Don't zero out qp->s_ack_queue in rvt_reset_qp
    IB/hfi1: Fix deadlock with txreq allocation slow path
    IB/mlx4: Prevent cross page boundary allocation
    IB/mlx4: Fix memory leak if QP creation failed
    IB/mlx4: Verify port number in flow steering create flow
    IB/mlx4: Fix error flow when sending mads under SRIOV
    IB/mlx4: Fix the SQ size of an RC QP
    IB/mlx5: Fix wrong naming of port_rcv_data counter
    IB/mlx5: Fix post send fence logic
    IB/uverbs: Initialize ib_qp_init_attr with zeros
    IB/core: Fix false search of the IB_SA_WELL_KNOWN_GUID
    IB/core: Fix RoCE v1 multicast join logic issue
    IB/core: Fix no default GIDs when netdevice reregisters
    IB/hfi1: Send a pkey change event on driver pkey update
    ...

    Linus Torvalds
     
  • Pull HID fix from Jiri Kosina:
    "hiddev ioctl() validation fix from Scott Bauer"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
    HID: hiddev: validate num_values for HIDIOCGUSAGES, HIDIOCSUSAGES commands

    Linus Torvalds
     
  • …l/git/groeck/linux-staging

    Pull hwmon fix from Guenter Roeck:
    "Improve fan type detection for dell-smm to prevent kernel hang"

    * tag 'hwmon-for-linus-v4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
    hwmon: (dell-smm) Cache fan_type() calls and change fan detection

    Linus Torvalds
     
  • Pull ACPI fix from Rafael Wysocki:
    "Stable-candidate fix for a deadlock in ACPICA introduced during the
    4.5 development cycle by a commit attempting to improve the handling
    of AML code that doesn't belong to any namespace objects in a given
    definition block (Lv Zheng)"

    * tag 'acpi-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPICA: Namespace: Fix deadlock triggered by MLC support in dynamic table loading

    Linus Torvalds
     
  • Pull power management fixes from Rafael Wysocki:
    "Fix for a latent cpufreq driver bug uncovered by a recent ACPICA
    change and several fixes for the devfreq framework, including one fix
    for an issue introduced recently.

    Specifics:

    - Fix a latent initialization issue in the pcc-cpufreq driver
    (incorrect initial value of a structure field) that has been
    uncovered by a recent ACPICA commit (Mike Galbraith).

    - Add a missing notification in an update_devfreq() error code path
    forgotten by a recent devfreq commit (Chanwoo Choi).

    - Fix devfreq device frequency initialization (Lukasz Luba).

    - Fix an incorrect IS_ERR() check in the devfreq framework discovered
    by the Smatch checker (Dan Carpenter).

    - Drop two excessive put_device() calls from the devfreq framework
    (MyungJoo Ham, Cai Zhiyong).

    - Fix a possible memory leak in the devfreq framework and drop an
    unnecessary kfree() invocation from it (MyungJoo Ham)"

    * tag 'pm-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / devfreq: Send the DEVFREQ_POSTCHANGE notification when target() is failed
    cpufreq: pcc-cpufreq: Fix doorbell.access_width
    PM / devfreq: fix initialization of current frequency in last status
    PM / devfreq: exynos-nocp: Remove incorrect IS_ERR() check
    PM / devfreq: remove double put_device
    PM / devfreq: fix double call put_device
    PM / devfreq: fix duplicated kfree on devfreq pointer
    PM / devfreq: devm_kzalloc to have dev pointer more precisely

    Linus Torvalds
     
  • Pull xen bug fixes from David Vrabel:

    - fix x86 PV dom0 crash during early boot on some hardware

    - fix two pciback bugs affects certain devices

    - fix potential overflow when clearing page tables in x86 PV

    * tag 'for-linus-4.7b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
    xen-pciback: return proper values during BAR sizing
    x86/xen: avoid m2p lookup when setting early page table entries
    xen/pciback: Fix conf_space read/write overlap check.
    x86/xen: fix upper bound of pmd loop in xen_cleanhighmap()
    xen/balloon: Fix declared-but-not-defined warning

    Linus Torvalds
     
  • Pull arm64 fixes from Will Deacon:
    "Here are a few more arm64 fixes, but things do finally appear to be
    slowing down. The main fix is avoiding hibernation in a previously
    unanticipated situation where we have CPUs parked in the kernel, but
    it's all good stuff.

    - Fix icache/dcache sync for anonymous pages under migration
    - Correct the ASID limit check
    - Fix parallel builds of Image and Image.gz
    - Refuse to hibernate when we have CPUs that we can't offline"

    * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
    arm64: hibernate: Don't hibernate on systems with stuck CPUs
    arm64: smp: Add function to determine if cpus are stuck in the kernel
    arm64: mm: remove page_mapping check in __sync_icache_dcache
    arm64: fix boot image dependencies to not generate invalid images
    arm64: update ASID limit

    Linus Torvalds
     
  • When I replaced kasprintf("%pf") with a direct call to
    sprint_symbol_no_offset I must have broken the initcall blacklisting
    feature on the arches where dereference_function_descriptor() is
    non-trivial.

    Fixes: c8cdd2be213f (init/main.c: simplify initcall_blacklisted())
    Link: http://lkml.kernel.org/r/1466027283-4065-1-git-send-email-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Cc: Yang Shi
    Cc: Prarit Bhargava
    Cc: Petr Mladek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • __vfs_write() returns a negative value in a error case.

    Link: http://lkml.kernel.org/r/20160616083108.6278.65815.stgit@pluto.themaw.net
    Signed-off-by: Andrey Vagin
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     
  • We have dereferenced page_ext before checking it. Lets check it first
    and then used it.

    Fixes: f86e4271978b ("mm: check the return value of lookup_page_ext for all call sites")
    Link: http://lkml.kernel.org/r/1465249059-7883-1-git-send-email-sudipm.mukherjee@gmail.com
    Signed-off-by: Sudip Mukherjee
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sudip Mukherjee
     
  • trivial fix to spelling mistake

    Link: http://lkml.kernel.org/r/1466672144-831-1-git-send-email-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • The value `bytes' comes from the filesystem which is about to be
    mounted. We cannot trust that the value is always in the range we
    expect it to be.

    Check its value before using it to calculate the length for the crc32_le
    call. It value must be larger (or equal) sumoff + 4.

    This fixes a kernel bug when accidentially mounting an image file which
    had the nilfs2 magic value 0x3434 at the right offset 0x406 by chance.
    The bytes 0x01 0x00 were stored at 0x408 and were interpreted as a
    s_bytes value of 1. This caused an underflow when substracting sumoff +
    4 (20) in the call to crc32_le.

    BUG: unable to handle kernel paging request at ffff88021e600000
    IP: crc32_le+0x36/0x100
    ...
    Call Trace:
    nilfs_valid_sb.part.5+0x52/0x60 [nilfs2]
    nilfs_load_super_block+0x142/0x300 [nilfs2]
    init_nilfs+0x60/0x390 [nilfs2]
    nilfs_mount+0x302/0x520 [nilfs2]
    mount_fs+0x38/0x160
    vfs_kern_mount+0x67/0x110
    do_mount+0x269/0xe00
    SyS_mount+0x9f/0x100
    entry_SYSCALL_64_fastpath+0x16/0x71

    Link: http://lkml.kernel.org/r/1466778587-5184-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
    Signed-off-by: Torsten Hilbrich
    Tested-by: Torsten Hilbrich
    Signed-off-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Torsten Hilbrich
     
  • Tetsuo has reported the following potential oom_killer_disable vs.
    oom_reaper race:

    (1) freeze_processes() starts freezing user space threads.
    (2) Somebody (maybe a kenrel thread) calls out_of_memory().
    (3) The OOM killer calls mark_oom_victim() on a user space thread
    P1 which is already in __refrigerator().
    (4) oom_killer_disable() sets oom_killer_disabled = true.
    (5) P1 leaves __refrigerator() and enters do_exit().
    (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
    exit_oom_victim(P1).
    (7) oom_killer_disable() returns while P1 not yet finished
    (8) P1 perform IO/interfere with the freezer.

    This situation is unfortunate. We cannot move oom_killer_disable after
    all the freezable kernel threads are frozen because the oom victim might
    depend on some of those kthreads to make a forward progress to exit so
    we could deadlock. It is also far from trivial to teach the oom_reaper
    to not call exit_oom_victim() because then we would lose a guarantee of
    the OOM killer and oom_killer_disable forward progress because
    exit_mm->mmput might block and never call exit_oom_victim.

    It seems the easiest way forward is to workaround this race by calling
    try_to_freeze_tasks again after oom_killer_disable. This will make sure
    that all the tasks are frozen or it bails out.

    Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
    Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • According to some high-load testing, these two BUG assertions were
    encountered, this led system panic. Actually, there were some
    discussions about removing these two BUG() assertions, it would not
    bring any side effect.

    Then, I did the the following changes,

    1) use the existing macro CATCH_BH_JBD_RACES to wrap BUG() in the
    ocfs2_read_blocks_sync function like before.

    2) disable the macro CATCH_BH_JBD_RACES in Makefile by default.

    Link: http://lkml.kernel.org/r/1466574294-26863-1-git-send-email-ghe@suse.com
    Signed-off-by: Gang He
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     
  • If the memory compaction free scanner cannot successfully split a free
    page (only possible due to per-zone low watermark), terminate the free
    scanner rather than continuing to scan memory needlessly. If the
    watermark is insufficient for a free page of order order, then
    terminate the scanner since all future splits will also likely fail.

    This prevents the compaction freeing scanner from scanning all memory on
    very large zones (very noticeable for zones > 128GB, for instance) when
    all splits will likely fail while holding zone->lock.

    compaction_alloc() iterating a 128GB zone has been benchmarked to take
    over 400ms on some systems whereas any free page isolated and ready to
    be split ends up failing in split_free_page() because of the low
    watermark check and thus the iteration continues.

    The next time compaction occurs, the freeing scanner will likely start
    at the end of the zone again since no success was made previously and we
    get the same lengthy iteration until the zone is brought above the low
    watermark. All thp page faults can take >400ms in such a state without
    this fix.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When kmemleak dumps contents of leaked objects it reads whole objects
    regardless of user-requested size. This upsets KASAN. Disable KASAN
    checks around object dump.

    Link: http://lkml.kernel.org/r/1466617631-68387-1-git-send-email-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Acked-by: Catalin Marinas
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • While working on s390 support for gigantic hugepages I ran into the
    following "Bad page state" warning when freeing gigantic pages:

    BUG: Bad page state in process bash pfn:580001
    page:000003d116000040 count:0 mapcount:0 mapping:ffffffff00000000 index:0x0
    flags: 0x7fffc0000000000()
    page dumped because: non-NULL mapping

    This is because page->compound_mapcount, which is part of a union with
    page->mapping, is initialized with -1 in prep_compound_gigantic_page(),
    and not cleared again during destroy_compound_gigantic_page(). Fix this
    by clearing the compound_mapcount in destroy_compound_gigantic_page()
    before clearing compound_head.

    Interestingly enough, the warning will not show up on x86_64, although
    this should not be architecture specific. Apparently there is an
    endianness issue, combined with the fact that the union contains both a
    64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as
    members. The resulting bogus page->mapping on x86_64 therefore contains
    00000000ffffffff instead of ffffffff00000000 on s390, which will falsely
    trigger the PageAnon() check in free_pages_prepare() because
    page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures
    like x86_64 in this case (the page is not compound anymore,
    ->compound_head was already cleared before). As a result, page->mapping
    will be cleared before doing the checks in free_pages_check().

    Not sure if the bogus "PageAnon() returning true" on x86_64 for the
    first tail page of a gigantic page (at this stage) has other theoretical
    implications, but they would also be fixed with this patch.

    Link: http://lkml.kernel.org/r/1466612719-5642-1-git-send-email-gerald.schaefer@de.ibm.com
    Signed-off-by: Gerald Schaefer
    Reviewed-by: Mike Kravetz
    Cc: Luiz Capitulino
    Cc: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: "Kirill A . Shutemov"
    Cc: Dave Hansen
    Cc: Paul Gortmaker
    Cc: "Aneesh Kumar K . V"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer