29 Dec, 2015

2 commits


14 Dec, 2015

13 commits

  • This mmu_notifier_ops structure is never modified, so declare it as
    const, like the other mmu_notifier_ops structures.

    Done with the help of Coccinelle.

    Signed-off-by: Julia Lawall
    Signed-off-by: Joerg Roedel

    Julia Lawall
     
  • Get rid of the three error paths that look the same and move
    error handling to a single place.

    Reviewed-by: Jesse Barnes
    Acked-By: David Woodhouse
    Signed-off-by: Joerg Roedel

    Joerg Roedel
     
  • Instead of just checking for a write access, calculate the
    flags that are passed to handle_mm_fault() more precisly and
    use the pre-defined macros.

    Reviewed-by: Jesse Barnes
    Acked-By: David Woodhouse
    Signed-off-by: Joerg Roedel

    Joerg Roedel
     
  • Not doing so is a bug and might trigger a BUG_ON in
    handle_mm_fault(). So add the proper permission checks
    before calling into mm code.

    Reviewed-by: Jesse Barnes
    Acked-By: David Woodhouse
    Signed-off-by: Joerg Roedel

    Joerg Roedel
     
  • The handle_mm_fault function expects the caller to do the
    access checks. Not doing so and calling the function with
    wrong permissions is a bug (catched by a BUG_ON).
    So fix this bug by adding proper access checking to the io
    page-fault code in the AMD IOMMUv2 driver.

    Reviewed-by: Jesse Barnes
    Acked-By: David Woodhouse
    Signed-off-by: Joerg Roedel

    Joerg Roedel
     
  • Linus Torvalds
     
  • Jan Stancek reported that I wrecked things for him by fixing things for
    Vladimir :/

    His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
    should not be possible, however my previous patch made this possible by
    unconditionally checking signal_pending().

    We cannot use current->state as was done previously, because the
    instruction after the store to that variable it can be changed. We must
    instead pass the initial state along and use that.

    Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers")
    Reported-by: Jan Stancek
    Reported-by: Chris Mason
    Tested-by: Jan Stancek
    Tested-by: Vladimir Murzin
    Tested-by: Chris Mason
    Reviewed-by: Paul Turner
    Cc: Ingo Molnar
    Cc: tglx@linutronix.de
    Cc: Oleg Nesterov
    Cc: hpa@zytor.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Pull NFS client bugfix from Trond Myklebust:
    "SUNRPC: Fix a NFSv4.1 callback channel regression"

    * tag 'nfs-for-4.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    SUNRPC: Fix callback channel

    Linus Torvalds
     
  • Pull timer fixlets from Thomas Gleixner:
    "Two trivial fixes which add missing header fileas and forward
    declarations so the code will compile even when the magic include
    chains are different"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/gic-v3: Add missing include for barrier.h
    irqchip/gic-v3: Add missing struct device_node declaration

    Linus Torvalds
     
  • Pull timer fix from Thomas Gleixner:
    "A single fix to unbreak a clocksource driver which has more than 32bit
    counter width"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource: Mmio: remove artificial 32bit limitation

    Linus Torvalds
     
  • Pull fpga driver fixes from Greg KH:
    "Only two small fpga driver fixes here, both have been in linux-next
    for a while, and resolve some reported issues"

    * tag 'char-misc-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
    fpga manager: Fix firmware resource leak on error
    fpga manager: remove label

    Linus Torvalds
     
  • Pull staging driver fixes from Greg KH:
    "Here are a few staging and IIO driver fixes for 4.4-rc5.

    All of them resolve reported problems and have been in linux-next for
    a while. Nothing major here, just small fixes where needed"

    * tag 'staging-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
    staging: lustre: echo_copy.._lsm() dereferences userland pointers directly
    iio: adc: spmi-vadc: add missing of_node_put
    iio: fix some warning messages
    iio: light: apds9960: correct ->last_busy count
    iio: lidar: return -EINVAL on invalid signal
    staging: iio: dummy: complete IIO events delivery to userspace

    Linus Torvalds
     
  • Pull USB driver fixes from Greg KH:
    "Here are a number of small USB fixes for 4.4-rc5. All of them have
    been in linux-next. The majority are gadget and phy issues, with a
    few new quirks and device ids added as well"

    * tag 'usb-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (32 commits)
    USB: add quirk for devices with broken LPM
    xhci: fix usb2 resume timing and races.
    usb: musb: fail with error when no DMA controller set
    usb: gadget: uvc: fix permissions of configfs attributes
    usb: musb: core: Fix pm runtime for deferred probe
    usb: phy: msm: fix a possible NULL dereference
    USB: host: ohci-at91: fix a crash in ohci_hcd_at91_overcurrent_irq
    usb: Quiet down false peer failure messages
    usb: xhci: fix config fail of FS hub behind a HS hub with MTT
    xhci: Fix memory leak in xhci_pme_acpi_rtd3_enable()
    usb: Use the USB_SS_MULT() macro to decode burst multiplier for log message
    USB: whci-hcd: add check for dma mapping error
    usb: core : hub: Fix BOS 'NULL pointer' kernel panic
    USB: quirks: Apply ALWAYS_POLL to all ELAN devices
    usb-storage: Fix scsi-sd failure "Invalid field in cdb" for USB adapter JMicron
    USB: quirks: Fix another ELAN touchscreen
    usb: dwc3: gadget: don't prestart interrupt endpoints
    USB: serial: Another Infineon flash loader USB ID
    USB: cdc_acm: Ignore Infineon Flash Loader utility
    USB: cp210x: Remove CP2110 ID from compatibility list
    ...

    Linus Torvalds
     

13 Dec, 2015

23 commits

  • Pull ARM SoC fixes from Arnd Bergmann:
    "Here are a bunch of small bug fixes for various ARM platforms, nothing
    really sticks out this week, most of either fixes bugs in code that
    was just added in 4.4, or that has been broken for many years without
    anyone noticing.

    at91/sama5d2:
    - fix sama5de hardware setup of sd/mmc interface
    - proper selection of pinctrl drivers. PIO4 is necessary for sama5d2

    berlin:
    - fix incorrect clock input for SDIO

    exynos:
    - Fix potential NULL pointer dereference in Exynos PMU driver.

    imx:
    - Fix vf610 SAI clock configuration bug which is discovered by the
    newly added master mode support in SAI audio driver.
    - Fix buggy L2 cache latency values in vf610 device trees, which may
    cause system hang when cpu runs at a higher frequency.

    ixp4xx:
    - fix prototypes for readl/writel functions

    ls2080a:
    - use little-endian register access for GPIO and SDHCI

    omap:
    - Fix clock source for ARM TWD and global timers on am437x
    - Always select REGULATOR_FIXED_VOLTAGE for omap2+ instead of when
    MACH_OMAP3_PANDORA is selected
    - Fix SPI DMA handles for dm816x as only some were mapped
    - Fix up mbox cells for dm816x to make mailbox usable

    pxa:
    - use PWM lookup table for all ezx machines

    s3c24xx:
    - Remove incorrect __init annotation from s3c24xx cpufreq driver
    structures.

    versatile:
    - fix PCI IRQ mapping on Versatile PB"

    * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
    ls2080a/dts: Add little endian property for GPIO IP block
    dt-bindings: define little-endian property for QorIQ GPIO
    ARM64: dts: ls2080a: fix eSDHC endianness
    ARM: dts: vf610: use reset values for L2 cache latencies
    ARM: pxa: use PWM lookup table for all machines
    ARM: dts: berlin: add 2nd clock for BG2Q sdhci0 and sdhci1
    ARM: dts: berlin: correct BG2Q's sdhci2 2nd clock
    ARM: dts: am4372: fix clock source for arm twd and global timers
    ARM: at91: fix pinctrl driver selection
    ARM: at91/dt: add always-on to 1.8V regulator
    ARM: dts: vf610: fix clock definition for SAI2
    ARM: imx: clk-vf610: fix SAI clock tree
    ARM: ixp4xx: fix read{b,w,l} return types
    irqchip/versatile-fpga: Fix PCI IRQ mapping on Versatile PB
    ARM: OMAP2+: enable REGULATOR_FIXED_VOLTAGE
    ARM: dts: add dm816x missing spi DT dma handles
    ARM: dts: add dm816x missing #mbox-cells
    cpufreq: s3c24xx: Do not mark s3c2410_plls_add as __init
    ARM: EXYNOS: Fix potential NULL pointer access in exynos_sys_powerdown_conf

    Linus Torvalds
     
  • Pull powerpc fixes from Michael Ellerman:
    - opal-irqchip: Fix double endian conversion from Alistair Popple
    - cxl: Set endianess of kernel contexts from Frederic Barrat
    - sbc8641: drop bogus PHY IRQ entries from DTS file from Paul Gortmaker
    - Revert "powerpc/eeh: Don't unfreeze PHB PE after reset" from Andrew
    Donnellan

    * tag 'powerpc-4.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    Revert "powerpc/eeh: Don't unfreeze PHB PE after reset"
    powerpc/sbc8641: drop bogus PHY IRQ entries from DTS file
    cxl: Set endianess of kernel contexts
    powerpc/opal-irqchip: Fix double endian conversion

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton:
    "17 fixes"

    * emailed patches from Andrew Morton :
    MIPS: fix DMA contiguous allocation
    sh64: fix __NR_fgetxattr
    ocfs2: fix SGID not inherited issue
    mm/oom_kill.c: avoid attempting to kill init sharing same memory
    drivers/base/memory.c: prohibit offlining of memory blocks with missing sections
    tmpfs: fix shmem_evict_inode() warnings on i_blocks
    mm/hugetlb.c: fix resv map memory leak for placeholder entries
    mm: hugetlb: call huge_pte_alloc() only if ptep is null
    kernel: remove stop_machine() Kconfig dependency
    mm: kmemleak: mark kmemleak_init prototype as __init
    mm: fix kerneldoc on mem_cgroup_replace_page
    osd fs: __r4w_get_page rely on PageUptodate for uptodate
    MAINTAINERS: make Vladimir co-maintainer of the memory controller
    mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress
    mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfo
    memcg: fix memory.high target
    mm: hugetlb: fix hugepage memory leak caused by wrong reserve count

    Linus Torvalds
     
  • Pull parisc fixes from Helge Deller:
    "Fix the boot crash on Mako machines with Huge Pages, prevent a panic
    with SATA controllers (and others) by correctly calculating the IOMMU
    space, hook up the mlock2 syscall and drop unneeded code in the parisc
    pci code"

    * 'parisc-4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
    parisc: Disable huge pages on Mako machines
    parisc: Wire up mlock2 syscall
    parisc: Remove unused pcibios_init_bus()
    parisc iommu: fix panic due to trying to allocate too large region

    Linus Torvalds
     
  • Pull block layer fixes from Jens Axboe:
    "A set of fixes for the current series. This contains:

    - A bunch of fixes for lightnvm, should be the last round for this
    series. From Matias and Wenwei.

    - A writeback detach inode fix from Ilya, also marked for stable.

    - A block (though it says SCSI) fix for an OOPS in SCSI runtime power
    management.

    - Module init error path fixes for null_blk from Minfei"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    null_blk: Fix error path in module initialization
    lightnvm: do not compile in debugging by default
    lightnvm: prevent gennvm module unload on use
    lightnvm: fix media mgr registration
    lightnvm: replace req queue with nvmdev for lld
    lightnvm: comments on constants
    lightnvm: check mm before use
    lightnvm: refactor spin_unlock in gennvm_get_blk
    lightnvm: put blks when luns configure failed
    lightnvm: use flags in rrpc_get_blk
    block: detach bdev inode from its wb in __blkdev_put()
    SCSI: Fix NULL pointer dereference in runtime PM

    Linus Torvalds
     
  • Pull arm64 fixes from Catalin Marinas:

    - Update the linker script to use L1_CACHE_BYTES instead of hard-coded
    64. We recently changed L1_CACHE_BYTES to 128

    - Improve race condition reporting on set_pte_at() and change the BUG
    to WARN_ONCE. With hardware update of the accessed/dirty state, we
    need to ensure that set_pte_at() does not inadvertently override
    hardware updated state. The patch also makes the checks ignore
    !pte_valid() new entries

    * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
    arm64: Improve error reporting on set_pte_at() checks
    arm64: update linker script to increased L1_CACHE_BYTES value

    Linus Torvalds
     
  • Recent changes to how GFP_ATOMIC is defined seems to have broken the
    condition to use mips_alloc_from_contiguous() in
    mips_dma_alloc_coherent().

    I couldn't bottom out the exact change but I think it's this commit
    d0164adc89f6 ("mm, page_alloc: distinguish between being unable to
    sleep, unwilling to sleep and avoiding waking kswapd").

    GFP_ATOMIC has multiple bits set and the check for !(gfp & GFP_ATOMIC)
    isn't enough.

    The reason behind this condition is to check whether we can potentially
    do a sleeping memory allocation. Use gfpflags_allow_blocking() instead
    which should be more robust.

    Signed-off-by: Qais Yousef
    Acked-by: Mel Gorman
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qais Yousef
     
  • According to arch/sh/kernel/syscalls_64.S and common sense, __NR_fgetxattr
    has to be defined to 259, but it doesn't. Instead, it's defined to 269,
    which is of course used by another syscall, __NR_sched_setaffinity in this
    case.

    This bug was found by strace test suite.

    Signed-off-by: Dmitry V. Levin
    Acked-by: Geert Uytterhoeven
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry V. Levin
     
  • Commit 8f1eb48758aa ("ocfs2: fix umask ignored issue") introduced an
    issue, SGID of sub dir was not inherited from its parents dir. It is
    because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
    is overwritten by "mode" which don't have SGID set later.

    Fixes: 8f1eb48758aa ("ocfs2: fix umask ignored issue")
    Signed-off-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Acked-by: Srinivas Eeda
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • It's possible that an oom killed victim shares an ->mm with the init
    process and thus oom_kill_process() would end up trying to kill init as
    well.

    This has been shown in practice:

    Out of memory: Kill process 9134 (init) score 3 or sacrifice child
    Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB
    Kill process 1 (init) sharing same memory
    ...
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

    And this will result in a kernel panic.

    If a process is forked by init and selected for oom kill while still
    sharing init_mm, then it's likely this system is in a recoverable state.
    However, it's better not to try to kill init and allow the machine to
    panic due to unkillable processes.

    [rientjes@google.com: rewrote changelog]
    [akpm@linux-foundation.org: fix inverted test, per Ben]
    Signed-off-by: Chen Jie
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Ben Hutchings
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Jie
     
  • Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory
    x86-64 systems") and 982792c782ef ("x86, mm: probe memory block size for
    generic x86 64bit") introduced large block sizes for x86. This made it
    possible to have multiple sections per memory block where previously,
    there was a only every one section per block.

    Since blocks consist of contiguous ranges of section, there can be holes
    in the blocks where sections are not present. If one attempts to
    offline such a block, a crash occurs since the code is not designed to
    deal with this.

    This patch is a quick fix to gaurd against the crash by not allowing
    blocks with non-present sections to be offlined.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=107781

    Signed-off-by: Seth Jennings
    Reported-by: Andrew Banman
    Cc: Daniel J Blueman
    Cc: Yinghai Lu
    Cc: Greg KH
    Cc: Russ Anderson
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     
  • Dmitry Vyukov provides a little program, autogenerated by syzkaller,
    which races a fault on a mapping of a sparse memfd object, against
    truncation of that object below the fault address: run repeatedly for a
    few minutes, it reliably generates shmem_evict_inode()'s
    WARN_ON(inode->i_blocks).

    (But there's nothing specific to memfd here, nor to the fstat which it
    happened to use to generate the fault: though that looked suspicious,
    since a shmem_recalc_inode() had been added there recently. The same
    problem can be reproduced with open+unlink in place of memfd_create, and
    with fstatfs in place of fstat.)

    v3.7 commit 0f3c42f522dc ("tmpfs: change final i_blocks BUG to WARNING")
    explains one cause of such a warning (a race with shmem_writepage to
    swap), and possible solutions; but we never took it further, and this
    syzkaller incident turns out to have a different cause.

    shmem_getpage_gfp()'s error recovery, when a freshly allocated page is
    then found to be beyond eof, looks plausible - decrementing the alloced
    count that was just before incremented - but in fact can go wrong, if a
    racing thread (the truncator, for example) gets its shmem_recalc_inode()
    in just after our delete_from_page_cache(). delete_from_page_cache()
    decrements nrpages, that shmem_recalc_inode() will balance the books by
    decrementing alloced itself, then our decrement of alloced take it one
    too low: leading to the WARNING when the object is finally evicted.

    Once the new page has been exposed in the page cache,
    shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get
    the accounting right in all cases (and not fall through from "trunc:" to
    "decused:"). Adjust that error recovery block; and the reinitialization
    of info and sbinfo can be removed too.

    While we're here, fix shmem_writepage() to avoid the original issue: it
    will be safe against a racing shmem_recalc_inode(), if it merely
    increments swapped before the shmem_delete_from_page_cache() which
    decrements nrpages (but it must then do its own shmem_recalc_inode()
    before that, while still in balance, instead of after). (Aside: why do
    we shmem_recalc_inode() here in the swap path? Because its raison d'etre
    is to cope with clean sparse shmem pages being reclaimed behind our
    back: so here when swapping is a good place to look for that case.) But
    I've not now managed to reproduce this bug, even without the patch.

    I don't see why I didn't do that earlier: perhaps inhibited by the
    preference to eliminate shmem_recalc_inode() altogether. Driven by this
    incident, I do now have a patch to do so at last; but still want to sit
    on it for a bit, there's a couple of questions yet to be resolved.

    Signed-off-by: Hugh Dickins
    Reported-by: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Dmitry Vyukov reported the following memory leak

    unreferenced object 0xffff88002eaafd88 (size 32):
    comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
    hex dump (first 32 bytes):
    28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff (.Nc....(.Nc....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    kmalloc include/linux/slab.h:458
    region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
    __vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
    vma_needs_reservation mm/hugetlb.c:1813
    alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
    hugetlb_no_page mm/hugetlb.c:3543
    hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
    follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
    __get_user_pages+0x542/0xf30 mm/gup.c:497
    populate_vma_page_range+0xde/0x110 mm/gup.c:919
    __mm_populate+0x1c7/0x310 mm/gup.c:969
    do_mlock+0x291/0x360 mm/mlock.c:637
    SYSC_mlock2 mm/mlock.c:658
    SyS_mlock2+0x4b/0x70 mm/mlock.c:648

    Dmitry identified a potential memory leak in the routine region_chg,
    where a region descriptor is not free'ed on an error path.

    However, the root cause for the above memory leak resides in region_del.
    In this specific case, a "placeholder" entry is created in region_chg.
    The associated page allocation fails, and the placeholder entry is left
    in the reserve map. This is "by design" as the entry should be deleted
    when the map is released. The bug is in the region_del routine which is
    used to delete entries within a specific range (and when the map is
    released). region_del did not handle the case where a placeholder entry
    exactly matched the start of the range range to be deleted. In this
    case, the entry would not be deleted and leaked. The fix is to take
    these special placeholder entries into account in region_del.

    The region_chg error path leak is also fixed.

    Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries")
    Signed-off-by: Mike Kravetz
    Reported-by: Dmitry Vyukov
    Acked-by: Hillf Danton
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
    and check whether the obtained *ptep is a migration/hwpoison entry or
    not. And if not, then we get to call huge_pte_alloc(). This is racy
    because the *ptep could turn into migration/hwpoison entry after the
    huge_pte_offset() check. This race results in BUG_ON in
    huge_pte_alloc().

    We don't have to call huge_pte_alloc() when the huge_pte_offset()
    returns non-NULL, so let's fix this bug with moving the code into else
    block.

    Note that the *ptep could turn into a migration/hwpoison entry after
    this block, but that's not a problem because we have another
    !pte_present check later (we never go into hugetlb_no_page() in that
    case.)

    Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Acked-by: David Rientjes
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Mike Kravetz
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently the full stop_machine() routine is only enabled on SMP if
    module unloading is enabled, or if the CPUs are hotpluggable. This
    leads to configurations where stop_machine() is broken as it will then
    only run the callback on the local CPU with irqs disabled, and not stop
    the other CPUs or run the callback on them.

    For example, this breaks MTRR setup on x86 in certain configs since
    ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and
    text_poke_smp_batch() functions") as the MTRR is only established on the
    boot CPU.

    This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
    and HOTPLUG_CPU config options to compile the correct stop_machine() for
    the architecture, removing the false dependency on MODULE_UNLOAD in the
    process.

    Link: https://lkml.org/lkml/2014/10/8/124
    References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
    Signed-off-by: Chris Wilson
    Acked-by: Ingo Molnar
    Cc: "Paul E. McKenney"
    Cc: Pranith Kumar
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: H. Peter Anvin
    Cc: Tejun Heo
    Cc: Iulia Manda
    Cc: Andy Lutomirski
    Cc: Rusty Russell
    Cc: Peter Zijlstra
    Cc: Chuck Ebbert
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     
  • The kmemleak_init() definition in mm/kmemleak.c is marked __init but its
    prototype in include/linux/kmemleak.h is marked __ref since commit
    a6186d89c913 ("kmemleak: Mark the early log buffer as __initdata").

    This causes a section mismatch which is reported as a warning when
    building with clang -Wsection, because kmemleak_init() is declared in
    section .ref.text but defined in .init.text.

    Fix this by marking kmemleak_init() prototype __init.

    Signed-off-by: Nicolas Iooss
    Signed-off-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     
  • Whoops, I missed removing the kerneldoc comment of the lrucare arg
    removed from mem_cgroup_replace_page; but it's a good comment, keep it.

    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit 42cb14b110a5 ("mm: migrate dirty page without
    clear_page_dirty_for_io etc") simplified the migration of a PageDirty
    pagecache page: one stat needs moving from zone to zone and that's about
    all.

    It's convenient and safest for it to shift the PageDirty bit from old
    page to new, just before updating the zone stats: before copying data
    and marking the new PageUptodate. This is all done while both pages are
    isolated and locked, just as before; and just as before, there's a
    moment when the new page is visible in the radix_tree, but not yet
    PageUptodate. What's new is that it may now be briefly visible as
    PageDirty before it is PageUptodate.

    When I scoured the tree to see if this could cause a problem anywhere,
    the only places I found were in two similar functions __r4w_get_page():
    which look up a page with find_get_page() (not using page lock), then
    claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.

    I'm not sure whether that was right before, but now it might be wrong
    (on rare occasions): only claim the page is uptodate if PageUptodate.
    Or perhaps the page in question could never be migratable anyway?

    Signed-off-by: Hugh Dickins
    Tested-by: Boaz Harrosh
    Cc: Benny Halevy
    Cc: Trond Myklebust
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Vladimir architected and authored much of the current state of the
    memcg's slab memory accounting and tracking. Make sure he gets CC'd on
    bug reports ;-)

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Tetsuo Handa has reported that the system might basically livelock in
    OOM condition without triggering the OOM killer.

    The issue is caused by internal dependency of the direct reclaim on
    vmstat counter updates (via zone_reclaimable) which are performed from
    the workqueue context. If all the current workers get assigned to an
    allocation request, though, they will be looping inside the allocator
    trying to reclaim memory but zone_reclaimable can see stalled numbers so
    it will consider a zone reclaimable even though it has been scanned way
    too much. WQ concurrency logic will not consider this situation as a
    congested workqueue because it relies that worker would have to sleep in
    such a situation. This also means that it doesn't try to spawn new
    workers or invoke the rescuer thread if the one is assigned to the
    queue.

    In order to fix this issue we need to do two things. First we have to
    let wq concurrency code know that we are in trouble so we have to do a
    short sleep. In order to prevent from issues handled by 0e093d99763e
    ("writeback: do not sleep on the congestion queue if there are no
    congested BDIs or if significant congestion is not being encountered in
    the current zone") we limit the sleep only to worker threads which are
    the ones of the interest anyway.

    The second thing to do is to create a dedicated workqueue for vmstat and
    mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
    have a spare worker thread for it.

    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: Tejun Heo
    Cc: Cristopher Lameter
    Cc: Joonsoo Kim
    Cc: Arkadiusz Miskiewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 016c13daa5c9 ("mm, page_alloc: use masks and shifts when
    converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
    MIGRATE_RECLAIMABLE in the enum definition. However, migratetype_names
    wasn't updated to reflect that.

    As a result, the file /proc/pagetypeinfo shows the counts for Movable as
    Reclaimable and vice versa.

    Additionally, commit 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks
    for high-order atomic allocations on demand") introduced
    MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
    show_migration_types(), so it doesn't appear in the listing of free
    areas during page alloc failures or oom kills.

    This patch fixes both problems. The atomic reserves will show with a
    letter 'H' in the free areas listings.

    Fixes: 016c13daa5c9 ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
    Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When the memory.high threshold is exceeded, try_charge() schedules a
    task_work to reclaim the excess. The reclaim target is set to the
    number of pages requested by try_charge().

    This is wrong, because try_charge() usually charges more pages than
    requested (batch > nr_pages) in order to refill per cpu stocks. As a
    result, a process in a cgroup can easily exceed memory.high
    significantly when doing a lot of charges w/o returning to userspace
    (e.g. reading a file in big chunks).

    Fix this issue by assuring that when exceeding memory.high a process
    reclaims as many pages as were actually charged (i.e. batch).

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
    alloc_buddy_huge_page() to directly create a hugepage from the buddy
    allocator.

    In that case, however, if alloc_buddy_huge_page() succeeds we don't
    decrement h->resv_huge_pages, which means that successful
    hugetlb_fault() returns without releasing the reserve count. As a
    result, subsequent hugetlb_fault() might fail despite that there are
    still free hugepages.

    This patch simply adds decrementing code on that code path.

    I reproduced this problem when testing v4.3 kernel in the following situation:
    - the test machine/VM is a NUMA system,
    - hugepage overcommiting is enabled,
    - most of hugepages are allocated and there's only one free hugepage
    which is on node 0 (for example),
    - another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
    node 1, tries to allocate a hugepage,
    - the allocation should fail but the reserve count is still hold.

    Signed-off-by: Naoya Horiguchi
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Hillf Danton
    Cc: Mike Kravetz
    Cc: [3.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

12 Dec, 2015

2 commits