16 Apr, 2016

1 commit

  • Pull block fixes from Jens Axboe:
    "A few fixes for the current series. This contains:

    - Two fixes for NVMe:

    One fixes a reset race that can be triggered by repeated
    insert/removal of the module.

    The other fixes an issue on some platforms, where we get probe
    timeouts since legacy interrupts isn't working. This used not to
    be a problem since we had the worker thread poll for completions,
    but since that was killed off, it means those poor souls can't
    successfully probe their NVMe device. Use a proper IRQ check and
    probe (msi-x -> msi ->legacy), like most other drivers to work
    around this. Both from Keith.

    - A loop corruption issue with offset in iters, from Ming Lei.

    - A fix for not having the partition stat per cpu ref count
    initialized before sending out the KOBJ_ADD, which could cause user
    space to access the counter prior to initialization. Also from
    Ming Lei.

    - A fix for using the wrong congestion state, from Kaixu Xia"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: loop: fix filesystem corruption in case of aio/dio
    NVMe: Always use MSI/MSI-x interrupts
    NVMe: Fix reset/remove race
    writeback: fix the wrong congested state variable definition
    block: partition: initialize percpuref before sending out KOBJ_ADD

    Linus Torvalds
     

15 Apr, 2016

1 commit


07 Apr, 2016

1 commit

  • The pkeys changes brought about a truly hideous set of macros in:

    cde70140fed8 ("mm/gup: Overload get_user_pages() functions")

    ... which macros are (ab-)using the fact that __VA_ARGS__ can be used
    to shift parameter positions in macro arguments without breaking the
    build and so can be used to call separate C functions depending on
    the number of arguments of the macro.

    This allowed easy migration of these 3 GUP APIs, as both these variants
    worked at the C level:

    old:
    ret = get_user_pages(current, current->mm, address, 1, 1, 0, &page, NULL);

    new:
    ret = get_user_pages(address, 1, 1, 0, &page, NULL);

    ... while we also generated a (functionally harmless but noticeable) build
    time warning if the old API was used. As there are over 300 uses of these
    APIs, this trick eased the migration of the API and avoided excessive
    migration pain in linux-next.

    Now, with its work done, get rid of all of that complication and ugliness:

    3 files changed, 16 insertions(+), 140 deletions(-)

    ... where the linecount of the migration hack was further inflated by the
    fact that there are NOMMU variants of these GUP APIs as well.

    Much of the conversion was done in linux-next over the past couple of months,
    and Linus recently removed all remaining old API uses from the upstream tree
    in the following upstrea commit:

    cb107161df3c ("Convert straggling drivers to new six-argument get_user_pages()")

    There was one more old-API usage in mm/gup.c, in the CONFIG_HAVE_GENERIC_RCU_GUP
    code path that ARM, ARM64 and PowerPC uses.

    After this commit any old API usage will break the build.

    [ Also fixed a PowerPC/HAVE_GENERIC_RCU_GUP warning reported by Stephen Rothwell. ]

    Cc: Andrew Morton
    Cc: Dave Hansen
    Cc: Dave Hansen
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

05 Apr, 2016

3 commits

  • Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov:
    "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The first patch with most changes has been done with coccinelle. The
    second is manual fixups on top.

    The third patch removes macros definition"

    [ I was planning to apply this just before rc2, but then I spaced out,
    so here it is right _after_ rc2 instead.

    As Kirill suggested as a possibility, I could have decided to only
    merge the first two patches, and leave the old interfaces for
    compatibility, but I'd rather get it all done and any out-of-tree
    modules and patches can trivially do the converstion while still also
    working with older kernels, so there is little reason to try to
    maintain the redundant legacy model. - Linus ]

    * PAGE_CACHE_SIZE-removal:
    mm: drop PAGE_CACHE_* and page_cache_{get,release} definition
    mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
    mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

    Linus Torvalds
     
  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Apr, 2016

5 commits

  • Commit fea85cff11de ("mm/page_isolation.c: return last tested pfn rather
    than failure indicator") changed the meaning of the return value. Let's
    change the function comments as well.

    Signed-off-by: Neil Zhang
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Zhang
     
  • Commit bb29902a7515 ("oom, oom_reaper: protect oom_reaper_list using
    simpler way") has simplified the check for tasks already enqueued for
    the oom reaper by checking tsk->oom_reaper_list != NULL. This check is
    not sufficient because the tsk might be the head of the queue without
    any other tasks queued and then we would simply lockup looping on the
    same task. Fix the condition by checking for the head as well.

    Fixes: bb29902a7515 ("oom, oom_reaper: protect oom_reaper_list using simpler way")
    Signed-off-by: Michal Hocko
    Acked-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The recently introduced batched invalidations mechanism uses its own
    mechanism for shootdown. However, it does wrong accounting of
    interrupts (e.g., inc_irq_stat is called for local invalidations),
    trace-points (e.g., TLB_REMOTE_SHOOTDOWN for local invalidations) and
    may break some platforms as it bypasses the invalidation mechanisms of
    Xen and SGI UV.

    This patch reuses the existing TLB flushing mechnaisms instead. We use
    NULL as mm to indicate a global invalidation is required.

    Fixes 72b252aed506b8 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
    Signed-off-by: Nadav Amit
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     
  • It is incorrect to use next_node to find a target node, it will return
    MAX_NUMNODES or invalid node. This will lead to crash in buddy system
    allocation.

    Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Signed-off-by: Xishi Qiu
    Acked-by: Vlastimil Babka
    Acked-by: Naoya Horiguchi
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: "Laura Abbott"
    Cc: Hui Zhu
    Cc: Wang Xiaoqiang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Add the missing argument to set_track().

    Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
    Signed-off-by: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Joonsoo Kim
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

01 Apr, 2016

1 commit


26 Mar, 2016

14 commits

  • !PageLRU should lead to SCAN_PAGE_LRU, not SCAN_SCAN_ABORT result.

    Signed-off-by: Kirill A. Shutemov
    Cc: Ebru Akagunduz
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • If
    - generic_file_read_iter() gets called with a zero read length,
    - the read offset is at a page boundary,
    - IOCB_DIRECT is not set
    - and the page in question hasn't made it into the page cache yet,
    then do_generic_file_read() will trigger a readahead with a req_size hint
    of zero.

    Since roundup_pow_of_two(0) is undefined, UBSAN reports

    UBSAN: Undefined behaviour in include/linux/log2.h:63:13
    shift exponent 64 is too large for 64-bit type 'long unsigned int'
    CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
    [...]
    Call Trace:
    [...]
    [] ondemand_readahead+0x3aa/0x3d0
    [] ? ondemand_readahead+0x3aa/0x3d0
    [] ? find_get_entry+0x2d/0x210
    [] page_cache_sync_readahead+0x63/0xa0
    [] do_generic_file_read+0x80d/0xf90
    [] generic_file_read_iter+0x185/0x420
    [...]
    [] __vfs_read+0x256/0x3d0
    [...]

    when get_init_ra_size() gets called from ondemand_readahead().

    The net effect is that the initial readahead size is arch dependent for
    requested read lengths of zero: for example, since

    1UL << (sizeof(unsigned long) * 8)

    evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
    size becomes 4 on the former and 0 on the latter.

    What's more, whether or not the file access timestamp is updated for zero
    length reads is decided differently for the two cases of IOCB_DIRECT
    being set or cleared: in the first case, generic_file_read_iter()
    explicitly skips updating that timestamp while in the latter case, it is
    always updated through the call to do_generic_file_read().

    According to POSIX, zero length reads "do not modify the last data access
    timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.

    Let generic_file_read_iter() unconditionally check the requested read
    length at its entry and return immediately with success if it is zero.

    Signed-off-by: Nicolai Stange
    Cc: Al Viro
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolai Stange
     
  • Implement the stack depot and provide CONFIG_STACKDEPOT. Stack depot
    will allow KASAN store allocation/deallocation stack traces for memory
    chunks. The stack traces are stored in a hash table and referenced by
    handles which reside in the kasan_alloc_meta and kasan_free_meta
    structures in the allocated memory chunks.

    IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
    duplication.

    Right now stackdepot support is only enabled in SLAB allocator. Once
    KASAN features in SLAB are on par with those in SLUB we can switch SLUB
    to stackdepot as well, thus removing the dependency on SLUB stack
    bookkeeping, which wastes a lot of memory.

    This patch is based on the "mm: kasan: stack depots" patch originally
    prepared by Dmitry Chernenkov.

    Joonsoo has said that he plans to reuse the stackdepot code for the
    mm/page_owner.c debugging facility.

    [akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
    [aryabinin@virtuozzo.com: comment style fixes]
    Signed-off-by: Alexander Potapenko
    Signed-off-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Add GFP flags to KASAN hooks for future patches to use.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Add KASAN hooks to SLAB allocator.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Hanjun Guo has reported that a CMA stress test causes broken accounting of
    CMA and free pages:

    > Before the test, I got:
    > -bash-4.3# cat /proc/meminfo | grep Cma
    > CmaTotal: 204800 kB
    > CmaFree: 195044 kB
    >
    >
    > After running the test:
    > -bash-4.3# cat /proc/meminfo | grep Cma
    > CmaTotal: 204800 kB
    > CmaFree: 6602584 kB
    >
    > So the freed CMA memory is more than total..
    >
    > Also the the MemFree is more than mem total:
    >
    > -bash-4.3# cat /proc/meminfo
    > MemTotal: 16342016 kB
    > MemFree: 22367268 kB
    > MemAvailable: 22370528 kB

    Laura Abbott has confirmed the issue and suspected the freepage accounting
    rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is
    caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
    pageblocks:

    > CMA isolates MAX_ORDER aligned blocks, but, during the process,
    > partialy isolated block exists. If MAX_ORDER is 11 and
    > pageblock_order is 9, two pageblocks make up MAX_ORDER
    > aligned block and I can think following scenario because pageblock
    > (un)isolation would be done one by one.
    >
    > (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
    > MIGRATE_ISOLATE, respectively.
    >
    > CC -> IC -> II (Isolation)
    > II -> CI -> CC (Un-isolation)
    >
    > If some pages are freed at this intermediate state such as IC or CI,
    > that page could be merged to the other page that is resident on
    > different type of pageblock and it will cause wrong freepage count.

    This was supposed to be prevented by CMA operating on MAX_ORDER blocks,
    but since it doesn't hold the zone->lock between pageblocks, a race
    window does exist.

    It's also likely that unexpected merging can occur between
    MIGRATE_ISOLATE and non-CMA pageblocks. This should be prevented in
    __free_one_page() since commit 3c605096d315 ("mm/page_alloc: restrict
    max order of merging on isolated pageblock"). However, we only check
    the migratetype of the pageblock where buddy merging has been initiated,
    not the migratetype of the buddy pageblock (or group of pageblocks)
    which can be MIGRATE_ISOLATE.

    Joonsoo has suggested checking for buddy migratetype as part of
    page_is_buddy(), but that would add extra checks in allocator hotpath
    and bloat-o-meter has shown significant code bloat (the function is
    inline).

    This patch reduces the bloat at some expense of more complicated code.
    The buddy-merging while-loop in __free_one_page() is initially bounded
    to pageblock_border and without any migratetype checks. The checks are
    placed outside, bumping the max_order if merging is allowed, and
    returning to the while-loop with a statement which can't be possibly
    considered harmful.

    This fixes the accounting bug and also removes the arguably weird state
    in the original commit 3c605096d315 where buddies could be left
    unmerged.

    Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
    Link: https://lkml.org/lkml/2016/3/2/280
    Signed-off-by: Vlastimil Babka
    Reported-by: Hanjun Guo
    Tested-by: Hanjun Guo
    Acked-by: Joonsoo Kim
    Debugged-by: Laura Abbott
    Debugged-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: [3.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • "oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
    to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it
    by simply checking tsk->oom_reaper_list != NULL.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
    address space" oom_reaper will call exit_oom_victim on the target task
    after it is done. This might however race with the PM freezer:

    CPU0 CPU1 CPU2
    freeze_processes
    try_to_freeze_tasks
    # Allocation request
    out_of_memory
    oom_killer_disable
    wake_oom_reaper(P1)
    __oom_reap_task
    exit_oom_victim(P1)
    wait_event(oom_victims==0)
    [...]
    do_exit(P1)
    perform IO/interfere with the freezer

    which breaks the oom_killer_disable semantic. We no longer have a
    guarantee that the oom victim won't interfere with the freezer because
    it might be anywhere on the way to do_exit while the freezer thinks the
    task has already terminated. It might trigger IO or touch devices which
    are frozen already.

    In order to close this race, make the oom_reaper thread freezable. This
    will work because
    a) already running oom_reaper will block freezer to enter the
    quiescent state
    b) wake_oom_reaper will not wake up the reaper after it has been
    frozen
    c) the only way to call exit_oom_victim after try_to_freeze_tasks
    is from the oom victim's context when we know the further
    interference shouldn't be possible

    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Entries are only added/removed from oom_reaper_list at head so we can
    use a single linked list and hence save a word in task_struct.

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Tetsuo has reported that oom_kill_allocating_task=1 will cause
    oom_reaper_list corruption because oom_kill_process doesn't follow
    standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
    the same task multiple times - e.g. by sacrificing the same child
    multiple times.

    This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
    which is set in oom_kill_process atomically and oom reaper is disabled
    if the flag was already set.

    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • wake_oom_reaper has allowed only 1 oom victim to be queued. The main
    reason for that was the simplicity as other solutions would require some
    way of queuing. The current approach is racy and that was deemed
    sufficient as the oom_reaper is considered a best effort approach to
    help with oom handling when the OOM victim cannot terminate in a
    reasonable time. The race could lead to missing an oom victim which can
    get stuck

    out_of_memory
    wake_oom_reaper
    cmpxchg // OK
    oom_reaper
    oom_reap_task
    __oom_reap_task
    oom_victim terminates
    atomic_inc_not_zero // fail
    out_of_memory
    wake_oom_reaper
    cmpxchg // fails
    task_to_reap = NULL

    This race requires 2 OOM invocations in a short time period which is not
    very likely but certainly not impossible. E.g. the original victim
    might have not released a lot of memory for some reason.

    The situation would improve considerably if wake_oom_reaper used a more
    robust queuing. This is what this patch implements. This means adding
    oom_reaper_list list_head into task_struct (eat a hole before embeded
    thread_struct for that purpose) and a oom_reaper_lock spinlock for
    queuing synchronization. wake_oom_reaper will then add the task on the
    queue and oom_reaper will dequeue it.

    Signed-off-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Inform about the successful/failed oom_reaper attempts and dump all the
    held locks to tell us more who is blocking the progress.

    [akpm@linux-foundation.org: fix CONFIG_MMU=n build]
    Signed-off-by: Michal Hocko
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When oom_reaper manages to unmap all the eligible vmas there shouldn't
    be much of the freable memory held by the oom victim left anymore so it
    makes sense to clear the TIF_MEMDIE flag for the victim and allow the
    OOM killer to select another task.

    The lack of TIF_MEMDIE also means that the victim cannot access memory
    reserves anymore but that shouldn't be a problem because it would get
    the access again if it needs to allocate and hits the OOM killer again
    due to the fatal_signal_pending resp. PF_EXITING check. We can safely
    hide the task from the OOM killer because it is clearly not a good
    candidate anymore as everyhing reclaimable has been torn down already.

    This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
    and thus hold off further global OOM killer actions granted the oom
    reaper is able to take mmap_sem for the associated mm struct. This is
    not guaranteed now but further steps should make sure that mmap_sem for
    write should be blocked killable which will help to reduce such a lock
    contention. This is not done by this patch.

    Note that exit_oom_victim might be called on a remote task from
    __oom_reap_task now so we have to check and clear the flag atomically
    otherwise we might race and underflow oom_victims or wake up waiters too
    early.

    Signed-off-by: Michal Hocko
    Suggested-by: Johannes Weiner
    Suggested-by: Tetsuo Handa
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This patch (of 5):

    This is based on the idea from Mel Gorman discussed during LSFMM 2015
    and independently brought up by Oleg Nesterov.

    The OOM killer currently allows to kill only a single task in a good
    hope that the task will terminate in a reasonable time and frees up its
    memory. Such a task (oom victim) will get an access to memory reserves
    via mark_oom_victim to allow a forward progress should there be a need
    for additional memory during exit path.

    It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
    construct workloads which break the core assumption mentioned above and
    the OOM victim might take unbounded amount of time to exit because it
    might be blocked in the uninterruptible state waiting for an event (e.g.
    lock) which is blocked by another task looping in the page allocator.

    This patch reduces the probability of such a lockup by introducing a
    specialized kernel thread (oom_reaper) which tries to reclaim additional
    memory by preemptively reaping the anonymous or swapped out memory owned
    by the oom victim under an assumption that such a memory won't be needed
    when its owner is killed and kicked from the userspace anyway. There is
    one notable exception to this, though, if the OOM victim was in the
    process of coredumping the result would be incomplete. This is
    considered a reasonable constrain because the overall system health is
    more important than debugability of a particular application.

    A kernel thread has been chosen because we need a reliable way of
    invocation so workqueue context is not appropriate because all the
    workers might be busy (e.g. allocating memory). Kswapd which sounds
    like another good fit is not appropriate as well because it might get
    blocked on locks during reclaim as well.

    oom_reaper has to take mmap_sem on the target task for reading so the
    solution is not 100% because the semaphore might be held or blocked for
    write but the probability is reduced considerably wrt. basically any
    lock blocking forward progress as described above. In order to prevent
    from blocking on the lock without any forward progress we are using only
    a trylock and retry 10 times with a short sleep in between. Users of
    mmap_sem which need it for write should be carefully reviewed to use
    _killable waiting as much as possible and reduce allocations requests
    done with the lock held to absolute minimum to reduce the risk even
    further.

    The API between oom killer and oom reaper is quite trivial.
    wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
    NULL->mm transition and oom_reaper clear this atomically once it is done
    with the work. This means that only a single mm_struct can be reaped at
    the time. As the operation is potentially disruptive we are trying to
    limit it to the ncessary minimum and the reaper blocks any updates while
    it operates on an mm. mm_struct is pinned by mm_count to allow parallel
    exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).

    Signed-off-by: Michal Hocko
    Suggested-by: Oleg Nesterov
    Suggested-by: Mel Gorman
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Andrea Argangeli
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Mar, 2016

3 commits

  • The mprotect(PROT_READ) fails when called by the READ_IMPLIES_EXEC
    binary on a memory mapped file located on non-exec fs. The mprotect
    does not check whether fs is _executable_ or not. The PROT_EXEC flag is
    set automatically even if a memory mapped file is located on non-exec
    fs. Fix it by checking whether a memory mapped file is located on a
    non-exec fs. If so the PROT_EXEC is not implied by the PROT_READ. The
    implementation uses the VM_MAYEXEC flag set properly in mmap. Now it is
    consistent with mmap.

    I did the isolated tests (PT_GNU_STACK X/NX, multiple VMAs, X/NX fs). I
    also patched the official 3.19.0-47-generic Ubuntu 14.04 kernel and it
    seems to work.

    Signed-off-by: Piotr Kwapulinski
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Konstantin Khlebnikov
    Cc: Dan Williams
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Piotr Kwapulinski
     
  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • Commit b430e9d1c6d4 ("remove compressed copy from zram in-memory")
    applied swap_slot_free_notify call in *end_swap_bio_read* to remove
    duplicated memory between zram and memory.

    However, with the introduction of rw_page in zram: 8c7f01025f7b ("zram:
    implement rw_page operation of zram"), it became void because rw_page
    doesn't need bio.

    Memory footprint is really important in embedded platforms which have
    small memory, for example, 512M) recently because it could start to kill
    processes if memory footprint exceeds some threshold by LMK or some
    similar memory management modules.

    This patch restores the function for rw_page, thereby eliminating this
    duplication.

    Signed-off-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: karam.lee
    Cc:
    Cc: Chan Jeong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

22 Mar, 2016

1 commit

  • Pull drm updates from Dave Airlie:
    "This is the main drm pull request for 4.6 kernel.

    Overall the coolest thing here for me is the nouveau maxwell signed
    firmware support from NVidia, it's taken a long while to extract this
    from them.

    I also wish the ARM vendors just designed one set of display IP, ARM
    display block proliferation is definitely increasing.

    Core:
    - drm_event cleanups
    - Internal API cleanup making mode_fixup optional.
    - Apple GMUX vga switcheroo support.
    - DP AUX testing interface

    Panel:
    - Refactoring of DSI core for use over more transports.

    New driver:
    - ARM hdlcd driver

    i915:
    - FBC/PSR (framebuffer compression, panel self refresh) enabled by default.
    - Ongoing atomic display support work
    - Ongoing runtime PM work
    - Pixel clock limit checks
    - VBT DSI description support
    - GEM fixes
    - GuC firmware scheduler enhancements

    amdkfd:
    - Deferred probing fixes to avoid make file or link ordering.

    amdgpu/radeon:
    - ACP support for i2s audio support.
    - Command Submission/GPU scheduler/GPUVM optimisations
    - Initial GPU reset support for amdgpu

    vmwgfx:
    - Support for DX10 gen mipmaps
    - Pageflipping and other fixes.

    exynos:
    - Exynos5420 SoC support for FIMD
    - Exynos5422 SoC support for MIPI-DSI

    nouveau:
    - GM20x secure boot support - adds acceleration for Maxwell GPUs.
    - GM200 support
    - GM20B clock driver support
    - Power sensors work

    etnaviv:
    - Correctness fixes for GPU cache flushing
    - Better support for i.MX6 systems.

    imx-drm:
    - VBlank IRQ support
    - Fence support
    - OF endpoint support

    msm:
    - HDMI support for 8996 (snapdragon 820)
    - Adreno 430 support
    - Timestamp queries support

    virtio-gpu:
    - Fixes for Android support.

    rockchip:
    - Add support for Innosilicion HDMI

    rcar-du:
    - Support for 4 crtcs
    - R8A7795 support
    - RCar Gen 3 support

    omapdrm:
    - HDMI interlace output support
    - dma-buf import support
    - Refactoring to remove a lot of legacy code.

    tilcdc:
    - Rewrite of pageflipping code
    - dma-buf support
    - pinctrl support

    vc4:
    - HDMI modesetting bug fixes
    - Significant 3D performance improvement.

    fsl-dcu (FreeScale):
    - Lots of fixes

    tegra:
    - Two small fixes

    sti:
    - Atomic support for planes
    - Improved HDMI support"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1063 commits)
    drm/amdgpu: release_pages requires linux/pagemap.h
    drm/sti: restore mode_fixup callback
    drm/amdgpu/gfx7: add MTYPE definition
    drm/amdgpu: removing BO_VAs shouldn't be interruptible
    drm/amd/powerplay: show uvd/vce power gate enablement for tonga.
    drm/amd/powerplay: show uvd/vce power gate info for fiji
    drm/amdgpu: use sched fence if possible
    drm/amdgpu: move ib.fence to job.fence
    drm/amdgpu: give a fence param to ib_free
    drm/amdgpu: include the right version of gmc header files for iceland
    drm/radeon: fix indentation.
    drm/amd/powerplay: add uvd/vce dpm enabling flag to fix the performance issue for CZ
    drm/amdgpu: switch back to 32bit hw fences v2
    drm/amdgpu: remove amdgpu_fence_is_signaled
    drm/amdgpu: drop the extra fence range check v2
    drm/amdgpu: signal fences directly in amdgpu_fence_process
    drm/amdgpu: cleanup amdgpu_fence_wait_empty v2
    drm/amdgpu: keep all fences in an RCU protected array v2
    drm/amdgpu: add number of hardware submissions to amdgpu_fence_driver_init_ring
    drm/amdgpu: RCU protected amd_sched_fence_release
    ...

    Linus Torvalds
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

20 Mar, 2016

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "This was delayed a day or two by some build-breakage on old toolchains
    which we've now fixed.

    There's two PCI commits both acked by Bjorn.

    There's one commit to mm/hugepage.c which is (co)authored by Kirill.

    Highlights:
    - Restructure Linux PTE on Book3S/64 to Radix format from Paul
    Mackerras
    - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh
    Kumar K.V
    - Add POWER9 cputable entry from Michael Neuling
    - FPU/Altivec/VSX save/restore optimisations from Cyril Bur
    - Add support for new ftrace ABI on ppc64le from Torsten Duwe

    Various cleanups & minor fixes from:
    - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy,
    Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell
    Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh.

    General:
    - atomics: Allow architectures to define their own __atomic_op_*
    helpers from Boqun Feng
    - Implement atomic{, 64}_*_return_* variants and acquire/release/
    relaxed variants for (cmp)xchg from Boqun Feng
    - Add powernv_defconfig from Jeremy Kerr
    - Fix BUG_ON() reporting in real mode from Balbir Singh
    - Add xmon command to dump OPAL msglog from Andrew Donnellan
    - Add xmon command to dump process/task similar to ps(1) from Douglas
    Miller
    - Clean up memory hotplug failure paths from David Gibson

    pci/eeh:
    - Redesign SR-IOV on PowerNV to give absolute isolation between VFs
    from Wei Yang.
    - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
    - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
    - PCI: Add pcibios_bus_add_device() weak function from Wei Yang
    - MAINTAINERS: Update EEH details and maintainership from Russell
    Currey

    cxl:
    - Support added to the CXL driver for running on both bare-metal and
    hypervisor systems, from Christophe Lombard and Frederic Barrat.
    - Ignore probes for virtual afu pci devices from Vaibhav Jain

    perf:
    - Export Power8 generic and cache events to sysfs from Sukadev
    Bhattiprolu
    - hv-24x7: Fix usage with chip events, display change in counter
    values, display domain indices in sysfs, eliminate domain suffix in
    event names, from Sukadev Bhattiprolu

    Freescale:
    - Updates from Scott: "Highlights include 8xx optimizations, 32-bit
    checksum optimizations, 86xx consolidation, e5500/e6500 cpu
    hotplug, more fman and other dt bits, and minor fixes/cleanup"

    * tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
    powerpc: Fix unrecoverable SLB miss during restore_math()
    powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers
    powerpc/rcpm: Fix build break when SMP=n
    powerpc/book3e-64: Use hardcoded mttmr opcode
    powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible
    powerpc/T104xRDB: add tdm riser card node to device tree
    powerpc32: PAGE_EXEC required for inittext
    powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree
    powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s)
    powerpc/86xx: Introduce and use common dtsi
    powerpc/86xx: Update device tree
    powerpc/86xx: Move dts files to fsl directory
    powerpc/86xx: Switch to kconfig fragments approach
    powerpc/86xx: Update defconfigs
    powerpc/86xx: Consolidate common platform code
    powerpc32: Remove one insn in mulhdu
    powerpc32: small optimisation in flush_icache_range()
    powerpc: Simplify test in __dma_sync()
    powerpc32: move xxxxx_dcache_range() functions inline
    powerpc32: Remove clear_pages() and define clear_page() inline
    ...

    Linus Torvalds
     

19 Mar, 2016

1 commit

  • Merge second patch-bomb from Andrew Morton:

    - a couple of hotfixes

    - the rest of MM

    - a new timer slack control in procfs

    - a couple of procfs fixes

    - a few misc things

    - some printk tweaks

    - lib/ updates, notably to radix-tree.

    - add my and Nick Piggin's old userspace radix-tree test harness to
    tools/testing/radix-tree/. Matthew said it was a godsend during the
    radix-tree work he did.

    - a few code-size improvements, switching to __always_inline where gcc
    screwed up.

    - partially implement character sets in sscanf

    * emailed patches from Andrew Morton : (118 commits)
    sscanf: implement basic character sets
    lib/bug.c: use common WARN helper
    param: convert some "on"/"off" users to strtobool
    lib: add "on"/"off" support to kstrtobool
    lib: update single-char callers of strtobool()
    lib: move strtobool() to kstrtobool()
    include/linux/unaligned: force inlining of byteswap operations
    include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
    include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
    usb: common: convert to use match_string() helper
    ide: hpt366: convert to use match_string() helper
    ata: hpt366: convert to use match_string() helper
    power: ab8500: convert to use match_string() helper
    power: charger_manager: convert to use match_string() helper
    drm/edid: convert to use match_string() helper
    pinctrl: convert to use match_string() helper
    device property: convert to use match_string() helper
    lib/string: introduce match_string() helper
    radix-tree tests: add test for radix_tree_iter_next
    radix-tree tests: add regression3 test
    ...

    Linus Torvalds
     

18 Mar, 2016

7 commits

  • Pull trivial tree updates from Jiri Kosina.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    drivers/rtc: broken link fix
    drm/i915 Fix typos in i915_gem_fence.c
    Docs: fix missing word in REPORTING-BUGS
    lib+mm: fix few spelling mistakes
    MAINTAINERS: add git URL for APM driver
    treewide: Fix typo in printk

    Linus Torvalds
     
  • shmem likes to occasionally drop the lock, schedule, then reacqire the
    lock and continue with the iteration from the last place it left off.
    This is currently done with a pretty ugly goto. Introduce
    radix_tree_iter_next() and use it throughout shmem.c.

    [koct9i@gmail.com: fix bug in radix_tree_iter_next() for tagged iteration]
    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Instead of a 'goto restart', we can now use radix_tree_iter_retry() to
    restart from our current position. This will make a difference when
    there are more ways to happen across an indirect pointer. And it
    eliminates some confusing gotos.

    [vbabka@suse.cz: remove now-obsolete-and-misleading comment]
    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Konstantin Khlebnikov
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • With huge pages, it is convenient to have the radix tree be able to
    return an entry that covers multiple indices. Previous attempts to deal
    with the problem have involved inserting N duplicate entries, which is a
    waste of memory and leads to problems trying to handle aliased tags, or
    probing the tree multiple times to find alternative entries which might
    cover the requested index.

    This approach inserts one canonical entry into the tree for a given
    range of indices, and may also insert other entries in order to ensure
    that lookups find the canonical entry.

    This solution only tolerates inserting powers of two that are greater
    than the fanout of the tree. If we wish to expand the radix tree's
    abilities to support large-ish pages that is less than the fanout at the
    penultimate level of the tree, then we would need to add one more step
    in lookup to ensure that any sibling nodes in the final level of the
    tree are dereferenced and we return the canonical entry that they
    reference.

    Signed-off-by: Matthew Wilcox
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • There are various email addresses for me throughout the kernel. Use the
    one that will always be valid.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • After the OOM killer is disabled during suspend operation, any
    !__GFP_NOFAIL && __GFP_FS allocations are forced to fail. Thus, any
    !__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • While oom_killer_disable() is called by freeze_processes() after all
    user threads except the current thread are frozen, it is possible that
    kernel threads invoke the OOM killer and sends SIGKILL to the current
    thread due to sharing the thawed victim's memory. Therefore, checking
    for SIGKILL is preferable than TIF_MEMDIE.

    Signed-off-by: Tetsuo Handa
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa