09 Sep, 2015

40 commits

  • mempool_destroy() does not tolerate a NULL mempool_t pointer argument and
    performs a NULL-pointer dereference. This requires additional attention
    and effort from developers/reviewers and forces all mempool_destroy()
    callers to do a NULL check

    if (pool)
    mempool_destroy(pool);

    Or, otherwise, be invalid mempool_destroy() users.

    Tweak mempool_destroy() and NULL-check the pointer there.

    Proposed by Andrew Morton.

    Link: https://lkml.org/lkml/2015/6/8/583
    Signed-off-by: Sergey Senozhatsky
    Acked-by: David Rientjes
    Cc: Julia Lawall
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • kmem_cache_destroy() does not tolerate a NULL kmem_cache pointer argument
    and performs a NULL-pointer dereference. This requires additional
    attention and effort from developers/reviewers and forces all
    kmem_cache_destroy() callers (200+ as of 4.1) to do a NULL check

    if (cache)
    kmem_cache_destroy(cache);

    Or, otherwise, be invalid kmem_cache_destroy() users.

    Tweak kmem_cache_destroy() and NULL-check the pointer there.

    Proposed by Andrew Morton.

    Link: https://lkml.org/lkml/2015/6/8/583
    Signed-off-by: Sergey Senozhatsky
    Acked-by: David Rientjes
    Cc: Julia Lawall
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • The "killed" variable in out_of_memory() can be removed since the call to
    oom_kill_process() where we should block to allow the process time to
    exit is obvious.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Describe the purpose of struct oom_control and what each member does.

    Also make gfp_mask and order const since they are never manipulated or
    passed to functions that discard the qualifier.

    Signed-off-by: David Rientjes
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Sysrq+f is used to kill a process either for debug or when the VM is
    otherwise unresponsive.

    It is not intended to trigger a panic when no process may be killed.

    Avoid panicking the system for sysrq+f when no processes are killed.

    Signed-off-by: David Rientjes
    Suggested-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The force_kill member of struct oom_control isn't needed if an order of -1
    is used instead. This is the same as order == -1 in struct
    compact_control which requires full memory compaction.

    This patch introduces no functional change.

    Signed-off-by: David Rientjes
    Cc: Sergey Senozhatsky
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are essential elements to an oom context that are passed around to
    multiple functions.

    Organize these elements into a new struct, struct oom_control, that
    specifies the context for an oom condition.

    This patch introduces no functional change.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This makes set_recommended_min_free_kbytes() have a return type of void as
    it cannot fail.

    Signed-off-by: Nicholas Krause
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Krause
     
  • Explicitly state that __GFP_NORETRY will attempt direct reclaim and
    memory compaction before returning NULL and that the oom killer is not
    called in the current implementation of the page allocator.

    [akpm@linux-foundation.org: s/has/have/]
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • === Short summary ====

    iov_iter_fault_in_readable() works around a really rare case and we can
    avoid the deadlock it addresses in another way: disable page faults and
    work around copy failures by faulting after the copy in a slow path
    instead of before in a hot one.

    I have a little microbenchmark that does repeated, small writes to tmpfs.
    This patch speeds that micro up by 6.2%.

    === Long version ===

    When doing a sys_write() we have a source buffer in userspace and then a
    target file page.

    If both of those are the same physical page, there is a potential deadlock
    that we avoid. It would happen something like this:

    1. We start the write to the file
    2. Allocate page cache page and set it !Uptodate
    3. Touch the userspace buffer to copy in the user data
    4. Page fault (since source of the write not yet mapped)
    5. Page fault code tries to lock the page and deadlocks

    (more details on this below)

    To avoid this, we prefault the page to guarantee that this fault does not
    occur. But, this prefault comes at a cost. It is one of the most
    expensive things that we do in a hot write() path (especially if we
    compare it to the read path). It is working around a pretty rare case.

    To fix this, it's pretty simple. We move the "prefault" code to run after
    we attempt the copy. We explicitly disable page faults _during_ the copy,
    detect the copy failure, then execute the "prefault" ouside of where the
    page lock needs to be held.

    iov_iter_copy_from_user_atomic() actually already has an implicit
    pagefault_disable() inside of it (at least on x86), but we add an explicit
    one. I don't think we can depend on every kmap_atomic() implementation to
    pagefault_disable() for eternity.

    ===================================================

    The stack trace when this happens looks like this:

    wait_on_page_bit_killable+0xc0/0xd0
    __lock_page_or_retry+0x84/0xa0
    filemap_fault+0x1ed/0x3d0
    __do_fault+0x41/0xc0
    handle_mm_fault+0x9bb/0x1210
    __do_page_fault+0x17f/0x3d0
    do_page_fault+0xc/0x10
    page_fault+0x22/0x30
    generic_perform_write+0xca/0x1a0
    __generic_file_write_iter+0x190/0x1f0
    ext4_file_write_iter+0xe9/0x460
    __vfs_write+0xaa/0xe0
    vfs_write+0xa6/0x1a0
    SyS_write+0x46/0xa0
    entry_SYSCALL_64_fastpath+0x12/0x6a
    0xffffffffffffffff

    (Note, this does *NOT* happen in practice today because
    the kmap_atomic() does a pagefault_disable(). The trace
    above was obtained by taking out the pagefault_disable().)

    You can trigger the deadlock with this little code snippet:

    fd = open("foo", O_RDWR);
    fdmap = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0);
    write(fd, &fdmap[0], 1);

    Signed-off-by: Dave Hansen
    Cc: Al Viro
    Cc: Michal Hocko
    Cc: Jens Axboe
    Cc: Tejun Heo
    Cc: NeilBrown
    Cc: Matthew Wilcox
    Cc: Paul Cassella
    Cc: Greg Thelen
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • We want to know per-process workingset size for smart memory management
    on userland and we use swap(ex, zram) heavily to maximize memory
    efficiency so workingset includes swap as well as RSS.

    On such system, if there are lots of shared anonymous pages, it's really
    hard to figure out exactly how many each process consumes memory(ie, rss
    + wap) if the system has lots of shared anonymous memory(e.g, android).

    This patch introduces SwapPss field on /proc//smaps so we can get
    more exact workingset size per process.

    Bongkyu tested it. Result is below.

    1. 50M used swap
    SwapTotal: 461976 kB
    SwapFree: 411192 kB

    $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
    48236
    $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
    141184

    2. 240M used swap
    SwapTotal: 461976 kB
    SwapFree: 216808 kB

    $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
    230315
    $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
    1387744

    [akpm@linux-foundation.org: simplify kunmap_atomic() call]
    Signed-off-by: Minchan Kim
    Reported-by: Bongkyu Kim
    Tested-by: Bongkyu Kim
    Cc: Hugh Dickins
    Cc: Sergey Senozhatsky
    Cc: Jonathan Corbet
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • memtest does not require these headers to be included.

    Signed-off-by: Vladimir Murzin
    Cc: Leon Romanovsky
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Murzin
     
  • - prefer pr_info(... to printk(KERN_INFO ...
    - use %pa for phys_addr_t
    - use cpu_to_be64 while printing pattern in reserve_bad_mem()

    Signed-off-by: Vladimir Murzin
    Cc: Leon Romanovsky
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Murzin
     
  • Since simple_strtoul is obsolete and memtest_pattern is type of int, use
    kstrtouint instead.

    Signed-off-by: Vladimir Murzin
    Cc: Leon Romanovsky
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Murzin
     
  • Notes about recent changes.

    [akpm@linux-foundation.org: various tweaks]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Mark Williamson
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch sets bit 56 in pagemap if this page is mapped only once. It
    allows to detect exclusively used pages without exposing PFN:

    present file exclusive state
    0 0 0 non-present
    1 1 0 file page mapped somewhere else
    1 1 1 file page mapped only here
    1 0 0 anon non-CoWed page (shared with parent/child)
    1 0 1 anon CoWed page (or never forked)

    CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.

    MMap-exclusive bit doesn't reflect potential page-sharing via swapcache:
    page could be mapped once but has several swap-ptes which point to it.
    Application could detect that by swap bit in pagemap entry and touch that
    pte via /proc/pid/mem to get real information.

    See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com

    Requested by Mark Williamson.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch makes pagemap readable for normal users and hides physical
    addresses from them. For some use-cases PFN isn't required at all.

    See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name

    Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
    Signed-off-by: Konstantin Khlebnikov
    Cc: Naoya Horiguchi
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch moves pmd dissection out of reporting loop: huge pages are
    reported as bunch of normal pages with contiguous PFNs.

    Add missing "FILE" bit in hugetlb vmas.

    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch removes page-shift bits (scheduled to remove since 3.11) and
    completes migration to the new bit layout. Also it cleans messy macro.

    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Naoya Horiguchi
    Cc: Mark Williamson
    Tested-by: Mark Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patchset makes pagemap useable again in the safe way (after row
    hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access
    for non-privileged users but hides PFNs from them.

    Also it adds bit 'map-exclusive' which is set if page is mapped only here:
    it helps in estimation of working set without exposing pfns and allows to
    distinguish CoWed and non-CoWed private anonymous pages.

    Second patch removes page-shift bits and completes migration to the new
    pagemap format: flags soft-dirty and mmap-exclusive are available only in
    the new format.

    This patch (of 5):

    This patch moves permission checks from pagemap_read() into pagemap_open().

    Pointer to mm is saved in file->private_data. This reference pins only
    mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.

    See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com

    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • It has no callers.

    Signed-off-by: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     
  • Each memblock_region has flags to indicates the type of this range. For
    the overlap case, memblock_add_range() inserts the lower part and leave the
    upper part as indicated in the overlapped region.

    If the flags of the new range differs from the overlapped region, the
    information recorded is not correct.

    This patch adds a WARN_ON when the flags of the new range differs from the
    overlapped region.

    Signed-off-by: Wei Yang
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Commit febd5949e134 ("mm/memory hotplug: init the zone's size when
    calculating node totalpages") refines the function
    free_area_init_core().

    After doing so, these two parameters are not used anymore.

    This patch removes these two parameters.

    Signed-off-by: Wei Yang
    Cc: Gu Zheng
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • nr_node_ids records the highest possible node id, which is calculated by
    scanning the bitmap node_states[N_POSSIBLE]. Current implementation
    scan the bitmap from the beginning, which will scan the whole bitmap.

    This patch reverses the order by scanning from the end with
    find_last_bit().

    Signed-off-by: Wei Yang
    Cc: Tejun Heo
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • __dax_fault() takes i_mmap_lock for write. Let's pair it with write
    unlock on do_cow_fault() side.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap.

    __dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from
    all mappings. We need to drop i_mmap_lock there to avoid lock deadlock.

    Re-aquiring the lock should be fine since we check i_size after the
    point.

    Signed-off-by: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • I was basically open-coding it (thanks to copying code from do_fault()
    which probably also needs to be fixed).

    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • If the first access to a huge page was a store, there would be no existing
    zero pmd in this process's page tables. There could be a zero pmd in
    another process's page tables, if it had done a load. We can detect this
    case by noticing that the buffer_head returned from the filesystem is New,
    and ensure that other processes mapping this huge page have their page
    tables flushed.

    Signed-off-by: Matthew Wilcox
    Reported-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This is another place where DAX assumed that pgtable_t was a pointer.
    Open code the important parts of set_huge_zero_page() in DAX and make
    set_huge_zero_page() static again.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The original DAX code assumed that pgtable_t was a pointer, which isn't
    true on all architectures. Restructure the code to not rely on that
    assumption.

    [willy@linux.intel.com: further fixes integrated into this patch]
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The DAX code neglected to put the refcount on the huge zero page.
    Also we must notify on splits.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • If two threads write-fault on the same hole at the same time, the winner
    of the race will return to userspace and complete their store, only to
    have the loser overwrite their store with zeroes. Fix this for now by
    taking the i_mmap_sem for write instead of read, and do so outside the
    call to get_block(). Now the loser of the race will see the block has
    already been zeroed, and will not zero it again.

    This severely limits our scalability. I have ideas for improving it, but
    those can wait for a later patch.

    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Jan Kara pointed out that in the case where we are writing to a hole, we
    can end up with a lock inversion between the page lock and the journal
    lock. We can avoid this by starting the transaction in ext4 before
    calling into DAX. The journal lock nests inside the superblock
    pagefault lock, so we have to duplicate that code from dax_fault, like
    XFS does.

    Signed-off-by: Matthew Wilcox
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • DAX wants different semantics from any currently-existing ext4 get_block
    callback. Unlike ext4_get_block_write(), it needs to honour the
    'create' flag, and unlike ext4_get_block(), it needs to be able to
    return unwritten extents. So introduce a new ext4_get_block_dax() which
    has those semantics.

    We could also change ext4_get_block_write() to honour the 'create' flag,
    but that might have consequences on other users that I do not currently
    understand.

    Signed-off-by: Matthew Wilcox
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Jan Kara pointed out I should be more explicit here about the perils of
    racing against truncate. The comment is mostly the same as for the PTE
    case.

    Signed-off-by: Matthew Wilcox
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • It would make more sense to have all the return values from
    vmf_insert_pfn_pmd() encoded in one place instead of having to follow
    the convention into insert_pfn(). Suggested by Jeff Moyer.

    Signed-off-by: Matthew Wilcox
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • DAX relies on the get_block function either zeroing newly allocated
    blocks before they're findable by subsequent calls to get_block, or
    marking newly allocated blocks as unwritten. ext4_get_block() cannot
    create unwritten extents, but ext4_get_block_write() can.

    Signed-off-by: Matthew Wilcox
    Reported-by: Andy Rudoff
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Fix typo s/CONFIG_TRANSPARENT_HUGEPAGES/CONFIG_TRANSPARENT_HUGEPAGE/ in
    #endif comment introduced by commit 2b26a9206d6a ("dax: add huge page
    fault support").

    Signed-off-by: Valentin Rothberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valentin Rothberg
     
  • Use DAX to provide support for huge pages.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use DAX to provide support for huge pages.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox