15 Aug, 2014

2 commits

  • Merge leftovers from Andrew Morton:
    "A few leftovers.

    I have a bunch of OCFS2 patches which are still out for review and
    which I might sneak along after -rc1. Partly my fault - I should send
    my review pokes out earlier"

    * emailed patches from Andrew Morton :
    mm: fix CROSS_MEMORY_ATTACH help text grammar
    drivers/mfd/rtsx_usb.c: export device table
    mm, hugetlb_cgroup: align hugetlb cgroup limit to hugepage size

    Linus Torvalds
     
  • Memcg aligns memory.limit_in_bytes to PAGE_SIZE as part of the resource
    counter since it makes no sense to allow a partial page to be charged.

    As a result of the hugetlb cgroup using the resource counter, it is also
    aligned to PAGE_SIZE but makes no sense unless aligned to the size of
    the hugepage being limited.

    Align hugetlb cgroup limit to hugepage size.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

12 Aug, 2014

2 commits

  • Pull vfs updates from Al Viro:
    "Stuff in here:

    - acct.c fixes and general rework of mnt_pin mechanism. That allows
    to go for delayed-mntput stuff, which will permit mntput() on deep
    stack without worrying about stack overflows - fs shutdown will
    happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
    series without introducing tons of stack overflows on new mntput()
    call chains it introduces.
    - Bruce's d_splice_alias() patches
    - more Miklos' rename() stuff.
    - a couple of regression fixes (stable fodder, in the end of branch)
    and a fix for API idiocy in iov_iter.c.

    There definitely will be another pile, maybe even two. I'd like to
    get Eric's series in this time, but even if we miss it, it'll go right
    in the beginning of for-next in the next cycle - the tricky part of
    prereqs is in this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    fix copy_tree() regression
    __generic_file_write_iter(): fix handling of sync error after DIO
    switch iov_iter_get_pages() to passing maximal number of pages
    fs: mark __d_obtain_alias static
    dcache: d_splice_alias should detect loops
    exportfs: update Exporting documentation
    dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
    dcache: remove unused d_find_alias parameter
    dcache: d_obtain_alias callers don't all want DISCONNECTED
    dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
    dcache: d_splice_alias mustn't create directory aliases
    dcache: close d_move race in d_splice_alias
    dcache: move d_splice_alias
    namei: trivial fix to vfs_rename_dir comment
    VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
    cifs: support RENAME_NOREPLACE
    hostfs: support rename flags
    shmem: support RENAME_EXCHANGE
    shmem: support RENAME_NOREPLACE
    btrfs: add RENAME_NOREPLACE
    ...

    Linus Torvalds
     
  • If DIO results in short write and sync write fails, we want to bugger off
    whether the DIO part has written anything or not; the logics on the return
    will take care of the right return value.

    Cc: stable@vger.kernel.org [3.16]
    Reported-by: Anton Altaparmakov
    Signed-off-by: Al Viro

    Al Viro
     

09 Aug, 2014

12 commits

  • If we set SEAL_WRITE on a file, we must make sure there cannot be any
    ongoing write-operations on the file. For write() calls, we simply lock
    the inode mutex, for mmap() we simply verify there're no writable
    mappings. However, there might be pages pinned by AIO, Direct-IO and
    similar operations via GUP. We must make sure those do not write to the
    memfd file after we set SEAL_WRITE.

    As there is no way to notify GUP users to drop pages or to wait for them
    to be done, we implement the wait ourself: When setting SEAL_WRITE, we
    check all pages for their ref-count. If it's bigger than 1, we know
    there's some user of the page. We then mark the page and wait for up to
    150ms for those ref-counts to be dropped. If the ref-counts are not
    dropped in time, we refuse the seal operation.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
    that you can pass to mmap(). It can support sealing and avoids any
    connection to user-visible mount-points. Thus, it's not subject to quotas
    on mounted file-systems, but can be used like malloc()'ed memory, but with
    a file-descriptor to it.

    memfd_create() returns the raw shmem file, so calls like ftruncate() can
    be used to modify the underlying inode. Also calls like fstat() will
    return proper information and mark the file as regular file. If you want
    sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
    supported (like on all other regular files).

    Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
    subject to a filesystem size limit. It is still properly accounted to
    memcg limits, though, and to the same overcommit or no-overcommit
    accounting as all user memory.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • If two processes share a common memory region, they usually want some
    guarantees to allow safe access. This often includes:
    - one side cannot overwrite data while the other reads it
    - one side cannot shrink the buffer while the other accesses it
    - one side cannot grow the buffer beyond previously set boundaries

    If there is a trust-relationship between both parties, there is no need
    for policy enforcement. However, if there's no trust relationship (eg.,
    for general-purpose IPC) sharing memory-regions is highly fragile and
    often not possible without local copies. Look at the following two
    use-cases:

    1) A graphics client wants to share its rendering-buffer with a
    graphics-server. The memory-region is allocated by the client for
    read/write access and a second FD is passed to the server. While
    scanning out from the memory region, the server has no guarantee that
    the client doesn't shrink the buffer at any time, requiring rather
    cumbersome SIGBUS handling.
    2) A process wants to perform an RPC on another process. To avoid huge
    bandwidth consumption, zero-copy is preferred. After a message is
    assembled in-memory and a FD is passed to the remote side, both sides
    want to be sure that neither modifies this shared copy, anymore. The
    source may have put sensible data into the message without a separate
    copy and the target may want to parse the message inline, to avoid a
    local copy.

    While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
    ways to achieve most of this, the first one is unproportionally ugly to
    use in libraries and the latter two are broken/racy or even disabled due
    to denial of service attacks.

    This patch introduces the concept of SEALING. If you seal a file, a
    specific set of operations is blocked on that file forever. Unlike locks,
    seals can only be set, never removed. Hence, once you verified a specific
    set of seals is set, you're guaranteed that no-one can perform the blocked
    operations on this file, anymore.

    An initial set of SEALS is introduced by this patch:
    - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
    in size. This affects ftruncate() and open(O_TRUNC).
    - GROW: If SEAL_GROW is set, the file in question cannot be increased
    in size. This affects ftruncate(), fallocate() and write().
    - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
    are possible. This affects fallocate(PUNCH_HOLE), mmap() and
    write().
    - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
    This basically prevents the F_ADD_SEAL operation on a file and
    can be set to prevent others from adding further seals that you
    don't want.

    The described use-cases can easily use these seals to provide safe use
    without any trust-relationship:

    1) The graphics server can verify that a passed file-descriptor has
    SEAL_SHRINK set. This allows safe scanout, while the client is
    allowed to increase buffer size for window-resizing on-the-fly.
    Concurrent writes are explicitly allowed.
    2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
    SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
    process can modify the data while the other side parses it.
    Furthermore, it guarantees that even with writable FDs passed to the
    peer, it cannot increase the size to hit memory-limits of the source
    process (in case the file-storage is accounted to the source).

    The new API is an extension to fcntl(), adding two new commands:
    F_GET_SEALS: Return a bitset describing the seals on the file. This
    can be called on any FD if the underlying file supports
    sealing.
    F_ADD_SEALS: Change the seals of a given file. This requires WRITE
    access to the file and F_SEAL_SEAL may not already be set.
    Furthermore, the underlying file must support sealing and
    there may not be any existing shared mapping of that file.
    Otherwise, EBADF/EPERM is returned.
    The given seals are _added_ to the existing set of seals
    on the file. You cannot remove seals again.

    The fcntl() handler is currently specific to shmem and disabled on all
    files. A file needs to explicitly support sealing for this interface to
    work. A separate syscall is added in a follow-up, which creates files that
    support sealing. There is no intention to support this on other
    file-systems. Semantics are unclear for non-volatile files and we lack any
    use-case right now. Therefore, the implementation is specific to shmem.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • This patch (of 6):

    The i_mmap_writable field counts existing writable mappings of an
    address_space. To allow drivers to prevent new writable mappings, make
    this counter signed and prevent new writable mappings if it is negative.
    This is modelled after i_writecount and DENYWRITE.

    This will be required by the shmem-sealing infrastructure to prevent any
    new writable mappings after the WRITE seal has been set. In case there
    exists a writable mapping, this operation will fail with EBUSY.

    Note that we rely on the fact that iff you already own a writable mapping,
    you can increase the counter without using the helpers. This is the same
    that we do for i_writecount.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • The core mm code will provide a default gate area based on
    FIXADDR_USER_START and FIXADDR_USER_END if
    !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).

    This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
    UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
    and x86_64 have gate areas, but they have their own implementations.

    This gets rid of the default and moves the code into ia64.

    This should save some code on architectures without a gate area: it's now
    possible to inline the gate_area functions in the default case.

    Signed-off-by: Andy Lutomirski
    Acked-by: Nathan Lynch
    Acked-by: H. Peter Anvin
    Acked-by: Benjamin Herrenschmidt [in principle]
    Acked-by: Richard Weinberger [for um]
    Acked-by: Will Deacon [for arm64]
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chris Metcalf
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Nathan Lynch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • zswap_entry_cache_destroy() is only called by __init init_zswap().

    This patch also fixes function name zswap_entry_cache_ s/destory/destroy

    Signed-off-by: Fabian Frederick
    Acked-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Charge migration currently disables IRQs twice to update the charge
    statistics for the old page and then again for the new page.

    But migration is a seamless transition of a charge from one physical
    page to another one of the same size, so this should be a non-event from
    an accounting point of view. Leave the statistics alone.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pages are now uncharged at release time, and all sources of batched
    uncharges operate on lists of pages. Directly use those lists, and
    get rid of the per-task batching state.

    This also batches statistics accounting, in addition to the res
    counter charges, to reduce IRQ-disabling and re-enabling.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Naoya Horiguchi
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Aleksei hit the soft lockup during reading /proc/PID/smaps. David
    investigated the problem and suggested the right fix.

    while_each_thread() is racy and should die, this patch updates
    vm_is_stack().

    Signed-off-by: Oleg Nesterov
    Reported-by: Aleksei Besogonov
    Tested-by: Aleksei Besogonov
    Suggested-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This reverts commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC").

    commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC") assumes that the
    system with !CONFIG_NUMA has only one memory node. But, it turns out to
    be false by the report from Geert. His system, m68k, has many memory
    nodes and is configured in !CONFIG_NUMA. So it couldn't boot with above
    change.

    Here goes his failure report.

    With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM:

    enable_cpucache failed for radix_tree_node, error 12.
    kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522!
    *** TRAP #7 *** FORMAT=0
    Current process id is 0
    BAD KERNEL TRAP: 00000000
    Modules linked in:
    PC: [] kmem_cache_init_late+0x70/0x8c
    SR: 2200 SP: 00345f90 a2: 0034c2e8
    d0: 0000003d d1: 00000000 d2: 00000000 d3: 003ac942
    d4: 00000000 d5: 00000000 a0: 0034f686 a1: 0034f682
    Process swapper (pid: 0, task=0034c2e8)
    Frame format=0
    Stack from 00345fc4:
    002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000
    00000000 00000000 00000000 00000000 00000000 003ac942 00000000
    003912d6
    Call Trace: [] parse_args+0x0/0x2ca
    [] strlen+0x0/0x1a
    [] start_kernel+0x23c/0x428
    [] _sinittext+0x2d6/0x95e

    Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 0035 4b1c 61ff
    fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c
    Disabling lock debugging due to kernel taint
    Kernel panic - not syncing: Attempted to kill the idle task!

    Although there is a alternative way to fix this issue such as disabling
    use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better
    to me in this time.

    Signed-off-by: Joonsoo Kim
    Reported-by: Geert Uytterhoeven
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2014

3 commits


07 Aug, 2014

21 commits

  • Change zswap to use the zpool api instead of directly using zbud. Add a
    boot-time param to allow selecting which zpool implementation to use,
    with zbud as the default.

    Signed-off-by: Dan Streetman
    Tested-by: Seth Jennings
    Cc: Weijie Yang
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Update zbud and zsmalloc to implement the zpool api.

    [fengguang.wu@intel.com: make functions static]
    Signed-off-by: Dan Streetman
    Tested-by: Seth Jennings
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Weijie Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Add zpool api.

    zpool provides an interface for memory storage, typically of compressed
    memory. Users can select what backend to use; currently the only
    implementations are zbud, a low density implementation with up to two
    compressed pages per storage page, and zsmalloc, a higher density
    implementation with multiple compressed pages per storage page.

    Signed-off-by: Dan Streetman
    Tested-by: Seth Jennings
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Weijie Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Change the type of the zbud_alloc() size param from unsigned int to
    size_t.

    Technically, this should not make any difference, as the zbud
    implementation already restricts the size to well within either type's
    limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use
    size_t, this brings the size parameter type in line with zsmalloc/zpool.

    Signed-off-by: Dan Streetman
    Acked-by: Seth Jennings
    Tested-by: Seth Jennings
    Cc: Weijie Yang
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • User-visible effect:
    Architectures that choose this method of maintaining cache coherency
    (MIPS and xtensa currently) are able to use high memory on cores with
    aliasing data cache. Without this fix such architectures can not use
    high memory (in case of xtensa it means that at most 128 MBytes of
    physical memory is available).

    The problem:
    VIPT cache with way size larger than MMU page size may suffer from
    aliasing problem: a single physical address accessed via different
    virtual addresses may end up in multiple locations in the cache.
    Virtual mappings of a physical address that always get cached in
    different cache locations are said to have different colors. L1 caching
    hardware usually doesn't handle this situation leaving it up to
    software. Software must avoid this situation as it leads to data
    corruption.

    What can be done:
    One way to handle this is to flush and invalidate data cache every time
    page mapping changes color. The other way is to always map physical
    page at a virtual address with the same color. Low memory pages already
    have this property. Giving architecture a way to control color of high
    memory page mapping allows reusing of existing low memory cache alias
    handling code.

    How this is done with this patch:
    Provide hooks that allow architectures with aliasing cache to align
    mapping address of high pages according to their color. Such
    architectures may enforce similar coloring of low- and high-memory page
    mappings and reuse existing cache management functions to support
    highmem.

    This code is based on the implementation of similar feature for MIPS by
    Leonid Yegoshin.

    Signed-off-by: Max Filippov
    Cc: Leonid Yegoshin
    Cc: Chris Zankel
    Cc: Marc Gauthier
    Cc: David Rientjes
    Cc: Steven Hill
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Max Filippov
     
  • When kernel device drivers or subsystems want to bind their lifespan to
    t= he lifespan of the mm_struct, they usually use one of the following
    methods:

    1. Manually calling a function in the interested kernel module. The
    funct= ion call needs to be placed in mmput. This method was rejected
    by several ker= nel maintainers.

    2. Registering to the mmu notifier release mechanism.

    The problem with the latter approach is that the mmu_notifier_release
    cal= lback is called from__mmu_notifier_release (called from exit_mmap).
    That functi= on iterates over the list of mmu notifiers and don't expect
    the release call= back function to remove itself from the list.
    Therefore, the callback function= in the kernel module can't release the
    mmu_notifier_object, which is actuall= y the kernel module's object
    itself. As a result, the destruction of the kernel module's object must
    to be done in a delayed fashion.

    This patch adds support for this delayed callback, by adding a new
    mmu_notifier_call_srcu function that receives a function ptr and calls
    th= at function with call_srcu. In that function, the kernel module
    releases its object. To use mmu_notifier_call_srcu, the calling module
    needs to call b= efore that a new function called
    mmu_notifier_unregister_no_release that as its= name implies,
    unregisters a notifier without calling its notifier release call= back.

    This patch also adds a function that will call barrier_srcu so those
    kern= el modules can sync with mmu_notifier.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Oded Gabbay
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Charge reclaim and OOM currently use the charge batch variable, but
    batching is already disabled at that point. To simplify the charge
    logic, the batch variable is reset to the original request size when
    reclaim is entered, so it's functionally equal, but it's misleading.

    Switch reclaim/OOM to nr_pages, which is the original request size.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This patch changes confusing #ifdef use in __access_remote_vm into
    merely ugly #ifdef use.

    Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651

    Signed-off-by: Rik van Riel
    Reported-by: David Binderman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • fault_around_bytes can only be changed via debugfs. Let's mark it
    read-mostly.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Dave Hansen
    Cc: Andrey Ryabinin
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Things can go wrong if fault_around_bytes will be changed under
    do_fault_around(): between fault_around_mask() and fault_around_pages().

    Let's read fault_around_bytes only once during do_fault_around() and
    calculate mask based on the reading.

    Note: fault_around_bytes can only be updated via debug interface. Also
    I've tried but was not able to trigger a bad behaviour without the
    patch. So I would not consider this patch as urgent.

    Signed-off-by: Kirill A. Shutemov
    Cc: Dave Hansen
    Cc: Andrey Ryabinin
    Cc: Sasha Levin
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • When memory cgoups are enabled, the code that decides to force to scan
    anonymous pages in get_scan_count() compares global values (free,
    high_watermark) to a value that is restricted to a memory cgroup (file).
    It make the code over-eager to force anon scan.

    For instance, it will force anon scan when scanning a memcg that is
    mainly populated by anonymous page, even when there is plenty of file
    pages to get rid of in others memcgs, even when swappiness == 0. It
    breaks user's expectation about swappiness and hurts performance.

    This patch makes sure that forced anon scan only happens when there not
    enough file pages for the all zone, not just in one random memcg.

    [hannes@cmpxchg.org: cleanups]
    Signed-off-by: Jerome Marchand
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Quite a while ago, get_scan_ratio() has been renamed get_scan_count(),
    however a comment in shrink_active_list() still mention it. This patch
    fixes the outdated comment.

    Signed-off-by: Jerome Marchand
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • The oom killer scans each process and determines whether it is eligible
    for oom kill or whether the oom killer should abort because of
    concurrent memory freeing. It will abort when an eligible process is
    found to have TIF_MEMDIE set, meaning it has already been oom killed and
    we're waiting for it to exit.

    Processes with task->mm == NULL should not be considered because they
    are either kthreads or have already detached their memory and killing
    them would not lead to memory freeing. That memory is only freed after
    exit_mm() has returned, however, and not when task->mm is first set to
    NULL.

    Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
    is no longer considered for oom kill, but only until exit_mm() has
    returned. This was fragile in the past because it relied on
    exit_notify() to be reached before no longer considering TIF_MEMDIE
    processes.

    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It is possible for some platforms, such as powerpc to set HPAGE_SHIFT to
    0 to indicate huge pages not supported.

    When this is the case, hugetlbfs could be disabled during boot time:
    hugetlbfs: disabling because there are no supported hugepage sizes

    Then in dissolve_free_huge_pages(), order is kept maximum (64 for
    64bits), and the for loop below won't end: for (pfn = start_pfn; pfn <
    end_pfn; pfn += 1 << order)

    As suggested by Naoya, below fix checks hugepages_supported() before
    calling dissolve_free_huge_pages().

    [rientjes@google.com: no legitimate reason to call dissolve_free_huge_pages() when !hugepages_supported()]
    Signed-off-by: Li Zhong
    Acked-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Signed-off-by: David Rientjes
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhong
     
  • __GFP_NO_KSWAPD, once the way to determine if an allocation was for thp
    or not, has gained more users. Their use is not necessarily wrong, they
    are trying to do a memory allocation that can easily fail without
    disturbing kswapd, so the bit has gained additional usecases.

    This restructures the check to determine whether MIGRATE_SYNC_LIGHT
    should be used for memory compaction in the page allocator. Rather than
    testing solely for __GFP_NO_KSWAPD, test for all bits that must be set
    for thp allocations.

    This also moves the check to be done only after the page allocator is
    aborted for deferred or contended memory compaction since setting
    migration_mode for this case is pointless.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
    to imply that they require locking semantics to avoid out_of_memory()
    being reordered.

    zone_scan_lock is required for both functions to ensure that there is
    proper locking synchronization.

    Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
    clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
    locking semantics.

    At the same time, convert oom_zonelist_trylock() to return bool instead
    of int since only success and failure are tested.

    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With memoryless node support being worked on, it's possible that for
    optimizations that a node may not have a non-NULL zonelist. When
    CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist
    for first_online_node may become NULL.

    The oom killer requires a zonelist that includes all memory zones for
    the sysrq trigger and pagefault out of memory handler.

    Ensure that a non-NULL zonelist is always passed to the oom killer.

    [akpm@linux-foundation.org: fix non-numa build]
    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This series of patches fixes a problem when adding memory in bad manner.
    For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
    memory installed, following commands cause problem:

    # echo 0x40000000 > /sys/devices/system/memory/probe
    [ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
    # echo 0x48000000 > /sys/devices/system/memory/probe
    [ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
    # echo online_movable > /sys/devices/system/memory/memory9/state
    # echo 0x50000000 > /sys/devices/system/memory/probe
    [ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
    # echo 0x58000000 > /sys/devices/system/memory/probe
    [ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
    # echo online_movable > /sys/devices/system/memory/memory11/state
    # echo online> /sys/devices/system/memory/memory8/state
    # echo online> /sys/devices/system/memory/memory10/state
    # echo offline> /sys/devices/system/memory/memory9/state
    [ 30.558819] Offlined Pages 32768
    # free
    total used free shared buffers cached
    Mem: 780588 18014398509432020 830552 0 0 51180
    -/+ buffers/cache: 18014398509380840 881732
    Swap: 0 0 0

    This is because the above commands probe higher memory after online a
    section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
    for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.

    After the second online_movable, the problem can be observed from
    zoneinfo:

    # cat /proc/zoneinfo
    ...
    Node 0, zone Movable
    pages free 65491
    min 250
    low 312
    high 375
    scanned 0
    spanned 18446744073709518848
    present 65536
    managed 65536
    ...

    This series of patches solve the problem by checking ZONE_MOVABLE when
    choosing zone for new memory. If new memory is inside or higher than
    ZONE_MOVABLE, makes it go there instead.

    After applying this series of patches, following are free and zoneinfo
    result (after offlining memory9):

    bash-4.2# free
    total used free shared buffers cached
    Mem: 780956 80112 700844 0 0 51180
    -/+ buffers/cache: 28932 752024
    Swap: 0 0 0

    bash-4.2# cat /proc/zoneinfo

    Node 0, zone DMA
    pages free 3389
    min 14
    low 17
    high 21
    scanned 0
    spanned 4095
    present 3998
    managed 3977
    nr_free_pages 3389
    ...
    start_pfn: 1
    inactive_ratio: 1
    Node 0, zone DMA32
    pages free 73724
    min 341
    low 426
    high 511
    scanned 0
    spanned 98304
    present 98304
    managed 92958
    nr_free_pages 73724
    ...
    start_pfn: 4096
    inactive_ratio: 1
    Node 0, zone Normal
    pages free 32630
    min 120
    low 150
    high 180
    scanned 0
    spanned 32768
    present 32768
    managed 32768
    nr_free_pages 32630
    ...
    start_pfn: 262144
    inactive_ratio: 1
    Node 0, zone Movable
    pages free 65476
    min 241
    low 301
    high 361
    scanned 0
    spanned 98304
    present 65536
    managed 65536
    nr_free_pages 65476
    ...
    start_pfn: 294912
    inactive_ratio: 1

    This patch (of 7):

    Introduce zone_for_memory() in arch independent code for
    arch_add_memory() use.

    Many arch_add_memory() function simply selects ZONE_HIGHMEM or
    ZONE_NORMAL and add new memory into it. However, with the existance of
    ZONE_MOVABLE, the selection method should be carefully considered: if
    new, higher memory is added after ZONE_MOVABLE is setup, the default
    zone and ZONE_MOVABLE may overlap each other.

    should_add_memory_movable() checks the status of ZONE_MOVABLE. If it
    has already contain memory, compare the address of new memory and
    movable memory. If new memory is higher than movable, it should be
    added into ZONE_MOVABLE instead of default zone.

    Signed-off-by: Wang Nan
    Cc: Zhang Yanfei
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: "Mel Gorman"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Nan
     
  • This function is never called for memcg caches, because they are
    unmergeable, so remove the dead code.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Setting vm_dirty_bytes and dirty_background_bytes is not protected by
    any serialization.

    Therefore, it's possible for either variable to change value after the
    test in global_dirty_limits() to determine whether available_memory
    needs to be initialized or not.

    Always ensure that available_memory is properly initialized.

    Signed-off-by: David Rientjes
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
    node") improved the previous khugepaged logic which allocated a
    transparent hugepages from the node of the first page being collapsed.

    However, it is still possible to collapse pages to remote memory which
    may suffer from additional access latency. With the current policy, it
    is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed
    remotely if the majority are allocated from that node.

    When zone_reclaim_mode is enabled, it means the VM should make every
    attempt to allocate locally to prevent NUMA performance degradation. In
    this case, we do not want to collapse hugepages to remote nodes that
    would suffer from increased access latency. Thus, when
    zone_reclaim_mode is enabled, only allow collapsing to nodes with
    RECLAIM_DISTANCE or less.

    There is no functional change for systems that disable
    zone_reclaim_mode.

    Signed-off-by: David Rientjes
    Cc: Dave Hansen
    Cc: Andrea Arcangeli
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes