08 Apr, 2014

1 commit

  • This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

05 Apr, 2014

1 commit

  • Pull file locking updates from Jeff Layton:
    "Highlights:

    - maintainership change for fs/locks.c. Willy's not interested in
    maintaining it these days, and is OK with Bruce and I taking it.
    - fix for open vs setlease race that Al ID'ed
    - cleanup and consolidation of file locking code
    - eliminate unneeded BUG() call
    - merge of file-private lock implementation"

    * 'locks-3.15' of git://git.samba.org/jlayton/linux:
    locks: make locks_mandatory_area check for file-private locks
    locks: fix locks_mandatory_locked to respect file-private locks
    locks: require that flock->l_pid be set to 0 for file-private locks
    locks: add new fcntl cmd values for handling file private locks
    locks: skip deadlock detection on FL_FILE_PVT locks
    locks: pass the cmd value to fcntl_getlk/getlk64
    locks: report l_pid as -1 for FL_FILE_PVT locks
    locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
    locks: rename locks_remove_flock to locks_remove_file
    locks: consolidate checks for compatible filp->f_mode values in setlk handlers
    locks: fix posix lock range overflow handling
    locks: eliminate BUG() call when there's an unexpected lock on file close
    locks: add __acquires and __releases annotations to locks_start and locks_stop
    locks: remove "inline" qualifier from fl_link manipulation functions
    locks: clean up comment typo
    locks: close potential race between setlease and open
    MAINTAINERS: update entry for fs/locks.c

    Linus Torvalds
     

04 Apr, 2014

1 commit

  • Mark function as static in mmap.c because they are not used outside this
    file.

    This eliminates the following warning in mm/mmap.c:

    mm/mmap.c:407:6: warning: no previous prototype for `validate_mm' [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rashika Kheria
     

31 Mar, 2014

1 commit

  • As Trond pointed out, you can currently deadlock yourself by setting a
    file-private lock on a file that requires mandatory locking and then
    trying to do I/O on it.

    Avoid this problem by plumbing some knowledge of file-private locks into
    the mandatory locking code. In order to do this, we must pass down
    information about the struct file that's being used to
    locks_verify_locked.

    Reported-by: Trond Myklebust
    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields

    Jeff Layton
     

19 Mar, 2014

1 commit

  • The _install_special_mapping() is the new base function for
    install_special_mapping(). This function will return a pointer of the
    created VMA or a error code in an ERR_PTR()

    This new function will be needed by the for the vdso 32 bit support to map the
    additonal vvar and hpet pages into the 32 bit address space. This will be done
    with io_remap_pfn_range() and remap_pfn_range, which requieres a vm_area_struct.

    Reviewed-by: Andy Lutomirski
    Signed-off-by: Stefani Seibold
    Link: http://lkml.kernel.org/r/1395094933-14252-3-git-send-email-stefani@seibold.net
    Signed-off-by: H. Peter Anvin

    Stefani Seibold
     

24 Jan, 2014

2 commits

  • The VM_SOFTDIRTY bit affects vma merge routine: if two VMAs has all bits
    in vm_flags matched except dirty bit the kernel can't longer merge them
    and this forces the kernel to generate new VMAs instead.

    It finally may lead to the situation when userspace application reaches
    vm.max_map_count limit and get crashed in worse case

    | (gimp:11768): GLib-ERROR **: gmem.c:110: failed to allocate 4096 bytes
    |
    | (file-tiff-load:12038): LibGimpBase-WARNING **: file-tiff-load: gimp_wire_read(): error
    | xinit: connection to X server lost
    |
    | waiting for X server to shut down
    | /usr/lib64/gimp/2.0/plug-ins/file-tiff-load terminated: Hangup
    | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
    | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup

    https://bugzilla.kernel.org/show_bug.cgi?id=67651
    https://bugzilla.gnome.org/show_bug.cgi?id=719619#c0

    Initial problem came from missed VM_SOFTDIRTY in do_brk() routine but
    even if we would set up VM_SOFTDIRTY here, there is still a way to
    prevent VMAs from merging: one can call

    | echo 4 > /proc/$PID/clear_refs

    and clear all VM_SOFTDIRTY over all VMAs presented in memory map, then
    new do_brk() will try to extend old VMA and finds that dirty bit doesn't
    match thus new VMA will be generated.

    As discussed with Pavel, the right approach should be to ignore
    VM_SOFTDIRTY bit when we're trying to merge VMAs and if merge successed
    we mark extended VMA with dirty bit where needed.

    Signed-off-by: Cyrill Gorcunov
    Reported-by: Bastian Hougaard
    Reported-by: Mel Gorman
    Cc: Pavel Emelyanov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Code that is obj-y (always built-in) or dependent on a bool Kconfig
    (built-in or absent) can never be modular. So using module_init as an
    alias for __initcall can be somewhat misleading.

    Fix these up now, so that we can relocate module_init from init.h into
    module.h in the future. If we don't do this, we'd have to add module.h
    to obviously non-modular code, and that would be a worse thing.

    The audit targets the following module_init users for change:
    mm/ksm.c bool KSM
    mm/mmap.c bool MMU
    mm/huge_memory.c bool TRANSPARENT_HUGEPAGE
    mm/mmu_notifier.c bool MMU_NOTIFIER

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of subsys_initcall (which makes sense for these
    files) will thus change this registration from level 6-device to level
    4-subsys (i.e. slightly earlier).

    However no observable impact of that difference has been observed during
    testing.

    One might think that core_initcall (l2) or postcore_initcall (l3) would
    be more appropriate for anything in mm/ but if we look at some actual
    init functions themselves, we see things like:

    mm/huge_memory.c --> hugepage_init --> hugepage_init_sysfs
    mm/mmap.c --> init_user_reserve --> sysctl_user_reserve_kbytes
    mm/ksm.c --> ksm_init --> sysfs_create_group

    and hence the choice of subsys_initcall (l4) seems reasonable, and at
    the same time minimizes the risk of changing the priority too
    drastically all at once. We can adjust further in the future.

    Also, several instances of missing ";" at EOL are fixed.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     

22 Jan, 2014

2 commits

  • Both do_brk and do_mmap_pgoff verify that we are actually capable of
    locking future pages if the corresponding VM_LOCKED flags are used.
    Encapsulate this logic into a single mlock_future_check() helper
    function.

    Signed-off-by: Davidlohr Bueso
    Cc: Rik van Riel
    Reviewed-by: Michel Lespinasse
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Some applications that run on HPC clusters are designed around the
    availability of RAM and the overcommit ratio is fine tuned to get the
    maximum usage of memory without swapping. With growing memory, the
    1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
    for these workload (on a 2TB machine it represents no less than 20GB).

    This patch adds the new overcommit_kbytes sysctl variable that allow a
    much finer grain.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

15 Nov, 2013

1 commit

  • With split page table lock for PMD level we can't hold mm->page_table_lock
    while updating nr_ptes.

    Let's convert it to atomic_long_t to avoid races.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Nov, 2013

4 commits

  • Merge first patch-bomb from Andrew Morton:
    "Quite a lot of other stuff is banked up awaiting further
    next->mainline merging, but this batch contains:

    - Lots of random misc patches
    - OCFS2
    - Most of MM
    - backlight updates
    - lib/ updates
    - printk updates
    - checkpatch updates
    - epoll tweaking
    - rtc updates
    - hfs
    - hfsplus
    - documentation
    - procfs
    - update gcov to gcc-4.7 format
    - IPC"

    * emailed patches from Andrew Morton : (269 commits)
    ipc, msg: fix message length check for negative values
    ipc/util.c: remove unnecessary work pending test
    devpts: plug the memory leak in kill_sb
    ./Makefile: export initial ramdisk compression config option
    init/Kconfig: add option to disable kernel compression
    drivers: w1: make w1_slave::flags long to avoid memory corruption
    drivers/w1/masters/ds1wm.cuse dev_get_platdata()
    drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
    drivers/memstick/core/mspro_block.c: fix attributes array allocation
    drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
    kernel/panic.c: reduce 1 byte usage for print tainted buffer
    gcov: reuse kbasename helper
    kernel/gcov/fs.c: use pr_warn()
    kernel/module.c: use pr_foo()
    gcov: compile specific gcov implementation based on gcc version
    gcov: add support for gcc 4.7 gcov format
    gcov: move gcov structs definitions to a gcc version specific file
    kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
    kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
    kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
    ...

    Linus Torvalds
     
  • The same calculation is currently done in three differents places.
    Factor that code so future changes has to be made at only one place.

    [akpm@linux-foundation.org: uninline vm_commit_limit()]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • This patch fixes the problem that get_unmapped_area() can return illegal
    address and result in failing mmap(2) etc.

    In case that the address higher than PAGE_SIZE is set to
    /proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
    returned by get_unmapped_area(), even if you do not pass any virtual
    address hint (i.e. the second argument).

    This is because the current get_unmapped_area() code does not take into
    account mmap_min_addr.

    This leads to two actual problems as follows:

    1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
    although any illegal parameter is not passed.

    2. The bottom-up search path after the top-down search might not work in
    arch_get_unmapped_area_topdown().

    Note: The first and third chunk of my patch, which changes "len" check,
    are for more precise check using mmap_min_addr, and not for solving the
    above problem.

    [How to reproduce]

    --- test.c -------------------------------------------------
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    void *ret = NULL, *last_map;
    size_t pagesize = sysconf(_SC_PAGESIZE);

    do {
    last_map = ret;
    ret = mmap(0, pagesize, PROT_NONE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    // printf("ret=%p\n", ret);
    } while (ret != MAP_FAILED);

    if (errno != ENOMEM) {
    printf("ERR: unexpected errno: %d (last map=%p)\n",
    errno, last_map);
    }

    return 0;
    }
    ---------------------------------------------------------------

    $ gcc -m32 -o test test.c
    $ sudo sysctl -w vm.mmap_min_addr=65536
    vm.mmap_min_addr = 65536
    $ ./test (run as non-priviledge user)
    ERR: unexpected errno: 1 (last map=0x10000)

    Signed-off-by: Akira Takeuchi
    Signed-off-by: Kiyoshi Owada
    Reviewed-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akira Takeuchi
     
  • This is more or less the generic variant of commit 41aacc1eea64 ("x86
    get_unmapped_area: Access mmap_legacy_base through mm_struct member").

    So effectively architectures which use an own arch_pick_mmap_layout()
    implementation but call the generic arch_get_unmapped_area() now can
    also randomize their mmap_base.

    All architectures which have an own arch_pick_mmap_layout() and call the
    generic arch_get_unmapped_area() (arm64, s390, tile) currently set
    mmap_base to TASK_UNMAPPED_BASE. This is also true for the generic
    arch_pick_mmap_layout() function. So this change is a no-op currently.

    Signed-off-by: Heiko Carstens
    Cc: Radu Caragea
    Cc: Michel Lespinasse
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Chris Metcalf
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

25 Oct, 2013

1 commit


12 Sep, 2013

6 commits

  • pgoff is not used after the statement "pgoff = vma->vm_pgoff;", so the
    assignment is redundant.

    Signed-off-by: Yanchuan Nian
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yanchuan Nian
     
  • Pavel reported that in case if vma area get unmapped and then mapped (or
    expanded) in-place, the soft dirty tracker won't be able to recognize this
    situation since it works on pte level and ptes are get zapped on unmap,
    loosing soft dirty bit of course.

    So to resolve this situation we need to track actions on vma level, there
    VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded)
    we set this bit, and keep it here until application calls for clearing
    soft dirty bit.

    Thus when user space application track memory changes now it can detect if
    vma area is renewed.

    Reported-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • correct_wcount and inode in mmap_region() just complicate the code. This
    boolean was needed previously, when deny_write_access() was called before
    vma_merge(), now we can simply check VM_DENYWRITE and do
    allow_write_access() if it is set.

    allow_write_access() checks file != NULL, so this is safe even if it was
    possible to use VM_DENYWRITE && !file. Just we need to ensure we use the
    same file which was deny_write_access()'ed, so the patch also moves "file
    = vma->vm_file" down after allow_write_access().

    Signed-off-by: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Al Viro
    Cc: Colin Cross
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Simple cleanup. Move "struct inode *inode" variable into "if (file)"
    block to simplify the code and avoid the unnecessary check.

    Signed-off-by: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Al Viro
    Cc: Colin Cross
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • mmap() doesn't allow the non-anonymous mappings with VM_GROWS* bit set.
    In particular this means that mmap_region()->vma_merge(file, vm_flags)
    must always fail if "vm_flags & VM_GROWS" is set incorrectly.

    So it does not make sense to check VM_GROWS* after we already allocated
    the new vma, the only caller, do_mmap_pgoff(), which can pass this flag
    can do the check itself.

    And this looks a bit more correct, mmap_region() already unmapped the
    old mapping at this stage. But if mmap() is going to fail, it should
    avoid do_munmap() if possible.

    Note: we check VM_GROWS at the end to ensure that do_mmap_pgoff() won't
    return EINVAL in the case when it currently returns another error code.

    Many thanks to Hugh who nacked the buggy v1.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Simple cleanup. Every user of vma_set_policy() does the same work, this
    looks a bit annoying imho. And the new trivial helper which does
    mpol_dup() + vma_set_policy() to simplify the callers.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

16 Aug, 2013

1 commit

  • Ben Tebulin reported:

    "Since v3.7.2 on two independent machines a very specific Git
    repository fails in 9/10 cases on git-fsck due to an SHA1/memory
    failures. This only occurs on a very specific repository and can be
    reproduced stably on two independent laptops. Git mailing list ran
    out of ideas and for me this looks like some very exotic kernel issue"

    and bisected the failure to the backport of commit 53a59fc67f97 ("mm:
    limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").

    That commit itself is not actually buggy, but what it does is to make it
    much more likely to hit the partial TLB invalidation case, since it
    introduces a new case in tlb_next_batch() that previously only ever
    happened when running out of memory.

    The real bug is that the TLB gather virtual memory range setup is subtly
    buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather:
    enable tlb flush range in generic mmu_gather"), and the range handling
    was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB
    range flushed when __tlb_remove_page() runs out of slots"), but that fix
    was not complete.

    The problem with the TLB gather virtual address range is that it isn't
    set up by the initial tlb_gather_mmu() initialization (which didn't get
    the TLB range information), but it is set up ad-hoc later by the
    functions that actually flush the TLB. And so any such case that forgot
    to update the TLB range entries would potentially miss TLB invalidates.

    Rather than try to figure out exactly which particular ad-hoc range
    setup was missing (I personally suspect it's the hugetlb case in
    zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
    did), this patch just gets rid of the problem at the source: make the
    TLB range information available to tlb_gather_mmu(), and initialize it
    when initializing all the other tlb gather fields.

    This makes the patch larger, but conceptually much simpler. And the end
    result is much more understandable; even if you want to play games with
    partial ranges when invalidating the TLB contents in chunks, now the
    range information is always there, and anybody who doesn't want to
    bother with it won't introduce subtle bugs.

    Ben verified that this fixes his problem.

    Reported-bisected-and-tested-by: Ben Tebulin
    Build-testing-by: Stephen Rothwell
    Build-testing-by: Richard Weinberger
    Reviewed-by: Michal Hocko
    Acked-by: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Aug, 2013

1 commit

  • vma_adjust() does vma_set_policy(vma, vma_policy(next)) and this
    is doubly wrong:

    1. This leaks vma->vm_policy if it is not NULL and not equal to
    next->vm_policy.

    This can happen if vma_merge() expands "area", not prev (case 8).

    2. This sets the wrong policy if vma_merge() joins prev and area,
    area is the vma the caller needs to update and it still has the
    old policy.

    Revert commit 1444f92c8498 ("mm: merging memory blocks resets
    mempolicy") which introduced these problems.

    Change mbind_range() to recheck mpol_equal() after vma_merge() to fix
    the problem that commit tried to address.

    Signed-off-by: Oleg Nesterov
    Acked-by: KOSAKI Motohiro
    Cc: Steven T Hampson
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Jul, 2013

1 commit

  • Since all architectures have been converted to use vm_unmapped_area(),
    there is no remaining use for the free_area_cache.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: "James E.J. Bottomley"
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

10 Jul, 2013

2 commits


04 Jul, 2013

1 commit


10 May, 2013

1 commit

  • Dave reported an oops triggered by trinity:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: newseg+0x10d/0x390
    PGD cf8c1067 PUD cf8c2067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    CPU: 2 PID: 7636 Comm: trinity-child2 Not tainted 3.9.0+#67
    ...
    Call Trace:
    ipcget+0x182/0x380
    SyS_shmget+0x5a/0x60
    tracesys+0xdd/0xe2

    This bug was introduced by commit af73e4d9506d ("hugetlbfs: fix mmap
    failure in unaligned size request").

    Reported-by: Dave Jones
    Cc:
    Signed-off-by: Li Zefan
    Reviewed-by: Naoya Horiguchi
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Li Zefan
     

08 May, 2013

1 commit

  • The current kernel returns -EINVAL unless a given mmap length is
    "almost" hugepage aligned. This is because in sys_mmap_pgoff() the
    given length is passed to vm_mmap_pgoff() as it is without being aligned
    with hugepage boundary.

    This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix
    alignment of huge page requests"), where alignment code is pushed into
    hugetlb_file_setup() and the variable len in caller side is not changed.

    To fix this, this patch partially reverts that commit, and adds
    alignment code in caller side. And it also introduces hstate_sizelog()
    in order to get proper hstate to specified hugepage size.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881

    [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Johannes Weiner
    Reported-by:
    Cc: Steven Truelove
    Cc: Jianguo Wu
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

30 Apr, 2013

7 commits

  • Fix a corner case for MAP_FIXED when requested mapping length is larger
    than rlimit for virtual memory. In such case any overlapping mappings
    are unmapped before we check for the limit and return ENOMEM.

    The check is moved before the loop that unmaps overlapping parts of
    existing mappings. When we are about to hit the limit (currently mapped
    pages + len > limit) we scan for overlapping pages and check again
    accounting for them.

    This fixes situation when userspace program expects that the previous
    mappings are preserved after the mmap() syscall has returned with error.
    (POSIX clearly states that successfull mapping shall replace any
    previous mappings.)

    This corner case was found and can be tested with LTP testcase:

    testcases/open_posix_testsuite/conformance/interfaces/mmap/24-2.c

    In this case the mmap, which is clearly over current limit, unmaps
    dynamic libraries and the testcase segfaults right after returning into
    userspace.

    I've also looked at the second instance of the unmapping loop in the
    do_brk(). The do_brk() is called from brk() syscall and from vm_brk().
    The brk() syscall checks for overlapping mappings and bails out when
    there are any (so it can't be triggered from the brk syscall). The
    vm_brk() is called only from binmft handlers so it shouldn't be
    triggered unless binmft handler created overlapping mappings.

    Signed-off-by: Cyril Hrubis
    Reviewed-by: Mel Gorman
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyril Hrubis
     
  • Alter the admin and user reserves of the previous patches in this series
    when memory is added or removed.

    If memory is added and the reserves have been eliminated or increased
    above the default max, then we'll trust the admin.

    If memory is removed and there isn't enough free memory, then we need to
    reset the reserves.

    Otherwise keep the reserve set by the admin.

    The reserve reset code is the same as the reserve initialization code.

    I tested hot addition and removal by triggering it via sysfs. The
    reserves shrunk when they were set high and memory was removed. They
    were reset higher when memory was added again.

    [akpm@linux-foundation.org: use register_hotmemory_notifier()]
    [akpm@linux-foundation.org: init_user_reserve() and init_admin_reserve can no longer be __meminit]
    [fengguang.wu@intel.com: make init_reserve_notifier() static]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
    memory reserve to something other than 3%, which may be multiple
    gigabytes on large memory systems. Only about 8MB is necessary to
    enable recovery in the default mode, and only a few hundred MB are
    required even when overcommit is disabled.

    This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

    admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

    I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

    Please see first patch in this series for full background, motivation,
    testing, and full changelog.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_admin_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Add user_reserve_kbytes knob.

    Limit the growth of the memory reserved for other user processes to
    min(3% current process size, user_reserve_pages). Only about 8MB is
    necessary to enable recovery in the default mode, and only a few hundred
    MB are required even when overcommit is disabled.

    user_reserve_pages defaults to min(3% free pages, 128MB)

    I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
    then adding the RSS of each.

    This only affects OVERCOMMIT_NEVER mode.

    Background

    1. user reserve

    __vm_enough_memory reserves a hardcoded 3% of the current process size for
    other applications when overcommit is disabled. This was done so that a
    user could recover if they launched a memory hogging process. Without the
    reserve, a user would easily run into a message such as:

    bash: fork: Cannot allocate memory

    2. admin reserve

    Additionally, a hardcoded 3% of free memory is reserved for root in both
    overcommit 'guess' and 'never' modes. This was intended to prevent a
    scenario where root-cant-log-in and perform recovery operations.

    Note that this reserve shrinks, and doesn't guarantee a useful reserve.

    Motivation

    The two hardcoded memory reserves should be updated to account for current
    memory sizes.

    Also, the admin reserve would be more useful if it didn't shrink too much.

    When the current code was originally written, 1GB was considered
    "enterprise". Now the 3% reserve can grow to multiple GB on large memory
    systems, and it only needs to be a few hundred MB at most to enable a user
    or admin to recover a system with an unwanted memory hogging process.

    I've found that reducing these reserves is especially beneficial for a
    specific type of application load:

    * single application system
    * one or few processes (e.g. one per core)
    * allocating all available memory
    * not initializing every page immediately
    * long running

    I've run scientific clusters with this sort of load. A long running job
    sometimes failed many hours (weeks of CPU time) into a calculation. They
    weren't initializing all of their memory immediately, and they weren't
    using calloc, so I put systems into overcommit 'never' mode. These
    clusters run diskless and have no swap.

    However, with the current reserves, a user wishing to allocate as much
    memory as possible to one process may be prevented from using, for
    example, almost 2GB out of 32GB.

    The effect is less, but still significant when a user starts a job with
    one process per core. I have repeatedly seen a set of processes
    requesting the same amount of memory fail because one of them could not
    allocate the amount of memory a user would expect to be able to allocate.
    For example, Message Passing Interfce (MPI) processes, one per core. And
    it is similar for other parallel programming frameworks.

    Changing this reserve code will make the overcommit never mode more useful
    by allowing applications to allocate nearly all of the available memory.

    Also, the new admin_reserve_kbytes will be safer than the current behavior
    since the hardcoded 3% of available memory reserve can shrink to something
    useless in the case where applications have grabbed all available memory.

    Risks

    * "bash: fork: Cannot allocate memory"

    The downside of the first patch-- which creates a tunable user reserve
    that is only used in overcommit 'never' mode--is that an admin can set
    it so low that a user may not be able to kill their process, even if
    they already have a shell prompt.

    Of course, a user can get in the same predicament with the current 3%
    reserve--they just have to launch processes until 3% becomes negligible.

    * root-cant-log-in problem

    The second patch, adding the tunable rootuser_reserve_pages, allows
    the admin to shoot themselves in the foot by setting it too small. They
    can easily get the system into a state where root-can't-log-in.

    However, the new admin_reserve_kbytes will be safer than the current
    behavior since the hardcoded 3% of available memory reserve can shrink
    to something useless in the case where applications have grabbed all
    available memory.

    Alternatives

    * Memory cgroups provide a more flexible way to limit application memory.

    Not everyone wants to set up cgroups or deal with their overhead.

    * We could create a fourth overcommit mode which provides smaller reserves.

    The size of useful reserves may be drastically different depending
    on the whether the system is embedded or enterprise.

    * Force users to initialize all of their memory or use calloc.

    Some users don't want/expect the system to overcommit when they malloc.
    Overcommit 'never' mode is for this scenario, and it should work well.

    The new user and admin reserve tunables are simple to use, with low
    overhead compared to cgroups. The patches preserve current behavior where
    3% of memory is less than 128MB, except that the admin reserve doesn't
    shrink to an unusable size under pressure. The code allows admins to tune
    for embedded and enterprise usage.

    FAQ

    * How is the root-cant-login problem addressed?
    What happens if admin_reserve_pages is set to 0?

    Root is free to shoot themselves in the foot by setting
    admin_reserve_kbytes too low.

    On x86_64, the minimum useful reserve is:
    8MB for overcommit 'guess'
    128MB for overcommit 'never'

    admin_reserve_pages defaults to min(3% free memory, 8MB)

    So, anyone switching to 'never' mode needs to adjust
    admin_reserve_pages.

    * How do you calculate a minimum useful reserve?

    A user or the admin needs enough memory to login and perform
    recovery operations, which includes, at a minimum:

    sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

    For overcommit 'guess', we can sum resident set sizes (RSS)
    because we only need enough memory to handle what the recovery
    programs will typically use. On x86_64 this is about 8MB.

    For overcommit 'never', we can take the max of their virtual sizes (VSZ)
    and add the sum of their RSS. We use VSZ instead of RSS because mode
    forces us to ensure we can fulfill all of the requested memory allocations--
    even if the programs only use a fraction of what they ask for.
    On x86_64 this is about 128MB.

    When swap is enabled, reserves are useful even when they are as
    small as 10MB, regardless of overcommit mode.

    When both swap and overcommit are disabled, then the admin should
    tune the reserves higher to be absolutley safe. Over 230MB each
    was safest in my testing.

    * What happens if user_reserve_pages is set to 0?

    Note, this only affects overcomitt 'never' mode.

    Then a user will be able to allocate all available memory minus
    admin_reserve_kbytes.

    However, they will easily see a message such as:

    "bash: fork: Cannot allocate memory"

    And they won't be able to recover/kill their application.
    The admin should be able to recover the system if
    admin_reserve_kbytes is set appropriately.

    * What's the difference between overcommit 'guess' and 'never'?

    "Guess" allows an allocation if there are enough free + reclaimable
    pages. It has a hardcoded 3% of free pages reserved for root.

    "Never" allows an allocation if there is enough swap + a configurable
    percentage (default is 50) of physical RAM. It has a hardcoded 3% of
    free pages reserved for root, like "Guess" mode. It also has a
    hardcoded 3% of the current process size reserved for additional
    applications.

    * Why is overcommit 'guess' not suitable even when an app eventually
    writes to every page? It takes free pages, file pages, available
    swap pages, reclaimable slab pages into consideration. In other words,
    these are all pages available, then why isn't overcommit suitable?

    Because it only looks at the present state of the system. It
    does not take into account the memory that other applications have
    malloced, but haven't initialized yet. It overcommits the system.

    Test Summary

    There was little change in behavior in the default overcommit 'guess'
    mode with swap enabled before and after the patch. This was expected.

    Systems run most predictably (i.e. no oom kills) in overcommit 'never'
    mode with swap enabled. This also allowed the most memory to be allocated
    to a user application.

    Overcommit 'guess' mode without swap is a bad idea. It is easy to
    crash the system. None of the other tested combinations crashed.
    This matches my experience on the Roadrunner supercomputer.

    Without the tunable user reserve, a system in overcommit 'never' mode
    and without swap does not allow the admin to recover, although the
    admin can.

    With the new tunable reserves, a system in overcommit 'never' mode
    and without swap can be configured to:

    1. maximize user-allocatable memory, running close to the edge of
    recoverability

    2. maximize recoverability, sacrificing allocatable memory to
    ensure that a user cannot take down a system

    Test Description

    Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

    System is booted into multiuser console mode, with unnecessary services
    turned off. Caches were dropped before each test.

    Hogs are user memtester processes that attempt to allocate all free memory
    as reported by /proc/meminfo

    In overcommit 'never' mode, memory_ratio=100

    Test Results

    3.9.0-rc1-mm1

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5432/5432 no yes yes
    guess yes 4 5444/5444 1 yes yes
    guess no 1 5302/5449 no yes yes
    guess no 4 - crash no no

    never yes 1 5460/5460 1 yes yes
    never yes 4 5460/5460 1 yes yes
    never no 1 5218/5432 no no yes
    never no 4 5203/5448 no no yes

    3.9.0-rc1-mm1-tunablereserves

    User and Admin Recovery show their respective reserves, if applicable.

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5419/5419 no - yes 8MB yes
    guess yes 4 5436/5436 1 - yes 8MB yes
    guess no 1 5440/5440 * - yes 8MB yes
    guess no 4 - crash - no 8MB no

    * process would successfully mlock, then the oom killer would pick it

    never yes 1 5446/5446 no 10MB yes 20MB yes
    never yes 4 5456/5456 no 10MB yes 20MB yes
    never no 1 5387/5429 no 128MB no 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely

    never no 1 5359/5448 no 10MB no 10MB barely

    never no 1 5323/5428 no 0MB no 10MB barely
    never no 1 5332/5428 no 0MB no 50MB yes
    never no 1 5293/5429 no 0MB no 90MB yes

    never no 1 5001/5427 no 230MB yes 338MB yes
    never no 4* 4998/5424 no 230MB yes 338MB yes

    * more memtesters were launched, able to allocate approximately another 100MB

    Future Work

    - Test larger memory systems.

    - Test an embedded image.

    - Test other architectures.

    - Time malloc microbenchmarks.

    - Would it be useful to be able to set overcommit policy for
    each memory cgroup?

    - Some lines are slightly above 80 chars.
    Perhaps define a macro to convert between pages and kb?
    Other places in the kernel do this.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_user_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Using mbind to change the mempolicy to MPOL_BIND on several adjacent
    mmapped blocks may result in a reset of the mempolicy to MPOL_DEFAULT in
    vma_adjust.

    Test code. Correct result is three lines containing "OK".

    #include
    #include
    #include
    #include
    #include

    /* gcc mbind_test.c -lnuma -o mbind_test -Wall */
    #define MAXNODE 4096

    void allocate()
    {
    int ret;
    int len;
    int policy = -1;
    unsigned char *p;
    unsigned long mask[MAXNODE] = { 0 };
    unsigned long retmask[MAXNODE] = { 0 };

    len = getpagesize() * 0x2fc00;
    p = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
    -1, 0);
    if (p == MAP_FAILED)
    printf("mbind err: %d\n", errno);

    mask[0] = 1;
    ret = mbind(p, len, MPOL_BIND, mask, MAXNODE, 0);
    if (ret < 0)
    printf("mbind err: %d %d\n", ret, errno);
    ret = get_mempolicy(&policy, retmask, MAXNODE, p, MPOL_F_ADDR);
    if (ret < 0)
    printf("get_mempolicy err: %d %d\n", ret, errno);

    if (policy == MPOL_BIND)
    printf("OK\n");
    else
    printf("ERROR: policy is %d\n", policy);
    }

    int main()
    {
    allocate();
    allocate();
    allocate();
    return 0;
    }

    Signed-off-by: Steven T Hampson
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hampson, Steven T
     
  • On architectures where a pgd entry may be shared between user and kernel
    (e.g. ARM+LPAE), freeing page tables needs a ceiling other than 0.
    This patch introduces a generic USER_PGTABLES_CEILING that arch code can
    override. It is the responsibility of the arch code setting the ceiling
    to ensure the complete freeing of the page tables (usually in
    pgd_free()).

    [catalin.marinas@arm.com: commit log; shift_arg_pages(), asm-generic/pgtables.h changes]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Catalin Marinas
    Cc: Russell King
    Cc: [3.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove the WARN_ON_ONCE(!mm) check as the comment suggested. Kernel
    code calls find_vma only when it is absolutely sure that the mm_struct
    arg to it is non-NULL.

    Signed-off-by: Zhang Yanfei
    Cc: k80c
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     

05 Apr, 2013

1 commit

  • find_vma() can be called by multiple threads with read lock
    held on mm->mmap_sem and any of them can update mm->mmap_cache.
    Prevent compiler from re-fetching mm->mmap_cache, because other
    readers could update it in the meantime:

    thread 1 thread 2
    |
    find_vma() | find_vma()
    struct vm_area_struct *vma = NULL; |
    vma = mm->mmap_cache; |
    if (!(vma && vma->vm_end > addr |
    && vma->vm_start mmap_cache = vma;
    return vma; |
    ^^ compiler may optimize this |
    local variable out and re-read |
    mm->mmap_cache |

    This issue can be reproduced with gcc-4.8.0-1 on s390x by running
    mallocstress testcase from LTP, which triggers:

    kernel BUG at mm/rmap.c:1088!
    Call Trace:
    ([] 0x3d100c57000)
    [] do_wp_page+0x2fc/0xa88
    [] handle_pte_fault+0x41a/0xac8
    [] handle_mm_fault+0x17a/0x268
    [] do_protection_exception+0x1e2/0x394
    [] pgm_check_handler+0x138/0x13c
    [] 0x3fffcf1f07a
    Last Breaking-Event-Address:
    [] page_add_new_anon_rmap+0xc2/0x168

    Thanks to Jakub Jelinek for his insight on gcc and helping to
    track this down.

    Signed-off-by: Jan Stancek
    Acked-by: David Rientjes
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Jan Stancek
     

29 Mar, 2013

1 commit

  • This reverts commit 186930500985 ("mm: introduce VM_POPULATE flag to
    better deal with racy userspace programs").

    VM_POPULATE only has any effect when userspace plays racy games with
    vmas by trying to unmap and remap memory regions that mmap or mlock are
    operating on.

    Also, the only effect of VM_POPULATE when userspace plays such games is
    that it avoids populating new memory regions that get remapped into the
    address range that was being operated on by the original mmap or mlock
    calls.

    Let's remove VM_POPULATE as there isn't any strong argument to mandate a
    new vm_flag.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

28 Feb, 2013

1 commit

  • The stack vma is designed to grow automatically (marked with VM_GROWSUP
    or VM_GROWSDOWN depending on architecture) when an access is made beyond
    the existing boundary. However, particularly if you have not limited
    your stack at all ("ulimit -s unlimited"), this can cause the stack to
    grow even if the access was really just one past *another* segment.

    And that's wrong, especially since we first grow the segment, but then
    immediately later enforce the stack guard page on the last page of the
    segment. So _despite_ first growing the stack segment as a result of
    the access, the kernel will then make the access cause a SIGSEGV anyway!

    So do the same logic as the guard page check does, and consider an
    access to within one page of the next segment to be a bad access, rather
    than growing the stack to abut the next segment.

    Reported-and-tested-by: Heiko Carstens
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds