24 May, 2016

1 commit

  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

22 Jan, 2016

1 commit

  • Tetsuo Handa reported underflow of NR_MLOCK on munlock.

    Testcase:

    #include
    #include
    #include

    #define BASE ((void *)0x400000000000)
    #define SIZE (1UL << 21)

    int main(int argc, char *argv[])
    {
    void *addr;

    system("grep Mlocked /proc/meminfo");
    addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED,
    -1, 0);
    if (addr == MAP_FAILED)
    printf("mmap() failed\n"), exit(1);
    munmap(addr, SIZE);
    system("grep Mlocked /proc/meminfo");
    return 0;
    }

    It happens on munlock_vma_page() due to unfortunate choice of nr_pages
    data type:

    __mod_zone_page_state(zone, NR_MLOCK, -nr_pages);

    For unsigned int nr_pages, implicitly casted to long in
    __mod_zone_page_state(), it becomes something around UINT_MAX.

    munlock_vma_page() usually called for THP as small pages go though
    pagevec.

    Let's make nr_pages signed int.

    Similar fixes in 6cdb18ad98a4 ("mm/vmstat: fix overflow in
    mod_zone_page_state()") used `long' type, but `int' here is OK for a
    count of the number of sub-pages in a huge page.

    Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Cc: Michel Lespinasse
    Acked-by: Michal Hocko
    Cc: [4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

3 commits

  • Since can_do_mlock only return 1 or 0, so make it boolean.

    No functional change.

    [akpm@linux-foundation.org: update declaration in mm.h]
    Signed-off-by: Wang Xiaoqiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Xiaoqiang
     
  • Before THP refcounting rework, THP was not allowed to cross VMA
    boundary. So, if we have THP and we split it, PG_mlocked can be safely
    transferred to small pages.

    With new THP refcounting and naive approach to mlocking we can end up
    with this scenario:
    1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
    2. the process does munlock() on the *part* of the THP:
    - the VMA is split into two, one of them VM_LOCKED;
    - huge PMD split into PTE table;
    - THP is still mlocked;
    3. split_huge_page():
    - it transfers PG_mlocked to *all* small pages regrardless if it
    blong to any VM_LOCKED VMA.

    We probably could munlock() all small pages on split_huge_page(), but I
    think we have accounting issue already on step two.

    Instead of forbidding mlocked pages altogether, we just avoid mlocking
    PTE-mapped THPs and munlock THPs on split_huge_pmd().

    This means PTE-mapped THPs will be on normal lru lists and will be split
    under memory pressure by vmscan. After the split vmscan will detect
    unevictable small pages and mlock them.

    With this approach we shouldn't hit situation like described above.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting THP can belong to several VMAs. This makes tricky
    to track THP pages, when they partially mlocked. It can lead to leaking
    mlocked pages to non-VM_LOCKED vmas and other problems.

    With this patch we will split all pages on mlock and avoid
    fault-in/collapse new THP in VM_LOCKED vmas.

    I've tried alternative approach: do not mark THP pages mlocked and keep
    them on normal LRUs. This way vmscan could try to split huge pages on
    memory pressure and free up subpages which doesn't belong to VM_LOCKED
    vmas. But this is user-visible change: we screw up Mlocked accouting
    reported in meminfo, so I had to leave this approach aside.

    We can bring something better later, but this should be good enough for
    now.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit


06 Nov, 2015

6 commits

  • The previous patch introduced a flag that specified pages in a VMA should
    be placed on the unevictable LRU, but they should not be made present when
    the area is created. This patch adds the ability to set this state via
    the new mlock system calls.

    We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
    MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
    MCL_ONFAULT should be used as a modifier to the two other mlockall flags.
    When used with MCL_CURRENT, all current mappings will be marked with
    VM_LOCKED | VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags
    will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with both
    MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be
    marked with VM_LOCKED | VM_LOCKONFAULT.

    Prior to this patch, mlockall() will unconditionally clear the
    mm->def_flags any time it is called without MCL_FUTURE. This behavior is
    maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is
    followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and
    new VMAs will be unlocked. This remains true with or without MCL_ONFAULT
    in either mlockall() invocation.

    munlock() will unconditionally clear both vma flags. munlockall()
    unconditionally clears for VMA flags on all VMAs and in the mm->def_flags
    field.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • The cost of faulting in all memory to be locked can be very high when
    working with large mappings. If only portions of the mapping will be used
    this can incur a high penalty for locking.

    For the example of a large file, this is the usage pattern for a large
    statical language model (probably applies to other statical or graphical
    models as well). For the security example, any application transacting in
    data that cannot be swapped out (credit card data, medical records, etc).

    This patch introduces the ability to request that pages are not
    pre-faulted, but are placed on the unevictable LRU when they are finally
    faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
    and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
    flag for a VMA will cause pages faulted into that VMA to be added to the
    unevictable LRU when they are faulted or if they are already present, but
    will not cause any missing pages to be faulted in.

    Exposing this new lock state means that we cannot overload the meaning of
    the FOLL_POPULATE flag any longer. Prior to this patch it was used to
    mean that the VMA for a fault was locked. This means we need the new
    FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
    will now only control if the VMA should be populated and in the case of
    VM_LOCKONFAULT, it will not be set.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • With the refactored mlock code, introduce a new system call for mlock.
    The new call will allow the user to specify what lock states are being
    added. mlock2 is trivial at the moment, but a follow on patch will add a
    new mlock state making it useful.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Heiko Carstens
    Cc: Geert Uytterhoeven
    Cc: Catalin Marinas
    Cc: Stephen Rothwell
    Cc: Guenter Roeck
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • mlock() allows a user to control page out of program memory, but this
    comes at the cost of faulting in the entire mapping when it is allocated.
    For large mappings where the entire area is not necessary this is not
    ideal. Instead of forcing all locked pages to be present when they are
    allocated, this set creates a middle ground. Pages are marked to be
    placed on the unevictable LRU (locked) when they are first used, but they
    are not faulted in by the mlock call.

    This series introduces a new mlock() system call that takes a flags
    argument along with the start address and size. This flags argument gives
    the caller the ability to request memory be locked in the traditional way,
    or to be locked after the page is faulted in. A new MCL flag is added to
    mirror the lock on fault behavior from mlock() in mlockall().

    There are two main use cases that this set covers. The first is the
    security focussed mlock case. A buffer is needed that cannot be written
    to swap. The maximum size is known, but on average the memory used is
    significantly less than this maximum. With lock on fault, the buffer is
    guaranteed to never be paged out without consuming the maximum size every
    time such a buffer is created.

    The second use case is focussed on performance. Portions of a large file
    are needed and we want to keep the used portions in memory once accessed.
    This is the case for large graphical models where the path through the
    graph is not known until run time. The entire graph is unlikely to be
    used in a given invocation, but once a node has been used it needs to stay
    resident for further processing. Given these constraints we have a number
    of options. We can potentially waste a large amount of memory by mlocking
    the entire region (this can also cause a significant stall at startup as
    the entire file is read in). We can mlock every page as we access them
    without tracking if the page is already resident but this introduces large
    overhead for each access. The third option is mapping the entire region
    with PROT_NONE and using a signal handler for SIGSEGV to
    mprotect(PROT_READ) and mlock() the needed page. Doing this page at a
    time adds a significant performance penalty. Batching can be used to
    mitigate this overhead, but in order to safely avoid trying to mprotect
    pages outside of the mapping, the boundaries of each mapping to be used in
    this way must be tracked and available to the signal handler. This is
    precisely what the mm system in the kernel should already be doing.

    For mlock(MLOCK_ONFAULT) the user is charged against RLIMIT_MEMLOCK as if
    mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was used, so when the VMA is
    created not when the pages are faulted in. For mlockall(MCL_ONFAULT) the
    user is charged as if MCL_FUTURE was used. This decision was made to keep
    the accounting checks out of the page fault path.

    To illustrate the benefit of this set I wrote a test program that mmaps a
    5 GB file filled with random data and then makes 15,000,000 accesses to
    random addresses in that mapping. The test program was run 20 times for
    each setup. Results are reported for two program portions, setup and
    execution. The setup phase is calling mmap and optionally mlock on the
    entire region. For most experiments this is trivial, but it highlights
    the cost of faulting in the entire region. Results are averages across
    the 20 runs in milliseconds.

    mmap with mlock(MLOCK_LOCKED) on entire range:
    Setup avg: 8228.666
    Processing avg: 8274.257

    mmap with mlock(MLOCK_LOCKED) before each access:
    Setup avg: 0.113
    Processing avg: 90993.552

    mmap with PROT_NONE and signal handler and batch size of 1 page:
    With the default value in max_map_count, this gets ENOMEM as I attempt
    to change the permissions, after upping the sysctl significantly I get:
    Setup avg: 0.058
    Processing avg: 69488.073
    mmap with PROT_NONE and signal handler and batch size of 8 pages:
    Setup avg: 0.068
    Processing avg: 38204.116

    mmap with PROT_NONE and signal handler and batch size of 16 pages:
    Setup avg: 0.044
    Processing avg: 29671.180

    mmap with mlock(MLOCK_ONFAULT) on entire range:
    Setup avg: 0.189
    Processing avg: 17904.899

    The signal handler in the batch cases faulted in memory in two steps to
    avoid having to know the start and end of the faulting mapping. The first
    step covers the page that caused the fault as we know that it will be
    possible to lock. The second step speculatively tries to mlock and
    mprotect the batch size - 1 pages that follow. There may be a clever way
    to avoid this without having the program track each mapping to be covered
    by this handeler in a globally accessible structure, but I could not find
    it. It should be noted that with a large enough batch size this two step
    fault handler can still cause the program to crash if it reaches far
    beyond the end of the mapping.

    These results show that if the developer knows that a majority of the
    mapping will be used, it is better to try and fault it in at once,
    otherwise mlock(MLOCK_ONFAULT) is significantly faster.

    The performance cost of these patches are minimal on the two benchmarks I
    have tested (stream and kernbench). The following are the average values
    across 20 runs of stream and 10 runs of kernbench after a warmup run whose
    results were discarded.

    Avg throughput in MB/s from stream using 1000000 element arrays
    Test 4.2-rc1 4.2-rc1+lock-on-fault
    Copy: 10,566.5 10,421
    Scale: 10,685 10,503.5
    Add: 12,044.1 11,814.2
    Triad: 12,064.8 11,846.3

    Kernbench optimal load
    4.2-rc1 4.2-rc1+lock-on-fault
    Elapsed Time 78.453 78.991
    User Time 64.2395 65.2355
    System Time 9.7335 9.7085
    Context Switches 22211.5 22412.1
    Sleeps 14965.3 14956.1

    This patch (of 6):

    Extending the mlock system call is very difficult because it currently
    does not take a flags argument. A later patch in this set will extend
    mlock to support a middle ground between pages that are locked and faulted
    in immediately and unlocked pages. To pave the way for the new system
    call, the code needs some reorganization so that all the actual entry
    point handles is checking input and translating to VMA flags.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Michael Kerrisk
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Jonathan Corbet
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • linux/mm.h provides offset_in_page() macro. Let's use already predefined
    macro instead of (addr & ~PAGE_MASK).

    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • In mlockall syscall wrapper after out-label for goto code just doing
    return. Remove goto out statements and return error values directly.

    Also instead of rewriting ret variable before every if-check move returns
    to 'error'-like path under if-check.

    Objdump asm listing showed me reducing by few asm lines. Object file size
    descreased from 220592 bytes to 220528 bytes for me (for aarch64).

    Signed-off-by: Alexey Klimov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Klimov
     

05 Sep, 2015

1 commit

  • vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
    must be aware about so that we can merge vmas back like they were
    originally before arming the userfaultfd on some memory range.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

15 Apr, 2015

4 commits

  • It's odd that we have populate_vma_page_range() and __mm_populate() in
    mm/mlock.c. It's implementation of generic memory population and mlocking
    is one of possible side effect, if VM_LOCKED is set.

    __get_user_pages() is core of the implementation. Let's move the code
    into mm/gup.c.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This is praparation to moving mm_populate()-related code out of
    mm/mlock.c.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • __mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on
    vma flags. The same codepath is used for MAP_POPULATE.

    Let's rename __mlock_vma_pages_range() to populate_vma_page_range().

    This patch also drops mlock_vma_pages_range() references from
    documentation. It has gone in cea10a19b797 ("mm: directly use
    __mlock_vma_pages_range() in find_extend_vma()").

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in
    get_user_pages only for mlock") FOLL_MLOCK has lost its original
    meaning: we don't necessarily mlock the page if the flags is set -- we
    also take VM_LOCKED into consideration.

    Since we use the same codepath for __mm_populate(), let's rename
    FOLL_MLOCK to FOLL_POPULATE.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Mar, 2015

1 commit

  • A userspace call to mmap(MAP_LOCKED) may result in the successful locking
    of memory while also producing a confusing audit log denial. can_do_mlock
    checks capable and rlimit. If either of these return positive
    can_do_mlock returns true. The capable check leads to an LSM hook used by
    apparmour and selinux which produce the audit denial. Reordering so
    rlimit is checked first eliminates the denial on success, only recording a
    denial when the lock is unsuccessful as a result of the denial.

    Signed-off-by: Jeff Vander Stoep
    Acked-by: Nick Kralevich
    Cc: Jeff Vander Stoep
    Cc: Sasha Levin
    Cc: "Paul E. McKenney"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Paul Cassella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Vander Stoep
     

13 Oct, 2014

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - changes related to No-CBs CPUs and NO_HZ_FULL

    - RCU-tasks implementation

    - torture-test updates

    - miscellaneous fixes

    - locktorture updates

    - RCU documentation updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
    workqueue: Use cond_resched_rcu_qs macro
    workqueue: Add quiescent state between work items
    locktorture: Cleanup header usage
    locktorture: Cannot hold read and write lock
    locktorture: Fix __acquire annotation for spinlock irq
    locktorture: Support rwlocks
    rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
    locktorture: Document boot/module parameters
    rcutorture: Rename rcutorture_runnable parameter
    locktorture: Add test scenario for rwsem_lock
    locktorture: Add test scenario for mutex_lock
    locktorture: Make torture scripting account for new _runnable name
    locktorture: Introduce torture context
    locktorture: Support rwsems
    locktorture: Add infrastructure for torturing read locks
    torture: Address race in module cleanup
    locktorture: Make statistics generic
    locktorture: Teach about lock debugging
    locktorture: Support mutexes
    locktorture: Add documentation
    ...

    Linus Torvalds
     

10 Oct, 2014

2 commits

  • Dump the contents of the relevant struct_mm when we hit the bug condition.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
    more information when they trigger.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

08 Sep, 2014

1 commit

  • RCU-tasks requires the occasional voluntary context switch
    from CPU-bound in-kernel tasks. In some cases, this requires
    instrumenting cond_resched(). However, there is some reluctance
    to countenance unconditionally instrumenting cond_resched() (see
    http://lwn.net/Articles/603252/), so this commit creates a separate
    cond_resched_rcu_qs() that may be used in place of cond_resched() in
    locations prone to long-duration in-kernel looping.

    This commit currently instruments only RCU-tasks. Future possibilities
    include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
    IPI usage.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

07 Aug, 2014

1 commit

  • Add a comment describing the circumstances in which
    __lock_page_or_retry() will or will not release the mmap_sem when
    returning 0.

    Add comments to lock_page_or_retry()'s callers (filemap_fault(),
    do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

    Add comments on up the call tree, particularly replacing the false "We
    return with mmap_sem still held" comments.

    Signed-off-by: Paul Cassella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Cassella
     

08 Apr, 2014

1 commit

  • A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
    fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
    the pages other than its check_page parameter (which is already locked).

    The BUG_ON in mlock_vma_page() is not documented and its purpose is
    somewhat unclear, but apparently it serializes against page migration,
    which could otherwise fail to transfer the PG_mlocked flag. This would
    not be fatal, as the page would be eventually encountered again, but
    NR_MLOCK accounting would become distorted nevertheless. This patch adds
    a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
    effect.

    The call site try_to_unmap_cluster() is fixed so that for page !=
    check_page, trylock_page() is attempted (to avoid possible deadlocks as we
    already have check_page locked) and mlock_vma_page() is performed only
    upon success. If the page lock cannot be obtained, the page is left
    without PG_mlocked, which is again not a problem in the whole unevictable
    memory design.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Bob Liu
    Reported-by: Sasha Levin
    Cc: Wanpeng Li
    Cc: Michel Lespinasse
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

24 Jan, 2014

2 commits

  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP
    pages") munlock skips tail pages of a munlocked THP page. There is some
    attempt to prevent bad consequences of racing with a THP page split, but
    code inspection indicates that there are two problems that may lead to a
    non-fatal, yet wrong outcome.

    First, __split_huge_page_refcount() copies flags including PageMlocked
    from the head page to the tail pages. Clearing PageMlocked by
    munlock_vma_page() in the middle of this operation might result in part
    of tail pages left with PageMlocked flag. As the head page still
    appears to be a THP page until all tail pages are processed,
    munlock_vma_page() might think it munlocked the whole THP page and skip
    all the former tail pages. Before ff6a6da60, those pages would be
    cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK
    would still become undercounted (related the next point).

    Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after
    the PageMlocked is cleared. The accounting might also become
    inconsistent due to race with __split_huge_page_refcount()

    - undercount when HUGE_PMD_NR is subtracted, but some tail pages are
    left with PageMlocked set and counted again (only possible before
    ff6a6da60)

    - overcount when hpage_nr_pages() sees a normal page (split has already
    finished), but the parallel split has meanwhile cleared PageMlocked from
    additional tail pages

    This patch prevents both problems via extending the scope of lru_lock in
    munlock_vma_page(). This is convenient because:

    - __split_huge_page_refcount() takes lru_lock for its whole operation

    - munlock_vma_page() typically takes lru_lock anyway for page isolation

    As this becomes a second function where page isolation is done with
    lru_lock already held, factor this out to a new
    __munlock_isolate_lru_page() function and clean up the code around.

    [akpm@linux-foundation.org: avoid a coding-style ugly]
    Signed-off-by: Vlastimil Babka
    Cc: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

22 Jan, 2014

1 commit

  • All mlock related syscalls prepare lock limits, lengths and start
    parameters with the mmap_sem held. Move this logic outside of the
    critical region. For the case of mlock, continue incrementing the
    amount already locked by mm->locked_vm with the rwsem taken.

    Signed-off-by: Davidlohr Bueso
    Cc: Rik van Riel
    Reviewed-by: Michel Lespinasse
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

03 Jan, 2014

2 commits

  • Commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
    munlock+putback using pagevec" introduced __munlock_pagevec() to speed
    up munlock by holding lru_lock over multiple isolated pages. Pages that
    fail to be isolated are put_page()d immediately, also within the lock.

    This can lead to deadlock when __munlock_pagevec() becomes the holder of
    the last page pin and put_page() leads to __page_cache_release() which
    also locks lru_lock. The deadlock has been observed by Sasha Levin
    using trinity.

    This patch avoids the deadlock by deferring put_page() operations until
    lru_lock is released. Another pagevec (which is also used by later
    phases of the function is reused to gather the pages for put_page()
    operation.

    Signed-off-by: Vlastimil Babka
    Reported-by: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP
    pages") munlock skips tail pages of a munlocked THP page. However, when
    the head page already has PageMlocked unset, it will not skip the tail
    pages.

    Commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
    munlock+putback using pagevec") has added a PageTransHuge() check which
    contains VM_BUG_ON(PageTail(page)). Sasha Levin found this triggered
    using trinity, on the first tail page of a THP page without PageMlocked
    flag.

    This patch fixes the issue by skipping tail pages also in the case when
    PageMlocked flag is unset. There is still a possibility of race with
    THP page split between clearing PageMlocked and determining how many
    pages to skip. The race might result in former tail pages not being
    skipped, which is however no longer a bug, as during the skip the
    PageTail flags are cleared.

    However this race also affects correctness of NR_MLOCK accounting, which
    is to be fixed in a separate patch.

    Signed-off-by: Vlastimil Babka
    Reported-by: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Bob Liu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

01 Oct, 2013

1 commit

  • The function __munlock_pagevec_fill() introduced in commit 7a8010cd3627
    ("mm: munlock: manual pte walk in fast path instead of
    follow_page_mask()") uses pmd_addr_end() for restricting its operation
    within current page table.

    This is insufficient on architectures/configurations where pmd is folded
    and pmd_addr_end() just returns the end of the full range to be walked.
    In this case, it allows pte++ to walk off the end of a page table
    resulting in unpredictable behaviour.

    This patch fixes the function by using pgd_addr_end() and pud_addr_end()
    before pmd_addr_end(), which will yield correct page table boundary on
    all configurations. This is similar to what existing page walkers do
    when walking each level of the page table.

    Additionaly, the patch clarifies a comment for get_locked_pte() call in the
    function.

    Signed-off-by: Vlastimil Babka
    Reported-by: Fengguang Wu
    Reviewed-by: Bob Liu
    Cc: Jörn Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

25 Sep, 2013

1 commit

  • There is a loop in do_mlockall() that lacks a preemption point, which
    means that the following can happen on non-preemptible builds of the
    kernel. Dave Jones reports:

    "My fuzz tester keeps hitting this. Every instance shows the non-irq
    stack came in from mlockall. I'm only seeing this on one box, but
    that has more ram (8gb) than my other machines, which might explain
    it.

    INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0)
    sending NMI to all CPUs:
    NMI backtrace for cpu 3
    CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32
    Call Trace:
    lru_add_drain_all+0x15/0x20
    SyS_mlockall+0xa5/0x1a0
    tracesys+0xdd/0xe2"

    This commit addresses this problem by inserting the required preemption
    point.

    Reported-by: Dave Jones
    Signed-off-by: Paul E. McKenney
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     

12 Sep, 2013

6 commits

  • Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
    each individual struct page. This entails repeated full page table
    translations and page table lock taken for each page separately.

    This patch avoids the costly follow_page_mask() where possible, by
    iterating over ptes within single pmd under single page table lock. The
    first pte is obtained by get_locked_pte() for non-THP page acquired by the
    initial follow_page_mask(). The rest of the on-stack pagevec for munlock
    is filled up using pte_walk as long as pte_present() and vm_normal_page()
    are sufficient to obtain the struct page.

    After this patch, a 14% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    Signed-off-by: Vlastimil Babka
    Cc: Jörn Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The performance of the fast path in munlock_vma_range() can be further
    improved by avoiding atomic ops of a redundant get_page()/put_page() pair.

    When calling get_page() during page isolation, we already have the pin
    from follow_page_mask(). This pin will be then returned by
    __pagevec_lru_add(), after which we do not reference the pages anymore.

    After this patch, an 8% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After introducing batching by pagevecs into munlock_vma_range(), we can
    further improve performance by bypassing the copying into per-cpu pagevec
    and the get_page/put_page pair associated with that. Instead we perform
    LRU putback directly from our pagevec. However, this is possible only for
    single-mapped pages that are evictable after munlock. Unevictable pages
    require rechecking after putting on the unevictable list, so for those we
    fallback to putback_lru_page(), hich handles that.

    After this patch, a 13% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    [akpm@linux-foundation.org:clarify comment]
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Depending on previous batch which introduced batched isolation in
    munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats.
    After the whole pagevec is processed for page isolation, the stats are
    updated only once with the number of successful isolations. There were
    however no measurable perfomance gains.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently, munlock_vma_range() calls munlock_vma_page on each page in a
    loop, which results in repeated taking and releasing of the lru_lock
    spinlock for isolating pages one by one. This patch batches the munlock
    operations using an on-stack pagevec, so that isolation is done under
    single lru_lock. For THP pages, the old behavior is preserved as they
    might be split while putting them into the pagevec. After this patch, a
    9% speedup was measured for munlocking a 56GB large memory area with THP
    disabled.

    A new function __munlock_pagevec() is introduced that takes a pagevec and:
    1) It clears PageMlocked and isolates all pages under lru_lock. Zone page
    stats can be also updated using the variant which assumes disabled
    interrupts. 2) It finishes the munlock and lru putback on all pages under
    their lock_page. Note that previously, lock_page covered also the
    PageMlocked clearing and page isolation, but it is not needed for those
    operations.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In munlock_vma_range(), lru_add_drain() is currently called in a loop
    before each munlock_vma_page() call.

    This is suboptimal for performance when munlocking many pages. The
    benefits of per-cpu pagevec for batching the LRU putback are removed since
    the pagevec only holds at most one page from the previous loop's
    iteration.

    The lru_add_drain() call also does not serve any purposes for correctness
    - it does not even drain pagavecs of all cpu's. The munlock code already
    expects and handles situations where a page cannot be isolated from the
    LRU (e.g. because it is on some per-cpu pagevec).

    The history of the (not commented) call also suggest that it appears there
    as an oversight rather than intentionally. Before commit ff6a6da6 ("mm:
    accelerate munlock() treatment of THP pages") the call happened only once
    upon entering the function. The commit has moved the call into the while
    loope. So while the other changes in the commit improved munlock
    performance for THP pages, it introduced the abovementioned suboptimal
    per-cpu pagevec usage.

    Further in history, before commit 408e82b7 ("mm: munlock use
    follow_page"), munlock_vma_pages_range() was just a wrapper around
    __mlock_vma_pages_range which performed both mlock and munlock depending
    on a flag. However, before ba470de4 ("mmap: handle mlocked pages during
    map, remap, unmap") the function handled only mlock, not munlock. The
    lru_add_drain call thus comes from the implementation in commit b291f000
    ("mlock: mlocked pages are unevictable" and was intended only for
    mlocking, not munlocking. The original intention of draining the LRU
    pagevec at mlock time was to ensure the pages were on the LRU before the
    lock operation so that they could be placed on the unevictable list
    immediately. There is very little motivation to do the same in the
    munlock path this, particularly for every single page.

    This patch therefore removes the call completely. After removing the
    call, a 10% speedup was measured for munlock() of a 56GB large memory area
    with THP disabled.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

29 Mar, 2013

1 commit

  • This reverts commit 186930500985 ("mm: introduce VM_POPULATE flag to
    better deal with racy userspace programs").

    VM_POPULATE only has any effect when userspace plays racy games with
    vmas by trying to unmap and remap memory regions that mmap or mlock are
    operating on.

    Also, the only effect of VM_POPULATE when userspace plays such games is
    that it avoids populating new memory regions that get remapped into the
    address range that was being operated on by the original mmap or mlock
    calls.

    Let's remove VM_POPULATE as there isn't any strong argument to mandate a
    new vm_flag.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

28 Feb, 2013

1 commit

  • munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
    at a time. When munlocking THP pages (or the huge zero page), this
    resulted in taking the mm->page_table_lock 512 times in a row.

    We can do better by making use of the page_mask returned by
    follow_page_mask (for the huge zero page case), or the size of the page
    munlock_vma_page() operated on (for the true THP page case).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

24 Feb, 2013

1 commit

  • Use long type for page counts in mm_populate() so as to avoid integer
    overflow when running the following test code:

    int main(void) {
    void *p = mmap(NULL, 0x100000000000, PROT_READ,
    MAP_PRIVATE | MAP_ANON, -1, 0);
    printf("p: %p\n", p);
    mlockall(MCL_CURRENT);
    printf("done\n");
    return 0;
    }

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse