01 Dec, 2016

1 commit

  • The following program triggers BUG() in munlock_vma_pages_range():

    // autogenerated by syzkaller (http://github.com/google/syzkaller)
    #include

    int main()
    {
    mmap((void*)0x20105000ul, 0xc00000ul, 0x2ul, 0x2172ul, -1, 0);
    mremap((void*)0x201fd000ul, 0x4000ul, 0xc00000ul, 0x3ul, 0x203f0000ul);
    return 0;
    }

    The test-case constructs the situation when munlock_vma_pages_range()
    finds PTE-mapped THP-head in the middle of page table and, by mistake,
    skips HPAGE_PMD_NR pages after that.

    As result, on the next iteration it hits the middle of PMD-mapped THP
    and gets upset seeing mlocked tail page.

    The solution is only skip HPAGE_PMD_NR pages if the THP was mlocked
    during munlock_vma_page(). It would guarantee that the page is
    PMD-mapped as we never mlock PTE-mapeed THPs.

    Fixes: e90309c9f772 ("thp: allow mlocked THP again")
    Link: http://lkml.kernel.org/r/20161115132703.7s7rrgmwttegcdh4@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Cc: Konstantin Khlebnikov
    Cc: Andrey Ryabinin
    Cc: syzkaller
    Cc: Andrea Arcangeli
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

08 Oct, 2016

2 commits

  • When one vma was with flag VM_LOCKED|VM_LOCKONFAULT (by invoking
    mlock2(,MLOCK_ONFAULT)), it can again be populated with mlock() with
    VM_LOCKED flag only.

    There is a hole in mlock_fixup() which increase mm->locked_vm twice even
    the two operations are on the same vma and both with VM_LOCKED flags.

    The issue can be reproduced by following code:

    mlock2(p, 1024 * 64, MLOCK_ONFAULT); //VM_LOCKED|VM_LOCKONFAULT
    mlock(p, 1024 * 64); //VM_LOCKED

    Then check the increase VmLck field in /proc/pid/status(to 128k).

    When vma is set with different vm_flags, and the new vm_flags is with
    VM_LOCKED, it is not necessarily be a "new locked" vma. This patch
    corrects this bug by prevent mm->locked_vm from increment when old
    vm_flags is already VM_LOCKED.

    Link: http://lkml.kernel.org/r/1472554781-9835-3-git-send-email-wei.guo.simon@gmail.com
    Signed-off-by: Simon Guo
    Acked-by: Kirill A. Shutemov
    Cc: Alexey Klimov
    Cc: Eric B Munson
    Cc: Geert Uytterhoeven
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Shuah Khan
    Cc: Simon Guo
    Cc: Thierry Reding
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Guo
     
  • In do_mlock(), the check against locked memory limitation has a hole
    which will fail following cases at step 3):

    1) User has a memory chunk from addressA with 50k, and user mem lock
    rlimit is 64k.
    2) mlock(addressA, 30k)
    3) mlock(addressA, 40k)

    The 3rd step should have been allowed since the 40k request is
    intersected with the previous 30k at step 2), and the 3rd step is
    actually for mlock on the extra 10k memory.

    This patch checks vma to caculate the actual "new" mlock size, if
    necessary, and ajust the logic to fix this issue.

    [akpm@linux-foundation.org: clean up comment layout]
    [wei.guo.simon@gmail.com: correct a typo in count_mm_mlocked_page_nr()]
    Link: http://lkml.kernel.org/r/1473325970-11393-2-git-send-email-wei.guo.simon@gmail.com
    Link: http://lkml.kernel.org/r/1472554781-9835-2-git-send-email-wei.guo.simon@gmail.com
    Signed-off-by: Simon Guo
    Cc: Alexey Klimov
    Cc: Eric B Munson
    Cc: Geert Uytterhoeven
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Shuah Khan
    Cc: Simon Guo
    Cc: Thierry Reding
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Guo
     

29 Jul, 2016

2 commits

  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

24 May, 2016

1 commit

  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

22 Jan, 2016

1 commit

  • Tetsuo Handa reported underflow of NR_MLOCK on munlock.

    Testcase:

    #include
    #include
    #include

    #define BASE ((void *)0x400000000000)
    #define SIZE (1UL << 21)

    int main(int argc, char *argv[])
    {
    void *addr;

    system("grep Mlocked /proc/meminfo");
    addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED,
    -1, 0);
    if (addr == MAP_FAILED)
    printf("mmap() failed\n"), exit(1);
    munmap(addr, SIZE);
    system("grep Mlocked /proc/meminfo");
    return 0;
    }

    It happens on munlock_vma_page() due to unfortunate choice of nr_pages
    data type:

    __mod_zone_page_state(zone, NR_MLOCK, -nr_pages);

    For unsigned int nr_pages, implicitly casted to long in
    __mod_zone_page_state(), it becomes something around UINT_MAX.

    munlock_vma_page() usually called for THP as small pages go though
    pagevec.

    Let's make nr_pages signed int.

    Similar fixes in 6cdb18ad98a4 ("mm/vmstat: fix overflow in
    mod_zone_page_state()") used `long' type, but `int' here is OK for a
    count of the number of sub-pages in a huge page.

    Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Cc: Michel Lespinasse
    Acked-by: Michal Hocko
    Cc: [4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

3 commits

  • Since can_do_mlock only return 1 or 0, so make it boolean.

    No functional change.

    [akpm@linux-foundation.org: update declaration in mm.h]
    Signed-off-by: Wang Xiaoqiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Xiaoqiang
     
  • Before THP refcounting rework, THP was not allowed to cross VMA
    boundary. So, if we have THP and we split it, PG_mlocked can be safely
    transferred to small pages.

    With new THP refcounting and naive approach to mlocking we can end up
    with this scenario:
    1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
    2. the process does munlock() on the *part* of the THP:
    - the VMA is split into two, one of them VM_LOCKED;
    - huge PMD split into PTE table;
    - THP is still mlocked;
    3. split_huge_page():
    - it transfers PG_mlocked to *all* small pages regrardless if it
    blong to any VM_LOCKED VMA.

    We probably could munlock() all small pages on split_huge_page(), but I
    think we have accounting issue already on step two.

    Instead of forbidding mlocked pages altogether, we just avoid mlocking
    PTE-mapped THPs and munlock THPs on split_huge_pmd().

    This means PTE-mapped THPs will be on normal lru lists and will be split
    under memory pressure by vmscan. After the split vmscan will detect
    unevictable small pages and mlock them.

    With this approach we shouldn't hit situation like described above.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting THP can belong to several VMAs. This makes tricky
    to track THP pages, when they partially mlocked. It can lead to leaking
    mlocked pages to non-VM_LOCKED vmas and other problems.

    With this patch we will split all pages on mlock and avoid
    fault-in/collapse new THP in VM_LOCKED vmas.

    I've tried alternative approach: do not mark THP pages mlocked and keep
    them on normal LRUs. This way vmscan could try to split huge pages on
    memory pressure and free up subpages which doesn't belong to VM_LOCKED
    vmas. But this is user-visible change: we screw up Mlocked accouting
    reported in meminfo, so I had to leave this approach aside.

    We can bring something better later, but this should be good enough for
    now.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit


06 Nov, 2015

6 commits

  • The previous patch introduced a flag that specified pages in a VMA should
    be placed on the unevictable LRU, but they should not be made present when
    the area is created. This patch adds the ability to set this state via
    the new mlock system calls.

    We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
    MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
    MCL_ONFAULT should be used as a modifier to the two other mlockall flags.
    When used with MCL_CURRENT, all current mappings will be marked with
    VM_LOCKED | VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags
    will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with both
    MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be
    marked with VM_LOCKED | VM_LOCKONFAULT.

    Prior to this patch, mlockall() will unconditionally clear the
    mm->def_flags any time it is called without MCL_FUTURE. This behavior is
    maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is
    followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and
    new VMAs will be unlocked. This remains true with or without MCL_ONFAULT
    in either mlockall() invocation.

    munlock() will unconditionally clear both vma flags. munlockall()
    unconditionally clears for VMA flags on all VMAs and in the mm->def_flags
    field.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • The cost of faulting in all memory to be locked can be very high when
    working with large mappings. If only portions of the mapping will be used
    this can incur a high penalty for locking.

    For the example of a large file, this is the usage pattern for a large
    statical language model (probably applies to other statical or graphical
    models as well). For the security example, any application transacting in
    data that cannot be swapped out (credit card data, medical records, etc).

    This patch introduces the ability to request that pages are not
    pre-faulted, but are placed on the unevictable LRU when they are finally
    faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
    and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
    flag for a VMA will cause pages faulted into that VMA to be added to the
    unevictable LRU when they are faulted or if they are already present, but
    will not cause any missing pages to be faulted in.

    Exposing this new lock state means that we cannot overload the meaning of
    the FOLL_POPULATE flag any longer. Prior to this patch it was used to
    mean that the VMA for a fault was locked. This means we need the new
    FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
    will now only control if the VMA should be populated and in the case of
    VM_LOCKONFAULT, it will not be set.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • With the refactored mlock code, introduce a new system call for mlock.
    The new call will allow the user to specify what lock states are being
    added. mlock2 is trivial at the moment, but a follow on patch will add a
    new mlock state making it useful.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Heiko Carstens
    Cc: Geert Uytterhoeven
    Cc: Catalin Marinas
    Cc: Stephen Rothwell
    Cc: Guenter Roeck
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • mlock() allows a user to control page out of program memory, but this
    comes at the cost of faulting in the entire mapping when it is allocated.
    For large mappings where the entire area is not necessary this is not
    ideal. Instead of forcing all locked pages to be present when they are
    allocated, this set creates a middle ground. Pages are marked to be
    placed on the unevictable LRU (locked) when they are first used, but they
    are not faulted in by the mlock call.

    This series introduces a new mlock() system call that takes a flags
    argument along with the start address and size. This flags argument gives
    the caller the ability to request memory be locked in the traditional way,
    or to be locked after the page is faulted in. A new MCL flag is added to
    mirror the lock on fault behavior from mlock() in mlockall().

    There are two main use cases that this set covers. The first is the
    security focussed mlock case. A buffer is needed that cannot be written
    to swap. The maximum size is known, but on average the memory used is
    significantly less than this maximum. With lock on fault, the buffer is
    guaranteed to never be paged out without consuming the maximum size every
    time such a buffer is created.

    The second use case is focussed on performance. Portions of a large file
    are needed and we want to keep the used portions in memory once accessed.
    This is the case for large graphical models where the path through the
    graph is not known until run time. The entire graph is unlikely to be
    used in a given invocation, but once a node has been used it needs to stay
    resident for further processing. Given these constraints we have a number
    of options. We can potentially waste a large amount of memory by mlocking
    the entire region (this can also cause a significant stall at startup as
    the entire file is read in). We can mlock every page as we access them
    without tracking if the page is already resident but this introduces large
    overhead for each access. The third option is mapping the entire region
    with PROT_NONE and using a signal handler for SIGSEGV to
    mprotect(PROT_READ) and mlock() the needed page. Doing this page at a
    time adds a significant performance penalty. Batching can be used to
    mitigate this overhead, but in order to safely avoid trying to mprotect
    pages outside of the mapping, the boundaries of each mapping to be used in
    this way must be tracked and available to the signal handler. This is
    precisely what the mm system in the kernel should already be doing.

    For mlock(MLOCK_ONFAULT) the user is charged against RLIMIT_MEMLOCK as if
    mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was used, so when the VMA is
    created not when the pages are faulted in. For mlockall(MCL_ONFAULT) the
    user is charged as if MCL_FUTURE was used. This decision was made to keep
    the accounting checks out of the page fault path.

    To illustrate the benefit of this set I wrote a test program that mmaps a
    5 GB file filled with random data and then makes 15,000,000 accesses to
    random addresses in that mapping. The test program was run 20 times for
    each setup. Results are reported for two program portions, setup and
    execution. The setup phase is calling mmap and optionally mlock on the
    entire region. For most experiments this is trivial, but it highlights
    the cost of faulting in the entire region. Results are averages across
    the 20 runs in milliseconds.

    mmap with mlock(MLOCK_LOCKED) on entire range:
    Setup avg: 8228.666
    Processing avg: 8274.257

    mmap with mlock(MLOCK_LOCKED) before each access:
    Setup avg: 0.113
    Processing avg: 90993.552

    mmap with PROT_NONE and signal handler and batch size of 1 page:
    With the default value in max_map_count, this gets ENOMEM as I attempt
    to change the permissions, after upping the sysctl significantly I get:
    Setup avg: 0.058
    Processing avg: 69488.073
    mmap with PROT_NONE and signal handler and batch size of 8 pages:
    Setup avg: 0.068
    Processing avg: 38204.116

    mmap with PROT_NONE and signal handler and batch size of 16 pages:
    Setup avg: 0.044
    Processing avg: 29671.180

    mmap with mlock(MLOCK_ONFAULT) on entire range:
    Setup avg: 0.189
    Processing avg: 17904.899

    The signal handler in the batch cases faulted in memory in two steps to
    avoid having to know the start and end of the faulting mapping. The first
    step covers the page that caused the fault as we know that it will be
    possible to lock. The second step speculatively tries to mlock and
    mprotect the batch size - 1 pages that follow. There may be a clever way
    to avoid this without having the program track each mapping to be covered
    by this handeler in a globally accessible structure, but I could not find
    it. It should be noted that with a large enough batch size this two step
    fault handler can still cause the program to crash if it reaches far
    beyond the end of the mapping.

    These results show that if the developer knows that a majority of the
    mapping will be used, it is better to try and fault it in at once,
    otherwise mlock(MLOCK_ONFAULT) is significantly faster.

    The performance cost of these patches are minimal on the two benchmarks I
    have tested (stream and kernbench). The following are the average values
    across 20 runs of stream and 10 runs of kernbench after a warmup run whose
    results were discarded.

    Avg throughput in MB/s from stream using 1000000 element arrays
    Test 4.2-rc1 4.2-rc1+lock-on-fault
    Copy: 10,566.5 10,421
    Scale: 10,685 10,503.5
    Add: 12,044.1 11,814.2
    Triad: 12,064.8 11,846.3

    Kernbench optimal load
    4.2-rc1 4.2-rc1+lock-on-fault
    Elapsed Time 78.453 78.991
    User Time 64.2395 65.2355
    System Time 9.7335 9.7085
    Context Switches 22211.5 22412.1
    Sleeps 14965.3 14956.1

    This patch (of 6):

    Extending the mlock system call is very difficult because it currently
    does not take a flags argument. A later patch in this set will extend
    mlock to support a middle ground between pages that are locked and faulted
    in immediately and unlocked pages. To pave the way for the new system
    call, the code needs some reorganization so that all the actual entry
    point handles is checking input and translating to VMA flags.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Michael Kerrisk
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Jonathan Corbet
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • linux/mm.h provides offset_in_page() macro. Let's use already predefined
    macro instead of (addr & ~PAGE_MASK).

    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • In mlockall syscall wrapper after out-label for goto code just doing
    return. Remove goto out statements and return error values directly.

    Also instead of rewriting ret variable before every if-check move returns
    to 'error'-like path under if-check.

    Objdump asm listing showed me reducing by few asm lines. Object file size
    descreased from 220592 bytes to 220528 bytes for me (for aarch64).

    Signed-off-by: Alexey Klimov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Klimov
     

05 Sep, 2015

1 commit

  • vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
    must be aware about so that we can merge vmas back like they were
    originally before arming the userfaultfd on some memory range.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

15 Apr, 2015

4 commits

  • It's odd that we have populate_vma_page_range() and __mm_populate() in
    mm/mlock.c. It's implementation of generic memory population and mlocking
    is one of possible side effect, if VM_LOCKED is set.

    __get_user_pages() is core of the implementation. Let's move the code
    into mm/gup.c.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This is praparation to moving mm_populate()-related code out of
    mm/mlock.c.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • __mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on
    vma flags. The same codepath is used for MAP_POPULATE.

    Let's rename __mlock_vma_pages_range() to populate_vma_page_range().

    This patch also drops mlock_vma_pages_range() references from
    documentation. It has gone in cea10a19b797 ("mm: directly use
    __mlock_vma_pages_range() in find_extend_vma()").

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in
    get_user_pages only for mlock") FOLL_MLOCK has lost its original
    meaning: we don't necessarily mlock the page if the flags is set -- we
    also take VM_LOCKED into consideration.

    Since we use the same codepath for __mm_populate(), let's rename
    FOLL_MLOCK to FOLL_POPULATE.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Mar, 2015

1 commit

  • A userspace call to mmap(MAP_LOCKED) may result in the successful locking
    of memory while also producing a confusing audit log denial. can_do_mlock
    checks capable and rlimit. If either of these return positive
    can_do_mlock returns true. The capable check leads to an LSM hook used by
    apparmour and selinux which produce the audit denial. Reordering so
    rlimit is checked first eliminates the denial on success, only recording a
    denial when the lock is unsuccessful as a result of the denial.

    Signed-off-by: Jeff Vander Stoep
    Acked-by: Nick Kralevich
    Cc: Jeff Vander Stoep
    Cc: Sasha Levin
    Cc: "Paul E. McKenney"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Paul Cassella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Vander Stoep
     

13 Oct, 2014

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - changes related to No-CBs CPUs and NO_HZ_FULL

    - RCU-tasks implementation

    - torture-test updates

    - miscellaneous fixes

    - locktorture updates

    - RCU documentation updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
    workqueue: Use cond_resched_rcu_qs macro
    workqueue: Add quiescent state between work items
    locktorture: Cleanup header usage
    locktorture: Cannot hold read and write lock
    locktorture: Fix __acquire annotation for spinlock irq
    locktorture: Support rwlocks
    rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
    locktorture: Document boot/module parameters
    rcutorture: Rename rcutorture_runnable parameter
    locktorture: Add test scenario for rwsem_lock
    locktorture: Add test scenario for mutex_lock
    locktorture: Make torture scripting account for new _runnable name
    locktorture: Introduce torture context
    locktorture: Support rwsems
    locktorture: Add infrastructure for torturing read locks
    torture: Address race in module cleanup
    locktorture: Make statistics generic
    locktorture: Teach about lock debugging
    locktorture: Support mutexes
    locktorture: Add documentation
    ...

    Linus Torvalds
     

10 Oct, 2014

2 commits

  • Dump the contents of the relevant struct_mm when we hit the bug condition.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
    more information when they trigger.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

08 Sep, 2014

1 commit

  • RCU-tasks requires the occasional voluntary context switch
    from CPU-bound in-kernel tasks. In some cases, this requires
    instrumenting cond_resched(). However, there is some reluctance
    to countenance unconditionally instrumenting cond_resched() (see
    http://lwn.net/Articles/603252/), so this commit creates a separate
    cond_resched_rcu_qs() that may be used in place of cond_resched() in
    locations prone to long-duration in-kernel looping.

    This commit currently instruments only RCU-tasks. Future possibilities
    include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
    IPI usage.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

07 Aug, 2014

1 commit

  • Add a comment describing the circumstances in which
    __lock_page_or_retry() will or will not release the mmap_sem when
    returning 0.

    Add comments to lock_page_or_retry()'s callers (filemap_fault(),
    do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

    Add comments on up the call tree, particularly replacing the false "We
    return with mmap_sem still held" comments.

    Signed-off-by: Paul Cassella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Cassella
     

08 Apr, 2014

1 commit

  • A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
    fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
    the pages other than its check_page parameter (which is already locked).

    The BUG_ON in mlock_vma_page() is not documented and its purpose is
    somewhat unclear, but apparently it serializes against page migration,
    which could otherwise fail to transfer the PG_mlocked flag. This would
    not be fatal, as the page would be eventually encountered again, but
    NR_MLOCK accounting would become distorted nevertheless. This patch adds
    a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
    effect.

    The call site try_to_unmap_cluster() is fixed so that for page !=
    check_page, trylock_page() is attempted (to avoid possible deadlocks as we
    already have check_page locked) and mlock_vma_page() is performed only
    upon success. If the page lock cannot be obtained, the page is left
    without PG_mlocked, which is again not a problem in the whole unevictable
    memory design.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Bob Liu
    Reported-by: Sasha Levin
    Cc: Wanpeng Li
    Cc: Michel Lespinasse
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

24 Jan, 2014

2 commits

  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP
    pages") munlock skips tail pages of a munlocked THP page. There is some
    attempt to prevent bad consequences of racing with a THP page split, but
    code inspection indicates that there are two problems that may lead to a
    non-fatal, yet wrong outcome.

    First, __split_huge_page_refcount() copies flags including PageMlocked
    from the head page to the tail pages. Clearing PageMlocked by
    munlock_vma_page() in the middle of this operation might result in part
    of tail pages left with PageMlocked flag. As the head page still
    appears to be a THP page until all tail pages are processed,
    munlock_vma_page() might think it munlocked the whole THP page and skip
    all the former tail pages. Before ff6a6da60, those pages would be
    cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK
    would still become undercounted (related the next point).

    Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after
    the PageMlocked is cleared. The accounting might also become
    inconsistent due to race with __split_huge_page_refcount()

    - undercount when HUGE_PMD_NR is subtracted, but some tail pages are
    left with PageMlocked set and counted again (only possible before
    ff6a6da60)

    - overcount when hpage_nr_pages() sees a normal page (split has already
    finished), but the parallel split has meanwhile cleared PageMlocked from
    additional tail pages

    This patch prevents both problems via extending the scope of lru_lock in
    munlock_vma_page(). This is convenient because:

    - __split_huge_page_refcount() takes lru_lock for its whole operation

    - munlock_vma_page() typically takes lru_lock anyway for page isolation

    As this becomes a second function where page isolation is done with
    lru_lock already held, factor this out to a new
    __munlock_isolate_lru_page() function and clean up the code around.

    [akpm@linux-foundation.org: avoid a coding-style ugly]
    Signed-off-by: Vlastimil Babka
    Cc: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

22 Jan, 2014

1 commit

  • All mlock related syscalls prepare lock limits, lengths and start
    parameters with the mmap_sem held. Move this logic outside of the
    critical region. For the case of mlock, continue incrementing the
    amount already locked by mm->locked_vm with the rwsem taken.

    Signed-off-by: Davidlohr Bueso
    Cc: Rik van Riel
    Reviewed-by: Michel Lespinasse
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

03 Jan, 2014

2 commits

  • Commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
    munlock+putback using pagevec" introduced __munlock_pagevec() to speed
    up munlock by holding lru_lock over multiple isolated pages. Pages that
    fail to be isolated are put_page()d immediately, also within the lock.

    This can lead to deadlock when __munlock_pagevec() becomes the holder of
    the last page pin and put_page() leads to __page_cache_release() which
    also locks lru_lock. The deadlock has been observed by Sasha Levin
    using trinity.

    This patch avoids the deadlock by deferring put_page() operations until
    lru_lock is released. Another pagevec (which is also used by later
    phases of the function is reused to gather the pages for put_page()
    operation.

    Signed-off-by: Vlastimil Babka
    Reported-by: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP
    pages") munlock skips tail pages of a munlocked THP page. However, when
    the head page already has PageMlocked unset, it will not skip the tail
    pages.

    Commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
    munlock+putback using pagevec") has added a PageTransHuge() check which
    contains VM_BUG_ON(PageTail(page)). Sasha Levin found this triggered
    using trinity, on the first tail page of a THP page without PageMlocked
    flag.

    This patch fixes the issue by skipping tail pages also in the case when
    PageMlocked flag is unset. There is still a possibility of race with
    THP page split between clearing PageMlocked and determining how many
    pages to skip. The race might result in former tail pages not being
    skipped, which is however no longer a bug, as during the skip the
    PageTail flags are cleared.

    However this race also affects correctness of NR_MLOCK accounting, which
    is to be fixed in a separate patch.

    Signed-off-by: Vlastimil Babka
    Reported-by: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Bob Liu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

01 Oct, 2013

1 commit

  • The function __munlock_pagevec_fill() introduced in commit 7a8010cd3627
    ("mm: munlock: manual pte walk in fast path instead of
    follow_page_mask()") uses pmd_addr_end() for restricting its operation
    within current page table.

    This is insufficient on architectures/configurations where pmd is folded
    and pmd_addr_end() just returns the end of the full range to be walked.
    In this case, it allows pte++ to walk off the end of a page table
    resulting in unpredictable behaviour.

    This patch fixes the function by using pgd_addr_end() and pud_addr_end()
    before pmd_addr_end(), which will yield correct page table boundary on
    all configurations. This is similar to what existing page walkers do
    when walking each level of the page table.

    Additionaly, the patch clarifies a comment for get_locked_pte() call in the
    function.

    Signed-off-by: Vlastimil Babka
    Reported-by: Fengguang Wu
    Reviewed-by: Bob Liu
    Cc: Jörn Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

25 Sep, 2013

1 commit

  • There is a loop in do_mlockall() that lacks a preemption point, which
    means that the following can happen on non-preemptible builds of the
    kernel. Dave Jones reports:

    "My fuzz tester keeps hitting this. Every instance shows the non-irq
    stack came in from mlockall. I'm only seeing this on one box, but
    that has more ram (8gb) than my other machines, which might explain
    it.

    INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0)
    sending NMI to all CPUs:
    NMI backtrace for cpu 3
    CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32
    Call Trace:
    lru_add_drain_all+0x15/0x20
    SyS_mlockall+0xa5/0x1a0
    tracesys+0xdd/0xe2"

    This commit addresses this problem by inserting the required preemption
    point.

    Reported-by: Dave Jones
    Signed-off-by: Paul E. McKenney
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     

12 Sep, 2013

4 commits

  • Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
    each individual struct page. This entails repeated full page table
    translations and page table lock taken for each page separately.

    This patch avoids the costly follow_page_mask() where possible, by
    iterating over ptes within single pmd under single page table lock. The
    first pte is obtained by get_locked_pte() for non-THP page acquired by the
    initial follow_page_mask(). The rest of the on-stack pagevec for munlock
    is filled up using pte_walk as long as pte_present() and vm_normal_page()
    are sufficient to obtain the struct page.

    After this patch, a 14% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    Signed-off-by: Vlastimil Babka
    Cc: Jörn Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The performance of the fast path in munlock_vma_range() can be further
    improved by avoiding atomic ops of a redundant get_page()/put_page() pair.

    When calling get_page() during page isolation, we already have the pin
    from follow_page_mask(). This pin will be then returned by
    __pagevec_lru_add(), after which we do not reference the pages anymore.

    After this patch, an 8% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After introducing batching by pagevecs into munlock_vma_range(), we can
    further improve performance by bypassing the copying into per-cpu pagevec
    and the get_page/put_page pair associated with that. Instead we perform
    LRU putback directly from our pagevec. However, this is possible only for
    single-mapped pages that are evictable after munlock. Unevictable pages
    require rechecking after putting on the unevictable list, so for those we
    fallback to putback_lru_page(), hich handles that.

    After this patch, a 13% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    [akpm@linux-foundation.org:clarify comment]
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Depending on previous batch which introduced batched isolation in
    munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats.
    After the whole pagevec is processed for page isolation, the stats are
    updated only once with the number of successful isolations. There were
    however no measurable perfomance gains.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka