30 Nov, 2016

1 commit

  • Linus found there still is a race in mremap after commit 5d1904204c99
    ("mremap: fix race between mremap() and page cleanning").

    As described by Linus:
    "the issue is that another thread might make the pte be dirty (in the
    hardware walker, so no locking of ours will make any difference)
    *after* we checked whether it was dirty, but *before* we removed it
    from the page tables"

    Fix it by moving the check after we removed it from the page table.

    Suggested-by: Linus Torvalds
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

18 Nov, 2016

1 commit

  • Prior to 3.15, there was a race between zap_pte_range() and
    page_mkclean() where writes to a page could be lost. Dave Hansen
    discovered by inspection that there is a similar race between
    move_ptes() and page_mkclean().

    We've been able to reproduce the issue by enlarging the race window with
    a msleep(), but have not been able to hit it without modifying the code.
    So, we think it's a real issue, but is difficult or impossible to hit in
    practice.

    The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
    'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
    this patch is to fix the race between page_mkclean() and mremap().

    Here is one possible way to hit the race: suppose a process mmapped a
    file with READ | WRITE and SHARED, it has two threads and they are bound
    to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
    1 did a write to addr X so that CPU1 now has a writable TLB for addr X
    on it. Thread 2 starts mremaping from addr X to Y while thread 1
    cleaned the page and then did another write to the old addr X again.
    The 2nd write from thread 1 could succeed but the value will get lost.

    thread 1 thread 2
    (bound to CPU1) (bound to CPU2)

    1: write 1 to addr X to get a
    writeable TLB on this CPU

    2: mremap starts

    3: move_ptes emptied PTE for addr X
    and setup new PTE for addr Y and
    then dropped PTL for X and Y

    4: page laundering for N by doing
    fadvise FADV_DONTNEED. When done,
    pageframe N is deemed clean.

    5: *write 2 to addr X

    6: tlb flush for addr X

    7: munmap (Y, pagesize) to make the
    page unmapped

    8: fadvise with FADV_DONTNEED again
    to kick the page off the pagecache

    9: pread the page from file to verify
    the value. If 1 is there, it means
    we have lost the written 2.

    *the write may or may not cause segmentation fault, it depends on
    if the TLB is still on the CPU.

    Please note that this is only one specific way of how the race could
    occur, it didn't mean that the race could only occur in exact the above
    config, e.g. more than 2 threads could be involved and fadvise() could
    be done in another thread, etc.

    For anonymous pages, they could race between mremap() and page reclaim:
    THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
    PMD gets unmapped/splitted/pagedout before the flush tlb happened for
    the old huge PMD in move_page_tables() and we could still write data to
    it. The normal anonymous page has similar situation.

    To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
    if any, did the flush before dropping the PTL. If we did the flush for
    every move_ptes()/move_huge_pmd() call then we do not need to do the
    flush in move_pages_tables() for the whole range. But if we didn't, we
    still need to do the whole range flush.

    Alternatively, we can track which part of the range is flushed in
    move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
    range in move_page_tables(). But that would require multiple tlb
    flushes for the different sub-ranges and should be less efficient than
    the single whole range flush.

    KBuild test on my Sandybridge desktop doesn't show any noticeable change.
    v4.9-rc4:
    real 5m14.048s
    user 32m19.800s
    sys 4m50.320s

    With this commit:
    real 5m13.888s
    user 32m19.330s
    sys 4m51.200s

    Reported-by: Dave Hansen
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

27 Jul, 2016

1 commit

  • split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
    to pte entries, which can be checked with pmd_trans_unstable(). Some
    callers make this assertion and some do it differently and some not, so
    let's do it in a unified manner.

    Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

24 May, 2016

1 commit

  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 May, 2016

2 commits

  • Whatever huge pagecache implementation we go with, file rmap locking
    must be added to anon rmap locking, when mremap's move_page_tables()
    finds a pmd_trans_huge pmd entry: a simple change, let's do it now.

    Factor out take_rmap_locks() and drop_rmap_locks() to handle the locking
    for make move_ptes() and move_page_tables(), and delete the
    VM_BUG_ON_VMA which rejected vm_file and required anon_vma.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was
    a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from
    the old vma, so a trans_huge_pmd in the new_vma will be as acceptable as
    it was in the old vma, alignment and size permitting.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

18 Mar, 2016

2 commits

  • There are few things about *pte_alloc*() helpers worth cleaning up:

    - 'vma' argument is unused, let's drop it;

    - most __pte_alloc() callers do speculative check for pmd_none(),
    before taking ptl: let's introduce pte_alloc() macro which does
    the check.

    The only direct user of __pte_alloc left is userfaultfd, which has
    different expectation about atomicity wrt pmd.

    - pte_alloc_map() and pte_alloc_map_lock() are redefined using
    pte_alloc().

    [sudeep.holla@arm.com: fix build for arm64 hugetlbpage]
    [sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more]
    Signed-off-by: Kirill A. Shutemov
    Cc: Dave Hansen
    Signed-off-by: Sudeep Holla
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • max_map_count sysctl unrelated to scheduler. Move its bits from
    include/linux/sched/sysctl.h to include/linux/mm.h.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

12 Feb, 2016

1 commit

  • DAX implements split_huge_pmd() by clearing pmd. This simple approach
    reduces memory overhead, as we don't need to deposit page table on huge
    page mapping to make split_huge_pmd() never-fail. PTE table can be
    allocated and populated later on page fault from backing store.

    But one side effect is that have to check if pmd is pmd_none() after
    split_huge_pmd(). In most places we do this already to deal with
    parallel MADV_DONTNEED.

    But I found two call sites which is not affected by MADV_DONTNEED (due
    down_write(mmap_sem)), but need to have the check to work with DAX
    properly.

    Signed-off-by: Kirill A. Shutemov
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

2 commits

  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to decouple splitting THP PMD from splitting underlying
    compound page.

    This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
    to reflect the fact that it doesn't imply page splitting, only PMD.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

05 Jan, 2016

1 commit

  • mremap() with MREMAP_FIXED on a VM_PFNMAP range causes the following
    WARN_ON_ONCE() message in untrack_pfn().

    WARNING: CPU: 1 PID: 3493 at arch/x86/mm/pat.c:985 untrack_pfn+0xbd/0xd0()
    Call Trace:
    [] dump_stack+0x45/0x57
    [] warn_slowpath_common+0x86/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] untrack_pfn+0xbd/0xd0
    [] unmap_single_vma+0x80e/0x860
    [] unmap_vmas+0x55/0xb0
    [] unmap_region+0xac/0x120
    [] do_munmap+0x28a/0x460
    [] move_vma+0x1b3/0x2e0
    [] SyS_mremap+0x3b3/0x510
    [] entry_SYSCALL_64_fastpath+0x12/0x71

    MREMAP_FIXED moves a pfnmap from old vma to new vma. untrack_pfn() is
    called with the old vma after its pfnmap page table has been removed,
    which causes follow_phys() to fail. The new vma has a new pfnmap to
    the same pfn & cache type with VM_PAT set. Therefore, we only need to
    clear VM_PAT from the old vma in this case.

    Add untrack_pfn_moved(), which clears VM_PAT from a given old vma.
    move_vma() is changed to call this function with the old vma when
    VM_PFNMAP is set. move_vma() then calls do_munmap(), and untrack_pfn()
    is a no-op since VM_PAT is cleared.

    Reported-by: Stas Sergeev
    Signed-off-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Borislav Petkov
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/1450832064-10093-2-git-send-email-toshi.kani@hpe.com
    Signed-off-by: Thomas Gleixner

    Toshi Kani
     

06 Nov, 2015

1 commit


05 Sep, 2015

5 commits

  • Minor, but this check is overcomplicated. Two half-intervals do NOT
    overlap if END1
    Acked-by: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Kirill A. Shutemov
    Cc: Laurent Dufour
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The "new_len > old_len" branch in vma_to_resize() looks very confusing.
    It only covers the VM_DONTEXPAND/pgoff checks but everything below is
    equally unneeded if new_len == old_len.

    Change this code to return if "new_len == old_len", new_len < old_len is
    not possible, otherwise the code below is wrong anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Kirill A. Shutemov
    Cc: Laurent Dufour
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • move_vma() sets *locked even if move_page_tables() or ->mremap() fails,
    change sys_mremap() to check "ret & ~PAGE_MASK".

    I think we should simply remove the VM_LOCKED code in move_vma(), that is
    why this patch doesn't change move_vma(). But this needs more cleanups.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Kirill A. Shutemov
    Cc: Laurent Dufour
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this
    way ->mremap() can have more users. Say, vdso.

    While at it, s/aio_ring_remap/aio_ring_mremap/.

    Note: this is the minimal change before ->mremap() finds another user in
    file_operations; this method should have more arguments, and it can be
    used to kill arch_remap().

    Signed-off-by: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Acked-by: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Laurent Dufour
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • move_vma() can't just return if f_op->mremap() fails, we should unmap the
    new vma like we do if move_page_tables() fails. To avoid the code
    duplication this patch moves the "move entries back" under the new "if
    (err)" branch.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Kirill Shutemov
    Cc: Pavel Emelyanov
    Cc: Laurent Dufour
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

25 Jun, 2015

1 commit

  • Some architectures would like to be triggered when a memory area is moved
    through the mremap system call.

    This patch introduces a new arch_remap() mm hook which is placed in the
    path of mremap, and is called before the old area is unmapped (and the
    arch_unmap() hook is called).

    Signed-off-by: Laurent Dufour
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     

16 Apr, 2015

2 commits

  • As suggested by Kirill the "goto"s in vma_to_resize aren't necessary, just
    change them to explicit return.

    Signed-off-by: Derek Che
    Suggested-by: "Kirill A. Shutemov"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Derek
     
  • Recently I straced bash behavior in this dd zero pipe to read test, in
    part of testing under vm.overcommit_memory=2 (OVERCOMMIT_NEVER mode):

    # dd if=/dev/zero | read x

    The bash sub shell is calling mremap to reallocate more and more memory
    untill it finally failed -ENOMEM (I expect), or to be killed by system OOM
    killer (which should not happen under OVERCOMMIT_NEVER mode); But the
    mremap system call actually failed of -EFAULT, which is a surprise to me,
    I think it's supposed to be -ENOMEM? then I wrote this piece of C code
    testing confirmed it: https://gist.github.com/crquan/326bde37e1ddda8effe5

    $ ./remap
    allocated one page @0x7f686bf71000, (PAGE_SIZE: 4096)
    grabbed 7680512000 bytes of memory (1875125 pages) @ 00007f6690993000.
    mremap failed Bad address (14).

    The -EFAULT comes from the branch of security_vm_enough_memory_mm failure,
    underlyingly it calls __vm_enough_memory which returns only 0 for success
    or -ENOMEM; So why vma_to_resize needs to return -EFAULT in this case?
    this sounds like a mistake to me.

    Some more digging into git history:

    1) Before commit 119f657c7 ("RLIMIT_AS checking fix") in May 1 2005
    (pre 2.6.12 days) it was returning -ENOMEM for this failure;

    2) but commit 119f657c7 ("untangling do_mremap(), part 1") changed it
    accidentally, to what ever is preserved in local ret, which happened to
    be -EFAULT, in a previous assignment;

    3) then in commit 54f5de709 code refactoring, it's explicitly returning
    -EFAULT, should be wrong.

    Signed-off-by: Derek Che
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Derek
     

07 Apr, 2015

1 commit

  • teach ->mremap() method to return an error and have it fail for
    aio mappings in process of being killed

    Note that in case of ->mremap() failure we need to undo move_page_tables()
    we'd already done; we could call ->mremap() first, but then the failure of
    move_page_tables() would require undoing whatever _successful_ ->mremap()
    has done, which would be a lot more headache in general.

    Signed-off-by: Al Viro

    Al Viro
     

11 Feb, 2015

1 commit


15 Dec, 2014

1 commit


14 Dec, 2014

3 commits

  • There are actually two issues this patch addresses. Let me start with
    the one I tried to solve in the beginning.

    So, in the checkpoint-restore project (criu) we try to dump tasks'
    state and restore one back exactly as it was. One of the tasks' state
    bits is rings set up with io_setup() call. There's (almost) no problems
    in dumping them, there's a problem restoring them -- if I dump a task
    with aio ring originally mapped at address A, I want to restore one
    back at exactly the same address A. Unfortunately, the io_setup() does
    not allow for that -- it mmaps the ring at whatever place mm finds
    appropriate (it calls do_mmap_pgoff() with zero address and without
    the MAP_FIXED flag).

    To make restore possible I'm going to mremap() the freshly created ring
    into the address A (under which it was seen before dump). The problem is
    that the ring's virtual address is passed back to the user-space as the
    context ID and this ID is then used as search key by all the other io_foo()
    calls. Reworking this ID to be just some integer doesn't seem to work, as
    this value is already used by libaio as a pointer using which this library
    accesses memory for aio meta-data.

    So, to make restore work we need to make sure that

    a) ring is mapped at desired virtual address
    b) kioctx->user_id matches this value

    Having said that, the patch makes mremap() on aio region update the
    kioctx's user_id and mmap_base values.

    Here appears the 2nd issue I mentioned in the beginning of this mail.
    If (regardless of the C/R dances I do) someone creates an io context
    with io_setup(), then mremap()-s the ring and then destroys the context,
    the kill_ioctx() routine will call munmap() on wrong (old) address.
    This will result in a) aio ring remaining in memory and b) some other
    vma get unexpectedly unmapped.

    What do you think?

    Signed-off-by: Pavel Emelyanov
    Acked-by: Dmitry Monakhov
    Signed-off-by: Benjamin LaHaise

    Pavel Emelyanov
     
  • The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
    similar data, one for file backed pages and the other for anon memory. To
    this end, this lock can also be a rwsem. In addition, there are some
    important opportunities to share the lock when there are no tree
    modifications.

    This conversion is straightforward. For now, all users take the write
    lock.

    [sfr@canb.auug.org.au: update fremap.c]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

10 Oct, 2014

2 commits

  • "WARNING: Use #include instead of "

    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
    more information when they trigger.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

11 May, 2014

1 commit

  • It's critical for split_huge_page() (and migration) to catch and freeze
    all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
    mremap() since usually we copy/move page table entries on dup_mm() or
    move_page_tables() without rmap lock taken. To get it work we rely on
    rmap walk order to not miss any entry. We expect to see destination VMA
    after source one to work correctly.

    But after switching rmap implementation to interval tree it's not always
    possible to preserve expected walk order.

    It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
    / vma_last_pgoff() and explicitly insert dst VMA after src one with
    vma_interval_tree_insert_after().

    But on move_vma() destination VMA can be merged into adjacent one and as
    result shifted left in interval tree. Fortunately, we can detect the
    situation and prevent race with rmap walk by moving page table entries
    under rmap lock. See commit 38a76013ad80.

    Problem is that we miss the lock when we move transhuge PMD. Most
    likely this bug caused the crash[1].

    [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473

    Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail")

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Andrea Arcangeli
    Cc: Rik van Riel
    Acked-by: Michel Lespinasse
    Cc: Dave Jones
    Cc: David Miller
    Acked-by: Johannes Weiner
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

17 Oct, 2013

1 commit

  • Revert commit 1ecfd533f4c5 ("mm/mremap.c: call pud_free() after fail
    calling pmd_alloc()").

    The original code was correct: pud_alloc(), pmd_alloc(), pte_alloc_map()
    ensure that the pud, pmd, pt is already allocated, and seldom do they
    need to allocate; on failure, upper levels are freed if appropriate by
    the subsequent do_munmap(). Whereas commit 1ecfd533f4c5 did an
    unconditional pud_free() of a most-likely still-in-use pud: saved only
    by the near-impossiblity of pmd_alloc() failing.

    Signed-off-by: Hugh Dickins
    Cc: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Sep, 2013

1 commit


28 Aug, 2013

1 commit

  • Dave reported corrupted swap entries

    | [ 4588.541886] swap_free: Unused swap offset entry 00002d15
    | [ 4588.541952] BUG: Bad page map in process trinity-kid12 pte:005a2a80 pmd:22c01f067

    and Hugh pointed that in move_ptes _PAGE_SOFT_DIRTY bit set regardless
    the type of entry pte consists of. The trick here is that when we carry
    soft dirty status in swap entries we are to use _PAGE_SWP_SOFT_DIRTY
    instead, because this is the only place in pte which can be used for own
    needs without intersecting with bits owned by swap entry type/offset.

    Reported-and-tested-by: Dave Jones
    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Analyzed-by: Hugh Dickins
    Cc: Hillf Danton
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

10 Jul, 2013

1 commit

  • This patch is very similar to commit 84d96d897671 ("mm: madvise:
    complete input validation before taking lock"): perform some basic
    validation of the input to mremap() before taking the
    ¤t->mm->mmap_sem lock.

    This also makes the MREMAP_FIXED => MREMAP_MAYMOVE dependency slightly
    more explicit.

    Signed-off-by: Rasmus Villemoes
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

04 Jul, 2013

1 commit

  • The soft-dirty is a bit on a PTE which helps to track which pages a task
    writes to. In order to do this tracking one should

    1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
    2. Wait some time.
    3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

    To do this tracking, the writable bit is cleared from PTEs when the
    soft-dirty bit is. Thus, after this, when the task tries to modify a
    page at some virtual address the #PF occurs and the kernel sets the
    soft-dirty bit on the respective PTE.

    Note, that although all the task's address space is marked as r/o after
    the soft-dirty bits clear, the #PF-s that occur after that are processed
    fast. This is so, since the pages are still mapped to physical memory,
    and thus all the kernel does is finds this fact out and puts back
    writable, dirty and soft-dirty bits on the PTE.

    Another thing to note, is that when mremap moves PTEs they are marked
    with soft-dirty as well, since from the user perspective mremap modifies
    the virtual memory at mremap's new address.

    Signed-off-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Glauber Costa
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

24 Feb, 2013

2 commits


08 Feb, 2013

1 commit

  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds