07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

2 commits

  • A process spent 30 minutes exiting, just munlocking the pages of a large
    anonymous area that had been alternately mprotected into page-sized vmas:
    for every single page there's an anon_vma walk through all the other
    little vmas to find the right one.

    A general fix to that would be a lot more complicated (use prio_tree on
    anon_vma?), but there's one very simple thing we can do to speed up the
    common case: if a page to be munlocked is mapped only once, then it is our
    vma that it is mapped into, and there's no need whatever to walk through
    all the others.

    Okay, there is a very remote race in munlock_vma_pages_range(), if between
    its follow_page() and lock_page(), another process were to munlock the
    same page, then page reclaim remove it from our vma, then another process
    mlock it again. We would find it with page_mapcount 1, yet it's still
    mlocked in another process. But never mind, that's much less likely than
    the down_read_trylock() failure which munlocking already tolerates (in
    try_to_unmap_one()): in due course page reclaim will discover and move the
    page to unevictable instead.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Hugh Dickins
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • MCL_FUTURE does not move pages between lru list and draining the LRU per
    cpu pagevecs is a nasty activity. Avoid doing it unecessarily.

    Signed-off-by: Christoph Lameter
    Cc: David Rientjes
    Reviewed-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

31 Oct, 2011

1 commit


27 May, 2011

1 commit

  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

05 May, 2011

1 commit

  • The logic in __get_user_pages() used to skip the stack guard page lookup
    whenever the caller wasn't interested in seeing what the actual page
    was. But Michel Lespinasse points out that there are cases where we
    don't care about the physical page itself (so 'pages' may be NULL), but
    do want to make sure a page is mapped into the virtual address space.

    So using the existence of the "pages" array as an indication of whether
    to look up the guard page or not isn't actually so great, and we really
    should just use the FOLL_MLOCK bit. But because that bit was only set
    for the VM_LOCKED case (and not all vma's necessarily have it, even for
    mlock()), we couldn't do that originally.

    Fix that by moving the VM_LOCKED check deeper into the call-chain, which
    actually simplifies many things. Now mlock() gets simpler, and we can
    also check for FOLL_MLOCK in __get_user_pages() and the code ends up
    much more straightforward.

    Reported-and-reviewed-by: Michel Lespinasse
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Apr, 2011

1 commit

  • Commit 53a7706d5ed8 ("mlock: do not hold mmap_sem for extended periods
    of time") changed mlock() to care about the exact number of pages that
    __get_user_pages() had brought it. Before, it would only care about
    errors.

    And that doesn't work, because we also handled one page specially in
    __mlock_vma_pages_range(), namely the stack guard page. So when that
    case was handled, the number of pages that the function returned was off
    by one. In particular, it could be zero, and then the caller would end
    up not making any progress at all.

    Rather than try to fix up that off-by-one error for the mlock case
    specially, this just moves the logic to handle the stack guard page
    into__get_user_pages() itself, thus making all the counts come out
    right automatically.

    Reported-by: Robert Święcki
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Mar, 2011

1 commit

  • Morally, the presence of a gate vma is more an attribute of a particular mm than
    a particular task. Moreover, dropping the dependency on task_struct will help
    make both existing and future operations on mm's more flexible and convenient.

    Signed-off-by: Stephen Wilson
    Reviewed-by: Michel Lespinasse
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Al Viro

    Stephen Wilson
     

02 Feb, 2011

1 commit

  • As Tao Ma noticed, change 5ecfda0 breaks blktrace. This is because
    blktrace mmaps a file with PROT_WRITE permissions but without PROT_READ,
    so my attempt to not unnecessarity break COW during mlock ended up
    causing mlock to fail with a permission problem.

    I am proposing to let mlock ignore vma protection in all cases except
    PROT_NONE. In particular, mlock should not fail for PROT_WRITE regions
    (as in the blktrace case, which broke at 5ecfda0) or for PROT_EXEC
    regions (which seem to me like they were always broken).

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

14 Jan, 2011

5 commits

  • __get_user_pages gets a new 'nonblocking' parameter to signal that the
    caller is prepared to re-acquire mmap_sem and retry the operation if
    needed. This is used to split off long operations if they are going to
    block on a disk transfer, or when we detect contention on the mmap_sem.

    [akpm@linux-foundation.org: remove ref to rwsem_is_contended()]
    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Use a single code path for faulting in pages during mlock.

    The reason to have it in this patch series is that I did not want to
    update both code paths in a later change that releases mmap_sem when
    blocking on disk.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Move the code to mlock pages from __mlock_vma_pages_range() to
    follow_page().

    This allows __mlock_vma_pages_range() to not have to break down work into
    16-page batches.

    An additional motivation for doing this within the present patch series is
    that it'll make it easier for a later chagne to drop mmap_sem when
    blocking on disk (we'd like to be able to resume at the page that was read
    from disk instead of at the start of a 16-page batch).

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Currently mlock() holds mmap_sem in exclusive mode while the pages get
    faulted in. In the case of a large mlock, this can potentially take a
    very long time, during which various commands such as 'ps auxw' will
    block. This makes sysadmins unhappy:

    real 14m36.232s
    user 0m0.003s
    sys 0m0.015s
    (output from 'time ps auxw' while a 20GB file was being mlocked without
    being previously preloaded into page cache)

    I propose that mlock() could release mmap_sem after the VM_LOCKED bits
    have been set in all appropriate VMAs. Then a second pass could be done
    to actually mlock the pages, in small batches, releasing mmap_sem when we
    block on disk access or when we detect some contention.

    This patch:

    Before this change, mlock() holds mmap_sem in exclusive mode while the
    pages get faulted in. In the case of a large mlock, this can potentially
    take a very long time. Various things will block while mmap_sem is held,
    including 'ps auxw'. This can make sysadmins angry.

    I propose that mlock() could release mmap_sem after the VM_LOCKED bits
    have been set in all appropriate VMAs. Then a second pass could be done
    to actually mlock the pages with mmap_sem held for reads only. We need to
    recheck the vma flags after we re-acquire mmap_sem, but this is easy.

    In the case where a vma has been munlocked before mlock completes, pages
    that were already marked as PageMlocked() are handled by the munlock()
    call, and mlock() is careful to not mark new page batches as PageMlocked()
    after the munlock() call has cleared the VM_LOCKED vma flags. So, the end
    result will be identical to what'd happen if munlock() had executed after
    the mlock() call.

    In a later change, I will allow the second pass to release mmap_sem when
    blocking on disk accesses or when it is otherwise contended, so that it
    won't be held for long periods of time even in shared mode.

    Signed-off-by: Michel Lespinasse
    Tested-by: Valdis Kletnieks
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When faulting in pages for mlock(), we want to break COW for anonymous or
    file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no
    need to write-fault into VM_SHARED vmas since shared file pages can be
    mlocked first and dirtied later, when/if they actually get written to.
    Skipping the write fault is desirable, as we don't want to unnecessarily
    cause these pages to be dirtied and queued for writeback.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Kosaki Motohiro
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Theodore Tso
    Cc: Michael Rubin
    Cc: Suleiman Souhlal
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

10 Sep, 2010

1 commit


21 Aug, 2010

1 commit


16 Aug, 2010

1 commit

  • This commit makes the stack guard page somewhat less visible to user
    space. It does this by:

    - not showing the guard page in /proc//maps

    It looks like lvm-tools will actually read /proc/self/maps to figure
    out where all its mappings are, and effectively do a specialized
    "mlockall()" in user space. By not showing the guard page as part of
    the mapping (by just adding PAGE_SIZE to the start for grows-up
    pages), lvm-tools ends up not being aware of it.

    - by also teaching the _real_ mlock() functionality not to try to lock
    the guard page.

    That would just expand the mapping down to create a new guard page,
    so there really is no point in trying to lock it in place.

    It would perhaps be nice to show the guard page specially in
    /proc//maps (or at least mark grow-down segments some way), but
    let's not open ourselves up to more breakage by user space from programs
    that depends on the exact deails of the 'maps' file.

    Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
    source code to see what was going on with the whole new warning.

    Reported-and-tested-by: François Valenduc
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Mar, 2010

1 commit

  • Support for the PMU's BTS features has been upstreamed in
    v2.6.32, but we still have the old and disabled ptrace-BTS,
    as Linus noticed it not so long ago.

    It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
    regard for other uses (perf) and doesn't provide the flexibility
    needed for perf either.

    Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
    was never used and ptrace-block-step can be implemented using a
    much simpler approach.

    So axe all 3000 lines of it. That includes the *locked_memory*()
    APIs in mm/mlock.c as well.

    Reported-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Cc: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Markus Metzger
    Cc: Steven Rostedt
    Cc: Andrew Morton
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in
    3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
    fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

16 Dec, 2009

3 commits

  • Cleanup stale comments on munlock_vma_page().

    Signed-off-by: Lee Schermerhorn
    Acked-by: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • When KSM merges an mlocked page, it has been forgetting to munlock it:
    that's been left to free_page_mlock(), which reports it in /proc/vmstat as
    unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
    whinges "Page flag mlocked set for process" in mmotm, whereas mainline is
    silently forgiving). Call munlock_vma_page() to fix that.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's contorted mlock/munlock handling in try_to_unmap_anon() and
    try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping.
    Simplify it by moving it all down into try_to_unmap_one().

    One thing is then lost, try_to_munlock()'s distinction between when no vma
    holds the page mlocked, and when a vma does mlock it, but we could not get
    mmap_sem to set the page flag. But its only caller takes no interest in
    that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
    the code simple and return SWAP_AGAIN for both cases.

    try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
    amusing: once unravelled, it turns out to have been choosing between two
    different ways of doing the same nothing. Ah, no, one way was actually
    returning SWAP_FAIL when it meant to return SWAP_SUCCESS.

    [kosaki.motohiro@jp.fujitsu.com: comment adding to mlocking in try_to_unmap_one]
    [akpm@linux-foundation.org: remove test of MLOCK_PAGES]
    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: KOSAKI Motohiro
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Sep, 2009

3 commits

  • I'm still reluctant to clutter __get_user_pages() with another flag, just
    to avoid touching ZERO_PAGE count in mlock(); though we can add that later
    if it shows up as an issue in practice.

    But when mlocking, we can test page->mapping slightly earlier, to avoid
    the potentially bouncy rescheduling of lock_page on ZERO_PAGE - mlock
    didn't lock_page in olden ZERO_PAGE days, so we might have regressed.

    And when munlocking, it turns out that FOLL_DUMP coincidentally does
    what's needed to avoid all updates to ZERO_PAGE, so use that here also.
    Plus add comment suggested by KAMEZAWA Hiroyuki.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • __get_user_pages() has been taking its own GUP flags, then processing
    them into FOLL flags for follow_page(). Though oddly named, the FOLL
    flags are more widely used, so pass them to __get_user_pages() now.
    Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.

    (The patch to __get_user_pages() looks peculiar, with both gup_flags
    and foll_flags: the gup_flags remain constant; but as before there's
    an exceptional case, out of scope of the patch, in which foll_flags
    per page have FOLL_WRITE masked off.)

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Hiroaki Wakabayashi points out that when mlock() has been interrupted
    by SIGKILL, the subsequent munlock() takes unnecessarily long because
    its use of __get_user_pages() insists on faulting in all the pages
    which mlock() never reached.

    It's worse than slowness if mlock() is terminated by Out Of Memory kill:
    the munlock_vma_pages_all() in exit_mmap() insists on faulting in all the
    pages which mlock() could not find memory for; so innocent bystanders are
    killed too, and perhaps the system hangs.

    __get_user_pages() does a lot that's silly for munlock(): so remove the
    munlock option from __mlock_vma_pages_range(), and use a simple loop of
    follow_page()s in munlock_vma_pages_range() instead; ignoring absent
    pages, and not marking present pages as accessed or dirty.

    (Change munlock() to only go so far as mlock() reached? That does not
    work out, given the convention that mlock() claims complete success even
    when it has to give up early - in part so that an underlying file can be
    extended later, and those pages locked which earlier would give SIGBUS.)

    Signed-off-by: Hugh Dickins
    Cc:
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Reviewed-by: Hiroaki Wakabayashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jun, 2009

1 commit

  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

24 Apr, 2009

1 commit

  • The current mm interface is asymetric. One function allocates a locked
    buffer, another function only refunds the memory.

    Change this to have two functions for accounting and refunding locked
    memory, respectively; and do the actual buffer allocation in ptrace.

    [ Impact: refactor BTS buffer allocation code ]

    Signed-off-by: Markus Metzger
    Acked-by: Andrew Morton
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Markus Metzger
     

08 Apr, 2009

1 commit


07 Apr, 2009

1 commit

  • When a ptraced task is unlinked, we need to stop branch tracing for
    that task.

    Since the unlink is called with interrupts disabled, and we need
    interrupts enabled to stop branch tracing, we defer the work.

    Collect all branch tracing related stuff in a branch tracing context.

    Reviewed-by: Oleg Nesterov
    Signed-off-by: Markus Metzger
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: roland@redhat.com
    Cc: eranian@googlemail.com
    Cc: juan.villacis@intel.com
    Cc: ak@linux.jf.intel.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Markus Metzger
     

18 Feb, 2009

1 commit

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, vm86: fix preemption bug
    x86, olpc: fix model detection without OFW
    x86, hpet: fix for LS21 + HPET = boot hang
    x86: CPA avoid repeated lazy mmu flush
    x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
    x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
    x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
    x86/cpa: make sure cpa is safe to call in lazy mmu mode
    x86, ptrace, mm: fix double-free on race

    Linus Torvalds
     

11 Feb, 2009

1 commit

  • Ptrace_detach() races with __ptrace_unlink() if the traced task is
    reaped while detaching. This might cause a double-free of the BTS
    buffer.

    Change the ptrace_detach() path to only do the memory accounting in
    ptrace_bts_detach() and leave the buffer free to ptrace_bts_untrace()
    which will be called from __ptrace_unlink().

    The fix follows a proposal from Oleg Nesterov.

    Reported-by: Oleg Nesterov
    Signed-off-by: Markus Metzger
    Signed-off-by: Ingo Molnar

    Markus Metzger
     

09 Feb, 2009

1 commit

  • Commit 27421e211a39784694b597dbf35848b88363c248, Manually revert
    "mlock: downgrade mmap sem while populating mlocked regions", has
    introduced its own regression: __mlock_vma_pages_range() may report
    an error (for example, -EFAULT from trying to lock down pages from
    beyond EOF), but mlock_vma_pages_range() must hide that from its
    callers as before.

    Reported-by: Sami Farin
    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

02 Feb, 2009

1 commit

  • This essentially reverts commit 8edb08caf68184fb170f4f69c7445929e199eaea.

    It downgraded our mmap semaphore to a read-lock while mlocking pages, in
    order to allow other threads (and external accesses like "ps" et al) to
    walk the vma lists and take page faults etc. Which is a nice idea, but
    the implementation does not work.

    Because we cannot upgrade the lock back to a write lock without
    releasing the mmap semaphore, the code had to release the lock entirely
    and then re-take it as a writelock. However, that meant that the caller
    possibly lost the vma chain that it was following, since now another
    thread could come in and mmap/munmap the range.

    The code tried to work around that by just looking up the vma again and
    erroring out if that happened, but quite frankly, that was just a buggy
    hack that doesn't actually protect against anything (the other thread
    could just have replaced the vma with another one instead of totally
    unmapping it).

    The only way to downgrade to a read map _reliably_ is to do it at the
    end, which is likely the right thing to do: do all the 'vma' operations
    with the write-lock held, then downgrade to a read after completing them
    all, and then do the "populate the newly mlocked regions" while holding
    just the read lock. And then just drop the read-lock and return to user
    space.

    The (perhaps somewhat simpler) alternative is to just make all the
    callers of mlock_vma_pages_range() know that the mmap lock got dropped,
    and just re-grab the mmap semaphore if it needs to mlock more than one
    vma region.

    So we can do this "downgrade mmap sem while populating mlocked regions"
    thing right, but the way it was done here was absolutely not correct.
    Thus the revert, in the expectation that we will do it all correctly
    some day.

    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Jan, 2009

2 commits


07 Jan, 2009

1 commit

  • The initial implementation of checking TIF_MEMDIE covers the cases of OOM
    killing. If the process has been OOM killed, the TIF_MEMDIE is set and it
    return immediately. This patch includes:

    1. add the case that the SIGKILL is sent by user processes. The
    process can try to get_user_pages() unlimited memory even if a user
    process has sent a SIGKILL to it(maybe a monitor find the process
    exceed its memory limit and try to kill it). In the old
    implementation, the SIGKILL won't be handled until the get_user_pages()
    returns.

    2. change the return value to be ERESTARTSYS. It makes no sense to
    return ENOMEM if the get_user_pages returned by getting a SIGKILL
    signal. Considering the general convention for a system call
    interrupted by a signal is ERESTARTNOSYS, so the current return value
    is consistant to that.

    Lee:

    An unfortunate side effect of "make-get_user_pages-interruptible" is that
    it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked,
    resulting in freeing of mlocked pages. Freeing of mlocked pages, in
    itself, is not so bad. We just count them now--altho' I had hoped to
    remove this stat and add PG_MLOCKED to the free pages flags check.

    However, consider pages in shared libraries mapped by more than one task
    that a task mlocked--e.g., via mlockall(). If the task that mlocked the
    pages exits via SIGKILL, these pages would be left mlocked and
    unevictable.

    Proposed fix:

    Add another GUP flag to ignore sigkill when calling get_user_pages from
    munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for
    the same purpose. We are not actually allocating memory in this case,
    which "make-get_user_pages-interruptible" intends to avoid. We're just
    munlocking pages that are already resident and mapped, and we're reusing
    get_user_pages() to access those pages.

    ?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL
    into a single flag: GUP_FLAGS_MUNLOCK ???

    [Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock]
    Signed-off-by: Paul Menage
    Signed-off-by: Ying Han
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: Lee Schermerhorn
    Cc: Rohit Seth
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

20 Dec, 2008

1 commit

  • Impact: move the BTS buffer accounting to the mlock bucket

    Add alloc_locked_buffer() and free_locked_buffer() functions to mm/mlock.c
    to kalloc a buffer and account the locked memory to current.

    Account the memory for the BTS buffer to the tracer.

    Signed-off-by: Markus Metzger
    Signed-off-by: Ingo Molnar

    Markus Metzger
     

17 Nov, 2008

1 commit

  • Fix an unitialized return value when compiling on parisc (with CONFIG_UNEVICTABLE_LRU=y):
    mm/mlock.c: In function `__mlock_vma_pages_range':
    mm/mlock.c:165: warning: `ret' might be used uninitialized in this function

    Signed-off-by: Helge Deller
    [ It isn't ever really used uninitialized, since no caller should ever
    call this function with an empty range. But the compiler is correct
    that from a local analysis standpoint that is impossible to see, and
    fixing the warning is appropriate. ]
    Signed-off-by: Linus Torvalds

    Helge Deller
     

13 Nov, 2008

1 commit

  • lockdep warns about following message at boot time on one of my test
    machine. Then, schedule_on_each_cpu() sholdn't be called when the task
    have mmap_sem.

    Actually, lru_add_drain_all() exist to prevent the unevictalble pages
    stay on reclaimable lru list. but currenct unevictable code can rescue
    unevictable pages although it stay on reclaimable list.

    So removing is better.

    In addition, this patch add lru_add_drain_all() to sys_mlock() and
    sys_mlockall(). it isn't must. but it reduce the failure of moving to
    unevictable list. its failure can rescue in vmscan later. but reducing
    is better.

    Note, if above rescuing happend, the Mlocked and the Unevictable field
    mismatching happend in /proc/meminfo. but it doesn't cause any real
    trouble.

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.28-rc2-mm1 #2
    -------------------------------------------------------
    lvm/1103 is trying to acquire lock:
    (&cpu_hotplug.lock){--..}, at: [] get_online_cpus+0x29/0x50

    but task is already holding lock:
    (&mm->mmap_sem){----}, at: [] sys_mlockall+0x4e/0xb0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&mm->mmap_sem){----}:
    [] check_noncircular+0x82/0x110
    [] might_fault+0x4a/0xa0
    [] validate_chain+0xb11/0x1070
    [] might_fault+0x4a/0xa0
    [] __lock_acquire+0x263/0xa10
    [] lock_acquire+0x7c/0xb0 (*) grab mmap_sem
    [] might_fault+0x4a/0xa0
    [] might_fault+0x7b/0xa0
    [] might_fault+0x4a/0xa0
    [] copy_to_user+0x30/0x60
    [] filldir+0x7c/0xd0
    [] sysfs_readdir+0x11a/0x1f0 (*) grab sysfs_mutex
    [] filldir+0x0/0xd0
    [] filldir+0x0/0xd0
    [] vfs_readdir+0x86/0xa0 (*) grab i_mutex
    [] sys_getdents+0x6b/0xc0
    [] syscall_call+0x7/0xb
    [] 0xffffffff

    -> #2 (sysfs_mutex){--..}:
    [] check_noncircular+0x82/0x110
    [] sysfs_addrm_start+0x2c/0xc0
    [] validate_chain+0xb11/0x1070
    [] sysfs_addrm_start+0x2c/0xc0
    [] __lock_acquire+0x263/0xa10
    [] lock_acquire+0x7c/0xb0 (*) grab sysfs_mutex
    [] sysfs_addrm_start+0x2c/0xc0
    [] mutex_lock_nested+0xa5/0x2f0
    [] sysfs_addrm_start+0x2c/0xc0
    [] sysfs_addrm_start+0x2c/0xc0
    [] sysfs_addrm_start+0x2c/0xc0
    [] create_dir+0x3f/0x90
    [] sysfs_create_dir+0x29/0x50
    [] _spin_unlock+0x25/0x40
    [] kobject_add_internal+0xcd/0x1a0
    [] kobject_set_name_vargs+0x3a/0x50
    [] kobject_init_and_add+0x2d/0x40
    [] sysfs_slab_add+0xd2/0x180
    [] sysfs_add_func+0x0/0x70
    [] sysfs_add_func+0x5c/0x70 (*) grab slub_lock
    [] run_workqueue+0x172/0x200
    [] run_workqueue+0x10f/0x200
    [] worker_thread+0x0/0xf0
    [] worker_thread+0x9c/0xf0
    [] autoremove_wake_function+0x0/0x50
    [] worker_thread+0x0/0xf0
    [] kthread+0x42/0x70
    [] kthread+0x0/0x70
    [] kernel_thread_helper+0x7/0x1c
    [] 0xffffffff

    -> #1 (slub_lock){----}:
    [] check_noncircular+0xd/0x110
    [] slab_cpuup_callback+0x11f/0x1d0
    [] validate_chain+0xb11/0x1070
    [] slab_cpuup_callback+0x11f/0x1d0
    [] mark_lock+0x35d/0xd00
    [] __lock_acquire+0x263/0xa10
    [] lock_acquire+0x7c/0xb0
    [] slab_cpuup_callback+0x11f/0x1d0
    [] down_read+0x43/0x80
    [] slab_cpuup_callback+0x11f/0x1d0 (*) grab slub_lock
    [] slab_cpuup_callback+0x11f/0x1d0
    [] notifier_call_chain+0x3c/0x70
    [] _cpu_up+0x84/0x110
    [] cpu_up+0x4b/0x70 (*) grab cpu_hotplug.lock
    [] kernel_init+0x0/0x170
    [] kernel_init+0xb5/0x170
    [] kernel_init+0x0/0x170
    [] kernel_thread_helper+0x7/0x1c
    [] 0xffffffff

    -> #0 (&cpu_hotplug.lock){--..}:
    [] validate_chain+0x5af/0x1070
    [] dev_status+0x0/0x50
    [] __lock_acquire+0x263/0xa10
    [] lock_acquire+0x7c/0xb0
    [] get_online_cpus+0x29/0x50
    [] mutex_lock_nested+0xa5/0x2f0
    [] get_online_cpus+0x29/0x50
    [] get_online_cpus+0x29/0x50
    [] lru_add_drain_per_cpu+0x0/0x10
    [] get_online_cpus+0x29/0x50 (*) grab cpu_hotplug.lock
    [] schedule_on_each_cpu+0x32/0xe0
    [] __mlock_vma_pages_range+0x85/0x2c0
    [] __lock_acquire+0x285/0xa10
    [] vma_merge+0xa9/0x1d0
    [] mlock_fixup+0x180/0x200
    [] do_mlockall+0x78/0x90 (*) grab mmap_sem
    [] sys_mlockall+0x81/0xb0
    [] syscall_call+0x7/0xb
    [] 0xffffffff

    other info that might help us debug this:

    1 lock held by lvm/1103:
    #0: (&mm->mmap_sem){----}, at: [] sys_mlockall+0x4e/0xb0

    stack backtrace:
    Pid: 1103, comm: lvm Not tainted 2.6.28-rc2-mm1 #2
    Call Trace:
    [] print_circular_bug_tail+0x7c/0xd0
    [] validate_chain+0x5af/0x1070
    [] dev_status+0x0/0x50
    [] __lock_acquire+0x263/0xa10
    [] lock_acquire+0x7c/0xb0
    [] get_online_cpus+0x29/0x50
    [] mutex_lock_nested+0xa5/0x2f0
    [] get_online_cpus+0x29/0x50
    [] get_online_cpus+0x29/0x50
    [] lru_add_drain_per_cpu+0x0/0x10
    [] get_online_cpus+0x29/0x50
    [] schedule_on_each_cpu+0x32/0xe0
    [] __mlock_vma_pages_range+0x85/0x2c0
    [] __lock_acquire+0x285/0xa10
    [] vma_merge+0xa9/0x1d0
    [] mlock_fixup+0x180/0x200
    [] do_mlockall+0x78/0x90
    [] sys_mlockall+0x81/0xb0
    [] syscall_call+0x7/0xb

    Signed-off-by: KOSAKI Motohiro
    Tested-by: Kamalesh Babulal
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Heiko Carstens
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

20 Oct, 2008

1 commit

  • Rework Posix error return for mlock().

    Posix requires error code for mlock*() system calls for some conditions
    that differ from what kernel low level functions, such as
    get_user_pages(), return for those conditions. For more info, see:

    http://marc.info/?l=linux-kernel&m=121750892930775&w=2

    This patch provides the same translation of get_user_pages()
    error codes to posix specified error codes in the context
    of the mlock rework for unevictable lru.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn