21 Jul, 2011

1 commit

  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

14 Jan, 2011

3 commits

  • MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
    mmap and before touching the memory. While this is enough for most
    usages, it's little effort to make madvise more dynamic at runtime on an
    existing mapping by making khugepaged aware about madvise.

    MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
    fault (that may not ever happen if all pages are already mapped and the
    "enabled" knob was set to madvise during the initial page faults).

    MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
    collapsing pages where not needed.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Andrea Arcangeli
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
    hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
    or the feature isn't built into the kernel. Never silently return
    success.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add madvise MADV_HUGEPAGE to mark regions that are important to be
    hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
    or the feature isn't built into the kernel. Never silently return
    success.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

16 Dec, 2009

4 commits


24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

2 commits

  • This patch presents the mm interface to a dummy version of ksm.c, for
    better scrutiny of that interface: the real ksm.c follows later.

    When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
    MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
    pretending that they can be serviced. But when CONFIG_KSM=y, accept them
    even if KSM is not currently running, and even on areas which KSM will not
    touch (e.g. hugetlb or shared file or special driver mappings).

    Like other madvices, report ENOMEM despite success if any area in the
    range is unmapped, and use EAGAIN to report out of memory.

    Define vma flag VM_MERGEABLE to identify an area on which KSM may try
    merging pages: leave it to ksm_madvise() to decide whether to set it.
    Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
    VM_MERGEABLE areas, to minimize callouts when forking or exiting.

    Based upon earlier patches by Chris Wright and Izik Eidus.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Signed-off-by: Izik Eidus
    Cc: Michael Kerrisk
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • madvise.c has several levels of switch statements, what to do in which?
    Move MADV_DOFORK code down from madvise_vma() to madvise_behavior(), so
    madvise_vma() can be a simple router, to madvise_behavior() by default.

    vma->vm_flags is an unsigned long so use the same type for new_flags. Add
    missing comment lines to describe MADV_DONTFORK and MADV_DOFORK.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Signed-off-by: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Sep, 2009

1 commit

  • Impact: optional, useful for debugging

    Add a new madvice sub command to inject poison for some
    pages in a process' address space. This is useful for
    testing the poison page handling.

    This patch can allow root to tie up large amounts of memory.
    I got feedback from container developers and they didn't see any
    problem.

    v2: Use write flag for get_user_pages to make sure to always get
    a fresh page
    v3: Don't request write mapping (Fengguang Wu)
    v4: Move MADV_* number to avoid conflict with KSM (Hugh Dickins)

    Signed-off-by: Andi Kleen

    Andi Kleen
     

17 Jun, 2009

2 commits

  • The posix_madvise() function succeeds (and does nothing) when called with
    parameters (NULL, 0, -1); according to LSB tests, it should fail with
    EINVAL because -1 is not a valid flag.

    When called with a valid address and size, it correctly fails.

    So perform an initial check for valid flags first.

    Reported-by: Jiri Dluhos
    Signed-off-by: Nick Piggin
    Reviewed-and-Tested-by: WANG Cong
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Impact: code simplification.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

13 May, 2009

1 commit


06 May, 2009

1 commit

  • madvise(MADV_WILLNEED) forces page cache readahead on a range of memory
    backed by a file. The assumption is made that the page required is
    order-0 and "normal" page cache.

    On hugetlbfs, this assumption is not true and order-0 pages are
    allocated and inserted into the hugetlbfs page cache. This leaks
    hugetlbfs page reservations and can cause BUGs to trigger related to
    corrupted page tables.

    This patch causes MADV_WILLNEED to be ignored for hugetlbfs-backed
    regions.

    Signed-off-by: Mel Gorman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

14 Jan, 2009

1 commit


31 Jul, 2008

1 commit


28 Apr, 2008

1 commit

  • Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for
    the user mappings.

    This requires the get_xip_page API to be changed to an address based one.
    Improve the API layering a little bit too, while we're here.

    This is required in order to support XIP filesystems on memory that isn't
    backed with struct page (but memory with struct page is still supported too).

    Signed-off-by: Nick Piggin
    Acked-by: Carsten Otte
    Cc: Jared Hulbert
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

17 Jul, 2007

1 commit

  • In the new madvise_need_mmap_write() call we can avoid an extra case
    statement and function call as follows.

    Signed-off-by: Jason Baron
    Cc: Nishanth Aravamudan
    Cc: Christoph Hellwig
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

22 May, 2007

1 commit

  • First thing mm.h does is including sched.h solely for can_do_mlock() inline
    function which has "current" dereference inside. By dealing with can_do_mlock()
    mm.h can be detached from sched.h which is good. See below, why.

    This patch
    a) removes unconditional inclusion of sched.h from mm.h
    b) makes can_do_mlock() normal function in mm/mlock.c
    c) exports can_do_mlock() to not break compilation
    d) adds sched.h inclusions back to files that were getting it indirectly.
    e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
    getting them indirectly

    Net result is:
    a) mm.h users would get less code to open, read, preprocess, parse, ... if
    they don't need sched.h
    b) sched.h stops being dependency for significant number of files:
    on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
    after patch it's only 3744 (-8.3%).

    Cross-compile tested on

    all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
    alpha alpha-up
    arm
    i386 i386-up i386-defconfig i386-allnoconfig
    ia64 ia64-up
    m68k
    mips
    parisc parisc-up
    powerpc powerpc-up
    s390 s390-up
    sparc sparc-up
    sparc64 sparc64-up
    um-x86_64
    x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

    as well as my two usual configs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

08 May, 2007

1 commit


29 Mar, 2007

1 commit

  • sys_madvise has down_write of mmap_sem, then madvise_remove calls
    vmtruncate_range which takes i_mutex and i_alloc_sem: no, we can easily devise
    deadlocks from that ordering.

    madvise_remove drop mmap_sem while calling vmtruncate_range: luckily, since
    madvise_remove doesn't split or merge vmas, it's easy to handle this case with
    a NULL prev, without restructuring sys_madvise. (Though sad to retake
    mmap_sem when it's unlikely to be needed, and certainly down_read is
    sufficient for MADV_REMOVE, unlike the other madvices.)

    Signed-off-by: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Mar, 2007

1 commit

  • madvise(MADV_REMOVE) can go into an infinite loop or cause an oops if the
    call covers a region from the start of a vma, and extending past that vma.

    Signed-off-by: Nick Piggin
    Cc: Badari Pulavarty
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Apr, 2006

1 commit


15 Feb, 2006

1 commit

  • Currently, copy-on-write may change the physical address of a page even if the
    user requested that the page is pinned in memory (either by mlock or by
    get_user_pages). This happens if the process forks meanwhile, and the parent
    writes to that page. As a result, the page is orphaned: in case of
    get_user_pages, the application will never see any data hardware DMA's into
    this page after the COW. In case of mlock'd memory, the parent is not getting
    the realtime/security benefits of mlock.

    In particular, this affects the Infiniband modules which do DMA from and into
    user pages all the time.

    This patch adds madvise options to control whether memory range is inherited
    across fork. Useful e.g. for when hardware is doing DMA from/into these
    pages. Could also be useful to an application wanting to speed up its forks
    by cutting large areas out of consideration.

    Signed-off-by: Michael S. Tsirkin
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

07 Jan, 2006

1 commit

  • Here is the patch to implement madvise(MADV_REMOVE) - which frees up a
    given range of pages & its associated backing store. Current
    implementation supports only shmfs/tmpfs and other filesystems return
    -ENOSYS.

    "Some app allocates large tmpfs files, then when some task quits and some
    client disconnect, some memory can be released. However the only way to
    release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli

    Databases want to use this feature to drop a section of their bufferpool
    (shared memory segments) - without writing back to disk/swap space.

    This feature is also useful for supporting hot-plug memory on UML.

    Concerns raised by Andrew Morton:

    - "We have no plan for holepunching! If we _do_ have such a plan (or
    might in the future) then what would the API look like? I think
    sys_holepunch(fd, start, len), so we should start out with that."

    - Using madvise is very weird, because people will ask "why do I need to
    mmap my file before I can stick a hole in it?"

    - None of the other madvise operations call into the filesystem in this
    manner. A broad question is: is this capability an MM operation or a
    filesytem operation? truncate, for example, is a filesystem operation
    which sometimes has MM side-effects. madvise is an mm operation and with
    this patch, it gains FS side-effects, only they're really, really
    significant ones."

    Comments:

    - Andrea suggested the fs operation too but then it's more efficient to
    have it as a mm operation with fs side effects, because they don't
    immediatly know fd and physical offset of the range. It's possible to
    fixup in userland and to use the fs operation but it's more expensive,
    the vmas are already in the kernel and we can use them.

    Short term plan & Future Direction:

    - We seem to need this interface only for shmfs/tmpfs files in the short
    term. We have to add hooks into the filesystem for correctness and
    completeness. This is what this patch does.

    - In the future, plan is to support both fs and mmap apis also. This
    also involves (other) filesystem specific functions to be implemented.

    - Current patch doesn't support VM_NONLINEAR - which can be addressed in
    the future.

    Signed-off-by: Badari Pulavarty
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

29 Nov, 2005

1 commit

  • This replaces the (in my opinion horrible) VM_UNMAPPED logic with very
    explicit support for a "remapped page range" aka VM_PFNMAP. It allows a
    VM area to contain an arbitrary range of page table entries that the VM
    never touches, and never considers to be normal pages.

    Any user of "remap_pfn_range()" automatically gets this new
    functionality, and doesn't even have to mark the pages reserved or
    indeed mark them any other way. It just works. As a side effect, doing
    mmap() on /dev/mem works for arbitrary ranges.

    Sparc update from David in the next commit.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Nov, 2005

1 commit

  • Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few
    drivers set VM_RESERVED on areas which are then populated by nopage. The
    PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in
    zap_pte_range, without changing those drivers not to set it: so their pages
    just leak away.

    Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core,
    to flag the special areas where the ptes may have no struct page, or if they
    have then it's not to be touched. Replace most instances of VM_RESERVED in
    core mm by VM_UNPAGED. Force it on in remap_pfn_range, and the sparc and
    sparc64 io_remap_pfn_range.

    Revert addition of VM_RESERVED to powerpc vdso, it's not needed there. Is it
    needed anywhere? It still governs the mm->reserved_vm statistic, and special
    vmas not to be merged, and areas not to be core dumped; but could probably be
    eliminated later (the drivers are probably specifying it because in 2.4 it
    kept swapout off the vma, but in 2.6 we work from the LRU, which these pages
    don't get on).

    Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no
    purpose whatsoever, and should be removed from drivers when we clean up.

    Signed-off-by: Hugh Dickins
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 Oct, 2005

1 commit

  • Remove PageReserved() calls from core code by tightening VM_RESERVED
    handling in mm/ to cover PageReserved functionality.

    PageReserved special casing is removed from get_page and put_page.

    All setting and clearing of PageReserved is retained, and it is now flagged
    in the page_alloc checks to help ensure we don't introduce any refcount
    based freeing of Reserved pages.

    MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
    deprecated. We never completely handled it correctly anyway, and is be
    reintroduced in future if required (Hugh has a proof of concept).

    Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
    arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
    be trivially removed.

    Last real user of PageReserved is swsusp, which uses PageReserved to
    determine whether a struct page points to valid memory or not. This still
    needs to be addressed (a generic page_is_ram() should work).

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
    thus mapcounted and count towards shared rss). These writes to the struct
    page could cause excessive cacheline bouncing on big systems. There are a
    number of ways this could be addressed if it is an issue.

    Signed-off-by: Nick Piggin

    Refcount bug fix for filemap_xip.c

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

12 Oct, 2005

1 commit

  • Revert this recent correctness change: Douglas Crosher
    reported that it broke an existing application, and that madvise() works
    without error on anonymous mappings on Solaris.

    This means that madvise() will remain non-standards-compliant: we should
    return -EBADF for all requests against non-file-backed vma's, but Linux only
    does this for MADV_WILLNEED requests.

    Signed-off-by: Suzuki K P
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suzuki
     

05 Sep, 2005

1 commit

  • Better late than never, I've at last reviewed the madvise vma merging
    going into 2.6.13. Remove a pointless check and fix two little bugs -
    a simple test (with /proc//maps hacked to show ReadHints) showed
    both mismerges in practice: though being madvise, neither was disastrous.

    1. Correct placement of the success label in madvise_behavior: as in
    mprotect_fixup and mlock_fixup, it is necessary to update vm_flags
    when vma_merge succeeds (to handle the exceptional Case 8 noted in
    the comments above vma_merge itself).

    2. Correct initial value of prev when starting part way into a vma: as
    in sys_mprotect and do_mlock, it needs to be set to vma in this case
    (vma_merge handles only that minimum of cases shown in its comments).

    3. If find_vma_prev sets prev, then the vma it returns is prev->vm_next,
    so it's pointless to make that same assignment again in sys_madvise.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Jul, 2005

1 commit


24 Jun, 2005

2 commits


22 Jun, 2005

2 commits

  • This attempts to merge back the split maps. This code is mostly copied
    from Chrisw's mlock merging from post 2.6.11 trees. The only difference is
    in munmapped_error handling. Also passed prev to willneed/dontneed,
    eventhogh they do not handle it now, since I felt it will be cleaner,
    instead of handling prev in madvise_vma in some cases and in subfunction in
    some cases.

    Signed-off-by: Prasanna Meda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prasanna Meda
     
  • This attempts to avoid splittings when it is not needed, that is when
    vm_flags are same as new flags. The idea is from the
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prasanna Meda
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds