26 Sep, 2014

3 commits

  • commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.

    aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 07a427884348d38a6fd56fa4d78249c407196650 upstream.

    shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
    before it's even added to the LRU or visible. This is unnecessary as what
    could it possible race against? Use an unlocked variant.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 579f82901f6f41256642936d7e632f3979ad76d4 upstream.

    This is a patch to improve swap readahead algorithm. It's from Hugh and
    I slightly changed it.

    Hugh's original changelog:

    swapin readahead does a blind readahead, whether or not the swapin is
    sequential. This may be ok on harddisk, because large reads have
    relatively small costs, and if the readahead pages are unneeded they can
    be reclaimed easily - though, what if their allocation forced reclaim of
    useful pages? But on SSD devices large reads are more expensive than
    small ones: if the readahead pages are unneeded, reading them in caused
    significant overhead.

    This patch adds very simplistic random read detection. Stealing the
    PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
    vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
    simply looks at readahead's current success rate, and narrows or widens
    its readahead window accordingly. There is little science to its
    heuristic: it's about as stupid as can be whilst remaining effective.

    The table below shows elapsed times (in centiseconds) when running a
    single repetitive swapping load across a 1000MB mapping in 900MB ram
    with 1GB swap (the harddisk tests had taken painfully too long when I
    used mem=500M, but SSD shows similar results for that).

    Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
    Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
    which Shaohua showed to be defective; HughNew this Nov 14 patch, with
    page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
    patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
    (1-page reads: no readahead).

    HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for
    sequential access to the mapping, cycling five times around; Rand for
    the same number of random touches. Anon for a MAP_PRIVATE anon mapping;
    Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

    One weakness of Shaohua's vma/anon_vma approach was that it did not
    optimize Shmem: seen below. Konstantin's approach was perhaps mistuned,
    50% slower on Seq: did not compete and is not shown below.

    HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 73921 76210 75611 76904 78191 121542
    Seq Shmem 73601 73176 73855 72947 74543 118322
    Rand Anon 895392 831243 871569 845197 846496 841680
    Rand Shmem 1058375 1053486 827935 764955 764376 756489

    SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 24634 24198 24673 25107 21614 70018
    Seq Shmem 24959 24932 25052 25703 22030 69678
    Rand Anon 43014 26146 28075 25989 26935 25901
    Rand Shmem 45349 45215 28249 24268 24138 24332

    These tests are, of course, two extremes of a very simple case: under
    heavier mixed loads I've not yet observed any consistent improvement or
    degradation, and wider testing would be welcome.

    Shaohua Li:

    Test shows Vanilla is slightly better in sequential workload than Hugh's
    patch. I observed with Hugh's patch sometimes the readahead size is
    shrinked too fast (from 8 to 1 immediately) in sequential workload if
    there is no hit. And in such case, continuing doing readahead is good
    actually.

    I don't prepare a sophisticated algorithm for the sequential workload
    because so far we can't guarantee sequential accessed pages are swap out
    sequentially. So I slightly change Hugh's heuristic - don't shrink
    readahead size too fast.

    Here is my test result (unit second, 3 runs average):
    Vanilla Hugh New
    Seq 356 370 360
    Random 4525 2447 2444

    Attached graph is the swapin/swapout throughput I collected with 'vmstat
    2'. The first part is running a random workload (till around 1200 of
    the x-axis) and the second part is running a sequential workload.
    swapin and swapout throughput are almost identical in steady state in
    both workloads. These are expected behavior. while in Vanilla, swapin
    is much bigger than swapout especially in random workload (because wrong
    readahead).

    Original patches by: Shaohua Li and Konstantin Khlebnikov.

    [fengguang.wu@intel.com: swapin_nr_pages() can be static]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Shaohua Li
    Signed-off-by: Fengguang Wu
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Shaohua Li
     

02 Jul, 2014

1 commit

  • commit 1c8349a17137b93f0a83f276c764a6df1b9a116e upstream.

    When we perform a data integrity sync we tag all the dirty pages with
    PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check
    for this tag in write_cache_pages_da and creates a struct
    mpage_da_data containing contiguously indexed pages tagged with this
    tag and sync these pages with a call to mpage_da_map_and_submit. This
    process is done in while loop until all the PAGECACHE_TAG_TOWRITE
    pages are synced. We also do journal start and stop in each iteration.
    journal_stop could initiate journal commit which would call
    ext4_writepage which in turn will call ext4_bio_write_page even for
    delayed OR unwritten buffers. When ext4_bio_write_page is called for
    such buffers, even though it does not sync them but it clears the
    PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
    are also not synced by the currently running data integrity sync. We
    will end up with dirty pages although sync is completed.

    This could cause a potential data loss when the sync call is followed
    by a truncate_pagecache call, which is exactly the case in
    collapse_range. (It will cause generic/127 failure in xfstests)

    To avoid this issue, we can use set_page_writeback_keepwrite instead of
    set_page_writeback, which doesn't clear TOWRITE tag.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara
    Signed-off-by: Jiri Slaby

    Namjae Jeon
     

14 Feb, 2013

1 commit

  • The s390 architecture is unique in respect to dirty page detection,
    it uses the change bit in the per-page storage key to track page
    modifications. All other architectures track dirty bits by means
    of page table entries. This property of s390 has caused numerous
    problems in the past, e.g. see git commit ef5d437f71afdf4a
    "mm: fix XFS oops due to dirty pages without buffers on s390".

    To avoid future issues in regard to per-page dirty bits convert
    s390 to a fault based software dirty bit detection mechanism. All
    user page table entries which are marked as clean will be hardware
    read-only, even if the pte is supposed to be writable. A write by
    the user process will trigger a protection fault which will cause
    the user pte to be marked as dirty and the hardware read-only bit
    is removed.

    With this change the dirty bit in the storage key is irrelevant
    for Linux as a host, but the storage key is still required for
    KVM guests. The effect is that page_test_and_clear_dirty and the
    related code can be removed. The referenced bit in the storage
    key is still used by the page_test_and_clear_young primitive to
    provide page age information.

    For page cache pages of mappings with mapping_cap_account_dirty
    there will not be any change in behavior as the dirty bit tracking
    already uses read-only ptes to control the amount of dirty pages.
    Only for swap cache pages and pages of mappings without
    mapping_cap_account_dirty there can be additional protection faults.
    To avoid an excessive number of additional faults the mk_pte
    primitive checks for PageDirty if the pgprot value allows for writes
    and pre-dirties the pte. That avoids all additional faults for
    tmpfs and shmem pages until these pages are added to the swap cache.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

27 Dec, 2012

1 commit

  • Unfortunately with !CONFIG_PAGEFLAGS_EXTENDED, (!PageHead) is false, and
    (PageHead) is true, for tail pages. If this is indeed the intended
    behavior, which I doubt because it breaks cache cleaning on some ARM
    systems, then the nomenclature is highly problematic.

    This patch makes sure PageHead is only true for head pages and PageTail
    is only true for tail pages, and neither is true for non-compound pages.

    [ This buglet seems ancient - seems to have been introduced back in Apr
    2008 in commit 6a1e7f777f61: "pageflags: convert to the use of new
    macros". And the reason nobody noticed is because the PageHead()
    tests are almost all about just sanity-checking, and only used on
    pages that are actual page heads. The fact that the old code returned
    true for tail pages too was thus not really noticeable. - Linus ]

    Signed-off-by: Christoffer Dall
    Acked-by: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Will Deacon
    Cc: Steve Capper
    Cc: Christoph Lameter
    Cc: stable@kernel.org # 2.6.26+
    Signed-off-by: Linus Torvalds

    Christoffer Dall
     

01 Aug, 2012

1 commit

  • When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. Swap over the network is considered as an option in diskless
    systems. The two likely scenarios are when blade servers are used as part
    of a cluster where the form factor or maintenance costs do not allow the
    use of disks and thin clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap according to the manual at
    https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
    There is also documentation and tutorials on how to setup swap over NBD at
    places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
    nbd-client also documents the use of NBD as swap. Despite this, the fact
    is that a machine using NBD for swap can deadlock within minutes if swap
    is used intensively. This patch series addresses the problem.

    The core issue is that network block devices do not use mempools like
    normal block devices do. As the host cannot control where they receive
    packets from, they cannot reliably work out in advance how much memory
    they might need. Some years ago, Peter Zijlstra developed a series of
    patches that supported swap over an NFS that at least one distribution is
    carrying within their kernels. This patch series borrows very heavily
    from Peter's work to support swapping over NBD as a pre-requisite to
    supporting swap-over-NFS. The bulk of the complexity is concerned with
    preserving memory that is allocated from the PFMEMALLOC reserves for use
    by the network layer which is needed for both NBD and NFS.

    Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
    preserve access to pages allocated under low memory situations
    to callers that are freeing memory.

    Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks

    Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
    reserves without setting PFMEMALLOC.

    Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
    for later use by network packet processing.

    Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

    Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

    Patches 7-12 allows network processing to use PFMEMALLOC reserves when
    the socket has been marked as being used by the VM to clean pages. If
    packets are received and stored in pages that were allocated under
    low-memory situations and are unrelated to the VM, the packets
    are dropped.

    Patch 11 reintroduces __skb_alloc_page which the networking
    folk may object to but is needed in some cases to propogate
    pfmemalloc from a newly allocated page to an skb. If there is a
    strong objection, this patch can be dropped with the impact being
    that swap-over-network will be slower in some cases but it should
    not fail.

    Patch 13 is a micro-optimisation to avoid a function call in the
    common case.

    Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
    PFMEMALLOC if necessary.

    Patch 15 notes that it is still possible for the PFMEMALLOC reserve
    to be depleted. To prevent this, direct reclaimers get throttled on
    a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
    expected that kswapd and the direct reclaimers already running
    will clean enough pages for the low watermark to be reached and
    the throttled processes are woken up.

    Patch 16 adds a statistic to track how often processes get throttled

    Some basic performance testing was run using kernel builds, netperf on
    loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
    sysbench. Each of them were expected to use the sl*b allocators
    reasonably heavily but there did not appear to be significant performance
    variances.

    For testing swap-over-NBD, a machine was booted with 2G of RAM with a
    swapfile backed by NBD. 8*NUM_CPU processes were started that create
    anonymous memory mappings and read them linearly in a loop. The total
    size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
    memory pressure.

    Without the patches and using SLUB, the machine locks up within minutes
    and runs to completion with them applied. With SLAB, the story is
    different as an unpatched kernel run to completion. However, the patched
    kernel completed the test 45% faster.

    MICRO
    3.5.0-rc2 3.5.0-rc2
    vanilla swapnbd
    Unrecognised test vmscan-anon-mmap-write
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 197.80 173.07
    User+Sys Time Running Test (seconds) 206.96 182.03
    Total Elapsed Time (seconds) 3240.70 1762.09

    This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages

    Allocations of pages below the min watermark run a risk of the machine
    hanging due to a lack of memory. To prevent this, only callers who have
    PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
    allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
    a slab though, nothing prevents other callers consuming free objects
    within those slabs. This patch limits access to slab pages that were
    alloced from the PFMEMALLOC reserves.

    When this patch is applied, pages allocated from below the low watermark
    are returned with page->pfmemalloc set and it is up to the caller to
    determine how the page should be protected. SLAB restricts access to any
    page with page->pfmemalloc set to callers which are known to able to
    access the PFMEMALLOC reserve. If one is not available, an attempt is
    made to allocate a new page rather than use a reserve. SLUB is a bit more
    relaxed in that it only records if the current per-CPU page was allocated
    from PFMEMALLOC reserve and uses another partial slab if the caller does
    not have the necessary GFP or process flags. This was found to be
    sufficient in tests to avoid hangs due to SLUB generally maintaining
    smaller lists than SLAB.

    In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
    a slab allocation even though free objects are available because they are
    being preserved for callers that are freeing pages.

    [a.p.zijlstra@chello.nl: Original implementation]
    [sebastian@breakpoint.cc: Correct order of page flag clearing]
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

25 Mar, 2012

1 commit

  • Pull cleanup from Paul Gortmaker:
    "The changes shown here are to unify linux's BUG support under the one
    file. Due to historical reasons, we have some BUG code
    in bug.h and some in kernel.h -- i.e. the support for BUILD_BUG in
    linux/kernel.h predates the addition of linux/bug.h, but old code in
    kernel.h wasn't moved to bug.h at that time. As a band-aid, kernel.h
    was including to pseudo link them.

    This has caused confusion[1] and general yuck/WTF[2] reactions. Here
    is an example that violates the principle of least surprise:

    CC lib/string.o
    lib/string.c: In function 'strlcat':
    lib/string.c:225:2: error: implicit declaration of function 'BUILD_BUG_ON'
    make[2]: *** [lib/string.o] Error 1
    $
    $ grep linux/bug.h lib/string.c
    #include
    $

    We've included for the BUG infrastructure and yet we
    still get a compile fail! [We've not kernel.h for BUILD_BUG_ON.] Ugh -
    very confusing for someone who is new to kernel development.

    With the above in mind, the goals of this changeset are:

    1) find and fix any include/*.h files that were relying on the
    implicit presence of BUG code.
    2) find and fix any C files that were consuming kernel.h and hence
    relying on implicitly getting some/all BUG code.
    3) Move the BUG related code living in kernel.h to
    4) remove the asm/bug.h from kernel.h to finally break the chain.

    During development, the order was more like 3-4, build-test, 1-2. But
    to ensure that git history for bisect doesn't get needless build
    failures introduced, the commits have been reorderd to fix the problem
    areas in advance.

    [1] https://lkml.org/lkml/2012/1/3/90
    [2] https://lkml.org/lkml/2012/1/17/414"

    Fix up conflicts (new radeon file, reiserfs header cleanups) as per Paul
    and linux-next.

    * tag 'bug-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
    kernel.h: doesn't explicitly use bug.h, so don't include it.
    bug: consolidate BUILD_BUG_ON with other bug code
    BUG: headers with BUG/BUG_ON etc. need linux/bug.h
    bug.h: add include of it to various implicit C users
    lib: fix implicit users of kernel.h for TAINT_WARN
    spinlock: macroize assert_spin_locked to avoid bug.h dependency
    x86: relocate get/set debugreg fcns to include/asm/debugreg.

    Linus Torvalds
     

22 Mar, 2012

1 commit

  • Andrea Arcangeli pointed out to me that a check in __memory_failure()
    which was intended to prevent THP tail pages from being checked for the
    absence of the PG_lru flag (something that is always the case), was also
    preventing THP head pages from being checked.

    A THP head page could actually benefit from the call to shake_page() by
    ending up being put back to a LRU, provided it had been waiting in a
    pagevec array.

    Andrea suggested that the "!PageTransCompound(p)" in the if-statement
    should be replaced by a "!PageTransTail(p)", thus allowing THP head pages
    to be checked and possibly shaken.

    Signed-off-by: Dean Nelson
    Cc: Jin Dongming
    Reviewed-by: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hidetoshi Seto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dean Nelson
     

05 Mar, 2012

1 commit

  • If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
    other BUG variant in a static inline (i.e. not in a #define) then
    that header really should be including and not just
    expecting it to be implicitly present.

    We can make this change risk-free, since if the files using these
    headers didn't have exposure to linux/bug.h already, they would have
    been causing compile failures/warnings.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

31 Jul, 2011

1 commit

  • * 'slub/lockless' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: (21 commits)
    slub: When allocating a new slab also prep the first object
    slub: disable interrupts in cmpxchg_double_slab when falling back to pagelock
    Avoid duplicate _count variables in page_struct
    Revert "SLUB: Fix build breakage in linux/mm_types.h"
    SLUB: Fix build breakage in linux/mm_types.h
    slub: slabinfo update for cmpxchg handling
    slub: Not necessary to check for empty slab on load_freelist
    slub: fast release on full slab
    slub: Add statistics for the case that the current slab does not match the node
    slub: Get rid of the another_slab label
    slub: Avoid disabling interrupts in free slowpath
    slub: Disable interrupts in free_debug processing
    slub: Invert locking and avoid slab lock
    slub: Rework allocator fastpaths
    slub: Pass kmem_cache struct to lock and freeze slab
    slub: explicit list_lock taking
    slub: Add cmpxchg_double_slab()
    mm: Rearrange struct page
    slub: Move page->frozen handling near where the page->freelist handling occurs
    slub: Do not use frozen page flag but a bit in the page counters
    ...

    Linus Torvalds
     

26 Jul, 2011

1 commit


02 Jul, 2011

1 commit


29 May, 2011

1 commit

  • page_get_storage_key() and page_set_storage_key() expect a page address
    and not its page frame number. This got inconsistent with 2d42552d
    "[S390] merge page_test_dirty and page_clear_dirty".

    Result is that we read/write storage keys from random pages and do not
    have a working dirty bit tracking at all.
    E.g. SetPageUpdate() doesn't clear the dirty bit of requested pages, which
    for example ext4 doesn't like very much and panics after a while.

    Unable to handle kernel paging request at virtual user address (null)
    Oops: 0004 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    CPU: 1 Not tainted 2.6.39-07551-g139f37f-dirty #152
    Process flush-94:0 (pid: 1576, task: 000000003eb34538, ksp: 000000003c287b70)
    Krnl PSW : 0704c00180000000 0000000000316b12 (jbd2_journal_file_inode+0x10e/0x138)
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
    Krnl GPRS: 0000000000000000 0000000000000000 0000000000000000 0700000000000000
    0000000000316a62 000000003eb34cd0 0000000000000025 000000003c287b88
    0000000000000001 000000003c287a70 000000003f1ec678 000000003f1ec000
    0000000000000000 000000003e66ec00 0000000000316a62 000000003c287988
    Krnl Code: 0000000000316b04: f0a0000407f4 srp 4(11,%r0),2036,0
    0000000000316b0a: b9020022 ltgr %r2,%r2
    0000000000316b0e: a7740015 brc 7,316b38
    >0000000000316b12: e3d0c0000024 stg %r13,0(%r12)
    0000000000316b18: 4120c010 la %r2,16(%r12)
    0000000000316b1c: 4130d060 la %r3,96(%r13)
    0000000000316b20: e340d0600004 lg %r4,96(%r13)
    0000000000316b26: c0e50002b567 brasl %r14,36d5f4
    Call Trace:
    ([] jbd2_journal_file_inode+0x5e/0x138)
    [] mpage_da_map_and_submit+0x2e8/0x42c
    [] ext4_da_writepages+0x2da/0x504
    [] writeback_single_inode+0xf8/0x268
    [] writeback_sb_inodes+0xd2/0x18c
    [] writeback_inodes_wb+0x80/0x168
    [] wb_writeback+0x2aa/0x324
    [] wb_do_writeback+0xd2/0x274
    [] bdi_writeback_thread+0xba/0x1c4
    [] kthread+0xa6/0xb0
    [] kernel_thread_starter+0x6/0xc
    [] kernel_thread_starter+0x0/0xc
    INFO: lockdep is turned off.
    Last Breaking-Event-Address:
    [] jbd2_journal_file_inode+0x86/0x138

    Reported-by: Sebastian Ott
    Signed-off-by: Heiko Carstens

    Heiko Carstens
     

23 May, 2011

1 commit

  • The page_clear_dirty primitive always sets the default storage key
    which resets the access control bits and the fetch protection bit.
    That will surprise a KVM guest that sets non-zero access control
    bits or the fetch protection bit. Merge page_test_dirty and
    page_clear_dirty back to a single function and only clear the
    dirty bit from the storage key.

    In addition move the function page_test_and_clear_dirty and
    page_test_and_clear_young to page.h where they belong. This
    requires to change the parameter from a struct page * to a page
    frame number.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

23 Mar, 2011

1 commit


14 Jan, 2011

6 commits

  • PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
    be added to page->flags without overflowing (because of the sparse section
    bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
    has to move the memory hotplug code from _mapcount to lru.next to avoid
    any risk of clashes. We can't use lru.next for PG_buddy removal, but
    memory hotplug can use lru.next even more easily than the mapcount
    instead.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Lately I've been working to make KVM use hugepages transparently without
    the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
    see removed:

    1) hugepages have to be swappable or the guest physical memory remains
    locked in RAM and can't be paged out to swap

    2) if a hugepage allocation fails, regular pages should be allocated
    instead and mixed in the same vma without any failure and without
    userland noticing

    3) if some task quits and more hugepages become available in the
    buddy, guest physical memory backed by regular pages should be
    relocated on hugepages automatically in regions under
    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
    not null)

    4) avoidance of reservation and maximization of use of hugepages whenever
    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
    1 machine with 1 database with 1 database cache with 1 database cache size
    known at boot time. It's definitely not feasible with a virtualization
    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
    with an unknown size of each virtual machine with an unknown amount of
    pagecache that could be potentially useful in the host for guest not using
    O_DIRECT (aka cache=off).

    hugepages in the virtualization hypervisor (and also in the guest!) are
    much more important than in a regular host not using virtualization,
    becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
    to 19 in case only the hypervisor uses transparent hugepages, and they
    decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
    linux hypervisor and the linux guest both uses this patch (though the
    guest will limit the addition speedup to anonymous regions only for
    now...). Even more important is that the tlb miss handler is much slower
    on a NPT/EPT guest than for a regular shadow paging or no-virtualization
    scenario. So maximizing the amount of virtual memory cached by the TLB
    pays off significantly more with NPT/EPT than without (even if there would
    be no significant speedup in the tlb-miss runtime).

    The first (and more tedious) part of this work requires allowing the VM to
    handle anonymous hugepages mixed with regular pages transparently on
    regular anonymous vmas. This is what this patch tries to achieve in the
    least intrusive possible way. We want hugepages and hugetlb to be used in
    a way so that all applications can benefit without changes (as usual we
    leverage the KVM virtualization design: by improving the Linux VM at
    large, KVM gets the performance boost too).

    The most important design choice is: always fallback to 4k allocation if
    the hugepage allocation fails! This is the _very_ opposite of some large
    pagecache patches that failed with -EIO back then if a 64k (or similar)
    allocation failed...

    Second important decision (to reduce the impact of the feature on the
    existing pagetable handling code) is that at any time we can split an
    hugepage into 512 regular pages and it has to be done with an operation
    that can't fail. This way the reliability of the swapping isn't decreased
    (no need to allocate memory when we are short on memory to swap) and it's
    trivial to plug a split_huge_page* one-liner where needed without
    polluting the VM. Over time we can teach mprotect, mremap and friends to
    handle pmd_trans_huge natively without calling split_huge_page*. The fact
    it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
    (instead of the current void) we'd need to rollback the mprotect from the
    middle of it (ideally including undoing the split_vma) which would be a
    big change and in the very wrong direction (it'd likely be simpler not to
    call split_huge_page at all and to teach mprotect and friends to handle
    hugepages instead of rolling them back from the middle). In short the
    very value of split_huge_page is that it can't fail.

    The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
    incremental and it'll just be an "harmless" addition later if this initial
    part is agreed upon. It also should be noted that locking-wise replacing
    regular pages with hugepages is going to be very easy if compared to what
    I'm doing below in split_huge_page, as it will only happen when
    page_count(page) matches page_mapcount(page) if we can take the PG_lock
    and mmap_sem in write mode. collapse_huge_page will be a "best effort"
    that (unlike split_huge_page) can fail at the minimal sign of trouble and
    we can try again later. collapse_huge_page will be similar to how KSM
    works and the madvise(MADV_HUGEPAGE) will work similar to
    madvise(MADV_MERGEABLE).

    The default I like is that transparent hugepages are used at page fault
    time. This can be changed with
    /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
    to three values "always", "madvise", "never" which mean respectively that
    hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
    or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
    controls if the hugepage allocation should defrag memory aggressively
    "always", only inside "madvise" regions, or "never".

    The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
    put_page (from get_user_page users that can't use mmu notifier like
    O_DIRECT) that runs against a __split_huge_page_refcount instead was a
    pain to serialize in a way that would result always in a coherent page
    count for both tail and head. I think my locking solution with a
    compound_lock taken only after the page_first is valid and is still a
    PageHead should be safe but it surely needs review from SMP race point of
    view. In short there is no current existing way to serialize the O_DIRECT
    final put_page against split_huge_page_refcount so I had to invent a new
    one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
    returns so...). And I didn't want to impact all gup/gup_fast users for
    now, maybe if we change the gup interface substantially we can avoid this
    locking, I admit I didn't think too much about it because changing the gup
    unpinning interface would be invasive.

    If we ignored O_DIRECT we could stick to the existing compound refcounting
    code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
    (and any other mmu notifier user) would call it without FOLL_GET (and if
    FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
    current task mmu notifier list yet). But O_DIRECT is fundamental for
    decent performance of virtualized I/O on fast storage so we can't avoid it
    to solve the race of put_page against split_huge_page_refcount to achieve
    a complete hugepage feature for KVM.

    Swap and oom works fine (well just like with regular pages ;). MMU
    notifier is handled transparently too, with the exception of the young bit
    on the pmd, that didn't have a range check but I think KVM will be fine
    because the whole point of hugepages is that EPT/NPT will also use a huge
    pmd when they notice gup returns pages with PageCompound set, so they
    won't care of a range and there's just the pmd young bit to check in that
    case.

    NOTE: in some cases if the L2 cache is small, this may slowdown and waste
    memory during COWs because 4M of memory are accessed in a single fault
    instead of 8k (the payoff is that after COW the program can run faster).
    So we might want to switch the copy_huge_page (and clear_huge_page too) to
    not temporal stores. I also extensively researched ways to avoid this
    cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
    up to 1M (I can send those patches that fully implemented prefault) but I
    concluded they're not worth it and they add an huge additional complexity
    and they remove all tlb benefits until the full hugepage has been faulted
    in, to save a little bit of memory and some cache during app startup, but
    they still don't improve substantially the cache-trashing during startup
    if the prefault happens in >4k chunks. One reason is that those 4k pte
    entries copied are still mapped on a perfectly cache-colored hugepage, so
    the trashing is the worst one can generate in those copies (cow of 4k page
    copies aren't so well colored so they trashes less, but again this results
    in software running faster after the page fault). Those prefault patches
    allowed things like a pte where post-cow pages were local 4k regular anon
    pages and the not-yet-cowed pte entries were pointing in the middle of
    some hugepage mapped read-only. If it doesn't payoff substantially with
    todays hardware it will payoff even less in the future with larger l2
    caches, and the prefault logic would blot the VM a lot. If one is
    emebdded transparent_hugepage can be disabled during boot with sysfs or
    with the boot commandline parameter transparent_hugepage=0 (or
    transparent_hugepage=2 to restrict hugepages inside madvise regions) that
    will ensure not a single hugepage is allocated at boot time. It is simple
    enough to just disable transparent hugepage globally and let transparent
    hugepages be allocated selectively by applications in the MADV_HUGEPAGE
    region (both at page fault time, and if enabled with the
    collapse_huge_page too through the kernel daemon).

    This patch supports only hugepages mapped in the pmd, archs that have
    smaller hugepages will not fit in this patch alone. Also some archs like
    power have certain tlb limits that prevents mixing different page size in
    the same regions so they will not fit in this framework that requires
    "graceful fallback" to basic PAGE_SIZE in case of physical memory
    fragmentation. hugetlbfs remains a perfect fit for those because its
    software limits happen to match the hardware limits. hugetlbfs also
    remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
    to be found not fragmented after a certain system uptime and that would be
    very expensive to defragment with relocation, so requiring reservation.
    hugetlbfs is the "reservation way", the point of transparent hugepages is
    not to have any reservation at all and maximizing the use of cache and
    hugepages at all times automatically.

    Some performance result:

    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
    ages3
    memset page fault 1566023
    memset tlb miss 453854
    memset second tlb miss 453321
    random access tlb miss 41635
    random access second tlb miss 41658
    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
    memset page fault 1566471
    memset tlb miss 453375
    memset second tlb miss 453320
    random access tlb miss 41636
    random access second tlb miss 41637
    vmx andrea # ./largepages3
    memset page fault 1566642
    memset tlb miss 453417
    memset second tlb miss 453313
    random access tlb miss 41630
    random access second tlb miss 41647
    vmx andrea # ./largepages3
    memset page fault 1566872
    memset tlb miss 453418
    memset second tlb miss 453315
    random access tlb miss 41618
    random access second tlb miss 41659
    vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
    vmx andrea # ./largepages3
    memset page fault 2182476
    memset tlb miss 460305
    memset second tlb miss 460179
    random access tlb miss 44483
    random access second tlb miss 44186
    vmx andrea # ./largepages3
    memset page fault 2182791
    memset tlb miss 460742
    memset second tlb miss 459962
    random access tlb miss 43981
    random access second tlb miss 43988

    ============
    #include
    #include
    #include
    #include

    #define SIZE (3UL*1024*1024*1024)

    int main()
    {
    char *p = malloc(SIZE), *p2;
    struct timeval before, after;

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset page fault %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    return 0;
    }
    ============

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This should work for both hugetlbfs and transparent hugepages.

    [akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability]
    Signed-off-by: Andrea Arcangeli
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • split_huge_page must transform a compound page to a regular page and needs
    ClearPageCompound.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add a new compound_lock() needed to serialize put_page against
    __split_huge_page_refcount().

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Temporary IO failures, eg. due to loss of both multipath paths, can
    permanently leave the PageError bit set on a page, resulting in msync or
    fsync returning -EIO over and over again, even if IO is now getting to the
    disk correctly.

    We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
    filemap_fdatawait_range function. Also clearing the PageError bit on the
    page allows subsequent msync or fsync calls on this file to return without
    an error, if the subsequent IO succeeds.

    Unfortunately data written out in the msync or fsync call that returned
    -EIO can still get lost, because the page dirty bit appears to not get
    restored on IO error. However, the alternative could be potentially all
    of memory filling up with uncleanable dirty pages, hanging the system, so
    there is no nice choice here...

    Signed-off-by: Rik van Riel
    Acked-by: Valerie Aurora
    Acked-by: Jeff Layton
    Cc: Theodore Ts'o
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

25 Oct, 2010

1 commit


16 Jul, 2010

1 commit


17 Dec, 2009

1 commit

  • * 'for-33' of git://repo.or.cz/linux-kbuild: (29 commits)
    net: fix for utsrelease.h moving to generated
    gen_init_cpio: fixed fwrite warning
    kbuild: fix make clean after mismerge
    kbuild: generate modules.builtin
    genksyms: properly consider EXPORT_UNUSED_SYMBOL{,_GPL}()
    score: add asm/asm-offsets.h wrapper
    unifdef: update to upstream revision 1.190
    kbuild: specify absolute paths for cscope
    kbuild: create include/generated in silentoldconfig
    scripts/package: deb-pkg: use fakeroot if available
    scripts/package: add KBUILD_PKG_ROOTCMD variable
    scripts/package: tar-pkg: use tar --owner=root
    Kbuild: clean up marker
    net: add net_tstamp.h to headers_install
    kbuild: move utsrelease.h to include/generated
    kbuild: move autoconf.h to include/generated
    drop explicit include of autoconf.h
    kbuild: move compile.h to include/generated
    kbuild: drop include/asm
    kbuild: do not check for include/asm-$ARCH
    ...

    Fixed non-conflicting clean merge of modpost.c as per comments from
    Stephen Rothwell (modpost.c had grown an include of linux/autoconf.h
    that needed to be changed to generated/autoconf.h)

    Linus Torvalds
     

16 Dec, 2009

3 commits

  • Rename get_uflags() to stable_page_flags() and make it a global function
    for use in the hwpoison page flags filter, which need to compare user
    page flags with the value provided by user space.

    Also move KPF_* to kernel-page-flags.h for use by user space tools.

    Acked-by: Matt Mackall
    Signed-off-by: Andi Kleen
    CC: Nick Piggin
    CC: Christoph Lameter
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • The unpoisoning interface is useful for stress testing tools to
    reclaim poisoned pages (to prevent OOM)

    There is no hardware level unpoisioning, so this
    cannot be used for real memory errors, only for software injected errors.

    Note that it may leak pages silently - those who have been removed from
    LRU cache, but not isolated from page cache/swap cache at hwpoison time.
    Especially the stress test of dirty swap cache pages shall reboot system
    before exhausting memory.

    AK: Fix comments, add documentation, add printks, rename symbol

    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • Remove three degrees of obfuscation, left over from when we had
    CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
    CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
    built when CONFIG_MMU, so don't need such conditions at all.

    Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
    169 defconfigs: leave those to evolve in due course.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Dec, 2009

1 commit


24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

2 commits

  • Make page_has_private() return a true boolean value and remove the double
    negations from the two callsites using it for arithmetic.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • By the time PG_mlocked is cleared in the page freeing path, nobody else is
    looking at our page->flags anymore.

    It is thus safe to make the test-and-clear non-atomic and thereby removing
    an unnecessary and expensive operation from a hotpath.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

16 Sep, 2009

1 commit

  • Hardware poisoned pages need special handling in the VM and shouldn't be
    touched again. This requires a new page flag. Define it here.

    The page flags wars seem to be over, so it shouldn't be a problem
    to get a new one.

    v2: Add TestSetHWPoison (suggested by Johannes Weiner)

    Acked-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Andi Kleen
     

27 Aug, 2009

1 commit


17 Jun, 2009

2 commits


11 May, 2009

1 commit


03 Apr, 2009

2 commits

  • Recruit a page flag to aid in cache management. The following extra flag is
    defined:

    (1) PG_fscache (PG_private_2)

    The marked page is backed by a local cache and is pinning resources in the
    cache driver.

    If PG_fscache is set, then things that checked for PG_private will now also
    check for that. This includes things like truncation and page invalidation.
    The function page_has_private() had been added to make the checks for both
    PG_private and PG_private_2 at the same time.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • The attached patch causes read_cache_pages() to release page-private data on a
    page for which add_to_page_cache() fails. If the filler function fails, then
    the problematic page is left attached to the pagecache (with appropriate flags
    set, one presumes) and the remaining to-be-attached pages are invalidated and
    discarded. This permits pages with caching references associated with them to
    be cleaned up.

    The invalidatepage() address space op is called (indirectly) to do the honours.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     

01 Apr, 2009

1 commit

  • The mlock() facility does not exist for NOMMU since all mappings are
    effectively locked anyway, so we don't make the bits available when
    they're not useful.

    Signed-off-by: David Howells
    Reviewed-by: KOSAKI Motohiro
    Cc: Peter Zijlstra
    Cc: Greg Ungerer
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Enrik Berkhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells