04 Aug, 2011

1 commit

  • init_fault_attr_dentries() is used to export fault_attr via debugfs.
    But it can only export it in debugfs root directory.

    Per Forlin is working on mmc_fail_request which adds support to inject
    data errors after a completed host transfer in MMC subsystem.

    The fault_attr for mmc_fail_request should be defined per mmc host and
    export it in debugfs directory per mmc host like
    /sys/kernel/debug/mmc0/mmc_fail_request.

    init_fault_attr_dentries() doesn't help for mmc_fail_request. So this
    introduces fault_create_debugfs_attr() which is able to create a
    directory in the arbitrary directory and replace
    init_fault_attr_dentries().

    [akpm@linux-foundation.org: extraneous semicolon, per Randy]
    Signed-off-by: Akinobu Mita
    Tested-by: Per Forlin
    Cc: Jens Axboe
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Randy Dunlap
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

27 Jul, 2011

2 commits


26 Jul, 2011

2 commits

  • With zone_reclaim_mode enabled, it's possible for zones to be considered
    full in the zonelist_cache so they are skipped in the future. If the
    process enters direct reclaim, the ZLC may still consider zones to be full
    even after reclaiming pages. Reconsider all zones for allocation if
    direct reclaim returns successfully.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There have been a small number of complaints about significant stalls
    while copying large amounts of data on NUMA machines reported on a
    distribution bugzilla. In these cases, zone_reclaim was enabled by
    default due to large NUMA distances. In general, the complaints have not
    been about the workload itself unless it was a file server (in which case
    the recommendation was disable zone_reclaim).

    The stalls are mostly due to significant amounts of time spent scanning
    the preferred zone for pages to free. After a failure, it might fallback
    to another node (as zonelists are often node-ordered rather than
    zone-ordered) but stall quickly again when the next allocation attempt
    occurs. In bad cases, each page allocated results in a full scan of the
    preferred zone.

    Patch 1 checks the preferred zone for recent allocation failure
    which is particularly important if zone_reclaim has failed
    recently. This avoids rescanning the zone in the near future and
    instead falling back to another node. This may hurt node locality
    in some cases but a failure to zone_reclaim is more expensive than
    a remote access.

    Patch 2 clears the zlc information after direct reclaim.
    Otherwise, zone_reclaim can mark zones full, direct reclaim can
    reclaim enough pages but the zone is still not considered for
    allocation.

    This was tested on a 24-thread 2-node x86_64 machine. The tests were
    focused on large amounts of IO. All tests were bound to the CPUs on
    node-0 to avoid disturbances due to processes being scheduled on different
    nodes. The kernels tested are

    3.0-rc6-vanilla Vanilla 3.0-rc6
    zlcfirst Patch 1 applied
    zlcreconsider Patches 1+2 applied

    FS-Mark
    ./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
    fsmark-3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirs zlcreconsider
    Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
    Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
    Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
    Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
    Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
    Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
    Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
    Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 501.49 493.91 499.93
    Total Elapsed Time (seconds) 2451.57 2257.48 2215.92

    MMTests Statistics: vmstat
    Page Ins 46268 63840 66008
    Page Outs 90821596 90671128 88043732
    Swap Ins 0 0 0
    Swap Outs 0 0 0
    Direct pages scanned 13091697 8966863 8971790
    Kswapd pages scanned 0 1830011 1831116
    Kswapd pages reclaimed 0 1829068 1829930
    Direct pages reclaimed 13037777 8956828 8648314
    Kswapd efficiency 100% 99% 99%
    Kswapd velocity 0.000 810.643 826.346
    Direct efficiency 99% 99% 96%
    Direct velocity 5340.128 3972.068 4048.788
    Percentage direct scans 100% 83% 83%
    Page writes by reclaim 0 3 0
    Slabs scanned 796672 720640 720256
    Direct inode steals 7422667 7160012 7088638
    Kswapd inode steals 0 1736840 2021238

    Test completes far faster with a large increase in the number of files
    created per second. Standard deviation is high as a small number of
    iterations were much higher than the mean. The number of pages scanned by
    zone_reclaim is reduced and kswapd is used for more work.

    LARGE DD
    3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirst zlcreconsider
    download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
    dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
    delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 125.03 118.98 122.01
    Total Elapsed Time (seconds) 624.56 375.02 398.06

    MMTests Statistics: vmstat
    Page Ins 3594216 439368 407032
    Page Outs 23380832 23380488 23377444
    Swap Ins 0 0 0
    Swap Outs 0 436 287
    Direct pages scanned 17482342 69315973 82864918
    Kswapd pages scanned 0 519123 575425
    Kswapd pages reclaimed 0 466501 522487
    Direct pages reclaimed 5858054 2732949 2712547
    Kswapd efficiency 100% 89% 90%
    Kswapd velocity 0.000 1384.254 1445.574
    Direct efficiency 33% 3% 3%
    Direct velocity 27991.453 184832.737 208171.929
    Percentage direct scans 100% 99% 99%
    Page writes by reclaim 0 5082 13917
    Slabs scanned 17280 29952 35328
    Direct inode steals 115257 1431122 332201
    Kswapd inode steals 0 0 979532

    This test downloads a large tarfile and copies it with dd a number of
    times - similar to the most recent bug report I've dealt with. Time to
    completion is reduced. The number of pages scanned directly is still
    disturbingly high with a low efficiency but this is likely due to the
    number of dirty pages encountered. The figures could probably be improved
    with more work around how kswapd is used and how dirty pages are handled
    but that is separate work and this result is significant on its own.

    Streaming Mapped Writer
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 124.47 111.67 112.64
    Total Elapsed Time (seconds) 2138.14 1816.30 1867.56

    MMTests Statistics: vmstat
    Page Ins 90760 89124 89516
    Page Outs 121028340 120199524 120736696
    Swap Ins 0 86 55
    Swap Outs 0 0 0
    Direct pages scanned 114989363 96461439 96330619
    Kswapd pages scanned 56430948 56965763 57075875
    Kswapd pages reclaimed 27743219 27752044 27766606
    Direct pages reclaimed 49777 46884 36655
    Kswapd efficiency 49% 48% 48%
    Kswapd velocity 26392.541 31363.631 30561.736
    Direct efficiency 0% 0% 0%
    Direct velocity 53780.091 53108.759 51581.004
    Percentage direct scans 67% 62% 62%
    Page writes by reclaim 385 122 1513
    Slabs scanned 43008 39040 42112
    Direct inode steals 0 10 8
    Kswapd inode steals 733 534 477

    This test just creates a large file mapping and writes to it linearly.
    Time to completion is again reduced.

    The gains are mostly down to two things. In many cases, there is less
    scanning as zone_reclaim simply gives up faster due to recent failures.
    The second reason is that memory is used more efficiently. Instead of
    scanning the preferred zone every time, the allocator falls back to
    another zone and uses it instead improving overall memory utilisation.

    This patch: initialise ZLC for first zone eligible for zone_reclaim.

    The zonelist cache (ZLC) is used among other things to record if
    zone_reclaim() failed for a particular zone recently. The intention is to
    avoid a high cost scanning extremely long zonelists or scanning within the
    zone uselessly.

    Currently the zonelist cache is setup only after the first zone has been
    considered and zone_reclaim() has been called. The objective was to avoid
    a costly setup but zone_reclaim is itself quite expensive. If it is
    failing regularly such as the first eligible zone having mostly mapped
    pages, the cost in scanning and allocation stalls is far higher than the
    ZLC initialisation step.

    This patch initialises ZLC before the first eligible zone calls
    zone_reclaim(). Once initialised, it is checked whether the zone failed
    zone_reclaim recently. If it has, the zone is skipped. As the first zone
    is now being checked, additional care has to be taken about zones marked
    full. A zone can be marked "full" because it should not have enough
    unmapped pages for zone_reclaim but this is excessive as direct reclaim or
    kswapd may succeed where zone_reclaim fails. Only mark zones "full" after
    zone_reclaim fails if it failed to reclaim enough pages after scanning.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Jul, 2011

1 commit

  • SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use
    sections array to map pfn to nid which is limited in granularity. If
    NUMA nodes are laid out such that the mapping cannot be accurate, boot
    will fail triggering BUG_ON() in mminit_verify_page_links().

    On 32bit, it's 512MiB w/ PAE and SPARSEMEM. This seems to have been
    granular enough until commit 2706a0bf7b (x86, NUMA: Enable
    CONFIG_AMD_NUMA on 32bit too). Apparently, there is a machine which
    aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT. This
    led to the following BUG_ON().

    On node 0 totalpages: 2096615
    DMA zone: 32 pages used for memmap
    DMA zone: 0 pages reserved
    DMA zone: 3927 pages, LIFO batch:0
    Normal zone: 1740 pages used for memmap
    Normal zone: 220978 pages, LIFO batch:31
    HighMem zone: 16405 pages used for memmap
    HighMem zone: 1853533 pages, LIFO batch:31
    BUG: Int 6: CR2 (null)
    EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc
    EBX f2400000 EDX 00000006 ECX (null) EAX 00000001
    err (null) EIP c16209aa CS 00000060 flg 00010002
    Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
    (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null)
    f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null)
    Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17
    Call Trace:
    [] ? early_fault+0x2e/0x2e
    [] ? mminit_verify_page_links+0x12/0x42
    [] ? memmap_init_zone+0xaf/0x10c
    [] ? free_area_init_node+0x2b9/0x2e3
    [] ? free_area_init_nodes+0x3f2/0x451
    [] ? paging_init+0x112/0x118
    [] ? setup_arch+0x791/0x82f
    [] ? start_kernel+0x6a/0x257

    This patch implements node_map_pfn_alignment() which determines
    maximum internode alignment and update numa_register_memblks() to
    reject NUMA configuration if alignment exceeds the pfn -> nid mapping
    granularity of the memory model as determined by PAGES_PER_SECTION.

    This makes the problematic machine boot w/ flatmem by rejecting the
    NUMA config and provides protection against crazy NUMA configurations.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org
    LKML-Reference:
    Reported-and-Tested-by: Hans Rosenfeld
    Cc: Conny Seidel
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     

02 Jun, 2011

1 commit

  • This reverts commit a197b59ae6e8bee56fcef37ea2482dc08414e2ac.

    As rmk says:
    "Commit a197b59ae6e8 (mm: fail GFP_DMA allocations when ZONE_DMA is not
    configured) is causing regressions on ARM with various drivers which
    use GFP_DMA.

    The behaviour up until now has been to silently ignore that flag when
    CONFIG_ZONE_DMA is not enabled, and to allocate from the normal zone.
    However, as a result of the above commit, such allocations now fail
    which causes drivers to fail. These are regressions compared to the
    previous kernel version."

    so just revert it.

    Requested-by: Russell King
    Acked-by: Andrew Morton
    Cc: David Rientjes
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 May, 2011

1 commit

  • During memory reclaim we determine the number of pages to be scanned per
    zone as

    (anon + file) >> priority.
    Assume
    scan = (anon + file) >> priority.

    If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
    priority gets higher. This has some problems.

    1. This increases priority as 1 without any scan.
    To do scan in this priority, amount of pages should be larger than 512M.
    If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
    batched, later. (But we lose 1 priority.)
    If memory size is below 16M, pages >> priority is 0 and no scan in
    DEF_PRIORITY forever.

    2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
    So, x86's ZONE_DMA will never be recoverred until the user of pages
    frees memory by itself.

    3. With memcg, the limit of memory can be small. When using small memcg,
    it gets priority < DEF_PRIORITY-2 very easily and need to call
    wait_iff_congested().
    For doing scan before priorty=9, 64MB of memory should be used.

    Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when

    1. the target is enough small.
    2. it's kswapd or memcg reclaim.

    Then we can avoid rapid priority drop and may be able to recover
    all_unreclaimable in a small zones. And this patch removes nr_saved_scan.
    This will allow scanning in this priority even when pages >> priority is
    very small.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Ying Han
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

25 May, 2011

10 commits

  • I believe I found a problem in __alloc_pages_slowpath, which allows a
    process to get stuck endlessly looping, even when lots of memory is
    available.

    Running an I/O and memory intensive stress-test I see a 0-order page
    allocation with __GFP_IO and __GFP_WAIT, running on a system with very
    little free memory. Right about the same time that the stress-test gets
    killed by the OOM-killer, the utility trying to allocate memory gets stuck
    in __alloc_pages_slowpath even though most of the systems memory was freed
    by the oom-kill of the stress-test.

    The utility ends up looping from the rebalance label down through the
    wait_iff_congested continiously. Because order=0,
    __alloc_pages_direct_compact skips the call to get_page_from_freelist.
    Because all of the reclaimable memory on the system has already been
    reclaimed, __alloc_pages_direct_reclaim skips the call to
    get_page_from_freelist. Since there is no __GFP_FS flag, the block with
    __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested,
    then jumps back to rebalance without ever trying to
    get_page_from_freelist. This loop repeats infinitely.

    The test case is pretty pathological. Running a mix of I/O stress-tests
    that do a lot of fork() and consume all of the system memory, I can pretty
    reliably hit this on 600 nodes, in about 12 hours. 32GB/node.

    Signed-off-by: Andrew Barry
    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Barry
     
  • The page allocator will improperly return a page from ZONE_NORMAL even
    when __GFP_DMA is passed if CONFIG_ZONE_DMA is disabled. The caller
    expects DMA memory, perhaps for ISA devices with 16-bit address registers,
    and may get higher memory resulting in undefined behavior.

    This patch causes the page allocator to return NULL in such circumstances
    with a warning emitted to the kernel log on the first occurrence.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This fixes a problem where the first pageblock got marked MIGRATE_RESERVE
    even though it only had a few free pages. eg, On current ARM port, The
    kernel starts at offset 0x8000 to leave room for boot parameters, and the
    memory is freed later.

    This in turn caused no contiguous memory to be reserved and frequent
    kswapd wakeups that emptied the caches to get more contiguous memory.

    Unfortunatelly, ARM needs order-2 allocation for pgd (see
    arm/mm/pgd.c#pgd_alloc()). Therefore the issue is not minor nor easy
    avoidable.

    [kosaki.motohiro@jp.fujitsu.com: added some explanation]
    [kosaki.motohiro@jp.fujitsu.com: add !pfn_valid_within() to check]
    [minchan.kim@gmail.com: check end_pfn in pageblock_is_reserved]
    Signed-off-by: John Stultz
    Signed-off-by: Arve Hjønnevåg
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arve Hjønnevåg
     
  • This originally started as a simple patch to give vmalloc() some more
    verbose output on failure on top of the plain page allocator messages.
    Johannes suggested that it might be nicer to lead with the vmalloc() info
    _before_ the page allocator messages.

    But, I do think there's a lot of value in what __alloc_pages_slowpath()
    does with its filtering and so forth.

    This patch creates a new function which other allocators can call instead
    of relying on the internal page allocator warnings. It also gives this
    function private rate-limiting which separates it from other
    printk_ratelimit() users.

    Signed-off-by: Dave Hansen
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • It's uncertain this has been beneficial, so it's safer to undo it. All
    other compaction users would still go in synchronous mode if a first
    attempt at async compaction failed. Hopefully we don't need to force
    special behavior for THP (which is the only __GFP_NO_KSWAPD user so far
    and it's the easier to exercise and to be noticeable). This also make
    __GFP_NO_KSWAPD return to its original strict semantics specific to bypass
    kswapd, as THP allocations have khugepaged for the async THP
    allocations/compactions.

    Signed-off-by: Andrea Arcangeli
    Cc: Alex Villacis Lasso
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug
    doesn't. There is no reason for this.

    [akpm@linux-foundation.org: fix CONFIG_SMP=n build]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, memory hotplug calls setup_per_zone_wmarks() and
    calculate_zone_inactive_ratio(), but doesn't call
    setup_per_zone_lowmem_reserve().

    It means the number of reserved pages aren't updated even if memory hot
    plug occur. This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of
    zone when hotplug happens") introduced invalid section references. Now,
    setup_per_zone_inactive_ratio() is marked __init and then it can't be
    referenced from memory hotplug code.

    This patch marks it as __meminit and also marks caller as __ref.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Yasunori Goto
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Christoph Lameter
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Architectures that implement their own show_mem() function did not pass
    the filter argument to show_free_areas() to appropriately avoid emitting
    the state of nodes that are disallowed in the current context. This patch
    now passes the filter argument to show_free_areas() so those nodes are now
    avoided.

    This patch also removes the show_free_areas() wrapper around
    __show_free_areas() and converts existing callers to pass an empty filter.

    ia64 emits additional information for each node, so skip_free_areas_zone()
    must be made global to filter disallowed nodes and it is converted to use
    a nid argument rather than a zone for this use case.

    Signed-off-by: David Rientjes
    Cc: Russell King
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Kyle McMartin
    Cc: Helge Deller
    Cc: James Bottomley
    Cc: "David S. Miller"
    Cc: Guan Xuetao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 May, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    b43: fix comment typo reqest -> request
    Haavard Skinnemoen has left Atmel
    cris: typo in mach-fs Makefile
    Kconfig: fix copy/paste-ism for dell-wmi-aio driver
    doc: timers-howto: fix a typo ("unsgined")
    perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
    md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
    treewide: fix a few typos in comments
    regulator: change debug statement be consistent with the style of the rest
    Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
    audit: acquire creds selectively to reduce atomic op overhead
    rtlwifi: don't touch with treewide double semicolon removal
    treewide: cleanup continuations and remove logging message whitespace
    ath9k_hw: don't touch with treewide double semicolon removal
    include/linux/leds-regulator.h: fix syntax in example code
    tty: fix typo in descripton of tty_termios_encode_baud_rate
    xtensa: remove obsolete BKL kernel option from defconfig
    m68k: fix comment typo 'occcured'
    arch:Kconfig.locks Remove unused config option.
    treewide: remove extra semicolons
    ...

    Linus Torvalds
     

21 May, 2011

1 commit

  • Commit e66eed651fd1 ("list: remove prefetching from regular list
    iterators") removed the include of prefetch.h from list.h, which
    uncovered several cases that had apparently relied on that rather
    obscure header file dependency.

    So this fixes things up a bit, using

    grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
    grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

    to guide us in finding files that either need
    inclusion, or have it despite not needing it.

    There are more of them around (mostly network drivers), but this gets
    many core ones.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 May, 2011

1 commit


12 May, 2011

2 commits

  • Add a alloc_pages_exact_nid() that allocates on a specific node.

    The naming is quite broken, but fixing that would need a larger renaming
    action.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Andi Kleen
    Cc: Michal Hocko
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Dave Hansen
    Cc: David Rientjes
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Stefan found nobootmem does not work on his system that has only 8M of
    RAM. This causes an early panic:

    BIOS-provided physical RAM map:
    BIOS-88: 0000000000000000 - 000000000009f000 (usable)
    BIOS-88: 0000000000100000 - 0000000000840000 (usable)
    bootconsole [earlyser0] enabled
    Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS!
    DMI not present or invalid.
    last_pfn = 0x840 max_arch_pfn = 0x100000
    init_memory_mapping: 0000000000000000-0000000000840000
    8MB LOWMEM available.
    mapped low ram: 0 - 00840000
    low ram: 0 - 00840000
    Zone PFN ranges:
    DMA 0x00000001 -> 0x00001000
    Normal empty
    Movable zone start PFN for each node
    early_node_map[2] active PFN ranges
    0: 0x00000001 -> 0x0000009f
    0: 0x00000100 -> 0x00000840
    BUG: Int 6: CR2 (null)
    EDI c034663c ESI (null) EBP c0329f38 ESP c0329ef4
    EBX c0346380 EDX 00000006 ECX ffffffff EAX fffffff4
    err (null) EIP c0353191 CS c0320060 flg 00010082
    Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null)
    00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null)
    (null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null)
    Pid: 0, comm: swapper Not tainted 2.6.36 #5
    Call Trace:
    [] ? 0xc02e3707
    [] 0xc035e6e5
    [] ? 0xc0353191
    [] 0xc03534d6
    [] 0xc034f1cd
    [] 0xc034a824
    [] ? 0xc03513cb
    [] 0xc0349432
    [] 0xc0349066

    It turns out that we should ignore the low limit of 16M.

    Use alloc_bootmem_node_nopanic() in this case.

    [akpm@linux-foundation.org: less mess]
    Signed-off-by: Yinghai LU
    Reported-by: Stefan Hellermann
    Tested-by: Stefan Hellermann
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: [2.6.34+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

26 Apr, 2011

1 commit


15 Apr, 2011

1 commit

  • The memory hotplug case involves calling to build_all_zonelists() which
    in turns calls in to setup_zone_pageset(). The latter is marked
    __meminit while build_all_zonelists() itself has no particular
    annotation. build_all_zonelists() is only handed a non-NULL pointer in
    the case of memory hotplug through an existing __meminit path, so the
    setup_zone_pageset() reference is always safe.

    The options as such are either to flag build_all_zonelists() as __ref (as
    per __build_all_zonelists()), or to simply discard the __meminit
    annotation from setup_zone_pageset().

    Signed-off-by: Paul Mundt
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mundt
     

10 Apr, 2011

1 commit


31 Mar, 2011

1 commit


25 Mar, 2011

1 commit

  • Commit ddd588b5dd55 ("oom: suppress nodes that are not allowed from
    meminfo on oom kill") moved lib/show_mem.o out of lib/lib.a, which
    resulted in build warnings on all architectures that implement their own
    versions of show_mem():

    lib/lib.a(show_mem.o): In function `show_mem':
    show_mem.c:(.text+0x1f4): multiple definition of `show_mem'
    arch/sparc/mm/built-in.o:(.text+0xd70): first defined here

    The fix is to remove __show_mem() and add its argument to show_mem() in
    all implementations to prevent this breakage.

    Architectures that implement their own show_mem() actually don't do
    anything with the argument yet, but they could be made to filter nodes
    that aren't allowed in the current context in the future just like the
    generic implementation.

    Reported-by: Stephen Rothwell
    Reported-by: James Bottomley
    Suggested-by: Andrew Morton
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 Mar, 2011

1 commit

  • Add checks at allocating or freeing a page whether the page is used (iow,
    charged) from the view point of memcg.

    This check may be useful in debugging a problem and we did similar checks
    before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot).

    This patch adds some overheads at allocating or freeing memory, so it's
    enabled only when CONFIG_DEBUG_VM is enabled.

    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

23 Mar, 2011

7 commits

  • Signed-off-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
    zone_statistics() that an allocation is on behalf of another thread. This
    way the local and remote counters can be still correct, even when
    background daemons like khugepaged are changing memory mappings.

    This only affects the accounting, but I think it's worth doing that right
    to avoid confusing users.

    I first tried to just pass down the right node, but this required a lot of
    changes to pass down this parameter and at least one addition of a 10th
    argument to a 9 argument function. Using the flag is a lot less
    intrusive.

    Open: should be also used for migration?

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andi Kleen
    Cc: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
    to succeed as they have graceful fallback. Waiting for I/O in those,
    tends to be overkill in terms of latencies, so we can reduce their latency
    by disabling sync migrate.

    Unfortunately, even with async migration it's still possible for the
    process to be blocked waiting for a request slot (e.g. get_request_wait
    in the block layer) when ->writepage is called. To prevent
    __GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on
    dirty page cache for asynchronous migration.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=31142

    [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Cc: Arthur Marsh
    Cc: Clemens Ladisch
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Reported-by: Alex Villacis Lasso
    Tested-by: Alex Villacis Lasso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • free_pcppages_bulk() frees pages from pcp lists in a round-robin fashion
    by keeping batch_free counter. But it doesn't need to spin if there is
    only one non-empty list. This can be checked by batch_free ==
    MIGRATE_PCPTYPES.

    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Namhyung Kim
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Displaying extremely verbose meminfo for all nodes on the system is
    overkill for page allocation failures when the context restricts that
    allocation to only a subset of nodes. We don't particularly care about
    the state of all nodes when some are not allowed in the current context,
    they can have an abundance of memory but we can't allocate from that part
    of memory.

    This patch suppresses disallowed nodes from the meminfo dump on a page
    allocation failure if the context requires it.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When a page allocation failure occurs, show_mem() is called to dump the
    state of the VM so users may understand what happened to get into that
    condition.

    This output, however, can be extremely verbose. In irq context, it may
    result in significant delays that incur NMI watchdog timeouts when the
    machine is large (we use CONFIG_NODES_SHIFT > 8 here to define a "large"
    machine since the length of the show_mem() output is proportional to the
    number of possible nodes).

    This patch suppresses the show_mem() call in irq context when the kernel
    has CONFIG_NODES_SHIFT > 8.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The oom killer is extremely verbose for machines with a large number of
    cpus and/or nodes. This verbosity can often be harmful if it causes other
    important messages to be scrolled from the kernel log and incurs a
    signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT >
    8.

    This patch causes only memory information to be displayed for nodes that
    are allowed by current's cpuset when dumping the VM state. Information
    for all other nodes is irrelevant to the oom condition; we don't care if
    there's an abundance of memory elsewhere if we can't access it.

    This only affects the behavior of dumping memory information when an oom
    is triggered. Other dumps, such as for sysrq+m, still display the
    unfiltered form when using the existing show_mem() interface.

    Additionally, the per-cpu pageset statistics are extremely verbose in oom
    killer output, so it is now suppressed. This removes

    nodes_weight(current->mems_allowed) * (1 + nr_cpus)

    lines from the oom killer output.

    Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed
    nodes.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

18 Mar, 2011

1 commit

  • Change the _mapcount value indicating PageBuddy from -2 to -128 for
    more robusteness against page_mapcount() undeflows.

    Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
    ignore the previous retval of PageBuddy().

    Signed-off-by: Andrea Arcangeli
    Reported-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

15 Mar, 2011

1 commit


26 Feb, 2011

2 commits

  • Heiko found recent memblock change triggers these warnings on s390:

    mm/page_alloc.c:3623:22: warning: 'last_active_region_index_in_nid' defined but not used
    mm/page_alloc.c:3638:22: warning: 'previous_active_region_index_in_nid' defined but not used

    Need to move those two function under HAVE_MEMBLOCK with its only
    user, find_memory_core_early().

    -tj: Minor updates to description.

    Reported-by: Heiko Carstens
    Signed-off-by: Yinghai Lu
    Signed-off-by: Tejun Heo

    Yinghai Lu
     
  • When pfn_valid_within() failed 'iter' was incremented twice.

    Signed-off-by: Namhyung Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim