13 Jan, 2012

4 commits

  • Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If compaction is deferred, direct reclaim is used to try to free enough
    pages for the allocation to succeed. For small high-orders, this has a
    reasonable chance of success. However, if the caller has specified
    __GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
    to fail the allocation rather than stall the caller in direct reclaim.
    This patch skips direct reclaim if compaction is deferred and the caller
    specifies __GFP_NO_KSWAPD.

    Async compaction only considers a subset of pages so it is possible for
    compaction to be deferred prematurely and not enter direct reclaim even in
    cases where it should. To compensate for this, this patch also defers
    compaction only if sync compaction failed.

    Signed-off-by: Mel Gorman
    Acked-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If there is a zone below ZONE_NORMAL has present_pages, we can set node
    state to N_NORMAL_MEMORY, no need to loop to end.

    Signed-off-by: Bob Liu
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Having a unified structure with a LRU list set for both global zones and
    per-memcg zones allows to keep that code simple which deals with LRU
    lists and does not care about the container itself.

    Once the per-memcg LRU lists directly link struct pages, the isolation
    function and all other list manipulations are shared between the memcg
    case and the global LRU case.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

11 Jan, 2012

10 commits

  • __free_pages_bootmem() used to special-case higher-order frees to save
    individual page checking with free_pages_bulk().

    Nowadays, both zero order and non-zero order frees use free_pages(), which
    checks each individual page anyway, and so there is little point in making
    the distinction anymore. The higher-order loop will work just fine for
    zero order pages.

    Signed-off-by: Johannes Weiner
    Cc: Uwe Kleine-König
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 88f5acf88ae6 ("mm: page allocator: adjust the per-cpu counter
    threshold when memory is low") changed the form how free_pages is
    calculated but it forgot that we used to do free_pages - ((1 << order) -
    1) so we ended up with off-by-two when calculating free_pages.

    Reported-by: Wang Sheng-Hui
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The maximum number of dirty pages that exist in the system at any time is
    determined by a number of pages considered dirtyable and a user-configured
    percentage of those, or an absolute number in bytes.

    This number of dirtyable pages is the sum of memory provided by all the
    zones in the system minus their lowmem reserves and high watermarks, so
    that the system can retain a healthy number of free pages without having
    to reclaim dirty pages.

    But there is a flaw in that we have a zoned page allocator which does not
    care about the global state but rather the state of individual memory
    zones. And right now there is nothing that prevents one zone from filling
    up with dirty pages while other zones are spared, which frequently leads
    to situations where kswapd, in order to restore the watermark of free
    pages, does indeed have to write pages from that zone's LRU list. This
    can interfere so badly with IO from the flusher threads that major
    filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
    already, taking away the VM's only possibility to keep such a zone
    balanced, aside from hoping the flushers will soon clean pages from that
    zone.

    Enter per-zone dirty limits. They are to a zone's dirtyable memory what
    the global limit is to the global amount of dirtyable memory, and try to
    make sure that no single zone receives more than its fair share of the
    globally allowed dirty pages in the first place. As the number of pages
    considered dirtyable excludes the zones' lowmem reserves and high
    watermarks, the maximum number of dirty pages in a zone is such that the
    zone can always be balanced without requiring page cleaning.

    As this is a placement decision in the page allocator and pages are
    dirtied only after the allocation, this patch allows allocators to pass
    __GFP_WRITE when they know in advance that the page will be written to and
    become dirty soon. The page allocator will then attempt to allocate from
    the first zone of the zonelist - which on NUMA is determined by the task's
    NUMA memory policy - that has not exceeded its dirty limit.

    At first glance, it would appear that the diversion to lower zones can
    increase pressure on them, but this is not the case. With a full high
    zone, allocations will be diverted to lower zones eventually, so it is
    more of a shift in timing of the lower zone allocations. Workloads that
    previously could fit their dirty pages completely in the higher zone may
    be forced to allocate from lower zones, but the amount of pages that
    "spill over" are limited themselves by the lower zones' dirty constraints,
    and thus unlikely to become a problem.

    For now, the problem of unfair dirty page distribution remains for NUMA
    configurations where the zones allowed for allocation are in sum not big
    enough to trigger the global dirty limits, wake up the flusher threads and
    remedy the situation. Because of this, an allocation that could not
    succeed on any of the considered zones is allowed to ignore the dirty
    limits before going into direct reclaim or even failing the allocation,
    until a future patch changes the global dirty throttling and flusher
    thread activation so that they take individual zone states into account.

    Test results

    15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
    40% dirty ratio
    16G USB thumb drive
    10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

    seconds nr_vmscan_write
    (stddev) min| median| max
    xfs
    vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
    patched: 550.996( 3.802) 0.000| 0.000| 0.000

    fuse-ntfs
    vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
    patched: 558.049(17.914) 0.000| 0.000| 43.000

    btrfs
    vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
    patched: 563.365(11.368) 0.000| 0.000| 1362.000

    ext4
    vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
    patched: 568.806(17.496) 0.000| 0.000| 0.000

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Tested-by: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Per-zone dirty limits try to distribute page cache pages allocated for
    writing across zones in proportion to the individual zone sizes, to reduce
    the likelihood of reclaim having to write back individual pages from the
    LRU lists in order to make progress.

    This patch:

    The amount of dirtyable pages should not include the full number of free
    pages: there is a number of reserved pages that the page allocator and
    kswapd always try to keep free.

    The closer (reclaimable pages - dirty pages) is to the number of reserved
    pages, the more likely it becomes for reclaim to run into dirty pages:

    +----------+ ---
    | anon | |
    +----------+ |
    | | |
    | | -- dirty limit new -- flusher new
    | file | | |
    | | | |
    | | -- dirty limit old -- flusher old
    | | |
    +----------+ --- reclaim
    | reserved |
    +----------+
    | kernel |
    +----------+

    This patch introduces a per-zone dirty reserve that takes both the lowmem
    reserve as well as the high watermark of the zone into account, and a
    global sum of those per-zone values that is subtracted from the global
    amount of dirtyable pages. The lowmem reserve is unavailable to page
    cache allocations and kswapd tries to keep the high watermark free. We
    don't want to end up in a situation where reclaim has to clean pages in
    order to balance zones.

    Not treating reserved pages as dirtyable on a global level is only a
    conceptual fix. In reality, dirty pages are not distributed equally
    across zones and reclaim runs into dirty pages on a regular basis.

    But it is important to get this right before tackling the problem on a
    per-zone level, where the distance between reclaim and the dirty pages is
    mostly much smaller in absolute numbers.

    [akpm@linux-foundation.org: fix highmem build]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
    on access (read,write) to an unallocated page, which permits us to catch
    code which corrupts memory. However the kernel is trying to maximise
    memory usage, hence there are usually few free pages in the system and
    buggy code usually corrupts some crucial data.

    This patch changes the buddy allocator to keep more free/protected pages
    and to interlace free/protected and allocated pages to increase the
    probability of catching corruption.

    When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
    debug_guardpage_minorder defines the minimum order used by the page
    allocator to grant a request. The requested size will be returned with
    the remaining pages used as guard pages.

    The default value of debug_guardpage_minorder is zero: no change from
    current behaviour.

    [akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
    Signed-off-by: Stanislaw Gruszka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: "Rafael J. Wysocki"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislaw Gruszka
     
  • Colin Cross reported;

    Under the following conditions, __alloc_pages_slowpath can loop forever:
    gfp_mask & __GFP_WAIT is true
    gfp_mask & __GFP_FS is false
    reclaim and compaction make no progress
    order
    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When min_free_kbytes is updated, some pageblocks are marked
    MIGRATE_RESERVE. Ordinarily, this work is unnoticable as it happens early
    in boot but on large machines with 1TB of memory, this has been reported
    to delay boot times, probably due to the NUMA distances involved.

    The bulk of the work is due to calling calling pageblock_is_reserved() an
    unnecessary amount of times and accessing far more struct page metadata
    than is necessary. This patch significantly reduces the amount of work
    done by setup_zone_migrate_reserve() improving boot times on 1TB machines.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Rename mm_page_free_direct into mm_page_free and mm_pagevec_free into
    mm_page_free_batched

    Since v2.6.33-5426-gc475dab the kernel triggers mm_page_free_direct for
    all freed pages, not only for directly freed. So, let's name it properly.
    For pages freed via page-list we also trigger mm_page_free_batched event.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • It not exported and now nobody uses it.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch adds helper free_hot_cold_page_list() to free list of 0-order
    pages. It frees pages directly from list without temporary page-vector.
    It also calls trace_mm_pagevec_free() to simulate pagevec_free()
    behaviour.

    bloat-o-meter:

    add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
    function old new delta
    free_hot_cold_page_list - 264 +264
    get_page_from_freelist 2129 2132 +3
    __pagevec_free 243 239 -4
    split_free_page 380 373 -7
    release_pages 606 510 -96
    free_page_list 188 - -188

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Acked-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

09 Jan, 2012

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
    Kconfig: acpi: Fix typo in comment.
    misc latin1 to utf8 conversions
    devres: Fix a typo in devm_kfree comment
    btrfs: free-space-cache.c: remove extra semicolon.
    fat: Spelling s/obsolate/obsolete/g
    SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
    tools/power turbostat: update fields in manpage
    mac80211: drop spelling fix
    types.h: fix comment spelling for 'architectures'
    typo fixes: aera -> area, exntension -> extension
    devices.txt: Fix typo of 'VMware'.
    sis900: Fix enum typo 'sis900_rx_bufer_status'
    decompress_bunzip2: remove invalid vi modeline
    treewide: Fix comment and string typo 'bufer'
    hyper-v: Update MAINTAINERS
    treewide: Fix typos in various parts of the kernel, and fix some comments.
    clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
    gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
    leds: Kconfig: Fix typo 'D2NET_V2'
    sound: Kconfig: drop unknown symbol ARCH_CLPS7500
    ...

    Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
    kconfig additions, close to removed commented-out old ones)

    Linus Torvalds
     
  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
    reiserfs: Properly display mount options in /proc/mounts
    vfs: prevent remount read-only if pending removes
    vfs: count unlinked inodes
    vfs: protect remounting superblock read-only
    vfs: keep list of mounts for each superblock
    vfs: switch ->show_options() to struct dentry *
    vfs: switch ->show_path() to struct dentry *
    vfs: switch ->show_devname() to struct dentry *
    vfs: switch ->show_stats to struct dentry *
    switch security_path_chmod() to struct path *
    vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
    vfs: trim includes a bit
    switch mnt_namespace ->root to struct mount
    vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
    vfs: opencode mntget() mnt_set_mountpoint()
    vfs: spread struct mount - remaining argument of next_mnt()
    vfs: move fsnotify junk to struct mount
    vfs: move mnt_devname
    vfs: move mnt_list to struct mount
    vfs: switch pnode.h macros to struct mount *
    ...

    Linus Torvalds
     

04 Jan, 2012

1 commit


20 Dec, 2011

1 commit


09 Dec, 2011

3 commits

  • setup_zone_migrate_reserve() expects that zone->start_pfn starts at
    pageblock_nr_pages aligned pfn otherwise we could access beyond an
    existing memblock resulting in the following panic if
    CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

    IP: [] setup_zone_migrate_reserve+0xcd/0x180
    *pdpt = 0000000000000000 *pde = f000ff53f000ff53
    Oops: 0000 [#1] SMP
    Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
    EIP: 0060:[] EFLAGS: 00010006 CPU: 0
    EIP is at setup_zone_migrate_reserve+0xcd/0x180
    EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
    ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
    Call Trace:
    [] __setup_per_zone_wmarks+0xec/0x160
    [] setup_per_zone_wmarks+0xf/0x20
    [] init_per_zone_wmark_min+0x27/0x86
    [] do_one_initcall+0x2b/0x160
    [] kernel_init+0xbe/0x157
    [] kernel_thread_helper+0x6/0xd
    Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
    EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
    CR2: 00000000000012b4

    We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
    highstart_pfn = 0x36ffe.

    The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
    in a pageblock is reserved before marking it MIGRATE_RESERVE").

    Make sure that start_pfn is always aligned to pageblock_nr_pages to
    ensure that pfn_valid s always called at the start of each pageblock.
    Architectures with holes in pageblocks will be correctly handled by
    pfn_valid_within in pageblock_is_reserved.

    Signed-off-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Tested-by: Dang Bo
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Arve Hjnnevg
    Cc: KOSAKI Motohiro
    Cc: John Stultz
    Cc: Dave Hansen
    Cc: [3.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
    page_tail->_count zero at all times. But the current kernel does not
    set page_tail->_count to zero if a 1GB page is utilized. So when an
    IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
    tail page's _count does not equal zero.

    kernel BUG at include/linux/mm.h:386!
    invalid opcode: 0000 [#1] SMP
    Call Trace:
    gup_pud_range+0xb8/0x19d
    get_user_pages_fast+0xcb/0x192
    ? trace_hardirqs_off+0xd/0xf
    hva_to_pfn+0x119/0x2f2
    gfn_to_pfn_memslot+0x2c/0x2e
    kvm_iommu_map_pages+0xfd/0x1c1
    kvm_iommu_map_memslots+0x7c/0xbd
    kvm_iommu_map_guest+0xaa/0xbf
    kvm_vm_ioctl_assigned_device+0x2ef/0xa47
    kvm_vm_ioctl+0x36c/0x3a2
    do_vfs_ioctl+0x49e/0x4e4
    sys_ioctl+0x5a/0x7c
    system_call_fastpath+0x16/0x1b
    RIP gup_huge_pud+0xf2/0x159

    Signed-off-by: Youquan Song
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Youquan Song
     
  • Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
    there's no user of early_node_map[] left. Kill early_node_map[] and
    replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also,
    relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
    as page_alloc.c would no longer host an alternative implementation.

    This change is ultimately one to one mapping and shouldn't cause any
    observable difference; however, after the recent changes, there are
    some functions which now would fit memblock.c better than page_alloc.c
    and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
    doesn't make much sense on some of them. Further cleanups for
    functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.

    -v2: Fix compile bug introduced by mis-spelling
    CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
    mmzone.h. Reported by Stephen Rothwell.

    Signed-off-by: Tejun Heo
    Cc: Stephen Rothwell
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu
    Cc: Tony Luck
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Chen Liqin
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: "H. Peter Anvin"

    Tejun Heo
     

29 Nov, 2011

1 commit

  • Conflicts & resolutions:

    * arch/x86/xen/setup.c

    dc91c728fd "xen: allow extra memory to be in multiple regions"
    24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..."

    conflicted on xen_add_extra_mem() updates. The resolution is
    trivial as the latter just want to replace
    memblock_x86_reserve_range() with memblock_reserve().

    * drivers/pci/intel-iommu.c

    166e9278a3f "x86/ia64: intel-iommu: move to drivers/iommu/"
    5dfe8660a3d "bootmem: Replace work_with_active_regions() with..."

    conflicted as the former moved the file under drivers/iommu/.
    Resolved by applying the chnages from the latter on the moved
    file.

    * mm/Kconfig

    6661672053a "memblock: add NO_BOOTMEM config symbol"
    c378ddd53f9 "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option"

    conflicted trivially. Both added config options. Just
    letting both add their own options resolves the conflict.

    * mm/memblock.c

    d1f0ece6cdc "mm/memblock.c: small function definition fixes"
    ed7b56a799c "memblock: Remove memblock_memory_can_coalesce()"

    confliected. The former updates function removed by the
    latter. Resolution is trivial.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

17 Nov, 2011

1 commit


01 Nov, 2011

2 commits

  • Add __attribute__((format (printf...) to the function to validate format
    and arguments. Use vsprintf extension %pV to avoid any possible message
    interleaving. Coalesce format string. Convert printks/pr_warning to
    pr_warn.

    [akpm@linux-foundation.org: use the __printf() macro]
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • When we get a bad_page bug report, it's useful to see what modules the
    user had loaded.

    Signed-off-by: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     

04 Aug, 2011

1 commit

  • init_fault_attr_dentries() is used to export fault_attr via debugfs.
    But it can only export it in debugfs root directory.

    Per Forlin is working on mmc_fail_request which adds support to inject
    data errors after a completed host transfer in MMC subsystem.

    The fault_attr for mmc_fail_request should be defined per mmc host and
    export it in debugfs directory per mmc host like
    /sys/kernel/debug/mmc0/mmc_fail_request.

    init_fault_attr_dentries() doesn't help for mmc_fail_request. So this
    introduces fault_create_debugfs_attr() which is able to create a
    directory in the arbitrary directory and replace
    init_fault_attr_dentries().

    [akpm@linux-foundation.org: extraneous semicolon, per Randy]
    Signed-off-by: Akinobu Mita
    Tested-by: Per Forlin
    Cc: Jens Axboe
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Randy Dunlap
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

27 Jul, 2011

2 commits


26 Jul, 2011

2 commits

  • With zone_reclaim_mode enabled, it's possible for zones to be considered
    full in the zonelist_cache so they are skipped in the future. If the
    process enters direct reclaim, the ZLC may still consider zones to be full
    even after reclaiming pages. Reconsider all zones for allocation if
    direct reclaim returns successfully.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There have been a small number of complaints about significant stalls
    while copying large amounts of data on NUMA machines reported on a
    distribution bugzilla. In these cases, zone_reclaim was enabled by
    default due to large NUMA distances. In general, the complaints have not
    been about the workload itself unless it was a file server (in which case
    the recommendation was disable zone_reclaim).

    The stalls are mostly due to significant amounts of time spent scanning
    the preferred zone for pages to free. After a failure, it might fallback
    to another node (as zonelists are often node-ordered rather than
    zone-ordered) but stall quickly again when the next allocation attempt
    occurs. In bad cases, each page allocated results in a full scan of the
    preferred zone.

    Patch 1 checks the preferred zone for recent allocation failure
    which is particularly important if zone_reclaim has failed
    recently. This avoids rescanning the zone in the near future and
    instead falling back to another node. This may hurt node locality
    in some cases but a failure to zone_reclaim is more expensive than
    a remote access.

    Patch 2 clears the zlc information after direct reclaim.
    Otherwise, zone_reclaim can mark zones full, direct reclaim can
    reclaim enough pages but the zone is still not considered for
    allocation.

    This was tested on a 24-thread 2-node x86_64 machine. The tests were
    focused on large amounts of IO. All tests were bound to the CPUs on
    node-0 to avoid disturbances due to processes being scheduled on different
    nodes. The kernels tested are

    3.0-rc6-vanilla Vanilla 3.0-rc6
    zlcfirst Patch 1 applied
    zlcreconsider Patches 1+2 applied

    FS-Mark
    ./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
    fsmark-3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirs zlcreconsider
    Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
    Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
    Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
    Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
    Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
    Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
    Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
    Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 501.49 493.91 499.93
    Total Elapsed Time (seconds) 2451.57 2257.48 2215.92

    MMTests Statistics: vmstat
    Page Ins 46268 63840 66008
    Page Outs 90821596 90671128 88043732
    Swap Ins 0 0 0
    Swap Outs 0 0 0
    Direct pages scanned 13091697 8966863 8971790
    Kswapd pages scanned 0 1830011 1831116
    Kswapd pages reclaimed 0 1829068 1829930
    Direct pages reclaimed 13037777 8956828 8648314
    Kswapd efficiency 100% 99% 99%
    Kswapd velocity 0.000 810.643 826.346
    Direct efficiency 99% 99% 96%
    Direct velocity 5340.128 3972.068 4048.788
    Percentage direct scans 100% 83% 83%
    Page writes by reclaim 0 3 0
    Slabs scanned 796672 720640 720256
    Direct inode steals 7422667 7160012 7088638
    Kswapd inode steals 0 1736840 2021238

    Test completes far faster with a large increase in the number of files
    created per second. Standard deviation is high as a small number of
    iterations were much higher than the mean. The number of pages scanned by
    zone_reclaim is reduced and kswapd is used for more work.

    LARGE DD
    3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirst zlcreconsider
    download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
    dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
    delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 125.03 118.98 122.01
    Total Elapsed Time (seconds) 624.56 375.02 398.06

    MMTests Statistics: vmstat
    Page Ins 3594216 439368 407032
    Page Outs 23380832 23380488 23377444
    Swap Ins 0 0 0
    Swap Outs 0 436 287
    Direct pages scanned 17482342 69315973 82864918
    Kswapd pages scanned 0 519123 575425
    Kswapd pages reclaimed 0 466501 522487
    Direct pages reclaimed 5858054 2732949 2712547
    Kswapd efficiency 100% 89% 90%
    Kswapd velocity 0.000 1384.254 1445.574
    Direct efficiency 33% 3% 3%
    Direct velocity 27991.453 184832.737 208171.929
    Percentage direct scans 100% 99% 99%
    Page writes by reclaim 0 5082 13917
    Slabs scanned 17280 29952 35328
    Direct inode steals 115257 1431122 332201
    Kswapd inode steals 0 0 979532

    This test downloads a large tarfile and copies it with dd a number of
    times - similar to the most recent bug report I've dealt with. Time to
    completion is reduced. The number of pages scanned directly is still
    disturbingly high with a low efficiency but this is likely due to the
    number of dirty pages encountered. The figures could probably be improved
    with more work around how kswapd is used and how dirty pages are handled
    but that is separate work and this result is significant on its own.

    Streaming Mapped Writer
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 124.47 111.67 112.64
    Total Elapsed Time (seconds) 2138.14 1816.30 1867.56

    MMTests Statistics: vmstat
    Page Ins 90760 89124 89516
    Page Outs 121028340 120199524 120736696
    Swap Ins 0 86 55
    Swap Outs 0 0 0
    Direct pages scanned 114989363 96461439 96330619
    Kswapd pages scanned 56430948 56965763 57075875
    Kswapd pages reclaimed 27743219 27752044 27766606
    Direct pages reclaimed 49777 46884 36655
    Kswapd efficiency 49% 48% 48%
    Kswapd velocity 26392.541 31363.631 30561.736
    Direct efficiency 0% 0% 0%
    Direct velocity 53780.091 53108.759 51581.004
    Percentage direct scans 67% 62% 62%
    Page writes by reclaim 385 122 1513
    Slabs scanned 43008 39040 42112
    Direct inode steals 0 10 8
    Kswapd inode steals 733 534 477

    This test just creates a large file mapping and writes to it linearly.
    Time to completion is again reduced.

    The gains are mostly down to two things. In many cases, there is less
    scanning as zone_reclaim simply gives up faster due to recent failures.
    The second reason is that memory is used more efficiently. Instead of
    scanning the preferred zone every time, the allocator falls back to
    another zone and uses it instead improving overall memory utilisation.

    This patch: initialise ZLC for first zone eligible for zone_reclaim.

    The zonelist cache (ZLC) is used among other things to record if
    zone_reclaim() failed for a particular zone recently. The intention is to
    avoid a high cost scanning extremely long zonelists or scanning within the
    zone uselessly.

    Currently the zonelist cache is setup only after the first zone has been
    considered and zone_reclaim() has been called. The objective was to avoid
    a costly setup but zone_reclaim is itself quite expensive. If it is
    failing regularly such as the first eligible zone having mostly mapped
    pages, the cost in scanning and allocation stalls is far higher than the
    ZLC initialisation step.

    This patch initialises ZLC before the first eligible zone calls
    zone_reclaim(). Once initialised, it is checked whether the zone failed
    zone_reclaim recently. If it has, the zone is skipped. As the first zone
    is now being checked, additional care has to be taken about zones marked
    full. A zone can be marked "full" because it should not have enough
    unmapped pages for zone_reclaim but this is excessive as direct reclaim or
    kswapd may succeed where zone_reclaim fails. Only mark zones "full" after
    zone_reclaim fails if it failed to reclaim enough pages after scanning.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Jul, 2011

5 commits

  • From 83103b92f3234ec830852bbc5c45911bd6cbdb20 Mon Sep 17 00:00:00 2001
    From: Tejun Heo
    Date: Thu, 14 Jul 2011 11:22:16 +0200

    Add optional region->nid which can be enabled by arch using
    CONFIG_HAVE_MEMBLOCK_NODE_MAP. When enabled, memblock also carries
    NUMA node information and replaces early_node_map[].

    Newly added memblocks have MAX_NUMNODES as nid. Arch can then call
    memblock_set_node() to set node information. memblock takes care of
    merging and node affine allocations w.r.t. node information.

    When MEMBLOCK_NODE_MAP is enabled, early_node_map[], related data
    structures and functions to manipulate and iterate it are disabled.
    memblock version of __next_mem_pfn_range() is provided such that
    for_each_mem_pfn_range() behaves the same and its users don't have to
    be updated.

    -v2: Yinghai spotted section mismatch caused by missing
    __init_memblock in memblock_set_node(). Fixed.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/20110714094342.GF3455@htj.dyndns.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     
  • With the previous changes, generic NUMA aware memblock API has feature
    parity with memblock_x86_find_in_range_node(). There currently are
    two users - x86 setup_node_data() and __alloc_memory_core_early() in
    nobootmem.c.

    This patch converts the former to use memblock_alloc_nid() and the
    latter memblock_find_range_in_node(), and kills
    memblock_x86_find_in_range_node() and related functions including
    find_memory_early_core_early() in page_alloc.c.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310460395-30913-9-git-send-email-tj@kernel.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     
  • The previous patch added for_each_mem_pfn_range() which is more
    versatile than for_each_active_range_index_in_nid(). This patch
    replaces for_each_active_range_index_in_nid() and open coded
    early_node_map[] walks with for_each_mem_pfn_range().

    All conversions in this patch are straight-forward and shouldn't cause
    any functional difference. After the conversions,
    for_each_active_range_index_in_nid() doesn't have any user left and is
    removed.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310460395-30913-4-git-send-email-tj@kernel.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     
  • __absent_pages_in_range() was needlessly complex. Reimplement it
    using for_each_mem_pfn_range().

    Also, update zone_absent_pages_in_node() such that it doesn't call
    __absent_pages_in_range() with @zone_start_pfn which is larger than
    @zone_end_pfn.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310460395-30913-3-git-send-email-tj@kernel.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     
  • Callback based iteration is cumbersome and much less useful than
    for_each_*() iterator. This patch implements for_each_mem_pfn_range()
    which replaces work_with_active_regions(). All the current users of
    work_with_active_regions() are converted.

    This simplifies walking over early_node_map and will allow converting
    internal logics in page_alloc to use iterator instead of walking
    early_node_map directly, which in turn will enable moving node
    information to memblock.

    powerpc change is only compile tested.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/20110714074610.GD3455@htj.dyndns.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     

14 Jul, 2011

2 commits

  • 25818f0f28 (memblock: Make MEMBLOCK_ERROR be 0) thankfully made
    MEMBLOCK_ERROR 0 and there already are codes which expect error return
    to be 0. There's no point in keeping MEMBLOCK_ERROR around. End its
    misery.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310457490-3356-6-git-send-email-tj@kernel.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     
  • a226f6c899 (FRV: Clean up bootmem allocator's page freeing algorithm)
    separated out __free_pages_bootmem() from free_all_bootmem_core().
    __free_pages_bootmem() takes @order argument but it assumes @order is
    either 0 or ilog2(BITS_PER_LONG). Note that all the current users
    match that assumption and this doesn't cause actual problems.

    Fix it by using 1 << order instead of BITS_PER_LONG.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310457490-3356-3-git-send-email-tj@kernel.org
    Cc: David Howells
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     

13 Jul, 2011

1 commit

  • SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use
    sections array to map pfn to nid which is limited in granularity. If
    NUMA nodes are laid out such that the mapping cannot be accurate, boot
    will fail triggering BUG_ON() in mminit_verify_page_links().

    On 32bit, it's 512MiB w/ PAE and SPARSEMEM. This seems to have been
    granular enough until commit 2706a0bf7b (x86, NUMA: Enable
    CONFIG_AMD_NUMA on 32bit too). Apparently, there is a machine which
    aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT. This
    led to the following BUG_ON().

    On node 0 totalpages: 2096615
    DMA zone: 32 pages used for memmap
    DMA zone: 0 pages reserved
    DMA zone: 3927 pages, LIFO batch:0
    Normal zone: 1740 pages used for memmap
    Normal zone: 220978 pages, LIFO batch:31
    HighMem zone: 16405 pages used for memmap
    HighMem zone: 1853533 pages, LIFO batch:31
    BUG: Int 6: CR2 (null)
    EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc
    EBX f2400000 EDX 00000006 ECX (null) EAX 00000001
    err (null) EIP c16209aa CS 00000060 flg 00010002
    Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
    (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null)
    f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null)
    Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17
    Call Trace:
    [] ? early_fault+0x2e/0x2e
    [] ? mminit_verify_page_links+0x12/0x42
    [] ? memmap_init_zone+0xaf/0x10c
    [] ? free_area_init_node+0x2b9/0x2e3
    [] ? free_area_init_nodes+0x3f2/0x451
    [] ? paging_init+0x112/0x118
    [] ? setup_arch+0x791/0x82f
    [] ? start_kernel+0x6a/0x257

    This patch implements node_map_pfn_alignment() which determines
    maximum internode alignment and update numa_register_memblks() to
    reject NUMA configuration if alignment exceeds the pfn -> nid mapping
    granularity of the memory model as determined by PAGES_PER_SECTION.

    This makes the problematic machine boot w/ flatmem by rejecting the
    NUMA config and provides protection against crazy NUMA configurations.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org
    LKML-Reference:
    Reported-and-Tested-by: Hans Rosenfeld
    Cc: Conny Seidel
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     

02 Jun, 2011

1 commit

  • This reverts commit a197b59ae6e8bee56fcef37ea2482dc08414e2ac.

    As rmk says:
    "Commit a197b59ae6e8 (mm: fail GFP_DMA allocations when ZONE_DMA is not
    configured) is causing regressions on ARM with various drivers which
    use GFP_DMA.

    The behaviour up until now has been to silently ignore that flag when
    CONFIG_ZONE_DMA is not enabled, and to allocate from the normal zone.
    However, as a result of the above commit, such allocations now fail
    which causes drivers to fail. These are regressions compared to the
    previous kernel version."

    so just revert it.

    Requested-by: Russell King
    Acked-by: Andrew Morton
    Cc: David Rientjes
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 May, 2011

1 commit

  • During memory reclaim we determine the number of pages to be scanned per
    zone as

    (anon + file) >> priority.
    Assume
    scan = (anon + file) >> priority.

    If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
    priority gets higher. This has some problems.

    1. This increases priority as 1 without any scan.
    To do scan in this priority, amount of pages should be larger than 512M.
    If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
    batched, later. (But we lose 1 priority.)
    If memory size is below 16M, pages >> priority is 0 and no scan in
    DEF_PRIORITY forever.

    2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
    So, x86's ZONE_DMA will never be recoverred until the user of pages
    frees memory by itself.

    3. With memcg, the limit of memory can be small. When using small memcg,
    it gets priority < DEF_PRIORITY-2 very easily and need to call
    wait_iff_congested().
    For doing scan before priorty=9, 64MB of memory should be used.

    Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when

    1. the target is enough small.
    2. it's kswapd or memcg reclaim.

    Then we can avoid rapid priority drop and may be able to recover
    all_unreclaimable in a small zones. And this patch removes nr_saved_scan.
    This will allow scanning in this priority even when pages >> priority is
    very small.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Ying Han
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki