30 Apr, 2013

1 commit


13 Dec, 2012

2 commits

  • We need a node which only contains movable memory. This feature is very
    important for node hotplug. If a node has normal/highmem, the memory may
    be used by the kernel and can't be offlined. If the node only contains
    movable memory, we can offline the memory and the node.

    All are prepared, we can actually introduce N_MEMORY.
    add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node

    [akpm@linux-foundation.org: fix Kconfig text]
    Signed-off-by: Lai Jiangshan
    Tested-by: Yasuaki Ishimatsu
    Signed-off-by: Wen Congyang
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Yinghai Lu
    Cc: Rusty Russell
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

12 Dec, 2012

4 commits

  • use [index] = init_value
    use N_xxxxx instead of hardcode.

    Make it more readability and easier to add new state.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • register_node() is defined as extern in include/linux/node.h. But the
    function is only called from register_one_node() in driver/base/node.c.

    So the patch defines register_node() as static.

    Signed-off-by: Yasuaki Ishimatsu
    Acked-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • When calling unregister_node(), the function shows following message at
    device_release().

    "Device 'node2' does not have a release() function, it is broken and must
    be fixed."

    The reason is node's device struct does not have a release() function.

    So the patch registers node_device_release() to the device's release()
    function for suppressing the warning message. Additionally, the patch
    adds memset() to initialize a node struct into register_node(). Because
    the node struct is part of node_devices[] array and it cannot be freed by
    node_device_release(). So if system reuses the node struct, it has a
    garbage.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • We use a static array to store struct node. In many cases, we don't have
    too many nodes, and some memory will be unused. Convert it to per-device
    dynamically allocated memory.

    Signed-off-by: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     

30 May, 2012

1 commit

  • /sys/devices/system/node/{online,possible} outputs a garbage byte
    because print_nodes_state() returns content size + 1. To fix the bug,
    the patch changes the use of cpuset_sprintf_cpulist to follow the use at
    other places, which is clearer and safer.

    This bug was introduced in v2.6.24 (commit bde631a51876: "mm: add node
    states sysfs class attributeS").

    Signed-off-by: Ryota Ozaki
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryota Ozaki
     

03 Feb, 2012

1 commit

  • One system with 2048g ram, reported soft lockup on recent kernel.

    [ 34.426749] cpu_dev_init done
    [ 61.166399] BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
    [ 61.166733] Modules linked in:
    [ 61.166904] irq event stamp: 1935610
    [ 61.178431] hardirqs last enabled at (1935609): [] mutex_lock_nested+0x299/0x2b4
    [ 61.178923] hardirqs last disabled at (1935610): [] apic_timer_interrupt+0x6b/0x80
    [ 61.198767] softirqs last enabled at (1935476): [] __do_softirq+0x195/0x1ab
    [ 61.218604] softirqs last disabled at (1935471): [] call_softirq+0x1c/0x30
    [ 61.238408] CPU 0
    [ 61.238549] Modules linked in:
    [ 61.238744]
    [ 61.238825] Pid: 1, comm: swapper/0 Not tainted 3.3.0-rc1-tip-yh-02076-g962f689-dirty #171
    [ 61.278212] RIP: 0010:[] [] lock_release+0x90/0x9c
    [ 61.278627] RSP: 0018:ffff883f64dbfd70 EFLAGS: 00000246
    [ 61.298287] RAX: ffff883f64dc0000 RBX: 0000000000000000 RCX: 000000000000008b
    [ 61.298690] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    [ 61.318383] RBP: ffff883f64dbfda0 R08: 0000000000000001 R09: 000000000000008b
    [ 61.338215] R10: 0000000000000000 R11: 0000000000000000 R12: ffff883f64dbfd10
    [ 61.338610] R13: ffff883f64dc0708 R14: ffff883f64dc0708 R15: ffffffff81095657
    [ 61.358299] FS: 0000000000000000(0000) GS:ffff883f7d600000(0000) knlGS:0000000000000000
    [ 61.378118] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 61.378450] CR2: 0000000000000000 CR3: 00000000024af000 CR4: 00000000000007f0
    [ 61.398144] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 61.417918] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 61.418260] Process swapper/0 (pid: 1, threadinfo ffff883f64dbe000, task ffff883f64dc0000)
    [ 61.445358] Stack:
    [ 61.445511] 0000000000000002 ffff897f649ba168 ffff883f64dbfe10 ffff88ff64bb57a8
    [ 61.458040] 0000000000000000 0000000000000000 ffff883f64dbfdc0 ffffffff81ceb1b4
    [ 61.458491] 000000000011608c ffff88ff64bb58a8 ffff883f64dbfdf0 ffffffff81c57638
    [ 61.478215] Call Trace:
    [ 61.478367] [] _raw_spin_unlock+0x21/0x2e
    [ 61.497994] [] klist_next+0x9e/0xbc
    [ 61.498264] [] next_device+0xe/0x1e
    [ 61.517867] [] subsys_find_device_by_id+0xb7/0xd6
    [ 61.518197] [] find_memory_block_hinted+0x3d/0x66
    [ 61.537927] [] find_memory_block+0x10/0x12
    [ 61.538193] [] add_memory_section+0x35/0x9e
    [ 61.557932] [] memory_dev_init+0x68/0xda
    [ 61.558227] [] driver_init+0x97/0xa7
    [ 61.577853] [] kernel_init+0xf6/0x1c0
    [ 61.578140] [] kernel_thread_helper+0x4/0x10
    [ 61.597850] [] ? retint_restore_args+0xe/0xe
    [ 61.598144] [] ? start_kernel+0x3ab/0x3ab
    [ 61.617826] [] ? gs_change+0xb/0xb
    [ 61.618060] Code: 10 48 83 3b 00 eb e8 4c 89 f2 44 89 fe 4c 89 ef e8 e1 fe ff ff 65 48 8b 04 25 40 bc 00 00 c7 80 cc 06 00 00 00 00 00 00 41 54 9d 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 89 e5 41 57 41 89 cf
    [ 89.285380] memory_dev_init done

    Finally it takes about 55s to create 16400 memory entries.

    Root cause: for x86_64, 2048g (with 2g hole at [2g,4g), and TOP2 will be 2050g), will have 16400 memory block.

    find_memory_block/subsys_find_device_by_id will be expensive with that many entries.

    Actually, we don't need to find that memory block for BOOT path.

    Skip that finding make it get back to normal.

    [ 34.466696] cpu_dev_init done
    [ 35.290080] memory_dev_init done

    Also solved the delay with topology_init when sections_per_block is not 1.

    Signed-off-by: Yinghai Lu
    Cc: Kay Sievers
    Cc: Nathan Fontenot
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Yinghai Lu
     

07 Jan, 2012

1 commit

  • This resolves the conflict in the arch/arm/mach-s3c64xx/s3c6400.c file,
    and it fixes the build error in the arch/x86/kernel/microcode_core.c
    file, that the merge did not catch.

    The microcode_core.c patch was provided by Stephen Rothwell
    who was invaluable in the merge issues involved
    with the large sysdev removal process in the driver-core tree.

    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

22 Dec, 2011

2 commits

  • This moves the 'memory sysdev_class' over to a regular 'memory' subsystem
    and converts the devices to regular devices. The sysdev drivers are
    implemented as subsystem interfaces now.

    After all sysdev classes are ported to regular driver core entities, the
    sysdev implementation will be entirely removed from the kernel.

    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     
  • This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
    and converts the devices to regular devices. The sysdev drivers are
    implemented as subsystem interfaces now.

    After all sysdev classes are ported to regular driver core entities, the
    sysdev implementation will be entirely removed from the kernel.

    Userspace relies on events and generic sysfs subsystem infrastructure
    from sysdev devices, which are made available with this conversion.

    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Borislav Petkov
    Cc: Tigran Aivazian
    Cc: Len Brown
    Cc: Zhang Rui
    Cc: Dave Jones
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Andrew Morton
    Cc: Arjan van de Ven
    Cc: "Rafael J. Wysocki"
    Cc: "Srivatsa S. Bhat"
    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

19 Nov, 2011

1 commit


25 May, 2011

1 commit

  • commit 2ac390370a ("writeback: add
    /sys/devices/system/node//vmstat") added vmstat entry. But
    strangely it only show nr_written and nr_dirtied.

    # cat /sys/devices/system/node/node20/vmstat
    nr_written 0
    nr_dirtied 0

    Of course, It's not adequate. With this patch, the vmstat show all vm
    stastics as /proc/vmstat.

    # cat /sys/devices/system/node/node0/vmstat
    nr_free_pages 899224
    nr_inactive_anon 201
    nr_active_anon 17380
    nr_inactive_file 31572
    nr_active_file 28277
    nr_unevictable 0
    nr_mlock 0
    nr_anon_pages 17321
    nr_mapped 8640
    nr_file_pages 60107
    nr_dirty 33
    nr_writeback 0
    nr_slab_reclaimable 6850
    nr_slab_unreclaimable 7604
    nr_page_table_pages 3105
    nr_kernel_stack 175
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem 260
    nr_dirtied 1050
    nr_written 938
    numa_hit 962872
    numa_miss 0
    numa_foreign 0
    numa_interleave 8617
    numa_local 962872
    numa_other 0
    nr_anon_transparent_hugepages 0

    [akpm@linux-foundation.org: no externs in .c files]
    Signed-off-by: KOSAKI Motohiro
    Cc: Michael Rubin
    Cc: Wu Fengguang
    Acked-by: David Rientjes
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

04 Feb, 2011

1 commit

  • Update the 'phys_index' property of a the memory_block struct to be
    called start_section_nr, and add a end_section_nr property. The
    data tracked here is the same but the updated naming is more in line
    with what is stored here, namely the first and last section number
    that the memory block spans.

    The names presented to userspace remain the same, phys_index for
    start_section_nr and end_phys_index for end_section_nr, to avoid breaking
    anything in userspace.

    This also updates the node sysfs code to be aware of the new capability for
    a memory block to contain multiple memory sections and be aware of the memory
    block structure name changes (start_section_nr). This requires an additional
    parameter to unregister_mem_sect_under_nodes so that we know which memory
    section of the memory block to unregister.

    Signed-off-by: Nathan Fontenot
    Reviewed-by: Robin Holt
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Greg Kroah-Hartman

    Nathan Fontenot
     

14 Jan, 2011

1 commit


27 Oct, 2010

1 commit

  • For NUMA node systems it is important to have visibility in memory
    characteristics. Two of the /proc/vmstat values "nr_written" and
    "nr_dirtied" are added here.

    # cat /sys/devices/system/node/node20/vmstat
    nr_written 0
    nr_dirtied 0

    Signed-off-by: Michael Rubin
    Reviewed-by: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Rubin
     

23 Oct, 2010

1 commit

  • Modify link_mem_sections() to pass in the previous mem_block as a hint to
    locating the next mem_block. Since they are typically added in order this
    results in a massive saving in time during boot of a very large system.
    For example, on a 16TB x86_64 machine, it reduced the total time spent
    linking all node's memory sections from 1 hour, 27 minutes to 46 seconds.

    Signed-off-by: Robin Holt
    To: Gary Hade
    To: Badari Pulavarty
    To: Ingo Molnar
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Greg Kroah-Hartman

    Robin Holt
     

10 Aug, 2010

1 commit


25 May, 2010

1 commit

  • Add a per-node sysfs file called compact. When the file is written to,
    each zone in that node is compacted. The intention that this would be
    used by something like a job scheduler in a batch system before a job
    starts so that the job can allocate the maximum number of hugepages
    without significant start-up cost.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Apr, 2010

1 commit

  • NODEMASK_ALLOC/FREE are mapped to kmalloc/free if NODES_SHIFT > 8.
    Among its several users, drivers/base/node.c wasn't including slab.h
    leading to build failure if NODES_SHIFT > 8. Include slab.h from
    drivers/base/node.c.

    This isn't an ideal solution but including slab.h directly from
    nodemask.h is not an option because nodemask.h gets included
    everywhere. For now, make it work by including slab.h from its users.

    Signed-off-by: Tejun Heo
    Reported-by: Ingo Molnar

    Tejun Heo
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

19 Mar, 2010

1 commit

  • node_read_distance() has a BUILD_BUG_ON() to prevent buffer overruns when
    the number of nodes printed will exceed the buffer length.

    Each node only needs four chars: three for distance (maximum distance is
    255) and one for a seperating space or a trailing newline.

    Signed-off-by: David Rientjes
    Cc: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     

08 Mar, 2010

3 commits

  • Convert the node driver to sysdev_class attribute arrays. This
    greatly cleans up the code and remove a lot of code.

    Saves ~150 bytes of code on x86-64.

    Signed-off-by: Andi Kleen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     
  • Using the new attribute argument convert the node driver class
    attributes to carry the node state. Then use a shared function to do
    what a lot of individual functions did before.

    Signed-off-by: Andi Kleen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     
  • Passing the attribute to the low level IO functions allows all kinds
    of cleanups, by sharing low level IO code without requiring
    an own function for every piece of data.

    Also drivers can extend the attributes with own data fields
    and use that in the low level function.

    Similar to sysdev_attributes and normal attributes.

    This is a tree-wide sweep, converting everything in one go.

    No functional changes in this patch other than passing the new
    argument everywhere.

    Tested on x86, the non x86 parts are uncompiled.

    Signed-off-by: Andi Kleen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

16 Dec, 2009

8 commits

  • Nodemasks should not be allocated on the stack for large systems (when it
    is larger than 256 bytes) since there is a threat of overflow.

    This patch causes the unregister_mem_sect_under_nodes() nodemask to be
    allocated on the stack for smaller systems and be allocated by slab for
    larger systems.

    GFP_KERNEL is used since remove_memory_block() can block.

    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Alex Chiang
    Signed-off-by: David Rientjes
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • You can discover which CPUs belong to a NUMA node by examining
    /sys/devices/system/node/node#/

    However, it's not convenient to go in the other direction, when looking at
    /sys/devices/system/cpu/cpu#/

    Yes, you can muck about in sysfs, but adding these symlinks makes life a
    lot more convenient.

    Signed-off-by: Alex Chiang
    Acked-by: David Rientjes
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • By returning early if the node is not online, we can unindent the
    interesting code by two levels.

    No functional change.

    Signed-off-by: Alex Chiang
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • By returning early if the node is not online, we can unindent the
    interesting code by one level.

    No functional change.

    Signed-off-by: Alex Chiang
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • Commit c04fc586c (mm: show node to memory section relationship with
    symlinks in sysfs) created symlinks from nodes to memory sections, e.g.

    /sys/devices/system/node/node1/memory135 -> ../../memory/memory135

    If you're examining the memory section though and are wondering what node
    it might belong to, you can find it by grovelling around in sysfs, but
    it's a little cumbersome.

    Add a reverse symlink for each memory section that points back to the
    node to which it belongs.

    Signed-off-by: Alex Chiang
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Acked-by: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • Offload the registration and unregistration of per node hstate sysfs
    attributes to a worker thread rather than attempt the
    allocation/attachment or detachment/freeing of the attributes in the
    context of the memory hotplug handler.

    I don't know that this is absolutely required, but the registration can
    sleep in allocations and other mem hot plug handlers do it this way. If
    it turns out this is NOT required, we can drop this patch.

    N.B., Only tested build, boot, libhugetlbfs regression.
    i.e., no memory hotplug testing.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Register per node hstate attributes only for nodes with memory. As
    suggested by David Rientjes.

    With Memory Hotplug, memory can be added to a memoryless node and a node
    with memory can become memoryless. Therefore, add a memory on/off-line
    notifier callback to [un]register a node's attributes on transition
    to/from memoryless state.

    N.B., Only tested build, boot, libhugetlbfs regression.
    i.e., no memory hotplug testing.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Andi Kleen
    Acked-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Add the per huge page size control/query attributes to the per node
    sysdevs:

    /sys/devices/system/node/node/hugepages/hugepages-/
    nr_hugepages - r/w
    free_huge_pages - r/o
    surplus_huge_pages - r/o

    The patch attempts to re-use/share as much of the existing global hstate
    attribute initialization and handling, and the "nodes_allowed" constraint
    processing as possible.

    Calling set_max_huge_pages() with no node indicates a change to global
    hstate parameters. In this case, any non-default task mempolicy will be
    used to generate the nodes_allowed mask. A valid node id indicates an
    update to that node's hstate parameters, and the count argument specifies
    the target count for the specified node. From this info, we compute the
    target global count for the hstate and construct a nodes_allowed node mask
    contain only the specified node.

    Setting the node specific nr_hugepages via the per node attribute
    effectively ignores any task mempolicy or cpuset constraints.

    With this patch:

    (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
    ./ ../ free_hugepages nr_hugepages surplus_hugepages

    Starting from:
    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 0
    Node 2 HugePages_Free: 0
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0
    vm.nr_hugepages = 0

    Allocate 16 persistent huge pages on node 2:
    (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

    [Note that this is equivalent to:
    numactl -m 2 hugeadmin --pool-pages-min 2M:+16
    ]

    Yields:
    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 16
    Node 2 HugePages_Free: 16
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0
    vm.nr_hugepages = 16

    Global controls work as expected--reduce pool to 8 persistent huge pages:
    (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 8
    Node 2 HugePages_Free: 8
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0

    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

22 Sep, 2009

2 commits

  • Recently we encountered OOM problems due to memory use of the GEM cache.
    Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
    shortage problem.

    We often use the following calculation to determine the amount of shmem
    pages:

    shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

    however the expression does not consider isolated and mlocked pages.

    This patch adds explicit accounting for pages used by shmem and tmpfs.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Wu Fengguang
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The amount of memory allocated to kernel stacks can become significant and
    cause OOM conditions. However, we do not display the amount of memory
    consumed by stacks.

    Add code to display the amount of memory used for stacks in /proc/meminfo.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

17 Jun, 2009

1 commit

  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

13 Mar, 2009

1 commit


11 Mar, 2009

1 commit

  • get_nid_for_pfn() returns int

    Presumably the (nid < 0) case has never happened.

    We do know that it is happening on one system while creating a symlink for
    a memory section so it should also happen on the same system if
    unregister_mem_sect_under_nodes() were called to remove the same symlink.

    The test was actually added in response to a problem with an earlier
    version reported by Yasunori Goto where one or more of the leading pages
    of a memory section on the 2nd node of one of his systems was
    uninitialized because I believe they coincided with a memory hole.

    That earlier version did not ignore uninitialized pages and determined
    the nid by considering only the 1st page of each memory section. This
    caused the symlink to the 1st memory section on the 2nd node to be
    incorrectly created in /sys/devices/system/node/node0 instead of
    /sys/devices/system/node/node1. The problem was fixed by adding the
    test to skip over uninitialized pages.

    I suspect we have not seen any reports of the non-removal
    of a symlink due to the incorrect declaration of the nid
    variable in unregister_mem_sect_under_nodes() because
    - systems where a memory section could have an uninitialized
    range of leading pages are probably rare.
    - memory remove is probably not done very frequently on the
    systems that are capable of demonstrating the problem.
    - lingering symlink(s) that should have been removed may
    have simply gone unnoticed.

    [garyhade@us.ibm.com: wrote changelog]
    Signed-off-by: Roel Kluin
    Cc: Gary Hade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     

07 Jan, 2009

1 commit

  • Show node to memory section relationship with symlinks in sysfs

    Add /sys/devices/system/node/nodeX/memoryY symlinks for all
    the memory sections located on nodeX. For example:
    /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
    indicates that memory section 135 resides on node1.

    Also revises documentation to cover this change as well as updating
    Documentation/ABI/testing/sysfs-devices-memory to include descriptions
    of memory hotremove files 'phys_device', 'phys_index', and 'state'
    that were previously not described there.

    In addition to it always being a good policy to provide users with
    the maximum possible amount of physical location information for
    resources that can be hot-added and/or hot-removed, the following
    are some (but likely not all) of the user benefits provided by
    this change.
    Immediate:
    - Provides information needed to determine the specific node
    on which a defective DIMM is located. This will reduce system
    downtime when the node or defective DIMM is swapped out.
    - Prevents unintended onlining of a memory section that was
    previously offlined due to a defective DIMM. This could happen
    during node hot-add when the user or node hot-add assist script
    onlines _all_ offlined sections due to user or script inability
    to identify the specific memory sections located on the hot-added
    node. The consequences of reintroducing the defective memory
    could be ugly.
    - Provides information needed to vary the amount and distribution
    of memory on specific nodes for testing or debugging purposes.
    Future:
    - Will provide information needed to identify the memory
    sections that need to be offlined prior to physical removal
    of a specific node.

    Symlink creation during boot was tested on 2-node x86_64, 2-node
    ppc64, and 2-node ia64 systems. Symlink creation during physical
    memory hot-add tested on a 2-node x86_64 system.

    Signed-off-by: Gary Hade
    Signed-off-by: Badari Pulavarty
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gary Hade