31 May, 2010

2 commits


28 May, 2010

20 commits

  • Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
    setattr > vmtruncate > truncate, have filesystems call their truncate sequence
    from ->setattr if filesystem specific operations are required. vmtruncate is
    deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
    previously should be used.

    simple_setattr is introduced for simple in-ram filesystems to implement
    the new truncate sequence. Eventually all filesystems should be converted
    to implement a setattr, and the default code in notify_change should go
    away.

    simple_setsize is also introduced to perform just the ATTR_SIZE portion
    of simple_setattr (ie. changing i_size and trimming pagecache).

    To implement the new truncate sequence:
    - filesystem specific manipulations (eg freeing blocks) must be done in
    the setattr method rather than ->truncate.
    - vmtruncate can not be used by core code to trim blocks past i_size in
    the event of write failure after allocation, so this must be performed
    in the fs code.
    - convert usage of helpers block_write_begin, nobh_write_begin,
    cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
    variants. These avoid calling vmtruncate to trim blocks (see previous).
    - inode_setattr should not be used. generic_setattr is a new function
    to be used to copy simple attributes into the generic inode.
    - make use of the better opportunity to handle errors with the new sequence.

    Big problem with the previous calling sequence: the filesystem is not called
    until i_size has already changed. This means it is not allowed to fail the
    call, and also it does not know what the previous i_size was. Also, generic
    code calling vmtruncate to truncate allocated blocks in case of error had
    no good way to return a meaningful error (or, for example, atomically handle
    block deallocation).

    Cc: Christoph Hellwig
    Acked-by: Jan Kara
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • We don't name our generic fsync implementations very well currently.
    The no-op implementation for in-memory filesystems currently is called
    simple_sync_file which doesn't make too much sense to start with,
    the the generic one for simple filesystems is called simple_fsync
    which can lead to some confusion.

    This patch renames the generic file fsync method to generic_file_fsync
    to match the other generic_file_* routines it is supposed to be used
    with, and the no-op implementation to noop_fsync to make it obvious
    what to expect. In addition add some documentation for both methods.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
    Btrfs: add more error checking to btrfs_dirty_inode
    Btrfs: allow unaligned DIO
    Btrfs: drop verbose enospc printk
    Btrfs: Fix block generation verification race
    Btrfs: fix preallocation and nodatacow checks in O_DIRECT
    Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
    Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
    Btrfs: rework O_DIRECT enospc handling
    Btrfs: use async helpers for DIO write checksumming
    Btrfs: don't walk around with task->state != TASK_RUNNING
    Btrfs: do aio_write instead of write
    Btrfs: add basic DIO read/write support
    direct-io: do not merge logically non-contiguous requests
    direct-io: add a hook for the fs to provide its own submit_bio function
    fs: allow short direct-io reads to be completed via buffered IO
    Btrfs: Metadata ENOSPC handling for balance
    Btrfs: Pre-allocate space for data relocation
    Btrfs: Metadata ENOSPC handling for tree log
    Btrfs: Metadata reservation for orphan inodes
    Btrfs: Introduce global metadata reservation
    ...

    Linus Torvalds
     
  • Example usage of generic "numa_mem_id()":

    The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes
    well. Specifically, the "fast path"--____cache_alloc()--will never
    succeed as slab doesn't cache offnode object on the per cpu queues, and
    for memoryless nodes, all memory will be "off node" relative to
    numa_node_id(). This adds significant overhead to all kmem cache
    allocations, incurring a significant regression relative to earlier
    kernels [from before slab.c was reorganized].

    This patch uses the generic topology function "numa_mem_id()" to return
    the "effective local memory node" for the calling context. This is the
    first node in the local node's generic fallback zonelist-- the same node
    that "local" mempolicy-based allocations would use. This lets slab cache
    these "local" allocations and avoid fallback/refill on every allocation.

    N.B.: Slab will need to handle node and memory hotplug events that could
    change the value returned by numa_mem_id() for any given node if recent
    changes to address memory hotplug don't already address this. E.g., flush
    all per cpu slab queues before rebuilding the zonelists while the
    "machine" is held in the stopped state.

    Performance impact on "hackbench 400 process 200"

    2.6.34-rc3-mmotm-100405-1609 no-patch this-patch
    ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff
    ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup

    The slowdown of the patched kernel from ~12 sec to ~28 seconds when
    configured with memoryless nodes is the result of all cpus allocating from
    a single node's mm pagepool. The cache lines of the single node are
    distributed/interleaved over the memory of the real physical nodes, but
    the zone lock, list heads, ... of the single node with memory still each
    live in a single cache line that is accessed from all processors.

    x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845

    Signed-off-by: Lee Schermerhorn
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Introduce numa_mem_id(), based on generic percpu variable infrastructure
    to track "nearest node with memory" for archs that support memoryless
    nodes.

    Define API in when CONFIG_HAVE_MEMORYLESS_NODES
    defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
    if/when they support them.

    Archs can override definitions of:

    numa_mem_id() - returns node number of "local memory" node
    set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
    cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue

    Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
    This will initialize the boot cpu at boot time, and all cpus on change of
    numa_zonelist_order, or when node or memory hot-plug requires zonelist
    rebuild. Archs that support memoryless nodes will need to initialize
    'numa_mem' for secondary cpus as they're brought on-line.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Rework the generic version of the numa_node_id() function to use the new
    generic percpu variable infrastructure.

    Guard the new implementation with a new config option:

    CONFIG_USE_PERCPU_NUMA_NODE_ID.

    Archs which support this new implemention will default this option to 'y'
    when NUMA is configured. This config option could be removed if/when all
    archs switch over to the generic percpu implementation of numa_node_id().
    Arch support involves:

    1) converting any existing per cpu variable implementations to use
    this implementation. x86_64 is an instance of such an arch.
    2) archs that don't use a per cpu variable for numa_node_id() will
    need to initialize the new per cpu variable "numa_node" as cpus
    are brought on-line. ia64 is an example.
    3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
    when NUMA is configured. This is required because I have
    retained the old implementation by default to allow archs to
    be modified incrementally, as desired.

    Subsequent patches will convert x86_64 and ia64 to use this implemenation.

    Signed-off-by: Lee Schermerhorn
    Cc: Tejun Heo
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for slab.

    Signed-off-by: Akinobu Mita
    Cc: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • We have observed several workloads running on multi-node systems where
    memory is assigned unevenly across the nodes in the system. There are
    numerous reasons for this but one is the round-robin rotor in
    cpuset_mem_spread_node().

    For example, a simple test that writes a multi-page file will allocate
    pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
    allocates on odd nodes & skips even nodes).

    An example is shown below. The program "lfile" writes a file consisting
    of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
    MPOL_F_NODE) to determine the nodes where the file pages were allocated.
    The output is shown below:

    # ./lfile
    allocated on nodes: 2 4 6 0 1 2 6 0 2

    There is a single rotor that is used for allocating both file pages & slab
    pages. Writing the file allocates both a data page & a slab page
    (buffer_head). This advances the RR rotor 2 nodes for each page
    allocated.

    A quick confirmation seems to confirm this is the cause of the uneven
    allocation:

    # echo 0 >/dev/cpuset/memory_spread_slab
    # ./lfile
    allocated on nodes: 6 7 8 9 0 1 2 3 4 5

    This patch introduces a second rotor that is used for slab allocations.

    Signed-off-by: Jack Steiner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     
  • Introduce struct mem_cgroup_thresholds. It helps to reduce number of
    checks of thresholds type (memory or mem+swap).

    [akpm@linux-foundation.org: repair comment]
    Signed-off-by: Kirill A. Shutemov
    Cc: Phil Carmody
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Since we are unable to handle an error returned by
    cftype.unregister_event() properly, let's make the callback
    void-returning.

    mem_cgroup_unregister_event() has been rewritten to be a "never fail"
    function. On mem_cgroup_usage_register_event() we save old buffer for
    thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
    avoid allocation.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Phil Carmody
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • FILE_MAPPED per memcg of migrated file cache is not properly updated,
    because our hook in page_add_file_rmap() can't know to which memcg
    FILE_MAPPED should be counted.

    Basically, this patch is for fixing the bug but includes some big changes
    to fix up other messes.

    Now, at migrating mapped file, events happen in following sequence.

    1. allocate a new page.
    2. get memcg of an old page.
    3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit for the charge.
    (IOW, PCG_USED bit is not set.)
    4. page migration replaces radix-tree, old-page and new-page.
    5. page migration remaps the new page if the old page was mapped.
    6. Here, the new page is unlocked.
    7. memcg commits the charge for newpage, Mark the new page's page_cgroup
    as PCG_USED.

    Because "commit" happens after page-remap, we can count FILE_MAPPED
    at "5", because we should avoid to trust page_cgroup->mem_cgroup.
    if PCG_USED bit is unset.
    (Note: memcg's LRU removal code does that but LRU-isolation logic is used
    for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
    not on LRU or page_cgroup->mem_cgroup is NULL.)

    We can lose file_mapped accounting information at 5 because FILE_MAPPED
    is updated only when mapcount changes 0->1. So we should catch it.

    BTW, historically, above implemntation comes from migration-failure
    of anonymous page. Because we charge both of old page and new page
    with mapcount=0, we can't catch
    - the page is really freed before remap.
    - migration fails but it's freed before remap
    or .....corner cases.

    New migration sequence with memcg is:

    1. allocate a new page.
    2. mark PageCgroupMigration to the old page.
    3. charge against a new page onto the old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
    4. page migration replaces radix-tree, page table, etc...
    5. At remapping, new page's page_cgroup is now makrked as "USED"
    We can catch 0->1 event and FILE_MAPPED will be properly updated.

    And we can catch SWAPOUT event after unlock this and freeing this
    page by unmap() can be caught.

    7. Clear PageCgroupMigration of the old page.

    So, FILE_MAPPED will be correctly updated.

    Then, for what MIGRATION flag is ?
    Without it, at migration failure, we may have to charge old page again
    because it may be fully unmapped. "charge" means that we have to dive into
    memory reclaim or something complated. So, it's better to avoid
    charge it again. Before this patch, __commit_charge() was working for
    both of the old/new page and fixed up all. But this technique has some
    racy condtion around FILE_MAPPED and SWAPOUT etc...
    Now, the kernel use MIGRATION flag and don't uncharge old page until
    the end of migration.

    I hope this change will make memcg's page migration much simpler. This
    page migration has caused several troubles. Worth to add a flag for
    simplification.

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@linux-foundation.org
     
  • Only an out of memory error will cause ret to be set.

    Signed-off-by: Phil Carmody
    Acked-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • The bottom 4 hunks are atomically changing memory to which there are no
    aliases as it's freshly allocated, so there's no need to use atomic
    operations.

    The other hunks are just atomic_read and atomic_set, and do not involve
    any read-modify-write. The use of atomic_{read,set} doesn't prevent a
    read/write or write/write race, so if a race were possible (I'm not saying
    one is), then it would still be there even with atomic_set.

    See:
    http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/

    Signed-off-by: Phil Carmody
    Acked-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • It's pointless to try to kill current if select_bad_process() did not find
    an eligible task to kill in mem_cgroup_out_of_memory() since it's
    guaranteed that current is a member of the memcg that is oom and it is, by
    definition, unkillable.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Li Zefan
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This patch adds support for moving charge of file pages, which include
    normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
    bit 1 of /memory.move_charge_at_immigrate.

    Unlike the case of anonymous pages, file pages(and swaps) in the range
    mmapped by the task will be moved even if the task hasn't done page fault,
    i.e. they might not be the task's "RSS", but other task's "RSS" that maps
    the same file. And mapcount of the page is ignored(the page can be moved
    even if page_mapcount(page) > 1). So, conditions that the page/swap
    should be met to be moved is that it must be in the range mmapped by the
    target task and it must be charged to the old cgroup.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This patch cleans up move charge code by:

    - define functions to handle pte for each types, and make
    is_target_pte_for_mc() cleaner.

    - instead of checking the MOVE_CHARGE_TYPE_ANON bit, define a function
    that checks the bit.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This adds a feature to disable oom-killer for memcg, if disabled, of
    course, tasks under memcg will stop.

    But now, we have oom-notifier for memcg. And the world around memcg is
    not under out-of-memory. memcg's out-of-memory just shows memcg hits
    limit. Then, administrator or management daemon can recover the situation
    by

    - kill some process
    - enlarge limit, add more swap.
    - migrate some tasks
    - remove file cache on tmps (difficult ?)

    Unlike oom-killer, you can take enough information before killing tasks.
    (by gcore, or, ps etc.)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering containers or other resource management softwares in userland,
    event notification of OOM in memcg should be implemented. Now, memcg has
    "threshold" notifier which uses eventfd, we can make use of it for oom
    notification.

    This patch adds oom notification eventfd callback for memcg. The usage is
    very similar to threshold notifier, but control file is memory.oom_control
    and no arguments other than eventfd is required.

    % cgroup_event_notifier /cgroup/A/memory.oom_control dummy
    (About cgroup_event_notifier, see Documentation/cgroup/)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg's oom waitqueue is a system-wide wait_queue (for handling
    hierarchy.) So, it's better to add custom wake function and do filtering
    in wake up path.

    This patch adds a filtering feature for waking up oom-waiters. Hierarchy
    is properly handled.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

27 May, 2010

1 commit

  • I/O errors can happen due to temporary failures, like multipath
    errors or losing network contact with the iSCSI server. Because
    of that, the VM will retry readpage on the page.

    However, do_generic_file_read does not clear PG_error. This
    causes the system to be unable to actually use the data in the
    page cache page, even if the subsequent readpage completes
    successfully!

    The function filemap_fault has had a ClearPageError before
    readpage forever. This patch simply adds the same to
    do_generic_file_read.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Rik van Riel
    Acked-by: Larry Woodman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

26 May, 2010

2 commits

  • Slightly rearrange the logic that determines capabilities and vm_flags.
    Disable BDI_CAP_MAP_DIRECT in all cases if the device can't support the
    protections. Allow private readonly mappings of readonly backing devices.

    Signed-off-by: Bernd Schmidt
    Signed-off-by: Mike Frysinger
    Acked-by: David McCullough
    Acked-by: Greg Ungerer
    Acked-by: Paul Mundt
    Acked-by: David Howells
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bernd Schmidt
     
  • The original code called mpol_put(new) while "new" was an ERR_PTR.

    Signed-off-by: Dan Carpenter
    Cc: Lee Schermerhorn
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     

25 May, 2010

15 commits

  • Add global mutex zonelists_mutex to fix the possible race:

    CPU0 CPU1 CPU2
    (1) zone->present_pages += online_pages;
    (2) build_all_zonelists();
    (3) alloc_page();
    (4) free_page();
    (5) build_all_zonelists();
    (6) __build_all_zonelists();
    (7) zone->pageset = alloc_percpu();

    In step (3,4), zone->pageset still points to boot_pageset, so bad
    things may happen if 2+ nodes are in this state. Even if only 1 node
    is accessing the boot_pageset, (3) may still consume too much memory
    to fail the memory allocations in step (7).

    Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
    since there is a new fresh memory block added in step(6).

    [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • For each new populated zone of hotadded node, need to update its pagesets
    with dynamically allocated per_cpu_pageset struct for all possible CPUs:

    1) Detach zone->pageset from the shared boot_pageset
    at end of __build_all_zonelists().

    2) Use mutex to protect zone->pageset when it's still
    shared in onlined_pages()

    Otherwises, multiple zones of different nodes would share same boot strapping
    boot_pageset for same CPU, which will finally cause below kernel panic:

    ------------[ cut here ]------------
    kernel BUG at mm/page_alloc.c:1239!
    invalid opcode: 0000 [#1] SMP
    ...
    Call Trace:
    [] __alloc_pages_nodemask+0x131/0x7b0
    [] alloc_pages_current+0x87/0xd0
    [] __page_cache_alloc+0x67/0x70
    [] __do_page_cache_readahead+0x120/0x260
    [] ra_submit+0x21/0x30
    [] ondemand_readahead+0x166/0x2c0
    [] page_cache_async_readahead+0x80/0xa0
    [] generic_file_aio_read+0x364/0x670
    [] nfs_file_read+0xca/0x130
    [] do_sync_read+0xfa/0x140
    [] vfs_read+0xb5/0x1a0
    [] sys_read+0x51/0x80
    [] system_call_fastpath+0x16/0x1b
    RIP [] get_page_from_freelist+0x883/0x900
    RSP
    ---[ end trace 4bda28328b9990db ]

    [akpm@linux-foundation.org: merge fix]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • No behavior change here.

    Move some of setup_per_cpu_pageset() code into a new function
    setup_zone_pageset() that will be useful for memory hotplug.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Haicheng Li
    Reviewed-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • In f4112de6b679d84bd9b9681c7504be7bdfb7c7d5 ("mm: introduce
    debug_kmap_atomic") I said that debug_kmap_atomic() needs
    CONFIG_TRACE_IRQFLAGS_SUPPORT.

    It was wrong. (I thought irqs_disabled() is only available when the
    architecture has CONFIG_TRACE_IRQFLAGS_SUPPORT)

    Remove the #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT check to enable
    kmap_atomic() debugging for the architectures which do not have
    CONFIG_TRACE_IRQFLAGS_SUPPORT.

    Reported-by: Andrew Morton
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Enable users to online CPUs even if the CPUs belongs to a numa node which
    doesn't have onlined local memory.

    The zonlists(pg_data_t.node_zonelists[]) of a numa node are created either
    in system boot/init period, or at the time of local memory online. For a
    numa node without onlined local memory, its zonelists are not initialized
    at present. As a result, any memory allocation operations executed by
    CPUs within this node will fail. In fact, an out-of-memory error is
    triggered when attempt to online CPUs before memory comes to online.

    This patch tries to create zonelists for such numa nodes, so that the
    memory allocation for this node can be fallback'ed to other nodes.

    [akpm@linux-foundation.org: remove unneeded export]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: minskey guo
    Cc: Minchan Kim
    Cc: Yasunori Goto
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    minskey guo
     
  • For now, we have global isolation vs. memory control group isolation, do
    not allow the reclaim entry function to set an arbitrary page isolation
    callback, we do not need that flexibility.

    And since we already pass around the group descriptor for the memory
    control group isolation case, just use it to decide which one of the two
    isolator functions to use.

    The decisions can be merged into nearby branches, so no extra cost there.
    In fact, we save the indirect calls.

    Signed-off-by: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This scan control is abused to communicate a return value from
    shrink_zones(). Write this idiomatically and remove the knob.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Signed-off-by: Johannes Weiner
    Cc: Dan Carpenter
    Cc: Rik van Riel
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • free_hot_cold_page() and __free_pages_ok() have very similar freeing
    preparation. Consolidate them.

    [akpm@linux-foundation.org: fix busted coding style]
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • If vmscan is under lumpy reclaim mode, it have to ignore referenced bit
    for making contenious free pages. but current page_check_references()
    doesn't.

    Fix it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Fix a wrong comment over page_cache_async_readahead().

    Signed-off-by: Huang Shijie
    Acked-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • get_scan_ratio() calculates percentage and if the percentage is < 1%, it
    will round percentage down to 0% and cause we completely ignore scanning
    anon/file pages to reclaim memory even the total anon/file pages are very
    big.

    To avoid underflow, we don't use percentage, instead we directly calculate
    how many pages should be scaned. In this way, we should get several
    scanned pages for < 1% percent.

    This has some benefits:

    1. increase our calculation precision

    2. making our scan more smoothly. Without this, if percent[x] is
    underflow, shrink_zone() doesn't scan any pages and suddenly it scans
    all pages when priority is zero. With this, even priority isn't zero,
    shrink_zone() gets chance to scan some pages.

    Note, this patch doesn't really change logics, but just increase
    precision. For system with a lot of memory, this might slightly changes
    behavior. For example, in a sequential file read workload, without the
    patch, we don't swap any anon pages. With it, if anon memory size is
    bigger than 16G, we will see one anon page swapped. The 16G is calculated
    as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap is assumed to be 1024
    which is common in this workload. So the impact sounds not a big deal.

    Signed-off-by: Shaohua Li
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Use mm->task_size instead of TASK_SIZE to ensure that the entire user
    address space is migrated. mm->task_size is independent of the calling
    task context. TASK SIZE may be dependant on the address space size of the
    calling process. Usage of TASK_SIZE can lead to partial address space
    migration if the calling process was 32 bit and the migrating process was
    64 bit.

    Here is the test script used on 64 system with a 32 bit echo process:

    mount -t cgroup none /cgroup -o cpuset
    cd /cgroup

    mkdir 0
    echo 1 > 0/cpuset.cpus
    echo 0 > 0/cpuset.mems
    echo 1 > 0/cpuset.memory_migrate

    mkdir 1
    echo 1 > 1/cpuset.cpus
    echo 1 > 1/cpuset.mems
    echo 1 > 1/cpuset.memory_migrate

    echo $$ > 0/tasks
    64_bit_process &
    pid=$!

    echo $pid > 1/tasks # This does not migrate all process pages without
    # this patch. If 64 bit echo is used or this patch is
    # applied, then the full address space of $pid is
    # migrated.

    To check memory migration, I watched:
    grep MemUsed /sys/devices/system/node/node*/meminfo

    Signed-off-by: Greg Thelen
    Acked-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • The fragmentation index may indicate that a failure is due to external
    fragmentation but after a compaction run completes, it is still possible
    for an allocation to fail. There are two obvious reasons as to why

    o Page migration cannot move all pages so fragmentation remains
    o A suitable page may exist but watermarks are not met

    In the event of compaction followed by an allocation failure, this patch
    defers further compaction in the zone (1 << compact_defer_shift) times.
    If the next compaction attempt also fails, compact_defer_shift is
    increased up to a maximum of 6. If compaction succeeds, the defer
    counters are reset again.

    The zone that is deferred is the first zone in the zonelist - i.e. the
    preferred zone. To defer compaction in the other zones, the information
    would need to be stored in the zonelist or implemented similar to the
    zonelist_cache. This would impact the fast-paths and is not justified at
    this time.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …hen it should be reclaimed

    The kernel applies some heuristics when deciding if memory should be
    compacted or reclaimed to satisfy a high-order allocation. One of these
    is based on the fragmentation. If the index is below 500, memory will not
    be compacted. This choice is arbitrary and not based on data. To help
    optimise the system and set a sensible default for this value, this patch
    adds a sysctl extfrag_threshold. The kernel will only compact memory if
    the fragmentation index is above the extfrag_threshold.

    [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman