25 Dec, 2009

2 commits


24 Dec, 2009

1 commit


23 Dec, 2009

1 commit

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (36 commits)
    powerpc/gc/wii: Remove get_irq_desc()
    powerpc/gc/wii: hlwd-pic: convert irq_desc.lock to raw_spinlock
    powerpc/gamecube/wii: Fix off-by-one error in ugecon/usbgecko_udbg
    powerpc/mpic: Fix problem that affinity is not updated
    powerpc/mm: Fix stupid bug in subpge protection handling
    powerpc/iseries: use DECLARE_COMPLETION_ONSTACK for non-constant completion
    powerpc: Fix MSI support on U4 bridge PCIe slot
    powerpc: Handle VSX alignment faults correctly in little-endian mode
    powerpc/mm: Fix typo of cpumask_clear_cpu()
    powerpc/mm: Fix hash_utils_64.c compile errors with DEBUG enabled.
    powerpc: Convert BUG() to use unreachable()
    powerpc/pseries: Make declarations of cpu_hotplug_driver_lock() ANSI compatible.
    powerpc/pseries: Don't panic when H_PROD fails during cpu-online.
    powerpc/mm: Fix a WARN_ON() with CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_VM
    powerpc/defconfigs: Set HZ=100 on pseries and ppc64 defconfigs
    powerpc/defconfigs: Disable token ring in powerpc defconfigs
    powerpc/defconfigs: Reduce 64bit vmlinux by making acenic and cramfs modules
    powerpc/pseries: Select XICS and PCI_MSI PSERIES
    powerpc/85xx: Wrong variable returned on error
    powerpc/iseries: Convert to proc_fops
    ...

    Linus Torvalds
     

22 Dec, 2009

1 commit

  • The injector filter requires stable_page_flags() which is supplied
    by procfs. So make it dependent on that.

    Also add ifdefs around the filter code in memory-failure.c so that
    when the filter is disabled due to missing dependencies the whole
    code still builds.

    Reported-by: Ingo Molnar
    Signed-off-by: Andi Kleen

    Andi Kleen
     

20 Dec, 2009

1 commit

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, irq: Allow 0xff for /proc/irq/[n]/smp_affinity on an 8-cpu system
    Makefile: Unexport LC_ALL instead of clearing it
    x86: Fix objdump version check in arch/x86/tools/chkobjdump.awk
    x86: Reenable TSC sync check at boot, even with NONSTOP_TSC
    x86: Don't use POSIX character classes in gen-insn-attr-x86.awk
    Makefile: set LC_CTYPE, LC_COLLATE, LC_NUMERIC to C
    x86: Increase MAX_EARLY_RES; insufficient on 32-bit NUMA
    x86: Fix checking of SRAT when node 0 ram is not from 0
    x86, cpuid: Add "volatile" to asm in native_cpuid()
    x86, msr: msrs_alloc/free for CONFIG_SMP=n
    x86, amd: Get multi-node CPU info from NodeId MSR instead of PCI config space
    x86: Add IA32_TSC_AUX MSR and use it
    x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers
    initramfs: add missing decompressor error check
    bzip2: Add missing checks for malloc returning NULL
    bzip2/lzma/gzip: pre-boot malloc doesn't return NULL on failure

    Linus Torvalds
     

18 Dec, 2009

5 commits

  • Memory balloon drivers can allocate a large amount of memory which is not
    movable but could be freed to accomodate memory hotplug remove.

    Prior to calling the memory hotplug notifier chain the memory in the
    pageblock is isolated. Currently, if the migrate type is not
    MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
    for that page range to fail.

    Rather than failing pageblock isolation if the migrateteype is not
    MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
    and not on the LRU, are owned by a registered balloon driver (or other
    entity) using a notifier chain. If all of the non-movable pages are owned
    by a balloon, they can be freed later through the memory notifier chain
    and the range can still be isolated in set_migratetype_isolate().

    Signed-off-by: Robert Jennings
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: Brian King
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Gerald Schaefer
    Cc: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Benjamin Herrenschmidt

    Robert Jennings
     
  • …/rusty/linux-2.6-for-linus

    * 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    cpumask: rename tsk_cpumask to tsk_cpus_allowed
    cpumask: don't recommend set_cpus_allowed hack in Documentation/cpu-hotplug.txt
    cpumask: avoid dereferencing struct cpumask
    cpumask: convert drivers/idle/i7300_idle.c to cpumask_var_t
    cpumask: use modern cpumask style in drivers/scsi/fcoe/fcoe.c
    cpumask: avoid deprecated function in mm/slab.c
    cpumask: use cpu_online in kernel/perf_event.c

    Linus Torvalds
     
  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
    Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support
    NOMMU: Optimise away the {dac_,}mmap_min_addr tests
    security/min_addr.c: make init_mmap_min_addr() static
    keys: PTR_ERR return of wrong pointer in keyctl_get_security()

    Linus Torvalds
     
  • * 'kmemleak' of git://linux-arm.org/linux-2.6:
    kmemleak: fix kconfig for crc32 build error
    kmemleak: Reduce the false positives by checking for modified objects
    kmemleak: Show the age of an unreferenced object
    kmemleak: Release the object lock before calling put_object()
    kmemleak: Scan the _ftrace_events section in modules
    kmemleak: Simplify the kmemleak_scan_area() function prototype
    kmemleak: Do not use off-slab management with SLAB_NOLEAKTRACE

    Linus Torvalds
     
  • I added blk_run_backing_dev on page_cache_async_readahead so readahead I/O
    is unpluged to improve throughput on especially RAID environment.

    The normal case is, if page N become uptodate at time T(N), then T(N)
    Acked-by: Wu Fengguang
    Cc: Jens Axboe
    Cc: KOSAKI Motohiro
    Tested-by: Ronald
    Cc: Bart Van Assche
    Cc: Vladislav Bolkhovitin
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

17 Dec, 2009

12 commits

  • These days we use cpumask_empty() which takes a pointer.

    Signed-off-by: Rusty Russell
    Acked-by: Christoph Lameter

    Rusty Russell
     
  • Replacing
    error = 0;
    if (error)
    op
    with nothing is not quite an equivalent transformation ;-)

    Signed-off-by: Al Viro

    Al Viro
     
  • Found one system that boot from socket1 instead of socket0, SRAT get rejected...

    [ 0.000000] SRAT: Node 1 PXM 0 0-a0000
    [ 0.000000] SRAT: Node 1 PXM 0 100000-80000000
    [ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
    [ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
    [ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
    [ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
    [ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
    [ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
    [ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
    [ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
    ...
    [ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040
    [ 0.000000] NUMA: Using 20 for the hash shift.
    [ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
    [ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
    [ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
    [ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
    [ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
    [ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
    [ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
    [ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
    [ 0.000000] SRAT: SRAT not used.

    the early_node_map is not sorted because node0 with non zero start come first.

    so try to sort it right away after all regions are registered.

    also fixs refression by 8716273c (x86: Export srat physical topology)

    -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
    -v3: update comments.

    Reported-and-tested-by: Jens Axboe
    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be
    skipped by the compiler. We do this as the minimum mmap address doesn't make
    any sense in NOMMU mode.

    mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode.

    Signed-off-by: David Howells
    Acked-by: Eric Paris
    Signed-off-by: James Morris

    David Howells
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
    HWPOISON: Remove stray phrase in a comment
    HWPOISON: Try to allocate migration page on the same node
    HWPOISON: Don't do early filtering if filter is disabled
    HWPOISON: Add a madvise() injector for soft page offlining
    HWPOISON: Add soft page offline support
    HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
    HWPOISON: Use new shake_page in memory_failure
    HWPOISON: Use correct name for MADV_HWPOISON in documentation
    HWPOISON: mention HWPoison in Kconfig entry
    HWPOISON: Use get_user_page_fast in hwpoison madvise
    HWPOISON: add an interface to switch off/on all the page filters
    HWPOISON: add memory cgroup filter
    memcg: add accessor to mem_cgroup.css
    memcg: rename and export try_get_mem_cgroup_from_page()
    HWPOISON: add page flags filter
    mm: export stable page flags
    HWPOISON: limit hwpoison injector to known page types
    HWPOISON: add fs/device filters
    HWPOISON: return 0 to indicate success reliably
    HWPOISON: make semantics of IGNORED/DELAYED clear
    ...

    Linus Torvalds
     
  • * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (38 commits)
    direct I/O fallback sync simplification
    ocfs: stop using do_sync_mapping_range
    cleanup blockdev_direct_IO locking
    make generic_acl slightly more generic
    sanitize xattr handler prototypes
    libfs: move EXPORT_SYMBOL for d_alloc_name
    vfs: force reval of target when following LAST_BIND symlinks (try #7)
    ima: limit imbalance msg
    Untangling ima mess, part 3: kill dead code in ima
    Untangling ima mess, part 2: deal with counters
    Untangling ima mess, part 1: alloc_file()
    O_TRUNC open shouldn't fail after file truncation
    ima: call ima_inode_free ima_inode_free
    IMA: clean up the IMA counts updating code
    ima: only insert at inode creation time
    ima: valid return code from ima_inode_alloc
    fs: move get_empty_filp() deffinition to internal.h
    Sanitize exec_permission_lite()
    Kill cached_lookup() and real_lookup()
    Kill path_lookup_open()
    ...

    Trivial conflicts in fs/direct-io.c

    Linus Torvalds
     
  • In the case of direct I/O falling back to buffered I/O we sync data
    twice currently: once at the end of generic_file_buffered_write using
    filemap_write_and_wait_range and once a little later in
    __generic_file_aio_write using do_sync_mapping_range with all flags set.

    The wait before write of the do_sync_mapping_range call does not make
    any sense, so just keep the filemap_write_and_wait_range call and move
    it to the right spot.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that we cache the ACL pointers in the generic inode all the generic_acl
    cruft can go away and generic_acl.c can directly implement xattr handlers
    dealing with the full Posix ACL semantics for in-memory filesystems.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Add a flags argument to struct xattr_handler and pass it to all xattr
    handler methods. This allows using the same methods for multiple
    handlers, e.g. for the ACL methods which perform exactly the same action
    for the access and default ACLs, just using a different underlying
    attribute. With a little more groundwork it'll also allow sharing the
    methods for the regular user/trusted/secure handlers in extN, ocfs2 and
    jffs2 like it's already done for xfs in this patch.

    Also change the inode argument to the handlers to a dentry to allow
    using the handlers mechnism for filesystems that require it later,
    e.g. cifs.

    [with GFS2 bits updated by Steven Whitehouse ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Joel Becker
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • There are 2 groups of alloc_file() callers:
    * ones that are followed by ima_counts_get
    * ones giving non-regular files
    So let's pull that ima_counts_get() into alloc_file();
    it's a no-op in case of non-regular files.

    Signed-off-by: Al Viro

    Al Viro
     
  • ... and have the caller grab both mnt and dentry; kill
    leak in infiniband, while we are at it.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     

16 Dec, 2009

17 commits

  • Variable `progress' isn't used in mem_cgroup_resize_limit() any more.
    Remove it.

    [akpm@linux-foundation.org: cleanup]
    Signed-off-by: Bob Liu
    Cc: Daisuke Nishimura
    Reviewed-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock
    caused by race between oom and cpuset_attach) instead of cgroup_mutex to
    fix a deadlock problem. The cgroup_mutex, which was removed by the
    commit, in mem_cgroup_out_of_memory() was originally introduced at commit
    c7ba5c9e (Memory controller: OOM handling).

    IIUC, the intention of this cgroup_mutex was to prevent task move during
    select_bad_process() so that situations like below can be avoided.

    Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
    1. Process A, which has been in cgroup "baa" and uses large memory, is just
    moved to cgroup "foo". Process A can be the candidates for being killed.
    2. Process B, which has been in cgroup "foo" and uses large memory, is just
    moved from cgroup "foo". Process B can be excluded from the candidates for
    being killed.

    But these race window exists anyway even if we hold a lock, because
    __mem_cgroup_try_charge() decides wether it should trigger oom or not
    outside of the lock. So the original cgroup_mutex in
    mem_cgroup_out_of_memory and thus current memcg_tasklist has no use. And
    IMHO, those races are not so critical for users.

    This patch removes it and make codes simpler.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • task_in_mem_cgroup(), which is called by select_bad_process() to check
    whether a task can be a candidate for being oom-killed from memcg's limit,
    checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs
    to).

    But this check return true(it's false positive) when:

    /aa use_hierarchy == 0 /aa/00 use_hierarchy == 1
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • mem_cgroup_move_parent() calls try_charge first and cancel_charge on
    failure. IMHO, charge/uncharge(especially charge) is high cost operation,
    so we should avoid it as far as possible.

    This patch tries to delay try_charge in mem_cgroup_move_parent() by
    re-ordering checks it does.

    And this patch renames mem_cgroup_move_account() to
    __mem_cgroup_move_account(), changes the return value of
    __mem_cgroup_move_account() from int to void, and adds a new
    wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid
    for moving account and calls __mem_cgroup_move_account().

    This patch removes the last caller of trylock_page_cgroup(), so removes
    its definition too.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • There are some places calling both res_counter_uncharge() and css_put() to
    cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().

    This patch introduces mem_cgroup_cancel_charge() and call it in those
    places.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Reviewed-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes
    grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPED

    And in global VM, mapped shared memory is accounted into FILE_MAPPED.
    But memcg doesn't. fix it.
    Note:
    page_is_file_cache() just checks SwapBacked or not.
    So, we need to check PageAnon.

    Cc: Balbir Singh
    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This is a patch for coalescing access to res_counter at charging by percpu
    caching. At charge, memcg charges 64pages and remember it in percpu
    cache. Because it's cache, drain/flush if necessary.

    This version uses public percpu area.
    2 benefits for using public percpu area.
    1. Sum of stocked charge in the system is limited to # of cpus
    not to the number of memcg. This shows better synchonization.
    2. drain code for flush/cpuhotplug is very easy (and quick)

    The most important point of this patch is that we never touch res_counter
    in fast path. The res_counter is system-wide shared counter which is modified
    very frequently. We shouldn't touch it as far as we can for avoiding
    false sharing.

    On x86-64 8cpu server, I tested overheads of memcg at page fault by
    running a program which does map/fault/unmap in a loop. Running
    a task per a cpu by taskset and see sum of the number of page faults
    in 60secs.

    [without memcg config]
    40156968 page-faults # 0.085 M/sec ( +- 0.046% )
    27.67 cache-miss/faults

    [root cgroup]
    36659599 page-faults # 0.077 M/sec ( +- 0.247% )
    31.58 cache miss/faults

    [in a child cgroup]
    18444157 page-faults # 0.039 M/sec ( +- 0.133% )
    69.96 cache miss/faults

    [ + coalescing uncharge patch]
    27133719 page-faults # 0.057 M/sec ( +- 0.155% )
    47.16 cache miss/faults

    [ + coalescing uncharge patch + this patch ]
    34224709 page-faults # 0.072 M/sec ( +- 0.173% )
    34.69 cache miss/faults

    Changelog (since Oct/2):
    - updated comments
    - replaced get_cpu_var() with __get_cpu_var() if possible.
    - removed mutex for system-wide drain. adds a counter instead of it.
    - removed CONFIG_HOTPLUG_CPU

    Changelog (old):
    - rebased onto the latest mmotm
    - moved charge size check before __GFP_WAIT check for avoiding unnecesary
    - added asynchronous flush routine.
    - fixed bugs pointed out by Nishimura-san.

    [akpm@linux-foundation.org: tweak comments]
    [nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In massive parallel enviroment, res_counter can be a performance
    bottleneck. One strong techinque to reduce lock contention is reducing
    calls by coalescing some amount of calls into one.

    Considering charge/uncharge chatacteristic,
    - charge is done one by one via demand-paging.
    - uncharge is done by
    - in chunk at munmap, truncate, exit, execve...
    - one by one via vmscan/paging.

    It seems we have a chance to coalesce uncharges for improving scalability
    at unmap/truncation.

    This patch is a for coalescing uncharge. For avoiding scattering memcg's
    structure to functions under /mm, this patch adds memcg batch uncharge
    information to the task. A reason for per-task batching is for making use
    of caller's context information. We do batched uncharge (deleyed
    uncharge) when truncation/unmap occurs but do direct uncharge when
    uncharge is called by memory reclaim (vmscan.c).

    The degree of coalescing depends on callers
    - at invalidate/trucate... pagevec size
    - at unmap ....ZAP_BLOCK_SIZE
    (memory itself will be freed in this degree.)
    Then, we'll not coalescing too much.

    On x86-64 8cpu server, I tested overheads of memcg at page fault by
    running a program which does map/fault/unmap in a loop. Running
    a task per a cpu by taskset and see sum of the number of page faults
    in 60secs.

    [without memcg config]
    40156968 page-faults # 0.085 M/sec ( +- 0.046% )
    27.67 cache-miss/faults
    [root cgroup]
    36659599 page-faults # 0.077 M/sec ( +- 0.247% )
    31.58 miss/faults
    [in a child cgroup]
    18444157 page-faults # 0.039 M/sec ( +- 0.133% )
    69.96 miss/faults
    [child with this patch]
    27133719 page-faults # 0.057 M/sec ( +- 0.155% )
    47.16 miss/faults

    We can see some amounts of improvement.
    (root cgroup doesn't affected by this patch)
    Another patch for "charge" will follow this and above will be improved more.

    Changelog(since 2009/10/02):
    - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
    - some clean up and commentary/description updates.
    - added initialize code to copy_process(). (possible bug fix)

    Changelog(old):
    - fixed !CONFIG_MEM_CGROUP case.
    - rebased onto the latest mmotm + softlimit fix patches.
    - unified patch for callers
    - added commetns.
    - make ->do_batch as bool.
    - removed css_get() at el. We don't need it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • A memory cgroup has a memory.memsw.usage_in_bytes file. It shows the sum
    of the usage of pages and swapents in the cgroup. Presently the root
    cgroup's memsw.usage_in_bytes shows the wrong value - the number of
    swapents are not added.

    So take MEM_CGROUP_STAT_SWAPOUT into account.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Fix node-oriented allocation handling in oom-kill.c I myself think of this
    as a bugfix not as an ehnancement.

    In these days, things are changed as
    - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
    - mempolicy don't maintain its own private zonelists.
    (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

    So, current oom-killer's check function is wrong.

    This patch does
    - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
    - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
    And
    - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

    - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc: Daisuke Nishimura
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In a typical oom analysis scenario, we frequently want to know whether the
    killed process has a memory leak or not at the first step. This patch
    adds vsz and rss information to the oom log to help this analysis. To
    save time for the debugging.

    example:
    ===================================================================
    rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24
    Call Trace:
    [] ?_spin_unlock+0x2b/0x40
    [] oom_kill_process+0xbe/0x2b0

    (snip)

    492283 pages non-shared
    Out of memory: kill process 2341 (memhog) score 527276 or a child
    Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB
    ===========================================================================
    ^
    |
    here

    [rientjes@google.com: fix race, add pid & comm to message]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Better to have complete sentences.

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Process based injection is much easier to handle for test programs,
    who can first bring a page into a specific state and then test.
    So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
    to the existing hard offline injector.

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • This is a simpler, gentler variant of memory_failure() for soft page
    offlining controlled from user space. It doesn't kill anything, just
    tries to invalidate and if that doesn't work migrate the
    page away.

    This is useful for predictive failure analysis, where a page has
    a high rate of corrected errors, but hasn't gone bad yet. Instead
    it can be offlined early and avoided.

    The offlining is controlled from sysfs, including a new generic
    entry point for hard page offlining for symmetry too.

    We use the page isolate facility to prevent re-allocation
    race. Normally this is only used by memory hotplug. To avoid
    races with memory allocation I am using lock_system_sleep().
    This avoids the situation where memory hotplug is about
    to isolate a page range and then hwpoison undoes that work.
    This is a big hammer currently, but the simplest solution
    currently.

    When the page is not free or LRU we try to free pages
    from slab and other caches. The slab freeing is currently
    quite dumb and does not try to focus on the specific slab
    cache which might own the page. This could be potentially
    improved later.

    Thanks to Fengguang Wu and Haicheng Li for some fixes.

    [Added fix from Andrew Morton to adapt to new migrate_pages prototype]
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Signed-off-by: Andi Kleen

    Andi Kleen