13 Jan, 2012

1 commit


01 Nov, 2011

1 commit

  • test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace
    PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
    current's oom_score_adj for ksm and swapoff without requiring an
    additional per-process flag.

    Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
    then reinstate the previous value is racy since it's possible that
    userspace can set the value to something else itself before the old value
    is reinstated. That results in userspace setting current's oom_score_adj
    to a different value and then the kernel immediately setting it back to
    its previous value without notification.

    To fix this, a new compare_swap_oom_score_adj() function is introduced
    with the same semantics as the compare and swap CAS instruction, or
    CMPXCHG on x86. It is used to reinstate the previous value of
    oom_score_adj if and only if the present value is the same as the old
    value.

    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

26 Jul, 2011

1 commit

  • The badness() function in the oom killer was renamed to oom_badness() in
    a63d83f427fb ("oom: badness heuristic rewrite") since it is a globally
    exported function for clarity.

    The prototype for the old function still existed in linux/oom.h, so remove
    it. There are no existing users.

    Also fixes documentation and comment references to badness() and adjusts
    them accordingly.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 May, 2011

1 commit

  • There's a kernel-wide shortage of per-process flags, so it's always
    helpful to trim one when possible without incurring a significant penalty.
    It's even more important when you're planning on adding a per- process
    flag yourself, which I plan to do shortly for transparent hugepages.

    PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
    tendency to allocate large amounts of memory and should be preferred for
    killing over other tasks. We'd rather immediately kill the task making
    the errant syscall rather than penalizing an innocent task.

    This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
    setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.

    The process's old oom_score_adj is stored and then set to
    OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old
    value is then reinstated when the process should no longer be considered a
    high priority for oom killing.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Izik Eidus
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

11 Aug, 2010

1 commit

  • When the OOM killer scans task, it check a task is under memcg or
    not when it's called via memcg's context.

    But, as Oleg pointed out, a thread group leader may have NULL ->mm
    and task_in_mem_cgroup() may do wrong decision. We have to use
    find_lock_task_mm() in memcg as generic OOM-Killer does.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

10 Aug, 2010

6 commits

  • /proc/pid/oom_adj is now deprecated so that that it may eventually be
    removed. The target date for removal is August 2012.

    A warning will be printed to the kernel log if a task attempts to use this
    interface. Future warning will be suppressed until the kernel is rebooted
    to prevent spamming the kernel log.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This a complete rewrite of the oom killer's badness() heuristic which is
    used to determine which task to kill in oom conditions. The goal is to
    make it as simple and predictable as possible so the results are better
    understood and we end up killing the task which will lead to the most
    memory freeing while still respecting the fine-tuning from userspace.

    Instead of basing the heuristic on mm->total_vm for each task, the task's
    rss and swap space is used instead. This is a better indication of the
    amount of memory that will be freeable if the oom killed task is chosen
    and subsequently exits. This helps specifically in cases where KDE or
    GNOME is chosen for oom kill on desktop systems instead of a memory
    hogging task.

    The baseline for the heuristic is a proportion of memory that each task is
    currently using in memory plus swap compared to the amount of "allowable"
    memory. "Allowable," in this sense, means the system-wide resources for
    unconstrained oom conditions, the set of mempolicy nodes, the mems
    attached to current's cpuset, or a memory controller's limit. The
    proportion is given on a scale of 0 (never kill) to 1000 (always kill),
    roughly meaning that if a task has a badness() score of 500 that the task
    consumes approximately 50% of allowable memory resident in RAM or in swap
    space.

    The proportion is always relative to the amount of "allowable" memory and
    not the total amount of RAM systemwide so that mempolicies and cpusets may
    operate in isolation; they shall not need to know the true size of the
    machine on which they are running if they are bound to a specific set of
    nodes or mems, respectively.

    Root tasks are given 3% extra memory just like __vm_enough_memory()
    provides in LSMs. In the event of two tasks consuming similar amounts of
    memory, it is generally better to save root's task.

    Because of the change in the badness() heuristic's baseline, it is also
    necessary to introduce a new user interface to tune it. It's not possible
    to redefine the meaning of /proc/pid/oom_adj with a new scale since the
    ABI cannot be changed for backward compatability. Instead, a new tunable,
    /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
    be used to polarize the heuristic such that certain tasks are never
    considered for oom kill while others may always be considered. The value
    is added directly into the badness() score so a value of -500, for
    example, means to discount 50% of its memory consumption in comparison to
    other tasks either on the system, bound to the mempolicy, in the cpuset,
    or sharing the same memory controller.

    /proc/pid/oom_adj is changed so that its meaning is rescaled into the
    units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
    these per-task tunables will rescale the value of the other to an
    equivalent meaning. Although /proc/pid/oom_adj was originally defined as
    a bitshift on the badness score, it now shares the same linear growth as
    /proc/pid/oom_score_adj but with different granularity. This is required
    so the ABI is not broken with userspace applications and allows oom_adj to
    be deprecated for future removal.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Cc: Minchan Kim
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • We have been used naming try_set_zone_oom and clear_zonelist_oom.
    The role of functions is to lock of zonelist for preventing parallel
    OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather
    awkward and unmatched with clear_zonelist_oom.

    Let's change it with try_set_zonelist_oom.

    Signed-off-by: Minchan Kim
    Acked-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The three oom killer sysctl variables (sysctl_oom_dump_tasks,
    sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better
    declared in include/linux/oom.h rather than kernel/sysctl.c.

    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are various points in the oom killer where the kernel must determine
    whether to panic or not. It's better to extract this to a helper function
    to remove all the confusion as to its semantics.

    Also fix a call to dump_header() where tasklist_lock is not read- locked,
    as required.

    There's no functional change with this patch.

    Acked-by: KOSAKI Motohiro
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

16 Dec, 2009

1 commit

  • Fix node-oriented allocation handling in oom-kill.c I myself think of this
    as a bugfix not as an ehnancement.

    In these days, things are changed as
    - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
    - mempolicy don't maintain its own private zonelists.
    (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

    So, current oom-killer's check function is wrong.

    This patch does
    - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
    - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
    And
    - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

    - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc: Daisuke Nishimura
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

22 Sep, 2009

1 commit


28 Apr, 2008

1 commit

  • Filtering zonelists requires very frequent use of zone_idx(). This is costly
    as it involves a lookup of another structure and a substraction operation. As
    the zone_idx is often required, it should be quickly accessible. The node idx
    could also be stored here if it was found that accessing zone->node is
    significant which may be the case on workloads where nodemasks are heavily
    used.

    This patch introduces a struct zoneref to store a zone pointer and a zone
    index. The zonelist then consists of an array of these struct zonerefs which
    are looked up as necessary. Helpers are given for accessing the zone index as
    well as the node index.

    [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
    [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
    [hugh@veritas.com: just return do_try_to_free_pages]
    [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Oct, 2007

4 commits

  • It's not necessary to include all of linux/sched.h in linux/oom.h. Instead,
    simply include prototypes for the relevant structs and include linux/types.h
    for gfp_t.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Acked-by: Alexey Dobriyan
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • OOM killer synchronization should be done with zone granularity so that memory
    policy and cpuset allocations may have their corresponding zones locked and
    allow parallel kills for other OOM conditions that may exist elsewhere in the
    system. DMA allocations can be targeted at the zone level, which would not be
    possible if locking was done in nodes or globally.

    Synchronization shall be done with a variation of "trylocks." The goal is to
    put the current task to sleep and restart the failed allocation attempt later
    if the trylock fails. Otherwise, the OOM killer is invoked.

    Each zone in the zonelist that __alloc_pages() was called with is checked for
    the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present,
    the "trylock" to serialize the OOM killer fails and returns zero. Otherwise,
    all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function
    returns non-zero.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The OOM killer's CONSTRAINT definitions are really more appropriate in an
    enum, so define them in include/linux/oom.h.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Move the OOM killer's extern function prototypes to include/linux/oom.h and
    include it where necessary.

    [clg@fr.ibm.com: build fix]
    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

21 Oct, 2006

1 commit

  • Despite mm.h is not being exported header, it does contain one thing
    which is part of userspace ABI -- value disabling OOM killer for given
    process. So,
    a) create and export include/linux/oom.h
    b) move OOM_DISABLE define there.
    c) turn bounding values of /proc/$PID/oom_adj into defines and export
    them too.

    Note: mass __KERNEL__ removal will be done later.

    Signed-off-by: Alexey Dobriyan
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan