30 May, 2011

1 commit

  • Thomas Gleixner reports that we now have a boot crash triggered by
    CONFIG_CPUMASK_OFFSTACK=y:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] find_next_bit+0x55/0xb0
    Call Trace:
    [] cpumask_any_but+0x2a/0x70
    [] flush_tlb_mm+0x2b/0x80
    [] pud_populate+0x35/0x50
    [] pgd_alloc+0x9a/0xf0
    [] mm_init+0xec/0x120
    [] mm_alloc+0x53/0xd0

    which was introduced by commit de03c72cfce5 ("mm: convert
    mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
    mm_init() vs mm_init_cpumask

    Thomas wrote a patch to just fix the ordering of initialization, but I
    hate the new double allocation in the fork path, so I ended up instead
    doing some more radical surgery to clean it all up.

    Reported-by: Thomas Gleixner
    Reported-by: Ingo Molnar
    Cc: KOSAKI Motohiro
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 May, 2011

3 commits

  • Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
    This was because exe_file was needed only for /proc//exe. Since we
    will need the exe_file functionality also for core dumps (so core name can
    contain full binary path), built this functionality always into the
    kernel.

    To achieve that move that out of proc FS to the kernel/ where in fact it
    should belong. By doing that we can make dup_mm_exe_file static. Also we
    can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
    leads to some problems:

    * cgroup creation is out-of-control
    * cgroup name can conflict when pids are looping
    * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
    * we may want to create a namespace without creating a cgroup

    The ns_cgroup was replaced by a compatibility flag 'clone_children',
    where a newly created cgroup will copy the parent cgroup values.
    The userspace has to manually create a cgroup and add a task to
    the 'tasks' file.

    This patch removes the ns_cgroup as suggested in the following thread:

    https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

    The 'cgroup_clone' function is removed because it is no longer used.

    This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
    ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
    printk warning users that the feature is planned for removal. Since that
    time we have heard from XXX users who were affected by this.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Serge E. Hallyn
    Cc: Eric W. Biederman
    Cc: Jamal Hadi Salim
    Reviewed-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     
  • Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

    Add an rwsem that lives in a threadgroup's signal_struct that's taken for
    reading in the fork path, under CONFIG_CGROUPS. If another part of the
    kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
    ifdefs should be changed to a higher-up flag that CGROUPS and the other
    system would both depend on.

    This is a pre-patch for cgroup-procs-write.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     

25 May, 2011

3 commits

  • cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
    It might lead to reduce cache hit ratio.

    This patch has two change.
    1) Move the place of cpumask into last of mm_struct. Because usually cpumask
    is accessed only front bits when the system has cpu-hotplug capability
    2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
    footprint if cpumask_size() will use nr_cpumask_bits properly in future.

    In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
    It may help to detect out of tree cpu_vm_mask users.

    This patch has no functional change.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Howells
    Cc: Koichi Yasutake
    Cc: Hugh Dickins
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Straightforward conversion of i_mmap_lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Hugh says:
    "The only significant loser, I think, would be page reclaim (when
    concurrent with truncation): could spin for a long time waiting for
    the i_mmap_mutex it expects would soon be dropped? "

    Counter points:
    - cpu contention makes the spin stop (need_resched())
    - zap pages should be freeing pages at a higher rate than reclaim
    ever can

    I think the simplification of the truncate code is definitely worth it.

    Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
    unmap_mapping_range() on the same inode") and takes out the code that
    caused its problem.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

12 May, 2011

1 commit


24 Apr, 2011

1 commit

  • Neil Brown pointed out that lock_depth somehow escaped the BKL
    removal work. Let's get rid of it now.

    Note that the perf scripting utilities still have a bunch of
    code for dealing with common_lock_depth in tracepoints; I have
    left that in place in case anybody wants to use that code with
    older kernels.

    Suggested-by: Neil Brown
    Signed-off-by: Jonathan Corbet
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110422111910.456c0e84@bike.lwn.net
    Signed-off-by: Ingo Molnar

    Jonathan Corbet
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

24 Mar, 2011

2 commits

  • Reorganize proc_get_sb() so it can be called before the struct pid of the
    first process is allocated.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Daniel Lezcano
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • This patchset is a cleanup and a preparation to unshare the pid namespace.
    These prerequisites prepare for Eric's patchset to give a file descriptor
    to a namespace and join an existing namespace.

    This patch:

    It turns out that the existing assignment in copy_process of the
    child_reaper can handle the initial assignment of child_reaper we just
    need to generalize the test in kernel/fork.c

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Daniel Lezcano
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

23 Mar, 2011

4 commits

  • Cleanup: kill the dead code which does nothing but complicates the code
    and confuses the reader.

    sys_unshare(CLONE_THREAD/SIGHAND/VM) is not really implemented, and I
    doubt very much it will ever work. At least, nobody even tried since the
    original 99d1419d96d7df9cfa56 ("unshare system call -v5: system call
    handler function") was applied more than 4 years ago.

    And the code is not consistent. unshare_thread() always fails
    unconditionally, while unshare_sighand() and unshare_vm() pretend to work
    if there is nothing to unshare.

    Remove unshare_thread(), unshare_sighand(), unshare_vm() helpers and
    related variables and add a simple CLONE_THREAD | CLONE_SIGHAND| CLONE_VM
    check into check_unshare_flags().

    Also, move the "CLONE_NEWNS needs CLONE_FS" check from
    check_unshare_flags() to sys_unshare(). This looks more consistent and
    matches the similar do_sysvsem check in sys_unshare().

    Note: with or without this patch "atomic_read(mm->mm_users) > 1" can give
    a false positive due to get_task_mm().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Janak Desai
    Cc: Daniel Lezcano
    Cc: "Eric W. Biederman"
    Cc: KOSAKI Motohiro
    Cc: Alexey Dobriyan
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • All kthreads being created from a single helper task, they all use memory
    from a single node for their kernel stack and task struct.

    This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
    to parameters already used by kthread_create().

    This parameter serves in allocating memory for the new kthread on its
    memory node if possible.

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Add a node parameter to alloc_thread_info(), and change its name to
    alloc_thread_info_node()

    This change is needed to allow NUMA aware kthread_create_on_cpu()

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • All kthreads being created from a single helper task, they all use memory
    from a single node for their kernel stack and task struct.

    This patch suite creates kthread_create_on_cpu(), adding a 'cpu' parameter
    to parameters already used by kthread_create().

    This parameter serves in allocating memory for the new kthread on its
    memory node if available.

    Users of this new function are : ksoftirqd, kworker, migration, pktgend...

    This patch:

    Add a node parameter to alloc_task_struct(), and change its name to
    alloc_task_struct_node()

    This change is needed to allow NUMA aware kthread_create_on_cpu()

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

18 Mar, 2011

1 commit


10 Mar, 2011

1 commit

  • This patch adds support for creating a queuing context outside
    of the queue itself. This enables us to batch up pieces of IO
    before grabbing the block device queue lock and submitting them to
    the IO scheduler.

    The context is created on the stack of the process and assigned in
    the task structure, so that we can auto-unplug it if we hit a schedule
    event.

    The current queue plugging happens implicitly if IO is submitted to
    an empty device, yet callers have to remember to unplug that IO when
    they are going to wait for it. This is an ugly API and has caused bugs
    in the past. Additionally, it requires hacks in the vm (->sync_page()
    callback) to handle that logic. By switching to an explicit plugging
    scheme we make the API a lot nicer and can get rid of the ->sync_page()
    hack in the vm.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Jan, 2011

4 commits

  • Add khugepaged to relocate fragmented pages into hugepages if new
    hugepages become available. (this is indipendent of the defrag logic that
    will have to make new hugepages available)

    The fundamental reason why khugepaged is unavoidable, is that some memory
    can be fragmented and not everything can be relocated. So when a virtual
    machine quits and releases gigabytes of hugepages, we want to use those
    freely available hugepages to create huge-pmd in the other virtual
    machines that may be running on fragmented memory, to maximize the CPU
    efficiency at all times. The scan is slow, it takes nearly zero cpu time,
    except when it copies data (in which case it means we definitely want to
    pay for that cpu time) so it seems a good tradeoff.

    In addition to the hugepages being released by other process releasing
    memory, we have the strong suspicion that the performance impact of
    potentially defragmenting hugepages during or before each page fault could
    lead to more performance inconsistency than allocating small pages at
    first and having them collapsed into large pages later... if they prove
    themselfs to be long lived mappings (khugepaged scan is slow so short
    lived mappings have low probability to run into khugepaged if compared to
    long lived mappings).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This increase the size of the mm struct a bit but it is needed to
    preallocate one pte for each hugepage so that split_huge_page will not
    require a fail path. Guarantee of success is a fundamental property of
    split_huge_page to avoid decrasing swapping reliability and to avoid
    adding -ENOMEM fail paths that would otherwise force the hugepage-unaware
    VM code to learn rolling back in the middle of its pte mangling operations
    (if something we need it to learn handling pmd_trans_huge natively rather
    being capable of rollback). When split_huge_page runs a pte is needed to
    succeed the split, to map the newly splitted regular pages with a regular
    pte. This way all existing VM code remains backwards compatible by just
    adding a split_huge_page* one liner. The memory waste of those
    preallocated ptes is negligible and so it is worth it.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We'd like to be able to oom_score_adj a process up/down as it
    enters/leaves the foreground. Currently, it is not possible to oom_adj
    down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
    oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
    or its inherited value at fork. Assuming the thread that has forked it
    has oom_score_adj of 0, each process could decrease it back from 0 upon
    activation unless a CAP_SYS_RESOURCE thread elevated it to something
    higher.

    Alternative considered:

    * a setuid binary
    * a daemon with CAP_SYS_RESOURCE

    Since you don't wan't all processes to be able to reduce their oom_adj, a
    setuid or daemon implementation would be complex. The alternatives also
    have much higher overhead.

    This patch updated from original patch based on feedback from David
    Rientjes.

    Signed-off-by: Mandeep Singh Baines
    Acked-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     
  • This warning was added in commit bdff746a3915 ("clone: prepare to recycle
    CLONE_STOPPED") three years ago. 2.6.26 came and went. As far as I know,
    no-one is actually using CLONE_STOPPED.

    Signed-off-by: Dave Jones
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     

08 Jan, 2011

1 commit

  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
    gameport: use this_cpu_read instead of lookup
    x86: udelay: Use this_cpu_read to avoid address calculation
    x86: Use this_cpu_inc_return for nmi counter
    x86: Replace uses of current_cpu_data with this_cpu ops
    x86: Use this_cpu_ops to optimize code
    vmstat: User per cpu atomics to avoid interrupt disable / enable
    irq_work: Use per cpu atomics instead of regular atomics
    cpuops: Use cmpxchg for xchg to avoid lock semantics
    x86: this_cpu_cmpxchg and this_cpu_xchg operations
    percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
    percpu,x86: relocate this_cpu_add_return() and friends
    connector: Use this_cpu operations
    xen: Use this_cpu_inc_return
    taskstats: Use this_cpu_ops
    random: Use this_cpu_inc_return
    fs: Use this_cpu_inc_return in buffer.c
    highmem: Use this_cpu_xx_return() operations
    vmstat: Use this_cpu_inc_return for vm statistics
    x86: Support for this_cpu_add, sub, dec, inc_return
    percpu: Generic support for this_cpu_add, sub, dec, inc_return
    ...

    Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
    as per Tejun.

    Linus Torvalds
     

07 Jan, 2011

1 commit


05 Jan, 2011

1 commit


04 Jan, 2011

1 commit

  • The cgroup exit mess also uncovered a struct autogroup reference leak.
    copy_process() was simply freeing vs putting the signal_struct,
    stranding a reference.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Oleg Nesterov
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

17 Dec, 2010

1 commit

  • __get_cpu_var() can be replaced with this_cpu_read and will then use a
    single read instruction with implied address calculation to access the
    correct per cpu instance.

    However, the address of a per cpu variable passed to __this_cpu_read()
    cannot be determined (since it's an implied address conversion through
    segment prefixes). Therefore apply this only to uses of __get_cpu_var
    where the address of the variable is not used.

    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

09 Dec, 2010

1 commit

  • idle_balance() drops/retakes rq->lock, leaving the previous task
    vulnerable to set_tsk_need_resched(). Clear it after we return
    from balancing instead, and in setup_thread_stack() as well, so
    no successfully descheduled or never scheduled task has it set.

    Need resched confused the skip_clock_update logic, which assumes
    that the next call to update_rq_clock() will come nearly immediately
    after being set. Make the optimization robust against the waking
    a sleeper before it sucessfully deschedules case by checking that
    the current task has not been dequeued before setting the flag,
    since it is that useless clock update we're trying to save, and
    clear unconditionally in schedule() proper instead of conditionally
    in put_prev_task().

    Signed-off-by: Mike Galbraith
    Reported-by: Bjoern B. Brandenburg
    Tested-by: Yong Zhang
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

30 Nov, 2010

1 commit

  • A recurring complaint from CFS users is that parallel kbuild has
    a negative impact on desktop interactivity. This patch
    implements an idea from Linus, to automatically create task
    groups. Currently, only per session autogroups are implemented,
    but the patch leaves the way open for enhancement.

    Implementation: each task's signal struct contains an inherited
    pointer to a refcounted autogroup struct containing a task group
    pointer, the default for all tasks pointing to the
    init_task_group. When a task calls setsid(), a new task group
    is created, the process is moved into the new task group, and a
    reference to the preveious task group is dropped. Child
    processes inherit this task group thereafter, and increase it's
    refcount. When the last thread of a process exits, the
    process's reference is dropped, such that when the last process
    referencing an autogroup exits, the autogroup is destroyed.

    At runqueue selection time, IFF a task has no cgroup assignment,
    its current autogroup is used.

    Autogroup bandwidth is controllable via setting it's nice level
    through the proc filesystem:

    cat /proc//autogroup

    Displays the task's group and the group's nice level.

    echo > /proc//autogroup

    Sets the task group's shares to the weight of nice task.
    Setting nice level is rate limited for !admin users due to the
    abuse risk of task group locking.

    The feature is enabled from boot by default if
    CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
    the boot option noautogroup, and can also be turned on/off on
    the fly via:

    echo [01] > /proc/sys/kernel/sched_autogroup_enabled

    ... which will automatically move tasks to/from the root task group.

    Signed-off-by: Mike Galbraith
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Cc: Markus Trippelsdorf
    Cc: Mathieu Desnoyers
    Cc: Paul Turner
    Cc: Oleg Nesterov
    [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
    Signed-off-by: Ingo Molnar
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

28 Oct, 2010

1 commit

  • Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
    itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
    execve() has no worth.

    Let's move ->cred_guard_mutex from task_struct to signal_struct. It
    naturally prevent multiple-threads-inside-exec.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

27 Oct, 2010

1 commit

  • It's pointless to kill a task if another thread sharing its mm cannot be
    killed to allow future memory freeing. A subsequent patch will prevent
    kills in such cases, but first it's necessary to have a way to flag a task
    that shares memory with an OOM_DISABLE task that doesn't incur an
    additional tasklist scan, which would make select_bad_process() an O(n^2)
    function.

    This patch adds an atomic counter to struct mm_struct that follows how
    many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
    They cannot be killed by the kernel, so their memory cannot be freed in
    oom conditions.

    This only requires task_lock() on the task that we're operating on, it
    does not require mm->mmap_sem since task_lock() pins the mm and the
    operation is atomic.

    [rientjes@google.com: changelog and sys_unshare() code]
    [rientjes@google.com: protect oom_disable_count with task_lock in fork]
    [rientjes@google.com: use old_mm for oom_disable_count in exec]
    Signed-off-by: Ying Han
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

23 Sep, 2010

1 commit

  • The below bug in fork led to the rmap walk finding the parent huge-pmd
    twice instead of just once, because the anon_vma_chain objects of the
    child vma still point to the vma->vm_mm of the parent.

    The patch fixes it by making the rmap walk accurate during fork. It's not
    a big deal normally but it worth being accurate considering the cost is
    the same.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

21 Aug, 2010

1 commit

  • It's a really simple list, and several of the users want to go backwards
    in it to find the previous vma. So rather than have to look up the
    previous entry with 'find_vma_prev()' or something similar, just make it
    doubly linked instead.

    Tested-by: Ian Campbell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Aug, 2010

1 commit

  • fs: fs_struct rwlock to spinlock

    struct fs_struct.lock is an rwlock with the read-side used to protect root and
    pwd members while taking references to them. Taking a reference to a path
    typically requires just 2 atomic ops, so the critical section is very small.
    Parallel read-side operations would have cacheline contention on the lock, the
    dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
    real parallelism increase.

    Replace it with a spinlock to avoid one or two atomic operations in typical
    path lookup fastpath.

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     

10 Aug, 2010

1 commit

  • This a complete rewrite of the oom killer's badness() heuristic which is
    used to determine which task to kill in oom conditions. The goal is to
    make it as simple and predictable as possible so the results are better
    understood and we end up killing the task which will lead to the most
    memory freeing while still respecting the fine-tuning from userspace.

    Instead of basing the heuristic on mm->total_vm for each task, the task's
    rss and swap space is used instead. This is a better indication of the
    amount of memory that will be freeable if the oom killed task is chosen
    and subsequently exits. This helps specifically in cases where KDE or
    GNOME is chosen for oom kill on desktop systems instead of a memory
    hogging task.

    The baseline for the heuristic is a proportion of memory that each task is
    currently using in memory plus swap compared to the amount of "allowable"
    memory. "Allowable," in this sense, means the system-wide resources for
    unconstrained oom conditions, the set of mempolicy nodes, the mems
    attached to current's cpuset, or a memory controller's limit. The
    proportion is given on a scale of 0 (never kill) to 1000 (always kill),
    roughly meaning that if a task has a badness() score of 500 that the task
    consumes approximately 50% of allowable memory resident in RAM or in swap
    space.

    The proportion is always relative to the amount of "allowable" memory and
    not the total amount of RAM systemwide so that mempolicies and cpusets may
    operate in isolation; they shall not need to know the true size of the
    machine on which they are running if they are bound to a specific set of
    nodes or mems, respectively.

    Root tasks are given 3% extra memory just like __vm_enough_memory()
    provides in LSMs. In the event of two tasks consuming similar amounts of
    memory, it is generally better to save root's task.

    Because of the change in the badness() heuristic's baseline, it is also
    necessary to introduce a new user interface to tune it. It's not possible
    to redefine the meaning of /proc/pid/oom_adj with a new scale since the
    ABI cannot be changed for backward compatability. Instead, a new tunable,
    /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
    be used to polarize the heuristic such that certain tasks are never
    considered for oom kill while others may always be considered. The value
    is added directly into the badness() score so a value of -500, for
    example, means to discount 50% of its memory consumption in comparison to
    other tasks either on the system, bound to the mempolicy, in the cpuset,
    or sharing the same memory controller.

    /proc/pid/oom_adj is changed so that its meaning is rescaled into the
    units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
    these per-task tunables will rescale the value of the other to an
    equivalent meaning. Although /proc/pid/oom_adj was originally defined as
    a bitshift on the badness score, it now shares the same linear growth as
    /proc/pid/oom_score_adj but with different granularity. This is required
    so the ABI is not broken with userspace applications and allows oom_adj to
    be deprecated for future removal.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Jun, 2010

1 commit

  • Concurrency managed workqueue needs to know when workers are going to
    sleep and waking up. Using these two hooks, cmwq keeps track of the
    current concurrency level and throttles execution of new works if it's
    too high and wakes up another worker from the sleep hook if it becomes
    too low.

    This patch introduces PF_WQ_WORKER to identify workqueue workers and
    adds the following two hooks.

    * wq_worker_waking_up(): called when a worker is woken up.

    * wq_worker_sleeping(): called when a worker is going to sleep and may
    return a pointer to a local task which should be woken up. The
    returned task is woken up using try_to_wake_up_local() which is
    simplified ttwu which is called under rq lock and can only wake up
    local tasks.

    Both hooks are currently defined as noop in kernel/workqueue_sched.h.
    Later cmwq implementation will replace them with proper
    implementation.

    These hooks are hard coded as they'll always be enabled.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Ingo Molnar

    Tejun Heo
     

31 May, 2010

1 commit

  • This reverts commit 0ac0c0d0f837c499afd02a802f9cf52d3027fa3b, which
    caused cross-architecture build problems for all the wrong reasons.
    IA64 already added its own version of __node_random(), but the fact is,
    there is nothing architectural about the function, and the original
    commit was just badly done. Revert it, since no fix is forthcoming.

    Requested-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 May, 2010

3 commits

  • copy_process(pid => &init_struct_pid) doesn't do attach_pid/etc.

    It shouldn't, but this means that the idle threads run with the wrong
    pids copied from the caller's task_struct. In x86 case the caller is
    either kernel_init() thread or keventd.

    In particular, this means that after the series of cpu_up/cpu_down an
    idle thread (which never exits) can run with .pid pointing to nowhere.

    Change fork_idle() to initialize idle->pids[] correctly. We only set
    .pid = &init_struct_pid but do not add .node to list, INIT_TASK() does
    the same for the boot-cpu idle thread (swapper).

    Signed-off-by: Oleg Nesterov
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Eric Biederman
    Cc: Herbert Poetzl
    Cc: Mathias Krause
    Acked-by: Roland McGrath
    Acked-by: Serge Hallyn
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes, just s/atomic_t count/int nr_threads/.

    With the recent changes this counter has a single user, get_nr_threads()
    And, none of its callers need the really accurate number of threads, not
    to mention each caller obviously races with fork/exit. It is only used to
    report this value to the user-space, except first_tid() uses it to avoid
    the unnecessary while_each_thread() loop in the unlikely case.

    It is a bit sad we need a word in struct signal_struct for this, perhaps
    we can change get_nr_threads() to approximate the number of threads using
    signal->live and kill ->nr_threads later.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • check_unshare_flags(CLONE_SIGHAND) adds CLONE_THREAD to *flags_ptr if the
    task is multithreaded to ensure unshare_thread() will fail.

    Not only this is a bit strange way to return the error, this is absolutely
    meaningless. If signal->count > 1 then sighand->count must be also > 1,
    and unshare_sighand() will fail anyway.

    In fact, all CLONE_THREAD/SIGHAND/VM checks inside sys_unshare() do not
    look right. Fortunately this code doesn't really work anyway.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov