14 Jan, 2011

4 commits

  • Add khugepaged to relocate fragmented pages into hugepages if new
    hugepages become available. (this is indipendent of the defrag logic that
    will have to make new hugepages available)

    The fundamental reason why khugepaged is unavoidable, is that some memory
    can be fragmented and not everything can be relocated. So when a virtual
    machine quits and releases gigabytes of hugepages, we want to use those
    freely available hugepages to create huge-pmd in the other virtual
    machines that may be running on fragmented memory, to maximize the CPU
    efficiency at all times. The scan is slow, it takes nearly zero cpu time,
    except when it copies data (in which case it means we definitely want to
    pay for that cpu time) so it seems a good tradeoff.

    In addition to the hugepages being released by other process releasing
    memory, we have the strong suspicion that the performance impact of
    potentially defragmenting hugepages during or before each page fault could
    lead to more performance inconsistency than allocating small pages at
    first and having them collapsed into large pages later... if they prove
    themselfs to be long lived mappings (khugepaged scan is slow so short
    lived mappings have low probability to run into khugepaged if compared to
    long lived mappings).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This increase the size of the mm struct a bit but it is needed to
    preallocate one pte for each hugepage so that split_huge_page will not
    require a fail path. Guarantee of success is a fundamental property of
    split_huge_page to avoid decrasing swapping reliability and to avoid
    adding -ENOMEM fail paths that would otherwise force the hugepage-unaware
    VM code to learn rolling back in the middle of its pte mangling operations
    (if something we need it to learn handling pmd_trans_huge natively rather
    being capable of rollback). When split_huge_page runs a pte is needed to
    succeed the split, to map the newly splitted regular pages with a regular
    pte. This way all existing VM code remains backwards compatible by just
    adding a split_huge_page* one liner. The memory waste of those
    preallocated ptes is negligible and so it is worth it.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We'd like to be able to oom_score_adj a process up/down as it
    enters/leaves the foreground. Currently, it is not possible to oom_adj
    down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
    oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
    or its inherited value at fork. Assuming the thread that has forked it
    has oom_score_adj of 0, each process could decrease it back from 0 upon
    activation unless a CAP_SYS_RESOURCE thread elevated it to something
    higher.

    Alternative considered:

    * a setuid binary
    * a daemon with CAP_SYS_RESOURCE

    Since you don't wan't all processes to be able to reduce their oom_adj, a
    setuid or daemon implementation would be complex. The alternatives also
    have much higher overhead.

    This patch updated from original patch based on feedback from David
    Rientjes.

    Signed-off-by: Mandeep Singh Baines
    Acked-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     
  • This warning was added in commit bdff746a3915 ("clone: prepare to recycle
    CLONE_STOPPED") three years ago. 2.6.26 came and went. As far as I know,
    no-one is actually using CLONE_STOPPED.

    Signed-off-by: Dave Jones
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     

08 Jan, 2011

1 commit

  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
    gameport: use this_cpu_read instead of lookup
    x86: udelay: Use this_cpu_read to avoid address calculation
    x86: Use this_cpu_inc_return for nmi counter
    x86: Replace uses of current_cpu_data with this_cpu ops
    x86: Use this_cpu_ops to optimize code
    vmstat: User per cpu atomics to avoid interrupt disable / enable
    irq_work: Use per cpu atomics instead of regular atomics
    cpuops: Use cmpxchg for xchg to avoid lock semantics
    x86: this_cpu_cmpxchg and this_cpu_xchg operations
    percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
    percpu,x86: relocate this_cpu_add_return() and friends
    connector: Use this_cpu operations
    xen: Use this_cpu_inc_return
    taskstats: Use this_cpu_ops
    random: Use this_cpu_inc_return
    fs: Use this_cpu_inc_return in buffer.c
    highmem: Use this_cpu_xx_return() operations
    vmstat: Use this_cpu_inc_return for vm statistics
    x86: Support for this_cpu_add, sub, dec, inc_return
    percpu: Generic support for this_cpu_add, sub, dec, inc_return
    ...

    Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
    as per Tejun.

    Linus Torvalds
     

07 Jan, 2011

1 commit


05 Jan, 2011

1 commit


04 Jan, 2011

1 commit

  • The cgroup exit mess also uncovered a struct autogroup reference leak.
    copy_process() was simply freeing vs putting the signal_struct,
    stranding a reference.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Oleg Nesterov
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

17 Dec, 2010

1 commit

  • __get_cpu_var() can be replaced with this_cpu_read and will then use a
    single read instruction with implied address calculation to access the
    correct per cpu instance.

    However, the address of a per cpu variable passed to __this_cpu_read()
    cannot be determined (since it's an implied address conversion through
    segment prefixes). Therefore apply this only to uses of __get_cpu_var
    where the address of the variable is not used.

    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

09 Dec, 2010

1 commit

  • idle_balance() drops/retakes rq->lock, leaving the previous task
    vulnerable to set_tsk_need_resched(). Clear it after we return
    from balancing instead, and in setup_thread_stack() as well, so
    no successfully descheduled or never scheduled task has it set.

    Need resched confused the skip_clock_update logic, which assumes
    that the next call to update_rq_clock() will come nearly immediately
    after being set. Make the optimization robust against the waking
    a sleeper before it sucessfully deschedules case by checking that
    the current task has not been dequeued before setting the flag,
    since it is that useless clock update we're trying to save, and
    clear unconditionally in schedule() proper instead of conditionally
    in put_prev_task().

    Signed-off-by: Mike Galbraith
    Reported-by: Bjoern B. Brandenburg
    Tested-by: Yong Zhang
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

30 Nov, 2010

1 commit

  • A recurring complaint from CFS users is that parallel kbuild has
    a negative impact on desktop interactivity. This patch
    implements an idea from Linus, to automatically create task
    groups. Currently, only per session autogroups are implemented,
    but the patch leaves the way open for enhancement.

    Implementation: each task's signal struct contains an inherited
    pointer to a refcounted autogroup struct containing a task group
    pointer, the default for all tasks pointing to the
    init_task_group. When a task calls setsid(), a new task group
    is created, the process is moved into the new task group, and a
    reference to the preveious task group is dropped. Child
    processes inherit this task group thereafter, and increase it's
    refcount. When the last thread of a process exits, the
    process's reference is dropped, such that when the last process
    referencing an autogroup exits, the autogroup is destroyed.

    At runqueue selection time, IFF a task has no cgroup assignment,
    its current autogroup is used.

    Autogroup bandwidth is controllable via setting it's nice level
    through the proc filesystem:

    cat /proc//autogroup

    Displays the task's group and the group's nice level.

    echo > /proc//autogroup

    Sets the task group's shares to the weight of nice task.
    Setting nice level is rate limited for !admin users due to the
    abuse risk of task group locking.

    The feature is enabled from boot by default if
    CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
    the boot option noautogroup, and can also be turned on/off on
    the fly via:

    echo [01] > /proc/sys/kernel/sched_autogroup_enabled

    ... which will automatically move tasks to/from the root task group.

    Signed-off-by: Mike Galbraith
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Cc: Markus Trippelsdorf
    Cc: Mathieu Desnoyers
    Cc: Paul Turner
    Cc: Oleg Nesterov
    [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
    Signed-off-by: Ingo Molnar
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

28 Oct, 2010

1 commit

  • Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
    itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
    execve() has no worth.

    Let's move ->cred_guard_mutex from task_struct to signal_struct. It
    naturally prevent multiple-threads-inside-exec.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

27 Oct, 2010

1 commit

  • It's pointless to kill a task if another thread sharing its mm cannot be
    killed to allow future memory freeing. A subsequent patch will prevent
    kills in such cases, but first it's necessary to have a way to flag a task
    that shares memory with an OOM_DISABLE task that doesn't incur an
    additional tasklist scan, which would make select_bad_process() an O(n^2)
    function.

    This patch adds an atomic counter to struct mm_struct that follows how
    many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
    They cannot be killed by the kernel, so their memory cannot be freed in
    oom conditions.

    This only requires task_lock() on the task that we're operating on, it
    does not require mm->mmap_sem since task_lock() pins the mm and the
    operation is atomic.

    [rientjes@google.com: changelog and sys_unshare() code]
    [rientjes@google.com: protect oom_disable_count with task_lock in fork]
    [rientjes@google.com: use old_mm for oom_disable_count in exec]
    Signed-off-by: Ying Han
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

23 Sep, 2010

1 commit

  • The below bug in fork led to the rmap walk finding the parent huge-pmd
    twice instead of just once, because the anon_vma_chain objects of the
    child vma still point to the vma->vm_mm of the parent.

    The patch fixes it by making the rmap walk accurate during fork. It's not
    a big deal normally but it worth being accurate considering the cost is
    the same.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

21 Aug, 2010

1 commit

  • It's a really simple list, and several of the users want to go backwards
    in it to find the previous vma. So rather than have to look up the
    previous entry with 'find_vma_prev()' or something similar, just make it
    doubly linked instead.

    Tested-by: Ian Campbell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Aug, 2010

1 commit

  • fs: fs_struct rwlock to spinlock

    struct fs_struct.lock is an rwlock with the read-side used to protect root and
    pwd members while taking references to them. Taking a reference to a path
    typically requires just 2 atomic ops, so the critical section is very small.
    Parallel read-side operations would have cacheline contention on the lock, the
    dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
    real parallelism increase.

    Replace it with a spinlock to avoid one or two atomic operations in typical
    path lookup fastpath.

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     

10 Aug, 2010

1 commit

  • This a complete rewrite of the oom killer's badness() heuristic which is
    used to determine which task to kill in oom conditions. The goal is to
    make it as simple and predictable as possible so the results are better
    understood and we end up killing the task which will lead to the most
    memory freeing while still respecting the fine-tuning from userspace.

    Instead of basing the heuristic on mm->total_vm for each task, the task's
    rss and swap space is used instead. This is a better indication of the
    amount of memory that will be freeable if the oom killed task is chosen
    and subsequently exits. This helps specifically in cases where KDE or
    GNOME is chosen for oom kill on desktop systems instead of a memory
    hogging task.

    The baseline for the heuristic is a proportion of memory that each task is
    currently using in memory plus swap compared to the amount of "allowable"
    memory. "Allowable," in this sense, means the system-wide resources for
    unconstrained oom conditions, the set of mempolicy nodes, the mems
    attached to current's cpuset, or a memory controller's limit. The
    proportion is given on a scale of 0 (never kill) to 1000 (always kill),
    roughly meaning that if a task has a badness() score of 500 that the task
    consumes approximately 50% of allowable memory resident in RAM or in swap
    space.

    The proportion is always relative to the amount of "allowable" memory and
    not the total amount of RAM systemwide so that mempolicies and cpusets may
    operate in isolation; they shall not need to know the true size of the
    machine on which they are running if they are bound to a specific set of
    nodes or mems, respectively.

    Root tasks are given 3% extra memory just like __vm_enough_memory()
    provides in LSMs. In the event of two tasks consuming similar amounts of
    memory, it is generally better to save root's task.

    Because of the change in the badness() heuristic's baseline, it is also
    necessary to introduce a new user interface to tune it. It's not possible
    to redefine the meaning of /proc/pid/oom_adj with a new scale since the
    ABI cannot be changed for backward compatability. Instead, a new tunable,
    /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
    be used to polarize the heuristic such that certain tasks are never
    considered for oom kill while others may always be considered. The value
    is added directly into the badness() score so a value of -500, for
    example, means to discount 50% of its memory consumption in comparison to
    other tasks either on the system, bound to the mempolicy, in the cpuset,
    or sharing the same memory controller.

    /proc/pid/oom_adj is changed so that its meaning is rescaled into the
    units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
    these per-task tunables will rescale the value of the other to an
    equivalent meaning. Although /proc/pid/oom_adj was originally defined as
    a bitshift on the badness score, it now shares the same linear growth as
    /proc/pid/oom_score_adj but with different granularity. This is required
    so the ABI is not broken with userspace applications and allows oom_adj to
    be deprecated for future removal.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Jun, 2010

1 commit

  • Concurrency managed workqueue needs to know when workers are going to
    sleep and waking up. Using these two hooks, cmwq keeps track of the
    current concurrency level and throttles execution of new works if it's
    too high and wakes up another worker from the sleep hook if it becomes
    too low.

    This patch introduces PF_WQ_WORKER to identify workqueue workers and
    adds the following two hooks.

    * wq_worker_waking_up(): called when a worker is woken up.

    * wq_worker_sleeping(): called when a worker is going to sleep and may
    return a pointer to a local task which should be woken up. The
    returned task is woken up using try_to_wake_up_local() which is
    simplified ttwu which is called under rq lock and can only wake up
    local tasks.

    Both hooks are currently defined as noop in kernel/workqueue_sched.h.
    Later cmwq implementation will replace them with proper
    implementation.

    These hooks are hard coded as they'll always be enabled.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Ingo Molnar

    Tejun Heo
     

31 May, 2010

1 commit

  • This reverts commit 0ac0c0d0f837c499afd02a802f9cf52d3027fa3b, which
    caused cross-architecture build problems for all the wrong reasons.
    IA64 already added its own version of __node_random(), but the fact is,
    there is nothing architectural about the function, and the original
    commit was just badly done. Revert it, since no fix is forthcoming.

    Requested-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 May, 2010

8 commits

  • copy_process(pid => &init_struct_pid) doesn't do attach_pid/etc.

    It shouldn't, but this means that the idle threads run with the wrong
    pids copied from the caller's task_struct. In x86 case the caller is
    either kernel_init() thread or keventd.

    In particular, this means that after the series of cpu_up/cpu_down an
    idle thread (which never exits) can run with .pid pointing to nowhere.

    Change fork_idle() to initialize idle->pids[] correctly. We only set
    .pid = &init_struct_pid but do not add .node to list, INIT_TASK() does
    the same for the boot-cpu idle thread (swapper).

    Signed-off-by: Oleg Nesterov
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Eric Biederman
    Cc: Herbert Poetzl
    Cc: Mathias Krause
    Acked-by: Roland McGrath
    Acked-by: Serge Hallyn
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes, just s/atomic_t count/int nr_threads/.

    With the recent changes this counter has a single user, get_nr_threads()
    And, none of its callers need the really accurate number of threads, not
    to mention each caller obviously races with fork/exit. It is only used to
    report this value to the user-space, except first_tid() uses it to avoid
    the unnecessary while_each_thread() loop in the unlikely case.

    It is a bit sad we need a word in struct signal_struct for this, perhaps
    we can change get_nr_threads() to approximate the number of threads using
    signal->live and kill ->nr_threads later.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • check_unshare_flags(CLONE_SIGHAND) adds CLONE_THREAD to *flags_ptr if the
    task is multithreaded to ensure unshare_thread() will fail.

    Not only this is a bit strange way to return the error, this is absolutely
    meaningless. If signal->count > 1 then sighand->count must be also > 1,
    and unshare_sighand() will fail anyway.

    In fact, all CLONE_THREAD/SIGHAND/VM checks inside sys_unshare() do not
    look right. Fortunately this code doesn't really work anyway.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move taskstats_tgid_free() from __exit_signal() to free_signal_struct().

    This way signal->stats never points to nowhere and we can read ->stats
    lockless.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Kill the empty thread_group_cputime_free() helper. It was needed to free
    the per-cpu data which we no longer have.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • We have a lot of problems with accessing task_struct->signal, it can
    "disappear" at any moment. Even current can't use its ->signal safely
    after exit_notify(). ->siglock helps, but it is not convenient, not
    always possible, and sometimes it makes sense to use task->signal even
    after this task has already dead.

    This patch adds the reference counter, sigcnt, into signal_struct. This
    reference is owned by task_struct and it is dropped in
    __put_task_struct(). Perhaps it makes sense to export
    get/put_signal_struct() later, but currently I don't see the immediate
    reason.

    Rename __cleanup_signal() to free_signal_struct() and unexport it. With
    the previous changes it does nothing except kmem_cache_free().

    Change __exit_signal() to not clear/free ->signal, it will be freed when
    the last reference to any thread in the thread group goes away.

    Note:
    - when the last thead exits signal->tty can point to nowhere, see
    the next patch.

    - with or without this patch signal_struct->count should go away,
    or at least it should be "int nr_threads" for fs/proc. This will
    be addressed later.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • tty_kref_put() has two callsites in copy_process() paths,

    1. if copy_process() suceeds it is called before we copy
    signal->tty from parent

    2. otherwise it is called from __cleanup_signal() under
    bad_fork_cleanup_signal: label

    In both cases tty_kref_put() is not right and unneeded because we don't
    have the balancing tty_kref_get(). Fortunately, this is harmless because
    this can only happen without CLONE_THREAD, and in this case signal->tty
    must be NULL.

    Remove tty_kref_put() from copy_process() and __cleanup_signal(), and
    change another caller of __cleanup_signal(), __exit_signal(), to call
    tty_kref_put() by hand.

    I hope this change makes sense by itself, but it is also needed to make
    ->signal refcountable.

    Signed-off-by: Oleg Nesterov
    Acked-by: Alan Cox
    Acked-by: Roland McGrath
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Some workloads that create a large number of small files tend to assign
    too many pages to node 0 (multi-node systems). Part of the reason is that
    the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at
    node 0 for newly created tasks.

    This patch changes the rotor to be initialized to a random node number of
    the cpuset.

    [akpm@linux-foundation.org: fix layout]
    [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
    Signed-off-by: Jack Steiner
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     

18 May, 2010

1 commit

  • …git/tip/linux-2.6-tip

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits)
    perf tools: Add mode to build without newt support
    perf symbols: symbol inconsistency message should be done only at verbose=1
    perf tui: Add explicit -lslang option
    perf options: Type check all the remaining OPT_ variants
    perf options: Type check OPT_BOOLEAN and fix the offenders
    perf options: Check v type in OPT_U?INTEGER
    perf options: Introduce OPT_UINTEGER
    perf tui: Add workaround for slang < 2.1.4
    perf record: Fix bug mismatch with -c option definition
    perf options: Introduce OPT_U64
    perf tui: Add help window to show key associations
    perf tui: Make <- exit menus too
    perf newt: Add single key shortcuts for zoom into DSO and threads
    perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed
    perf newt: Fix the 'A'/'a' shortcut for annotate
    perf newt: Make <- exit the ui_browser
    x86, perf: P4 PMU - fix counters management logic
    perf newt: Make <- zoom out filters
    perf report: Report number of events, not samples
    perf hist: Clarify events_stats fields usage
    ...

    Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c

    Linus Torvalds
     

12 May, 2010

1 commit

  • Originally, commit d899bf7b ("procfs: provide stack information for
    threads") attempted to introduce a new feature for showing where the
    threadstack was located and how many pages are being utilized by the
    stack.

    Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
    applied to fix the NO_MMU case.

    Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
    64-bit") was applied to fix a bug in ia32 executables being loaded.

    Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
    threads") was applied to fix a bug which had kernel threads printing a
    userland stack address.

    Commit 1306d603f ('proc: partially revert "procfs: provide stack
    information for threads"') was then applied to revert the stack pages
    being used to solve a significant performance regression.

    This patch nearly undoes the effect of all these patches.

    The reason for reverting these is it provides an unusable value in
    field 28. For x86_64, a fork will result in the task->stack_start
    value being updated to the current user top of stack and not the stack
    start address. This unpredictability of the stack_start value makes
    it worthless. That includes the intended use of showing how much stack
    space a thread has.

    Other architectures will get different values. As an example, ia64
    gets 0. The do_fork() and copy_process() functions appear to treat the
    stack_start and stack_size parameters as architecture specific.

    I only partially reverted c44972f1 ("procfs: disable per-task stack usage
    on NOMMU") . If I had completely reverted it, I would have had to change
    mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
    configured. Since I could not test the builds without significant effort,
    I decided to not change mm/Makefile.

    I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
    information for threads on 64-bit") . I left the KSTK_ESP() change in
    place as that seemed worthwhile.

    Signed-off-by: Robin Holt
    Cc: Stefani Seibold
    Cc: KOSAKI Motohiro
    Cc: Michal Simek
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

08 Apr, 2010

1 commit


07 Apr, 2010

1 commit

  • - We weren't zeroing p->rss_stat[] at fork()

    - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel
    threads and was oopsing.

    - Make __sync_task_rss_stat() static, too.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648

    [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)]
    Reported-by: Troels Liebe Bentsen
    Signed-off-by: KAMEZAWA Hiroyuki
    "Michael S. Tsirkin"
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

26 Mar, 2010

1 commit

  • Support for the PMU's BTS features has been upstreamed in
    v2.6.32, but we still have the old and disabled ptrace-BTS,
    as Linus noticed it not so long ago.

    It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
    regard for other uses (perf) and doesn't provide the flexibility
    needed for perf either.

    Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
    was never used and ptrace-block-step can be implemented using a
    much simpler approach.

    So axe all 3000 lines of it. That includes the *locked_memory*()
    APIs in mm/mlock.c as well.

    Reported-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Cc: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Markus Metzger
    Cc: Steven Rostedt
    Cc: Andrew Morton
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Mar, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    locking: Make sparse work with inline spinlocks and rwlocks
    x86/mce: Fix RCU lockdep splats
    rcu: Increase RCU CPU stall timeouts if PROVE_RCU
    ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
    rcu: Suppress RCU lockdep warnings during early boot
    rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
    rcu: Suppress __mpol_dup() false positive from RCU lockdep
    rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
    rcu: Add control variables to lockdep_rcu_dereference() diagnostics
    rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
    rcu: Use wrapper function instead of exporting tasklist_lock
    sched, rcu: Fix rcu_dereference() for RCU-lockdep
    rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
    rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
    x86/gart: Unexport gart_iommu_aperture

    Fix trivial conflicts in kernel/trace/ftrace.c

    Linus Torvalds
     

13 Mar, 2010

2 commits


07 Mar, 2010

3 commits

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

04 Mar, 2010

1 commit

  • Lockdep-RCU commit d11c563d exported tasklist_lock, which is not
    a good thing. This patch instead exports a function that uses
    lockdep to check whether tasklist_lock is held.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    Cc: Christoph Hellwig
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

25 Feb, 2010

1 commit

  • Update the rcu_dereference() usages to take advantage of the new
    lockdep-based checking.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
    Signed-off-by: Ingo Molnar

    Paul E. McKenney