01 Aug, 2012

1 commit

  • Pull perf updates from Ingo Molnar:
    "The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
    updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
    fixes) now that Arnaldo is back from vacation."

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    uprobes: __replace_page() needs munlock_vma_page()
    uprobes: Rename vma_address() and make it return "unsigned long"
    uprobes: Fix register_for_each_vma()->vma_address() check
    uprobes: Introduce vaddr_to_offset(vma, vaddr)
    uprobes: Teach build_probe_list() to consider the range
    uprobes: Remove insert_vm_struct()->uprobe_mmap()
    uprobes: Remove copy_vma()->uprobe_mmap()
    uprobes: Fix overflow in vma_address()/find_active_uprobe()
    uprobes: Suppress uprobe_munmap() from mmput()
    uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
    uprobes: Clean up and document write_opcode()->lock_page(old_page)
    uprobes: Kill write_opcode()->lock_page(new_page)
    uprobes: __replace_page() should not use page_address_in_vma()
    uprobes: Don't recheck vma/f_mapping in write_opcode()
    perf/x86: Fix missing struct before structure name
    perf/x86: Fix format definition of SNB-EP uncore QPI box
    perf/x86: Make bitfield unsigned
    perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
    perf/x86: Add Intel Nehalem-EX uncore support
    perf/x86: Fix typo in format definition of uncore PCU filter
    ...

    Linus Torvalds
     

26 Jul, 2012

1 commit

  • Otherwise they can't be filtered for a defined task:

    perf record -e sched:sched_switch ./foo

    This command doesn't report any events without this patch.

    I think it isn't a security concern if someone knows who will
    be executed next - this can already be observed by polling /proc
    state. By default perf is disabled for non-root users in any case.

    I need these events for profiling sleep times. sched_switch is used for
    getting callchains and sched_stat_* is used for getting time periods.
    These events are combined in user space, then it can be analyzed by
    perf tools.

    Signed-off-by: Andrew Vagin
    Signed-off-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Arun Sharma
    Link: http://lkml.kernel.org/r/1342088069-1005148-1-git-send-email-avagin@openvz.org
    Signed-off-by: Ingo Molnar

    Andrew Vagin
     

24 Jul, 2012

7 commits

  • Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
    Don't call task_group() too many times in set_task_rq()"), he
    found the reason to be that the multiple task_group()
    invocations in set_task_rq() returned different values.

    Looking at all that I found a lack of serialization and plain
    wrong comments.

    The below tries to fix it using an extra pointer which is
    updated under the appropriate scheduler locks. Its not pretty,
    but I can't really see another way given how all the cgroup
    stuff works.

    Reported-and-tested-by: Stefan Bader
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Current load balance scheme requires only one cpu in a
    sched_group (balance_cpu) to look at other peer sched_groups for
    imbalance and pull tasks towards itself from a busy cpu. Tasks
    thus pulled by balance_cpu could later get picked up by cpus
    that are in the same sched_group as that of balance_cpu.

    This scheme however fails to pull tasks that are not allowed to
    run on balance_cpu (but are allowed to run on other cpus in its
    sched_group). That can affect fairness and in some worst case
    scenarios cause starvation.

    Consider a two core (2 threads/core) system running tasks as
    below:

    Core0 Core1
    / \ / \
    C0 C1 C2 C3
    | | | |
    v v v v
    F0 T1 F1 [idle]
    T2

    F0 = SCHED_FIFO task (pinned to C0)
    F1 = SCHED_FIFO task (pinned to C2)
    T1 = SCHED_OTHER task (pinned to C1)
    T2 = SCHED_OTHER task (pinned to C1 and C2)

    F1 could become a cpu hog, which will starve T2 unless C1 pulls
    it. Between C0 and C1 however, C0 is required to look for
    imbalance between cores, which will fail to pull T2 towards
    Core0. T2 will starve eternally in this case. The same scenario
    can arise in presence of non-rt tasks as well (say we replace F1
    with high irq load).

    We tackle this problem by having balance_cpu move pinned tasks
    to one of its sibling cpus (where they can run). We first check
    if load balance goal can be met by ignoring pinned tasks,
    failing which we retry move_tasks() with a new env->dst_cpu.

    This patch modifies load balance semantics on who can move load
    towards a given cpu in a given sched_domain.

    Before this patch, a given_cpu or a ilb_cpu acting on behalf of
    an idle given_cpu is responsible for moving load to given_cpu.

    With this patch applied, balance_cpu can in addition decide on
    moving some load to a given_cpu.

    There is a remote possibility that excess load could get moved
    as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
    *independently* and at *same* time to move some load to a
    given_cpu). However we should see less of such conflicting
    decisions in practice and moreover subsequent load balance
    cycles should correct the excess load moved to given_cpu.

    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Prashanth Nageshappa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
    [ minor edits ]
    Signed-off-by: Ingo Molnar

    Srivatsa Vaddagiri
     
  • While load balancing, if all tasks on the source runqueue are pinned,
    we retry after excluding the corresponding source cpu. However, loop counters
    env.loop and env.loop_break are not reset before retrying, which can lead
    to failure in moving the tasks. In this patch we reset env.loop and
    env.loop_break to their inital values before we retry.

    Signed-off-by: Prashanth Nageshappa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FE06EEF.2090709@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Prashanth Nageshappa
     
  • Members of 'struct lb_env' are not in appropriate order to reuse compiler
    added padding on 64bit architectures. In this patch we reorder those struct
    members and help reduce the size of the structure from 96 bytes to 80
    bytes on 64 bit architectures.

    Suggested-by: Srivatsa Vaddagiri
    Signed-off-by: Prashanth Nageshappa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FE06DDE.7000403@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Prashanth Nageshappa
     
  • Traversing an entire package is not only expensive, it also leads to tasks
    bouncing all over a partially idle and possible quite large package. Fix
    that up by assigning a 'buddy' CPU to try to motivate. Each buddy may try
    to motivate that one other CPU, if it's busy, tough, it may then try its
    SMT sibling, but that's all this optimization is allowed to cost.

    Sibling cache buddies are cross-wired to prevent bouncing.

    4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:

    clients 1 2 4 8 16 32 64 128
    ..........................................................................
    pre 30 41 118 645 3769 6214 12233 14312
    post 299 603 1211 2418 4697 6847 11606 14557

    A nice increase in performance.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Separate out the cpuset related handling for CPU/Memory online/offline.
    This also helps us exploit the most obvious and basic level of optimization
    that any notification mechanism (CPU/Mem online/offline) has to offer us:
    "We *know* why we have been invoked. So stop pretending that we are lost,
    and do only the necessary amount of processing!".

    And while at it, rename scan_for_empty_cpusets() to
    scan_cpusets_upon_hotplug(), which is more appropriate considering how
    it is restructured.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     
  • In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
    masks as and when necessary to ensure that the tasks belonging to the cpusets
    have some place (online CPUs) to run on. And regular CPU hotplug is
    destructive in the sense that the kernel doesn't remember the original cpuset
    configurations set by the user, across hotplug operations.

    However, suspend/resume (which uses CPU hotplug) is a special case in which
    the kernel has the responsibility to restore the system (during resume), to
    exactly the same state it was in before suspend.

    In order to achieve that, do the following:

    1. Don't modify cpusets during suspend/resume. At all.
    In particular, don't move the tasks from one cpuset to another, and
    don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
    during the CPU hotplug operations that are carried out in the
    suspend/resume path.

    2. However, cpusets and sched domains are related. We just want to avoid
    altering cpusets alone. So, to keep the sched domains updated, build
    a single sched domain (containing all active cpus) during each of the
    CPU hotplug operations carried out in s/r path, effectively ignoring
    the cpusets' cpus_allowed masks.

    (Since userspace is frozen while doing all this, it will go unnoticed.)

    3. During the last CPU online operation during resume, build the sched
    domains by looking up the (unaltered) cpusets' cpus_allowed masks.
    That will bring back the system to the same original state as it was in
    before suspend.

    Ultimately, this will not only solve the cpuset problem related to suspend
    resume (ie., restores the cpusets to exactly what it was before suspend, by
    not touching it at all) but also speeds up suspend/resume because we avoid
    running cpuset update code for every CPU being offlined/onlined.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     

15 Jul, 2012

1 commit

  • …t-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull RCU, perf, and scheduler fixes from Ingo Molnar.

    The RCU fix is a revert for an optimization that could cause deadlocks.

    One of the scheduler commits (164c33c6adee "sched: Fix fork() error path
    to not crash") is correct but not complete (some architectures like Tile
    are not covered yet) - the resulting additional fixes are still WIP and
    Ingo did not want to delay these pending fixes. See this thread on
    lkml:

    [PATCH] fork: fix error handling in dup_task()

    The perf fixes are just trivial oneliners.

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf kvm: Fix segfault with report and mixed guestmount use
    perf kvm: Fix regression with guest machine creation
    perf script: Fix format regression due to libtraceevent merge
    ring-buffer: Fix accounting of entries when removing pages
    ring-buffer: Fix crash due to uninitialized new_pages list head

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    MAINTAINERS/sched: Update scheduler file pattern
    sched/nohz: Rewrite and fix load-avg computation -- again
    sched: Fix fork() error path to not crash

    Linus Torvalds
     

06 Jul, 2012

1 commit

  • Thanks to Charles Wang for spotting the defects in the current code:

    - If we go idle during the sample window -- after sampling, we get a
    negative bias because we can negate our own sample.

    - If we wake up during the sample window we get a positive bias
    because we push the sample to a known active period.

    So rewrite the entire nohz load-avg muck once again, now adding
    copious documentation to the code.

    Reported-and-tested-by: Doug Smythies
    Reported-and-tested-by: Charles Wang
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
    [ minor edits ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Jul, 2012

1 commit

  • This reverts commit 616c310e83b872024271c915c1b9ab505b9efad9.
    (Move PREEMPT_RCU preemption to switch_to() invocation).
    Testing by Sasha Levin showed that this
    can result in deadlock due to invoking the scheduler when one of
    the runqueue locks is held. Because this commit was simply a
    performance optimization, revert it.

    Reported-by: Sasha Levin
    Signed-off-by: Paul E. McKenney
    Tested-by: Sasha Levin

    Paul E. McKenney
     

09 Jun, 2012

2 commits

  • Pull scheduler fixes from Ingo Molnar.

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Fix the relax_domain_level boot parameter
    sched: Validate assumptions in sched_init_numa()
    sched: Always initialize cpu-power
    sched: Fix domain iteration
    sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
    sched/numa: Load balance between remote nodes
    sched/x86: Calculate booted cores after construction of sibling_mask

    Linus Torvalds
     
  • Fix lots of new kernel-doc warnings in kernel/sched/fair.c:

    Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
    Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
    Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
    Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
    Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
    .. more warnings

    Signed-off-by: Randy Dunlap
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

06 Jun, 2012

6 commits

  • It does not get processed because sched_domain_level_max is 0 at the
    time that setup_relax_domain_level() is run.

    Simply accept the value as it is, as we don't know the value of
    sched_domain_level_max until sched domain construction is completed.

    Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls
    the set_domain_attribute() routine prior to setting the sd->level, however,
    the set_domain_attribute() routine relies on the sd->level to decide whether
    idle load balancing will be off/on.

    Signed-off-by: Dimitri Sivanich
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
    Signed-off-by: Ingo Molnar

    Dimitri Sivanich
     
  • Add some code to validate assumptions we're making and output
    warnings if they are not.

    If this trigger we want to know about it.

    Signed-off-by: Peter Zijlstra
    Cc: Alex Shi
    Link: http://lkml.kernel.org/n/tip-6uc3wk5s9udxtdl9cnku0vtt@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Often when we run into mis-shapen topologies the balance iteration
    fails to update the cpu power properly and we'll end up in /0 traps.

    Always initialize the cpu-power to a semi-sane value so that we can
    at least boot the machine, even if the load-balancer might not
    function correctly.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Weird topologies can lead to asymmetric domain setups. This needs
    further consideration since these setups are typically non-minimal
    too.

    For now, make it work by adding an extra mask selecting which CPUs
    are allowed to iterate up.

    The topology that triggered it is the one from David Rientjes:

    10 20 20 30
    20 10 20 20
    20 20 10 20
    30 20 20 10

    resulting in boxes that wouldn't even boot.

    Reported-by: David Rientjes
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Roland Dreier reported spurious, hard to trigger lockdep warnings
    within the scheduler - without any real lockup.

    This bit gives us the right clue:

    > [89945.640512] [] double_lock_balance+0x5a/0x90
    > [89945.640568] [] push_rt_task+0xc6/0x290

    if you look at that code you'll find the double_lock_balance() in
    question is the one in find_lock_lowest_rq() [yay for inlining].

    Now find_lock_lowest_rq() has a bug.. it fails to use
    double_unlock_balance() in one exit path, if this results in a retry in
    push_rt_task() we'll call double_lock_balance() again, at which point
    we'll run into said lockdep confusion.

    Reported-by: Roland Dreier
    Acked-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Commit cb83b629b ("sched/numa: Rewrite the CONFIG_NUMA sched
    domain support") removed the NODE sched domain and started checking
    if the node distance in SLIT table is farther than REMOTE_DISTANCE,
    if so, it will lose the load balance chance at exec/fork/wake_affine
    points.

    But actually, even the node distance is farther than REMOTE_DISTANCE.

    Modern CPUs also has QPI like connections, which ensures that memory
    access is not too slow between nodes. So the above change in behavior
    on NUMA machine causes a performance regression on various benchmarks:
    hackbench, tbench, netperf, oltp, etc.

    This patch will recover the scheduler behavior to old mode on all my
    Intel platforms: NHM EP/EX, WSM EP, SNB EP/EP4S, and thus fixes the
    perfromance regressions. (all of them just have 2 kinds distance, 10, 21)

    Signed-off-by: Alex Shi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1338965571-9812-1-git-send-email-alex.shi@intel.com
    Signed-off-by: Ingo Molnar

    Alex Shi
     

30 May, 2012

9 commits

  • Remove explicit NULL assignment of static pointer
    dattr_cur from init_sched_domains().

    Signed-off-by: Kamalesh Babulal
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120523091411.GG5005@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Kamalesh Babulal
     
  • No need to have the last NULL entry.

    Signed-off-by: Hiroshi Shimamoto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FBF29E7.5020805@ct.jp.nec.com
    Signed-off-by: Ingo Molnar

    Hiroshi Shimamoto
     
  • The strings sched_feat_names are never changed.

    Signed-off-by: Hiroshi Shimamoto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FBF29B2.9030904@ct.jp.nec.com
    Signed-off-by: Ingo Molnar

    Hiroshi Shimamoto
     
  • task_tick_rt() has an optimization to only reschedule SCHED_RR tasks
    if they were the only element on their rq. However, with cgroups
    a SCHED_RR task could be the only element on its per-cgroup rq but
    still be competing with other SCHED_RR tasks in its parent's
    cgroup. In this case, the SCHED_RR task in the child cgroup would
    never yield at the end of its timeslice. If the child cgroup
    rt_runtime_us was the same as the parent cgroup rt_runtime_us,
    the task in the parent cgroup would starve completely.

    Modify task_tick_rt() to check that the task is the only task on its
    rq, and that the each of the scheduling entities of its ancestors
    is also the only entity on its rq.

    Signed-off-by: Colin Cross
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com
    Signed-off-by: Ingo Molnar

    Colin Cross
     
  • Since nr_cpus_allowed is used outside of sched/rt.c and wants to be
    used outside of there more, move it to a more natural site.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We could re-read rq->rt_avg after we validated it was smaller than
    total, invalidating the check and resulting in an unintended negative.

    Signed-off-by: Peter Zijlstra
    Cc: David Rientjes
    Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • SD_OVERLAP exists to allow overlapping groups, overlapping groups
    appear in NUMA topologies that aren't fully connected.

    The typical result of not fully connected NUMA is that each cpu (or
    rather node) will have different spans for a particular distance.
    However due to how sched domains are traversed -- only the first cpu
    in the mask goes one level up -- the next level only cares about the
    spans of the cpus that went up.

    Due to this two things were observed to be broken:

    - build_overlap_sched_groups() -- since its possible the cpu we're
    building the groups for exists in multiple (or all) groups, the
    selection criteria of the first group didn't ensure there was a cpu
    for which is was true that cpumask_first(span) == cpu. Thus load-
    balancing would terminate.

    - update_group_power() -- assumed that the cpu span of the first
    group of the domain was covered by all groups of the child domain.
    The above explains why this isn't true, so deal with it.

    Signed-off-by: Peter Zijlstra
    Cc: David Rientjes
    Link: http://lkml.kernel.org/r/1337788843.9783.14.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Allocators don't appreciate it when you try and allocate memory from
    offline nodes.

    Reported-and-tested-by: Tony Luck
    Reported-and-tested-by: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-epfc1io9whb7o22bcujf31vn@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
    calculations") since while that fixed the busy case it regressed the
    mostly idle case.

    Add a callback from the nohz exit to also age the rq->cpu_load[]
    array. This closes the hole where either there was no nohz load
    balance pass during the nohz, or there was a 'significant' amount of
    idle time between the last nohz balance and the nohz exit.

    So we'll update unconditionally from the tick to not insert any
    accidental 0 load periods while busy, and we try and catch up from
    nohz idle balance and nohz exit. Both these are still prone to missing
    a jiffy, but that has always been the case.

    Signed-off-by: Peter Zijlstra
    Cc: pjt@google.com
    Cc: Venkatesh Pallipadi
    Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

24 May, 2012

2 commits

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     
  • …rnel.org/pub/scm/linux/kernel/git/tip/tip

    Pull perf fixes from Ingo Molnar:

    - Leftover AMD PMU driver fix fix from the end of the v3.4
    stabilization cycle.

    - Late tools/perf/ changes that missed the first round:
    * endianness fixes
    * event parsing improvements
    * libtraceevent fixes factored out from trace-cmd
    * perl scripting engine fixes related to libtraceevent,
    * testcase improvements
    * perf inject / pipe mode fixes
    * plus a kernel side fix

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86: Update event scheduling constraints for AMD family 15h models

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "sched, perf: Use a single callback into the scheduler"
    perf evlist: Show event attribute details
    perf tools: Bump default sample freq to 4 kHz
    perf buildid-list: Work better with pipe mode
    perf tools: Fix piped mode read code
    perf inject: Fix broken perf inject -b
    perf tools: rename HEADER_TRACE_INFO to HEADER_TRACING_DATA
    perf tools: Add union u64_swap type for swapping u64 data
    perf tools: Carry perf_event_attr bitfield throught different endians
    perf record: Fix documentation for branch stack sampling
    perf target: Add cpu flag to sample_type if target has cpu
    perf tools: Always try to build libtraceevent
    perf tools: Rename libparsevent to libtraceevent in Makefile
    perf script: Rename struct event to struct event_format in perl engine
    perf script: Explicitly handle known default print arg type
    perf tools: Add hardcoded name term for pmu events
    perf tools: Separate 'mem:' event scanner bits
    perf tools: Use allocated list for each parsed event
    perf tools: Add support for displaying event parser debug info
    perf test: Move parse event automated tests to separated object

    Linus Torvalds
     

23 May, 2012

4 commits

  • This reverts commit cb04ff9ac424 ("sched, perf: Use a single
    callback into the scheduler").

    Before this change was introduced, the process switch worked
    like this (wrt. to perf event schedule):

    schedule (prev, next)
    - schedule out all perf events for prev
    - switch to next
    - schedule in all perf events for current (next)

    After the commit, the process switch looks like:

    schedule (prev, next)
    - schedule out all perf events for prev
    - schedule in all perf events for (next)
    - switch to next

    The problem is, that after we schedule perf events in, the pmu
    is enabled and we can receive events even before we make the
    switch to next - so "current" still being prev process (event
    SAMPLE data are filled based on the value of the "current"
    process).

    Thats exactly what we see for test__PERF_RECORD test. We receive
    SAMPLES with PID of the process that our tracee is scheduled
    from.

    Discussed with Peter Zijlstra:

    > Bah!, yeah I guess reverting is the right thing for now. Sad
    > though.
    >
    > So by having the two hooks we have a black-spot between them
    > where we receive no events at all, this black-spot covers the
    > hand-over of current and we thus don't receive the 'wrong'
    > events.
    >
    > I rather liked we could do away with both that black-spot and
    > clean up the code a little, but apparently people rely on it.

    Signed-off-by: Jiri Olsa
    Acked-by: Peter Zijlstra
    Cc: acme@redhat.com
    Cc: paulus@samba.org
    Cc: cjashfor@linux.vnet.ibm.com
    Cc: fweisbec@gmail.com
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • Pull scheduler changes from Ingo Molnar:
    "The biggest change is the cleanup/simplification of the load-balancer:
    instead of the current practice of architectures twiddling scheduler
    internal data structures and providing the scheduler domains in
    colorfully inconsistent ways, we now have generic scheduler code in
    kernel/sched/core.c:sched_init_numa() that looks at the architecture's
    node_distance() parameters and (while not fully trusting it) deducts a
    NUMA topology from it.

    This inevitably changes balancing behavior - hopefully for the better.

    There are various smaller optimizations, cleanups and fixlets as well"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
    sched: Remove stale power aware scheduling remnants and dysfunctional knobs
    sched/debug: Fix printing large integers on 32-bit platforms
    sched/fair: Improve the ->group_imb logic
    sched/nohz: Fix rq->cpu_load[] calculations
    sched/numa: Don't scale the imbalance
    sched/fair: Revert sched-domain iteration breakage
    sched/x86: Rewrite set_cpu_sibling_map()
    sched/numa: Fix the new NUMA topology bits
    sched/numa: Rewrite the CONFIG_NUMA sched domain support
    sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
    sched/fair: Add some serialization to the sched_domain load-balance walk
    sched/fair: Let minimally loaded cpu balance the group
    sched: Change rq->nr_running to unsigned int
    x86/numa: Check for nonsensical topologies on real hw as well
    x86/numa: Hard partition cpu topology masks on node boundaries
    x86/numa: Allow specifying node_distance() for numa=fake
    x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
    sched: Update documentation and comments
    sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt()

    Linus Torvalds
     
  • Pull perf changes from Ingo Molnar:
    "Lots of changes:

    - (much) improved assembly annotation support in perf report, with
    jump visualization, searching, navigation, visual output
    improvements and more.

    - kernel support for AMD IBS PMU hardware features. Notably 'perf
    record -e cycles:p' and 'perf top -e cycles:p' should work without
    skid now, like PEBS does on the Intel side, because it takes
    advantage of IBS transparently.

    - the libtracevents library: it is the first step towards unifying
    tracing tooling and perf, and it also gives a tracing library for
    external tools like powertop to rely on.

    - infrastructure: various improvements and refactoring of the UI
    modules and related code

    - infrastructure: cleanup and simplification of the profiling
    targets code (--uid, --pid, --tid, --cpu, --all-cpus, etc.)

    - tons of robustness fixes all around

    - various ftrace updates: speedups, cleanups, robustness
    improvements.

    - typing 'make' in tools/ will now give you a menu of projects to
    build and a short help text to explain what each does.

    - ... and lots of other changes I forgot to list.

    The perf record make bzImage + perf report regression you reported
    should be fixed."

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (166 commits)
    tracing: Remove kernel_lock annotations
    tracing: Fix initial buffer_size_kb state
    ring-buffer: Merge separate resize loops
    perf evsel: Create events initially disabled -- again
    perf tools: Split term type into value type and term type
    perf hists: Fix callchain ip printf format
    perf target: Add uses_mmap field
    ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER
    ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code()
    ftrace: Make ftrace_modify_all_code() global for archs to use
    ftrace: Return record ip addr for ftrace_location()
    ftrace: Consolidate ftrace_location() and ftrace_text_reserved()
    ftrace: Speed up search by skipping pages by address
    ftrace: Remove extra helper functions
    ftrace: Sort all function addresses, not just per page
    tracing: change CPU ring buffer state from tracing_cpumask
    tracing: Check return value of tracing_dentry_percpu()
    ring-buffer: Reset head page before running self test
    ring-buffer: Add integrity check at end of iter read
    ring-buffer: Make addition of pages in ring buffer atomic
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "cgroup file type addition / removal is updated so that file types are
    added and removed instead of individual files so that dynamic file
    type addition / removal can be implemented by cgroup and used by
    controllers. blkio controller changes which will come through block
    tree are dependent on this. Other changes include res_counter cleanup
    and disallowing kthread / PF_THREAD_BOUND threads to be attached to
    non-root cgroups.

    There's a reported bug with the file type addition / removal handling
    which can lead to oops on cgroup umount. The issue is being looked
    into. It shouldn't cause problems for most setups and isn't a
    security concern."

    Fix up trivial conflict in Documentation/feature-removal-schedule.txt

    * 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    res_counter: Account max_usage when calling res_counter_charge_nofail()
    res_counter: Merge res_counter_charge and res_counter_charge_nofail
    cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
    cgroup: remove cgroup_subsys->populate()
    cgroup: get rid of populate for memcg
    cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
    cgroup: make css->refcnt clearing on cgroup removal optional
    cgroup: use negative bias on css->refcnt to block css_tryget()
    cgroup: implement cgroup_rm_cftypes()
    cgroup: introduce struct cfent
    cgroup: relocate __d_cgrp() and __d_cft()
    cgroup: remove cgroup_add_file[s]()
    cgroup: convert memcg controller to the new cftype interface
    memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
    cgroup: convert all non-memcg controllers to the new cftype interface
    cgroup: relocate cftype and cgroup_subsys definitions in controllers
    cgroup: merge cft_release_agent cftype array into the base files array
    cgroup: implement cgroup_add_cftypes() and friends
    cgroup: build list of all cgroups under a given cgroupfs_root
    cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
    ...

    Linus Torvalds
     

22 May, 2012

2 commits

  • Pull smp hotplug cleanups from Thomas Gleixner:
    "This series is merily a cleanup of code copied around in arch/* and
    not changing any of the real cpu hotplug horrors yet. I wish I'd had
    something more substantial for 3.5, but I underestimated the lurking
    horror..."

    Fix up trivial conflicts in arch/{arm,sparc,x86}/Kconfig and
    arch/sparc/include/asm/thread_info_32.h

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
    um: Remove leftover declaration of alloc_task_struct_node()
    task_allocator: Use config switches instead of magic defines
    sparc: Use common threadinfo allocator
    score: Use common threadinfo allocator
    sh-use-common-threadinfo-allocator
    mn10300: Use common threadinfo allocator
    powerpc: Use common threadinfo allocator
    mips: Use common threadinfo allocator
    hexagon: Use common threadinfo allocator
    m32r: Use common threadinfo allocator
    frv: Use common threadinfo allocator
    cris: Use common threadinfo allocator
    x86: Use common threadinfo allocator
    c6x: Use common threadinfo allocator
    fork: Provide kmemcache based thread_info allocator
    tile: Use common threadinfo allocator
    fork: Provide weak arch_release_[task_struct|thread_info] functions
    fork: Move thread info gfp flags to header
    fork: Remove the weak insanity
    sh: Remove cpu_idle_wait()
    ...

    Linus Torvalds
     
  • Pull RCU changes from Ingo Molnar:
    "This is the v3.5 RCU tree from Paul E. McKenney:

    1) A set of improvements and fixes to the RCU_FAST_NO_HZ feature (with
    more on the way for 3.6). Posted to LKML:
    https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
    https://lkml.org/lkml/2012/4/16/611 (commit 4),
    https://lkml.org/lkml/2012/4/30/390 (commit 6), and
    https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
    the other commits for the convenience of the tester).

    2) Changes to make rcu_barrier() avoid disrupting execution of CPUs
    that have no RCU callbacks. Posted to LKML:
    https://lkml.org/lkml/2012/4/23/322.

    3) A couple of commits that improve the efficiency of the interaction
    between preemptible RCU and the scheduler, these two being all that
    survived an abortive attempt to allow preemptible RCU's
    __rcu_read_lock() to be inlined. The full set was posted to LKML at
    https://lkml.org/lkml/2012/4/14/143, and the first and third patches
    of that set remain.

    4) Lai Jiangshan's algorithmic implementation of SRCU, which includes
    call_srcu() and srcu_barrier(). A major feature of this new
    implementation is that synchronize_srcu() no longer disturbs the
    execution of other CPUs. This work is based on earlier
    implementations by Peter Zijlstra and Paul E. McKenney. Posted to
    LKML: https://lkml.org/lkml/2012/2/22/82.

    5) A number of miscellaneous bug fixes and improvements which were
    posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
    subsequent updates posted to LKML."

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
    rcu: Make rcu_barrier() less disruptive
    rcu: Explicitly initialize RCU_FAST_NO_HZ per-CPU variables
    rcu: Make RCU_FAST_NO_HZ handle timer migration
    rcu: Update RCU maintainership
    rcu: Make exit_rcu() more precise and consolidate
    rcu: Move PREEMPT_RCU preemption to switch_to() invocation
    rcu: Ensure that RCU_FAST_NO_HZ timers expire on correct CPU
    rcu: Add rcutorture test for call_srcu()
    rcu: Implement per-domain single-threaded call_srcu() state machine
    rcu: Use single value to handle expedited SRCU grace periods
    rcu: Improve srcu_readers_active_idx()'s cache locality
    rcu: Remove unused srcu_barrier()
    rcu: Implement a variant of Peter's SRCU algorithm
    rcu: Improve SRCU's wait_idx() comments
    rcu: Flip ->completed only once per SRCU grace period
    rcu: Increment upper bit only for srcu_read_lock()
    rcu: Remove fast check path from __synchronize_srcu()
    rcu: Direct algorithmic SRCU implementation
    rcu: Introduce rcutorture testing for rcu_barrier()
    timer: Fix mod_timer_pinned() header comment
    ...

    Linus Torvalds
     

19 May, 2012

1 commit

  • Merge reason: We are going to queue up a dependent patch:

    "perf tools: Move parse event automated tests to separated object"

    That depends on:

    commit e7c72d8
    perf tools: Add 'G' and 'H' modifiers to event parsing

    Conflicts:
    tools/perf/builtin-stat.c

    Conflicted with the recent 'perf_target' patches when checking the
    result of perf_evsel open routines to see if a retry is needed to cope
    with older kernels where the exclude guest/host perf_event_attr bits
    were not used.

    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

18 May, 2012

1 commit


17 May, 2012

1 commit

  • It's been broken forever (i.e. it's not scheduling in a power
    aware fashion), as reported by Suresh and others sending
    patches, and nobody cares enough to fix it properly ...
    so remove it to make space free for something better.

    There's various problems with the code as it stands today, first
    and foremost the user interface which is bound to topology
    levels and has multiple values per level. This results in a
    state explosion which the administrator or distro needs to
    master and almost nobody does.

    Furthermore large configuration state spaces aren't good, it
    means the thing doesn't just work right because it's either
    under so many impossibe to meet constraints, or even if
    there's an achievable state workloads have to be aware of
    it precisely and can never meet it for dynamic workloads.

    So pushing this kind of decision to user-space was a bad idea
    even with a single knob - it's exponentially worse with knobs
    on every node of the topology.

    There is a proposal to replace the user interface with a single
    3 state knob:

    sched_balance_policy := { performance, power, auto }

    where 'auto' would be the preferred default which looks at things
    like Battery/AC mode and possible cpufreq state or whatever the hw
    exposes to show us power use expectations - but there's been no
    progress on it in the past many months.

    Aside from that, the actual implementation of the various knobs
    is known to be broken. There have been sporadic attempts at
    fixing things but these always stop short of reaching a mergable
    state.

    Therefore this wholesale removal with the hopes of spurring
    people who care to come forward once again and work on a
    coherent replacement.

    Signed-off-by: Peter Zijlstra
    Cc: Suresh Siddha
    Cc: Arjan van de Ven
    Cc: Vincent Guittot
    Cc: Vaidyanathan Srinivasan
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra