14 Aug, 2011

28 commits

  • Basic description of usage and effect for CFS Bandwidth Control.

    Signed-off-by: Bharata B Rao
    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184758.498036116@google.com
    Signed-off-by: Ingo Molnar

    Bharata B Rao
     
  • When a local cfs_rq blocks we return the majority of its remaining quota to the
    global bandwidth pool for use by other runqueues.

    We do this only when the quota is current and there is more than
    min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

    In the case where there are throttled runqueues and we have sufficient
    bandwidth to meter out a slice, a second timer is kicked off to handle this
    delivery, unthrottling where appropriate.

    Using a 'worst case' antagonist which executes on each cpu
    for 1ms before moving onto the next on a fairly large machine:

    no quota generations:

    197.47 ms /cgroup/a/cpuacct.usage
    199.46 ms /cgroup/a/cpuacct.usage
    205.46 ms /cgroup/a/cpuacct.usage
    198.46 ms /cgroup/a/cpuacct.usage
    208.39 ms /cgroup/a/cpuacct.usage

    Since we are allowed to use "stale" quota our usage is effectively bounded by
    the rate of input into the global pool and performance is relatively stable.

    with quota generations [1s increments]:

    119.58 ms /cgroup/a/cpuacct.usage
    119.65 ms /cgroup/a/cpuacct.usage
    119.64 ms /cgroup/a/cpuacct.usage
    119.63 ms /cgroup/a/cpuacct.usage
    119.60 ms /cgroup/a/cpuacct.usage

    The large deficit here is due to quota generations (/intentionally/) preventing
    us from now using previously stranded slack quota. The cost is that this quota
    becomes unavailable.

    with quota generations and quota return:

    200.09 ms /cgroup/a/cpuacct.usage
    200.09 ms /cgroup/a/cpuacct.usage
    198.09 ms /cgroup/a/cpuacct.usage
    200.09 ms /cgroup/a/cpuacct.usage
    200.06 ms /cgroup/a/cpuacct.usage

    By returning unused quota we're able to both stably consume our desired quota
    and prevent unintentional overages due to the abuse of slack quota from
    previous quota periods (especially on a large machine).

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184758.306848658@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • This change introduces statistics exports for the cpu sub-system, these are
    added through the use of a stat file similar to that exported by other
    subsystems.

    The following exports are included:

    nr_periods: number of periods in which execution occurred
    nr_throttled: the number of periods above in which execution was throttle
    throttled_time: cumulative wall-time that any cpus have been throttled for
    this group

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184758.198901931@google.com
    Signed-off-by: Ingo Molnar

    Nikhil Rao
     
  • With the machinery in place to throttle and unthrottle entities, as well as
    handle their participation (or lack there of) we can now enable throttling.

    There are 2 points that we must check whether it's time to set throttled state:
    put_prev_entity() and enqueue_entity().

    - put_prev_entity() is the typical throttle path, we reach it by exceeding our
    allocated run-time within update_curr()->account_cfs_rq_runtime() and going
    through a reschedule.

    - enqueue_entity() covers the case of a wake-up into an already throttled
    group. In this case we know the group cannot be on_rq and can throttle
    immediately. Checks are added at time of put_prev_entity() and
    enqueue_entity()

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184758.091415417@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Throttled tasks are invisisble to cpu-offline since they are not eligible for
    selection by pick_next_task(). The regular 'escape' path for a thread that is
    blocked at offline is via ttwu->select_task_rq, however this will not handle a
    throttled group since there are no individual thread wakeups on an unthrottle.

    Resolve this by unthrottling offline cpus so that threads can be migrated.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.989000590@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Buddies allow us to select "on-rq" entities without actually selecting them
    from a cfs_rq's rb_tree. As a result we must ensure that throttled entities
    are not falsely nominated as buddies. The fact that entities are dequeued
    within throttle_entity is not sufficient for clearing buddy status as the
    nomination may occur after throttling.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.886850167@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • From the perspective of load-balance and shares distribution, throttled
    entities should be invisible.

    However, both of these operations work on 'active' lists and are not
    inherently aware of what group hierarchies may be present. In some cases this
    may be side-stepped (e.g. we could sideload via tg_load_down in load balance)
    while in others (e.g. update_shares()) it is more difficult to compute without
    incurring some O(n^2) costs.

    Instead, track hierarchicaal throttled state at time of transition. This
    allows us to easily identify whether an entity belongs to a throttled hierarchy
    and avoid incorrect interactions with it.

    Also, when an entity leaves a throttled hierarchy we need to advance its
    time averaging for shares averaging so that the elapsed throttled time is not
    considered as part of the cfs_rq's operation.

    We also use this information to prevent buddy interactions in the wakeup and
    yield_to() paths.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Extend walk_tg_tree to accept a positional argument

    static int walk_tg_tree_from(struct task_group *from,
    tg_visitor down, tg_visitor up, void *data)

    Existing semantics are preserved, caller must hold rcu_lock() or sufficient
    analogue.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.677889157@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • At the start of each period we refresh the global bandwidth pool. At this time
    we must also unthrottle any cfs_rq entities who are now within bandwidth once
    more (as quota permits).

    Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
    and their entities re-enqueued.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Now that consumption is tracked (via update_curr()) we add support to throttle
    group entities (and their corresponding cfs_rqs) in the case where this is no
    run-time remaining.

    Throttled entities are dequeued to prevent scheduling, additionally we mark
    them as throttled (using cfs_rq->throttled) to prevent them from becoming
    re-enqueued until they are unthrottled. A list of a task_group's throttled
    entities are maintained on the cfs_bandwidth structure.

    Note: While the machinery for throttling is added in this patch the act of
    throttling an entity exceeding its bandwidth is deferred until later within
    the series.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Since quota is managed using a global state but consumed on a per-cpu basis
    we need to ensure that our per-cpu state is appropriately synchronized.
    Most importantly, runtime that is state (from a previous period) should not be
    locally consumable.

    We take advantage of existing sched_clock synchronization about the jiffy to
    efficiently detect whether we have (globally) crossed a quota boundary above.

    One catch is that the direction of spread on sched_clock is undefined,
    specifically, we don't know whether our local clock is behind or ahead
    of the one responsible for the current expiration time.

    Fortunately we can differentiate these by considering whether the
    global deadline has advanced. If it has not, then we assume our clock to be
    "fast" and advance our local expiration; otherwise, we know the deadline has
    truly passed and we expire our local runtime.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • This patch adds a per-task_group timer which handles the refresh of the global
    CFS bandwidth pool.

    Since the RT pool is using a similar timer there's some small refactoring to
    share this support.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Account bandwidth usage on the cfs_rq level versus the task_groups to which
    they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
    under cfs_rq->runtime_enabled.

    cfs_rq's which belong to a bandwidth constrained task_group have their runtime
    accounted via the update_curr() path, which withdraws bandwidth from the global
    pool as desired. Updates involving the global pool are currently protected
    under cfs_bandwidth->lock, local runtime is protected by rq->lock.

    This patch only assigns and tracks quota, no action is taken in the case that
    cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Add constraints validation for CFS bandwidth hierarchies.

    Validate that:
    max(child bandwidth)
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.083774572@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • In this patch we introduce the notion of CFS bandwidth, partitioned into
    globally unassigned bandwidth, and locally claimed bandwidth.

    - The global bandwidth is per task_group, it represents a pool of unclaimed
    bandwidth that cfs_rqs can allocate from.
    - The local bandwidth is tracked per-cfs_rq, this represents allotments from
    the global pool bandwidth assigned to a specific cpu.

    Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
    - cpu.cfs_period_us : the bandwidth period in usecs
    - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
    to consume over period above.

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Introduce hierarchical task accounting for the group scheduling case in CFS, as
    well as promoting the responsibility for maintaining rq->nr_running to the
    scheduling classes.

    The primary motivation for this is that with scheduling classes supporting
    bandwidth throttling it is possible for entities participating in throttled
    sub-trees to not have root visible changes in rq->nr_running across activate
    and de-activate operations. This in turn leads to incorrect idle and
    weight-per-task load balance decisions.

    This also allows us to make a small fixlet to the fastpath in pick_next_task()
    under group scheduling.

    Note: this issue also exists with the existing sched_rt throttling mechanism.
    This patch does not address that.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Since [sched/cpupri: Remove the vec->lock], member pri_active
    of struct cpupri is not needed any more, just remove it. Also
    clean stuff related to it.

    Signed-off-by: Yong Zhang
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110806001004.GA2207@zhy
    Signed-off-by: Ingo Molnar

    Yong Zhang
     
  • [ This patch actually compiles. Thanks to Mike Galbraith for pointing
    that out. I compiled and booted this patch with no issues. ]

    Re-examining the cpupri patch, I see there's a possible race because the
    update of the two priorities vec->counts are not protected by a memory
    barrier.

    When a RT runqueue is overloaded and wants to push an RT task to another
    runqueue, it scans the RT priority vectors in a loop from lowest
    priority to highest.

    When we queue or dequeue an RT task that changes a runqueue's highest
    priority task, we update the vectors to show that a runqueue is rated at
    a different priority. To do this, we first set the new priority mask,
    and increment the vec->count, and then set the old priority mask by
    decrementing the vec->count.

    If we are lowering the runqueue's RT priority rating, it will trigger a
    RT pull, and we do not care if we miss pushing to this runqueue or not.

    But if we raise the priority, but the priority is still lower than an RT
    task that is looking to be pushed, we must make sure that this runqueue
    is still seen by the push algorithm (the loop).

    Because the loop reads from lowest to highest, and the new priority is
    set before the old one is cleared, we will either see the new or old
    priority set and the vector will be checked.

    But! Since there's no memory barrier between the updates of the two, the
    old count may be decremented first before the new count is incremented.
    This means the loop may see the old count of zero and skip it, and also
    the new count of zero before it was updated. A possible runqueue that
    the RT task could move to could be missed.

    A conditional memory barrier is placed between the vec->count updates
    and is only called when both updates are done.

    The smp_wmb() has also been changed to smp_mb__before_atomic_inc/dec(),
    as they are not needed by archs that already synchronize
    atomic_inc/dec().

    The smp_rmb() has been moved to be called at every iteration of the loop
    so that the race between seeing the two updates is visible by each
    iteration of the loop, as an arch is free to optimize the reading of
    memory of the counters in the loop.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1312547269.18583.194.camel@gandalf.stny.rr.com
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • sched/cpupri: Remove the vec->lock

    The cpupri vec->lock has been showing up as a top contention
    lately. This is because of the RT push/pull logic takes an
    agressive approach for migrating RT tasks. The cpupri logic is
    in place to improve the performance of the push/pull when dealing
    with large number CPU machines.

    The problem though is a vec->lock is required, where a vec is a
    global per RT priority structure. That is, if there are lots of
    RT tasks at the same priority, every time they are added or removed
    from the RT queue, this global vec->lock is taken. Now that more
    kernel threads are becoming RT (RCU boost and threaded interrupts)
    this is becoming much more of an issue.

    There are two variables that are being synced by the vec->lock.
    The cpupri bitmask, and the vec->counter. The cpupri bitmask
    is one bit per priority. If a RT priority vec has a process queued,
    then the vec->count is > 0 and the cpupri bitmask is set for that
    RT priority.

    If the cpupri bitmask gets out of sync with the vec->counter, we could
    end up pushing a low proirity RT task to a high priority queue.
    That RT task that could have run immediately could be queued on a
    run queue with a higher priority task indefinitely.

    The solution is not to use the cpupri bitmask and just look at the
    vec->count directly when doing a pull. The cpupri bitmask is just
    a fast way to scan the RT priorities when a pull is made. Instead
    of using the bitmask, and just examine all RT priorities, and
    look at the vec->counts, we could eliminate the vec->lock. The
    scan of RT tasks is to find a run queue that we can push an RT task
    to, and we do not push to a high priority queue, thus the scan only
    needs to go from 1 to RT task->prio, and not all 100 RT priorities.

    The push algorithm, which does the scan of RT priorities (and
    scan of the bitmask) only happens when we have an overloaded RT run
    queue (more than one RT task queued). The grabbing of the vec->lock
    happens every time any RT task is queued or dequeued on the run
    queue for that priority. The slowing down of the scan by not using
    a bitmask is negligible by the speed up of removing the vec->lock
    contention, and replacing it with an atomic counter and memory barrier.

    To prove this, I wrote a patch that times both the loop and the code
    that grabs the vec->locks. I passed the patches to various people
    (and companies) to test and show the results. I let everyone choose
    their own load to test, giving different loads on the system,
    for various different setups.

    Here's some of the results: (snipping to a few CPUs to not make
    this change log huge, but the results were consistent across
    the entire system).

    System 1 (24 CPUs)

    Before patch:
    CPU: Name Count Max Min Average Total
    ---- ---- ----- --- --- ------- -----
    [...]
    cpu 20: loop 3057 1.766 0.061 0.642 1963.170
    vec 6782949 90.469 0.089 0.414 2811760.503
    cpu 21: loop 2617 1.723 0.062 0.641 1679.074
    vec 6782810 90.499 0.089 0.291 1978499.900
    cpu 22: loop 2212 1.863 0.063 0.699 1547.160
    vec 6767244 85.685 0.089 0.435 2949676.898
    cpu 23: loop 2320 2.013 0.062 0.594 1380.265
    vec 6781694 87.923 0.088 0.431 2928538.224

    After patch:
    cpu 20: loop 2078 1.579 0.061 0.533 1108.006
    vec 6164555 5.704 0.060 0.143 885185.809
    cpu 21: loop 2268 1.712 0.065 0.575 1305.248
    vec 6153376 5.558 0.060 0.187 1154960.469
    cpu 22: loop 1542 1.639 0.095 0.533 823.249
    vec 6156510 5.720 0.060 0.190 1172727.232
    cpu 23: loop 1650 1.733 0.068 0.545 900.781
    vec 6170784 5.533 0.060 0.167 1034287.953

    All times are in microseconds. The 'loop' is the amount of time spent
    doing the loop across the priorities (before patch uses bitmask).
    the 'vec' is the amount of time in the code that requires grabbing
    the vec->lock. The second patch just does not have the vec lock, but
    encompasses the same code.

    Amazingly the loop code even went down on average. The vec code went
    from .5 down to .18, that's more than half the time spent!

    Note, more than one test was run, but they all had the same results.

    System 2 (64 CPUs)

    Before patch:
    CPU: Name Count Max Min Average Total
    ---- ---- ----- --- --- ------- -----
    cpu 60: loop 0 0 0 0 0
    vec 5410840 277.954 0.084 0.782 4232895.727
    cpu 61: loop 0 0 0 0 0
    vec 4915648 188.399 0.084 0.570 2803220.301
    cpu 62: loop 0 0 0 0 0
    vec 5356076 276.417 0.085 0.786 4214544.548
    cpu 63: loop 0 0 0 0 0
    vec 4891837 170.531 0.085 0.799 3910948.833

    After patch:
    cpu 60: loop 0 0 0 0 0
    vec 5365118 5.080 0.021 0.063 340490.267
    cpu 61: loop 0 0 0 0 0
    vec 4898590 1.757 0.019 0.071 347903.615
    cpu 62: loop 0 0 0 0 0
    vec 5737130 3.067 0.021 0.119 687108.734
    cpu 63: loop 0 0 0 0 0
    vec 4903228 1.822 0.021 0.071 348506.477

    The test run during the measurement did not have any (very few,
    from other CPUs) RT tasks pushing. But this shows that it helped
    out tremendously with the contention, as the contention happens
    because the vec->lock is taken only on queuing at an RT priority,
    and different CPUs that queue tasks at the same priority will
    have contention.

    I tested on my own 4 CPU machine with the following results:

    Before patch:
    CPU: Name Count Max Min Average Total
    ---- ---- ----- --- --- ------- -----
    cpu 0: loop 2377 1.489 0.158 0.588 1398.395
    vec 4484 770.146 2.301 4.396 19711.755
    cpu 1: loop 2169 1.962 0.160 0.576 1250.110
    vec 4425 152.769 2.297 4.030 17834.228
    cpu 2: loop 2324 1.749 0.155 0.559 1299.799
    vec 4368 779.632 2.325 4.665 20379.268
    cpu 3: loop 2325 1.629 0.157 0.561 1306.113
    vec 4650 408.782 2.394 4.348 20222.577

    After patch:
    CPU: Name Count Max Min Average Total
    ---- ---- ----- --- --- ------- -----
    cpu 0: loop 2121 1.616 0.113 0.636 1349.189
    vec 4303 1.151 0.225 0.421 1811.966
    cpu 1: loop 2130 1.638 0.178 0.644 1372.927
    vec 4627 1.379 0.235 0.428 1983.648
    cpu 2: loop 2056 1.464 0.165 0.637 1310.141
    vec 4471 1.311 0.217 0.433 1937.927
    cpu 3: loop 2154 1.481 0.162 0.601 1295.083
    vec 4236 1.253 0.230 0.425 1803.008

    This was running my migrate.c code that can be found at:
    http://lwn.net/Articles/425763/

    The migrate code does stress the RT tasks a bit. This shows that
    the loop did increase a little after the patch, but not by much.
    The vec code dropped dramatically. From 4.3us down to .42us.
    That's a 10x improvement!

    Tested-by: Mike Galbraith
    Tested-by: Luis Claudio R. Gonçalves
    Tested-by: Matthew Hank Sabins
    Signed-off-by: Steven Rostedt
    Reviewed-by: Gregory Haskins
    Acked-by: Hillf Danton
    Signed-off-by: Peter Zijlstra
    Cc: Chris Mason
    Link: http://lkml.kernel.org/r/1312317372.18583.101.camel@gandalf.stny.rr.com
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • Hillf Danton proposed a patch (see link) that cleaned up the
    sched_rt code that calculates the priority of the next highest priority
    task to be used in finding run queues to pull from.

    His patch removed the calculating of the next prio to just use the current
    prio when deteriming if we should examine a run queue to pull from. The problem
    with his patch was that it caused more false checks. Because we check a run
    queue for pushable tasks if the current priority of that run queue is higher
    in priority than the task about to run on our run queue. But after grabbing
    the locks and doing the real check, we find that there may not be a task
    that has a higher prio task to pull. Thus the locks were taken with nothing to
    do.

    I added some trace_printks() to record when and how many times the run queue
    locks were taken to check for pullable tasks, compared to how many times we
    pulled a task.

    With the current method, it was:

    3806 locks taken vs 2812 pulled tasks

    With Hillf's patch:

    6728 locks taken vs 2804 pulled tasks

    The number of times locks were taken to pull a task went up almost double with
    no more success rate.

    But his patch did get me thinking. When we look at the priority of the highest
    task to consider taking the locks to do a pull, a failure to pull can be one
    of the following: (in order of most likely)

    o RT task was pushed off already between the check and taking the lock
    o Waiting RT task can not be migrated
    o RT task's CPU affinity does not include the target run queue's CPU
    o RT task's priority changed between the check and taking the lock

    And with Hillf's patch, the thing that caused most of the failures, is
    the RT task to pull was not at the right priority to pull (not greater than
    the current RT task priority on the target run queue).

    Most of the above cases we can't help. But the current method does not check
    if the next highest prio RT task can be migrated or not, and if it can not,
    we still grab the locks to do the test (we don't find out about this fact until
    after we have the locks). I thought about this case, and realized that the
    pushable task plist that is maintained only holds RT tasks that can migrate.
    If we move the calculating of the next highest prio task from the inc/dec_rt_task()
    functions into the queuing of the pushable tasks, then we only measure the
    priorities of those tasks that we push, and we get this basically for free.

    Not only does this patch make the code a little more efficient, it cleans it
    up and makes it a little simpler.

    Thanks to Hillf Danton for inspiring me on this patch.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Cc: Hillf Danton
    Cc: Gregory Haskins
    Link: http://lkml.kernel.org/r/BANLkTimQ67180HxCx5vgMqumqw1EkFh3qg@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • When a new task is woken, the code to balance the RT task is currently
    skipped in the select_task_rq() call. But it will be pushed if the rq
    is currently overloaded with RT tasks anyway. The issue is that we
    already queued the task, and if it does get pushed, it will have to
    be dequeued and requeued on the new run queue. The advantage with
    pushing it first is that we avoid this requeuing as we are pushing it
    off before the task is ever queued.

    See commit 318e0893ce3f524 ("sched: pre-route RT tasks on wakeup")
    for more details.

    The return of select_task_rq() when it is not a wake up has also been
    changed to return task_cpu() instead of smp_processor_id(). This is more
    of a sanity because the current only other user of select_task_rq()
    besides wake ups, is an exec, where task_cpu() should also be the same
    as smp_processor_id(). But if it is used for other purposes, lets keep
    the task on the same CPU. Why would we mant to migrate it to the current
    CPU?

    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Cc: Hillf Danton
    Link: http://lkml.kernel.org/r/20110617015919.832743148@goodmis.org
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • There's no reason to clean the exec_start in put_prev_task_rt() as it is reset
    when the task gets back to the run queue. This saves us doing a store() in the
    fast path.

    Signed-off-by: Hillf Danton
    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Yong Zhang
    Link: http://lkml.kernel.org/r/BANLkTimqWD=q6YnSDi-v9y=LMWecgEzEWg@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Hillf Danton
     
  • Do not call dequeue_pushable_task() when failing to push an eligible
    task, as it remains pushable, merely not at this particular moment.

    Signed-off-by: Hillf Danton
    Signed-off-by: Mike Galbraith
    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Cc: Yong Zhang
    Link: http://lkml.kernel.org/r/1306895385.4791.26.camel@marge.simson.net
    Signed-off-by: Ingo Molnar

    Hillf Danton
     
  • Checking for the validity of sd is removed, since it is already
    checked by the for_each_domain macro.

    Signed-off-by: Hillf Danton
    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/BANLkTimT+Tut-3TshCDm-NiLLXrOznibNA@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Hillf Danton
     
  • When computing the next priority for a given run-queue, the check for
    RT priority of the task determined by the pick_next_highest_task_rt()
    function could be removed, since only RT tasks are returned by the
    function.

    Reviewed-by: Yong Zhang
    Signed-off-by: Hillf Danton
    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/BANLkTimxmWiof9s5AvS3v_0X+sMiE=0x5g@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Hillf Danton
     
  • Setting child->prio = current->normal_prio _after_ SCHED_RESET_ON_FORK has
    been handled for an RT parent gives birth to a deranged mutant child with
    non-RT policy, but RT prio and sched_class.

    Move PI leakage protection up, always set priorities and weight, and if the
    child is leaving RT class, reset rt_priority to the proper value.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1311779695.8691.2.camel@marge.simson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Remove the WAKEUP_PREEMPT feature, disabling it doesn't make any sense
    and its outlived its use by a long long while.

    Signed-off-by: Yong Zhang
    Acked-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110729082033.GB12106@zhy
    Signed-off-by: Ingo Molnar

    Yong Zhang
     
  • Since commit a2d47777 ("sched: fix stale value in average load per task")
    the variable rq->avg_load_per_task is no longer required. Remove it.

    Signed-off-by: Jan H. Schönherr
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1312189408-17172-1-git-send-email-schnhrr@cs.tu-berlin.de
    Signed-off-by: Ingo Molnar

    Jan H. Schönherr
     

12 Aug, 2011

8 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
    sparc: Don't do hypervisor calls on non-sun4v in DS driver.

    Linus Torvalds
     
  • Reported-by: Pieter-Paul Giesberts
    Signed-off-by: David S. Miller

    David S. Miller
     
  • Just like files-layout, blocks & objects layouts are part of the
    NFS 4.1 protocol and should be automatically selected if NFS_4_1
    is selected. The small problem is that these depend on other
    Kernel support being present, while files only depends on NFS
    itself.

    This patch removes from the user choice the presence of objects
    and blocks layout. But makes sure these are selected only if
    the depended subsystems are present in the Kernel.

    Signed-off-by: Boaz Harrosh
    Acked-by: Peng Tao
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • Commit df5e6223407e ("ext4: fix deadlock in ext4_symlink() in ENOSPC
    conditions") recalculated the number of credits needed for a long
    symlink, in the process of splitting it into two transactions. However,
    the first credit calculation under-counted because if selinux is
    enabled, credits are needed to create the selinux xattr as well.

    Overrunning the reservation will result in an OOPS in
    jbd2_journal_dirty_metadata() due to this assert:

    J_ASSERT_JH(jh, handle->h_buffer_credits > 0);

    Fix this by increasing the reservation size.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Jan Kara
    Acked-by: "Theodore Ts'o"
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • Commit ae54870a1dc9 ("ext3: Fix lock inversion in ext3_symlink()")
    recalculated the number of credits needed for a long symlink, in the
    process of splitting it into two transactions. However, the first
    credit calculation under-counted because if selinux is enabled, credits
    are needed to create the selinux xattr as well.

    Overrunning the reservation will result in an OOPS in
    journal_dirty_metadata() due to this assert:

    J_ASSERT_JH(jh, handle->h_buffer_credits > 0);

    Fix this by increasing the reservation size.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Jan Kara
    Acked-by: "Theodore Ts'o"
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
    check in set_user() to check for NPROC exceeding via setuid() and
    similar functions.

    Before the check there was a possibility to greatly exceed the allowed
    number of processes by an unprivileged user if the program relied on
    rlimit only. But the check created new security threat: many poorly
    written programs simply don't check setuid() return code and believe it
    cannot fail if executed with root privileges. So, the check is removed
    in this patch because of too often privilege escalations related to
    buggy programs.

    The NPROC can still be enforced in the common code flow of daemons
    spawning user processes. Most of daemons do fork()+setuid()+execve().
    The check introduced in execve() (1) enforces the same limit as in
    setuid() and (2) doesn't create similar security issues.

    Neil Brown suggested to track what specific process has exceeded the
    limit by setting PF_NPROC_EXCEEDED process flag. With the change only
    this process would fail on execve(), and other processes' execve()
    behaviour is not changed.

    Solar Designer suggested to re-check whether NPROC limit is still
    exceeded at the moment of execve(). If the process was sleeping for
    days between set*uid() and execve(), and the NPROC counter step down
    under the limit, the defered execve() failure because NPROC limit was
    exceeded days ago would be unexpected. If the limit is not exceeded
    anymore, we clear the flag on successful calls to execve() and fork().

    The flag is also cleared on successful calls to set_user() as the limit
    was exceeded for the previous user, not the current one.

    Similar check was introduced in -ow patches (without the process flag).

    v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().

    Reviewed-by: James Morris
    Signed-off-by: Vasiliy Kulikov
    Acked-by: NeilBrown
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • …l/git/tip/linux-2.6-tip

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf symbols: Check '/tmp/perf-' symbol file ownership
    perf sched: Usage leftover from trace -> script rename
    perf sched: Do not delete session object prematurely
    perf tools: Check $HOME/.perfconfig ownership
    perf, x86: Add model 45 SandyBridge support
    perf tools: Add support to install perf python extension
    perf tools: do not look at ./config for configuration
    perf tools: Make clean leaves some files
    perf lock: Dropping unsupported ':r' modifier
    perf probe: Fix coredump introduced by probe module option
    jump label: Reduce the cycle count by changing the link order
    perf report: Use ui__warning in some more places
    perf python: Add PERF_RECORD_{LOST,READ,SAMPLE} routine tables
    perf evlist: Introduce 'disable' method
    trace events: Update version number reference to new 3.x scheme for EVENT_POWER_TRACING_DEPRECATED
    perf buildid-cache: Zero out buffer of filenames when adding/removing buildid

    Linus Torvalds
     
  • Change to new git tree -
    (git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git).

    Signed-off-by: Tracey Dent
    Acked-by: WANG Cong
    Signed-off-by: Linus Torvalds

    Tracey Dent
     

11 Aug, 2011

4 commits

  • This reverts commit af9d220bac41dc3201893e1601cc7c44f7da4498.

    It turns out that one was meant to be applied on top of the edac.git
    tree in -next that has more i7core_edac changes, but that wasn't clear
    in the original email.

    Reported-by: Stephen Rothwell
    Acked-by: Borislav Petkov
    Cc: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • PNFS_BLOCK needs BLK_DEV_DM/MD, which is not a dependency for other
    pnfs layout drivers. Seperate it out so others can still build when
    BLK_DEV_DM/MD is not enabled.

    Also change select to depends on to avoid build failures.

    Reported-and-tested-by: Randy Dunlap
    Signed-off-by: Peng Tao
    Acked-by: Benny Halevy
    Signed-off-by: Linus Torvalds

    Peng Tao
     
  • * 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm:
    ARM: drop experimental status for ARM_PATCH_PHYS_VIRT
    ARM: 7008/1: alignment: Make SIGBUS sent to userspace POSIXly correct
    ARM: 7007/1: alignment: Prevent ignoring of faults with ARMv6 unaligned access model
    ARM: 7010/1: mm: fix invalid loop for poison_init_mem
    ARM: 7005/1: freshen up mm/proc-arm946.S
    dmaengine: PL08x: Fix trivial build error
    ARM: Fix build error for SMP=n builds

    Linus Torvalds
     
  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc: Really fix build without CONFIG_PCI
    powerpc: Fix build without CONFIG_PCI
    powerpc/4xx: Fix build of PCI code on 405
    powerpc/pseries: Simplify vpa deregistration functions
    powerpc/pseries: Cleanup VPA registration and deregistration errors
    powerpc/pseries: Fix kexec on recent firmware versions
    MAINTAINERS: change maintainership of mpc5xxx
    powerpc: Make KVM_GUEST default to n
    powerpc/kvm: Fix build errors with older toolchains
    powerpc: Lack of ibm,io-events not that important!
    powerpc: Move kdump default base address to half RMO size on 64bit
    powerpc/perf: Disable pagefaults during callchain stack read
    ppc: Remove duplicate definition of PV_POWER7
    powerpc: pseries: Fix kexec on machines with more than 4TB of RAM
    powerpc: Jump label misalignment causes oops at boot
    powerpc: Clean up some panic messages in prom_init
    powerpc: Fix device tree claim code
    powerpc: Return the_cpu_ spec from identify_cpu
    powerpc: mtspr/mtmsr should take an unsigned long

    Linus Torvalds