31 Mar, 2011

1 commit


26 Mar, 2011

1 commit


25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

24 Mar, 2011

1 commit


23 Mar, 2011

2 commits

  • The sentence uses the possessive pronoun, which is spelled
    without an apostrophe.

    Signed-off-by: Jonathan Neuschäfer
    Cc: Jiri Kosina
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jonathan Neuschäfer
     
  • All kthreads being created from a single helper task, they all use memory
    from a single node for their kernel stack and task struct.

    This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
    to parameters already used by kthread_create().

    This parameter serves in allocating memory for the new kthread on its
    memory node if possible.

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

16 Mar, 2011

2 commits

  • …l/git/tip/linux-2.6-tip

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (62 commits)
    posix-clocks: Check write permissions in posix syscalls
    hrtimer: Remove empty hrtimer_init_hres_timer()
    hrtimer: Update hrtimer->state documentation
    hrtimer: Update base[CLOCK_BOOTTIME].offset correctly
    timers: Export CLOCK_BOOTTIME via the posix timers interface
    timers: Add CLOCK_BOOTTIME hrtimer base
    time: Extend get_xtime_and_monotonic_offset() to also return sleep
    time: Introduce get_monotonic_boottime and ktime_get_boottime
    hrtimers: extend hrtimer base code to handle more then 2 clockids
    ntp: Remove redundant and incorrect parameter check
    mn10300: Switch do_timer() to xtimer_update()
    posix clocks: Introduce dynamic clocks
    posix-timers: Cleanup namespace
    posix-timers: Add support for fd based clocks
    x86: Add clock_adjtime for x86
    posix-timers: Introduce a syscall for clock tuning.
    time: Splitout compat timex accessors
    ntp: Add ADJ_SETOFFSET mode bit
    time: Introduce timekeeping_inject_offset
    posix-timer: Update comment
    ...

    Fix up new system-call-related conflicts in
    arch/x86/ia32/ia32entry.S
    arch/x86/include/asm/unistd_32.h
    arch/x86/include/asm/unistd_64.h
    arch/x86/kernel/syscall_table_32.S
    (name_to_handle_at()/open_by_handle_at() vs clock_adjtime()), and some
    due to movement of get_jiffies_64() in:
    kernel/time.c

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits)
    sched: Resched proper CPU on yield_to()
    sched: Allow users with sufficient RLIMIT_NICE to change from SCHED_IDLE policy
    sched: Allow SCHED_BATCH to preempt SCHED_IDLE tasks
    sched: Clean up the IRQ_TIME_ACCOUNTING code
    sched: Add #ifdef around irq time accounting functions
    sched, autogroup: Stop claiming ownership of the root task group
    sched, autogroup: Stop going ahead if autogroup is disabled
    sched, autogroup, sysctl: Use proc_dointvec_minmax() instead
    sched: Fix the group_imb logic
    sched: Clean up some f_b_g() comments
    sched: Clean up remnants of sd_idle
    sched: Wholesale removal of sd_idle logic
    sched: Add yield_to(task, preempt) functionality
    sched: Use a buddy to implement yield_task_fair()
    sched: Limit the scope of clear_buddies
    sched: Check the right ->nr_running in yield_task_fair()
    sched: Avoid expensive initial update_cfs_load(), on UP too
    sched: Fix switch_from_fair()
    sched: Simplify the idle scheduling class
    softirqs: Account ksoftirqd time as cpustat softirq
    ...

    Linus Torvalds
     

10 Mar, 2011

1 commit

  • This patch adds support for creating a queuing context outside
    of the queue itself. This enables us to batch up pieces of IO
    before grabbing the block device queue lock and submitting them to
    the IO scheduler.

    The context is created on the stack of the process and assigned in
    the task structure, so that we can auto-unplug it if we hit a schedule
    event.

    The current queue plugging happens implicitly if IO is submitted to
    an empty device, yet callers have to remember to unplug that IO when
    they are going to wait for it. This is an ugly API and has caused bugs
    in the past. Additionally, it requires hacks in the vm (->sync_page()
    callback) to handle that logic. By switching to an explicit plugging
    scheme we make the API a lot nicer and can get rid of the ->sync_page()
    hack in the vm.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Mar, 2011

1 commit


23 Feb, 2011

1 commit


17 Feb, 2011

1 commit

  • There are two spellings in use for 'freeze' + 'able' - 'freezable' and
    'freezeable'. The former is the more prominent one. The latter is
    mostly used by workqueue and in a few other odd places. Unify the
    spelling to 'freezable'.

    Signed-off-by: Tejun Heo
    Reported-by: Alan Stern
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Greg Kroah-Hartman
    Acked-by: Dmitry Torokhov
    Cc: David Woodhouse
    Cc: Alex Dubov
    Cc: "David S. Miller"
    Cc: Steven Whitehouse

    Tejun Heo
     

03 Feb, 2011

3 commits

  • Currently only implemented for fair class tasks.

    Add a yield_to_task method() to the fair scheduling class. allowing the
    caller of yield_to() to accelerate another thread in it's thread group,
    task group.

    Implemented via a scheduler hint, using cfs_rq->next to encourage the
    target being selected. We can rely on pick_next_entity to keep things
    fair, so noone can accelerate a thread that has already used its fair
    share of CPU time.

    This also means callers should only call yield_to when they really
    mean it. Calling it too often can result in the scheduler just
    ignoring the hint.

    Signed-off-by: Rik van Riel
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Use the buddy mechanism to implement yield_task_fair. This
    allows us to skip onto the next highest priority se at every
    level in the CFS tree, unless doing so would introduce gross
    unfairness in CPU time distribution.

    We order the buddy selection in pick_next_entity to check
    yield first, then last, then next. We need next to be able
    to override yield, because it is possible for the "next" and
    "yield" task to be different processen in the same sub-tree
    of the CFS tree. When they are, we need to go into that
    sub-tree regardless of the "yield" hint, and pick the correct
    entity once we get to the right level.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Oleg reported that on architectures with
    __ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI from
    task_oncpu_function_call() can land before perf_event_task_sched_in()
    and cause interesting situations for eg. perf_install_in_context().

    This patch reworks the task_oncpu_function_call() interface to give a
    more usable primitive as well as rework all its users to hopefully be
    more obvious as well as remove the races.

    While looking at the code I also found a number of races against
    perf_event_task_sched_out() which can flip contexts between tasks so
    plug those too.

    Reported-and-reviewed-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Feb, 2011

1 commit

  • All callers of do_timer() are converted to xtime_update(). The only
    users of xtime_lock are in kernel/time/. Make both local to
    kernel/time/ and remove them from the global header files.

    [ tglx: Reuse tick-internal.h instead of creating another local header
    file. Massaged changelog ]

    Signed-off-by: Torben Hohn
    Cc: Peter Zijlstra
    Cc: johnstul@us.ibm.com
    Cc: yong.zhang0@gmail.com
    Cc: hch@infradead.org
    Signed-off-by: Thomas Gleixner

    Torben Hohn
     

31 Jan, 2011

1 commit

  • xtime_update() takes xtime_lock write locked and calls
    do_timer(). Provided to replace the do_timer() calls in the
    architecture code.

    Signed-off-by: Torben Hohn
    Cc: Peter Zijlstra
    Cc: johnstul@us.ibm.com
    Cc: yong.zhang0@gmail.com
    Cc: hch@infradead.org
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Torben Hohn
     

26 Jan, 2011

2 commits

  • When a task is taken out of the fair class we must ensure the vruntime
    is properly normalized because when we put it back in it will assume
    to be normalized.

    The case that goes wrong is when changing away from the fair class
    while sleeping. Sleeping tasks have non-normalized vruntime in order
    to make sleeper-fairness work. So treat the switch away from fair as a
    wakeup and preserve the relative vruntime.

    Also update sysrq-n to call the ->switch_{to,from} methods.

    Reported-by: Onkalo Samu
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Cleanup patch, freeing up PF_KSOFTIRQD and use per_cpu ksoftirqd pointer
    instead, as suggested by Eric Dumazet.

    Tested-by: Shaun Ruffell
    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

14 Jan, 2011

4 commits

  • Add khugepaged to relocate fragmented pages into hugepages if new
    hugepages become available. (this is indipendent of the defrag logic that
    will have to make new hugepages available)

    The fundamental reason why khugepaged is unavoidable, is that some memory
    can be fragmented and not everything can be relocated. So when a virtual
    machine quits and releases gigabytes of hugepages, we want to use those
    freely available hugepages to create huge-pmd in the other virtual
    machines that may be running on fragmented memory, to maximize the CPU
    efficiency at all times. The scan is slow, it takes nearly zero cpu time,
    except when it copies data (in which case it means we definitely want to
    pay for that cpu time) so it seems a good tradeoff.

    In addition to the hugepages being released by other process releasing
    memory, we have the strong suspicion that the performance impact of
    potentially defragmenting hugepages during or before each page fault could
    lead to more performance inconsistency than allocating small pages at
    first and having them collapsed into large pages later... if they prove
    themselfs to be long lived mappings (khugepaged scan is slow so short
    lived mappings have low probability to run into khugepaged if compared to
    long lived mappings).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We'd like to be able to oom_score_adj a process up/down as it
    enters/leaves the foreground. Currently, it is not possible to oom_adj
    down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
    oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
    or its inherited value at fork. Assuming the thread that has forked it
    has oom_score_adj of 0, each process could decrease it back from 0 upon
    activation unless a CAP_SYS_RESOURCE thread elevated it to something
    higher.

    Alternative considered:

    * a setuid binary
    * a daemon with CAP_SYS_RESOURCE

    Since you don't wan't all processes to be able to reduce their oom_adj, a
    setuid or daemon implementation would be complex. The alternatives also
    have much higher overhead.

    This patch updated from original patch based on feedback from David
    Rientjes.

    Signed-off-by: Mandeep Singh Baines
    Acked-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     
  • This warning was added in commit bdff746a3915 ("clone: prepare to recycle
    CLONE_STOPPED") three years ago. 2.6.26 came and went. As far as I know,
    no-one is actually using CLONE_STOPPED.

    Signed-off-by: Dave Jones
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • On a 16TB machine, max_user_watches has an integer overflow. Convert it
    to use a long and handle the associated fallout.

    Signed-off-by: Robin Holt
    Cc: "Eric W. Biederman"
    Acked-by: Davide Libenzi
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

11 Jan, 2011

2 commits


07 Jan, 2011

3 commits

  • root_task_group is the leftover of USER_SCHED, now it's always
    same to init_task_group.
    But as Mike suggested, root_task_group is maybe the suitable name
    to keep for a tree.
    So in this patch:
    init_task_group --> root_task_group
    init_task_group_load --> root_task_group_load
    INIT_TASK_GROUP_LOAD --> ROOT_TASK_GROUP_LOAD

    Suggested-by: Mike Galbraith
    Signed-off-by: Yong Zhang
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Yong Zhang
     
  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
    sched: Change wait_for_completion_*_timeout() to return a signed long
    sched, autogroup: Fix reference leak
    sched, autogroup: Fix potential access to freed memory
    sched: Remove redundant CONFIG_CGROUP_SCHED ifdef
    sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight
    sched: Move periodic share updates to entity_tick()
    printk: Use this_cpu_{read|write} api on printk_pending
    sched: Make pushable_tasks CONFIG_SMP dependant
    sched: Add 'autogroup' scheduling feature: automated per session task groups
    sched: Fix unregister_fair_sched_group()
    sched: Remove unused argument dest_cpu to migrate_task()
    mutexes, sched: Introduce arch_mutex_cpu_relax()
    sched: Add some clock info to sched_debug
    cpu: Remove incorrect BUG_ON
    cpu: Remove unused variable
    sched: Fix UP build breakage
    sched: Make task dump print all 15 chars of proc comm
    sched: Update tg->shares after cpu.shares write
    sched: Allow update_cfs_load() to update global load
    sched: Implement demand based update_cfs_load()
    ...

    Linus Torvalds
     
  • …git/tip/linux-2.6-tip

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (146 commits)
    tools, perf: Documentation for the power events API
    perf: Add calls to suspend trace point
    perf script: Make some lists static
    perf script: Use the default lost event handler
    perf session: Warn about errors when processing pipe events too
    perf tools: Fix perf_event.h header usage
    perf test: Clarify some error reports in the open syscall test
    x86, NMI: Add touch_nmi_watchdog to io_check_error delay
    x86: Avoid calling arch_trigger_all_cpu_backtrace() at the same time
    x86: Only call smp_processor_id in non-preempt cases
    perf timechart: Adjust perf timechart to the new power events
    perf: Clean up power events by introducing new, more generic ones
    perf: Do not export power_frequency, but power_start event
    perf test: Add test for counting open syscalls
    perf evsel: Auto allocate resources needed for some methods
    perf evsel: Use {cpu,thread}_map to shorten list of parameters
    perf tools: Refactor all_tids to hold nr and the map
    perf tools: Refactor cpumap to hold nr and the map
    perf evsel: Introduce per cpu and per thread open helpers
    perf evsel: Steal the counter reading routines from stat
    ...

    Linus Torvalds
     

05 Jan, 2011

1 commit


23 Dec, 2010

1 commit


22 Dec, 2010

1 commit


09 Dec, 2010

2 commits

  • As noted by Peter Zijlstra at https://lkml.org/lkml/2010/11/10/391
    (while reviewing other stuff, though), tracking pushable tasks
    only makes sense on SMP systems.

    Signed-off-by: Dario Faggioli
    Acked-by: Steven Rostedt
    Acked-by: Gregory Haskins
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     
  • There's a long-running regression that proved difficult to fix and
    which is hitting certain people and is rather annoying in its effects.

    Damien reported that after 74f5187ac8 (sched: Cure load average vs
    NO_HZ woes) his load average is unnaturally high, he also noted that
    even with that patch reverted the load avgerage numbers are not
    correct.

    The problem is that the previous patch only solved half the NO_HZ
    problem, it addressed the part of going into NO_HZ mode, not of
    comming out of NO_HZ mode. This patch implements that missing half.

    When comming out of NO_HZ mode there are two important things to take
    care of:

    - Folding the pending idle delta into the global active count.
    - Correctly aging the averages for the idle-duration.

    So with this patch the NO_HZ interaction should be complete and
    behaviour between CONFIG_NO_HZ=[yn] should be equivalent.

    Furthermore, this patch slightly changes the load average computation
    by adding a rounding term to the fixed point multiplication.

    Reported-by: Damien Wyart
    Reported-by: Tim McGrath
    Tested-by: Damien Wyart
    Tested-by: Orion Poplawski
    Tested-by: Kyle McMartin
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Cc: Chase Douglas
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

30 Nov, 2010

2 commits

  • A recurring complaint from CFS users is that parallel kbuild has
    a negative impact on desktop interactivity. This patch
    implements an idea from Linus, to automatically create task
    groups. Currently, only per session autogroups are implemented,
    but the patch leaves the way open for enhancement.

    Implementation: each task's signal struct contains an inherited
    pointer to a refcounted autogroup struct containing a task group
    pointer, the default for all tasks pointing to the
    init_task_group. When a task calls setsid(), a new task group
    is created, the process is moved into the new task group, and a
    reference to the preveious task group is dropped. Child
    processes inherit this task group thereafter, and increase it's
    refcount. When the last thread of a process exits, the
    process's reference is dropped, such that when the last process
    referencing an autogroup exits, the autogroup is destroyed.

    At runqueue selection time, IFF a task has no cgroup assignment,
    its current autogroup is used.

    Autogroup bandwidth is controllable via setting it's nice level
    through the proc filesystem:

    cat /proc//autogroup

    Displays the task's group and the group's nice level.

    echo > /proc//autogroup

    Sets the task group's shares to the weight of nice task.
    Setting nice level is rate limited for !admin users due to the
    abuse risk of task group locking.

    The feature is enabled from boot by default if
    CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
    the boot option noautogroup, and can also be turned on/off on
    the fly via:

    echo [01] > /proc/sys/kernel/sched_autogroup_enabled

    ... which will automatically move tasks to/from the root task group.

    Signed-off-by: Mike Galbraith
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Cc: Markus Trippelsdorf
    Cc: Mathieu Desnoyers
    Cc: Paul Turner
    Cc: Oleg Nesterov
    [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
    Signed-off-by: Ingo Molnar
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Add priority boosting, but only for TINY_PREEMPT_RCU. This is enabled
    by the default-off RCU_BOOST kernel parameter. The priority to which to
    boost preempted RCU readers is controlled by the RCU_BOOST_PRIO kernel
    parameter (defaulting to real-time priority 1) and the time to wait
    before boosting the readers blocking a given grace period is controlled
    by the RCU_BOOST_DELAY kernel parameter (defaulting to 500 milliseconds).

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

26 Nov, 2010

2 commits

  • The perf hardware pmu got initialized at various points in the boot,
    some before early_initcall() some after (notably arch_initcall).

    The problem is that the NMI lockup detector is ran from early_initcall()
    and expects the hardware pmu to be present.

    Sanitize this by moving all architecture hardware pmu implementations to
    initialize at early_initcall() and move the lockup detector to an explicit
    initcall right after that.

    Cc: paulus
    Cc: davem
    Cc: Michael Cree
    Cc: Deng-Cheng Zhu
    Acked-by: Paul Mundt
    Acked-by: Will Deacon
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: Pick up latest fixes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

18 Nov, 2010

3 commits

  • Introduce a new sysctl for the shares window and disambiguate it from
    sched_time_avg.

    A 10ms window appears to be a good compromise between accuracy and performance.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • By tracking a per-cpu load-avg for each cfs_rq and folding it into a
    global task_group load on each tick we can rework tg_shares_up to be
    strictly per-cpu.

    This should improve cpu-cgroup performance for smp systems
    significantly.

    [ Paul: changed to use queueing cfs_rq + bug fixes ]

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While discussing the need for sched_idle_next(), Oleg remarked that
    since try_to_wake_up() ensures sleeping tasks will end up running on a
    sane cpu, we can do away with migrate_live_tasks().

    If we then extend the existing hack of migrating current from
    CPU_DYING to migrating the full rq worth of tasks from CPU_DYING, the
    need for the sched_idle_next() abomination disappears as well, since
    idle will be the only possible thread left after the migration thread
    stops.

    This greatly simplifies the hot-unplug task migration path, as can be
    seen from the resulting code reduction (and about half the new lines
    are comments).

    Suggested-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra