19 Sep, 2009

2 commits

  • Remove net/genetlink.h inclusion, now sched.c won't be recompiled
    because of some networking changes.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (34 commits)
    time: Prevent 32 bit overflow with set_normalized_timespec()
    clocksource: Delay clocksource down rating to late boot
    clocksource: clocksource_select must be called with mutex locked
    clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash
    timers: Drop a function prototype
    clocksource: Resolve cpu hotplug dead lock with TSC unstable
    timer.c: Fix S/390 comments
    timekeeping: Fix invalid getboottime() value
    timekeeping: Fix up read_persistent_clock() breakage on sh
    timekeeping: Increase granularity of read_persistent_clock(), build fix
    time: Introduce CLOCK_REALTIME_COARSE
    x86: Do not unregister PIT clocksource on PIT oneshot setup/shutdown
    clocksource: Avoid clocksource watchdog circular locking dependency
    clocksource: Protect the watchdog rating changes with clocksource_mutex
    clocksource: Call clocksource_change_rating() outside of watchdog_lock
    timekeeping: Introduce read_boot_clock
    timekeeping: Increase granularity of read_persistent_clock()
    timekeeping: Update clocksource with stop_machine
    timekeeping: Add timekeeper read_clock helper functions
    timekeeping: Move NTP adjusted clock multiplier to struct timekeeper
    ...

    Fix trivial conflict due to MIPS lemote -> loongson renaming.

    Linus Torvalds
     

18 Sep, 2009

4 commits

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (37 commits)
    sched: Fix SD_POWERSAVING_BALANCE|SD_PREFER_LOCAL vs SD_WAKE_AFFINE
    sched: Stop buddies from hogging the system
    sched: Add new wakeup preemption mode: WAKEUP_RUNNING
    sched: Fix TASK_WAKING & loadaverage breakage
    sched: Disable wakeup balancing
    sched: Rename flags to wake_flags
    sched: Clean up the load_idx selection in select_task_rq_fair
    sched: Optimize cgroup vs wakeup a bit
    sched: x86: Name old_perf in a unique way
    sched: Implement a gentler fair-sleepers feature
    sched: Add SD_PREFER_LOCAL
    sched: Add a few SYNC hint knobs to play with
    sched: Fix sync wakeups again
    sched: Add WF_FORK
    sched: Rename sync arguments
    sched: Rename select_task_rq() argument
    sched: Feature to disable APERF/MPERF cpu_power
    x86: sched: Provide arch implementations using aperf/mperf
    x86: Add generic aperf/mperf code
    x86: Move APERF/MPERF into a X86_FEATURE
    ...

    Fix up trivial conflict in arch/x86/include/asm/processor.h due to
    nearby addition of amd_get_nb_id() declaration from the EDAC merge.

    Linus Torvalds
     
  • With BLOCK_IOPOLL_SOFTIRQ added, softirq_to_name[] and
    show_softirq_name() needs to be updated.

    Signed-off-by: Li Zefan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Li Zefan
     
  • For direct function pointers (like what mcount provides) PowerPC64
    requires the use of %ps, otherwise nothing is printed.

    This patch converts all prints of functions retrieved through mcount
    to use the %ps format from the %pf.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Merge reason: Pick up kernel/softirq.c update for dependent fix.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

17 Sep, 2009

4 commits

  • The SD_POWERSAVING_BALANCE|SD_PREFER_LOCAL code can break out of
    the domain iteration early, making us miss the SD_WAKE_AFFINE bits.

    Fix this by continuing iteration until there is no need for a
    larger domain.

    This also cleans up the cgroup stuff a bit, but not having two
    update_shares() invocations.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Clear buddies more agressively.

    The (theoretical, haven't actually observed any of this) problem is
    that when we do not select either buddy in pick_next_entity()
    because they are too far ahead of the left-most task, we do not
    clear the buddies.

    This means that as soon as we service the left-most task, these
    same buddies will be tried again on the next schedule. Now if the
    left-most task was a pure hog, it wouldn't have done any wakeups
    and it wouldn't have set buddies of its own. That leads to the old
    buddies dominating, which would lead to bad latencies.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Create a new wakeup preemption mode, preempt towards tasks that run
    shorter on avg. It sets next buddy to be sure we actually run the task
    we preempted for.

    Test results:

    root@twins:~# while :; do :; done &
    [1] 6537
    root@twins:~# while :; do :; done &
    [2] 6538
    root@twins:~# while :; do :; done &
    [3] 6539
    root@twins:~# while :; do :; done &
    [4] 6540

    root@twins:/home/peter# ./latt -c4 sleep 4
    Entries: 48 (clients=4)

    Averages:
    ------------------------------
    Max 4750 usec
    Avg 497 usec
    Stdev 737 usec

    root@twins:/home/peter# echo WAKEUP_RUNNING > /debug/sched_features

    root@twins:/home/peter# ./latt -c4 sleep 4
    Entries: 48 (clients=4)

    Averages:
    ------------------------------
    Max 14 usec
    Avg 5 usec
    Stdev 3 usec

    Disabled by default - needs more testing.

    Signed-off-by: Peter Zijlstra
    Acked-by: Mike Galbraith
    Signed-off-by: Ingo Molnar
    LKML-Reference:

    Peter Zijlstra
     
  • Fix this:

    top - 21:54:00 up 2:59, 1 user, load average: 432512.33, 426421.74, 417432.74

    Which happens because we now set TASK_WAKING before activate_task().

    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

16 Sep, 2009

18 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
    Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev
    debugfs: Modify default debugfs directory for debugging pktcdvd.
    debugfs: Modified default dir of debugfs for debugging UHCI.
    debugfs: Change debugfs directory of IWMC3200
    debugfs: Change debuhgfs directory of trace-events-sample.h
    debugfs: Fix mount directory of debugfs by default in events.txt
    hpilo: add poll f_op
    hpilo: add interrupt handler
    hpilo: staging for interrupt handling
    driver core: platform_device_add_data(): use kmemdup()
    Driver core: Add support for compatibility classes
    uio: add generic driver for PCI 2.3 devices
    driver-core: move dma-coherent.c from kernel to driver/base
    mem_class: fix bug
    mem_class: use minor as index instead of searching the array
    driver model: constify attribute groups
    UIO: remove 'default n' from Kconfig
    Driver core: Add accessor for device platform data
    Driver core: move dev_get/set_drvdata to drivers/base/dd.c
    Driver core: add new device to bus's list before probing

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: fix linkage problem with blk_iopoll and !CONFIG_BLOCK

    Linus Torvalds
     
  • For consistencies sake, rename the argument (again).

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Clean up the code a little.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We don't need to call update_shares() for each domain we iterate,
    just got the largets one.

    However, we should call it before wake_affine() as well, so that
    that can use up-to-date values too.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Fix the condition of strcmp for "*".
    Also fix NULL pointer dereference when glob is NULL.

    Signed-off-by: Atsushi Tsuji
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Atsushi Tsuji
     
  • Add back FAIR_SLEEPERS and GENTLE_FAIR_SLEEPERS.

    FAIR_SLEEPERS is the old logic: credit sleepers with their sleep time.

    GENTLE_FAIR_SLEEPERS dampens this a bit: 50% of their sleep time gets
    credited.

    The hope here is to still give the benefits of fair-sleepers logic
    (quick wakeups, etc.) while not allow them to have 100% of their
    sleep time as if they were running.

    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • And turn it on for NUMA and MC domains. This improves
    locality in balancing decisions by keeping up to
    capacity amount of tasks local before looking for idle
    CPUs. (and twice the capacity if SD_POWERSAVINGS_BALANCE
    is set.)

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • kernel/built-in.o:(.data+0x17b0): undefined reference to `blk_iopoll_enabled'

    Since the extern declaration makes the compile work, but the actual
    symbol is missing when block/blk-iopoll.o isn't linked in.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently we use overlap to weaken the SYNC hint, but allow it to
    set the hint as well.

    echo NO_SYNC_WAKEUP > /debug/sched_features
    echo SYNC_MORE > /debug/sched_features

    preserves pipe-test behaviour without using the WF_SYNC hint.

    Worth playing with on more workloads...

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The sync argument rename to introduce WF_* broke stuff by missing a
    local alias for an argument in __wake_up_common, fix it by using
    the more descriptive wake_flags name.

    This restores WF_SYNC propagation, which fixes wake_affine()
    behaviour, which fixes pipe-test.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (134 commits)
    powerpc/nvram: Enable use Generic NVRAM driver for different size chips
    powerpc/iseries: Fix oops reading from /proc/iSeries/mf/*/cmdline
    powerpc/ps3: Workaround for flash memory I/O error
    powerpc/booke: Don't set DABR on 64-bit BookE, use DAC1 instead
    powerpc/perf_counters: Reduce stack usage of power_check_constraints
    powerpc: Fix bug where perf_counters breaks oprofile
    powerpc/85xx: Fix SMP compile error and allow NULL for smp_ops
    powerpc/irq: Improve nanodoc
    powerpc: Fix some late PowerMac G5 with PCIe ATI graphics
    powerpc/fsl-booke: Use HW PTE format if CONFIG_PTE_64BIT
    powerpc/book3e: Add missing page sizes
    powerpc/pseries: Fix to handle slb resize across migration
    powerpc/powermac: Thermal control turns system off too eagerly
    powerpc/pci: Merge ppc32 and ppc64 versions of phb_scan()
    powerpc/405ex: support cuImage via included dtb
    powerpc/405ex: provide necessary fixup function to support cuImage
    powerpc/40x: Add support for the ESTeem 195E (PPC405EP) SBC
    powerpc/44x: Add Eiger AMCC (AppliedMicro) PPC460SX evaluation board support.
    powerpc/44x: Update Arches defconfig
    powerpc/44x: Update Arches dts
    ...

    Fix up conflicts in drivers/char/agp/uninorth-agp.c

    Linus Torvalds
     
  • Placing dma-coherent.c in driver/base is better than in kernel,
    since it contains code to do per-device coherent dma memory
    handling.

    Signed-off-by: Ming Lei
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
    powerpc64: convert to dynamic percpu allocator
    sparc64: use embedding percpu first chunk allocator
    percpu: kill lpage first chunk allocator
    x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
    percpu: update embedding first chunk allocator to handle sparse units
    percpu: use group information to allocate vmap areas sparsely
    vmalloc: implement pcpu_get_vm_areas()
    vmalloc: separate out insert_vmalloc_vm()
    percpu: add chunk->base_addr
    percpu: add pcpu_unit_offsets[]
    percpu: introduce pcpu_alloc_info and pcpu_group_info
    percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
    percpu: add @align to pcpu_fc_alloc_fn_t
    percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
    percpu: drop @static_size from first chunk allocators
    percpu: generalize first chunk allocator selection
    percpu: build first chunk allocators selectively
    percpu: rename 4k first chunk allocator to page
    percpu: improve boot messages
    percpu: fix pcpu_reclaim() locking
    ...

    Fix trivial conflict as by Tejun Heo in kernel/sched.c

    Linus Torvalds
     
  • …x/kernel/git/tip/linux-2.6-tip

    * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf_counter: Fix buffer overflow in perf_copy_attr()

    Linus Torvalds
     
  • The prev_trace_clock_time is only read or written to when the
    trace_clock_lock is taken. For better perfomance, they
    should share the same cache line.

    Reported-by: Peter Zijlstra
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • * 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, pat: Fix cacheflush address in change_page_attr_set_clr()
    mm: remove !NUMA condition from PAGEFLAGS_EXTENDED condition set
    x86: Fix earlyprintk=dbgp for machines without NX
    x86, pat: Sanity check remap_pfn_range for RAM region
    x86, pat: Lookup the protection from memtype list on vm_insert_pfn()
    x86, pat: Add lookup_memtype to get the current memtype of a paddr
    x86, pat: Use page flags to track memtypes of RAM pages
    x86, pat: Generalize the use of page flag PG_uncached
    x86, pat: Add rbtree to do quick lookup in memtype tracking
    x86, pat: Add PAT reserve free to io_mapping* APIs
    x86, pat: New i/f for driver to request memtype for IO regions
    x86, pat: ioremap to follow same PAT restrictions as other PAT users
    x86, pat: Keep identity maps consistent with mmaps even when pat_disabled
    x86, mtrr: make mtrr_aps_delayed_init static bool
    x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init
    generic-ipi: Allow cpus not yet online to call smp_call_function with irqs disabled
    x86: Fix an incorrect argument of reserve_bootmem()
    x86: Fix system crash when loading with "reservetop" parameter

    Linus Torvalds
     
  • * 'x86-txt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, intel_txt: clean up the impact on generic code, unbreak non-x86
    x86, intel_txt: Handle ACPI_SLEEP without X86_TRAMPOLINE
    x86, intel_txt: Fix typos in Kconfig help
    x86, intel_txt: Factor out the code for S3 setup
    x86, intel_txt: tboot.c needs
    intel_txt: Force IOMMU on for Intel TXT launch
    x86, intel_txt: Intel TXT Sx shutdown support
    x86, intel_txt: Intel TXT reboot/halt shutdown support
    x86, intel_txt: Intel TXT boot support

    Linus Torvalds
     

15 Sep, 2009

12 commits

  • Avoid the cache buddies from biasing the time distribution away
    from fork()ers. Normally the next buddy will be the preferred
    scheduling target, but this makes fork()s prefer to run the new
    child, whereas we prefer to run the parent, since that will
    generate more work.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to extend the functions to have more than 1 flag (sync),
    rename the argument to flags, and explicitly define a WF_ space for
    individual flags.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to be able to rename the sync argument, we need to rename
    the current flag argument.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • I suspect a feed-back loop between cpuidle and the aperf/mperf
    cpu_power bits, where when we have idle C-states lower the ratio,
    which leads to lower cpu_power and then less load, which generates
    more idle time, etc..

    Put in a knob to disable it.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Provide an ach specific hook for cpufreq based scaling of
    cpu_power.

    Signed-off-by: Peter Zijlstra
    [ego@in.ibm.com: spotting bugs]
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Make the idle balancer more agressive, to improve a
    x264 encoding workload provided by Jason Garrett-Glaser:

    NEXT_BUDDY NO_LB_BIAS
    encoded 600 frames, 252.82 fps, 22096.60 kb/s
    encoded 600 frames, 250.69 fps, 22096.60 kb/s
    encoded 600 frames, 245.76 fps, 22096.60 kb/s

    NO_NEXT_BUDDY LB_BIAS
    encoded 600 frames, 344.44 fps, 22096.60 kb/s
    encoded 600 frames, 346.66 fps, 22096.60 kb/s
    encoded 600 frames, 352.59 fps, 22096.60 kb/s

    NO_NEXT_BUDDY NO_LB_BIAS
    encoded 600 frames, 425.75 fps, 22096.60 kb/s
    encoded 600 frames, 425.45 fps, 22096.60 kb/s
    encoded 600 frames, 422.49 fps, 22096.60 kb/s

    Peter pointed out that this is better done via newidle_idx,
    not via LB_BIAS, newidle balancing should look for where
    there is load _now_, not where there was load 2 ticks ago.

    Worst-case latencies are improved as well as no buddies
    means less vruntime spread. (as per prior lkml discussions)

    This change improves kbuild-peak parallelism as well.

    Reported-by: Jason Garrett-Glaser
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • When merging select_task_rq_fair() and sched_balance_self() we lost
    the use of wake_idx, restore that and set them to 0 to make wake
    balancing more aggressive.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While merging select_task_rq_fair() and sched_balance_self() I made
    a mistake that leads to testing the wrong task affinty.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • for_each_domain() uses RCU to serialize the sched_domains, except
    it doesn't actually use rcu_read_lock() and instead relies on
    disabling preemption -> FAIL.

    XXX: audit other sched_domain code.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • One of the problems of power-saving balancing is that under certain
    scenarios it is too slow and allows tons of real work to pile up.

    Avoid this by ignoring the powersave stuff when there's real work
    to be done.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The problem with wake_idle() is that is doesn't respect things like
    cpu_power, which means it doesn't deal well with SMT nor the recent
    RT interaction.

    To cure this, it needs to do what sched_balance_self() does, which
    leads to the possibility of merging select_task_rq_fair() and
    sched_balance_self().

    Modify sched_balance_self() to:

    - update_shares() when walking up the domain tree,
    (it only called it for the top domain, but it should
    have done this anyway), which allows us to remove
    this ugly bit from try_to_wake_up().

    - do wake_affine() on the smallest domain that contains
    both this (the waking) and the prev (the wakee) cpu for
    WAKE invocations.

    Then use the top-down balance steps it had to replace wake_idle().

    This leads to the dissapearance of SD_WAKE_BALANCE and
    SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE.

    SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective.

    Touch all topology bits to replace the old with new SD flags --
    platforms might need re-tuning, enabling SD_BALANCE_WAKE
    conditionally on a NUMA distance seems like a good additional
    feature, magny-core and small nehalem systems would want this
    enabled, systems with slow interconnects would not.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We're going to want to drop rq->lock in try_to_wake_up() for a
    longer period of time, however we also want to deal with concurrent
    waking of the same task, which is currently handled by holding
    rq->lock.

    So introduce a new TASK state, namely TASK_WAKING, which indicates
    someone is already waking the task (other wakers will fail p->state
    & state).

    We also keep preemption disabled over the whole ttwu().

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra