14 Apr, 2011

1 commit


13 Apr, 2011

1 commit

  • We really only want to unplug the pending IO when the process actually
    goes to sleep. So move the test for flushing the plug up to the place
    where we actually deactivate the task - where we have properly checked
    for preemption and for the process really sleeping.

    Acked-by: Jens Axboe
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Apr, 2011

2 commits

  • Make XEN_SAVE_RESTORE select HIBERNATE_CALLBACKS.
    Remove XEN_SAVE_RESTORE dependency from PM_SLEEP.

    Signed-off-by: Shriram Rajagopalan
    Acked-by: Ian Campbell
    Signed-off-by: Rafael J. Wysocki

    Shriram Rajagopalan
     
  • Xen save/restore is going to use hibernate device callbacks for
    quiescing devices and putting them back to normal operations and it
    would need to select CONFIG_HIBERNATION for this purpose. However,
    that also would cause the hibernate interfaces for user space to be
    enabled, which might confuse user space, because the Xen kernels
    don't support hibernation. Moreover, it would be wasteful, as it
    would make the Xen kernels include a substantial amount of code that
    they would never use.

    To address this issue introduce new power management Kconfig option
    CONFIG_HIBERNATE_CALLBACKS, such that it will only select the code
    that is necessary for the hibernate device callbacks to work and make
    CONFIG_HIBERNATION select it. Then, Xen save/restore will be able to
    select CONFIG_HIBERNATE_CALLBACKS without dragging the entire
    hibernate code along with it.

    Signed-off-by: Rafael J. Wysocki
    Tested-by: Shriram Rajagopalan

    Rafael J. Wysocki
     

11 Apr, 2011

3 commits

  • calc_delta_fair() checks NICE_0_LOAD already, delete duplicate check.

    Signed-off-by: Shaohua Li
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/1302238389.3981.92.camel@sli10-conroe
    Signed-off-by: Ingo Molnar

    Shaohua Li
     
  • The scheduler load balancer has specific code to deal with cases of
    unbalanced system due to lots of unmovable tasks (for example because of
    hard CPU affinity). In those situation, it excludes the busiest CPU that
    has pinned tasks for load balance consideration such that it can perform
    second 2nd load balance pass on the rest of the system.

    This all works as designed if there is only one cgroup in the system.

    However, when we have multiple cgroups, this logic has false positives and
    triggers multiple load balance passes despite there are actually no pinned
    tasks at all.

    The reason it has false positives is that the all pinned logic is deep in
    the lowest function of can_migrate_task() and is too low level:

    load_balance_fair() iterates each task group and calls balance_tasks() to
    migrate target load. Along the way, balance_tasks() will also set a
    all_pinned variable. Given that task-groups are iterated, this all_pinned
    variable is essentially the status of last group in the scanning process.
    Task group can have number of reasons that no load being migrated, none
    due to cpu affinity. However, this status bit is being propagated back up
    to the higher level load_balance(), which incorrectly think that no tasks
    were moved. It kick off the all pinned logic and start multiple passes
    attempt to move load onto puller CPU.

    To fix this, move the all_pinned aggregation up at the iterator level.
    This ensures that the status is aggregated over all task-groups, not just
    last one in the list.

    Signed-off-by: Ken Chen
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Ken Chen
     
  • In function find_busiest_group(), the sched-domain avg_load isn't
    calculated at all if there is a group imbalance within the domain. This
    will cause erroneous imbalance calculation.

    The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
    will dump entire sds->max_load into imbalance variable, which is used
    later on to migrate entire load from busiest CPU to the puller CPU.

    This has two really bad effect:

    1. stampede of task migration, and they won't be able to break out
    of the bad state because of positive feedback loop: large load
    delta -> heavier load migration -> larger imbalance and the cycle
    goes on.

    2. severe imbalance in CPU queue depth. This causes really long
    scheduling latency blip which affects badly on application that
    has tight latency requirement.

    The fix is to have kernel calculate domain avg_load in both cases. This
    will ensure that imbalance calculation is always sensible and the target
    is usually half way between busiest and puller CPU.

    Signed-off-by: Ken Chen
    Signed-off-by: Peter Zijlstra
    Cc:
    Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com
    Signed-off-by: Ingo Molnar

    Ken Chen
     

09 Apr, 2011

1 commit


08 Apr, 2011

2 commits

  • …-linus', 'irq-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86-32, fpu: Fix FPU exception handling on non-SSE systems
    x86, hibernate: Initialize mmu_cr4_features during boot
    x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
    x86: visws: Fixup irq overhaul fallout

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Clean up rebalance_domains() load-balance interval calculation

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
    rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm()

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq: Fix cpumask leak in __setup_irq()

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf probe: Fix listing incorrect line number with inline function
    perf probe: Fix to find recursively inlined function
    perf probe: Fix multiple --vars options behavior
    perf probe: Fix to remove redundant close
    perf probe: Fix to ensure function declared file

    Linus Torvalds
     
  • * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
    Fix common misspellings

    Linus Torvalds
     

05 Apr, 2011

3 commits


04 Apr, 2011

3 commits

  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Fix rebalance interval calculation
    sched, doc: Beef up load balancing description
    sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf: Fix task_struct reference leak
    perf: Fix task context scheduling
    perf: mmap 512 kiB by default
    perf: Rebase max unprivileged mlock threshold on top of page size
    perf tools: Fix NO_NEWT=1 python build error
    perf symbols: Properly align symbol_conf.priv_size
    perf tools: Emit clearer message for sys_perf_event_open ENOENT return
    perf tools: Fixup exit path when not able to open events
    perf symbols: Fix vsyscall symbol lookup
    oprofile, x86: Allow setting EDGE/INV/CMASK for counter events

    Linus Torvalds
     
  • The ADJ_SETOFFSET bit added in commit 094aa188 ("ntp: Add ADJ_SETOFFSET
    mode bit") also introduced a way for any user to change the system time.
    Sneaky or buggy calls to adjtimex() could set

    ADJ_OFFSET_SS_READ | ADJ_SETOFFSET

    which would result in a successful call to timekeeping_inject_offset().
    This patch fixes the issue by adding the capability check.

    Signed-off-by: Richard Cochran
    Signed-off-by: Linus Torvalds

    Richard Cochran
     

03 Apr, 2011

1 commit


01 Apr, 2011

1 commit

  • On ppc64 the crashkernel region almost always overlaps an area of firmware.
    This works fine except when using the sysfs interface to reduce the kdump
    region. If we free the firmware area we are guaranteed to crash.

    Rename free_reserved_phys_range to crash_free_reserved_phys_range and make
    it a weak function so we can override it.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

31 Mar, 2011

5 commits

  • Fixes generated by 'codespell' and manually reviewed.

    Signed-off-by: Lucas De Marchi

    Lucas De Marchi
     
  • sys_perf_event_open() had an imbalance in the number of task refs it
    took causing memory leakage

    Cc: Jiri Olsa
    Cc: Oleg Nesterov
    Cc: stable@kernel.org # .37+
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Ensure we allow 512 kiB + 1 page for user control without
    assuming a 4096 bytes page size.

    Reported-by: Peter Zijlstra
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Cc: Stephane Eranian
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The interval for checking scheduling domains if they are due to be
    balanced currently depends on boot state NR_CPUS, which may not
    accurately reflect the number of online CPUs at the time of check.

    Thus replace NR_CPUS with num_online_cpus().

    (ed: Should only affect those who set NR_CPUS really high, such as 4096
    or so :-)

    Signed-off-by: Sisir Koppaka
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Sisir Koppaka
     
  • sched_setscheduler() (in sched.c) is called in order of changing the
    scheduling policy and/or the real-time priority of a task. Thus,
    if we find out that neither of those are actually being modified, it
    is possible to return earlier and save the overhead of a full
    deactivate+activate cycle of the task in question.

    Beside that, if we have more than one SCHED_FIFO task with the same
    priority on the same rq (which means they share the same priority queue)
    having one of them changing its position in the priority queue because of
    a sched_setscheduler (as it happens by means of the deactivate+activate)
    that does not actually change the priority violates POSIX which states,
    for SCHED_FIFO:

    "If a thread whose policy or priority has been modified by
    pthread_setschedprio() is a running thread or is runnable, the effect on
    its position in the thread list depends on the direction of the
    modification, as follows: a. b. If the priority is unchanged, the
    thread does not change position in the thread list. c. "

    http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_08.html

    (ed: And the POSIX specification here does, briefly and somewhat unexpectedly,
    match what common sense tells us as well. )

    Signed-off-by: Dario Faggioli
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     

30 Mar, 2011

2 commits


29 Mar, 2011

9 commits

  • All users converted to new interface.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The only subtle difference is that alpha uses ACTUAL_NR_IRQS and
    prints the IRQF_DISABLED flag.

    Change the generic implementation to deal with ACTUAL_NR_IRQS if
    defined.

    The IRQF_DISABLED printing is pointless, as we nowadays run all
    interrupts with irqs disabled.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The late night fixup missed to convert the data type from irq_desc to
    irq_data, which results in a harmless but annoying warning.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • …rnel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'irq-cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    vlynq: Convert irq functions

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq; Fix cleanup fallout
    genirq: Fix typo and remove unused variable
    genirq: Fix new kernel-doc warnings
    genirq: Add setter for AFFINITY_SET in irq_data state
    genirq: Provide setter inline for IRQD_IRQ_INPROGRESS
    genirq: Remove handle_IRQ_event
    arm: Ns9xxx: Remove private irq flow handler
    powerpc: cell: Use the core flow handler
    genirq: Provide edge_eoi flow handler
    genirq: Move INPROGRESS, MASKED and DISABLED state flags to irq_data
    genirq: Split irq_set_affinity() so it can be called with lock held.
    genirq: Add chip flag for restricting cpu_on/offline calls
    genirq: Add chip hooks for taking CPUs on/off line.
    genirq: Add irq disabled flag to irq_data state
    genirq: Reserve the irq when calling irq_set_chip()

    Linus Torvalds
     
  • I missed the CONFIG_GENERIC_PENDING_IRQ dependency in the affinity
    related functions and the IRQ_LEVEL propagation into irq_data
    state. Did not pop up on my main test platforms. :(

    Signed-off-by: Thomas Gleixner
    Tested-by: David Daney

    Thomas Gleixner
     
  • Commit da48524eb206 ("Prevent rt_sigqueueinfo and rt_tgsigqueueinfo
    from spoofing the signal code") made the check on si_code too strict.
    There are several legitimate places where glibc wants to queue a
    negative si_code different from SI_QUEUE:

    - This was first noticed with glibc's aio implementation, which wants
    to queue a signal with si_code SI_ASYNCIO; the current kernel
    causes glibc's tst-aio4 test to fail because rt_sigqueueinfo()
    fails with EPERM.

    - Further examination of the glibc source shows that getaddrinfo_a()
    wants to use SI_ASYNCNL (which the kernel does not even define).
    The timer_create() fallback code wants to queue signals with SI_TIMER.

    As suggested by Oleg Nesterov , loosen the check to
    forbid only the problematic SI_TKILL case.

    Reported-by: Klaus Dittrich
    Acked-by: Julien Tinnes
    Cc:
    Signed-off-by: Roland Dreier
    Signed-off-by: Linus Torvalds

    Roland Dreier
     
  • Sigh, I'm overworked.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Fix new irq-related kernel-doc warnings in 2.6.38:

    Warning(kernel/irq/manage.c:149): No description found for parameter 'mask'
    Warning(kernel/irq/manage.c:149): Excess function parameter 'cpumask' description in 'irq_set_affinity'
    Warning(include/linux/irq.h:161): No description found for parameter 'state_use_accessors'
    Warning(include/linux/irq.h:161): Excess struct/union/enum/typedef member 'state_use_accessor' description in 'irq_data'

    Signed-off-by: Randy Dunlap
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Randy Dunlap
     

28 Mar, 2011

3 commits


27 Mar, 2011

3 commits