20 Apr, 2011

1 commit


19 Apr, 2011

1 commit

  • next_pidmap() just quietly accepted whatever 'last' pid that was passed
    in, which is not all that safe when one of the users is /proc.

    Admittedly the proc code should do some sanity checking on the range
    (and that will be the next commit), but that doesn't mean that the
    helper functions should just do that pidmap pointer arithmetic without
    checking the range of its arguments.

    So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1"
    doesn't really matter, the for-loop does check against the end of the
    pidmap array properly (it's only the actual pointer arithmetic overflow
    case we need to worry about, and going one bit beyond isn't going to
    overflow).

    [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]

    Reported-by: Tavis Ormandy
    Analyzed-by: Robert Święcki
    Cc: Eric W. Biederman
    Cc: Pavel Emelyanov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Apr, 2011

1 commit

  • A dynamic posix clock is protected from asynchronous removal by a mutex.
    However, using a mutex has the unwanted effect that a long running clock
    operation in one process will unnecessarily block other processes.

    For example, one process might call read() to get an external time stamp
    coming in at one pulse per second. A second process calling clock_gettime
    would have to wait for almost a whole second.

    This patch fixes the issue by using a reader/writer semaphore instead of
    a mutex.

    Signed-off-by: Richard Cochran
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/%3C20110330132421.GA31771%40riccoc20.at.omicron.at%3E
    Signed-off-by: Thomas Gleixner

    Richard Cochran
     

17 Apr, 2011

2 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: make unplug timer trace event correspond to the schedule() unplug
    block: let io_schedule() flush the plug inline

    Linus Torvalds
     
  • …linus', 'timer-fixes-for-linus' and 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec()
    perf: Fix a build error with some GCC versions

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Fix erroneous all_pinned logic
    sched: Fix sched-domain avg_load calculation

    * 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    RTC: rtc-mrst: follow on to the change of rtc_device_register()
    RTC: add missing "return 0" in new alarm func for rtc-bfin.c
    RTC: Fix s3c compile error due to missing s3c_rtc_setpie
    RTC: Fix early irqs caused by calling rtc_set_alarm too early

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, amd: Disable GartTlbWlkErr when BIOS forgets it
    x86, NUMA: Fix fakenuma boot failure
    x86/mrst: Fix boot crash caused by incorrect pin to irq mapping
    x86/ce4100: Add reg property to bridges

    Linus Torvalds
     

16 Apr, 2011

2 commits

  • It's a pretty close match to what we had before - the timer triggering
    would mean that nobody unplugged the plug in due time, in the new
    scheme this matches very closely what the schedule() unplug now is.
    It's essentially the difference between an explicit unplug (IO unplug)
    or an implicit unplug (timer unplug, we scheduled with pending IO
    queued).

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Linus correctly observes that the most important dispatch cases
    are now done from kblockd, this isn't ideal for latency reasons.
    The original reason for switching dispatches out-of-line was to
    avoid too deep a stack, so by _only_ letting the "accidental"
    flush directly in schedule() be guarded by offload to kblockd,
    we should be able to get the best of both worlds.

    So add a blk_schedule_flush_plug() that offloads to kblockd,
    and only use that from the schedule() path.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Apr, 2011

2 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: only force kblockd unplugging from the schedule() path
    block: cleanup the block plug helper functions
    block, blk-sysfs: Use the variable directly instead of a function call
    block: move queue run on unplug to kblockd
    block: kill queue_sync_plugs()
    block: readd plug trace event
    block: add callback function for unplug notification
    block: add comment on why we save and disable interrupts in flush_plug_list()
    block: fixup block IO unplug trace call
    block: remove block_unplug_timer() trace point
    block: splice plug list to local context

    Linus Torvalds
     
  • The FLAGS_HAS_TIMEOUT flag was not getting set, causing the restart_block to
    restart futex_wait() without a timeout after a signal.

    Commit b41277dc7a18ee332d in 2.6.38 introduced the regression by accidentally
    removing the the FLAGS_HAS_TIMEOUT assignment from futex_wait() during the setup
    of the restart block. Restore the originaly behavior.

    Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=32922

    Reported-by: Tim Smith
    Reported-by: Torsten Hilbrich
    Signed-off-by: Darren Hart
    Signed-off-by: Eric Dumazet
    Cc: Peter Zijlstra
    Cc: John Kacur
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/%3Cdaac0eb3af607f72b9a4d3126b2ba8fb5ed3b883.1302820917.git.dvhart%40linux.intel.com%3E
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

13 Apr, 2011

1 commit

  • We really only want to unplug the pending IO when the process actually
    goes to sleep. So move the test for flushing the plug up to the place
    where we actually deactivate the task - where we have properly checked
    for preemption and for the process really sleeping.

    Acked-by: Jens Axboe
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Apr, 2011

4 commits

  • It was removed with the on-stack plugging, readd it and track the
    depth of requests added when flushing the plug.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We no longer have an unplug timer running, so no point in keeping
    the trace point.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Make XEN_SAVE_RESTORE select HIBERNATE_CALLBACKS.
    Remove XEN_SAVE_RESTORE dependency from PM_SLEEP.

    Signed-off-by: Shriram Rajagopalan
    Acked-by: Ian Campbell
    Signed-off-by: Rafael J. Wysocki

    Shriram Rajagopalan
     
  • Xen save/restore is going to use hibernate device callbacks for
    quiescing devices and putting them back to normal operations and it
    would need to select CONFIG_HIBERNATION for this purpose. However,
    that also would cause the hibernate interfaces for user space to be
    enabled, which might confuse user space, because the Xen kernels
    don't support hibernation. Moreover, it would be wasteful, as it
    would make the Xen kernels include a substantial amount of code that
    they would never use.

    To address this issue introduce new power management Kconfig option
    CONFIG_HIBERNATE_CALLBACKS, such that it will only select the code
    that is necessary for the hibernate device callbacks to work and make
    CONFIG_HIBERNATION select it. Then, Xen save/restore will be able to
    select CONFIG_HIBERNATE_CALLBACKS without dragging the entire
    hibernate code along with it.

    Signed-off-by: Rafael J. Wysocki
    Tested-by: Shriram Rajagopalan

    Rafael J. Wysocki
     

11 Apr, 2011

3 commits

  • The scheduler load balancer has specific code to deal with cases of
    unbalanced system due to lots of unmovable tasks (for example because of
    hard CPU affinity). In those situation, it excludes the busiest CPU that
    has pinned tasks for load balance consideration such that it can perform
    second 2nd load balance pass on the rest of the system.

    This all works as designed if there is only one cgroup in the system.

    However, when we have multiple cgroups, this logic has false positives and
    triggers multiple load balance passes despite there are actually no pinned
    tasks at all.

    The reason it has false positives is that the all pinned logic is deep in
    the lowest function of can_migrate_task() and is too low level:

    load_balance_fair() iterates each task group and calls balance_tasks() to
    migrate target load. Along the way, balance_tasks() will also set a
    all_pinned variable. Given that task-groups are iterated, this all_pinned
    variable is essentially the status of last group in the scanning process.
    Task group can have number of reasons that no load being migrated, none
    due to cpu affinity. However, this status bit is being propagated back up
    to the higher level load_balance(), which incorrectly think that no tasks
    were moved. It kick off the all pinned logic and start multiple passes
    attempt to move load onto puller CPU.

    To fix this, move the all_pinned aggregation up at the iterator level.
    This ensures that the status is aggregated over all task-groups, not just
    last one in the list.

    Signed-off-by: Ken Chen
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Ken Chen
     
  • In function find_busiest_group(), the sched-domain avg_load isn't
    calculated at all if there is a group imbalance within the domain. This
    will cause erroneous imbalance calculation.

    The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
    will dump entire sds->max_load into imbalance variable, which is used
    later on to migrate entire load from busiest CPU to the puller CPU.

    This has two really bad effect:

    1. stampede of task migration, and they won't be able to break out
    of the bad state because of positive feedback loop: large load
    delta -> heavier load migration -> larger imbalance and the cycle
    goes on.

    2. severe imbalance in CPU queue depth. This causes really long
    scheduling latency blip which affects badly on application that
    has tight latency requirement.

    The fix is to have kernel calculate domain avg_load in both cases. This
    will ensure that imbalance calculation is always sensible and the target
    is usually half way between busiest and puller CPU.

    Signed-off-by: Ken Chen
    Signed-off-by: Peter Zijlstra
    Cc:
    Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com
    Signed-off-by: Ingo Molnar

    Ken Chen
     
  • There is a bug in perf_event_enable_on_exec() when cgroup events are
    active on a CPU: the cgroup events may be scheduled twice causing event
    state corruptions which eventually may lead to kernel panics.

    The reason is that the function needs to first schedule out the cgroup
    events, just like for the per-thread events. The cgroup event are
    scheduled back in automatically from the perf_event_context_sched_in()
    function.

    The patch also adds a WARN_ON_ONCE() is perf_cgroup_switch() to catch any
    bogus state.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110406005454.GA1062@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

09 Apr, 2011

1 commit


08 Apr, 2011

2 commits

  • …-linus', 'irq-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86-32, fpu: Fix FPU exception handling on non-SSE systems
    x86, hibernate: Initialize mmu_cr4_features during boot
    x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
    x86: visws: Fixup irq overhaul fallout

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Clean up rebalance_domains() load-balance interval calculation

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
    rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm()

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq: Fix cpumask leak in __setup_irq()

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf probe: Fix listing incorrect line number with inline function
    perf probe: Fix to find recursively inlined function
    perf probe: Fix multiple --vars options behavior
    perf probe: Fix to remove redundant close
    perf probe: Fix to ensure function declared file

    Linus Torvalds
     
  • * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
    Fix common misspellings

    Linus Torvalds
     

05 Apr, 2011

3 commits


04 Apr, 2011

3 commits

  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Fix rebalance interval calculation
    sched, doc: Beef up load balancing description
    sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf: Fix task_struct reference leak
    perf: Fix task context scheduling
    perf: mmap 512 kiB by default
    perf: Rebase max unprivileged mlock threshold on top of page size
    perf tools: Fix NO_NEWT=1 python build error
    perf symbols: Properly align symbol_conf.priv_size
    perf tools: Emit clearer message for sys_perf_event_open ENOENT return
    perf tools: Fixup exit path when not able to open events
    perf symbols: Fix vsyscall symbol lookup
    oprofile, x86: Allow setting EDGE/INV/CMASK for counter events

    Linus Torvalds
     
  • The ADJ_SETOFFSET bit added in commit 094aa188 ("ntp: Add ADJ_SETOFFSET
    mode bit") also introduced a way for any user to change the system time.
    Sneaky or buggy calls to adjtimex() could set

    ADJ_OFFSET_SS_READ | ADJ_SETOFFSET

    which would result in a successful call to timekeeping_inject_offset().
    This patch fixes the issue by adding the capability check.

    Signed-off-by: Richard Cochran
    Signed-off-by: Linus Torvalds

    Richard Cochran
     

03 Apr, 2011

1 commit


01 Apr, 2011

1 commit

  • On ppc64 the crashkernel region almost always overlaps an area of firmware.
    This works fine except when using the sysfs interface to reduce the kdump
    region. If we free the firmware area we are guaranteed to crash.

    Rename free_reserved_phys_range to crash_free_reserved_phys_range and make
    it a weak function so we can override it.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

31 Mar, 2011

5 commits

  • Fixes generated by 'codespell' and manually reviewed.

    Signed-off-by: Lucas De Marchi

    Lucas De Marchi
     
  • sys_perf_event_open() had an imbalance in the number of task refs it
    took causing memory leakage

    Cc: Jiri Olsa
    Cc: Oleg Nesterov
    Cc: stable@kernel.org # .37+
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Ensure we allow 512 kiB + 1 page for user control without
    assuming a 4096 bytes page size.

    Reported-by: Peter Zijlstra
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Cc: Stephane Eranian
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The interval for checking scheduling domains if they are due to be
    balanced currently depends on boot state NR_CPUS, which may not
    accurately reflect the number of online CPUs at the time of check.

    Thus replace NR_CPUS with num_online_cpus().

    (ed: Should only affect those who set NR_CPUS really high, such as 4096
    or so :-)

    Signed-off-by: Sisir Koppaka
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Sisir Koppaka
     
  • sched_setscheduler() (in sched.c) is called in order of changing the
    scheduling policy and/or the real-time priority of a task. Thus,
    if we find out that neither of those are actually being modified, it
    is possible to return earlier and save the overhead of a full
    deactivate+activate cycle of the task in question.

    Beside that, if we have more than one SCHED_FIFO task with the same
    priority on the same rq (which means they share the same priority queue)
    having one of them changing its position in the priority queue because of
    a sched_setscheduler (as it happens by means of the deactivate+activate)
    that does not actually change the priority violates POSIX which states,
    for SCHED_FIFO:

    "If a thread whose policy or priority has been modified by
    pthread_setschedprio() is a running thread or is runnable, the effect on
    its position in the thread list depends on the direction of the
    modification, as follows: a. b. If the priority is unchanged, the
    thread does not change position in the thread list. c. "

    http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_08.html

    (ed: And the POSIX specification here does, briefly and somewhat unexpectedly,
    match what common sense tells us as well. )

    Signed-off-by: Dario Faggioli
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     

30 Mar, 2011

2 commits


29 Mar, 2011

5 commits

  • All users converted to new interface.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The only subtle difference is that alpha uses ACTUAL_NR_IRQS and
    prints the IRQF_DISABLED flag.

    Change the generic implementation to deal with ACTUAL_NR_IRQS if
    defined.

    The IRQF_DISABLED printing is pointless, as we nowadays run all
    interrupts with irqs disabled.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The late night fixup missed to convert the data type from irq_desc to
    irq_data, which results in a harmless but annoying warning.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • …rnel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'irq-cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    vlynq: Convert irq functions

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq; Fix cleanup fallout
    genirq: Fix typo and remove unused variable
    genirq: Fix new kernel-doc warnings
    genirq: Add setter for AFFINITY_SET in irq_data state
    genirq: Provide setter inline for IRQD_IRQ_INPROGRESS
    genirq: Remove handle_IRQ_event
    arm: Ns9xxx: Remove private irq flow handler
    powerpc: cell: Use the core flow handler
    genirq: Provide edge_eoi flow handler
    genirq: Move INPROGRESS, MASKED and DISABLED state flags to irq_data
    genirq: Split irq_set_affinity() so it can be called with lock held.
    genirq: Add chip flag for restricting cpu_on/offline calls
    genirq: Add chip hooks for taking CPUs on/off line.
    genirq: Add irq disabled flag to irq_data state
    genirq: Reserve the irq when calling irq_set_chip()

    Linus Torvalds