29 Dec, 2010

1 commit


25 Dec, 2010

1 commit


24 Dec, 2010

1 commit

  • Fix two related problems in the event-copying loop of
    ring_buffer_read_page.

    The loop condition for copying events is off-by-one.
    "len" is the remaining space in the caller-supplied page.
    "size" is the size of the next event (or two events).
    If len == size, then there is just enough space for the next event.

    size was set to rb_event_ts_length, which may include the size of two
    events if the first event is a time-extend, in order to assure time-
    extends are kept together with the event after it. However,
    rb_advance_reader always advances by one event. This would result in the
    event after any time-extend being duplicated. Instead, get the size of
    a single event for the memcpy, but use rb_event_ts_length for the loop
    condition.

    Signed-off-by: David Sharp
    LKML-Reference:
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    David Sharp
     

23 Dec, 2010

1 commit

  • The taskstats structure is internally aligned on 8 byte boundaries but the
    layout of the aggregrate reply, with two NLA headers and the pid (each 4
    bytes), actually force the entire structure to be unaligned. This causes
    the kernel to issue unaligned access warnings on some architectures like
    ia64. Unfortunately, some software out there doesn't properly unroll the
    NLA packet and assumes that the start of the taskstats structure will
    always be 20 bytes from the start of the netlink payload. Aligning the
    start of the taskstats structure breaks this software, which we don't
    want. So, for now the alignment only happens on architectures that
    require it and those users will have to update to fixed versions of those
    packages. Space is reserved in the packet only when needed. This ifdef
    should be removed in several years e.g. 2012 once we can be confident
    that fixed versions are installed on most systems. We add the padding
    before the aggregate since the aggregate is already a defined type.

    Commit 85893120 ("delayacct: align to 8 byte boundary on 64-bit systems")
    previously addressed the alignment issues by padding out the pid field.
    This was supposed to be a compatible change but the circumstances
    described above mean that it wasn't. This patch backs out that change,
    since it was a hack, and introduces a new NULL attribute type to provide
    the padding. Padding the response with 4 bytes avoids allocating an
    aligned taskstats structure and copying it back. Since the structure
    weighs in at 328 bytes, it's too big to do it on the stack.

    Signed-off-by: Jeff Mahoney
    Reported-by: Brian Rogers
    Cc: Jeff Mahoney
    Cc: Guillaume Chazarain
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     

22 Dec, 2010

1 commit

  • spinlock in kthread_worker and wait_queue_head in kthread_work both
    should be lockdep sensible, so change the interface to make it
    suiltable for CONFIG_LOCKDEP.

    tj: comment update

    Reported-by: Nicolas
    Signed-off-by: Yong Zhang
    Signed-off-by: Andy Walls
    Tested-by: Andy Walls
    Cc: Tejun Heo
    Cc: Andrew Morton
    Signed-off-by: Tejun Heo

    Yong Zhang
     

21 Dec, 2010

1 commit


20 Dec, 2010

3 commits

  • Linus reported that the new warning introduced by commit f26f9aff6aaf
    "Sched: fix skip_clock_update optimization" triggers. The need_resched
    flag can be set by other CPUs asynchronously so this debug check is
    bogus - remove it.

    Reported-by: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …nel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86-32: Make sure we can map all of lowmem if we need to
    x86, vt-d: Handle previous faults after enabling fault handling
    x86: Enable the intr-remap fault handling after local APIC setup
    x86, vt-d: Fix the vt-d fault handling irq migration in the x2apic mode
    x86, vt-d: Quirk for masking vtd spec errors to platform error handling logic
    x86, xsave: Use alloc_bootmem_align() instead of alloc_bootmem()
    bootmem: Add alloc_bootmem_align()
    x86, gcc-4.6: Use gcc -m options when building vdso
    x86: HPET: Chose a paranoid safe value for the ETIME check
    x86: io_apic: Avoid unused variable warning when CONFIG_GENERIC_PENDING_IRQ=n

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf: Fix off by one in perf_swevent_init()
    perf: Fix duplicate events with multiple-pmu vs software events
    ftrace: Have recordmcount honor endianness in fn_ELF_R_INFO
    scripts/tags.sh: Add magic for trace-events
    tracing: Fix panic when lseek() called on "trace" opened for writing

    Linus Torvalds
     
  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Fix the irqtime code for 32bit
    sched: Fix the irqtime code to deal with u64 wraps
    nohz: Fix get_next_timer_interrupt() vs cpu hotplug
    Sched: fix skip_clock_update optimization
    sched: Cure more NO_HZ load average woes

    Linus Torvalds
     

19 Dec, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
    x86: avoid high BIOS area when allocating address space
    x86: avoid E820 regions when allocating address space
    x86: avoid low BIOS area when allocating address space
    resources: add arch hook for preventing allocation in reserved areas
    Revert "resources: support allocating space within a region from the top down"
    Revert "PCI: allocate bus resources from the top down"
    Revert "x86/PCI: allocate space from the end of a region, not the beginning"
    Revert "x86: allocate space within a region top-down"
    Revert "PCI: fix pci_bus_alloc_resource() hang, prefer positive decode"
    PCI: Update MCP55 quirk to not affect non HyperTransport variants

    Linus Torvalds
     

18 Dec, 2010

2 commits


17 Dec, 2010

2 commits

  • Commit 3624eb0 (PM / Hibernate: Modify signature used to mark swap)
    attempted to modify hibernate signature used to mark swap partitions
    containing hibernation images, so that old kernels don't try to
    handle compressed images. However, this change broke resume from
    hibernation on Fedora 14 that apparently doesn't pass the resume=
    argument to the kernel and tries to trigger resume from early user
    space. This doesn't work, because the signature is now different,
    so the old signature has to be restored to avoid the problem.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=22732 .

    Reported-by: Dr. David Alan Gilbert
    Reported-by: Zhang Rui
    Reported-by: Pascal Chapperon
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • The user-space hibernation sends a wrong notification after the image
    restoration because of thinko for the file flag check. RDONLY
    corresponds to hibernation and WRONLY to restoration, confusingly.

    Signed-off-by: Takashi Iwai
    Signed-off-by: Rafael J. Wysocki
    Cc: stable@kernel.org

    Takashi Iwai
     

16 Dec, 2010

4 commits

  • …rostedt/linux-2.6-trace into perf/urgent

    Ingo Molnar
     
  • Since the irqtime accounting is using non-atomic u64 and can be read
    from remote cpus (writes are strictly cpu local, reads are not) we
    have to deal with observing partial updates.

    When we do observe partial updates the clock movement (in particular,
    ->clock_task movement) will go funny (in either direction), a
    subsequent clock update (observing the full update) will make it go
    funny in the oposite direction.

    Since we rely on these clocks to be strictly monotonic we cannot
    suffer backwards motion. One possible solution would be to simply
    ignore all backwards deltas, but that will lead to accounting
    artefacts, most notable: clock_task + irq_time != clock, this
    inaccuracy would end up in user visible stats.

    Therefore serialize the reads using a seqcount.

    Reviewed-by: Venkatesh Pallipadi
    Reported-by: Mikael Pettersson
    Tested-by: Mikael Pettersson
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Some ARM systems have a short sched_clock() [ which needs to be fixed
    too ], but this exposed a bug in the irq_time code as well, it doesn't
    deal with wraps at all.

    Fix the irq_time code to deal with u64 wraps by re-writing the code to
    only use delta increments, which avoids the whole issue.

    Reviewed-by: Venkatesh Pallipadi
    Reported-by: Mikael Pettersson
    Tested-by: Mikael Pettersson
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The perf_swevent_enabled[] array has PERF_COUNT_SW_MAX elements.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Dan Carpenter
     

15 Dec, 2010

1 commit


14 Dec, 2010

1 commit

  • Running the annotate branch profiler on three boxes, including my
    main box that runs firefox, evolution, xchat, and is part of the distcc farm,
    showed this with the likelys in the workqueue code:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    96 996253 99 wq_worker_sleeping workqueue.c 703
    96 996247 99 wq_worker_waking_up workqueue.c 677

    The likely()s in this case were assuming that WORKER_NOT_RUNNING will
    most likely be false. But this is not the case. The reason is
    (and shown by adding trace_printks and testing it) that most of the time
    WORKER_PREP is set.

    In worker_thread() we have:

    worker_clr_flags(worker, WORKER_PREP);

    [ do work stuff ]

    worker_set_flags(worker, WORKER_PREP, false);

    (that 'false' means not to wake up an idle worker)

    The wq_worker_sleeping() is called from schedule when a worker thread
    is putting itself to sleep. Which happens most of the time outside
    of that [ do work stuff ].

    The wq_worker_waking_up is called by the wakeup worker code, which
    is also callod outside that [ do work stuff ].

    Thus, the likely and unlikely used by those two functions are actually
    backwards.

    Remove the annotation and let gcc figure it out.

    Acked-by: Tejun Heo
    Signed-off-by: Steven Rostedt
    Signed-off-by: Tejun Heo

    Steven Rostedt
     

09 Dec, 2010

4 commits

  • This fixes a bug as seen on 2.6.32 based kernels where timers got
    enqueued on offline cpus.

    If a cpu goes offline it might still have pending timers. These will
    be migrated during CPU_DEAD handling after the cpu is offline.
    However while the cpu is going offline it will schedule the idle task
    which will then call tick_nohz_stop_sched_tick().

    That function in turn will call get_next_timer_intterupt() to figure
    out if the tick of the cpu can be stopped or not. If it turns out that
    the next tick is just one jiffy off (delta_jiffies == 1)
    tick_nohz_stop_sched_tick() incorrectly assumes that the tick should
    not stop and takes an early exit and thus it won't update the load
    balancer cpu.

    Just afterwards the cpu will be killed and the load balancer cpu could
    be the offline cpu.

    On 2.6.32 based kernel get_nohz_load_balancer() gets called to decide
    on which cpu a timer should be enqueued (see __mod_timer()). Which
    leads to the possibility that timers get enqueued on an offline cpu.
    These will never expire and can cause a system hang.

    This has been observed 2.6.32 kernels. On current kernels
    __mod_timer() uses get_nohz_timer_target() which doesn't have that
    problem. However there might be other problems because of the too
    early exit tick_nohz_stop_sched_tick() in case a cpu goes offline.

    The easiest and probably safest fix seems to be to let
    get_next_timer_interrupt() just lie and let it say there isn't any
    pending timer if the current cpu is offline.

    I also thought of moving migrate_[hr]timers() from CPU_DEAD to
    CPU_DYING, but seeing that there already have been fixes at least in
    the hrtimer code in this area I'm afraid that this could add new
    subtle bugs.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Cc: stable@kernel.org
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     
  • idle_balance() drops/retakes rq->lock, leaving the previous task
    vulnerable to set_tsk_need_resched(). Clear it after we return
    from balancing instead, and in setup_thread_stack() as well, so
    no successfully descheduled or never scheduled task has it set.

    Need resched confused the skip_clock_update logic, which assumes
    that the next call to update_rq_clock() will come nearly immediately
    after being set. Make the optimization robust against the waking
    a sleeper before it sucessfully deschedules case by checking that
    the current task has not been dequeued before setting the flag,
    since it is that useless clock update we're trying to save, and
    clear unconditionally in schedule() proper instead of conditionally
    in put_prev_task().

    Signed-off-by: Mike Galbraith
    Reported-by: Bjoern B. Brandenburg
    Tested-by: Yong Zhang
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • There's a long-running regression that proved difficult to fix and
    which is hitting certain people and is rather annoying in its effects.

    Damien reported that after 74f5187ac8 (sched: Cure load average vs
    NO_HZ woes) his load average is unnaturally high, he also noted that
    even with that patch reverted the load avgerage numbers are not
    correct.

    The problem is that the previous patch only solved half the NO_HZ
    problem, it addressed the part of going into NO_HZ mode, not of
    comming out of NO_HZ mode. This patch implements that missing half.

    When comming out of NO_HZ mode there are two important things to take
    care of:

    - Folding the pending idle delta into the global active count.
    - Correctly aging the averages for the idle-duration.

    So with this patch the NO_HZ interaction should be complete and
    behaviour between CONFIG_NO_HZ=[yn] should be equivalent.

    Furthermore, this patch slightly changes the load average computation
    by adding a rounding term to the fixed point multiplication.

    Reported-by: Damien Wyart
    Reported-by: Tim McGrath
    Tested-by: Damien Wyart
    Tested-by: Orion Poplawski
    Tested-by: Kyle McMartin
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Cc: Chase Douglas
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Because the multi-pmu bits can share contexts between struct pmu
    instances we could get duplicate events by iterating the pmu list.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 Dec, 2010

2 commits

  • …r-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86/pvclock: Zero last_value on resume

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf record: Fix eternal wait for stillborn child
    perf header: Don't assume there's no attr info if no sample ids is provided
    perf symbols: Figure out start address of kernel map from kallsyms
    perf symbols: Fix kallsyms kernel/module map splitting

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    nohz: Fix printk_needs_cpu() return value on offline cpus
    printk: Fix wake_up_klogd() vs cpu hotplug

    Linus Torvalds
     
  • …git/tip/linux-2.6-tip

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq: Fix incorrect proc spurious output

    Linus Torvalds
     

07 Dec, 2010

3 commits

  • * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
    PM / Hibernate: Fix memory corruption related to swap
    PM / Hibernate: Use async I/O when reading compressed hibernation image

    Linus Torvalds
     
  • There is a problem that swap pages allocated before the creation of
    a hibernation image can be released and used for storing the contents
    of different memory pages while the image is being saved. Since the
    kernel stored in the image doesn't know of that, it causes memory
    corruption to occur after resume from hibernation, especially on
    systems with relatively small RAM that need to swap often.

    This issue can be addressed by keeping the GFP_IOFS bits clear
    in gfp_allowed_mask during the entire hibernation, including the
    saving of the image, until the system is finally turned off or
    the hibernation is aborted. Unfortunately, for this purpose
    it's necessary to rework the way in which the hibernate and
    suspend code manipulates gfp_allowed_mask.

    This change is based on an earlier patch from Hugh Dickins.

    Signed-off-by: Rafael J. Wysocki
    Reported-by: Ondrej Zary
    Acked-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: stable@kernel.org

    Rafael J. Wysocki
     
  • This is a fix for reading LZO compressed image using async I/O.
    Essentially, instead of having just one page into which we keep
    reading blocks from swap, we allocate enough of them to cover the
    largest compressed size and then let block I/O pick them all up. Once
    we have them all (and here we wait), we decompress them, as usual.
    Obviously, the very first block we still pick up synchronously,
    because we need to know the size of the lot before we pick up the
    rest.

    Also fixed the copyright line, which I've forgotten before.

    Signed-off-by: Bojan Smojver
    Signed-off-by: Rafael J. Wysocki

    Bojan Smojver
     

03 Dec, 2010

1 commit

  • If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
    otherwise reset before do_exit(). do_exit may later (via mm_release in
    fork.c) do a put_user to a user-controlled address, potentially allowing
    a user to leverage an oops into a controlled write into kernel memory.

    This is only triggerable in the presence of another bug, but this
    potentially turns a lot of DoS bugs into privilege escalations, so it's
    worth fixing. I have proof-of-concept code which uses this bug along
    with CVE-2010-3849 to write a zero to an arbitrary kernel address, so
    I've tested that this is not theoretical.

    A more logical place to put this fix might be when we know an oops has
    occurred, before we call do_exit(), but that would involve changing
    every architecture, in multiple places.

    Let's just stick it in do_exit instead.

    [akpm@linux-foundation.org: update code comment]
    Signed-off-by: Nelson Elhage
    Cc: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nelson Elhage
     

01 Dec, 2010

2 commits

  • Since commit a1afb637(switch /proc/irq/*/spurious to seq_file) all
    /proc/irq/XX/spurious files show the information of irq 0.

    Current irq_spurious_proc_open() passes on NULL as the 3rd argument,
    which is used as an IRQ number in irq_spurious_proc_show(), to the
    single_open(). Because of this, all the /proc/irq/XX/spurious file
    shows IRQ 0 information regardless of the IRQ number.

    To fix the problem, irq_spurious_proc_open() must pass on the
    appropreate data (IRQ number) to single_open().

    Signed-off-by: Kenji Kaneshige
    Reviewed-by: Yong Zhang
    LKML-Reference:
    Cc: stable@kernel.org [2.6.33+]
    Signed-off-by: Thomas Gleixner

    Kenji Kaneshige
     
  • The file_ops struct for the "trace" special file defined llseek as seq_lseek().
    However, if the file was opened for writing only, seq_open() was not called,
    and the seek would dereference a null pointer, file->private_data.

    This patch introduces a new wrapper for seq_lseek() which checks if the file
    descriptor is opened for reading first. If not, it does nothing.

    Cc:
    Signed-off-by: Slava Pestov
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Slava Pestov
     

29 Nov, 2010

1 commit


27 Nov, 2010

3 commits


26 Nov, 2010

4 commits

  • This patch fixes a hang observed with 2.6.32 kernels where timers got enqueued
    on offline cpus.

    printk_needs_cpu() may return 1 if called on offline cpus. When a cpu gets
    offlined it schedules the idle process which, before killing its own cpu, will
    call tick_nohz_stop_sched_tick(). That function in turn will call
    printk_needs_cpu() in order to check if the local tick can be disabled. On
    offline cpus this function should naturally return 0 since regardless if the
    tick gets disabled or not the cpu will be dead short after. That is besides the
    fact that __cpu_disable() should already have made sure that no interrupts on
    the offlined cpu will be delivered anyway.

    In this case it prevents tick_nohz_stop_sched_tick() to call
    select_nohz_load_balancer(). No idea if that really is a problem. However what
    made me debug this is that on 2.6.32 the function get_nohz_load_balancer() is
    used within __mod_timer() to select a cpu on which a timer gets enqueued. If
    printk_needs_cpu() returns 1 then the nohz_load_balancer cpu doesn't get
    updated when a cpu gets offlined. It may contain the cpu number of an offline
    cpu. In turn timers get enqueued on an offline cpu and not very surprisingly
    they never expire and cause system hangs.

    This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses
    get_nohz_timer_target() which doesn't have that problem. However there might be
    other problems because of the too early exit tick_nohz_stop_sched_tick() in
    case a cpu goes offline.

    Easiest way to fix this is just to test if the current cpu is offline and call
    printk_tick() directly which clears the condition.

    Alternatively I tried a cpu hotplug notifier which would clear the condition,
    however between calling the notifier function and printk_needs_cpu() something
    could have called printk() again and the problem is back again. This seems to
    be the safest fix.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     
  • wake_up_klogd() may get called from preemptible context but uses
    __raw_get_cpu_var() to write to a per cpu variable. If it gets preempted
    between getting the address and writing to it, the cpu in question could be
    offline if the process gets scheduled back and hence writes to the per cpu data
    of an offline cpu.

    This buggy behaviour was introduced with fa33507a "printk: robustify
    printk, fix #2" which was supposed to fix a "using smp_processor_id() in
    preemptible" warning.

    Let's use this_cpu_write() instead which disables preemption and makes sure
    that the outlined scenario cannot happen.

    Signed-off-by: Heiko Carstens
    Acked-by: Eric Dumazet
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     
  • Stephane noticed that because the perf_sw_event() call is inside the
    perf_event_task_sched_out() call it won't get called unless we
    have a per-task counter.

    Reported-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • It was found that sometimes children of tasks with inherited events had
    one extra event. Eventually it turned out to be due to the list rotation
    no being exclusive with the list iteration in the inheritance code.

    Cure this by temporarily disabling the rotation while we inherit the events.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Cc:
    Signed-off-by: Ingo Molnar

    Thomas Gleixner