18 Oct, 2011

1 commit

  • There's a lock inversion between the cputimer->lock and rq->lock;
    notably the two callchains involved are:

    update_rlimit_cpu()
    sighand->siglock
    set_process_cpu_timer()
    cpu_timer_sample_group()
    thread_group_cputimer()
    cputimer->lock
    thread_group_cputime()
    task_sched_runtime()
    ->pi_lock
    rq->lock

    scheduler_tick()
    rq->lock
    task_tick_fair()
    update_curr()
    account_group_exec()
    cputimer->lock

    Where the first one is enabling a CLOCK_PROCESS_CPUTIME_ID timer, and
    the second one is keeping up-to-date.

    This problem was introduced by e8abccb7193 ("posix-cpu-timers: Cure
    SMP accounting oddities").

    Cure the problem by removing the cputimer->lock and rq->lock nesting,
    this leaves concurrent enablers doing duplicate work, but the time
    wasted should be on the same order otherwise wasted spinning on the
    lock and the greater-than assignment filter should ensure we preserve
    monotonicity.

    Reported-by: Dave Jones
    Reported-by: Simon Kirby
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Link: http://lkml.kernel.org/r/1318928713.21167.4.camel@twins
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

17 Oct, 2011

1 commit

  • The size is always valid, but variable-length arrays generate worse code
    for no good reason (unless the function happens to be inlined and the
    compiler sees the length for the simple constant it is).

    Also, there seems to be some code generation problem on POWER, where
    Henrik Bakken reports that register r28 can get corrupted under some
    subtle circumstances (interrupt happening at the wrong time?). That all
    indicates some seriously broken compiler issues, but since variable
    length arrays are bad regardless, there's little point in trying to
    chase it down.

    "Just don't do that, then".

    Reported-by: Henrik Grindal Bakken
    Cc: Benjamin Herrenschmidt
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Oct, 2011

1 commit

  • …for-linus' of git://tesla.tglx.de/git/linux-2.6-tip

    * 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
    irq: Fix check for already initialized irq_domain in irq_domain_add
    irq: Add declaration of irq_domain_simple_ops to irqdomain.h

    * 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
    x86/rtc: Don't recursively acquire rtc_lock

    * 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
    posix-cpu-timers: Cure SMP wobbles
    sched: Fix up wchan borkage
    sched/rt: Migrate equal priority tasks to available CPUs

    Linus Torvalds
     

30 Sep, 2011

2 commits

  • David reported:

    Attached below is a watered-down version of rt/tst-cpuclock2.c from
    GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
    similar.

    Run it several times, and you will see cases where the main thread
    will measure a process clock difference before and after the nanosleep
    which is smaller than the cpu-burner thread's individual thread clock
    difference. This doesn't make any sense since the cpu-burner thread
    is part of the top-level process's thread group.

    I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
    64-bit binaries).

    For example:

    [davem@boricha build-x86_64-linux]$ ./test
    process: before(0.001221967) after(0.498624371) diff(497402404)
    thread: before(0.000081692) after(0.498316431) diff(498234739)
    self: before(0.001223521) after(0.001240219) diff(16698)
    [davem@boricha build-x86_64-linux]$

    The diff of 'process' should always be >= the diff of 'thread'.

    I make sure to wrap the 'thread' clock measurements the most tightly
    around the nanosleep() call, and that the 'process' clock measurements
    are the outer-most ones.

    ---
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static pthread_barrier_t barrier;

    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1)
    __asm__ __volatile__("" : : : "memory");
    return NULL;
    }

    int main(void)
    {
    clockid_t process_clock, my_thread_clock, th_clock;
    struct timespec process_before, process_after;
    struct timespec me_before, me_after;
    struct timespec th_before, th_after;
    struct timespec sleeptime;
    unsigned long diff;
    pthread_t th;
    int err;

    err = clock_getcpuclockid(0, &process_clock);
    if (err)
    return 1;

    err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
    if (err)
    return 1;

    pthread_barrier_init(&barrier, NULL, 2);
    err = pthread_create(&th, NULL, chew_cpu, NULL);
    if (err)
    return 1;

    err = pthread_getcpuclockid(th, &th_clock);
    if (err)
    return 1;

    pthread_barrier_wait(&barrier);

    err = clock_gettime(process_clock, &process_before);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_before);
    if (err)
    return 1;

    err = clock_gettime(th_clock, &th_before);
    if (err)
    return 1;

    sleeptime.tv_sec = 0;
    sleeptime.tv_nsec = 500000000;
    nanosleep(&sleeptime, NULL);

    err = clock_gettime(th_clock, &th_after);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_after);
    if (err)
    return 1;

    err = clock_gettime(process_clock, &process_after);
    if (err)
    return 1;

    diff = process_after.tv_nsec - process_before.tv_nsec;
    printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    process_before.tv_sec, process_before.tv_nsec,
    process_after.tv_sec, process_after.tv_nsec, diff);
    diff = th_after.tv_nsec - th_before.tv_nsec;
    printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    th_before.tv_sec, th_before.tv_nsec,
    th_after.tv_sec, th_after.tv_nsec, diff);
    diff = me_after.tv_nsec - me_before.tv_nsec;
    printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    me_before.tv_sec, me_before.tv_nsec,
    me_after.tv_sec, me_after.tv_nsec, diff);

    return 0;
    }

    This is due to us using p->se.sum_exec_runtime in
    thread_group_cputime() where we iterate the thread group and sum all
    data. This does not take time since the last schedule operation (tick
    or otherwise) into account. We can cure this by using
    task_sched_runtime() at the cost of having to take locks.

    This also means we can (and must) do away with
    thread_group_sched_runtime() since the modified thread_group_cputime()
    is now more accurate and would deadlock when called from
    thread_group_sched_runtime().

    Aside of that it makes the function safe on 32 bit systems. The old
    code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
    64bit value and could be changed on another cpu at the same time.

    Reported-by: David Miller
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
    Tested-by: David Miller
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • __find_resource() incorrectly returns a resource window which overlaps
    an existing allocated window. This happens when the parent's
    resource-window spans 0x00000000 to 0xffffffff and is entirely allocated
    to all its children resource-windows.

    __find_resource() looks for gaps in resource allocation among the
    children resource windows. When it encounters the last child window it
    blindly tries the range next to one allocated to the last child. Since
    the last child's window ends at 0xffffffff the calculation overflows,
    leading the algorithm to believe that any window in the range 0x0000000
    to 0xfffffff is available for allocation. This leads to a conflicting
    window allocation.

    Michal Ludvig reported this issue seen on his platform. The following
    patch fixes the problem and has been verified by Michal. I believe this
    bug has been there for ages. It got exposed by git commit 2bbc6942273b
    ("PCI : ability to relocate assigned pci-resources")

    Signed-off-by: Ram Pai
    Tested-by: Michal Ludvig
    Signed-off-by: Linus Torvalds

    Ram Pai
     

26 Sep, 2011

2 commits

  • Commit c259e01a1ec ("sched: Separate the scheduler entry for
    preemption") contained a boo-boo wrecking wchan output. It forgot to
    put the new schedule() function in the __sched section and thereby
    doesn't get properly ignored for things like wchan.

    Tested-by: Simon Kirby
    Cc: stable@kernel.org # 2.6.39+
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110923000346.GA25425@hostway.ca
    Signed-off-by: Ingo Molnar

    Simon Kirby
     
  • If PTRACE_LISTEN fails after lock_task_sighand() it doesn't drop ->siglock.

    Reported-by: Matt Fleming
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

20 Sep, 2011

4 commits

  • The sanity check in irq_domain_add() tests desc->irq_data != NULL or
    irq_data->domain != NULL. This prevents adding an irq_domain to a irq
    descriptor when irq_data exists, which true when the irq descriptor
    exists.

    This went unnoticed so far as the simple domain code did not enter
    this code path because domain->nr_irqs is always 0 for the simple domains.

    Split the check for irq_data == NULL out and have a separate warning
    for it.

    [ tglx: Made the check for irq_data == NULL separate ]

    Signed-off-by: Rob Herring
    Cc: Grant Likely
    Cc: marc.zyngier@arm.com
    Cc: thomas.abraham@linaro.org
    Cc: jamie@jamieiles.com
    Cc: b-cousson@ti.com
    Cc: shawn.guo@linaro.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: devicetree-discuss@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/1316017900-19918-3-git-send-email-robherring2@gmail.com
    Signed-off-by: Thomas Gleixner

    Rob Herring
     
  • * 'irq-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
    x86, iommu: Mark DMAR IRQ as non-threaded
    genirq: Make irq_shutdown() symmetric vs. irq_startup again

    Linus Torvalds
     
  • Even with just the interface limited to admin, there really is little to
    reason to give byte-per-byte counts for taskstats. So round it down to
    something less intrusive.

    Acked-by: Balbir Singh
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Ok, this isn't optimal, since it means that 'iotop' needs admin
    capabilities, and we may have to work on this some more. But at the
    same time it is very much not acceptable to let anybody just read
    anybody elses IO statistics quite at this level.

    Use of the GENL_ADMIN_PERM suggested by Johannes Berg as an alternative
    to checking the capabilities by hand.

    Reported-by: Vasiliy Kulikov
    Cc: Johannes Berg
    Acked-by: Balbir Singh
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Sep, 2011

1 commit

  • Commit 43fa5460fe60dea5c610490a1d263415419c60f6 ("sched: Try not to
    migrate higher priority RT tasks") also introduced a change in behavior
    which keeps RT tasks on the same CPU if there is an equal priority RT
    task currently running even if there are empty CPUs available.

    This can cause unnecessary wakeup latencies, and can prevent the
    scheduler from balancing all RT tasks across available CPUs.

    This change causes an RT task to search for a new CPU if an equal
    priority RT task is already running on wakeup. Lower priority tasks
    will still have to wait on higher priority tasks, but the system should
    still balance out because there is always the possibility that if there
    are both a high and low priority RT tasks on a given CPU that the high
    priority task could wakeup while the low priority task is running and
    force it to search for a better runqueue.

    Signed-off-by: Shawn Bohrer
    Acked-by: Steven Rostedt
    Tested-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org # 37+
    Link: http://lkml.kernel.org/r/1315837684-18733-1-git-send-email-sbohrer@rgmadvisors.com
    Signed-off-by: Ingo Molnar

    Shawn Bohrer
     

15 Sep, 2011

1 commit

  • Take cwq->gcwq->lock to avoid racing between drain_workqueue checking to
    make sure the workqueues are empty and cwq_dec_nr_in_flight decrementing
    and then incrementing nr_active when it activates a delayed work.

    We discovered this when a corner case in one of our drivers resulted in
    us trying to destroy a workqueue in which the remaining work would
    always requeue itself again in the same workqueue. We would hit this
    race condition and trip the BUG_ON on workqueue.c:3080.

    Signed-off-by: Thomas Tuttle
    Acked-by: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Tuttle
     

12 Sep, 2011

1 commit

  • If an irq_chip provides .irq_shutdown(), but neither of .irq_disable() or
    .irq_mask(), free_irq() crashes when jumping to NULL.
    Fix this by only trying .irq_disable() and .irq_mask() if there's no
    .irq_shutdown() provided.

    This revives the symmetry with irq_startup(), which tries .irq_startup(),
    .irq_enable(), and irq_unmask(), and makes it consistent with the comment for
    irq_chip.irq_shutdown() in , which says:

    * @irq_shutdown: shut down the interrupt (defaults to ->disable if NULL)

    This is also how __free_irq() behaved before the big overhaul, cfr. e.g.
    3b56f0585fd4c02d047dc406668cb40159b2d340 ("genirq: Remove bogus conditional"),
    where the core interrupt code always overrode .irq_shutdown() to
    .irq_disable() if .irq_shutdown() was NULL.

    Signed-off-by: Geert Uytterhoeven
    Cc: linux-m68k@lists.linux-m68k.org
    Link: http://lkml.kernel.org/r/1315742394-16036-2-git-send-email-geert@linux-m68k.org
    Cc: stable@kernel.org
    Signed-off-by: Thomas Gleixner

    Geert Uytterhoeven
     

08 Sep, 2011

2 commits


31 Aug, 2011

1 commit

  • We detected a serious issue with PERF_SAMPLE_READ and
    timing information when events were being multiplexing.

    Samples would have time_running > time_enabled. That
    was easy to reproduce with a libpfm4 example (ran 3
    times to cause multiplexing on Core 2):

    $ syst_smpl -e uops_retired:freq=1 &
    $ syst_smpl -e uops_retired:freq=1 &
    $ syst_smpl -e uops_retired:freq=1 &
    IIP:0x0000000040062d ... PERIOD:2355332948 ENA=40144625315 RUN=60014875184
    syst_smpl: WARNING: time_running > time_enabled
    63277537998 uops_retired:freq=1 , scaled

    The bug was not present in kernel up to (and including) 3.0. It turns
    out the bug was introduced by the following commit:

    commit c4794295917ebeda8013b6cb9c8d71ab4f74a1fa

    events: Move lockless timer calculation into helper function

    The parameters of the function got reversed yet the call sites
    were not updated to reflect the change. That lead to time_running
    and time_enabled being swapped. That had no effect when there was
    no multiplexing because in that case time_running = time_enabled
    but it would show up in any other scenario.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110829124112.GA4828@quad
    Signed-off-by: Ingo Molnar

    Eric B Munson
     

29 Aug, 2011

4 commits

  • The current cgroup context switch code was incorrect leading
    to bogus counts. Furthermore, as soon as there was an active
    cgroup event on a CPU, the context switch cost on that CPU
    would increase by a significant amount as demonstrated by a
    simple ping/pong example:

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10684.51 ctxsw/s

    Now start a cgroup perf stat:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    6674.61 ctxsw/s

    That's a 37% penalty.

    Note that pong is not even in the monitored cgroup.

    The results shown by perf stat are bogus:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    Performance counter stats for 'sleep 100':

    CPU1 cycles test
    CPU1 16,984,189,138 cycles # 0.000 GHz

    The second 'cycles' event should report a count @ CPU clock
    (here 2.4GHz) as it is counting across all cgroups.

    The patch below fixes the bogus accounting and bypasses any
    cgroup switches in case the outgoing and incoming tasks are
    in the same cgroup.

    With this patch the same test now yields:
    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10775.30 ctxsw/s

    Start perf stat with cgroup:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Run pong outside the cgroup:
    $ /pong
    Both processes pinned to CPU1, running for 10s
    10687.80 ctxsw/s

    The penalty is now less than 2%.

    And the results for perf stat are correct:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    Now perf stat reports the correct counts for
    for the non cgroup event.

    If we run pong inside the cgroup, then we also get the
    correct counts:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 22,297,726,205 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    10.001457237 seconds time elapsed

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • This patch fixes the following memory leak:

    unreferenced object 0xffff880107266800 (size 512):
    comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] create_object+0x187/0x28b
    [] kmemleak_alloc+0x73/0x98
    [] __kmalloc_node+0x104/0x159
    [] kzalloc_node.clone.97+0x15/0x17
    [] build_sched_domains+0xb7/0x7f3
    [] partition_sched_domains+0x1db/0x24a
    [] do_rebuild_sched_domains+0x3b/0x47
    [] rebuild_sched_domains+0x10/0x12
    [] sched_power_savings_store+0x6c/0x7b
    [] sched_mc_power_savings_store+0x16/0x18
    [] sysdev_class_store+0x20/0x22
    [] sysfs_write_file+0x108/0x144
    [] vfs_write+0xaf/0x102
    [] sys_write+0x4d/0x74
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    Signed-off-by: WANG Cong
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org # 3.0
    Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.com
    Signed-off-by: Ingo Molnar

    WANG Cong
     
  • There is no real reason to run blk_schedule_flush_plug() with
    interrupts and preemption disabled.

    Move it into schedule() and call it when the task is going voluntarily
    to sleep. There might be false positives when the task is woken
    between that call and actually scheduling, but that's not really
    different from being woken immediately after switching away.

    This fixes a deadlock in the scheduler where the
    blk_schedule_flush_plug() callchain enables interrupts and thereby
    allows a wakeup to happen of the task that's going to sleep.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: stable@kernel.org # 2.6.39+
    Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.org
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Block-IO and workqueues call into notifier functions from the
    scheduler core code with interrupts and preemption disabled. These
    calls should be made before entering the scheduler core.

    To simplify this, separate the scheduler core code into
    __schedule(). __schedule() is directly called from the places which
    set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
    checks into schedule(), so they are only called when a task voluntary
    goes to sleep.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: stable@kernel.org # 2.6.39+
    Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

27 Aug, 2011

1 commit


26 Aug, 2011

2 commits

  • It seems that 7bf693951a8e ("console: allow to retain boot console via
    boot option keep_bootcon") doesn't always achieve what it aims, as when
    printk_late_init() runs it unconditionally turns off all boot consoles.
    With this patch, I am able to see more messages on the boot console in
    KVM guests than I can without, when keep_bootcon is specified.

    I think it is appropriate for the relevant -stable trees. However, it's
    more of an annoyance than a serious bug (ideally you don't need to keep
    the boot console around as console handover should be working -- I was
    encountering a situation where the console handover wasn't working and
    not having the boot console available meant I couldn't see why).

    Signed-off-by: Nishanth Aravamudan
    Cc: David S. Miller
    Cc: Alan Cox
    Cc: Greg KH
    Acked-by: Fabio M. Di Nitto
    Cc: [2.6.39.x, 3.0.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • I ran into a couple of programs which broke with the new Linux 3.0
    version. Some of those were binary only. I tried to use LD_PRELOAD to
    work around it, but it was quite difficult and in one case impossible
    because of a mix of 32bit and 64bit executables.

    For example, all kind of management software from HP doesnt work, unless
    we pretend to run a 2.6 kernel.

    $ uname -a
    Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux

    $ hpacucli ctrl all show

    Error: No controllers detected.

    $ rpm -qf /usr/sbin/hpacucli
    hpacucli-8.75-12.0

    Another notable case is that Python now reports "linux3" from
    sys.platform(); which in turn can break things that were checking
    sys.platform() == "linux2":

    https://bugzilla.mozilla.org/show_bug.cgi?id=664564

    It seems pretty clear to me though it's a bug in the apps that are using
    '==' instead of .startswith(), but this allows us to unbreak broken
    programs.

    This patch adds a UNAME26 personality that makes the kernel report a
    2.6.40+x version number instead. The x is the x in 3.x.

    I know this is somewhat ugly, but I didn't find a better workaround, and
    compatibility to existing programs is important.

    Some programs also read /proc/sys/kernel/osrelease. This can be worked
    around in user space with mount --bind (and a mount namespace)

    To use:

    wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c
    gcc -o uname26 uname26.c
    ./uname26 program

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

24 Aug, 2011

2 commits

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: fix tracing builds inside the source tree
    xfs: remove subdirectories
    xfs: don't expect xfs headers to be in subdirectories

    Linus Torvalds
     
  • This reverts commit f3637a5f2e2eb391ff5757bc83fb5de8f9726464.

    It turns out that this breaks several drivers, one example being OMAP
    boards which use the on-board OMAP UARTs and the omap-serial driver that
    will not boot to userspace after the commit.

    Paul Walmsley reports that enabling CONFIG_DEBUG_SHIRQ reveals 'IRQ
    handler type mismatch' errors:

    IRQ handler type mismatch for IRQ 74
    current handler: serial idle
    ...

    and the reason is that setting IRQF_ONESHOT will now result in those
    interrupt handlers having different IRQF flags, and thus being
    unsharable. So the commit log in the reverted commit:

    "Since it is required for those users and
    there is no difference for others it makes sense to add this flag
    unconditionally."

    is simply not true: there may not be any difference from a "actions at
    irq time", but there is a *big* difference wrt this flag testing irq
    management (see __setup_irq() in kernel/irq/manage.c).

    One solution may be to stop verifying IRQF_ONESHOT in __setup_irq(), but
    right now the safe course of action is to revert the change. Let's
    revisit this in a later merge window.

    Reported-by: Paul Walmsley
    Cc: Sebastian Andrzej Siewior
    Requested-by: Alan Cox
    Acked-by: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

20 Aug, 2011

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
    Revert "cfq: Remove special treatment for metadata rqs."
    block: fix flush machinery for stacking drivers with differring flush flags
    block: improve rq_affinity placement
    blktrace: add FLUSH/FUA support
    Move some REQ flags to the common bio/request area
    allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
    xen/blkback: Make description more obvious.
    cfq-iosched: Add documentation about idling
    block: Make rq_affinity = 1 work as expected
    block: swim3: fix unterminated of_device_id table
    block/genhd.c: remove useless cast in diskstats_show()
    drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
    drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
    bsg-lib: add module.h include
    cfq-iosched: Reduce linked group count upon group destruction
    blk-throttle: correctly determine sync bio
    loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
    loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
    loop: add management interface for on-demand device allocation
    loop: replace linked list of allocated devices with an idr index
    ...

    Linus Torvalds
     

19 Aug, 2011

1 commit


18 Aug, 2011

3 commits


14 Aug, 2011

1 commit


13 Aug, 2011

1 commit

  • Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
    annoying subdirectories in the XFS source code. Besides the large
    amount of file rename the only changes are to the Makefile, a few
    files including headers with the subdirectory prefix, and the binary
    sysctl compat code that includes a header under fs/xfs/ from
    kernel/.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

12 Aug, 2011

2 commits

  • The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
    check in set_user() to check for NPROC exceeding via setuid() and
    similar functions.

    Before the check there was a possibility to greatly exceed the allowed
    number of processes by an unprivileged user if the program relied on
    rlimit only. But the check created new security threat: many poorly
    written programs simply don't check setuid() return code and believe it
    cannot fail if executed with root privileges. So, the check is removed
    in this patch because of too often privilege escalations related to
    buggy programs.

    The NPROC can still be enforced in the common code flow of daemons
    spawning user processes. Most of daemons do fork()+setuid()+execve().
    The check introduced in execve() (1) enforces the same limit as in
    setuid() and (2) doesn't create similar security issues.

    Neil Brown suggested to track what specific process has exceeded the
    limit by setting PF_NPROC_EXCEEDED process flag. With the change only
    this process would fail on execve(), and other processes' execve()
    behaviour is not changed.

    Solar Designer suggested to re-check whether NPROC limit is still
    exceeded at the moment of execve(). If the process was sleeping for
    days between set*uid() and execve(), and the NPROC counter step down
    under the limit, the defered execve() failure because NPROC limit was
    exceeded days ago would be unexpected. If the limit is not exceeded
    anymore, we clear the flag on successful calls to execve() and fork().

    The flag is also cleared on successful calls to set_user() as the limit
    was exceeded for the previous user, not the current one.

    Similar check was introduced in -ow patches (without the process flag).

    v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().

    Reviewed-by: James Morris
    Signed-off-by: Vasiliy Kulikov
    Acked-by: NeilBrown
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • …l/git/tip/linux-2.6-tip

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf symbols: Check '/tmp/perf-' symbol file ownership
    perf sched: Usage leftover from trace -> script rename
    perf sched: Do not delete session object prematurely
    perf tools: Check $HOME/.perfconfig ownership
    perf, x86: Add model 45 SandyBridge support
    perf tools: Add support to install perf python extension
    perf tools: do not look at ./config for configuration
    perf tools: Make clean leaves some files
    perf lock: Dropping unsupported ':r' modifier
    perf probe: Fix coredump introduced by probe module option
    jump label: Reduce the cycle count by changing the link order
    perf report: Use ui__warning in some more places
    perf python: Add PERF_RECORD_{LOST,READ,SAMPLE} routine tables
    perf evlist: Introduce 'disable' method
    trace events: Update version number reference to new 3.x scheme for EVENT_POWER_TRACING_DEPRECATED
    perf buildid-cache: Zero out buffer of filenames when adding/removing buildid

    Linus Torvalds
     

11 Aug, 2011

2 commits

  • Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or
    FUA follows WRITE, use the same 'F' flag for both cases and
    distinguish them by their (relative) position. The end results
    look like (other flags might be shown also):

    - WRITE: W
    - WRITE_FLUSH: FW
    - WRITE_FUA: WF
    - WRITE_FLUSH_FUA: FWF

    Note that we reuse TC_BARRIER due to lack of bit space of act_mask
    so that the older versions of blktrace tools will report flush
    requests as barriers from now on.

    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Signed-off-by: Namhyung Kim
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Namhyung Kim
     
  • Its possible to jam up the alarm timers by setting very small interval
    timers, which will cause the alarmtimer subsystem to spend all of its time
    firing and restarting timers. This can effectivly lock up a box.

    A deeper fix is needed, closely mimicking the hrtimer code, but for now
    just cap the interval to 100us to avoid userland hanging the system.

    CC: Thomas Gleixner
    CC: stable@kernel.org
    Signed-off-by: John Stultz

    John Stultz
     

10 Aug, 2011

3 commits

  • Following common_timer_get, zero out the itimerspec passed in.

    CC: Thomas Gleixner
    CC: stable@kernel.org
    Signed-off-by: John Stultz

    John Stultz
     
  • We don't check if old_setting is non null before assigning it, so
    correct this.

    CC: Thomas Gleixner
    CC: stable@kernel.org
    Signed-off-by: John Stultz

    John Stultz
     
  • syslog-ng versions before 3.3.0beta1 (2011-05-12) assume that
    CAP_SYS_ADMIN is sufficient to access syslog, so ever since CAP_SYSLOG
    was introduced (2010-11-25) they have triggered a warning.

    Commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now")
    improved matters a little by making syslog-ng work again, just keeping
    the WARN_ONCE(). But still, this is a warning that writes a stack trace
    we don't care about to syslog, sets a taint flag, and alarms sysadmins
    when nothing worse has happened than use of an old userspace with a
    recent kernel.

    Convert the WARN_ONCE to a printk_once to avoid that while continuing to
    give userspace developers a hint that this is an unwanted
    backward-compatibility feature and won't be around forever.

    Reported-by: Ralf Hildebrandt
    Reported-by: Niels
    Reported-by: Paweł Sikora
    Signed-off-by: Jonathan Nieder
    Liked-by: Gergely Nagy
    Acked-by: Serge Hallyn
    Acked-by: James Morris
    Signed-off-by: Linus Torvalds

    Jonathan Nieder