04 Aug, 2010

1 commit


30 Jul, 2010

1 commit

  • It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
    credentials by incrementing their usage count after their replacement by the
    task being accessed.

    What happens is that get_task_cred() can race with commit_creds():

    TASK_1 TASK_2 RCU_CLEANER
    -->get_task_cred(TASK_2)
    rcu_read_lock()
    __cred = __task_cred(TASK_2)
    -->commit_creds()
    old_cred = TASK_2->real_cred
    TASK_2->real_cred = ...
    put_cred(old_cred)
    call_rcu(old_cred)
    [__cred->usage == 0]
    get_cred(__cred)
    [__cred->usage == 1]
    rcu_read_unlock()
    -->put_cred_rcu()
    [__cred->usage == 1]
    panic()

    However, since a tasks credentials are generally not changed very often, we can
    reasonably make use of a loop involving reading the creds pointer and using
    atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.

    If successful, we can safely return the credentials in the knowledge that, even
    if the task we're accessing has released them, they haven't gone to the RCU
    cleanup code.

    We then change task_state() in procfs to use get_task_cred() rather than
    calling get_cred() on the result of __task_cred(), as that suffers from the
    same problem.

    Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
    tripped when it is noticed that the usage count is not zero as it ought to be,
    for example:

    kernel BUG at kernel/cred.c:168!
    invalid opcode: 0000 [#1] SMP
    last sysfs file: /sys/kernel/mm/ksm/run
    CPU 0
    Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
    745
    RIP: 0010:[] [] __put_cred+0xc/0x45
    RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202
    RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
    RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
    RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
    R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
    FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
    Stack:
    ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
    ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
    ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
    Call Trace:
    [] put_cred+0x13/0x15
    [] commit_creds+0x16b/0x175
    [] set_current_groups+0x47/0x4e
    [] sys_setgroups+0xf6/0x105
    [] system_call_fastpath+0x16/0x1b
    Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
    48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 0b eb fe 65 48 8b
    04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
    RIP [] __put_cred+0xc/0x45
    RSP
    ---[ end trace df391256a100ebdd ]---

    Signed-off-by: David Howells
    Acked-by: Jiri Olsa
    Signed-off-by: Linus Torvalds

    David Howells
     

28 Jul, 2010

1 commit

  • The command

    echo "file ec.c +p" >/sys/kernel/debug/dynamic_debug/control

    causes an oops.

    Move the call to ddebug_remove_module() down into free_module(). In this
    way it should be called from all error paths. Currently, we are missing
    the remove if the module init routine fails.

    Signed-off-by: Jason Baron
    Reported-by: Thomas Renninger
    Tested-by: Thomas Renninger
    Cc: [2.6.32+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

22 Jul, 2010

5 commits


19 Jul, 2010

2 commits

  • With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
    friends use the early_res functions for memory management when
    NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
    corresponding code paths for bootmem allocations.

    Signed-off-by: Catalin Marinas
    Acked-by: Pekka Enberg
    Acked-by: Yinghai Lu
    Cc: H. Peter Anvin
    Cc: stable@kernel.org

    Catalin Marinas
     
  • pavel@suse.cz no longer works, replace it with working address.

    Signed-off-by: Pavel Machek
    Signed-off-by: Jiri Kosina

    Pavel Machek
     

12 Jul, 2010

1 commit


05 Jul, 2010

1 commit

  • We should initialize the module dynamic debug datastructures
    only after determining that the module is not loaded yet. This
    fixes a bug that introduced in 2.6.35-rc2, where when a trying
    to load a module twice, we also load it's dynamic printing data
    twice which causes all sorts of nasty issues. Also handle
    the dynamic debug cleanup later on failure.

    Signed-off-by: Yehuda Sadeh
    Signed-off-by: Rusty Russell (removed a #ifdef)
    Signed-off-by: Linus Torvalds

    Yehuda Sadeh
     

03 Jul, 2010

1 commit


01 Jul, 2010

2 commits

  • Commit 0224cf4c5e (sched: Intoduce get_cpu_iowait_time_us())
    broke things by not making sure preemption was indeed disabled
    by the callers of nr_iowait_cpu() which took the iowait value of
    the current cpu.

    This resulted in a heap of preempt warnings. Cure this by making
    nr_iowait_cpu() take a cpu number and fix up the callers to pass
    in the right number.

    Signed-off-by: Peter Zijlstra
    Cc: Arjan van de Ven
    Cc: Sergey Senozhatsky
    Cc: Rafael J. Wysocki
    Cc: Maxim Levitsky
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: Jiri Slaby
    Cc: linux-pm@lists.linux-foundation.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • futex_find_get_task is currently used (through lookup_pi_state) from two
    contexts, futex_requeue and futex_lock_pi_atomic. None of the paths
    looks it needs the credentials check, though. Different (e)uids
    shouldn't matter at all because the only thing that is important for
    shared futex is the accessibility of the shared memory.

    The credentail check results in glibc assert failure or process hang (if
    glibc is compiled without assert support) for shared robust pthread
    mutex with priority inheritance if a process tries to lock already held
    lock owned by a process with a different euid:

    pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.

    The problem is that futex_lock_pi_atomic which is called when we try to
    lock already held lock checks the current holder (tid is stored in the
    futex value) to get the PI state. It uses lookup_pi_state which in turn
    gets task struct from futex_find_get_task. ESRCH is returned either
    when the task is not found or if credentials check fails.

    futex_lock_pi_atomic simply returns if it gets ESRCH. glibc code,
    however, doesn't expect that robust lock returns with ESRCH because it
    should get either success or owner died.

    Signed-off-by: Michal Hocko
    Acked-by: Darren Hart
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Nick Piggin
    Cc: Alexey Kuznetsov
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Jun, 2010

1 commit

  • When crashkernel is not enabled, "echo 0 > /sys/kernel/kexec_crash_size"
    OOPSes the kernel in crash_shrink_memory. This happens when
    crash_shrink_memory tries to release the 'crashk_res' resource which are
    not reserved. Also value of "/sys/kernel/kexec_crash_size" shows as 1,
    which should be 0.

    This patch fixes the OOPS in crash_shrink_memory and shows
    "/sys/kernel/kexec_crash_size" as 0 when crash kernel memory is not
    reserved.

    Signed-off-by: Pavan Naregundi
    Reviewed-by: WANG Cong
    Cc: Simon Horman
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavan Naregundi
     

29 Jun, 2010

5 commits


25 Jun, 2010

1 commit

  • GCC 4.4.1 on ARM has been observed to replace the while loop in
    sched_avg_update with a call to uldivmod, resulting in the
    following build failure at link-time:

    kernel/built-in.o: In function `sched_avg_update':
    kernel/sched.c:1261: undefined reference to `__aeabi_uldivmod'
    kernel/sched.c:1261: undefined reference to `__aeabi_uldivmod'
    make: *** [.tmp_vmlinux1] Error 1

    This patch introduces a fake data hazard to the loop body to
    prevent the compiler optimising the loop away.

    Signed-off-by: Will Deacon
    Signed-off-by: Andrew Morton
    Acked-by: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Russell King
    Cc: Linus Torvalds
    Cc:
    Signed-off-by: Ingo Molnar

    Will Deacon
     

24 Jun, 2010

1 commit

  • Because cgroup_fork() is ran before sched_fork() [ from copy_process() ]
    and the child's pid is not yet visible the child is pinned to its
    cgroup. Therefore we can silence this warning.

    A nicer solution would be moving cgroup_fork() to right after
    dup_task_struct() and exclude PF_STARTING from task_subsys_state().

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Li Zefan
    Signed-off-by: Paul E. McKenney

    Peter Zijlstra
     

23 Jun, 2010

1 commit

  • The task_group() function returns a pointer that must be protected
    by either RCU, the ->alloc_lock, or the cgroup lock (see the
    rcu_dereference_check() in task_subsys_state(), which is invoked by
    task_group()). The wake_affine() function currently does none of these,
    which means that a concurrent update would be within its rights to free
    the structure returned by task_group(). Because wake_affine() uses this
    structure only to compute load-balancing heuristics, there is no reason
    to acquire either of the two locks.

    Therefore, this commit introduces an RCU read-side critical section that
    starts before the first call to task_group() and ends after the last use
    of the "tg" pointer returned from task_group(). Thanks to Li Zefan for
    pointing out the need to extend the RCU read-side critical section from
    that proposed by the original patch.

    Signed-off-by: Daniel J Blueman
    Signed-off-by: Paul E. McKenney

    Daniel J Blueman
     

18 Jun, 2010

2 commits

  • Commit e70971591 ("sched: Optimize unused cgroup configuration") introduced
    an imbalanced scheduling bug.

    If we do not use CGROUP, function update_h_load won't update h_load. When the
    system has a large number of tasks far more than logical CPU number, the
    incorrect cfs_rq[cpu]->h_load value will cause load_balance() to pull too
    many tasks to the local CPU from the busiest CPU. So the busiest CPU keeps
    going in a round robin. That will hurt performance.

    The issue was found originally by a scientific calculation workload that
    developed by Yanmin. With that commit, the workload performance drops
    about 40%.

    CPU before after

    00 : 2 : 7
    01 : 1 : 7
    02 : 11 : 6
    03 : 12 : 7
    04 : 6 : 6
    05 : 11 : 7
    06 : 10 : 6
    07 : 12 : 7
    08 : 11 : 6
    09 : 12 : 6
    10 : 1 : 6
    11 : 1 : 6
    12 : 6 : 6
    13 : 2 : 6
    14 : 2 : 6
    15 : 1 : 6

    Reviewed-by: Yanmin zhang
    Signed-off-by: Alex Shi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Alex,Shi
     
  • Chris Wedgwood reports that 39c0cbe (sched: Rate-limit nohz) causes a
    serial console regression, unresponsiveness, and indeed it does. The
    reason is that the nohz code is skipped even when the tick was already
    stopped before the nohz_ratelimit(cpu) condition changed.

    Move the nohz_ratelimit() check to the other conditions which prevent
    long idle sleeps.

    Reported-by: Chris Wedgwood
    Tested-by: Brian Bloniarz
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Jiri Kosina
    Cc: Linus Torvalds
    Cc: Greg KH
    Cc: Alan Cox
    Cc: OGAWA Hirofumi
    Cc: Jef Driesen
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

17 Jun, 2010

2 commits


12 Jun, 2010

1 commit


11 Jun, 2010

2 commits

  • With the addition of the code to shrink the kernel tracepoint
    infrastructure, we lost kprobes being traced by perf. The reason
    is that I tested if the "tp_event->class->perf_probe" existed before
    enabling it. This prevents "ftrace only" events (like the function
    trace events) from being enabled by perf.

    Unfortunately, kprobe events do not use perf_probe. This causes
    kprobes to be missed by perf. To fix this, we add the test to
    see if "tp_event->class->reg" exists as well as perf_probe.

    Normal trace events have only "perf_probe" but no "reg" function,
    and kprobes and syscalls have the "reg" but no "perf_probe".
    The ftrace unique events do not have either, so this is a valid
    test. If a kprobe or syscall is not to be probed by perf, the
    "reg" function is called anyway, and will return a failure and
    prevent perf from probing it.

    Reported-by: Srikar Dronamraju
    Tested-by: Srikar Dronamraju
    Acked-by: Peter Zijlstra
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    tracing: Fix null pointer deref with SEND_SIG_FORCED
    perf: Fix signed comparison in perf_adjust_period()
    powerpc/oprofile: fix potential buffer overrun in op_model_cell.c
    perf symbols: Set the DSO long name when using symbol_conf.vmlinux_name

    Linus Torvalds
     

10 Jun, 2010

1 commit


09 Jun, 2010

3 commits

  • The set_type() function can change the chip implementation when the
    trigger mode changes. That might result in using an non-initialized
    irq chip when called from __setup_irq() or when called via
    set_irq_type() on an already enabled irq.

    The set_irq_type() function should not be called on an enabled irq,
    but because we forgot to put a check into it, we have a bunch of users
    which grew the habit of doing that and it never blew up as the
    function is serialized via desc->lock against all users of desc->chip
    and they never hit the non-initialized irq chip issue.

    The easy fix for the __setup_irq() issue would be to move the
    irq_chip_set_defaults(desc->chip) call after the trigger setting to
    make sure that a chip change is covered.

    But as we have already users, which do the type setting after
    request_irq(), the safe fix for now is to call irq_chip_set_defaults()
    from __irq_set_trigger() when desc->set_type() changed the irq chip.

    It needs a deeper analysis whether we should refuse to change the chip
    on an already enabled irq, but that'd be a large scale change to fix
    all the existing users. So that's neither stable nor 2.6.35 material.

    Reported-by: Esben Haabendal
    Signed-off-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: linuxppc-dev
    Cc: stable@kernel.org

    Thomas Gleixner
     
  • PROVE_RCU has a few issues with the cpu_cgroup because the scheduler
    typically holds rq->lock around the css rcu derefs but the generic
    cgroup code doesn't (and can't) know about that lock.

    Provide means to add extra checks to the css dereference and use that
    in the scheduler to annotate its users.

    The addition of rq->lock to these checks is correct because the
    cgroup_subsys::attach() method takes the rq->lock for each task it
    moves, therefore by holding that lock, we ensure the task is pinned to
    the current cgroup and the RCU derefence is valid.

    That leaves one genuine race in __sched_setscheduler() where we used
    task_group() without holding any of the required locks and thus raced
    with the cgroup code. Solve this by moving the check under the
    appropriate lock.

    Signed-off-by: Peter Zijlstra
    Cc: "Paul E. McKenney"
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Frederic reported that frequency driven swevents didn't work properly
    and even caused a division-by-zero error.

    It turns out there are two bugs, the division-by-zero comes from a
    failure to deal with that in perf_calculate_period().

    The other was more interesting and turned out to be a wrong comparison
    in perf_adjust_period(). The comparison was between an s64 and u64 and
    got implicitly converted to an unsigned comparison. The problem is
    that period_left is typically < 0, so it ended up being always true.

    Cure this by making the local period variables s64.

    Reported-by: Frederic Weisbecker
    Tested-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Jun, 2010

5 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    module: fix bne2 "gave up waiting for init of module libcrc32c"
    module: verify_export_symbols under the lock
    module: move find_module check to end
    module: make locking more fine-grained.
    module: Make module sysfs functions private.
    module: move sysfs exposure to end of load_module
    module: fix kdb's illicit use of struct module_use.
    module: Make the 'usage' lists be two-way

    Linus Torvalds
     
  • Problem: it's hard to avoid an init routine stumbling over a
    request_module these days. And it's not clear it's always a bad idea:
    for example, a module like kvm with dynamic dependencies on kvm-intel
    or kvm-amd would be neater if it could simply request_module the right
    one.

    In this particular case, it's libcrc32c:

    libcrc32c_mod_init
    crypto_alloc_shash
    crypto_alloc_tfm
    crypto_find_alg
    crypto_alg_mod_lookup
    crypto_larval_lookup
    request_module

    If another module is waiting inside resolve_symbol() for libcrc32c to
    finish initializing (ie. bne2 depends on libcrc32c) then it does so
    holding the module lock, and our request_module() can't make progress
    until that is released.

    Waiting inside resolve_symbol() without the lock isn't all that hard:
    we just need to pass the -EBUSY up the call chain so we can sleep
    where we don't hold the lock. Error reporting is a bit trickier: we
    need to copy the name of the unfinished module before releasing the
    lock.

    Other notes:
    1) This also fixes a theoretical issue where a weak dependency would allow
    symbol version mismatches to be ignored.
    2) We rename use_module to ref_module to make life easier for the only
    external user (the out-of-tree ksplice patches).

    Signed-off-by: Rusty Russell
    Cc: Linus Torvalds
    Cc: Tim Abbot
    Tested-by: Brandon Philips

    Rusty Russell
     
  • It disabled preempt so it was "safe", but nothing stops another module
    slipping in before this module is added to the global list now we don't
    hold the lock the whole time.

    So we check this just after we check for duplicate modules, and just
    before we put the module in the global list.

    (find_symbol finds symbols in coming and going modules, too).

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • I think Rusty may have made the lock a bit _too_ finegrained there, and
    didn't add it to some places that needed it. It looks, for example, like
    PATCH 1/2 actually drops the lock in places where it's needed
    ("find_module()" is documented to need it, but now load_module() didn't
    hold it at all when it did the find_module()).

    Rather than adding a new "module_loading" list, I think we should be able
    to just use the existing "modules" list, and just fix up the locking a
    bit.

    In fact, maybe we could just move the "look up existing module" a bit
    later - optimistically assuming that the module doesn't exist, and then
    just undoing the work if it turns out that we were wrong, just before
    adding ourselves to the list.

    Signed-off-by: Rusty Russell

    Linus Torvalds
     
  • Kay Sievers reports that we still have some
    contention over module loading which is slowing boot.

    Linus also disliked a previous "drop lock and regrab" patch to fix the
    bne2 "gave up waiting for init of module libcrc32c" message.

    This is more ambitious: we only grab the lock where we need it.

    Signed-off-by: Rusty Russell
    Cc: Brandon Philips
    Cc: Kay Sievers
    Cc: Linus Torvalds

    Rusty Russell