07 Dec, 2012

2 commits


05 Dec, 2012

4 commits

  • Don't use enum-type bitfields in the module signature info block as we can't be
    certain how the compiler will handle them. As I understand it, it is arch
    dependent, and it is possible for the compiler to rearrange them based on
    endianness and to insert a byte of padding to pad the three enums out to four
    bytes.

    Instead use u8 fields for these, which the compiler should emit in the right
    order without padding.

    Signed-off-by: David Howells
    Signed-off-by: Rusty Russell

    David Howells
     
  • Norbert reported:
    "3.7-rc6 booted with nmi_watchdog=0 fails to suspend to RAM or
    offline CPUs. It's reproducable with a KVM guest and physical
    system."

    The reason is that commit bcd951cf(watchdog: Use hotplug thread
    infrastructure) missed to take this into account. So the cpu offline
    code gets stuck in the teardown function because it accesses non
    initialized data structures.

    Add a check for watchdog_enabled into that path to cure the issue.

    Reported-and-tested-by: Norbert Warmuth
    Tested-by: Joseph Salisbury
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1211231033230.2701@ionos
    Link: http://bugs.launchpad.net/bugs/1079534
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Pull module fixes from Rusty Russell:
    "Module signing build fixes for blackfin and metag"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    modsign: add symbol prefix to certificate list
    linux/kernel.h: define SYMBOL_PREFIX

    Linus Torvalds
     
  • Pull workqueue fixes from Tejun Heo:
    "So, safe fixes my ass.

    Commit 8852aac25e79 ("workqueue: mod_delayed_work_on() shouldn't queue
    timer on 0 delay") had the side-effect of performing delayed_work
    sanity checks even when @delay is 0, which should be fine for any sane
    use cases.

    Unfortunately, megaraid was being overly ingenious. It seemingly
    wanted to use cancel_delayed_work_sync() before cancel_work_sync() was
    introduced, but didn't want to waste the space for full delayed_work
    as it was only going to use 0 @delay. So, it only allocated space for
    struct work_struct and then cast it to struct delayed_work and passed
    it into delayed_work functions - truly awesome engineering tradeoff to
    save some bytes.

    Xiaotian fixed it by making megraid allocate full delayed_work for
    now. It should be converted to use work_struct and cancel_work_sync()
    but I think we better do that after 3.7.

    I added another commit to change BUG_ON()s in __queue_delayed_work()
    to WARN_ON_ONCE()s so that the kernel doesn't crash even if there are
    more such abuses."

    * 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s
    megaraid: fix BUG_ON() from incorrect use of delayed work

    Linus Torvalds
     

04 Dec, 2012

2 commits

  • 8852aac25e ("workqueue: mod_delayed_work_on() shouldn't queue timer on
    0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in
    megaraid - it allocated work_struct, casted it to delayed_work and
    then pass that into queue_delayed_work().

    Previously, this was okay because 0 @delay short-circuited to
    queue_work() before doing anything with delayed_work. 8852aac25e
    moved 0 @delay test into __queue_delayed_work() after sanity check on
    delayed_work making megaraid trigger BUG_ON().

    Although megaraid is already fixed by c1d390d8e6 ("megaraid: fix
    BUG_ON() from incorrect use of delayed work"), this patch converts
    BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such
    abusers, if there are more, trigger warning but don't crash the
    machine.

    Signed-off-by: Tejun Heo
    Cc: Xiaotian Feng

    Tejun Heo
     
  • This reverts commit 800d4d30c8f20bd728e5741a3b77c4859a613f7c.

    Between commits 8323f26ce342 ("sched: Fix race in task_group()") and
    800d4d30c8f2 ("sched, autogroup: Stop going ahead if autogroup is
    disabled"), autogroup is a wreck.

    With both applied, all you have to do to crash a box is disable
    autogroup during boot up, then reboot.. boom, NULL pointer dereference
    due to commit 800d4d30c8f2 not allowing autogroup to move things, and
    commit 8323f26ce342 making that the only way to switch runqueues:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] effective_load.isra.43+0x50/0x90
    Pid: 7047, comm: systemd-user-se Not tainted 3.6.8-smp #7 MEDIONPC MS-7502/MS-7502
    RIP: effective_load.isra.43+0x50/0x90
    Process systemd-user-se (pid: 7047, threadinfo ffff880221dde000, task ffff88022618b3a0)
    Call Trace:
    select_task_rq_fair+0x255/0x780
    try_to_wake_up+0x156/0x2c0
    wake_up_state+0xb/0x10
    signal_wake_up+0x28/0x40
    complete_signal+0x1d6/0x250
    __send_signal+0x170/0x310
    send_signal+0x40/0x80
    do_send_sig_info+0x47/0x90
    group_send_sig_info+0x4a/0x70
    kill_pid_info+0x3a/0x60
    sys_kill+0x97/0x1a0
    ? vfs_read+0x120/0x160
    ? sys_read+0x45/0x90
    system_call_fastpath+0x16/0x1b
    Code: 49 0f af 41 50 31 d2 49 f7 f0 48 83 f8 01 48 0f 46 c6 48 2b 07 48 8b bf 40 01 00 00 48 85 ff 74 3a 45 31 c0 48 8b 8f 50 01 00 00 8b 11 4c 8b 89 80 00 00 00 49 89 d2 48 01 d0 45 8b 59 58 4c
    RIP [] effective_load.isra.43+0x50/0x90
    RSP
    CR2: 0000000000000000

    Signed-off-by: Mike Galbraith
    Acked-by: Ingo Molnar
    Cc: Yong Zhang
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org # 2.6.39+
    Signed-off-by: Linus Torvalds

    Mike Galbraith
     

03 Dec, 2012

1 commit

  • Add the arch symbol prefix (if applicable) to the asm definition of
    modsign_certificate_list and modsign_certificate_list_end. This uses the
    recently defined SYMBOL_PREFIX which is derived from
    CONFIG_SYMBOL_PREFIX.

    This fixes the build of module signing on the blackfin and metag
    architectures.

    Signed-off-by: James Hogan
    Cc: Rusty Russell
    Cc: David Howells
    Cc: Mike Frysinger
    Signed-off-by: Rusty Russell

    James Hogan
     

02 Dec, 2012

3 commits

  • 8376fe22c7 ("workqueue: implement mod_delayed_work[_on]()")
    implemented mod_delayed_work[_on]() using the improved
    try_to_grab_pending(). The function is later used, among others, to
    replace [__]candel_delayed_work() + queue_delayed_work() combinations.

    Unfortunately, a delayed_work item w/ zero @delay is handled slightly
    differently by mod_delayed_work_on() compared to
    queue_delayed_work_on(). The latter skips timer altogether and
    directly queues it using queue_work_on() while the former schedules
    timer which will expire on the closest tick. This means, when @delay
    is zero, that [__]cancel_delayed_work() + queue_delayed_work_on()
    makes the target item immediately executable while
    mod_delayed_work_on() may induce delay of upto a full tick.

    This somewhat subtle difference breaks some of the converted users.
    e.g. block queue plugging uses delayed_work for deferred processing
    and uses mod_delayed_work_on() when the queue needs to be immediately
    unplugged. The above problem manifested as noticeably higher number
    of context switches under certain circumstances.

    The difference in behavior was caused by missing special case handling
    for 0 delay in mod_delayed_work_on() compared to
    queue_delayed_work_on(). Joonsoo Kim posted a patch to add it -
    ("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1].
    The patch was queued for 3.8 but it was described as optimization and
    I missed that it was a correctness issue.

    As both queue_delayed_work_on() and mod_delayed_work_on() use
    __queue_delayed_work() for queueing, it seems that the better approach
    is to move the 0 delay special handling to the function instead of
    duplicating it in mod_delayed_work_on().

    Fix the problem by moving 0 delay special case handling from
    queue_delayed_work_on() to __queue_delayed_work(). This replaces
    Joonsoo's patch.

    [1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Anders Kaseorg
    Reported-and-tested-by: Zlatko Calusic
    LKML-Reference:
    LKML-Reference:
    Cc: Joonsoo Kim

    Tejun Heo
     
  • A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling
    off, never to be seen again. In the case where this occurred, an exiting
    thread hit reiserfs homebrew conditional resched while holding a mutex,
    bringing the box to its knees.

    PID: 18105 TASK: ffff8807fd412180 CPU: 5 COMMAND: "kdmflush"
    #0 [ffff8808157e7670] schedule at ffffffff8143f489
    #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs]
    #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14
    #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs]
    #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2
    #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41
    #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a
    #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88
    #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850
    #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f
    [exception RIP: kernel_thread_helper]
    RIP: ffffffff8144a5c0 RSP: ffff8808157e7f58 RFLAGS: 00000202
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffffffff8107af60 RDI: ffff8803ee491d18
    RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

    Signed-off-by: Mike Galbraith
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Mike Galbraith
     
  • Pull perf fixes from Ingo Molnar:
    "This is mostly about unbreaking architectures that took the UAPI
    changes in the v3.7 cycle, plus misc fixes."

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf kvm: Fix building perf kvm on non x86 arches
    perf kvm: Rename perf_kvm to perf_kvm_stat
    perf: Make perf build for x86 with UAPI disintegration applied
    perf powerpc: Use uapi/unistd.h to fix build error
    tools: Pass the target in descend
    tools: Honour the O= flag when tool build called from a higher Makefile
    tools: Define a Makefile function to do subdir processing
    x86: Export asm/{svm.h,vmx.h,perf_regs.h}
    perf tools: Fix strbuf_addf() when the buffer needs to grow
    perf header: Fix numa topology printing
    perf, powerpc: Fix hw breakpoints returning -ENOSPC

    Linus Torvalds
     

27 Nov, 2012

2 commits

  • Dave Jones reported a bug with futex_lock_pi() that his trinity test
    exposed. Sometime between queue_me() and taking the q.lock_ptr, the
    lock_ptr became NULL, resulting in a crash.

    While futex_wake() is careful to not call wake_futex() on futex_q's with
    a pi_state or an rt_waiter (which are either waiting for a
    futex_unlock_pi() or a PI futex_requeue()), futex_wake_op() and
    futex_requeue() do not perform the same test.

    Update futex_wake_op() and futex_requeue() to test for q.pi_state and
    q.rt_waiter and abort with -EINVAL if detected. To ensure any future
    breakage is caught, add a WARN() to wake_futex() if the same condition
    is true.

    This fix has seen 3 hours of testing with "trinity -c futex" on an
    x86_64 VM with 4 CPUS.

    [akpm@linux-foundation.org: tidy up the WARN()]
    Signed-off-by: Darren Hart
    Reported-by: Dave Jones
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: John Kacur
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darren Hart
     
  • In get_sample_period(), unsigned long is not enough:

    watchdog_thresh * 2 * (NSEC_PER_SEC / 5)

    case1:
    watchdog_thresh is 10 by default, the sample value will be: 0xEE6B2800

    case2:
    set watchdog_thresh is 20, the sample value will be: 0x1 DCD6 5000

    In case2, we need use u64 to express the sample period. Otherwise,
    changing the threshold thru proc often can not be successful.

    Signed-off-by: liu chuansheng
    Acked-by: Don Zickus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chuansheng Liu
     

13 Nov, 2012

1 commit


01 Nov, 2012

1 commit

  • Siddhesh analyzed a failure in the take over of pi futexes in case the
    owner died and provided a workaround.
    See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076

    The detailed problem analysis shows:

    Futex F is initialized with PTHREAD_PRIO_INHERIT and
    PTHREAD_MUTEX_ROBUST_NP attributes.

    T1 lock_futex_pi(F);

    T2 lock_futex_pi(F);
    --> T2 blocks on the futex and creates pi_state which is associated
    to T1.

    T1 exits
    --> exit_robust_list() runs
    --> Futex F userspace value TID field is set to 0 and
    FUTEX_OWNER_DIED bit is set.

    T3 lock_futex_pi(F);
    --> Succeeds due to the check for F's userspace TID field == 0
    --> Claims ownership of the futex and sets its own TID into the
    userspace TID field of futex F
    --> returns to user space

    T1 --> exit_pi_state_list()
    --> Transfers pi_state to waiter T2 and wakes T2 via
    rt_mutex_unlock(&pi_state->mutex)

    T2 --> acquires pi_state->mutex and gains real ownership of the
    pi_state
    --> Claims ownership of the futex and sets its own TID into the
    userspace TID field of futex F
    --> returns to user space

    T3 --> observes inconsistent state

    This problem is independent of UP/SMP, preemptible/non preemptible
    kernels, or process shared vs. private. The only difference is that
    certain configurations are more likely to expose it.

    So as Siddhesh correctly analyzed the following check in
    futex_lock_pi_atomic() is the culprit:

    if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {

    We check the userspace value for a TID value of 0 and take over the
    futex unconditionally if that's true.

    AFAICT this check is there as it is correct for a different corner
    case of futexes: the WAITERS bit became stale.

    Now the proposed change

    - if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
    + if (unlikely(ownerdied ||
    + !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {

    solves the problem, but it's not obvious why and it wreckages the
    "stale WAITERS bit" case.

    What happens is, that due to the WAITERS bit being set (T2 is blocked
    on that futex) it enforces T3 to go through lookup_pi_state(), which
    in the above case returns an existing pi_state and therefor forces T3
    to legitimately fight with T2 over the ownership of the pi_state (via
    pi_state->mutex). Probelm solved!

    Though that does not work for the "WAITERS bit is stale" problem
    because if lookup_pi_state() does not find existing pi_state it
    returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
    return -ESRCH to user space because the OWNER_DIED bit is not set.

    Now there is a different solution to that problem. Do not look at the
    user space value at all and enforce a lookup of possibly available
    pi_state. If pi_state can be found, then the new incoming locker T3
    blocks on that pi_state and legitimately races with T2 to acquire the
    rt_mutex and the pi_state and therefor the proper ownership of the
    user space futex.

    lookup_pi_state() has the correct order of checks. It first tries to
    find a pi_state associated with the user space futex and only if that
    fails it checks for futex TID value = 0. If no pi_state is available
    nothing can create new state at that point because this happens with
    the hash bucket lock held.

    So the above scenario changes to:

    T1 lock_futex_pi(F);

    T2 lock_futex_pi(F);
    --> T2 blocks on the futex and creates pi_state which is associated
    to T1.

    T1 exits
    --> exit_robust_list() runs
    --> Futex F userspace value TID field is set to 0 and
    FUTEX_OWNER_DIED bit is set.

    T3 lock_futex_pi(F);
    --> Finds pi_state and blocks on pi_state->rt_mutex

    T1 --> exit_pi_state_list()
    --> Transfers pi_state to waiter T2 and wakes it via
    rt_mutex_unlock(&pi_state->mutex)

    T2 --> acquires pi_state->mutex and gains ownership of the pi_state
    --> Claims ownership of the futex and sets its own TID into the
    userspace TID field of futex F
    --> returns to user space

    This covers all gazillion points on which T3 might come in between
    T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
    also solves the "WAITERS bit stale" problem by forcing the take over.

    Another benefit of changing the code this way is that it makes it less
    dependent on untrusted user space values and therefor minimizes the
    possible wreckage which might be inflicted.

    As usual after staring for too long at the futex code my brain hurts
    so much that I really want to ditch that whole optimization of
    avoiding the syscall for the non contended case for PI futexes and rip
    out the maze of corner case handling code. Unfortunately we can't as
    user space relies on that existing behaviour, but at least thinking
    about it helps me to preserve my mental sanity. Maybe we should
    nevertheless :)

    Reported-and-tested-by: Siddhesh Poyarekar
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos
    Acked-by: Darren Hart
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

31 Oct, 2012

1 commit

  • Masaki found and patched a kallsyms issue: the last symbol in a
    module's symtab wasn't transferred. This is because we manually copy
    the zero'th entry (which is always empty) then copy the rest in a loop
    starting at 1, though from src[0]. His fix was minimal, I prefer to
    rewrite the loops in more standard form.

    There are two loops: one to get the size, and one to copy. Make these
    identical: always count entry 0 and any defined symbol in an allocated
    non-init section.

    This bug exists since the following commit was introduced.
    module: reduce symbol table for loaded modules (v2)
    commit: 4a4962263f07d14660849ec134ee42b63e95ea9a

    LKML: http://lkml.org/lkml/2012/10/24/27
    Reported-by: Masaki Kimura
    Cc: stable@kernel.org

    Rusty Russell
     

30 Oct, 2012

1 commit

  • I've been trying to get hardware breakpoints with perf to work
    on POWER7 but I'm getting the following:

    % perf record -e mem:0x10000000 true

    Error: sys_perf_event_open() syscall returned with 28 (No space left on device). /bin/dmesg may provide additional information.

    Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?

    true: Terminated

    (FWIW adding -a and it works fine)

    Debugging it seems that __reserve_bp_slot() is returning ENOSPC
    because it thinks there are no free breakpoint slots on this
    CPU.

    I have a 2 CPUs, so perf userspace is doing two perf_event_open
    syscalls to add a counter to each CPU [1]. The first syscall
    succeeds but the second is failing.

    On this second syscall, fetch_bp_busy_slots() sets slots.pinned
    to be 1, despite there being no breakpoint on this CPU. This is
    because the call the task_bp_pinned, checks all CPUs, rather
    than just the current CPU. POWER7 only has one hardware
    breakpoint per CPU (ie. HBP_NUM=1), so we return ENOSPC.

    The following patch fixes this by checking the associated CPU
    for each breakpoint in task_bp_pinned. I'm not familiar with
    this code, so it's provided as a reference to the above issue.

    Signed-off-by: Michael Neuling
    Acked-by: Peter Zijlstra
    Cc: Michael Ellerman
    Cc: Jovi Zhang
    Cc: K Prasad
    Link: http://lkml.kernel.org/r/1351268936-2956-1-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Michael Neuling
     

26 Oct, 2012

3 commits

  • Merge misc fixes from Andrew Morton:
    "18 total. 15 fixes and some updates to a device_cgroup patchset which
    bring it up to date with the version which I should have merged in the
    first place."

    * emailed patches from Andrew Morton : (18 patches)
    fs/compat_ioctl.c: VIDEO_SET_SPU_PALETTE missing error check
    gen_init_cpio: avoid stack overflow when expanding
    drivers/rtc/rtc-imxdi.c: add missing spin lock initialization
    mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant
    pidns: limit the nesting depth of pid namespaces
    drivers/dma/dw_dmac: make driver's endianness configurable
    mm/mmu_notifier: allocate mmu_notifier in advance
    tools/testing/selftests/epoll/test_epoll.c: fix build
    UAPI: fix tools/vm/page-types.c
    mm/page_alloc.c:alloc_contig_range(): return early for err path
    rbtree: include linux/compiler.h for definition of __always_inline
    genalloc: stop crashing the system when destroying a pool
    backlight: ili9320: add missing SPI dependency
    device_cgroup: add proper checking when changing default behavior
    device_cgroup: stop using simple_strtoul()
    device_cgroup: rename deny_all to behavior
    cgroup: fix invalid rcu dereference
    mm: fix XFS oops due to dirty pages without buffers on s390

    Linus Torvalds
     
  • If one includes documentation for an external tool, it should be
    correct. This is not:

    1. Overriding the input to rngd should typically be neither
    necessary nor desired. This is especially so since newer
    versions of rngd support a number of different *types* of sources.
    2. The default kernel-exported device is called /dev/hwrng not
    /dev/hwrandom nor /dev/hw_random (both of which were used in the
    past; however, kernel and udev seem to have converged on
    /dev/hwrng.)

    Overall it is better if the documentation for rngd is kept with rngd
    rather than in a kernel Makefile.

    Signed-off-by: H. Peter Anvin
    Cc: David Howells
    Cc: Jeff Garzik
    Signed-off-by: Linus Torvalds

    H. Peter Anvin
     
  • 'struct pid' is a "variable sized struct" - a header with an array of
    upids at the end.

    The size of the array depends on a level (depth) of pid namespaces. Now a
    level of pidns is not limited, so 'struct pid' can be more than one page.

    Looks reasonable, that it should be less than a page. MAX_PIS_NS_LEVEL is
    not calculated from PAGE_SIZE, because in this case it depends on
    architectures, config options and it will be reduced, if someone adds a
    new fields in struct pid or struct upid.

    I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand
    "struct pid" and it's more than enough for all known for me use-cases.
    When someone finds a reasonable use case, we can add a config option or a
    sysctl parameter.

    In addition it will reduce the effect of another problem, when we have
    many nested namespaces and the oldest one starts dying.
    zap_pid_ns_processe will be called for each namespace and find_vpid will
    be called for each process in a namespace. find_vpid will be called
    minimum max_level^2 / 2 times. The reason of that is that when we found a
    bit in pidmap, we can't determine this pidns is top for this process or it
    isn't.

    vpid is a heavy operation, so a fork bomb, which create many nested
    namespace, can make a system inaccessible for a long time. For example my
    system becomes inaccessible for a few minutes with 4000 processes.

    [akpm@linux-foundation.org: return -EINVAL in response to excessive nesting, not -ENOMEM]
    Signed-off-by: Andrew Vagin
    Acked-by: Oleg Nesterov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     

25 Oct, 2012

3 commits

  • Pull cgroup fixes from Tejun Heo:
    "This pull request contains three fixes.

    Two are reverts of task_lock() removal in cgroup fork path. The
    optimizations incorrectly assumed that threadgroup_lock can protect
    process forks (as opposed to thread creations) too. Further cleanup
    of cgroup fork path is scheduled.

    The third fixes cgroup emptiness notification loss."

    * 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    Revert "cgroup: Remove task_lock() from cgroup_post_fork()"
    Revert "cgroup: Drop task_lock(parent) on cgroup_fork()"
    cgroup: notify_on_release may not be triggered in some cases

    Linus Torvalds
     
  • Pull workqueue fix from Tejun Heo:
    "This pull request contains one patch from Dan Magenheimer to fix
    cancel_delayed_work() regression introduced by its reimplementation
    using try_to_grab_pending(). The reimplementation made it incorrectly
    return %true when the work item is idle.

    There aren't too many consumers of the return value but it broke at
    least ramster."

    * 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: cancel_delayed_work() should return %false if work item is idle

    Linus Torvalds
     
  • 57b30ae77b ("workqueue: reimplement cancel_delayed_work() using
    try_to_grab_pending()") made cancel_delayed_work() always return %true
    unless someone else is also trying to cancel the work item, which is
    broken - if the target work item is idle, the return value should be
    %false.

    try_to_grab_pending() indicates that the target work item was idle by
    zero return value. Use it for return. Note that this brings
    cancel_delayed_work() in line with __cancel_work_timer() in return
    value handling.

    Signed-off-by: Dan Magenheimer
    Signed-off-by: Tejun Heo
    LKML-Reference:

    Dan Magenheimer
     

24 Oct, 2012

1 commit

  • Pull perf fixes from Ingo Molnar:
    "Most of these are uprobes race fixes from Oleg, and their preparatory
    cleanups. (It's larger than what I'd normally send for an -rc kernel,
    but they looked significant enough to not delay them.)

    There's also an oprofile fix and an uncore PMU fix."

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
    perf/x86: Disable uncore on virtualized CPUs
    oprofile, x86: Fix wrapping bug in op_x86_get_ctrl()
    ring-buffer: Check for uninitialized cpu buffer before resizing
    uprobes: Fix the racy uprobe->flags manipulation
    uprobes: Fix prepare_uprobe() race with itself
    uprobes: Introduce prepare_uprobe()
    uprobes: Fix handle_swbp() vs unregister() + register() race
    uprobes: Do not delete uprobe if uprobe_unregister() fails
    uprobes: Don't return success if alloc_uprobe() fails
    uprobes/x86: Only rep+nop can be emulated correctly
    uprobes: Simplify is_swbp_at_addr(), remove stale comments
    uprobes: Kill set_orig_insn()->is_swbp_at_addr()
    uprobes: Introduce copy_opcode(), kill read_opcode()
    uprobes: Kill set_swbp()->is_swbp_at_addr()
    uprobes: Restrict valid_vma(false) to skip VM_SHARED vmas
    uprobes: Change valid_vma() to demand VM_MAYEXEC rather than VM_EXEC
    uprobes: Change write_opcode() to use FOLL_FORCE
    uprobes: Move clear_thread_flag(TIF_UPROBE) to uprobe_notify_resume()
    uprobes: Kill UTASK_BP_HIT state
    uprobes: Fix UPROBE_SKIP_SSTEP checks in handle_swbp()
    ...

    Linus Torvalds
     

22 Oct, 2012

3 commits


20 Oct, 2012

6 commits

  • The min/max call needed to have explicit types on some architectures
    (e.g. mn10300). Use clamp_t instead to avoid the warning:

    kernel/sys.c: In function 'override_release':
    kernel/sys.c:1287:10: warning: comparison of distinct pointer types lacks a cast [enabled by default]

    Reported-by: Fengguang Wu
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Emit the magic string that indicates a module has a signature after the
    signature data instead of before it. This allows module_sig_check() to
    be made simpler and faster by the elimination of the search for the
    magic string. Instead we just need to do a single memcmp().

    This works because at the end of the signature data there is the
    fixed-length signature information block. This block then falls
    immediately prior to the magic number.

    From the contents of the information block, it is trivial to calculate
    the size of the signature data and thus the size of the actual module
    data.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • This reverts commit 7e3aa30ac8c904a706518b725c451bb486daaae9.

    The commit incorrectly assumed that fork path always performed
    threadgroup_change_begin/end() and depended on that for
    synchronization against task exit and cgroup migration paths instead
    of explicitly grabbing task_lock().

    threadgroup_change is not locked when forking a new process (as
    opposed to a new thread in the same process) and even if it were it
    wouldn't be effective as different processes use different threadgroup
    locks.

    Revert the incorrect optimization.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Acked-by: Li Zefan
    Cc: Frederic Weisbecker
    Cc: stable@vger.kernel.org

    Tejun Heo
     
  • This reverts commit 7e381b0eb1e1a9805c37335562e8dc02e7d7848c.

    The commit incorrectly assumed that fork path always performed
    threadgroup_change_begin/end() and depended on that for
    synchronization against task exit and cgroup migration paths instead
    of explicitly grabbing task_lock().

    threadgroup_change is not locked when forking a new process (as
    opposed to a new thread in the same process) and even if it were it
    wouldn't be effective as different processes use different threadgroup
    locks.

    Revert the incorrect optimization.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Acked-by: Li Zefan
    Bitterly-Acked-by: Frederic Weisbecker
    Cc: stable@vger.kernel.org

    Tejun Heo
     
  • free_pid_ns() operates in a recursive fashion:

    free_pid_ns(parent)
    put_pid_ns(parent)
    kref_put(&ns->kref, free_pid_ns);
    free_pid_ns

    thus if there was a huge nesting of namespaces the userspace may trigger
    avalanche calling of free_pid_ns leading to kernel stack exhausting and a
    panic eventually.

    This patch turns the recursion into an iterative loop.

    Based on a patch by Andrew Vagin.

    [akpm@linux-foundation.org: export put_pid_ns() to modules]
    Signed-off-by: Cyrill Gorcunov
    Cc: Andrew Vagin
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Calling uname() with the UNAME26 personality set allows a leak of kernel
    stack contents. This fixes it by defensively calculating the length of
    copy_to_user() call, making the len argument unsigned, and initializing
    the stack buffer to zero (now technically unneeded, but hey, overkill).

    CVE-2012-0957

    Reported-by: PaX Team
    Signed-off-by: Kees Cook
    Cc: Andi Kleen
    Cc: PaX Team
    Cc: Brad Spengler
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

17 Oct, 2012

2 commits

  • The console_cpu_notify() function runs with interrupts disabled in the
    CPU_DYING case. It therefore cannot block, for example, as will happen
    when it calls console_lock(). Therefore, remove the CPU_DYING leg of
    the switch statement to avoid this problem.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • notify_on_release must be triggered when the last process in a cgroup is
    move to another. But if the first(and only) process in a cgroup is moved to
    another, notify_on_release is not triggered.

    # mkdir /cgroup/cpu/SRC
    # mkdir /cgroup/cpu/DST
    #
    # echo 1 >/cgroup/cpu/SRC/notify_on_release
    # echo 1 >/cgroup/cpu/DST/notify_on_release
    #
    # sleep 300 &
    [1] 8629
    #
    # echo 8629 >/cgroup/cpu/SRC/tasks
    # echo 8629 >/cgroup/cpu/DST/tasks
    -> notify_on_release for /SRC must be triggered at this point,
    but it isn't.

    This is because put_css_set() is called before setting CGRP_RELEASABLE
    in cgroup_task_migrate(), and is a regression introduce by the
    commit:74a1166d(cgroups: make procs file writable), which was merged
    into v3.0.

    Cc: Ben Blum
    Cc: # v3.0.x and later
    Acked-by: Li Zefan
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Tejun Heo

    Daisuke Nishimura
     

15 Oct, 2012

1 commit

  • Pull module signing support from Rusty Russell:
    "module signing is the highlight, but it's an all-over David Howells frenzy..."

    Hmm "Magrathea: Glacier signing key". Somebody has been reading too much HHGTTG.

    * 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (37 commits)
    X.509: Fix indefinite length element skip error handling
    X.509: Convert some printk calls to pr_devel
    asymmetric keys: fix printk format warning
    MODSIGN: Fix 32-bit overflow in X.509 certificate validity date checking
    MODSIGN: Make mrproper should remove generated files.
    MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs
    MODSIGN: Use the same digest for the autogen key sig as for the module sig
    MODSIGN: Sign modules during the build process
    MODSIGN: Provide a script for generating a key ID from an X.509 cert
    MODSIGN: Implement module signature checking
    MODSIGN: Provide module signing public keys to the kernel
    MODSIGN: Automatically generate module signing keys if missing
    MODSIGN: Provide Kconfig options
    MODSIGN: Provide gitignore and make clean rules for extra files
    MODSIGN: Add FIPS policy
    module: signature checking hook
    X.509: Add a crypto key parser for binary (DER) X.509 certificates
    MPILIB: Provide a function to read raw data into an MPI
    X.509: Add an ASN.1 decoder
    X.509: Add simple ASN.1 grammar compiler
    ...

    Linus Torvalds
     

13 Oct, 2012

3 commits

  • Pull KGDB/KDB fixes and cleanups from Jason Wessel:
    "Cleanups
    - Clean up compile warnings in kgdboc.c and x86/kernel/kgdb.c
    - Add module event hooks for simplified debugging with gdb
    Fixes
    - Fix kdb to stop paging with 'q' on bta and dmesg
    - Fix for data that scrolls off the vga console due to line wrapping
    when using the kdb pager
    New
    - The debug core registers for kernel module events which allows a
    kernel aware gdb to automatically load symbols and break on entry
    to a kernel module
    - Allow kgdboc=kdb to setup kdb on the vga console"

    * tag 'for_linus-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
    tty/console: fix warnings in drivers/tty/serial/kgdboc.c
    kdb,vt_console: Fix missed data due to pager overruns
    kdb: Fix dmesg/bta scroll to quit with 'q'
    kgdboc: Accept either kbd or kdb to activate the vga + keyboard kdb shell
    kgdb,x86: fix warning about unused variable
    mips,kgdb: fix recursive page fault with CONFIG_KPROBES
    kgdb: Add module event hooks

    Linus Torvalds
     
  • Pull perf updates from Ingo Molnar:
    "This tree includes some late late perf items that missed the first
    round:

    tools:

    - Bash auto completion improvements, now we can auto complete the
    tools long options, tracepoint event names, etc, from Namhyung Kim.

    - Look up thread using tid instead of pid in 'perf sched'.

    - Move global variables into a perf_kvm struct, from David Ahern.

    - Hists refactorings, preparatory for improved 'diff' command, from
    Jiri Olsa.

    - Hists refactorings, preparatory for event group viewieng work, from
    Namhyung Kim.

    - Remove double negation on optional feature macro definitions, from
    Namhyung Kim.

    - Remove several cases of needless global variables, on most
    builtins.

    - misc fixes

    kernel:

    - sysfs support for IBS on AMD CPUs, from Robert Richter.

    - Support for an upcoming Intel CPU, the Xeon-Phi / Knights Corner
    HPC blade PMU, from Vince Weaver.

    - misc fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    perf: Fix perf_cgroup_switch for sw-events
    perf: Clarify perf_cpu_context::active_pmu usage by renaming it to ::unique_pmu
    perf/AMD/IBS: Add sysfs support
    perf hists: Add more helpers for hist entry stat
    perf hists: Move he->stat.nr_events initialization to a template
    perf hists: Introduce struct he_stat
    perf diff: Removing the total_period argument from output code
    perf tool: Add hpp interface to enable/disable hpp column
    perf tools: Removing hists pair argument from output path
    perf hists: Separate overhead and baseline columns
    perf diff: Refactor diff displacement possition info
    perf hists: Add struct hists pointer to struct hist_entry
    perf tools: Complete tracepoint event names
    perf/x86: Add support for Intel Xeon-Phi Knights Corner PMU
    perf evlist: Remove some unused methods
    perf evlist: Introduce add_newtp method
    perf kvm: Move global variables into a perf_kvm struct
    perf tools: Convert to BACKTRACE_SUPPORT
    perf tools: Long option completion support for each subcommands
    perf tools: Complete long option names of perf command
    ...

    Linus Torvalds
     
  • Pull third pile of kernel_execve() patches from Al Viro:
    "The last bits of infrastructure for kernel_thread() et.al., with
    alpha/arm/x86 use of those. Plus sanitizing the asm glue and
    do_notify_resume() on alpha, fixing the "disabled irq while running
    task_work stuff" breakage there.

    At that point the rest of kernel_thread/kernel_execve/sys_execve work
    can be done independently for different architectures. The only
    pending bits that do depend on having all architectures converted are
    restrictred to fs/* and kernel/* - that'll obviously have to wait for
    the next cycle.

    I thought we'd have to wait for all of them done before we start
    eliminating the longjump-style insanity in kernel_execve(), but it
    turned out there's a very simple way to do that without flagday-style
    changes."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to saner kernel_execve() semantics
    arm: switch to saner kernel_execve() semantics
    x86, um: convert to saner kernel_execve() semantics
    infrastructure for saner ret_from_kernel_thread semantics
    make sure that kernel_thread() callbacks call do_exit() themselves
    make sure that we always have a return path from kernel_execve()
    ppc: eeh_event should just use kthread_run()
    don't bother with kernel_thread/kernel_execve for launching linuxrc
    alpha: get rid of switch_stack argument of do_work_pending()
    alpha: don't bother passing switch_stack separately from regs
    alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
    alpha: simplify TIF_NEED_RESCHED handling

    Linus Torvalds