15 Jan, 2016

1 commit


15 Dec, 2015

1 commit

  • [ Upstream commit fbca9d2d35c6ef1b323fae75cc9545005ba25097 ]

    During own review but also reported by Dmitry's syzkaller [1] it has been
    noticed that we trigger a heap out-of-bounds access on eBPF array maps
    when updating elements. This happens with each map whose map->value_size
    (specified during map creation time) is not multiple of 8 bytes.

    In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
    used to align array map slots for faster access. However, in function
    array_map_update_elem(), we update the element as ...

    memcpy(array->value + array->elem_size * index, value, array->elem_size);

    ... where we access 'value' out-of-bounds, since it was allocated from
    map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
    and later on copied through copy_from_user(value, uvalue, map->value_size).
    Thus, up to 7 bytes, we can access out-of-bounds.

    Same could happen from within an eBPF program, where in worst case we
    access beyond an eBPF program's designated stack.

    Since 1be7f75d1668 ("bpf: enable non-root eBPF programs") didn't hit an
    official release yet, it only affects priviledged users.

    In case of array_map_lookup_elem(), the verifier prevents eBPF programs
    from accessing beyond map->value_size through check_map_access(). Also
    from syscall side map_lookup_elem() only copies map->value_size back to
    user, so nothing could leak.

    [1] http://github.com/google/syzkaller

    Fixes: 28fbcfa08d8e ("bpf: add array type of eBPF maps")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

10 Nov, 2015

1 commit

  • commit 275d7d44d802ef271a42dc87ac091a495ba72fc5 upstream.

    Poma (on the way to another bug) reported an assertion triggering:

    [] module_assert_mutex_or_preempt+0x49/0x90
    [] __module_address+0x32/0x150
    [] __module_text_address+0x16/0x70
    [] symbol_put_addr+0x29/0x40
    [] dvb_frontend_detach+0x7d/0x90 [dvb_core]

    Laura Abbott produced a patch which lead us to
    inspect symbol_put_addr(). This function has a comment claiming it
    doesn't need to disable preemption around the module lookup
    because it holds a reference to the module it wants to find, which
    therefore cannot go away.

    This is wrong (and a false optimization too, preempt_disable() is really
    rather cheap, and I doubt any of this is on uber critical paths,
    otherwise it would've retained a pointer to the actual module anyway and
    avoided the second lookup).

    While its true that the module cannot go away while we hold a reference
    on it, the data structure we do the lookup in very much _CAN_ change
    while we do the lookup. Therefore fix the comment and add the
    required preempt_disable().

    Reported-by: poma
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Rusty Russell
    Fixes: a6e6abd575fc ("module: remove module_text_address()")
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

27 Oct, 2015

2 commits

  • commit fe32d3cd5e8eb0f82e459763374aa80797023403 upstream.

    These functions check should_resched() before unlocking spinlock/bh-enable:
    preempt_count always non-zero => should_resched() always returns false.
    cond_resched_lock() worked iff spin_needbreak is set.

    This patch adds argument "preempt_offset" to should_resched().

    preempt_count offset constants for that:

    PREEMPT_DISABLE_OFFSET - offset after preempt_disable()
    PREEMPT_LOCK_OFFSET - offset after spin_lock()
    SOFTIRQ_DISABLE_OFFSET - offset after local_bh_distable()
    SOFTIRQ_LOCK_OFFSET - offset after spin_lock_bh()

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Graf
    Cc: Boris Ostrovsky
    Cc: David Vrabel
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: bdb438065890 ("sched: Extract the basic add/sub preempt_count modifiers")
    Link: http://lkml.kernel.org/r/20150715095204.12246.98268.stgit@buzz
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mike Galbraith
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     
  • commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.

    My system keeps crashing with below message. vmstat_update() schedules a delayed
    work in current cpu and expects the work runs in the cpu.
    schedule_delayed_work() is expected to make delayed work run in local cpu. The
    problem is timer can be migrated with NO_HZ. __queue_work() queues work in
    timer handler, which could run in a different cpu other than where the delayed
    work is scheduled. The end result is the delayed work runs in different cpu.
    The patch makes __queue_delayed_work records local cpu earlier. Where the timer
    runs doesn't change where the work runs with the change.

    [ 28.010131] ------------[ cut here ]------------
    [ 28.010609] kernel BUG at ../mm/vmstat.c:1392!
    [ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
    [ 28.011860] Modules linked in:
    [ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634
    [ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
    [ 28.014160] Workqueue: events vmstat_update
    [ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
    [ 28.015445] RIP: 0010:[] []vmstat_update+0x31/0x80
    [ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297
    [ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
    [ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
    [ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
    [ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
    [ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
    [ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
    [ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
    [ 28.020071] Stack:
    [ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
    [ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
    [ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
    [ 28.020071] Call Trace:
    [ 28.020071] [] process_one_work+0x1c8/0x540
    [ 28.020071] [] ? process_one_work+0x14b/0x540
    [ 28.020071] [] worker_thread+0x114/0x460
    [ 28.020071] [] ? process_one_work+0x540/0x540
    [ 28.020071] [] kthread+0xf8/0x110
    [ 28.020071] [] ?kthread_create_on_node+0x200/0x200
    [ 28.020071] [] ret_from_fork+0x3f/0x70
    [ 28.020071] [] ?kthread_create_on_node+0x200/0x200

    Signed-off-by: Shaohua Li
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     

23 Oct, 2015

6 commits

  • commit 95c2b17534654829db428f11bcf4297c059a2a7e upstream.

    Per-IRQ directories in procfs are created only when a handler is first
    added to the irqdesc, not when the irqdesc is created. In the case of
    a shared IRQ, multiple tasks can race to create a directory. This
    race condition seems to have been present forever, but is easier to
    hit with async probing.

    Signed-off-by: Ben Hutchings
    Link: http://lkml.kernel.org/r/1443266636.2004.2.camel@decadent.org.uk
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Ben Hutchings
     
  • commit 54d27365cae88fbcc853b391dcd561e71acb81fa upstream.

    The optimized task selection logic optimistically selects a new task
    to run without first doing a full put_prev_task(). This is so that we
    can avoid a put/set on the common ancestors of the old and new task.

    Similarly, we should only call check_cfs_rq_runtime() to throttle
    eligible groups if they're part of the common ancestry, otherwise it
    is possible to end up with no eligible task in the simple task
    selection.

    Imagine:
    /root
    /prev /next
    /A /B

    If our optimistic selection ends up throttling /next, we goto simple
    and our put_prev_task() ends up throttling /prev, after which we're
    going to bug out in set_next_entity() because there aren't any tasks
    left.

    Avoid this scenario by only throttling common ancestors.

    Reported-by: Mohammed Naser
    Reported-by: Konstantin Khlebnikov
    Signed-off-by: Ben Segall
    [ munged Changelog ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Thomas Gleixner
    Cc: pjt@google.com
    Fixes: 678d5718d8d0 ("sched/fair: Optimize cgroup pick_next_task_fair()")
    Link: http://lkml.kernel.org/r/xm26wq1oswoq.fsf@sword-of-the-dawn.mtv.corp.google.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Ben Segall
     
  • commit 95913d97914f44db2b81271c2e2ebd4d2ac2df83 upstream.

    So the problem this patch is trying to address is as follows:

    CPU0 CPU1

    context_switch(A, B)
    ttwu(A)
    LOCK A->pi_lock
    A->on_cpu == 0
    finish_task_switch(A)
    prev_state = A->state on_cpu = 0; |
    UNLOCK rq0->lock |
    | context_switch(C, A)
    `-- A->state = TASK_DEAD
    prev_state == TASK_DEAD
    put_task_struct(A)
    context_switch(A, C)
    finish_task_switch(A)
    A->state == TASK_DEAD
    put_task_struct(A)

    The argument being that the WMB will allow the load of A->state on CPU0
    to cross over and observe CPU1's store of A->state, which will then
    result in a double-drop and use-after-free.

    Now the comment states (and this was true once upon a long time ago)
    that we need to observe A->state while holding rq->lock because that
    will order us against the wakeup; however the wakeup will not in fact
    acquire (that) rq->lock; it takes A->pi_lock these days.

    We can obviously fix this by upgrading the WMB to an MB, but that is
    expensive, so we'd rather avoid that.

    The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
    which avoids the MB on some archs, but not important ones like ARM.

    Reported-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: manfred@colorfullife.com
    Cc: will.deacon@arm.com
    Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
    Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit 00cc1633816de8c95f337608a1ea64e228faf771 upstream.

    Commit 2ee507c47293 ("sched: Add function single_task_running to let a task
    check if it is the only task running on a cpu") referenced the current
    runqueue with the smp_processor_id. When CONFIG_DEBUG_PREEMPT is enabled,
    that is only allowed if preemption is disabled or the currrent task is
    bound to the local cpu (e.g. kernel worker).

    With commit f78195129963 ("kvm: add halt_poll_ns module parameter") KVM
    calls single_task_running. If CONFIG_DEBUG_PREEMPT is enabled that
    generates a lot of kernel messages.

    To avoid adding preemption in that cases, as it would limit the usefulness,
    we change single_task_running to access directly the cpu local runqueue.

    Cc: Tim Chen
    Suggested-by: Peter Zijlstra
    Acked-by: Peter Zijlstra (Intel)
    Fixes: 2ee507c472939db4b146d545352b8a7c79ef47f8
    Signed-off-by: Dominik Dingel
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Dominik Dingel
     
  • commit 57ffc5ca679f499f4704fd9b6a372916f59930ee upstream.

    Its currently possible to drop the last refcount to the aux buffer
    from NMI context, which results in the expected fireworks.

    The refcounting needs a bigger overhaul, but to cure the immediate
    problem, delay the freeing by using an irq_work.

    Reviewed-and-tested-by: Alexander Shishkin
    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150618103249.GK19282@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit 2619d7e9c92d524cb155ec89fd72875321512e5b upstream.

    The internal clocksteering done for fine-grained error
    correction uses a logarithmic approximation, so any time
    adjtimex() adjusts the clock steering, timekeeping_freqadjust()
    quickly approximates the correct clock frequency over a series
    of ticks.

    Unfortunately, the logic in timekeeping_freqadjust(), introduced
    in commit:

    dc491596f639 ("timekeeping: Rework frequency adjustments to work better w/ nohz")

    used the abs() function with a s64 error value to calculate the
    size of the approximated adjustment to be made.

    Per include/linux/kernel.h:

    "abs() should not be used for 64-bit types (s64, u64, long long) - use abs64()".

    Thus on 32-bit platforms, this resulted in the clocksteering to
    take a quite dampended random walk trying to converge on the
    proper frequency, which caused the adjustments to be made much
    slower then intended (most easily observed when large
    adjustments are made).

    This patch fixes the issue by using abs64() instead.

    Reported-by: Nuno Gonçalves
    Tested-by: Nuno Goncalves
    Signed-off-by: John Stultz
    Cc: Linus Torvalds
    Cc: Miroslav Lichvar
    Cc: Peter Zijlstra
    Cc: Prarit Bhargava
    Cc: Richard Cochran
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1441840051-20244-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    John Stultz
     

30 Sep, 2015

1 commit

  • commit 12c641ab8270f787dfcce08b5f20ce8b65008096 upstream.

    In the logic in the initial commit of unshare made creating a new
    thread group for a process, contingent upon creating a new memory
    address space for that process. That is wrong. Two separate
    processes in different thread groups can share a memory address space
    and clone allows creation of such proceses.

    This is significant because it was observed that mm_users > 1 does not
    mean that a process is multi-threaded, as reading /proc/PID/maps
    temporarily increments mm_users, which allows other processes to
    (accidentally) interfere with unshare() calls.

    Correct the check in check_unshare_flags() to test for
    !thread_group_empty() for CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM.
    For sighand->count > 1 for CLONE_SIGHAND and CLONE_VM.
    For !current_is_single_threaded instead of mm_users > 1 for CLONE_VM.

    By using the correct checks in unshare this removes the possibility of
    an accidental denial of service attack.

    Additionally using the correct checks in unshare ensures that only an
    explicit unshare(CLONE_VM) can possibly trigger the slow path of
    current_is_single_threaded(). As an explict unshare(CLONE_VM) is
    pointless it is not expected there are many applications that make
    that call.

    Fixes: b2e0d98705e60e45bbb3c0032c48824ad7ae0704 userns: Implement unshare of the user namespace
    Reported-by: Ricky Zhou
    Reported-by: Kees Cook
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

22 Sep, 2015

2 commits

  • commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream.

    Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • commit dd9d3843755da95f63dd3a376f62b3e45c011210 upstream.

    There is a race condition in SMP bootup code, which may result
    in

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()
    or
    kernel BUG at kernel/smpboot.c:135!

    It can be triggered with a bit of luck in Linux guests running
    on busy hosts.

    CPU0 CPUn
    ==== ====

    _cpu_up()
    __cpu_up()
    start_secondary()
    set_cpu_online()
    cpumask_set_cpu(cpu,
    to_cpumask(cpu_online_bits));
    cpu_notify(CPU_ONLINE)

    cpumask_set_cpu(cpu,
    to_cpumask(cpu_active_bits));

    During the various CPU_ONLINE callbacks CPUn is online but not
    active. Several things can go wrong at that point, depending on
    the scheduling of tasks on CPU0.

    Variant 1:

    cpu_notify(CPU_ONLINE)
    workqueue_cpu_up_callback()
    rebind_workers()
    set_cpus_allowed_ptr()

    This call fails because it requires an active CPU; rebind_workers()
    ends with a warning:

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()

    Variant 2:

    cpu_notify(CPU_ONLINE)
    smpboot_thread_call()
    smpboot_unpark_threads()
    ..
    __kthread_unpark()
    __kthread_bind()
    wake_up_state()
    ..
    select_task_rq()
    select_fallback_rq()

    The ->wake_cpu of the unparked thread is not allowed, making a call
    to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
    find an allowed, active CPU and promptly resets the allowed CPUs, so
    that the task in question ends up on CPU0.

    When those unparked tasks are eventually executed, they run
    immediately into a BUG:

    kernel BUG at kernel/smpboot.c:135!

    Just changing the order in which the online/active bits are set
    (and adding some memory barriers), would solve the two issues
    above. However, it would change the order of operations back to
    the one before commit 6acbfb96976f ("sched: Fix hotplug vs.
    set_cpus_allowed_ptr()"), thus, reintroducing that particular
    problem.

    Going further back into history, we have at least the following
    commits touching this topic:
    - commit 2baab4e90495 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
    - commit 5fbd036b552f ("sched: Cleanup cpu_active madness")

    Together, these give us the following non-working solutions:

    - secondary CPU sets active before online, because active is assumed to
    be a subset of online;

    - secondary CPU sets online before active, because the primary CPU
    assumes that an online CPU is also active;

    - secondary CPU sets online and waits for primary CPU to set active,
    because it might deadlock.

    Commit 875ebe940d77 ("powerpc/smp: Wait until secondaries are
    active & online") introduces an arch-specific solution to this
    arch-independent problem.

    Now, go for a more general solution without explicit waiting and
    simply set active twice: once on the secondary CPU after online
    was set and once on the primary CPU after online was seen.

    set_cpus_allowed_ptr()")

    Signed-off-by: Jan H. Schönherr
    Acked-by: Peter Zijlstra
    Cc: Anton Blanchard
    Cc: Borislav Petkov
    Cc: Joerg Roedel
    Cc: Linus Torvalds
    Cc: Matt Wilson
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
    Link: http://lkml.kernel.org/r/1439408156-18840-1-git-send-email-jschoenh@amazon.de
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Jan H. Schönherr
     

14 Sep, 2015

7 commits

  • commit b7560de198222994374c1340a389f12d5efb244a upstream.

    This helper is required for irq chips which do not implement a
    irq_set_type callback and need to call down the irq domain hierarchy
    for the actual trigger type change.

    This helper is required to fix further wreckage caused by the
    conversion of TI OMAP to hierarchical irq domains and therefor tagged
    for stable.

    [ tglx: Massaged changelog ]

    Signed-off-by: Grygorii Strashko
    Cc: Sudeep Holla
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc: stable@vger.kernel.org # 4.1
    Link: http://lkml.kernel.org/r/1439554830-19502-3-git-send-email-grygorii.strashko@ti.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Grygorii Strashko
     
  • commit 6d4affea7d5aa5ca5ff4c3e5fbf3ee16801cc527 upstream.

    irq_chip_retrigger_hierarchy() returns -ENOSYS if it was not able to
    find at least one .irq_retrigger() callback implemented in the IRQ
    domain hierarchy.

    That's wrong, because check_irq_resend() expects a 0 return value from
    the callback in case that the hardware assisted resend was not
    possible. If the return value is non zero the core code assumes
    hardware resend success and the software resend is not invoked.

    This results in lost interrupts on platforms where none of the parent
    irq chips in the hierarchy implements the retrigger callback.

    This is observable on TI OMAP, where the hierarchy is:

    ARM GIC
    Reviewed-by: Marc Zyngier
    Reviewed-by: Jiang Liu
    Cc: Sudeep Holla
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/1439554830-19502-2-git-send-email-grygorii.strashko@ti.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Grygorii Strashko
     
  • commit 24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 upstream.

    The comment says it's using trialcs->mems_allowed as a temp variable but
    it didn't match the code. Change the code to match the comment.

    This fixes an issue when writing in cpuset.mems when a sub-directory
    exists: we need to write several times for the information to persist:

    | root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
    | root@alban:/sys/fs/cgroup/cpuset# cd footest9
    | root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
    | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
    |
    | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
    | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
    |
    | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
    | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
    | 0
    | root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
    |
    | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
    | root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
    | 0
    | root@alban:/sys/fs/cgroup/cpuset/footest9#

    This should help to fix the following issue in Docker:
    https://github.com/opencontainers/runc/issues/133
    In some conditions, a Docker container needs to be started twice in
    order to work.

    Signed-off-by: Alban Crequy
    Tested-by: Iago López Galeiras
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Alban Crequy
     
  • commit c7999c6f3fed9e383d3131474588f282ae6d56b9 upstream.

    I ran the perf fuzzer, which triggered some WARN()s which are due to
    trying to stop/restart an event on the wrong CPU.

    Use the normal IPI pattern to ensure we run the code on the correct CPU.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Vince Weaver
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: bad7192b842c ("perf: Fix PERF_EVENT_IOC_PERIOD to force-reset the period")
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit ee9397a6fb9bc4e52677f5e33eed4abee0f515e6 upstream.

    If rb->aux_refcount is decremented to zero before rb->refcount,
    __rb_free_aux() may be called twice resulting in a double free of
    rb->aux_pages. Fix this by adding a check to __rb_free_aux().

    Signed-off-by: Ben Hutchings
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 57ffc5ca679f ("perf: Fix AUX buffer refcounting")
    Link: http://lkml.kernel.org/r/1437953468.12842.17.camel@decadent.org.uk
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Ben Hutchings
     
  • commit 00a2916f7f82c348a2a94dbb572874173bc308a3 upstream.

    A recent fix to the shadow timestamp inadvertly broke the running time
    accounting.

    We must not update the running timestamp if we fail to schedule the
    event, the event will not have ran. This can (and did) result in
    negative total runtime because the stopped timestamp was before the
    running timestamp (we 'started' but never stopped the event -- because
    it never really started we didn't have to stop it either).

    Reported-and-Tested-by: Vince Weaver
    Fixes: 72f669c0086f ("perf: Update shadow timestamp before add event")
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Shaohua Li
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit fed66e2cdd4f127a43fd11b8d92a99bdd429528c upstream.

    Vince reported that the fasync signal stuff doesn't work proper for
    inherited events. So fix that.

    Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
    which upon the demise of the file descriptor ensures the allocation is
    freed and state is updated.

    Now for perf, we can have the events stick around for a while after the
    original FD is dead because of references from child events. So we
    cannot copy the fasync pointer around. We can however consistently use
    the parent's fasync, as that will be updated.

    Reported-and-Tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho deMelo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

17 Aug, 2015

2 commits

  • commit 3c00cb5e68dc719f2fc73a33b1b230aadfcb1309 upstream.

    This function can leak kernel stack data when the user siginfo_t has a
    positive si_code value. The top 16 bits of si_code descibe which fields
    in the siginfo_t union are active, but they are treated inconsistently
    between copy_siginfo_from_user32, copy_siginfo_to_user32 and
    copy_siginfo_to_user.

    copy_siginfo_from_user32 is called from rt_sigqueueinfo and
    rt_tgsigqueueinfo in which the user has full control overthe top 16 bits
    of si_code.

    This fixes the following information leaks:
    x86: 8 bytes leaked when sending a signal from a 32-bit process to
    itself. This leak grows to 16 bytes if the process uses x32.
    (si_code = __SI_CHLD)
    x86: 100 bytes leaked when sending a signal from a 32-bit process to
    a 64-bit process. (si_code = -1)
    sparc: 4 bytes leaked when sending a signal from a 32-bit process to a
    64-bit process. (si_code = any)

    parsic and s390 have similar bugs, but they are not vulnerable because
    rt_[tg]sigqueueinfo have checks that prevent sending a positive si_code
    to a different process. These bugs are also fixed for consistency.

    Signed-off-by: Amanieu d'Antras
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Chris Metcalf
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Amanieu d'Antras
     
  • commit 26135022f85105ad725cda103fa069e29e83bd16 upstream.

    This function may copy the si_addr_lsb, si_lower and si_upper fields to
    user mode when they haven't been initialized, which can leak kernel
    stack data to user mode.

    Just checking the value of si_code is insufficient because the same
    si_code value is shared between multiple signals. This is solved by
    checking the value of si_signo in addition to si_code.

    Signed-off-by: Amanieu d'Antras
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Amanieu d'Antras
     

11 Aug, 2015

2 commits

  • commit e3eea1404f5ff7a2ceb7b5e7ba412a6fd94f2935 upstream.

    Commit 4104d326b670 ("ftrace: Remove global function list and call function
    directly") simplified the ftrace code by removing the global_ops list with a
    new design. But this cleanup also broke the filtering of PIDs that are added
    to the set_ftrace_pid file.

    Add back the proper hooks to have pid filtering working once again.

    Reported-by: Matt Fleming
    Reported-by: Richard Weinberger
    Tested-by: Matt Fleming
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 75a06189fc508a2acf470b0b12710362ffb2c4b1 upstream.

    The resend mechanism happily calls the interrupt handler of interrupts
    which are marked IRQ_NESTED_THREAD from softirq context. This can
    result in crashes because the interrupt handler is not the proper way
    to invoke the device handlers. They must be invoked via
    handle_nested_irq.

    Prevent the resend even if the interrupt has no valid parent irq
    set. Its better to have a lost interrupt than a crashing machine.

    Reported-by: Uwe Kleine-König
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

04 Aug, 2015

6 commits

  • commit d194e5d666225b04c7754471df0948f645b6ab3a upstream.

    The final version of commit 637241a900cb ("kmsg: honor dmesg_restrict
    sysctl on /dev/kmsg") lost few hooks, as result security_syslog() are
    processed incorrectly:

    - open of /dev/kmsg checks syslog access permissions by using
    check_syslog_permissions() where security_syslog() is not called if
    dmesg_restrict is set.

    - syslog syscall and /proc/kmsg calls do_syslog() where security_syslog
    can be executed twice (inside check_syslog_permissions() and then
    directly in do_syslog())

    With this patch security_syslog() is called once only in all
    syslog-related operations regardless of dmesg_restrict value.

    Fixes: 637241a900cb ("kmsg: honor dmesg_restrict sysctl on /dev/kmsg")
    Signed-off-by: Vasily Averin
    Cc: Kees Cook
    Cc: Josh Boyer
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vasily Averin
     
  • commit fff3b16d2754a061a3549c4307a186423a0128fd upstream.

    Many harddisks (mostly WD ones) have firmware problems and take too
    long, more than 10 seconds, to resume from suspend. And this often
    exceeds the default DPM watchdog timeout (12 seconds), resulting in a
    kernel panic out of sudden.

    Since most distros just take the default as is, we should give a bit
    more safer value. This patch increases the default value from 12
    seconds to one minute, which has been confirmed to be long enough for
    such problematic disks.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=91921
    Fixes: 70fea60d888d (PM / Sleep: Detect device suspend/resume lockup and log event)
    Signed-off-by: Takashi Iwai
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Takashi Iwai
     
  • commit 6224beb12e190ff11f3c7d4bf50cb2922878f600 upstream.

    Fengguang Wu's tests triggered a bug in the branch tracer's start up
    test when CONFIG_DEBUG_PREEMPT set. This was because that config
    adds some debug logic in the per cpu field, which calls back into
    the branch tracer.

    The branch tracer has its own recursive checks, but uses a per cpu
    variable to implement it. If retrieving the per cpu variable calls
    back into the branch tracer, you can see how things will break.

    Instead of using a per cpu variable, use the trace_recursion field
    of the current task struct. Simply set a bit when entering the
    branch tracing and clear it when leaving. If the bit is set on
    entry, just don't do the tracing.

    There's also the case with lockdep, as the local_irq_save() called
    before the recursion can also trigger code that can call back into
    the function. Changing that to a raw_local_irq_save() will protect
    that as well.

    This prevents the recursion and the inevitable crash that follows.

    Link: http://lkml.kernel.org/r/20150630141803.GA28071@wfg-t540p.sh.intel.com

    Reported-by: Fengguang Wu
    Tested-by: Fengguang Wu
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit cc9e4bde03f2b4cfba52406c021364cbd2a4a0f3 upstream.

    The trace.h header when called without CONFIG_EVENT_TRACING enabled
    (seldom done), will not compile because of a typo in the protocol
    of trace_event_enum_update().

    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 6b88f44e161b9ee2a803e5b2b1fbcf4e20e8b980 upstream.

    While debugging a WARN_ON() for filtering, I found that it is possible
    for the filter string to be referenced after its end. With the filter:

    # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter

    The filter_parse() function can call infix_get_op() which calls
    infix_advance() that updates the infix filter pointers for the cnt
    and tail without checking if the filter is already at the end, which
    will put the cnt to zero and the tail beyond the end. The loop then calls
    infix_next() that has

    ps->infix.cnt--;
    return ps->infix.string[ps->infix.tail++];

    The cnt will now be below zero, and the tail that is returned is
    already passed the end of the filter string. So far the allocation
    of the filter string usually has some buffer that is zeroed out, but
    if the filter string is of the exact size of the allocated buffer
    there's no guarantee that the charater after the nul terminating
    character will be zero.

    Luckily, only root can write to the filter.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit b4875bbe7e68f139bd3383828ae8e994a0df6d28 upstream.

    When testing the fix for the trace filter, I could not come up with
    a scenario where the operand count goes below zero, so I added a
    WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case
    that it can happen (although the filter would be wrong).

    # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter

    That is, a single operation without any operands will hit the path
    where the WARN_ON_ONCE() can trigger. Although this is harmless,
    and the filter is reported as a error. But instead of spitting out
    a warning to the kernel dmesg, just fail nicely and report it via
    the proper channels.

    Link: http://lkml.kernel.org/r/558C6082.90608@oracle.com

    Reported-by: Vince Weaver
    Reported-by: Sasha Levin
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     

22 Jul, 2015

5 commits

  • commit 63781394c540dd9e666a6b21d70b64dd52bce76e upstream.

    request_any_context_irq() returns a negative value on failure.
    It returns either IRQC_IS_HARDIRQ or IRQC_IS_NESTED on success.
    So fix testing return value of request_any_context_irq().

    Also fixup the return value of devm_request_any_context_irq() to make it
    consistent with request_any_context_irq().

    Fixes: 0668d3065128 ("genirq: Add devm_request_any_context_irq()")
    Signed-off-by: Axel Lin
    Reviewed-by: Stephen Boyd
    Link: http://lkml.kernel.org/r/1431334978.17783.4.camel@ingics.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Axel Lin
     
  • commit 9a1bd63cdae4b623494c4ebaf723a91c35ec49fb upstream.

    The list of loaded modules is walked through in
    module_kallsyms_on_each_symbol (called by kallsyms_on_each_symbol). The
    module_mutex lock should be acquired to prevent potential corruptions
    in the list.

    This was uncovered with new lockdep asserts in module code introduced by
    the commit 0be964be0d45 ("module: Sanitize RCU usage and locking") in
    recent next- trees.

    Signed-off-by: Miroslav Benes
    Acked-by: Josh Poimboeuf
    Signed-off-by: Jiri Kosina
    Signed-off-by: Greg Kroah-Hartman

    Miroslav Benes
     
  • commit 6e91f8cb138625be96070b778d9ba71ce520ea7e upstream.

    If, at the time __rcu_process_callbacks() is invoked, there are callbacks
    in Tiny RCU's callback list, but none of them are ready to be invoked,
    the current list-management code will knit the non-ready callbacks out
    of the list. This can result in hangs and possibly worse. This commit
    therefore inserts a check for there being no callbacks that can be
    invoked immediately.

    This bug is unlikely to occur -- you have to get a new callback between
    the time rcu_sched_qs() or rcu_bh_qs() was called, but before we get to
    __rcu_process_callbacks(). It was detected by the addition of RCU-bh
    testing to rcutorture, which in turn was instigated by Iftekhar Ahmed's
    mutation testing. Although this bug was made much more likely by
    915e8a4fe45e (rcu: Remove fastpath from __rcu_process_callbacks()), this
    did not cause the bug, but rather made it much more probable. That
    said, it takes more than 40 hours of rcutorture testing, on average,
    for this bug to appear, so this fix cannot be considered an emergency.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • commit f9bb48825a6b5d02f4cabcc78967c75db903dcdc upstream.

    This allows for better documentation in the code and
    it allows for a simpler and fully correct version of
    fs_fully_visible to be written.

    The mount points converted and their filesystems are:
    /sys/hypervisor/s390/ s390_hypfs
    /sys/kernel/config/ configfs
    /sys/kernel/debug/ debugfs
    /sys/firmware/efi/efivars/ efivarfs
    /sys/fs/fuse/connections/ fusectl
    /sys/fs/pstore/ pstore
    /sys/kernel/tracing/ tracefs
    /sys/fs/cgroup/ cgroup
    /sys/kernel/security/ securityfs
    /sys/fs/selinux/ selinuxfs
    /sys/fs/smackfs/ smackfs

    Acked-by: Greg Kroah-Hartman
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit f9bd6733d3f11e24f3949becf277507d422ee1eb upstream.

    Add a magic sysctl table sysctl_mount_point that when used to
    create a directory forces that directory to be permanently empty.

    Update the code to use make_empty_dir_inode when accessing permanently
    empty directories.

    Update the code to not allow adding to permanently empty directories.

    Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

30 Jun, 2015

1 commit

  • commit 2f993cf093643b98477c421fa2b9a98dcc940323 upstream.

    While looking for other users of get_state/cond_sync. I Found
    ring_buffer_attach() and it looks obviously buggy?

    Don't we need to ensure that we have "synchronize" _between_
    list_del() and list_add() ?

    IOW. Suppose that ring_buffer_attach() preempts right_after
    get_state_synchronize_rcu() and gp completes before spin_lock().

    In this case cond_synchronize_rcu() does nothing and we reuse
    ->rb_entry without waiting for gp in between?

    It also moves the ->rcu_pending check under "if (rb)", to make it
    more readable imo.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: der.herr@hofr.at
    Cc: josh@joshtriplett.org
    Cc: tj@kernel.org
    Fixes: b69cf53640da ("perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()")
    Link: http://lkml.kernel.org/r/20150530200425.GA15748@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     

18 Jun, 2015

1 commit

  • …l/git/rostedt/linux-trace

    Pull tracing filter fix from Steven Rostedt:
    "Vince Weaver reported a warning when he added perf event filters into
    his fuzzer tests. There's a missing check of balanced operations when
    parenthesis are used, and this triggers a WARN_ON() and when reading
    the failure, the filter reports no failure occurred.

    The operands were not being checked if they match, this adds that"

    * tag 'trace-fix-filter-4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Have filter check for balanced ops

    Linus Torvalds
     

17 Jun, 2015

1 commit

  • When the following filter is used it causes a warning to trigger:

    # cd /sys/kernel/debug/tracing
    # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
    -bash: echo: write error: Invalid argument
    # cat events/ext4/ext4_truncate_exit/filter
    ((dev==1)blocks==2)
    ^
    parse_error: No error

    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990()
    Modules linked in: bnep lockd grace bluetooth ...
    CPU: 3 PID: 1223 Comm: bash Tainted: G W 4.1.0-rc3-test+ #450
    Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
    0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0
    0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c
    ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea
    Call Trace:
    [] dump_stack+0x4c/0x6e
    [] warn_slowpath_common+0x97/0xe0
    [] ? _kstrtoull+0x2c/0x80
    [] warn_slowpath_null+0x1a/0x20
    [] replace_preds+0x3c5/0x990
    [] create_filter+0x82/0xb0
    [] apply_event_filter+0xd4/0x180
    [] event_filter_write+0x8f/0x120
    [] __vfs_write+0x28/0xe0
    [] ? __sb_start_write+0x53/0xf0
    [] ? security_file_permission+0x30/0xc0
    [] vfs_write+0xb8/0x1b0
    [] SyS_write+0x4f/0xb0
    [] system_call_fastpath+0x12/0x6a
    ---[ end trace e11028bd95818dcd ]---

    Worse yet, reading the error message (the filter again) it says that
    there was no error, when there clearly was. The issue is that the
    code that checks the input does not check for balanced ops. That is,
    having an op between a closed parenthesis and the next token.

    This would only cause a warning, and fail out before doing any real
    harm, but it should still not caues a warning, and the error reported
    should work:

    # cd /sys/kernel/debug/tracing
    # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
    -bash: echo: write error: Invalid argument
    # cat events/ext4/ext4_truncate_exit/filter
    ((dev==1)blocks==2)
    ^
    parse_error: Meaningless filter expression

    And give no kernel warning.

    Link: http://lkml.kernel.org/r/20150615175025.7e809215@gandalf.local.home

    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: stable@vger.kernel.org # 2.6.31+
    Reported-by: Vince Weaver
    Tested-by: Vince Weaver
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

15 Jun, 2015

1 commit