08 Dec, 2018

1 commit

  • commit 09d3f015d1e1b4fee7e9bbdcf54201d239393391 upstream.

    Commit:

    142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + register() race")

    added the UPROBE_COPY_INSN flag, and corresponding smp_wmb() and smp_rmb()
    memory barriers, to ensure that handle_swbp() uses fully-initialized
    uprobes only.

    However, the smp_rmb() is mis-placed: this barrier should be placed
    after handle_swbp() has tested for the flag, thus guaranteeing that
    (program-order) subsequent loads from the uprobe can see the initial
    stores performed by prepare_uprobe().

    Move the smp_rmb() accordingly. Also amend the comments associated
    to the two memory barriers to indicate their actual locations.

    Signed-off-by: Andrea Parri
    Acked-by: Oleg Nesterov
    Cc: Alexander Shishkin
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: stable@kernel.org
    Fixes: 142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + register() race")
    Link: http://lkml.kernel.org/r/20181122161031.15179-1-andrea.parri@amarulasolutions.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Andrea Parri
     

04 Nov, 2018

2 commits

  • [ Upstream commit cd6fb677ce7e460c25bdd66f689734102ec7d642 ]

    Some of the scheduling tracepoints allow the perf_tp_event
    code to write to ring buffer under different cpu than the
    code is running on.

    This results in corrupted ring buffer data demonstrated in
    following perf commands:

    # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver processes per group
    # 10 groups == 400 processes run

    Total time: 0.383 [sec]
    [ perf record: Woken up 8 times to write data ]
    0x42b890 [0]: failed to process type: -1765585640
    [ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]

    # perf report --stdio
    0x42b890 [0]: failed to process type: -1765585640

    The reason for the corruption are some of the scheduling tracepoints,
    that have __perf_task dfined and thus allow to store data to another
    cpu ring buffer:

    sched_waking
    sched_wakeup
    sched_wakeup_new
    sched_stat_wait
    sched_stat_sleep
    sched_stat_iowait
    sched_stat_blocked

    The perf_tp_event function first store samples for current cpu
    related events defined for tracepoint:

    hlist_for_each_entry_rcu(event, head, hlist_entry)
    perf_swevent_event(event, count, &data, regs);

    And then iterates events of the 'task' and store the sample
    for any task's event that passes tracepoint checks:

    ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);

    list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
    if (event->attr.type != PERF_TYPE_TRACEPOINT)
    continue;
    if (event->attr.config != entry->type)
    continue;

    perf_swevent_event(event, count, &data, regs);
    }

    Above code can race with same code running on another cpu,
    ending up with 2 cpus trying to store under the same ring
    buffer, which is specifically not allowed.

    This patch prevents the problem, by allowing only events with the same
    current cpu to receive the event.

    NOTE: this requires the use of (per-task-)per-cpu buffers for this
    feature to work; perf-record does this.

    Signed-off-by: Jiri Olsa
    [peterz: small edits to Changelog]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andrew Vagin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: e6dab5ffab59 ("perf/trace: Add ability to set a target task for events")
    Link: http://lkml.kernel.org/r/20180923161343.GB15054@krava
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Jiri Olsa
     
  • [ Upstream commit a9f9772114c8b07ae75bcb3654bd017461248095 ]

    When we unregister a PMU, we fail to serialize the @pmu_idr properly.
    Fix that by doing the entire thing under pmu_lock.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: 2e80a82a49c4 ("perf: Dynamic pmu types")
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     

13 Oct, 2018

1 commit

  • commit befb1b3c2703897c5b8ffb0044dc5d0e5f27c5d7 upstream.

    It is possible that a failure can occur during the scheduling of a
    pinned event. The initial portion of perf_event_read_local() contains
    the various error checks an event should pass before it can be
    considered valid. Ensure that the potential scheduling failure
    of a pinned event is checked for and have a credible error.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Reinette Chatre
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: fenghua.yu@intel.com
    Cc: tony.luck@intel.com
    Cc: acme@kernel.org
    Cc: gavin.hindman@intel.com
    Cc: jithu.joseph@intel.com
    Cc: dave.hansen@intel.com
    Cc: hpa@zytor.com
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/6486385d1f30336e9973b24c8c65f5079543d3d3.1537377064.git.reinette.chatre@intel.com
    Signed-off-by: Greg Kroah-Hartman

    Reinette Chatre
     

26 Sep, 2018

1 commit

  • commit 02e184476eff848273826c1d6617bb37e5bcc7ad upstream.

    Perf can record user stack data in response to a synchronous request, such
    as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we
    end up reading user stack data using __copy_from_user_inatomic() under
    set_fs(KERNEL_DS). I think this conflicts with the intention of using
    set_fs(KERNEL_DS). And it is explicitly forbidden by hardware on ARM64
    when both CONFIG_ARM64_UAO and CONFIG_ARM64_PAN are used.

    So fix this by forcing USER_DS when recording user stack data.

    Signed-off-by: Yabin Cui
    Acked-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 88b0193d9418 ("perf/callchain: Force USER_DS when invoking perf_callchain_user()")
    Link: http://lkml.kernel.org/r/20180823225935.27035-1-yabinc@google.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Yabin Cui
     

30 May, 2018

3 commits

  • [ Upstream commit 9e5b127d6f33468143d90c8a45ca12410e4c3fa7 ]

    Mark reported his arm64 perf fuzzer runs sometimes splat like:

    armv8pmu_read_counter+0x1e8/0x2d8
    armpmu_event_update+0x8c/0x188
    armpmu_read+0xc/0x18
    perf_output_read+0x550/0x11e8
    perf_event_read_event+0x1d0/0x248
    perf_event_exit_task+0x468/0xbb8
    do_exit+0x690/0x1310
    do_group_exit+0xd0/0x2b0
    get_signal+0x2e8/0x17a8
    do_signal+0x144/0x4f8
    do_notify_resume+0x148/0x1e8
    work_pending+0x8/0x14

    which asserts that we only call pmu::read() on ACTIVE events.

    The above callchain does:

    perf_event_exit_task()
    perf_event_exit_task_context()
    task_ctx_sched_out() // INACTIVE
    perf_event_exit_event()
    perf_event_set_state(EXIT) // EXIT
    sync_child_event()
    perf_event_read_event()
    perf_output_read()
    perf_output_read_group()
    leader->pmu->read()

    Which results in doing a pmu::read() on an !ACTIVE event.

    I _think_ this is 'new' since we added attr.inherit_stat, which added
    the perf_event_read_event() to the exit path, without that
    perf_event_read_output() would only trigger from samples and for
    @event to trigger a sample, it's leader _must_ be ACTIVE too.

    Still, adding this check makes it consistent with the @sub case for
    the siblings.

    Reported-and-Tested-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • [ Upstream commit 33801b94741d6c3be9713c10aa627477216c21e2 ]

    There's two problems when installing cgroup events on CPUs: firstly
    list_update_cgroup_event() only tries to set cpuctx->cgrp for the
    first event, if that mismatches on @cgrp we'll not try again for later
    additions.

    Secondly, when we install a cgroup event into an active context, only
    issue an event reprogram when the event matches the current cgroup
    context. This avoids a pointless event reprogramming.

    Signed-off-by: leilei.lin
    [ Improved the changelog and comments. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: brendan.d.gregg@gmail.com
    Cc: eranian@gmail.com
    Cc: linux-kernel@vger.kernel.org
    Cc: yang_oliver@hotmail.com
    Link: http://lkml.kernel.org/r/20180306093637.28247-1-linxiulei@gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    leilei.lin
     
  • [ Upstream commit c917e0f259908e75bd2a65877e25f9d90c22c848 ]

    When a perf_event is attached to parent cgroup, it should count events
    for all children cgroups:

    parent_group /sys/fs/cgroup/p/c/cgroup.procs

    # after the test process is done, stop perf in the first console shows

    instructions p

    The instruction should not be "not counted" as the process runs in the
    child cgroup.

    We found this is because perf_event->cgrp and cpuctx->cgrp are not
    identical, thus perf_event->cgrp are not updated properly.

    This patch fixes this by updating perf_cgroup properly for ancestor
    cgroup(s).

    Reported-by: Ephraim Park
    Signed-off-by: Song Liu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc:
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20180312165943.1057894-1-songliubraving@fb.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Song Liu
     

16 May, 2018

2 commits

  • commit 4411ec1d1993e8dbff2898390e3fed280d88e446 upstream.

    > kernel/events/ring_buffer.c:871 perf_mmap_to_page() warn: potential spectre issue 'rb->aux_pages'

    Userspace controls @pgoff through the fault address. Sanitize the
    array index before doing the array dereference.

    Reported-by: Dan Carpenter
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit bfb3d7b8b906b66551424d7636182126e1d134c8 upstream.

    If the get_callchain_buffers fails to allocate the buffer it will
    decrease the nr_callchain_events right away.

    There's no point of checking the allocation error for
    nr_callchain_events > 1. Removing that check.

    Signed-off-by: Jiri Olsa
    Tested-by: Arnaldo Carvalho de Melo
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: H. Peter Anvin
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: syzkaller-bugs@googlegroups.com
    Cc: x86@kernel.org
    Link: http://lkml.kernel.org/r/20180415092352.12403-3-jolsa@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Greg Kroah-Hartman

    Jiri Olsa
     

26 Apr, 2018

2 commits

  • commit 78b562fbfa2cf0a9fcb23c3154756b690f4905c1 upstream.

    Return immediately when we find issue in the user stack checks. The
    error value could get overwritten by following check for
    PERF_SAMPLE_REGS_INTR.

    Signed-off-by: Jiri Olsa
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: H. Peter Anvin
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: syzkaller-bugs@googlegroups.com
    Cc: x86@kernel.org
    Fixes: 60e2364e60e8 ("perf: Add ability to sample machine state on interrupt")
    Link: http://lkml.kernel.org/r/20180415092352.12403-1-jolsa@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Greg Kroah-Hartman

    Jiri Olsa
     
  • commit 5af44ca53d019de47efe6dbc4003dd518e5197ed upstream.

    The syzbot hit KASAN bug in perf_callchain_store having the entry stored
    behind the allocated bounds [1].

    We miss the sample_max_stack check for the initial event that allocates
    callchain buffers. This missing check allows to create an event with
    sample_max_stack value bigger than the global sysctl maximum:

    # sysctl -a | grep perf_event_max_stack
    kernel.perf_event_max_stack = 127

    # perf record -vv -C 1 -e cycles/max-stack=256/ kill
    ...
    perf_event_attr:
    size 112
    ...
    sample_max_stack 256
    ------------------------------------------------------------
    sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 4

    Note the '-C 1', which forces perf record to create just single event.
    Otherwise it opens event for every cpu, then the sample_max_stack check
    fails on the second event and all's fine.

    The fix is to run the sample_max_stack check also for the first event
    with callchains.

    [1] https://marc.info/?l=linux-kernel&m=152352732920874&w=2

    Reported-by: syzbot+7c449856228b63ac951e@syzkaller.appspotmail.com
    Signed-off-by: Jiri Olsa
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: H. Peter Anvin
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: syzkaller-bugs@googlegroups.com
    Cc: x86@kernel.org
    Fixes: 97c79a38cd45 ("perf core: Per event callchain limit")
    Link: http://lkml.kernel.org/r/20180415092352.12403-2-jolsa@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Greg Kroah-Hartman

    Jiri Olsa
     

19 Apr, 2018

1 commit

  • commit 621b6d2ea297d0fb6030452c5bcd221f12165fcf upstream.

    A use-after-free bug was caught by KASAN while running usdt related
    code (BCC project. bcc/tests/python/test_usdt2.py):

    ==================================================================
    BUG: KASAN: use-after-free in uprobe_perf_close+0x222/0x3b0
    Read of size 4 at addr ffff880384f9b4a4 by task test_usdt2.py/870

    CPU: 4 PID: 870 Comm: test_usdt2.py Tainted: G W 4.16.0-next-20180409 #215
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    dump_stack+0xc7/0x15b
    ? show_regs_print_info+0x5/0x5
    ? printk+0x9c/0xc3
    ? kmsg_dump_rewind_nolock+0x6e/0x6e
    ? uprobe_perf_close+0x222/0x3b0
    print_address_description+0x83/0x3a0
    ? uprobe_perf_close+0x222/0x3b0
    kasan_report+0x1dd/0x460
    ? uprobe_perf_close+0x222/0x3b0
    uprobe_perf_close+0x222/0x3b0
    ? probes_open+0x180/0x180
    ? free_filters_list+0x290/0x290
    trace_uprobe_register+0x1bb/0x500
    ? perf_event_attach_bpf_prog+0x310/0x310
    ? probe_event_disable+0x4e0/0x4e0
    perf_uprobe_destroy+0x63/0xd0
    _free_event+0x2bc/0xbd0
    ? lockdep_rcu_suspicious+0x100/0x100
    ? ring_buffer_attach+0x550/0x550
    ? kvm_sched_clock_read+0x1a/0x30
    ? perf_event_release_kernel+0x3e4/0xc00
    ? __mutex_unlock_slowpath+0x12e/0x540
    ? wait_for_completion+0x430/0x430
    ? lock_downgrade+0x3c0/0x3c0
    ? lock_release+0x980/0x980
    ? do_raw_spin_trylock+0x118/0x150
    ? do_raw_spin_unlock+0x121/0x210
    ? do_raw_spin_trylock+0x150/0x150
    perf_event_release_kernel+0x5d4/0xc00
    ? put_event+0x30/0x30
    ? fsnotify+0xd2d/0xea0
    ? sched_clock_cpu+0x18/0x1a0
    ? __fsnotify_update_child_dentry_flags.part.0+0x1b0/0x1b0
    ? pvclock_clocksource_read+0x152/0x2b0
    ? pvclock_read_flags+0x80/0x80
    ? kvm_sched_clock_read+0x1a/0x30
    ? sched_clock_cpu+0x18/0x1a0
    ? pvclock_clocksource_read+0x152/0x2b0
    ? locks_remove_file+0xec/0x470
    ? pvclock_read_flags+0x80/0x80
    ? fcntl_setlk+0x880/0x880
    ? ima_file_free+0x8d/0x390
    ? lockdep_rcu_suspicious+0x100/0x100
    ? ima_file_check+0x110/0x110
    ? fsnotify+0xea0/0xea0
    ? kvm_sched_clock_read+0x1a/0x30
    ? rcu_note_context_switch+0x600/0x600
    perf_release+0x21/0x40
    __fput+0x264/0x620
    ? fput+0xf0/0xf0
    ? do_raw_spin_unlock+0x121/0x210
    ? do_raw_spin_trylock+0x150/0x150
    ? SyS_fchdir+0x100/0x100
    ? fsnotify+0xea0/0xea0
    task_work_run+0x14b/0x1e0
    ? task_work_cancel+0x1c0/0x1c0
    ? copy_fd_bitmaps+0x150/0x150
    ? vfs_read+0xe5/0x260
    exit_to_usermode_loop+0x17b/0x1b0
    ? trace_event_raw_event_sys_exit+0x1a0/0x1a0
    do_syscall_64+0x3f6/0x490
    ? syscall_return_slowpath+0x2c0/0x2c0
    ? lockdep_sys_exit+0x1f/0xaa
    ? syscall_return_slowpath+0x1a3/0x2c0
    ? lockdep_sys_exit+0x1f/0xaa
    ? prepare_exit_to_usermode+0x11c/0x1e0
    ? enter_from_user_mode+0x30/0x30
    random: crng init done
    ? __put_user_4+0x1c/0x30
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    RIP: 0033:0x7f41d95f9340
    RSP: 002b:00007fffe71e4268 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
    RAX: 0000000000000000 RBX: 000000000000000d RCX: 00007f41d95f9340
    RDX: 0000000000000000 RSI: 0000000000002401 RDI: 000000000000000d
    RBP: 0000000000000000 R08: 00007f41ca8ff700 R09: 00007f41d996dd1f
    R10: 00007fffe71e41e0 R11: 0000000000000246 R12: 00007fffe71e4330
    R13: 0000000000000000 R14: fffffffffffffffc R15: 00007fffe71e4290

    Allocated by task 870:
    kasan_kmalloc+0xa0/0xd0
    kmem_cache_alloc_node+0x11a/0x430
    copy_process.part.19+0x11a0/0x41c0
    _do_fork+0x1be/0xa20
    do_syscall_64+0x198/0x490
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    Freed by task 0:
    __kasan_slab_free+0x12e/0x180
    kmem_cache_free+0x102/0x4d0
    free_task+0xfe/0x160
    __put_task_struct+0x189/0x290
    delayed_put_task_struct+0x119/0x250
    rcu_process_callbacks+0xa6c/0x1b60
    __do_softirq+0x238/0x7ae

    The buggy address belongs to the object at ffff880384f9b480
    which belongs to the cache task_struct of size 12928

    It occurs because task_struct is freed before perf_event which refers
    to the task and task flags are checked while teardown of the event.
    perf_event_alloc() assigns task_struct to hw.target of perf_event,
    but there is no reference counting for it.

    As a fix we get_task_struct() in perf_event_alloc() at above mentioned
    assignment and put_task_struct() in _free_event().

    Signed-off-by: Prashant Bhole
    Reviewed-by: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 63b6da39bb38e8f1a1ef3180d32a39d6 ("perf: Fix perf_event_exit_task() race")
    Link: http://lkml.kernel.org/r/20180409100346.6416-1-bhole_prashant_q7@lab.ntt.co.jp
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Prashant Bhole
     

08 Apr, 2018

1 commit

  • commit f67b15037a7a50c57f72e69a6d59941ad90a0f0f upstream.

    Annoyingly, modify_user_hw_breakpoint() unnecessarily complicates the
    modification of a breakpoint - simplify it and remove the pointless
    local variables.

    Also update the stale Docbook while at it.

    Signed-off-by: Linus Torvalds
    Acked-by: Thomas Gleixner
    Cc:
    Cc: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

29 Mar, 2018

1 commit

  • commit bd903afeb504db5655a45bb4cf86f38be5b1bf62 upstream.

    In ctx_resched(), EVENT_FLEXIBLE should be sched_out when EVENT_PINNED is
    added. However, ctx_resched() calculates ctx_event_type before checking
    this condition. As a result, pinned events will NOT get higher priority
    than flexible events.

    The following shows this issue on an Intel CPU (where ref-cycles can
    only use one hardware counter).

    1. First start:
    perf stat -C 0 -e ref-cycles -I 1000
    2. Then, in the second console, run:
    perf stat -C 0 -e ref-cycles:D -I 1000

    The second perf uses pinned events, which is expected to have higher
    priority. However, because it failed in ctx_resched(). It is never
    run.

    This patch fixes this by calculating ctx_event_type after re-evaluating
    event_type.

    Reported-by: Ephraim Park
    Signed-off-by: Song Liu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc:
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: 487f05e18aa4 ("perf/core: Optimize event rescheduling on active contexts")
    Link: http://lkml.kernel.org/r/20180306055504.3283731-1-songliubraving@fb.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Song Liu
     

25 Feb, 2018

1 commit

  • [ Upstream commit 34900ec5c9577cc1b0f22887ac7349f458ba8ac2 ]

    Reset header size for namespace events, otherwise it only gets bigger in
    ctx iterations.

    Signed-off-by: Jiri Olsa
    Acked-by: Peter Zijlstra (Intel)
    Fixes: e422267322cd ("perf: Add PERF_RECORD_NAMESPACES to include namespaces related info")
    Link: http://lkml.kernel.org/n/tip-nlo4gonz9d4guyb8153ukzt0@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jiri Olsa
     

04 Feb, 2018

1 commit

  • [ Upstream commit 0e18dd12064e07519f7cbff4149ca7fff620cbed ]

    perf with --namespace key leaks various memory objects including namespaces

    4.14.0+
    pid_namespace 1 12 2568 12 8
    user_namespace 1 39 824 39 8
    net_namespace 1 5 6272 5 8

    This happen because perf_fill_ns_link_info() struct patch ns_path:
    during initialization ns_path incremented counters on related mnt and dentry,
    but without lost path_put nobody decremented them back.
    Leaked dentry is name of related namespace,
    and its leak does not allow to free unused namespace.

    Signed-off-by: Vasily Averin
    Acked-by: Peter Zijlstra
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Hari Bathini
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Thomas Gleixner
    Fixes: commit e422267322cd ("perf: Add PERF_RECORD_NAMESPACES to include namespaces related info")
    Link: http://lkml.kernel.org/r/c510711b-3904-e5e1-d296-61273d21118d@virtuozzo.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Vasily Averin
     

25 Dec, 2017

1 commit

  • commit 3382290ed2d5e275429cef510ab21889d3ccd164 upstream.

    [ Note, this is a Git cherry-pick of the following commit:

    506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

    ... for easier x86 PTI code testing and back-porting. ]

    READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
    can be used instead of lockless_dereference() without any change in
    semantics.

    Signed-off-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     

10 Dec, 2017

1 commit

  • [ Upstream commit a9cd8194e1e6bd09619954721dfaf0f94fe2003e ]

    Event timestamps are serialized using ctx->lock, make sure to hold it
    over reading all values.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

03 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

30 Oct, 2017

1 commit

  • The following commit:

    864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")

    made list_update_cgroup_event() skip setting cpuctx->cgrp if no cgroup event
    targets %current's cgroup.

    This breaks perf_event's hierarchical support because events which target one
    of the ancestors get ignored.

    Fix it by using cgroup_is_descendant() test instead of equality.

    Signed-off-by: Tejun Heo
    Acked-by: Thomas Gleixner
    Cc: Arnaldo Carvalho de Melo
    Cc: David Carrillo-Cisneros
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: kernel-team@fb.com
    Cc: stable@vger.kernel.org # v4.9+
    Fixes: 864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
    Link: http://lkml.kernel.org/r/20171028164237.GA972780@devbig577.frc2.facebook.com
    Signed-off-by: Ingo Molnar

    Tejun Heo
     

10 Oct, 2017

2 commits

  • Update cgroup time when an event is scheduled in by descendants.

    Reviewed-and-tested-by: Jiri Olsa
    Signed-off-by: leilei.lin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@kernel.org
    Cc: alexander.shishkin@linux.intel.com
    Cc: brendan.d.gregg@gmail.com
    Cc: yang_oliver@hotmail.com
    Link: http://lkml.kernel.org/r/CALPjY3mkHiekRkRECzMi9G-bjUQOvOjVBAqxmWkTzc-g+0LwMg@mail.gmail.com
    Signed-off-by: Ingo Molnar

    leilei.lin
     
  • Since commit:

    1fd7e4169954 ("perf/core: Remove perf_cpu_context::unique_pmu")

    ... when a PMU is unregistered then its associated ->pmu_cpu_context is
    unconditionally freed. Whilst this is fine for dynamically allocated
    context types (i.e. those registered using perf_invalid_context), this
    causes a problem for sharing of static contexts such as
    perf_{sw,hw}_context, which are used by multiple built-in PMUs and
    effectively have a global lifetime.

    Whilst testing the ARM SPE driver, which must use perf_sw_context to
    support per-task AUX tracing, unregistering the driver as a result of a
    module unload resulted in:

    Unable to handle kernel NULL pointer dereference at virtual address 00000038
    Internal error: Oops: 96000004 [#1] PREEMPT SMP
    Modules linked in: [last unloaded: arm_spe_pmu]
    PC is at ctx_resched+0x38/0xe8
    LR is at perf_event_exec+0x20c/0x278
    [...]
    ctx_resched+0x38/0xe8
    perf_event_exec+0x20c/0x278
    setup_new_exec+0x88/0x118
    load_elf_binary+0x26c/0x109c
    search_binary_handler+0x90/0x298
    do_execveat_common.isra.14+0x540/0x618
    SyS_execve+0x38/0x48

    since the software context has been freed and the ctx.pmu->pmu_disable_count
    field has been set to NULL.

    This patch fixes the problem by avoiding the freeing of static PMU contexts
    altogether. Whilst the sharing of dynamic contexts is questionable, this
    actually requires the caller to share their context pointer explicitly
    and so the burden is on them to manage the object lifetime.

    Reported-by: Kim Phillips
    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mark Rutland
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 1fd7e4169954 ("perf/core: Remove perf_cpu_context::unique_pmu")
    Link: http://lkml.kernel.org/r/1507040450-7730-1-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     

29 Sep, 2017

1 commit

  • The following commit:

    d9a50b0256 ("perf/aux: Ensure aux_wakeup represents most recent wakeup index")

    changed the AUX wakeup position calculation to rounddown(), which causes
    a division-by-zero in AUX overwrite mode (aka "snapshot mode").

    The zero denominator results from the fact that perf record doesn't set
    aux_watermark to anything, in which case the kernel will set it to half
    the AUX buffer size, but only for non-overwrite mode. In the overwrite
    mode aux_watermark stays zero.

    The good news is that, AUX overwrite mode, wakeups don't happen and
    related bookkeeping is not relevant, so we can simply forego the whole
    wakeup updates.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/20170906160811.16510-1-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

21 Sep, 2017

1 commit

  • This patch fixes a bug exhibited by the following scenario:
    1. fd1 = perf_event_open with attr.config = ID1
    2. attach bpf program prog1 to fd1
    3. fd2 = perf_event_open with attr.config = ID1

    4. user program closes fd2 and prog1 is detached from the tracepoint.
    5. user program with fd1 does not work properly as tracepoint
    no output any more.

    The issue happens at step 4. Multiple perf_event_open can be called
    successfully, but only one bpf prog pointer in the tp_event. In the
    current logic, any fd release for the same tp_event will free
    the tp_event->prog.

    The fix is to free tp_event->prog only when the closing fd
    corresponds to the one which registered the program.

    Signed-off-by: Yonghong Song
    Signed-off-by: David S. Miller

    Yonghong Song
     

07 Sep, 2017

2 commits

  • Pull cgroup updates from Tejun Heo:
    "Several notable changes this cycle:

    - Thread mode was merged. This will be used for cgroup2 support for
    CPU and possibly other controllers. Unfortunately, CPU controller
    cgroup2 support didn't make this pull request but most contentions
    have been resolved and the support is likely to be merged before
    the next merge window.

    - cgroup.stat now shows the number of descendant cgroups.

    - cpuset now can enable the easier-to-configure v2 behavior on v1
    hierarchy"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cpuset: Allow v2 behavior in v1 cgroup
    cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
    cgroup: remove unneeded checks
    cgroup: misc changes
    cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy
    cgroup: re-use the parent pointer in cgroup_destroy_locked()
    cgroup: add cgroup.stat interface with basic hierarchy stats
    cgroup: implement hierarchy limits
    cgroup: keep track of number of descent cgroups
    cgroup: add comment to cgroup_enable_threaded()
    cgroup: remove unnecessary empty check when enabling threaded mode
    cgroup: update debug controller to print out thread mode information
    cgroup: implement cgroup v2 thread support
    cgroup: implement CSS_TASK_ITER_THREADED
    cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
    cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
    cgroup: reorganize cgroup.procs / task write path
    cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets
    cgroup: distinguish local and children populated states
    cgroup: remove now unused list_head @pending in cgroup_apply_cftypes()
    ...

    Linus Torvalds
     
  • Pull networking updates from David Miller:

    1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

    2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

    3) Allow generic XDP to work on virtual devices, from John Fastabend.

    4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

    5) Remove UFO offloads from the tree, gave us little other than bugs.

    6) Remove the IPSEC flow cache, from Florian Westphal.

    7) Support ipv6 route offload in mlxsw driver.

    8) Support VF representors in bnxt_en, from Sathya Perla.

    9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

    10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

    11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

    12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

    13) Add new jump instructions to eBPF, from Daniel Borkmann.

    14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

    15) Support XDP in tap driver, from Jason Wang.

    16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

    17) Add Huawei hinic ethernet driver.

    18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
    i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
    i40e: avoid NVM acquire deadlock during NVM update
    drivers: net: xgene: Remove return statement from void function
    drivers: net: xgene: Configure tx/rx delay for ACPI
    drivers: net: xgene: Read tx/rx delay for ACPI
    rocker: fix kcalloc parameter order
    rds: Fix non-atomic operation on shared flag variable
    net: sched: don't use GFP_KERNEL under spin lock
    vhost_net: correctly check tx avail during rx busy polling
    net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
    rxrpc: Make service connection lookup always check for retry
    net: stmmac: Delete dead code for MDIO registration
    gianfar: Fix Tx flow control deactivation
    cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
    cxgb4: Fix pause frame count in t4_get_port_stats
    cxgb4: fix memory leak
    tun: rename generic_xdp to skb_xdp
    tun: reserve extra headroom only when XDP is set
    net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
    net: dsa: bcm_sf2: Advertise number of egress queues
    ...

    Linus Torvalds
     

05 Sep, 2017

1 commit

  • Pull x86 cache quality monitoring update from Thomas Gleixner:
    "This update provides a complete rewrite of the Cache Quality
    Monitoring (CQM) facility.

    The existing CQM support was duct taped into perf with a lot of issues
    and the attempts to fix those turned out to be incomplete and
    horrible.

    After lengthy discussions it was decided to integrate the CQM support
    into the Resource Director Technology (RDT) facility, which is the
    obvious choise as in hardware CQM is part of RDT. This allowed to add
    Memory Bandwidth Monitoring support on top.

    As a result the mechanisms for allocating cache/memory bandwidth and
    the corresponding monitoring mechanisms are integrated into a single
    management facility with a consistent user interface"

    * 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    x86/intel_rdt: Turn off most RDT features on Skylake
    x86/intel_rdt: Add command line options for resource director technology
    x86/intel_rdt: Move special case code for Haswell to a quirk function
    x86/intel_rdt: Remove redundant ternary operator on return
    x86/intel_rdt/cqm: Improve limbo list processing
    x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug
    x86/intel_rdt: Modify the intel_pqr_state for better performance
    x86/intel_rdt/cqm: Clear the default RMID during hotcpu
    x86/intel_rdt: Show bitmask of shareable resource with other executing units
    x86/intel_rdt/mbm: Handle counter overflow
    x86/intel_rdt/mbm: Add mbm counter initialization
    x86/intel_rdt/mbm: Basic counting of MBM events (total and local)
    x86/intel_rdt/cqm: Add CPU hotplug support
    x86/intel_rdt/cqm: Add sched_in support
    x86/intel_rdt: Introduce rdt_enable_key for scheduling
    x86/intel_rdt/cqm: Add mount,umount support
    x86/intel_rdt/cqm: Add rmdir support
    x86/intel_rdt: Separate the ctrl bits from rmdir
    x86/intel_rdt/cqm: Add mon_data
    x86/intel_rdt: Prepare for RDT monitor data support
    ...

    Linus Torvalds
     

04 Sep, 2017

2 commits

  • Pull perf updates from Ingo Molnar:
    "Kernel side changes:

    - Add branch type profiling/tracing support. (Jin Yao)

    - Add the PERF_SAMPLE_PHYS_ADDR ABI to allow the tracing/profiling of
    physical memory addresses, where the PMU supports it. (Kan Liang)

    - Export some PMU capability details in the new
    /sys/bus/event_source/devices/cpu/caps/ sysfs directory. (Andi
    Kleen)

    - Aux data fixes and updates (Will Deacon)

    - kprobes fixes and updates (Masami Hiramatsu)

    - AMD uncore PMU driver fixes and updates (Janakarajan Natarajan)

    On the tooling side, here's a (limited!) list of highlights - there
    were many other changes that I could not list, see the shortlog and
    git history for details:

    UI improvements:

    - Implement a visual marker for fused x86 instructions in the
    annotate TUI browser, available now in 'perf report', more work
    needed to have it available as well in 'perf top' (Jin Yao)

    Further explanation from one of Jin's patches:

    │ ┌──cmpl $0x0,argp_program_version_hook
    81.93 │ ├──je 20
    │ │ lock cmpxchg %esi,0x38a9a4(%rip)
    │ │↓ jne 29
    │ │↓ jmp 43
    11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

    That means the cmpl+je is a fused instruction pair and they should
    be considered together.

    - Record the branch type and then show statistics and info about in
    callchain entries (Jin Yao)

    Example from one of Jin's patches:

    # perf record -g -j any,save_type
    # perf report --branch-history --stdio --no-children

    38.50% div.c:45 [.] main div
    |
    ---main div.c:42 (RET CROSS_2M cycles:2)
    compute_flag div.c:28 (cycles:2)
    compute_flag div.c:27 (RET CROSS_2M cycles:1)
    rand rand.c:28 (cycles:1)
    rand rand.c:28 (RET CROSS_2M cycles:1)
    __random random.c:298 (cycles:1)
    __random random.c:297 (COND_BWD CROSS_2M cycles:1)
    __random random.c:295 (cycles:1)
    __random random.c:295 (COND_BWD CROSS_2M cycles:1)
    __random random.c:295 (cycles:1)
    __random random.c:295 (RET CROSS_2M cycles:9)

    namespaces support:

    - Add initial support for namespaces, using setns to access files in
    namespaces, grabbing their build-ids, etc. (Krister Johansen)

    perf trace enhancements:

    - Beautify pkey_{alloc,free,mprotect} arguments in 'perf trace'
    (Arnaldo Carvalho de Melo)

    - Add initial 'clone' syscall args beautifier in 'perf trace'
    (Arnaldo Carvalho de Melo)

    - Ignore 'fd' and 'offset' args for MAP_ANONYMOUS in 'perf trace'
    (Arnaldo Carvalho de Melo)

    - Beautifiers for the 'cmd' arg of several ioctl types, including:
    sound, DRM, KVM, vhost virtio and perf_events. (Arnaldo Carvalho de
    Melo)

    - Add PERF_SAMPLE_CALLCHAIN and PERF_RECORD_MMAP[2] to 'perf data'
    CTF conversion, allowing CTF trace visualization tools to show
    callchains and to resolve symbols (Geneviève Bastien)

    - Beautify the fcntl syscall, which is an interesting one in the
    sense that infrastructure had to be put in place to change the
    formatters of some arguments according to the value in a previous
    one, i.e. cmd dictates how arg and the syscall return will be
    formatted. (Arnaldo Carvalho de Melo

    perf stat enhancements:

    - Use group read for event groups in 'perf stat', reducing overhead
    when groups are defined in the event specification, i.e. when using
    {} to enclose a list of events, asking them to be read at the same
    time, e.g.: "perf stat -e '{cycles,instructions}'" (Jiri Olsa)

    pipe mode improvements:

    - Process tracing data in 'perf annotate' pipe mode (David
    Carrillo-Cisneros)

    - Add header record types to pipe-mode, now this command:

    $ perf record -o - -e cycles sleep 1 | perf report --stdio --header

    Will show the same as in non-pipe mode, i.e. involving a perf.data
    file (David Carrillo-Cisneros)

    Vendor specific hardware event support updates/enhancements:

    - Update POWER9 vendor events tables (Sukadev Bhattiprolu)

    - Add POWER9 PMU events Sukadev (Bhattiprolu)

    - Support additional POWER8+ PVR in PMU mapfile (Shriya)

    - Add Skylake server uncore JSON vendor events (Andi Kleen)

    - Support exporting Intel PT data to sqlite3 with python perf
    scripts, this is in addition to the postgresql support that was
    already there (Adrian Hunter)"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (253 commits)
    perf symbols: Fix plt entry calculation for ARM and AARCH64
    perf probe: Fix kprobe blacklist checking condition
    perf/x86: Fix caps/ for !Intel
    perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR
    perf/core, pt, bts: Get rid of itrace_started
    perf trace beauty: Beautify pkey_{alloc,free,mprotect} arguments
    tools headers: Sync cpu features kernel ABI headers with tooling headers
    perf tools: Pass full path of FEATURES_DUMP
    perf tools: Robustify detection of clang binary
    tools lib: Allow external definition of CC, AR and LD
    perf tools: Allow external definition of flex and bison binary names
    tools build tests: Don't hardcode gcc name
    perf report: Group stat values on global event id
    perf values: Zero value buffers
    perf values: Fix allocation check
    perf values: Fix thread index bug
    perf report: Add dump_read function
    perf record: Set read_format for inherit_stat
    perf c2c: Fix remote HITM detection for Skylake
    perf tools: Fix static build with newer toolchains
    ...

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:

    - Prevent a potential inconistency in the perf user space access which
    might lead to evading sanity checks.

    - Prevent perf recording function trace entries twice

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/ftrace: Fix double traces of perf on ftrace:function
    perf/core: Fix potential double-fetch bug

    Linus Torvalds
     

02 Sep, 2017

1 commit


01 Sep, 2017

1 commit

  • Commit 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for
    write killable") made it possible to kill a forking task while it is
    waiting to acquire its ->mmap_sem for write, in dup_mmap().

    However, it was overlooked that this introduced an new error path before
    the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
    being copied from the old mm_struct by the memcpy in dup_mm(). For a
    task that has previously hit a uprobe tracepoint, this resulted in the
    'struct xol_area' being freed multiple times if the task was killed at
    just the right time while forking.

    Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
    than in uprobe_dup_mmap().

    With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
    program given by commit 2b7e8665b4ff ("fork: fix incorrect fput of
    ->exe_file causing use-after-free"), provided that a uprobe tracepoint
    has been set on the fork_thread() function. For example:

    $ gcc reproducer.c -o reproducer -lpthread
    $ nm reproducer | grep fork_thread
    0000000000400719 t fork_thread
    $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
    $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
    $ ./reproducer

    Here is the use-after-free reported by KASAN:

    BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
    Read of size 8 at addr ffff8800320a8b88 by task reproducer/198

    CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 #255
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
    Call Trace:
    dump_stack+0xdb/0x185
    print_address_description+0x7e/0x290
    kasan_report+0x23b/0x350
    __asan_report_load8_noabort+0x19/0x20
    uprobe_clear_state+0x1c4/0x200
    mmput+0xd6/0x360
    do_exit+0x740/0x1670
    do_group_exit+0x13f/0x380
    get_signal+0x597/0x17d0
    do_signal+0x99/0x1df0
    exit_to_usermode_loop+0x166/0x1e0
    syscall_return_slowpath+0x258/0x2c0
    entry_SYSCALL_64_fastpath+0xbc/0xbe

    ...

    Allocated by task 199:
    save_stack_trace+0x1b/0x20
    kasan_kmalloc+0xfc/0x180
    kmem_cache_alloc_trace+0xf3/0x330
    __create_xol_area+0x10f/0x780
    uprobe_notify_resume+0x1674/0x2210
    exit_to_usermode_loop+0x150/0x1e0
    prepare_exit_to_usermode+0x14b/0x180
    retint_user+0x8/0x20

    Freed by task 199:
    save_stack_trace+0x1b/0x20
    kasan_slab_free+0xa8/0x1a0
    kfree+0xba/0x210
    uprobe_clear_state+0x151/0x200
    mmput+0xd6/0x360
    copy_process.part.8+0x605f/0x65d0
    _do_fork+0x1a5/0xbd0
    SyS_clone+0x19/0x20
    do_syscall_64+0x22f/0x660
    return_from_SYSCALL_64+0x0/0x7a

    Note: without KASAN, you may instead see a "Bad page state" message, or
    simply a general protection fault.

    Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
    Fixes: 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
    Signed-off-by: Eric Biggers
    Reported-by: Oleg Nesterov
    Acked-by: Oleg Nesterov
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Dmitry Vyukov
    Cc: Ingo Molnar
    Cc: Konstantin Khlebnikov
    Cc: Mark Rutland
    Cc: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Vlastimil Babka
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

29 Aug, 2017

5 commits

  • For understanding how the workload maps to memory channels and hardware
    behavior, it's very important to collect address maps with physical
    addresses. For example, 3D XPoint access can only be found by filtering
    the physical address.

    Add a new sample type for physical address.

    perf already has a facility to collect data virtual address. This patch
    introduces a function to convert the virtual address to physical address.
    The function is quite generic and can be extended to any architecture as
    long as a virtual address is provided.

    - For kernel direct mapping addresses, virt_to_phys is used to convert
    the virtual addresses to physical address.

    - For user virtual addresses, __get_user_pages_fast is used to walk the
    pages tables for user physical address.

    - This does not work for vmalloc addresses right now. These are not
    resolved, but code to do that could be added.

    The new sample type requires collecting the virtual address. The
    virtual address will not be output unless SAMPLE_ADDR is applied.

    For security, the physical address can only be exposed to root or
    privileged user.

    Tested-by: Madhavan Srinivasan
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: acme@kernel.org
    Cc: mpe@ellerman.id.au
    Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Kan Liang
     
  • I just noticed that hw.itrace_started and hw.config are aliased to the
    same location. Now, the PT driver happens to use both, which works out
    fine by sheer luck:

    - STORE(hw.itrace_start) is ordered before STORE(hw.config), in the
    program order, although there are no compiler barriers to ensure that,

    - to the perf_log_itrace_start() hw.itrace_start looks set at the same
    time as when it is intended to be set because both stores happen in the
    same path,

    - hw.config is never reset to zero in the PT driver.

    Now, the use of hw.config by the PT driver makes more sense (it being a
    HW PMU) than messing around with itrace_started, which is an awkward API
    to begin with.

    This patch replaces hw.itrace_started with an attach_state bit and an
    API call for the PMU drivers to use to communicate the condition.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20170330153956.25994-1-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • When running perf on the ftrace:function tracepoint, there is a bug
    which can be reproduced by:

    perf record -e ftrace:function -a sleep 20 &
    perf record -e ftrace:function ls
    perf script

    ls 10304 [005] 171.853235: ftrace:function:
    perf_output_begin
    ls 10304 [005] 171.853237: ftrace:function:
    perf_output_begin
    ls 10304 [005] 171.853239: ftrace:function:
    task_tgid_nr_ns
    ls 10304 [005] 171.853240: ftrace:function:
    task_tgid_nr_ns
    ls 10304 [005] 171.853242: ftrace:function:
    __task_pid_nr_ns
    ls 10304 [005] 171.853244: ftrace:function:
    __task_pid_nr_ns

    We can see that all the function traces are doubled.

    The problem is caused by the inconsistency of the register
    function perf_ftrace_event_register() with the probe function
    perf_ftrace_function_call(). The former registers one probe
    for every perf_event. And the latter handles all perf_events
    on the current cpu. So when two perf_events on the current cpu,
    the traces of them will be doubled.

    So this patch adds an extra parameter "event" for perf_tp_event,
    only send sample data to this event when it's not NULL.

    Signed-off-by: Zhou Chengming
    Reviewed-by: Jiri Olsa
    Acked-by: Steven Rostedt (VMware)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@kernel.org
    Cc: alexander.shishkin@linux.intel.com
    Cc: huawei.libin@huawei.com
    Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.com
    Signed-off-by: Ingo Molnar

    Zhou Chengming
     
  • While examining the kernel source code, I found a dangerous operation that
    could turn into a double-fetch situation (a race condition bug) where the same
    userspace memory region are fetched twice into kernel with sanity checks after
    the first fetch while missing checks after the second fetch.

    1. The first fetch happens in line 9573 get_user(size, &uattr->size).

    2. Subsequently the 'size' variable undergoes a few sanity checks and
    transformations (line 9577 to 9584).

    3. The second fetch happens in line 9610 copy_from_user(attr, uattr, size)

    4. Given that 'uattr' can be fully controlled in userspace, an attacker can
    race condition to override 'uattr->size' to arbitrary value (say, 0xFFFFFFFF)
    after the first fetch but before the second fetch. The changed value will be
    copied to 'attr->size'.

    5. There is no further checks on 'attr->size' until the end of this function,
    and once the function returns, we lose the context to verify that 'attr->size'
    conforms to the sanity checks performed in step 2 (line 9577 to 9584).

    6. My manual analysis shows that 'attr->size' is not used elsewhere later,
    so, there is no working exploit against it right now. However, this could
    easily turns to an exploitable one if careless developers start to use
    'attr->size' later.

    To fix this, override 'attr->size' from the second fetch to the one from the
    first fetch, regardless of what is actually copied in.

    In this way, it is assured that 'attr->size' is consistent with the checks
    performed after the first fetch.

    Signed-off-by: Meng Xu
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: acme@kernel.org
    Cc: alexander.shishkin@linux.intel.com
    Cc: meng.xu@gatech.edu
    Cc: sanidhya@gatech.edu
    Cc: taesoo@gatech.edu
    Link: http://lkml.kernel.org/r/1503522470-35531-1-git-send-email-meng.xu@gatech.edu
    Signed-off-by: Ingo Molnar

    Meng Xu
     

25 Aug, 2017

2 commits

  • In an XDP redirect applications using tracepoint xdp:xdp_redirect to
    diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
    was consuming 2% CPU. This was reduced to 1.85% with this simple
    change.

    Looking at the annotated asm code, it was clear that the unlikely case
    in_nmi() test was chosen (by the compiler) as the most likely
    event/branch. This small adjustment makes the compiler (GCC version
    7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/150342256382.16595.986861478681783732.stgit@firesoul
    Signed-off-by: Ingo Molnar

    Jesper Dangaard Brouer
     
  • The exiting/dead task has no PIDs and in this case perf_event_pid/tid()
    return zero, change them to return -1 to distinguish this case from
    idle threads.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170822155928.GA6892@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov