20 Jan, 2021

4 commits

  • [ Upstream commit 69ca310f34168eae0ada434796bfc22fb4a0fa26 ]

    On some systems, some variant of the following splat is
    repeatedly seen. The common factor in all traces seems
    to be the entry point to task_file_seq_next(). With the
    patch, all warnings go away.

    rcu: INFO: rcu_sched self-detected stall on CPU
    rcu: \x0926-....: (20992 ticks this GP) idle=d7e/1/0x4000000000000002 softirq=81556231/81556231 fqs=4876
    \x09(t=21033 jiffies g=159148529 q=223125)
    NMI backtrace for cpu 26
    CPU: 26 PID: 2015853 Comm: bpftool Kdump: loaded Not tainted 5.6.13-0_fbk4_3876_gd8d1f9bf80bb #1
    Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A12 10/08/2018
    Call Trace:

    dump_stack+0x50/0x70
    nmi_cpu_backtrace.cold.6+0x13/0x50
    ? lapic_can_unplug_cpu.cold.30+0x40/0x40
    nmi_trigger_cpumask_backtrace+0xba/0xca
    rcu_dump_cpu_stacks+0x99/0xc7
    rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
    ? tick_sched_do_timer+0x60/0x60
    update_process_times+0x24/0x50
    tick_sched_timer+0x37/0x70
    __hrtimer_run_queues+0xfe/0x270
    hrtimer_interrupt+0xf4/0x210
    smp_apic_timer_interrupt+0x5e/0x120
    apic_timer_interrupt+0xf/0x20

    RIP: 0010:get_pid_task+0x38/0x80
    Code: 89 f6 48 8d 44 f7 08 48 8b 00 48 85 c0 74 2b 48 83 c6 55 48 c1 e6 04 48 29 f0 74 19 48 8d 78 20 ba 01 00 00 00 f0 0f c1 50 20 d2 74 27 78 11 83 c2 01 78 0c 48 83 c4 08 c3 31 c0 48 83 c4 08
    RSP: 0018:ffffc9000d293dc8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
    RAX: ffff888637c05600 RBX: ffffc9000d293e0c RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000550 RDI: ffff888637c05620
    RBP: ffffffff8284eb80 R08: ffff88831341d300 R09: ffff88822ffd8248
    R10: ffff88822ffd82d0 R11: 00000000003a93c0 R12: 0000000000000001
    R13: 00000000ffffffff R14: ffff88831341d300 R15: 0000000000000000
    ? find_ge_pid+0x1b/0x20
    task_seq_get_next+0x52/0xc0
    task_file_seq_get_next+0x159/0x220
    task_file_seq_next+0x4f/0xa0
    bpf_seq_read+0x159/0x390
    vfs_read+0x8a/0x140
    ksys_read+0x59/0xd0
    do_syscall_64+0x42/0x110
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f95ae73e76e
    Code: Bad RIP value.
    RSP: 002b:00007ffc02c1dbf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: 000000000170faa0 RCX: 00007f95ae73e76e
    RDX: 0000000000001000 RSI: 00007ffc02c1dc30 RDI: 0000000000000007
    RBP: 00007ffc02c1ec70 R08: 0000000000000005 R09: 0000000000000006
    R10: fffffffffffff20b R11: 0000000000000246 R12: 00000000019112a0
    R13: 0000000000000000 R14: 0000000000000007 R15: 00000000004283c0

    If unable to obtain the file structure for the current task,
    proceed to the next task number after the one returned from
    task_seq_get_next(), instead of the next task number from the
    original iterator.

    Also, save the stopping task number from task_seq_get_next()
    on failure in case of restarts.

    Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets")
    Signed-off-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201218185032.2464558-2-jonathan.lemon@gmail.com
    Signed-off-by: Sasha Levin

    Jonathan Lemon
     
  • [ Upstream commit 91b2db27d3ff9ad29e8b3108dfbf1e2f49fe9bd3 ]

    Simplify task_file_seq_get_next() by removing two in/out arguments: task
    and fstruct. Use info->task and info->files instead.

    Signed-off-by: Song Liu
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20201120002833.2481110-1-songliubraving@fb.com
    Signed-off-by: Sasha Levin

    Song Liu
     
  • [ Upstream commit 1b04fa9900263b4e217ca2509fd778b32c2b4eb2 ]

    PowerPC testing encountered boot failures due to RCU Tasks not being
    fully initialized until core_initcall() time. This commit therefore
    initializes RCU Tasks (along with Rude RCU and RCU Tasks Trace) just
    before early_initcall() time, thus allowing waiting on RCU Tasks grace
    periods from early_initcall() handlers.

    Link: https://lore.kernel.org/rcu/87eekfh80a.fsf@dja-thinkpad.axtens.net/
    Fixes: 36dadef23fcc ("kprobes: Init kprobes in early_initcall")
    Tested-by: Daniel Axtens
    Signed-off-by: Uladzislau Rezki (Sony)
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Sasha Levin

    Uladzislau Rezki (Sony)
     
  • commit 7bb83f6fc4ee84e95d0ac0d14452c2619fb3fe70 upstream.

    Enable the notrace function check on the architecture which doesn't
    support kprobes on ftrace but support dynamic ftrace. This notrace
    function check is not only for the kprobes on ftrace but also
    sw-breakpoint based kprobes.
    Thus there is no reason to limit this check for the arch which
    supports kprobes on ftrace.

    This also changes the dependency of Kconfig. Because kprobe event
    uses the function tracer's address list for identifying notrace
    function, if the CONFIG_DYNAMIC_FTRACE=n, it can not check whether
    the target function is notrace or not.

    Link: https://lkml.kernel.org/r/20210105065730.2634785-1-naveen.n.rao@linux.vnet.ibm.com
    Link: https://lkml.kernel.org/r/161007957862.114704.4512260007555399463.stgit@devnote2

    Cc: stable@vger.kernel.org
    Fixes: 45408c4f92506 ("tracing: kprobes: Prohibit probing on notrace function")
    Acked-by: Naveen N. Rao
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Masami Hiramatsu
     

13 Jan, 2021

1 commit

  • [ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ]

    In realtime scenario, We do not want to have interference on the
    isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
    on the housekeeping cpu, it kick a kworker on the isolated cpu.

    alloc_workqueue
    pwq_adjust_max_active
    wake_up_worker

    The comment in pwq_adjust_max_active() said:
    "Need to kick a worker after thawed or an unbound wq's
    max_active is bumped"

    So it is unnecessary to kick a kworker for percpu's wq when invoking
    alloc_workqueue(). this patch only kick a worker based on the actual
    activation of delayed works.

    Signed-off-by: Yunfeng Ye
    Reviewed-by: Lai Jiangshan
    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin

    Yunfeng Ye
     

09 Jan, 2021

4 commits

  • [ Upstream commit f7cfd871ae0c5008d94b6f66834e7845caa93c15 ]

    Recently syzbot reported[0] that there is a deadlock amongst the users
    of exec_update_mutex. The problematic lock ordering found by lockdep
    was:

    perf_event_open (exec_update_mutex -> ovl_i_mutex)
    chown (ovl_i_mutex -> sb_writes)
    sendfile (sb_writes -> p->lock)
    by reading from a proc file and writing to overlayfs
    proc_pid_syscall (p->lock -> exec_update_mutex)

    While looking at possible solutions it occured to me that all of the
    users and possible users involved only wanted to state of the given
    process to remain the same. They are all readers. The only writer is
    exec.

    There is no reason for readers to block on each other. So fix
    this deadlock by transforming exec_update_mutex into a rw_semaphore
    named exec_update_lock that only exec takes for writing.

    Cc: Jann Horn
    Cc: Vasiliy Kulikov
    Cc: Al Viro
    Cc: Bernd Edlinger
    Cc: Oleg Nesterov
    Cc: Christopher Yeoh
    Cc: Cyrill Gorcunov
    Cc: Sargun Dhillon
    Cc: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
    [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
    Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
    Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Eric W. Biederman
     
  • [ Upstream commit 31784cff7ee073b34d6eddabb95e3be2880a425c ]

    In preparation for converting exec_update_mutex to a rwsem so that
    multiple readers can execute in parallel and not deadlock, add
    down_read_interruptible. This is needed for perf_event_open to be
    converted (with no semantic changes) from working on a mutex to
    wroking on a rwsem.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/87k0tybqfy.fsf@x220.int.ebiederm.org
    Signed-off-by: Sasha Levin

    Eric W. Biederman
     
  • [ Upstream commit 0f9368b5bf6db0c04afc5454b1be79022a681615 ]

    In preparation for converting exec_update_mutex to a rwsem so that
    multiple readers can execute in parallel and not deadlock, add
    down_read_killable_nested. This is needed so that kcmp_lock
    can be converted from working on a mutexes to working on rw_semaphores.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/87o8jabqh3.fsf@x220.int.ebiederm.org
    Signed-off-by: Sasha Levin

    Eric W. Biederman
     
  • [ Upstream commit 78af4dc949daaa37b3fcd5f348f373085b4e858f ]

    Syzbot reported a lock inversion involving perf. The sore point being
    perf holding exec_update_mutex() for a very long time, specifically
    across a whole bunch of filesystem ops in pmu::event_init() (uprobes)
    and anon_inode_getfile().

    This then inverts against procfs code trying to take
    exec_update_mutex.

    Move the permission checks later, such that we need to hold the mutex
    over less code.

    Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Sasha Levin

    peterz@infradead.org
     

06 Jan, 2021

4 commits

  • [ Upstream commit ba8ea8e7dd6e1662e34e730eadfc52aa6816f9dd ]

    can_stop_idle_tick() checks whether the do_timer() duty has been taken over
    by a CPU on boot. That's silly because the boot CPU always takes over with
    the initial clockevent device.

    But even if no CPU would have installed a clockevent and taken over the
    duty then the question whether the tick on the current CPU can be stopped
    or not is moot. In that case the current CPU would have no clockevent
    either, so there would be nothing to keep ticking.

    Remove it.

    Signed-off-by: Thomas Gleixner
    Acked-by: Frederic Weisbecker
    Link: https://lore.kernel.org/r/20201206212002.725238293@linutronix.de
    Signed-off-by: Sasha Levin

    Thomas Gleixner
     
  • [ Upstream commit 38dc717e97153e46375ee21797aa54777e5498f3 ]

    Apparently there has been a longstanding race between udev/systemd and
    the module loader. Currently, the module loader sends a uevent right
    after sysfs initialization, but before the module calls its init
    function. However, some udev rules expect that the module has
    initialized already upon receiving the uevent.

    This race has been triggered recently (see link in references) in some
    systemd mount unit files. For instance, the configfs module creates the
    /sys/kernel/config mount point in its init function, however the module
    loader issues the uevent before this happens. sys-kernel-config.mount
    expects to be able to mount /sys/kernel/config upon receipt of the
    module loading uevent, but if the configfs module has not called its
    init function yet, then this directory will not exist and the mount unit
    fails. A similar situation exists for sys-fs-fuse-connections.mount, as
    the fuse sysfs mount point is created during the fuse module's init
    function. If udev is faster than module initialization then the mount
    unit would fail in a similar fashion.

    To fix this race, delay the module KOBJ_ADD uevent until after the
    module has finished calling its init routine.

    References: https://github.com/systemd/systemd/issues/17586
    Reviewed-by: Greg Kroah-Hartman
    Tested-By: Nicolas Morey-Chaisemartin
    Signed-off-by: Jessica Yu
    Signed-off-by: Sasha Levin

    Jessica Yu
     
  • [ Upstream commit 5e8ed280dab9eeabc1ba0b2db5dbe9fe6debb6b5 ]

    If a module fails to load due to an error in prepare_coming_module(),
    the following error handling in load_module() runs with
    MODULE_STATE_COMING in module's state. Fix it by correctly setting
    MODULE_STATE_GOING under "bug_cleanup" label.

    Signed-off-by: Miroslav Benes
    Signed-off-by: Jessica Yu
    Signed-off-by: Sasha Levin

    Miroslav Benes
     
  • commit 2d18e54dd8662442ef5898c6bdadeaf90b3cebbc upstream.

    A memory leak is found in cgroup1_parse_param() when multiple source
    parameters overwrite fc->source in the fs_context struct without free.

    unreferenced object 0xffff888100d930e0 (size 16):
    comm "mount", pid 520, jiffies 4303326831 (age 152.783s)
    hex dump (first 16 bytes):
    74 65 73 74 6c 65 61 6b 00 00 00 00 00 00 00 00 testleak........
    backtrace:
    [] kmemdup_nul+0x2d/0xa0
    [] vfs_parse_fs_string+0xc0/0x150
    [] generic_parse_monolithic+0x15a/0x1d0
    [] path_mount+0xee1/0x1820
    [] do_mount+0xea/0x100
    [] __x64_sys_mount+0x14b/0x1f0

    Fix this bug by permitting a single source parameter and rejecting with
    an error all subsequent ones.

    Fixes: 8d2451f4994f ("cgroup1: switch to option-by-option parsing")
    Reported-by: Hulk Robot
    Signed-off-by: Qinglang Miao
    Reviewed-by: Zefan Li
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Qinglang Miao
     

30 Dec, 2020

12 commits

  • commit adab66b71abfe206a020f11e561f4df41f0b2aba upstream.

    It was believed that metag was the only architecture that required the ring
    buffer to keep 8 byte words aligned on 8 byte architectures, and with its
    removal, it was assumed that the ring buffer code did not need to handle
    this case. It appears that sparc64 also requires this.

    The following was reported on a sparc64 boot up:

    kernel: futex hash table entries: 65536 (order: 9, 4194304 bytes, linear)
    kernel: Running postponed tracer tests:
    kernel: Testing tracer function:
    kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
    kernel: Kernel unaligned access at TPC[552a24] trace_function+0x44/0x140
    kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
    kernel: Kernel unaligned access at TPC[552a24] trace_function+0x44/0x140
    kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
    kernel: PASSED

    Need to put back the 64BIT aligned code for the ring buffer.

    Link: https://lore.kernel.org/r/CADxRZqzXQRYgKc=y-KV=S_yHL+Y8Ay2mh5ezeZUnpRvg+syWKw@mail.gmail.com

    Cc: stable@vger.kernel.org
    Fixes: 86b3de60a0b6 ("ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS")
    Reported-by: Anatoly Pugachev
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 60efe21e5976d3d4170a8190ca76a271d6419754 upstream.

    Disable ftrace selftests when any tracer (kernel command line options
    like ftrace=, trace_events=, kprobe_events=, and boot-time tracing)
    starts running because selftest can disturb it.

    Currently ftrace= and trace_events= are checked, but kprobe_events
    has a different flag, and boot-time tracing didn't checked. This unifies
    the disabled flag and all of those boot-time tracing features sets
    the flag.

    This also fixes warnings on kprobe-event selftest
    (CONFIG_FTRACE_STARTUP_TEST=y and CONFIG_KPROBE_EVENTS=y) with boot-time
    tracing (ftrace.event.kprobes.EVENT.probes) like below;

    [ 59.803496] trace_kprobe: Testing kprobe tracing:
    [ 59.804258] ------------[ cut here ]------------
    [ 59.805682] WARNING: CPU: 3 PID: 1 at kernel/trace/trace_kprobe.c:1987 kprobe_trace_self_tests_ib
    [ 59.806944] Modules linked in:
    [ 59.807335] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.10.0-rc7+ #172
    [ 59.808029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/204
    [ 59.808999] RIP: 0010:kprobe_trace_self_tests_init+0x5f/0x42b
    [ 59.809696] Code: e8 03 00 00 48 c7 c7 30 8e 07 82 e8 6d 3c 46 ff 48 c7 c6 00 b2 1a 81 48 c7 c7 7
    [ 59.812439] RSP: 0018:ffffc90000013e78 EFLAGS: 00010282
    [ 59.813038] RAX: 00000000ffffffef RBX: 0000000000000000 RCX: 0000000000049443
    [ 59.813780] RDX: 0000000000049403 RSI: 0000000000049403 RDI: 000000000002deb0
    [ 59.814589] RBP: ffffc90000013e90 R08: 0000000000000001 R09: 0000000000000001
    [ 59.815349] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000ffffffef
    [ 59.816138] R13: ffff888004613d80 R14: ffffffff82696940 R15: ffff888004429138
    [ 59.816877] FS: 0000000000000000(0000) GS:ffff88807dcc0000(0000) knlGS:0000000000000000
    [ 59.817772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 59.818395] CR2: 0000000001a8dd38 CR3: 0000000002222000 CR4: 00000000000006a0
    [ 59.819144] Call Trace:
    [ 59.819469] ? init_kprobe_trace+0x6b/0x6b
    [ 59.819948] do_one_initcall+0x5f/0x300
    [ 59.820392] ? rcu_read_lock_sched_held+0x4f/0x80
    [ 59.820916] kernel_init_freeable+0x22a/0x271
    [ 59.821416] ? rest_init+0x241/0x241
    [ 59.821841] kernel_init+0xe/0x10f
    [ 59.822251] ret_from_fork+0x22/0x30
    [ 59.822683] irq event stamp: 16403349
    [ 59.823121] hardirqs last enabled at (16403359): [] console_unlock+0x48e/0x580
    [ 59.824074] hardirqs last disabled at (16403368): [] console_unlock+0x3f6/0x580
    [ 59.825036] softirqs last enabled at (16403200): [] __do_softirq+0x33a/0x484
    [ 59.825982] softirqs last disabled at (16403087): [] asm_call_irq_on_stack+0x10
    [ 59.827034] ---[ end trace 200c544775cdfeb3 ]---
    [ 59.827635] trace_kprobe: error on probing function entry.

    Link: https://lkml.kernel.org/r/160741764955.3448999.3347769358299456915.stgit@devnote2

    Fixes: 4d655281eb1b ("tracing/boot Add kprobe event support")
    Cc: Ingo Molnar
    Cc: stable@vger.kernel.org
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Masami Hiramatsu
     
  • commit 950cc0d2bef078e1f6459900ca4d4b2a2e0e3c37 upstream.

    The handle_inode_event() interface was added as (quoting comment):
    "a simple variant of handle_event() for groups that only have inode
    marks and don't have ignore mask".

    In other words, all backends except fanotify. The inotify backend
    also falls under this category, but because it required extra arguments
    it was left out of the initial pass of backends conversion to the
    simple interface.

    This results in code duplication between the generic helper
    fsnotify_handle_event() and the inotify_handle_event() callback
    which also happen to be buggy code.

    Generalize the handle_inode_event() arguments and add the check for
    FS_EXCL_UNLINK flag to the generic helper, so inotify backend could
    be converted to use the simple interface.

    Link: https://lore.kernel.org/r/20201202120713.702387-2-amir73il@gmail.com
    CC: stable@vger.kernel.org
    Fixes: b9a1b9772509 ("fsnotify: create method handle_inode_event() in fsnotify_operations")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • commit 406100f3da08066c00105165db8520bbc7694a36 upstream.

    One of our machines keeled over trying to rebuild the scheduler domains.
    Mainline produces the same splat:

    BUG: unable to handle page fault for address: 0000607f820054db
    CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
    Workqueue: events cpuset_hotplug_workfn
    RIP: build_sched_domains
    Call Trace:
    partition_sched_domains_locked
    rebuild_sched_domains_locked
    cpuset_hotplug_workfn

    It happens with cgroup2 and exclusive cpusets only. This reproducer
    triggers it on an 8-cpu vm and works most effectively with no
    preexisting child cgroups:

    cd $UNIFIED_ROOT
    mkdir cg1
    echo 4-7 > cg1/cpuset.cpus
    echo root > cg1/cpuset.cpus.partition

    # with smt/control reading 'on',
    echo off > /sys/devices/system/cpu/smt/control

    RIP maps to

    sd->shared = *per_cpu_ptr(sdd->sds, sd_id);

    from sd_init(). sd_id is calculated earlier in the same function:

    cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
    sd_id = cpumask_first(sched_domain_span(sd));

    tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
    and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
    value from per_cpu_ptr() above.

    The problem is a race between cpuset_hotplug_workfn() and a later
    offline of CPU N. cpuset_hotplug_workfn() updates the effective masks
    when N is still online, the offline clears N from cpu_sibling_map, and
    then the worker uses the stale effective masks that still have N to
    generate the scheduling domains, leading the worker to read
    N's empty cpu_sibling_map in sd_init().

    rebuild_sched_domains_locked() prevented the race during the cgroup2
    cpuset series up until the Fixes commit changed its check. Make the
    check more robust so that it can detect an offline CPU in any exclusive
    cpuset's effective mask, not just the top one.

    Fixes: 0ccea8feb980 ("cpuset: Make generate_sched_domains() work with partition")
    Signed-off-by: Daniel Jordan
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com
    Signed-off-by: Greg Kroah-Hartman

    Daniel Jordan
     
  • [ Upstream commit 57efa1fe5957694fa541c9062de0a127f0b9acb0 ]

    Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
    fork() for ptes") pages under a FOLL_PIN will not be write protected
    during COW for fork. This means that pages returned from
    pin_user_pages(FOLL_WRITE) should not become write protected while the pin
    is active.

    However, there is a small race where get_user_pages_fast(FOLL_PIN) can
    establish a FOLL_PIN at the same time copy_present_page() is write
    protecting it:

    CPU 0 CPU 1
    get_user_pages_fast()
    internal_get_user_pages_fast()
    copy_page_range()
    pte_alloc_map_lock()
    copy_present_page()
    atomic_read(has_pinned) == 0
    page_maybe_dma_pinned() == false
    atomic_set(has_pinned, 1);
    gup_pgd_range()
    gup_pte_range()
    pte_t pte = gup_get_pte(ptep)
    pte_access_permitted(pte)
    try_grab_compound_head()
    pte = pte_wrprotect(pte)
    set_pte_at();
    pte_unmap_unlock()
    // GUP now returns with a write protected page

    The first attempt to resolve this by using the write protect caused
    problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
    early COW write protect games during fork()")

    Instead wrap copy_p4d_range() with the write side of a seqcount and check
    the read side around gup_pgd_range(). If there is a collision then
    get_user_pages_fast() fails and falls back to slow GUP.

    Slow GUP is safe against this race because copy_page_range() is only
    called while holding the exclusive side of the mmap_lock on the src
    mm_struct.

    [akpm@linux-foundation.org: coding style fixes]
    Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com

    Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
    Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
    Signed-off-by: Jason Gunthorpe
    Suggested-by: Linus Torvalds
    Reviewed-by: John Hubbard
    Reviewed-by: Jan Kara
    Reviewed-by: Peter Xu
    Acked-by: "Ahmed S. Darwish" [seqcount_t parts]
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Jann Horn
    Cc: Kirill Shutemov
    Cc: Kirill Tkhai
    Cc: Leon Romanovsky
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jason Gunthorpe
     
  • [ Upstream commit 12cc126df82c96c89706aa207ad27c56f219047c ]

    __module_address() needs to be called with preemption disabled or with
    module_mutex taken. preempt_disable() is enough for read-only uses, which is
    what this fix does. Also, module_put() does internal check for NULL, so drop
    it as well.

    Fixes: a38d1107f937 ("bpf: support raw tracepoints in modules")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20201203204634.1325171-2-andrii@kernel.org
    Signed-off-by: Sasha Levin

    Andrii Nakryiko
     
  • [ Upstream commit 4615fbc3788ddc8e7c6d697714ad35a53729aa2c ]

    When an interrupt allocation fails for N interrupts, it is pretty
    common for the error handling code to free the same number of interrupts,
    no matter how many interrupts have actually been allocated.

    This may result in the domain freeing code to be unexpectedly called
    for interrupts that have no mapping in that domain. Things end pretty
    badly.

    Instead, add some checks to irq_domain_free_irqs_hierarchy() to make sure
    that thiss does not follow the hierarchy if no mapping exists for a given
    interrupt.

    Fixes: 6a6544e520abe ("genirq/irqdomain: Remove auto-recursive hierarchy support")
    Signed-off-by: Marc Zyngier
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20201129135551.396777-1-maz@kernel.org
    Signed-off-by: Sasha Levin

    Marc Zyngier
     
  • [ Upstream commit 56292e8609e39537297a7468dda4d87b9bd81d6a ]

    The current memmory-allocation interface causes the following difficulties
    for kvfree_rcu():

    a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
    complain about violation of the nesting rules, as in "BUG: Invalid
    wait context". This Kconfig option checks for proper raw_spinlock
    vs. spinlock nesting, in particular, it is not legal to acquire a
    spinlock_t while holding a raw_spinlock_t.

    This is a problem because kfree_rcu() uses raw_spinlock_t whereas the
    "page allocator" internally deals with spinlock_t to access to its
    zones. The code also can be broken from higher level of view:

    raw_spin_lock(&some_lock);
    kfree_rcu(some_pointer, some_field_offset);

    b) If built with CONFIG_PREEMPT_RT, spinlock_t is converted into
    sleeplock. This means that invoking the page allocator from atomic
    contexts results in "BUG: scheduling while atomic".

    c) Please note that call_rcu() is already invoked from raw atomic context,
    so it is only reasonable to expaect that kfree_rcu() and kvfree_rcu()
    will also be called from atomic raw context.

    This commit therefore defers page allocation to a clean context using the
    combination of an hrtimer and a workqueue. The hrtimer stage is required
    in order to avoid deadlocks with the scheduler. This deferred allocation
    is required only when kvfree_rcu()'s per-CPU page cache is empty.

    Link: https://lore.kernel.org/lkml/20200630164543.4mdcf6zb4zfclhln@linutronix.de/
    Fixes: 3042f83f19be ("rcu: Support reclaim for head-less object")
    Reported-by: Sebastian Andrzej Siewior
    Signed-off-by: Uladzislau Rezki (Sony)
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Sasha Levin

    Uladzislau Rezki (Sony)
     
  • [ Upstream commit d2098b4440981705e844c50254540ba7b5f82795 ]

    Kim reported that perf-ftrace made his box unhappy. It turns out that
    commit:

    ff5c4f5cad33 ("rcu/tree: Mark the idle relevant functions noinstr")

    removed one too many notrace qualifiers, probably due to there not being
    a helpful comment.

    This commit therefore reinstates the notrace and adds a comment to avoid
    losing it again.

    [ paulmck: Apply Steven Rostedt's feedback on the comment. ]
    Fixes: ff5c4f5cad33 ("rcu/tree: Mark the idle relevant functions noinstr")
    Reported-by: Kim Phillips
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     
  • [ Upstream commit 6dbce04d8417ae706596366e16841d77c454ba52 ]

    Eugenio managed to tickle #PF from NMI context which resulted in
    hitting a WARN in RCU through irqentry_enter() ->
    __rcu_irq_enter_check_tick().

    However, this situation is perfectly sane and does not warrant an
    WARN. The #PF will (necessarily) be atomic and not require messing
    with the tick state, so early return is correct. This commit
    therefore removes the WARN.

    Fixes: aaf2bc50df1f ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()")
    Reported-by: "Eugenio Pérez"
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Andy Lutomirski
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     
  • [ Upstream commit 345a957fcc95630bf5535d7668a59ed983eb49a7 ]

    do_sched_yield() invokes schedule() with interrupts disabled which is
    not allowed. This goes back to the pre git era to commit a6efb709806c
    ("[PATCH] irqlock patch 2.5.27-H6") in the history tree.

    Reenable interrupts and remove the misleading comment which "explains" it.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
    Signed-off-by: Sasha Levin

    Thomas Gleixner
     
  • [ Upstream commit a57415f5d1e43c3a5c5d412cd85e2792d7ed9b11 ]

    When change sched_rt_{runtime, period}_us, we validate that the new
    settings should at least accommodate the currently allocated -dl
    bandwidth:

    sched_rt_handler()
    --> sched_dl_bandwidth_validate()
    {
    new_bw = global_rt_runtime()/global_rt_period();

    for_each_possible_cpu(cpu) {
    dl_b = dl_bw_of(cpu);
    if (new_bw < dl_b->total_bw) total_bw is the allocated bandwidth of the whole root domain.
    Instead, we should compare dl_b->total_bw against "cpus*new_bw",
    where 'cpus' is the number of CPUs of the root domain.

    Also, below annotation(in kernel/sched/sched.h) implied implementation
    only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept
    evolving till got merged(v9), but the annotation remains unchanged,
    meaningless and misleading, update it.

    * With respect to SMP, the bandwidth is given on a per-CPU basis,
    * meaning that:
    * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
    * - dl_total_bw array contains, in the i-eth element, the currently
    * allocated bandwidth on the i-eth CPU.

    [1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/

    Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks")
    Signed-off-by: Peng Liu
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Bristot de Oliveira
    Acked-by: Juri Lelli
    Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com
    Signed-off-by: Sasha Levin

    Peng Liu
     

14 Dec, 2020

1 commit

  • Pull x86 fixes from Thomas Gleixner:
    "A set of x86 and membarrier fixes:

    - Correct a few problems in the x86 and the generic membarrier
    implementation. Small corrections for assumptions about visibility
    which have turned out not to be true.

    - Make the PAT bits for memory encryption correct vs 4K and 2M/1G
    page table entries as they are at a different location.

    - Fix a concurrency issue in the the local bandwidth readout of
    resource control leading to incorrect values

    - Fix the ordering of allocating a vector for an interrupt. The order
    missed to respect the provided cpumask when the first attempt of
    allocating node local in the mask fails. It then tries the node
    instead of trying the full provided mask first. This leads to
    erroneous error messages and breaking the (user) supplied affinity
    request. Reorder it.

    - Make the INT3 padding detection in optprobe work correctly"

    * tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/kprobes: Fix optprobe to detect INT3 padding correctly
    x86/apic/vector: Fix ordering in vector assignment
    x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled
    x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
    membarrier: Execute SYNC_CORE on the calling thread
    membarrier: Explicitly sync remote cores when SYNC_CORE is requested
    membarrier: Add an actual barrier before rseq_preempt()
    x86/membarrier: Get rid of a dubious optimization

    Linus Torvalds
     

12 Dec, 2020

2 commits

  • Remove bpf_ prefix, which causes these helpers to be reported in verifier
    dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
    fix it as long as it is still possible before UAPI freezes on these helpers.

    Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Linus Torvalds

    Andrii Nakryiko
     
  • kernel/elfcore.c only contains weak symbols, which triggers a bug with
    clang in combination with recordmcount:

    Cannot find symbol for section 2: .text.
    kernel/elfcore.o: failed

    Move the empty stubs into linux/elfcore.h as inline functions. As only
    two architectures use these, just use the architecture specific Kconfig
    symbols to key off the declaration.

    Link: https://lkml.kernel.org/r/20201204165742.3815221-2-arnd@kernel.org
    Signed-off-by: Arnd Bergmann
    Cc: Nathan Chancellor
    Cc: Nick Desaulniers
    Cc: Barret Rhoden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

11 Dec, 2020

3 commits

  • Pull networking fixes from David Miller:

    1) IPsec compat fixes, from Dmitry Safonov.

    2) Fix memory leak in xfrm_user_policy(). Fix from Yu Kuai.

    3) Fix polling in xsk sockets by using sk_poll_wait() instead of
    datagram_poll() which keys off of sk_wmem_alloc and such which xsk
    sockets do not update. From Xuan Zhuo.

    4) Missing init of rekey_data in cfgh80211, from Sara Sharon.

    5) Fix destroy of timer before init, from Davide Caratti.

    6) Missing CRYPTO_CRC32 selects in ethernet driver Kconfigs, from Arnd
    Bergmann.

    7) Missing error return in rtm_to_fib_config() switch case, from Zhang
    Changzhong.

    8) Fix some src/dest address handling in vrf and add a testcase. From
    Stephen Suryaputra.

    9) Fix multicast handling in Seville switches driven by mscc-ocelot
    driver. From Vladimir Oltean.

    10) Fix proto value passed to skb delivery demux in udp, from Xin Long.

    11) HW pkt counters not reported correctly in enetc driver, from Claudiu
    Manoil.

    12) Fix deadlock in bridge, from Joseph Huang.

    13) Missing of_node_pur() in dpaa2 driver, fromn Christophe JAILLET.

    14) Fix pid fetching in bpftool when there are a lot of results, from
    Andrii Nakryiko.

    15) Fix long timeouts in nft_dynset, from Pablo Neira Ayuso.

    16) Various stymmac fixes, from Fugang Duan.

    17) Fix null deref in tipc, from Cengiz Can.

    18) When mss is biog, coose more resonable rcvq_space in tcp, fromn Eric
    Dumazet.

    19) Revert a geneve change that likely isnt necessary, from Jakub
    Kicinski.

    20) Avoid premature rx buffer reuse in various Intel driversm from Björn
    Töpel.

    21) retain EcT bits during TIS reflection in tcp, from Wei Wang.

    22) Fix Tso deferral wrt. cwnd limiting in tcp, from Neal Cardwell.

    23) MPLS_OPT_LSE_LABEL attribute is 342 ot 8 bits, from Guillaume Nault

    24) Fix propagation of 32-bit signed bounds in bpf verifier and add test
    cases, from Alexei Starovoitov.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
    selftests: fix poll error in udpgro.sh
    selftests/bpf: Fix "dubious pointer arithmetic" test
    selftests/bpf: Fix array access with signed variable test
    selftests/bpf: Add test for signed 32-bit bound check bug
    bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.
    MAINTAINERS: Add entry for Marvell Prestera Ethernet Switch driver
    net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower
    net/mlx4_en: Handle TX error CQE
    net/mlx4_en: Avoid scheduling restart task if it is already running
    tcp: fix cwnd-limited bug for TSO deferral where we send nothing
    net: flow_offload: Fix memory leak for indirect flow block
    tcp: Retain ECT bits for tos reflection
    ethtool: fix stack overflow in ethnl_parse_bitset()
    e1000e: fix S0ix flow to allow S0i3.2 subset entry
    ice: avoid premature Rx buffer reuse
    ixgbe: avoid premature Rx buffer reuse
    i40e: avoid premature Rx buffer reuse
    igb: avoid transmit queue timeout in xdp path
    igb: use xdp_do_flush
    igb: skb add metasize for xdp
    ...

    Linus Torvalds
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf 2020-12-10

    The following pull-request contains BPF updates for your *net* tree.

    We've added 21 non-merge commits during the last 12 day(s) which contain
    a total of 21 files changed, 163 insertions(+), 88 deletions(-).

    The main changes are:

    1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.

    2) Fix ring_buffer__poll() return value, from Andrii.

    3) Fix race in lwt_bpf, from Cong.

    4) Fix test_offload, from Toke.

    5) Various xsk fixes.

    Please consider pulling these changes from:

    git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

    Thanks a lot!

    Also thanks to reporters, reviewers and testers of commits in this pull-request:

    Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
    Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The 64-bit signed bounds should not affect 32-bit signed bounds unless the
    verifier knows that upper 32-bits are either all 1s or all 0s. For example the
    register with smin_value==1 doesn't mean that s32_min_value is also equal to 1,
    since smax_value could be larger than 32-bit subregister can hold.
    The verifier refines the smax/s32_max return value from certain helpers in
    do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min
    value is also bounded. When both smin and smax bounds fit into 32-bit
    subregister the verifier can propagate those bounds.

    Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
    Reported-by: Jean-Philippe Brucker
    Acked-by: John Fastabend
    Signed-off-by: Alexei Starovoitov

    Alexei Starovoitov
     

09 Dec, 2020

3 commits

  • membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as
    syncing the core on all sibling threads but not necessarily the calling
    thread. This behavior is fundamentally buggy and cannot be used safely.

    Suppose a user program has two threads. Thread A is on CPU 0 and thread B
    is on CPU 1. Thread A modifies some text and calls
    membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).

    Then thread B executes the modified code. If, at any point after
    membarrier() decides which CPUs to target, thread A could be preempted and
    replaced by thread B on CPU 0. This could even happen on exit from the
    membarrier() syscall. If this happens, thread B will end up running on CPU
    0 without having synced.

    In principle, this could be fixed by arranging for the scheduler to issue
    sync_core_before_usermode() whenever switching between two threads in the
    same mm if there is any possibility of a concurrent membarrier() call, but
    this would have considerable overhead. Instead, make membarrier() sync the
    calling CPU as well.

    As an optimization, this avoids an extra smp_mb() in the default
    barrier-only mode and an extra rseq preempt on the caller.

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Link: https://lore.kernel.org/r/250ded637696d490c69bef1877148db86066881c.1607058304.git.luto@kernel.org

    Andy Lutomirski
     
  • membarrier() does not explicitly sync_core() remote CPUs; instead, it
    relies on the assumption that an IPI will result in a core sync. On x86,
    this may be true in practice, but it's not architecturally reliable. In
    particular, the SDM and APM do not appear to guarantee that interrupt
    delivery is serializing. While IRET does serialize, IPI return can
    schedule, thereby switching to another task in the same mm that was
    sleeping in a syscall. The new task could then SYSRET back to usermode
    without ever executing IRET.

    Make this more robust by explicitly calling sync_core_before_usermode()
    on remote cores. (This also helps people who search the kernel tree for
    instances of sync_core() and sync_core_before_usermode() -- one might be
    surprised that the core membarrier code doesn't currently show up in a
    such a search.)

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org

    Andy Lutomirski
     
  • It seems that most RSEQ membarrier users will expect any stores done before
    the membarrier() syscall to be visible to the target task(s). While this
    is extremely likely to be true in practice, nothing actually guarantees it
    by a strict reading of the x86 manuals. Rather than providing this
    guarantee by accident and potentially causing a problem down the road, just
    add an explicit barrier.

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/d3e7197e034fa4852afcf370ca49c30496e58e40.1607058304.git.luto@kernel.org

    Andy Lutomirski
     

08 Dec, 2020

1 commit

  • Pull tracing fix from Steven Rostedt:
    "Fix userstacktrace option for instances

    While writing an application that requires user stack trace option to
    work in instances, I found that the instance option has a bug that
    makes it a nop. The check for performing the user stack trace in an
    instance, checks the top level options (not the instance options) to
    determine if a user stack trace should be performed or not.

    This is not only incorrect, but also confusing for users. It confused
    me for a bit!"

    * tag 'trace-v5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix userstacktrace option for instances

    Linus Torvalds
     

07 Dec, 2020

1 commit

  • Pull irq fixes from Thomas Gleixner:
    "A set of updates for the interrupt subsystem:

    - Make multiqueue devices which use the managed interrupt affinity
    infrastructure work on PowerPC/Pseries. PowerPC does not use the
    generic infrastructure for setting up PCI/MSI interrupts and the
    multiqueue changes failed to update the legacy PCI/MSI
    infrastructure. Make this work by passing the affinity setup
    information down to the mapping and allocation functions.

    - Move Jason Cooper from MAINTAINERS to CREDITS as his mail is
    bouncing and he's not reachable. We hope all is well with him and
    say thanks for his work over the years"

    * tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    powerpc/pseries: Pass MSI affinity to irq_create_mapping()
    genirq/irqdomain: Add an irq_create_mapping_affinity() function
    MAINTAINERS: Move Jason Cooper to CREDITS

    Linus Torvalds
     

06 Dec, 2020

1 commit

  • Pull powerpc fixes from Michael Ellerman:
    "Some more powerpc fixes for 5.10:

    - Three commits fixing possible missed TLB invalidations for
    multi-threaded processes when CPUs are hotplugged in and out.

    - A fix for a host crash triggerable by host userspace (qemu) in KVM
    on Power9.

    - A fix for a host crash in machine check handling when running HPT
    guests on a HPT host.

    - One commit fixing potential missed TLB invalidations when using the
    hash MMU on Power9 or later.

    - A regression fix for machines with CPUs on node 0 but no memory.

    Thanks to Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan
    Mohanty, Milton Miller, Nicholas Piggin, Paul Mackerras, and Srikar
    Dronamraju"

    * tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE
    KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check
    powerpc/numa: Fix a regression on memoryless node 0
    powerpc/64s: Trim offlined CPUs from mm_cpumasks
    kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
    powerpc/64s/pseries: Fix hash tlbiel_all_isa300 for guest kernels
    powerpc/64s: Fix hash ISA v3.0 TLBIEL instruction generation

    Linus Torvalds
     

05 Dec, 2020

1 commit

  • When the instances were able to use their own options, the userstacktrace
    option was left hardcoded for the top level. This made the instance
    userstacktrace option bascially into a nop, and will confuse users that set
    it, but nothing happens (I was confused when it happened to me!)

    Cc: stable@vger.kernel.org
    Fixes: 16270145ce6b ("tracing: Add trace options for core options to instances")
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     

02 Dec, 2020

1 commit

  • Pull tracing fixes from Steven Rostedt:

    - Use correct timestamp variable for ring buffer write stamp update

    - Fix up before stamp and write stamp when crossing ring buffer sub
    buffers

    - Keep a zero delta in ring buffer in slow path if cmpxchg fails

    - Fix trace_printk static buffer for archs that care

    - Fix ftrace record accounting for ftrace ops with trampolines

    - Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency

    - Remove WARN_ON in hwlat tracer that triggers on something that is OK

    - Make "my_tramp" trampoline in ftrace direct sample code global

    - Fixes in the bootconfig tool for better alignment management

    * tag 'trace-v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ring-buffer: Always check to put back before stamp when crossing pages
    ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency
    ftrace: Fix updating FTRACE_FL_TRAMP
    tracing: Fix alignment of static buffer
    tracing: Remove WARN_ON in start_thread()
    samples/ftrace: Mark my_tramp[12]? global
    ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next()
    ring-buffer: Update write stamp with the correct ts
    docs: bootconfig: Update file format on initrd image
    tools/bootconfig: Align the bootconfig applied initrd image size to 4
    tools/bootconfig: Fix to check the write failure correctly
    tools/bootconfig: Fix errno reference after printf()

    Linus Torvalds
     

01 Dec, 2020

1 commit

  • The current ring buffer logic checks to see if the updating of the event
    buffer was interrupted, and if it is, it will try to fix up the before stamp
    with the write stamp to make them equal again. This logic is flawed, because
    if it is not interrupted, the two are guaranteed to be different, as the
    current event just updated the before stamp before allocation. This
    guarantees that the next event (this one or another interrupting one) will
    think it interrupted the time updates of a previous event and inject an
    absolute time stamp to compensate.

    The correct logic is to always update the timestamps when traversing to a
    new sub buffer.

    Cc: stable@vger.kernel.org
    Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp")
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)