20 Jan, 2021

2 commits

  • [ Upstream commit 69ca310f34168eae0ada434796bfc22fb4a0fa26 ]

    On some systems, some variant of the following splat is
    repeatedly seen. The common factor in all traces seems
    to be the entry point to task_file_seq_next(). With the
    patch, all warnings go away.

    rcu: INFO: rcu_sched self-detected stall on CPU
    rcu: \x0926-....: (20992 ticks this GP) idle=d7e/1/0x4000000000000002 softirq=81556231/81556231 fqs=4876
    \x09(t=21033 jiffies g=159148529 q=223125)
    NMI backtrace for cpu 26
    CPU: 26 PID: 2015853 Comm: bpftool Kdump: loaded Not tainted 5.6.13-0_fbk4_3876_gd8d1f9bf80bb #1
    Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A12 10/08/2018
    Call Trace:

    dump_stack+0x50/0x70
    nmi_cpu_backtrace.cold.6+0x13/0x50
    ? lapic_can_unplug_cpu.cold.30+0x40/0x40
    nmi_trigger_cpumask_backtrace+0xba/0xca
    rcu_dump_cpu_stacks+0x99/0xc7
    rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
    ? tick_sched_do_timer+0x60/0x60
    update_process_times+0x24/0x50
    tick_sched_timer+0x37/0x70
    __hrtimer_run_queues+0xfe/0x270
    hrtimer_interrupt+0xf4/0x210
    smp_apic_timer_interrupt+0x5e/0x120
    apic_timer_interrupt+0xf/0x20

    RIP: 0010:get_pid_task+0x38/0x80
    Code: 89 f6 48 8d 44 f7 08 48 8b 00 48 85 c0 74 2b 48 83 c6 55 48 c1 e6 04 48 29 f0 74 19 48 8d 78 20 ba 01 00 00 00 f0 0f c1 50 20 d2 74 27 78 11 83 c2 01 78 0c 48 83 c4 08 c3 31 c0 48 83 c4 08
    RSP: 0018:ffffc9000d293dc8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
    RAX: ffff888637c05600 RBX: ffffc9000d293e0c RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000550 RDI: ffff888637c05620
    RBP: ffffffff8284eb80 R08: ffff88831341d300 R09: ffff88822ffd8248
    R10: ffff88822ffd82d0 R11: 00000000003a93c0 R12: 0000000000000001
    R13: 00000000ffffffff R14: ffff88831341d300 R15: 0000000000000000
    ? find_ge_pid+0x1b/0x20
    task_seq_get_next+0x52/0xc0
    task_file_seq_get_next+0x159/0x220
    task_file_seq_next+0x4f/0xa0
    bpf_seq_read+0x159/0x390
    vfs_read+0x8a/0x140
    ksys_read+0x59/0xd0
    do_syscall_64+0x42/0x110
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f95ae73e76e
    Code: Bad RIP value.
    RSP: 002b:00007ffc02c1dbf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: 000000000170faa0 RCX: 00007f95ae73e76e
    RDX: 0000000000001000 RSI: 00007ffc02c1dc30 RDI: 0000000000000007
    RBP: 00007ffc02c1ec70 R08: 0000000000000005 R09: 0000000000000006
    R10: fffffffffffff20b R11: 0000000000000246 R12: 00000000019112a0
    R13: 0000000000000000 R14: 0000000000000007 R15: 00000000004283c0

    If unable to obtain the file structure for the current task,
    proceed to the next task number after the one returned from
    task_seq_get_next(), instead of the next task number from the
    original iterator.

    Also, save the stopping task number from task_seq_get_next()
    on failure in case of restarts.

    Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets")
    Signed-off-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201218185032.2464558-2-jonathan.lemon@gmail.com
    Signed-off-by: Sasha Levin

    Jonathan Lemon
     
  • [ Upstream commit 91b2db27d3ff9ad29e8b3108dfbf1e2f49fe9bd3 ]

    Simplify task_file_seq_get_next() by removing two in/out arguments: task
    and fstruct. Use info->task and info->files instead.

    Signed-off-by: Song Liu
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20201120002833.2481110-1-songliubraving@fb.com
    Signed-off-by: Sasha Levin

    Song Liu
     

02 Sep, 2020

1 commit

  • Currently, task_file iterator iterates all files from all tasks.
    This may potentially visit a lot of duplicated files if there are
    many tasks sharing the same files, e.g., typical pthreads
    where these pthreads and the main thread are sharing the same files.

    This patch changed task_file iterator to skip a particular task
    if that task shares the same files as its group_leader (the task
    having the same tgid and also task->tgid == task->pid).
    This will preserve the same result, visiting all files from all
    tasks, and will reduce runtime cost significantl, e.g., if there are
    a lot of pthreads and the process has a lot of open files.

    Suggested-by: Andrii Nakryiko
    Signed-off-by: Yonghong Song
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Josef Bacik
    Link: https://lore.kernel.org/bpf/20200902023112.1672792-1-yhs@fb.com

    Yonghong Song
     

19 Aug, 2020

1 commit

  • Currently when traversing all tasks, the next tid
    is always increased by one. This may result in
    visiting the same task multiple times in a
    pid namespace.

    This patch fixed the issue by seting the next
    tid as pid_nr_ns(pid, ns) + 1, similar to
    funciton next_tgid().

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Cc: Rik van Riel
    Link: https://lore.kernel.org/bpf/20200818222310.2181500-1-yhs@fb.com

    Yonghong Song
     

18 Aug, 2020

1 commit

  • With latest `bpftool prog` command, we observed the following kernel
    panic.
    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    PGD dfe894067 P4D dfe894067 PUD deb663067 PMD 0
    Oops: 0010 [#1] SMP
    CPU: 9 PID: 6023 ...
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0000:ffffc900002b8f18 EFLAGS: 00010286
    RAX: ffff8883a405f400 RBX: ffff888e46a6bf00 RCX: 000000008020000c
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8883a405f400
    RBP: ffff888e46a6bf50 R08: 0000000000000000 R09: ffffffff81129600
    R10: ffff8883a405f300 R11: 0000160000000000 R12: 0000000000002710
    R13: 000000e9494b690c R14: 0000000000000202 R15: 0000000000000009
    FS: 00007fd9187fe700(0000) GS:ffff888e46a40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 0000000de5d33002 CR4: 0000000000360ee0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:

    rcu_core+0x1a4/0x440
    __do_softirq+0xd3/0x2c8
    irq_exit+0x9d/0xa0
    smp_apic_timer_interrupt+0x68/0x120
    apic_timer_interrupt+0xf/0x20

    RIP: 0033:0x47ce80
    Code: Bad RIP value.
    RSP: 002b:00007fd9187fba40 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
    RAX: 0000000000000002 RBX: 00007fd931789160 RCX: 000000000000010c
    RDX: 00007fd9308cdfb4 RSI: 00007fd9308cdfb4 RDI: 00007ffedd1ea0a8
    RBP: 00007fd9187fbab0 R08: 000000000000000e R09: 000000000000002a
    R10: 0000000000480210 R11: 00007fd9187fc570 R12: 00007fd9316cc400
    R13: 0000000000000118 R14: 00007fd9308cdfb4 R15: 00007fd9317a9380

    After further analysis, the bug is triggered by
    Commit eaaacd23910f ("bpf: Add task and task/file iterator targets")
    which introduced task_file bpf iterator, which traverses all open file
    descriptors for all tasks in the current namespace.
    The latest `bpftool prog` calls a task_file bpf program to traverse
    all files in the system in order to associate processes with progs/maps, etc.
    When traversing files for a given task, rcu read_lock is taken to
    access all files in a file_struct. But it used get_file() to grab
    a file, which is not right. It is possible file->f_count is 0 and
    get_file() will unconditionally increase it.
    Later put_file() may cause all kind of issues with the above
    as one of sympotoms.

    The failure can be reproduced with the following steps in a few seconds:
    $ cat t.c
    #include
    #include
    #include
    #include
    #include

    #define N 10000
    int fd[N];
    int main() {
    int i;

    for (i = 0; i < N; i++) {
    fd[i] = open("./note.txt", 'r');
    if (fd[i] < 0) {
    fprintf(stderr, "failed\n");
    return -1;
    }
    }
    for (i = 0; i < N; i++)
    close(fd[i]);

    return 0;
    }
    $ gcc -O2 t.c
    $ cat run.sh
    #/bin/bash
    for i in {1..100}
    do
    while true; do ./a.out; done &
    done
    $ ./run.sh
    $ while true; do bpftool prog >& /dev/null; done

    This patch used get_file_rcu() which only grabs a file if the
    file->f_count is not zero. This is to ensure the file pointer
    is always valid. The above reproducer did not fail for more
    than 30 minutes.

    Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets")
    Suggested-by: Josef Bacik
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Josef Bacik
    Link: https://lore.kernel.org/bpf/20200817174214.252601-1-yhs@fb.com

    Yonghong Song
     

26 Jul, 2020

3 commits

  • This patch refactored target bpf_iter_init_seq_priv_t callback
    function to accept additional information. This will be needed
    in later patches for map element targets since a particular
    map should be passed to traverse elements for that particular
    map. In the future, other information may be passed to target
    as well, e.g., pid, cgroup id, etc. to customize the iterator.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200723184110.590156-1-yhs@fb.com

    Yonghong Song
     
  • There is no functionality change for this patch.
    Struct bpf_iter_reg is used to register a bpf_iter target,
    which includes information for both prog_load, link_create
    and seq_file creation.

    This patch puts fields related seq_file creation into
    a different structure. This will be useful for map
    elements iterator where one iterator covers different
    map types and different map types may have different
    seq_ops, init/fini private_data function and
    private_data size.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200723184109.590030-1-yhs@fb.com

    Yonghong Song
     
  • Currently, the pos pointer in bpf iterator map/task/task_file
    seq_ops->start() is always incremented.
    This is incorrect. It should be increased only if
    *pos is 0 (for SEQ_START_TOKEN) since these start()
    function actually returns the first real object.
    If *pos is not 0, it merely found the object
    based on the state in seq->private, and not really
    advancing the *pos. This patch fixed this issue
    by only incrementing *pos if it is 0.

    Note that the old *pos calculation, although not
    correct, does not affect correctness of bpf_iter
    as bpf_iter seq_file->read() does not support llseek.

    This patch also renamed "mid" in bpf_map iterator
    seq_file private data to "map_id" for better clarity.

    Fixes: 6086d29def80 ("bpf: Add bpf_map iterator")
    Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets")
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200722195156.4029817-1-yhs@fb.com

    Yonghong Song
     

22 Jul, 2020

1 commit

  • One additional field btf_id is added to struct
    bpf_ctx_arg_aux to store the precomputed btf_ids.
    The btf_id is computed at build time with
    BTF_ID_LIST or BTF_ID_LIST_GLOBAL macro definitions.
    All existing bpf iterators are changed to used
    pre-compute btf_ids.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200720163403.1393551-1-yhs@fb.com

    Yonghong Song
     

15 May, 2020

1 commit

  • task_seq_get_next might stop prematurely if get_pid_task() fails to get
    task_struct. Failure to do so doesn't mean that there are no more tasks with
    higher pids. Procfs's iteration algorithm (see next_tgid in fs/proc/base.c)
    does a retry in such case. After this fix, instead of stopping prematurely
    after about 300 tasks on my server, bpf_iter program now returns >4000, which
    sounds much closer to reality.

    Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200514055137.1564581-1-andriin@fb.com

    Andrii Nakryiko
     

14 May, 2020

2 commits

  • Commit b121b341e598 ("bpf: Add PTR_TO_BTF_ID_OR_NULL
    support") adds a field btf_id_or_null_non0_off to
    bpf_prog->aux structure to indicate that the
    first ctx argument is PTR_TO_BTF_ID reg_type and
    all others are PTR_TO_BTF_ID_OR_NULL.
    This approach does not really scale if we have
    other different reg types in the future, e.g.,
    a pointer to a buffer.

    This patch enables bpf_iter targets registering ctx argument
    reg types which may be different from the default one.
    For example, for pointers to structures, the default reg_type
    is PTR_TO_BTF_ID for tracing program. The target can register
    a particular pointer type as PTR_TO_BTF_ID_OR_NULL which can
    be used by the verifier to enforce accesses.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200513180221.2949882-1-yhs@fb.com

    Yonghong Song
     
  • Currently bpf_iter_reg_target takes parameters from target
    and allocates memory to save them. This is really not
    necessary, esp. in the future we may grow information
    passed from targets to bpf_iter manager.

    The patch refactors the code so target reg_info
    becomes static and bpf_iter manager can just take
    a reference to it.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200513180219.2949605-1-yhs@fb.com

    Yonghong Song
     

10 May, 2020

1 commit

  • Only the tasks belonging to "current" pid namespace
    are enumerated.

    For task/file target, the bpf program will have access to
    struct task_struct *task
    u32 fd
    struct file *file
    where fd/file is an open file for the task.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200509175911.2476407-1-yhs@fb.com

    Yonghong Song