06 Sep, 2020

1 commit

  • Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    changed ctl_table.proc_handler to take a kernel pointer. Adjust the
    signature of proc_ipc_sem_dointvec to match ctl_table.proc_handler which
    fixes the following sparse error/warning:

    ipc/ipc_sysctl.c:94:47: warning: incorrect type in argument 3 (different address spaces)
    ipc/ipc_sysctl.c:94:47: expected void *buffer
    ipc/ipc_sysctl.c:94:47: got void [noderef] __user *buffer
    ipc/ipc_sysctl.c:194:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
    ipc/ipc_sysctl.c:194:35: expected int ( [usertype] *proc_handler )( ... )
    ipc/ipc_sysctl.c:194:35: got int ( * )( ... )

    Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    Signed-off-by: Tobias Klauser
    Signed-off-by: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Link: https://lkml.kernel.org/r/20200825105846.5193-1-tklauser@distanz.ch
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

13 Aug, 2020

2 commits

  • Remove the superfuous break, as there is a 'return' before it.

    Signed-off-by: Liao Pingfang
    Signed-off-by: Yi Wang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/1594724361-11525-1-git-send-email-wang.yi59@zte.com.cn
    Signed-off-by: Linus Torvalds

    Liao Pingfang
     
  • Two functions are only called via function pointers, don't bother
    inlining them.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20200710200312.GA960353@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

08 Aug, 2020

1 commit

  • The current split between do_mmap() and do_mmap_pgoff() was introduced in
    commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
    do_mmap_pgoff()") to support MPX.

    The wrapper function do_mmap_pgoff() always passed 0 as the value of the
    vm_flags argument to do_mmap(). However, MPX support has subsequently
    been removed from the kernel and there were no more direct callers of
    do_mmap(); all calls were going via do_mmap_pgoff().

    Simplify the code by removing do_mmap_pgoff() and changing all callers to
    directly call do_mmap(), which now no longer takes a vm_flags argument.

    Signed-off-by: Peter Collingbourne
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
    Signed-off-by: Linus Torvalds

    Peter Collingbourne
     

10 Jun, 2020

1 commit

  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

09 Jun, 2020

2 commits

  • the reason is to avoid a delay caused by the synchronize_rcu() call in
    kern_umount() when the mqueue mount is freed.

    the code:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    int main()
    {
    int i;

    for (i = 0; i < 1000; i++)
    if (unshare(CLONE_NEWIPC) < 0)
    error(EXIT_FAILURE, errno, "unshare");
    }

    goes from

    Command being timed: "./ipc-namespace"
    User time (seconds): 0.00
    System time (seconds): 0.06
    Percent of CPU this job got: 0%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.05

    to

    Command being timed: "./ipc-namespace"
    User time (seconds): 0.00
    System time (seconds): 0.02
    Percent of CPU this job got: 96%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03

    Signed-off-by: Giuseppe Scrivano
    Signed-off-by: Andrew Morton
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Link: http://lkml.kernel.org/r/20200225145419.527994-1-gscrivan@redhat.com
    Signed-off-by: Linus Torvalds

    Giuseppe Scrivano
     
  • Sparse reports a warning at freeque()

    warning: context imbalance in freeque() - unexpected unlock

    The root cause is the missing annotation at freeque()

    Add the missing __releases(RCU) annotation
    Add the missing __releases(&msq->q_perm) annotation

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Boqun Feng
    Cc: Lu Shuaibing
    Cc: Nathan Chancellor
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20200403160505.2832-2-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     

04 Jun, 2020

2 commits

  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     
  • Pull thread updates from Christian Brauner:
    "We have been discussing using pidfds to attach to namespaces for quite
    a while and the patches have in one form or another already existed
    for about a year. But I wanted to wait to see how the general api
    would be received and adopted.

    This contains the changes to make it possible to use pidfds to attach
    to the namespaces of a process, i.e. they can be passed as the first
    argument to the setns() syscall.

    When only a single namespace type is specified the semantics are
    equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
    equals setns(pidfd, CLONE_NEWNET).

    However, when a pidfd is passed, multiple namespace flags can be
    specified in the second setns() argument and setns() will attach the
    caller to all the specified namespaces all at once or to none of them.

    Specifying 0 is not valid together with a pidfd. Here are just two
    obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

    Allowing to also attach subsets of namespaces supports various
    use-cases where callers setns to a subset of namespaces to retain
    privilege, perform an action and then re-attach another subset of
    namespaces.

    Apart from significantly reducing the number of syscalls needed to
    attach to all currently supported namespaces (eight "open+setns"
    sequences vs just a single "setns()"), this also allows atomic setns
    to a set of namespaces, i.e. either attaching to all namespaces
    succeeds or we fail without having changed anything.

    This is centered around a new internal struct nsset which holds all
    information necessary for a task to switch to a new set of namespaces
    atomically. Fwiw, with this change a pidfd becomes the only token
    needed to interact with a container. I'm expecting this to be
    picked-up by util-linux for nsenter rather soon.

    Associated with this change is a shiny new test-suite dedicated to
    setns() (for pidfds and nsfds alike)"

    * tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests/pidfd: add pidfd setns tests
    nsproxy: attach to namespaces via pidfds
    nsproxy: add struct nsset

    Linus Torvalds
     

16 May, 2020

1 commit


15 May, 2020

1 commit

  • Commit 89163f93c6f9 ("ipc/util.c: sysvipc_find_ipc() should increase
    position index") is causing this bug (seen on 5.6.8):

    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages

    # ipcmk -Q
    Message queue id: 0
    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0x82db8127 0 root 644 0 0

    # ipcmk -Q
    Message queue id: 1
    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0x82db8127 0 root 644 0 0
    0x76d1fb2a 1 root 644 0 0

    # ipcrm -q 0
    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0x76d1fb2a 1 root 644 0 0
    0x76d1fb2a 1 root 644 0 0

    # ipcmk -Q
    Message queue id: 2
    # ipcrm -q 2
    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0x76d1fb2a 1 root 644 0 0
    0x76d1fb2a 1 root 644 0 0

    # ipcmk -Q
    Message queue id: 3
    # ipcrm -q 1
    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0x7c982867 3 root 644 0 0
    0x7c982867 3 root 644 0 0
    0x7c982867 3 root 644 0 0
    0x7c982867 3 root 644 0 0

    Whenever an IPC item with a low id is deleted, the items with higher ids
    are duplicated, as if filling a hole.

    new_pos should jump through hole of unused ids, pos can be updated
    inside "for" cycle.

    Fixes: 89163f93c6f9 ("ipc/util.c: sysvipc_find_ipc() should increase position index")
    Reported-by: Andreas Schwab
    Reported-by: Randy Dunlap
    Signed-off-by: Vasily Averin
    Signed-off-by: Andrew Morton
    Acked-by: Waiman Long
    Cc: NeilBrown
    Cc: Steven Rostedt
    Cc: Ingo Molnar
    Cc: Peter Oberparleiter
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc:
    Link: http://lkml.kernel.org/r/4921fe9b-9385-a2b4-1dc4-1099be6d2e39@virtuozzo.com
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

09 May, 2020

1 commit

  • Add a simple struct nsset. It holds all necessary pieces to switch to a new
    set of namespaces without leaving a task in a half-switched state which we
    will make use of in the next patch. This patch switches the existing setns
    logic over without causing a change in setns() behavior. This brings
    setns() closer to how unshare() works(). The prepare_ns() function is
    responsible to prepare all necessary information. This has two reasons.
    First it minimizes dependencies between individual namespaces, i.e. all
    install handler can expect that all fields are properly initialized
    independent in what order they are called in. Second, this makes the code
    easier to maintain and easier to follow if it needs to be changed.

    The prepare_ns() helper will only be switched over to use a flags argument
    in the next patch. Here it will still use nstype as a simple integer
    argument which was argued would be clearer. I'm not particularly
    opinionated about this if it really helps or not. The struct nsset itself
    already contains the flags field since its name already indicates that it
    can contain information required by different namespaces. None of this
    should have functional consequences.

    Signed-off-by: Christian Brauner
    Reviewed-by: Serge Hallyn
    Cc: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Jann Horn
    Cc: Michael Kerrisk
    Cc: Aleksa Sarai
    Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com

    Christian Brauner
     

08 May, 2020

1 commit

  • Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
    changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
    longer works if the sender doesn't have rights to send a signal.

    Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
    to avoid check_kill_permission().

    This needs the additional notify.sigev_signo != 0 check, shouldn't we
    change do_mq_notify() to deny sigev_signo == 0 ?

    Test-case:

    #include
    #include
    #include
    #include
    #include

    static int notified;

    static void sigh(int sig)
    {
    notified = 1;
    }

    int main(void)
    {
    signal(SIGIO, sigh);

    int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
    assert(fd >= 0);

    struct sigevent se = {
    .sigev_notify = SIGEV_SIGNAL,
    .sigev_signo = SIGIO,
    };
    assert(mq_notify(fd, &se) == 0);

    if (!fork()) {
    assert(setuid(1) == 0);
    mq_send(fd, "",1,0);
    return 0;
    }

    wait(NULL);
    mq_unlink("/mq");
    assert(notified);
    return 0;
    }

    [manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
    Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
    Reported-by: Yoji
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Acked-by: "Eric W. Biederman"
    Cc: Davidlohr Bueso
    Cc: Markus Elfring
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

11 Apr, 2020

1 commit

  • If seq_file .next function does not change position index, read after
    some lseek can generate unexpected output.

    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    Signed-off-by: Vasily Averin
    Signed-off-by: Andrew Morton
    Acked-by: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Al Viro
    Cc: Ingo Molnar
    Cc: NeilBrown
    Cc: Peter Oberparleiter
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/b7a20945-e315-8bb0-21e6-3875c14a8494@virtuozzo.com
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

08 Apr, 2020

3 commits

  • Fix the following sparse warning:

    ipc/shm.c:1335:6: warning: symbol 'compat_ksys_shmctl' was not declared.
    Should it be static?

    Reported-by: Hulk Robot
    Signed-off-by: Jason Yan
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200403063933.24785-1-yanaijie@huawei.com
    Signed-off-by: Linus Torvalds

    Jason Yan
     
  • Signed-off-by: somala swaraj
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200301135530.18340-1-somalaswaraj@gmail.com
    Signed-off-by: Linus Torvalds

    Somala Swaraj
     
  • Now that "struct proc_ops" exist we can start putting there stuff which
    could not fly with VFS "struct file_operations"...

    Most of fs/proc/inode.c file is dedicated to make open/read/.../close
    reliable in the event of disappearing /proc entries which usually happens
    if module is getting removed. Files like /proc/cpuinfo which never
    disappear simply do not need such protection.

    Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
    "permanent" files.

    Enable "permanent" flag for

    /proc/cpuinfo
    /proc/kmsg
    /proc/modules
    /proc/slabinfo
    /proc/stat
    /proc/sysvipc/*
    /proc/swaps

    More will come once I figure out foolproof way to prevent out module
    authors from marking their stuff "permanent" for performance reasons
    when it is not.

    This should help with scalability: benchmark is "read /proc/cpuinfo R times
    by N threads scattered over the system".

    N R t, s (before) t, s (after)
    -----------------------------------------------------
    64 4096 1.582458 1.530502 -3.2%
    256 4096 6.371926 6.125168 -3.9%
    1024 4096 25.64888 24.47528 -4.6%

    Benchmark source:

    #include
    #include
    #include
    #include

    #include
    #include
    #include
    #include

    const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
    int N;
    const char *filename;
    int R;

    int xxx = 0;

    int glue(int n)
    {
    cpu_set_t m;
    CPU_ZERO(&m);
    CPU_SET(n, &m);
    return sched_setaffinity(0, sizeof(cpu_set_t), &m);
    }

    void f(int n)
    {
    glue(n % NR_CPUS);

    while (*(volatile int *)&xxx == 0) {
    }

    for (int i = 0; i < R; i++) {
    int fd = open(filename, O_RDONLY);
    char buf[4096];
    ssize_t rv = read(fd, buf, sizeof(buf));
    asm volatile ("" :: "g" (rv));
    close(fd);
    }
    }

    int main(int argc, char *argv[])
    {
    if (argc < 4) {
    std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
    ";
    return 1;
    }

    N = atoi(argv[1]);
    filename = argv[2];
    R = atoi(argv[3]);

    for (int i = 0; i < NR_CPUS; i++) {
    if (glue(i) == 0)
    break;
    }

    std::vector T;
    T.reserve(N);
    for (int i = 0; i < N; i++) {
    T.emplace_back(f, i);
    }

    auto t0 = std::chrono::system_clock::now();
    {
    *(volatile int *)&xxx = 1;
    for (auto& t: T) {
    t.join();
    }
    }
    auto t1 = std::chrono::system_clock::now();
    std::chrono::duration dt = t1 - t0;
    std::cout << dt.count() << '
    ';

    return 0;
    }

    P.S.:
    Explicit randomization marker is added because adding non-function pointer
    will silently disable structure layout randomization.

    [akpm@linux-foundation.org: coding style fixes]
    Reported-by: kbuild test robot
    Reported-by: Dan Carpenter
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Cc: Al Viro
    Cc: Joe Perches
    Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

22 Feb, 2020

1 commit

  • This reverts commit a97955844807e327df11aa33869009d14d6b7de0.

    Commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage
    in exit_sem()") removes a lock that is needed. This leads to a process
    looping infinitely in exit_sem() and can also lead to a crash. There is
    a reproducer available in [1] and with the commit reverted the issue
    does not reproduce anymore.

    Using the reproducer found in [1] is fairly easy to reach a point where
    one of the child processes is looping infinitely in exit_sem between
    for(;;) and if (semid == -1) block, while it's trying to free its last
    sem_undo structure which has already been freed by freeary().

    Each sem_undo struct is on two lists: one per semaphore set (list_id)
    and one per process (list_proc). The list_id list tracks undos by
    semaphore set, and the list_proc by process.

    Undo structures are removed either by freeary() or by exit_sem(). The
    freeary function is invoked when the user invokes a syscall to remove a
    semaphore set. During this operation freeary() traverses the list_id
    associated with the semaphore set and removes the undo structures from
    both the list_id and list_proc lists.

    For this case, exit_sem() is called at process exit. Each process
    contains a struct sem_undo_list (referred to as "ulp") which contains
    the head for the list_proc list. When the process exits, exit_sem()
    traverses this list to remove each sem_undo struct. As in freeary(),
    whenever a sem_undo struct is removed from list_proc, it is also removed
    from the list_id list.

    Removing elements from list_id is safe for both exit_sem() and freeary()
    due to sem_lock(). Removing elements from list_proc is not safe;
    freeary() locks &un->ulp->lock when it performs
    list_del_rcu(&un->list_proc) but exit_sem() does not (locking was
    removed by commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list
    lock usage in exit_sem()").

    This can result in the following situation while executing the
    reproducer [1] : Consider a child process in exit_sem() and the parent
    in freeary() (because of semctl(sid[i], NSEM, IPC_RMID)).

    - The list_proc for the child contains the last two undo structs A and
    B (the rest have been removed either by exit_sem() or freeary()).

    - The semid for A is 1 and semid for B is 2.

    - exit_sem() removes A and at the same time freeary() removes B.

    - Since A and B have different semid sem_lock() will acquire different
    locks for each process and both can proceed.

    The bug is that they remove A and B from the same list_proc at the same
    time because only freeary() acquires the ulp lock. When exit_sem()
    removes A it makes ulp->list_proc.next to point at B and at the same
    time freeary() removes B setting B->semid=-1.

    At the next iteration of for(;;) loop exit_sem() will try to remove B.

    The only way to break from for(;;) is for (&un->list_proc ==
    &ulp->list_proc) to be true which is not. Then exit_sem() will check if
    B->semid=-1 which is and will continue looping in for(;;) until the
    memory for B is reallocated and the value at B->semid is changed.

    At that point, exit_sem() will crash attempting to unlink B from the
    lists (this can be easily triggered by running the reproducer [1] a
    second time).

    To prove this scenario instrumentation was added to keep information
    about each sem_undo (un) struct that is removed per process and per
    semaphore set (sma).

    CPU0 CPU1
    [caller holds sem_lock(sma for A)] ...
    freeary() exit_sem()
    ... ...
    ... sem_lock(sma for B)
    spin_lock(A->ulp->lock) ...
    list_del_rcu(un_A->list_proc) list_del_rcu(un_B->list_proc)

    Undo structures A and B have different semid and sem_lock() operations
    proceed. However they belong to the same list_proc list and they are
    removed at the same time. This results into ulp->list_proc.next
    pointing to the address of B which is already removed.

    After reverting commit a97955844807 ("ipc,sem: remove uneeded
    sem_undo_list lock usage in exit_sem()") the issue was no longer
    reproducible.

    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1694779

    Link: http://lkml.kernel.org/r/20191211191318.11860-1-ioanna-maria.alifieraki@canonical.com
    Fixes: a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()")
    Signed-off-by: Ioanna Alifieraki
    Acked-by: Manfred Spraul
    Acked-by: Herton R. Krzesinski
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc:
    Cc: Joel Fernandes (Google)
    Cc: Davidlohr Bueso
    Cc: Jay Vosburgh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ioanna Alifieraki
     

04 Feb, 2020

6 commits

  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • A use of uninitialized memory in msgctl_down() because msqid64 in
    ksys_msgctl hasn't been initialized. The local | msqid64 | is created in
    ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64
    is never initialized before msgctl_down() checks msqid64->msg_qbytes.

    KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
    reports:

    ==================================================================
    BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300
    Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022

    CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    dump_stack+0x75/0xae
    __kumsan_report+0x17c/0x3e6
    kumsan_report+0xe/0x20
    msgctl_down+0x94/0x300
    ksys_msgctl.constprop.14+0xef/0x260
    do_syscall_64+0x7e/0x1f0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x4400e9
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970
    R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000

    The buggy address belongs to the page:
    page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
    flags: 0x100000000000000()
    raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000
    raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
    page dumped because: kumsan: bad access detected
    ==================================================================

    Syzkaller reproducer:
    msgctl$IPC_RMID(0x0, 0x0)

    C reproducer:
    // autogenerated by syzkaller (https://github.com/google/syzkaller)

    int main(void)
    {
    syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
    syscall(__NR_msgctl, 0, 0, 0);
    return 0;
    }

    [natechancellor@gmail.com: adjust indentation in ksys_msgctl]
    Link: https://github.com/ClangBuiltLinux/linux/issues/829
    Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com
    Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com
    Signed-off-by: Lu Shuaibing
    Signed-off-by: Nathan Chancellor
    Suggested-by: Arnd Bergmann
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: NeilBrown
    From: Andrew Morton
    Subject: drivers/block/null_blk_main.c: fix layout

    Each line here overflows 80 cols by exactly one character. Delete one tab
    per line to fix.

    Cc: Shaohua Li
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lu Shuaibing
     
  • Document and update the memory barriers in ipc/sem.c:

    - Add smp_store_release() to wake_up_sem_queue_prepare() and
    document why it is needed.

    - Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
    as the pair for the barrier inside wake_up_sem_queue_prepare().

    - Add comments to all barriers, and mention the rules in the block
    regarding locking.

    - Switch to using wake_q_add_safe().

    Link: http://lkml.kernel.org/r/20191020123305.14715-6-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Waiman Long
    Cc: Davidlohr Bueso
    Cc:
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Transfer findings from ipc/mqueue.c:

    - A control barrier was missing for the lockless receive case So in
    theory, not yet initialized data may have been copied to user space -
    obviously only for architectures where control barriers are not NOP.

    - use smp_store_release(). In theory, the refount may have been
    decreased to 0 already when wake_q_add() tries to get a reference.

    Link: http://lkml.kernel.org/r/20191020123305.14715-5-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Waiman Long
    Cc: Davidlohr Bueso
    Cc:
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Update and document memory barriers for mqueue.c:

    - ewp->state is read without any locks, thus READ_ONCE is required.

    - add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need
    acquire semantics if the value is STATE_READY.

    - use wake_q_add_safe()

    - document why __set_current_state() may be used:
    Reading task->state cannot happen before the wake_q_add() call,
    which happens while holding info->lock. Thus the spin_unlock()
    is the RELEASE, and the spin_lock() is the ACQUIRE.

    For completeness: there is also a 3 CPU scenario, if the to be woken
    up task is already on another wake_q.
    Then:
    - CPU1: spin_unlock() of the task that goes to sleep is the RELEASE
    - CPU2: the spin_lock() of the waker is the ACQUIRE
    - CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE
    - CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE

    Link: http://lkml.kernel.org/r/20191020123305.14715-4-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Cc: Waiman Long
    Cc:
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • pipelined_send() and pipelined_receive() are identical, so merge them.

    [manfred@colorfullife.com: add changelog]
    Link: http://lkml.kernel.org/r/20191020123305.14715-3-manfred@colorfullife.com
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc:
    Cc: Peter Zijlstra
    Cc: Waiman Long
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

10 Dec, 2019

1 commit

  • Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
    at places where these are defined. Later patches will remove the unused
    definition of FIELD_SIZEOF().

    This patch is generated using following script:

    EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

    git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
    do

    if [[ "$file" =~ $EXCLUDE_FILES ]]; then
    continue
    fi
    sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
    done

    Signed-off-by: Pankaj Bharadiya
    Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
    Co-developed-by: Kees Cook
    Signed-off-by: Kees Cook
    Acked-by: David Miller # for net

    Pankaj Bharadiya
     

15 Nov, 2019

1 commit


26 Sep, 2019

3 commits

  • CONFIG_PROVE_RCU_LIST requires list_for_each_entry_rcu() to pass a lockdep
    expression if using srcu or locking for protection. It can only check
    regular RCU protection, all other protection needs to be passed as lockdep
    expression.

    Link: http://lkml.kernel.org/r/20190830231817.76862-2-joel@joelfernandes.org
    Signed-off-by: Joel Fernandes (Google)
    Cc: Arnd Bergmann
    Cc: Bjorn Helgaas
    Cc: Catalin Marinas
    Cc: "Gustavo A. R. Silva"
    Cc: Jonathan Derrick
    Cc: Keith Busch
    Cc: Lorenzo Pieralisi
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     
  • Null pointers were assigned to local variables in a few cases as exception
    handling. The jump target “out” was used where no meaningful data
    processing actions should eventually be performed by branches of an if
    statement then. Use an additional jump target for calling dev_kfree_skb()
    directly.

    Return also directly after error conditions were detected when no extra
    clean-up is needed by this function implementation.

    Link: http://lkml.kernel.org/r/592ef10e-0b69-72d0-9789-fc48f638fdfd@web.de
    Signed-off-by: Markus Elfring
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     
  • dev_kfree_skb() input parameter validation, thus the test around the call
    is not needed.

    This issue was detected by using the Coccinelle software.

    Link: http://lkml.kernel.org/r/07477187-63e5-cc80-34c1-32dd16b38e12@web.de
    Signed-off-by: Markus Elfring
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     

19 Sep, 2019

1 commit

  • Pull vfs mount API infrastructure updates from Al Viro:
    "Infrastructure bits of mount API conversions.

    The rest is more of per-filesystem updates and that will happen
    in separate pull requests"

    * 'work.mount-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mtd: Provide fs_context-aware mount_mtd() replacement
    vfs: Create fs_context-aware mount_bdev() replacement
    new helper: get_tree_keyed()
    vfs: set fs_context::user_ns for reconfigure

    Linus Torvalds
     

08 Sep, 2019

1 commit

  • Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
    to a commit from my y2038 series in linux-5.1, as I missed the custom
    sys_ipc() wrapper that sparc64 uses in place of the generic version that
    I patched.

    The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
    now do not allow being called with the IPC_64 flag any more, resulting
    in a -EINVAL error when they don't recognize the command.

    Instead, the correct way to do this now is to call the internal
    ksys_old_{sem,shm,msg}ctl() functions to select the API version.

    As we generally move towards these functions anyway, change all of
    sparc_ipc() to consistently use those in place of the sys_*() versions,
    and move the required ksys_*() declarations into linux/syscalls.h

    The IS_ENABLED(CONFIG_SYSVIPC) check is required to avoid link
    errors when ipc is disabled.

    Reported-by: Matt Turner
    Fixes: 275f22148e87 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
    Cc: stable@vger.kernel.org
    Tested-by: Matt Turner
    Tested-by: Anatoly Pugachev
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

06 Sep, 2019

1 commit


20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

17 Jul, 2019

1 commit

  • Andreas Christoforou reported:

    UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
    9 * 2305843009213693951 cannot be represented in type 'long int'
    ...
    Call Trace:
    mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
    evict+0x472/0x8c0 fs/inode.c:558
    iput_final fs/inode.c:1547 [inline]
    iput+0x51d/0x8c0 fs/inode.c:1573
    mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
    mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
    vfs_mkobj+0x39e/0x580 fs/namei.c:2892
    prepare_open ipc/mqueue.c:731 [inline]
    do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771

    Which could be triggered by:

    struct mq_attr attr = {
    .mq_flags = 0,
    .mq_maxmsg = 9,
    .mq_msgsize = 0x1fffffffffffffff,
    .mq_curmsgs = 0,
    };

    if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
    perror("mq_open");

    mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
    preparing to return -EINVAL. During the cleanup, it calls
    mqueue_evict_inode() which performed resource usage tracking math for
    updating "user", before checking if there was a valid "user" at all
    (which would indicate that the calculations would be sane). Instead,
    delay this check to after seeing a valid "user".

    The overflow was real, but the results went unused, so while the flaw is
    harmless, it's noisy for kernel fuzzers, so just fix it by moving the
    calculation under the non-NULL "user" where it actually gets used.

    Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook
    Signed-off-by: Kees Cook
    Reported-by: Andreas Christoforou
    Acked-by: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation version 2 of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 315 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

26 May, 2019

1 commit


24 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this file is released under gnu general public licence version 2 or
    at your option any later version see the file copying for more
    details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 1 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kate Stewart
    Reviewed-by: Richard Fontana
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190520071857.941092988@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner