22 Jun, 2020

1 commit

  • [ Upstream commit 586b58cac8b4683eb58a1446fbc399de18974e40 ]

    With CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_CGROUPS=y, kernel oopses in
    non-preemptible context look untidy; after the main oops, the kernel prints
    a "sleeping function called from invalid context" report because
    exit_signals() -> cgroup_threadgroup_change_begin() -> percpu_down_read()
    can sleep, and that happens before the preempt_count_set(PREEMPT_ENABLED)
    fixup.

    It looks like the same thing applies to profile_task_exit() and
    kcov_task_exit().

    Fix it by moving the preemption fixup up and the calls to
    profile_task_exit() and kcov_task_exit() down.

    Fixes: 1dc0fffc48af ("sched/core: Robustify preemption leak checks")
    Signed-off-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200305220657.46800-1-jannh@google.com
    Signed-off-by: Sasha Levin

    Jann Horn
     

09 Jan, 2020

1 commit

  • commit 43cf75d96409a20ef06b756877a2e72b10a026fc upstream.

    Currently, when global init and all threads in its thread-group have exited
    we panic via:
    do_exit()
    -> exit_notify()
    -> forget_original_parent()
    -> find_child_reaper()
    This makes it hard to extract a useable coredump for global init from a
    kernel crashdump because by the time we panic exit_mm() will have already
    released global init's mm.
    This patch moves the panic futher up before exit_mm() is called. As was the
    case previously, we only panic when global init and all its threads in the
    thread-group have exited.

    Signed-off-by: chenqiwu
    Acked-by: Christian Brauner
    Acked-by: Oleg Nesterov
    [christian.brauner@ubuntu.com: fix typo, rewrite commit message]
    Link: https://lore.kernel.org/r/1576736993-10121-1-git-send-email-qiwuchen55@gmail.com
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    chenqiwu
     

29 Nov, 2019

4 commits

  • commit 18f694385c4fd77a09851fd301236746ca83f3cb upstream.

    Instead of relying on PF_EXITING use an explicit state for the futex exit
    and set it in the futex exit function. This moves the smp barrier and the
    lock/unlock serialization into the futex code.

    As with the DEAD state this is restricted to the exit path as exec
    continues to use the same task struct.

    This allows to simplify that logic in a next step.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit f24f22435dcc11389acc87e5586239c1819d217c upstream.

    Setting task::futex_state in do_exit() is rather arbitrarily placed for no
    reason. Move it into the futex code.

    Note, this is only done for the exit cleanup as the exec cleanup cannot set
    the state to FUTEX_STATE_DEAD because the task struct is still in active
    use.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 4610ba7ad877fafc0a25a30c6c82015304120426 upstream.

    mm_release() contains the futex exit handling. mm_release() is called from
    do_exit()->exit_mm() and from exec()->exec_mm().

    In the exit_mm() case PF_EXITING and the futex state is updated. In the
    exec_mm() case these states are not touched.

    As the futex exit code needs further protections against exit races, this
    needs to be split into two functions.

    Preparatory only, no functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 3d4775df0a89240f671861c6ab6e8d59af8e9e41 upstream.

    The futex exit handling relies on PF_ flags. That's suboptimal as it
    requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in
    the middle of do_exit() to enforce the observability of PF_EXITING in the
    futex code.

    Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic
    over to the new state. The PF_EXITING dependency will be cleaned up in a
    later step.

    This prepares for handling various futex exit issues later.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

25 Sep, 2019

2 commits

  • Remove work arounds that were written before there was a grace period
    after tasks left the runqueue in finish_task_switch().

    In particular now that there tasks exiting the runqueue exprience
    a RCU grace period none of the work performed by task_rcu_dereference()
    excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
    with rcu_dereference().

    Remove the code in rcuwait_wait_event() that checks to ensure the current
    task has not exited. It is no longer necessary as it is guaranteed
    that any running task will experience a RCU grace period after it
    leaves the run queueue.

    Remove the comment in rcuwait_wake_up() as it is no longer relevant.

    Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
    Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Davidlohr Bueso
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org
    Signed-off-by: Ingo Molnar

    Eric W. Biederman
     
  • Add a count of the number of RCU users (currently 1) of the task
    struct so that we can later add the scheduler case and get rid of the
    very subtle task_rcu_dereference(), and just use rcu_dereference().

    As suggested by Oleg have the count overlap rcu_head so that no
    additional space in task_struct is required.

    Inspired-by: Linus Torvalds
    Inspired-by: Oleg Nesterov
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Davidlohr Bueso
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
    Signed-off-by: Ingo Molnar

    Eric W. Biederman
     

17 Sep, 2019

1 commit

  • Pull pidfd/waitid updates from Christian Brauner:
    "This contains two features and various tests.

    First, it adds support for waiting on process through pidfds by adding
    the P_PIDFD type to the waitid() syscall. This completes the basic
    functionality of the pidfd api (cf. [1]). In the meantime we also have
    a new adition to the userspace projects that make use of the pidfd
    api. The qt project was nice enough to send a mail pointing out that
    they have a pr up to switch to the pidfd api (cf. [2]).

    Second, this tag contains an extension to the waitid() syscall to make
    it possible to wait on the current process group in a race free manner
    (even though the actual problem is very unlikely) by specifing 0
    together with the P_PGID type. This extension traces back to a
    discussion on the glibc development mailing list.

    There are also a range of tests for the features above. Additionally,
    the test-suite which detected the pidfd-polling race we fixed in [3]
    is included in this tag"

    [1] https://lwn.net/Articles/794707/
    [2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
    [3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")

    * tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    waitid: Add support for waiting for the current process group
    tests: add pidfd poll tests
    tests: move common definitions and functions into pidfd.h
    pidfd: add pidfd_wait tests
    pidfd: add P_PIDFD to waitid()

    Linus Torvalds
     

10 Sep, 2019

1 commit

  • It was recently discovered that the linux version of waitid is not a
    superset of the other wait functions because it does not include support
    for waiting for the current process group. This has two downsides:
    1. An extra system call is needed to get the current process group.
    2. After the current process group is received and before it is passed
    to waitid a signal could arrive causing the current process group to change.
    Inherent race-conditions as these make it impossible for userspace to
    emulate this functionaly and thus violate async-signal safety
    requirements for waitpid.

    Arguments can be made for using a different choice of idtype and id
    for this case but the BSDs already use this P_PGID and 0 to indicate
    waiting for the current process's process group. So be nice to user
    space programmers and don't introduce an unnecessary incompatibility.

    Some people have noted that the posix description is that
    waitpid will wait for the current process group, and that in
    the presence of pthreads that process group can change. To get
    clarity on this issue I looked at XNU, FreeBSD, and Luminos. All of
    those flavors of unix waited for the current process group at the
    time of call and as written could not adapt to the process group
    changing after the call.

    At one point Linux did adapt to the current process group changing but
    that stopped in 161550d74c07 ("pid: sys_wait... fixes"). It has been
    over 11 years since Linux has that behavior, no programs that fail
    with the change in behavior have been reported, and I could not
    find any other unix that does this. So I think it is safe to clarify
    the definition of current process group, to current process group
    at the time of the wait function.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Cc: "H. Peter Anvin"
    Cc: Arnd Bergmann
    Cc: Palmer Dabbelt
    Cc: Rich Felker
    Cc: Alistair Francis
    Cc: Zong Li
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Florian Weimer
    Cc: Adhemerval Zanella
    Cc: GNU C Library
    Link: https://lore.kernel.org/r/20190814154400.6371-2-christian.brauner@ubuntu.com

    Eric W. Biederman
     

02 Aug, 2019

1 commit

  • This adds the P_PIDFD type to waitid().
    One of the last remaining bits for the pidfd api is to make it possible
    to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
    that want to use the pidfd api to exclusively manage processes can do so
    now.

    One of the things this will unblock in the future is the ability to make
    it possible to retrieve the exit status via waitid(P_PIDFD) for
    non-parent processes if handed a _suitable_ pidfd that has this feature
    set. This is similar to what you can do on FreeBSD with kqueue(). It
    might even end up being possible to wait on a process as a non-parent if
    an appropriate property is enabled on the pidfd.

    With P_PIDFD no scoping of the process identified by the pidfd is
    possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
    waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
    equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
    rely on pid-based wait*() syscalls for now.

    Signed-off-by: Christian Brauner
    Reviewed-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Joel Fernandes (Google)
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro
    Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io

    Christian Brauner
     

31 Jul, 2019

1 commit

  • Since commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")
    we unconditionally set exit_state to EXIT_ZOMBIE before calling into
    do_notify_parent(). This was done to eliminate a race when querying
    exit_state in do_notify_pidfd().
    Back then we decided to do the absolute minimal thing to fix this and
    not touch the rest of the exit_notify() function where exit_state is
    set.
    Since this fix has not caused any issues change the setting of
    exit_state to EXIT_DEAD in the autoreap case to account for the fact hat
    exit_state is set to EXIT_ZOMBIE unconditionally. This fix was planned
    but also explicitly requested in [1] and makes the whole code more
    consistent.

    /* References */
    [1]: https://lore.kernel.org/lkml/CAHk-=wigcxGFR2szue4wavJtH5cYTTeNES=toUBVGsmX0rzX+g@mail.gmail.com

    Signed-off-by: Christian Brauner
    Acked-by: Oleg Nesterov
    Cc: Linus Torvalds

    Christian Brauner
     

22 Jul, 2019

1 commit

  • There is a race between reading task->exit_state in pidfd_poll and
    writing it after do_notify_parent calls do_notify_pidfd. Expected
    sequence of events is:

    CPU 0 CPU 1
    ------------------------------------------------
    exit_notify
    do_notify_parent
    do_notify_pidfd
    tsk->exit_state = EXIT_DEAD
    pidfd_poll
    if (tsk->exit_state)

    However nothing prevents the following sequence:

    CPU 0 CPU 1
    ------------------------------------------------
    exit_notify
    do_notify_parent
    do_notify_pidfd
    pidfd_poll
    if (tsk->exit_state)
    tsk->exit_state = EXIT_DEAD

    This causes a polling task to wait forever, since poll blocks because
    exit_state is 0 and the waiting task is not notified again. A stress
    test continuously doing pidfd poll and process exits uncovered this bug.

    To fix it, we make sure that the task's exit_state is always set before
    calling do_notify_pidfd.

    Fixes: b53b0b9d9a6 ("pidfd: add polling support")
    Cc: kernel-team@android.com
    Cc: Oleg Nesterov
    Signed-off-by: Suren Baghdasaryan
    Signed-off-by: Joel Fernandes (Google)
    Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
    [christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
    Signed-off-by: Christian Brauner

    Suren Baghdasaryan
     

01 Jun, 2019

1 commit

  • cgroup_release() calls cgroup_subsys->release() which is used by the
    pids controller to uncharge its pid. We want to use it to manage
    iteration of dying tasks which requires putting it before
    __unhash_process(). Move cgroup_release() above __exit_signal().
    While this makes it uncharge before the pid is freed, pid is RCU freed
    anyway and the window is very narrow.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov

    Tejun Heo
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • The RCU reader uses rcu_dereference() inside rcu_read_lock critical
    sections, so the writer shall use WRITE_ONCE. Just a cleanup, we still
    rely on gcc to emit atomic writes in other places.

    Link: http://lkml.kernel.org/r/20190325225636.11635-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Jann Horn
    Cc: Jason Gunthorpe
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Oleg Nesterov
    Cc: Peter Xu
    Cc: zhong jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

08 Mar, 2019

1 commit

  • Pull cgroup updates from Tejun Heo:

    - Oleg's pids controller accounting update which gets rid of rcu delay
    in pids accounting updates

    - rstat (cgroup hierarchical stat collection mechanism) optimization

    - Doc updates

    * 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: remove unused task_has_mempolicy()
    cgroup, rstat: Don't flush subtree root unless necessary
    cgroup: add documentation for pids.events file
    Documentation: cgroup-v2: eliminate markup warnings
    MAINTAINERS: Update cgroup entry
    cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting

    Linus Torvalds
     

02 Feb, 2019

1 commit

  • Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
    zap_pid_ns_processes() waits on all tasks in a current pidns, and only
    then are tasks from the dead list released.

    zap_pid_ns_processes() can get stuck on waiting tasks from the dead
    list. In this case, we will have one unkillable process with one or
    more dead children.

    Thanks to Oleg for the advice to release tasks in find_child_reaper().

    Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
    Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
    Signed-off-by: Andrei Vagin
    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrei Vagin
     

31 Jan, 2019

1 commit

  • The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
    needs pids_free() to uncharge the pid.

    However, ->free() is called from __put_task_struct()->cgroup_free() and this
    is too late. Even the trivial program which does

    for (;;) {
    int pid = fork();
    assert(pid >= 0);
    if (pid)
    wait(NULL);
    else
    exit(0);
    }

    can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
    implies an RCU gp after the task/pid goes away and before the final put().

    Test-case:

    mkdir -p /tmp/CG
    mount -t cgroup2 none /tmp/CG
    echo '+pids' > /tmp/CG/cgroup.subtree_control

    mkdir /tmp/CG/PID
    echo 2 > /tmp/CG/PID/pids.max

    perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
    echo $! > /tmp/CG/PID/cgroup.procs

    Without this patch the forking process fails soon after migration.

    Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
    into the new helper, cgroup_release(), called by release_task() which actually
    frees the pid(s).

    Reported-by: Herton R. Krzesinski
    Reported-by: Jan Stancek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     

21 Jan, 2019

1 commit

  • For some peculiar reason rcuwait_wake_up() has the right barrier in
    the comment, but not in the code.

    This mistake has been observed to cause a deadlock in the following
    situation:

    P1 P2

    percpu_up_read() percpu_down_write()
    rcu_sync_is_idle() // false
    rcu_sync_enter()
    ...
    __percpu_up_read()

    [S] ,- __this_cpu_dec(*sem->read_count)
    | smp_rmb();
    [L] | task = rcu_dereference(w->task) // NULL
    |
    | [S] w->task = current
    | smp_mb();
    | [L] readers_active_check() // fail
    `->

    Where the smp_rmb() (obviously) fails to constrain the store.

    [ peterz: Added changelog. ]

    Signed-off-by: Prateek Sood
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Andrea Parri
    Acked-by: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
    Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org
    Signed-off-by: Ingo Molnar

    Prateek Sood
     

16 Jan, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Fix regression in multi-SKB responses to RTM_GETADDR, from Arthur
    Gautier.

    2) Fix ipv6 frag parsing in openvswitch, from Yi-Hung Wei.

    3) Unbounded recursion in ipv4 and ipv6 GUE tunnels, from Stefano
    Brivio.

    4) Use after free in hns driver, from Yonglong Liu.

    5) icmp6_send() needs to handle the case of NULL skb, from Eric
    Dumazet.

    6) Missing rcu read lock in __inet6_bind() when operating on mapped
    addresses, from David Ahern.

    7) Memory leak in tipc-nl_compat_publ_dump(), from Gustavo A. R. Silva.

    8) Fix PHY vs r8169 module loading ordering issues, from Heiner
    Kallweit.

    9) Fix bridge vlan memory leak, from Ido Schimmel.

    10) Dev refcount leak in AF_PACKET, from Jason Gunthorpe.

    11) Infoleak in ipv6_local_error(), flow label isn't completely
    initialized. From Eric Dumazet.

    12) Handle mv88e6390 errata, from Andrew Lunn.

    13) Making vhost/vsock CID hashing consistent, from Zha Bin.

    14) Fix lack of UMH cleanup when it unexpectedly exits, from Taehee Yoo.

    15) Bridge forwarding must clear skb->tstamp, from Paolo Abeni.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
    bnxt_en: Fix context memory allocation.
    bnxt_en: Fix ring checking logic on 57500 chips.
    mISDN: hfcsusb: Use struct_size() in kzalloc()
    net: clear skb->tstamp in bridge forwarding path
    net: bpfilter: disallow to remove bpfilter module while being used
    net: bpfilter: restart bpfilter_umh when error occurred
    net: bpfilter: use cleanup callback to release umh_info
    umh: add exit routine for UMH process
    isdn: i4l: isdn_tty: Fix some concurrency double-free bugs
    vhost/vsock: fix vhost vsock cid hashing inconsistent
    net: stmmac: Prevent RX starvation in stmmac_napi_poll()
    net: stmmac: Fix the logic of checking if RX Watchdog must be enabled
    net: stmmac: Check if CBS is supported before configuring
    net: stmmac: dwxgmac2: Only clear interrupts that are active
    net: stmmac: Fix PCI module removal leak
    tools/bpf: fix bpftool map dump with bitfields
    tools/bpf: test btf bitfield with >=256 struct member offset
    bpf: fix bpffs bitfield pretty print
    net: ethernet: mediatek: fix warning in phy_start_aneg
    tcp: change txhash on SYN-data timeout
    ...

    Linus Torvalds
     

12 Jan, 2019

1 commit

  • A UMH process which is created by the fork_usermode_blob() such as
    bpfilter needs to release members of the umh_info when process is
    terminated.
    But the do_exit() does not release members of the umh_info. hence module
    which uses UMH needs own code to detect whether UMH process is
    terminated or not.
    But this implementation needs extra code for checking the status of
    UMH process. it eventually makes the code more complex.

    The new PF_UMH flag is added and it is used to identify UMH processes.
    The exit_umh() does not release members of the umh_info.
    Hence umh_info->cleanup callback should release both members of the
    umh_info and the private data.

    Suggested-by: David S. Miller
    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller

    Taehee Yoo
     

05 Jan, 2019

1 commit

  • Originally, the rule used to be that you'd have to do access_ok()
    separately, and then user_access_begin() before actually doing the
    direct (optimized) user access.

    But experience has shown that people then decide not to do access_ok()
    at all, and instead rely on it being implied by other operations or
    similar. Which makes it very hard to verify that the access has
    actually been range-checked.

    If you use the unsafe direct user accesses, hardware features (either
    SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
    Access Never - on ARM) do force you to use user_access_begin(). But
    nothing really forces the range check.

    By putting the range check into user_access_begin(), we actually force
    people to do the right thing (tm), and the range check vill be visible
    near the actual accesses. We have way too long a history of people
    trying to avoid them.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Jan, 2019

1 commit

  • Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
    of the user address range verification function since we got rid of the
    old racy i386-only code to walk page tables by hand.

    It existed because the original 80386 would not honor the write protect
    bit when in kernel mode, so you had to do COW by hand before doing any
    user access. But we haven't supported that in a long time, and these
    days the 'type' argument is a purely historical artifact.

    A discussion about extending 'user_access_begin()' to do the range
    checking resulted this patch, because there is no way we're going to
    move the old VERIFY_xyz interface to that model. And it's best done at
    the end of the merge window when I've done most of my merges, so let's
    just get this done once and for all.

    This patch was mostly done with a sed-script, with manual fix-ups for
    the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

    There were a couple of notable cases:

    - csky still had the old "verify_area()" name as an alias.

    - the iter_iov code had magical hardcoded knowledge of the actual
    values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
    really used it)

    - microblaze used the type argument for a debug printout

    but other than those oddities this should be a total no-op patch.

    I tried to fix up all architectures, did fairly extensive grepping for
    access_ok() uses, and the changes are trivial, but I may have missed
    something. Any missed conversion should be trivially fixable, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Jul, 2018

1 commit


21 Jul, 2018

2 commits

  • Everywhere except in the pid array we distinguish between a tasks pid and
    a tasks tgid (thread group id). Even in the enumeration we want that
    distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
    we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

    Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
    into the pids array. Then remove the __PIDTYPE_TGID special case and the
    leader_pid in signal_struct.

    The net size increase is just an extra pointer added to struct pid and
    an extra pair of pointers of an hlist_node added to task_struct.

    The effect on code maintenance is the removal of a number of special
    cases today and the potential to remove many more special cases as
    PIDTYPE_TGID gets used to it's fullest. The long term potential
    is allowing zombie thread group leaders to exit, which will remove
    a lot more special cases in the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The function is general and inline so there is no need
    to hide it inside of exit.c

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Apr, 2018

1 commit

  • All call sites of sys_wait4() set *rusage to NULL. Therefore, there is
    no need for the copy_to_user() handling of *rusage, and we can use
    kernel_wait4() directly.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Acked-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

05 Jan, 2018

1 commit

  • gcc -fisolate-erroneous-paths-dereference can generate calls to abort()
    from modular code too.

    [arnd@arndb.de: drop duplicate exports of abort()]
    Link: http://lkml.kernel.org/r/20180102103311.706364-1-arnd@arndb.de
    Reported-by: Vineet Gupta
    Cc: Sudip Mukherjee
    Cc: Arnd Bergmann
    Cc: Alexey Brodkin
    Cc: Russell King
    Cc: Jose Abreu
    Signed-off-by: Andrew Morton
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

15 Dec, 2017

1 commit

  • gcc toggle -fisolate-erroneous-paths-dereference (default at -O2
    onwards) isolates faulty code paths such as null pointer access, divide
    by zero etc. If gcc port doesnt implement __builtin_trap, an abort() is
    generated which causes kernel link error.

    In this case, gcc is generating abort due to 'divide by zero' in
    lib/mpi/mpih-div.c.

    Currently 'frv' and 'arc' are failing. Previously other arch was also
    broken like m32r was fixed by commit d22e3d69ee1a ("m32r: fix build
    failure").

    Let's define this weak function which is common for all arch and fix the
    problem permanently. We can even remove the arch specific 'abort' after
    this is done.

    Link: http://lkml.kernel.org/r/1513118956-8718-1-git-send-email-sudipm.mukherjee@gmail.com
    Signed-off-by: Sudip Mukherjee
    Cc: Alexey Brodkin
    Cc: Vineet Gupta
    Cc: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sudip Mukherjee
     

25 Oct, 2017

1 commit

  • …READ_ONCE()/WRITE_ONCE()

    Please do not apply this to mainline directly, instead please re-run the
    coccinelle script shown below and apply its output.

    For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
    preference to ACCESS_ONCE(), and new code is expected to use one of the
    former. So far, there's been no reason to change most existing uses of
    ACCESS_ONCE(), as these aren't harmful, and changing them results in
    churn.

    However, for some features, the read/write distinction is critical to
    correct operation. To distinguish these cases, separate read/write
    accessors must be used. This patch migrates (most) remaining
    ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
    coccinelle script:

    ----
    // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
    // WRITE_ONCE()

    // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

    virtual patch

    @ depends on patch @
    expression E1, E2;
    @@

    - ACCESS_ONCE(E1) = E2
    + WRITE_ONCE(E1, E2)

    @ depends on patch @
    expression E;
    @@

    - ACCESS_ONCE(E)
    + READ_ONCE(E)
    ----

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: davem@davemloft.net
    Cc: linux-arch@vger.kernel.org
    Cc: mpe@ellerman.id.au
    Cc: shuah@kernel.org
    Cc: snitzer@redhat.com
    Cc: thor.thayer@linux.intel.com
    Cc: tj@kernel.org
    Cc: viro@zeniv.linux.org.uk
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Mark Rutland
     

21 Oct, 2017

1 commit

  • As pointed out by Linus and David, the earlier waitid() fix resulted in
    a (currently harmless) unbalanced user_access_end() call. This fixes it
    to just directly return EFAULT on access_ok() failure.

    Fixes: 96ca579a1ecc ("waitid(): Add missing access_ok() checks")
    Acked-by: David Daney
    Cc: Al Viro
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     

10 Oct, 2017

1 commit

  • Adds missing access_ok() checks.

    CVE-2017-5123

    Reported-by: Chris Salls
    Signed-off-by: Kees Cook
    Acked-by: Al Viro
    Fixes: 4c48abe91be0 ("waitid(): switch copyout of siginfo to unsafe_put_user()")
    Cc: stable@kernel.org # 4.13
    Signed-off-by: Linus Torvalds

    Kees Cook
     

30 Sep, 2017

1 commit

  • kernel_waitid() can return a PID, an error or 0. rusage is filled in the first
    case and waitid(2) rusage should've been copied out exactly in that case, *not*
    whenever kernel_waitid() has not returned an error. Compat variant shares that
    braino; none of kernel_wait4() callers do, so the below ought to fix it.

    Reported-and-tested-by: Alexander Potapenko
    Fixes: ce72a16fa705 ("wait4(2)/waitid(2): separate copying rusage to userland")
    Cc: stable@vger.kernel.org # v4.13
    Signed-off-by: Al Viro

    Al Viro
     

12 Sep, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "Life has been busy and I have not gotten half as much done this round
    as I would have liked. I delayed it so that a minor conflict
    resolution with the mips tree could spend a little time in linux-next
    before I sent this pull request.

    This includes two long delayed user namespace changes from Kirill
    Tkhai. It also includes a very useful change from Serge Hallyn that
    allows the security capability attribute to be used inside of user
    namespaces. The practical effect of this is people can now untar
    tarballs and install rpms in user namespaces. It had been suggested to
    generalize this and encode some of the namespace information
    information in the xattr name. Upon close inspection that makes the
    things that should be hard easy and the things that should be easy
    more expensive.

    Then there is my bugfix/cleanup for signal injection that removes the
    magic encoding of the siginfo union member from the kernel internal
    si_code. The mips folks reported the case where I had used FPE_FIXME
    me is impossible so I have remove FPE_FIXME from mips, while at the
    same time including a return statement in that case to keep gcc from
    complaining about unitialized variables.

    I almost finished the work to get make copy_siginfo_to_user a trivial
    copy to user. The code is available at:

    git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3

    But I did not have time/energy to get the code posted and reviewed
    before the merge window opened.

    I was able to see that the security excuse for just copying fields
    that we know are initialized doesn't work in practice there are buggy
    initializations that don't initialize the proper fields in siginfo. So
    we still sometimes copy unitialized data to userspace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    Introduce v3 namespaced file capabilities
    mips/signal: In force_fcr31_sig return in the impossible case
    signal: Remove kernel interal si_code magic
    fcntl: Don't use ambiguous SIG_POLL si_codes
    prctl: Allow local CAP_SYS_ADMIN changing exe_file
    security: Use user_namespace::level to avoid redundant iterations in cap_capable()
    userns,pidns: Verify the userns for new pid namespaces
    signal/testing: Don't look for __SI_FAULT in userspace
    signal/mips: Document a conflict with SI_USER with SIGFPE
    signal/sparc: Document a conflict with SI_USER with SIGFPE
    signal/ia64: Document a conflict with SI_USER with SIGFPE
    signal/alpha: Document a conflict with SI_USER for SIGTRAP

    Linus Torvalds
     

05 Sep, 2017

1 commit

  • Pull locking updates from Ingo Molnar:

    - Add 'cross-release' support to lockdep, which allows APIs like
    completions, where it's not the 'owner' who releases the lock, to be
    tracked. It's all activated automatically under
    CONFIG_PROVE_LOCKING=y.

    - Clean up (restructure) the x86 atomics op implementation to be more
    readable, in preparation of KASAN annotations. (Dmitry Vyukov)

    - Fix static keys (Paolo Bonzini)

    - Add killable versions of down_read() et al (Kirill Tkhai)

    - Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)

    - Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)

    - Remove smp_mb__before_spinlock() and convert its usages, introduce
    smp_mb__after_spinlock() (Peter Zijlstra)

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
    locking/lockdep/selftests: Fix mixed read-write ABBA tests
    sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
    acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
    locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
    smp: Avoid using two cache lines for struct call_single_data
    locking/lockdep: Untangle xhlock history save/restore from task independence
    locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
    futex: Remove duplicated code and fix undefined behaviour
    Documentation/locking/atomic: Finish the document...
    locking/lockdep: Fix workqueue crossrelease annotation
    workqueue/lockdep: 'Fix' flush_work() annotation
    locking/lockdep/selftests: Add mixed read-write ABBA tests
    mm, locking/barriers: Clarify tlb_flush_pending() barriers
    locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
    locking/lockdep: Explicitly initialize wq_barrier::done::map
    locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
    locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
    locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
    locking/refcounts, x86/asm: Implement fast refcount overflow protection
    locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
    ...

    Linus Torvalds
     

17 Aug, 2017

3 commits

  • …isc.2017.08.17a', 'spin_unlock_wait_no.2017.08.17a', 'srcu.2017.07.27c' and 'torture.2017.07.24c' into HEAD

    doc.2017.08.17a: Documentation updates.
    fixes.2017.08.17a: RCU fixes.
    hotplug.2017.07.25b: CPU-hotplug updates.
    misc.2017.08.17a: Miscellaneous fixes outside of RCU (give or take conflicts).
    spin_unlock_wait_no.2017.08.17a: Remove spin_unlock_wait().
    srcu.2017.07.27c: SRCU updates.
    torture.2017.07.24c: Torture-test updates.

    Paul E. McKenney
     
  • There is no agreed-upon definition of spin_unlock_wait()'s semantics, and
    it appears that all callers could do just as well with a lock/unlock pair.
    This commit therefore replaces the spin_unlock_wait() call in do_exit()
    with spin_lock() followed immediately by spin_unlock(). This should be
    safe from a performance perspective because the lock is a per-task lock,
    and this is happening only at task-exit time.

    Signed-off-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: Alan Stern
    Cc: Andrea Parri
    Cc: Linus Torvalds

    Paul E. McKenney
     
  • Currently, the exit-time support for TASKS_RCU is open-coded in do_exit().
    This commit creates exit_tasks_rcu_start() and exit_tasks_rcu_finish()
    APIs for do_exit() use. This has the benefit of confining the use of the
    tasks_rcu_exit_srcu variable to one file, allowing it to become static.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

10 Aug, 2017

1 commit

  • Lockdep is a runtime locking correctness validator that detects and
    reports a deadlock or its possibility by checking dependencies between
    locks. It's useful since it does not report just an actual deadlock but
    also the possibility of a deadlock that has not actually happened yet.
    That enables problems to be fixed before they affect real systems.

    However, this facility is only applicable to typical locks, such as
    spinlocks and mutexes, which are normally released within the context in
    which they were acquired. However, synchronization primitives like page
    locks or completions, which are allowed to be released in any context,
    also create dependencies and can cause a deadlock.

    So lockdep should track these locks to do a better job. The 'crossrelease'
    implementation makes these primitives also be tracked.

    Signed-off-by: Byungchul Park
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: boqun.feng@gmail.com
    Cc: kernel-team@lge.com
    Cc: kirill@shutemov.name
    Cc: npiggin@gmail.com
    Cc: walken@google.com
    Cc: willy@infradead.org
    Link: http://lkml.kernel.org/r/1502089981-21272-6-git-send-email-byungchul.park@lge.com
    Signed-off-by: Ingo Molnar

    Byungchul Park