23 Mar, 2016

1 commit

  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

18 Mar, 2016

1 commit


17 Feb, 2016

1 commit

  • Introduce the ability to create new cgroup namespace. The newly created
    cgroup namespace remembers the cgroup of the process at the point
    of creation of the cgroup namespace (referred as cgroupns-root).
    The main purpose of cgroup namespace is to virtualize the contents
    of /proc/self/cgroup file. Processes inside a cgroup namespace
    are only able to see paths relative to their namespace root
    (unless they are moved outside of their cgroupns-root, at which point
    they will see a relative path from their cgroupns-root).
    For a correctly setup container this enables container-tools
    (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
    containers without leaking system level cgroup hierarchy to the task.
    This patch only implements the 'unshare' part of the cgroupns.

    Signed-off-by: Aditya Kali
    Signed-off-by: Serge Hallyn
    Signed-off-by: Tejun Heo

    Aditya Kali
     

15 Jan, 2016

2 commits

  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Jan, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:

    - cgroup v2 interface is now official. It's no longer hidden behind a
    devel flag and can be mounted using the new cgroup2 fs type.

    Unfortunately, cpu v2 interface hasn't made it yet due to the
    discussion around in-process hierarchical resource distribution and
    only memory and io controllers can be used on the v2 interface at the
    moment.

    - The existing documentation which has always been a bit of mess is
    relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
    is added as the authoritative documentation for the v2 interface.

    - Some features are added through for-4.5-ancestor-test branch to
    enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
    netfilter changes will be merged through the net tree which pulled in
    the said branch.

    - Various cleanups

    * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: rename cgroup documentations
    cgroup: fix a typo.
    cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
    cgroup: demote subsystem init messages to KERN_DEBUG
    cgroup: Fix uninitialized variable warning
    cgroup: put controller Kconfig options in meaningful order
    cgroup: clean up the kernel configuration menu nomenclature
    cgroup_pids: fix a typo.
    Subject: cgroup: Fix incomplete dd command in blkio documentation
    cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
    cpuset: Replace all instances of time_t with time64_t
    cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
    cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
    cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type

    Linus Torvalds
     

06 Jan, 2016

2 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • In the following commit:

    7675104990ed ("sched: Implement lockless wake-queues")

    we gained lockless wake-queues.

    The -RT kernel managed to lockup itself with those. There could be multiple
    attempts for task X to enqueue it for a wakeup _even_ if task X is already
    running.

    The reason is that task X could be runnable but not yet on CPU. The the
    task performing the wakeup did not leave the CPU it could performe
    multiple wakeups.

    With the proper timming task X could be running and enqueued for a
    wakeup. If this happens while X is performing a fork() then its its
    child will have a !NULL `wake_q` member copied.

    This is not a problem as long as the child task does not participate in
    lockless wakeups :)

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Fixes: 7675104990ed ("sched: Implement lockless wake-queues")
    Link: http://lkml.kernel.org/r/20151221171710.GA5499@linutronix.de
    Signed-off-by: Ingo Molnar

    Sebastian Andrzej Siewior
     

04 Dec, 2015

2 commits

  • The cputime can only be updated by the current task itself, even in
    vtime case. So we can safely use seqcount instead of seqlock as there
    is no writer concurrency involved.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Hiroshi Shimamoto
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E . McKenney
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447948054-28668-8-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • VTIME_SLEEPING state happens either when:

    1) The task is sleeping and no tickless delta is to be added on the task
    cputime stats.
    2) The CPU isn't running vtime at all, so the same properties of 1) applies.

    Lets rename the vtime symbol to reflect both states.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Hiroshi Shimamoto
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E . McKenney
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447948054-28668-4-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

03 Dec, 2015

1 commit


30 Nov, 2015

1 commit

  • If the new child migrates to another cgroup before cgroup_post_fork() calls
    subsys->fork(), then both pids_can_attach() and pids_fork() will do the same
    pids_uncharge(old_pids) + pids_charge(pids) sequence twice.

    Change copy_process() to call threadgroup_change_begin/threadgroup_change_end
    unconditionally. percpu_down_read() is cheap and this allows other cleanups,
    see the next changes.

    Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem.

    Signed-off-by: Oleg Nesterov
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     

06 Nov, 2015

3 commits

  • Merge patch-bomb from Andrew Morton:

    - inotify tweaks

    - some ocfs2 updates (many more are awaiting review)

    - various misc bits

    - kernel/watchdog.c updates

    - Some of mm. I have a huge number of MM patches this time and quite a
    lot of it is quite difficult and much will be held over to next time.

    * emailed patches from Andrew Morton : (162 commits)
    selftests: vm: add tests for lock on fault
    mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
    mm: introduce VM_LOCKONFAULT
    mm: mlock: add new mlock system call
    mm: mlock: refactor mlock, munlock, and munlockall code
    kasan: always taint kernel on report
    mm, slub, kasan: enable user tracking by default with KASAN=y
    kasan: use IS_ALIGNED in memory_is_poisoned_8()
    kasan: Fix a type conversion error
    lib: test_kasan: add some testcases
    kasan: update reference to kasan prototype repo
    kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
    kasan: various fixes in documentation
    kasan: update log messages
    kasan: accurately determine the type of the bad access
    kasan: update reported bug types for kernel memory accesses
    kasan: update reported bug types for not user nor kernel memory accesses
    mm/kasan: prevent deadlock in kasan reporting
    mm/kasan: don't use kasan shadow pointer in generic functions
    mm/kasan: MODULE_VADDR is not available on all archs
    ...

    Linus Torvalds
     
  • The cost of faulting in all memory to be locked can be very high when
    working with large mappings. If only portions of the mapping will be used
    this can incur a high penalty for locking.

    For the example of a large file, this is the usage pattern for a large
    statical language model (probably applies to other statical or graphical
    models as well). For the security example, any application transacting in
    data that cannot be swapped out (credit card data, medical records, etc).

    This patch introduces the ability to request that pages are not
    pre-faulted, but are placed on the unevictable LRU when they are finally
    faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
    and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
    flag for a VMA will cause pages faulted into that VMA to be added to the
    unevictable LRU when they are faulted or if they are already present, but
    will not cause any missing pages to be faulted in.

    Exposing this new lock state means that we cannot overload the meaning of
    the FOLL_POPULATE flag any longer. Prior to this patch it was used to
    mean that the VMA for a fault was locked. This means we need the new
    FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
    will now only control if the VMA should be populated and in the case of
    VM_LOCKONFAULT, it will not be set.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • Pull cgroup updates from Tejun Heo:
    "The cgroup core saw several significant updates this cycle:

    - percpu_rwsem for threadgroup locking is reinstated. This was
    temporarily dropped due to down_write latency issues. Oleg's
    rework of percpu_rwsem which is scheduled to be merged in this
    merge window resolves the issue.

    - On the v2 hierarchy, when controllers are enabled and disabled, all
    operations are atomic and can fail and revert cleanly. This allows
    ->can_attach() failure which is necessary for cpu RT slices.

    - Tasks now stay associated with the original cgroups after exit
    until released. This allows tracking resources held by zombies
    (e.g. pids) and makes it easy to find out where zombies came from
    on the v2 hierarchy. The pids controller was broken before these
    changes as zombies escaped the limits; unfortunately, updating this
    behavior required too many invasive changes and I don't think it's
    a good idea to backport them, so the pids controller on 4.3, the
    first version which included the pids controller, will stay broken
    at least until I'm sure about the cgroup core changes.

    - Optimization of a couple common tests using static_key"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
    cgroup: fix race condition around termination check in css_task_iter_next()
    blkcg: don't create "io.stat" on the root cgroup
    cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
    cgroup: replace error handling in cgroup_init() with WARN_ON()s
    cgroup: add cgroup_subsys->free() method and use it to fix pids controller
    cgroup: keep zombies associated with their original cgroups
    cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
    cgroup: don't hold css_set_rwsem across css task iteration
    cgroup: reorganize css_task_iter functions
    cgroup: factor out css_set_move_task()
    cgroup: keep css_set and task lists in chronological order
    cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
    cgroup: make css_sets pin the associated cgroups
    cgroup: relocate cgroup_[try]get/put()
    cgroup: move check_for_release() invocation
    cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
    cgroup: make cgroup->nr_populated count the number of populated css_sets
    cgroup: remove an unused parameter from cgroup_task_migrate()
    cgroup: fix too early usage of static_branch_disable()
    cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
    ...

    Linus Torvalds
     

16 Oct, 2015

1 commit

  • cgroup_exit() is called when a task exits and disassociates the
    exiting task from its cgroups and half-attach it to the root cgroup.
    This is unnecessary and undesirable.

    No controller actually needs an exiting task to be disassociated with
    non-root cgroups. Both cpu and perf_event controllers update the
    association to the root cgroup from their exit callbacks just to keep
    consistent with the cgroup core behavior.

    Also, this disassociation makes it difficult to track resources held
    by zombies or determine where the zombies came from. Currently, pids
    controller is completely broken as it uncharges on exit and zombies
    always escape the resource restriction. With cgroup association being
    reset on exit, fixing it is pretty painful.

    There's no reason to reset cgroup membership on exit. The zombie can
    be removed from its css_set so that it doesn't show up on
    "cgroup.procs" and thus can't be migrated or interfere with cgroup
    removal. It can still pin and point to the css_set so that its cgroup
    membership is maintained. This patch makes cgroup core keep zombies
    associated with their cgroups at the time of exit.

    * Previous patches decoupled populated_cnt tracking from css_set
    lifetime, so a dying task can be simply unlinked from its css_set
    while pinning and pointing to the css_set. This keeps css_set
    association from task side alive while hiding it from "cgroup.procs"
    and populated_cnt tracking. The css_set reference is dropped when
    the task_struct is freed.

    * ->exit() callback no longer needs the css arguments as the
    associated css never changes once PF_EXITING is set. Removed.

    * cpu and perf_events controllers no longer need ->exit() callbacks.
    There's no reason to explicitly switch away on exit. The final
    schedule out is enough. The callbacks are removed.

    * On traditional hierarchies, nothing changes. "/proc/PID/cgroup"
    still reports "/" for all zombies. On the default hierarchy,
    "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
    to at the time of exit. If the cgroup gets removed before the task
    is reaped, " (deleted)" is appended.

    v2: Build brekage due to missing dummy cgroup_free() when
    !CONFIG_CGROUP fixed.

    Signed-off-by: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo

    Tejun Heo
     

15 Oct, 2015

1 commit

  • In the next patch in this series, a new field 'checking_timer' will
    be added to 'struct thread_group_cputimer'. Both this and the
    existing 'running' integer field are just used as boolean values. To
    save space in the structure, we can make both of these fields booleans.

    This is a preparatory patch to convert the existing running integer
    field to a boolean.

    Suggested-by: George Spelvin
    Signed-off-by: Jason Low
    Reviewed: George Spelvin
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Frederic Weisbecker
    Cc: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: hideaki.kimura@hpe.com
    Cc: terry.rudd@hpe.com
    Cc: scott.norton@hpe.com
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1444849677-29330-4-git-send-email-jason.low2@hp.com
    Signed-off-by: Thomas Gleixner

    Jason Low
     

17 Sep, 2015

1 commit

  • Note: This commit was originally committed as d59cfc09c32a but got
    reverted by 0c986253b939 due to the performance regression from
    the percpu_rwsem write down/up operations added to cgroup task
    migration path. percpu_rwsem changes which alleviate the
    performance issue are pending for v4.4-rc1 merge window.
    Re-apply.

    The cgroup side of threadgroup locking uses signal_struct->group_rwsem
    to synchronize against threadgroup changes. This per-process rwsem
    adds small overhead to thread creation, exit and exec paths, forces
    cgroup code paths to do lock-verify-unlock-retry dance in a couple
    places and makes it impossible to atomically perform operations across
    multiple processes.

    This patch replaces signal_struct->group_rwsem with a global
    percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
    side and contained in cgroups proper. This patch converts one-to-one.

    This does make writer side heavier and lower the granularity; however,
    cgroup process migration is a fairly cold path, we do want to optimize
    thread operations over it and cgroup migration operations don't take
    enough time for the lower granularity to matter.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Tejun Heo
     

16 Sep, 2015

1 commit

  • This reverts commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14.

    d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with
    a global percpu_rwsem") and b5ba75b5fc0e ("cgroup: simplify
    threadgroup locking") changed how cgroup synchronizes against task
    fork and exits so that it uses global percpu_rwsem instead of
    per-process rwsem; unfortunately, the write [un]lock paths of
    percpu_rwsem always involve synchronize_rcu_expedited() which turned
    out to be too expensive.

    Improvements for percpu_rwsem are scheduled to be merged in the coming
    v4.4-rc1 merge window which alleviates this issue. For now, revert
    the two commits to restore per-process rwsem. They will be re-applied
    for the v4.4-rc1 merge window.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
    Reported-by: Christian Borntraeger
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc: Paolo Bonzini
    Cc: stable@vger.kernel.org # v4.2+

    Tejun Heo
     

05 Sep, 2015

2 commits

  • These two flags gets set in vma->vm_flags to tell the VM common code
    if the userfaultfd is armed and in which mode (only tracking missing
    faults, only tracking wrprotect faults or both). If neither flags is
    set it means the userfaultfd is not armed on the vma.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This adds the vm_userfaultfd_ctx to the vm_area_struct.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

02 Sep, 2015

2 commits

  • Pull cgroup updates from Tejun Heo:

    - a new PIDs controller is added. It turns out that PIDs are actually
    an independent resource from kmem due to the limited PID space.

    - more core preparations for the v2 interface. Once cpu side interface
    is settled, it should be ready for lifting the devel mask.
    for-4.3-unified-base was temporarily branched so that other trees
    (block) can pull cgroup core changes that blkcg changes depend on.

    - a non-critical idr_preload usage bug fix.

    * 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: pids: fix invalid get/put usage
    cgroup: introduce cgroup_subsys->legacy_name
    cgroup: don't print subsystems for the default hierarchy
    cgroup: make cftype->private a unsigned long
    cgroup: export cgrp_dfl_root
    cgroup: define controller file conventions
    cgroup: fix idr_preload usage
    cgroup: add documentation for the PIDs controller
    cgroup: implement the PIDs subsystem
    cgroup: allow a cgroup subsystem to reject a fork

    Linus Torvalds
     
  • Pull user namespace updates from Eric Biederman:
    "This finishes up the changes to ensure proc and sysfs do not start
    implementing executable files, as the there are application today that
    are only secure because such files do not exist.

    It akso fixes a long standing misfeature of /proc//mountinfo that
    did not show the proper source for files bind mounted from
    /proc//ns/*.

    It also straightens out the handling of clone flags related to user
    namespaces, fixing an unnecessary failure of unshare(CLONE_NEWUSER)
    when files such as /proc//environ are read while is calling
    unshare. This winds up fixing a minor bug in unshare flag handling
    that dates back to the first version of unshare in the kernel.

    Finally, this fixes a minor regression caused by the introduction of
    sysfs_create_mount_point, which broke someone's in house application,
    by restoring the size of /sys/fs/cgroup to 0 bytes. Apparently that
    application uses the directory size to determine if a tmpfs is mounted
    on /sys/fs/cgroup.

    The bind mount escape fixes are present in Al Viros for-next branch.
    and I expect them to come from there. The bind mount escape is the
    last of the user namespace related security bugs that I am aware of"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    fs: Set the size of empty dirs to 0.
    userns,pidns: Force thread group sharing, not signal handler sharing.
    unshare: Unsharing a thread does not require unsharing a vm
    nsfs: Add a show_path method to fix mountinfo
    mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC
    vfs: Commit to never having exectuables on proc and sysfs.

    Linus Torvalds
     

01 Sep, 2015

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The biggest change in this cycle is the rewrite of the main SMP load
    balancing metric: the CPU load/utilization. The main goal was to make
    the metric more precise and more representative - see the changelog of
    this commit for the gory details:

    9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

    It is done in a way that significantly reduces complexity of the code:

    5 files changed, 249 insertions(+), 494 deletions(-)

    and the performance testing results are encouraging. Nevertheless we
    need to keep an eye on potential regressions, since this potentially
    affects every SMP workload in existence.

    This work comes from Yuyang Du.

    Other changes:

    - SCHED_DL updates. (Andrea Parri)

    - Simplify architecture callbacks by removing finish_arch_switch().
    (Peter Zijlstra et al)

    - cputime accounting: guarantee stime + utime == rtime. (Peter
    Zijlstra)

    - optimize idle CPU wakeups some more - inspired by Facebook server
    loads. (Mike Galbraith)

    - stop_machine fixes and updates. (Oleg Nesterov)

    - Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra)

    - sched/numa tweaks. (Srikar Dronamraju)

    - misc fixes and small cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    sched/deadline: Fix comment in enqueue_task_dl()
    sched/deadline: Fix comment in push_dl_tasks()
    sched: Change the sched_class::set_cpus_allowed() calling context
    sched: Make sched_class::set_cpus_allowed() unconditional
    sched: Fix a race between __kthread_bind() and sched_setaffinity()
    sched: Ensure a task has a non-normalized vruntime when returning back to CFS
    sched/numa: Fix NUMA_DIRECT topology identification
    tile: Reorganize _switch_to()
    sched, sparc32: Update scheduler comments in copy_thread()
    sched: Remove finish_arch_switch()
    sched, tile: Remove finish_arch_switch
    sched, sh: Fold finish_arch_switch() into switch_to()
    sched, score: Remove finish_arch_switch()
    sched, avr32: Remove finish_arch_switch()
    sched, MIPS: Get rid of finish_arch_switch()
    sched, arm: Remove finish_arch_switch()
    sched/fair: Clean up load average references
    sched/fair: Provide runnable_load_avg back to cfs_rq
    sched/fair: Remove task and group entity load when they are dead
    sched/fair: Init cfs_rq's sched_entity load average
    ...

    Linus Torvalds
     

13 Aug, 2015

2 commits

  • The code that places signals in signal queues computes the uids, gids,
    and pids at the time the signals are enqueued. Which means that tasks
    that share signal queues must be in the same pid and user namespaces.

    Sharing signal handlers is fine, but bizarre.

    So make the code in fork and userns_install clearer by only testing
    for what is functionally necessary.

    Also update the comment in unshare about unsharing a user namespace to
    be a little more explicit and make a little more sense.

    Acked-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In the logic in the initial commit of unshare made creating a new
    thread group for a process, contingent upon creating a new memory
    address space for that process. That is wrong. Two separate
    processes in different thread groups can share a memory address space
    and clone allows creation of such proceses.

    This is significant because it was observed that mm_users > 1 does not
    mean that a process is multi-threaded, as reading /proc/PID/maps
    temporarily increments mm_users, which allows other processes to
    (accidentally) interfere with unshare() calls.

    Correct the check in check_unshare_flags() to test for
    !thread_group_empty() for CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM.
    For sighand->count > 1 for CLONE_SIGHAND and CLONE_VM.
    For !current_is_single_threaded instead of mm_users > 1 for CLONE_VM.

    By using the correct checks in unshare this removes the possibility of
    an accidental denial of service attack.

    Additionally using the correct checks in unshare ensures that only an
    explicit unshare(CLONE_VM) can possibly trigger the slow path of
    current_is_single_threaded(). As an explict unshare(CLONE_VM) is
    pointless it is not expected there are many applications that make
    that call.

    Cc: stable@vger.kernel.org
    Fixes: b2e0d98705e60e45bbb3c0032c48824ad7ae0704 userns: Implement unshare of the user namespace
    Reported-by: Ricky Zhou
    Reported-by: Kees Cook
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Aug, 2015

1 commit

  • While the current code guarantees monotonicity for stime and utime
    independently of one another, it does not guarantee that the sum of
    both is equal to the total time we started out with.

    This confuses things (and peoples) who look at this sum, like top, and
    will report >100% usage followed by a matching period of 0%.

    Rework the code to provide both individual monotonicity and a coherent
    sum.

    Suggested-by: Fredrik Markstrom
    Reported-by: Fredrik Markstrom
    Tested-by: Fredrik Markstrom
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Thomas Gleixner
    Cc: jason.low2@hp.com
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

18 Jul, 2015

2 commits

  • Don't burden architectures without dynamic task_struct sizing
    with the overhead of dynamic sizing.

    Also optimize the x86 code a bit by caching task_struct_size.

    Acked-and-Tested-by: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1437128892-9831-3-git-send-email-mingo@kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The FPU rewrite removed the dynamic allocations of 'struct fpu'.
    But, this potentially wastes massive amounts of memory (2k per
    task on systems that do not have AVX-512 for instance).

    Instead of having a separate slab, this patch just appends the
    space that we need to the 'task_struct' which we dynamically
    allocate already. This saves from doing an extra slab
    allocation at fork().

    The only real downside here is that we have to stick everything
    and the end of the task_struct. But, I think the
    BUILD_BUG_ON()s I stuck in there should keep that from being too
    fragile.

    Signed-off-by: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1437128892-9831-2-git-send-email-mingo@kernel.org
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

15 Jul, 2015

1 commit

  • Add a new cgroup subsystem callback can_fork that conditionally
    states whether or not the fork is accepted or rejected by a cgroup
    policy. In addition, add a cancel_fork callback so that if an error
    occurs later in the forking process, any state modified by can_fork can
    be reverted.

    Allow for a private opaque pointer to be passed from cgroup_can_fork to
    cgroup_post_fork, allowing for the fork state to be stored by each
    subsystem separately.

    Also add a tagging system for cgroup_subsys.h to allow for CGROUP_
    enumerations to be be defined and used. In addition, explicitly add a
    CGROUP_CANFORK_COUNT macro to make arrays easier to define.

    This is in preparation for implementing the pids cgroup subsystem.

    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo

    Aleksa Sarai
     

27 Jun, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:

    - threadgroup_lock got reorganized so that its users can pick the
    actual locking mechanism to use. Its only user - cgroups - is
    updated to use a percpu_rwsem instead of per-process rwsem.

    This makes things a bit lighter on hot paths and allows cgroups to
    perform and fail multi-task (a process) migrations atomically.
    Multi-task migrations are used in several places including the
    unified hierarchy.

    - Delegation rule and documentation added to unified hierarchy. This
    will likely be the last interface update from the cgroup core side
    for unified hierarchy before lifting the devel mask.

    - Some groundwork for the pids controller which is scheduled to be
    merged in the coming devel cycle.

    * 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: add delegation section to unified hierarchy documentation
    cgroup: require write perm on common ancestor when moving processes on the default hierarchy
    cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
    kernfs: make kernfs_get_inode() public
    MAINTAINERS: add a cgroup core co-maintainer
    cgroup: fix uninitialised iterator in for_each_subsys_which
    cgroup: replace explicit ss_mask checking with for_each_subsys_which
    cgroup: use bitmask to filter for_each_subsys
    cgroup: add seq_file forward declaration for struct cftype
    cgroup: simplify threadgroup locking
    sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
    sched, cgroup: reorganize threadgroup locking
    cgroup: switch to unsigned long for bitmasks
    cgroup: reorganize include/linux/cgroup.h
    cgroup: separate out include/linux/cgroup-defs.h
    cgroup: fix some comment typos

    Linus Torvalds
     

26 Jun, 2015

1 commit

  • clone has some of the quirkiest syscall handling in the kernel, with a
    pile of special cases, historical curiosities, and architecture-specific
    calling conventions. In particular, clone with CLONE_SETTLS accepts a
    parameter "tls" that the C entry point completely ignores and some
    assembly entry points overwrite; instead, the low-level arch-specific
    code pulls the tls parameter out of the arch-specific register captured
    as part of pt_regs on entry to the kernel. That's a massive hack, and
    it makes the arch-specific code only work when called via the specific
    existing syscall entry points; because of this hack, any new clone-like
    system call would have to accept an identical tls argument in exactly
    the same arch-specific position, rather than providing a unified system
    call entry point across architectures.

    The first patch allows architectures to handle the tls argument via
    normal C parameter passing, if they opt in by selecting
    HAVE_COPY_THREAD_TLS. The second patch makes 32-bit and 64-bit x86 opt
    into this.

    These two patches came out of the clone4 series, which isn't ready for
    this merge window, but these first two cleanup patches were entirely
    uncontroversial and have acks. I'd like to go ahead and submit these
    two so that other architectures can begin building on top of this and
    opting into HAVE_COPY_THREAD_TLS. However, I'm also happy to wait and
    send these through the next merge window (along with v3 of clone4) if
    anyone would prefer that.

    This patch (of 2):

    clone with CLONE_SETTLS accepts an argument to set the thread-local
    storage area for the new thread. sys_clone declares an int argument
    tls_val in the appropriate point in the argument list (based on the
    various CLONE_BACKWARDS variants), but doesn't actually use or pass along
    that argument. Instead, sys_clone calls do_fork, which calls
    copy_process, which calls the arch-specific copy_thread, and copy_thread
    pulls the corresponding syscall argument out of the pt_regs captured at
    kernel entry (knowing what argument of clone that architecture passes tls
    in).

    Apart from being awful and inscrutable, that also only works because only
    one code path into copy_thread can pass the CLONE_SETTLS flag, and that
    code path comes from sys_clone with its architecture-specific
    argument-passing order. This prevents introducing a new version of the
    clone system call without propagating the same architecture-specific
    position of the tls argument.

    However, there's no reason to pull the argument out of pt_regs when
    sys_clone could just pass it down via C function call arguments.

    Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
    and a new copy_thread_tls that accepts the tls parameter as an additional
    unsigned long (syscall-argument-sized) argument. Change sys_clone's tls
    argument to an unsigned long (which does not change the ABI), and pass
    that down to copy_thread_tls.

    Architectures that don't opt into copy_thread_tls will continue to ignore
    the C argument to sys_clone in favor of the pt_regs captured at kernel
    entry, and thus will be unable to introduce new versions of the clone
    syscall.

    Patch co-authored by Josh Triplett and Thiago Macieira.

    Signed-off-by: Josh Triplett
    Acked-by: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thiago Macieira
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     

27 May, 2015

1 commit

  • The cgroup side of threadgroup locking uses signal_struct->group_rwsem
    to synchronize against threadgroup changes. This per-process rwsem
    adds small overhead to thread creation, exit and exec paths, forces
    cgroup code paths to do lock-verify-unlock-retry dance in a couple
    places and makes it impossible to atomically perform operations across
    multiple processes.

    This patch replaces signal_struct->group_rwsem with a global
    percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
    side and contained in cgroups proper. This patch converts one-to-one.

    This does make writer side heavier and lower the granularity; however,
    cgroup process migration is a fairly cold path, we do want to optimize
    thread operations over it and cgroup migration operations don't take
    enough time for the lower granularity to matter.

    Signed-off-by: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Tejun Heo
     

19 May, 2015

1 commit

  • Until now, pagefault_disable()/pagefault_enabled() used the preempt
    count to track whether in an environment with pagefaults disabled (can
    be queried via in_atomic()).

    This patch introduces a separate counter in task_struct to count the
    level of pagefault_disable() calls. We'll keep manipulating the preempt
    count to retain compatibility to existing pagefault handlers.

    It is now possible to verify whether in a pagefault_disable() envionment
    by calling pagefault_disabled(). In contrast to in_atomic() it will not
    be influenced by preempt_enable()/preempt_disable().

    This patch is based on a patch from Ingo Molnar.

    Reviewed-and-tested-by: Thomas Gleixner
    Signed-off-by: David Hildenbrand
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: David.Laight@ACULAB.COM
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: airlied@linux.ie
    Cc: akpm@linux-foundation.org
    Cc: benh@kernel.crashing.org
    Cc: bigeasy@linutronix.de
    Cc: borntraeger@de.ibm.com
    Cc: daniel.vetter@intel.com
    Cc: heiko.carstens@de.ibm.com
    Cc: herbert@gondor.apana.org.au
    Cc: hocko@suse.cz
    Cc: hughd@google.com
    Cc: mst@redhat.com
    Cc: paulus@samba.org
    Cc: ralf@linux-mips.org
    Cc: schwidefsky@de.ibm.com
    Cc: yang.shi@windriver.com
    Link: http://lkml.kernel.org/r/1431359540-32227-2-git-send-email-dahi@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    David Hildenbrand
     

08 May, 2015

2 commits

  • While running a database workload, we found a scalability issue with itimers.

    Much of the problem was caused by the thread_group_cputimer spinlock.
    Each time we account for group system/user time, we need to obtain a
    thread_group_cputimer's spinlock to update the timers. On larger systems
    (such as a 16 socket machine), this caused more than 30% of total time
    spent trying to obtain this kernel lock to update these group timer stats.

    This patch converts the timers to 64-bit atomic variables and use
    atomic add to update them without a lock. With this patch, the percent
    of total time spent updating thread group cputimer timers was reduced
    from 30% down to less than 1%.

    Note: On 32-bit systems using the generic 64-bit atomics, this causes
    sample_group_cputimer() to take locks 3 times instead of just 1 time.
    However, we tested this patch on a 32-bit system ARM system using the
    generic atomics and did not find the overhead to be much of an issue.
    An explanation for why this isn't an issue is that 32-bit systems usually
    have small numbers of CPUs, and cacheline contention from extra spinlocks
    called periodically is not really apparent on smaller systems.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Aswin Chandramouleeswaran
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Scott J Norton
    Cc: Steven Rostedt
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
    the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
    the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Acked-by: Waiman Long
    Cc: Andrew Morton
    Cc: Aswin Chandramouleeswaran
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Preeti U Murthy
    Cc: Scott J Norton
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     

17 Apr, 2015

3 commits

  • sync_buffer() needs the mmap_sem for two distinct operations, both only
    occurring upon user context switch handling:

    1) Dealing with the exe_file.

    2) Adding the dcookie data as we need to lookup the vma that
    backs it. This is done via add_sample() and add_data().

    This patch isolates 1), for it will no longer need the mmap_sem for
    serialization. However, for now, make of the more standard
    get_mm_exe_file(), requiring only holding the mmap_sem to read the value,
    and relying on reference counting to make sure that the exe file won't
    dissappear underneath us while doing the get dcookie.

    As a consequence, for 2) we move the mmap_sem locking into where we really
    need it, in lookup_dcookie(). The benefits are twofold: reduce mmap_sem
    hold times, and cleaner code.

    [akpm@linux-foundation.org: export get_mm_exe_file for arch/x86/oprofile/oprofile.ko]
    Signed-off-by: Davidlohr Bueso
    Cc: Robert Richter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Oleg cleverly suggested using xchg() to set the new mm->exe_file instead
    of calling set_mm_exe_file() which requires some form of serialization --
    mmap_sem in this case. For archs that do not have atomic rmw instructions
    we still fallback to a spinlock alternative, so this should always be
    safe. As such, we only need the mmap_sem for looking up the backing
    vm_file, which can be done sharing the lock. Naturally, this means we
    need to manually deal with both the new and old file reference counting,
    and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can
    probably be deleted in the future anyway.

    Signed-off-by: Davidlohr Bueso
    Suggested-by: Oleg Nesterov
    Acked-by: Oleg Nesterov
    Reviewed-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This patch removes mm->mmap_sem from mm->exe_file read side.
    Also it kills dup_mm_exe_file() and moves exe_file duplication into
    dup_mmap() where both mmap_sems are locked.

    [akpm@linux-foundation.org: fix comment typo]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Davidlohr Bueso
    Cc: Al Viro
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov