06 Jan, 2017

1 commit

  • commit bfedb589252c01fa505ac9f6f2a3d5d68d707ef4 upstream.

    During exec dumpable is cleared if the file that is being executed is
    not readable by the user executing the file. A bug in
    ptrace_may_access allows reading the file if the executable happens to
    enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
    unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

    This problem is fixed with only necessary userspace breakage by adding
    a user namespace owner to mm_struct, captured at the time of exec, so
    it is clear in which user namespace CAP_SYS_PTRACE must be present in
    to be able to safely give read permission to the executable.

    The function ptrace_may_access is modified to verify that the ptracer
    has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
    This ensures that if the task changes it's cred into a subordinate
    user namespace it does not become ptraceable.

    The function ptrace_attach is modified to only set PT_PTRACE_CAP when
    CAP_SYS_PTRACE is held over task->mm->user_ns. The intent of
    PT_PTRACE_CAP is to be a flag to note that whatever permission changes
    the task might go through the tracer has sufficient permissions for
    it not to be an issue. task->cred->user_ns is always the same
    as or descendent of mm->user_ns. Which guarantees that having
    CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
    credentials.

    To prevent regressions mm->dumpable and mm->user_ns are not considered
    when a task has no mm. As simply failing ptrace_may_attach causes
    regressions in privileged applications attempting to read things
    such as /proc//stat

    Acked-by: Kees Cook
    Tested-by: Cyrill Gorcunov
    Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

01 Nov, 2016

1 commit

  • If something goes wrong with task stack refcounting and a stack
    refcount hits zero too early, warn and leak it rather than
    potentially freeing it early (and silently).

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

16 Oct, 2016

1 commit

  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds
     

11 Oct, 2016

2 commits

  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     
  • This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot time as
    possible, hoping to capitalize on any possible variation in CPU operation
    (due to runtime data differences, hardware differences, SMP ordering,
    thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example for
    how to manipulate kernel code using the gcc plugin internals.

    The need for very-early boot entropy tends to be very architecture or
    system design specific, so this plugin is more suited for those sorts
    of special cases. The existing kernel RNG already attempts to extract
    entropy from reliable runtime variation, but this plugin takes the idea to
    a logical extreme by permuting a global variable based on any variation
    in code execution (e.g. a different value (and permutation function)
    is used to permute the global based on loop count, case statement,
    if/then/else branching, etc).

    To do this, the plugin starts by inserting a local variable in every
    marked function. The plugin then adds logic so that the value of this
    variable is modified by randomly chosen operations (add, xor and rol) and
    random values (gcc generates separate static values for each location at
    compile time and also injects the stack pointer at runtime). The resulting
    value depends on the control flow path (e.g., loops and branches taken).

    Before the function returns, the plugin mixes this local variable into
    the latent_entropy global variable. The value of this global variable
    is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
    though it does not credit any bytes of entropy to the pool; the contents
    of the global are just used to mix the pool.

    Additionally, the plugin can pre-initialize arrays with build-time
    random contents, so that two different kernel builds running on identical
    hardware will not have the same starting values.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message and code comments]
    Signed-off-by: Kees Cook

    Emese Revfy
     

08 Oct, 2016

4 commits

  • The global zero page is used to satisfy an anonymous read fault. If
    THP(Transparent HugePage) is enabled then the global huge zero page is
    used. The global huge zero page uses an atomic counter for reference
    counting and is allocated/freed dynamically according to its counter
    value.

    CPU time spent on that counter will greatly increase if there are a lot
    of processes doing anonymous read faults. This patch proposes a way to
    reduce the access to the global counter so that the CPU load can be
    reduced accordingly.

    To do this, a new flag of the mm_struct is introduced:
    MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
    the global counter in two cases:

    1 The first time it uses the global huge zero page;
    2 The time when mm_user of its mm_struct reaches zero.

    Note that right now, the huge zero page is eligible to be freed as soon
    as its last use goes away. With this patch, the page will not be
    eligible to be freed until the exit of the last process from which it
    was ever used.

    And with the use of mm_user, the kthread is not eligible to use huge
    zero page either. Since no kthread is using huge zero page today, there
    is no difference after applying this patch. But if that is not desired,
    I can change it to when mm_count reaches zero.

    Case used for test on Haswell EP:

    usemem -n 72 --readonly -j 0x200000 100G

    Which spawns 72 processes and each will mmap 100G anonymous space and
    then do read only access to that space sequentially with a step of 2MB.

    CPU cycles from perf report for base commit:
    54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
    CPU cycles from perf report for this commit:
    0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

    Performance(throughput) of the workload for base commit: 1784430792
    Performance(throughput) of the workload for this commit: 4726928591
    164% increase.

    Runtime of the workload for base commit: 707592 us
    Runtime of the workload for this commit: 303970 us
    50% drop.

    Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
    Signed-off-by: Aaron Lu
    Cc: Sergey Senozhatsky
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • After "oom: keep mm of the killed task available" we can safely detect
    an oom victim by checking task->signal->oom_mm so we do not need the
    signal_struct counter anymore so let's get rid of it.

    This alone wouldn't be sufficient for nommu archs because
    exit_oom_victim doesn't hide the process from the oom killer anymore.
    We can, however, mark the mm with a MMF flag in __mmput. We can reuse
    MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

    Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Lockdep complains that __mmdrop is not safe from the softirq context:

    =================================
    [ INFO: inconsistent lock state ]
    4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G W
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
    {SOFTIRQ-ON-W} state was registered at:
    __lock_acquire+0xa06/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    __change_page_attr_set_clr+0x2a5/0xacd
    change_page_attr_set_clr+0x16f/0x32c
    set_memory_nx+0x37/0x3a
    free_init_pages+0x9e/0xc7
    alternative_instructions+0xa2/0xb3
    check_bugs+0xe/0x2d
    start_kernel+0x3ce/0x3ea
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x17a/0x18d
    irq event stamp: 105916
    hardirqs last enabled at (105916): free_hot_cold_page+0x37e/0x390
    hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
    softirqs last enabled at (105878): _local_bh_enable+0x42/0x44
    softirqs last disabled at (105879): irq_exit+0x6f/0xd1

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(pgd_lock);

    lock(pgd_lock);

    *** DEADLOCK ***

    1 lock held by swapper/1/0:
    #0: (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800

    stack backtrace:
    CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
    Call Trace:

    print_usage_bug.part.25+0x259/0x268
    mark_lock+0x381/0x567
    __lock_acquire+0x993/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    pgd_free+0x19/0x6b
    __mmdrop+0x25/0xb9
    __put_task_struct+0x103/0x11e
    delayed_put_task_struct+0x157/0x15e
    rcu_process_callbacks+0x660/0x800
    __do_softirq+0x1ec/0x4d5
    irq_exit+0x6f/0xd1
    smp_apic_timer_interrupt+0x42/0x4d
    apic_timer_interrupt+0x8e/0xa0

    arch_cpu_idle+0xf/0x11
    default_idle_call+0x32/0x34
    cpu_startup_entry+0x20c/0x399
    start_secondary+0xfe/0x101

    More over commit a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was
    explicit about pgd_lock not to be called from the irq context. This
    means that __mmdrop called from free_signal_struct has to be postponed
    to a user context. We already have a similar mechanism for mmput_async
    so we can use it here as well. This is safe because mm_count is pinned
    by mm_users.

    This fixes bug introduced by "oom: keep mm of the killed task available"

    Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reap_task has to call exit_oom_victim in order to make sure that the
    oom vicim will not block the oom killer for ever. This is, however,
    opening new problems (e.g oom_killer_disable exclusion - see commit
    74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
    race")). exit_oom_victim should be only called from the victim's
    context ideally.

    One way to achieve this would be to rely on per mm_struct flags. We
    already have MMF_OOM_REAPED to hide a task from the oom killer since
    "mm, oom: hide mm which is shared with kthread or global init". The
    problem is that the exit path:

    do_exit
    exit_mm
    tsk->mm = NULL;
    mmput
    __mmput
    exit_oom_victim

    doesn't guarantee that exit_oom_victim will get called in a bounded
    amount of time. At least exit_aio depends on IO which might get blocked
    due to lack of memory and who knows what else is lurking there.

    This patch takes a different approach. We remember tsk->mm into the
    signal_struct and bind it to the signal struct life time for all oom
    victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
    have to rely on find_lock_task_mm anymore and they will have a reliable
    reference to the mm struct. As a result all the oom specific
    communication inside the OOM killer can be done via tsk->signal->oom_mm.

    Increasing the signal_struct for something as unlikely as the oom killer
    is far from ideal but this approach will make the code much more
    reasonable and long term we even might want to move task->mm into the
    signal_struct anyway. In the next step we might want to make the oom
    killer exclusion and access to memory reserves completely independent
    which would be also nice.

    Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Oct, 2016

1 commit

  • Pull namespace updates from Eric Biederman:
    "This set of changes is a number of smaller things that have been
    overlooked in other development cycles focused on more fundamental
    change. The devpts changes are small things that were a distraction
    until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
    trivial regression fix to autofs for the unprivileged mount changes
    that went in last cycle. A pair of ioctls has been added by Andrey
    Vagin making it is possible to discover the relationships between
    namespaces when referring to them through file descriptors.

    The big user visible change is starting to add simple resource limits
    to catch programs that misbehave. With namespaces in general and user
    namespaces in particular allowing users to use more kinds of
    resources, it has become important to have something to limit errant
    programs. Because the purpose of these limits is to catch errant
    programs the code needs to be inexpensive to use as it always on, and
    the default limits need to be high enough that well behaved programs
    on well behaved systems don't encounter them.

    To this end, after some review I have implemented per user per user
    namespace limits, and use them to limit the number of namespaces. The
    limits being per user mean that one user can not exhause the limits of
    another user. The limits being per user namespace allow contexts where
    the limit is 0 and security conscious folks can remove from their
    threat anlysis the code used to manage namespaces (as they have
    historically done as it root only). At the same time the limits being
    per user namespace allow other parts of the system to use namespaces.

    Namespaces are increasingly being used in application sand boxing
    scenarios so an all or nothing disable for the entire system for the
    security conscious folks makes increasing use of these sandboxes
    impossible.

    There is also added a limit on the maximum number of mounts present in
    a single mount namespace. It is nontrivial to guess what a reasonable
    system wide limit on the number of mount structure in the kernel would
    be, especially as it various based on how a system is using
    containers. A limit on the number of mounts in a mount namespace
    however is much easier to understand and set. In most cases in
    practice only about 1000 mounts are used. Given that some autofs
    scenarious have the potential to be 30,000 to 50,000 mounts I have set
    the default limit for the number of mounts at 100,000 which is well
    above every known set of users but low enough that the mount hash
    tables don't degrade unreaonsably.

    These limits are a start. I expect this estabilishes a pattern that
    other limits for resources that namespaces use will follow. There has
    been interest in making inotify event limits per user per user
    namespace as well as interest expressed in making details about what
    is going on in the kernel more visible"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
    autofs: Fix automounts by using current_real_cred()->uid
    mnt: Add a per mount namespace limit on the number of mounts
    netns: move {inc,dec}_net_namespaces into #ifdef
    nsfs: Simplify __ns_get_path
    tools/testing: add a test to check nsfs ioctl-s
    nsfs: add ioctl to get a parent namespace
    nsfs: add ioctl to get an owning user namespace for ns file descriptor
    kernel: add a helper to get an owning user namespace for a namespace
    devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
    devpts: Remove sync_filesystems
    devpts: Make devpts_kill_sb safe if fsi is NULL
    devpts: Simplify devpts_mount by using mount_nodev
    devpts: Move the creation of /dev/pts/ptmx into fill_super
    devpts: Move parse_mount_options into fill_super
    userns: When the per user per user namespace limit is reached return ENOSPC
    userns; Document per user per user namespace limits.
    mntns: Add a limit on the number of mount namespaces.
    netns: Add a limit on the number of net namespaces
    cgroupns: Add a limit on the number of cgroup namespaces
    ipcns: Add a limit on the number of ipc namespaces
    ...

    Linus Torvalds
     

16 Sep, 2016

2 commits

  • vmalloc() is a bit slow, and pounding vmalloc()/vfree() will eventually
    force a global TLB flush.

    To reduce pressure on them, if CONFIG_VMAP_STACK=y, cache two thread
    stacks per CPU. This will let us quickly allocate a hopefully
    cache-hot, TLB-hot stack under heavy forking workloads (shell script style).

    On my silly pthread_create() benchmark, it saves about 2 µs per
    pthread_create()+join() with CONFIG_VMAP_STACK=y.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/94811d8e3994b2e962f88866290017d498eb069c.1474003868.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • We currently keep every task's stack around until the task_struct
    itself is freed. This means that we keep the stack allocation alive
    for longer than necessary and that, under load, we free stacks in
    big batches whenever RCU drops the last task reference. Neither of
    these is good for reuse of cache-hot memory, and freeing in batches
    prevents us from usefully caching small numbers of vmalloced stacks.

    On architectures that have thread_info on the stack, we can't easily
    change this, but on architectures that set THREAD_INFO_IN_TASK, we
    can free it as soon as the task is dead.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

15 Sep, 2016

1 commit


02 Sep, 2016

3 commits

  • Merge fixes from Andrew Morton:
    "14 fixes"

    * emailed patches from Andrew Morton :
    rapidio/tsi721: fix incorrect detection of address translation condition
    rapidio/documentation/mport_cdev: add missing parameter description
    kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
    MAINTAINERS: Vladimir has moved
    mm, mempolicy: task->mempolicy must be NULL before dropping final reference
    printk/nmi: avoid direct printk()-s from __printk_nmi_flush()
    treewide: remove references to the now unnecessary DEFINE_PCI_DEVICE_TABLE
    drivers/scsi/wd719x.c: remove last declaration using DEFINE_PCI_DEVICE_TABLE
    mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator
    lib/test_hash.c: fix warning in preprocessor symbol evaluation
    lib/test_hash.c: fix warning in two-dimensional array init
    kconfig: tinyconfig: provide whole choice blocks to avoid warnings
    kexec: fix double-free when failing to relocate the purgatory
    mm, oom: prevent premature OOM killer invocation for high order request

    Linus Torvalds
     
  • Commit fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
    exit") has caused a subtle regression in nscd which uses
    CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
    shared databases, so that the clients are notified when nscd is
    restarted. Now, when nscd uses a non-persistent database, clients that
    have it mapped keep thinking the database is being updated by nscd, when
    in fact nscd has created a new (anonymous) one (for non-persistent
    databases it uses an unlinked file as backend).

    The original proposal for the CLONE_CHILD_CLEARTID change claimed
    (https://lkml.org/lkml/2006/10/25/233):

    : The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
    : on behalf of pthread_create() library calls. This feature is used to
    : request that the kernel clear the thread-id in user space (at an address
    : provided in the syscall) when the thread disassociates itself from the
    : address space, which is done in mm_release().
    :
    : Unfortunately, when a multi-threaded process incurs a core dump (such as
    : from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
    : the other threads, which then proceed to clear their user-space tids
    : before synchronizing in exit_mm() with the start of core dumping. This
    : misrepresents the state of process's address space at the time of the
    : SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
    : problems (misleading him/her to conclude that the threads had gone away
    : before the fault).
    :
    : The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
    : core dump has been initiated.

    The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
    seems to have a larger scope than the original patch asked for. It
    seems that limitting the scope of the check to core dumping should work
    for SIGSEGV issue describe above.

    [Changelog partly based on Andreas' description]
    Fixes: fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
    Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Tested-by: William Preston
    Acked-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Andreas Schwab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pull audit fixes from Paul Moore:
    "Two small patches to fix some bugs with the audit-by-executable
    functionality we introduced back in v4.3 (both patches are marked
    for the stable folks)"

    * 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
    audit: fix exe_file access in audit_exe_compare
    mm: introduce get_task_exe_file

    Linus Torvalds
     

01 Sep, 2016

1 commit

  • For more convenient access if one has a pointer to the task.

    As a minor nit take advantage of the fact that only task lock + rcu are
    needed to safely grab ->exe_file. This saves mm refcount dance.

    Use the helper in proc_exe_link.

    Signed-off-by: Mateusz Guzik
    Acked-by: Konstantin Khlebnikov
    Acked-by: Richard Guy Briggs
    Cc: # 4.3.x
    Signed-off-by: Paul Moore

    Mateusz Guzik
     

24 Aug, 2016

1 commit

  • If CONFIG_VMAP_STACK=y is selected, kernel stacks are allocated with
    __vmalloc_node_range().

    Grsecurity has had a similar feature (called GRKERNSEC_KSTACKOVERFLOW=y)
    for a long time.

    Signed-off-by: Andy Lutomirski
    Acked-by: Michal Hocko
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: Dmitry Vyukov
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/14c07d4fd173a5b117f51e8b939f9f4323e39899.1470907718.git.luto@kernel.org
    [ Minor edits. ]
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

17 Aug, 2016

1 commit

  • cgroup_threadgroup_rwsem is acquired in read mode during process exit
    and fork. It is also grabbed in write mode during
    __cgroups_proc_write(). I've recently run into a scenario with lots
    of memory pressure and OOM and I am beginning to see

    systemd

    __switch_to+0x1f8/0x350
    __schedule+0x30c/0x990
    schedule+0x48/0xc0
    percpu_down_write+0x114/0x170
    __cgroup_procs_write.isra.12+0xb8/0x3c0
    cgroup_file_write+0x74/0x1a0
    kernfs_fop_write+0x188/0x200
    __vfs_write+0x6c/0xe0
    vfs_write+0xc0/0x230
    SyS_write+0x6c/0x110
    system_call+0x38/0xb4

    This thread is waiting on the reader of cgroup_threadgroup_rwsem to
    exit. The reader itself is under memory pressure and has gone into
    reclaim after fork. There are times the reader also ends up waiting on
    oom_lock as well.

    __switch_to+0x1f8/0x350
    __schedule+0x30c/0x990
    schedule+0x48/0xc0
    jbd2_log_wait_commit+0xd4/0x180
    ext4_evict_inode+0x88/0x5c0
    evict+0xf8/0x2a0
    dispose_list+0x50/0x80
    prune_icache_sb+0x6c/0x90
    super_cache_scan+0x190/0x210
    shrink_slab.part.15+0x22c/0x4c0
    shrink_zone+0x288/0x3c0
    do_try_to_free_pages+0x1dc/0x590
    try_to_free_pages+0xdc/0x260
    __alloc_pages_nodemask+0x72c/0xc90
    alloc_pages_current+0xb4/0x1a0
    page_table_alloc+0xc0/0x170
    __pte_alloc+0x58/0x1f0
    copy_page_range+0x4ec/0x950
    copy_process.isra.5+0x15a0/0x1870
    _do_fork+0xa8/0x4b0
    ppc_clone+0x8/0xc

    In the meanwhile, all processes exiting/forking are blocked almost
    stalling the system.

    This patch moves the threadgroup_change_begin from before
    cgroup_fork() to just before cgroup_canfork(). There is no nee to
    worry about threadgroup changes till the task is actually added to the
    threadgroup. This avoids having to call reclaim with
    cgroup_threadgroup_rwsem held.

    tj: Subject and description edits.

    Signed-off-by: Balbir Singh
    Acked-by: Zefan Li
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org # v4.2+
    Signed-off-by: Tejun Heo

    Balbir Singh
     

09 Aug, 2016

3 commits


29 Jul, 2016

2 commits

  • We should account for stacks regardless of stack size, and we need to
    account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
    to kilobytes and Move it into account_kernel_stack().

    Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
    Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
    This only makes sense if each kernel stack exists entirely in one zone,
    and allowing vmapped stacks could break this assumption.

    Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
    allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
    architectures. Keep it simple and use KiB.

    Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

27 Jul, 2016

1 commit

  • Currently, to charge a non-slab allocation to kmemcg one has to use
    alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
    this helper should finally be freed using free_kmem_pages, otherwise it
    won't be uncharged.

    This API suits its current users fine, but it turns out to be impossible
    to use along with page reference counting, i.e. when an allocation is
    supposed to be freed with put_page, as it is the case with pipe or unix
    socket buffers.

    To overcome this limitation, this patch moves charging/uncharging to
    generic page allocator paths, i.e. to __alloc_pages_nodemask and
    free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
    one can use any of the available page allocation functions to get the
    allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
    just like in case of kmalloc and friends. A charged page will be
    automatically uncharged on free.

    To make it possible, we need to mark pages charged to kmemcg somehow.
    To avoid introducing a new page flag, we make use of page->_mapcount for
    marking such pages. Since pages charged to kmemcg are not supposed to
    be mapped to userspace, it should work just fine. There are other
    (ab)users of page->_mapcount - buddy and balloon pages - but we don't
    conflict with them.

    In case kmemcg is compiled out or not used at runtime, this patch
    introduces no overhead to generic page allocator paths. If kmemcg is
    used, it will be plus one gfp flags check on alloc and plus one
    page->_mapcount check on free, which shouldn't hurt performance, because
    the data accessed are hot.

    Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

25 Jun, 2016

2 commits

  • Commit b235beea9e99 ("Clarify naming of thread info/stack allocators")
    breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:

    kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
    kernel/fork.c:355:8: error: assignment from incompatible pointer type
    stack = alloc_thread_stack_node(tsk, node);
    ^

    Fix it by renaming free_stack() to free_thread_stack(), and updating the
    return type of alloc_thread_stack_node().

    Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators")
    Signed-off-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • We've had the thread info allocated together with the thread stack for
    most architectures for a long time (since the thread_info was split off
    from the task struct), but that is about to change.

    But the patches that move the thread info to be off-stack (and a part of
    the task struct instead) made it clear how confused the allocator and
    freeing functions are.

    Because the common case was that we share an allocation with the thread
    stack and the thread_info, the two pointers were identical. That
    identity then meant that we would have things like

    ti = alloc_thread_info_node(tsk, node);
    ...
    tsk->stack = ti;

    which certainly _worked_ (since stack and thread_info have the same
    value), but is rather confusing: why are we assigning a thread_info to
    the stack? And if we move the thread_info away, the "confusing" code
    just gets to be entirely bogus.

    So remove all this confusion, and make it clear that we are doing the
    stack allocation by renaming and clarifying the function names to be
    about the stack. The fact that the thread_info then shares the
    allocation is an implementation detail, and not really about the
    allocation itself.

    This is a pure renaming and type fix: we pass in the same pointer, it's
    just that we clarify what the pointer means.

    The ia64 code that actually only has one single allocation (for all of
    task_struct, thread_info and kernel thread stack) now looks a bit odd,
    but since "tsk->stack" is actually not even used there, that oddity
    doesn't matter. It would be a separate thing to clean that up, I
    intentionally left the ia64 changes as a pure brute-force renaming and
    type change.

    Acked-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 May, 2016

1 commit

  • mmput_async is currently used only from the oom_reaper which is defined
    only for CONFIG_MMU. We can save work_struct in mm_struct for
    !CONFIG_MMU.

    [akpm@linux-foundation.org: fix typo, per Minchan]
    Link: http://lkml.kernel.org/r/20160520061658.GB19172@dhcp22.suse.cz
    Reported-by: Minchan Kim
    Signed-off-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

24 May, 2016

2 commits

  • dup_mmap needs to lock current's mm mmap_sem for write. If the waiting
    task gets killed by the oom killer it would block oom_reaper from
    asynchronous address space reclaim and reduce the chances of timely OOM
    resolving. Wait for the lock in the killable mode and return with EINTR
    if the task got killed while waiting.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Linux preallocates the task structs of the idle tasks for all possible
    CPUs. This currently means they all end up on node 0. This also
    implies that the cache line of MWAIT, which is around the flags field in
    the task struct, are all located in node 0.

    We see a noticeable performance improvement on Knights Landing CPUs when
    the cache lines used for MWAIT are located in the local nodes of the
    CPUs using them. I would expect this to give a (likely slight)
    improvement on other systems too.

    The patch implements placing the idle task in the node of its CPUs, by
    passing the right target node to copy_process()

    [akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1]
    Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.org
    Signed-off-by: Andi Kleen
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

21 May, 2016

2 commits

  • When using this program (as root):

    #include
    #include
    #include
    #include

    #include
    #include
    #include

    #define ITER 1000
    #define FORKERS 15
    #define THREADS (6000/FORKERS) // 1850 is proc max

    static void fork_100_wait()
    {
    unsigned a, to_wait = 0;

    printf("\t%d forking %d\n", THREADS, getpid());

    for (a = 0; a < THREADS; a++) {
    switch (fork()) {
    case 0:
    usleep(1000);
    exit(0);
    break;
    case -1:
    break;
    default:
    to_wait++;
    break;
    }
    }

    printf("\t%d forked from %d, waiting for %d\n", THREADS, getpid(),
    to_wait);

    for (a = 0; a < to_wait; a++)
    wait(NULL);

    printf("\t%d waited from %d\n", THREADS, getpid());
    }

    static void run_forkers()
    {
    pid_t forkers[FORKERS];
    unsigned a;

    for (a = 0; a < FORKERS; a++) {
    switch ((forkers[a] = fork())) {
    case 0:
    fork_100_wait();
    exit(0);
    break;
    case -1:
    err(1, "DIE fork of %d'th forker", a);
    break;
    default:
    break;
    }
    }

    for (a = 0; a < FORKERS; a++)
    waitpid(forkers[a], NULL, 0);
    }

    int main()
    {
    unsigned a;
    int ret;

    ret = ioperm(10, 20, 0);
    if (ret < 0)
    err(1, "ioperm");

    for (a = 0; a < ITER; a++)
    run_forkers();

    return 0;
    }

    kmemleak reports many occurences of this leak:
    unreferenced object 0xffff8805917c8000 (size 8192):
    comm "fork-leak", pid 2932, jiffies 4295354292 (age 1871.028s)
    hex dump (first 32 bytes):
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
    backtrace:
    [] kmemdup+0x25/0x50
    [] copy_thread_tls+0x6c3/0x9a0
    [] copy_process+0x1a84/0x5790
    [] wake_up_new_task+0x2d5/0x6f0
    [] _do_fork+0x12d/0x820
    ...

    Due to the leakage of the memory items which should have been freed in
    arch/x86/kernel/process.c:exit_thread().

    Make sure the memory is freed when fork fails later in copy_process.
    This is done by calling exit_thread with the thread to kill.

    Signed-off-by: Jiri Slaby
    Cc: "David S. Miller"
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: Aurelien Jacquiot
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Liqin
    Cc: Chris Metcalf
    Cc: Chris Zankel
    Cc: David Howells
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Guan Xuetao
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James Hogan
    Cc: Jeff Dike
    Cc: Jesper Nilsson
    Cc: Jiri Slaby
    Cc: Jonas Bonn
    Cc: Koichi Yasutake
    Cc: Lennox Wu
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mikael Starvik
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Richard Henderson
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Russell King
    Cc: Steven Miao
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • Tetsuo has properly noted that mmput slow path might get blocked waiting
    for another party (e.g. exit_aio waits for an IO). If that happens the
    oom_reaper would be put out of the way and will not be able to process
    next oom victim. We should strive for making this context as reliable
    and independent on other subsystems as much as possible.

    Introduce mmput_async which will perform the slow path from an async
    (WQ) context. This will delay the operation but that shouldn't be a
    problem because the oom_reaper has reclaimed the victim's address space
    for most cases as much as possible and the remaining context shouldn't
    bind too much memory anymore. The only exception is when mmap_sem
    trylock has failed which shouldn't happen too often.

    The issue is only theoretical but not impossible.

    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

03 May, 2016

1 commit

  • This patch implements the SS_AUTODISARM flag that can be OR-ed with
    SS_ONSTACK when forming ss_flags.

    When this flag is set, sigaltstack will be disabled when entering
    the signal handler; more precisely, after saving sas to uc_stack.
    When leaving the signal handler, the sigaltstack is restored by
    uc_stack.

    When this flag is used, it is safe to switch from sighandler with
    swapcontext(). Without this flag, the subsequent signal will corrupt
    the state of the switched-away sighandler.

    To detect the support of this functionality, one can do:

    err = sigaltstack(SS_DISABLE | SS_AUTODISARM);
    if (err && errno == EINVAL)
    unsupported();

    Signed-off-by: Stas Sergeev
    Cc: Al Viro
    Cc: Aleksa Sarai
    Cc: Amanieu d'Antras
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: Eric W. Biederman
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Heinrich Schuchardt
    Cc: Jason Low
    Cc: Josh Triplett
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Palmer Dabbelt
    Cc: Paul Moore
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Cc: Richard Weinberger
    Cc: Sasha Levin
    Cc: Shuah Khan
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: linux-api@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/1460665206-13646-4-git-send-email-stsp@list.ru
    Signed-off-by: Ingo Molnar

    Stas Sergeev
     

23 Mar, 2016

1 commit

  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

18 Mar, 2016

1 commit


17 Feb, 2016

1 commit

  • Introduce the ability to create new cgroup namespace. The newly created
    cgroup namespace remembers the cgroup of the process at the point
    of creation of the cgroup namespace (referred as cgroupns-root).
    The main purpose of cgroup namespace is to virtualize the contents
    of /proc/self/cgroup file. Processes inside a cgroup namespace
    are only able to see paths relative to their namespace root
    (unless they are moved outside of their cgroupns-root, at which point
    they will see a relative path from their cgroupns-root).
    For a correctly setup container this enables container-tools
    (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
    containers without leaking system level cgroup hierarchy to the task.
    This patch only implements the 'unshare' part of the cgroupns.

    Signed-off-by: Aditya Kali
    Signed-off-by: Serge Hallyn
    Signed-off-by: Tejun Heo

    Aditya Kali
     

15 Jan, 2016

2 commits

  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Jan, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:

    - cgroup v2 interface is now official. It's no longer hidden behind a
    devel flag and can be mounted using the new cgroup2 fs type.

    Unfortunately, cpu v2 interface hasn't made it yet due to the
    discussion around in-process hierarchical resource distribution and
    only memory and io controllers can be used on the v2 interface at the
    moment.

    - The existing documentation which has always been a bit of mess is
    relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
    is added as the authoritative documentation for the v2 interface.

    - Some features are added through for-4.5-ancestor-test branch to
    enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
    netfilter changes will be merged through the net tree which pulled in
    the said branch.

    - Various cleanups

    * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: rename cgroup documentations
    cgroup: fix a typo.
    cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
    cgroup: demote subsystem init messages to KERN_DEBUG
    cgroup: Fix uninitialized variable warning
    cgroup: put controller Kconfig options in meaningful order
    cgroup: clean up the kernel configuration menu nomenclature
    cgroup_pids: fix a typo.
    Subject: cgroup: Fix incomplete dd command in blkio documentation
    cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
    cpuset: Replace all instances of time_t with time64_t
    cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
    cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
    cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type

    Linus Torvalds