12 Aug, 2014

1 commit

  • Current upstream kernel hangs with mips and powerpc targets in
    uniprocessor mode if SECCOMP is configured.

    Bisect points to commit dbd952127d11 ("seccomp: introduce writer locking").
    Turns out that code such as
    BUG_ON(!spin_is_locked(&list_lock));
    can not be used in uniprocessor mode because spin_is_locked() always
    returns false in this configuration, and that assert_spin_locked()
    exists for that very purpose and must be used instead.

    Fixes: dbd952127d11 ("seccomp: introduce writer locking")
    Cc: Kees Cook
    Signed-off-by: Guenter Roeck
    Signed-off-by: Kees Cook

    Guenter Roeck
     

09 Aug, 2014

7 commits

  • This patch (of 6):

    The i_mmap_writable field counts existing writable mappings of an
    address_space. To allow drivers to prevent new writable mappings, make
    this counter signed and prevent new writable mappings if it is negative.
    This is modelled after i_writecount and DENYWRITE.

    This will be required by the shmem-sealing infrastructure to prevent any
    new writable mappings after the WRITE seal has been set. In case there
    exists a writable mapping, this operation will fail with EBUSY.

    Note that we rely on the fact that iff you already own a writable mapping,
    you can increase the counter without using the helpers. This is the same
    that we do for i_writecount.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • This is small set of patches our team has had kicking around for a few
    versions internally that fixes tasks getting hung on shm_exit when there
    are many threads hammering it at once.

    Anton wrote a simple test to cause the issue:

    http://ozlabs.org/~anton/junkcode/bust_shm_exit.c

    Before applying this patchset, this test code will cause either hanging
    tracebacks or pthread out of memory errors.

    After this patchset, it will still produce output like:

    root@somehost:~# ./bust_shm_exit 1024 160
    ...
    INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
    INFO: Stall ended before state dump start
    ...

    But the task will continue to run along happily, so we consider this an
    improvement over hanging, even if it's a bit noisy.

    This patch (of 3):

    exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
    walks every shared memory segment in the namespace. Thus the amount of
    work is related to the number of shm segments in the namespace not the
    number of segments that might need to be cleaned.

    In addition, this occurs after the task has been notified the thread has
    exited, so the number of tasks waiting for the ns shm rwsem can grow
    without bound until memory is exausted.

    Add a list to the task struct of all shmids allocated by this task. Init
    the list head in copy_process. Use the ns->rwsem for locking. Add
    segments after id is added, remove before removing from id.

    On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
    to handling of semaphore undo.

    I chose a define for the init sequence since its a simple list init,
    otherwise it would require a function call to avoid include loops between
    the semaphore code and the task struct. Converting the list_del to
    list_del_init for the unshare cases would remove the exit followed by
    init, but I left it blow up if not inited.

    Signed-off-by: Milton Miller
    Signed-off-by: Jack Miller
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Miller
     
  • It's only used in fork.c:mm_init().

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • If a forking process has a thread calling (un)mmap (silly but still),
    the child process may have some of its mm's vm usage counters (total_vm
    and friends) screwed up, because currently they are copied from oldmm
    w/o holding any locks (memcpy in dup_mm).

    This patch moves the counters initialization to dup_mmap() to be called
    under oldmm->mmap_sem, which eliminates any possibility of race.

    Signed-off-by: Vladimir Davydov
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mm->pinned_vm counts pages of mm's address space that were permanently
    pinned in memory by increasing their reference counter. The counter was
    introduced by commit bc3e53f682d9 ("mm: distinguish between mlocked and
    pinned pages"), while before it locked_vm had been used for such pages.

    Obviously, we should reset the counter on fork if !CLONE_VM, just like
    we do with locked_vm, but currently we don't. Let's fix it.

    This patch will fix the contents of /proc/pid/status:VmPin.

    ib_umem_get[infiniband] and perf_mmap still check pinned_vm against
    RLIMIT_MEMLOCK. It's left from the times when pinned pages were accounted
    under locked_vm, but today it looks wrong. It isn't clear how we should
    deal with it.

    We still have some drivers accounting pinned pages under mm->locked_vm -
    this is what commit bc3e53f682d9 was fighting against. It's
    infiniband/usnic and vfio.

    Signed-off-by: Vladimir Davydov
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hal Rosenstock
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mm initialization on fork/exec is spread all over the place, which makes
    the code look inconsistent.

    We have mm_init(), which is supposed to init/nullify mm's internals, but
    it doesn't init all the fields it should:

    - on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
    are zeroed in dup_mmap();

    - on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
    calling mm_init();

    - ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
    called before mm_init() on both fork and exec;

    - ->context is initialized by init_new_context(), which is called after
    mm_init() on both fork and exec;

    Let's consolidate all the initializations in mm_init() to make the code
    look cleaner.

    Signed-off-by: Vladimir Davydov
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Pages are now uncharged at release time, and all sources of batched
    uncharges operate on lists of pages. Directly use those lists, and
    get rid of the per-task batching state.

    This also batches statistics accounting, in addition to the res
    counter charges, to reduce IRQ-disabling and re-enabling.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Naoya Horiguchi
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

06 Aug, 2014

2 commits

  • Pull security subsystem updates from James Morris:
    "In this release:

    - PKCS#7 parser for the key management subsystem from David Howells
    - appoint Kees Cook as seccomp maintainer
    - bugfixes and general maintenance across the subsystem"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
    X.509: Need to export x509_request_asymmetric_key()
    netlabel: shorter names for the NetLabel catmap funcs/structs
    netlabel: fix the catmap walking functions
    netlabel: fix the horribly broken catmap functions
    netlabel: fix a problem when setting bits below the previously lowest bit
    PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
    tpm: simplify code by using %*phN specifier
    tpm: Provide a generic means to override the chip returned timeouts
    tpm: missing tpm_chip_put in tpm_get_random()
    tpm: Properly clean sysfs entries in error path
    tpm: Add missing tpm_do_selftest to ST33 I2C driver
    PKCS#7: Use x509_request_asymmetric_key()
    Revert "selinux: fix the default socket labeling in sock_graft()"
    X.509: x509_request_asymmetric_keys() doesn't need string length arguments
    PKCS#7: fix sparse non static symbol warning
    KEYS: revert encrypted key change
    ima: add support for measuring and appraising firmware
    firmware_class: perform new LSM checks
    security: introduce kernel_fw_from_file hook
    PKCS#7: Missing inclusion of linux/err.h
    ...

    Linus Torvalds
     
  • Pull timer and time updates from Thomas Gleixner:
    "A rather large update of timers, timekeeping & co

    - Core timekeeping code is year-2038 safe now for 32bit machines.
    Now we just need to fix all in kernel users and the gazillion of
    user space interfaces which rely on timespec/timeval :)

    - Better cache layout for the timekeeping internal data structures.

    - Proper nanosecond based interfaces for in kernel users.

    - Tree wide cleanup of code which wants nanoseconds but does hoops
    and loops to convert back and forth from timespecs. Some of it
    definitely belongs into the ugly code museum.

    - Consolidation of the timekeeping interface zoo.

    - A fast NMI safe accessor to clock monotonic for tracing. This is a
    long standing request to support correlated user/kernel space
    traces. With proper NTP frequency correction it's also suitable
    for correlation of traces accross separate machines.

    - Checkpoint/restart support for timerfd.

    - A few NOHZ[_FULL] improvements in the [hr]timer code.

    - Code move from kernel to kernel/time of all time* related code.

    - New clocksource/event drivers from the ARM universe. I'm really
    impressed that despite an architected timer in the newer chips SoC
    manufacturers insist on inventing new and differently broken SoC
    specific timers.

    [ Ed. "Impressed"? I don't think that word means what you think it means ]

    - Another round of code move from arch to drivers. Looks like most
    of the legacy mess in ARM regarding timers is sorted out except for
    a few obnoxious strongholds.

    - The usual updates and fixlets all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    timekeeping: Fixup typo in update_vsyscall_old definition
    clocksource: document some basic timekeeping concepts
    timekeeping: Use cached ntp_tick_length when accumulating error
    timekeeping: Rework frequency adjustments to work better w/ nohz
    timekeeping: Minor fixup for timespec64->timespec assignment
    ftrace: Provide trace clocks monotonic
    timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
    seqcount: Add raw_write_seqcount_latch()
    seqcount: Provide raw_read_seqcount()
    timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
    timekeeping: Create struct tk_read_base and use it in struct timekeeper
    timekeeping: Restructure the timekeeper some more
    clocksource: Get rid of cycle_last
    clocksource: Move cycle_last validation to core code
    clocksource: Make delta calculation a function
    wireless: ath9k: Get rid of timespec conversions
    drm: vmwgfx: Use nsec based interfaces
    drm: i915: Use nsec based interfaces
    timekeeping: Provide ktime_get_raw()
    hangcheck-timer: Use ktime_get_ns()
    ...

    Linus Torvalds
     

24 Jul, 2014

2 commits


19 Jul, 2014

1 commit

  • Normally, task_struct.seccomp.filter is only ever read or modified by
    the task that owns it (current). This property aids in fast access
    during system call filtering as read access is lockless.

    Updating the pointer from another task, however, opens up race
    conditions. To allow cross-thread filter pointer updates, writes to the
    seccomp fields are now protected by the sighand spinlock (which is shared
    by all threads in the thread group). Read access remains lockless because
    pointer updates themselves are atomic. However, writes (or cloning)
    often entail additional checking (like maximum instruction counts)
    which require locking to perform safely.

    In the case of cloning threads, the child is invisible to the system
    until it enters the task list. To make sure a child can't be cloned from
    a thread and left in a prior state, seccomp duplication is additionally
    moved under the sighand lock. Then parent and child are certain have
    the same seccomp state when they exit the lock.

    Based on patches by Will Drewry and David Drysdale.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     

17 Jul, 2014

1 commit


16 Jul, 2014

2 commits


21 Jun, 2014

1 commit

  • syscall_regfunc() and syscall_unregfunc() should set/clear
    TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
    with copy_process() and miss the new child which was not added to
    the process/thread lists yet.

    Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
    under tasklist.

    Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com

    Cc: stable@vger.kernel.org # 2.6.33
    Fixes: a871bd33a6c0 "tracing: Add syscall tracepoints"
    Acked-by: Frederic Weisbecker
    Acked-by: Paul E. McKenney
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Steven Rostedt

    Oleg Nesterov
     

12 Jun, 2014

1 commit

  • do_posix_clock_monotonic_gettime() is a leftover from the initial
    posix timer implementation which maps to ktime_get_ts().

    Signed-off-by: Thomas Gleixner
    Cc: John Stultz
    Cc: Peter Zijlstra
    Cc: Oleg Nesterov
    Link: http://lkml.kernel.org/r/20140611234607.427408044@linutronix.de
    Signed-off-by: Thomas Gleixner
    Cc: Oleg Nesterov

    Thomas Gleixner
     

07 Jun, 2014

1 commit

  • When tracing a process in another pid namespace, it's important for fork
    event messages to contain the child's pid as seen from the tracer's pid
    namespace, not the parent's. Otherwise, the tracer won't be able to
    correlate the fork event with later SIGTRAP signals it receives from the
    child.

    We still risk a race condition if a ptracer from a different pid
    namespace attaches after we compute the pid_t value. However, sending a
    bogus fork event message in this unlikely scenario is still a vast
    improvement over the status quo where we always send bogus fork event
    messages to debuggers in a different pid namespace than the forking
    process.

    Signed-off-by: Matthew Dempsky
    Acked-by: Oleg Nesterov
    Cc: Kees Cook
    Cc: Julien Tinnes
    Cc: Roland McGrath
    Cc: Jan Kratochvil
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Dempsky
     

05 Jun, 2014

2 commits

  • CONFIG_MM_OWNER makes no sense. It is not user-selectable, it is only
    selected by CONFIG_MEMCG automatically. So we can kill this option in
    init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally.

    Signed-off-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Currently to allocate a page that should be charged to kmemcg (e.g.
    threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
    allocated is then to be freed by free_memcg_kmem_pages. Apart from
    looking asymmetrical, this also requires intrusion to the general
    allocation path. So let's introduce separate functions that will
    alloc/free pages charged to kmemcg.

    The new functions are called alloc_kmem_pages and free_kmem_pages. They
    should be used when the caller actually would like to use kmalloc, but
    has to fall back to the page allocator for the allocation is large.
    They only differ from alloc_pages and free_pages in that besides
    allocating or freeing pages they also charge them to the kmem resource
    counter of the current memory cgroup.

    [sfr@canb.auug.org.au: export kmalloc_order() to modules]
    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

08 Apr, 2014

5 commits

  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes
    with the right macro in the kernel subsystem.

    Signed-off-by: Gideon Israel Dsouza
    Cc: "Rafael J. Wysocki"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
    There's no significant performance degradation to checking
    current->mempolicy rather than current->flags & PF_MEMPOLICY in the
    allocation path, especially since this is considered unlikely().

    Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
    64GB of memory and without a mempolicy:

    threads before after
    16 1249409 1244487
    32 1281786 1246783
    48 1239175 1239138
    64 1244642 1241841
    80 1244346 1248918
    96 1266436 1254316
    112 1307398 1312135
    128 1327607 1326502

    Per-process flags are a scarce resource so we should free them up whenever
    possible and make them available. We'll be using it shortly for memcg oom
    reserves.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • copy_flags() does not use the clone_flags formal and can be collapsed
    into copy_process() for cleaner code.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
    also adds a prctl control which allows us to set the THP disable bit in
    mm->def_flags so that VMs will pick up the setting as they are created.

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     

04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

29 Mar, 2014

1 commit

  • cgroup_exit() is called in fork and exit path. If it's called in the
    failure path during fork, PF_EXITING isn't set, and then lockdep will
    complain.

    Fix this by removing cgroup_exit() in that failure path. cgroup_fork()
    does nothing that needs cleanup.

    Reported-by: Sasha Levin
    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

11 Mar, 2014

1 commit

  • Bad idea on -rt:

    [ 908.026136] [] rt_spin_lock_slowlock+0xaa/0x2c0
    [ 908.026145] [] task_numa_free+0x31/0x130
    [ 908.026151] [] finish_task_switch+0xce/0x100
    [ 908.026156] [] thread_return+0x48/0x4ae
    [ 908.026160] [] schedule+0x25/0xa0
    [ 908.026163] [] rt_spin_lock_slowlock+0xd5/0x2c0
    [ 908.026170] [] get_signal_to_deliver+0xaf/0x680
    [ 908.026175] [] do_signal+0x3d/0x5b0
    [ 908.026179] [] do_notify_resume+0x90/0xe0
    [ 908.026186] [] int_signal+0x12/0x17
    [ 908.026193] [] 0x7ff2a388b1cf

    and since upstream does not mind where we do this, be a bit nicer ...

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1393568591.6018.27.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

24 Jan, 2014

4 commits


22 Jan, 2014

1 commit

  • while_each_thread() and next_thread() should die, almost every lockless
    usage is wrong.

    1. Unless g == current, the lockless while_each_thread() is not safe.

    while_each_thread(g, t) can loop forever if g exits, next_thread()
    can't reach the unhashed thread in this case. Note that this can
    happen even if g is the group leader, it can exec.

    2. Even if while_each_thread() itself was correct, people often use
    it wrongly.

    It was never safe to just take rcu_read_lock() and loop unless
    you verify that pid_alive(g) == T, even the first next_thread()
    can point to the already freed/reused memory.

    This patch adds signal_struct->thread_head and task->thread_node to
    create the normal rcu-safe list with the stable head. The new
    for_each_thread(g, t) helper is always safe under rcu_read_lock() as
    long as this task_struct can't go away.

    Note: of course it is ugly to have both task_struct->thread_node and the
    old task_struct->thread_group, we will kill it later, after we change
    the users of while_each_thread() to use for_each_thread().

    Perhaps we can kill it even before we convert all users, we can
    reimplement next_thread(t) using the new thread_head/thread_node. But
    we can't do this right now because this will lead to subtle behavioural
    changes. For example, do/while_each_thread() always sees at least one
    task, while for_each_thread() can do nothing if the whole thread group
    has died. Or thread_group_empty(), currently its semantics is not clear
    unless thread_group_leader(p) and we need to audit the callers before we
    can change it.

    So this patch adds the new interface which has to coexist with the old
    one for some time, hopefully the next changes will be more or less
    straightforward and the old one will go away soon.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Acked-by: David Rientjes
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Cc: Michal Hocko
    Cc: "Tu, Xiaobing"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

21 Jan, 2014

1 commit

  • Pull scheduler changes from Ingo Molnar:

    - Add the initial implementation of SCHED_DEADLINE support: a real-time
    scheduling policy where tasks that meet their deadlines and
    periodically execute their instances in less than their runtime quota
    see real-time scheduling and won't miss any of their deadlines.
    Tasks that go over their quota get delayed (Available to privileged
    users for now)

    - Clean up and fix preempt_enable_no_resched() abuse all around the
    tree

    - Do sched_clock() performance optimizations on x86 and elsewhere

    - Fix and improve auto-NUMA balancing

    - Fix and clean up the idle loop

    - Apply various cleanups and fixes

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
    sched: Fix __sched_setscheduler() nice test
    sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
    sched: Fix up attr::sched_priority warning
    sched: Fix up scheduler syscall LTP fails
    sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
    sched/core: Fix htmldocs warnings
    sched/deadline: No need to check p if dl_se is valid
    sched/deadline: Remove unused variables
    sched/deadline: Fix sparse static warnings
    m68k: Fix build warning in mac_via.h
    sched, thermal: Clean up preempt_enable_no_resched() abuse
    sched, net: Fixup busy_loop_us_clock()
    sched, net: Clean up preempt_enable_no_resched() abuse
    sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
    sched/preempt, locking: Rework local_bh_{dis,en}able()
    sched/clock, x86: Avoid a runtime condition in native_sched_clock()
    sched/clock: Fix up clear_sched_clock_stable()
    sched/clock, x86: Use a static_key for sched_clock_stable
    sched/clock: Remove local_irq_disable() from the clocks
    sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
    ...

    Linus Torvalds
     

18 Jan, 2014

1 commit

  • Pull namespace fixes from Eric Biederman:
    "This is a set of 3 regression fixes.

    This fixes /proc/mounts when using "ip netns add " to display
    the actual mount point.

    This fixes a regression in clone that broke lxc-attach.

    This fixes a regression in the permission checks for mounting /proc
    that made proc unmountable if binfmt_misc was in use. Oops.

    My apologies for sending this pull request so late. Al Viro gave
    interesting review comments about the d_path fix that I wanted to
    address in detail before I sent this pull request. Unfortunately a
    bad round of colds kept from addressing that in detail until today.
    The executive summary of the review was:

    Al: Is patching d_path really sufficient?
    The prepend_path, d_path, d_absolute_path, and __d_path family of
    functions is a really mess.

    Me: Yes, patching d_path is really sufficient. Yes, the code is mess.
    No it is not appropriate to rewrite all of d_path for a regression
    that has existed for entirely too long already, when a two line
    change will do"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Fix a regression in mounting proc
    fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
    vfs: In d_path don't call d_dname on a mount point

    Linus Torvalds
     

13 Jan, 2014

4 commits

  • Some method to deal with rt-mutexes and make sched_dl interact with
    the current PI-coded is needed, raising all but trivial issues, that
    needs (according to us) to be solved with some restructuring of
    the pi-code (i.e., going toward a proxy execution-ish implementation).

    This is under development, in the meanwhile, as a temporary solution,
    what this commits does is:

    - ensure a pi-lock owner with waiters is never throttled down. Instead,
    when it runs out of runtime, it immediately gets replenished and it's
    deadline is postponed;

    - the scheduling parameters (relative deadline and default runtime)
    used for that replenishments --during the whole period it holds the
    pi-lock-- are the ones of the waiting task with earliest deadline.

    Acting this way, we provide some kind of boosting to the lock-owner,
    still by using the existing (actually, slightly modified by the previous
    commit) pi-architecture.

    We would stress the fact that this is only a surely needed, all but
    clean solution to the problem. In the end it's only a way to re-start
    discussion within the community. So, as always, comments, ideas, rants,
    etc.. are welcome! :-)

    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    [ Added !RT_MUTEXES build fix. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     
  • Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
    and provide a proper comparison function for -deadline and
    -priority tasks.

    This is done mainly because:
    - classical prio field of the plist is just an int, which might
    not be enough for representing a deadline;
    - manipulating such a list would become O(nr_deadline_tasks),
    which might be to much, as the number of -deadline task increases.

    Therefore, an rb-tree is used, and tasks are queued in it according
    to the following logic:
    - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
    one with the higher (lower, actually!) prio wins;
    - among a -priority and a -deadline task, the latter always wins;
    - among two -deadline tasks, the one with the earliest deadline
    wins.

    Queueing and dequeueing functions are changed accordingly, for both
    the list of a task's pi-waiters and the list of tasks blocked on
    a pi-lock.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    Signed-off-again-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Introduces the data structures, constants and symbols needed for
    SCHED_DEADLINE implementation.

    Core data structure of SCHED_DEADLINE are defined, along with their
    initializers. Hooks for checking if a task belong to the new policy
    are also added where they are needed.

    Adds a scheduling class, in sched/dl.c and a new policy called
    SCHED_DEADLINE. It is an implementation of the Earliest Deadline
    First (EDF) scheduling algorithm, augmented with a mechanism (called
    Constant Bandwidth Server, CBS) that makes it possible to isolate
    the behaviour of tasks between each other.

    The typical -deadline task will be made up of a computation phase
    (instance) which is activated on a periodic or sporadic fashion. The
    expected (maximum) duration of such computation is called the task's
    runtime; the time interval by which each instance need to be completed
    is called the task's relative deadline. The task's absolute deadline
    is dynamically calculated as the time instant a task (better, an
    instance) activates plus the relative deadline.

    The EDF algorithms selects the task with the smallest absolute
    deadline as the one to be executed first, while the CBS ensures each
    task to run for at most its runtime every (relative) deadline
    length time interval, avoiding any interference between different
    tasks (bandwidth isolation).
    Thanks to this feature, also tasks that do not strictly comply with
    the computational model sketched above can effectively use the new
    policy.

    To summarize, this patch:
    - introduces the data structures, constants and symbols needed;
    - implements the core logic of the scheduling algorithm in the new
    scheduling class file;
    - provides all the glue code between the new scheduling class and
    the core scheduler and refines the interactions between sched/dl
    and the other existing scheduling classes.

    Signed-off-by: Dario Faggioli
    Signed-off-by: Michael Trimarchi
    Signed-off-by: Fabio Checconi
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     
  • Pick up the latest fixes before applying new changes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar