13 Oct, 2014

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
    Hansen)

    - Various sched/idle refinements for better idle handling (Nicolas
    Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

    - sched/numa updates and optimizations (Rik van Riel)

    - sysbench speedup (Vincent Guittot)

    - capacity calculation cleanups/refactoring (Vincent Guittot)

    - Various cleanups to thread group iteration (Oleg Nesterov)

    - Double-rq-lock removal optimization and various refactorings
    (Kirill Tkhai)

    - various sched/deadline fixes

    ... and lots of other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
    sched/fair: Delete resched_cpu() from idle_balance()
    sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
    sched: Improve sysbench performance by fixing spurious active migration
    sched/x86: Fix up typo in topology detection
    x86, sched: Add new topology for multi-NUMA-node CPUs
    sched/rt: Use resched_curr() in task_tick_rt()
    sched: Use rq->rd in sched_setaffinity() under RCU read lock
    sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
    sched: Use dl_bw_of() under RCU read lock
    sched/fair: Remove duplicate code from can_migrate_task()
    sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
    sched: print_rq(): Don't use tasklist_lock
    sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
    sched: Fix the task-group check in tg_has_rt_tasks()
    sched/fair: Leverage the idle state info when choosing the "idlest" cpu
    sched: Let the scheduler see CPU idle states
    sched/deadline: Fix inter- exclusive cpusets migrations
    sched/deadline: Clear dl_entity params when setscheduling to different class
    sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
    ...

    Linus Torvalds
     

10 Oct, 2014

1 commit


03 Oct, 2014

1 commit

  • Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
    calling perf_event_free_task() when failing sched_fork() we will not yet
    have done the memset() on ->perf_event_ctxp[] and will therefore try and
    'free' the inherited contexts, which are still in use by the parent
    process. This is bad..

    Suggested-by: Oleg Nesterov
    Reported-by: Oleg Nesterov
    Reported-by: Sylvain 'ythier' Hitier
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

19 Sep, 2014

1 commit

  • Tasks get their end of stack set to STACK_END_MAGIC with the
    aim to catch stack overruns. Currently this feature does not
    apply to init_task. This patch removes this restriction.

    Note that a similar patch was posted by Prarit Bhargava
    some time ago but was never merged:

    http://marc.info/?l=linux-kernel&m=127144305403241&w=2

    Signed-off-by: Aaron Tomlin
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Oleg Nesterov
    Acked-by: Michael Ellerman
    Cc: aneesh.kumar@linux.vnet.ibm.com
    Cc: dzickus@redhat.com
    Cc: bmr@redhat.com
    Cc: jcastillo@redhat.com
    Cc: jgh@redhat.com
    Cc: minchan@kernel.org
    Cc: tglx@linutronix.de
    Cc: hannes@cmpxchg.org
    Cc: Alex Thorlton
    Cc: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Daeseok Youn
    Cc: David Rientjes
    Cc: Fabian Frederick
    Cc: Geert Uytterhoeven
    Cc: Jiri Olsa
    Cc: Kees Cook
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Michael Opdenacker
    Cc: Paul Mackerras
    Cc: Prarit Bhargava
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Seiji Aguchi
    Cc: Steven Rostedt
    Cc: Vladimir Davydov
    Cc: Yasuaki Ishimatsu
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.com
    Signed-off-by: Ingo Molnar

    Aaron Tomlin
     

08 Sep, 2014

1 commit

  • Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
    issues on large systems, due to both functions being serialized with a
    lock.

    The lock protects against reporting a wrong value, due to a thread in the
    task group exiting, its statistics reporting up to the signal struct, and
    that exited task's statistics being counted twice (or not at all).

    Protecting that with a lock results in times() and clock_gettime() being
    completely serialized on large systems.

    This can be fixed by using a seqlock around the events that gather and
    propagate statistics. As an additional benefit, the protection code can
    be moved into thread_group_cputime(), slightly simplifying the calling
    functions.

    In the case of posix_cpu_clock_get_task() things can be simplified a
    lot, because the calling function already ensures that the task sticks
    around, and the rest is now taken care of in thread_group_cputime().

    This way the statistics reporting code can run lockless.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alex Thorlton
    Cc: Andrew Morton
    Cc: Daeseok Youn
    Cc: David Rientjes
    Cc: Dongsheng Yang
    Cc: Geert Uytterhoeven
    Cc: Guillaume Morin
    Cc: Ionut Alexa
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Li Zefan
    Cc: Michal Hocko
    Cc: Michal Schmidt
    Cc: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: umgwanakikbuti@gmail.com
    Cc: fweisbec@gmail.com
    Cc: srao@redhat.com
    Cc: lwoodman@redhat.com
    Cc: atheurer@redhat.com
    Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

12 Aug, 2014

1 commit

  • Current upstream kernel hangs with mips and powerpc targets in
    uniprocessor mode if SECCOMP is configured.

    Bisect points to commit dbd952127d11 ("seccomp: introduce writer locking").
    Turns out that code such as
    BUG_ON(!spin_is_locked(&list_lock));
    can not be used in uniprocessor mode because spin_is_locked() always
    returns false in this configuration, and that assert_spin_locked()
    exists for that very purpose and must be used instead.

    Fixes: dbd952127d11 ("seccomp: introduce writer locking")
    Cc: Kees Cook
    Signed-off-by: Guenter Roeck
    Signed-off-by: Kees Cook

    Guenter Roeck
     

09 Aug, 2014

7 commits

  • This patch (of 6):

    The i_mmap_writable field counts existing writable mappings of an
    address_space. To allow drivers to prevent new writable mappings, make
    this counter signed and prevent new writable mappings if it is negative.
    This is modelled after i_writecount and DENYWRITE.

    This will be required by the shmem-sealing infrastructure to prevent any
    new writable mappings after the WRITE seal has been set. In case there
    exists a writable mapping, this operation will fail with EBUSY.

    Note that we rely on the fact that iff you already own a writable mapping,
    you can increase the counter without using the helpers. This is the same
    that we do for i_writecount.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • This is small set of patches our team has had kicking around for a few
    versions internally that fixes tasks getting hung on shm_exit when there
    are many threads hammering it at once.

    Anton wrote a simple test to cause the issue:

    http://ozlabs.org/~anton/junkcode/bust_shm_exit.c

    Before applying this patchset, this test code will cause either hanging
    tracebacks or pthread out of memory errors.

    After this patchset, it will still produce output like:

    root@somehost:~# ./bust_shm_exit 1024 160
    ...
    INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
    INFO: Stall ended before state dump start
    ...

    But the task will continue to run along happily, so we consider this an
    improvement over hanging, even if it's a bit noisy.

    This patch (of 3):

    exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
    walks every shared memory segment in the namespace. Thus the amount of
    work is related to the number of shm segments in the namespace not the
    number of segments that might need to be cleaned.

    In addition, this occurs after the task has been notified the thread has
    exited, so the number of tasks waiting for the ns shm rwsem can grow
    without bound until memory is exausted.

    Add a list to the task struct of all shmids allocated by this task. Init
    the list head in copy_process. Use the ns->rwsem for locking. Add
    segments after id is added, remove before removing from id.

    On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
    to handling of semaphore undo.

    I chose a define for the init sequence since its a simple list init,
    otherwise it would require a function call to avoid include loops between
    the semaphore code and the task struct. Converting the list_del to
    list_del_init for the unshare cases would remove the exit followed by
    init, but I left it blow up if not inited.

    Signed-off-by: Milton Miller
    Signed-off-by: Jack Miller
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Miller
     
  • It's only used in fork.c:mm_init().

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • If a forking process has a thread calling (un)mmap (silly but still),
    the child process may have some of its mm's vm usage counters (total_vm
    and friends) screwed up, because currently they are copied from oldmm
    w/o holding any locks (memcpy in dup_mm).

    This patch moves the counters initialization to dup_mmap() to be called
    under oldmm->mmap_sem, which eliminates any possibility of race.

    Signed-off-by: Vladimir Davydov
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mm->pinned_vm counts pages of mm's address space that were permanently
    pinned in memory by increasing their reference counter. The counter was
    introduced by commit bc3e53f682d9 ("mm: distinguish between mlocked and
    pinned pages"), while before it locked_vm had been used for such pages.

    Obviously, we should reset the counter on fork if !CLONE_VM, just like
    we do with locked_vm, but currently we don't. Let's fix it.

    This patch will fix the contents of /proc/pid/status:VmPin.

    ib_umem_get[infiniband] and perf_mmap still check pinned_vm against
    RLIMIT_MEMLOCK. It's left from the times when pinned pages were accounted
    under locked_vm, but today it looks wrong. It isn't clear how we should
    deal with it.

    We still have some drivers accounting pinned pages under mm->locked_vm -
    this is what commit bc3e53f682d9 was fighting against. It's
    infiniband/usnic and vfio.

    Signed-off-by: Vladimir Davydov
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hal Rosenstock
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mm initialization on fork/exec is spread all over the place, which makes
    the code look inconsistent.

    We have mm_init(), which is supposed to init/nullify mm's internals, but
    it doesn't init all the fields it should:

    - on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
    are zeroed in dup_mmap();

    - on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
    calling mm_init();

    - ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
    called before mm_init() on both fork and exec;

    - ->context is initialized by init_new_context(), which is called after
    mm_init() on both fork and exec;

    Let's consolidate all the initializations in mm_init() to make the code
    look cleaner.

    Signed-off-by: Vladimir Davydov
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Pages are now uncharged at release time, and all sources of batched
    uncharges operate on lists of pages. Directly use those lists, and
    get rid of the per-task batching state.

    This also batches statistics accounting, in addition to the res
    counter charges, to reduce IRQ-disabling and re-enabling.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Naoya Horiguchi
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

06 Aug, 2014

2 commits

  • Pull security subsystem updates from James Morris:
    "In this release:

    - PKCS#7 parser for the key management subsystem from David Howells
    - appoint Kees Cook as seccomp maintainer
    - bugfixes and general maintenance across the subsystem"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
    X.509: Need to export x509_request_asymmetric_key()
    netlabel: shorter names for the NetLabel catmap funcs/structs
    netlabel: fix the catmap walking functions
    netlabel: fix the horribly broken catmap functions
    netlabel: fix a problem when setting bits below the previously lowest bit
    PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
    tpm: simplify code by using %*phN specifier
    tpm: Provide a generic means to override the chip returned timeouts
    tpm: missing tpm_chip_put in tpm_get_random()
    tpm: Properly clean sysfs entries in error path
    tpm: Add missing tpm_do_selftest to ST33 I2C driver
    PKCS#7: Use x509_request_asymmetric_key()
    Revert "selinux: fix the default socket labeling in sock_graft()"
    X.509: x509_request_asymmetric_keys() doesn't need string length arguments
    PKCS#7: fix sparse non static symbol warning
    KEYS: revert encrypted key change
    ima: add support for measuring and appraising firmware
    firmware_class: perform new LSM checks
    security: introduce kernel_fw_from_file hook
    PKCS#7: Missing inclusion of linux/err.h
    ...

    Linus Torvalds
     
  • Pull timer and time updates from Thomas Gleixner:
    "A rather large update of timers, timekeeping & co

    - Core timekeeping code is year-2038 safe now for 32bit machines.
    Now we just need to fix all in kernel users and the gazillion of
    user space interfaces which rely on timespec/timeval :)

    - Better cache layout for the timekeeping internal data structures.

    - Proper nanosecond based interfaces for in kernel users.

    - Tree wide cleanup of code which wants nanoseconds but does hoops
    and loops to convert back and forth from timespecs. Some of it
    definitely belongs into the ugly code museum.

    - Consolidation of the timekeeping interface zoo.

    - A fast NMI safe accessor to clock monotonic for tracing. This is a
    long standing request to support correlated user/kernel space
    traces. With proper NTP frequency correction it's also suitable
    for correlation of traces accross separate machines.

    - Checkpoint/restart support for timerfd.

    - A few NOHZ[_FULL] improvements in the [hr]timer code.

    - Code move from kernel to kernel/time of all time* related code.

    - New clocksource/event drivers from the ARM universe. I'm really
    impressed that despite an architected timer in the newer chips SoC
    manufacturers insist on inventing new and differently broken SoC
    specific timers.

    [ Ed. "Impressed"? I don't think that word means what you think it means ]

    - Another round of code move from arch to drivers. Looks like most
    of the legacy mess in ARM regarding timers is sorted out except for
    a few obnoxious strongholds.

    - The usual updates and fixlets all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    timekeeping: Fixup typo in update_vsyscall_old definition
    clocksource: document some basic timekeeping concepts
    timekeeping: Use cached ntp_tick_length when accumulating error
    timekeeping: Rework frequency adjustments to work better w/ nohz
    timekeeping: Minor fixup for timespec64->timespec assignment
    ftrace: Provide trace clocks monotonic
    timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
    seqcount: Add raw_write_seqcount_latch()
    seqcount: Provide raw_read_seqcount()
    timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
    timekeeping: Create struct tk_read_base and use it in struct timekeeper
    timekeeping: Restructure the timekeeper some more
    clocksource: Get rid of cycle_last
    clocksource: Move cycle_last validation to core code
    clocksource: Make delta calculation a function
    wireless: ath9k: Get rid of timespec conversions
    drm: vmwgfx: Use nsec based interfaces
    drm: i915: Use nsec based interfaces
    timekeeping: Provide ktime_get_raw()
    hangcheck-timer: Use ktime_get_ns()
    ...

    Linus Torvalds
     

24 Jul, 2014

2 commits


19 Jul, 2014

1 commit

  • Normally, task_struct.seccomp.filter is only ever read or modified by
    the task that owns it (current). This property aids in fast access
    during system call filtering as read access is lockless.

    Updating the pointer from another task, however, opens up race
    conditions. To allow cross-thread filter pointer updates, writes to the
    seccomp fields are now protected by the sighand spinlock (which is shared
    by all threads in the thread group). Read access remains lockless because
    pointer updates themselves are atomic. However, writes (or cloning)
    often entail additional checking (like maximum instruction counts)
    which require locking to perform safely.

    In the case of cloning threads, the child is invisible to the system
    until it enters the task list. To make sure a child can't be cloned from
    a thread and left in a prior state, seccomp duplication is additionally
    moved under the sighand lock. Then parent and child are certain have
    the same seccomp state when they exit the lock.

    Based on patches by Will Drewry and David Drysdale.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     

17 Jul, 2014

1 commit


16 Jul, 2014

2 commits


21 Jun, 2014

1 commit

  • syscall_regfunc() and syscall_unregfunc() should set/clear
    TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
    with copy_process() and miss the new child which was not added to
    the process/thread lists yet.

    Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
    under tasklist.

    Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com

    Cc: stable@vger.kernel.org # 2.6.33
    Fixes: a871bd33a6c0 "tracing: Add syscall tracepoints"
    Acked-by: Frederic Weisbecker
    Acked-by: Paul E. McKenney
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Steven Rostedt

    Oleg Nesterov
     

12 Jun, 2014

1 commit

  • do_posix_clock_monotonic_gettime() is a leftover from the initial
    posix timer implementation which maps to ktime_get_ts().

    Signed-off-by: Thomas Gleixner
    Cc: John Stultz
    Cc: Peter Zijlstra
    Cc: Oleg Nesterov
    Link: http://lkml.kernel.org/r/20140611234607.427408044@linutronix.de
    Signed-off-by: Thomas Gleixner
    Cc: Oleg Nesterov

    Thomas Gleixner
     

07 Jun, 2014

1 commit

  • When tracing a process in another pid namespace, it's important for fork
    event messages to contain the child's pid as seen from the tracer's pid
    namespace, not the parent's. Otherwise, the tracer won't be able to
    correlate the fork event with later SIGTRAP signals it receives from the
    child.

    We still risk a race condition if a ptracer from a different pid
    namespace attaches after we compute the pid_t value. However, sending a
    bogus fork event message in this unlikely scenario is still a vast
    improvement over the status quo where we always send bogus fork event
    messages to debuggers in a different pid namespace than the forking
    process.

    Signed-off-by: Matthew Dempsky
    Acked-by: Oleg Nesterov
    Cc: Kees Cook
    Cc: Julien Tinnes
    Cc: Roland McGrath
    Cc: Jan Kratochvil
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Dempsky
     

05 Jun, 2014

2 commits

  • CONFIG_MM_OWNER makes no sense. It is not user-selectable, it is only
    selected by CONFIG_MEMCG automatically. So we can kill this option in
    init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally.

    Signed-off-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Currently to allocate a page that should be charged to kmemcg (e.g.
    threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
    allocated is then to be freed by free_memcg_kmem_pages. Apart from
    looking asymmetrical, this also requires intrusion to the general
    allocation path. So let's introduce separate functions that will
    alloc/free pages charged to kmemcg.

    The new functions are called alloc_kmem_pages and free_kmem_pages. They
    should be used when the caller actually would like to use kmalloc, but
    has to fall back to the page allocator for the allocation is large.
    They only differ from alloc_pages and free_pages in that besides
    allocating or freeing pages they also charge them to the kmem resource
    counter of the current memory cgroup.

    [sfr@canb.auug.org.au: export kmalloc_order() to modules]
    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

08 Apr, 2014

5 commits

  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes
    with the right macro in the kernel subsystem.

    Signed-off-by: Gideon Israel Dsouza
    Cc: "Rafael J. Wysocki"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
    There's no significant performance degradation to checking
    current->mempolicy rather than current->flags & PF_MEMPOLICY in the
    allocation path, especially since this is considered unlikely().

    Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
    64GB of memory and without a mempolicy:

    threads before after
    16 1249409 1244487
    32 1281786 1246783
    48 1239175 1239138
    64 1244642 1241841
    80 1244346 1248918
    96 1266436 1254316
    112 1307398 1312135
    128 1327607 1326502

    Per-process flags are a scarce resource so we should free them up whenever
    possible and make them available. We'll be using it shortly for memcg oom
    reserves.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • copy_flags() does not use the clone_flags formal and can be collapsed
    into copy_process() for cleaner code.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
    also adds a prctl control which allows us to set the THP disable bit in
    mm->def_flags so that VMs will pick up the setting as they are created.

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     

04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

29 Mar, 2014

1 commit

  • cgroup_exit() is called in fork and exit path. If it's called in the
    failure path during fork, PF_EXITING isn't set, and then lockdep will
    complain.

    Fix this by removing cgroup_exit() in that failure path. cgroup_fork()
    does nothing that needs cleanup.

    Reported-by: Sasha Levin
    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

11 Mar, 2014

1 commit

  • Bad idea on -rt:

    [ 908.026136] [] rt_spin_lock_slowlock+0xaa/0x2c0
    [ 908.026145] [] task_numa_free+0x31/0x130
    [ 908.026151] [] finish_task_switch+0xce/0x100
    [ 908.026156] [] thread_return+0x48/0x4ae
    [ 908.026160] [] schedule+0x25/0xa0
    [ 908.026163] [] rt_spin_lock_slowlock+0xd5/0x2c0
    [ 908.026170] [] get_signal_to_deliver+0xaf/0x680
    [ 908.026175] [] do_signal+0x3d/0x5b0
    [ 908.026179] [] do_notify_resume+0x90/0xe0
    [ 908.026186] [] int_signal+0x12/0x17
    [ 908.026193] [] 0x7ff2a388b1cf

    and since upstream does not mind where we do this, be a bit nicer ...

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1393568591.6018.27.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

24 Jan, 2014

4 commits


22 Jan, 2014

1 commit

  • while_each_thread() and next_thread() should die, almost every lockless
    usage is wrong.

    1. Unless g == current, the lockless while_each_thread() is not safe.

    while_each_thread(g, t) can loop forever if g exits, next_thread()
    can't reach the unhashed thread in this case. Note that this can
    happen even if g is the group leader, it can exec.

    2. Even if while_each_thread() itself was correct, people often use
    it wrongly.

    It was never safe to just take rcu_read_lock() and loop unless
    you verify that pid_alive(g) == T, even the first next_thread()
    can point to the already freed/reused memory.

    This patch adds signal_struct->thread_head and task->thread_node to
    create the normal rcu-safe list with the stable head. The new
    for_each_thread(g, t) helper is always safe under rcu_read_lock() as
    long as this task_struct can't go away.

    Note: of course it is ugly to have both task_struct->thread_node and the
    old task_struct->thread_group, we will kill it later, after we change
    the users of while_each_thread() to use for_each_thread().

    Perhaps we can kill it even before we convert all users, we can
    reimplement next_thread(t) using the new thread_head/thread_node. But
    we can't do this right now because this will lead to subtle behavioural
    changes. For example, do/while_each_thread() always sees at least one
    task, while for_each_thread() can do nothing if the whole thread group
    has died. Or thread_group_empty(), currently its semantics is not clear
    unless thread_group_leader(p) and we need to audit the callers before we
    can change it.

    So this patch adds the new interface which has to coexist with the old
    one for some time, hopefully the next changes will be more or less
    straightforward and the old one will go away soon.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Acked-by: David Rientjes
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Cc: Michal Hocko
    Cc: "Tu, Xiaobing"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

21 Jan, 2014

1 commit

  • Pull scheduler changes from Ingo Molnar:

    - Add the initial implementation of SCHED_DEADLINE support: a real-time
    scheduling policy where tasks that meet their deadlines and
    periodically execute their instances in less than their runtime quota
    see real-time scheduling and won't miss any of their deadlines.
    Tasks that go over their quota get delayed (Available to privileged
    users for now)

    - Clean up and fix preempt_enable_no_resched() abuse all around the
    tree

    - Do sched_clock() performance optimizations on x86 and elsewhere

    - Fix and improve auto-NUMA balancing

    - Fix and clean up the idle loop

    - Apply various cleanups and fixes

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
    sched: Fix __sched_setscheduler() nice test
    sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
    sched: Fix up attr::sched_priority warning
    sched: Fix up scheduler syscall LTP fails
    sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
    sched/core: Fix htmldocs warnings
    sched/deadline: No need to check p if dl_se is valid
    sched/deadline: Remove unused variables
    sched/deadline: Fix sparse static warnings
    m68k: Fix build warning in mac_via.h
    sched, thermal: Clean up preempt_enable_no_resched() abuse
    sched, net: Fixup busy_loop_us_clock()
    sched, net: Clean up preempt_enable_no_resched() abuse
    sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
    sched/preempt, locking: Rework local_bh_{dis,en}able()
    sched/clock, x86: Avoid a runtime condition in native_sched_clock()
    sched/clock: Fix up clear_sched_clock_stable()
    sched/clock, x86: Use a static_key for sched_clock_stable
    sched/clock: Remove local_irq_disable() from the clocks
    sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
    ...

    Linus Torvalds