08 Apr, 2014

5 commits

  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes
    with the right macro in the kernel subsystem.

    Signed-off-by: Gideon Israel Dsouza
    Cc: "Rafael J. Wysocki"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
    There's no significant performance degradation to checking
    current->mempolicy rather than current->flags & PF_MEMPOLICY in the
    allocation path, especially since this is considered unlikely().

    Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
    64GB of memory and without a mempolicy:

    threads before after
    16 1249409 1244487
    32 1281786 1246783
    48 1239175 1239138
    64 1244642 1241841
    80 1244346 1248918
    96 1266436 1254316
    112 1307398 1312135
    128 1327607 1326502

    Per-process flags are a scarce resource so we should free them up whenever
    possible and make them available. We'll be using it shortly for memcg oom
    reserves.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • copy_flags() does not use the clone_flags formal and can be collapsed
    into copy_process() for cleaner code.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
    also adds a prctl control which allows us to set the THP disable bit in
    mm->def_flags so that VMs will pick up the setting as they are created.

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     

04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

29 Mar, 2014

1 commit

  • cgroup_exit() is called in fork and exit path. If it's called in the
    failure path during fork, PF_EXITING isn't set, and then lockdep will
    complain.

    Fix this by removing cgroup_exit() in that failure path. cgroup_fork()
    does nothing that needs cleanup.

    Reported-by: Sasha Levin
    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

11 Mar, 2014

1 commit

  • Bad idea on -rt:

    [ 908.026136] [] rt_spin_lock_slowlock+0xaa/0x2c0
    [ 908.026145] [] task_numa_free+0x31/0x130
    [ 908.026151] [] finish_task_switch+0xce/0x100
    [ 908.026156] [] thread_return+0x48/0x4ae
    [ 908.026160] [] schedule+0x25/0xa0
    [ 908.026163] [] rt_spin_lock_slowlock+0xd5/0x2c0
    [ 908.026170] [] get_signal_to_deliver+0xaf/0x680
    [ 908.026175] [] do_signal+0x3d/0x5b0
    [ 908.026179] [] do_notify_resume+0x90/0xe0
    [ 908.026186] [] int_signal+0x12/0x17
    [ 908.026193] [] 0x7ff2a388b1cf

    and since upstream does not mind where we do this, be a bit nicer ...

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1393568591.6018.27.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

24 Jan, 2014

4 commits


22 Jan, 2014

1 commit

  • while_each_thread() and next_thread() should die, almost every lockless
    usage is wrong.

    1. Unless g == current, the lockless while_each_thread() is not safe.

    while_each_thread(g, t) can loop forever if g exits, next_thread()
    can't reach the unhashed thread in this case. Note that this can
    happen even if g is the group leader, it can exec.

    2. Even if while_each_thread() itself was correct, people often use
    it wrongly.

    It was never safe to just take rcu_read_lock() and loop unless
    you verify that pid_alive(g) == T, even the first next_thread()
    can point to the already freed/reused memory.

    This patch adds signal_struct->thread_head and task->thread_node to
    create the normal rcu-safe list with the stable head. The new
    for_each_thread(g, t) helper is always safe under rcu_read_lock() as
    long as this task_struct can't go away.

    Note: of course it is ugly to have both task_struct->thread_node and the
    old task_struct->thread_group, we will kill it later, after we change
    the users of while_each_thread() to use for_each_thread().

    Perhaps we can kill it even before we convert all users, we can
    reimplement next_thread(t) using the new thread_head/thread_node. But
    we can't do this right now because this will lead to subtle behavioural
    changes. For example, do/while_each_thread() always sees at least one
    task, while for_each_thread() can do nothing if the whole thread group
    has died. Or thread_group_empty(), currently its semantics is not clear
    unless thread_group_leader(p) and we need to audit the callers before we
    can change it.

    So this patch adds the new interface which has to coexist with the old
    one for some time, hopefully the next changes will be more or less
    straightforward and the old one will go away soon.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Acked-by: David Rientjes
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Cc: Michal Hocko
    Cc: "Tu, Xiaobing"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

21 Jan, 2014

1 commit

  • Pull scheduler changes from Ingo Molnar:

    - Add the initial implementation of SCHED_DEADLINE support: a real-time
    scheduling policy where tasks that meet their deadlines and
    periodically execute their instances in less than their runtime quota
    see real-time scheduling and won't miss any of their deadlines.
    Tasks that go over their quota get delayed (Available to privileged
    users for now)

    - Clean up and fix preempt_enable_no_resched() abuse all around the
    tree

    - Do sched_clock() performance optimizations on x86 and elsewhere

    - Fix and improve auto-NUMA balancing

    - Fix and clean up the idle loop

    - Apply various cleanups and fixes

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
    sched: Fix __sched_setscheduler() nice test
    sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
    sched: Fix up attr::sched_priority warning
    sched: Fix up scheduler syscall LTP fails
    sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
    sched/core: Fix htmldocs warnings
    sched/deadline: No need to check p if dl_se is valid
    sched/deadline: Remove unused variables
    sched/deadline: Fix sparse static warnings
    m68k: Fix build warning in mac_via.h
    sched, thermal: Clean up preempt_enable_no_resched() abuse
    sched, net: Fixup busy_loop_us_clock()
    sched, net: Clean up preempt_enable_no_resched() abuse
    sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
    sched/preempt, locking: Rework local_bh_{dis,en}able()
    sched/clock, x86: Avoid a runtime condition in native_sched_clock()
    sched/clock: Fix up clear_sched_clock_stable()
    sched/clock, x86: Use a static_key for sched_clock_stable
    sched/clock: Remove local_irq_disable() from the clocks
    sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
    ...

    Linus Torvalds
     

18 Jan, 2014

1 commit

  • Pull namespace fixes from Eric Biederman:
    "This is a set of 3 regression fixes.

    This fixes /proc/mounts when using "ip netns add " to display
    the actual mount point.

    This fixes a regression in clone that broke lxc-attach.

    This fixes a regression in the permission checks for mounting /proc
    that made proc unmountable if binfmt_misc was in use. Oops.

    My apologies for sending this pull request so late. Al Viro gave
    interesting review comments about the d_path fix that I wanted to
    address in detail before I sent this pull request. Unfortunately a
    bad round of colds kept from addressing that in detail until today.
    The executive summary of the review was:

    Al: Is patching d_path really sufficient?
    The prepend_path, d_path, d_absolute_path, and __d_path family of
    functions is a really mess.

    Me: Yes, patching d_path is really sufficient. Yes, the code is mess.
    No it is not appropriate to rewrite all of d_path for a regression
    that has existed for entirely too long already, when a two line
    change will do"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Fix a regression in mounting proc
    fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
    vfs: In d_path don't call d_dname on a mount point

    Linus Torvalds
     

13 Jan, 2014

4 commits

  • Some method to deal with rt-mutexes and make sched_dl interact with
    the current PI-coded is needed, raising all but trivial issues, that
    needs (according to us) to be solved with some restructuring of
    the pi-code (i.e., going toward a proxy execution-ish implementation).

    This is under development, in the meanwhile, as a temporary solution,
    what this commits does is:

    - ensure a pi-lock owner with waiters is never throttled down. Instead,
    when it runs out of runtime, it immediately gets replenished and it's
    deadline is postponed;

    - the scheduling parameters (relative deadline and default runtime)
    used for that replenishments --during the whole period it holds the
    pi-lock-- are the ones of the waiting task with earliest deadline.

    Acting this way, we provide some kind of boosting to the lock-owner,
    still by using the existing (actually, slightly modified by the previous
    commit) pi-architecture.

    We would stress the fact that this is only a surely needed, all but
    clean solution to the problem. In the end it's only a way to re-start
    discussion within the community. So, as always, comments, ideas, rants,
    etc.. are welcome! :-)

    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    [ Added !RT_MUTEXES build fix. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     
  • Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
    and provide a proper comparison function for -deadline and
    -priority tasks.

    This is done mainly because:
    - classical prio field of the plist is just an int, which might
    not be enough for representing a deadline;
    - manipulating such a list would become O(nr_deadline_tasks),
    which might be to much, as the number of -deadline task increases.

    Therefore, an rb-tree is used, and tasks are queued in it according
    to the following logic:
    - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
    one with the higher (lower, actually!) prio wins;
    - among a -priority and a -deadline task, the latter always wins;
    - among two -deadline tasks, the one with the earliest deadline
    wins.

    Queueing and dequeueing functions are changed accordingly, for both
    the list of a task's pi-waiters and the list of tasks blocked on
    a pi-lock.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    Signed-off-again-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Introduces the data structures, constants and symbols needed for
    SCHED_DEADLINE implementation.

    Core data structure of SCHED_DEADLINE are defined, along with their
    initializers. Hooks for checking if a task belong to the new policy
    are also added where they are needed.

    Adds a scheduling class, in sched/dl.c and a new policy called
    SCHED_DEADLINE. It is an implementation of the Earliest Deadline
    First (EDF) scheduling algorithm, augmented with a mechanism (called
    Constant Bandwidth Server, CBS) that makes it possible to isolate
    the behaviour of tasks between each other.

    The typical -deadline task will be made up of a computation phase
    (instance) which is activated on a periodic or sporadic fashion. The
    expected (maximum) duration of such computation is called the task's
    runtime; the time interval by which each instance need to be completed
    is called the task's relative deadline. The task's absolute deadline
    is dynamically calculated as the time instant a task (better, an
    instance) activates plus the relative deadline.

    The EDF algorithms selects the task with the smallest absolute
    deadline as the one to be executed first, while the CBS ensures each
    task to run for at most its runtime every (relative) deadline
    length time interval, avoiding any interference between different
    tasks (bandwidth isolation).
    Thanks to this feature, also tasks that do not strictly comply with
    the computational model sketched above can effectively use the new
    policy.

    To summarize, this patch:
    - introduces the data structures, constants and symbols needed;
    - implements the core logic of the scheduling algorithm in the new
    scheduling class file;
    - provides all the glue code between the new scheduling class and
    the core scheduler and refines the interactions between sched/dl
    and the other existing scheduling classes.

    Signed-off-by: Dario Faggioli
    Signed-off-by: Michael Trimarchi
    Signed-off-by: Fabio Checconi
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     
  • Pick up the latest fixes before applying new changes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

19 Dec, 2013

1 commit

  • There are a few subtle races, between change_protection_range (used by
    mprotect and change_prot_numa) on one side, and NUMA page migration and
    compaction on the other side.

    The basic race is that there is a time window between when the PTE gets
    made non-present (PROT_NONE or NUMA), and the TLB is flushed.

    During that time, a CPU may continue writing to the page.

    This is fine most of the time, however compaction or the NUMA migration
    code may come in, and migrate the page away.

    When that happens, the CPU may continue writing, through the cached
    translation, to what is no longer the current memory location of the
    process.

    This only affects x86, which has a somewhat optimistic pte_accessible.
    All other architectures appear to be safe, and will either always flush,
    or flush whenever there is a valid mapping, even with no permissions
    (SPARC).

    The basic race looks like this:

    CPU A CPU B CPU C

    load TLB entry
    make entry PTE/PMD_NUMA
    fault on entry
    read/write old page
    start migrating page
    change PTE/PMD to new page
    read/write old page [*]
    flush TLB
    reload TLB from new entry
    read/write new page
    lose data

    [*] the old page may belong to a new user at this point!

    The obvious fix is to flush remote TLB entries, by making sure that
    pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
    still be accessible if there is a TLB flush pending for the mm.

    This should fix both NUMA migration and compaction.

    [mgorman@suse.de: fix build]
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

27 Nov, 2013

2 commits

  • A zombie task obviously can't fork(), remove the unnecessary
    initialization of child->exit_state. It is zero anyway after
    dup_task_struct().

    Note: copy_process() is huge and it has a lot of chaotic
    initializations, probably it makes sense to move them into the
    new helper called by dup_task_struct().

    Signed-off-by: Oleg Nesterov
    Cc: David Laight
    Cc: Geert Uytterhoeven
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20131113143612.GA10540@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Serge Hallyn writes:
    > Hi Oleg,
    >
    > commit 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e :
    > "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks"
    > breaks lxc-attach in 3.12. That code forks a child which does
    > setns() and then does a clone(CLONE_PARENT). That way the
    > grandchild can be in the right namespaces (which the child was
    > not) and be a child of the original task, which is the monitor.
    >
    > lxc-attach in 3.11 was working fine with no side effects that I
    > could see. Is there a real danger in allowing CLONE_PARENT
    > when current->nsproxy->pidns_for_children is not our pidns,
    > or was this done out of an "over-abundance of caution"? Can we
    > safely revert that new extra check?

    The two fundamental things I know we can not allow are:
    - A shared signal queue aka CLONE_THREAD. Because we compute the pid
    and uid of the signal when we place it in the queue.

    - Changing the pid and by extention pid_namespace of an existing
    process.

    From a parents perspective there is nothing special about the pid
    namespace, to deny CLONE_PARENT, because the parent simply won't know or
    care.

    From the childs perspective all that is special really are shared signal
    queues.

    User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks
    in different pid namespaces is almost certainly going to break because
    it is complicated. But shared signal handlers can look at per thread
    information to know which pid namespace a process is in, so I don't know
    of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads
    at the kernel level. It would be absolutely stupid to implement but
    that is a different thing.

    So hmm.

    Because it can do no harm, and because it is a regression let's remove
    the CLONE_PARENT check and send it stable.

    Cc: stable@vger.kernel.org
    Acked-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

15 Nov, 2013

2 commits

  • The basic idea is the same as with PTE level: the lock is embedded into
    struct page of table's page.

    We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
    take mm->page_table_lock anymore. Let's reuse page->lru of table's page
    for that.

    pgtable_pmd_page_ctor() returns true, if initialization is successful
    and false otherwise. Current implementation never fails, but assumption
    that constructor can fail will help to port it to -rt where spinlock_t
    is rather huge and cannot be embedded into struct page -- dynamic
    allocation is required.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Reviewed-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split page table lock for PMD level we can't hold mm->page_table_lock
    while updating nr_ptes.

    Let's convert it to atomic_long_t to avoid races.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Nov, 2013

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this cycle are:

    - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
    van Riel, Peter Zijlstra et al. Yay!

    - optimize preemption counter handling: merge the NEED_RESCHED flag
    into the preempt_count variable, by Peter Zijlstra.

    - wait.h fixes and code reorganization from Peter Zijlstra

    - cfs_bandwidth fixes from Ben Segall

    - SMP load-balancer cleanups from Peter Zijstra

    - idle balancer improvements from Jason Low

    - other fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
    ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
    stop_machine: Fix race between stop_two_cpus() and stop_cpus()
    sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
    sched: Fix asymmetric scheduling for POWER7
    sched: Move completion code from core.c to completion.c
    sched: Move wait code from core.c to wait.c
    sched: Move wait.c into kernel/sched/
    sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
    sched: Avoid throttle_cfs_rq() racing with period_timer stopping
    sched: Guarantee new group-entities always have weight
    sched: Fix hrtimer_cancel()/rq->lock deadlock
    sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
    sched: Fix race on toggling cfs_bandwidth_used
    sched: Remove extra put_online_cpus() inside sched_setaffinity()
    sched/rt: Fix task_tick_rt() comment
    sched/wait: Fix build breakage
    sched/wait: Introduce prepare_to_wait_event()
    sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
    sched: Remove get_online_cpus() usage
    sched: Fix race in migrate_swap_stop()
    ...

    Linus Torvalds
     

30 Oct, 2013

2 commits

  • uprobe_copy_process() does nothing if the child shares ->mm with
    the forking process, but there is a special case: CLONE_VFORK.
    In this case it would be more correct to do dup_utask() but avoid
    dup_xol(). This is not that important, the child should not unwind
    its stack too much, this can corrupt the parent's stack, but at
    least we need this to allow to ret-probe __vfork() itself.

    Note: in theory, it would be better to check task_pt_regs(p)->sp
    instead of CLONE_VFORK, we need to dup_utask() if and only if the
    child can return from the function called by the parent. But this
    needs the arch-dependant helper, and I think that nobody actually
    does clone(same_stack, CLONE_VM).

    Reported-by: Martin Cermak
    Reported-by: David Smith
    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     
  • Preparation for the next patches.

    Move the callsite of uprobe_copy_process() in copy_process() down
    to the succesfull return. We do not care if copy_process() fails,
    uprobe_free_utask() won't be called in this case so the wrong
    ->utask != NULL doesn't matter.

    OTOH, with this change we know that copy_process() can't fail when
    uprobe_copy_process() is called, the new task should either return
    to user-mode or call do_exit(). This way uprobe_copy_process() can:

    1. setup p->utask != NULL if necessary

    2. setup uprobes_state.xol_area

    3. use task_work_add(p)

    Also, move the definition of uprobe_copy_process() down so that it
    can see get_utask().

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     

09 Oct, 2013

2 commits

  • A newly spawned thread inside a process should stay on the same
    NUMA node as its parent. This prevents processes from being "torn"
    across multiple NUMA nodes every time they spawn a new thread.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-49-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • PTE scanning and NUMA hinting fault handling is expensive so commit
    5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
    on a new node") deferred the PTE scan until a task had been scheduled on
    another node. The problem is that in the purely shared memory case that
    this may never happen and no NUMA hinting fault information will be
    captured. We are not ruling out the possibility that something better
    can be done here but for now, this patch needs to be reverted and depend
    entirely on the scan_delay to avoid punishing short-lived processes.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

14 Sep, 2013

1 commit

  • Pull aio changes from Ben LaHaise:
    "First off, sorry for this pull request being late in the merge window.
    Al had raised a couple of concerns about 2 items in the series below.
    I addressed the first issue (the race introduced by Gu's use of
    mm_populate()), but he has not provided any further details on how he
    wants to rework the anon_inode.c changes (which were sent out months
    ago but have yet to be commented on).

    The bulk of the changes have been sitting in the -next tree for a few
    months, with all the issues raised being addressed"

    * git://git.kvack.org/~bcrl/aio-next: (22 commits)
    aio: rcu_read_lock protection for new rcu_dereference calls
    aio: fix race in ring buffer page lookup introduced by page migration support
    aio: fix rcu sparse warnings introduced by ioctx table lookup patch
    aio: remove unnecessary debugging from aio_free_ring()
    aio: table lookup: verify ctx pointer
    staging/lustre: kiocb->ki_left is removed
    aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
    aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
    aio: convert the ioctx list to table lookup v3
    aio: double aio_max_nr in calculations
    aio: Kill ki_dtor
    aio: Kill ki_users
    aio: Kill unneeded kiocb members
    aio: Kill aio_rw_vect_retry()
    aio: Don't use ctx->tail unnecessarily
    aio: io_cancel() no longer returns the io_event
    aio: percpu ioctx refcount
    aio: percpu reqs_available
    aio: reqs_active -> reqs_available
    aio: fix build when migration is disabled
    ...

    Linus Torvalds
     

12 Sep, 2013

4 commits

  • Simple cleanup. Every user of vma_set_policy() does the same work, this
    looks a bit annoying imho. And the new trivial helper which does
    mpol_dup() + vma_set_policy() to simplify the callers.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID.

    Then later copy_process() denies CLONE_SIGHAND if the new process will
    be in a different pid namespace (task_active_pid_ns() doesn't match
    current->nsproxy->pid_ns).

    This looks confusing and inconsistent. CLONE_NEWPID is very similar to
    the case when ->pid_ns was already unshared, we want the same
    restrictions so copy_process() should also nack CLONE_PARENT.

    And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well
    just for consistency.

    Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change
    copy_process() to do the same check along with ->pid_ns check we already
    have.

    Signed-off-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Cc: "Eric W. Biederman"
    Cc: Colin Walters
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
    unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process
    unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does
    the same check.

    Remove the CLONE_NEWPID check to cleanup the code and prepare for the
    next change.

    Test-case:

    static int child(void *arg)
    {
    return 0;
    }

    static char stack[16 * 1024];

    int main(void)
    {
    pid_t pid;

    assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);

    pid = clone(child, stack + sizeof(stack) / 2,
    CLONE_NEWPID | SIGCHLD, NULL);
    assert(pid < 0 && errno == EINVAL);

    return 0;
    }

    clone(CLONE_NEWPID) correctly fails with or without this change.

    Signed-off-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Cc: "Eric W. Biederman"
    Cc: Colin Walters
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
    unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
    pid_ns, this obviously breaks vfork:

    int main(void)
    {
    assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
    assert(vfork() >= 0);
    _exit(0);
    return 0;
    }

    fails without this patch.

    Change this check to use CLONE_SIGHAND instead. This also forbids
    CLONE_THREAD automatically, and this is what the comment implies.

    We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
    would be safer to not do this. The current check denies CLONE_SIGHAND
    implicitely and there is no reason to change this.

    Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better.
    Having shared signal handling between two different pid namespaces is
    the case that we are fundamentally guarding against."

    Signed-off-by: Oleg Nesterov
    Reported-by: Colin Walters
    Acked-by: Andy Lutomirski
    Reviewed-by: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

08 Sep, 2013

1 commit

  • Pull namespace changes from Eric Biederman:
    "This is an assorted mishmash of small cleanups, enhancements and bug
    fixes.

    The major theme is user namespace mount restrictions. nsown_capable
    is killed as it encourages not thinking about details that need to be
    considered. A very hard to hit pid namespace exiting bug was finally
    tracked and fixed. A couple of cleanups to the basic namespace
    infrastructure.

    Finally there is an enhancement that makes per user namespace
    capabilities usable as capabilities, and an enhancement that allows
    the per userns root to nice other processes in the user namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Kill nsown_capable it makes the wrong thing easy
    capabilities: allow nice if we are privileged
    pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
    userns: Allow PR_CAPBSET_DROP in a user namespace.
    namespaces: Simplify copy_namespaces so it is clear what is going on.
    pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
    sysfs: Restrict mounting sysfs
    userns: Better restrictions on when proc and sysfs can be mounted
    vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
    kernel/nsproxy.c: Improving a snippet of code.
    proc: Restrict mounting the proc filesystem
    vfs: Lock in place mounts from more privileged users

    Linus Torvalds
     

31 Aug, 2013

1 commit

  • I goofed when I made unshare(CLONE_NEWPID) only work in a
    single-threaded process. There is no need for that requirement and in
    fact I analyzied things right for setns. The hard requirement
    is for tasks that share a VM to all be in the pid namespace and
    we properly prevent that in do_fork.

    Just to be certain I took a look through do_wait and
    forget_original_parent and there are no cases that make it any harder
    for children to be in the multiple pid namespaces than it is for
    children to be in the same pid namespace. I also performed a check to
    see if there were in uses of task->nsproxy_pid_ns I was not familiar
    with, but it is only used when allocating a new pid for a new task,
    and in checks to prevent craziness from happening.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 Aug, 2013

1 commit


14 Aug, 2013

1 commit

  • Fix inadvertent breakage in the clone syscall ABI for Microblaze that
    was introduced in commit f3268edbe6fe ("microblaze: switch to generic
    fork/vfork/clone").

    The Microblaze syscall ABI for clone takes the parent tid address in the
    4th argument; the third argument slot is used for the stack size. The
    incorrectly-used CLONE_BACKWARDS type assigned parent tid to the 3rd
    slot.

    This commit restores the original ABI so that existing userspace libc
    code will work correctly.

    All kernel versions from v3.8-rc1 were affected.

    Signed-off-by: Michal Simek
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Simek
     

31 Jul, 2013

1 commit

  • On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
    > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
    > > When using a large number of threads performing AIO operations the
    > > IOCTX list may get a significant number of entries which will cause
    > > significant overhead. For example, when running this fio script:
    > >
    > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
    > > blocksize=1024; numjobs=512; thread; loops=100
    > >
    > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
    > > 30% CPU time spent by lookup_ioctx:
    > >
    > > 32.51% [guest.kernel] [g] lookup_ioctx
    > > 9.19% [guest.kernel] [g] __lock_acquire.isra.28
    > > 4.40% [guest.kernel] [g] lock_release
    > > 4.19% [guest.kernel] [g] sched_clock_local
    > > 3.86% [guest.kernel] [g] local_clock
    > > 3.68% [guest.kernel] [g] native_sched_clock
    > > 3.08% [guest.kernel] [g] sched_clock_cpu
    > > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11
    > > 2.60% [guest.kernel] [g] memcpy
    > > 2.33% [guest.kernel] [g] lock_acquired
    > > 2.25% [guest.kernel] [g] lock_acquire
    > > 1.84% [guest.kernel] [g] do_io_submit
    > >
    > > This patchs converts the ioctx list to a radix tree. For a performance
    > > comparison the above FIO script was run on a 2 sockets 8 core
    > > machine. This are the results (average and %rsd of 10 runs) for the
    > > original list based implementation and for the radix tree based
    > > implementation:
    > >
    > > cores 1 2 4 8 16 32
    > > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
    > > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
    > > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
    > > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
    > > % of radix
    > > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
    > > to list
    > >
    > > To consider the impact of the patch on the typical case of having
    > > only one ctx per process the following FIO script was run:
    > >
    > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
    > > blocksize=1024; numjobs=1; thread; loops=100
    > >
    > > on the same system and the results are the following:
    > >
    > > list 58892 ms
    > > %rsd 0.91%
    > > radix 59404 ms
    > > %rsd 0.81%
    > > % of radix
    > > relative 100.87%
    > > to list
    >
    > So, I was just doing some benchmarking/profiling to get ready to send
    > out the aio patches I've got for 3.11 - and it looks like your patch is
    > causing a ~1.5% throughput regression in my testing :/
    ...

    I've got an alternate approach for fixing this wart in lookup_ioctx()...
    Instead of using an rbtree, just use the reserved id in the ring buffer
    header to index an array pointing the ioctx. It's not finished yet, and
    it needs to be tidied up, but is most of the way there.

    -ben
    --
    "Thought is the essence of where you are now."
    --
    kmo> And, a rework of Ben's code, but this was entirely his idea
    kmo> -Kent

    bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
    free memory.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker