05 Feb, 2013

1 commit

  • …x/kernel/git/frederic/linux-dynticks into sched/core

    Pull full-dynticks (user-space execution is undisturbed and
    receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready,
    from Frederic Weisbecker:

    "This implements the cputime accounting on full dynticks CPUs.

    Typical cputime stats infrastructure relies on the timer tick and
    its periodic polling on the CPU to account the amount of time
    spent by the CPUs and the tasks per high level domains such as
    userspace, kernelspace, guest, ...

    Now we are preparing to implement full dynticks capability on
    Linux for Real Time and HPC users who want full CPU isolation.
    This feature requires a cputime accounting that doesn't depend
    on the timer tick.

    To implement it, this new cputime infrastructure plugs into
    kernel/user/guest boundaries to take snapshots of cputime and
    flush these to the stats when needed. This performs pretty
    much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
    and cputime snaphots are synchronized between write and read
    side such that the latter can safely retrieve the pending tickless
    cputime of a task and add it to its latest cputime snapshot to
    return the correct result to the user."

    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

28 Jan, 2013

1 commit

  • While remotely reading the cputime of a task running in a
    full dynticks CPU, the values stored in utime/stime fields
    of struct task_struct may be stale. Its values may be those
    of the last kernel user transition time snapshot and
    we need to add the tickless time spent since this snapshot.

    To fix this, flush the cputime of the dynticks CPUs on
    kernel user transition and record the time / context
    where we did this. Then on top of this snapshot and the current
    time, perform the fixup on the reader side from task_times()
    accessors.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    [fixed kvm module related build errors]
    Signed-off-by: Sedat Dilek

    Frederic Weisbecker
     

21 Jan, 2013

1 commit

  • Pull misc syscall fixes from Al Viro:

    - compat syscall fixes (discussed back in December)

    - a couple of "make life easier for sigaltstack stuff by reducing
    inter-tree dependencies"

    - fix up compiler/asmlinkage calling convention disagreement of
    sys_clone()

    - misc

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    sys_clone() needs asmlinkage_protect
    make sure that /linuxrc has std{in,out,err}
    x32: fix sigtimedwait
    x32: fix waitid()
    switch compat_sys_wait4() and compat_sys_waitid() to COMPAT_SYSCALL_DEFINE
    switch compat_sys_sigaltstack() to COMPAT_SYSCALL_DEFINE
    CONFIG_GENERIC_SIGALTSTACK build breakage with asm-generic/syscalls.h
    Ensure that kernel_init_freeable() is not inlined into non __init code

    Linus Torvalds
     

20 Jan, 2013

1 commit


25 Dec, 2012

1 commit

  • The sequence:
    unshare(CLONE_NEWPID)
    clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)

    Creates a new process in the new pid namespace without setting
    pid_ns->child_reaper. After forking this results in a NULL
    pointer dereference.

    Avoid this and other nonsense scenarios that can show up after
    creating a new pid namespace with unshare by adding a new
    check in copy_prodcess.

    Pointed-out-by: Oleg Nesterov
    Acked-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

21 Dec, 2012

1 commit

  • Pull signal handling cleanups from Al Viro:
    "sigaltstack infrastructure + conversion for x86, alpha and um,
    COMPAT_SYSCALL_DEFINE infrastructure.

    Note that there are several conflicts between "unify
    SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
    resolution is trivial - just remove definitions of SS_ONSTACK and
    SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
    include/uapi/linux/signal.h contains the unified variant."

    Fixed up conflicts as per Al.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to generic sigaltstack
    new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
    generic compat_sys_sigaltstack()
    introduce generic sys_sigaltstack(), switch x86 and um to it
    new helper: compat_user_stack_pointer()
    new helper: restore_altstack()
    unify SS_ONSTACK/SS_DISABLE definitions
    new helper: current_user_stack_pointer()
    missing user_stack_pointer() instances
    Bury the conditionals from kernel_thread/kernel_execve series
    COMPAT_SYSCALL_DEFINE: infrastructure

    Linus Torvalds
     

20 Dec, 2012

1 commit

  • All architectures have
    CONFIG_GENERIC_KERNEL_THREAD
    CONFIG_GENERIC_KERNEL_EXECVE
    __ARCH_WANT_SYS_EXECVE
    None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
    of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
    Kill the conditionals and make both callers use do_execve().

    Signed-off-by: Al Viro

    Al Viro
     

19 Dec, 2012

1 commit

  • Because those architectures will draw their stacks directly from the page
    allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
    flag, and issue the corresponding free_pages.

    This code path is taken when the architecture doesn't define
    CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
    THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
    architectures fall in this category.

    This will guarantee that every stack page is accounted to the memcg the
    process currently lives on, and will have the allocations to fail if they
    go over limit.

    For the time being, I am defining a new variant of THREADINFO_GFP, not to
    mess with the other path. Once the slab is also tracked by memcg, we can
    get rid of that flag.

    Tested to successfully protect against :(){ :|:& };:

    Signed-off-by: Glauber Costa
    Acked-by: Frederic Weisbecker
    Acked-by: Kamezawa Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     

18 Dec, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

2 commits

  • Pull big execve/kernel_thread/fork unification series from Al Viro:
    "All architectures are converted to new model. Quite a bit of that
    stuff is actually shared with architecture trees; in such cases it's
    literally shared branch pulled by both, not a cherry-pick.

    A lot of ugliness and black magic is gone (-3KLoC total in this one):

    - kernel_thread()/kernel_execve()/sys_execve() redesign.

    We don't do syscalls from kernel anymore for either kernel_thread()
    or kernel_execve():

    kernel_thread() is essentially clone(2) with callback run before we
    return to userland, the callbacks either never return or do
    successful do_execve() before returning.

    kernel_execve() is a wrapper for do_execve() - it doesn't need to
    do transition to user mode anymore.

    As a result kernel_thread() and kernel_execve() are
    arch-independent now - they live in kernel/fork.c and fs/exec.c
    resp. sys_execve() is also in fs/exec.c and it's completely
    architecture-independent.

    - daemonize() is gone, along with its parts in fs/*.c

    - struct pt_regs * is no longer passed to do_fork/copy_process/
    copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.

    - sys_fork()/sys_vfork()/sys_clone() unified; some architectures
    still need wrappers (ones with callee-saved registers not saved in
    pt_regs on syscall entry), but the main part of those suckers is in
    kernel/fork.c now."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
    do_coredump(): get rid of pt_regs argument
    print_fatal_signal(): get rid of pt_regs argument
    ptrace_signal(): get rid of unused arguments
    get rid of ptrace_signal_deliver() arguments
    new helper: signal_pt_regs()
    unify default ptrace_signal_deliver
    flagday: kill pt_regs argument of do_fork()
    death to idle_regs()
    don't pass regs to copy_process()
    flagday: don't pass regs to copy_thread()
    bfin: switch to generic vfork, get rid of pointless wrappers
    xtensa: switch to generic clone()
    openrisc: switch to use of generic fork and clone
    unicore32: switch to generic clone(2)
    score: switch to generic fork/vfork/clone
    c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
    take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
    mn10300: switch to generic fork/vfork/clone
    h8300: switch to generic fork/vfork/clone
    tile: switch to generic clone()
    ...

    Conflicts:
    arch/microblaze/include/asm/Kbuild

    Linus Torvalds
     
  • Pull cgroup changes from Tejun Heo:
    "A lot of activities on cgroup side. The big changes are focused on
    making cgroup hierarchy handling saner.

    - cgroup_rmdir() had peculiar semantics - it allowed cgroup
    destruction to be vetoed by individual controllers and tried to
    drain refcnt synchronously. The vetoing never worked properly and
    caused good deal of contortions in cgroup. memcg was the last
    reamining user. Michal Hocko removed the usage and cgroup_rmdir()
    path has been simplified significantly. This was done in a
    separate branch so that the memcg people can base further memcg
    changes on top.

    - The above allowed cleaning up cgroup lifecycle management and
    implementation of generic cgroup iterators which are used to
    improve hierarchy support.

    - cgroup_freezer updated to allow migration in and out of a frozen
    cgroup and handle hierarchy. If a cgroup is frozen, all descendant
    cgroups are frozen.

    - netcls_cgroup and netprio_cgroup updated to handle hierarchy
    properly.

    - Various fixes and cleanups.

    - Two merge commits. One to pull in memcg and rmdir cleanups (needed
    to build iterators). The other pulled in cgroup/for-3.7-fixes for
    device_cgroup fixes so that further device_cgroup patches can be
    stacked on top."

    Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
    commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
    touching code close to commit 2ef37d3fe4 ("memcg: Simplify
    mem_cgroup_force_empty_list error handling") in for-3.8)

    * 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
    cgroup: update Documentation/cgroups/00-INDEX
    cgroup_rm_file: don't delete the uncreated files
    cgroup: remove subsystem files when remounting cgroup
    cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
    cgroup: warn about broken hierarchies only after css_online
    cgroup: list_del_init() on removed events
    cgroup: fix lockdep warning for event_control
    cgroup: move list add after list head initilization
    netprio_cgroup: allow nesting and inherit config on cgroup creation
    netprio_cgroup: implement netprio[_set]_prio() helpers
    netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
    netprio_cgroup: reimplement priomap expansion
    netprio_cgroup: shorten variable names in extend_netdev_table()
    netprio_cgroup: simplify write_priomap()
    netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
    cgroup: remove obsolete guarantee from cgroup_task_migrate.
    cgroup: add cgroup->id
    cgroup, cpuset: remove cgroup_subsys->post_clone()
    cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
    cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
    ...

    Linus Torvalds
     

12 Dec, 2012

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The biggest change affects group scheduling: we now track the runnable
    average on a per-task entity basis, allowing a smoother, exponential
    decay average based load/weight estimation instead of the previous
    binary on-the-runqueue/off-the-runqueue load weight method.

    This will inevitably disturb workloads that were in some sort of
    borderline balancing state or unstable equilibrium, so an eye has to
    be kept on regressions.

    For that reason the new load average is only limited to group
    scheduling (shares distribution) at the moment (which was also hurting
    the most from the prior, crude weight calculation and whose scheduling
    quality wins most from this change) - but we plan to extend this to
    regular SMP balancing as well in the future, which will simplify and
    speed up things a bit.

    Other changes involve ongoing preparatory work to extend NOHZ to the
    scheduler as well, eventually allowing completely irq-free user-space
    execution."

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled"
    cputime: Comment cputime's adjusting code
    cputime: Consolidate cputime adjustment code
    cputime: Rename thread_group_times to thread_group_cputime_adjusted
    cputime: Move thread_group_cputime() to sched code
    vtime: Warn if irqs aren't disabled on system time accounting APIs
    vtime: No need to disable irqs on vtime_account()
    vtime: Consolidate a bit the ctx switch code
    vtime: Explicitly account pending user time on process tick
    vtime: Remove the underscore prefix invasion
    sched/autogroup: Fix crash on reboot when autogroup is disabled
    cputime: Separate irqtime accounting from generic vtime
    cputime: Specialize irq vtime hooks
    kvm: Directly account vtime to system on guest switch
    vtime: Make vtime_account_system() irqsafe
    vtime: Gather vtime declarations to their own header file
    sched: Describe CFS load-balancer
    sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
    sched: Make __update_entity_runnable_avg() fast
    sched: Update_cfs_shares at period edge
    ...

    Linus Torvalds
     

11 Dec, 2012

1 commit

  • Due to the fact that migrations are driven by the CPU a task is running
    on there is no point tracking NUMA faults until one task runs on a new
    node. This patch tracks the first node used by an address space. Until
    it changes, PTE scanning is disabled and no NUMA hinting faults are
    trapped. This should help workloads that are short-lived, do not care
    about NUMA placement or have bound themselves to a single node.

    This takes advantage of the logic in "mm: sched: numa: Implement slow
    start for working set sampling" to delay when the checks are made. This
    will take advantage of processes that set their CPU and node bindings
    early in their lifetime. It will also potentially allow any initial load
    balancing to take place.

    Signed-off-by: Mel Gorman

    Mel Gorman
     

29 Nov, 2012

6 commits


20 Nov, 2012

1 commit

  • - Add CLONE_THREAD to the unshare flags if CLONE_NEWUSER is selected
    As changing user namespaces is only valid if all there is only
    a single thread.
    - Restore the code to add CLONE_VM if CLONE_THREAD is selected and
    the code to addCLONE_SIGHAND if CLONE_VM is selected.
    Making the constraints in the code clear.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

19 Nov, 2012

5 commits

  • Now that we have been through every permission check in the kernel
    having uid == 0 and gid == 0 in your local user namespace no
    longer adds any special privileges. Even having a full set
    of caps in your local user namespace is safe because capabilies
    are relative to your local user namespace, and do not confer
    unexpected privileges.

    Over the long term this should allow much more of the kernels
    functionality to be safely used by non-root users. Functionality
    like unsharing the mount namespace that is only unsafe because
    it can fool applications whose privileges are raised when they
    are executed. Since those applications have no privileges in
    a user namespaces it becomes safe to spoof and confuse those
    applications all you want.

    Those capabilities will still need to be enabled carefully because
    we may still need things like rlimits on the number of unprivileged
    mounts but that is to avoid DOS attacks not to avoid fooling root
    owned processes.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Unsharing of the pid namespace unlike unsharing of other namespaces
    does not take affect immediately. Instead it affects the children
    created with fork and clone. The first of these children becomes the init
    process of the new pid namespace, the rest become oddball children
    of pid 0. From the point of view of the new pid namespace the process
    that created it is pid 0, as it's pid does not map.

    A couple of different semantics were considered but this one was
    settled on because it is easy to implement and it is usable from
    pam modules. The core reasons for the existence of unshare.

    I took a survey of the callers of pam modules and the following
    appears to be a representative sample of their logic.
    {
    setup stuff include pam
    child = fork();
    if (!child) {
    setuid()
    exec /bin/bash
    }
    waitpid(child);

    pam and other cleanup
    }

    As you can see there is a fork to create the unprivileged user
    space process. Which means that the unprivileged user space
    process will appear as pid 1 in the new pid namespace. Further
    most login processes do not cope with extraneous children which
    means shifting the duty of reaping extraneous child process to
    the creator of those extraneous children makes the system more
    comprehensible.

    The practical reason for this set of pid namespace semantics is
    that it is simple to implement and verify they work correctly.
    Whereas an implementation that requres changing the struct
    pid on a process comes with a lot more races and pain. Not
    the least of which is that glibc caches getpid().

    These semantics are implemented by having two notions
    of the pid namespace of a proces. There is task_active_pid_ns
    which is the pid namspace the process was created with
    and the pid namespace that all pids are presented to
    that process in. The task_active_pid_ns is stored
    in the struct pid of the task.

    Then there is the pid namespace that will be used for children
    that pid namespace is stored in task->nsproxy->pid_ns.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
    for the system init process, and another way for pid namespace
    init processes test pid->nr == 1 and use the same code for both.

    For the global init this results in SIGNAL_UNKILLABLE being set
    much earlier in the initialization process.

    This is a small cleanup and it paves the way for allowing unshare and
    enter of the pid namespace as that path like our global init also will
    not set CLONE_NEWPID.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Track the number of pids in the proc hash table. When the number of
    pids goes to 0 schedule work to unmount the kernel mount of proc.

    Move the mount of proc into alloc_pid when we allocate the pid for
    init.

    Remove the surprising calls of pid_ns_release proc in fork and
    proc_flush_task. Those code paths really shouldn't know about proc
    namespace implementation details and people have demonstrated several
    times that finding and understanding those code paths is difficult and
    non-obvious.

    Because of the call path detach pid is alwasy called with the
    rtnl_lock held free_pid is not allowed to sleep, so the work to
    unmounting proc is moved to a work queue. This has the side benefit
    of not blocking the entire world waiting for the unnecessary
    rcu_barrier in deactivate_locked_super.

    In the process of making the code clear and obvious this fixes a bug
    reported by Gao feng where we would leak a
    mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
    succeeded and copy_net_ns failed.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
    aka ns_of_pid(task_pid(tsk)) should have the same number of
    cache line misses with the practical difference that
    ns_of_pid(task_pid(tsk)) is released later in a processes life.

    Furthermore by using task_active_pid_ns it becomes trivial
    to write an unshare implementation for the the pid namespace.

    So I have used task_active_pid_ns everywhere I can.

    In fork since the pid has not yet been attached to the
    process I use ns_of_pid, to achieve the same effect.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

16 Nov, 2012

1 commit

  • This was always racy, but 268720903f87e0b84b161626c4447b81671b5d18
    "uprobes: Rework register_for_each_vma() to make it O(n)" should be
    blamed anyway, it made everything worse and I didn't notice.

    register/unregister call build_map_info() and then do install/remove
    breakpoint for every mm which mmaps inode/offset. This can obviously
    race with fork()->dup_mmap() in between and we can miss the child.

    uprobe_register() could be easily fixed but unregister is much worse,
    the new mm inherits "int3" from parent and there is no way to detect
    this if uprobe goes away.

    So this patch simply adds percpu_down_read/up_read around dup_mmap(),
    and percpu_down_write/up_write into register_for_each_vma().

    This adds 2 new hooks into dup_mmap() but we can kill uprobe_dup_mmap()
    and fold it into uprobe_end_dup_mmap().

    Reported-by: Srikar Dronamraju
    Acked-by: Srikar Dronamraju
    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     

17 Oct, 2012

1 commit

  • cgroup core has a bug which violates a basic rule about event
    notifications - when a new entity needs to be added, you add that to
    the notification list first and then make the new entity conform to
    the current state. If done in the reverse order, an event happening
    inbetween will be lost.

    cgroup_subsys->fork() is invoked way before the new task is added to
    the css_set. Currently, cgroup_freezer is the only user of ->fork()
    and uses it to make new tasks conform to the current state of the
    freezer. If FROZEN state is requested while fork is in progress
    between cgroup_fork_callbacks() and cgroup_post_fork(), the child
    could escape freezing - the cgroup isn't frozen when ->fork() is
    called and the freezer couldn't see the new task on the css_set.

    This patch moves cgroup_subsys->fork() invocation to
    cgroup_post_fork() after the new task is added to the css_set.
    cgroup_fork_callbacks() is removed.

    Because now a task may be migrated during cgroup_subsys->fork(),
    freezer_fork() is updated so that it adheres to the usual RCU locking
    and the rather pointless comment on why locking can be different there
    is removed (if it doesn't make anything simpler, why even bother?).

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Rafael J. Wysocki
    Cc: stable@vger.kernel.org

    Tejun Heo
     

10 Oct, 2012

1 commit

  • Pull generic execve() changes from Al Viro:
    "This introduces the generic kernel_thread() and kernel_execve()
    functions, and switches x86, arm, alpha, um and s390 over to them."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (26 commits)
    s390: convert to generic kernel_execve()
    s390: switch to generic kernel_thread()
    s390: fold kernel_thread_helper() into ret_from_fork()
    s390: fold execve_tail() into start_thread(), convert to generic sys_execve()
    um: switch to generic kernel_thread()
    x86, um/x86: switch to generic sys_execve and kernel_execve
    x86: split ret_from_fork
    alpha: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
    alpha: switch to generic kernel_thread()
    alpha: switch to generic sys_execve()
    arm: get rid of execve wrapper, switch to generic execve() implementation
    arm: optimized current_pt_regs()
    arm: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
    arm: split ret_from_fork, simplify kernel_thread() [based on patch by rmk]
    generic sys_execve()
    generic kernel_execve()
    new helper: current_pt_regs()
    preparation for generic kernel_thread()
    um: kill thread->forking
    um: let signal_delivered() do SIGTRAP on singlestepping into handler
    ...

    Linus Torvalds
     

09 Oct, 2012

5 commits

  • Update the generic interval tree code that was introduced in "mm: replace
    vma prio_tree with an interval tree".

    Changes:

    - fixed 'endpoing' typo noticed by Andrew Morton

    - replaced include/linux/interval_tree_tmpl.h, which was used as a
    template (including it automatically defined the interval tree
    functions) with include/linux/interval_tree_generic.h, which only
    defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
    defines the interval tree functions when invoked. Now that is a very
    long macro which is unfortunate, but it does make the usage sites
    (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.

    - make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
    instead of duplicating that code in the interval tree template.

    - replaced vma_interval_tree_add(), which was actually handling the
    nonlinear and interval tree cases, with vma_interval_tree_insert_after()
    which handles only the interval tree case and has an API that is more
    consistent with the other interval tree handling functions.
    The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The deprecated /proc//oom_adj is scheduled for removal this month.

    Signed-off-by: Davidlohr Bueso
    Acked-by: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Currently the kernel sets mm->exe_file during sys_execve() and then tracks
    number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
    as this counter drops to zero kernel resets mm->exe_file to NULL. Plus it
    resets mm->exe_file at last mmput() when mm->mm_users drops to zero.

    VMA with VM_EXECUTABLE flag appears after mapping file with flag
    MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
    splitting, because sys_mmap ignores this flag. Usually binfmt module sets
    mm->exe_file and mmaps executable vmas with this file, they hold
    mm->exe_file while task is running.

    comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"),
    where all this stuff was introduced:

    > The kernel implements readlink of /proc/pid/exe by getting the file from
    > the first executable VMA. Then the path to the file is reconstructed and
    > reported as the result.
    >
    > Because of the VMA walk the code is slightly different on nommu systems.
    > This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    > walking the VMAs to find the first executable file-backed VMA we store a
    > reference to the exec'd file in the mm_struct.
    >
    > That reference would prevent the filesystem holding the executable file
    > from being unmounted even after unmapping the VMAs. So we track the number
    > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    > unmapped. This avoids pinning the mounted filesystem.

    exe_file's vma accounting is hooked into every file mmap/unmmap and vma
    split/merge just to fix some hypothetical pinning fs from umounting by mm,
    which already unmapped all its executable files, but still alive.

    Seems like currently nobody depends on this behaviour. We can try to
    remove this logic and keep mm->exe_file until final mmput().

    mm->exe_file is still protected with mm->mmap_sem, because we want to
    change it via new sys_prctl(PR_SET_MM_EXE_FILE). Also via this syscall
    task can change its mm->exe_file and unpin mountpoint explicitly.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Some security modules and oprofile still uses VM_EXECUTABLE for retrieving
    a task's executable file. After this patch they will use mm->exe_file
    directly. mm->exe_file is protected with mm->mmap_sem, so locking stays
    the same.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Chris Metcalf [arch/tile]
    Acked-by: Tetsuo Handa [tomoyo]
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Acked-by: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

03 Oct, 2012

1 commit

  • Pull networking changes from David Miller:

    1) GRE now works over ipv6, from Dmitry Kozlov.

    2) Make SCTP more network namespace aware, from Eric Biederman.

    3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

    4) Make openvswitch network namespace aware, from Pravin B Shelar.

    5) IPV6 NAT implementation, from Patrick McHardy.

    6) Server side support for TCP Fast Open, from Jerry Chu and others.

    7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

    8) Increate the loopback default MTU to 64K, from Eric Dumazet.

    9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic. This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

    10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail. Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

    11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

    12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

    13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

    Fix up various fairly trivial conflicts, mostly caused by the user
    namespace changes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
    hyperv: Add buffer for extended info after the RNDIS response message.
    hyperv: Report actual status in receive completion packet
    hyperv: Remove extra allocated space for recv_pkt_list elements
    hyperv: Fix page buffer handling in rndis_filter_send_request()
    hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
    hyperv: Fix the max_xfer_size in RNDIS initialization
    vxlan: put UDP socket in correct namespace
    vxlan: Depend on CONFIG_INET
    sfc: Fix the reported priorities of different filter types
    sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
    sfc: Fix loopback self-test with separate_tx_channels=1
    sfc: Fix MCDI structure field lookup
    sfc: Add parentheses around use of bitfield macro arguments
    sfc: Fix null function pointer in efx_sriov_channel_type
    vxlan: virtual extensible lan
    igmp: export symbol ip_mc_leave_group
    netlink: add attributes to fdb interface
    tg3: unconditionally select HWMON support when tg3 is enabled.
    Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
    gre: fix sparse warning
    ...

    Linus Torvalds
     

02 Oct, 2012

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Continued quest to clean up and enhance the cputime code by Frederic
    Weisbecker, in preparation for future tickless kernel features.

    Other than that, smallish changes."

    Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    cputime: Make finegrained irqtime accounting generally available
    cputime: Gather time/stats accounting config options into a single menu
    ia64: Reuse system and user vtime accounting functions on task switch
    ia64: Consolidate user vtime accounting
    vtime: Consolidate system/idle context detection
    cputime: Use a proper subsystem naming for vtime related APIs
    sched: cpu_power: enable ARCH_POWER
    sched/nohz: Clean up select_nohz_load_balancer()
    sched: Fix load avg vs. cpu-hotplug
    sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
    sched: Fix nohz_idle_balance()
    sched: Remove useless code in yield_to()
    sched: Add time unit suffix to sched sysctl knobs
    sched/debug: Limit sd->*_idx range on sysctl
    sched: Remove AFFINE_WAKEUPS feature flag
    s390: Remove leftover account_tick_vtime() header
    cputime: Consolidate vtime handling on context switch
    sched: Move cputime code to its own file
    cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
    tile: Remove SD_PREFER_LOCAL leftover
    ...

    Linus Torvalds
     

01 Oct, 2012

1 commit

  • Let architectures select GENERIC_KERNEL_THREAD and have their copy_thread()
    treat NULL regs as "it came from kernel_thread(), sp argument contains
    the function new thread will be calling and stack_size - the argument for
    that function". Switching the architectures begins shortly...

    Signed-off-by: Al Viro

    Al Viro
     

25 Sep, 2012

1 commit

  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Sep, 2012

1 commit

  • Now that the last architecture to use this has stopped doing so (ARM,
    thanks Catalin!) we can remove this complexity from the scheduler
    core.

    Signed-off-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Catalin Marinas
    Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Aug, 2012

1 commit