10 Oct, 2012

1 commit

  • Pull generic execve() changes from Al Viro:
    "This introduces the generic kernel_thread() and kernel_execve()
    functions, and switches x86, arm, alpha, um and s390 over to them."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (26 commits)
    s390: convert to generic kernel_execve()
    s390: switch to generic kernel_thread()
    s390: fold kernel_thread_helper() into ret_from_fork()
    s390: fold execve_tail() into start_thread(), convert to generic sys_execve()
    um: switch to generic kernel_thread()
    x86, um/x86: switch to generic sys_execve and kernel_execve
    x86: split ret_from_fork
    alpha: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
    alpha: switch to generic kernel_thread()
    alpha: switch to generic sys_execve()
    arm: get rid of execve wrapper, switch to generic execve() implementation
    arm: optimized current_pt_regs()
    arm: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
    arm: split ret_from_fork, simplify kernel_thread() [based on patch by rmk]
    generic sys_execve()
    generic kernel_execve()
    new helper: current_pt_regs()
    preparation for generic kernel_thread()
    um: kill thread->forking
    um: let signal_delivered() do SIGTRAP on singlestepping into handler
    ...

    Linus Torvalds
     

09 Oct, 2012

5 commits

  • Update the generic interval tree code that was introduced in "mm: replace
    vma prio_tree with an interval tree".

    Changes:

    - fixed 'endpoing' typo noticed by Andrew Morton

    - replaced include/linux/interval_tree_tmpl.h, which was used as a
    template (including it automatically defined the interval tree
    functions) with include/linux/interval_tree_generic.h, which only
    defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
    defines the interval tree functions when invoked. Now that is a very
    long macro which is unfortunate, but it does make the usage sites
    (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.

    - make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
    instead of duplicating that code in the interval tree template.

    - replaced vma_interval_tree_add(), which was actually handling the
    nonlinear and interval tree cases, with vma_interval_tree_insert_after()
    which handles only the interval tree case and has an API that is more
    consistent with the other interval tree handling functions.
    The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The deprecated /proc//oom_adj is scheduled for removal this month.

    Signed-off-by: Davidlohr Bueso
    Acked-by: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Currently the kernel sets mm->exe_file during sys_execve() and then tracks
    number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
    as this counter drops to zero kernel resets mm->exe_file to NULL. Plus it
    resets mm->exe_file at last mmput() when mm->mm_users drops to zero.

    VMA with VM_EXECUTABLE flag appears after mapping file with flag
    MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
    splitting, because sys_mmap ignores this flag. Usually binfmt module sets
    mm->exe_file and mmaps executable vmas with this file, they hold
    mm->exe_file while task is running.

    comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"),
    where all this stuff was introduced:

    > The kernel implements readlink of /proc/pid/exe by getting the file from
    > the first executable VMA. Then the path to the file is reconstructed and
    > reported as the result.
    >
    > Because of the VMA walk the code is slightly different on nommu systems.
    > This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    > walking the VMAs to find the first executable file-backed VMA we store a
    > reference to the exec'd file in the mm_struct.
    >
    > That reference would prevent the filesystem holding the executable file
    > from being unmounted even after unmapping the VMAs. So we track the number
    > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    > unmapped. This avoids pinning the mounted filesystem.

    exe_file's vma accounting is hooked into every file mmap/unmmap and vma
    split/merge just to fix some hypothetical pinning fs from umounting by mm,
    which already unmapped all its executable files, but still alive.

    Seems like currently nobody depends on this behaviour. We can try to
    remove this logic and keep mm->exe_file until final mmput().

    mm->exe_file is still protected with mm->mmap_sem, because we want to
    change it via new sys_prctl(PR_SET_MM_EXE_FILE). Also via this syscall
    task can change its mm->exe_file and unpin mountpoint explicitly.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Some security modules and oprofile still uses VM_EXECUTABLE for retrieving
    a task's executable file. After this patch they will use mm->exe_file
    directly. mm->exe_file is protected with mm->mmap_sem, so locking stays
    the same.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Chris Metcalf [arch/tile]
    Acked-by: Tetsuo Handa [tomoyo]
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Acked-by: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

03 Oct, 2012

1 commit

  • Pull networking changes from David Miller:

    1) GRE now works over ipv6, from Dmitry Kozlov.

    2) Make SCTP more network namespace aware, from Eric Biederman.

    3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

    4) Make openvswitch network namespace aware, from Pravin B Shelar.

    5) IPV6 NAT implementation, from Patrick McHardy.

    6) Server side support for TCP Fast Open, from Jerry Chu and others.

    7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

    8) Increate the loopback default MTU to 64K, from Eric Dumazet.

    9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic. This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

    10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail. Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

    11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

    12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

    13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

    Fix up various fairly trivial conflicts, mostly caused by the user
    namespace changes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
    hyperv: Add buffer for extended info after the RNDIS response message.
    hyperv: Report actual status in receive completion packet
    hyperv: Remove extra allocated space for recv_pkt_list elements
    hyperv: Fix page buffer handling in rndis_filter_send_request()
    hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
    hyperv: Fix the max_xfer_size in RNDIS initialization
    vxlan: put UDP socket in correct namespace
    vxlan: Depend on CONFIG_INET
    sfc: Fix the reported priorities of different filter types
    sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
    sfc: Fix loopback self-test with separate_tx_channels=1
    sfc: Fix MCDI structure field lookup
    sfc: Add parentheses around use of bitfield macro arguments
    sfc: Fix null function pointer in efx_sriov_channel_type
    vxlan: virtual extensible lan
    igmp: export symbol ip_mc_leave_group
    netlink: add attributes to fdb interface
    tg3: unconditionally select HWMON support when tg3 is enabled.
    Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
    gre: fix sparse warning
    ...

    Linus Torvalds
     

02 Oct, 2012

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Continued quest to clean up and enhance the cputime code by Frederic
    Weisbecker, in preparation for future tickless kernel features.

    Other than that, smallish changes."

    Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    cputime: Make finegrained irqtime accounting generally available
    cputime: Gather time/stats accounting config options into a single menu
    ia64: Reuse system and user vtime accounting functions on task switch
    ia64: Consolidate user vtime accounting
    vtime: Consolidate system/idle context detection
    cputime: Use a proper subsystem naming for vtime related APIs
    sched: cpu_power: enable ARCH_POWER
    sched/nohz: Clean up select_nohz_load_balancer()
    sched: Fix load avg vs. cpu-hotplug
    sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
    sched: Fix nohz_idle_balance()
    sched: Remove useless code in yield_to()
    sched: Add time unit suffix to sched sysctl knobs
    sched/debug: Limit sd->*_idx range on sysctl
    sched: Remove AFFINE_WAKEUPS feature flag
    s390: Remove leftover account_tick_vtime() header
    cputime: Consolidate vtime handling on context switch
    sched: Move cputime code to its own file
    cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
    tile: Remove SD_PREFER_LOCAL leftover
    ...

    Linus Torvalds
     

01 Oct, 2012

1 commit

  • Let architectures select GENERIC_KERNEL_THREAD and have their copy_thread()
    treat NULL regs as "it came from kernel_thread(), sp argument contains
    the function new thread will be calling and stack_size - the argument for
    that function". Switching the architectures begins shortly...

    Signed-off-by: Al Viro

    Al Viro
     

25 Sep, 2012

1 commit

  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Sep, 2012

1 commit

  • Now that the last architecture to use this has stopped doing so (ARM,
    thanks Catalin!) we can remove this complexity from the scheduler
    core.

    Signed-off-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Catalin Marinas
    Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Aug, 2012

3 commits

  • Now that we have uprobe_dup_mmap() we can fold uprobe_reset_state()
    into the new hook and remove it. mmput()->uprobe_clear_state() can't
    be called before dup_mmap().

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • Add the new MMF_HAS_UPROBES flag. It is set by install_breakpoint()
    and it is copied by dup_mmap(), uprobe_pre_sstep_notifier() checks
    it to avoid the slow path if the task was never probed. Perhaps it
    makes sense to check it in valid_vma(is_register => false) as well.

    This needs the new dup_mmap()->uprobe_dup_mmap() hook. We can't use
    uprobe_reset_state() or put MMF_HAS_UPROBES into MMF_INIT_MASK, we
    need oldmm->mmap_sem to avoid the race with uprobe_register() or
    mmap() from another thread.

    Currently we never clear this bit, it can be false-positive after
    uprobe_unregister() or uprobe_munmap() or if dup_mmap() hits the
    probed VM_DONTCOPY vma. But this is fine correctness-wise and has
    no effect unless the task hits the non-uprobe breakpoint.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • 1. Kill dup_mmap()->uprobe_mmap(), it was only needed to calculate
    new_mm->uprobes_state.count removed by the previous patch.

    If the forking process has a pending uprobe (int3) in vma, it will
    be copied by copy_page_range(), note that it checks vma->anon_vma
    so "Don't copy ptes" is not possible after install_breakpoint()
    which does anon_vma_prepare().

    2. Remove is_swbp_at_addr() and "int count" in uprobe_mmap(). Again,
    this was needed for uprobes_state.count.

    As a side effect this fixes the bug pointed out by Srikar,
    this code lacked the necessary put_uprobe().

    3. uprobe_munmap() becomes a nop after the previous patch. Remove the
    meaningless code but do not remove the helper, we will need it.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     

21 Aug, 2012

1 commit

  • This patch fixes:

    https://bugzilla.redhat.com/show_bug.cgi?id=843640

    If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path
    does unmap_region() but does not remove the soon-to-be-freed vma
    from rb tree. Actually there are more problems but this is how
    William noticed this bug.

    Perhaps we could do do_munmap() + return in this case, but in
    fact it is simply wrong to abort if uprobe_mmap() fails. Until
    at least we move the !UPROBE_COPY_INSN code from
    install_breakpoint() to uprobe_register().

    For example, uprobe_mmap()->install_breakpoint() can fail if the
    probed insn is not supported (remember, uprobe_register()
    succeeds if nobody mmaps inode/offset), mmap() should not fail
    in this case.

    dup_mmap()->uprobe_mmap() is wrong too by the same reason,
    fork() can race with uprobe_register() and fail for no reason if
    it wins the race and does install_breakpoint() first.

    And, if nothing else, both mmap_region() and dup_mmap() return
    success if uprobe_mmap() fails. Change them to ignore the error
    code from uprobe_mmap().

    Reported-and-tested-by: William Cohen
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Cc: # v3.5
    Cc: Anton Arapov
    Cc: William Cohen
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

01 Aug, 2012

2 commits

  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
    But we can also account for total_vm in the vm_stat_account() which makes
    the code tidy.

    Even for mprotect_fixup(), we can get the right result in the end.

    Signed-off-by: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     

31 Jul, 2012

3 commits

  • The function dup_task() may fail at the following function calls in the
    following order.

    0) alloc_task_struct_node()
    1) alloc_thread_info_node()
    2) arch_dup_task_struct()

    Error by 0) is not a matter, it can just return. But error by 1) requires
    releasing task_struct allocated by 0) before it returns. Likewise, error
    by 2) requires releasing task_struct and thread_info allocated by 0) and
    1).

    The existing error handling calls free_task_struct() and
    free_thread_info() which do not only release task_struct and thread_info,
    but also call architecture specific arch_release_task_struct() and
    arch_release_thread_info().

    The problem is that task_struct and thread_info are not fully initialized
    yet at this point, but arch_release_task_struct() and
    arch_release_thread_info() are called with them.

    For example, x86 defines its own arch_release_task_struct() that releases
    a task_xstate. If alloc_thread_info_node() fails in dup_task(),
    arch_release_task_struct() is called with task_struct which is just
    allocated and filled with garbage in this error handling.

    This actually happened with tools/testing/fault-injection/failcmd.sh

    # env FAILCMD_TYPE=fail_page_alloc \
    ./tools/testing/fault-injection/failcmd.sh --times=100 \
    --min-order=0 --ignore-gfp-wait=0 \
    -- make -C tools/testing/selftests/ run_tests

    In order to fix this issue, make free_{task_struct,thread_info}() not to
    call arch_release_{task_struct,thread_info}() and call
    arch_release_{task_struct,thread_info}() implicitly where needed.

    Default arch_release_task_struct() and arch_release_thread_info() are
    defined as empty by default. So this change only affects the
    architectures which implement their own arch_release_task_struct() or
    arch_release_thread_info() as listed below.

    arch_release_task_struct(): x86, sh
    arch_release_thread_info(): mn10300, tile

    Signed-off-by: Akinobu Mita
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: David Howells
    Cc: Koichi Yasutake
    Cc: Paul Mundt
    Cc: Chris Metcalf
    Cc: Salman Qazi
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • To make way for "fork: fix error handling in dup_task()", which fixes the
    errors more completely.

    Cc: Salman Qazi
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The current code can be replaced by vma_pages(). So use it to simplify
    the code.

    [akpm@linux-foundation.org: initialise `len' at its definition site]
    Signed-off-by: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     

24 Jul, 2012

1 commit

  • Pull the big VFS changes from Al Viro:
    "This one is *big* and changes quite a few things around VFS. What's in there:

    - the first of two really major architecture changes - death to open
    intents.

    The former is finally there; it was very long in making, but with
    Miklos getting through really hard and messy final push in
    fs/namei.c, we finally have it. Unlike his variant, this one
    doesn't introduce struct opendata; what we have instead is
    ->atomic_open() taking preallocated struct file * and passing
    everything via its fields.

    Instead of returning struct file *, it returns -E... on error, 0
    on success and 1 in "deal with it yourself" case (e.g. symlink
    found on server, etc.).

    See comments before fs/namei.c:atomic_open(). That made a lot of
    goodies finally possible and quite a few are in that pile:
    ->lookup(), ->d_revalidate() and ->create() do not get struct
    nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
    flags instead, ->create() gets "do we want it exclusive" flag.

    With the introduction of new helper (kern_path_locked()) we are rid
    of all struct nameidata instances outside of fs/namei.c; it's still
    visible in namei.h, but not for long. Come the next cycle,
    declaration will move either to fs/internal.h or to fs/namei.c
    itself. [me, miklos, hch]

    - The second major change: behaviour of final fput(). Now we have
    __fput() done without any locks held by caller *and* not from deep
    in call stack.

    That obviously lifts a lot of constraints on the locking in there.
    Moreover, it's legal now to call fput() from atomic contexts (which
    has immediately simplified life for aio.c). We also don't need
    anti-recursion logics in __scm_destroy() anymore.

    There is a price, though - the damn thing has become partially
    asynchronous. For fput() from normal process we are guaranteed
    that pending __fput() will be done before the caller returns to
    userland, exits or gets stopped for ptrace.

    For kernel threads and atomic contexts it's done via
    schedule_work(), so theoretically we might need a way to make sure
    it's finished; so far only one such place had been found, but there
    might be more.

    There's flush_delayed_fput() (do all pending __fput()) and there's
    __fput_sync() (fput() analog doing __fput() immediately). I hope
    we won't need them often; see warnings in fs/file_table.c for
    details. [me, based on task_work series from Oleg merged last
    cycle]

    - sync series from Jan

    - large part of "death to sync_supers()" work from Artem; the only
    bits missing here are exofs and ext4 ones. As far as I understand,
    those are going via the exofs and ext4 trees resp.; once they are
    in, we can put ->write_super() to the rest, along with the thread
    calling it.

    - preparatory bits from unionmount series (from dhowells).

    - assorted cleanups and fixes all over the place, as usual.

    This is not the last pile for this cycle; there's at least jlayton's
    ESTALE work and fsfreeze series (the latter - in dire need of fixes,
    so I'm not sure it'll make the cut this cycle). I'll probably throw
    symlink/hardlink restrictions stuff from Kees into the next pile, too.
    Plus there's a lot of misc patches I hadn't thrown into that one -
    it's large enough as it is..."

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
    ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
    btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
    switch dentry_open() to struct path, make it grab references itself
    spufs: shift dget/mntget towards dentry_open()
    zoran: don't bother with struct file * in zoran_map
    ecryptfs: don't reinvent the wheels, please - use struct completion
    don't expose I_NEW inodes via dentry->d_inode
    tidy up namei.c a bit
    unobfuscate follow_up() a bit
    ext3: pass custom EOF to generic_file_llseek_size()
    ext4: use core vfs llseek code for dir seeks
    vfs: allow custom EOF in generic_file_llseek code
    vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
    vfs: Remove unnecessary flushing of block devices
    vfs: Make sys_sync writeout also block device inodes
    vfs: Create function for iterating over block devices
    vfs: Reorder operations during sys_sync
    quota: Move quota syncing to ->sync_fs method
    quota: Split dquot_quota_sync() to writeback and cache flushing part
    vfs: Move noop_backing_dev_info check from sync into writeback
    ...

    Linus Torvalds
     

23 Jul, 2012

1 commit

  • layout based on Oleg's suggestion; single-linked list,
    task->task_works points to the last element, forward pointer
    from said last element points to head. I'd still prefer
    much more regular scheme with two pointers in task_work,
    but...

    Signed-off-by: Al Viro

    Al Viro
     

06 Jul, 2012

1 commit

  • In dup_task_struct(), if arch_dup_task_struct() fails, the clean up
    code fails to clean up correctly. That's because the clean up
    code depends on unininitalized ti->task pointer. We fix this
    by making sure that the task and thread_info know about each other
    before we attempt to take the error path.

    Signed-off-by: Salman Qazi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120626011815.11323.5533.stgit@dungbeetle.mtv.corp.google.com
    Signed-off-by: Ingo Molnar

    Salman Qazi
     

08 Jun, 2012

2 commits

  • This reverts commit 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94.

    It's horribly and utterly broken for at least the following reasons:

    - calling sync_mm_rss() from mmput() is fundamentally wrong, because
    there's absolutely no reason to believe that the task that does the
    mmput() always does it on its own VM. Example: fork, ptrace, /proc -
    you name it.

    - calling it *after* having done mmdrop() on it is doubly insane, since
    the mm struct may well be gone now.

    - testing mm against NULL before you call it is insane too, since a
    NULL mm there would have caused oopses long before.

    .. and those are just the three bugs I found before I decided to give up
    looking for me and revert it asap. I should have caught it before I
    even took it, but I trusted Andrew too much.

    Cc: Konstantin Khlebnikov
    Cc: Markus Trippelsdorf
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • mm->rss_stat counters have per-task delta: task->rss_stat. Before
    changing task->mm pointer the kernel must flush this delta with
    sync_mm_rss().

    do_exit() already calls sync_mm_rss() to flush the rss-counters before
    committing the rss statistics into task->signal->maxrss, taskstats,
    audit and other stuff. Unfortunately the kernel does this before
    calling mm_release(), which can call put_user() for processing
    task->clear_child_tid. So at this point we can trigger page-faults and
    task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
    inconsistent and check_mm() will print something like this:

    | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
    | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1

    This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
    out of do_exit() and calls it earlier. After mm_release() there should
    be no pagefaults.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Markus Trippelsdorf
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: [3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

01 Jun, 2012

3 commits

  • Pull second pile of signal handling patches from Al Viro:
    "This one is just task_work_add() series + remaining prereqs for it.

    There probably will be another pull request from that tree this
    cycle - at least for helpers, to get them out of the way for per-arch
    fixes remaining in the tree."

    Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
    had brought in commit 97fd75b7b8e0 ("kernel/irq/manage.c: use the
    pr_foo() infrastructure to prefix printks") which changed one of the
    pr_err() calls that this merge moves around.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    keys: kill task_struct->replacement_session_keyring
    keys: kill the dummy key_replace_session_keyring()
    keys: change keyctl_session_to_parent() to use task_work_add()
    genirq: reimplement exit_irq_thread() hook via task_work_add()
    task_work_add: generic process-context callbacks
    avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
    parisc: need to check NOTIFY_RESUME when exiting from syscall
    move key_repace_session_keyring() into tracehook_notify_resume()
    TIF_NOTIFY_RESUME is defined on all targets now

    Linus Torvalds
     
  • Merge misc patches from Andrew Morton:

    - the "misc" tree - stuff from all over the map

    - checkpatch updates

    - fatfs

    - kmod changes

    - procfs

    - cpumask

    - UML

    - kexec

    - mqueue

    - rapidio

    - pidns

    - some checkpoint-restore feature work. Reluctantly. Most of it
    delayed a release. I'm still rather worried that we don't have a
    clear roadmap to completion for this work.

    * emailed from Andrew Morton : (78 patches)
    kconfig: update compression algorithm info
    c/r: prctl: add ability to set new mm_struct::exe_file
    c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
    c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
    syscalls, x86: add __NR_kcmp syscall
    fs, proc: introduce /proc//task//children entry
    sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
    aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
    eventfd: change int to __u64 in eventfd_signal()
    fs/nls: add Apple NLS
    pidns: make killed children autoreap
    pidns: use task_active_pid_ns in do_notify_parent
    rapidio/tsi721: add DMA engine support
    rapidio: add DMA engine support for RIO data transfers
    ipc/mqueue: add rbtree node caching support
    tools/selftests: add mq_perf_tests
    ipc/mqueue: strengthen checks on mqueue creation
    ipc/mqueue: correct mq_attr_ok test
    ipc/mqueue: improve performance of send/recv
    selftests: add mq_open_tests
    ...

    Linus Torvalds
     
  • Child should wake up the parent from vfork() only after finishing all
    operations with shared mm. There is no sense in using
    CLONE_CHILD_CLEARTID together with CLONE_VFORK, but it looks more accurate
    now.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Konstantin Khlebnikov
    Cc: Markus Trippelsdorf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

30 May, 2012

3 commits

  • Merge block/IO core bits from Jens Axboe:
    "This is a bit bigger on the core side than usual, but that is purely
    because we decided to hold off on parts of Tejun's submission on 3.4
    to give it a bit more time to simmer. As a consequence, it's seen a
    long cycle in for-next.

    It contains:

    - Bug fix from Dan, wrong locking type.
    - Relax splice gifting restriction from Eric.
    - A ton of updates from Tejun, primarily for blkcg. This improves
    the code a lot, making the API nicer and cleaner, and also includes
    fixes for how we handle and tie policies and re-activate on
    switches. The changes also include generic bug fixes.
    - A simple fix from Vivek, along with a fix for doing proper delayed
    allocation of the blkcg stats."

    Fix up annoying conflict just due to different merge resolution in
    Documentation/feature-removal-schedule.txt

    * 'for-3.5/core' of git://git.kernel.dk/linux-block: (92 commits)
    blkcg: tg_stats_alloc_lock is an irq lock
    vmsplice: relax alignement requirements for SPLICE_F_GIFT
    blkcg: use radix tree to index blkgs from blkcg
    blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
    block: fix elvpriv allocation failure handling
    block: collapse blk_alloc_request() into get_request()
    blkcg: collapse blkcg_policy_ops into blkcg_policy
    blkcg: embed struct blkg_policy_data in policy specific data
    blkcg: mass rename of blkcg API
    blkcg: style cleanups for blk-cgroup.h
    blkcg: remove blkio_group->path[]
    blkcg: blkg_rwstat_read() was missing inline
    blkcg: shoot down blkgs if all policies are deactivated
    blkcg: drop stuff unused after per-queue policy activation update
    blkcg: implement per-queue policy activation
    blkcg: add request_queue->root_blkg
    blkcg: make request_queue bypassing on allocation
    blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing
    blkcg: make blkg_conf_prep() take @pol and return with queue lock held
    blkcg: remove static policy ID enums
    ...

    Linus Torvalds
     
  • The vma length in dup_mmap is calculated and stored in a unsigned int,
    which is insufficient and hence overflows for very large maps (beyond
    16TB). The following program demonstrates this:

    #include
    #include
    #include

    #define GIG 1024 * 1024 * 1024L
    #define EXTENT 16393

    int main(void)
    {
    int i, r;
    void *m;
    char buf[1024];

    for (i = 0; i < EXTENT; i++) {
    m = mmap(NULL, (size_t) 1 * 1024 * 1024 * 1024L,
    PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

    if (m == (void *)-1)
    printf("MMAP Failed: %d\n", m);
    else
    printf("%d : MMAP returned %p\n", i, m);

    r = fork();

    if (r == 0) {
    printf("%d: successed\n", i);
    return 0;
    } else if (r < 0)
    printf("FORK Failed: %d\n", r);
    else if (r > 0)
    wait(NULL);
    }
    return 0;
    }

    Increase the storage size of the result to unsigned long, which is
    sufficient for storing the difference between addresses.

    Signed-off-by: Siddhesh Poyarekar
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Siddhesh Poyarekar
     
  • The swap token code no longer fits in with the current VM model. It
    does not play well with cgroups or the better NUMA placement code in
    development, since we have only one swap token globally.

    It also has the potential to mess with scalability of the system, by
    increasing the number of non-reclaimable pages on the active and
    inactive anon LRU lists.

    Last but not least, the swap token code has been broken for a year
    without complaints, as reported by Konstantin Khlebnikov. This suggests
    we no longer have much use for it.

    The days of sub-1G memory systems with heavy use of swap are over. If
    we ever need thrashing reducing code in the future, we will have to
    implement something that does scale.

    Signed-off-by: Rik van Riel
    Cc: Konstantin Khlebnikov
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Acked-by: Bob Picco
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

25 May, 2012

1 commit

  • Pull user-space probe instrumentation from Ingo Molnar:
    "The uprobes code originates from SystemTap and has been used for years
    in Fedora and RHEL kernels. This version is much rewritten, reviews
    from PeterZ, Oleg and myself shaped the end result.

    This tree includes uprobes support in 'perf probe' - but SystemTap
    (and other tools) can take advantage of user probe points as well.

    Sample usage of uprobes via perf, for example to profile malloc()
    calls without modifying user-space binaries.

    First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.

    If you don't know which function you want to probe you can pick one
    from 'perf top' or can get a list all functions that can be probed
    within libc (binaries can be specified as well):

    $ perf probe -F -x /lib/libc.so.6

    To probe libc's malloc():

    $ perf probe -x /lib64/libc.so.6 malloc
    Added new event:
    probe_libc:malloc (on 0x7eac0)

    You can now use it in all perf tools, such as:

    perf record -e probe_libc:malloc -aR sleep 1

    Make use of it to create a call graph (as the flat profile is going to
    look very boring):

    $ perf record -e probe_libc:malloc -gR make
    [ perf record: Woken up 173 times to write data ]
    [ perf record: Captured and wrote 44.190 MB perf.data (~1930712

    $ perf report | less

    32.03% git libc-2.15.so [.] malloc
    |
    --- malloc

    29.49% cc1 libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--0.95%-- 0x208eb1000000000
    |
    |--0.63%-- htab_traverse_noresize

    11.04% as libc-2.15.so [.] malloc
    |
    --- malloc
    |

    7.15% ld libc-2.15.so [.] malloc
    |
    --- malloc
    |

    5.07% sh libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.99% python-config libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.54% make libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--7.34%-- glob
    | |
    | |--93.18%-- 0x41588f
    | |
    | --6.82%-- glob
    | 0x41588f

    ...

    Or:

    $ perf report -g flat | less

    # Overhead Command Shared Object Symbol
    # ........ ............. ............. ..........
    #
    32.03% git libc-2.15.so [.] malloc
    27.19%
    malloc

    29.49% cc1 libc-2.15.so [.] malloc
    24.77%
    malloc

    11.04% as libc-2.15.so [.] malloc
    11.02%
    malloc

    7.15% ld libc-2.15.so [.] malloc
    6.57%
    malloc

    ...

    The core uprobes design is fairly straightforward: uprobes probe
    points register themselves at (inode:offset) addresses of
    libraries/binaries, after which all existing (or new) vmas that map
    that address will have a software breakpoint injected at that address.
    vmas are COW-ed to preserve original content. The probe points are
    kept in an rbtree.

    If user-space executes the probed inode:offset instruction address
    then an event is generated which can be recovered from the regular
    perf event channels and mmap-ed ring-buffer.

    Multiple probes at the same address are supported, they create a
    dynamic callback list of event consumers.

    The basic model is further complicated by the XOL speedup: the
    original instruction that is probed is copied (in an architecture
    specific fashion) and executed out of line when the probe triggers.
    The XOL area is a single vma per process, with a fixed number of
    entries (which limits probe execution parallelism).

    The API: uprobes are installed/removed via
    /sys/kernel/debug/tracing/uprobe_events, the API is integrated to
    align with the kprobes interface as much as possible, but is separate
    to it.

    Injecting a probe point is privileged operation, which can be relaxed
    by setting perf_paranoid to -1.

    You can use multiple probes as well and mix them with kprobes and
    regular PMU events or tracepoints, when instrumenting a task."

    Fix up trivial conflicts in mm/memory.c due to previous cleanup of
    unmap_single_vma().

    * 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    perf probe: Detect probe target when m/x options are absent
    perf probe: Provide perf interface for uprobes
    tracing: Fix kconfig warning due to a typo
    tracing: Provide trace events interface for uprobes
    tracing: Extract out common code for kprobes/uprobes trace events
    tracing: Modify is_delete, is_return from int to bool
    uprobes/core: Decrement uprobe count before the pages are unmapped
    uprobes/core: Make background page replacement logic account for rss_stat counters
    uprobes/core: Optimize probe hits with the help of a counter
    uprobes/core: Allocate XOL slots for uprobes use
    uprobes/core: Handle breakpoint and singlestep exceptions
    uprobes/core: Rename bkpt to swbp
    uprobes/core: Make order of function parameters consistent across functions
    uprobes/core: Make macro names consistent
    uprobes: Update copyright notices
    uprobes/core: Move insn to arch specific structure
    uprobes/core: Remove uprobe_opcode_sz
    uprobes/core: Make instruction tables volatile
    uprobes: Move to kernel/events/
    uprobes/core: Clean up, refactor and improve the code
    ...

    Linus Torvalds
     

24 May, 2012

2 commits

  • Provide a simple mechanism that allows running code in the (nonatomic)
    context of the arbitrary task.

    The caller does task_work_add(task, task_work) and this task executes
    task_work->func() either from do_notify_resume() or from do_exit(). The
    callback can rely on PF_EXITING to detect the latter case.

    "struct task_work" can be embedded in another struct, still it has "void
    *data" to handle the most common/simple case.

    This allows us to kill the ->replacement_session_keyring hack, and
    potentially this can have more users.

    Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into
    tracehook_notify_resume() and do_exit(). But at the same time we can
    remove the "replacement_session_keyring != NULL" checks from
    arch/*/signal.c and exit_creds().

    Note: task_work_add/task_work_run abuses ->pi_lock. This is only because
    this lock is already used by lookup_pi_state() to synchronize with
    do_exit() setting PF_EXITING. Fortunately the scope of this lock in
    task_work.c is really tiny, and the code is unlikely anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Cc: Thomas Gleixner
    Cc: Richard Kuo
    Cc: Linus Torvalds
    Cc: Alexander Gordeev
    Cc: Chris Zankel
    Cc: David Smith
    Cc: "Frank Ch. Eigler"
    Cc: Geert Uytterhoeven
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • Pull fpu state cleanups from Ingo Molnar:
    "This tree streamlines further aspects of FPU handling by eliminating
    the prepare_to_copy() complication and moving that logic to
    arch_dup_task_struct().

    It also fixes the FPU dumps in threaded core dumps, removes and old
    (and now invalid) assumption plus micro-optimizes the exit path by
    avoiding an FPU save for dead tasks."

    Fixed up trivial add-add conflict in arch/sh/kernel/process.c that came
    in because we now do the FPU handling in arch_dup_task_struct() rather
    than the legacy (and now gone) prepare_to_copy().

    * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, fpu: drop the fpu state during thread exit
    x86, xsave: remove thread_has_fpu() bug check in __sanitize_i387_state()
    coredump: ensure the fpu state is flushed for proper multi-threaded core dump
    fork: move the real prepare_to_copy() users to arch_dup_task_struct()

    Linus Torvalds
     

22 May, 2012

2 commits

  • Pull security subsystem updates from James Morris:
    "New notable features:
    - The seccomp work from Will Drewry
    - PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
    - Longer security labels for Smack from Casey Schaufler
    - Additional ptrace restriction modes for Yama by Kees Cook"

    Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
    apparmor: fix long path failure due to disconnected path
    apparmor: fix profile lookup for unconfined
    ima: fix filename hint to reflect script interpreter name
    KEYS: Don't check for NULL key pointer in key_validate()
    Smack: allow for significantly longer Smack labels v4
    gfp flags for security_inode_alloc()?
    Smack: recursive tramsmute
    Yama: replace capable() with ns_capable()
    TOMOYO: Accept manager programs which do not start with / .
    KEYS: Add invalidation support
    KEYS: Do LRU discard in full keyrings
    KEYS: Permit in-place link replacement in keyring list
    KEYS: Perform RCU synchronisation on keys prior to key destruction
    KEYS: Announce key type (un)registration
    KEYS: Reorganise keys Makefile
    KEYS: Move the key config into security/keys/Kconfig
    KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
    Yama: remove an unused variable
    samples/seccomp: fix dependencies on arch macros
    Yama: add additional ptrace scopes
    ...

    Linus Torvalds
     
  • Pull smp hotplug cleanups from Thomas Gleixner:
    "This series is merily a cleanup of code copied around in arch/* and
    not changing any of the real cpu hotplug horrors yet. I wish I'd had
    something more substantial for 3.5, but I underestimated the lurking
    horror..."

    Fix up trivial conflicts in arch/{arm,sparc,x86}/Kconfig and
    arch/sparc/include/asm/thread_info_32.h

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
    um: Remove leftover declaration of alloc_task_struct_node()
    task_allocator: Use config switches instead of magic defines
    sparc: Use common threadinfo allocator
    score: Use common threadinfo allocator
    sh-use-common-threadinfo-allocator
    mn10300: Use common threadinfo allocator
    powerpc: Use common threadinfo allocator
    mips: Use common threadinfo allocator
    hexagon: Use common threadinfo allocator
    m32r: Use common threadinfo allocator
    frv: Use common threadinfo allocator
    cris: Use common threadinfo allocator
    x86: Use common threadinfo allocator
    c6x: Use common threadinfo allocator
    fork: Provide kmemcache based thread_info allocator
    tile: Use common threadinfo allocator
    fork: Provide weak arch_release_[task_struct|thread_info] functions
    fork: Move thread info gfp flags to header
    fork: Remove the weak insanity
    sh: Remove cpu_idle_wait()
    ...

    Linus Torvalds
     

17 May, 2012

1 commit

  • Historical prepare_to_copy() is mostly a no-op, duplicated for majority of
    the architectures and the rest following the x86 model of flushing the extended
    register state like fpu there.

    Remove it and use the arch_dup_task_struct() instead.

    Suggested-by: Oleg Nesterov
    Suggested-by: Linus Torvalds
    Signed-off-by: Suresh Siddha
    Link: http://lkml.kernel.org/r/1336692811-30576-1-git-send-email-suresh.b.siddha@intel.com
    Acked-by: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Koichi Yasutake
    Cc: Paul Mackerras
    Cc: Paul Mundt
    Cc: Chris Zankel
    Cc: Richard Henderson
    Cc: Russell King
    Cc: Haavard Skinnemoen
    Cc: Mike Frysinger
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Cc: Mikael Starvik
    Cc: Yoshinori Sato
    Cc: Richard Kuo
    Cc: Tony Luck
    Cc: Michal Simek
    Cc: Ralf Baechle
    Cc: Jonas Bonn
    Cc: James E.J. Bottomley
    Cc: Helge Deller
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chen Liqin
    Cc: Lennox Wu
    Cc: David S. Miller
    Cc: Chris Metcalf
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Guan Xuetao
    Signed-off-by: H. Peter Anvin

    Suresh Siddha
     

11 May, 2012

1 commit

  • Fork() failure post namespace creation for a child cloned with
    CLONE_NEWPID leaks pid_namespace/mnt_cache due to proc being mounted
    during creation, but not unmounted during cleanup. Call
    pid_ns_release_proc() during cleanup.

    Signed-off-by: Mike Galbraith
    Acked-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Louis Rilling
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Galbraith
     

08 May, 2012

2 commits