17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
    fails because disable_pid_allocation() was called by the exiting
    child_reaper.

    We could simply move get_pid_ns() down to successful return, but this fix
    tries to be as trivial as possible.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: Aaron Tomlin
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sterling Alexander
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Dec, 2014

2 commits


01 Oct, 2013

1 commit

  • "case 0" in free_pid() assumes that disable_pid_allocation() should
    clear PIDNS_HASH_ADDING before the last pid goes away.

    However this doesn't happen if the first fork() fails to create the
    child reaper which should call disable_pid_allocation().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: "Eric W. Biederman"
    Cc: "Serge E. Hallyn"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

31 Aug, 2013

1 commit

  • Serge Hallyn writes:

    > Since commit af4b8a83add95ef40716401395b44a1b579965f4 it's been
    > possible to get into a situation where a pidns reaper is
    > , reparented to host pid 1, but never reaped. How to
    > reproduce this is documented at
    >
    > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526
    > (and see
    > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13)
    > In short, run repeated starts of a container whose init is
    >
    > Process.exit(0);
    >
    > sysrq-t when such a task is playing zombie shows:
    >
    > [ 131.132978] init x ffff88011fc14580 0 2084 2039 0x00000000
    > [ 131.132978] ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580
    > [ 131.132978] ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000
    > [ 131.132978] ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0
    > [ 131.132978] Call Trace:
    > [ 131.132978] [] schedule+0x29/0x70
    > [ 131.132978] [] do_exit+0x6e1/0xa40
    > [ 131.132978] [] ? signal_wake_up_state+0x1e/0x30
    > [ 131.132978] [] do_group_exit+0x3f/0xa0
    > [ 131.132978] [] SyS_exit_group+0x14/0x20
    > [ 131.132978] [] tracesys+0xe1/0xe6
    >
    > Further debugging showed that every time this happened, zap_pid_ns_processes()
    > started with nr_hashed being 3, while we were expecting it to drop to 2.
    > Any time it didn't happen, nr_hashed was 1 or 2. So the reaper was
    > waiting for nr_hashed to become 2, but free_pid() only wakes the reaper
    > if nr_hashed hits 1.

    The issue is that when the task group leader of an init process exits
    before other tasks of the init process when the init process finally
    exits it will be a secondary task sleeping in zap_pid_ns_processes and
    waiting to wake up when the number of hashed pids drops to two. This
    case waits forever as free_pid only sends a wake up when the number of
    hashed pids drops to 1.

    To correct this the simple strategy of sending a possibly unncessary
    wake up when the number of hashed pids drops to 2 is adopted.

    Sending one extraneous wake up is relatively harmless, at worst we
    waste a little cpu time in the rare case when a pid namespace
    appropaches exiting.

    We can detect the case when the pid namespace drops to just two pids
    hashed race free in free_pid.

    Dereferencing pid_ns->child_reaper with the pidmap_lock held is safe
    without out the tasklist_lock because it is guaranteed that the
    detach_pid will be called on the child_reaper before it is freed and
    detach_pid calls __change_pid which calls free_pid which takes the
    pidmap_lock. __change_pid only calls free_pid if this is the
    last use of the pid. For a thread that is not the thread group leader
    the threads pid will only ever have one user because a threads pid
    is not allowed to be the pid of a process, of a process group or
    a session. For a thread that is a thread group leader all of
    the other threads of that process will be reaped before it is allowed
    for the thread group leader to be reaped ensuring there will only
    be one user of the threads pid as a process pid. Furthermore
    because the thread is the init process of a pid namespace all of the
    other processes in the pid namespace will have also been already freed
    leading to the fact that the pid will not be used as a session pid or
    a process group pid for any other running process.

    CC: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Tested-by: Serge Hallyn
    Reported-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Jul, 2013

2 commits

  • Move statement to static initilization of init_pid_ns.

    Signed-off-by: Raphael S. Carvalho
    Cc: "Eric W. Biederman"
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S. Carvalho
     
  • copy_process() adds the new child to thread_group/init_task.tasks list and
    then does attach_pid(child, PIDTYPE_PID). This means that the lockless
    next_thread() or next_task() can see this thread with the wrong pid. Say,
    "ls /proc/pid/task" can list the same inode twice.

    We could move attach_pid(child, PIDTYPE_PID) up, but in this case
    find_task_by_vpid() can find the new thread before it was fully
    initialized.

    And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch
    copy_process() initializes child->pids[*].pid first, then calls
    attach_pid() to insert the task into the pid->tasks list.

    attach_pid() no longer need the "struct pid*" argument, it is always
    called after pid_link->pid was already set.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Pavel Emelyanov
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Split the proc namespace stuff out into linux/proc_ns.h.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: Serge E. Hallyn
    cc: Eric W. Biederman
    Signed-off-by: Al Viro

    David Howells
     

01 May, 2013

2 commits

  • Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
    simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

    [akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
    Signed-off-by: Raphael S.Carvalho
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S.Carvalho
     
  • find_next_offset() searches for an available "cleaned bit" in the
    respective pid bitmap (page), so returns the offset if found, otherwise
    it returns a value equals to BITS_PER_PAGE.

    For example, suppose find_next_offset didn't find any available bit, so
    there's no purpose to call mk_pid (Wasteful Cpu Cycles).

    Therefore, I found it could be better to call mk_pid after the checking
    (offset < BITS_PER_PAGE) returned sucessfully! Another point: If (offset
    < BITS_PER_PAGE) results in a "failure", then mk_pid would be called
    again afterwards.

    [akpm@linux-foundation.org: simplify code]
    Signed-off-by: Raphael S. Carvalho
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S. Carvalho
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

13 Feb, 2013

1 commit


26 Dec, 2012

1 commit

  • Oleg pointed out that in a pid namespace the sequence.
    - pid 1 becomes a zombie
    - setns(thepidns), fork,...
    - reaping pid 1.
    - The injected processes exiting.

    Can lead to processes attempting access their child reaper and
    instead following a stale pointer.

    That waitpid for init can return before all of the processes in
    the pid namespace have exited is also unfortunate.

    Avoid these problems by disabling the allocation of new pids in a pid
    namespace when init dies, instead of when the last process in a pid
    namespace is reaped.

    Pointed-out-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

18 Dec, 2012

3 commits

  • Merge misc patches from Andrew Morton:
    "Incoming:

    - lots of misc stuff

    - backlight tree updates

    - lib/ updates

    - Oleg's percpu-rwsem changes

    - checkpatch

    - rtc

    - aoe

    - more checkpoint/restart support

    I still have a pile of MM stuff pending - Pekka should be merging
    later today after which that is good to go. A number of other things
    are twiddling thumbs awaiting maintainer merges."

    * emailed patches from Andrew Morton : (180 commits)
    scatterlist: don't BUG when we can trivially return a proper error.
    docs: update documentation about /proc//fdinfo/ fanotify output
    fs, fanotify: add @mflags field to fanotify output
    docs: add documentation about /proc//fdinfo/ output
    fs, notify: add procfs fdinfo helper
    fs, exportfs: add exportfs_encode_inode_fh() helper
    fs, exportfs: escape nil dereference if no s_export_op present
    fs, epoll: add procfs fdinfo helper
    fs, eventfd: add procfs fdinfo helper
    procfs: add ability to plug in auxiliary fdinfo providers
    tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
    breakpoint selftests: print failure status instead of cause make error
    kcmp selftests: print fail status instead of cause make error
    kcmp selftests: make run_tests fix
    mem-hotplug selftests: print failure status instead of cause make error
    cpu-hotplug selftests: print failure status instead of cause make error
    mqueue selftests: print failure status instead of cause make error
    vm selftests: print failure status instead of cause make error
    ubifs: use prandom_bytes
    mtd: nandsim: use prandom_bytes
    ...

    Linus Torvalds
     
  • Since commit 1cdcbec1a337 ("CRED: Neuter sys_capset()")
    is_container_init() has no callers.

    Signed-off-by: Gao feng
    Cc: David Howells
    Acked-by: Serge Hallyn
    Cc: James Morris
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gao feng
     
  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

06 Dec, 2012

1 commit


20 Nov, 2012

1 commit

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

19 Nov, 2012

5 commits

  • Looking at pid_ns->nr_hashed is a bit simpler and it works for
    disjoint process trees that an unshare or a join of a pid_namespace
    may create.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Set nr_hashed to -1 just before we schedule the work to cleanup proc.
    Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
    fail.

    This guaranteees that processes never enter a pid namespaces after we
    have cleaned up the state to support processes in a pid namespace.

    Currently sending SIGKILL to all of the process in a pid namespace as
    init exists gives us this guarantee but we need something a little
    stronger to support unsharing and joining a pid namespace.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Track the number of pids in the proc hash table. When the number of
    pids goes to 0 schedule work to unmount the kernel mount of proc.

    Move the mount of proc into alloc_pid when we allocate the pid for
    init.

    Remove the surprising calls of pid_ns_release proc in fork and
    proc_flush_task. Those code paths really shouldn't know about proc
    namespace implementation details and people have demonstrated several
    times that finding and understanding those code paths is difficult and
    non-obvious.

    Because of the call path detach pid is alwasy called with the
    rtnl_lock held free_pid is not allowed to sleep, so the work to
    unmounting proc is moved to a work queue. This has the side benefit
    of not blocking the entire world waiting for the unnecessary
    rcu_barrier in deactivate_locked_super.

    In the process of making the code clear and obvious this fixes a bug
    reported by Gao feng where we would leak a
    mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
    succeeded and copy_net_ns failed.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
    aka ns_of_pid(task_pid(tsk)) should have the same number of
    cache line misses with the practical difference that
    ns_of_pid(task_pid(tsk)) is released later in a processes life.

    Furthermore by using task_active_pid_ns it becomes trivial
    to write an unshare implementation for the the pid namespace.

    So I have used task_active_pid_ns everywhere I can.

    In fork since the pid has not yet been attached to the
    process I use ns_of_pid, to achieve the same effect.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • - Capture the the user namespace that creates the pid namespace
    - Use that user namespace to test if it is ok to write to
    /proc/sys/kernel/ns_last_pid.

    Zhao Hongjiang noticed I was missing a put_user_ns
    in when destroying a pid_ns. I have foloded his patch into this one
    so that bisects will work properly.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

15 Aug, 2012

1 commit

  • Correct a long standing omission and use struct pid in the owner
    field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS.
    This guarantees we don't have issues when pid wraparound occurs.

    Use a kuid_t in the owner field of struct ip6_flowlabel when the
    share type is IPV6_FL_S_USER to add user namespace support.

    In /proc/net/ip6_flowlabel capture the current pid namespace when
    opening the file and release the pid namespace when the file is
    closed ensuring we print the pid owner value that is meaning to
    the reader of the file. Similarly use from_kuid_munged to print
    uid values that are meaningful to the reader of the file.

    This requires exporting pid_nr_ns so that ipv6 can continue to built
    as a module. Yoiks what silliness

    Acked-by: David S. Miller
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

24 May, 2012

1 commit

  • UDP stack needs a minimum hash size value for proper operation and also
    uses alloc_large_system_hash() for proper NUMA distribution of its hash
    tables and automatic sizing depending on available system memory.

    On some low memory situations, udp_table_init() must ignore the
    alloc_large_system_hash() result and reallocs a bigger memory area.

    As we cannot easily free old hash table, we leak it and kmemleak can
    issue a warning.

    This patch adds a low limit parameter to alloc_large_system_hash() to
    solve this problem.

    We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
    allocation.

    Reported-by: Mark Asselstine
    Reported-by: Tim Bird
    Signed-off-by: Eric Dumazet
    Cc: Paul Gortmaker
    Signed-off-by: David S. Miller

    Tim Bird
     

14 Feb, 2012

1 commit

  • When the number of dentry cache hash table entries gets too high
    (2147483648 entries), as happens by default on a 16TB system, use of a
    signed integer in the dcache_init() initialization loop prevents the
    dentry_hashtable from getting initialized, causing a panic in
    __d_lookup(). Fix this in dcache_init() and similar areas.

    Signed-off-by: Dimitri Sivanich
    Acked-by: David S. Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dimitri Sivanich
     

13 Jan, 2012

1 commit

  • The sysctl works on the current task's pid namespace, getting and setting
    its last_pid field.

    Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible
    to create a task with desired pid value. This ability is required badly
    for the checkpoint/restore in userspace.

    This approach suits all the parties for now.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

29 Sep, 2011

1 commit

  • Long ago, using TREE_RCU with PREEMPT would result in "scheduling
    while atomic" diagnostics if you blocked in an RCU read-side critical
    section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
    this diagnostic. This commit therefore adds a replacement diagnostic
    based on PROVE_RCU.

    Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
    used for things that have nothing to do with rcu_dereference(), rename
    lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
    argument that is a string indicating what is suspicious. This third
    argument is passed in from a new third argument to rcu_lockdep_assert().
    Update all calls to rcu_lockdep_assert() to add an informative third
    argument.

    Also, add a pair of rcu_lockdep_assert() calls from within
    rcu_note_context_switch(), one complaining if a context switch occurs
    in an RCU-bh read-side critical section and another complaining if a
    context switch occurs in an RCU-sched read-side critical section.
    These are present only if the PROVE_RCU kernel parameter is enabled.

    Finally, fix some checkpatch whitespace complaints in lockdep.c.

    Again, you must enable PROVE_RCU to see these new diagnostics. But you
    are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

09 Jul, 2011

1 commit


19 Apr, 2011

1 commit

  • next_pidmap() just quietly accepted whatever 'last' pid that was passed
    in, which is not all that safe when one of the users is /proc.

    Admittedly the proc code should do some sanity checking on the range
    (and that will be the next commit), but that doesn't mean that the
    helper functions should just do that pidmap pointer arithmetic without
    checking the range of its arguments.

    So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1"
    doesn't really matter, the for-loop does check against the end of the
    pidmap array properly (it's only the actual pointer arithmetic overflow
    case we need to worry about, and going one bit beyond isn't going to
    overflow).

    [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]

    Reported-by: Tavis Ormandy
    Analyzed-by: Robert Święcki
    Cc: Eric W. Biederman
    Cc: Pavel Emelyanov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Mar, 2011

1 commit


20 Aug, 2010

2 commits

  • find_task_by_vpid() says "Must be called under rcu_read_lock().". But due to
    commit 3120438 "rcu: Disable lockdep checking in RCU list-traversal primitives",
    we are currently unable to catch "find_task_by_vpid() with tasklist_lock held
    but RCU lock not held" errors due to the RCU-lockdep checks being
    suppressed in the RCU variants of the struct list_head traversals.
    This commit therefore places an explicit check for being in an RCU
    read-side critical section in find_task_by_pid_ns().

    ===================================================
    [ INFO: suspicious rcu_dereference_check() usage. ]
    ---------------------------------------------------
    kernel/pid.c:386 invoked rcu_dereference_check() without protection!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 1
    1 lock held by rc.sysinit/1102:
    #0: (tasklist_lock){.+.+..}, at: [] sys_setpgid+0x40/0x160

    stack backtrace:
    Pid: 1102, comm: rc.sysinit Not tainted 2.6.35-rc3-dirty #1
    Call Trace:
    [] lockdep_rcu_dereference+0x94/0xb0
    [] find_task_by_pid_ns+0x6d/0x70
    [] find_task_by_vpid+0x18/0x20
    [] sys_setpgid+0x47/0x160
    [] sysenter_do_call+0x12/0x36

    Commit updated to use a new rcu_lockdep_assert() exported API rather than
    the old internal __do_rcu_dereference().

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Tetsuo Handa
     
  • This avoids warnings from missing __rcu annotations
    in the rculist implementation, making it possible to
    use the same lists in both RCU and non-RCU cases.

    We can add rculist annotations later, together with
    lockdep support for rculist, which is missing as well,
    but that may involve changing all the users.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     

11 Aug, 2010

2 commits

  • alloc_pidmap() calculates max_scan so that if the initial offset != 0 we
    inspect the first map->page twice. This is correct, we want to find the
    unused bits < offset in this bitmap block. Add the comment.

    But it doesn't make any sense to stop the find_next_offset() loop when we
    are looking into this map->page for the second time. We have already
    already checked the bits >= offset during the first attempt, it is fine to
    do this again, no matter if we succeed this time or not.

    Remove this hard-to-understand code. It optimizes the very unlikely case
    when we are going to fail, but slows down the more likely case.

    Signed-off-by: Oleg Nesterov
    Cc: Salman Qazi
    Cc: Ingo Molnar
    Cc: Sukadev Bhattiprolu
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • A program that repeatedly forks and waits is susceptible to having the
    same pid repeated, especially when it competes with another instance of
    the same program. This is really bad for bash implementation.
    Furthermore, many shell scripts assume that pid numbers will not be used
    for some length of time.

    Race Description:

    A B

    // pid == offset == n // pid == offset == n + 1
    test_and_set_bit(offset, map->page)
    test_and_set_bit(offset, map->page);
    pid_ns->last_pid = pid;
    pid_ns->last_pid = pid;
    // pid == n + 1 is freed (wait())

    // Next fork()...
    last = pid_ns->last_pid; // == n
    pid = last + 1;

    Code to reproduce it (Running multiple instances is more effective):

    #include
    #include
    #include
    #include
    #include
    #include

    // The distance mod 32768 between two pids, where the first pid is expected
    // to be smaller than the second.
    int PidDistance(pid_t first, pid_t second) {
    return (second + 32768 - first) % 32768;
    }

    int main(int argc, char* argv[]) {
    int failed = 0;
    pid_t last_pid = 0;
    int i;
    printf("%d\n", sizeof(pid_t));
    for (i = 0; i < 10000000; ++i) {
    if (i % 32786 == 0)
    printf("Iter: %d\n", i/32768);
    int child_exit_code = i % 256;
    pid_t pid = fork();
    if (pid == -1) {
    fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
    exit(1);
    }
    if (pid == 0) {
    // Child
    exit(child_exit_code);
    } else {
    // Parent
    if (i > 0) {
    int distance = PidDistance(last_pid, pid);
    if (distance == 0 || distance > 30000) {
    fprintf(stderr,
    "Unexpected pid sequence: previous fork: pid=%d, "
    "current fork: pid=%d for iteration=%d.\n",
    last_pid, pid, i);
    failed = 1;
    }
    }
    last_pid = pid;
    int status;
    int reaped = wait(&status);
    if (reaped != pid) {
    fprintf(stderr,
    "Wait return value: expected pid=%d, "
    "got %d, iteration %d\n",
    pid, reaped, i);
    failed = 1;
    } else if (WEXITSTATUS(status) != child_exit_code) {
    fprintf(stderr,
    "Unexpected exit status %x, iteration %d\n",
    WEXITSTATUS(status), i);
    failed = 1;
    }
    }
    }
    exit(failed);
    }

    Thanks to Ted Tso for the key ideas of this implementation.

    Signed-off-by: Salman Qazi
    Cc: Ingo Molnar
    Cc: Theodore Ts'o
    Cc: Peter Zijlstra
    Cc: Sukadev Bhattiprolu
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Salman
     

28 May, 2010

1 commit

  • On a system with a substantial number of processors, the early default
    pid_max of 32k will not be enough. A system with 1664 CPU's, there are
    25163 processes started before the login prompt. It's estimated that with
    2048 CPU's we will pass the 32k limit. With 4096, we'll reach that limit
    very early during the boot cycle, and processes would stall waiting for an
    available pid.

    This patch increases the early maximum number of pids available, and
    increases the minimum number of pids that can be set during runtime.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Hedi Berriche
    Signed-off-by: Mike Travis
    Signed-off-by: Robin Holt
    Acked-by: Linus Torvalds
    Cc: Ingo Molnar
    Cc: Pavel Machek
    Cc: Alan Cox
    Cc: Greg KH
    Cc: Rik van Riel
    Cc: John Stoffel
    Cc: Jack Steiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hedi Berriche
     

14 Mar, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    locking: Make sparse work with inline spinlocks and rwlocks
    x86/mce: Fix RCU lockdep splats
    rcu: Increase RCU CPU stall timeouts if PROVE_RCU
    ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
    rcu: Suppress RCU lockdep warnings during early boot
    rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
    rcu: Suppress __mpol_dup() false positive from RCU lockdep
    rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
    rcu: Add control variables to lockdep_rcu_dereference() diagnostics
    rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
    rcu: Use wrapper function instead of exporting tasklist_lock
    sched, rcu: Fix rcu_dereference() for RCU-lockdep
    rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
    rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
    x86/gart: Unexport gart_iommu_aperture

    Fix trivial conflicts in kernel/trace/ftrace.c

    Linus Torvalds