29 Jun, 2013

1 commit


02 May, 2013

1 commit

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     

10 Apr, 2013

1 commit


27 Mar, 2013

1 commit

  • Only allow unprivileged mounts of proc and sysfs if they are already
    mounted when the user namespace is created.

    proc and sysfs are interesting because they have content that is
    per namespace, and so fresh mounts are needed when new namespaces
    are created while at the same time proc and sysfs have content that
    is shared between every instance.

    Respect the policy of who may see the shared content of proc and sysfs
    by only allowing new mounts if there was an existing mount at the time
    the user namespace was created.

    In practice there are only two interesting cases: proc and sysfs are
    mounted at their usual places, proc and sysfs are not mounted at all
    (some form of mount namespace jail).

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

20 Nov, 2012

1 commit


19 Nov, 2012

4 commits

  • Track the number of pids in the proc hash table. When the number of
    pids goes to 0 schedule work to unmount the kernel mount of proc.

    Move the mount of proc into alloc_pid when we allocate the pid for
    init.

    Remove the surprising calls of pid_ns_release proc in fork and
    proc_flush_task. Those code paths really shouldn't know about proc
    namespace implementation details and people have demonstrated several
    times that finding and understanding those code paths is difficult and
    non-obvious.

    Because of the call path detach pid is alwasy called with the
    rtnl_lock held free_pid is not allowed to sleep, so the work to
    unmounting proc is moved to a work queue. This has the side benefit
    of not blocking the entire world waiting for the unnecessary
    rcu_barrier in deactivate_locked_super.

    In the process of making the code clear and obvious this fixes a bug
    reported by Gao feng where we would leak a
    mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
    succeeded and copy_net_ns failed.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
    aka ns_of_pid(task_pid(tsk)) should have the same number of
    cache line misses with the practical difference that
    ns_of_pid(task_pid(tsk)) is released later in a processes life.

    Furthermore by using task_active_pid_ns it becomes trivial
    to write an unshare implementation for the the pid namespace.

    So I have used task_active_pid_ns everywhere I can.

    In fork since the pid has not yet been attached to the
    process I use ns_of_pid, to achieve the same effect.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Now that we have s_fs_info pointing to our pid namespace
    the original reason for the proc root inode having a struct
    pid is gone.

    Caching a pid in the root inode has led to some complicated
    code. Now that we don't need the struct pid, just remove it.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • I had visions at one point of splitting proc into two filesystems. If
    that had happened proc/self being the the part of proc that actually deals
    with pids would have been a nice cleanup. As it is proc/self requires
    a lot of unnecessary infrastructure for a single file.

    The only user visible change is that a mounted /proc for a pid namespace
    that is dead now shows a broken proc symlink, instead of being completely
    invisible. I don't think anyone will notice or care.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

06 Oct, 2012

1 commit


14 Jul, 2012

2 commits

  • Pass mount flags to sget() so that it can use them in initialising a new
    superblock before the set function is called. They could also be passed to the
    compare function.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Just the flags; only NFS cares even about that, but there are
    legitimate uses for such argument. And getting rid of that
    completely would require splitting ->lookup() into a couple
    of methods (at least), so let's leave that alone for now...

    Signed-off-by: Al Viro

    Al Viro
     

24 May, 2012

1 commit

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

16 May, 2012

1 commit


06 Apr, 2012

1 commit

  • The proc_parse_options() call from proc_mount() runs only once at boot
    time. So on any later mount attempt, any mount options are ignored
    because ->s_root is already initialized.

    As a consequence, "mount -o " will ignore the options. The
    only way to change mount options is "mount -o remount,".

    To fix this, parse the mount options unconditionally.

    Signed-off-by: Vasiliy Kulikov
    Reported-by: Arkadiusz Miskiewicz
    Tested-by: Arkadiusz Miskiewicz
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

11 Jan, 2012

2 commits

  • Add support for mount options to restrict access to /proc/PID/
    directories. The default backward-compatible "relaxed" behaviour is left
    untouched.

    The first mount option is called "hidepid" and its value defines how much
    info about processes we want to be available for non-owners:

    hidepid=0 (default) means the old behavior - anybody may read all
    world-readable /proc/PID/* files.

    hidepid=1 means users may not access any /proc// directories, but
    their own. Sensitive files like cmdline, sched*, status are now protected
    against other users. As permission checking done in proc_pid_permission()
    and files' permissions are left untouched, programs expecting specific
    files' modes are not confused.

    hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
    users. It doesn't mean that it hides whether a process exists (it can be
    learned by other means, e.g. by kill -0 $PID), but it hides process' euid
    and egid. It compicates intruder's task of gathering info about running
    processes, whether some daemon runs with elevated privileges, whether
    another user runs some sensitive program, whether other users run any
    program at all, etc.

    gid=XXX defines a group that will be able to gather all processes' info
    (as in hidepid=0 mode). This group should be used instead of putting
    nonroot user in sudoers file or something. However, untrusted users (like
    daemons, etc.) which are not supposed to monitor the tasks in the whole
    system should not be added to the group.

    hidepid=1 or higher is designed to restrict access to procfs files, which
    might reveal some sensitive private information like precise keystrokes
    timings:

    http://www.openwall.com/lists/oss-security/2011/11/05/3

    hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
    conky gracefully handle EPERM/ENOENT and behave as if the current user is
    the only user running processes. pstree shows the process subtree which
    contains "pstree" process.

    Note: the patch doesn't deal with setuid/setgid issues of keeping
    preopened descriptors of procfs files (like
    https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
    information like the scheduling counters of setuid apps doesn't threaten
    anybody's privacy - only the user started the setuid program may read the
    counters.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Add support for procfs mount options. Actual mount options are coming in
    the next patches.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

09 Dec, 2011

1 commit


28 Jul, 2011

1 commit

  • Since __proc_create() appends the name it is given to the end of the PDE
    structure that it allocates, there isn't a need to store a name pointer.
    Instead we can just replace the name pointer with a terminal char array of
    _unspecified_ length. The compiler will simply append the string to statically
    defined variables of PDE type overlapping any hole at the end of the structure
    and, unlike specifying an explicitly _zero_ length array, won't give a warning
    if you try to statically initialise it with a string of more than zero length.

    Also, whilst we're at it:

    (1) Move namelen to end just prior to name and reduce it to a single byte
    (name shouldn't be longer than NAME_MAX).

    (2) Move pde_unload_lock two places further on so that if it's four bytes in
    size on a 64-bit machine, it won't cause an unused hole in the PDE struct.

    Signed-off-by: David Howells
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    David Howells
     

13 Jun, 2011

1 commit


24 Mar, 2011

2 commits


29 Oct, 2010

2 commits


15 Oct, 2010

1 commit

  • All file_operations should get a .llseek operation so we can make
    nonseekable_open the default for future file operations without a
    .llseek pointer.

    The three cases that we can automatically detect are no_llseek, seq_lseek
    and default_llseek. For cases where we can we can automatically prove that
    the file offset is always ignored, we use noop_llseek, which maintains
    the current behavior of not returning an error from a seek.

    New drivers should normally not use noop_llseek but instead use no_llseek
    and call nonseekable_open at open time. Existing drivers can be converted
    to do the same when the maintainer knows for certain that no user code
    relies on calling seek on the device file.

    The generated code is often incorrectly indented and right now contains
    comments that clarify for each added line why a specific variant was
    chosen. In the version that gets submitted upstream, the comments will
    be gone and I will manually fix the indentation, because there does not
    seem to be a way to do that using coccinelle.

    Some amount of new code is currently sitting in linux-next that should get
    the same modifications, which I will do at the end of the merge window.

    Many thanks to Julia Lawall for helping me learn to write a semantic
    patch that does all this.

    ===== begin semantic patch =====
    // This adds an llseek= method to all file operations,
    // as a preparation for making no_llseek the default.
    //
    // The rules are
    // - use no_llseek explicitly if we do nonseekable_open
    // - use seq_lseek for sequential files
    // - use default_llseek if we know we access f_pos
    // - use noop_llseek if we know we don't access f_pos,
    // but we still want to allow users to call lseek
    //
    @ open1 exists @
    identifier nested_open;
    @@
    nested_open(...)
    {

    }

    @ open exists@
    identifier open_f;
    identifier i, f;
    identifier open1.nested_open;
    @@
    int open_f(struct inode *i, struct file *f)
    {

    }

    @ read disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {

    }

    @ read_no_fpos disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ write @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {

    }

    @ write_no_fpos @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ fops0 @
    identifier fops;
    @@
    struct file_operations fops = {
    ...
    };

    @ has_llseek depends on fops0 @
    identifier fops0.fops;
    identifier llseek_f;
    @@
    struct file_operations fops = {
    ...
    .llseek = llseek_f,
    ...
    };

    @ has_read depends on fops0 @
    identifier fops0.fops;
    identifier read_f;
    @@
    struct file_operations fops = {
    ...
    .read = read_f,
    ...
    };

    @ has_write depends on fops0 @
    identifier fops0.fops;
    identifier write_f;
    @@
    struct file_operations fops = {
    ...
    .write = write_f,
    ...
    };

    @ has_open depends on fops0 @
    identifier fops0.fops;
    identifier open_f;
    @@
    struct file_operations fops = {
    ...
    .open = open_f,
    ...
    };

    // use no_llseek if we call nonseekable_open
    ////////////////////////////////////////////
    @ nonseekable1 depends on !has_llseek && has_open @
    identifier fops0.fops;
    identifier nso ~= "nonseekable_open";
    @@
    struct file_operations fops = {
    ... .open = nso, ...
    +.llseek = no_llseek, /* nonseekable */
    };

    @ nonseekable2 depends on !has_llseek @
    identifier fops0.fops;
    identifier open.open_f;
    @@
    struct file_operations fops = {
    ... .open = open_f, ...
    +.llseek = no_llseek, /* open uses nonseekable */
    };

    // use seq_lseek for sequential files
    /////////////////////////////////////
    @ seq depends on !has_llseek @
    identifier fops0.fops;
    identifier sr ~= "seq_read";
    @@
    struct file_operations fops = {
    ... .read = sr, ...
    +.llseek = seq_lseek, /* we have seq_read */
    };

    // use default_llseek if there is a readdir
    ///////////////////////////////////////////
    @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier readdir_e;
    @@
    // any other fop is used that changes pos
    struct file_operations fops = {
    ... .readdir = readdir_e, ...
    +.llseek = default_llseek, /* readdir is present */
    };

    // use default_llseek if at least one of read/write touches f_pos
    /////////////////////////////////////////////////////////////////
    @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read.read_f;
    @@
    // read fops use offset
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = default_llseek, /* read accesses f_pos */
    };

    @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ... .write = write_f, ...
    + .llseek = default_llseek, /* write accesses f_pos */
    };

    // Use noop_llseek if neither read nor write accesses f_pos
    ///////////////////////////////////////////////////////////

    @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    identifier write_no_fpos.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ...
    .write = write_f,
    .read = read_f,
    ...
    +.llseek = noop_llseek, /* read and write both use no f_pos */
    };

    @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write_no_fpos.write_f;
    @@
    struct file_operations fops = {
    ... .write = write_f, ...
    +.llseek = noop_llseek, /* write uses no f_pos */
    };

    @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    @@
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = noop_llseek, /* read uses no f_pos */
    };

    @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    @@
    struct file_operations fops = {
    ...
    +.llseek = noop_llseek, /* no read or write fn */
    };
    ===== End semantic patch =====

    Signed-off-by: Arnd Bergmann
    Cc: Julia Lawall
    Cc: Christoph Hellwig

    Arnd Bergmann
     

28 May, 2010

1 commit

  • I removed 3 unused assignments. The first two get reset on the first
    statement of their functions. For "err" in root.c we don't return an
    error and we don't use the variable again.

    Signed-off-by: Dan Carpenter
    Cc: Oleg Nesterov
    Acked-by: Serge Hallyn
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     

04 Mar, 2010

1 commit

  • EXPORT_SYMBOL(proc_symlink);
    EXPORT_SYMBOL(proc_mkdir);
    EXPORT_SYMBOL(create_proc_entry);
    EXPORT_SYMBOL(proc_create_data);
    EXPORT_SYMBOL(remove_proc_entry);

    Those EXPORT_SYMBOL shouldn't be in fs/proc/root.c,
    should be in fs/proc/generic.c.

    Signed-off-by: Helight.Xu
    Signed-off-by: Al Viro

    Helight.Xu
     

09 May, 2009

1 commit


28 Mar, 2009

1 commit

  • simple_set_mnt() is defined as returning 'int' but always returns 0.
    Callers assume simple_set_mnt() never fails and don't properly cleanup if
    it were to _ever_ fail. For instance, get_sb_single() and get_sb_nodev()
    should:

    up_write(sb->s_unmount);
    deactivate_super(sb);

    if simple_set_mnt() fails.

    Since simple_set_mnt() never fails, would be cleaner if it did not
    return anything.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Sukadev Bhattiprolu
     

05 Jan, 2009

1 commit

  • There are four BKL users in proc: de_put(), proc_lookup_de(),
    proc_readdir_de(), proc_root_readdir(),

    1) de_put()
    -----------
    de_put() is classic atomic_dec_and_test() refcount wrapper -- no BKL
    needed. BKL doesn't matter to possible refcount leak as well.

    2) proc_lookup_de()
    -------------------
    Walking PDE list is protected by proc_subdir_lock(), proc_get_inode() is
    potentially blocking, all callers of proc_lookup_de() eventually end up
    from ->lookup hooks which is protected by directory's ->i_mutex -- BKL
    doesn't protect anything.

    3) proc_readdir_de()
    --------------------
    "." and ".." part doesn't need BKL, walking PDE list is under
    proc_subdir_lock, calling filldir callback is potentially blocking
    because it writes to luserspace. All proc_readdir_de() callers
    eventually come from ->readdir hook which is under directory's
    ->i_mutex -- BKL doesn't protect anything.

    4) proc_root_readdir_de()
    -------------------------
    proc_root_readdir_de is ->readdir hook, see (3).

    Since readdir hooks doesn't use BKL anymore, switch to
    generic_file_llseek, since it also takes directory's i_mutex.

    Signed-off-by: Alexey Dobriyan

    Alexey Dobriyan
     

23 Oct, 2008

2 commits


29 Apr, 2008

5 commits

  • This set of patches fixes an proc ->open'less usage due to ->proc_fops flip in
    the most part of the kernel code. The original OOPS is described in the
    commit 2d3a4e3666325a9709cc8ea2e88151394e8f20fc:

    Typical PDE creation code looks like:

    pde = create_proc_entry("foo", 0, NULL);
    if (pde)
    pde->proc_fops = &foo_proc_fops;

    Notice that PDE is first created, only then ->proc_fops is set up to
    final value. This is a problem because right after creation
    a) PDE is fully visible in /proc , and
    b) ->proc_fops are proc_file_operations which do not have ->open callback. So, it's
    possible to ->read without ->open (see one class of oopses below).

    The fix is new API called proc_create() which makes sure ->proc_fops are
    set up before gluing PDE to main tree. Typical new code looks like:

    pde = proc_create("foo", 0, NULL, &foo_proc_fops);
    if (!pde)
    return -ENOMEM;

    Fix most networking users for a start.

    In the long run, create_proc_entry() for regular files will go.

    In addition to this, proc_create_data is introduced to fix reading from
    proc without PDE->data. The race is basically the same as above.

    create_proc_entries is replaced in the entire kernel code as new method
    is also simply better.

    This patch:

    The problem is the same as for de->proc_fops. Right now PDE becomes visible
    without data set. So, the entry could be looked up without data. This, in
    most cases, will simply OOPS.

    proc_create_data call is created to address this issue. proc_create now
    becomes a wrapper around it.

    Signed-off-by: Denis V. Lunev
    Cc: "Eric W. Biederman"
    Cc: "J. Bruce Fields"
    Cc: Alessandro Zummo
    Cc: Alexey Dobriyan
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Benjamin Herrenschmidt
    Cc: Bjorn Helgaas
    Cc: Chris Mason
    Acked-by: David Howells
    Cc: Dmitry Torokhov
    Cc: Geert Uytterhoeven
    Cc: Grant Grundler
    Cc: Greg Kroah-Hartman
    Cc: Haavard Skinnemoen
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: James Bottomley
    Cc: Jaroslav Kysela
    Cc: Jeff Garzik
    Cc: Jeff Mahoney
    Cc: Jesper Nilsson
    Cc: Karsten Keil
    Cc: Kyle McMartin
    Cc: Len Brown
    Cc: Martin Schwidefsky
    Cc: Mathieu Desnoyers
    Cc: Matthew Wilcox
    Cc: Mauro Carvalho Chehab
    Cc: Mikael Starvik
    Cc: Nadia Derbey
    Cc: Neil Brown
    Cc: Paul Mackerras
    Cc: Peter Osterlund
    Cc: Pierre Peiffer
    Cc: Russell King
    Cc: Takashi Iwai
    Cc: Tony Luck
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis V. Lunev
     
  • Remove proc_root export. Creation and removal works well if parent PDE is
    supplied as NULL -- it worked always that way.

    So, one useless export removed and consistency added, some drivers created
    PDEs with &proc_root as parent but removed them as NULL and so on.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Use creation by full path: "driver/foo".

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Use creation by full path instead: "fs/foo".

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Remove proc_bus export and variable itself. Using pathnames works fine
    and is slightly more understandable and greppable.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

09 Feb, 2008

1 commit

  • Typical PDE creation code looks like:

    pde = create_proc_entry("foo", 0, NULL);
    if (pde)
    pde->proc_fops = &foo_proc_fops;

    Notice that PDE is first created, only then ->proc_fops is set up to
    final value. This is a problem because right after creation
    a) PDE is fully visible in /proc , and
    b) ->proc_fops are proc_file_operations which do not have ->open callback. So, it's
    possible to ->read without ->open (see one class of oopses below).

    The fix is new API called proc_create() which makes sure ->proc_fops are
    set up before gluing PDE to main tree. Typical new code looks like:

    pde = proc_create("foo", 0, NULL, &foo_proc_fops);
    if (!pde)
    return -ENOMEM;

    Fix most networking users for a start.

    In the long run, create_proc_entry() for regular files will go.

    BUG: unable to handle kernel NULL pointer dereference at virtual address 00000024
    printing eip: c1188c1b *pdpt = 000000002929e001 *pde = 0000000000000000
    Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    last sysfs file: /sys/block/sda/sda1/dev
    Modules linked in: foo af_packet ipv6 cpufreq_ondemand loop serio_raw psmouse k8temp hwmon sr_mod cdrom

    Pid: 24679, comm: cat Not tainted (2.6.24-rc3-mm1 #2)
    EIP: 0060:[] EFLAGS: 00210002 CPU: 0
    EIP is at mutex_lock_nested+0x75/0x25d
    EAX: 000006fe EBX: fffffffb ECX: 00001000 EDX: e9340570
    ESI: 00000020 EDI: 00200246 EBP: e9340570 ESP: e8ea1ef8
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process cat (pid: 24679, ti=E8EA1000 task=E9340570 task.ti=E8EA1000)
    Stack: 00000000 c106f7ce e8ee05b4 00000000 00000001 458003d0 f6fb6f20 fffffffb
    00000000 c106f7aa 00001000 c106f7ce 08ae9000 f6db53f0 00000020 00200246
    00000000 00000002 00000000 00200246 00200246 e8ee05a0 fffffffb e8ee0550
    Call Trace:
    [] seq_read+0x24/0x28a
    [] seq_read+0x0/0x28a
    [] seq_read+0x24/0x28a
    [] seq_read+0x0/0x28a
    [] proc_reg_read+0x60/0x73
    [] proc_reg_read+0x0/0x73
    [] vfs_read+0x6c/0x8b
    [] sys_read+0x3c/0x63
    [] sysenter_past_esp+0x5f/0xa5
    [] destroy_inode+0x24/0x33
    =======================
    INFO: lockdep is turned off.
    Code: 75 21 68 e1 1a 19 c1 68 87 00 00 00 68 b8 e8 1f c1 68 25 73 1f c1 e8 84 06 e9 ff e8 52 b8 e7 ff 83 c4 10 9c 5f fa e8 28 89 ea ff fe 4e 04 79 0a f3 90 80 7e 04 00 7e f8 eb f0 39 76 34 74 33
    EIP: [] mutex_lock_nested+0x75/0x25d SS:ESP 0068:e8ea1ef8

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

06 Dec, 2007

1 commit

  • Creating PDEs with refcount 0 and "deleted" flag has problems (see below).
    Switch to usual scheme:
    * PDE is created with refcount 1
    * every de_get does +1
    * every de_put() and remove_proc_entry() do -1
    * once refcount reaches 0, PDE is freed.

    This elegantly fixes at least two following races (both observed) without
    introducing new locks, without abusing old locks, without spreading
    lock_kernel():

    1) PDE leak

    remove_proc_entry de_put
    ----------------- ------
    [refcnt = 1]
    if (atomic_read(&de->count) == 0)
    if (atomic_dec_and_test(&de->count))
    if (de->deleted)
    /* also not taken! */
    free_proc_entry(de);
    else
    de->deleted = 1;
    [refcount=0, deleted=1]

    2) use after free

    remove_proc_entry de_put
    ----------------- ------
    [refcnt = 1]

    if (atomic_dec_and_test(&de->count))
    if (atomic_read(&de->count) == 0)
    free_proc_entry(de);
    /* boom! */
    if (de->deleted)
    free_proc_entry(de);

    BUG: unable to handle kernel paging request at virtual address 6b6b6b6b
    printing eip: c10acdda *pdpt = 00000000338f8001 *pde = 0000000000000000
    Oops: 0000 [#1] PREEMPT SMP
    Modules linked in: af_packet ipv6 cpufreq_ondemand loop serio_raw psmouse k8temp hwmon sr_mod cdrom
    Pid: 23161, comm: cat Not tainted (2.6.24-rc2-8c0863403f109a43d7000b4646da4818220d501f #4)
    EIP: 0060:[] EFLAGS: 00210097 CPU: 1
    EIP is at strnlen+0x6/0x18
    EAX: 6b6b6b6b EBX: 6b6b6b6b ECX: 6b6b6b6b EDX: fffffffe
    ESI: c128fa3b EDI: f380bf34 EBP: ffffffff ESP: f380be44
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process cat (pid: 23161, ti=f380b000 task=f38f2570 task.ti=f380b000)
    Stack: c10ac4f0 00000278 c12ce000 f43cd2a8 00000163 00000000 7da86067 00000400
    c128fa20 00896b18 f38325a8 c128fe20 ffffffff 00000000 c11f291e 00000400
    f75be300 c128fa20 f769c9a0 c10ac779 f380bf34 f7bfee70 c1018e6b f380bf34
    Call Trace:
    [] vsnprintf+0x2ad/0x49b
    [] vscnprintf+0x14/0x1f
    [] vprintk+0xc5/0x2f9
    [] handle_fasteoi_irq+0x0/0xab
    [] do_IRQ+0x9f/0xb7
    [] preempt_schedule_irq+0x3f/0x5b
    [] need_resched+0x1f/0x21
    [] printk+0x1b/0x1f
    [] de_put+0x3d/0x50
    [] proc_delete_inode+0x38/0x41
    [] proc_delete_inode+0x0/0x41
    [] generic_delete_inode+0x5e/0xc6
    [] iput+0x60/0x62
    [] d_kill+0x2d/0x46
    [] dput+0xdc/0xe4
    [] __fput+0xb0/0xcd
    [] filp_close+0x48/0x4f
    [] sys_close+0x67/0xa5
    [] sysenter_past_esp+0x5f/0x85
    =======================
    Code: c9 74 0c f2 ae 74 05 bf 01 00 00 00 4f 89 fa 5f 89 d0 c3 85 c9 57 89 c7 89 d0 74 05 f2 ae 75 01 4f 89 f8 5f c3 89 c1 89 c8 eb 06 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 c3 90 90 90 57 83 c9
    EIP: [] strnlen+0x6/0x18 SS:ESP 0068:f380be44

    Also, remove broken usage of ->deleted from reiserfs: if sget() succeeds,
    module is already pinned and remove_proc_entry() can't happen => nobody
    can mark PDE deleted.

    Dummy proc root in netns code is not marked with refcount 1. AFAICS, we
    never get it, it's just for proper /proc/net removal. I double checked
    CLONE_NETNS continues to work.

    Patch survives many hours of modprobe/rmmod/cat loops without new bugs
    which can be attributed to refcounting.

    Signed-off-by: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

30 Nov, 2007

1 commit

  • proc_kill_inodes() can clear ->i_fop in the middle of vfs_readdir resulting in
    NULL dereference during "file->f_op->readdir(file, buf, filler)".

    The solution is to remove proc_kill_inodes() completely:

    a) we don't have tricky modules implementing their tricky readdir hooks which
    could keeping this revoke from hell.

    b) In a situation when module is gone but PDE still alive, standard
    readdir will return only "." and "..", because pde->next was cleared by
    remove_proc_entry().

    c) the race proc_kill_inode() destined to prevent is not completely
    fixed, just race window made smaller, because vfs_readdir() is run
    without sb_lock held and without file_list_lock held. Effectively,
    ->i_fop is cleared at random moment, which can't fix properly anything.

    BUG: unable to handle kernel NULL pointer dereference at virtual address 00000018
    printing eip: c1061205 *pdpt = 0000000005b22001 *pde = 0000000000000000
    Oops: 0000 [#1] PREEMPT SMP
    Modules linked in: foo af_packet ipv6 cpufreq_ondemand loop serio_raw sr_mod k8temp cdrom hwmon amd_rng
    Pid: 2033, comm: find Not tainted (2.6.24-rc1-b1d08ac064268d0ae2281e98bf5e82627e0f0c56 #2)
    EIP: 0060:[] EFLAGS: 00010246 CPU: 0
    EIP is at vfs_readdir+0x47/0x74
    EAX: c6b6a780 EBX: 00000000 ECX: c1061040 EDX: c5decf94
    ESI: c6b6a780 EDI: fffffffe EBP: c9797c54 ESP: c5decf78
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process find (pid: 2033, ti=c5dec000 task=c64bba90 task.ti=c5dec000)
    Stack: c5decf94 c1061040 fffffff7 0805ffbc 00000000 c6b6a780 c1061295 0805ffbc
    00000000 00000400 00000000 00000004 0805ffbc 4588eff4 c5dec000 c10026ba
    00000004 0805ffbc 00000400 0805ffbc 4588eff4 bfdc6c70 000000dc 0000007b
    Call Trace:
    [] filldir64+0x0/0xc5
    [] sys_getdents64+0x63/0xa5
    [] sysenter_past_esp+0x5f/0x85
    =======================
    Code: 49 83 78 18 00 74 43 8d 6b 74 bf fe ff ff ff 89 e8 e8 b8 c0 12 00 f6 83 2c 01 00 00 10 75 22 8b 5e 10 8b 4c 24 04 89 f0 8b 14 24 53 18 f6 46 1a 04 89 c7 75 0b 8b 56 0c 8b 46 08 e8 c8 66 00
    EIP: [] vfs_readdir+0x47/0x74 SS:ESP 0068:c5decf78

    hch: "Nice, getting rid of this is a very good step formwards.
    Unfortunately we have another copy of this junk in
    security/selinux/selinuxfs.c:sel_remove_entries() which would need the
    same treatment."

    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan