06 Dec, 2007

1 commit

  • Creating PDEs with refcount 0 and "deleted" flag has problems (see below).
    Switch to usual scheme:
    * PDE is created with refcount 1
    * every de_get does +1
    * every de_put() and remove_proc_entry() do -1
    * once refcount reaches 0, PDE is freed.

    This elegantly fixes at least two following races (both observed) without
    introducing new locks, without abusing old locks, without spreading
    lock_kernel():

    1) PDE leak

    remove_proc_entry de_put
    ----------------- ------
    [refcnt = 1]
    if (atomic_read(&de->count) == 0)
    if (atomic_dec_and_test(&de->count))
    if (de->deleted)
    /* also not taken! */
    free_proc_entry(de);
    else
    de->deleted = 1;
    [refcount=0, deleted=1]

    2) use after free

    remove_proc_entry de_put
    ----------------- ------
    [refcnt = 1]

    if (atomic_dec_and_test(&de->count))
    if (atomic_read(&de->count) == 0)
    free_proc_entry(de);
    /* boom! */
    if (de->deleted)
    free_proc_entry(de);

    BUG: unable to handle kernel paging request at virtual address 6b6b6b6b
    printing eip: c10acdda *pdpt = 00000000338f8001 *pde = 0000000000000000
    Oops: 0000 [#1] PREEMPT SMP
    Modules linked in: af_packet ipv6 cpufreq_ondemand loop serio_raw psmouse k8temp hwmon sr_mod cdrom
    Pid: 23161, comm: cat Not tainted (2.6.24-rc2-8c0863403f109a43d7000b4646da4818220d501f #4)
    EIP: 0060:[] EFLAGS: 00210097 CPU: 1
    EIP is at strnlen+0x6/0x18
    EAX: 6b6b6b6b EBX: 6b6b6b6b ECX: 6b6b6b6b EDX: fffffffe
    ESI: c128fa3b EDI: f380bf34 EBP: ffffffff ESP: f380be44
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process cat (pid: 23161, ti=f380b000 task=f38f2570 task.ti=f380b000)
    Stack: c10ac4f0 00000278 c12ce000 f43cd2a8 00000163 00000000 7da86067 00000400
    c128fa20 00896b18 f38325a8 c128fe20 ffffffff 00000000 c11f291e 00000400
    f75be300 c128fa20 f769c9a0 c10ac779 f380bf34 f7bfee70 c1018e6b f380bf34
    Call Trace:
    [] vsnprintf+0x2ad/0x49b
    [] vscnprintf+0x14/0x1f
    [] vprintk+0xc5/0x2f9
    [] handle_fasteoi_irq+0x0/0xab
    [] do_IRQ+0x9f/0xb7
    [] preempt_schedule_irq+0x3f/0x5b
    [] need_resched+0x1f/0x21
    [] printk+0x1b/0x1f
    [] de_put+0x3d/0x50
    [] proc_delete_inode+0x38/0x41
    [] proc_delete_inode+0x0/0x41
    [] generic_delete_inode+0x5e/0xc6
    [] iput+0x60/0x62
    [] d_kill+0x2d/0x46
    [] dput+0xdc/0xe4
    [] __fput+0xb0/0xcd
    [] filp_close+0x48/0x4f
    [] sys_close+0x67/0xa5
    [] sysenter_past_esp+0x5f/0x85
    =======================
    Code: c9 74 0c f2 ae 74 05 bf 01 00 00 00 4f 89 fa 5f 89 d0 c3 85 c9 57 89 c7 89 d0 74 05 f2 ae 75 01 4f 89 f8 5f c3 89 c1 89 c8 eb 06 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 c3 90 90 90 57 83 c9
    EIP: [] strnlen+0x6/0x18 SS:ESP 0068:f380be44

    Also, remove broken usage of ->deleted from reiserfs: if sget() succeeds,
    module is already pinned and remove_proc_entry() can't happen => nobody
    can mark PDE deleted.

    Dummy proc root in netns code is not marked with refcount 1. AFAICS, we
    never get it, it's just for proper /proc/net removal. I double checked
    CLONE_NETNS continues to work.

    Patch survives many hours of modprobe/rmmod/cat loops without new bugs
    which can be attributed to refcounting.

    Signed-off-by: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

01 Dec, 2007

1 commit

  • Well I clearly goofed when I added the initial network namespace support
    for /proc/net. Currently things work but there are odd details visible to
    user space, even when we have a single network namespace.

    Since we do not cache proc_dir_entry dentries at the moment we can just
    modify ->lookup to return a different directory inode depending on the
    network namespace of the process looking at /proc/net, replacing the
    current technique of using a magic and fragile follow_link method.

    To accomplish that this patch:
    - introduces a shadow_proc method to allow different dentries to
    be returned from proc_lookup.
    - Removes the old /proc/net follow_link magic
    - Fixes a weakness in our not caching of proc generic dentries.

    As shadow_proc uses a task struct to decided which dentry to return we can
    go back later and fix the proc generic caching without modifying any code
    that uses the shadow_proc method.

    Signed-off-by: Eric W. Biederman
    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Cc: Pavel Emelyanov
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Herbert Xu

    Eric W. Biederman
     

07 Nov, 2007

1 commit


20 Oct, 2007

3 commits

  • The namespace's proc_mnt must be kern_mount-ed to make this pointer always
    valid, independently of whether the user space mounted the proc or not. This
    solves raced in proc_flush_task, etc. with the proc_mnt switching from NULL
    to not-NULL.

    The initialization is done after the init's pid is created and hashed to make
    proc_get_sb() finr it and get for root inode.

    Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the
    superblock holds the namespace we must explicitly break this circle to destroy
    all the stuff. This is done after the init of the namespace dies. Running a
    few steps forward - when init exits it will kill all its children, so no
    proc_mnt will be needed after its death.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Each pid namespace have to be visible through its own proc mount. Thus we
    need to have per-namespace proc trees with their own superblocks.

    We cannot easily show different pid namespace via one global proc tree, since
    each pid refers to different tasks in different namespaces. E.g. pid 1
    refers to the init task in the initial namespace and to some other task when
    seeing from another namespace. Moreover - pid, exisintg in one namespace may
    not exist in the other.

    This approach has one move advantage is that the tasks from the init namespace
    can see what tasks live in another namespace by reading entries from another
    proc tree.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The first part is trivial - we just make the proc_flush_task() to operate on
    arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
    it.

    The other change is more tricky: I moved the proc_flush_task() call in
    release_task() higher to address the following problem.

    When flushing task from many proc trees we need to know the set of ids (not
    just one pid) to find the dentries' names to flush. Thus we need to pass the
    task's pid to proc_flush_task() as struct pid is the only object that can
    provide all the pid numbers. But after __exit_signal() task has detached all
    his pids and this information is lost.

    This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
    tree and keep them in hash (since pids are still alive before __exit_signal())
    till the next shrink, but since proc_flush_task() does not provide a 100%
    guarantee that the dentries will be flushed, this is OK to do so.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

11 Oct, 2007

2 commits

  • The problem: proc_net files remember which network namespace the are
    against but do not remember hold a reference count (as that would pin
    the network namespace). So we currently have a small window where
    the reference count on a network namespace may be incremented when opening
    a /proc file when it has already gone to zero.

    To fix this introduce maybe_get_net and get_proc_net.

    maybe_get_net increments the network namespace reference count only if it is
    greater then zero, ensuring we don't increment a reference count after it
    has gone to zero.

    get_proc_net handles all of the magic to go from a proc inode to the network
    namespace instance and call maybe_get_net on it.

    PROC_NET the old accessor is removed so that we don't get confused and use
    the wrong helper function.

    Then I fix up the callers to use get_proc_net and handle the case case
    where get_proc_net returns NULL. In that case I return -ENXIO because
    effectively the network namespace has already gone away so the files
    we are trying to access don't exist anymore.

    Signed-off-by: Eric W. Biederman
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes /proc/net per network namespace. It modifies the global
    variables proc_net and proc_net_stat to be per network namespace.
    The proc_net file helpers are modified to take a network namespace argument,
    and all of their callers are fixed to pass &init_net for that argument.
    This ensures that all of the /proc/net files are only visible and
    usable in the initial network namespace until the code behind them
    has been updated to be handle multiple network namespaces.

    Making /proc/net per namespace is necessary as at least some files
    in /proc/net depend upon the set of network devices which is per
    network namespace, and even more files in /proc/net have contents
    that are relevant to a single network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

12 Aug, 2007

1 commit


17 Jul, 2007

1 commit

  • Fix following races:
    ===========================================
    1. Write via ->write_proc sleeps in copy_from_user(). Module disappears
    meanwhile. Or, more generically, system call done on /proc file, method
    supplied by module is called, module dissapeares meanwhile.

    pde = create_proc_entry()
    if (!pde)
    return -ENOMEM;
    pde->write_proc = ...
    open
    write
    copy_from_user
    pde = create_proc_entry();
    if (!pde) {
    remove_proc_entry();
    return -ENOMEM;
    /* module unloaded */
    }
    *boom*
    ==========================================
    2. bogo-revoke aka proc_kill_inodes()

    remove_proc_entry vfs_read
    proc_kill_inodes [check ->f_op validness]
    [check ->f_op->read validness]
    [verify_area, security permissions checks]
    ->f_op = NULL;
    if (file->f_op->read)
    /* ->f_op dereference, boom */

    NOTE, NOTE, NOTE: file_operations are proxied for regular files only. Let's
    see how this scheme behaves, then extend if needed for directories.
    Directories creators in /proc only set ->owner for them, so proxying for
    directories may be unneeded.

    NOTE, NOTE, NOTE: methods being proxied are ->llseek, ->read, ->write,
    ->poll, ->unlocked_ioctl, ->ioctl, ->compat_ioctl, ->open, ->release.
    If your in-tree module uses something else, yell on me. Full audit pending.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

09 May, 2007

1 commit

  • proc_lookup remove_proc_entry
    =========== =================

    lock_kernel();
    spin_lock(&proc_subdir_lock);
    [find PDE with refcount 0]
    spin_unlock(&proc_subdir_lock);
    spin_lock(&proc_subdir_lock);
    [find PDE with refcount 0]
    [check refcount and free PDE]
    spin_unlock(&proc_subdir_lock);
    proc_get_inode:
    de_get(de); /* boom */

    Signed-off-by: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

08 May, 2007

1 commit

  • Adds /proc/pid/clear_refs. When any non-zero number is written to this file,
    pte_mkold() and ClearPageReferenced() is called for each pte and its
    corresponding page, respectively, in that task's VMAs. This file is only
    writable by the user who owns the task.

    It is now possible to measure _approximately_ how much memory a task is using
    by clearing the reference bits with

    echo 1 > /proc/pid/clear_refs

    and checking the reference count for each VMA from the /proc/pid/smaps output
    at a measured time interval. For example, to observe the approximate change
    in memory footprint for a task, write a script that clears the references
    (echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and
    extracts the size in kB. Add the sizes for each VMA together for the total
    referenced footprint. Moments later, repeat the process and observe the
    difference.

    For example, using an efficient Mozilla:

    accumulated time referenced memory
    ---------------- -----------------
    0 s 408 kB
    1 s 408 kB
    2 s 556 kB
    3 s 1028 kB
    4 s 872 kB
    5 s 1956 kB
    6 s 416 kB
    7 s 1560 kB
    8 s 2336 kB
    9 s 1044 kB
    10 s 416 kB

    This is a valuable tool to get an approximate measurement of the memory
    footprint for a task.

    Cc: Hugh Dickins
    Cc: Paul Mundt
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    [akpm@linux-foundation.org: build fixes]
    [mpm@selenic.com: rename for_each_pmd]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

15 Feb, 2007

1 commit

  • With this change the sysctl inodes can be cached and nothing needs to be done
    when removing a sysctl table.

    For a cost of 2K code we will save about 4K of static tables (when we remove
    de from ctl_table) and 70K in proc_dir_entries that we will not allocate, or
    about half that on a 32bit arch.

    The speed feels about the same, even though we can now cache the sysctl
    dentries :(

    We get the core advantage that we don't need to have a 1 to 1 mapping between
    ctl table entries and proc files. Making it possible to have /proc/sys vary
    depending on the namespace you are in. The currently merged namespaces don't
    have an issue here but the network namespace under /proc/sys/net needs to have
    different directories depending on which network adapters are visible. By
    simply being a cache different directories being visible depending on who you
    are is trivial to implement.

    [akpm@osdl.org: fix uninitialised var]
    [akpm@osdl.org: fix ARM build]
    [bunk@stusta.de: make things static]
    Signed-off-by: Eric W. Biederman
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

13 Feb, 2007

1 commit

  • Many struct inode_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

02 Oct, 2006

1 commit


27 Sep, 2006

1 commit


24 Sep, 2006

1 commit


27 Jun, 2006

4 commits

  • Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
    that they work with struct pid instead of struct task_ref.

    Mostly this is a straight 1-1 substitution.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Every inode in /proc holds a reference to a struct task_struct. If a
    directory or file is opened and remains open after the the task exits this
    pinning continues. With 8K stacks on a 32bit machine the amount pinned per
    file descriptor is about 10K.

    Normally I would figure a reasonable per user process limit is about 100
    processes. With 80 processes, with a 1000 file descriptors each I can trigger
    the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
    data.

    This patch replaces the struct task_struct pointer with a pointer to a struct
    task_ref which has a struct task_struct pointer. The so the pinning of dead
    tasks does not happen.

    The code now has to contend with the fact that the task may now exit at any
    time. Which is a little but not muh more complicated.

    With this change it takes about 1000 processes each opening up 1000 file
    descriptors before I can trigger the OOM killer. Much better.

    [mlp@google.com: task_mmu small fixes]
    Signed-off-by: Eric W. Biederman
    Cc: Trond Myklebust
    Cc: Paul Jackson
    Cc: Oleg Nesterov
    Cc: Albert Cahalan
    Signed-off-by: Prasanna Meda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • To keep the dcache from filling up with dead /proc entries we flush them on
    process exit. However over the years that code has gotten hairy with a
    dentry_pointer and a lock in task_struct and misdocumented as a correctness
    feature.

    I have rewritten this code to look and see if we have a corresponding entry in
    the dcache and if so flush it on process exit. This removes the extra fields
    in the task_struct and allows me to trivially handle the case of a
    /proc//task/ entry as well as the current /proc/ entries.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The sole renaming use of proc_inode.type is to discover the file descriptor
    number, so just store the file descriptor number and don't wory about
    processing this field. This removes any /proc limits on the maximum number of
    file descriptors, and clears the path to make the hard coded /proc inode
    numbers go away.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

26 Apr, 2006

1 commit


11 Apr, 2006

1 commit


29 Mar, 2006

2 commits

  • This is a conversion to make the various file_operations structs in fs/
    const. Basically a regexp job, with a few manual fixups

    The goal is both to increase correctness (harder to accidentally write to
    shared datastructures) and reducing the false sharing of cachelines with
    things that get dirty in .data (while .rodata is nicely read only and thus
    cache clean)

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • Mark the f_ops members of inodes as const, as well as fix the
    ripple-through this causes by places that copy this f_ops and then "do
    stuff" with it.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

27 Mar, 2006

2 commits

  • Change proc_dir_entry->size to be loff_t to represent files like
    /proc/vmcore for 32bit systems with more than 4G memory.

    Needed for seeing correct size for /proc/vmcore for 32-bit systems with >
    4G RAM.

    Signed-off-by: Maneesh Soni
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maneesh Soni
     
  • It has been discovered that the remove_proc_entry has a race in the removing
    of entries in the proc file system that are siblings. There's no protection
    around the traversing and removing of elements that belong in the same
    subdirectory.

    This subdirectory list is protected in other areas by the BKL. So the BKL was
    at first used to protect this area too, but unfortunately, remove_proc_entry
    may be called with spinlocks held. The BKL may schedule, so this was not a
    solution.

    The final solution was to add a new global spin lock to protect this list,
    called proc_subdir_lock. This lock now protects the list in
    remove_proc_entry, and I also went around looking for other areas that this
    list is modified and added this protection there too. Care must be taken
    since these locations call several functions that may also schedule.

    Since I don't see any location that these functions that modify the
    subdirectory list are called by interrupts, the irqsave/restore versions of
    the spin lock was _not_ used.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

13 Jan, 2006

1 commit


09 Nov, 2005

1 commit

  • You could open the /proc/sys/net/ipv4/conf// file, then
    wait for interface to go away, try to grab as much memory as possible in
    hope to hit the (kfreed) ctl_table. Then fill it with pointers to your
    function. Then do read from file you've opened and if you are lucky,
    you'll get it called as ->proc_handler() in kernel mode.

    So this is at least an Oops and possibly more. It does depend on an
    interface going away though, so less of a security risk than it would
    otherwise be.

    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Linus Torvalds

    Al Viro
     

08 Nov, 2005

1 commit

  • This patch adds the ability to the SMU driver to recover missing
    calibration partitions from the SMU chip itself. It also adds some
    dynamic mecanism to /proc/device-tree so that new properties are visible
    to userland.

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Paul Mackerras

    Benjamin Herrenschmidt
     

26 Jun, 2005

1 commit

  • From: "Vivek Goyal"

    o Support for /proc/vmcore interface. This interface exports elf core image
    either in ELF32 or ELF64 format, depending on the format in which elf headers
    have been stored by crashed kernel.
    o Added support for CONFIG_VMCORE config option.
    o Removed the dependency on /proc/kcore.

    From: "Eric W. Biederman"

    This patch has been refactored to more closely match the prevailing style in
    the affected files. And to clearly indicate the dependency between
    /proc/kcore and proc/vmcore.c

    From: Hariprasad Nellitheertha

    This patch contains the code that provides an ELF format interface to the
    previous kernel's memory post kexec reboot.

    Signed off by Hariprasad Nellitheertha
    Signed-off-by: Eric Biederman
    Signed-off-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds