23 Sep, 2009

3 commits

  • Benjamin Herrenschmidt pointed out that vmemmap
    range is not included in KCORE_RAM, KCORE_VMALLOC ....

    This adds KCORE_VMEMMAP if SPARSEMEM_VMEMMAP is used. By this, vmemmap
    can be readable via /proc/kcore

    Because it's not vmalloc area, vread/vwrite cannot be used. But the range
    is static against the memory layout, this patch handles vmemmap area by
    the same scheme with physical memory.

    This patch assumes SPARSEMEM_VMEMMAP range is not in VMALLOC range. It's
    correct now.

    [akpm@linux-foundation.org: fix typo]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Cc: Benjamin Herrenschmidt
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, kclist_add() only eats start address and size as its arguments.
    Considering to make kclist dynamically reconfigulable, it's necessary to
    know which kclists are for System RAM and which are not.

    This patch add kclist types as
    KCORE_RAM
    KCORE_VMALLOC
    KCORE_TEXT
    KCORE_OTHER

    This "type" is used in a patch following this for detecting KCORE_RAM.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patchset is for /proc/kcore. With this,

    - many per-arch hooks are removed.

    - /proc/kcore will know really valid physical memory area.

    - /proc/kcore will be aware of memory hotplug.

    - /proc/kcore will be architecture independent i.e.
    if an arch supports CONFIG_MMU, it can use /proc/kcore.
    (if the arch uses usual memory layout.)

    This patch:

    /proc/kcore uses its own list handling codes. It's better to use
    generic list codes.

    No changes in logic. just clean up.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

12 Jun, 2009

1 commit


31 Mar, 2009

1 commit

  • Setting ->owner as done currently (pde->owner = THIS_MODULE) is racy
    as correctly noted at bug #12454. Someone can lookup entry with NULL
    ->owner, thus not pinning enything, and release it later resulting
    in module refcount underflow.

    We can keep ->owner and supply it at registration time like ->proc_fops
    and ->data.

    But this leaves ->owner as easy-manipulative field (just one C assignment)
    and somebody will forget to unpin previous/pin current module when
    switching ->owner. ->proc_fops is declared as "const" which should give
    some thoughts.

    ->read_proc/->write_proc were just fixed to not require ->owner for
    protection.

    rmmod'ed directories will be empty and return "." and ".." -- no harm.
    And directories with tricky enough readdir and lookup shouldn't be modular.
    We definitely don't want such modular code.

    Removing ->owner will also make PDE smaller.

    So, let's nuke it.

    Kudos to Jeff Layton for reminding about this, let's say, oversight.

    http://bugzilla.kernel.org/show_bug.cgi?id=12454

    Signed-off-by: Alexey Dobriyan

    Alexey Dobriyan
     

23 Oct, 2008

2 commits


07 Oct, 2008

1 commit

  • commit 14cf11af6cf608eb8c23e989ddb17a715ddce109 ("powerpc: Merge enough to
    start building in arch/powerpc.") unwired /proc/ppc_htab, and commit
    917f0af9e5a9ceecf9e72537fabb501254ba321d ("powerpc: Remove arch/ppc and
    include/asm-ppc") removed the rest of the /proc/ppc_htab support, but there are
    still a few references left. Kill them for good.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Benjamin Herrenschmidt

    Geert Uytterhoeven
     

27 Jul, 2008

1 commit

  • * keep references to ctl_table_head and ctl_table in /proc/sys inodes
    * grab the former during operations, use the latter for access to
    entry if that succeeds
    * have ->d_compare() check if table should be seen for one who does lookup;
    that allows us to avoid flipping inodes - if we have the same name resolve
    to different things, we'll just keep several dentries and ->d_compare()
    will reject the wrong ones.
    * have ->lookup() and ->readdir() scan the table of our inode first, then
    walk all ctl_table_header and scan ->attached_by for those that are
    attached to our directory.
    * implement ->getattr().
    * get rid of insane amounts of tree-walking
    * get rid of the need to know dentry in ->permission() and of the contortions
    induced by that.

    Signed-off-by: Al Viro

    Al Viro
     

26 Jul, 2008

2 commits

  • Current two-stage scheme of removing PDE emphasizes one bug in proc:

    open
    rmmod
    remove_proc_entry
    close

    ->release won't be called because ->proc_fops were cleared. In simple
    cases it's small memory leak.

    For every ->open, ->release has to be done. List of openers is introduced
    which is traversed at remove_proc_entry() if neeeded.

    Discussions with Al long ago (sigh).

    Signed-off-by: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • This patch moves the extern of struct proc_kmsg_operations to
    fs/proc/internal.h and adds an #include "internal.h" to fs/proc/kmsg.c
    so that the latter sees the former.

    Signed-off-by: Adrian Bunk
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

23 Jul, 2008

1 commit


13 Jun, 2008

1 commit


29 Apr, 2008

7 commits

  • This set of patches fixes an proc ->open'less usage due to ->proc_fops flip in
    the most part of the kernel code. The original OOPS is described in the
    commit 2d3a4e3666325a9709cc8ea2e88151394e8f20fc:

    Typical PDE creation code looks like:

    pde = create_proc_entry("foo", 0, NULL);
    if (pde)
    pde->proc_fops = &foo_proc_fops;

    Notice that PDE is first created, only then ->proc_fops is set up to
    final value. This is a problem because right after creation
    a) PDE is fully visible in /proc , and
    b) ->proc_fops are proc_file_operations which do not have ->open callback. So, it's
    possible to ->read without ->open (see one class of oopses below).

    The fix is new API called proc_create() which makes sure ->proc_fops are
    set up before gluing PDE to main tree. Typical new code looks like:

    pde = proc_create("foo", 0, NULL, &foo_proc_fops);
    if (!pde)
    return -ENOMEM;

    Fix most networking users for a start.

    In the long run, create_proc_entry() for regular files will go.

    In addition to this, proc_create_data is introduced to fix reading from
    proc without PDE->data. The race is basically the same as above.

    create_proc_entries is replaced in the entire kernel code as new method
    is also simply better.

    This patch:

    The problem is the same as for de->proc_fops. Right now PDE becomes visible
    without data set. So, the entry could be looked up without data. This, in
    most cases, will simply OOPS.

    proc_create_data call is created to address this issue. proc_create now
    becomes a wrapper around it.

    Signed-off-by: Denis V. Lunev
    Cc: "Eric W. Biederman"
    Cc: "J. Bruce Fields"
    Cc: Alessandro Zummo
    Cc: Alexey Dobriyan
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Benjamin Herrenschmidt
    Cc: Bjorn Helgaas
    Cc: Chris Mason
    Acked-by: David Howells
    Cc: Dmitry Torokhov
    Cc: Geert Uytterhoeven
    Cc: Grant Grundler
    Cc: Greg Kroah-Hartman
    Cc: Haavard Skinnemoen
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: James Bottomley
    Cc: Jaroslav Kysela
    Cc: Jeff Garzik
    Cc: Jeff Mahoney
    Cc: Jesper Nilsson
    Cc: Karsten Keil
    Cc: Kyle McMartin
    Cc: Len Brown
    Cc: Martin Schwidefsky
    Cc: Mathieu Desnoyers
    Cc: Matthew Wilcox
    Cc: Mauro Carvalho Chehab
    Cc: Mikael Starvik
    Cc: Nadia Derbey
    Cc: Neil Brown
    Cc: Paul Mackerras
    Cc: Peter Osterlund
    Cc: Pierre Peiffer
    Cc: Russell King
    Cc: Takashi Iwai
    Cc: Tony Luck
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis V. Lunev
     
  • Now that last dozen or so users of ->get_info were removed, ditch it too.
    Everyone sane shouldd have switched to seq_file interface long ago.

    P.S.: Co-existing 3 interfaces (->get_info/->read_proc/->proc_fops) for proc
    is long-standing crap, BTW, thus
    a) put ->read_proc/->write_proc/read_proc_entry() users on death row,
    b) new such users should be rejected,
    c) everyone is encouraged to convert his favourite ->read_proc user or
    I'll do it, lazy bastards.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Remove proc_root export. Creation and removal works well if parent PDE is
    supplied as NULL -- it worked always that way.

    So, one useless export removed and consistency added, some drivers created
    PDEs with &proc_root as parent but removed them as NULL and so on.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Use creation by full path: "driver/foo".

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Use creation by full path instead: "fs/foo".

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Remove proc_bus export and variable itself. Using pathnames works fine
    and is slightly more understandable and greppable.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • The kernel implements readlink of /proc/pid/exe by getting the file from
    the first executable VMA. Then the path to the file is reconstructed and
    reported as the result.

    Because of the VMA walk the code is slightly different on nommu systems.
    This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    walking the VMAs to find the first executable file-backed VMA we store a
    reference to the exec'd file in the mm_struct.

    That reference would prevent the filesystem holding the executable file
    from being unmounted even after unmapping the VMAs. So we track the number
    of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    unmapped. This avoids pinning the mounted filesystem.

    [akpm@linux-foundation.org: improve comments]
    [yamamoto@valinux.co.jp: fix dup_mmap]
    Signed-off-by: Matt Helsley
    Cc: Oleg Nesterov
    Cc: David Howells
    Cc:"Eric W. Biederman"
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Hugh Dickins
    Signed-off-by: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     

08 Mar, 2008

1 commit

  • Current /proc/net is done with so called "shadows", but current
    implementation is broken and has little chances to get fixed.

    The problem is that dentries subtree of /proc/net directory has
    fancy revalidation rules to make processes living in different
    net namespaces see different entries in /proc/net subtree, but
    currently, tasks see in the /proc/net subdir the contents of any
    other namespace, depending on who opened the file first.

    The proposed fix is to turn /proc/net into a symlink, which points
    to /proc/self/net, which in turn shows what previously was in
    /proc/net - the network-related info, from the net namespace the
    appropriate task lives in.

    # ls -l /proc/net
    lrwxrwxrwx 1 root root 8 Mar 5 15:17 /proc/net -> self/net

    In other words - this behaves like /proc/mounts, but unlike
    "mounts", "net" is not a file, but a directory.

    Changes from v2:
    * Fixed discrepancy of /proc/net nlink count and selinux labeling
    screwup pointed out by Stephen.

    To get the correct nlink count the ->getattr callback for /proc/net
    is overridden to read one from the net->proc_net entry.

    To make selinux still work the net->proc_net entry is initialized
    properly, i.e. with the "net" name and the proc_net parent.

    Selinux fixes are
    Acked-by: Stephen Smalley

    Changes from v1:
    * Fixed a task_struct leak in get_proc_task_net, pointed out by Paul.

    Signed-off-by: Pavel Emelyanov
    Acked-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

15 Feb, 2008

1 commit

  • proc_get_link() is always called with a dentry and a vfsmount from a struct
    path. Make proc_get_link() take it directly as an argument.

    Signed-off-by: Jan Blunck
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     

09 Feb, 2008

3 commits

  • Typical PDE creation code looks like:

    pde = create_proc_entry("foo", 0, NULL);
    if (pde)
    pde->proc_fops = &foo_proc_fops;

    Notice that PDE is first created, only then ->proc_fops is set up to
    final value. This is a problem because right after creation
    a) PDE is fully visible in /proc , and
    b) ->proc_fops are proc_file_operations which do not have ->open callback. So, it's
    possible to ->read without ->open (see one class of oopses below).

    The fix is new API called proc_create() which makes sure ->proc_fops are
    set up before gluing PDE to main tree. Typical new code looks like:

    pde = proc_create("foo", 0, NULL, &foo_proc_fops);
    if (!pde)
    return -ENOMEM;

    Fix most networking users for a start.

    In the long run, create_proc_entry() for regular files will go.

    BUG: unable to handle kernel NULL pointer dereference at virtual address 00000024
    printing eip: c1188c1b *pdpt = 000000002929e001 *pde = 0000000000000000
    Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    last sysfs file: /sys/block/sda/sda1/dev
    Modules linked in: foo af_packet ipv6 cpufreq_ondemand loop serio_raw psmouse k8temp hwmon sr_mod cdrom

    Pid: 24679, comm: cat Not tainted (2.6.24-rc3-mm1 #2)
    EIP: 0060:[] EFLAGS: 00210002 CPU: 0
    EIP is at mutex_lock_nested+0x75/0x25d
    EAX: 000006fe EBX: fffffffb ECX: 00001000 EDX: e9340570
    ESI: 00000020 EDI: 00200246 EBP: e9340570 ESP: e8ea1ef8
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process cat (pid: 24679, ti=E8EA1000 task=E9340570 task.ti=E8EA1000)
    Stack: 00000000 c106f7ce e8ee05b4 00000000 00000001 458003d0 f6fb6f20 fffffffb
    00000000 c106f7aa 00001000 c106f7ce 08ae9000 f6db53f0 00000020 00200246
    00000000 00000002 00000000 00200246 00200246 e8ee05a0 fffffffb e8ee0550
    Call Trace:
    [] seq_read+0x24/0x28a
    [] seq_read+0x0/0x28a
    [] seq_read+0x24/0x28a
    [] seq_read+0x0/0x28a
    [] proc_reg_read+0x60/0x73
    [] proc_reg_read+0x0/0x73
    [] vfs_read+0x6c/0x8b
    [] sys_read+0x3c/0x63
    [] sysenter_past_esp+0x5f/0xa5
    [] destroy_inode+0x24/0x33
    =======================
    INFO: lockdep is turned off.
    Code: 75 21 68 e1 1a 19 c1 68 87 00 00 00 68 b8 e8 1f c1 68 25 73 1f c1 e8 84 06 e9 ff e8 52 b8 e7 ff 83 c4 10 9c 5f fa e8 28 89 ea ff fe 4e 04 79 0a f3 90 80 7e 04 00 7e f8 eb f0 39 76 34 74 33
    EIP: [] mutex_lock_nested+0x75/0x25d SS:ESP 0068:e8ea1ef8

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Currently we possibly lookup the pid in the wrong pid namespace. So
    seq_file convert proc_pid_status which ensures the proper pid namespaces is
    passed in.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: s390 build fix]
    [akpm@linux-foundation.org: fix task_name() output]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Eric W. Biederman
    Cc: Andrew Morgan
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: Pavel Emelyanov
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Currently many /proc/pid files use a crufty precursor to the current seq_file
    api, and they don't have direct access to the pid_namespace or the pid of for
    which they are displaying data.

    So implement proc_single_file_operations to make the seq_file routines easy to
    use, and to give access to the full state of the pid of we are displaying data
    for.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

06 Feb, 2008

1 commit

  • This puts all the clear_refs code where it belongs and probably lets things
    compile on MMU-less systems as well.

    Signed-off-by: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Cc: David Rientjes
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     

29 Jan, 2008

1 commit

  • cat /proc/net/atm/arp causes the NULL pointer dereference in the
    get_proc_net+0xc/0x3a. This happens as proc_get_net believes that the
    parent proc dir entry contains struct net.

    Fix this assumption for "net/atm" case.

    The problem is introduced by the commit c0097b07abf5f92ab135d024dd41bd2aada1512f
    from Eric W. Biederman/Daniel Lezcano.

    Signed-off-by: Denis V. Lunev
    Signed-off-by: David S. Miller

    Denis V. Lunev
     

06 Dec, 2007

1 commit

  • Creating PDEs with refcount 0 and "deleted" flag has problems (see below).
    Switch to usual scheme:
    * PDE is created with refcount 1
    * every de_get does +1
    * every de_put() and remove_proc_entry() do -1
    * once refcount reaches 0, PDE is freed.

    This elegantly fixes at least two following races (both observed) without
    introducing new locks, without abusing old locks, without spreading
    lock_kernel():

    1) PDE leak

    remove_proc_entry de_put
    ----------------- ------
    [refcnt = 1]
    if (atomic_read(&de->count) == 0)
    if (atomic_dec_and_test(&de->count))
    if (de->deleted)
    /* also not taken! */
    free_proc_entry(de);
    else
    de->deleted = 1;
    [refcount=0, deleted=1]

    2) use after free

    remove_proc_entry de_put
    ----------------- ------
    [refcnt = 1]

    if (atomic_dec_and_test(&de->count))
    if (atomic_read(&de->count) == 0)
    free_proc_entry(de);
    /* boom! */
    if (de->deleted)
    free_proc_entry(de);

    BUG: unable to handle kernel paging request at virtual address 6b6b6b6b
    printing eip: c10acdda *pdpt = 00000000338f8001 *pde = 0000000000000000
    Oops: 0000 [#1] PREEMPT SMP
    Modules linked in: af_packet ipv6 cpufreq_ondemand loop serio_raw psmouse k8temp hwmon sr_mod cdrom
    Pid: 23161, comm: cat Not tainted (2.6.24-rc2-8c0863403f109a43d7000b4646da4818220d501f #4)
    EIP: 0060:[] EFLAGS: 00210097 CPU: 1
    EIP is at strnlen+0x6/0x18
    EAX: 6b6b6b6b EBX: 6b6b6b6b ECX: 6b6b6b6b EDX: fffffffe
    ESI: c128fa3b EDI: f380bf34 EBP: ffffffff ESP: f380be44
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process cat (pid: 23161, ti=f380b000 task=f38f2570 task.ti=f380b000)
    Stack: c10ac4f0 00000278 c12ce000 f43cd2a8 00000163 00000000 7da86067 00000400
    c128fa20 00896b18 f38325a8 c128fe20 ffffffff 00000000 c11f291e 00000400
    f75be300 c128fa20 f769c9a0 c10ac779 f380bf34 f7bfee70 c1018e6b f380bf34
    Call Trace:
    [] vsnprintf+0x2ad/0x49b
    [] vscnprintf+0x14/0x1f
    [] vprintk+0xc5/0x2f9
    [] handle_fasteoi_irq+0x0/0xab
    [] do_IRQ+0x9f/0xb7
    [] preempt_schedule_irq+0x3f/0x5b
    [] need_resched+0x1f/0x21
    [] printk+0x1b/0x1f
    [] de_put+0x3d/0x50
    [] proc_delete_inode+0x38/0x41
    [] proc_delete_inode+0x0/0x41
    [] generic_delete_inode+0x5e/0xc6
    [] iput+0x60/0x62
    [] d_kill+0x2d/0x46
    [] dput+0xdc/0xe4
    [] __fput+0xb0/0xcd
    [] filp_close+0x48/0x4f
    [] sys_close+0x67/0xa5
    [] sysenter_past_esp+0x5f/0x85
    =======================
    Code: c9 74 0c f2 ae 74 05 bf 01 00 00 00 4f 89 fa 5f 89 d0 c3 85 c9 57 89 c7 89 d0 74 05 f2 ae 75 01 4f 89 f8 5f c3 89 c1 89 c8 eb 06 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 c3 90 90 90 57 83 c9
    EIP: [] strnlen+0x6/0x18 SS:ESP 0068:f380be44

    Also, remove broken usage of ->deleted from reiserfs: if sget() succeeds,
    module is already pinned and remove_proc_entry() can't happen => nobody
    can mark PDE deleted.

    Dummy proc root in netns code is not marked with refcount 1. AFAICS, we
    never get it, it's just for proper /proc/net removal. I double checked
    CLONE_NETNS continues to work.

    Patch survives many hours of modprobe/rmmod/cat loops without new bugs
    which can be attributed to refcounting.

    Signed-off-by: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

01 Dec, 2007

1 commit

  • Well I clearly goofed when I added the initial network namespace support
    for /proc/net. Currently things work but there are odd details visible to
    user space, even when we have a single network namespace.

    Since we do not cache proc_dir_entry dentries at the moment we can just
    modify ->lookup to return a different directory inode depending on the
    network namespace of the process looking at /proc/net, replacing the
    current technique of using a magic and fragile follow_link method.

    To accomplish that this patch:
    - introduces a shadow_proc method to allow different dentries to
    be returned from proc_lookup.
    - Removes the old /proc/net follow_link magic
    - Fixes a weakness in our not caching of proc generic dentries.

    As shadow_proc uses a task struct to decided which dentry to return we can
    go back later and fix the proc generic caching without modifying any code
    that uses the shadow_proc method.

    Signed-off-by: Eric W. Biederman
    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Cc: Pavel Emelyanov
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Herbert Xu

    Eric W. Biederman
     

07 Nov, 2007

1 commit


20 Oct, 2007

3 commits

  • The namespace's proc_mnt must be kern_mount-ed to make this pointer always
    valid, independently of whether the user space mounted the proc or not. This
    solves raced in proc_flush_task, etc. with the proc_mnt switching from NULL
    to not-NULL.

    The initialization is done after the init's pid is created and hashed to make
    proc_get_sb() finr it and get for root inode.

    Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the
    superblock holds the namespace we must explicitly break this circle to destroy
    all the stuff. This is done after the init of the namespace dies. Running a
    few steps forward - when init exits it will kill all its children, so no
    proc_mnt will be needed after its death.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Each pid namespace have to be visible through its own proc mount. Thus we
    need to have per-namespace proc trees with their own superblocks.

    We cannot easily show different pid namespace via one global proc tree, since
    each pid refers to different tasks in different namespaces. E.g. pid 1
    refers to the init task in the initial namespace and to some other task when
    seeing from another namespace. Moreover - pid, exisintg in one namespace may
    not exist in the other.

    This approach has one move advantage is that the tasks from the init namespace
    can see what tasks live in another namespace by reading entries from another
    proc tree.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The first part is trivial - we just make the proc_flush_task() to operate on
    arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
    it.

    The other change is more tricky: I moved the proc_flush_task() call in
    release_task() higher to address the following problem.

    When flushing task from many proc trees we need to know the set of ids (not
    just one pid) to find the dentries' names to flush. Thus we need to pass the
    task's pid to proc_flush_task() as struct pid is the only object that can
    provide all the pid numbers. But after __exit_signal() task has detached all
    his pids and this information is lost.

    This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
    tree and keep them in hash (since pids are still alive before __exit_signal())
    till the next shrink, but since proc_flush_task() does not provide a 100%
    guarantee that the dentries will be flushed, this is OK to do so.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

11 Oct, 2007

2 commits

  • The problem: proc_net files remember which network namespace the are
    against but do not remember hold a reference count (as that would pin
    the network namespace). So we currently have a small window where
    the reference count on a network namespace may be incremented when opening
    a /proc file when it has already gone to zero.

    To fix this introduce maybe_get_net and get_proc_net.

    maybe_get_net increments the network namespace reference count only if it is
    greater then zero, ensuring we don't increment a reference count after it
    has gone to zero.

    get_proc_net handles all of the magic to go from a proc inode to the network
    namespace instance and call maybe_get_net on it.

    PROC_NET the old accessor is removed so that we don't get confused and use
    the wrong helper function.

    Then I fix up the callers to use get_proc_net and handle the case case
    where get_proc_net returns NULL. In that case I return -ENXIO because
    effectively the network namespace has already gone away so the files
    we are trying to access don't exist anymore.

    Signed-off-by: Eric W. Biederman
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes /proc/net per network namespace. It modifies the global
    variables proc_net and proc_net_stat to be per network namespace.
    The proc_net file helpers are modified to take a network namespace argument,
    and all of their callers are fixed to pass &init_net for that argument.
    This ensures that all of the /proc/net files are only visible and
    usable in the initial network namespace until the code behind them
    has been updated to be handle multiple network namespaces.

    Making /proc/net per namespace is necessary as at least some files
    in /proc/net depend upon the set of network devices which is per
    network namespace, and even more files in /proc/net have contents
    that are relevant to a single network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

12 Aug, 2007

1 commit


17 Jul, 2007

1 commit

  • Fix following races:
    ===========================================
    1. Write via ->write_proc sleeps in copy_from_user(). Module disappears
    meanwhile. Or, more generically, system call done on /proc file, method
    supplied by module is called, module dissapeares meanwhile.

    pde = create_proc_entry()
    if (!pde)
    return -ENOMEM;
    pde->write_proc = ...
    open
    write
    copy_from_user
    pde = create_proc_entry();
    if (!pde) {
    remove_proc_entry();
    return -ENOMEM;
    /* module unloaded */
    }
    *boom*
    ==========================================
    2. bogo-revoke aka proc_kill_inodes()

    remove_proc_entry vfs_read
    proc_kill_inodes [check ->f_op validness]
    [check ->f_op->read validness]
    [verify_area, security permissions checks]
    ->f_op = NULL;
    if (file->f_op->read)
    /* ->f_op dereference, boom */

    NOTE, NOTE, NOTE: file_operations are proxied for regular files only. Let's
    see how this scheme behaves, then extend if needed for directories.
    Directories creators in /proc only set ->owner for them, so proxying for
    directories may be unneeded.

    NOTE, NOTE, NOTE: methods being proxied are ->llseek, ->read, ->write,
    ->poll, ->unlocked_ioctl, ->ioctl, ->compat_ioctl, ->open, ->release.
    If your in-tree module uses something else, yell on me. Full audit pending.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

09 May, 2007

1 commit

  • proc_lookup remove_proc_entry
    =========== =================

    lock_kernel();
    spin_lock(&proc_subdir_lock);
    [find PDE with refcount 0]
    spin_unlock(&proc_subdir_lock);
    spin_lock(&proc_subdir_lock);
    [find PDE with refcount 0]
    [check refcount and free PDE]
    spin_unlock(&proc_subdir_lock);
    proc_get_inode:
    de_get(de); /* boom */

    Signed-off-by: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

08 May, 2007

1 commit

  • Adds /proc/pid/clear_refs. When any non-zero number is written to this file,
    pte_mkold() and ClearPageReferenced() is called for each pte and its
    corresponding page, respectively, in that task's VMAs. This file is only
    writable by the user who owns the task.

    It is now possible to measure _approximately_ how much memory a task is using
    by clearing the reference bits with

    echo 1 > /proc/pid/clear_refs

    and checking the reference count for each VMA from the /proc/pid/smaps output
    at a measured time interval. For example, to observe the approximate change
    in memory footprint for a task, write a script that clears the references
    (echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and
    extracts the size in kB. Add the sizes for each VMA together for the total
    referenced footprint. Moments later, repeat the process and observe the
    difference.

    For example, using an efficient Mozilla:

    accumulated time referenced memory
    ---------------- -----------------
    0 s 408 kB
    1 s 408 kB
    2 s 556 kB
    3 s 1028 kB
    4 s 872 kB
    5 s 1956 kB
    6 s 416 kB
    7 s 1560 kB
    8 s 2336 kB
    9 s 1044 kB
    10 s 416 kB

    This is a valuable tool to get an approximate measurement of the memory
    footprint for a task.

    Cc: Hugh Dickins
    Cc: Paul Mundt
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    [akpm@linux-foundation.org: build fixes]
    [mpm@selenic.com: rename for_each_pmd]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

15 Feb, 2007

1 commit

  • With this change the sysctl inodes can be cached and nothing needs to be done
    when removing a sysctl table.

    For a cost of 2K code we will save about 4K of static tables (when we remove
    de from ctl_table) and 70K in proc_dir_entries that we will not allocate, or
    about half that on a 32bit arch.

    The speed feels about the same, even though we can now cache the sysctl
    dentries :(

    We get the core advantage that we don't need to have a 1 to 1 mapping between
    ctl table entries and proc files. Making it possible to have /proc/sys vary
    depending on the namespace you are in. The currently merged namespaces don't
    have an issue here but the network namespace under /proc/sys/net needs to have
    different directories depending on which network adapters are visible. By
    simply being a cache different directories being visible depending on who you
    are is trivial to implement.

    [akpm@osdl.org: fix uninitialised var]
    [akpm@osdl.org: fix ARM build]
    [bunk@stusta.de: make things static]
    Signed-off-by: Eric W. Biederman
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman