08 Oct, 2016

1 commit

  • Allow some seq_puts removals by taking a string instead of a single
    char.

    [akpm@linux-foundation.org: update vmstat_show(), per Joe]
    Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

27 Aug, 2016

1 commit

  • seq_read() is a nasty piece of work, not to mention buggy.

    It has (I think) an old bug which allows unprivileged userspace to read
    beyond the end of m->buf.

    I was getting these:

    BUG: KASAN: slab-out-of-bounds in seq_read+0xcd2/0x1480 at addr ffff880116889880
    Read of size 2713 by task trinity-c2/1329
    CPU: 2 PID: 1329 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #96
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    Call Trace:
    kasan_object_err+0x1c/0x80
    kasan_report_error+0x2cb/0x7e0
    kasan_report+0x4e/0x80
    check_memory_region+0x13e/0x1a0
    kasan_check_read+0x11/0x20
    seq_read+0xcd2/0x1480
    proc_reg_read+0x10b/0x260
    do_loop_readv_writev.part.5+0x140/0x2c0
    do_readv_writev+0x589/0x860
    vfs_readv+0x7b/0xd0
    do_readv+0xd8/0x2c0
    SyS_readv+0xb/0x10
    do_syscall_64+0x1b3/0x4b0
    entry_SYSCALL64_slow_path+0x25/0x25
    Object at ffff880116889100, in cache kmalloc-4096 size: 4096
    Allocated:
    PID = 1329
    save_stack_trace+0x26/0x80
    save_stack+0x46/0xd0
    kasan_kmalloc+0xad/0xe0
    __kmalloc+0x1aa/0x4a0
    seq_buf_alloc+0x35/0x40
    seq_read+0x7d8/0x1480
    proc_reg_read+0x10b/0x260
    do_loop_readv_writev.part.5+0x140/0x2c0
    do_readv_writev+0x589/0x860
    vfs_readv+0x7b/0xd0
    do_readv+0xd8/0x2c0
    SyS_readv+0xb/0x10
    do_syscall_64+0x1b3/0x4b0
    return_from_SYSCALL_64+0x0/0x6a
    Freed:
    PID = 0
    (stack is not available)
    Memory state around the buggy address:
    ffff88011688a000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff88011688a080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    >ffff88011688a100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ^
    ffff88011688a180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88011688a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================
    Disabling lock debugging due to kernel taint

    This seems to be the same thing that Dave Jones was seeing here:

    https://lkml.org/lkml/2016/8/12/334

    There are multiple issues here:

    1) If we enter the function with a non-empty buffer, there is an attempt
    to flush it. But it was not clearing m->from after doing so, which
    means that if we try to do this flush twice in a row without any call
    to traverse() in between, we are going to be reading from the wrong
    place -- the splat above, fixed by this patch.

    2) If there's a short write to userspace because of page faults, the
    buffer may already contain multiple lines (i.e. pos has advanced by
    more than 1), but we don't save the progress that was made so the
    next call will output what we've already returned previously. Since
    that is a much less serious issue (and I have a headache after
    staring at seq_read() for the past 8 hours), I'll leave that for now.

    Link: http://lkml.kernel.org/r/1471447270-32093-1-git-send-email-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Reported-by: Dave Jones
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

15 Apr, 2016

1 commit

  • A lot of seqfile users seem to be using things like %pK that uses the
    credentials of the current process, but that is actually completely
    wrong for filesystem interfaces.

    The unix semantics for permission checking files is to check permissions
    at _open_ time, not at read or write time, and that is not just a small
    detail: passing off stdin/stdout/stderr to a suid application and making
    the actual IO happen in privileged context is a classic exploit
    technique.

    So if we want to be able to look at permissions at read time, we need to
    use the file open credentials, not the current ones. Normal file
    accesses can just use "f_cred" (or any of the helper functions that do
    that, like file_ns_capable()), but the seqfile interfaces do not have
    any such options.

    It turns out that seq_file _does_ save away the user_ns information of
    the file, though. Since user_ns is just part of the full credential
    information, replace that special case with saving off the cred pointer
    instead, and suddenly seq_file has all the permission information it
    needs.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Nov, 2015

3 commits

  • Since 5cec38ac866b ("fs, seq_file: fallback to vmalloc instead of oom kill
    processes") seq_buf_alloc() avoids calling the oom killer for PAGE_SIZE or
    smaller allocations; but larger allocations can use the oom killer via
    vmalloc(). Thus reads of small files can return ENOMEM, but larger files
    use the oom killer to avoid ENOMEM.

    The effect of this bug is that reads from /proc and other virtual
    filesystems can return ENOMEM instead of the preferred behavior - oom
    killing something (possibly the calling process). I don't know of anyone
    except Google who has noticed the issue.

    I suspect the fix is more needed in smaller systems where there isn't any
    reclaimable memory. But these seem like the kinds of systems which
    probably don't use the oom killer for production situations.

    Memory overcommit requires use of the oom killer to select a victim
    regardless of file size.

    Enable oom killer for small seq_buf_alloc() allocations.

    Fixes: 5cec38ac866b ("fs, seq_file: fallback to vmalloc instead of oom kill processes")
    Signed-off-by: David Rientjes
    Signed-off-by: Greg Thelen
    Acked-by: Eric Dumazet
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • strint_escape_str() escapes input string by given criteria. In case of
    seq_escape() the criteria is to convert some characters to their octal
    representation.

    Signed-off-by: Andy Shevchenko
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • This improves code readability.

    Signed-off-by: Andy Shevchenko
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     

12 Sep, 2015

1 commit

  • The seq_ function return values were frequently misused.

    See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
    seq_has_overflowed() and make public")

    All uses of these return values have been removed, so convert the
    return types to void.

    Miscellanea:

    o Move seq_put_decimal_ and seq_escape prototypes closer the
    other seq_vprintf prototypes
    o Reorder seq_putc and seq_puts to return early on overflow
    o Add argument names to seq_vprintf and seq_printf
    o Update the seq_escape kernel-doc
    o Convert a couple of leading spaces to tabs in seq_escape

    Signed-off-by: Joe Perches
    Cc: Al Viro
    Cc: Steven Rostedt
    Cc: Mark Brown
    Cc: Stephen Rothwell
    Cc: Joerg Roedel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

11 Sep, 2015

1 commit

  • This introduces a new helper and switches current users to use it. All
    patches are compiled tested. kmemleak is tested via its own test suite.

    This patch (of 6):

    The new seq_hex_dump() is a complete analogue of print_hex_dump().

    We have few users of this functionality already. It allows to reduce their
    codebase.

    Signed-off-by: Andy Shevchenko
    Cc: Alexander Viro
    Cc: Joe Perches
    Cc: Tadeusz Struk
    Cc: Helge Deller
    Cc: Ingo Tuchscherer
    Cc: Catalin Marinas
    Cc: Vladimir Kondratiev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

02 Jul, 2015

1 commit

  • Merge third patchbomb from Andrew Morton:

    - the rest of MM

    - scripts/gdb updates

    - ipc/ updates

    - lib/ updates

    - MAINTAINERS updates

    - various other misc things

    * emailed patches from Andrew Morton : (67 commits)
    genalloc: rename of_get_named_gen_pool() to of_gen_pool_get()
    genalloc: rename dev_get_gen_pool() to gen_pool_get()
    x86: opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
    MAINTAINERS: add zpool
    MAINTAINERS: BCACHE: Kent Overstreet has changed email address
    MAINTAINERS: move Jens Osterkamp to CREDITS
    MAINTAINERS: remove unused nbd.h pattern
    MAINTAINERS: update brcm gpio filename pattern
    MAINTAINERS: update brcm dts pattern
    MAINTAINERS: update sound soc intel patterns
    MAINTAINERS: remove website for paride
    MAINTAINERS: update Emulex ocrdma email addresses
    bcache: use kvfree() in various places
    libcxgbi: use kvfree() in cxgbi_free_big_mem()
    target: use kvfree() in session alloc and free
    IB/ehca: use kvfree() in ipz_queue_{cd}tor()
    drm/nouveau/gem: use kvfree() in u_free()
    drm: use kvfree() in drm_free_large()
    cxgb4: use kvfree() in t4_free_mem()
    cxgb3: use kvfree() in cxgb_free_mem()
    ...

    Linus Torvalds
     

01 Jul, 2015

2 commits

  • seq_open() stores its struct seq_file in file->private_data, thus it must
    not be modified by user of seq_file.

    Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.com
    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     
  • Since patch described below, from v2.6.15-rc1, seq_open() could use a
    struct seq_file already allocated by the caller if the pointer to the
    structure is stored in file->private_data before calling the function.

    Commit 1abe77b0fc4b485927f1f798ae81a752677e1d05
    Author: Al Viro
    Date: Mon Nov 7 17:15:34 2005 -0500

    [PATCH] allow callers of seq_open do allocation themselves

    Allow caller of seq_open() to kmalloc() seq_file + whatever else they
    want and set ->private_data to it. seq_open() will then abstain from
    doing allocation itself.

    As there's no more use for such feature, as it could be easily replaced by
    calls to seq_open_private() (see commit 39699037a5c9 ("[FS] seq_file:
    Introduce the seq_open_private()")) and seq_release_private() (see
    v2.6.0-test3), support for this uncommon feature can be removed from
    seq_open().

    Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.com
    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     

24 Jun, 2015

1 commit


03 Jun, 2015

1 commit

  • Now that we're guaranteed to have a meaningful root dentry, we can just
    export seq_dentry() and use it in btrfs_show_options(). The subvolume ID
    is easy to get and can also be useful, so put that in there, too.

    Reviewed-by: David Sterba
    Signed-off-by: Omar Sandoval
    Signed-off-by: Chris Mason

    Omar Sandoval
     

14 Feb, 2015

1 commit

  • Now that all bitmap formatting usages have been converted to
    '%*pb[l]', the separate formatting functions are unnecessary. The
    following functions are removed.

    * bitmap_scn[list]printf()
    * cpumask_scnprintf(), cpulist_scnprintf()
    * [__]nodemask_scnprintf(), [__]nodelist_scnprintf()
    * seq_bitmap[_list](), seq_cpumask[_list](), seq_nodemask[_list]()
    * seq_buf_bitmask()

    Signed-off-by: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

14 Dec, 2014

1 commit

  • Since commit 058504edd026 ("fs/seq_file: fallback to vmalloc allocation"),
    seq_buf_alloc() falls back to vmalloc() when the kmalloc() for contiguous
    memory fails. This was done to address order-4 slab allocations for
    reading /proc/stat on large machines and noticed because
    PAGE_ALLOC_COSTLY_ORDER < 4, so there is no infinite loop in the page
    allocator when allocating new slab for such high-order allocations.

    Contiguous memory isn't necessary for caller of seq_buf_alloc(), however.
    Other GFP_KERNEL high-order allocations that are
    Cc: Heiko Carstens
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

30 Oct, 2014

1 commit

  • The return values of seq_printf/puts/putc are frequently misused.

    Start down a path to remove all the return value uses of these
    functions.

    Move the seq_overflow() to a global inlined function called
    seq_has_overflowed() that can be used by the users of seq_file() calls.

    Update the documentation to not show return types for seq_printf
    et al. Add a description of seq_has_overflowed().

    Link: http://lkml.kernel.org/p/848ac7e3d1c31cddf638a8526fa3c59fa6fdeb8a.1412031505.git.joe@perches.com

    Cc: Al Viro
    Signed-off-by: Joe Perches
    [ Reworked the original patch from Joe ]
    Signed-off-by: Steven Rostedt

    Joe Perches
     

04 Jul, 2014

1 commit

  • There are a couple of seq_files which use the single_open() interface.
    This interface requires that the whole output must fit into a single
    buffer.

    E.g. for /proc/stat allocation failures have been observed because an
    order-4 memory allocation failed due to memory fragmentation. In such
    situations reading /proc/stat is not possible anymore.

    Therefore change the seq_file code to fallback to vmalloc allocations
    which will usually result in a couple of order-0 allocations and hence
    also work if memory is fragmented.

    For reference a call trace where reading from /proc/stat failed:

    sadc: page allocation failure: order:4, mode:0x1040d0
    CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
    [...]
    Call Trace:
    show_stack+0x6c/0xe8
    warn_alloc_failed+0xd6/0x138
    __alloc_pages_nodemask+0x9da/0xb68
    __get_free_pages+0x2e/0x58
    kmalloc_order_trace+0x44/0xc0
    stat_open+0x5a/0xd8
    proc_reg_open+0x8a/0x140
    do_dentry_open+0x1bc/0x2c8
    finish_open+0x46/0x60
    do_last+0x382/0x10d0
    path_openat+0xc8/0x4f8
    do_filp_open+0x46/0xa8
    do_sys_open+0x114/0x1f0
    sysc_tracego+0x14/0x1a

    Signed-off-by: Heiko Carstens
    Tested-by: David Rientjes
    Cc: Ian Kent
    Cc: Hendrik Brueckner
    Cc: Thorsten Diehl
    Cc: Andrea Righi
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Stefan Bader
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

19 Nov, 2013

1 commit


15 Nov, 2013

1 commit

  • There are several users who want to know bytes written by seq_*() for
    alignment purpose. Currently they are using %n format for knowing it
    because seq_*() returns 0 on success.

    This patch introduces seq_setwidth() and seq_pad() for allowing them to
    align without using %n format.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Kees Cook
    Cc: Joe Perches
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

25 Oct, 2013

1 commit

  • This issue was first pointed out by Jiaxing Wang several months ago, but no
    further comments:
    https://lkml.org/lkml/2013/6/29/41

    As we know pread() does not change f_pos, so after pread(), file->f_pos
    and m->read_pos become different. And seq_lseek() does not update file->f_pos
    if offset equals to m->read_pos, so after pread() and seq_lseek()(lseek to
    m->read_pos), then a subsequent read may read from a wrong position, the
    following program produces the problem:

    char str1[32] = { 0 };
    char str2[32] = { 0 };
    int poffset = 10;
    int count = 20;

    /*open any seq file*/
    int fd = open("/proc/modules", O_RDONLY);

    pread(fd, str1, count, poffset);
    printf("pread:%s\n", str1);

    /*seek to where m->read_pos is*/
    lseek(fd, poffset+count, SEEK_SET);

    /*supposed to read from poffset+count, but this read from position 0*/
    read(fd, str2, count);
    printf("read:%s\n", str2);

    out put:
    pread:
    ck_netbios_ns 12665
    read:
    nf_conntrack_netbios

    /proc/modules:
    nf_conntrack_netbios_ns 12665 0 - Live 0xffffffffa038b000
    nf_conntrack_broadcast 12589 1 nf_conntrack_netbios_ns, Live 0xffffffffa0386000

    So we always update file->f_pos to offset in seq_lseek() to fix this issue.

    Signed-off-by: Jiaxing Wang
    Signed-off-by: Gu Zheng
    Signed-off-by: Al Viro

    Gu Zheng
     

08 Jul, 2013

1 commit

  • When we convert the file_lock_list to a set of percpu lists, we'll need
    a way to iterate over them in order to output /proc/locks info. Add
    some seq_list_*_percpu helpers to handle that.

    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields
    Signed-off-by: Al Viro

    Jeff Layton
     

10 Apr, 2013

1 commit

  • Same as single_open(), but preallocates the buffer of given size.
    Doesn't make any sense for sizes up to PAGE_SIZE and doesn't make
    sense if output of show() exceeds PAGE_SIZE only rarely - seq_read()
    will take care of growing the buffer and redoing show(). If you
    _know_ that it will be large, it might make more sense to look into
    saner iterator, rather than go with single-shot one. If that's
    impossible, single_open_size() might be for you.

    Again, don't use that without a good reason; occasionally that's really
    the best way to go, but very often there are better solutions.

    Signed-off-by: Al Viro

    Al Viro
     

04 Mar, 2013

1 commit

  • Pull more VFS bits from Al Viro:
    "Unfortunately, it looks like xattr series will have to wait until the
    next cycle ;-/

    This pile contains 9p cleanups and fixes (races in v9fs_fid_add()
    etc), fixup for nommu breakage in shmem.c, several cleanups and a bit
    more file_inode() work"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    constify path_get/path_put and fs_struct.c stuff
    fix nommu breakage in shmem.c
    cache the value of file_inode() in struct file
    9p: if v9fs_fid_lookup() gets to asking server, it'd better have hashed dentry
    9p: make sure ->lookup() adds fid to the right dentry
    9p: untangle ->lookup() a bit
    9p: double iput() in ->lookup() if d_materialise_unique() fails
    9p: v9fs_fid_add() can't fail now
    v9fs: get rid of v9fs_dentry
    9p: turn fid->dlist into hlist
    9p: don't bother with private lock in ->d_fsdata; dentry->d_lock will do just fine
    more file_inode() open-coded instances
    selinux: opened file can't have NULL or negative ->f_path.dentry

    (In the meantime, the hlist traversal macros have changed, so this
    required a semantic conflict fixup for the newly hlistified fid->dlist)

    Linus Torvalds
     

28 Feb, 2013

3 commits


11 Jan, 2013

1 commit

  • Fix kernel-doc warnings in fs/seq_file.c:

    Warning(fs/seq_file.c:304): No description found for parameter 'whence'
    Warning(fs/seq_file.c:304): Excess function parameter 'origin' description in 'seq_lseek'

    Signed-off-by: Randy Dunlap
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

18 Dec, 2012

1 commit


15 Aug, 2012

1 commit

  • struct file already has a user namespace associated with it
    in file->f_cred->user_ns, unfortunately because struct
    seq_file has no struct file backpointer associated with
    it, it is difficult to get at the user namespace in seq_file
    context. Therefore add a helper function seq_user_ns to return
    the associated user namespace and a user_ns field to struct
    seq_file to be used in implementing seq_user_ns.

    Cc: Al Viro
    Cc: Eric Dumazet
    Cc: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Acked-by: David S. Miller
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

11 Jun, 2012

1 commit

  • The existing seq_printf function is rewritten in terms of the new
    seq_vprintf which is also exported to modules. This allows GFS2
    (and potentially other seq_file users) to have a vprintf based
    interface and to avoid an extra copy into a temporary buffer in
    some cases.

    Signed-off-by: Steven Whitehouse
    Reported-by: Eric Dumazet
    Acked-by: Al Viro

    Steven Whitehouse
     

25 Mar, 2012

1 commit

  • Pull cleanup of fs/ and lib/ users of module.h from Paul Gortmaker:
    "Fix up files in fs/ and lib/ dirs to only use module.h if they really
    need it.

    These are trivial in scope vs the work done previously. We now have
    things where any few remaining cleanups can be farmed out to arch or
    subsystem maintainers, and I have done so when possible. What is
    remaining here represents the bits that don't clearly lie within a
    single arch/subsystem boundary, like the fs dir and the lib dir.

    Some duplicate includes arising from overlapping fixes from
    independent subsystem maintainer submissions are also quashed."

    Fix up trivial conflicts due to clashes with other include file cleanups
    (including some due to the previous bug.h cleanup pull).

    * tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
    lib: reduce the use of module.h wherever possible
    fs: reduce the use of module.h wherever possible
    includecheck: delete any duplicate instances of module.h

    Linus Torvalds
     

24 Mar, 2012

3 commits

  • It is undocumented but a seq_file's overflow state is indicated by
    m->count == m->size. Add seq_set_overflow() and seq_overflow() to
    set/check overflow status explicitly.

    Based on an idea from Eric Dumazet.

    [akpm@linux-foundation.org: tweak code comment]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Eric Dumazet
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Process accounting applications as top, ps visit some files under
    /proc/. With seq_put_decimal_ull(), we can optimize /proc//stat
    and /proc//statm files.

    This patch adds
    - seq_put_decimal_ll() for signed values.
    - allow delimiter == 0.
    - convert seq_printf() to seq_put_decimal_ull/ll in /proc/stat, statm.

    Test result on a system with 2000+ procs.

    Before patch:
    [kamezawa@bluextal test]$ top -b -n 1 | wc -l
    2223
    [kamezawa@bluextal test]$ time top -b -n 1 > /dev/null

    real 0m0.675s
    user 0m0.044s
    sys 0m0.121s

    [kamezawa@bluextal test]$ time ps -elf > /dev/null

    real 0m0.236s
    user 0m0.056s
    sys 0m0.176s

    After patch:
    kamezawa@bluextal ~]$ time top -b -n 1 > /dev/null

    real 0m0.657s
    user 0m0.052s
    sys 0m0.100s

    [kamezawa@bluextal ~]$ time ps -elf > /dev/null

    real 0m0.198s
    user 0m0.050s
    sys 0m0.145s

    Considering top, ps tend to scan /proc periodically, this will reduce cpu
    consumption by top/ps to some extent.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • == stat_check.py
    num = 0
    with open("/proc/stat") as f:
    while num < 1000 :
    data = f.read()
    f.seek(0, 0)
    num = num + 1
    ==

    perf shows

    20.39% stat_check.py [kernel.kallsyms] [k] format_decode
    13.41% stat_check.py [kernel.kallsyms] [k] number
    12.61% stat_check.py [kernel.kallsyms] [k] vsnprintf
    10.85% stat_check.py [kernel.kallsyms] [k] memcpy
    4.85% stat_check.py [kernel.kallsyms] [k] radix_tree_lookup
    4.43% stat_check.py [kernel.kallsyms] [k] seq_printf

    This patch removes most of calls to vsnprintf() by adding num_to_str()
    and seq_print_decimal_ull(), which prints decimal numbers without rich
    functions provided by printf().

    On my 8cpu box.
    == Before patch ==
    [root@bluextal test]# time ./stat_check.py

    real 0m0.150s
    user 0m0.026s
    sys 0m0.121s

    == After patch ==
    [root@bluextal test]# time ./stat_check.py

    real 0m0.055s
    user 0m0.022s
    sys 0m0.030s

    [akpm@linux-foundation.org: remove incorrect comment, use less statck in num_to_str(), move comment from .h to .c, simplify seq_put_decimal_ull()]
    [andrea@betterlinux.com: avoid breaking the ABI in /proc/stat]
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Righi
    Cc: Eric Dumazet
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Paul Turner
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

22 Mar, 2012

1 commit

  • The following program illustrates the problem:

    char buf[8192];

    int fd = open("/proc/self/maps", O_RDONLY);

    n = pread(fd, buf, sizeof(buf), 0);
    printf("%d\n", n);

    /* lseek(fd, 0, SEEK_CUR); */ /* Uncomment to work around */

    n = pread(fd, buf, sizeof(buf), 0);
    printf("%d\n", n);

    The second printf() prints zero, but uncommenting the lseek() corrects its
    behaviour.

    To fix, make seq_read() mirror seq_lseek() when processing changes in
    *ppos. Restore m->version first, then if required traverse and update
    read_pos on success.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=11856

    Signed-off-by: Earl Chew
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Earl Chew
     

29 Feb, 2012

1 commit


04 Jan, 2012

1 commit


07 Dec, 2011

1 commit

  • __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
    getting just that. The root cause is that when __d_path() misses the root
    it had been told to look for, it stores the location of the most remote ancestor
    in *root. Without grabbing references. Sure, at the moment of call it had
    been pinned down by what we have in *path. And if we raced with umount -l, we
    could have very well stopped at vfsmount/dentry that got freed as soon as
    prepend_path() dropped vfsmount_lock.

    It is safe to compare these pointers with pre-existing (and known to be still
    alive) vfsmount and dentry, as long as all we are asking is "is it the same
    address?". Dereferencing is not safe and apparmor ended up stepping into
    that. d_namespace_path() really wants to examine the place where we stopped,
    even if it's not connected to our namespace. As the result, it looked
    at ->d_sb->s_magic of a dentry that might've been already freed by that point.
    All other callers had been careful enough to avoid that, but it's really
    a bad interface - it invites that kind of trouble.

    The fix is fairly straightforward, even though it's bigger than I'd like:
    * prepend_path() root argument becomes const.
    * __d_path() is never called with NULL/NULL root. It was a kludge
    to start with. Instead, we have an explicit function - d_absolute_root().
    Same as __d_path(), except that it doesn't get root passed and stops where
    it stops. apparmor and tomoyo are using it.
    * __d_path() returns NULL on path outside of root. The main
    caller is show_mountinfo() and that's precisely what we pass root for - to
    skip those outside chroot jail. Those who don't want that can (and do)
    use d_path().
    * __d_path() root argument becomes const. Everyone agrees, I hope.
    * apparmor does *NOT* try to use __d_path() or any of its variants
    when it sees that path->mnt is an internal vfsmount. In that case it's
    definitely not mounted anywhere and dentry_path() is exactly what we want
    there. Handling of sysctl()-triggered weirdness is moved to that place.
    * if apparmor is asked to do pathname relative to chroot jail
    and __d_path() tells it we it's not in that jail, the sucker just calls
    d_absolute_path() instead. That's the other remaining caller of __d_path(),
    BTW.
    * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
    the normal seq_file logics will take care of growing the buffer and redoing
    the call of ->show() just fine). However, if it gets path not reachable
    from root, it returns SEQ_SKIP. The only caller adjusted (i.e. stopped
    ignoring the return value as it used to do).

    Reviewed-by: John Johansen
    ACKed-by: John Johansen
    Signed-off-by: Al Viro
    Cc: stable@vger.kernel.org

    Al Viro
     

26 Oct, 2010

1 commit