28 Mar, 2014

1 commit

  • Currently we allocated anon_inode_inode in anon_inodefs_mount. This is
    somewhat fragile as if that function ever gets called again, it will
    overwrite anon_inode_inode pointer. So move the initialization of
    anon_inode_inode to anon_inode_init().

    Signed-off-by: Jan Kara
    [ Further simplified on suggestion from Dave Jones ]
    Signed-off-by: Linus Torvalds

    Jan Kara
     

26 Mar, 2014

2 commits

  • The previous commit removed the register_filesystem() call and the
    associated error handling, but left the label for the error path that no
    longer exists. Remove that too.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • anon_inodefs filesystem is a kernel internal filesystem userspace
    shouldn't mess with. Remove registration of it so userspace cannot
    even try to mount it (which would fail anyway because the filesystem is
    MS_NOUSER).

    This fixes an oops triggered by trinity when it tried mounting
    anon_inodefs which overwrote anon_inode_inode pointer while other CPU
    has been in anon_inode_getfile() between ihold() and d_instantiate().
    Thus effectively creating dentry pointing to an inode without holding a
    reference to it.

    Reported-by: Sasha Levin
    Signed-off-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Jan Kara
     

09 Nov, 2013

2 commits


16 Jul, 2013

1 commit

  • Introduce a new lib function anon_inode_getfile_private(), it creates a new file
    instance by hooking it up to an anonymous inode, and a dentry that describe the
    "class" of the file, similar to anon_inode_getfile(), but each file holds a
    single inode. Furthermore, anyone who wants to create a private anon file will
    benefit from this change.

    Signed-off-by: Gu Zheng
    Signed-off-by: Benjamin LaHaise

    Gu Zheng
     

26 Feb, 2013

1 commit


23 Feb, 2013

1 commit

  • Allocating a file structure in function get_empty_filp() might fail because
    of several reasons:
    - not enough memory for file structures
    - operation is not allowed
    - user is over its limit

    Currently the function returns NULL in all cases and we loose the exact
    reason of the error. All callers of get_empty_filp() assume that the function
    can fail with ENFILE only.

    Return error through pointer. Change all callers to preserve this error code.

    [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
    (things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

    Signed-off-by: Anatol Pomozov
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Al Viro

    Anatol Pomozov
     

21 Mar, 2012

1 commit


27 Jul, 2011

1 commit


24 Jul, 2011

1 commit

  • For a number of file systems that don't have a mount point (e.g. sockfs
    and pipefs), they are not marked as long term. Therefore in
    mntput_no_expire, all locks in vfs_mount lock are taken instead of just
    local cpu's lock to aggregate reference counts when we release
    reference to file objects. In fact, only local lock need to have been
    taken to update ref counts as these file systems are in no danger of
    going away until we are ready to unregister them.

    The attached patch marks file systems using kern_mount without
    mount point as long term. The contentions of vfs_mount lock
    is now eliminated. Before un-registering such file system,
    kern_unmount should be called to remove the long term flag and
    make the mount point ready to be freed.

    Signed-off-by: Tim Chen
    Signed-off-by: Al Viro

    Tim Chen
     

17 Jan, 2011

1 commit

  • Instead of splitting refcount between (per-cpu) mnt_count
    and (SMP-only) mnt_longrefs, make all references contribute
    to mnt_count again and keep track of how many are longterm
    ones.

    Accounting rules for longterm count:
    * 1 for each fs_struct.root.mnt
    * 1 for each fs_struct.pwd.mnt
    * 1 for having non-NULL ->mnt_ns
    * decrement to 0 happens only under vfsmount lock exclusive

    That allows nice common case for mntput() - since we can't drop the
    final reference until after mnt_longterm has reached 0 due to the rules
    above, mntput() can grab vfsmount lock shared and check mnt_longterm.
    If it turns out to be non-zero (which is the common case), we know
    that this is not the final mntput() and can just blindly decrement
    percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
    do usual decrement-and-check of percpu mnt_count.

    For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
    namespace.c uses the latter in places where we don't already hold
    vfsmount lock exclusive and opencodes a few remaining spots where
    we need to manipulate mnt_longterm.

    Note that we mostly revert the code outside of fs/namespace.c back
    to what we used to have; in particular, normal code doesn't need
    to care about two kinds of references, etc. And we get to keep
    the optimization Nick's variant had bought us...

    Signed-off-by: Al Viro

    Al Viro
     

14 Jan, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
    fs: add documentation on fallocate hole punching
    Gfs2: fail if we try to use hole punch
    Btrfs: fail if we try to use hole punch
    Ext4: fail if we try to use hole punch
    Ocfs2: handle hole punching via fallocate properly
    XFS: handle hole punching via fallocate properly
    fs: add hole punching to fallocate
    vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
    fix signedness mess in rw_verify_area() on 64bit architectures
    fs: fix kernel-doc for dcache::prepend_path
    fs: fix kernel-doc for dcache::d_validate
    sanitize ecryptfs ->mount()
    switch afs
    move internal-only parts of ncpfs headers to fs/ncpfs
    switch ncpfs
    switch 9p
    pass default dentry_operations to mount_pseudo()
    switch hostfs
    switch affs
    switch configfs
    ...

    Linus Torvalds
     
  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    Documentation/trace/events.txt: Remove obsolete sched_signal_send.
    writeback: fix global_dirty_limits comment runtime -> real-time
    ppc: fix comment typo singal -> signal
    drivers: fix comment typo diable -> disable.
    m68k: fix comment typo diable -> disable.
    wireless: comment typo fix diable -> disable.
    media: comment typo fix diable -> disable.
    remove doc for obsolete dynamic-printk kernel-parameter
    remove extraneous 'is' from Documentation/iostats.txt
    Fix spelling milisec -> ms in snd_ps3 module parameter description
    Fix spelling mistakes in comments
    Revert conflicting V4L changes
    i7core_edac: fix typos in comments
    mm/rmap.c: fix comment
    sound, ca0106: Fix assignment to 'channel'.
    hrtimer: fix a typo in comment
    init/Kconfig: fix typo
    anon_inodes: fix wrong function name in comment
    fix comment typos concerning "consistent"
    poll: fix a typo in comment
    ...

    Fix up trivial conflicts in:
    - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
    - fs/ext4/ext4.h

    Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.

    Linus Torvalds
     

13 Jan, 2011

1 commit


07 Jan, 2011

3 commits

  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Regardless of how much we possibly try to scale dcache, there is likely
    always going to be some fundamental contention when adding or removing children
    under the same parent. Pseudo filesystems do not seem need to have connected
    dentries because by definition they are disconnected.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Reduce some branches and memory accesses in dcache lookup by adding dentry
    flags to indicate common d_ops are set, rather than having to check them.
    This saves a pointer memory access (dentry->d_op) in common path lookup
    situations, and saves another pointer load and branch in cases where we
    have d_op but not the particular operation.

    Patched with:

    git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

    Signed-off-by: Nick Piggin

    Nick Piggin
     

10 Dec, 2010

1 commit


29 Oct, 2010

1 commit


26 Oct, 2010

2 commits

  • Instead of always assigning an increasing inode number in new_inode
    move the call to assign it into those callers that actually need it.
    For now callers that need it is estimated conservatively, that is
    the call is added to all filesystems that do not assign an i_ino
    by themselves. For a few more filesystems we can avoid assigning
    any inode number given that they aren't user visible, and for others
    it could be done lazily when an inode number is actually needed,
    but that's left for later patches.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Clones an existing reference to inode; caller must already hold one.

    Signed-off-by: Al Viro

    Al Viro
     

28 May, 2010

1 commit


22 May, 2010

1 commit

  • anon_inode_mkinode() sets inode->i_mode = S_IRUSR | S_IWUSR; This means
    that (inode->i_mode & S_IFMT) == 0. This trips up some SELinux code that
    needs to determine if a given inode is a regular file, a directory, etc.
    The easiest solution is to just make sure that the anon_inode also sets
    S_IFREG.

    Signed-off-by: Eric Paris
    Signed-off-by: Al Viro

    Eric Paris
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

13 Mar, 2010

1 commit

  • Inotify was switched to use anon_inode instead of its own private filesystem
    which only had one inode in commit c44dcc56d2b5c7 "switch inotify_user to
    anon_inode"

    The problem with this is that now the inotify inode is not a distinct inode
    which can be managed by LSMs. userspace tools which use inotify were allowed
    to use the inotify inode but may not have had permission to do read/write type
    operations on the anon_inode. After looking at the anon_inode and its users
    it looks like the best solution is to just mark the anon_inode as S_PRIVATE
    so the security system will ignore it.

    Signed-off-by: Eric Paris
    Acked-by: James Morris
    Signed-off-by: Linus Torvalds

    Eric Paris
     

23 Dec, 2009

2 commits

  • * pull ACC_MODE to fs.h; we have several copies all over the place
    * nightmarish expression calculating f_mode by f_flags deserves a helper
    too (OPEN_FMODE(flags))

    Signed-off-by: Al Viro

    Al Viro
     
  • It seems a couple places such as arch/ia64/kernel/perfmon.c and
    drivers/infiniband/core/uverbs_main.c could use anon_inode_getfile()
    instead of a private pseudo-fs + alloc_file(), if only there were a way
    to get a read-only file. So provide this by having anon_inode_getfile()
    create a read-only file if we pass O_RDONLY in flags.

    Signed-off-by: Roland Dreier
    Signed-off-by: Al Viro

    Roland Dreier
     

17 Dec, 2009

3 commits

  • Filesystems outside the regular namespace do not have to clear DCACHE_UNHASHED
    in order to have a working /proc/$pid/fd/XXX. Nothing in proc prevents the
    fd link from being used if its dentry is not in the hash.

    Also, it does not get put into the dcache hash if DCACHE_UNHASHED is clear;
    that depends on the filesystem calling d_add or d_rehash.

    So delete the misleading comments and needless code.

    Acked-by: Miklos Szeredi
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • Add a d_dname method for anon_inodes filesystem, the same way pipefs and
    sockfs pseudo filesystems. This allows us to remove the DCACHE_UNHASHED
    hack from anon_inodes.c (see next patch).

    [AV: inumber is useless here, dropped from anon_inodefs_dname()]

    Signed-off-by: Nick Piggin
    Cc: Miklos Szeredi
    Cc: Davide Libenzi
    Cc: "David S. Miller"
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Nick Piggin
     
  • ... and have the caller grab both mnt and dentry; kill
    leak in infiniband, while we are at it.

    Signed-off-by: Al Viro

    Al Viro
     

05 Oct, 2009

1 commit


23 Sep, 2009

1 commit

  • Split the anonfd interface into a bare file pointer creation one, and a
    file pointer creation plus install one.

    There are cases, like the usage of eventfds inside other kernel
    interfaces, where the file pointer created by anonfd needs to be used
    inside the initialization of other structures.

    As it is right now, as soon as anon_inode_getfd() returns, the kenrle can
    race with userspace closing the newly installed file descriptor.

    This patch, while keeping the old anon_inode_getfd(), introduces a new
    anon_inode_getfile() (whose services are reused in anon_inode_getfd())
    that allows to split the file creation phase and the fd install one.

    Once all the kernel structures are initialized, the code can call the
    proper fd_install().

    Gregory manifested the need for something like this inside KVM.

    Signed-off-by: Davide Libenzi
    Cc: Alexander Viro
    Cc: James Morris
    Cc: Peter Zijlstra
    Cc: Gregory Haskins
    Acked-by: Serge Hallyn
    Acked-by: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

18 Jun, 2009

1 commit

  • .set_page_dirty() is one of those a_ops that defaults to the
    buffer implementation when not set. Therefore provide a dummy
    function to make it do nothing.

    (Uncovered by perfcounters fd's which can now be writable-mmap-ed.)

    Signed-off-by: Peter Zijlstra
    Cc: Al Viro
    Cc: Davide Libenzi
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 Mar, 2009

1 commit


31 Dec, 2008

1 commit

  • There is an imbalance for anonymous inodes. If the fops->owner field is set,
    the module reference count of owner is decreases on release.
    ("filp_close" --> "__fput" ---> "fops_put")

    On the other hand, anon_inode_getfd does not increase the module reference
    count of owner. This causes two problems:

    - if owner is set, the module refcount goes negative
    - if owner is not set, the module can be unloaded while code is running

    This patch changes anon_inode_getfd to be symmetric regarding fops->owner
    handling.

    I have checked all existing users of anon_inode_getfd. Noone sets fops->owner,
    thats why nobody has seen the module refcount negative. The refcounting was
    tested with a patched and unpatched KVM module.(see patch 2/2) I also did an
    epoll_open/close test.

    Signed-off-by: Christian Borntraeger
    Reviewed-by: Davide Libenzi
    Signed-off-by: Avi Kivity

    Christian Borntraeger
     

14 Nov, 2008

1 commit

  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Signed-off-by: James Morris

    David Howells
     

25 Jul, 2008

2 commits

  • Building on the previous change to anon_inode_getfd, this patch introduces
    support for handling of O_NONBLOCK in addition to the already supported
    O_CLOEXEC. Following patches will take advantage of this support. As can be
    seen, the additional support for supporting this functionality is minimal.

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch just extends the anon_inode_getfd interface to take an additional
    parameter with a flag value. The flag value is passed on to
    get_unused_fd_flags in anticipation for a use with the O_CLOEXEC flag.

    No actual semantic changes here, the changed callers all pass 0 for now.

    [akpm@linux-foundation.org: KVM fix]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

02 May, 2008

1 commit

  • a) none of the callers even looks at inode or file returned by anon_inode_getfd()
    b) any caller that would try to look at those would be racy, since by the time
    it returns we might have raced with close() from another thread and that
    file would be pining for fjords.

    Signed-off-by: Al Viro

    Al Viro