27 Oct, 2010

1 commit

  • Robin Holt tried to boot a 16TB system and found af_unix was overflowing
    a 32bit value :

    We were seeing a failure which prevented boot. The kernel was incapable
    of creating either a named pipe or unix domain socket. This comes down
    to a common kernel function called unix_create1() which does:

    atomic_inc(&unix_nr_socks);
    if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
    goto out;

    The function get_max_files() is a simple return of files_stat.max_files.
    files_stat.max_files is a signed integer and is computed in
    fs/file_table.c's files_init().

    n = (mempages * (PAGE_SIZE / 1024)) / 10;
    files_stat.max_files = n;

    In our case, mempages (total_ram_pages) is approx 3,758,096,384
    (0xe0000000). That leaves max_files at approximately 1,503,238,553.
    This causes 2 * get_max_files() to integer overflow.

    Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
    integers, and change af_unix to use an atomic_long_t instead of atomic_t.

    get_max_files() is changed to return an unsigned long. get_nr_files() is
    changed to return a long.

    unix_nr_socks is changed from atomic_t to atomic_long_t, while not
    strictly needed to address Robin problem.

    Before patch (on a 64bit kernel) :
    # echo 2147483648 >/proc/sys/fs/file-max
    # cat /proc/sys/fs/file-max
    -18446744071562067968

    After patch:
    # echo 2147483648 >/proc/sys/fs/file-max
    # cat /proc/sys/fs/file-max
    2147483648
    # cat /proc/sys/fs/file-nr
    704 0 2147483648

    Reported-by: Robin Holt
    Signed-off-by: Eric Dumazet
    Acked-by: David Miller
    Reviewed-by: Robin Holt
    Tested-by: Robin Holt
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

18 Aug, 2010

2 commits

  • fs: scale files_lock

    Improve scalability of files_lock by adding per-cpu, per-sb files lists,
    protected with an lglock. The lglock provides fast access to the per-cpu lists
    to add and remove files. It also provides a snapshot of all the per-cpu lists
    (although this is very slow).

    One difficulty with this approach is that a file can be removed from the list
    by another CPU. We must track which per-cpu list the file is on with a new
    variale in the file struct (packed into a hole on 64-bit archs). Scalability
    could suffer if files are frequently removed from different cpu's list.

    However loads with frequent removal of files imply short interval between
    adding and removing the files, and the scheduler attempts to avoid moving
    processes too far away. Also, even in the case of cross-CPU removal, the
    hardware has much more opportunity to parallelise cacheline transfers with N
    cachelines than with 1.

    A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
    degenerates to contending on a single lock, which is no worse than before. When
    more than one CPU are allocating files, even if they are always freed by
    different CPUs, there will be more parallelism than the single-lock case.

    Testing results:

    On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
    to remove the file, the number of times it is removed by the same CPU that
    added it, and the number of times it is removed by the same node that added it.

    Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%)
    kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
    dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

    So a file is removed from the same CPU it was added by over 90% of the time.
    It remains within the same node 95% of the time.

    Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

    throughput
    2.6.34-rc2 24.5
    +patch 24.9

    us sys idle IO wait (in %)
    2.6.34-rc2 51.25 28.25 17.25 3.25
    +patch 53.75 18.5 19 8.75

    So significantly less CPU time spent in kernel code, higher idle time and
    slightly higher throughput.

    Single threaded performance difference was within the noise of microbenchmarks.
    That is not to say penalty does not exist, the code is larger and more memory
    accesses required so it will be slightly slower.

    Cc: linux-kernel@vger.kernel.org
    Cc: Tim Chen
    Cc: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • fs: cleanup files_lock locking

    Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
    manipulate the per-sb files list; unexport the files_lock spinlock.

    Cc: linux-kernel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Alan Cox
    Acked-by: Andi Kleen
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     

13 Aug, 2010

1 commit

  • This reverts commit 3bcf3860a4ff9bbc522820b4b765e65e4deceb3e (and the
    accompanying commit c1e5c954020e "vfs/fsnotify: fsnotify_close can delay
    the final work in fput" that was a horribly ugly hack to make it work at
    all).

    The 'struct file' approach not only causes that disgusting hack, it
    somehow breaks pulseaudio, probably due to some other subtlety with
    f_count handling.

    Fix up various conflicts due to later fsnotify work.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Aug, 2010

1 commit


28 Jul, 2010

1 commit

  • fanotify almost works like so:

    user context calls fsnotify_* function with a struct file.
    fsnotify takes a reference on the struct path
    user context goes about it's buissiness

    at some later point in time the fsnotify listener gets the struct path
    fanotify listener calls dentry_open() to create a file which userspace can deal with
    listener drops the reference on the struct path
    at some later point the listener calls close() on it's new file

    With the switch from struct path to struct file this presents a problem for
    fput() and fsnotify_close(). fsnotify_close() is called when the filp has
    already reached 0 and __fput() wants to do it's cleanup.

    The solution presented here is a bit odd. If an event is created from a
    struct file we take a reference on the file. We check however if the f_count
    was already 0 and if so we take an EXTRA reference EVEN THOUGH IT WAS ZERO.
    In __fput() (where we know the f_count hit 0 once) we check if the f_count is
    non-zero and if so we drop that 'extra' ref and return without destroying the
    file.

    Signed-off-by: Eric Paris

    Eric Paris
     

28 May, 2010

1 commit

  • __aio_put_req() plays sick games with file refcount. What
    it wants is fput() from atomic context; it's almost always
    done with f_count > 1, so they only have to deal with delayed
    work in rare cases when their reference happens to be the
    last one. Current code decrements f_count and if it hasn't
    hit 0, everything is fine. Otherwise it keeps a pointer
    to struct file (with zero f_count!) around and has delayed
    work do __fput() on it.

    Better way to do it: use atomic_long_add_unless( , -1, 1)
    instead of !atomic_long_dec_and_test(). IOW, decrement it
    only if it's not the last reference, leave refcount alone
    if it was. And use normal fput() in delayed work.

    I've made that atomic_long_add_unless call a new helper -
    fput_atomic(). Drops a reference to file if it's safe to
    do in atomic (i.e. if that's not the last one), tells if
    it had been able to do that. aio.c converted to it, __fput()
    use is gone. req->ki_file *always* contributes to refcount
    now. And __fput() became static.

    Signed-off-by: Al Viro

    Al Viro
     

07 Mar, 2010

1 commit

  • We'll introduce FMODE_RANDOM which will be runtime modified. So protect
    all runtime modification to f_mode with f_lock to avoid races.

    Signed-off-by: Wu Fengguang
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Trond Myklebust
    Cc: Chuck Lever
    Cc: [2.6.33.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

07 Feb, 2010

1 commit


23 Dec, 2009

1 commit


17 Dec, 2009

6 commits

  • Commit 3d1e4631 ("get rid of init_file()") removed the export of
    alloc_file() -- possibly inadvertently, since that commit mainly
    consisted of deleting the lines between the end of alloc_file() and
    the start of the code in init_file().

    There is in fact one modular use of alloc_file() in the tree, in
    drivers/infiniband/core/uverbs_main.c, so re-add the export to fix:

    ERROR: "alloc_file" [drivers/infiniband/core/ib_uverbs.ko] undefined!

    when CONFIG_INFINIBAND_USER_ACCESS=m.

    Cc: Al Viro
    Signed-off-by: Roland Dreier
    Signed-off-by: Linus Torvalds

    Roland Dreier
     
  • There are 2 groups of alloc_file() callers:
    * ones that are followed by ima_counts_get
    * ones giving non-regular files
    So let's pull that ima_counts_get() into alloc_file();
    it's a no-op in case of non-regular files.

    Signed-off-by: Al Viro

    Al Viro
     
  • All users outside of fs/ of get_empty_filp() have been removed. This patch
    moves the definition from the include/ directory to internal.h so no new
    users crop up and removes the EXPORT_SYMBOL. I'd love to see open intents
    stop using it too, but that's a problem for another day and a smarter
    developer!

    Signed-off-by: Eric Paris
    Acked-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Eric Paris
     
  • ... and have the caller grab both mnt and dentry; kill
    leak in infiniband, while we are at it.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     

25 Oct, 2009

1 commit


24 Sep, 2009

1 commit

  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

12 Jun, 2009

2 commits

  • This function walks the s_files lock, and operates primarily on the
    files in a superblock, so it better belongs here (eg. see also
    fs_may_remount_ro).

    [AV: ... and it shouldn't be static after that move]

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • This patch speeds up lmbench lat_mmap test by about another 2% after the
    first patch.

    Before:
    avg = 462.286
    std = 5.46106

    After:
    avg = 453.12
    std = 9.58257

    (50 runs of each, stddev gives a reasonable confidence)

    It does this by introducing mnt_clone_write, which avoids some heavyweight
    operations of mnt_want_write if called on a vfsmount which we know already
    has a write count; and mnt_want_write_file, which can call mnt_clone_write
    if the file is open for write.

    After these two patches, mnt_want_write and mnt_drop_write go from 7% on
    the profile down to 1.3% (including mnt_clone_write).

    [AV: mnt_want_write_file() should take file alone and derive mnt from it;
    not only all callers have that form, but that's the only mnt about which
    we know that it's already held for write if file is opened for write]

    Cc: Dave Hansen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

30 Mar, 2009

1 commit


27 Mar, 2009

1 commit


16 Mar, 2009

1 commit

  • This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock. For now,
    epoll remains the only user, but a future patch will use it to protect
    f_flags as well.

    Cc: Davide Libenzi
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jonathan Corbet

    Jonathan Corbet
     

06 Feb, 2009

2 commits

  • Conflicts:
    fs/namei.c

    Manually merged per:

    diff --cc fs/namei.c
    index 734f2b5,bbc15c2..0000000
    --- a/fs/namei.c
    +++ b/fs/namei.c
    @@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char
    nd->flags |= LOOKUP_CONTINUE;
    err = exec_permission_lite(inode);
    if (err == -EAGAIN)
    - err = vfs_permission(nd, MAY_EXEC);
    + err = inode_permission(nd->path.dentry->d_inode,
    + MAY_EXEC);
    + if (!err)
    + err = ima_path_check(&nd->path, MAY_EXEC);
    if (err)
    break;

    @@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path *path, int acc
    flag &= ~O_TRUNC;
    }

    - error = vfs_permission(nd, acc_mode);
    + error = inode_permission(inode, acc_mode);
    if (error)
    return error;
    +
    - error = ima_path_check(&nd->path,
    ++ error = ima_path_check(path,
    + acc_mode & (MAY_READ | MAY_WRITE | MAY_EXEC));
    + if (error)
    + return error;
    /*
    * An append-only file must be opened in append mode for writing.
    */

    Signed-off-by: James Morris

    James Morris
     
  • This patch replaces the generic integrity hooks, for which IMA registered
    itself, with IMA integrity hooks in the appropriate places directly
    in the fs directory.

    Signed-off-by: Mimi Zohar
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Mimi Zohar
     

01 Jan, 2009

1 commit


14 Nov, 2008

3 commits

  • Attach creds to file structs and discard f_uid/f_gid.

    file_operations::open() methods (such as hppfs_open()) should use file->f_cred
    rather than current_cred(). At the moment file->f_cred will be current_cred()
    at this point.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Signed-off-by: James Morris

    David Howells
     
  • Wrap current->cred and a few other accessors to hide their actual
    implementation.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     

02 Nov, 2008

1 commit

  • As it is, all instances of ->release() for files that have ->fasync()
    need to remember to evict file from fasync lists; forgetting that
    creates a hole and we actually have a bunch that *does* forget.

    So let's keep our lives simple - let __fput() check FASYNC in
    file->f_flags and call ->fasync() there if it's been set. And lose that
    crap in ->release() instances - leaving it there is still valid, but we
    don't have to bother anymore.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

21 Oct, 2008

1 commit


27 Jul, 2008

1 commit


02 May, 2008

1 commit


19 Apr, 2008

3 commits

  • There have been a few oopses caused by 'struct file's with NULL f_vfsmnts.
    There was also a set of potentially missed mnt_want_write()s from
    dentry_open() calls.

    This patch provides a very simple debugging framework to catch these kinds of
    bugs. It will WARN_ON() them, but should stop us from having any oopses or
    mnt_writer count imbalances.

    I'm quite convinced that this is a good thing because it found bugs in the
    stuff I was working on as soon as I wrote it.

    [hch: made it conditional on a debug option.
    But it's still a little bit too ugly]

    [hch: merged forced remount r/o fix from Dave and akpm's fix for the fix]

    Signed-off-by: Dave Hansen
    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Hansen
     
  • This is the first really tricky patch in the series. It elevates the writer
    count on a mount each time a non-special file is opened for write.

    We used to do this in may_open(), but Miklos pointed out that __dentry_open()
    is used as well to create filps. This will cover even those cases, while a
    call in may_open() would not have.

    There is also an elevated count around the vfs_create() call in open_namei().
    See the comments for more details, but we need this to fix a 'create, remount,
    fail r/w open()' race.

    Some filesystems forego the use of normal vfs calls to create
    struct files. Make sure that these users elevate the mnt
    writer count because they will get __fput(), and we need
    to make sure they're balanced.

    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Hansen
     
  • If someone decides to demote a file from r/w to just
    r/o, they can use this same code as __fput().

    NFS does just that, and will use this in the next
    patch.

    AV: drop write access in __fput() only after we evict from file list.

    Signed-off-by: Dave Hansen
    Cc: Erez Zadok
    Cc: Trond Myklebust
    Cc: "J Bruce Fields"
    Acked-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Hansen
     

19 Mar, 2008

1 commit

  • Some new uses of get_empty_filp() have crept in; switched
    to alloc_file() to make sure that pieces of initialization
    won't be missing.

    We really need to kill get_empty_filp().

    [AV] fixed dentry leak on failure exit in anon_inode_getfd()

    Cc: Erez Zadok
    Cc: Trond Myklebust
    Cc: "J Bruce Fields"
    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Dave Hansen
    Signed-off-by: Al Viro

    Dave Hansen
     

09 Feb, 2008

1 commit


20 Oct, 2007

1 commit


17 Oct, 2007

1 commit

  • Why do we need r/o bind mounts?

    This feature allows a read-only view into a read-write filesystem. In the
    process of doing that, it also provides infrastructure for keeping track of
    the number of writers to any given mount.

    This has a number of uses. It allows chroots to have parts of filesystems
    writable. It will be useful for containers in the future because users may
    have root inside a container, but should not be allowed to write to
    somefilesystems. This also replaces patches that vserver has had out of the
    tree for several years.

    It allows security enhancement by making sure that parts of your filesystem
    read-only (such as when you don't trust your FTP server), when you don't want
    to have entire new filesystems mounted, or when you want atime selectively
    updated. I've been using the following script to test that the feature is
    working as desired. It takes a directory and makes a regular bind and a r/o
    bind mount of it. It then performs some normal filesystem operations on the
    three directories, including ones that are expected to fail, like creating a
    file on the r/o mount.

    This patch:

    Some filesystems forego the vfs and may_open() and create their own 'struct
    file's.

    This patch creates a couple of helper functions which can be used by these
    filesystems, and will provide a unified place which the r/o bind mount code
    may patch.

    Also, rename an existing, static-scope init_file() to a less generic name.

    Signed-off-by: Dave Hansen
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen