13 Nov, 2013

1 commit

  • Pull vfs updates from Al Viro:
    "All kinds of stuff this time around; some more notable parts:

    - RCU'd vfsmounts handling
    - new primitives for coredump handling
    - files_lock is gone
    - Bruce's delegations handling series
    - exportfs fixes

    plus misc stuff all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
    ecryptfs: ->f_op is never NULL
    locks: break delegations on any attribute modification
    locks: break delegations on link
    locks: break delegations on rename
    locks: helper functions for delegation breaking
    locks: break delegations on unlink
    namei: minor vfs_unlink cleanup
    locks: implement delegations
    locks: introduce new FL_DELEG lock flag
    vfs: take i_mutex on renamed file
    vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
    vfs: don't use PARENT/CHILD lock classes for non-directories
    vfs: pull ext4's double-i_mutex-locking into common code
    exportfs: fix quadratic behavior in filehandle lookup
    exportfs: better variable name
    exportfs: move most of reconnect_path to helper function
    exportfs: eliminate unused "noprogress" counter
    exportfs: stop retrying once we race with rename/remove
    exportfs: clear DISCONNECTED on all parents sooner
    exportfs: more detailed comment for path_reconnect
    ...

    Linus Torvalds
     

09 Nov, 2013

1 commit


25 Oct, 2013

1 commit


20 Oct, 2013

1 commit

  • Background: nfsd v[23] had throughput regression since delayed fput
    went in; every read or write ends up doing fput() and we get a pair
    of extra context switches out of that (plus quite a bit of work
    in queue_work itselfi, apparently). Use of schedule_delayed_work()
    gives it a chance to accumulate a bit before we do __fput() on all
    of them. I'm not too happy about that solution, but... on at least
    one real-world setup it reverts about 10% throughput loss we got from
    switch to delayed fput.

    Signed-off-by: Al Viro

    Al Viro
     

12 Sep, 2013

1 commit


04 Sep, 2013

1 commit


13 Jul, 2013

2 commits

  • fput() and delayed_fput() can use llist and avoid the locking.

    This is unlikely path, it is not that this change can improve
    the performance, but this way the code looks simpler.

    Signed-off-by: Oleg Nesterov
    Suggested-by: Andrew Morton
    Cc: Al Viro
    Cc: Andrey Vagin
    Cc: "Eric W. Biederman"
    Cc: David Howells
    Cc: Huang Ying
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • A missed update to "fput: task_work_add() can fail if the caller has
    passed exit_task_work()".

    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Andrey Vagin
    Cc: David Howells
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Andrew Morton
     

29 Jun, 2013

1 commit


15 Jun, 2013

1 commit

  • fput() assumes that it can't be called after exit_task_work() but
    this is not true, for example free_ipc_ns()->shm_destroy() can do
    this. In this case fput() silently leaks the file.

    Change it to fallback to delayed_fput_work if task_work_add() fails.
    The patch looks complicated but it is not, it changes the code from

    if (PF_KTHREAD) {
    schedule_work(...);
    return;
    }
    task_work_add(...)

    to
    if (!PF_KTHREAD) {
    if (!task_work_add(...))
    return;
    /* fallback */
    }
    schedule_work(...);

    As for shm_destroy() in particular, we could make another fix but I
    think this change makes sense anyway. There could be another similar
    user, it is not safe to assume that task_work_add() can't fail.

    Reported-by: Andrey Vagin
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Al Viro

    Oleg Nesterov
     

02 Mar, 2013

1 commit


23 Feb, 2013

3 commits

  • Allocating a file structure in function get_empty_filp() might fail because
    of several reasons:
    - not enough memory for file structures
    - operation is not allowed
    - user is over its limit

    Currently the function returns NULL in all cases and we loose the exact
    reason of the error. All callers of get_empty_filp() assume that the function
    can fail with ENFILE only.

    Return error through pointer. Change all callers to preserve this error code.

    [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
    (things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

    Signed-off-by: Anatol Pomozov
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Al Viro

    Anatol Pomozov
     
  • Based on parts from Anatol's patch (the rest is the next commit).

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     

21 Dec, 2012

1 commit

  • File descriptors (even those for writing) do not hold freeze protection.
    Thus mark_files_ro() must call __mnt_drop_write() to only drop protection
    against remount read-only. Calling mnt_drop_write_file() as we do now
    results in:

    [ BUG: bad unlock balance detected! ]
    3.7.0-rc6-00028-g88e75b6 #101 Not tainted
    -------------------------------------
    kworker/1:2/79 is trying to release lock (sb_writers) at:
    [] mnt_drop_write+0x24/0x30
    but there are no more locks to release!

    Reported-by: Zdenek Kabelac
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

10 Oct, 2012

1 commit


03 Oct, 2012

1 commit

  • Pull security subsystem updates from James Morris:
    "Highlights:

    - Integrity: add local fs integrity verification to detect offline
    attacks
    - Integrity: add digital signature verification
    - Simple stacking of Yama with other LSMs (per LSS discussions)
    - IBM vTPM support on ppc64
    - Add new driver for Infineon I2C TIS TPM
    - Smack: add rule revocation for subject labels"

    Fixed conflicts with the user namespace support in kernel/auditsc.c and
    security/integrity/ima/ima_policy.c.

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (39 commits)
    Documentation: Update git repository URL for Smack userland tools
    ima: change flags container data type
    Smack: setprocattr memory leak fix
    Smack: implement revoking all rules for a subject label
    Smack: remove task_wait() hook.
    ima: audit log hashes
    ima: generic IMA action flag handling
    ima: rename ima_must_appraise_or_measure
    audit: export audit_log_task_info
    tpm: fix tpm_acpi sparse warning on different address spaces
    samples/seccomp: fix 31 bit build on s390
    ima: digital signature verification support
    ima: add support for different security.ima data types
    ima: add ima_inode_setxattr/removexattr function and calls
    ima: add inode_post_setattr call
    ima: replace iint spinblock with rwlock/read_lock
    ima: allocating iint improvements
    ima: add appraise action keywords and default rules
    ima: integrity appraisal extension
    vfs: move ima_file_free before releasing the file
    ...

    Linus Torvalds
     

27 Sep, 2012

1 commit


08 Sep, 2012

1 commit

  • ima_file_free(), called on __fput(), currently flags files that have
    changed, so that the file is re-measured. For appraising a files's
    integrity, the file's hash must be re-calculated and stored in the
    'security.ima' xattr to reflect any changes.

    This patch moves the ima_file_free() call to before releasing the file
    in preparation of ima-appraisal measuring the file and updating the
    'security.ima' xattr.

    Signed-off-by: Mimi Zohar
    Acked-by: Serge Hallyn
    Acked-by: Dmitry Kasatkin

    Mimi Zohar
     

31 Jul, 2012

1 commit

  • Most of places where we want freeze protection coincides with the places where
    we also have remount-ro protection. So make mnt_want_write() and
    mnt_drop_write() (and their _file alternative) prevent freezing as well.
    For the few cases that are really interested only in remount-ro protection
    provide new function variants.

    BugLink: https://bugs.launchpad.net/bugs/897421
    Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

30 Jul, 2012

1 commit


23 Jul, 2012

1 commit

  • ... and schedule_work() for interrupt/kernel_thread callers
    (and yes, now it *is* OK to call from interrupt).

    We are guaranteed that __fput() will be done before we return
    to userland (or exit). Note that for fput() from a kernel
    thread we get an async behaviour; it's almost always OK, but
    sometimes you might need to have __fput() completed before
    you do anything else. There are two mechanisms for that -
    a general barrier (flush_delayed_fput()) and explicit
    __fput_sync(). Both should be used with care (as was the
    case for fput() from kernel threads all along). See comments
    in fs/file_table.c for details.

    Signed-off-by: Al Viro

    Al Viro
     

14 Jul, 2012

1 commit


30 May, 2012

2 commits

  • lglocks and brlocks are currently generated with some complicated macros
    in lglock.h. But there's no reason to not just use common utility
    functions and put all the data into a common data structure.

    In preparation, this patch changes the API to look more like normal
    function calls with pointers, not magic macros.

    The patch is rather large because I move over all users in one go to keep
    it bisectable. This impacts the VFS somewhat in terms of lines changed.
    But no actual behaviour change.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Al Viro
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Rusty Russell
    Signed-off-by: Al Viro

    Andi Kleen
     
  • lglocks and brlocks are currently generated with some complicated macros
    in lglock.h. But there's no reason to not just use common utility
    functions and put all the data into a common data structure.

    Since there are at least two users it makes sense to share this code in a
    library. This is also easier maintainable than a macro forest.

    This will also make it later possible to dynamically allocate lglocks and
    also use them in modules (this would both still need some additional, but
    now straightforward, code)

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Al Viro
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Rusty Russell
    Signed-off-by: Al Viro

    Andi Kleen
     

21 Mar, 2012

1 commit


07 Jan, 2012

1 commit


27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

17 Mar, 2011

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fix cdev leak on O_PATH final fput()

    Linus Torvalds
     
  • __fput doesn't need a cdev_put() for O_PATH handles.

    Signed-off-by: mszeredi@suse.cz
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
    AppArmor: kill unused macros in lsm.c
    AppArmor: cleanup generated files correctly
    KEYS: Add an iovec version of KEYCTL_INSTANTIATE
    KEYS: Add a new keyctl op to reject a key with a specified error code
    KEYS: Add a key type op to permit the key description to be vetted
    KEYS: Add an RCU payload dereference macro
    AppArmor: Cleanup make file to remove cruft and make it easier to read
    SELinux: implement the new sb_remount LSM hook
    LSM: Pass -o remount options to the LSM
    SELinux: Compute SID for the newly created socket
    SELinux: Socket retains creator role and MLS attribute
    SELinux: Auto-generate security_is_socket_class
    TOMOYO: Fix memory leak upon file open.
    Revert "selinux: simplify ioctl checking"
    selinux: drop unused packet flow permissions
    selinux: Fix packet forwarding checks on postrouting
    selinux: Fix wrong checks for selinux_policycap_netpeer
    selinux: Fix check for xfrm selinux context algorithm
    ima: remove unnecessary call to ima_must_measure
    IMA: remove IMA imbalance checking
    ...

    Linus Torvalds
     

15 Mar, 2011

2 commits

  • Just need to make sure that AF_UNIX garbage collector won't
    confuse O_PATHed socket on filesystem for real AF_UNIX opened
    socket.

    Signed-off-by: Al Viro

    Al Viro
     
  • New flag for open(2) - O_PATH. Semantics:
    * pathname is resolved, but the file itself is _NOT_ opened
    as far as filesystem is concerned.
    * almost all operations on the resulting descriptors shall
    fail with -EBADF. Exceptions are:
    1) operations on descriptors themselves (i.e.
    close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
    fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
    fcntl(fd, F_SETFD, ...))
    2) fcntl(fd, F_GETFL), for a common non-destructive way to
    check if descriptor is open
    3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
    points of pathname resolution
    * closing such descriptor does *NOT* affect dnotify or
    posix locks.
    * permissions are checked as usual along the way to file;
    no permission checks are applied to the file itself. Of course,
    giving such thing to syscall will result in permission checks (at
    the moment it means checking that starting point of ....at() is
    a directory and caller has exec permissions on it).

    fget() and fget_light() return NULL on such descriptors; use of
    fget_raw() and fget_raw_light() is needed to get them. That protects
    existing code from dealing with those things.

    There are two things still missing (they come in the next commits):
    one is handling of symlinks (right now we refuse to open them that
    way; see the next commit for semantics related to those) and another
    is descriptor passing via SCM_RIGHTS datagrams.

    Signed-off-by: Al Viro

    Al Viro
     

08 Mar, 2011

1 commit


10 Feb, 2011

1 commit

  • ima_counts_get() updated the readcount and invalidated the PCR,
    as necessary. Only update the i_readcount in the VFS layer.
    Move the PCR invalidation checks to ima_file_check(), where it
    belongs.

    Maintaining the i_readcount in the VFS layer, will allow other
    subsystems to use i_readcount.

    Signed-off-by: Mimi Zohar
    Acked-by: Eric Paris

    Mimi Zohar
     

05 Feb, 2011

1 commit

  • In get_empty_filp() since 2.6.29, file_free(f) is called with f->f_cred == NULL
    when security_file_alloc() returned an error. As a result, kernel will panic()
    due to put_cred(NULL) call within RCU callback.

    Fix this bug by assigning f->f_cred before calling security_file_alloc().

    Signed-off-by: Tetsuo Handa
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

17 Jan, 2011

1 commit

  • There's an unlikely() in fget_light() that assumes the file ref count
    will be 1. Running the annotate branch profiler on a desktop that is
    performing daily tasks (running firefox, evolution, xchat and is also part
    of a distcc farm), it shows that the ref count is not 1 that often.

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    1035099358 6209599193 85 fget_light file_table.c 315

    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Steven Rostedt
    Signed-off-by: Al Viro

    Steven Rostedt
     

27 Oct, 2010

1 commit

  • Robin Holt tried to boot a 16TB system and found af_unix was overflowing
    a 32bit value :

    We were seeing a failure which prevented boot. The kernel was incapable
    of creating either a named pipe or unix domain socket. This comes down
    to a common kernel function called unix_create1() which does:

    atomic_inc(&unix_nr_socks);
    if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
    goto out;

    The function get_max_files() is a simple return of files_stat.max_files.
    files_stat.max_files is a signed integer and is computed in
    fs/file_table.c's files_init().

    n = (mempages * (PAGE_SIZE / 1024)) / 10;
    files_stat.max_files = n;

    In our case, mempages (total_ram_pages) is approx 3,758,096,384
    (0xe0000000). That leaves max_files at approximately 1,503,238,553.
    This causes 2 * get_max_files() to integer overflow.

    Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
    integers, and change af_unix to use an atomic_long_t instead of atomic_t.

    get_max_files() is changed to return an unsigned long. get_nr_files() is
    changed to return a long.

    unix_nr_socks is changed from atomic_t to atomic_long_t, while not
    strictly needed to address Robin problem.

    Before patch (on a 64bit kernel) :
    # echo 2147483648 >/proc/sys/fs/file-max
    # cat /proc/sys/fs/file-max
    -18446744071562067968

    After patch:
    # echo 2147483648 >/proc/sys/fs/file-max
    # cat /proc/sys/fs/file-max
    2147483648
    # cat /proc/sys/fs/file-nr
    704 0 2147483648

    Reported-by: Robin Holt
    Signed-off-by: Eric Dumazet
    Acked-by: David Miller
    Reviewed-by: Robin Holt
    Tested-by: Robin Holt
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

18 Aug, 2010

2 commits

  • fs: scale files_lock

    Improve scalability of files_lock by adding per-cpu, per-sb files lists,
    protected with an lglock. The lglock provides fast access to the per-cpu lists
    to add and remove files. It also provides a snapshot of all the per-cpu lists
    (although this is very slow).

    One difficulty with this approach is that a file can be removed from the list
    by another CPU. We must track which per-cpu list the file is on with a new
    variale in the file struct (packed into a hole on 64-bit archs). Scalability
    could suffer if files are frequently removed from different cpu's list.

    However loads with frequent removal of files imply short interval between
    adding and removing the files, and the scheduler attempts to avoid moving
    processes too far away. Also, even in the case of cross-CPU removal, the
    hardware has much more opportunity to parallelise cacheline transfers with N
    cachelines than with 1.

    A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
    degenerates to contending on a single lock, which is no worse than before. When
    more than one CPU are allocating files, even if they are always freed by
    different CPUs, there will be more parallelism than the single-lock case.

    Testing results:

    On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
    to remove the file, the number of times it is removed by the same CPU that
    added it, and the number of times it is removed by the same node that added it.

    Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%)
    kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
    dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

    So a file is removed from the same CPU it was added by over 90% of the time.
    It remains within the same node 95% of the time.

    Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

    throughput
    2.6.34-rc2 24.5
    +patch 24.9

    us sys idle IO wait (in %)
    2.6.34-rc2 51.25 28.25 17.25 3.25
    +patch 53.75 18.5 19 8.75

    So significantly less CPU time spent in kernel code, higher idle time and
    slightly higher throughput.

    Single threaded performance difference was within the noise of microbenchmarks.
    That is not to say penalty does not exist, the code is larger and more memory
    accesses required so it will be slightly slower.

    Cc: linux-kernel@vger.kernel.org
    Cc: Tim Chen
    Cc: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • fs: cleanup files_lock locking

    Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
    manipulate the per-sb files list; unexport the files_lock spinlock.

    Cc: linux-kernel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Alan Cox
    Acked-by: Andi Kleen
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin