28 Aug, 2017

1 commit


06 Jul, 2017

1 commit

  • Most filesystems currently use mapping_set_error and
    filemap_check_errors for setting and reporting/clearing writeback errors
    at the mapping level. filemap_check_errors is indirectly called from
    most of the filemap_fdatawait_* functions and from
    filemap_write_and_wait*. These functions are called from all sorts of
    contexts to wait on writeback to finish -- e.g. mostly in fsync, but
    also in truncate calls, getattr, etc.

    The non-fsync callers are problematic. We should be reporting writeback
    errors during fsync, but many places spread over the tree clear out
    errors before they can be properly reported, or report errors at
    nonsensical times.

    If I get -EIO on a stat() call, there is no reason for me to assume that
    it is because some previous writeback failed. The fact that it also
    clears out the error such that a subsequent fsync returns 0 is a bug,
    and a nasty one since that's potentially silent data corruption.

    This patch adds a small bit of new infrastructure for setting and
    reporting errors during address_space writeback. While the above was my
    original impetus for adding this, I think it's also the case that
    current fsync semantics are just problematic for userland. Most
    applications that call fsync do so to ensure that the data they wrote
    has hit the backing store.

    In the case where there are multiple writers to the file at the same
    time, this is really hard to determine. The first one to call fsync will
    see any stored error, and the rest get back 0. The processes with open
    fds may not be associated with one another in any way. They could even
    be in different containers, so ensuring coordination between all fsync
    callers is not really an option.

    One way to remedy this would be to track what file descriptor was used
    to dirty the file, but that's rather cumbersome and would likely be
    slow. However, there is a simpler way to improve the semantics here
    without incurring too much overhead.

    This set adds an errseq_t to struct address_space, and a corresponding
    one is added to struct file. Writeback errors are recorded in the
    mapping's errseq_t, and the one in struct file is used as the "since"
    value.

    This changes the semantics of the Linux fsync implementation such that
    applications can now use it to determine whether there were any
    writeback errors since fsync(fd) was last called (or since the file was
    opened in the case of fsync having never been called).

    Note that those writeback errors may have occurred when writing data
    that was dirtied via an entirely different fd, but that's the case now
    with the current mapping_set_error/filemap_check_error infrastructure.
    This will at least prevent you from getting a false report of success.

    The new behavior is still consistent with the POSIX spec, and is more
    reliable for application developers. This patch just adds some basic
    infrastructure for doing this, and ensures that the f_wb_err "cursor"
    is properly set when a file is opened. Later patches will change the
    existing code to use this new infrastructure for reporting errors at
    fsync time.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     

02 Mar, 2017

1 commit


06 Dec, 2016

1 commit


07 Aug, 2015

1 commit

  • Dave Hansen reported the following;

    My laptop has been behaving strangely with 4.2-rc2. Once I log
    in to my X session, I start getting all kinds of strange errors
    from applications and see this in my dmesg:

    VFS: file-max limit 8192 reached

    The problem is that the file-max is calculated before memory is fully
    initialised and miscalculates how much memory the kernel is using. This
    patch recalculates file-max after deferred memory initialisation. Note
    that using memory hotplug infrastructure would not have avoided this
    problem as the value is not recalculated after memory hot-add.

    4.1: files_stat.max_files = 6582781
    4.2-rc2: files_stat.max_files = 8192
    4.2-rc2 patched: files_stat.max_files = 6562467

    Small differences with the patch applied and 4.1 but not enough to matter.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Hansen
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Cc: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

24 Jun, 2015

1 commit


12 Apr, 2015

1 commit


13 Oct, 2014

2 commits

  • Pull vfs updates from Al Viro:
    "The big thing in this pile is Eric's unmount-on-rmdir series; we
    finally have everything we need for that. The final piece of prereqs
    is delayed mntput() - now filesystem shutdown always happens on
    shallow stack.

    Other than that, we have several new primitives for iov_iter (Matt
    Wilcox, culled from his XIP-related series) pushing the conversion to
    ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
    cleanups and fixes (including the external name refcounting, which
    gives consistent behaviour of d_move() wrt procfs symlinks for long
    and short names alike) and assorted cleanups and fixes all over the
    place.

    This is just the first pile; there's a lot of stuff from various
    people that ought to go in this window. Starting with
    unionmount/overlayfs mess... ;-/"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
    fs/file_table.c: Update alloc_file() comment
    vfs: Deduplicate code shared by xattr system calls operating on paths
    reiserfs: remove pointless forward declaration of struct nameidata
    don't need that forward declaration of struct nameidata in dcache.h anymore
    take dname_external() into fs/dcache.c
    let path_init() failures treated the same way as subsequent link_path_walk()
    fix misuses of f_count() in ppp and netlink
    ncpfs: use list_for_each_entry() for d_subdirs walk
    vfs: move getname() from callers to do_mount()
    gfs2_atomic_open(): skip lookups on hashed dentry
    [infiniband] remove pointless assignments
    gadgetfs: saner API for gadgetfs_create_file()
    f_fs: saner API for ffs_sb_create_file()
    jfs: don't hash direct inode
    [s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
    ecryptfs: ->f_op is never NULL
    android: ->f_op is never NULL
    nouveau: __iomem misannotations
    missing annotation in fs/file.c
    fs: namespace: suppress 'may be used uninitialized' warnings
    ...

    Linus Torvalds
     
  • This comment is 5 years outdated; init_file() no longer exists.

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

08 Sep, 2014

1 commit

  • Percpu allocator now supports allocation mask. Add @gfp to
    percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
    with percpu_counters too.

    We could have left percpu_counter_init() alone and added
    percpu_counter_init_gfp(); however, the number of users isn't that
    high and introducing _gfp variants to all percpu data structures would
    be quite ugly, so let's just do the conversion. This is the one with
    the most users. Other percpu data structures are a lot easier to
    convert.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Jan Kara
    Acked-by: "David S. Miller"
    Cc: x86@kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andrew Morton

    Tejun Heo
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

07 Jun, 2014

1 commit


07 May, 2014

2 commits

  • Beginning to introduce those. Just the callers for now, and it's
    clumsier than it'll eventually become; once we finish converting
    aio_read and aio_write instances, the things will get nicer.

    For now, these guys are in parallel to ->aio_read() and ->aio_write();
    they take iocb and iov_iter, with everything in iov_iter already
    validated. File offset is passed in iocb->ki_pos, iov/nr_segs -
    in iov_iter.

    Main concerns in that series are stack footprint and ability to
    split the damn thing cleanly.

    [fix from Peter Ujfalusi folded]

    Signed-off-by: Al Viro

    Al Viro
     
  • Since we are about to introduce new methods (read_iter/write_iter), the
    tests in a bunch of places would have to grow inconveniently. Check
    once (at open() time) and store results in ->f_mode as FMODE_CAN_READ
    and FMODE_CAN_WRITE resp. It might end up being a temporary measure -
    once everything switches from ->aio_{read,write} to ->{read,write}_iter
    it might make sense to return to open-coded checks. We'll see...

    Signed-off-by: Al Viro

    Al Viro
     

13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     

05 Apr, 2014

1 commit

  • Pull file locking updates from Jeff Layton:
    "Highlights:

    - maintainership change for fs/locks.c. Willy's not interested in
    maintaining it these days, and is OK with Bruce and I taking it.
    - fix for open vs setlease race that Al ID'ed
    - cleanup and consolidation of file locking code
    - eliminate unneeded BUG() call
    - merge of file-private lock implementation"

    * 'locks-3.15' of git://git.samba.org/jlayton/linux:
    locks: make locks_mandatory_area check for file-private locks
    locks: fix locks_mandatory_locked to respect file-private locks
    locks: require that flock->l_pid be set to 0 for file-private locks
    locks: add new fcntl cmd values for handling file private locks
    locks: skip deadlock detection on FL_FILE_PVT locks
    locks: pass the cmd value to fcntl_getlk/getlk64
    locks: report l_pid as -1 for FL_FILE_PVT locks
    locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
    locks: rename locks_remove_flock to locks_remove_file
    locks: consolidate checks for compatible filp->f_mode values in setlk handlers
    locks: fix posix lock range overflow handling
    locks: eliminate BUG() call when there's an unexpected lock on file close
    locks: add __acquires and __releases annotations to locks_start and locks_stop
    locks: remove "inline" qualifier from fl_link manipulation functions
    locks: clean up comment typo
    locks: close potential race between setlease and open
    MAINTAINERS: update entry for fs/locks.c

    Linus Torvalds
     

02 Apr, 2014

4 commits


31 Mar, 2014

1 commit

  • This function currently removes leases in addition to flock locks and in
    a later patch we'll have it deal with file-private locks too. Rename it
    to locks_remove_file to indicate that it removes locks that are
    associated with a particular struct file, and not just flock locks.

    Acked-by: J. Bruce Fields
    Signed-off-by: Jeff Layton

    Jeff Layton
     

10 Mar, 2014

1 commit

  • Our write() system call has always been atomic in the sense that you get
    the expected thread-safe contiguous write, but we haven't actually
    guaranteed that concurrent writes are serialized wrt f_pos accesses, so
    threads (or processes) that share a file descriptor and use "write()"
    concurrently would quite likely overwrite each others data.

    This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

    "2.9.7 Thread Interactions with Regular File Operations

    All of the following functions shall be atomic with respect to each
    other in the effects specified in POSIX.1-2008 when they operate on
    regular files or symbolic links: [...]"

    and one of the effects is the file position update.

    This unprotected file position behavior is not new behavior, and nobody
    has ever cared. Until now. Yongzhi Pan reported unexpected behavior to
    Michael Kerrisk that was due to this.

    This resolves the issue with a f_pos-specific lock that is taken by
    read/write/lseek on file descriptors that may be shared across threads
    or processes.

    Reported-by: Yongzhi Pan
    Reported-by: Michael Kerrisk
    Cc: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro

    Linus Torvalds
     

13 Nov, 2013

1 commit

  • Pull vfs updates from Al Viro:
    "All kinds of stuff this time around; some more notable parts:

    - RCU'd vfsmounts handling
    - new primitives for coredump handling
    - files_lock is gone
    - Bruce's delegations handling series
    - exportfs fixes

    plus misc stuff all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
    ecryptfs: ->f_op is never NULL
    locks: break delegations on any attribute modification
    locks: break delegations on link
    locks: break delegations on rename
    locks: helper functions for delegation breaking
    locks: break delegations on unlink
    namei: minor vfs_unlink cleanup
    locks: implement delegations
    locks: introduce new FL_DELEG lock flag
    vfs: take i_mutex on renamed file
    vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
    vfs: don't use PARENT/CHILD lock classes for non-directories
    vfs: pull ext4's double-i_mutex-locking into common code
    exportfs: fix quadratic behavior in filehandle lookup
    exportfs: better variable name
    exportfs: move most of reconnect_path to helper function
    exportfs: eliminate unused "noprogress" counter
    exportfs: stop retrying once we race with rename/remove
    exportfs: clear DISCONNECTED on all parents sooner
    exportfs: more detailed comment for path_reconnect
    ...

    Linus Torvalds
     

09 Nov, 2013

1 commit


25 Oct, 2013

1 commit


20 Oct, 2013

1 commit

  • Background: nfsd v[23] had throughput regression since delayed fput
    went in; every read or write ends up doing fput() and we get a pair
    of extra context switches out of that (plus quite a bit of work
    in queue_work itselfi, apparently). Use of schedule_delayed_work()
    gives it a chance to accumulate a bit before we do __fput() on all
    of them. I'm not too happy about that solution, but... on at least
    one real-world setup it reverts about 10% throughput loss we got from
    switch to delayed fput.

    Signed-off-by: Al Viro

    Al Viro
     

12 Sep, 2013

1 commit


04 Sep, 2013

1 commit


13 Jul, 2013

2 commits

  • fput() and delayed_fput() can use llist and avoid the locking.

    This is unlikely path, it is not that this change can improve
    the performance, but this way the code looks simpler.

    Signed-off-by: Oleg Nesterov
    Suggested-by: Andrew Morton
    Cc: Al Viro
    Cc: Andrey Vagin
    Cc: "Eric W. Biederman"
    Cc: David Howells
    Cc: Huang Ying
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • A missed update to "fput: task_work_add() can fail if the caller has
    passed exit_task_work()".

    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Andrey Vagin
    Cc: David Howells
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Andrew Morton
     

29 Jun, 2013

1 commit


15 Jun, 2013

1 commit

  • fput() assumes that it can't be called after exit_task_work() but
    this is not true, for example free_ipc_ns()->shm_destroy() can do
    this. In this case fput() silently leaks the file.

    Change it to fallback to delayed_fput_work if task_work_add() fails.
    The patch looks complicated but it is not, it changes the code from

    if (PF_KTHREAD) {
    schedule_work(...);
    return;
    }
    task_work_add(...)

    to
    if (!PF_KTHREAD) {
    if (!task_work_add(...))
    return;
    /* fallback */
    }
    schedule_work(...);

    As for shm_destroy() in particular, we could make another fix but I
    think this change makes sense anyway. There could be another similar
    user, it is not safe to assume that task_work_add() can't fail.

    Reported-by: Andrey Vagin
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Al Viro

    Oleg Nesterov
     

02 Mar, 2013

1 commit


23 Feb, 2013

3 commits

  • Allocating a file structure in function get_empty_filp() might fail because
    of several reasons:
    - not enough memory for file structures
    - operation is not allowed
    - user is over its limit

    Currently the function returns NULL in all cases and we loose the exact
    reason of the error. All callers of get_empty_filp() assume that the function
    can fail with ENFILE only.

    Return error through pointer. Change all callers to preserve this error code.

    [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
    (things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

    Signed-off-by: Anatol Pomozov
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Al Viro

    Anatol Pomozov
     
  • Based on parts from Anatol's patch (the rest is the next commit).

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     

21 Dec, 2012

1 commit

  • File descriptors (even those for writing) do not hold freeze protection.
    Thus mark_files_ro() must call __mnt_drop_write() to only drop protection
    against remount read-only. Calling mnt_drop_write_file() as we do now
    results in:

    [ BUG: bad unlock balance detected! ]
    3.7.0-rc6-00028-g88e75b6 #101 Not tainted
    -------------------------------------
    kworker/1:2/79 is trying to release lock (sb_writers) at:
    [] mnt_drop_write+0x24/0x30
    but there are no more locks to release!

    Reported-by: Zdenek Kabelac
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

10 Oct, 2012

1 commit


03 Oct, 2012

1 commit

  • Pull security subsystem updates from James Morris:
    "Highlights:

    - Integrity: add local fs integrity verification to detect offline
    attacks
    - Integrity: add digital signature verification
    - Simple stacking of Yama with other LSMs (per LSS discussions)
    - IBM vTPM support on ppc64
    - Add new driver for Infineon I2C TIS TPM
    - Smack: add rule revocation for subject labels"

    Fixed conflicts with the user namespace support in kernel/auditsc.c and
    security/integrity/ima/ima_policy.c.

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (39 commits)
    Documentation: Update git repository URL for Smack userland tools
    ima: change flags container data type
    Smack: setprocattr memory leak fix
    Smack: implement revoking all rules for a subject label
    Smack: remove task_wait() hook.
    ima: audit log hashes
    ima: generic IMA action flag handling
    ima: rename ima_must_appraise_or_measure
    audit: export audit_log_task_info
    tpm: fix tpm_acpi sparse warning on different address spaces
    samples/seccomp: fix 31 bit build on s390
    ima: digital signature verification support
    ima: add support for different security.ima data types
    ima: add ima_inode_setxattr/removexattr function and calls
    ima: add inode_post_setattr call
    ima: replace iint spinblock with rwlock/read_lock
    ima: allocating iint improvements
    ima: add appraise action keywords and default rules
    ima: integrity appraisal extension
    vfs: move ima_file_free before releasing the file
    ...

    Linus Torvalds
     

27 Sep, 2012

1 commit