08 Apr, 2020

2 commits

  • Christoph Hellwig noticed that we were doing some unnecessary
    work in orangefs_flush:

    orangefs_flush just writes out data on every close(2) call. There is
    no need to change anything about the dirty state, especially as
    orangefs doesn't treat I_DIRTY_TIMES special in any way. The code
    seems to come from partially open coding vfs_fsync.

    He sent in a patch with the above commit message and also a
    patch that was a reversion of another Orangefs patch I had
    sent upstream a while ago. I had to fix his reversion patch
    so that it would compile which caused his "don't mess with
    I_DIRTY_TIMES" patch to fail to apply. So here I have just
    remade his patch and applied it after the fixed reversion patch.

    Signed-off-by: Mike Marshall

    Mike Marshall
     
  • Christoph Hellwig sent in a reversion of "orangefs: remember count
    when reading." because:

    ->read_iter calls can race with each other and one or
    more ->flush calls. Remove the the scheme to store the read
    count in the file private data as is is completely racy and
    can cause use after free or double free conditions

    Christoph's reversion caused Orangefs not to work or to compile. I
    added a patch that fixed that, but intel's kbuild test robot pointed
    out that sending Christoph's patch followed by my patch upstream, it
    would break bisection because of the failure to compile. So I have
    combined the reversion plus my patch... here's the commit message
    that was in my patch:

    Logically, optimal Orangefs "pages" are 4 megabytes. Reading
    large Orangefs files 4096 bytes at a time is like trying to
    kick a dead whale down the beach. Before Christoph's "Revert
    orangefs: remember count when reading." I tried to give users
    a knob whereby they could, for example, use "count" in
    read(2) or bs with dd(1) to get whatever they considered an
    appropriate amount of bytes at a time from Orangefs and fill
    as many page cache pages as they could at once.

    Without the racy code that Christoph reverted Orangefs won't
    even compile, much less work. So this replaces the logic that
    used the private file data that Christoph reverted with
    a static number of bytes to read from Orangefs.

    I ran tests like the following to determine what a
    reasonable static number of bytes might be:

    dd if=/pvfsmnt/asdf of=/dev/null count=128 bs=4194304
    dd if=/pvfsmnt/asdf of=/dev/null count=256 bs=2097152
    dd if=/pvfsmnt/asdf of=/dev/null count=512 bs=1048576
    .
    .
    .
    dd if=/pvfsmnt/asdf of=/dev/null count=4194304 bs=128

    Reads seem faster using the static number, so my "knob code"
    wasn't just racy, it wasn't even a good idea...

    Signed-off-by: Mike Marshall
    Reported-by: kbuild test robot

    Mike Marshall
     

04 Dec, 2019

1 commit

  • Orangefs has no open, and orangefs checks file permissions
    on each file access. Posix requires that file permissions
    be checked on open and nowhere else. Orangefs-through-the-kernel
    needs to seem posix compliant.

    The VFS opens files, even if the filesystem provides no
    method. We can see if a file was successfully opened for
    read and or for write by looking at file->f_mode.

    When writes are flowing from the page cache, file is no
    longer available. We can trust the VFS to have checked
    file->f_mode before writing to the page cache.

    The mode of a file might change between when it is opened
    and IO commences, or it might be created with an arbitrary mode.

    We'll make sure we don't hit EACCES during the IO stage by
    using UID 0. Some of the time we have access without changing
    to UID 0 - how to check?

    Signed-off-by: Mike Marshall

    Mike Marshall
     

01 Aug, 2019

1 commit

  • There are 3 remaining files without an extension inside the fs docs
    dir.

    Manually convert them to ReST.

    In the case of the nfs/exporting.rst file, as the nfs docs
    aren't ported yet, I opted to convert and add a :orphan: there,
    with should be removed when it gets added into a nfs-specific
    part of the fs documentation.

    Signed-off-by: Mauro Carvalho Chehab
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

17 Jul, 2019

1 commit


12 Jul, 2019

2 commits


01 Jul, 2019

1 commit

  • Create a generic function to check incoming FS_IOC_SETFLAGS flag values
    and later prepare the inode for updates so that we can standardize the
    implementations that follow ext4's flag values.

    Note that the efivarfs implementation no longer fails a no-op SETFLAGS
    without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Acked-by: David Sterba
    Reviewed-by: Bob Peterson

    Darrick J. Wong
     

04 May, 2019

13 commits

  • ->readpage looks in file->private_data to try and find out how the
    userspace program set "count" in read(2) or with "dd bs=" or whatever.

    ->readpage uses "count" and inode->i_size to calculate how much
    data Orangefs should deposit in the Orangefs shared buffer, and
    remembers which slot the data is in.

    After copying data from the Orangefs shared buffer slot into
    "the page", readpage tries to increment through the pagecache index
    and fill as many pages as it can from the extra data in the shared
    buffer. Hopefully these extra pages will soon be needed by the vfs,
    and they'll be in the pagecache already.

    Signed-off-by: Mike Marshall
    Signed-off-by: Martin Brandenburg

    Mike Marshall
     
  • When userspace deposits more than a page of data into the shared buffer,
    we'll need to know which slot it is in when we get back to readpage
    so that we can try to use the extra data to fill some extra pages.

    Signed-off-by: Mike Marshall
    Signed-off-by: Martin Brandenburg

    Mike Marshall
     
  • Orangefs wins when it can do IO on large (up to four meg) blocks at a time,
    and looses when it has to do tiny "small io" reads and writes. Accessing
    Orangefs through the pagecache with the kernel module helps with small io,
    both reading and writing, a great deal. Readpage generally tries to fetch a
    page (four k) at a time. We'll let users use "count" (as in read(2) or
    pread(2) for example) as a knob to control how much data they get from
    Orangefs at a time and we'll try to use the data to fill extra
    pagecache pages when we get to ->readpage, hopefully resulting in
    fewer calls to readpage and Orangefs userspace.

    We need a way to remember how they set count so that we can still have
    it available when we get to ->readpage.

    - We'll use file->private_data to keep track of "count".
    We'll wrap generic_file_open with orangefs_file_open and
    initialize private_data to NULL there.

    - In ->read_iter we have access to both "count" and file, so
    we'll kmalloc some space onto file->private_data and store
    "count" there.

    - We'll kfree file->private_data each time we visit ->flush and
    reinitialize it to NULL.

    Signed-off-by: Mike Marshall
    Signed-off-by: Martin Brandenburg

    Mike Marshall
     
  • This is modeled after NFS, except our method is different. We use a
    simple timer to determine whether to invalidate the page cache. This
    is bound to perform.

    This addes a sysfs parameter cache_timeout_msecs which controls the time
    between page cache invalidations.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Go through pages and look for a consecutive writable region. After
    finding a number of consecutive writable pages or when finding that
    the next page's dirty range is not contiguous and cannot be written
    as one request, send the write to the server.

    The number of pages is determined by the client-core's buffer size.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Attach the actual range of bytes written to plus the responsible uid/gid
    to each dirty page. This information must be sent to the server when
    the page is written out.

    Now write_begin, page_mkwrite, and invalidatepage keep up with this
    information. There are several conditions where they must write out the
    page immediately to store the new range. Two non-contiguous ranges
    cannot be stored on a single page.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Without this, an fsync call is sent to the server even if no data
    changed. This resulted in a rather severe (50%) performance regression
    under certain metadata-heavy workloads.

    In the past, everything was direct IO. Nothing happend on a close call.
    An explicit fsync call would send an fsync request to the server which
    in turn fsynced the underlying file.

    Now there are cached writes. Then fsync began writing out dirty pages
    in addition to making an fsync request to the server, and close began
    calling fsync.

    With this commit, close only writes out dirty pages, and does not make
    the fsync request.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • direct_IO was the only caller and all direct_IO did was call it,
    so there's no use in having the code spread out into so many functions.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Now orangefs_inode_getattr fills from cache if an inode has dirty pages.

    also if attr_valid and dirty pages and !flags, we spin on inode writeback
    before returning if pages still dirty after: should it be other way

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Remove orangefs_inode_read. It was used by readpage. Calling
    wait_for_direct_io directly serves the purpose just as well. There is
    now no check of the bufmap size in the readpage path. There are already
    other places the bufmap size is assumed to be greater than PAGE_SIZE.

    Important to call truncate_inode_pages now in the write path so a
    subsequent read sees the new data.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • This should be a no-op now, but once inode writeback works, it'll be
    necessary to have the correct attribute in the dirty inode.

    Previously the attribute fetch timeout was marked invalid and the server
    provided the updated attribute. When the inode is dirty, the server
    cannot be consulted since it does not yet know the pending setattr.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • No need to store the received mask. It is either STATX_BASIC_STATS or
    STATX_BASIC_STATS & ~STATX_SIZE. If STATX_SIZE is requested, the cache
    is bypassed anyway, so the cached mask is unnecessary to decide whether
    to do a real getattr.

    This is a change. Previously a getattr would want size and use the
    cached size. All of the in-kernel callers that wanted size did not want
    a cached size. Now a getattr cannot use the cached size if it wants
    size at all.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     

21 Feb, 2019

1 commit


15 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler. For now,
    this is just documenting that the function returns a VM_FAULT
    value rather than an errno. Once all instances are converted,
    vm_fault_t will become a distinct type.

    See the following
    commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    Fixed checkpatch.pl warning.

    Signed-off-by: Souptick Joarder
    Signed-off-by: Mike Marshall

    Souptick Joarder
     

02 Jun, 2018

2 commits


04 Apr, 2018

1 commit

  • Must retrieve size before running filemap_fault so the kernel has
    an up-to-date size.

    This should have been caught by xfstests generic/246, but it was masked
    by orangefs_new_inode, which set i_size to PAGE_SIZE. When nothing
    caused a getattr prior to a pagefault, i_size was still PAGE_SIZE.
    Since xfstests only read 10 bytes, it did not catch this bug.

    When orangefs_new_inode was modified to perform a getattr instead,
    i_size was set to zero, as it was a newly created file. Then
    orangefs_file_write_iter did NOT set i_size. Instead it invalidated the
    attribute cache, which should have caused the next caller to retrieve
    i_size. But the fault handler did not know it was supposed to retrieve
    i_size. So during xfstests, i_size was still zero, and filemap_fault
    returned VM_FAULT_SIGBUS.

    Fixes xfstests generic/452.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     

02 Apr, 2018

1 commit


26 Jan, 2018

1 commit

  • After do_readv_writev, the inode cache is invalidated anyway, so i_size
    will never be read. It will be fetched from the server which will also
    know about updates from other machines.

    Fixes deadlock on 32-bit SMP.

    See https://marc.info/?l=linux-fsdevel&m=151268557427760&w=2

    Signed-off-by: Martin Brandenburg
    Cc: Al Viro
    Cc: Mike Marshall
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Martin Brandenburg
     

14 Nov, 2017

1 commit

  • The previous code path was to mark the inode dirty, let
    orangefs_inode_dirty set a flag in our private inode, then later during
    inode release call orangefs_flush_inode which notices the flag and
    writes the atime out.

    The code path worked almost identically for mtime, ctime, and mode
    except that those flags are set explicitly and not as side effects of
    dirty.

    Now orangefs_flush_inode is removed. Marking an inode dirty does not
    imply an atime update. Any place where flags were set before is now
    an explicit call to orangefs_inode_setattr. Since OrangeFS does not
    utilize inode writeback, the attribute change should be written out
    immediately.

    Fixes generic/120.

    In namei.c, there are several places where the directory mtime and ctime
    are set, but only the mtime is sent to the server. These don't seem
    right, but I've left them as is for now.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

15 Sep, 2017

1 commit


06 May, 2017

1 commit

  • Pull orangefs updates from Mike Marshall:
    "Orangefs cleanups, fixes and statx support.

    Some cleanups:

    - remove unused get_fsid_from_ino
    - fix bounds check for listxattr
    - clean up oversize xattr validation
    - do not set getattr_time on orangefs_lookup
    - return from orangefs_devreq_read quickly if possible
    - do not wait for timeout if umounting
    - handle zero size write in debugfs

    Bug fixes:

    - do not check possibly stale size on truncate
    - ensure the userspace component is unmounted if mount fails
    - total reimplementation of dir.c

    New feature:

    - implement statx

    The new implementation of dir.c is kind of a big deal, all new code.
    It has been posted to fs-devel during the previous rc period, we
    didn't get much review or feedback from there, but it has been
    reviewed very heavily here, so much so that we have two entire
    versions of the reimplementation.

    Not only does the new implementation fix some xfstests, but it passes
    all the new tests we made here that involve seeking and rewinding and
    giant directories and long file names. The new dir code has three
    patches itself:

    - skip forward to the next directory entry if seek is short
    - invalidate stored directory on seek
    - count directory pieces correctly"

    * tag 'for-linus-4.12-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
    orangefs: count directory pieces correctly
    orangefs: invalidate stored directory on seek
    orangefs: skip forward to the next directory entry if seek is short
    orangefs: handle zero size write in debugfs
    orangefs: do not wait for timeout if umounting
    orangefs: return from orangefs_devreq_read quickly if possible
    orangefs: ensure the userspace component is unmounted if mount fails
    orangefs: do not check possibly stale size on truncate
    orangefs: implement statx
    orangefs: remove ORANGEFS_READDIR macros
    orangefs: support very large directories
    orangefs: support llseek on directories
    orangefs: rewrite readdir to fix several bugs
    orangefs: do not set getattr_time on orangefs_lookup
    orangefs: clean up oversize xattr validation
    orangefs: fix bounds check for listxattr
    orangefs: remove unused get_fsid_from_ino

    Linus Torvalds
     

27 Apr, 2017

1 commit

  • Fortunately OrangeFS has had a getattr request mask for a long time.

    The server basically has two difficulty levels for attributes. Fetching
    any attribute except size requires communicating with the metadata
    server for that handle. Since all the attributes are right there, it
    makes sense to return them all. Fetching the size requires
    communicating with every I/O server (that the file is distributed
    across). Therefore if asked for anything except size, get everything
    except size, and if asked for size, get everything.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     

22 Apr, 2017

1 commit


05 Dec, 2016

1 commit


25 Oct, 2016

1 commit

  • Replace wrong use of file->f_path.dentry->d_inode with file_inode(file).
    In case orangefs ever finds itself as an overelayfs layer, it would want
    to get its own inode and not overlayfs's inode.

    DISCLAIMER: I did not test this patch because I do not know how to setup
    an orangefs mount

    Signed-off-by: Amir Goldstein
    Signed-off-by: Mike Marshall

    Amir Goldstein
     

11 Oct, 2016

2 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

29 Sep, 2016

1 commit

  • Pull in an OrangeFS branch containing miscellaneous improvements.

    - clean up debugfs globals
    - remove dead code in sysfs
    - reorganize duplicated sysfs attribute structs
    - consolidate sysfs show and store functions
    - remove duplicated sysfs_ops structures
    - describe organization of sysfs
    - make devreq_mutex static
    - g_orangefs_stats -> orangefs_stats for consistency
    - rename most remaining global variables

    Martin Brandenburg
     

28 Sep, 2016

1 commit

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani