21 Aug, 2020

1 commit

  • commit ec95f1dedc9c64ac5a8b0bdb7c276936c70fdedd upstream.

    Christoph Hellwig sent in a reversion of "orangefs: remember count
    when reading." because:

    ->read_iter calls can race with each other and one or
    more ->flush calls. Remove the the scheme to store the read
    count in the file private data as is is completely racy and
    can cause use after free or double free conditions

    Christoph's reversion caused Orangefs not to work or to compile. I
    added a patch that fixed that, but intel's kbuild test robot pointed
    out that sending Christoph's patch followed by my patch upstream, it
    would break bisection because of the failure to compile. So I have
    combined the reversion plus my patch... here's the commit message
    that was in my patch:

    Logically, optimal Orangefs "pages" are 4 megabytes. Reading
    large Orangefs files 4096 bytes at a time is like trying to
    kick a dead whale down the beach. Before Christoph's "Revert
    orangefs: remember count when reading." I tried to give users
    a knob whereby they could, for example, use "count" in
    read(2) or bs with dd(1) to get whatever they considered an
    appropriate amount of bytes at a time from Orangefs and fill
    as many page cache pages as they could at once.

    Without the racy code that Christoph reverted Orangefs won't
    even compile, much less work. So this replaces the logic that
    used the private file data that Christoph reverted with
    a static number of bytes to read from Orangefs.

    I ran tests like the following to determine what a
    reasonable static number of bytes might be:

    dd if=/pvfsmnt/asdf of=/dev/null count=128 bs=4194304
    dd if=/pvfsmnt/asdf of=/dev/null count=256 bs=2097152
    dd if=/pvfsmnt/asdf of=/dev/null count=512 bs=1048576
    .
    .
    .
    dd if=/pvfsmnt/asdf of=/dev/null count=4194304 bs=128

    Reads seem faster using the static number, so my "knob code"
    wasn't just racy, it wasn't even a good idea...

    Signed-off-by: Mike Marshall
    Reported-by: kbuild test robot
    Signed-off-by: Greg Kroah-Hartman

    Mike Marshall
     

24 Feb, 2020

1 commit

  • [ Upstream commit 9f198a2ac543eaaf47be275531ad5cbd50db3edf ]

    if seq_file .next fuction does not change position index,
    read after some lseek can generate unexpected output.

    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    Signed-off-by: Vasily Averin
    Signed-off-by: Mike Marshall
    Signed-off-by: Sasha Levin

    Vasily Averin
     

20 Sep, 2019

1 commit

  • Pull orangefs updates from Mike Marshall:
    "A fix and a cleanup.

    The fix: way back in the stone age (2003) mode was set to the magic
    number "755" in what is now fs/orangefs/namei.c(orangefs_symlink).
    Łukasz Wrochna reported it and Artur Świgoń sent in a patch to change
    it to octal. Maybe it shouldn't be a magic number at all but rather
    something like "S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH"...

    cleanup: Colin Ian King found a redundant assignment and sent in a
    patch to remove it"

    [ And no, octal numbers for permissions are a lot more legible than a
    binary 'or' of some line noise macros. So 0755 is preferred over
    trying to spell it out using "helpful" macros - Linus ]

    * tag 'for-linus-5.4-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
    orangefs: remove redundant assignment to err
    orangefs: Add octal zero prefix

    Linus Torvalds
     

13 Sep, 2019

2 commits


01 Aug, 2019

2 commits

  • This file has its own proper style, except that, after a while,
    the coding style gets violated and whitespaces are placed on
    different ways.

    As Sphinx and ReST are very sentitive to whitespace differences,
    I had to opt if each entry after required/mandatory/... fields
    should start with zero spaces or with a tab. I opted to start them
    all from the zero position, in order to avoid needing to break lines
    with more than 80 columns, with would make harder for review.

    Most of the other changes at porting.rst were made to use an unified
    notation with works nice as a text file while also produce a good html
    output after being parsed.

    Signed-off-by: Mauro Carvalho Chehab
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     
  • There are 3 remaining files without an extension inside the fs docs
    dir.

    Manually convert them to ReST.

    In the case of the nfs/exporting.rst file, as the nfs docs
    aren't ported yet, I opted to convert and add a :orphan: there,
    with should be removed when it gets added into a nfs-specific
    part of the fs documentation.

    Signed-off-by: Mauro Carvalho Chehab
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

17 Jul, 2019

1 commit


13 Jul, 2019

1 commit

  • Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong:
    "Here's a patch series that sets up common parameter checking functions
    for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations.

    The goal here is to reduce the amount of behaviorial variance between
    the filesystems where those ioctls originated (ext2 and XFS,
    respectively) and everybody else.

    - Standardize parameter checking for the SETFLAGS and FSSETXATTR
    ioctls (which were the file attribute setters for ext4 and xfs and
    have now been hoisted to the vfs)

    - Only allow the DAX flag to be set on files and directories"

    * tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    vfs: only allow FSSETXATTR to set DAX flag on files and dirs
    vfs: teach vfs_ioc_fssetxattr_check to check extent size hints
    vfs: teach vfs_ioc_fssetxattr_check to check project id info
    vfs: create a generic checking function for FS_IOC_FSSETXATTR
    vfs: create a generic checking and prep function for FS_IOC_SETFLAGS

    Linus Torvalds
     

12 Jul, 2019

2 commits


04 Jul, 2019

1 commit

  • Stephen writes:
    After merging the driver-core tree, today's linux-next build (x86_64
    allmodconfig) produced this warning:

    fs/orangefs/orangefs-debugfs.c: In function 'orangefs_debugfs_init':
    fs/orangefs/orangefs-debugfs.c:193:1: warning: label 'out' defined but not used [-Wunused-label]
    out:
    ^~~
    fs/orangefs/orangefs-debugfs.c: In function 'orangefs_kernel_debug_init':
    fs/orangefs/orangefs-debugfs.c:204:17: warning: unused variable 'ret' [-Wunused-variable]
    struct dentry *ret;
    ^~~
    Fix this up and change the return type of the function to void as it can
    not fail, which cleans up some more code and variables as well.

    Cc: Mike Marshall
    Cc: Martin Brandenburg
    Cc: devel@lists.orangefs.org
    Reported-by: Stephen Rothwell
    Fixes: f095adba36bb ("orangefs: no need to check return value of debugfs_create functions")
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

03 Jul, 2019

1 commit

  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    Cc: Mike Marshall
    Cc: Martin Brandenburg
    Cc: devel@lists.orangefs.org
    Signed-off-by: Greg Kroah-Hartman
    Link: https://lore.kernel.org/r/20190612152204.GA17511@kroah.com
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

01 Jul, 2019

1 commit

  • Create a generic function to check incoming FS_IOC_SETFLAGS flag values
    and later prepare the inode for updates so that we can standardize the
    implementations that follow ext4's flag values.

    Note that the efivarfs implementation no longer fails a no-op SETFLAGS
    without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Acked-by: David Sterba
    Reviewed-by: Bob Peterson

    Darrick J. Wong
     

21 May, 2019

2 commits


15 May, 2019

1 commit

  • To facilitate additional options to get_user_pages_fast() change the
    singular write parameter to be gup_flags.

    This patch does not change any functionality. New functionality will
    follow in subsequent patches.

    Some of the get_user_pages_fast() call sites were unchanged because they
    already passed FOLL_WRITE or 0 for the write parameter.

    NOTE: It was suggested to change the ordering of the get_user_pages_fast()
    arguments to ensure that callers were converted. This breaks the current
    GUP call site convention of having the returned pages be the final
    parameter. So the suggestion was rejected.

    Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
    Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
    Signed-off-by: Ira Weiny
    Reviewed-by: Mike Marshall
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Dan Williams
    Cc: "David S. Miller"
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Michal Hocko
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ira Weiny
     

10 May, 2019

1 commit

  • Pull orangefs updates from Mike Marshall:
    "This includes one fix and our "Orangefs through the pagecache" patch
    series which greatly improves our small IO performance and helps us
    pass more xfstests than before.

    Fix:
    - orangefs: truncate before updating size

    Pagecache series:
    - all the rest"

    * tag 'for-linus-5.2-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: (23 commits)
    orangefs: truncate before updating size
    orangefs: copy Orangefs-sized blocks into the pagecache if possible.
    orangefs: pass slot index back to readpage.
    orangefs: remember count when reading.
    orangefs: add orangefs_revalidate_mapping
    orangefs: implement writepages
    orangefs: write range tracking
    orangefs: avoid fsync service operation on flush
    orangefs: skip inode writeout if nothing to write
    orangefs: move do_readv_writev to direct_IO
    orangefs: do not return successful read when the client-core disappeared
    orangefs: implement writepage
    orangefs: migrate to generic_file_read_iter
    orangefs: service ops done for writeback are not killable
    orangefs: remove orangefs_readpages
    orangefs: reorganize setattr functions to track attribute changes
    orangefs: let setattr write to cached inode
    orangefs: set up and use backing_dev_info
    orangefs: hold i_lock during inode_getattr
    orangefs: update attributes rather than relying on server
    ...

    Linus Torvalds
     

04 May, 2019

22 commits

  • Otherwise we race with orangefs_writepage/orangefs_writepages
    which and does not expect i_size < page_offset.

    Fixes xfstests generic/129.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • ->readpage looks in file->private_data to try and find out how the
    userspace program set "count" in read(2) or with "dd bs=" or whatever.

    ->readpage uses "count" and inode->i_size to calculate how much
    data Orangefs should deposit in the Orangefs shared buffer, and
    remembers which slot the data is in.

    After copying data from the Orangefs shared buffer slot into
    "the page", readpage tries to increment through the pagecache index
    and fill as many pages as it can from the extra data in the shared
    buffer. Hopefully these extra pages will soon be needed by the vfs,
    and they'll be in the pagecache already.

    Signed-off-by: Mike Marshall
    Signed-off-by: Martin Brandenburg

    Mike Marshall
     
  • When userspace deposits more than a page of data into the shared buffer,
    we'll need to know which slot it is in when we get back to readpage
    so that we can try to use the extra data to fill some extra pages.

    Signed-off-by: Mike Marshall
    Signed-off-by: Martin Brandenburg

    Mike Marshall
     
  • Orangefs wins when it can do IO on large (up to four meg) blocks at a time,
    and looses when it has to do tiny "small io" reads and writes. Accessing
    Orangefs through the pagecache with the kernel module helps with small io,
    both reading and writing, a great deal. Readpage generally tries to fetch a
    page (four k) at a time. We'll let users use "count" (as in read(2) or
    pread(2) for example) as a knob to control how much data they get from
    Orangefs at a time and we'll try to use the data to fill extra
    pagecache pages when we get to ->readpage, hopefully resulting in
    fewer calls to readpage and Orangefs userspace.

    We need a way to remember how they set count so that we can still have
    it available when we get to ->readpage.

    - We'll use file->private_data to keep track of "count".
    We'll wrap generic_file_open with orangefs_file_open and
    initialize private_data to NULL there.

    - In ->read_iter we have access to both "count" and file, so
    we'll kmalloc some space onto file->private_data and store
    "count" there.

    - We'll kfree file->private_data each time we visit ->flush and
    reinitialize it to NULL.

    Signed-off-by: Mike Marshall
    Signed-off-by: Martin Brandenburg

    Mike Marshall
     
  • This is modeled after NFS, except our method is different. We use a
    simple timer to determine whether to invalidate the page cache. This
    is bound to perform.

    This addes a sysfs parameter cache_timeout_msecs which controls the time
    between page cache invalidations.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Go through pages and look for a consecutive writable region. After
    finding a number of consecutive writable pages or when finding that
    the next page's dirty range is not contiguous and cannot be written
    as one request, send the write to the server.

    The number of pages is determined by the client-core's buffer size.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Attach the actual range of bytes written to plus the responsible uid/gid
    to each dirty page. This information must be sent to the server when
    the page is written out.

    Now write_begin, page_mkwrite, and invalidatepage keep up with this
    information. There are several conditions where they must write out the
    page immediately to store the new range. Two non-contiguous ranges
    cannot be stored on a single page.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Without this, an fsync call is sent to the server even if no data
    changed. This resulted in a rather severe (50%) performance regression
    under certain metadata-heavy workloads.

    In the past, everything was direct IO. Nothing happend on a close call.
    An explicit fsync call would send an fsync request to the server which
    in turn fsynced the underlying file.

    Now there are cached writes. Then fsync began writing out dirty pages
    in addition to making an fsync request to the server, and close began
    calling fsync.

    With this commit, close only writes out dirty pages, and does not make
    the fsync request.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Would happen if an inode is dirty but whatever happened is not something
    that can be written out to OrangeFS.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • direct_IO was the only caller and all direct_IO did was call it,
    so there's no use in having the code spread out into so many functions.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Now orangefs_inode_getattr fills from cache if an inode has dirty pages.

    also if attr_valid and dirty pages and !flags, we spin on inode writeback
    before returning if pages still dirty after: should it be other way

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Remove orangefs_inode_read. It was used by readpage. Calling
    wait_for_direct_io directly serves the purpose just as well. There is
    now no check of the bufmap size in the readpage path. There are already
    other places the bufmap size is assumed to be greater than PAGE_SIZE.

    Important to call truncate_inode_pages now in the write path so a
    subsequent read sees the new data.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • It's a copy of the loop which would run in read_pages from
    mm/readahead.c.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • OrangeFS accepts a mask indicating which attributes were changed. The
    kernel must not set any bits except those that were actually changed.
    The kernel must set the uid/gid of the request to the actual uid/gid
    responsible for the change.

    Code path for notify_change initiated setattrs is

    orangefs_setattr(dentry, iattr)
    -> __orangefs_setattr(inode, iattr)

    In kernel changes are initiated by calling __orangefs_setattr.

    Code path for writeback is

    orangefs_write_inode
    -> orangefs_inode_setattr

    attr_valid and attr_uid and attr_gid change together under i_lock.
    I_DIRTY changes separately.

    __orangefs_setattr
    lock
    if needs to be cleaned first, unlock and retry
    set attr_valid
    copy data in
    unlock
    mark_inode_dirty

    orangefs_inode_setattr
    lock
    copy attributes out
    unlock
    clear getattr_time
    # __writeback_single_inode clears dirty

    orangefs_inode_getattr
    # possible to get here with attr_valid set and not dirty
    lock
    if getattr_time ok or attr_valid set, unlock and return
    unlock
    do server operation
    # another thread may getattr or setattr, so check for that
    lock
    if getattr_time ok or attr_valid, unlock and return
    else, copy in
    update getattr_time
    unlock

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • This is a fairly big change, but ultimately it's not a lot of code.

    Implement write_inode and then avoid the call to orangefs_inode_setattr
    within orangefs_setattr.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • This should be a no-op now. When inode writeback works, this will
    prevent a getattr from overwriting inode data while an inode is
    transitioning to dirty.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • This should be a no-op now, but once inode writeback works, it'll be
    necessary to have the correct attribute in the dirty inode.

    Previously the attribute fetch timeout was marked invalid and the server
    provided the updated attribute. When the inode is dirty, the server
    cannot be consulted since it does not yet know the pending setattr.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • No need to store the received mask. It is either STATX_BASIC_STATS or
    STATX_BASIC_STATS & ~STATX_SIZE. If STATX_SIZE is requested, the cache
    is bypassed anyway, so the cached mask is unnecessary to decide whether
    to do a real getattr.

    This is a change. Previously a getattr would want size and use the
    cached size. All of the in-kernel callers that wanted size did not want
    a cached size. Now a getattr cannot use the cached size if it wants
    size at all.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg
     
  • When an inode is created, we fetch attributes from the server. There is
    no need to turn around and invalidate them.

    No need to initialize attributes after the getattr either. Either it'll
    be exactly the same, or it'll be something else and wrong.

    Signed-off-by: Martin Brandenburg
    Signed-off-by: Mike Marshall

    Martin Brandenburg