10 Dec, 2020

1 commit


09 Dec, 2020

1 commit

  • There's a memory leak in afs_parse_source() whereby multiple source=
    parameters overwrite fc->source in the fs_context struct without freeing
    the previously recorded source.

    Fix this by only permitting a single source parameter and rejecting with
    an error all subsequent ones.

    This was caught by syzbot with the kernel memory leak detector, showing
    something like the following trace:

    unreferenced object 0xffff888114375440 (size 32):
    comm "repro", pid 5168, jiffies 4294923723 (age 569.948s)
    backtrace:
    slab_post_alloc_hook+0x42/0x79
    __kmalloc_track_caller+0x125/0x16a
    kmemdup_nul+0x24/0x3c
    vfs_parse_fs_string+0x5a/0xa1
    generic_parse_monolithic+0x9d/0xc5
    do_new_mount+0x10d/0x15a
    do_mount+0x5f/0x8e
    __do_sys_mount+0xff/0x127
    do_syscall_64+0x2d/0x3a
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: 13fcc6837049 ("afs: Add fs_context support")
    Reported-by: syzbot+86dc6632faaca40133ab@syzkaller.appspotmail.com
    Signed-off-by: David Howells
    cc: Randy Dunlap
    Signed-off-by: Linus Torvalds

    David Howells
     

23 Nov, 2020

2 commits

  • Linux 5.10-rc5

    Signed-off-by: Greg Kroah-Hartman
    Change-Id: Ia5b23cceb3e0212c1c841f1297ecfab65cc9aaa6

    Greg Kroah-Hartman
     
  • When doing a lookup in a directory, the afs filesystem uses a bulk
    status fetch to speculatively retrieve the statuses of up to 48 other
    vnodes found in the same directory and it will then either update extant
    inodes or create new ones - effectively doing 'lookup ahead'.

    To avoid the possibility of deadlocking itself, however, the filesystem
    doesn't lock all of those inodes; rather just the directory inode is
    locked (by the VFS).

    When the operation completes, afs_inode_init_from_status() or
    afs_apply_status() is called, depending on whether the inode already
    exists, to commit the new status.

    A case exists, however, where the speculative status fetch operation may
    straddle a modification operation on one of those vnodes. What can then
    happen is that the speculative bulk status RPC retrieves the old status,
    and whilst that is happening, the modification happens - which returns
    an updated status, then the modification status is committed, then we
    attempt to commit the speculative status.

    This results in something like the following being seen in dmesg:

    kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus

    showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
    say that the vnode had data version 8 when we'd already recorded version
    9 due to a local modification. This was causing the cache to be
    invalidated for that vnode when it shouldn't have been. If it happens
    on a data file, this might lead to local changes being lost.

    Fix this by ignoring speculative status updates if the data version
    doesn't match the expected value.

    Note that it is possible to get a DV regression if a volume gets
    restored from a backup - but we should get a callback break in such a
    case that should trigger a recheck anyway. It might be worth checking
    the volume creation time in the volsync info and, if a change is
    observed in that (as would happen on a restore), invalidate all caches
    associated with the volume.

    Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup")
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     

15 Nov, 2020

2 commits


06 Nov, 2020

1 commit


04 Nov, 2020

2 commits

  • The cleanup for the yfs_store_opaque_acl2_operation calls the wrong
    function to destroy the ACL content buffer. It's an afs_acl struct, not
    a yfs_acl struct - and the free function for latter may pass invalid
    pointers to kfree().

    Fix this by using the afs_acl_put() function. The yfs_acl_put()
    function is then no longer used and can be removed.

    general protection fault, probably for non-canonical address 0x7ebde00000000: 0000 [#1] SMP PTI
    ...
    RIP: 0010:compound_head+0x0/0x11
    ...
    Call Trace:
    virt_to_cache+0x8/0x51
    kfree+0x5d/0x79
    yfs_free_opaque_acl+0x16/0x29
    afs_put_operation+0x60/0x114
    __vfs_setxattr+0x67/0x72
    __vfs_setxattr_noperm+0x66/0xe9
    vfs_setxattr+0x67/0xce
    setxattr+0x14e/0x184
    __do_sys_fsetxattr+0x66/0x8f
    do_syscall_64+0x2d/0x3a
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept")
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • When using the afs.yfs.acl xattr to change an AuriStor ACL, a warning
    can be generated when the request is marshalled because the buffer
    pointer isn't increased after adding the last element, thereby
    triggering the check at the end if the ACL wasn't empty. This just
    causes something like the following warning, but doesn't stop the call
    from happening successfully:

    kAFS: YFS.StoreOpaqueACL2: Request buffer underflow (36
    Signed-off-by: Linus Torvalds

    David Howells
     

02 Nov, 2020

1 commit


29 Oct, 2020

7 commits

  • The dirty region bounds stored in page->private on an afs page are 15 bits
    on a 32-bit box and can, at most, represent a range of up to 32K within a
    32K page with a resolution of 1 byte. This is a problem for powerpc32 with
    64K pages enabled.

    Further, transparent huge pages may get up to 2M, which will be a problem
    for the afs filesystem on all 32-bit arches in the future.

    Fix this by decreasing the resolution. For the moment, a 64K page will
    have a resolution determined from PAGE_SIZE. In the future, the page will
    need to be passed in to the helper functions so that the page size can be
    assessed and the resolution determined dynamically.

    Note that this might not be the ideal way to handle this, since it may
    allow some leakage of undirtied zero bytes to the server's copy in the case
    of a 3rd-party conflict. Fixing that would require a separately allocated
    record and is a more complicated fix.

    Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
    Reported-by: kernel test robot
    Signed-off-by: David Howells
    Reviewed-by: Matthew Wilcox (Oracle)

    David Howells
     
  • Fix afs_invalidatepage() to adjust the dirty region recorded in
    page->private when truncating a page. If the dirty region is entirely
    removed, then the private data is cleared and the page dirty state is
    cleared.

    Without this, if the page is truncated and then expanded again by truncate,
    zeros from the expanded, but no-longer dirty region may get written back to
    the server if the page gets laundered due to a conflicting 3rd-party write.

    It mustn't, however, shorten the dirty region of the page if that page is
    still mmapped and has been marked dirty by afs_page_mkwrite(), so a flag is
    stored in page->private to record this.

    Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
    Signed-off-by: David Howells

    David Howells
     
  • Currently, page->private on an afs page is used to store the range of
    dirtied data within the page, where the range includes the lower bound, but
    excludes the upper bound (e.g. 0-1 is a range covering a single byte).

    This, however, requires a superfluous bit for the last-byte bound so that
    on a 4KiB page, it can say 0-4096 to indicate the whole page, the idea
    being that having both numbers the same would indicate an empty range.
    This is unnecessary as the PG_private bit is clear if it's an empty range
    (as is PG_dirty).

    Alter the way the dirty range is encoded in page->private such that the
    upper bound is reduced by 1 (e.g. 0-0 is then specified the same single
    byte range mentioned above).

    Applying this to both bounds frees up two bits, one of which can be used in
    a future commit.

    This allows the afs filesystem to be compiled on ppc32 with 64K pages;
    without this, the following warnings are seen:

    ../fs/afs/internal.h: In function 'afs_page_dirty_to':
    ../fs/afs/internal.h:881:15: warning: right shift count >= width of type [-Wshift-count-overflow]
    881 | return (priv >> __AFS_PAGE_PRIV_SHIFT) & __AFS_PAGE_PRIV_MASK;
    | ^~
    ../fs/afs/internal.h: In function 'afs_page_dirty':
    ../fs/afs/internal.h:886:28: warning: left shift count >= width of type [-Wshift-count-overflow]
    886 | return ((unsigned long)to << __AFS_PAGE_PRIV_SHIFT) | from;
    | ^~

    Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
    Signed-off-by: David Howells

    David Howells
     
  • The afs filesystem uses page->private to store the dirty range within a
    page such that in the event of a conflicting 3rd-party write to the server,
    we write back just the bits that got changed locally.

    However, there are a couple of problems with this:

    (1) I need a bit to note if the page might be mapped so that partial
    invalidation doesn't shrink the range.

    (2) There aren't necessarily sufficient bits to store the entire range of
    data altered (say it's a 32-bit system with 64KiB pages or transparent
    huge pages are in use).

    So wrap the accesses in inline functions so that future commits can change
    how this works.

    Also move them out of the tracing header into the in-directory header.
    There's not really any need for them to be in the tracing header.

    Signed-off-by: David Howells

    David Howells
     
  • In afs, page->private is set to indicate the dirty region of a page. This
    is done in afs_write_begin(), but that can't take account of whether the
    copy into the page actually worked.

    Fix this by moving the change of page->private into afs_write_end().

    Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
    Signed-off-by: David Howells

    David Howells
     
  • Fix the leak of the target page in afs_write_begin() when it fails.

    Fixes: 15b4650e55e0 ("afs: convert to new aops")
    Signed-off-by: David Howells
    cc: Nick Piggin

    David Howells
     
  • Fix afs to take a ref on a page when it sets PG_private on it and to drop
    the ref when removing the flag.

    Note that in afs_write_begin(), a lot of the time, PG_private is already
    set on a page to which we're going to add some data. In such a case, we
    leave the bit set and mustn't increment the page count.

    As suggested by Matthew Wilcox, use attach/detach_page_private() where
    possible.

    Fixes: 31143d5d515e ("AFS: implement basic file write support")
    Reported-by: Matthew Wilcox (Oracle)
    Signed-off-by: David Howells
    Reviewed-by: Matthew Wilcox (Oracle)

    David Howells
     

28 Oct, 2020

4 commits

  • Fix afs_launder_page() to not clear PG_writeback on the page it is
    laundering as the flag isn't set in this case.

    Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
    Signed-off-by: David Howells

    David Howells
     
  • The "op" pointer is freed earlier when we call afs_put_operation().

    Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept")
    Signed-off-by: Dan Carpenter
    Signed-off-by: David Howells
    cc: Colin Ian King

    Dan Carpenter
     
  • The patch dca54a7bbb8c: "afs: Add tracing for cell refcount and active user
    count" from Oct 13, 2020, leads to the following Smatch complaint:

    fs/afs/cell.c:596 afs_unuse_cell()
    warn: variable dereferenced before check 'cell' (see line 592)

    Fix this by moving the retrieval of the cell debug ID to after the check of
    the validity of the cell pointer.

    Reported-by: Dan Carpenter
    Fixes: dca54a7bbb8c ("afs: Add tracing for cell refcount and active user count")
    Signed-off-by: David Howells
    cc: Dan Carpenter

    David Howells
     
  • The prevention of splice-write without explicit ops made the
    copy_file_write() syscall to an afs file (as done by the generic/112
    xfstest) fail with EINVAL.

    Fix by using iter_file_splice_write() for afs.

    Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops")
    Signed-off-by: David Howells
    Reviewed-by: Christoph Hellwig

    David Howells
     

26 Oct, 2020

1 commit


24 Oct, 2020

1 commit


17 Oct, 2020

1 commit

  • Pull afs updates from David Howells:
    "A collection of fixes to fix afs_cell struct refcounting, thereby
    fixing a slew of related syzbot bugs:

    - Fix the cell tree in the netns to use an rwsem rather than RCU.

    There seem to be some problems deriving from the use of RCU and a
    seqlock to walk the rbtree, but it's not entirely clear what since
    there are several different failures being seen.

    Changing things to use an rwsem instead makes it more robust. The
    extra performance derived from using RCU isn't necessary in this
    case since the only time we're looking up a cell is during mount or
    when cells are being manually added.

    - Fix the refcounting by splitting the usage counter into a memory
    refcount and an active users counter. The usage counter was doing
    double duty, keeping track of whether a cell is still in use and
    keeping track of when it needs to be destroyed - but this makes the
    clean up tricky. Separating these out simplifies the logic.

    - Fix purging a cell that has an alias. A cell alias pins the cell
    it's an alias of, but the alias is always later in the list. Trying
    to purge in a single pass causes rmmod to hang in such a case.

    - Fix cell removal. If a cell's manager is requeued whilst it's
    removing itself, the manager will run again and re-remove itself,
    causing problems in various places. Follow Hillf Danton's
    suggestion to insert a more terminal state that causes the manager
    to do nothing post-removal.

    In additional to the above, two other changes:

    - Add a tracepoint for the cell refcount and active users count. This
    helped with debugging the above and may be useful again in future.

    - Downgrade an assertion to a print when a still-active server is
    seen during purging. This was happening as a consequence of
    incomplete cell removal before the servers were cleaned up"

    * tag 'afs-fixes-20201016' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    afs: Don't assert on unpurgeable server records
    afs: Add tracing for cell refcount and active user count
    afs: Fix cell removal
    afs: Fix cell purging with aliases
    afs: Fix cell refcounting by splitting the usage counter
    afs: Fix rapid cell addition/removal by not using RCU on cells tree

    Linus Torvalds
     

16 Oct, 2020

6 commits

  • Don't give an assertion failure on unpurgeable afs_server records - which
    kills the thread - but rather emit a trace line when we are purging a
    record (which only happens during network namespace removal or rmmod) and
    print a notice of the problem.

    Signed-off-by: David Howells

    David Howells
     
  • Add a tracepoint to log the cell refcount and active user count and pass in
    a reason code through various functions that manipulate these counters.

    Additionally, a helper function, afs_see_cell(), is provided to log
    interesting places that deal with a cell without actually doing any
    accounting directly.

    Signed-off-by: David Howells

    David Howells
     
  • Fix cell removal by inserting a more final state than AFS_CELL_FAILED that
    indicates that the cell has been unpublished in case the manager is already
    requeued and will go through again. The new AFS_CELL_REMOVED state will
    just immediately leave the manager function.

    Going through a second time in the AFS_CELL_FAILED state will cause it to
    try to remove the cell again, potentially leading to the proc list being
    removed.

    Fixes: 989782dcdc91 ("afs: Overhaul cell database management")
    Reported-by: syzbot+b994ecf2b023f14832c1@syzkaller.appspotmail.com
    Reported-by: syzbot+0e0db88e1eb44a91ae8d@syzkaller.appspotmail.com
    Reported-by: syzbot+2d0585e5efcd43d113c2@syzkaller.appspotmail.com
    Reported-by: syzbot+1ecc2f9d3387f1d79d42@syzkaller.appspotmail.com
    Reported-by: syzbot+18d51774588492bf3f69@syzkaller.appspotmail.com
    Reported-by: syzbot+a5e4946b04d6ca8fa5f3@syzkaller.appspotmail.com
    Suggested-by: Hillf Danton
    Signed-off-by: David Howells
    cc: Hillf Danton

    David Howells
     
  • When the afs module is removed, one of the things that has to be done is to
    purge the cell database. afs_cell_purge() cancels the management timer and
    then starts the cell manager work item to do the purging. This does a
    single run through and then assumes that all cells are now purged - but
    this is no longer the case.

    With the introduction of alias detection, a later cell in the database can
    now be holding an active count on an earlier cell (cell->alias_of). The
    purge scan passes by the earlier cell first, but this can't be got rid of
    until it has discarded the alias. Ordinarily, afs_unuse_cell() would
    handle this by setting the management timer to trigger another pass - but
    afs_set_cell_timer() doesn't do anything if the namespace is being removed
    (net->live == false). rmmod then hangs in the wait on cells_outstanding in
    afs_cell_purge().

    Fix this by making afs_set_cell_timer() directly queue the cell manager if
    net->live is false. This causes additional management passes.

    Queueing the cell manager increments cells_outstanding to make sure the
    wait won't complete until all cells are destroyed.

    Fixes: 8a070a964877 ("afs: Detect cell aliases 1 - Cells with root volumes")
    Signed-off-by: David Howells

    David Howells
     
  • Management of the lifetime of afs_cell struct has some problems due to the
    usage counter being used to determine whether objects of that type are in
    use in addition to whether anyone might be interested in the structure.

    This is made trickier by cell objects being cached for a period of time in
    case they're quickly reused as they hold the result of a setup process that
    may be slow (DNS lookups, AFS RPC ops).

    Problems include the cached root volume from alias resolution pinning its
    parent cell record, rmmod occasionally hanging and occasionally producing
    assertion failures.

    Fix this by splitting the count of active users from the struct reference
    count. Things then work as follows:

    (1) The cell cache keeps +1 on the cell's activity count and this has to
    be dropped before the cell can be removed. afs_manage_cell() tries to
    exchange the 1 to a 0 with the cells_lock write-locked, and if
    successful, the record is removed from the net->cells.

    (2) One struct ref is 'owned' by the activity count. That is put when the
    active count is reduced to 0 (final_destruction label).

    (3) A ref can be held on a cell whilst it is queued for management on a
    work queue without confusing the active count. afs_queue_cell() is
    added to wrap this.

    (4) The queue's ref is dropped at the end of the management. This is
    split out into a separate function, afs_manage_cell_work().

    (5) The root volume record is put after a cell is removed (at the
    final_destruction label) rather then in the RCU destruction routine.

    (6) Volumes hold struct refs, but aren't active users.

    (7) Both counts are displayed in /proc/net/afs/cells.

    There are some management function changes:

    (*) afs_put_cell() now just decrements the refcount and triggers the RCU
    destruction if it becomes 0. It no longer sets a timer to have the
    manager do this.

    (*) afs_use_cell() and afs_unuse_cell() are added to increase and decrease
    the active count. afs_unuse_cell() sets the management timer.

    (*) afs_queue_cell() is added to queue a cell with approprate refs.

    There are also some other fixes:

    (*) Don't let /proc/net/afs/cells access a cell's vllist if it's NULL.

    (*) Make sure that candidate cells in lookups are properly destroyed
    rather than being simply kfree'd. This ensures the bits it points to
    are destroyed also.

    (*) afs_dec_cells_outstanding() is now called in cell destruction rather
    than at "final_destruction". This ensures that cell->net is still
    valid to the end of the destructor.

    (*) As a consequence of the previous two changes, move the increment of
    net->cells_outstanding that was at the point of insertion into the
    tree to the allocation routine to correctly balance things.

    Fixes: 989782dcdc91 ("afs: Overhaul cell database management")
    Signed-off-by: David Howells

    David Howells
     
  • There are a number of problems that are being seen by the rapidly mounting
    and unmounting an afs dynamic root with an explicit cell and volume
    specified (which should probably be rejected, but that's a separate issue):

    What the tests are doing is to look up/create a cell record for the name
    given and then tear it down again without actually using it to try to talk
    to a server. This is repeated endlessly, very fast, and the new cell
    collides with the old one if it's not quick enough to reuse it.

    It appears (as suggested by Hillf Danton) that the search through the RB
    tree under a read_seqbegin_or_lock() under RCU conditions isn't safe and
    that it's not blocking the write_seqlock(), despite taking two passes at
    it. He suggested that the code should take a ref on the cell it's
    attempting to look at - but this shouldn't be necessary until we've
    compared the cell names. It's possible that I'm missing a barrier
    somewhere.

    However, using an RCU search for this is overkill, really - we only need to
    access the cell name in a few places, and they're places where we're may
    end up sleeping anyway.

    Fix this by switching to an R/W semaphore instead.

    Additionally, draw the down_read() call inside the function (renamed to
    afs_find_cell()) since all the callers were taking the RCU read lock (or
    should've been[*]).

    [*] afs_probe_cell_name() should have been, but that doesn't appear to be
    involved in the bug reports.

    The symptoms of this look like:

    general protection fault, probably for non-canonical address 0xf27d208691691fdb: 0000 [#1] PREEMPT SMP KASAN
    KASAN: maybe wild-memory-access in range [0x93e924348b48fed8-0x93e924348b48fedf]
    ...
    RIP: 0010:strncasecmp lib/string.c:52 [inline]
    RIP: 0010:strncasecmp+0x5f/0x240 lib/string.c:43
    afs_lookup_cell_rcu+0x313/0x720 fs/afs/cell.c:88
    afs_lookup_cell+0x2ee/0x1440 fs/afs/cell.c:249
    afs_parse_source fs/afs/super.c:290 [inline]
    ...

    Fixes: 989782dcdc91 ("afs: Overhaul cell database management")
    Reported-by: syzbot+459a5dce0b4cb70fd076@syzkaller.appspotmail.com
    Signed-off-by: David Howells
    cc: Hillf Danton
    cc: syzkaller-bugs@googlegroups.com

    David Howells
     

14 Oct, 2020

1 commit

  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

12 Oct, 2020

1 commit


09 Oct, 2020

1 commit

  • The afs filesystem has a lock[*] that it uses to serialise I/O operations
    going to the server (vnode->io_lock), as the server will only perform one
    modification operation at a time on any given file or directory. This
    prevents the the filesystem from filling up all the call slots to a server
    with calls that aren't going to be executed in parallel anyway, thereby
    allowing operations on other files to obtain slots.

    [*] Note that is probably redundant for directories at least since
    i_rwsem is used to serialise directory modifications and
    lookup/reading vs modification. The server does allow parallel
    non-modification ops, however.

    When a file truncation op completes, we truncate the in-memory copy of the
    file to match - but we do it whilst still holding the io_lock, the idea
    being to prevent races with other operations.

    However, if writeback starts in a worker thread simultaneously with
    truncation (whilst notify_change() is called with i_rwsem locked, writeback
    pays it no heed), it may manage to set PG_writeback bits on the pages that
    will get truncated before afs_setattr_success() manages to call
    truncate_pagecache(). Truncate will then wait for those pages - whilst
    still inside io_lock:

    # cat /proc/8837/stack
    [] wait_on_page_bit_common+0x184/0x1e7
    [] truncate_inode_pages_range+0x37f/0x3eb
    [] truncate_pagecache+0x3c/0x53
    [] afs_setattr_success+0x4d/0x6e
    [] afs_wait_for_operation+0xd8/0x169
    [] afs_do_sync_operation+0x16/0x1f
    [] afs_setattr+0x1fb/0x25d
    [] notify_change+0x2cf/0x3c4
    [] do_truncate+0x7f/0xb2
    [] do_sys_ftruncate+0xd1/0x104
    [] do_syscall_64+0x2d/0x3a
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The writeback operation, however, stalls indefinitely because it needs to
    get the io_lock to proceed:

    # cat /proc/5940/stack
    [] afs_get_io_locks+0x58/0x1ae
    [] afs_begin_vnode_operation+0xc7/0xd1
    [] afs_store_data+0x1b2/0x2a3
    [] afs_write_back_from_locked_page+0x418/0x57c
    [] afs_writepages_region+0x196/0x224
    [] afs_writepages+0x74/0x156
    [] do_writepages+0x2d/0x56
    [] __writeback_single_inode+0x84/0x207
    [] writeback_sb_inodes+0x238/0x3cf
    [] __writeback_inodes_wb+0x68/0x9f
    [] wb_writeback+0x145/0x26c
    [] wb_do_writeback+0x16a/0x194
    [] wb_workfn+0x74/0x177
    [] process_one_work+0x174/0x264
    [] worker_thread+0x117/0x1b9
    [] kthread+0xec/0xf1
    [] ret_from_fork+0x1f/0x30

    and thus deadlock has occurred.

    Note that whilst afs_setattr() calls filemap_write_and_wait(), the fact
    that the caller is holding i_rwsem doesn't preclude more pages being
    dirtied through an mmap'd region.

    Fix this by:

    (1) Use the vnode validate_lock to mediate access between afs_setattr()
    and afs_writepages():

    (a) Exclusively lock validate_lock in afs_setattr() around the whole
    RPC operation.

    (b) If WB_SYNC_ALL isn't set on entry to afs_writepages(), trying to
    shared-lock validate_lock and returning immediately if we couldn't
    get it.

    (c) If WB_SYNC_ALL is set, wait for the lock.

    The validate_lock is also used to validate a file and to zap its cache
    if the file was altered by a third party, so it's probably a good fit
    for this.

    (2) Move the truncation outside of the io_lock in setattr, using the same
    hook as is used for local directory editing.

    This requires the old i_size to be retained in the operation record as
    we commit the revised status to the inode members inside the io_lock
    still, but we still need to know if we reduced the file size.

    Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     

25 Sep, 2020

1 commit

  • Set up a readahead size by default, as very few users have a good
    reason to change it. This means code, ecryptfs, and orangefs now
    set up the values while they were previously missing it, while ubifs,
    mtd and vboxsf manually set it to 0 to avoid readahead.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Acked-by: David Sterba [btrfs]
    Acked-by: Richard Weinberger [ubifs, mtd]
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Sep, 2020

1 commit


04 Sep, 2020

1 commit

  • Pull networking fixes from David Miller:

    1) Use netif_rx_ni() when necessary in batman-adv stack, from Jussi
    Kivilinna.

    2) Fix loss of RTT samples in rxrpc, from David Howells.

    3) Memory leak in hns_nic_dev_probe(), from Dignhao Liu.

    4) ravb module cannot be unloaded, fix from Yuusuke Ashizuka.

    5) We disable BH for too lokng in sctp_get_port_local(), add a
    cond_resched() here as well, from Xin Long.

    6) Fix memory leak in st95hf_in_send_cmd, from Dinghao Liu.

    7) Out of bound access in bpf_raw_tp_link_fill_link_info(), from
    Yonghong Song.

    8) Missing of_node_put() in mt7530 DSA driver, from Sumera
    Priyadarsini.

    9) Fix crash in bnxt_fw_reset_task(), from Michael Chan.

    10) Fix geneve tunnel checksumming bug in hns3, from Yi Li.

    11) Memory leak in rxkad_verify_response, from Dinghao Liu.

    12) In tipc, don't use smp_processor_id() in preemptible context. From
    Tuong Lien.

    13) Fix signedness issue in mlx4 memory allocation, from Shung-Hsi Yu.

    14) Missing clk_disable_prepare() in gemini driver, from Dan Carpenter.

    15) Fix ABI mismatch between driver and firmware in nfp, from Louis
    Peens.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (110 commits)
    net/smc: fix sock refcounting in case of termination
    net/smc: reset sndbuf_desc if freed
    net/smc: set rx_off for SMCR explicitly
    net/smc: fix toleration of fake add_link messages
    tg3: Fix soft lockup when tg3_reset_task() fails.
    doc: net: dsa: Fix typo in config code sample
    net: dp83867: Fix WoL SecureOn password
    nfp: flower: fix ABI mismatch between driver and firmware
    tipc: fix shutdown() of connectionless socket
    ipv6: Fix sysctl max for fib_multipath_hash_policy
    drivers/net/wan/hdlc: Change the default of hard_header_len to 0
    net: gemini: Fix another missing clk_disable_unprepare() in probe
    net: bcmgenet: fix mask check in bcmgenet_validate_flow()
    amd-xgbe: Add support for new port mode
    net: usb: dm9601: Add USB ID of Keenetic Plus DSL
    vhost: fix typo in error message
    net: ethernet: mlx4: Fix memory allocation in mlx4_buddy_init()
    pktgen: fix error message with wrong function name
    net: ethernet: ti: am65-cpsw: fix rmii 100Mbit link mode
    cxgb4: fix thermal zone device registration
    ...

    Linus Torvalds
     

01 Sep, 2020

1 commit


28 Aug, 2020

2 commits

  • David Howells says:

    ====================
    rxrpc, afs: Fix probing issues

    Here are some fixes for rxrpc and afs to fix issues in the RTT measuring in
    rxrpc and thence the Volume Location server probing in afs:

    (1) Move the serial number of a received ACK into a local variable to
    simplify the next patch.

    (2) Fix the loss of RTT samples due to extra interposed ACKs causing
    baseline information to be discarded too early. This is a particular
    problem for afs when it sends a single very short call to probe a
    server it hasn't talked to recently.

    (3) Fix rxrpc_kernel_get_srtt() to indicate whether it actually has seen
    any valid samples or not.

    (4) Remove a field that's set/woken, but never read/waited on.

    (5) Expose the RTT and other probe information through procfs to make
    debugging of this stuff easier.

    (6) Fix VL rotation in afs to only use summary information from VL probing
    and not the probe running state (which gets clobbered when next a
    probe is issued).

    (7) Fix VL rotation to actually return the error aggregated from the probe
    errors.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The fall through annotation comes after a return statement so it's not
    reachable.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Gustavo A. R. Silva

    Dan Carpenter
     

24 Aug, 2020

1 commit