20 Aug, 2016

1 commit

  • …rnel/git/dgc/linux-xfs

    Pull xfs and iomap fixes from Dave Chinner:
    "Changes in this update:

    Regression fixes for XFS changes introduce in 4.8-rc1:
    - buffer IO accounting assert failure
    - ENOSPC block accounting reservation issue
    - DAX IO path page cache invalidation fix
    - rmapbt on-disk block count in agf
    - correct classification of rmap block type when updating AGFL.
    - iomap support for attribute fork mapping

    Regression fixes for iomap infrastructure in 4.8-rc1:
    - fiemap: honor FIEMAP_FLAG_SYNC
    - fiemap: implement FIEMAP_FLAG_XATTR support to fix XFS regression
    - make mark_page_accessed and pagefault_disable usage consistent with
    other IO paths"

    * tag 'xfs-iomap-for-linus-4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
    xfs: remove OWN_AG rmap when allocating a block from the AGFL
    xfs: (re-)implement FIEMAP_FLAG_XATTR
    xfs: simplify xfs_file_iomap_begin
    iomap: mark ->iomap_end as optional
    iomap: prepare iomap_fiemap for attribute mappings
    iomap: fiemap should honor the FIEMAP_FLAG_SYNC flag
    iomap: remove superflous pagefault_disable from iomap_write_actor
    iomap: remove superflous mark_page_accessed from iomap_write_actor
    xfs: store rmapbt block count in the AGF
    xfs: don't invalidate whole file on DAX read/write
    xfs: fix bogus space reservation in xfs_iomap_write_allocate
    xfs: don't assert fail on non-async buffers on ioacct decrement

    Linus Torvalds
     

18 Aug, 2016

1 commit

  • Pull networking fixes from David Miller:

    1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
    Fietkau.

    2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.

    3) Fix some tg3 ethtool logic bugs, and one that would cause no
    interrupts to be generated when rx-coalescing is set to 0. From
    Satish Baddipadige and Siva Reddy Kallam.

    4) QLCNIC mailbox corruption and napi budget handling fix from Manish
    Chopra.

    5) Fix fib_trie logic when walking the trie during /proc/net/route
    output than can access a stale node pointer. From David Forster.

    6) Several sctp_diag fixes from Phil Sutter.

    7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.

    8) Checksum fixup fixes in bpf from Daniel Borkmann.

    9) Memork leaks in nfnetlink, from Liping Zhang.

    10) Use after free in rxrpc, from David Howells.

    11) Use after free in new skb_array code of macvtap driver, from Jason
    Wang.

    12) Calipso resource leak, from Colin Ian King.

    13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.

    14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.

    15) Fix lockdep splats in macsec, from Sabrina Dubroca.

    16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
    handling.

    17) Various tc-action bug fixes, from CONG Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
    net_sched: allow flushing tc police actions
    net_sched: unify the init logic for act_police
    net_sched: convert tcf_exts from list to pointer array
    net_sched: move tc offload macros to pkt_cls.h
    net_sched: fix a typo in tc_for_each_action()
    net_sched: remove an unnecessary list_del()
    net_sched: remove the leftover cleanup_a()
    mlxsw: spectrum: Allow packets to be trapped from any PG
    mlxsw: spectrum: Unmap 802.1Q FID before destroying it
    mlxsw: spectrum: Add missing rollbacks in error path
    mlxsw: reg: Fix missing op field fill-up
    mlxsw: spectrum: Trap loop-backed packets
    mlxsw: spectrum: Add missing packet traps
    mlxsw: spectrum: Mark port as active before registering it
    mlxsw: spectrum: Create PVID vPort before registering netdevice
    mlxsw: spectrum: Remove redundant errors from the code
    mlxsw: spectrum: Don't return upon error in removal path
    i40e: check for and deal with non-contiguous TCs
    ixgbe: Re-enable ability to toggle VLAN filtering
    ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
    ...

    Linus Torvalds
     

17 Aug, 2016

13 commits

  • Dave Chinner
     
  • When we're really tight on space, xfs_alloc_ag_vextent_small() can
    allocate a block from the AGFL and give it to the caller. Since the
    caller is never the AGFL-fixing method, we must remove the OWN_AG
    reverse mapping because it will clash with whatever rmap the caller
    wants to set up. This bug was discovered by running generic/299
    repeatedly.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Use a special read-only iomap_ops implementation to support fiemap on
    the attr fork.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • We'll never get nimap == 0 for a successful return from xfs_bmapi_read,
    so don't try to handle it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • No need to implement it for read-only mappings.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • By bassing through an -ENOENT, similar to the old XFS implementation of
    FIEMAP_FLAG_XATTR.

    Signed-off-by: Dave Chinner
    [hch: split from a larger patch]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The flag is checked as supported, but then we do an unconditional
    sync of the file, regardless of whether the flag is set or not. Make
    the sync conditional on having the FIEMAP_FLAG_SYNC flag set.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • iov_iter_copy_from_user_atomic disables page faults internally, no need to
    do it around the call. This also brings the iomap code in line with
    the original filemap version.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • This catches up with commit 2457ae ("mm: non-atomically mark page
    accessed during page cache allocation where possible"), which
    moved the initial access marking into the pagecache allocator.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Track the number of blocks used for the rmapbt in the AGF. When we
    get to the AG reservation code we need this counter to quickly
    make our reservation during mount.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • When we do DAX IO, we try to invalidate the entire page cache held
    on the file. This is incorrect as it will trash the entire mapping
    tree that now tracks dirty state in exceptional entries in the radix
    tree slots.

    What we are trying to do is remove cached pages (e.g from reads
    into holes) that sit in the radix tree over the range we are about
    to write to. Hence we should just limit the invalidation to the
    range we are about to overwrite.

    Reported-by: Jan Kara
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The space reservations was without an explaination in commit

    "Add error reporting calls in error paths that return EFSCORRUPTED"

    back in 2003. There is no reason to reserve disk blocks in the
    transaction when allocating blocks for delalloc space as we already
    reserved the space when creating the delalloc extent.

    With this fix we stop running out of the reserved pool in
    generic/229, which has happened for long time with small blocksize
    file systems, and has increased in severity with the new buffered
    write path.

    [ dchinner: we still need to pass the block reservation into
    xfs_bmapi_write() to ensure we don't deadlock during AG selection.
    See commit dbd5c8c ("xfs: pass total block res. as total
    xfs_bmapi_write() parameter") for more details on why this is
    necessary. ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • The buffer I/O accounting mechanism tracks async buffers under I/O. As
    an optimization, the buffer I/O count is incremented only once on the
    first async I/O for a given hold cycle of a buffer and decremented once
    the buffer is released to the LRU (or freed).

    xfs_buf_ioacct_dec() has an ASSERT() check for an XBF_ASYNC buffer, but
    we have one or two corner cases where a buffer can be submitted for I/O
    multiple times via different methods in a single hold cycle. If an async
    I/O occurs first, the I/O count is incremented. If a sync I/O occurs
    before the hold count drops, XBF_ASYNC is cleared by the time the I/O
    count is decremented.

    Remove the async assert check from xfs_buf_ioacct_dec() as this is a
    perfectly valid scenario. For the purposes of I/O accounting, we really
    only care about the buffer async state at I/O submission time.

    Discovered-and-analyzed-by: Dave Chinner
    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

14 Aug, 2016

1 commit

  • Pull block fixes from Jens Axboe:

    - an NVMe fix from Gabriel, fixing a suspend/resume issue on some
    setups

    - addition of a few missing entries in the block queue sysfs
    documentation, from Joe

    - a fix for a sparse shadow warning for the bvec iterator, from
    Johannes

    - a writeback deadlock involving raid issuing barriers, and not
    flushing the plug when we wakeup the flusher threads. From
    Konstantin

    - a set of patches for the NVMe target/loop/rdma code, from Roland and
    Sagi

    * 'for-linus' of git://git.kernel.dk/linux-block:
    bvec: avoid variable shadowing warning
    doc: update block/queue-sysfs.txt entries
    nvme: Suspend all queues before deletion
    mm, writeback: flush plugged IO in wakeup_flusher_threads()
    nvme-rdma: Remove unused includes
    nvme-rdma: start async event handler after reconnecting to a controller
    nvmet: Fix controller serial number inconsistency
    nvmet-rdma: Don't use the inline buffer in order to avoid allocation for small reads
    nvmet-rdma: Correctly handle RDMA device hot removal
    nvme-rdma: Make sure to shutdown the controller if we can
    nvme-loop: Remove duplicate call to nvme_remove_namespaces
    nvme-rdma: Free the I/O tags when we delete the controller
    nvme-rdma: Remove duplicate call to nvme_remove_namespaces
    nvme-rdma: Fix device removal handling
    nvme-rdma: Queue ns scanning after a sucessful reconnection
    nvme-rdma: Don't leak uninitialized memory in connect request private data

    Linus Torvalds
     

13 Aug, 2016

3 commits

  • Pull nfsd fixes from Bruce Fields:
    "Fixes for the dentry refcounting leak I introduced in 4.8-rc1, and for
    races in the LOCK code which appear to go back to the big nfsd state
    lock removal from 3.17"

    * tag 'nfsd-4.8-1' of git://linux-nfs.org/~bfields/linux:
    nfsd: don't return an unhashed lock stateid after taking mutex
    nfsd: Fix race between FREE_STATEID and LOCK
    nfsd: fix dentry refcounting on create

    Linus Torvalds
     
  • nfsd4_lock will take the st_mutex before working with the stateid it
    gets, but between the time when we drop the cl_lock and take the mutex,
    the stateid could become unhashed (a'la FREE_STATEID). If that happens
    the lock stateid returned to the client will be forgotten.

    Fix this by first moving the st_mutex acquisition into
    lookup_or_create_lock_state. Then, have it check to see if the lock
    stateid is still hashed after taking the mutex. If it's not, then put
    the stateid and try the find/create again.

    Signed-off-by: Jeff Layton
    Tested-by: Alexey Kodanev
    Cc: stable@vger.kernel.org # feb9dad5 nfsd: Always lock state exclusively.
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    - Stable patch from Olga to fix RPCSEC_GSS upcalls when the same user
    needs multiple different security services (e.g. krb5i and krb5p).

    - Stable patch to fix a regression introduced by the use of
    SO_REUSEPORT, and that prevented the use of multiple different NFS
    versions to the same server.

    - TCP socket reconnection timer fixes.

    - Patch from Neil to disable the use of IPv6 temporary addresses"

    * tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4: Cap the transport reconnection timer at 1/2 lease period
    NFSv4: Cleanup the setting of the nfs4 lease period
    SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout
    SUNRPC: Fix reconnection timeouts
    NFSv4.2: LAYOUTSTATS may return NFS4ERR_ADMIN/DELEG_REVOKED
    SUNRPC: disable the use of IPv6 temporary addresses.
    SUNRPC: allow for upcalls for same uid but different gss service
    SUNRPC: Fix up socket autodisconnect
    SUNRPC: Handle EADDRNOTAVAIL on connection failures

    Linus Torvalds
     

12 Aug, 2016

4 commits

  • Merge fixes from Andrew Morton:
    "7 fixes"

    * emailed patches from Andrew Morton :
    mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats
    mm, oom: fix uninitialized ret in task_will_free_mem()
    kasan: remove the unnecessary WARN_ONCE from quarantine.c
    mm: memcontrol: fix memcg id ref counter on swap charge move
    mm: memcontrol: fix swap counter leak on swapout from offline cgroup
    proc, meminfo: use correct helpers for calculating LRU sizes in meminfo
    mm/hugetlb: fix incorrect hugepages count during mem hotplug

    Linus Torvalds
     
  • meminfo_proc_show() and si_mem_available() are using the wrong helpers
    for calculating the size of the LRUs. The user-visible impact is that
    there appears to be an abnormally high number of unevictable pages.

    Link: http://lkml.kernel.org/r/20160805105805.GR2799@techsingularity.net
    Signed-off-by: Mel Gorman
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Pull ceph fixes from Ilya Dryomov:
    "A patch for a NULL dereference bug introduced in 4.8-rc1 and a handful
    of static checker fixes"

    * tag 'ceph-for-4.8-rc2' of https://github.com/ceph/ceph-client:
    ceph: initialize pathbase in the !dentry case in encode_caps_cb()
    rbd: nuke the 32-bit pool id check
    rbd: destroy header_oloc in rbd_dev_release()
    ceph: fix null pointer dereference in ceph_flush_snaps()
    libceph: using kfree_rcu() to simplify the code
    libceph: make cancel_generic_request() static
    libceph: fix return value check in alloc_msg_with_page_vector()

    Linus Torvalds
     
  • When running LTP's nfslock01 test, the Linux client can send a LOCK
    and a FREE_STATEID request at the same time. The outcome is:

    Frame 324 R OPEN stateid [2,O]

    Frame 115004 C LOCK lockowner_is_new stateid [2,O] offset 672000 len 64
    Frame 115008 R LOCK stateid [1,L]
    Frame 115012 C WRITE stateid [0,L] offset 672000 len 64
    Frame 115016 R WRITE NFS4_OK
    Frame 115019 C LOCKU stateid [1,L] offset 672000 len 64
    Frame 115022 R LOCKU NFS4_OK
    Frame 115025 C FREE_STATEID stateid [2,L]
    Frame 115026 C LOCK lockowner_is_new stateid [2,O] offset 672128 len 64
    Frame 115029 R FREE_STATEID NFS4_OK
    Frame 115030 R LOCK stateid [3,L]
    Frame 115034 C WRITE stateid [0,L] offset 672128 len 64
    Frame 115038 R WRITE NFS4ERR_BAD_STATEID

    In other words, the server returns stateid L in a successful LOCK
    reply, but it has already released it. Subsequent uses of stateid L
    fail.

    To address this, protect the generation check in nfsd4_free_stateid
    with the st_mutex. This should guarantee that only one of two
    outcomes occurs: either LOCK returns a fresh valid stateid, or
    FREE_STATEID returns NFS4ERR_LOCKS_HELD.

    Reported-by: Alexey Kodanev
    Fix-suggested-by: Jeff Layton
    Signed-off-by: Chuck Lever
    Tested-by: Alexey Kodanev
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    Chuck Lever
     

11 Aug, 2016

2 commits

  • b44061d0b9 introduced a dentry ref counting bug. Previously we were
    grabbing one ref to dchild in nfsd_create(), but with the creation of
    nfsd_create_locked() we have a ref for dchild from the lookup in
    nfsd_create(), and then another ref in nfsd_create_locked(). The ref
    from the lookup in nfsd_create() is never dropped and results in
    dentries still in use at unmount.

    Signed-off-by: Josef Bacik
    Fixes: b44061d0b9 "nfsd: reorganize nfsd_create"
    Reported-by: kernel test robot
    Reviewed-by: Jeff Layton
    Acked-by: Al Viro
    Signed-off-by: J. Bruce Fields

    Josef Bacik
     
  • Pull btrfs fixes from Chris Mason:
    "Some fixes for btrfs send/recv and fsync from Filipe and Robbie Ko.

    Bonus points to Filipe for already having xfstests in place for many
    of these"

    * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: remove unused function btrfs_add_delayed_qgroup_reserve()
    Btrfs: improve performance on fsync against new inode after rename/unlink
    Btrfs: be more precise on errors when getting an inode from disk
    Btrfs: send, don't bug on inconsistent snapshots
    Btrfs: send, avoid incorrect leaf accesses when sending utimes operations
    Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
    Btrfs: send, fix warning due to late freeing of orphan_dir_info structures
    Btrfs: incremental send, fix premature rmdir operations
    Btrfs: incremental send, fix invalid paths for rename operations
    Btrfs: send, add missing error check for calls to path_loop()
    Btrfs: send, fix failure to move directories with the same name around
    Btrfs: add missing check for writeback errors on fsync

    Linus Torvalds
     

10 Aug, 2016

2 commits

  • I've found funny live-lock between raid10 barriers during resync and
    memory controller hard limits. Inside mpage_readpages() task holds on to
    its plug bio which blocks the barrier in raid10. Its memory cgroup have
    no free memory thus the task goes into reclaimer but all reclaimable
    pages are dirty and cannot be written because raid10 is rebuilding and
    stuck on the barrier.

    Common flush of such IO in schedule() never happens, because the caller
    doesn't go to sleep.

    Lock is 'live' because changing memory limit or killing tasks which
    holds that stuck bio unblock whole progress.

    That was what happened in 3.18.x but I see no difference in upstream
    logic. Theoretically this might happen even without memory cgroup.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     
  • To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
    which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg
    in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
    with __GFP_ACCOUNT, including those that aren't actually charged to any
    cgroup, i.e. allocated from the root cgroup context. To avoid overhead
    in case cgroups are not used, we only do that if memcg_kmem_enabled() is
    true. The latter is set iff there are kmem-enabled memory cgroups
    (online or offline). The root cgroup is not considered kmem-enabled.

    As a result, if a page is allocated with __GFP_ACCOUNT for the root
    cgroup when there are kmem-enabled memory cgroups and is freed after all
    kmem-enabled memory cgroups were removed, e.g.

    # no memory cgroups has been created yet, create one
    mkdir /sys/fs/cgroup/memory/test
    # run something allocating pages with __GFP_ACCOUNT, e.g.
    # a program using pipe
    dmesg | tail
    # remove the memory cgroup
    rmdir /sys/fs/cgroup/memory/test

    we'll get bad page state bug complaining about page->_mapcount != -1:

    BUG: Bad page state in process swapper/0 pfn:1fd945c
    page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0
    flags: 0x1000000000000000()

    To avoid that, let's mark with PageKmemcg only those pages that are
    actually charged to and hence pin a non-root memory cgroup.

    Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 Aug, 2016

2 commits


08 Aug, 2016

2 commits

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Commit abf545484d31 changed it from an 'rw' flags type to the
    newer ops based interface, but now we're effectively leaking
    some bdev internals to the rest of the kernel. Since we only
    care about whether it's a read or a write at that level, just
    pass in a bool 'is_write' parameter instead.

    Then we can also move op_is_write() and friends back under
    CONFIG_BLOCK protection.

    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Aug, 2016

3 commits

  • Pull binfmt_misc update from James Bottomley:
    "This update is to allow architecture emulation containers to function
    such that the emulation binary can be housed outside the container
    itself. The container and fs parts both have acks from relevant
    experts.

    To use the new feature you have to add an F option to your binfmt_misc
    configuration"

    From the docs:
    "The usual behaviour of binfmt_misc is to spawn the binary lazily when
    the misc format file is invoked. However, this doesn't work very well
    in the face of mount namespaces and changeroots, so the F mode opens
    the binary as soon as the emulation is installed and uses the opened
    image to spawn the emulator, meaning it is always available once
    installed, regardless of how the environment changes"

    * tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc:
    binfmt_misc: add F option description to documentation
    binfmt_misc: add persistent opened binary handler for containers
    fs: add filp_clone_open API

    Linus Torvalds
     
  • In most cases, EPERM is returned on immutable inode, and there're only a
    few places returning EACCES. I noticed this when running LTP on
    overlayfs, setxattr03 failed due to unexpected EACCES on immutable
    inode.

    So converting all EACCES to EPERM on immutable inode.

    Acked-by: Dave Chinner
    Signed-off-by: Eryu Guan
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Eryu Guan
     
  • Pull more vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    In the "trivial API change" department - ->d_compare() losing 'parent'
    argument"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    cachefiles: Fix race between inactivating and culling a cache object
    9p: use clone_fid()
    9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"
    vfs: make dentry_needs_remove_privs() internal
    vfs: remove file_needs_remove_privs()
    vfs: fix deadlock in file_remove_privs() on overlayfs
    get rid of 'parent' argument of ->d_compare()
    cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare()
    affs ->d_compare(): don't bother with ->d_inode
    fold _d_rehash() and __d_rehash() together
    fold dentry_rcuwalk_invalidate() into its only remaining caller

    Linus Torvalds
     

06 Aug, 2016

6 commits

  • …nel/git/dgc/linux-xfs

    Pull more xfs updates from Dave Chinner:
    "This is the second part of the XFS updates for this merge cycle, and
    contains the new reverse block mapping feature for XFS.

    Reverse mapping allows us to track the owner of a specific block on
    disk precisely. It is implemented as a set of btrees (one per
    allocation group) that track the owners of allocated extents.
    Effectively it is a "used space tree" that is updated when we allocate
    or free extents. i.e. it is coherent with the free space btrees we
    already maintain and never overlaps with them.

    This reverse mapping infrastructure is the building block of several
    upcoming features - reflink, copy-on-write data, dedupe, online
    metadata and data scrubbing, highly accurate bad sector/data loss
    reporting to users, and significantly improved reconstruction of
    damaged and corrupted filesystems. There's a lot of new stuff coming
    along in the next couple of cycles,a nd it all builds in the rmap
    infrastructure.

    As such, it's a huge chunk of new code with new on-disk format
    features and internal infrastructure. It warns at mount time as an
    experimental feature and that it may eat data (as we do with all new
    on-disk features until they stabilise). We have not released
    userspace suport for it yet - userspace support currently requires
    download from Darrick's xfsprogs repo and build from source, so the
    access to this feature is really developer/tester only at this point.
    Initial userspace support will be released at the same time kernel
    with this code in it is released.

    The new rmap enabled code regresses 3 xfstests - all are ENOSPC
    related corner cases, one of which Darrick posted a fix for a few
    hours ago. The other two are fixed by infrastructure that is part of
    the upcoming reflink patchset. This new ENOSPC infrastructure
    requires a on-disk format tweak required to keep mount times in
    check - we need to keep an on-disk count of allocated rmapbt blocks so
    we don't have to scan the entire btrees at mount time to count them.

    This is currently being tested and will be part of the fixes sent in
    the next week or two so users will not be exposed to this change"

    * tag 'xfs-rmap-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (52 commits)
    xfs: move (and rename) the deferred bmap-free tracepoints
    xfs: collapse single use static functions
    xfs: remove unnecessary parentheses from log redo item recovery functions
    xfs: remove the extents array from the rmap update done log item
    xfs: in btree_lshift, only allocate temporary cursor when needed
    xfs: remove unnecesary lshift/rshift key initialization
    xfs: remove the get*keys and update_keys btree ops pointers
    xfs: enable the rmap btree functionality
    xfs: don't update rmapbt when fixing agfl
    xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled
    xfs: add rmap btree block detection to log recovery
    xfs: add rmap btree geometry feature flag
    xfs: propagate bmap updates to rmapbt
    xfs: enable the xfs_defer mechanism to process rmaps to update
    xfs: log rmap intent items
    xfs: create rmap update intent log items
    xfs: add rmap btree insert and delete helpers
    xfs: convert unwritten status of reverse mappings
    xfs: remove an extent from the rmap btree
    xfs: add an extent to the rmap btree
    ...

    Linus Torvalds
     
  • Pull qstr constification updates from Al Viro:
    "Fairly self-contained bunch - surprising lot of places passes struct
    qstr * as an argument when const struct qstr * would suffice; it
    complicates analysis for no good reason.

    I'd prefer to feed that separately from the assorted fixes (those are
    in #for-linus and with somewhat trickier topology)"

    * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    qstr: constify instances in adfs
    qstr: constify instances in lustre
    qstr: constify instances in f2fs
    qstr: constify instances in ext2
    qstr: constify instances in vfat
    qstr: constify instances in procfs
    qstr: constify instances in fuse
    qstr constify instances in fs/dcache.c
    qstr: constify instances in nfs
    qstr: constify instances in ocfs2
    qstr: constify instances in autofs4
    qstr: constify instances in hfs
    qstr: constify instances in hfsplus
    qstr: constify instances in logfs
    qstr: constify dentry_init_security

    Linus Torvalds
     
  • Inside the kafs filesystem it is possible to occasionally have a call
    processed and terminated before we've had a chance to check whether we need
    to clean up the rx queue for that call because afs_send_simple_reply() ends
    the call when it is done, but this is done in a workqueue item that might
    happen to run to completion before afs_deliver_to_call() completes.

    Further, it is possible for rxrpc_kernel_send_data() to be called to send a
    reply before the last request-phase data skb is released. The rxrpc skb
    destructor is where the ACK processing is done and the call state is
    advanced upon release of the last skb. ACK generation is also deferred to
    a work item because it's possible that the skb destructor is not called in
    a context where kernel_sendmsg() can be invoked.

    To this end, the following changes are made:

    (1) kernel_rxrpc_data_consumed() is added. This should be called whenever
    an skb is emptied so as to crank the ACK and call states. This does
    not release the skb, however. kernel_rxrpc_free_skb() must now be
    called to achieve that. These together replace
    rxrpc_kernel_data_delivered().

    (2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().

    This makes afs_deliver_to_call() easier to work as the skb can simply
    be discarded unconditionally here without trying to work out what the
    return value of the ->deliver() function means.

    The ->deliver() functions can, via afs_data_complete(),
    afs_transfer_reply() and afs_extract_data() mark that an skb has been
    consumed (thereby cranking the state) without the need to
    conditionally free the skb to make sure the state is correct on an
    incoming call for when the call processor tries to send the reply.

    (3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
    has finished with a packet and MSG_PEEK isn't set.

    (4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().

    Because of this, we no longer need to clear the destructor and put the
    call before we free the skb in cases where we don't want the ACK/call
    state to be cranked.

    (5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
    than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
    the delivery function already), and the caller is now responsible for
    producing an abort if that was the last packet.

    (6) There are many bits of unmarshalling code where:

    ret = afs_extract_data(call, skb, last, ...);
    switch (ret) {
    case 0: break;
    case -EAGAIN: return 0;
    default: return ret;
    }

    is to be found. As -EAGAIN can now be passed back to the caller, we
    now just return if ret < 0:

    ret = afs_extract_data(call, skb, last, ...);
    if (ret < 0)
    return ret;

    (7) Checks for trailing data and empty final data packets has been
    consolidated as afs_data_complete(). So:

    if (skb->len > 0)
    return -EBADMSG;
    if (!last)
    return 0;

    becomes:

    ret = afs_data_complete(call, skb, last);
    if (ret < 0)
    return ret;

    (8) afs_transfer_reply() now checks the amount of data it has against the
    amount of data desired and the amount of data in the skb and returns
    an error to induce an abort if we don't get exactly what we want.

    Without these changes, the following oops can occasionally be observed,
    particularly if some printks are inserted into the delivery path:

    general protection fault: 0000 [#1] SMP
    Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
    CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G E 4.7.0-fsdevel+ #1303
    Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
    Workqueue: kafsd afs_async_workfn [kafs]
    task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
    RIP: 0010:[] [] __lock_acquire+0xcf/0x15a1
    RSP: 0018:ffff88040c073bc0 EFLAGS: 00010002
    RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
    RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
    R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
    FS: 0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
    Stack:
    0000000000000006 000000000be04930 0000000000000000 ffff880400000000
    ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
    ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
    Call Trace:
    [] ? mark_held_locks+0x5e/0x74
    [] ? __local_bh_enable_ip+0x9b/0xa1
    [] ? trace_hardirqs_on_caller+0x16d/0x189
    [] lock_acquire+0x122/0x1b6
    [] ? lock_acquire+0x122/0x1b6
    [] ? skb_dequeue+0x18/0x61
    [] _raw_spin_lock_irqsave+0x35/0x49
    [] ? skb_dequeue+0x18/0x61
    [] skb_dequeue+0x18/0x61
    [] afs_deliver_to_call+0x344/0x39d [kafs]
    [] afs_process_async_call+0x4c/0xd5 [kafs]
    [] afs_async_workfn+0xe/0x10 [kafs]
    [] process_one_work+0x29d/0x57c
    [] worker_thread+0x24a/0x385
    [] ? rescuer_thread+0x2d0/0x2d0
    [] kthread+0xf3/0xfb
    [] ret_from_fork+0x1f/0x40
    [] ? kthread_create_on_node+0x1cf/0x1cf

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • Pull pstore fixes from Kees Cook:
    "Fixes for pstore ramoops driver to catch bad kfree() and to use better
    DT bindings"

    * tag 'pstore-v4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    ramoops: use persistent_ram_free() instead of kfree() for freeing prz
    ramoops: use DT reserved-memory bindings

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "Here's the second round of block updates for this merge window.

    It's a mix of fixes for changes that went in previously in this round,
    and fixes in general. This pull request contains:

    - Fixes for loop from Christoph

    - A bdi vs gendisk lifetime fix from Dan, worth two cookies.

    - A blk-mq timeout fix, when on frozen queues. From Gabriel.

    - Writeback fix from Jan, ensuring that __writeback_single_inode()
    does the right thing.

    - Fix for bio->bi_rw usage in f2fs from me.

    - Error path deadlock fix in blk-mq sysfs registration from me.

    - Floppy O_ACCMODE fix from Jiri.

    - Fix to the new bio op methods from Mike.

    One more followup will be coming here, ensuring that we don't
    propagate the block types outside of block. That, and a rename of
    bio->bi_rw is coming right after -rc1 is cut.

    - Various little fixes"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    mm/block: convert rw_page users to bio op use
    loop: make do_req_filebacked more robust
    loop: don't try to use AIO for discards
    blk-mq: fix deadlock in blk_mq_register_disk() error path
    Include: blkdev: Removed duplicate 'struct request;' declaration.
    Fixup direct bi_rw modifiers
    block: fix bdi vs gendisk lifetime mismatch
    blk-mq: Allow timeouts to run while queue is freezing
    nbd: fix race in ioctl
    block: fix use-after-free in seq file
    f2fs: drop bio->bi_rw manual assignment
    block: add missing group association in bio-cloning functions
    blkcg: kill unused field nr_undestroyed_grps
    writeback: Write dirty times for WB_SYNC_ALL writeback
    floppy: fix open(O_ACCMODE) for ioctl-only open

    Linus Torvalds
     
  • We don't want to miss a lease period renewal due to the TCP connection
    failing to reconnect in a timely fashion. To ensure this doesn't happen,
    cap the reconnection timer so that we retry the connection attempt
    at least every 1/2 lease period.

    Signed-off-by: Trond Myklebust

    Trond Myklebust