11 Sep, 2016

2 commits

  • Pull libnvdimm fixes from Dan Williams:
    "nvdimm fixes for v4.8, two of them are tagged for -stable:

    - Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
    DAX pmd mappings end up with an uncached pgprot, and unusable
    performance for the device-dax interface. The device-dax interface
    appeared in 4.7 so this is tagged for -stable.

    - Fix a couple VM_BUG_ON() checks in the show_smaps() path to
    understand DAX pmd entries. This fix is tagged for -stable.

    - Fix a mis-merge of the nfit machine-check handler to flip the
    polarity of an if() to match the final version of the patch that
    Vishal sent for 4.8-rc1. Without this the nfit machine check
    handler never detects / inserts new 'badblocks' entries which
    applications use to identify lost portions of files.

    - For test purposes, fix the nvdimm_clear_poison() path to operate on
    legacy / simulated nvdimm memory ranges. Without this fix a test
    can set badblocks, but never clear them on these ranges.

    - Fix the range checking done by dax_dev_pmd_fault(). This is not
    tagged for -stable since this problem is mitigated by specifying
    aligned resources at device-dax setup time.

    These patches have appeared in a next release over the past week. The
    recent rebase you can see in the timestamps was to drop an invalid fix
    as identified by the updated device-dax unit tests [1]. The -mm
    touches have an ack from Andrew"

    [1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
    https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: allow legacy (e820) pmem region to clear bad blocks
    nfit, mce: Fix SPA matching logic in MCE handler
    mm: fix cache mode of dax pmd mappings
    mm: fix show_smap() for zone_device-pmd ranges
    dax: fix mapping size check

    Linus Torvalds
     
  • Pull fscrypto fixes fromTed Ts'o:
    "Fix some brown-paper-bag bugs for fscrypto, including one one which
    allows a malicious user to set an encryption policy on an empty
    directory which they do not own"

    * tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    fscrypto: require write access to mount to set encryption policy
    fscrypto: only allow setting encryption policy on directories
    fscrypto: add authorization check for setting encryption policy

    Linus Torvalds
     

10 Sep, 2016

7 commits

  • Since setting an encryption policy requires writing metadata to the
    filesystem, it should be guarded by mnt_want_write/mnt_drop_write.
    Otherwise, a user could cause a write to a frozen or readonly
    filesystem. This was handled correctly by f2fs but not by ext4. Make
    fscrypt_process_policy() handle it rather than relying on the filesystem
    to get it right.

    Signed-off-by: Eric Biggers
    Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
    Signed-off-by: Theodore Ts'o
    Acked-by: Jaegeuk Kim

    Eric Biggers
     
  • The FS_IOC_SET_ENCRYPTION_POLICY ioctl allowed setting an encryption
    policy on nondirectory files. This was unintentional, and in the case
    of nonempty regular files did not behave as expected because existing
    data was not actually encrypted by the ioctl.

    In the case of ext4, the user could also trigger filesystem errors in
    ->empty_dir(), e.g. due to mismatched "directory" checksums when the
    kernel incorrectly tried to interpret a regular file as a directory.

    This bug affected ext4 with kernels v4.8-rc1 or later and f2fs with
    kernels v4.6 and later. It appears that older kernels only permitted
    directories and that the check was accidentally lost during the
    refactoring to share the file encryption code between ext4 and f2fs.

    This patch restores the !S_ISDIR() check that was present in older
    kernels.

    Signed-off-by: Eric Biggers
    Cc: stable@vger.kernel.org
    Signed-off-by: Theodore Ts'o

    Eric Biggers
     
  • On an ext4 or f2fs filesystem with file encryption supported, a user
    could set an encryption policy on any empty directory(*) to which they
    had readonly access. This is obviously problematic, since such a
    directory might be owned by another user and the new encryption policy
    would prevent that other user from creating files in their own directory
    (for example).

    Fix this by requiring inode_owner_or_capable() permission to set an
    encryption policy. This means that either the caller must own the file,
    or the caller must have the capability CAP_FOWNER.

    (*) Or also on any regular file, for f2fs v4.6 and later and ext4
    v4.8-rc1 and later; a separate bug fix is coming for that.

    Signed-off-by: Eric Biggers
    Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
    Signed-off-by: Theodore Ts'o

    Eric Biggers
     
  • Attempting to dump /proc//smaps for a process with pmd dax mappings
    currently results in the following VM_BUG_ONs:

    kernel BUG at mm/huge_memory.c:1105!
    task: ffff88045f16b140 task.stack: ffff88045be14000
    RIP: 0010:[] [] follow_trans_huge_pmd+0x2cb/0x340
    [..]
    Call Trace:
    [] smaps_pte_range+0xa0/0x4b0
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    kernel BUG at fs/proc/task_mmu.c:585!
    RIP: 0010:[] [] smaps_pte_range+0x499/0x4b0
    Call Trace:
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    These locations are sanity checking page flags that must be set for an
    anonymous transparent huge page, but are not set for the zone_device
    pages associated with dax mappings.

    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Acked-by: Andrew Morton
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Pull fuse fix from Miklos Szeredi:
    "This fixes a deadlock when fuse, direct I/O and loop device are
    combined"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: direct-io: don't dirty ITER_BVEC pages

    Linus Torvalds
     
  • Pull overlayfs fix from Miklos Szeredi:
    "This fixes a regression caused by the last pull request"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: fix workdir creation

    Linus Torvalds
     
  • Pull btrfs fixes from Chris Mason:
    "I'm not proud of how long it took me to track down that one liner in
    btrfs_sync_log(), but the good news is the patches I was trying to
    blame for these problems were actually fine (sorry Filipe)"

    * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
    btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
    btrfs: do not decrease bytes_may_use when replaying extents

    Linus Torvalds
     

08 Sep, 2016

1 commit


06 Sep, 2016

2 commits

  • In btrfs_async_reclaim_metadata_space(), we use ticket's address to
    determine whether asynchronous metadata reclaim work is making progress.

    ticket = list_first_entry(&space_info->tickets,
    struct reserve_ticket, list);
    if (last_ticket == ticket) {
    flush_state++;
    } else {
    last_ticket = ticket;
    flush_state = FLUSH_DELAYED_ITEMS_NR;
    if (commit_cycles)
    commit_cycles--;
    }

    But indeed it's wrong, we should not rely on local variable's address to
    do this check, because addresses may be same. In my test environment, I
    dd one 168MB file in a 256MB fs, found that for this file, every time
    wait_reserve_ticket() called, local variable ticket's address is same,

    For above codes, assume a previous ticket's address is addrA, last_ticket
    is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
    wake up it, then another ticket is added, but with the same address addrA,
    now last_ticket will be same to current ticket, then current ticket's flush
    work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
    which may result in some enospc issues(I have seen this in my test machine).

    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    Wang Xiaoguang
     
  • We use a btrfs_log_ctx structure to pass information into the
    tree log commit, and get error values out. It gets added to a per
    log-transaction list which we walk when things go bad.

    Commit d1433debe added an optimization to skip waiting for the log
    commit, but didn't take root_log_ctx out of the list. This
    patch makes sure we remove things before exiting.

    Signed-off-by: Chris Mason
    Fixes: d1433debe7f4346cf9fc0dafc71c3137d2a97bc4
    cc: stable@vger.kernel.org # 3.15+

    Chris Mason
     

05 Sep, 2016

3 commits

  • When replaying extents, there is no need to update bytes_may_use
    in btrfs_alloc_logged_file_extent(), otherwise it'll trigger a
    WARN_ON about bytes_may_use.

    Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely")
    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    Wang Xiaoguang
     
  • Commit f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
    modified "if (fpos_frag(new_pos) != fi->frag)" to "if (fi->frag |=
    fpos_frag(new_pos))" in need_reset_readdir(), thus replacing a
    comparison operator with an assignment one.

    This looks like a typo which is reported by clang when building the
    kernel with some warning flags:

    fs/ceph/dir.c:600:22: error: using the result of an assignment as a
    condition without parentheses [-Werror,-Wparentheses]
    } else if (fi->frag |= fpos_frag(new_pos)) {
    ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
    fs/ceph/dir.c:600:22: note: place parentheses around the assignment
    to silence this warning
    } else if (fi->frag |= fpos_frag(new_pos)) {
    ^
    ( )
    fs/ceph/dir.c:600:22: note: use '!=' to turn this compound
    assignment into an inequality comparison
    } else if (fi->frag |= fpos_frag(new_pos)) {
    ^~
    !=

    Fixes: f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
    Signed-off-by: Nicolas Iooss
    Signed-off-by: Ilya Dryomov

    Nicolas Iooss
     
  • Workdir creation fails in latest kernel.

    Fix by allowing EOPNOTSUPP as a valid return value from
    vfs_removexattr(XATTR_NAME_POSIX_ACL_*). Upper filesystem may not support
    ACL and still be perfectly able to support overlayfs.

    Reported-by: Martin Ziegler
    Signed-off-by: Miklos Szeredi
    Fixes: c11b9fdd6a61 ("ovl: remove posix_acl_default from workdir")
    Cc:

    Miklos Szeredi
     

04 Sep, 2016

3 commits

  • Pull btrfs fixes from Chris Mason:
    "I'm still prepping a set of fixes for btrfs fsync, just nailing down a
    hard to trigger memory corruption. For now, these are tested and ready."

    * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: fix one bug that process may endlessly wait for ticket in wait_reserve_ticket()
    Btrfs: fix endless loop in balancing block groups
    Btrfs: kill invalid ASSERT() in process_all_refs()

    Linus Torvalds
     
  • Pull driver core fixes from Greg KH:
    "Here are three small fixes for 4.8-rc5.

    One for sysfs, one for kernfs, and one documentation fix, all for
    reported issues. All of these have been in linux-next for a while"

    * tag 'driver-core-4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    sysfs: correctly handle read offset on PREALLOC attrs
    documentation: drivers/core/of: fix name of of_node symlink
    kernfs: don't depend on d_find_any_alias() when generating notifications

    Linus Torvalds
     
  • In commit 8ead9dd54716 ("devpts: more pty driver interface cleanups") I
    made devpts_get_priv() just return the dentry->fs_data directly. And
    because I thought it wouldn't happen, I added a warning if you ever saw
    a pts node that wasn't on devpts.

    And no, that warning never triggered under any actual real use, but you
    can trigger it by creating nonsensical pts nodes by hand.

    So just revert the warning, and make devpts_get_priv() return NULL for
    that case like it used to.

    Reported-by: Dmitry Vyukov
    Cc: stable@vger.kernel.org # 4.6+
    Cc: Eric W Biederman"
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Sep, 2016

1 commit

  • Pull overlayfs fixes from Miklos Szeredi:
    "Most of this is regression fixes for posix acl behavior introduced in
    4.8-rc1 (these were caught by the pjd-fstest suite). The are also
    miscellaneous fixes marked as stable material and cleanups.

    Other than overlayfs code, it touches to add a constant
    with which to disable posix acl caching. No changes needed to the
    actual caching code, it automatically does the right thing, although
    later we may want to optimize this case.

    I'm now testing overlayfs with the following test suites to catch
    regressions:

    - unionmount-testsuite
    - xfstests
    - pjd-fstest"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: update doc
    ovl: listxattr: use strnlen()
    ovl: Switch to generic_getxattr
    ovl: copyattr after setting POSIX ACL
    ovl: Switch to generic_removexattr
    ovl: Get rid of ovl_xattr_noacl_handlers array
    ovl: Fix OVL_XATTR_PREFIX
    ovl: fix spelling mistake: "directries" -> "directories"
    ovl: don't cache acl on overlay layer
    ovl: use cached acl on underlying layer
    ovl: proper cleanup of workdir
    ovl: remove posix_acl_default from workdir
    ovl: handle umask and posix_acl_default correctly on creation
    ovl: don't copy up opaqueness

    Linus Torvalds
     

02 Sep, 2016

2 commits

  • Pull audit fixes from Paul Moore:
    "Two small patches to fix some bugs with the audit-by-executable
    functionality we introduced back in v4.3 (both patches are marked
    for the stable folks)"

    * 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
    audit: fix exe_file access in audit_exe_compare
    mm: introduce get_task_exe_file

    Linus Torvalds
     
  • …rnel/git/dgc/linux-xfs

    Pull xfs and iomap fixes from Dave Chinner:
    "Most of these changes are small regression fixes that address problems
    introduced in the 4.8-rc1 window. The two fixes that aren't (IO
    completion fix and superblock inprogress check) are fixes for problems
    introduced some time ago and need to be pushed back to stable kernels.

    Changes in this update:
    - iomap FIEMAP_EXTENT_MERGED usage fix
    - additional mount-time feature restrictions
    - rmap btree query fixes
    - freeze/unmount io completion workqueue fix
    - memory corruption fix for deferred operations handling"

    * tag 'xfs-iomap-for-linus-4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
    xfs: track log done items directly in the deferred pending work item
    iomap: don't set FIEMAP_EXTENT_MERGED for extent based filesystems
    xfs: prevent dropping ioend completions during buftarg wait
    xfs: fix superblock inprogress check
    xfs: simple btree query range should look right if LE lookup fails
    xfs: fix some key handling problems in _btree_simple_query_range
    xfs: don't log the entire end of the AGF
    xfs: disallow mounting of realtime + rmap filesystems
    xfs: don't perform lookups on zero-height btrees

    Linus Torvalds
     

01 Sep, 2016

17 commits

  • If can_overcommit() in btrfs_calc_reclaim_metadata_size() returns true,
    btrfs_async_reclaim_metadata_space() will not reclaim metadata space, just
    return directly and also forget to wake up process which are waiting for
    their tickets, so these processes will wait endlessly.

    Fstests case generic/172 with mount option "-o compress=lzo" have revealed
    this bug in my test machine. Here if we have tickets to handle, we must
    handle them first.

    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    Wang Xiaoguang
     
  • Qgroup function may overwrite the saved error 'err' with 0
    in case quota is not enabled, and this ends up with a
    endless loop in balance because we keep going back to balance
    the same block group.

    It really should use 'ret' instead.

    Signed-off-by: Liu Bo
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    Liu Bo
     
  • Suppose you have the following tree in snap1 on a file system mounted with -o
    inode_cache so that inode numbers are recycled

    └── [ 258] a
    └── [ 257] b

    and then you remove b, rename a to c, and then re-create b in c so you have the
    following tree

    └── [ 258] c
    └── [ 257] b

    and then you try to do an incremental send you will hit

    ASSERT(pending_move == 0);

    in process_all_refs(). This is because we assume that any recycling of inodes
    will not have a pending change in our path, which isn't the case. This is the
    case for the DELETE side, since we want to remove the old file using the old
    path, but on the create side we could have a pending move and need to do the
    normal pending rename dance. So remove this ASSERT() and put a comment about
    why we ignore pending_move. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Be defensive about what underlying fs provides us in the returned xattr
    list buffer. If it's not properly null terminated, bail out with a warning
    insead of BUG.

    Signed-off-by: Miklos Szeredi
    Cc:

    Miklos Szeredi
     
  • Now that overlayfs has xattr handlers for iop->{set,remove}xattr, use
    those same handlers for iop->getxattr as well.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Miklos Szeredi

    Andreas Gruenbacher
     
  • Setting POSIX acl may also modify the file mode, so need to copy that up to
    the overlay inode.

    Reported-by: Eryu Guan
    Fixes: d837a49bd57f ("ovl: fix POSIX ACL setting")
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Commit d837a49bd57f ("ovl: fix POSIX ACL setting") switches from
    iop->setxattr from ovl_setxattr to generic_setxattr, so switch from
    ovl_removexattr to generic_removexattr as well. As far as permission
    checking goes, the same rules should apply in either case.

    While doing that, rename ovl_setxattr to ovl_xattr_set to indicate that
    this is not an iop->setxattr implementation and remove the unused inode
    argument.

    Move ovl_other_xattr_set above ovl_own_xattr_set so that they match the
    order of handlers in ovl_xattr_handlers.

    Signed-off-by: Andreas Gruenbacher
    Fixes: d837a49bd57f ("ovl: fix POSIX ACL setting")
    Signed-off-by: Miklos Szeredi

    Andreas Gruenbacher
     
  • Use an ordinary #ifdef to conditionally include the POSIX ACL handlers
    in ovl_xattr_handlers, like the other filesystems do. Flag the code
    that is now only used conditionally with __maybe_unused.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Miklos Szeredi

    Andreas Gruenbacher
     
  • Make sure ovl_own_xattr_handler only matches attribute names starting
    with "overlay.", not "overlayXXX".

    Signed-off-by: Andreas Gruenbacher
    Fixes: d837a49bd57f ("ovl: fix POSIX ACL setting")
    Signed-off-by: Miklos Szeredi

    Andreas Gruenbacher
     
  • Trivial fix to spelling mistake in pr_err message.

    Signed-off-by: Colin Ian King
    Signed-off-by: Miklos Szeredi

    Colin Ian King
     
  • Some operations (setxattr/chmod) can make the cached acl stale. We either
    need to clear overlay's acl cache for the affected inode or prevent acl
    caching on the overlay altogether. Preventing caching has the following
    advantages:

    - no double caching, less memory used

    - overlay cache doesn't go stale when fs clears it's own cache

    Possible disadvantage is performance loss. If that becomes a problem
    get_acl() can be optimized for overlayfs.

    This patch disables caching by pre setting i_*acl to a value that

    - has bit 0 set, so is_uncached_acl() will return true

    - is not equal to ACL_NOT_CACHED, so get_acl() will not overwrite it

    The constant -3 was chosen for this purpose.

    Fixes: 39a25b2b3762 ("ovl: define ->get_acl() for overlay inodes")
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Instead of calling ->get_acl() directly, use get_acl() to get the cached
    value.

    We will have the acl cached on the underlying inode anyway, because we do
    permission checking on the both the overlay and the underlying fs.

    So, since we already have double caching, this improves performance without
    any cost.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • When mounting overlayfs it needs a clean "work" directory under the
    supplied workdir.

    Previously the mount code removed this directory if it already existed and
    created a new one. If the removal failed (e.g. directory was not empty)
    then it fell back to a read-only mount not using the workdir.

    While this has never been reported, it is possible to get a non-empty
    "work" dir from a previous mount of overlayfs in case of crash in the
    middle of an operation using the work directory.

    In this case the left over state should be discarded and the overlay
    filesystem will be consistent, guaranteed by the atomicity of operations on
    moving to/from the workdir to the upper layer.

    This patch implements cleaning out any files left in workdir. It is
    implemented using real recursion for simplicity, but the depth is limited
    to 2, because the worst case is that of a directory containing whiteouts
    under "work".

    Signed-off-by: Miklos Szeredi
    Cc:

    Miklos Szeredi
     
  • Clear out posix acl xattrs on workdir and also reset the mode after
    creation so that an inherited sgid bit is cleared.

    Signed-off-by: Miklos Szeredi
    Cc:

    Miklos Szeredi
     
  • Setting MS_POSIXACL in sb->s_flags has the side effect of passing mode to
    create functions without masking against umask.

    Another problem when creating over a whiteout is that the default posix acl
    is not inherited from the parent dir (because the real parent dir at the
    time of creation is the work directory).

    Fix these problems by:

    a) If upper fs does not have MS_POSIXACL, then mask mode with umask.

    b) If creating over a whiteout, call posix_acl_create() to get the
    inherited acls. After creation (but before moving to the final
    destination) set these acls on the created file. posix_acl_create() also
    updates the file creation mode as appropriate.

    Fixes: 39a25b2b3762 ("ovl: define ->get_acl() for overlay inodes")
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • For more convenient access if one has a pointer to the task.

    As a minor nit take advantage of the fact that only task lock + rcu are
    needed to safely grab ->exe_file. This saves mm refcount dance.

    Use the helper in proc_exe_link.

    Signed-off-by: Mateusz Guzik
    Acked-by: Konstantin Khlebnikov
    Acked-by: Richard Guy Briggs
    Cc: # 4.3.x
    Signed-off-by: Paul Moore

    Mateusz Guzik
     
  • We used to delay switching to the new credentials until after we had
    mapped the executable (and possible elf interpreter). That was kind of
    odd to begin with, since the new executable will actually then _run_
    with the new creds, but whatever.

    The bigger problem was that we also want to make sure that we turn off
    prof events and tracing before we start mapping the new executable
    state. So while this is a cleanup, it's also a fix for a possible
    information leak.

    Reported-by: Robert Święcki
    Tested-by: Peter Zijlstra
    Acked-by: David Howells
    Acked-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Acked-by: Eric W. Biederman
    Cc: Willy Tarreau
    Cc: Kees Cook
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Aug, 2016

2 commits

  • Attributes declared with __ATTR_PREALLOC use sysfs_kf_read() which returns
    zero bytes for non-zero offset. This breaks script checkarray in mdadm tool
    in debian where /bin/sh is 'dash' because its builtin 'read' reads only one
    byte at a time. Script gets 'i' instead of 'idle' when reads current action
    from /sys/block/$dev/md/sync_action and as a result does nothing.

    This patch adds trivial implementation of partial read: generate whole
    string and move required part into buffer head.

    Signed-off-by: Konstantin Khlebnikov
    Fixes: 4ef67a8c95f3 ("sysfs/kernfs: make read requests on pre-alloc files use the buffer.")
    Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=787950
    Cc: Stable # v3.19+
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     
  • kernfs_notify_workfn() sends out file modified events for the
    scheduled kernfs_nodes. Because the modifications aren't from
    userland, it doesn't have the matching file struct at hand and can't
    use fsnotify_modify(). Instead, it looked up the inode and then used
    d_find_any_alias() to find the dentry and used fsnotify_parent() and
    fsnotify() directly to generate notifications.

    The assumption was that the relevant dentries would have been pinned
    if there are listeners, which isn't true as inotify doesn't pin
    dentries at all and watching the parent doesn't pin the child dentries
    even for dnotify. This led to, for example, inotify watchers not
    getting notifications if the system is under memory pressure and the
    matching dentries got reclaimed. It can also be triggered through
    /proc/sys/vm/drop_caches or a remount attempt which involves shrinking
    dcache.

    fsnotify_parent() only uses the dentry to access the parent inode,
    which kernfs can do easily. Update kernfs_notify_workfn() so that it
    uses fsnotify() directly for both the parent and target inodes without
    going through d_find_any_alias(). While at it, supply the target file
    name to fsnotify() from kernfs_node->name.

    Signed-off-by: Tejun Heo
    Reported-by: Evgeny Vereshchagin
    Fixes: d911d9874801 ("kernfs: make kernfs_notify() trigger inotify events too")
    Cc: John McCutchan
    Cc: Robert Love
    Cc: Eric Paris
    Cc: stable@vger.kernel.org # v3.16+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo