01 Dec, 2018

1 commit

  • commit 30aba6656f61ed44cba445a3c0d38b296fa9e8f5 upstream.

    Disallows open of FIFOs or regular files not owned by the user in world
    writable sticky directories, unless the owner is the same as that of the
    directory or the file is opened without the O_CREAT flag. The purpose
    is to make data spoofing attacks harder. This protection can be turned
    on and off separately for FIFOs and regular files via sysctl, just like
    the symlinks/hardlinks protection. This patch is based on Openwall's
    "HARDEN_FIFO" feature by Solar Designer.

    This is a brief list of old vulnerabilities that could have been prevented
    by this feature, some of them even allow for privilege escalation:

    CVE-2000-1134
    CVE-2007-3852
    CVE-2008-0525
    CVE-2009-0416
    CVE-2011-4834
    CVE-2015-1838
    CVE-2015-7442
    CVE-2016-7489

    This list is not meant to be complete. It's difficult to track down all
    vulnerabilities of this kind because they were often reported without any
    mention of this particular attack vector. In fact, before
    hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
    vehicle to exploit them.

    [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
    Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
    Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
    [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
    [keescook@chromium.org: adjust commit subjet]
    Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
    Signed-off-by: Salvatore Mesoraca
    Signed-off-by: Kees Cook
    Suggested-by: Solar Designer
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Loic
    Signed-off-by: Greg Kroah-Hartman

    Salvatore Mesoraca
     

10 Nov, 2018

1 commit

  • commit a725356b6659469d182d662f22d770d83d3bc7b5 upstream.

    Commit 031a072a0b8a ("vfs: call vfs_clone_file_range() under freeze
    protection") created a wrapper do_clone_file_range() around
    vfs_clone_file_range() moving the freeze protection to former, so
    overlayfs could call the latter.

    The more common vfs practice is to call do_xxx helpers from vfs_xxx
    helpers, where freeze protecction is taken in the vfs_xxx helper, so
    this anomality could be a source of confusion.

    It seems that commit 8ede205541ff ("ovl: add reflink/copyfile/dedup
    support") may have fallen a victim to this confusion -
    ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
    hope of getting freeze protection on upper fs, but in fact results in
    overlayfs allowing to bypass upper fs freeze protection.

    Swap the names of the two helpers to conform to common vfs practice
    and call the correct helpers from overlayfs and nfsd.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi
    Fixes: 031a072a0b8a ("vfs: call vfs_clone_file_range() under freeze...")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Sasha Levin

    Amir Goldstein
     

21 Mar, 2018

1 commit

  • commit 95dd77580ccd66a0da96e6d4696945b8cea39431 upstream.

    On nfsv2 and nfsv3 the nfs server can export subsets of the same
    filesystem and report the same filesystem identifier, so that the nfs
    client can know they are the same filesystem. The subsets can be from
    disjoint directory trees. The nfsv2 and nfsv3 filesystems provides no
    way to find the common root of all directory trees exported form the
    server with the same filesystem identifier.

    The practical result is that in struct super s_root for nfs s_root is
    not necessarily the root of the filesystem. The nfs mount code sets
    s_root to the root of the first subset of the nfs filesystem that the
    kernel mounts.

    This effects the dcache invalidation code in generic_shutdown_super
    currently called shrunk_dcache_for_umount and that code for years
    has gone through an additional list of dentries that might be dentry
    trees that need to be freed to accomodate nfs.

    When I wrote path_connected I did not realize nfs was so special, and
    it's hueristic for avoiding calling is_subdir can fail.

    The practical case where this fails is when there is a move of a
    directory from the subtree exposed by one nfs mount to the subtree
    exposed by another nfs mount. This move can happen either locally or
    remotely. With the remote case requiring that the move directory be cached
    before the move and that after the move someone walks the path
    to where the move directory now exists and in so doing causes the
    already cached directory to be moved in the dcache through the magic
    of d_splice_alias.

    If someone whose working directory is in the move directory or a
    subdirectory and now starts calling .. from the initial mount of nfs
    (where s_root == mnt_root), then path_connected as a heuristic will
    not bother with the is_subdir check. As s_root really is not the root
    of the nfs filesystem this heuristic is wrong, and the path may
    actually not be connected and path_connected can fail.

    The is_subdir function might be cheap enough that we can call it
    unconditionally. Verifying that will take some benchmarking and
    the result may not be the same on all kernels this fix needs
    to be backported to. So I am avoiding that for now.

    Filesystems with snapshots such as nilfs and btrfs do something
    similar. But as the directory tree of the snapshots are disjoint
    from one another and from the main directory tree rename won't move
    things between them and this problem will not occur.

    Cc: stable@vger.kernel.org
    Reported-by: Al Viro
    Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

09 Mar, 2018

1 commit

  • commit 230f5a8969d8345fc9bbe3683f068246cf1be4b8 upstream.

    Gerd reports that ->i_mode may contain other bits besides S_IFCHR. Use
    S_ISCHR() instead. Otherwise, get_user_pages_longterm() may fail on
    device-dax instances when those are meant to be explicitly allowed.

    Fixes: 2bb6d2837083 ("mm: introduce get_user_pages_longterm")
    Cc:
    Reported-by: Gerd Rausch
    Acked-by: Jane Chu
    Reported-by: Haozhong Zhang
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

05 Dec, 2017

2 commits

  • commit 5d38f049cee1e1c4a7ac55aa79d37d01ddcc3860 upstream.

    Commit 42f461482178 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
    allowed the fstatat(2) system call to properly honor the AT_NO_AUTOMOUNT
    flag but introduced a semantic change.

    In order to honor AT_NO_AUTOMOUNT a semantic change was made to the
    negative dentry case for stat family system calls in follow_automount().

    This changed the unconditional triggering of an automount in this case
    to no longer be done and an error returned instead.

    This has caused more problems than I expected so reverting the change is
    needed.

    In a discussion with Neil Brown it was concluded that the automount(8)
    daemon can implement this change without kernel modifications. So that
    will be done instead and the autofs module documentation updated with a
    description of the problem and what needs to be done by module users for
    this specific case.

    Link: http://lkml.kernel.org/r/151174730120.6162.3848002191530283984.stgit@pluto.themaw.net
    Fixes: 42f4614821 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
    Signed-off-by: Ian Kent
    Cc: Neil Brown
    Cc: Al Viro
    Cc: David Howells
    Cc: Colin Walters
    Cc: Ondrej Holy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ian Kent
     
  • commit 2bb6d2837083de722bfdc369cb0d76ce188dd9b4 upstream.

    Patch series "introduce get_user_pages_longterm()", v2.

    Here is a new get_user_pages api for cases where a driver intends to
    keep an elevated page count indefinitely. This is distinct from usages
    like iov_iter_get_pages where the elevated page counts are transient.
    The iov_iter_get_pages cases immediately turn around and submit the
    pages to a device driver which will put_page when the i/o operation
    completes (under kernel control).

    In the longterm case userspace is responsible for dropping the page
    reference at some undefined point in the future. This is untenable for
    filesystem-dax case where the filesystem is in control of the lifetime
    of the block / page and needs reasonable limits on how long it can wait
    for pages in a mapping to become idle.

    Fixing filesystems to actually wait for dax pages to be idle before
    blocks from a truncate/hole-punch operation are repurposed is saved for
    a later patch series.

    Also, allowing longterm registration of dax mappings is a future patch
    series that introduces a "map with lease" semantic where the kernel can
    revoke a lease and force userspace to drop its page references.

    I have also tagged these for -stable to purposely break cases that might
    assume that longterm memory registrations for filesystem-dax mappings
    were supported by the kernel. The behavior regression this policy
    change implies is one of the reasons we maintain the "dax enabled.
    Warning: EXPERIMENTAL, use at your own risk" notification when mounting
    a filesystem in dax mode.

    It is worth noting the device-dax interface does not suffer the same
    constraints since it does not support file space management operations
    like hole-punch.

    This patch (of 4):

    Until there is a solution to the dma-to-dax vs truncate problem it is
    not safe to allow long standing memory registrations against
    filesytem-dax vmas. Device-dax vmas do not have this problem and are
    explicitly allowed.

    This is temporary until a "memory registration with layout-lease"
    mechanism can be implemented for the affected sub-systems (RDMA and
    V4L2).

    [akpm@linux-foundation.org: use kcalloc()]
    Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
    Signed-off-by: Dan Williams
    Suggested-by: Christoph Hellwig
    Cc: Doug Ledford
    Cc: Hal Rosenstock
    Cc: Inki Dae
    Cc: Jan Kara
    Cc: Jason Gunthorpe
    Cc: Jeff Moyer
    Cc: Joonyoung Shim
    Cc: Kyungmin Park
    Cc: Mauro Carvalho Chehab
    Cc: Mel Gorman
    Cc: Ross Zwisler
    Cc: Sean Hefty
    Cc: Seung-Woo Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

04 Oct, 2017

1 commit

  • Before commit 9c5d760b8d22 ("mm: split gfp_mask and mapping flags into
    separate fields") the private_* fields of struct adrress_space were
    grouped together and using "ditto" in comments describing the last
    fields was correct.

    With introduction of gpf_mask between private_lock and private_list
    "ditto" references the wrong description.

    Fix it by using the elaborate description.

    Link: http://lkml.kernel.org/r/1507009987-8746-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

15 Sep, 2017

4 commits

  • This patch constifies the path argument to kernel_read_file_from_path().

    Signed-off-by: Mimi Zohar
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Mimi Zohar
     
  • Pull nowait read support from Al Viro:
    "Support IOCB_NOWAIT for buffered reads and block devices"

    * 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    block_dev: support RFW_NOWAIT on block device nodes
    fs: support RWF_NOWAIT for buffered reads
    fs: support IOCB_NOWAIT in generic_file_buffered_read
    fs: pass iocb to do_generic_file_read

    Linus Torvalds
     
  • Pull mount flag updates from Al Viro:
    "Another chunk of fmount preparations from dhowells; only trivial
    conflicts for that part. It separates MS_... bits (very grotty
    mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
    only a small subset of MS_... stuff).

    This does *not* convert the filesystems to new constants; only the
    infrastructure is done here. The next step in that series is where the
    conflicts would be; that's the conversion of filesystems. It's purely
    mechanical and it's better done after the merge, so if you could run
    something like

    list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

    sed -i -e 's/\/SB_RDONLY/g' \
    -e 's/\/SB_NOSUID/g' \
    -e 's/\/SB_NODEV/g' \
    -e 's/\/SB_NOEXEC/g' \
    -e 's/\/SB_SYNCHRONOUS/g' \
    -e 's/\/SB_MANDLOCK/g' \
    -e 's/\/SB_DIRSYNC/g' \
    -e 's/\/SB_NOATIME/g' \
    -e 's/\/SB_NODIRATIME/g' \
    -e 's/\/SB_SILENT/g' \
    -e 's/\/SB_POSIXACL/g' \
    -e 's/\/SB_KERNMOUNT/g' \
    -e 's/\/SB_I_VERSION/g' \
    -e 's/\/SB_LAZYTIME/g' \
    $list

    and commit it with something along the lines of 'convert filesystems
    away from use of MS_... constants' as commit message, it would save a
    quite a bit of headache next cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Differentiate mount flags (MS_*) from internal superblock flags
    VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
    vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

    Linus Torvalds
     
  • Pull more set_fs removal from Al Viro:
    "Christoph's 'use kernel_read and friends rather than open-coding
    set_fs()' series"

    * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: unexport vfs_readv and vfs_writev
    fs: unexport vfs_read and vfs_write
    fs: unexport __vfs_read/__vfs_write
    lustre: switch to kernel_write
    gadget/f_mass_storage: stop messing with the address limit
    mconsole: switch to kernel_read
    btrfs: switch write_buf to kernel_write
    net/9p: switch p9_fd_read to kernel_write
    mm/nommu: switch do_mmap_private to kernel_read
    serial2002: switch serial2002_tty_write to kernel_{read/write}
    fs: make the buf argument to __kernel_write a void pointer
    fs: fix kernel_write prototype
    fs: fix kernel_read prototype
    fs: move kernel_read to fs/read_write.c
    fs: move kernel_write to fs/read_write.c
    autofs4: switch autofs4_write to __kernel_write
    ashmem: switch to ->read_iter

    Linus Torvalds
     

14 Sep, 2017

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This fixes d_ino correctness in readdir, which brings overlayfs on par
    with normal filesystems regarding inode number semantics, as long as
    all layers are on the same filesystem.

    There are also some bug fixes, one in particular (random ioctl's
    shouldn't be able to modify lower layers) that touches some vfs code,
    but of course no-op for non-overlay fs"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: fix false positive ESTALE on lookup
    ovl: don't allow writing ioctl on lower layer
    ovl: fix relatime for directories
    vfs: add flags to d_real()
    ovl: cleanup d_real for negative
    ovl: constant d_ino for non-merge dirs
    ovl: constant d_ino across copy up
    ovl: fix readdir error value
    ovl: check snprintf return

    Linus Torvalds
     

09 Sep, 2017

2 commits

  • The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT which
    is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering of an
    automount by the call. But this flag is unconditionally cleared for all
    stat family system calls except statx().

    stat family system calls have always triggered mount requests for the
    negative dentry case in follow_automount() which is intended but prevents
    the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.

    In order to handle the AT_NO_AUTOMOUNT for both system calls the negative
    dentry case in follow_automount() needs to be changed to return ENOENT
    when the LOOKUP_AUTOMOUNT flag is clear (and the other required flags are
    clear).

    AFAICT this change doesn't have any noticable side effects and may, in
    some use cases (although I didn't see it in testing) prevent unnecessary
    callbacks to the automount daemon.

    It's also possible that a stat family call has been made with a path that
    is in the process of being mounted by some other process. But stat family
    calls should return the automount state of the path as it is "now" so it
    shouldn't wait for mount completion.

    This is the same semantic as the positive dentry case already handled.

    Link: http://lkml.kernel.org/r/150216641255.11652.4204561328197919771.stgit@pluto.themaw.net
    Fixes: deccf497d804a4c5fca ("Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()")
    Signed-off-by: Ian Kent
    Cc: David Howells
    Cc: Colin Walters
    Cc: Ondrej Holy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

08 Sep, 2017

2 commits

  • Pull quota scaling updates from Jan Kara:
    "This contains changes to make the quota subsystem more scalable.

    Reportedly it improves number of files created per second on ext4
    filesystem on fast storage by about a factor of 2x"

    * 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
    quota: Add lock annotations to struct members
    quota: Reduce contention on dq_data_lock
    fs: Provide __inode_get_bytes()
    quota: Inline dquot_[re]claim_reserved_space() into callsite
    quota: Inline inode_{incr,decr}_space() into callsites
    quota: Inline functions into their callsites
    ext4: Disable dirty list tracking of dquots when journalling quotas
    quota: Allow disabling tracking of dirty dquots in a list
    quota: Remove dq_wait_unused from dquot
    quota: Move locking into clear_dquot_dirty()
    quota: Do not dirty bad dquots
    quota: Fix possible corruption of dqi_flags
    quota: Propagate ->quota_read errors from v2_read_file_info()
    quota: Fix error codes in v2_read_file_info()
    quota: Push dqio_sem down to ->read_file_info()
    quota: Push dqio_sem down to ->write_file_info()
    quota: Push dqio_sem down to ->get_next_id()
    quota: Push dqio_sem down to ->release_dqblk()
    quota: Remove locking for writing to the old quota format
    quota: Do not acquire dqio_sem for dquot overwrites in v2 format
    ...

    Linus Torvalds
     
  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

07 Sep, 2017

5 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - DAX updates

    - OCFS2

    - most of MM

    * emailed patches from Andrew Morton : (119 commits)
    mm,fork: introduce MADV_WIPEONFORK
    x86,mpx: make mpx depend on x86-64 to free up VMA flag
    mm: add /proc/pid/smaps_rollup
    mm: hugetlb: clear target sub-page last when clearing huge page
    mm: oom: let oom_reap_task and exit_mmap run concurrently
    swap: choose swap device according to numa node
    mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    z3fold: use per-cpu unbuddied lists
    mm, swap: don't use VMA based swap readahead if HDD is used as swap
    mm, swap: add sysfs interface for VMA based swap readahead
    mm, swap: VMA based swap readahead
    mm, swap: fix swap readahead marking
    mm, swap: add swap readahead hit statistics
    mm/vmalloc.c: don't reinvent the wheel but use existing llist API
    mm/vmstat.c: fix wrong comment
    selftests/memfd: add memfd_create hugetlbfs selftest
    mm/shmem: add hugetlbfs support to memfd_create()
    mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
    mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
    ...

    Linus Torvalds
     
  • Link: http://lkml.kernel.org/r/20170525102927.6163-1-jlayton@redhat.com
    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • Pull writeback error handling updates from Jeff Layton:
    "This pile continues the work from last cycle on better tracking
    writeback errors. In v4.13 we added some basic errseq_t infrastructure
    and converted a few filesystems to use it.

    This set continues refining that infrastructure, adds documentation,
    and converts most of the other filesystems to use it. The main
    exception at this point is the NFS client"

    * tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    ecryptfs: convert to file_write_and_wait in ->fsync
    mm: remove optimizations based on i_size in mapping writeback waits
    fs: convert a pile of fsync routines to errseq_t based reporting
    gfs2: convert to errseq_t based writeback error reporting for fsync
    fs: convert sync_file_range to use errseq_t based error-tracking
    mm: add file_fdatawait_range and file_write_and_wait
    fuse: convert to errseq_t based error tracking for fsync
    mm: consolidate dax / non-dax checks for writeback
    Documentation: add some docs for errseq_t
    errseq: rename __errseq_set to errseq_set

    Linus Torvalds
     
  • Pull file locking updates from Jeff Layton:
    "This pile just has a few file locking fixes from Ben Coddington. There
    are a couple of cleanup patches + an attempt to bring sanity to the
    l_pid value that is reported back to userland on an F_GETLK request.

    After a few gyrations, he came up with a way for filesystems to
    communicate to the VFS layer code whether the pid should be translated
    according to the namespace or presented as-is to userland"

    * tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    locks: restore a warn for leaked locks on close
    fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
    fs/locks: Use allocation rather than the stack in fcntl_getlk()

    Linus Torvalds
     
  • Pull XFS updates from Darrick Wong:
    "Here are the changes for xfs for 4.14. Most of these are cleanups and
    fixes for bad behavior, as we're mostly focusing on improving
    reliablity this cycle (read: there's potentially a lot of stuff on the
    horizon for 4.15 so better to spend a few weeks killing other bugs
    now).

    Summary:

    - Write unmount record for a ro mount to avoid unnecessary log replay

    - Clean up orphaned inodes when mounting fs readonly

    - Resubmit inode log items when buffer writeback fails to avoid
    umount hang

    - Fix log recovery corruption problems when log headers wrap around
    the end

    - Avoid infinite loop searching for free inodes when inode counters
    are wrong

    - Evict inodes involved with log redo so that we don't leak them
    later

    - Fix a potential race between reclaim and inode cluster freeing

    - Refactor the inode joining code w.r.t. transaction rolling &
    deferred ops

    - Fix a bug where the log doesn't properly deal with dirty buffers
    that are about to become ordered buffers

    - Fix the extent swap code to deal with making dirty buffers ordered
    properly

    - Consolidate page fault handlers

    - Refactor the incore extent manipulation functions to use the iext
    abstractions instead of directly modifying with extent data

    - Disable crashy chattr +/-x until we fix it

    - Don't allow us to set S_DAX for v2 inodes

    - Various cleanups

    - Clarify some documentation

    - Fix a problem where fsync and a log commit race to send the disk a
    flush command, resulting in a small window where power fail data
    loss could occur

    - Simplify some rmap operations in the fcollapse code

    - Fix some use-after-free problems in async writeback"

    * tag 'xfs-4.14-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits)
    xfs: use kmem_free to free return value of kmem_zalloc
    xfs: open code end_buffer_async_write in xfs_finish_page_writeback
    xfs: don't set v3 xflags for v2 inodes
    xfs: fix compiler warnings
    fsmap: fix documentation of FMR_OF_LAST
    xfs: simplify the rmap code in xfs_bmse_merge
    xfs: remove unused flags arg from xfs_file_iomap_begin_delay
    xfs: fix incorrect log_flushed on fsync
    xfs: disable per-inode DAX flag
    xfs: replace xfs_qm_get_rtblks with a direct call to xfs_bmap_count_leaves
    xfs: rewrite xfs_bmap_count_leaves using xfs_iext_get_extent
    xfs: use xfs_iext_*_extent helpers in xfs_bmap_split_extent_at
    xfs: use xfs_iext_*_extent helpers in xfs_bmap_shift_extents
    xfs: move some code around inside xfs_bmap_shift_extents
    xfs: use xfs_iext_get_extent in xfs_bmap_first_unused
    xfs: switch xfs_bmap_local_to_extents to use xfs_iext_insert
    xfs: add a xfs_iext_update_extent helper
    xfs: consolidate the various page fault handlers
    iomap: return VM_FAULT_* codes from iomap_page_mkwrite
    xfs: relog dirty buffers during swapext bmbt owner change
    ...

    Linus Torvalds
     

06 Sep, 2017

1 commit

  • Pull char/misc driver updates from Greg KH:
    "Here is the big char/misc driver update for 4.14-rc1.

    Lots of different stuff in here, it's been an active development cycle
    for some reason. Highlights are:

    - updated binder driver, this brings binder up to date with what
    shipped in the Android O release, plus some more changes that
    happened since then that are in the Android development trees.

    - coresight updates and fixes

    - mux driver file renames to be a bit "nicer"

    - intel_th driver updates

    - normal set of hyper-v updates and changes

    - small fpga subsystem and driver updates

    - lots of const code changes all over the driver trees

    - extcon driver updates

    - fmc driver subsystem upadates

    - w1 subsystem minor reworks and new features and drivers added

    - spmi driver updates

    Plus a smattering of other minor driver updates and fixes.

    All of these have been in linux-next with no reported issues for a
    while"

    * tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (244 commits)
    ANDROID: binder: don't queue async transactions to thread.
    ANDROID: binder: don't enqueue death notifications to thread todo.
    ANDROID: binder: Don't BUG_ON(!spin_is_locked()).
    ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctl
    ANDROID: binder: push new transactions to waiting threads.
    ANDROID: binder: remove proc waitqueue
    android: binder: Add page usage in binder stats
    android: binder: fixup crash introduced by moving buffer hdr
    drivers: w1: add hwmon temp support for w1_therm
    drivers: w1: refactor w1_slave_show to make the temp reading functionality separate
    drivers: w1: add hwmon support structures
    eeprom: idt_89hpesx: Support both ACPI and OF probing
    mcb: Fix an error handling path in 'chameleon_parse_cells()'
    MCB: add support for SC31 to mcb-lpc
    mux: make device_type const
    char: virtio: constify attribute_group structures.
    Documentation/ABI: document the nvmem sysfs files
    lkdtm: fix spelling mistake: "incremeted" -> "incremented"
    perf: cs-etm: Fix ETMv4 CONFIGR entry in perf.data file
    nvmem: include linux/err.h from header
    ...

    Linus Torvalds
     

05 Sep, 2017

7 commits


02 Sep, 2017

1 commit

  • When we introduced the bmap redo log items, we set MS_ACTIVE on the
    mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
    from being truncated prematurely during log recovery. This also had the
    effect of putting linked inodes on the lru instead of evicting them.

    Unfortunately, we neglected to find all those unreferenced lru inodes
    and evict them after finishing log recovery, which means that we leak
    them if anything goes wrong in the rest of xfs_mountfs, because the lru
    is only cleaned out on unmount.

    Therefore, evict unreferenced inodes in the lru list immediately
    after clearing MS_ACTIVE.

    Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
    Signed-off-by: Darrick J. Wong
    Cc: viro@ZenIV.linux.org.uk
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

01 Sep, 2017

1 commit


28 Aug, 2017

2 commits

  • We want the binder fix in here as well for testing and merge issues.

    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     
  • We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
    filesystems (and other IO targets) that know they are 64-bit clean and
    don't have any 32-bit limits in their IO path.

    It turns out that our 32-bit value for that limit was bogus. On 32-bit,
    the VM layer is limited by the page cache to only 32-bit index values,
    but our logic for that was confusing and actually wrong. We used to
    define that value to

    (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)

    which is actually odd in several ways: it limits the index to 31 bits,
    and then it limits files so that they can't have data in that last byte
    of a page that has the highest 31-bit index (ie page index 0x7fffffff).

    Neither of those limitations make sense. The index is actually the full
    32 bit unsigned value, and we can use that whole full page. So the
    maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".

    However, we do wan tto avoid the maximum index, because we have code
    that iterates over the page indexes, and we don't want that code to
    overflow. So the maximum size of a file on a 32-bit host should
    actually be one page less than the full 32-bit index.

    So the actual limit is ULONG_MAX << PAGE_SHIFT. That means that we will
    not actually be using the page of that last index (ULONG_MAX), but we
    can grow a file up to that limit.

    The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
    Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
    volume. It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
    byte less), but the actual true VM limit is one page less than 16TiB.

    This was invisible until commit c2a9737f45e2 ("vfs,mm: fix a dead loop
    in truncate_inode_pages_range()"), which started applying that
    MAX_LFS_FILESIZE limit to block devices too.

    NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
    actually just the offset type itself (loff_t), which is signed. But for
    clarity, on 64-bit, just use the maximum signed value, and don't make
    people have to count the number of 'f' characters in the hex constant.

    So just use LLONG_MAX for the 64-bit case. That was what the value had
    been before too, just written out as a hex constant.

    Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
    Reported-and-tested-by: Doug Nazar
    Cc: Andreas Dilger
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Dave Kleikamp
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Aug, 2017

1 commit


18 Aug, 2017

1 commit

  • Provide helper __inode_get_bytes() which assumes i_lock is already
    acquired. Quota code will need this to be able to use i_lock to protect
    consistency of quota accounting information and inode usage.

    Signed-off-by: Jan Kara

    Jan Kara
     

01 Aug, 2017

2 commits

  • Marcelo added this i_size based optimization with a patch in 2004
    (commitid is from the linux-history tree):

    commit 765dad09b4ac101a32d87af2bb793c3060497d3c
    Author: Marcelo Tosatti
    Date: Tue Sep 7 17:51:17 2004 -0700

    small wait_on_page_writeback_range() optimization

    filemap_fdatawait() calls wait_on_page_writeback_range() with -1
    as "end" parameter. This is not needed since we know the EOF
    from the inode. Use that instead.

    There may be races here, particularly with clustered or network
    filesystems. It also seems like a bit of a layering violation since
    we're operating on an address_space here, not an inode.

    Finally, it's also questionable whether this optimization really helps
    on workloads that we care about. Should we be optimizing for writeback
    vs. truncate races in a codepath where we expect to wait anyway? It
    doesn't seem worth the risk.

    Remove this optimization from the filemap_fdatawait codepaths. This
    means that filemap_fdatawait becomes a trivial wrapper around
    filemap_fdatawait_range.

    Reviewed-by: Jan Kara
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Necessary now for gfs2_fsync and sync_file_range, but there will
    eventually be other callers.

    Reviewed-by: Jan Kara
    Signed-off-by: Jeff Layton

    Jeff Layton
     

29 Jul, 2017

1 commit


24 Jul, 2017

1 commit