28 Jul, 2016

4 commits

  • Pull networking updates from David Miller:

    1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

    2) Make DSA binding more sane, from Andrew Lunn.

    3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

    4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

    5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface. From Brenden Blanco and
    others.

    6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

    7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

    8) Simplify netlink conntrack entry layout, from Florian Westphal.

    9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

    10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

    11) Support qdisc packet injection in pktgen, from John Fastabend.

    12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

    13) Add NV congestion control support to TCP, from Lawrence Brakmo.

    14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

    15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

    16) Support MPLS over IPV4, from Simon Horman.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    xgene: Fix build warning with ACPI disabled.
    be2net: perform temperature query in adapter regardless of its interface state
    l2tp: Correctly return -EBADF from pppol2tp_getname.
    net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
    net: ipmr/ip6mr: update lastuse on entry change
    macsec: ensure rx_sa is set when validation is disabled
    tipc: dump monitor attributes
    tipc: add a function to get the bearer name
    tipc: get monitor threshold for the cluster
    tipc: make cluster size threshold for monitoring configurable
    tipc: introduce constants for tipc address validation
    net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
    MAINTAINERS: xgene: Add driver and documentation path
    Documentation: dtb: xgene: Add MDIO node
    dtb: xgene: Add MDIO node
    drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
    drivers: net: xgene: Use exported functions
    drivers: net: xgene: Enable MDIO driver
    drivers: net: xgene: Add backward compatibility
    drivers: net: phy: xgene: Add MDIO driver
    ...

    Linus Torvalds
     
  • Pull dlm updates from David Teigland:
    "This set includes two trivial changes, one to use kmemdup and another
    to control the log level of recovery messages"

    * tag 'dlm-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
    dlm: Use kmemdup instead of kmalloc and memcpy
    dlm: add log_info config option

    Linus Torvalds
     
  • Pull f2fs updates from Jaegeuk Kim:
    "The major change in this version is mitigating cpu overheads on write
    paths by replacing redundant inode page updates with mark_inode_dirty
    calls. And we tried to reduce lock contentions as well to improve
    filesystem scalability. Other feature is setting F2FS automatically
    when detecting host-managed SMR.

    Enhancements:
    - ioctl to move a range of data between files
    - inject orphan inode errors
    - avoid flush commands congestion
    - support lazytime

    Bug fixes:
    - return proper results for some dentry operations
    - fix deadlock in add_link failure
    - disable extent_cache for fcollapse/finsert"

    * tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
    f2fs: clean up coding style and redundancy
    f2fs: get victim segment again after new cp
    f2fs: handle error case with f2fs_bug_on
    f2fs: avoid data race when deciding checkpoin in f2fs_sync_file
    f2fs: support an ioctl to move a range of data blocks
    f2fs: fix to report error number of f2fs_find_entry
    f2fs: avoid memory allocation failure due to a long length
    f2fs: reset default idle interval value
    f2fs: use blk_plug in all the possible paths
    f2fs: fix to avoid data update racing between GC and DIO
    f2fs: add maximum prefree segments
    f2fs: disable extent_cache for fcollapse/finsert inodes
    f2fs: refactor __exchange_data_block for speed up
    f2fs: fix ERR_PTR returned by bio
    f2fs: avoid mark_inode_dirty
    f2fs: move i_size_write in f2fs_write_end
    f2fs: fix to avoid redundant discard during fstrim
    f2fs: avoid mismatching block range for discard
    f2fs: fix incorrect f_bfree calculation in ->statfs
    f2fs: use percpu_rw_semaphore
    ...

    Linus Torvalds
     
  • Pull xfs updates from Dave Chinner:
    "The major addition is the new iomap based block mapping
    infrastructure. We've been kicking this about locally for years, but
    there are other filesystems want to use it too (e.g. gfs2). Now it
    is fully working, reviewed and ready for merge and be used by other
    filesystems.

    There are a lot of other fixes and cleanups in the tree, but those are
    XFS internal things and none are of the scale or visibility of the
    iomap changes. See below for details.

    I am likely to send another pull request next week - we're just about
    ready to merge some new functionality (on disk block->owner reverse
    mapping infrastructure), but that's a huge chunk of code (74 files
    changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
    separate to all the "normal" pull request changes so they don't get
    lost in the noise.

    Summary of changes in this update:
    - generic iomap based IO path infrastructure
    - generic iomap based fiemap implementation
    - xfs iomap based Io path implementation
    - buffer error handling fixes
    - tracking of in flight buffer IO for unmount serialisation
    - direct IO and DAX io path separation and simplification
    - shortform directory format definition changes for wider platform
    compatibility
    - various buffer cache fixes
    - cleanups in preparation for rmap merge
    - error injection cleanups and fixes
    - log item format buffer memory allocation restructuring to prevent
    rare OOM reclaim deadlocks
    - sparse inode chunks are now fully supported"

    * tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
    xfs: remove EXPERIMENTAL tag from sparse inode feature
    xfs: bufferhead chains are invalid after end_page_writeback
    xfs: allocate log vector buffers outside CIL context lock
    libxfs: directory node splitting does not have an extra block
    xfs: remove dax code from object file when disabled
    xfs: skip dirty pages in ->releasepage()
    xfs: remove __arch_pack
    xfs: kill xfs_dir2_inou_t
    xfs: kill xfs_dir2_sf_off_t
    xfs: split direct I/O and DAX path
    xfs: direct calls in the direct I/O path
    xfs: stop using generic_file_read_iter for direct I/O
    xfs: split xfs_file_read_iter into buffered and direct I/O helpers
    xfs: remove s_maxbytes enforcement in xfs_file_read_iter
    xfs: kill ioflags
    xfs: don't pass ioflags around in the ioctl path
    xfs: track and serialize in-flight async buffers against unmount
    xfs: exclude never-released buffers from buftarg I/O accounting
    xfs: don't reset b_retries to 0 on every failure
    xfs: remove extraneous buffer flag changes
    ...

    Linus Torvalds
     

27 Jul, 2016

21 commits

  • Merge updates from Andrew Morton:

    - a few misc bits

    - ocfs2

    - most(?) of MM

    * emailed patches from Andrew Morton : (125 commits)
    thp: fix comments of __pmd_trans_huge_lock()
    cgroup: remove unnecessary 0 check from css_from_id()
    cgroup: fix idr leak for the first cgroup root
    mm: memcontrol: fix documentation for compound parameter
    mm: memcontrol: remove BUG_ON in uncharge_list
    mm: fix build warnings in
    mm, thp: convert from optimistic swapin collapsing to conservative
    mm, thp: fix comment inconsistency for swapin readahead functions
    thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
    shmem: split huge pages beyond i_size under memory pressure
    thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
    khugepaged: add support of collapse for tmpfs/shmem pages
    shmem: make shmem_inode_info::lock irq-safe
    khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
    thp: extract khugepaged from mm/huge_memory.c
    shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
    shmem: add huge pages support
    shmem: get_unmapped_area align huge page
    shmem: prepare huge= mount option and sysfs knob
    mm, rmap: account shmem thp pages
    ...

    Linus Torvalds
     
  • Pull media updates from Mauro Carvalho Chehab:

    - new framework support for HDMI CEC and remote control support

    - new encoding codec driver for Mediatek SoC

    - new frontend driver: helene tuner

    - added support for NetUp almost universal devices, with supports
    DVB-C/S/S2/T/T2 and ISDB-T

    - the mn88472 frontend driver got promoted from staging

    - a new driver for RCar video input

    - some soc_camera legacy drivers got removed: timb, omap1, mx2, mx3

    - lots of driver cleanups, improvements and fixups

    * tag 'media/v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (377 commits)
    [media] cec: always check all_device_types and features
    [media] cec: poll should check if there is room in the tx queue
    [media] vivid: support monitor all mode
    [media] cec: fix test for unconfigured adapter in main message loop
    [media] cec: limit the size of the transmit queue
    [media] cec: zero unused msg part after msg->len
    [media] cec: don't set fh to NULL in CEC_TRANSMIT
    [media] cec: clear all status fields before transmit and always fill in sequence
    [media] cec: CEC_RECEIVE overwrote the timeout field
    [media] cxd2841er: Reading SNR for DVB-C added
    [media] cxd2841er: Reading BER and UCB for DVB-C added
    [media] cxd2841er: fix switch-case for DVB-C
    [media] cxd2841er: fix signal strength scale for ISDB-T
    [media] cxd2841er: adjust the dB scale for DVB-C
    [media] cxd2841er: provide signal strength for DVB-C
    [media] cxd2841er: fix BER report via DVBv5 stats API
    [media] mb86a20s: apply mask to val after checking for read failure
    [media] airspy: fix error logic during device register
    [media] s5p-cec/TODO: add TODO item
    [media] cec/TODO: drop comment about sphinx documentation
    ...

    Linus Torvalds
     
  • Pull pstore subsystem updates from Kees Cook:
    "This expands the supported compressors, fixes some bugs, and finally
    adds DT bindings"

    * tag 'pstore-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    pstore/ram: add Device Tree bindings
    efi-pstore: implement efivars_pstore_exit()
    pstore: drop file opened reference count
    pstore: add lzo/lz4 compression support
    pstore: Cleanup pstore_dump()
    pstore: Enable compression on normal path (again)
    ramoops: Only unregister when registered

    Linus Torvalds
     
  • Pull orangefs updates from Mike Mashall:
    "Orangefs cleanups and enablement of O_DIRECT in open.

    Cleanups:

    - remove some unused defines, and also some obfuscatory ones.

    - remove a redundant xattr handler.

    - Remove useless xattr prefix arguments.

    - Be more picky about uid and gid handling WRT namespaces.

    Our use of current_user_ns() instead of init_user_ns left open the
    possibility that users could spoof their uids or gids when the
    server was running in a different namespace in "default security"
    mode.

    - Allow open(2) to succeed with O_DIRECT"

    * tag 'for-linus-4.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
    orangefs: fix namespace handling
    Orangefs: allow O_DIRECT in open
    orangefs: Remove useless xattr prefix arguments
    orangefs: Remove redundant "trusted." xattr handler
    orangefs: Remove useless defines

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "The major change this cycle is deleting ext4's copy of the file system
    encryption code and switching things over to using the copies in
    fs/crypto. I've updated the MAINTAINERS file to add an entry for
    fs/crypto listing Jaeguk Kim and myself as the maintainers.

    There are also a number of bug fixes, most notably for some problems
    found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum. Also
    fixed is a writeback deadlock detected by generic/130, and some
    potential races in the metadata checksum code"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
    ext4: verify extent header depth
    ext4: short-cut orphan cleanup on error
    ext4: fix reference counting bug on block allocation error
    MAINTAINRES: fs-crypto maintainers update
    ext4 crypto: migrate into vfs's crypto engine
    ext2: fix filesystem deadlock while reading corrupted xattr block
    ext4: fix project quota accounting without quota limits enabled
    ext4: validate s_reserved_gdt_blocks on mount
    ext4: remove unused page_idx
    ext4: don't call ext4_should_journal_data() on the journal inode
    ext4: Fix WARN_ON_ONCE in ext4_commit_super()
    ext4: fix deadlock during page writeback
    ext4: correct error value of function verifying dx checksum
    ext4: avoid modifying checksum fields directly during checksum verification
    ext4: check for extents that wrap around
    jbd2: make journal y2038 safe
    jbd2: track more dependencies on transaction commit
    jbd2: move lockdep tracking to journal_s
    jbd2: move lockdep instrumentation for jbd2 handles
    ext4: respect the nobarrier mount option in nojournal mode
    ...

    Linus Torvalds
     
  • Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
    smaps. It indicates how many times we allocate and map shmem THP.

    NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

    Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Vladimir has noticed that we might declare memcg oom even during
    readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
    restriction) while __do_page_cache_readahead uses
    page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
    OOMs. This gfp mask discrepancy is really unfortunate and easily
    fixable. Drop page_cache_alloc_readahead() which only has one user and
    outsource the gfp_mask logic into readahead_gfp_mask and propagate this
    mask from __do_page_cache_readahead down to read_pages.

    This alone would have only very limited impact as most filesystems are
    implementing ->readpages and the common implementation mpage_readpages
    does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
    use readahead_gfp_mask instead as this function is called only during
    readahead as well. The same applies to read_cache_pages.

    ext4 has its own ext4_mpage_readpages but the path which has pages !=
    NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
    doing a very similar pattern to mpage_readpages so the same can be
    applied to them as well.

    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@suse.com: restrict gfp mask in mpage_alloc]
    Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Chris Mason
    Cc: Steve French
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Cc: Mike Marshall
    Cc: Jaegeuk Kim
    Cc: Changman Lee
    Cc: Chao Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pipes can consume a significant amount of system memory, hence they
    should be accounted to kmemcg.

    This patch marks pipe_inode_info and anonymous pipe buffer page
    allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
    Note, since a pipe buffer page can be "stolen" and get reused for other
    purposes, including mapping to userspace, we clear PageKmemcg thus
    resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
    is introduced by this patch.

    A note regarding anon_pipe_buf_steal implementation. We allow to steal
    the page if its ref count equals 1. It looks racy, but it is correct
    for anonymous pipe buffer pages, because:

    - We lock out all other pipe users, because ->steal is called with
    pipe_lock held, so the page can't be spliced to another pipe from
    under us.

    - The page is not on LRU and it never was.

    - Thus a parallel thread can access it only by PFN. Although this is
    quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
    this is not dangerous, because all such functions do is increase page
    ref count, check if the page is the one they are looking for, and
    decrease ref count if it isn't. Since our page is clean except for
    PageKmemcg mark, which doesn't conflict with other _mapcount users,
    the worst that can happen is we see page_count > 2 due to a transient
    ref, in which case we false-positively abort ->steal, which is still
    fine, because ->steal is not guaranteed to succeed.

    Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
    Signed-off-by: Vladimir Davydov
    Cc: Alexander Viro
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The per-sb inode writeback list tracks inodes currently under writeback
    to facilitate efficient sync processing. In particular, it ensures that
    sync only needs to walk through a list of inodes that were cleaned by
    the sync.

    Add a couple tracepoints to help identify when inodes are added/removed
    to and from the writeback lists. Piggyback off of the writeback
    lazytime tracepoint template as it already tracks the relevant inode
    information.

    Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    cc: Josef Bacik
    Cc: Holger Hoffstätte
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Foster
     
  • wait_sb_inodes() currently does a walk of all inodes in the filesystem
    to find dirty one to wait on during sync. This is highly inefficient
    and wastes a lot of CPU when there are lots of clean cached inodes that
    we don't need to wait on.

    To avoid this "all inode" walk, we need to track inodes that are
    currently under writeback that we need to wait for. We do this by
    adding inodes to a writeback list on the sb when the mapping is first
    tagged as having pages under writeback. wait_sb_inodes() can then walk
    this list of "inodes under IO" and wait specifically just for the inodes
    that the current sync(2) needs to wait for.

    Define a couple helpers to add/remove an inode from the writeback list
    and call them when the overall mapping is tagged for or cleared from
    writeback. Update wait_sb_inodes() to walk only the inodes under
    writeback due to the sync.

    With this change, filesystem sync times are significantly reduced for
    fs' with largely populated inode caches and otherwise no other work to
    do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
    with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
    than 0.1s when the filesystem is fully clean.

    Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Tested-by: Holger Hoffstätte
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     
  • Clean up unnecessary assignment for 'ret'.

    Link: http://lkml.kernel.org/r/578C61F6.4080403@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    piaojun
     
  • These BUG_ON(!inode) are obscure because we have already used inode to
    get osb. And actually we can guarantee here inode is valid in the
    context. So we can safely remove them.

    Link: http://lkml.kernel.org/r/5776336A.6030104@huawei.com
    Signed-off-by: Joseph Qi
    Reviewed-by: Eric Ren
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Several prototypes in inode.h are just defined but not actually
    implemented and used, so remove them.

    Link: http://lkml.kernel.org/r/57763787.4020706@huawei.com
    Signed-off-by: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • dlm_debug_ctxt->debug_refcnt is initialized to 1 and then increased to 2
    by dlm_debug_get in dlm_debug_init. But dlm_debug_put is called only
    once in dlm_debug_shutdown during unregister dlm, which leads to
    dlm_debug_ctxt leaked.

    Link: http://lkml.kernel.org/r/577BB755.4030900@huawei.com
    Signed-off-by: Joseph Qi
    Reviewed-by: Jiufei Xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • The last goto is unneeded, so remove it.

    Link: http://lkml.kernel.org/r/576213D3.6080002@huawei.com
    Signed-off-by: Joseph Qi
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Journal replay will be run when performing recovery for a dead node. To
    avoid the stale cache impact, all blocks of dead node's journal inode
    were reloaded from disk. This hurts the performance. Check whether one
    block is cached before reloading it can improve performance a lot. In
    my test env, the time doing recovery was improved from 120s to 1s.

    [akpm@linux-foundation.org: clean up the for loop p_blkno handling]
    Link: http://lkml.kernel.org/r/1466155682-24656-1-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: "Gang He"
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Obviously, memset() has zeroed the whole struct locking_max_version.
    So, it's no need to zero its two fields individually.

    Link: http://lkml.kernel.org/r/1463970605-18354-1-git-send-email-zren@suse.com
    Signed-off-by: Eric Ren
    Reviewed-by: Joseph Qi
    Reviewed-by: Gang He
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Ren
     
  • Remove the unused wrappers dax_fault() and dax_pmd_fault(). After this
    removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
    dax_pmd_fault() respectively, and update all callers.

    The dax_fault() and dax_pmd_fault() wrappers were initially intended to
    capture some filesystem independent functionality around page faults
    (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
    and ctime).

    However, the following commits:

    5726b27b09cc ("ext2: Add locking for DAX faults")
    ea3d7209ca01 ("ext4: fix races between page faults and hole punching")

    added locking to the ext2 and ext4 filesystems after these common
    operations but before __dax_fault() and __dax_pmd_fault() were called.
    This means that these wrappers are no longer used, and are unlikely to
    be used in the future.

    XFS has had locking analogous to what was recently added to ext2 and
    ext4 since DAX support was initially introduced by:

    6b698edeeef0 ("xfs: add DAX file operations support")

    Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dan Williams
    Cc: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

26 Jul, 2016

4 commits

  • Pull timer updates from Thomas Gleixner:
    "This update provides the following changes:

    - The rework of the timer wheel which addresses the shortcomings of
    the current wheel (cascading, slow search for next expiring timer,
    etc). That's the first major change of the wheel in almost 20
    years since Finn implemted it.

    - A large overhaul of the clocksource drivers init functions to
    consolidate the Device Tree initialization

    - Some more Y2038 updates

    - A capability fix for timerfd

    - Yet another clock chip driver

    - The usual pile of updates, comment improvements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
    tick/nohz: Optimize nohz idle enter
    clockevents: Make clockevents_subsys static
    clocksource/drivers/time-armada-370-xp: Fix return value check
    timers: Implement optimization for same expiry time in mod_timer()
    timers: Split out index calculation
    timers: Only wake softirq if necessary
    timers: Forward the wheel clock whenever possible
    timers/nohz: Remove pointless tick_nohz_kick_tick() function
    timers: Optimize collect_expired_timers() for NOHZ
    timers: Move __run_timers() function
    timers: Remove set_timer_slack() leftovers
    timers: Switch to a non-cascading wheel
    timers: Reduce the CPU index space to 256k
    timers: Give a few structs and members proper names
    hlist: Add hlist_is_singular_node() helper
    signals: Use hrtimer for sigtimedwait()
    timers: Remove the deprecated mod_timer_pinned() API
    timers, net/ipv4/inet: Initialize connection request timers as pinned
    timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
    timers, drivers/tty/metag_da: Initialize the poll timer as pinned
    ...

    Linus Torvalds
     
  • Pull x86 mm updates from Ingo Molnar:
    "Various x86 low level modifications:

    - preparatory work to support virtually mapped kernel stacks (Andy
    Lutomirski)

    - support for 64-bit __get_user() on 32-bit kernels (Benjamin
    LaHaise)

    - (involved) workaround for Knights Landing CPU erratum (Dave Hansen)

    - MPX enhancements (Dave Hansen)

    - mremap() extension to allow remapping of the special VDSO vma, for
    purposes of user level context save/restore (Dmitry Safonov)

    - hweight and entry code cleanups (Borislav Petkov)

    - bitops code generation optimizations and cleanups with modern GCC
    (H. Peter Anvin)

    - syscall entry code optimizations (Paolo Bonzini)"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
    x86/mm/cpa: Add missing comment in populate_pdg()
    x86/mm/cpa: Fix populate_pgd(): Stop trying to deallocate failed PUDs
    x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2
    x86/smp: Remove unnecessary initialization of thread_info::cpu
    x86/smp: Remove stack_smp_processor_id()
    x86/uaccess: Move thread_info::addr_limit to thread_struct
    x86/dumpstack: Rename thread_struct::sig_on_uaccess_error to sig_on_uaccess_err
    x86/uaccess: Move thread_info::uaccess_err and thread_info::sig_on_uaccess_err to thread_struct
    x86/dumpstack: When OOPSing, rewind the stack before do_exit()
    x86/mm/64: In vmalloc_fault(), use CR3 instead of current->active_mm
    x86/dumpstack/64: Handle faults when printing the "Stack: " part of an OOPS
    x86/dumpstack: Try harder to get a call trace on stack overflow
    x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
    x86/mm/cpa: In populate_pgd(), don't set the PGD entry until it's populated
    x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
    x86/mm: Use pte_none() to test for empty PTE
    x86/mm: Disallow running with 32-bit PTEs to work around erratum
    x86/mm: Ignore A/D bits in pte/pmd/pud_none()
    x86/mm: Move swap offset/type up in PTE to work around erratum
    x86/entry: Inline enter_from_user_mode()
    ...

    Linus Torvalds
     
  • Linux 4.7

    Kees Cook
     
  • This patch includes minor clean-ups.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

25 Jul, 2016

2 commits

  • Pull char/misc driver updates from Greg KH:
    "Here is the big char/misc driver update for 4.8-rc1.

    Not a lot of stuff, but it's all over the place, full details are in
    the shortlog. All of these have been in linux-next with no reported
    issues for a while"

    * tag 'char-misc-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (49 commits)
    lkdtm: silence warnings about function declarations
    lkdtm: hide unused functions
    intel_th: pci: Add Kaby Lake PCH-H support
    intel_th: Fix a deadlock in modprobing
    dsp56k: prevent a harmless underflow
    chardev: add missing line break in pr_warn
    lkdtm: use struct arrays instead of enums
    lkdtm: move jprobe entry points to start of source
    lkdtm: reorganize module paramaters
    lkdtm: rename globals for clarity
    lkdtm: rename "count" to "crash_count"
    lkdtm: remove intentional off-by-one array access
    lkdtm: split remaining logic bug tests to separate file
    lkdtm: split heap corruption tests to separate file
    lkdtm: split memory permissions tests to separate file
    lkdtm: split usercopy tests to separate file
    lkdtm: drop "alloc_size" parameter
    lkdtm: add usercopy test for blocking kernel text
    extcon: adc-jack: add suspend/resume support
    extcon: add missing of_node_put after calling of_parse_phandle
    ...

    Linus Torvalds
     
  • Pull gfs2 updates from Bob Peterson:
    "We've got ten patches this time, half of which are related to a
    plethora of nasty outcomes when inodes are transitioned from the
    unlinked state to the free state. Small file systems are particularly
    vulnerable to these problems, and it can manifest as mainly hangs, but
    also file system corruption. The patches have been tested for
    literally many weeks, with a very gruelling test, so I have a high
    level of confidence.

    - Andreas Gruenbacher wrote a series of five patches for various
    lockups during the transition of inodes from unlinked to free.

    The main patch is titled "Fix gfs2_lookup_by_inum lock inversion"
    and the other four are support and cleanup patches related to that.

    - Ben Marzinski contributed two patches with regard to a recreatable
    problem when gfs2 tries to write a page to a file that is being
    truncated, resulting in a BUG() in gfs2_remove_from_journal.

    Note that Ben had to export vfs function __block_write_full_page to
    get this to work properly. It's been posted a long time and he
    talked to various VFS people about it, and nobody seemed to mind.

    - I contributed 3 patches:
    o The first one fixes a memory corruptor: a race in which one
    process can overwrite the gl_object pointer set by another
    process, causing kernel panic and other symptoms.
    o The second patch fixes another race that resulted in a
    false-positive BUG_ON. This occurred when resource group
    reservations were freed by one process while another process
    was trying to grab a new reservation in the same resource
    group.
    o The third patch fixes a problem with doing journal replay when
    the journals are not all the same size"

    * tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
    GFS2: Fix gfs2_replay_incr_blk for multiple journal sizes
    GFS2: Check rs_free with rd_rsspin protection
    gfs2: writeout truncated pages
    fs: export __block_write_full_page
    gfs2: Lock holder cleanup
    gfs2: Large-filesystem fix for 32-bit systems
    gfs2: Get rid of gfs2_ilookup
    gfs2: Fix gfs2_lookup_by_inum lock inversion
    gfs2: Initialize iopen glock holder for new inodes
    GFS2: don't set rgrp gl_object until it's inserted into rgrp tree

    Linus Torvalds
     

24 Jul, 2016

1 commit


23 Jul, 2016

2 commits

  • Pull overlayfs fixes from Miklos Szeredi:
    "This contains a fix for a potential crash/corruption issue and another
    where the suid/sgid bits weren't cleared on write"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: verify upper dentry in ovl_remove_and_whiteout()
    ovl: Copy up underlying inode's ->i_mode to overlay inode
    ovl: handle ATTR_KILL*

    Linus Torvalds
     
  • Previous selected segment may become free after write_checkpoint,
    if we do garbage collect on this segment, and then new_curseg happen
    to reuse it, it may cause f2fs_bug_on as below.

    panic+0x154/0x29c
    do_garbage_collect+0x15c/0xaf4
    f2fs_gc+0x2dc/0x444
    f2fs_balance_fs.part.22+0xcc/0x14c
    f2fs_balance_fs+0x28/0x34
    f2fs_map_blocks+0x5ec/0x790
    f2fs_preallocate_blocks+0xe0/0x100
    f2fs_file_write_iter+0x64/0x11c
    new_sync_write+0xac/0x11c
    vfs_write+0x144/0x1e4
    SyS_write+0x60/0xc0

    Here, maybe we check sit and ssa type during reset_curseg. So, we check
    segment is stale or not, and select a new victim to avoid this.

    Signed-off-by: Yunlei He
    Signed-off-by: Jaegeuk Kim

    Yunlei He
     

22 Jul, 2016

6 commits

  • The upper dentry may become stale before we call ovl_lock_rename_workdir.
    For example, someone could (mistakenly or maliciously) manually unlink(2)
    it directly from upperdir.

    To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
    and and check if it matches the upper dentry.

    Essentially, it is the same problem and similar solution as in
    commit 11f3710417d0 ("ovl: verify upper dentry before unlink and rename").

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi
    Cc:

    Maxim Patlasov
     
  • Dave Chinner
     
  • Been around for long enough now, hasn't caused any regression test
    failures in the past 3 months, so it's time to make it a fully
    supported feature.

    Signed-off-by: Dave Chinner
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • In xfs_finish_page_writeback(), we have a loop that looks like this:

    do {
    if (off < bvec->bv_offset)
    goto next_bh;
    if (off > end)
    break;
    bh->b_end_io(bh, !error);
    next_bh:
    off += bh->b_size;
    } while ((bh = bh->b_this_page) != head);

    The b_end_io function is end_buffer_async_write(), which will call
    end_page_writeback() once all the buffers have marked as no longer
    under IO. This issue here is that the only thing currently
    protecting both the bufferhead chain and the page from being
    reclaimed is the PageWriteback state held on the page.

    While we attempt to limit the loop to just the buffers covered by
    the IO, we still read from the buffer size and follow the next
    pointer in the bufferhead chain. There is no guarantee that either
    of these are valid after the PageWriteback flag has been cleared.
    Hence, loops like this are completely unsafe, and result in
    use-after-free issues. One such problem was caught by Calvin Owens
    with KASAN:

    .....
    INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
    free_buffer_head+0x41/0x90
    __slab_free+0x1ed/0x340
    kmem_cache_free+0x270/0x300
    free_buffer_head+0x41/0x90
    try_to_free_buffers+0x171/0x240
    xfs_vm_releasepage+0xcb/0x3b0
    try_to_release_page+0x106/0x190
    shrink_page_list+0x118e/0x1a10
    shrink_inactive_list+0x42c/0xdf0
    shrink_zone_memcg+0xa09/0xfa0
    shrink_zone+0x2c3/0xbc0
    .....
    Call Trace:
    [] dump_stack+0x68/0x94
    [] print_trailer+0x115/0x1a0
    [] object_err+0x34/0x40
    [] kasan_report_error+0x217/0x530
    [] __asan_report_load8_noabort+0x43/0x50
    [] xfs_destroy_ioend+0x3bf/0x4c0
    [] xfs_end_bio+0x154/0x220
    [] bio_endio+0x158/0x1b0
    [] blk_update_request+0x18b/0xb80
    [] scsi_end_request+0x97/0x5a0
    [] scsi_io_completion+0x438/0x1690
    [] scsi_finish_command+0x375/0x4e0
    [] scsi_softirq_done+0x280/0x340

    Where the access is occuring during IO completion after the buffer
    had been freed from direct memory reclaim.

    Prevent use-after-free accidents in this end_io processing loop by
    pre-calculating the loop conditionals before calling bh->b_end_io().
    The loop is already limited to just the bufferheads covered by the
    IO in progress, so the offset checks are sufficient to prevent
    accessing buffers in the chain after end_page_writeback() has been
    called by the the bh->b_end_io() callout.

    Yet another example of why Bufferheads Must Die.

    cc: # 4.7
    Signed-off-by: Dave Chinner
    Reported-and-Tested-by: Calvin Owens
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • One of the problems we currently have with delayed logging is that
    under serious memory pressure we can deadlock memory reclaim. THis
    occurs when memory reclaim (such as run by kswapd) is reclaiming XFS
    inodes and issues a log force to unpin inodes that are dirty in the
    CIL.

    The CIL is pushed, but this will only occur once it gets the CIL
    context lock to ensure that all committing transactions are complete
    and no new transactions start being committed to the CIL while the
    push switches to a new context.

    The deadlock occurs when the CIL context lock is held by a
    committing process that is doing memory allocation for log vector
    buffers, and that allocation is then blocked on memory reclaim
    making progress. Memory reclaim, however, is blocked waiting for
    a log force to make progress, and so we effectively deadlock at this
    point.

    To solve this problem, we have to move the CIL log vector buffer
    allocation outside of the context lock so that memory reclaim can
    always make progress when it needs to force the log. The problem
    with doing this is that a CIL push can take place while we are
    determining if we need to allocate a new log vector buffer for
    an item and hence the current log vector may go away without
    warning. That means we canot rely on the existing log vector being
    present when we finally grab the context lock and so we must have a
    replacement buffer ready to go at all times.

    To ensure this, introduce a "shadow log vector" buffer that is
    always guaranteed to be present when we gain the CIL context lock
    and format the item. This shadow buffer may or may not be used
    during the formatting, but if the log item does not have an existing
    log vector buffer or that buffer is too small for the new
    modifications, we swap it for the new shadow buffer and format
    the modifications into that new log vector buffer.

    The result of this is that for any object we modify more than once
    in a given CIL checkpoint, we double the memory required
    to track dirty regions in the log. For single modifications then
    we consume the shadow log vectorwe allocate on commit, and that gets
    consumed by the checkpoint. However, if we make multiple
    modifications, then the second transaction commit will allocate a
    shadow log vector and hence we will end up with double the memory
    usage as only one of the log vectors is consumed by the CIL
    checkpoint. The remaining shadow vector will be freed when th elog
    item is freed.

    This can probably be optimised in future - access to the shadow log
    vector is serialised by the object lock (as opposited to the active
    log vector, which is controlled by the CIL context lock) and so we
    can probably free shadow log vector from some objects when the log
    item is marked clean on removal from the AIL.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • xfsprogs source commit 4280e59dcbc4cd8e01585efe788a68eb378048e8

    xfs_da3_split() has to handle all three versions of the
    directory/attribute btree structure. The attr tree is v1, the dir
    tre is v2 or v3. The main difference between the v1 and v2/3 trees
    is the way tree nodes are split - in the v1 tree we can require a
    double split to occur because the object to be inserted may be
    larger than the space made by splitting a leaf. In this case we need
    to do a double split - one to split the full leaf, then another to
    allocate an empty leaf block in the correct location for the new
    entry. This does not happen with dir (v2/v3) formats as the objects
    being inserted are always guaranteed to fit into the new space in
    the split blocks.

    Indeed, for directories they *may* be an extra block on this buffer
    pointer. However, it's guaranteed not to be a leaf block (i.e. a
    directory data block) - the directory code only ever places hash
    index or free space blocks in this pointer (as a cursor of
    sorts), and so to use it as a directory data block will immediately
    corrupt the directory.

    The problem is that the code assumes that there may be extra blocks
    that we need to link into the tree once we've split the root, but
    this is not true for either dir or attr trees, because the extra
    attr block is always consumed by the last node split before we split
    the root. Hence the linking in an extra block is always wrong at the
    root split level, and this manifests itself in repair as a directory
    corruption in a repaired directory, leaving the directory rebuild
    incomplete.

    This is a dir v2 zero-day bug - it was in the initial dir v2 commit
    that was made back in February 1998.

    Fix this by ensuring the linking of the blocks after the root split
    never tries to make use of the extra blocks that may be held in the
    cursor. They are held there for other purposes and should never be
    touched by the root splitting code.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner