15 Oct, 2016

1 commit

  • Both import_iovec() and rw_copy_check_uvector() take an array
    (typically small and on-stack) which is used to hold an iovec array copy
    from userspace. This is to avoid an expensive memory allocation in the
    fast path (i.e. few iovec elements).

    The caller may have to check whether these functions actually used
    the provided buffer or allocated a new one -- but this differs between
    the too. Let's just add a kernel doc to clarify what the semantics are
    for each function.

    Signed-off-by: Vegard Nossum
    Signed-off-by: Al Viro

    Vegard Nossum
     

12 Oct, 2016

1 commit


11 Oct, 2016

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

08 Oct, 2016

9 commits

  • Merge updates from Andrew Morton:

    - fsnotify updates

    - ocfs2 updates

    - all of MM

    * emailed patches from Andrew Morton : (127 commits)
    console: don't prefer first registered if DT specifies stdout-path
    cred: simpler, 1D supplementary groups
    CREDITS: update Pavel's information, add GPG key, remove snail mail address
    mailmap: add Johan Hovold
    .gitattributes: set git diff driver for C source code files
    uprobes: remove function declarations from arch/{mips,s390}
    spelling.txt: "modeled" is spelt correctly
    nmi_backtrace: generate one-line reports for idle cpus
    arch/tile: adopt the new nmi_backtrace framework
    nmi_backtrace: do a local dump_stack() instead of a self-NMI
    nmi_backtrace: add more trigger_*_cpu_backtrace() methods
    min/max: remove sparse warnings when they're nested
    Documentation/filesystems/proc.txt: add more description for maps/smaps
    mm, proc: fix region lost in /proc/self/smaps
    proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
    proc: add LSM hook checks to /proc//timerslack_ns
    proc: relax /proc//timerslack_ns capability requirements
    meminfo: break apart a very long seq_printf with #ifdefs
    seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
    proc: faster /proc/*/status
    ...

    Linus Torvalds
     
  • When doing an nmi backtrace of many cores, most of which are idle, the
    output is a little overwhelming and very uninformative. Suppress
    messages for cpus that are idling when they are interrupted and just
    emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".

    We do this by grouping all the cpuidle code together into a new
    .cpuidle.text section, and then checking the address of the interrupted
    PC to see if it lies within that section.

    This commit suitably tags x86 and tile idle routines, and only adds in
    the minimal framework for other architectures.

    Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Chris Metcalf
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Thompson [arm]
    Tested-by: Petr Mladek
    Cc: Aaron Tomlin
    Cc: Peter Zijlstra (Intel)
    Cc: "Rafael J. Wysocki"
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • Currently on arm there is code that checks whether it should call
    dump_stack() explicitly, to avoid trying to raise an NMI when the
    current context is not preemptible by the backtrace IPI. Similarly, the
    forthcoming arch/tile support uses an IPI mechanism that does not
    support generating an NMI to self.

    Accordingly, move the code that guards this case into the generic
    mechanism, and invoke it unconditionally whenever we want a backtrace of
    the current cpu. It seems plausible that in all cases, dump_stack()
    will generate better information than generating a stack from the NMI
    handler. The register state will be missing, but that state is likely
    not particularly helpful in any case.

    Or, if we think it is helpful, we should be capturing and emitting the
    current register state in all cases when regs == NULL is passed to
    nmi_cpu_backtrace().

    Link: http://lkml.kernel.org/r/1472487169-14923-3-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Chris Metcalf
    Tested-by: Daniel Thompson [arm]
    Reviewed-by: Petr Mladek
    Acked-by: Aaron Tomlin
    Cc: "Rafael J. Wysocki"
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • Patch series "improvements to the nmi_backtrace code" v9.

    This patch series modifies the trigger_xxx_backtrace() NMI-based remote
    backtracing code to make it more flexible, and makes a few small
    improvements along the way.

    The motivation comes from the task isolation code, where there are
    scenarios where we want to be able to diagnose a case where some cpu is
    about to interrupt a task-isolated cpu. It can be helpful to see both
    where the interrupting cpu is, and also an approximation of where the
    cpu that is being interrupted is. The nmi_backtrace framework allows us
    to discover the stack of the interrupted cpu.

    I've tested that the change works as desired on tile, and build-tested
    x86, arm, mips, and sparc64. For x86 I confirmed that the generic
    cpuidle stuff as well as the architecture-specific routines are in the
    new cpuidle section. For arm, mips, and sparc I just build-tested it
    and made sure the generic cpuidle routines were in the new cpuidle
    section, but I didn't attempt to figure out which the platform-specific
    idle routines might be. That might be more usefully done by someone
    with platform experience in follow-up patches.

    This patch (of 4):

    Currently you can only request a backtrace of either all cpus, or all
    cpus but yourself. It can also be helpful to request a remote backtrace
    of a single cpu, and since we want that, the logical extension is to
    support a cpumask as the underlying primitive.

    This change modifies the existing lib/nmi_backtrace.c code to take a
    cpumask as its basic primitive, and modifies the linux/nmi.h code to use
    the new "cpumask" method instead.

    The existing clients of nmi_backtrace (arm and x86) are converted to
    using the new cpumask approach in this change.

    The other users of the backtracing API (sparc64 and mips) are converted
    to use the cpumask approach rather than the all/allbutself approach.
    The mips code ignored the "include_self" boolean but with this change it
    will now also dump a local backtrace if requested.

    Link: http://lkml.kernel.org/r/1472487169-14923-2-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Chris Metcalf
    Tested-by: Daniel Thompson [arm]
    Reviewed-by: Aaron Tomlin
    Reviewed-by: Petr Mladek
    Cc: "Rafael J. Wysocki"
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Ralf Baechle
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • This came to light when implementing native 64-bit atomics for ARCv2.

    The atomic64 self-test code uses CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
    to check whether atomic64_dec_if_positive() is available. It seems it
    was needed when not every arch defined it. However as of current code
    the Kconfig option seems needless

    - for CONFIG_GENERIC_ATOMIC64 it is auto-enabled in lib/Kconfig and a
    generic definition of API is present lib/atomic64.c
    - arches with native 64-bit atomics select it in arch/*/Kconfig and
    define the API in their headers

    So I see no point in keeping the Kconfig option

    Compile tested for:
    - blackfin (CONFIG_GENERIC_ATOMIC64)
    - x86 (!CONFIG_GENERIC_ATOMIC64)
    - ia64

    Link: http://lkml.kernel.org/r/1473703083-8625-3-git-send-email-vgupta@synopsys.com
    Signed-off-by: Vineet Gupta
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Ralf Baechle
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Vineet Gupta
    Cc: Zhaoxiu Zeng
    Cc: Linus Walleij
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Herbert Xu
    Cc: Ming Lin
    Cc: Arnd Bergmann
    Cc: Geert Uytterhoeven
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Andi Kleen
    Cc: Boqun Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     
  • Pull VFS splice updates from Al Viro:
    "There's a bunch of branches this cycle, both mine and from other folks
    and I'd rather send pull requests separately.

    This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
    (and introduction of such). Gets rid of a lot of code in fs/splice.c
    and elsewhere; there will be followups, but these are for the next
    cycle... Some pipe/splice-related cleanups from Miklos in the same
    branch as well"

    * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    pipe: fix comment in pipe_buf_operations
    pipe: add pipe_buf_steal() helper
    pipe: add pipe_buf_confirm() helper
    pipe: add pipe_buf_release() helper
    pipe: add pipe_buf_get() helper
    relay: simplify relay_file_read()
    switch default_file_splice_read() to use of pipe-backed iov_iter
    switch generic_file_splice_read() to use of ->read_iter()
    new iov_iter flavour: pipe-backed
    fuse_dev_splice_read(): switch to add_to_pipe()
    skb_splice_bits(): get rid of callback
    new helper: add_to_pipe()
    splice: lift pipe_lock out of splice_to_pipe()
    splice: switch get_iovec_page_array() to iov_iter
    splice_to_pipe(): don't open-code wakeup_pipe_readers()
    consistent treatment of EFAULT on O_DIRECT read/write

    Linus Torvalds
     
  • Pull block layer updates from Jens Axboe:
    "This is the main pull request for block layer changes in 4.9.

    As mentioned at the last merge window, I've changed things up and now
    do just one branch for core block layer changes, and driver changes.
    This avoids dependencies between the two branches. Outside of this
    main pull request, there are two topical branches coming as well.

    This pull request contains:

    - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

    - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
    Followup dependency fix from Geert.

    - General fixes from Bart, Baoyou, Guoqing, and Linus W.

    - CFQ async write starvation fix from Glauber.

    - Add supprot for delayed kick of the requeue list, from Mike.

    - Pull out the scalable bitmap code from blk-mq-tag.c and make it
    generally available under the name of sbitmap. Only blk-mq-tag uses
    it for now, but the blk-mq scheduling bits will use it as well.
    From Omar.

    - bdev thaw error progagation from Pierre.

    - Improve the blk polling statistics, and allow the user to clear
    them. From Stephen.

    - Set of minor cleanups from Christoph in block/blk-mq.

    - Set of cleanups and optimizations from me for block/blk-mq.

    - Various nvme/nvmet/nvmeof fixes from the various folks"

    * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
    fs/block_dev.c: return the right error in thaw_bdev()
    nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
    nvme/scsi: Remove power management support
    nvmet: Make dsm number of ranges zero based
    nvmet: Use direct IO for writes
    admin-cmd: Added smart-log command support.
    nvme-fabrics: Add host_traddr options field to host infrastructure
    nvme-fabrics: revise host transport option descriptions
    nvme-fabrics: rework nvmf_get_address() for variable options
    nbd: use BLK_MQ_F_BLOCKING
    blkcg: Annotate blkg_hint correctly
    cfq: fix starvation of asynchronous writes
    blk-mq: add flag for drivers wanting blocking ->queue_rq()
    blk-mq: remove non-blocking pass in blk_mq_map_request
    blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
    block: export bio_free_pages to other modules
    lightnvm: propagate device_add() error code
    lightnvm: expose device geometry through sysfs
    lightnvm: control life of nvm_dev in driver
    blk-mq: register device instead of disk
    ...

    Linus Torvalds
     
  • Pull trivial updates from Jiri Kosina:
    "The usual rocket science from the trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    tracing/syscalls: fix multiline in error message text
    lib/Kconfig.debug: fix DEBUG_SECTION_MISMATCH description
    doc: vfs: fix fadvise() sycall name
    x86/entry: spell EBX register correctly in documentation
    securityfs: fix securityfs_create_dir comment
    irq: Fix typo in tracepoint.xml

    Linus Torvalds
     
  • Pull MD updates from Shaohua Li:
    "This update includes:

    - new AVX512 instruction based raid6 gen/recovery algorithm

    - a couple of md-cluster related bug fixes

    - fix a potential deadlock

    - set nonrotational bit for raid array with SSD

    - set correct max_hw_sectors for raid5/6, which hopefuly can improve
    performance a little bit

    - other minor fixes"

    * tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
    md: set rotational bit
    raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
    raid5: handle register_shrinker failure
    raid5: fix to detect failure of register_shrinker
    md: fix a potential deadlock
    md/bitmap: fix wrong cleanup
    raid5: allow arbitrary max_hw_sectors
    lib/raid6: Add AVX512 optimized xor_syndrome functions
    lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
    lib/raid6: Add AVX512 optimized recovery functions
    lib/raid6: Add AVX512 optimized gen_syndrome functions
    md-cluster: make resync lock also could be interruptted
    md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
    md-cluster: convert the completion to wait queue
    md-cluster: protect md_find_rdev_nr_rcu with rcu lock
    md-cluster: clean related infos of cluster
    md: changes for MD_STILL_CLOSED flag
    md-cluster: remove some unnecessary dlm_unlock_sync
    md-cluster: use FORCEUNLOCK in lockres_free
    md-cluster: call md_kick_rdev_from_array once ack failed

    Linus Torvalds
     

07 Oct, 2016

1 commit

  • Pull dmaengine updates from Vinod Koul:
    "This is bit large pile of code which bring in some nice additions:

    - Error reporting: we have added a new mechanism for users of
    dmaenegine to register a callback_result which tells them the
    result of the dma transaction. Right now only one user (ntb) is
    using it.

    - As we discussed on KS mailing list and pointed out NO_IRQ has no
    place in kernel, this also remove NO_IRQ from dmaengine subsystem
    (both arm and ppc users)

    - Support for IOMMU slave transfers and its implementation for arm.

    - To get better build coverage, enable COMPILE_TEST for bunch of
    driver, and fix the warning and sparse complaints on these.

    - Apart from above, usual updates spread across drivers"

    * tag 'dmaengine-4.9-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (169 commits)
    async_pq_val: fix DMA memory leak
    dmaengine: virt-dma: move function declarations
    dmaengine: omap-dma: Enable burst and data pack for SG
    DT: dmaengine: rcar-dmac: document R8A7743/5 support
    dmaengine: fsldma: Unmap region obtained by of_iomap
    dmaengine: jz4780: fix resource leaks on error exit return
    dma-debug: fix ia64 build, use PHYS_PFN
    dmaengine: coh901318: fix integer overflow when shifting more than 32 places
    dmaengine: edma: avoid uninitialized variable use
    dma-mapping: fix m32r build warning
    dma-mapping: fix ia64 build, use PHYS_PFN
    dmaengine: ti-dma-crossbar: enable COMPILE_TEST
    dmaengine: omap-dma: enable COMPILE_TEST
    dmaengine: edma: enable COMPILE_TEST
    dmaengine: ti-dma-crossbar: Fix of_device_id data parameter usage
    dmaengine: ti-dma-crossbar: Correct type for of_find_property() third parameter
    dmaengine/ARM: omap-dma: Fix the DMAengine compile test on non OMAP configs
    dmaengine: edma: Rename set_bits and remove unused clear_bits helper
    dmaengine: edma: Use correct type for of_find_property() third parameter
    dmaengine: edma: Fix of_device_id data parameter usage (legacy vs TPCC)
    ...

    Linus Torvalds
     

06 Oct, 2016

4 commits

  • Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • iov_iter variant for passing data into pipe. copy_to_iter()
    copies data into page(s) it has allocated and stuffs them into
    the pipe; copy_page_to_iter() stuffs there a reference to the
    page given to it. Both will try to coalesce if possible.
    iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
    and friends will do as copy_to_iter() would have and return the
    pages where the data would've been copied. iov_iter_advance()
    will truncate everything past the spot it has advanced to.

    New primitive: iov_iter_pipe(), used for initializing those.
    pipe should be locked all along.

    Running out of space acts as fault would for iovec-backed ones;
    in other words, giving it to ->read_iter() may result in short
    read if the pipe overflows, or -EFAULT if it happens with nothing
    copied there.

    In other words, ->read_iter() on those acts pretty much like
    ->splice_read(). Moreover, all generic_file_splice_read() users,
    as well as many other ->splice_read() instances can be switched
    to that scheme - that'll happen in the next commit.

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull networking updates from David Miller:

    1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
    co. at Google. https://lwn.net/Articles/701165/

    2) Do TCP Small Queues for retransmits, from Eric Dumazet.

    3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
    Starovoitov.

    4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.

    5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.

    6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.

    7) Support ndo_poll_controller in mlx5, from Calvin Owens.

    8) Move VRF processing to an output hook and allow l3mdev to be
    loopback, from David Ahern.

    9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.

    10) Congestion control in RXRPC, from David Howells.

    11) Support geneve RX offload in ixgbe, from Emil Tantilov.

    12) When hitting pressure for new incoming TCP data SKBs, perform a
    partial rathern than a full purge of the OFO queue (which could be
    huge). From Eric Dumazet.

    13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.

    14) Support RX network flow classification to igb, from Gangfeng Huang.

    15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.

    16) New skbmod packet action, from Jamal Hadi Salim.

    17) Remove some inefficiencies in snmp proc output, from Jia He.

    18) Add FIB notifications to properly propagate route changes to
    hardware which is doing forwarding offloading. From Jiri Pirko.

    19) New dsa driver for qca8xxx chips, from John Crispin.

    20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
    Żenczykowski.

    21) Add L3 mode to ipvlan, from Mahesh Bandewar.

    22) Support 802.1ad in mlx4, from Moshe Shemesh.

    23) Support hardware LRO in mediatek driver, from Nelson Chang.

    24) Add TC offloading to mlx5, from Or Gerlitz.

    25) Convert various drivers to ethtool ksettings interfaces, from
    Philippe Reynes.

    26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.

    27) NAPI support for ath10k, from Rajkumar Manoharan.

    28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.

    29) UDP replicast support in TIPC, from Richard Alpe.

    30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.

    31) Support BQL in thunderx driver, from Sunil Goutham.

    32) TSO support in alx driver, from Tobias Regnery.

    33) Add stream parser engine and use it in kcm.

    34) Support async DHCP replies in ipconfig module, from Uwe
    Kleine-König.

    35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
    mlxsw: switchx2: Fix misuse of hard_header_len
    mlxsw: spectrum: Fix misuse of hard_header_len
    net/faraday: Stop NCSI device on shutdown
    net/ncsi: Introduce ncsi_stop_dev()
    net/ncsi: Rework the channel monitoring
    net/ncsi: Allow to extend NCSI request properties
    net/ncsi: Rework request index allocation
    net/ncsi: Don't probe on the reserved channel ID (0x1f)
    net/ncsi: Introduce NCSI_RESERVED_CHANNEL
    net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
    net: Add netdev all_adj_list refcnt propagation to fix panic
    net: phy: Add Edge-rate driver for Microsemi PHYs.
    vmxnet3: Wake queue from reset work
    i40e: avoid NULL pointer dereference and recursive errors on early PCI error
    qed: Add RoCE ll2 & GSI support
    qed: Add support for memory registeration verbs
    qed: Add support for QP verbs
    qed: PD,PKEY and CQ verb support
    qed: Add support for RoCE hw init
    qede: Add qedr framework
    ...

    Linus Torvalds
     
  • When the underflow checks were added to workingset_node_shadow_dec(),
    they triggered immediately:

    kernel BUG at ./include/linux/swap.h:276!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
    soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
    CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
    Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
    task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
    RIP: page_cache_tree_insert+0xf1/0x100
    Call Trace:
    __add_to_page_cache_locked+0x12e/0x270
    add_to_page_cache_lru+0x4e/0xe0
    mpage_readpages+0x112/0x1d0
    blkdev_readpages+0x1d/0x20
    __do_page_cache_readahead+0x1ad/0x290
    force_page_cache_readahead+0xaa/0x100
    page_cache_sync_readahead+0x3f/0x50
    generic_file_read_iter+0x5af/0x740
    blkdev_read_iter+0x35/0x40
    __vfs_read+0xe1/0x130
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x13/0x8f
    Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 0b e8 88 68 ef ff 0f 1f 84 00
    RIP page_cache_tree_insert+0xf1/0x100

    This is a long-standing bug in the way shadow entries are accounted in
    the radix tree nodes. The shrinker needs to know when radix tree nodes
    contain only shadow entries, no pages, so node->count is split in half
    to count shadows in the upper bits and pages in the lower bits.

    Unfortunately, the radix tree implementation doesn't know of this and
    assumes all entries are in node->count. When there is a shadow entry
    directly in root->rnode and the tree is later extended, the radix tree
    implementation will copy that entry into the new node and and bump its
    node->count, i.e. increases the page count bits. Once the shadow gets
    removed and we subtract from the upper counter, node->count underflows
    and triggers the warning. Afterwards, without node->count reaching 0
    again, the radix tree node is leaked.

    Limit shadow entries to when we have actual radix tree nodes and can
    count them properly. That means we lose the ability to detect refaults
    from files that had only the first page faulted in at eviction time.

    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Signed-off-by: Johannes Weiner
    Reported-and-tested-by: Linus Torvalds
    Reviewed-by: Jan Kara
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

05 Oct, 2016

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "The new features and main improvements in this merge for v4.9

    - Support for the UBSAN sanitizer

    - Set HAVE_EFFICIENT_UNALIGNED_ACCESS, it improves the code in some
    places

    - Improvements for the in-kernel fpu code, in particular the overhead
    for multiple consecutive in kernel fpu users is recuded

    - Add a SIMD implementation for the RAID6 gen and xor operations

    - Add RAID6 recovery based on the XC instruction

    - The PCI DMA flush logic has been improved to increase the speed of
    the map / unmap operations

    - The time synchronization code has seen some updates

    And bug fixes all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (48 commits)
    s390/con3270: fix insufficient space padding
    s390/con3270: fix use of uninitialised data
    MAINTAINERS: update DASD maintainer
    s390/cio: fix accidental interrupt enabling during resume
    s390/dasd: add missing \n to end of dev_err messages
    s390/config: Enable config options for Docker
    s390/dasd: make query host access interruptible
    s390/dasd: fix panic during offline processing
    s390/dasd: fix hanging offline processing
    s390/pci_dma: improve lazy flush for unmap
    s390/pci_dma: split dma_update_trans
    s390/pci_dma: improve map_sg
    s390/pci_dma: simplify dma address calculation
    s390/pci_dma: remove dma address range check
    iommu/s390: simplify registration of I/O address translation parameters
    s390: migrate exception table users off module.h and onto extable.h
    s390: export header for CLP ioctl
    s390/vmur: fix irq pointer dereference in int handler
    s390/dasd: add missing KOBJ_CHANGE event for unformatted devices
    s390: enable UBSAN
    ...

    Linus Torvalds
     

04 Oct, 2016

3 commits

  • Pull CPU hotplug updates from Thomas Gleixner:
    "Yet another batch of cpu hotplug core updates and conversions:

    - Provide core infrastructure for multi instance drivers so the
    drivers do not have to keep custom lists.

    - Convert custom lists to the new infrastructure. The block-mq custom
    list conversion comes through the block tree and makes the diffstat
    tip over to more lines removed than added.

    - Handle unbalanced hotplug enable/disable calls more gracefully.

    - Remove the obsolete CPU_STARTING/DYING notifier support.

    - Convert another batch of notifier users.

    The relayfs changes which conflicted with the conversion have been
    shipped to me by Andrew.

    The remaining lot is targeted for 4.10 so that we finally can remove
    the rest of the notifiers"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    cpufreq: Fix up conversion to hotplug state machine
    blk/mq: Reserve hotplug states for block multiqueue
    x86/apic/uv: Convert to hotplug state machine
    s390/mm/pfault: Convert to hotplug state machine
    mips/loongson/smp: Convert to hotplug state machine
    mips/octeon/smp: Convert to hotplug state machine
    fault-injection/cpu: Convert to hotplug state machine
    padata: Convert to hotplug state machine
    cpufreq: Convert to hotplug state machine
    ACPI/processor: Convert to hotplug state machine
    virtio scsi: Convert to hotplug state machine
    oprofile/timer: Convert to hotplug state machine
    block/softirq: Convert to hotplug state machine
    lib/irq_poll: Convert to hotplug state machine
    x86/microcode: Convert to hotplug state machine
    sh/SH-X3 SMP: Convert to hotplug state machine
    ia64/mca: Convert to hotplug state machine
    ARM/OMAP/wakeupgen: Convert to hotplug state machine
    ARM/shmobile: Convert to hotplug state machine
    arm64/FP/SIMD: Convert to hotplug state machine
    ...

    Linus Torvalds
     
  • Pull low-level x86 updates from Ingo Molnar:
    "In this cycle this topic tree has become one of those 'super topics'
    that accumulated a lot of changes:

    - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
    x86 - preceded by an array of changes. v4.8 saw preparatory changes
    in this area already - this is the rest of the work. Includes the
    thread stack caching performance optimization. (Andy Lutomirski)

    - switch_to() cleanups and all around enhancements. (Brian Gerst)

    - A large number of dumpstack infrastructure enhancements and an
    unwinder abstraction. The secret long term plan is safe(r) live
    patching plus maybe another attempt at debuginfo based unwinding -
    but all these current bits are standalone enhancements in a frame
    pointer based debug environment as well. (Josh Poimboeuf)

    - More __ro_after_init and const annotations. (Kees Cook)

    - Enable KASLR for the vmemmap memory region. (Thomas Garnier)"

    [ The virtually mapped stack changes are pretty fundamental, and not
    x86-specific per se, even if they are only used on x86 right now. ]

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    x86/asm: Get rid of __read_cr4_safe()
    thread_info: Use unsigned long for flags
    x86/alternatives: Add stack frame dependency to alternative_call_2()
    x86/dumpstack: Fix show_stack() task pointer regression
    x86/dumpstack: Remove dump_trace() and related callbacks
    x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
    oprofile/x86: Convert x86_backtrace() to use the new unwinder
    x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
    perf/x86: Convert perf_callchain_kernel() to use the new unwinder
    x86/unwind: Add new unwind interface and implementations
    x86/dumpstack: Remove NULL task pointer convention
    fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
    sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
    lib/syscall: Pin the task stack in collect_syscall()
    x86/process: Pin the target stack in get_wchan()
    x86/dumpstack: Pin the target stack when dumping it
    kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
    sched/core: Add try_get_task_stack() and put_task_stack()
    x86/entry/64: Fix a minor comment rebase error
    iommu/amd: Don't put completion-wait semaphore on stack
    ...

    Linus Torvalds
     
  • Pull EFI updates from Ingo Molnar:
    "Main changes in this cycle were:

    - Refactor the EFI memory map code into architecture neutral files
    and allow drivers to permanently reserve EFI boot services regions
    on x86, as well as ARM/arm64. (Matt Fleming)

    - Add ARM support for the EFI ESRT driver. (Ard Biesheuvel)

    - Make the EFI runtime services and efivar API interruptible by
    swapping spinlocks for semaphores. (Sylvain Chouleur)

    - Provide the EFI identity mapping for kexec which allows kexec to
    work on SGI/UV platforms with requiring the "noefi" kernel command
    line parameter. (Alex Thorlton)

    - Add debugfs node to dump EFI page tables on arm64. (Ard Biesheuvel)

    - Merge the EFI test driver being carried out of tree until now in
    the FWTS project. (Ivan Hu)

    - Expand the list of flags for classifying EFI regions as "RAM" on
    arm64 so we align with the UEFI spec. (Ard Biesheuvel)

    - Optimise out the EFI mixed mode if it's unsupported (CONFIG_X86_32)
    or disabled (CONFIG_EFI_MIXED=n) and switch the early EFI boot
    services function table for direct calls, alleviating us from
    having to maintain the custom function table. (Lukas Wunner)

    - Miscellaneous cleanups and fixes"

    * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
    x86/efi: Round EFI memmap reservations to EFI_PAGE_SIZE
    x86/efi: Allow invocation of arbitrary boot services
    x86/efi: Optimize away setup_gop32/64 if unused
    x86/efi: Use kmalloc_array() in efi_call_phys_prolog()
    efi/arm64: Treat regions with WT/WC set but WB cleared as memory
    efi: Add efi_test driver for exporting UEFI runtime service interfaces
    x86/efi: Defer efi_esrt_init until after memblock_x86_fill
    efi/arm64: Add debugfs node to dump UEFI runtime page tables
    x86/efi: Remove unused find_bits() function
    fs/efivarfs: Fix double kfree() in error path
    x86/efi: Map in physical addresses in efi_map_region_fixed
    lib/ucs2_string: Speed up ucs2_utf8size()
    firmware-gsmi: Delete an unnecessary check before the function call "dma_pool_destroy"
    x86/efi: Initialize status to ensure garbage is not returned on small size
    efi: Replace runtime services spinlock with semaphore
    efi: Don't use spinlocks for efi vars
    efi: Use a file local lock for efivars
    efi/arm*: esrt: Add missing call to efi_esrt_init()
    efi/esrt: Use memremap not ioremap to access ESRT table in memory
    x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data
    ...

    Linus Torvalds
     

03 Oct, 2016

1 commit


01 Oct, 2016

1 commit

  • kbuild test robot reports:

    lib/dma-debug.c: In function 'debug_dma_map_resource':
    >> lib/dma-debug.c:1541:16: error: implicit declaration of function '__phys_to_pfn' [-Werror=implicit-function-declaration]
    entry->pfn = __phys_to_pfn(addr);
    ^~~~~~~~~~~~~

    ia64 does not provide __phys_to_pfn(), use the PHYS_PFN() alias.

    Fixes: 0e74b34dfc3318bf ("dma-debug: add support for resource mappings")
    Signed-off-by: Niklas Söderlund
    Signed-off-by: Vinod Koul

    Niklas Söderlund
     

30 Sep, 2016

1 commit


29 Sep, 2016

1 commit


28 Sep, 2016

2 commits

  • put_cpu_var takes the percpu data, not the data returned from
    get_cpu_var.

    This doesn't change the behavior.

    Cc: Tejun Heo
    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: David S. Miller

    Shaohua Li
     
  • * the only remaining callers of "short" fault-ins are just as happy with generic
    variants (both in lib/iov_iter.c); switch them to multipage variants, kill the
    "short" ones
    * rename the multipage variants to now available plain ones.
    * get rid of compat macro defining iov_iter_fault_in_multipage_readable by
    expanding it in its only user.

    Signed-off-by: Al Viro

    Al Viro
     

27 Sep, 2016

2 commits

  • Specifying the aligned attributes to the char data[NDISKS][PAGE_SIZE],
    char recovi[PAGE_SIZE] and char recovi[PAGE_SIZE] arrays, so that all
    malloc memory is page boundary aligned.

    Without these alignment attributes, the test causes a segfault in
    userspace when the NDISKS are changed to 4 from 16.

    The RAID stripes will be page aligned anyway, so we want to test what
    the kernel actually will execute.

    Cc: H. Peter Anvin
    Cc: Yu-cheng Yu
    Signed-off-by: Gayatri Kammela
    Reviewed-by: H. Peter Anvin
    Signed-off-by: Shaohua Li

    Gayatri Kammela
     
  • A MMIO mapped resource can not be represented by a struct page so a new
    debug type is needed to handle this. This patch add such type and
    functionality to add/remove entries and how to translate them to a
    physical address.

    Signed-off-by: Niklas Söderlund
    Acked-by: Arnd Bergmann
    Signed-off-by: Vinod Koul

    Niklas Söderlund
     

26 Sep, 2016

1 commit

  • The fixes to the radix tree test suite show that the multi-order case is
    broken. The basic reason is that the radix tree code uses tagged
    pointers with the "internal" bit in the low bits, and calculating the
    pointer indices was supposed to mask off those bits. But gcc will
    notice that we then use the index to re-create the pointer, and will
    avoid doing the arithmetic and use the tagged pointer directly.

    This cleans the code up, using the existing is_sibling_entry() helper to
    validate the sibling pointer range (instead of open-coding it), and
    using entry_to_node() to mask off the low tag bit from the pointer. And
    once you do that, you might as well just use the now cleaned-up pointer
    directly.

    [ Side note: the multi-order code isn't actually ever used in the kernel
    right now, and the only reason I didn't just delete all that code is
    that Kirill Shutemov piped up and said:

    "Well, my ext4-with-huge-pages patchset[1] uses multi-order entries.
    It also converts shmem-with-huge-pages and hugetlb to them.

    I'm okay with converting it to other mechanism, but I need
    something. (I looked into Konstantin's RFC patchset[2]. It looks
    okay, but I don't feel myself qualified to review it as I don't
    know much about radix-tree internals.)"

    [1] http://lkml.kernel.org/r/20160915115523.29737-1-kirill.shutemov@linux.intel.com
    [2] http://lkml.kernel.org/r/147230727479.9957.1087787722571077339.stgit@zurg ]

    Reported-by: Matthew Wilcox
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Cedric Blancher
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Sep, 2016

2 commits


22 Sep, 2016

4 commits

  • Optimize RAID6 xor_syndrome functions to take advantage of the 512-bit
    ZMM integer instructions introduced in AVX512.

    AVX512 optimized xor_syndrome functions, which is simply based on sse2.c
    written by hpa.

    The patch was tested and benchmarked before submission on
    a hardware that has AVX512 flags to support such instructions

    Cc: H. Peter Anvin
    Cc: Jim Kukunas
    Cc: Fenghua Yu
    Cc: Megha Dey
    Signed-off-by: Gayatri Kammela
    Reviewed-by: Fenghua Yu
    Signed-off-by: Shaohua Li

    Gayatri Kammela
     
  • Adding avx512 gen_syndrome and recovery functions so as to allow code to
    be compiled and tested successfully in userspace.

    This patch is tested in userspace and improvement in performace is
    observed.

    Cc: H. Peter Anvin
    Cc: Jim Kukunas
    Cc: Fenghua Yu
    Signed-off-by: Megha Dey
    Signed-off-by: Gayatri Kammela
    Reviewed-by: Fenghua Yu
    Signed-off-by: Shaohua Li

    Gayatri Kammela
     
  • Optimize RAID6 recovery functions to take advantage of
    the 512-bit ZMM integer instructions introduced in AVX512.

    AVX512 optimized recovery functions, which is simply based
    on recov_avx2.c written by Jim Kukunas

    This patch was tested and benchmarked before submission on
    a hardware that has AVX512 flags to support such instructions

    Cc: Jim Kukunas
    Cc: H. Peter Anvin
    Cc: Fenghua Yu
    Signed-off-by: Megha Dey
    Signed-off-by: Gayatri Kammela
    Reviewed-by: Fenghua Yu
    Signed-off-by: Shaohua Li

    Gayatri Kammela
     
  • Optimize RAID6 gen_syndrom functions to take advantage of
    the 512-bit ZMM integer instructions introduced in AVX512.

    AVX512 optimized gen_syndrom functions, which is simply based
    on avx2.c written by Yuanhan Liu and sse2.c written by hpa.

    The patch was tested and benchmarked before submission on
    a hardware that has AVX512 flags to support such instructions

    Cc: H. Peter Anvin
    Cc: Jim Kukunas
    Cc: Fenghua Yu
    Signed-off-by: Megha Dey
    Signed-off-by: Gayatri Kammela
    Reviewed-by: Fenghua Yu
    Signed-off-by: Shaohua Li

    Gayatri Kammela
     

21 Sep, 2016

1 commit

  • This commit introduces a generic library to estimate either the min or
    max value of a time-varying variable over a recent time window. This
    is code originally from Kathleen Nichols. The current form of the code
    is from Van Jacobson.

    A single struct minmax_sample will track the estimated windowed-max
    value of the series if you call minmax_running_max() or the estimated
    windowed-min value of the series if you call minmax_running_min().

    Nearly equivalent code is already in place for minimum RTT estimation
    in the TCP stack. This commit extracts that code and generalizes it to
    handle both min and max. Moving the code here reduces the footprint
    and complexity of the TCP code base and makes the filter generally
    available for other parts of the codebase, including an upcoming TCP
    congestion control module.

    This library works well for time series where the measurements are
    smoothly increasing or decreasing.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     

20 Sep, 2016

3 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Some architectures use a hardware defined structure at address zero.
    Checking for a null pointer will result in many ubsan reports.
    Allow users to disable the null sanitizer.

    Signed-off-by: Christian Borntraeger
    Acked-by: Andrey Ryabinin
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Christian Borntraeger
     
  • The insecure_elasticity setting is an ugly wart brought out by
    users who need to insert duplicate objects (that is, distinct
    objects with identical keys) into the same table.

    In fact, those users have a much bigger problem. Once those
    duplicate objects are inserted, they don't have an interface to
    find them (unless you count the walker interface which walks
    over the entire table).

    Some users have resorted to doing a manual walk over the hash
    table which is of course broken because they don't handle the
    potential existence of multiple hash tables. The result is that
    they will break sporadically when they encounter a hash table
    resize/rehash.

    This patch provides a way out for those users, at the expense
    of an extra pointer per object. Essentially each object is now
    a list of objects carrying the same key. The hash table will
    only see the lists so nothing changes as far as rhashtable is
    concerned.

    To use this new interface, you need to insert a struct rhlist_head
    into your objects instead of struct rhash_head. While the hash
    table is unchanged, for type-safety you'll need to use struct
    rhltable instead of struct rhashtable. All the existing interfaces
    have been duplicated for rhlist, including the hash table walker.

    One missing feature is nulls marking because AFAIK the only potential
    user of it does not need duplicate objects. Should anyone need
    this it shouldn't be too hard to add.

    Signed-off-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu