18 Nov, 2017

11 commits

  • We have one boolean flag in rpcrdma_req today. I'd like to add more
    flags, so convert that boolean to a bit flag.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Problem statement:

    Recently Sagi Grimberg observed that kernel RDMA-
    enabled storage initiators don't handle delayed Send completion
    correctly. If Send completion is delayed beyond the end of a ULP
    transaction, the ULP may release resources that are still being used
    by the HCA to complete a long-running Send operation.

    This is a common design trait amongst our initiators. Most Send
    operations are faster than the ULP transaction they are part of.
    Waiting for a completion for these is typically unnecessary.

    Infrequently, a network partition or some other problem crops up
    where an ordering problem can occur. In NFS parlance, the RPC Reply
    arrives and completes the RPC, but the HCA is still retrying the
    Send WR that conveyed the RPC Call. In this case, the HCA can try
    to use memory that has been invalidated or DMA unmapped, and the
    connection is lost. If that memory has been re-used for something
    else (possibly not related to NFS), and the Send retransmission
    exposes that data on the wire.

    Thus we cannot assume that it is safe to release Send-related
    resources just because a ULP reply has arrived.

    After some analysis, we have determined that the completion
    housekeeping will not be difficult for xprtrdma:

    - Inline Send buffers are registered via the local DMA key, and
    are already left DMA mapped for the lifetime of a transport
    connection, thus no additional handling is necessary for those
    - Gathered Sends involving page cache pages _will_ need to
    DMA unmap those pages after the Send completes. But like
    inline send buffers, they are registered via the local DMA key,
    and thus will not need to be invalidated

    In addition, RPC completion will need to wait for Send completion
    in the latter case. However, nearly always, the Send that conveys
    the RPC Call will have completed long before the RPC Reply
    arrives, and thus no additional latency will be accrued.

    Design notes:

    In this patch, the rpcrdma_sendctx object is introduced, and a
    lock-free circular queue is added to manage a set of them per
    transport.

    The RPC client's send path already prevents sending more than one
    RPC Call at the same time. This allows us to treat the consumer
    side of the queue (rpcrdma_sendctx_get_locked) as if there is a
    single consumer thread.

    The producer side of the queue (rpcrdma_sendctx_put_locked) is
    invoked only from the Send completion handler, which is a single
    thread of execution (soft IRQ).

    The only care that needs to be taken is with the tail index, which
    is shared between the producer and consumer. Only the producer
    updates the tail index. The consumer compares the head with the
    tail to ensure that the a sendctx that is in use is never handed
    out again (or, expressed more conventionally, the queue is empty).

    When the sendctx queue empties completely, there are enough Sends
    outstanding that posting more Send operations can result in a Send
    Queue overflow. In this case, the ULP is told to wait and try again.
    This introduces strong Send Queue accounting to xprtrdma.

    As a final touch, Jason Gunthorpe
    suggested a mechanism that does not require signaling every Send.
    We signal once every N Sends, and perform SGE unmapping of N Send
    operations during that one completion.

    Reported-by: Sagi Grimberg
    Suggested-by: Jason Gunthorpe
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Commit 655fec6987be ("xprtrdma: Use gathered Send for large inline
    messages") assumed that, since the zeroeth element of the Send SGE
    array always pointed to req->rl_rdmabuf, it needed to be initialized
    just once. This was a valid assumption because the Send SGE array
    and rl_rdmabuf both live in the same rpcrdma_req.

    In a subsequent patch, the Send SGE array will be separated from the
    rpcrdma_req, so the zeroeth element of the SGE array needs to be
    initialized every time.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up: Make rpcrdma_prepare_send_sges() return a negative errno
    instead of a bool. Soon callers will want distinct treatments of
    different types of failures.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • When this function fails, it needs to undo the DMA mappings it's
    done so far. Otherwise these are leaked.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up. rpcrdma_prepare_hdr_sge() sets num_sge to one, then
    rpcrdma_prepare_msg_sges() sets num_sge again to the count of SGEs
    it added, plus one for the header SGE just mapped in
    rpcrdma_prepare_hdr_sge(). This is confusing, and nails in an
    assumption about when these functions are called.

    Instead, maintain a running count that both functions can update
    with just the number of SGEs they have added to the SGE array.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • We need to decode and save the incoming rdma_credits field _after_
    we know that the direction of the message is "forward direction
    Reply". Otherwise, the credits value in reverse direction Calls is
    also used to update the forward direction credits.

    It is safe to decode the rdma_credits field in rpcrdma_reply_handler
    now that rpcrdma_reply_handler is single-threaded. Receives complete
    in the same order as they were sent on the NFS server.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • I noticed that the soft IRQ thread looked pretty busy under heavy
    I/O workloads. perf suggested one area that was expensive was the
    queue_work() call in rpcrdma_wc_receive. That gave me some ideas.

    Instead of scheduling a separate worker to process RPC Replies,
    promote the Receive completion handler to IB_POLL_WORKQUEUE, and
    invoke rpcrdma_reply_handler directly.

    Note that the poll workqueue is single-threaded. In order to keep
    memory invalidation from serializing all RPC Replies, handle any
    necessary invalidation tasks in a separate multi-threaded workqueue.

    This provides a two-tier scheme, similar to OS I/O interrupt
    handlers: A fast interrupt handler that schedules the slow handler
    and re-enables the interrupt, and a slower handler that is invoked
    for any needed heavy lifting.

    Benefits include:
    - One less context switch for RPCs that don't register memory
    - Receive completion handling is moved out of soft IRQ context to
    make room for other users of soft IRQ
    - The same CPU core now DMA syncs and XDR decodes the Receive buffer

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up: I'd like to be able to invoke the tail of
    rpcrdma_reply_handler in two different places. Split the tail out
    into its own helper function.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up: Make it easier to pass the decoded XID, vers, credits, and
    proc fields around by moving these variables into struct rpcrdma_rep.

    Note: the credits field will be handled in a subsequent patch.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • A reply with an unrecognized value in the version field means the
    transport header is potentially garbled and therefore all the fields
    are untrustworthy.

    Fixes: 59aa1f9a3cce3 ("xprtrdma: Properly handle RDMA_ERROR ... ")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     

17 Oct, 2017

7 commits

  • Clean up: There are no remaining callers of this method.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • The "safe" version of ro_unmap is used here to avoid waiting
    unnecessarily. However:

    - It is safe to wait. After all, we have to wait anyway when using
    FMR to register memory.

    - This case is rare: it occurs only after a reconnect.

    By switching this call site to ro_unmap_sync, the final use of
    ro_unmap_safe is removed.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • In current kernels, waiting in xprt_release appears to be safe to
    do. I had erroneously believed that for ASYNC RPCs, waiting of any
    kind in xprt_release->xprt_rdma_free would result in deadlock. I've
    done injection testing and consulted with Trond to confirm that
    waiting in the RPC release path is safe.

    For the very few times where RPC resources haven't yet been released
    earlier by the reply handler, it is safe to wait synchronously in
    xprt_rdma_free for invalidation rather than defering it to MR
    recovery.

    Note: When the QP is error state, posting a LocalInvalidate should
    flush and mark the MR as bad. There is no way the remote HCA can
    access that MR via a QP in error state, so it is effectively already
    inaccessible and thus safe for the Upper Layer to access. The next
    time the MR is used it should be recognized and cleaned up properly
    by frwr_op_map.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Commit f5a73672d181 ("NFS: allow close-to-open cache semantics to
    apply to root of NFS filesystem") added a call to
    __nfs_revalidate_inode() to nfs_opendir to as the lookup
    process wouldn't reliable do this.

    Subsequent commit a3fbbde70a0c ("VFS: we need to set LOOKUP_JUMPED
    on mountpoint crossing") make this unnecessary. So remove the
    unnecessary code.

    Signed-off-by: NeilBrown
    Signed-off-by: Anna Schumaker

    NeilBrown
     
  • For correct close-to-open semantics, NFS must validate
    the change attribute of a directory (or file) on open.

    Since commit ecf3d1f1aa74 ("vfs: kill FS_REVAL_DOT by adding a
    d_weak_revalidate dentry op"), open() of "." or a path ending ".." is
    not revalidated reliably (except when that direct is a mount point).

    Prior to that commit, "." was revalidated using nfs_lookup_revalidate()
    which checks the LOOKUP_OPEN flag and forces revalidation if the flag is
    set.
    Since that commit, nfs_weak_revalidate() is used for NFSv3 (which
    ignores the flags) and nothing is used for NFSv4.

    This is fixed by using nfs_lookup_verify_inode() in
    nfs_weak_revalidate(). This does the revalidation exactly when needed.
    Also, add a definition of .d_weak_revalidate for NFSv4.

    The incorrect behavior is easily demonstrated by running "echo *" in
    some non-mountpoint NFS directory while watching network traffic.
    Without this patch, "echo *" sometimes doesn't produce any traffic.
    With the patch it always does.

    Fixes: ecf3d1f1aa74 ("vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op")
    cc: stable@vger.kernel.org (3.9+)
    Signed-off-by: NeilBrown
    Signed-off-by: Anna Schumaker

    NeilBrown
     
  • The NFS_ACCESS_* flags aren't a 1:1 mapping to the MAY_* flags, so
    checking for MAY_WHATEVER might have surprising results in
    nfs*_proc_access(). Let's simplify this check when determining which
    bits to ask for, and do it in a generic place instead of copying code
    for each NFS version.

    Signed-off-by: Anna Schumaker

    Anna Schumaker
     
  • Passing the NFS v4 flags into the v3 code seems weird to me, even if
    they are defined to the same values. This patch adds in generic flags
    to help me feel better

    Signed-off-by: Anna Schumaker

    Anna Schumaker
     

16 Oct, 2017

1 commit


15 Oct, 2017

10 commits

  • Pull char/misc driver fixes from Greg KH:
    "Here are 4 patches to resolve some char/misc driver issues found these
    past weeks.

    One of them is a mei bugfix and another is a new mei device id. There
    is also a hyper-v fix for a reported issue, and a binder issue fix for
    a problem reported by a few people.

    All of these have been in my tree for a while, I don't know if
    linux-next is really testing much this month. But 0-day is happy with
    them :)"

    * tag 'char-misc-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
    binder: fix use-after-free in binder_transaction()
    Drivers: hv: vmbus: Fix bugs in rescind handling
    mei: me: add gemini lake devices id
    mei: always use domain runtime pm callbacks.

    Linus Torvalds
     
  • Pull USB fixes from Greg KH:
    "Here are a handful of USB driver fixes for 4.14-rc5.

    There is the "usual" usb-serial fixes and device ids, USB gadget
    fixes, and some more fixes found by the fuzz testing that is happening
    on the USB layer right now.

    All of these have been in my tree this week with no reported issues"

    * tag 'usb-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
    usb: usbtest: fix NULL pointer dereference
    usb: gadget: configfs: Fix memory leak of interface directory data
    usb: gadget: composite: Fix use-after-free in usb_composite_overwrite_options
    usb: misc: usbtest: Fix overflow in usbtest_do_ioctl()
    usb: renesas_usbhs: Fix DMAC sequence for receiving zero-length packet
    USB: dummy-hcd: Fix deadlock caused by disconnect detection
    usb: phy: tegra: Fix phy suspend for UDC
    USB: serial: console: fix use-after-free after failed setup
    USB: serial: console: fix use-after-free on disconnect
    USB: serial: qcserial: add Dell DW5818, DW5819
    USB: serial: cp210x: add support for ELV TFD500
    USB: serial: cp210x: fix partnum regression
    USB: serial: option: add support for TP-Link LTE module
    USB: serial: ftdi_sio: add id for Cypress WICED dev board

    Linus Torvalds
     
  • Pull dmaengine fixes from Vinod Koul:
    "Here are fixes for this round

    - fix spinlock usage amd fifo response for altera driver

    - fix ti crossbar race condition

    - fix edma memcpy align"

    * tag 'dmaengine-fix-4.14-rc5' of git://git.infradead.org/users/vkoul/slave-dma:
    dmaengine: altera: fix spinlock usage
    dmaengine: altera: fix response FIFO emptying
    dmaengine: ti-dma-crossbar: Fix possible race condition with dma_inuse
    dmaengine: edma: Align the memcpy acnt array size with the transfer

    Linus Torvalds
     
  • Pull x86 fixes from Ingo Molnar:
    "A landry list of fixes:

    - fix reboot breakage on some PCID-enabled system

    - fix crashes/hangs on some PCID-enabled systems

    - fix microcode loading on certain older CPUs

    - various unwinder fixes

    - extend an APIC quirk to more hardware systems and disable APIC
    related warning on virtualized systems

    - various Hyper-V fixes

    - a macro definition robustness fix

    - remove jprobes IRQ disabling

    - various mem-encryption fixes"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/microcode: Do the family check first
    x86/mm: Flush more aggressively in lazy TLB mode
    x86/apic: Update TSC_DEADLINE quirk with additional SKX stepping
    x86/apic: Silence "FW_BUG TSC_DEADLINE disabled due to Errata" on hypervisors
    x86/mm: Disable various instrumentations of mm/mem_encrypt.c and mm/tlb.c
    x86/hyperv: Fix hypercalls with extended CPU ranges for TLB flushing
    x86/hyperv: Don't use percpu areas for pcpu_flush/pcpu_flush_ex structures
    x86/hyperv: Clear vCPU banks between calls to avoid flushing unneeded vCPUs
    x86/unwind: Disable unwinder warnings on 32-bit
    x86/unwind: Align stack pointer in unwinder dump
    x86/unwind: Use MSB for frame pointer encoding on 32-bit
    x86/unwind: Fix dereference of untrusted pointer
    x86/alternatives: Fix alt_max_short macro to really be a max()
    x86/mm/64: Fix reboot interaction with CR4.PCIDE
    kprobes/x86: Remove IRQ disabling from jprobe handlers
    kprobes/x86: Set up frame pointer in kprobe trampoline

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Three fixes that address an SMP balancing performance regression"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Ensure load_balance() respects the active_mask
    sched/core: Address more wake_affine() regressions
    sched/core: Fix wake_affine() performance regression

    Linus Torvalds
     
  • Pull RAS fixes from Ingo Molnar:
    "A boot parameter fix, plus a header export fix"

    * 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mce: Hide mca_cfg
    RAS/CEC: Use the right length for "cec_disable"

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Some tooling fixes plus three kernel fixes: a memory leak fix, a
    statistics fix and a crash fix"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel/uncore: Fix memory leaks on allocation failures
    perf/core: Fix cgroup time when scheduling descendants
    perf/core: Avoid freeing static PMU contexts when PMU is unregistered
    tools include uapi bpf.h: Sync kernel ABI header with tooling header
    perf pmu: Unbreak perf record for arm/arm64 with events with explicit PMU
    perf script: Add missing separator for "-F ip,brstack" (and brstackoff)
    perf callchain: Compare dsos (as well) for CCKEY_FUNCTION

    Linus Torvalds
     
  • Pull locking fixes from Ingo Molnar:
    "Two lockdep fixes for bugs introduced by the cross-release dependency
    tracking feature - plus a commit that disables it because performance
    regressed in an absymal fashion on some systems"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/lockdep: Disable cross-release features for now
    locking/selftest: Avoid false BUG report
    locking/lockdep: Fix stacktrace mess

    Linus Torvalds
     
  • Pull irq fixes from Ingo Molnar:
    "A CPU hotplug related fix, plus two related sanity checks"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs
    genirq/cpuhotplug: Add sanity check for effective affinity mask
    genirq: Warn when effective affinity is not updated

    Linus Torvalds
     
  • Pull objtool fix from Ingo Molnar:
    "A single objtool fix: avoid silently broken ORC debuginfo builds and
    error out instead"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    objtool: Upgrade libelf-devel warning to error for CONFIG_ORC_UNWINDER

    Linus Torvalds
     

14 Oct, 2017

11 commits

  • On CPUs like AMD's Geode, for example, we shouldn't even try to load
    microcode because they do not support the modern microcode loading
    interface.

    However, we do the family check *after* the other checks whether the
    loader has been disabled on the command line or whether we're running in
    a guest.

    So move the family checks first in order to exit early if we're being
    loaded on an unsupported family.

    Reported-and-tested-by: Sven Glodowski
    Signed-off-by: Borislav Petkov
    Cc: # 4.11..
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://bugzilla.suse.com/show_bug.cgi?id=1061396
    Link: http://lkml.kernel.org/r/20171012112316.977-1-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     
  • Johan Hovold reported a big lockdep slowdown on his system, caused by lockdep:

    > I had noticed that the BeagleBone Black boot time appeared to have
    > increased significantly with 4.14 and yesterday I finally had time to
    > investigate it.
    >
    > Boot time (from "Linux version" to login prompt) had in fact doubled
    > since 4.13 where it took 17 seconds (with my current config) compared to
    > the 35 seconds I now see with 4.14-rc4.
    >
    > I quick bisect pointed to lockdep and specifically the following commit:
    >
    > 28a903f63ec0 ("locking/lockdep: Handle non(or multi)-acquisition of a crosslock")

    Because the final v4.14 release is close, disable the cross-release lockdep
    features for now.

    Bisected-by: Johan Hovold
    Debugged-by: Johan Hovold
    Reported-by: Johan Hovold
    Cc: Arnd Bergmann
    Cc: Byungchul Park
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Lindgren
    Cc: kernel-team@lge.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-mm@kvack.org
    Cc: linux-omap@vger.kernel.org
    Link: http://lkml.kernel.org/r/20171014072659.f2yr6mhm5ha3eou7@gmail.com
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Pull MIPS fixes from Ralf Baechle:
    "More MIPS fixes for 4.14:

    - Loongson 1: Set the default number of RX and TX queues to
    accomodate for recent changes of stmmac driver.

    - BPF: Fix uninitialised target compiler error.

    - Fix cmpxchg on 32 bit signed ints for 64 bit kernels with
    !kernel_uses_llsc

    - Fix generic-board-config.sh for builds using O=

    - Remove pr_err() calls from fpu_emu() for a case which is not a
    kernel error"

    * '4.14-fixes' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
    MIPS: math-emu: Remove pr_err() calls from fpu_emu()
    MIPS: Fix generic-board-config.sh for builds using O=
    MIPS: Fix cmpxchg on 32b signed ints for 64b kernel with !kernel_uses_llsc
    MIPS: loongson1: set default number of rx and tx queues for stmmac
    MIPS: bpf: Fix uninitialised target compiler error

    Linus Torvalds
     
  • Since commit:

    94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")

    x86's lazy TLB mode has been all the way lazy: when running a kernel thread
    (including the idle thread), the kernel keeps using the last user mm's
    page tables without attempting to maintain user TLB coherence at all.

    From a pure semantic perspective, this is fine -- kernel threads won't
    attempt to access user pages, so having stale TLB entries doesn't matter.

    Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
    we also allow any paging-structure caches that may exist on the CPU
    to become incoherent. This means that we can have a
    paging-structure cache entry that references a freed page table, and
    the CPU is within its rights to do a speculative page walk starting
    at the freed page table.

    I can imagine this causing two different problems:

    - A speculative page walk starting from a bogus page table could read
    IO addresses. I haven't seen any reports of this causing problems.

    - A speculative page walk that involves a bogus page table can install
    garbage in the TLB. Such garbage would always be at a user VA, but
    some AMD CPUs have logic that triggers a machine check when it notices
    these bogus entries. I've seen a couple reports of this.

    Boris further explains the failure mode:

    > It is actually more of an optimization which assumes that paging-structure
    > entries are in WB DRAM:
    >
    > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
    > performance optimization that assumes PML4, PDP, PDE, and PTE entries
    > are in cacheable WB-DRAM; memory type checks may be bypassed, and
    > addresses outside of WB-DRAM may result in undefined behavior or NB
    > protocol errors. 1=Disables performance optimization and allows PML4,
    > PDP, PDE and PTE entries to be in any memory type. Operating systems
    > that maintain page tables in memory types other than WB- DRAM must set
    > TlbCacheDis to insure proper operation."
    >
    > The MCE generated is an NB protocol error to signal that
    >
    > "Link: A specific coherent-only packet from a CPU was issued to an
    > IO link. This may be caused by software which addresses page table
    > structures in a memory type other than cacheable WB-DRAM without
    > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
    > example, when page table structure addresses are above top of memory. In
    > such cases, the NB will generate an MCE if it sees a mismatch between
    > the memory operation generated by the core and the link type."
    >
    > I'm assuming coherent-only packets don't go out on IO links, thus the
    > error.

    To fix this, reinstate TLB coherence in lazy mode. With this patch
    applied, we do it in one of two ways:

    - If we have PCID, we simply switch back to init_mm's page tables
    when we enter a kernel thread -- this seems to be quite cheap
    except for the cost of serializing the CPU.

    - If we don't have PCID, then we set a flag and switch to init_mm
    the first time we would otherwise need to flush the TLB.

    The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
    to override the default mode for benchmarking.

    In theory, we could optimize this better by only flushing the TLB in
    lazy CPUs when a page table is freed. Doing that would require
    auditing the mm code to make sure that all page table freeing goes
    through tlb_remove_page() as well as reworking some data structures
    to implement the improved flush logic.

    Reported-by: Markus Trippelsdorf
    Reported-by: Adam Borowski
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Daniel Borkmann
    Cc: Eric Biggers
    Cc: Johannes Hirte
    Cc: Kees Cook
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Nadav Amit
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Roman Kagan
    Cc: Thomas Gleixner
    Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
    Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • Pull drm fixes from Dave Airlie:
    "Couple of the arm people seem to wake up so this has imx and msm
    fixes, along with a bunch of i915 stable bounds fixes and an amdgpu
    regression fix.

    All seems pretty okay for now"

    * tag 'drm-fixes-for-v4.14-rc5' of git://people.freedesktop.org/~airlied/linux:
    drm/msm: fix _NO_IMPLICIT fencing case
    drm/msm: fix error path cleanup
    drm/msm/mdp5: Remove extra pm_runtime_put call in mdp5_crtc_cursor_set()
    drm/msm/dsi: Use correct pm_runtime_put variant during host_init
    drm/msm: fix return value check in _msm_gem_kernel_new()
    drm/msm: use proper memory barriers for updating tail/head
    drm/msm/mdp5: add missing max size for 8x74 v1
    drm/amdgpu: fix placement flags in amdgpu_ttm_bind
    drm/i915/bios: parse DDI ports also for CHV for HDMI DDC pin and DP AUX channel
    gpu: ipu-v3: pre: implement workaround for ERR009624
    gpu: ipu-v3: prg: wait for double buffers to be filled on channel startup
    gpu: ipu-v3: Allow channel burst locking on i.MX6 only
    drm/i915: Read timings from the correct transcoder in intel_crtc_mode_get()
    drm/i915: Order two completing nop_submit_request
    drm/i915: Silence compiler warning for hsw_power_well_enable()
    drm/i915: Use crtc_state_is_legacy_gamma in intel_color_check
    drm/i915/edp: Increase the T12 delay quirk to 1300ms
    drm/i915/edp: Get the Panel Power Off timestamp after panel is off
    sync_file: Return consistent status in SYNC_IOC_FILE_INFO
    drm/atomic: Unref duplicated drm_atomic_state in drm_atomic_helper_resume()

    Linus Torvalds
     
  • drm/i915 fixes for 4.14-rc5:

    Three fixes for stable:

    - Use crtc_state_is_legacy_gamma in intel_color_check (Maarten)
    - Read timings from the correct transcoder (Ville).
    - Fix HDMI on BSW (Jani).

    Other fixes:

    - eDP fixes (Manasi)
    - Silence compiler warnings (Chris)
    - Order two completing nop_submit_request (Chris)

    * tag 'drm-intel-fixes-2017-10-11' of git://anongit.freedesktop.org/drm/drm-intel:
    drm/i915/bios: parse DDI ports also for CHV for HDMI DDC pin and DP AUX channel
    drm/i915: Read timings from the correct transcoder in intel_crtc_mode_get()
    drm/i915: Order two completing nop_submit_request
    drm/i915: Silence compiler warning for hsw_power_well_enable()
    drm/i915: Use crtc_state_is_legacy_gamma in intel_color_check
    drm/i915/edp: Increase the T12 delay quirk to 1300ms
    drm/i915/edp: Get the Panel Power Off timestamp after panel is off

    Dave Airlie
     
  • bunch of msm fixes

    * 'msm-fixes-4.14-rc4' of git://people.freedesktop.org/~robclark/linux:
    drm/msm: fix _NO_IMPLICIT fencing case
    drm/msm: fix error path cleanup
    drm/msm/mdp5: Remove extra pm_runtime_put call in mdp5_crtc_cursor_set()
    drm/msm/dsi: Use correct pm_runtime_put variant during host_init
    drm/msm: fix return value check in _msm_gem_kernel_new()
    drm/msm: use proper memory barriers for updating tail/head
    drm/msm/mdp5: add missing max size for 8x74 v1

    Dave Airlie
     
  • Merge misc fixes from Andrew Morton:
    "18 fixes"

    * emailed patches from Andrew Morton :
    mm, swap: use page-cluster as max window of VMA based swap readahead
    mm: page_vma_mapped: ensure pmd is loaded with READ_ONCE outside of lock
    kmemleak: clear stale pointers from task stacks
    fs/binfmt_misc.c: node could be NULL when evicting inode
    fs/mpage.c: fix mpage_writepage() for pages with buffers
    linux/kernel.h: add/correct kernel-doc notation
    tty: fall back to N_NULL if switching to N_TTY fails during hangup
    Revert "vmalloc: back off when the current task is killed"
    mm/cma.c: take __GFP_NOWARN into account in cma_alloc()
    scripts/kallsyms.c: ignore symbol type 'n'
    userfaultfd: selftest: exercise -EEXIST only in background transfer
    mm: only display online cpus of the numa node
    mm: remove unnecessary WARN_ONCE in page_vma_mapped_walk().
    mm/mempolicy: fix NUMA_INTERLEAVE_HIT counter
    include/linux/of.h: provide of_n_{addr,size}_cells wrappers for !CONFIG_OF
    mm/madvise.c: add description for MADV_WIPEONFORK and MADV_KEEPONFORK
    lib/Kconfig.debug: kernel hacking menu: runtime testing: keep tests together
    mm/migrate: fix indexing bug (off by one) and avoid out of bound access

    Linus Torvalds
     
  • When the VMA based swap readahead was introduced, a new knob

    /sys/kernel/mm/swap/vma_ra_max_order

    was added as the max window of VMA swap readahead. This is to make it
    possible to use different max window for VMA based readahead and
    original physical readahead. But Minchan Kim pointed out that this will
    cause a regression because setting page-cluster sysctl to zero cannot
    disable swap readahead with the change.

    To fix the regression, the page-cluster sysctl is used as the max window
    of both the VMA based swap readahead and original physical swap
    readahead. If more fine grained control is needed in the future, more
    knobs can be added as the subordinate knobs of the page-cluster sysctl.

    The vma_ra_max_order knob is deleted. Because the knob was introduced
    in v4.14-rc1, and this patch is targeting being merged before v4.14
    releasing, there should be no existing users of this newly added ABI.

    Link: http://lkml.kernel.org/r/20171011070847.16003-1-ying.huang@intel.com
    Fixes: ec560175c0b6fce ("mm, swap: VMA based swap readahead")
    Signed-off-by: "Huang, Ying"
    Reported-by: Minchan Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Loading the pmd without holding the pmd_lock exposes us to races with
    concurrent updaters of the page tables but, worse still, it also allows
    the compiler to cache the pmd value in a register and reuse it later on,
    even if we've performed a READ_ONCE in between and seen a more recent
    value.

    In the case of page_vma_mapped_walk, this leads to the following crash
    when the pmd loaded for the initial pmd_trans_huge check is all zeroes
    and a subsequent valid table entry is loaded by check_pmd. We then
    proceed into map_pte, but the compiler re-uses the zero entry inside
    pte_offset_map, resulting in a junk pointer being installed in
    pvmw->pte:

    PC is at check_pte+0x20/0x170
    LR is at page_vma_mapped_walk+0x2e0/0x540
    [...]
    Process doio (pid: 2463, stack limit = 0xffff00000f2e8000)
    Call trace:
    check_pte+0x20/0x170
    page_vma_mapped_walk+0x2e0/0x540
    page_mkclean_one+0xac/0x278
    rmap_walk_file+0xf0/0x238
    rmap_walk+0x64/0xa0
    page_mkclean+0x90/0xa8
    clear_page_dirty_for_io+0x84/0x2a8
    mpage_submit_page+0x34/0x98
    mpage_process_page_bufs+0x164/0x170
    mpage_prepare_extent_to_map+0x134/0x2b8
    ext4_writepages+0x484/0xe30
    do_writepages+0x44/0xe8
    __filemap_fdatawrite_range+0xbc/0x110
    file_write_and_wait_range+0x48/0xd8
    ext4_sync_file+0x80/0x4b8
    vfs_fsync_range+0x64/0xc0
    SyS_msync+0x194/0x1e8

    This patch fixes the problem by ensuring that READ_ONCE is used before
    the initial checks on the pmd, and this value is subsequently used when
    checking whether or not the pmd is present. pmd_check is removed and
    the pmd_present check is inlined directly.

    Link: http://lkml.kernel.org/r/1507222630-5839-1-git-send-email-will.deacon@arm.com
    Fixes: f27176cfc363 ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()")
    Signed-off-by: Will Deacon
    Tested-by: Yury Norov
    Tested-by: Richard Ruigrok
    Acked-by: Kirill A. Shutemov
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • Kmemleak considers any pointers on task stacks as references. This
    patch clears newly allocated and reused vmap stacks.

    Link: http://lkml.kernel.org/r/150728990124.744199.8403409836394318684.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov