12 Feb, 2019

1 commit

  • Interactive governor has lived in Android sources for a very long time
    and this commit is based on the code present in following branch:

    https://android.googlesource.com/kernel/common android-4.4

    The Interactive governor is designed for latency-sensitive workloads,
    such as interactive user interfaces like the mobile phones and tablets.
    The interactive governor aims to be significantly more responsive to
    ramp CPU quickly up when CPU-intensive activity begins.

    Existing governors sample CPU load at a particular rate, typically every
    X ms and then update the frequency from a work-handler. This can lead
    to under-powering UI threads for the period of time during which the
    user begins interacting with a previously-idle system until the next
    sample period happens.

    The 'interactive' governor uses a different approach.

    A real-time thread is used for scaling up, giving the remaining tasks
    the CPU performance benefit, unlike existing governors which are more
    likely to schedule ramp-up work to occur after your performance starved
    tasks have completed.

    The Android version of interactive governor also checks whether to scale
    the CPU frequency up soon after coming out of idle. When the CPU comes
    out of idle, the governor check if the CPU sampling is overdue or not.
    If yes, it immediately starts the sampling. Otherwise, the utilization
    hooks from the scheduler handle the sampling later. If the CPU is very
    busy from exiting idle to when the evaluation happens, then it assumes
    that the CPU is under-powered and ramps it to MAX speed.

    If the CPU was not sufficiently busy to immediately ramp to MAX speed,
    then the governor evaluates the CPU load since the last speed
    adjustment, choosing the highest value between that longer-term load or
    the short-term load since idle exit to determine the CPU speed to ramp
    to.

    Idle notifiers will be be handled later and are not included for now.

    The core of this code is written and maintained (in Android
    repositories) by Mike Chan and Todd Poyner over a long period of time.

    Vireshk has made changes to to the governor to align it with the current
    practices followed with mainline governors, like using utilization hooks
    from the scheduler and handling kobject (for governor's sysfs directory)
    in a race free manner. And of course this included general cleanup of
    the governor as well.

    Signed-off-by: Mike Chan
    Signed-off-by: Todd Poynor
    Signed-off-by: Viresh Kumar

    Viresh Kumar
     

10 Jan, 2019

1 commit

  • commit fde872682e175743e0c3ef939c89e3c6008a1529 upstream.

    Some time back, nfsd switched from calling vfs_fsync() to using a new
    commit_metadata() hook in export_operations(). If the file system did
    not provide a commit_metadata() hook, it fell back to using
    sync_inode_metadata(). Unfortunately doesn't work on all file
    systems. In particular, it doesn't work on ext4 due to how the inode
    gets journalled --- the VFS writeback code will not always call
    ext4_write_inode().

    So we need to provide our own ext4_nfs_commit_metdata() method which
    calls ext4_write_inode() directly.

    Google-Bug-Id: 121195940
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     

18 Aug, 2018

1 commit

  • commit 3f5fe9fef5b2da06b6319fab8123056da5217c3f upstream.

    The recent conversion of the task state recording to use task_state_index()
    broke the sched_switch tracepoint task state output.

    task_state_index() returns surprisingly an index (0-7) which is then
    printed with __print_flags() applying bitmasks. Not really working and
    resulting in weird states like 'prev_state=t' instead of 'prev_state=I'.

    Use TASK_REPORT_MAX instead of TASK_STATE_MAX to report preemption. Build a
    bitmask from the return value of task_state_index() and store it in
    entry->prev_state, which makes __print_flags() work as expected.

    Signed-off-by: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: stable@vger.kernel.org
    Fixes: efb40f588b43 ("sched/tracing: Fix trace_sched_switch task-state printing")
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1711221304180.1751@nanos
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

11 Jul, 2018

1 commit

  • commit 4ff648decf4712d39f184fc2df3163f43975575a upstream.

    Since the following commit:

    b91473ff6e97 ("sched,tracing: Update trace_sched_pi_setprio()")

    the sched_pi_setprio trace point shows the "newprio" during a deboost:

    |futex sched_pi_setprio: comm=futex_requeue_p pid"34 oldprio˜ newprio=3D98
    |futex sched_switch: prev_comm=futex_requeue_p prev_pid"34 prev_prio=120

    This patch open codes __rt_effective_prio() in the tracepoint as the
    'newprio' to get the old behaviour back / the correct priority:

    |futex sched_pi_setprio: comm=futex_requeue_p pid"20 oldprio˜ newprio=3D120
    |futex sched_switch: prev_comm=futex_requeue_p prev_pid"20 prev_prio=120

    Peter suggested to open code the new priority so people using tracehook
    could get the deadline data out.

    Reported-by: Mansky Christian
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Fixes: b91473ff6e97 ("sched,tracing: Update trace_sched_pi_setprio()")
    Link: http://lkml.kernel.org/r/20180524132647.gg6ziuogczdmjjzu@linutronix.de
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Sebastian Andrzej Siewior
     

23 May, 2018

1 commit

  • commit 45dd9b0666a162f8e4be76096716670cf1741f0e upstream.

    Doing an audit of trace events, I discovered two trace events in the xen
    subsystem that use a hack to create zero data size trace events. This is not
    what trace events are for. Trace events add memory footprint overhead, and
    if all you need to do is see if a function is hit or not, simply make that
    function noinline and use function tracer filtering.

    Worse yet, the hack used was:

    __array(char, x, 0)

    Which creates a static string of zero in length. There's assumptions about
    such constructs in ftrace that this is a dynamic string that is nul
    terminated. This is not the case with these tracepoints and can cause
    problems in various parts of ftrace.

    Nuke the trace events!

    Link: http://lkml.kernel.org/r/20180509144605.5a220327@gandalf.local.home

    Cc: stable@vger.kernel.org
    Fixes: 95a7d76897c1e ("xen/mmu: Use Xen specific TLB flush instead of the generic one.")
    Reviewed-by: Juergen Gross
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     

26 Apr, 2018

1 commit

  • [ Upstream commit 91633eed73a3ac37aaece5c8c1f93a18bae616a9 ]

    So far only CLOCK_MONOTONIC and CLOCK_REALTIME were taken into account as
    well as HRTIMER_MODE_ABS/REL in the hrtimer_init tracepoint. The query for
    detecting the ABS or REL timer modes is not valid anymore, it got broken
    by the introduction of HRTIMER_MODE_PINNED.

    HRTIMER_MODE_PINNED is not evaluated in the hrtimer_init() call, but for the
    sake of completeness print all given modes.

    Signed-off-by: Anna-Maria Gleixner
    Cc: Christoph Hellwig
    Cc: John Stultz
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: keescook@chromium.org
    Link: http://lkml.kernel.org/r/20171221104205.7269-9-anna-maria@linutronix.de
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Anna-Maria Gleixner
     

29 Mar, 2018

1 commit

  • commit c658dc58c7eaa8569ceb0edd1ddbdfda84fe8aa5 upstream.

    Swap the positions of blk_addr and blksz in the tracepoint print arguments
    so that they match the print format.

    Signed-off-by: Adrian Hunter
    Fixes: d2f82254e4e8 ("mmc: core: Add members to mmc_request and mmc_data for CQE's")
    Cc: # 4.14+
    Signed-off-by: Ulf Hansson
    Signed-off-by: Greg Kroah-Hartman

    Adrian Hunter
     

25 Feb, 2018

2 commits

  • [ Upstream commit 975b820b6836b6b6c42fb84cd2e772e2b41bca67 ]

    In some cases the clock parent would be set NULL when doing re-parent,
    it will cause a NULL pointer accessing if clk_set trace event is
    enabled.

    This patch sets the parent as "none" if the input parameter is NULL.

    Fixes: dfc202ead312 (clk: Add tracepoints for hardware operations)
    Signed-off-by: Cai Li
    Signed-off-by: Chunyan Zhang
    Signed-off-by: Stephen Boyd
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Cai Li
     
  • [ Upstream commit 23721a755f98ac846897a013c92cccb281c1bcc8 ]

    We meet this compile warning, which caused by missing bpf.h in xdp.h.

    In file included from ./include/trace/events/xdp.h:10:0,
    from ./include/linux/bpf_trace.h:6,
    from drivers/net/ethernet/intel/i40e/i40e_txrx.c:29:
    ./include/trace/events/xdp.h:93:17: warning: ‘struct bpf_map’ declared inside parameter list will not be visible outside of this definition or declaration
    const struct bpf_map *map, u32 map_index),
    ^
    ./include/linux/tracepoint.h:187:34: note: in definition of macro ‘__DECLARE_TRACE’
    static inline void trace_##name(proto) \
    ^~~~~
    ./include/linux/tracepoint.h:352:24: note: in expansion of macro ‘PARAMS’
    __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
    ^~~~~~
    ./include/linux/tracepoint.h:477:2: note: in expansion of macro ‘DECLARE_TRACE’
    DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
    ^~~~~~~~~~~~~
    ./include/linux/tracepoint.h:477:22: note: in expansion of macro ‘PARAMS’
    DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
    ^~~~~~
    ./include/trace/events/xdp.h:89:1: note: in expansion of macro ‘DEFINE_EVENT’
    DEFINE_EVENT(xdp_redirect_template, xdp_redirect,
    ^~~~~~~~~~~~
    ./include/trace/events/xdp.h:90:2: note: in expansion of macro ‘TP_PROTO’
    TP_PROTO(const struct net_device *dev,
    ^~~~~~~~
    ./include/trace/events/xdp.h:93:17: warning: ‘struct bpf_map’ declared inside parameter list will not be visible outside of this definition or declaration
    const struct bpf_map *map, u32 map_index),
    ^
    ./include/linux/tracepoint.h:203:38: note: in definition of macro ‘__DECLARE_TRACE’
    register_trace_##name(void (*probe)(data_proto), void *data) \
    ^~~~~~~~~~
    ./include/linux/tracepoint.h:354:4: note: in expansion of macro ‘PARAMS’
    PARAMS(void *__data, proto), \
    ^~~~~~

    Reported-by: Huang Daode
    Cc: Hanjun Guo
    Fixes: 8d3b778ff544 ("xdp: tracepoint xdp_redirect also need a map argument")
    Signed-off-by: Xie XiuQi
    Acked-by: Jesper Dangaard Brouer
    Acked-by: Steven Rostedt (VMware)
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Xie XiuQi
     

22 Feb, 2018

2 commits

  • commit 1299ef1d8870d2d9f09a5aadf2f8b2c887c2d033 upstream.

    flush_tlb_single() and flush_tlb_one() sound almost identical, but
    they really mean "flush one user translation" and "flush one kernel
    translation". Rename them to flush_tlb_one_user() and
    flush_tlb_one_kernel() to make the semantics more obvious.

    [ I was looking at some PTI-related code, and the flush-one-address code
    is unnecessarily hard to understand because the names of the helpers are
    uninformative. This came up during PTI review, but no one got around to
    doing it. ]

    Signed-off-by: Andy Lutomirski
    Acked-by: Peter Zijlstra (Intel)
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Eduardo Valentin
    Cc: Hugh Dickins
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux-MM
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/3303b02e3c3d049dc5235d5651e0ae6d29a34354.1517414378.git.luto@kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Andy Lutomirski
     
  • commit d8be75663cec0069b85f80191abd2682ce4a512f upstream.

    Now that kmemcheck is gone, we don't need the NOTRACK flags.

    Link: http://lkml.kernel.org/r/20171007030159.22241-5-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

04 Feb, 2018

1 commit

  • [ Upstream commit f859ab61875978eeaa539740ff7f7d91f5d60006 ]

    RxRPC service endpoints expire like they're supposed to by the following
    means:

    (1) Mark dead rxrpc_net structs (with ->live) rather than twiddling the
    global service conn timeout, otherwise the first rxrpc_net struct to
    die will cause connections on all others to expire immediately from
    then on.

    (2) Mark local service endpoints for which the socket has been closed
    (->service_closed) so that the expiration timeout can be much
    shortened for service and client connections going through that
    endpoint.

    (3) rxrpc_put_service_conn() needs to schedule the reaper when the usage
    count reaches 1, not 0, as idle conns have a 1 count.

    (4) The accumulator for the earliest time we might want to schedule for
    should be initialised to jiffies + MAX_JIFFY_OFFSET, not ULONG_MAX as
    the comparison functions use signed arithmetic.

    (5) Simplify the expiration handling, adding the expiration value to the
    idle timestamp each time rather than keeping track of the time in the
    past before which the idle timestamp must go to be expired. This is
    much easier to read.

    (6) Ignore the timeouts if the net namespace is dead.

    (7) Restart the service reaper work item rather the client reaper.

    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     

17 Jan, 2018

1 commit

  • commit e39d200fa5bf5b94a0948db0dae44c1b73b84a56 upstream.

    Reported by syzkaller:

    BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
    Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

    CPU: 6 PID: 32298 Comm: syz-executor Tainted: G OE 4.15.0-rc2+ #18
    Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
    Call Trace:
    dump_stack+0xab/0xe1
    print_address_description+0x6b/0x290
    kasan_report+0x28a/0x370
    write_mmio+0x11e/0x270 [kvm]
    emulator_read_write_onepage+0x311/0x600 [kvm]
    emulator_read_write+0xef/0x240 [kvm]
    emulator_fix_hypercall+0x105/0x150 [kvm]
    em_hypercall+0x2b/0x80 [kvm]
    x86_emulate_insn+0x2b1/0x1640 [kvm]
    x86_emulate_instruction+0x39a/0xb90 [kvm]
    handle_exception+0x1b4/0x4d0 [kvm_intel]
    vcpu_enter_guest+0x15a0/0x2640 [kvm]
    kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
    kvm_vcpu_ioctl+0x479/0x880 [kvm]
    do_vfs_ioctl+0x142/0x9a0
    SyS_ioctl+0x74/0x80
    entry_SYSCALL_64_fastpath+0x23/0x9a

    The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
    to the guest memory, however, write_mmio tracepoint always prints 8 bytes
    through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
    leaks 5 bytes from the kernel stack (CVE-2017-17741). This patch fixes
    it by just accessing the bytes which we operate on.

    Before patch:

    syz-executor-5567 [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

    After patch:

    syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f

    Reported-by: Dmitry Vyukov
    Reviewed-by: Darren Kenny
    Reviewed-by: Marc Zyngier
    Tested-by: Marc Zyngier
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Marc Zyngier
    Cc: Christoffer Dall
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini
    Cc: Mathieu Desnoyers
    Signed-off-by: Greg Kroah-Hartman

    Wanpeng Li
     

30 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

29 Sep, 2017

3 commits

  • Currently TASK_PARKED is masqueraded as TASK_INTERRUPTIBLE, give it
    its own print state because it will not in fact get woken by regular
    wakeups and is a long-term state.

    This requires moving TASK_PARKED into the TASK_REPORT mask, and since
    that latter needs to be a contiguous bitmask, we need to shuffle the
    bits around a bit.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Markus reported that kthreads that idle using TASK_IDLE instead of
    TASK_INTERRUPTIBLE are reported in as TASK_UNINTERRUPTIBLE and things
    like htop mark those red.

    This is undesirable, so add an explicit state for TASK_IDLE.

    Reported-by: Markus Trippelsdorf
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Convert trace_sched_switch to use the common task-state helpers and
    fix the "X" and "Z" order, possibly they ended up in the wrong order
    because TASK_REPORT has them in the wrong order too.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

17 Sep, 2017

1 commit

  • Pull networking fixes from David Miller:

    1) Fix hotplug deadlock in hv_netvsc, from Stephen Hemminger.

    2) Fix double-free in rmnet driver, from Dan Carpenter.

    3) INET connection socket layer can double put request sockets, fix
    from Eric Dumazet.

    4) Don't match collect metadata-mode tunnels if the device is down,
    from Haishuang Yan.

    5) Do not perform TSO6/GSO on ipv6 packets with extensions headers in
    be2net driver, from Suresh Reddy.

    6) Fix scaling error in gen_estimator, from Eric Dumazet.

    7) Fix 64-bit statistics deadlock in systemport driver, from Florian
    Fainelli.

    8) Fix use-after-free in sctp_sock_dump, from Xin Long.

    9) Reject invalid BPF_END instructions in verifier, from Edward Cree.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
    mlxsw: spectrum_router: Only handle IPv4 and IPv6 events
    Documentation: link in networking docs
    tcp: fix data delivery rate
    bpf/verifier: reject BPF_ALU64|BPF_END
    sctp: do not mark sk dumped when inet_sctp_diag_fill returns err
    sctp: fix an use-after-free issue in sctp_sock_dump
    netvsc: increase default receive buffer size
    tcp: update skb->skb_mstamp more carefully
    net: ipv4: fix l3slave check for index returned in IP_PKTINFO
    net: smsc911x: Quieten netif during suspend
    net: systemport: Fix 64-bit stats deadlock
    net: vrf: avoid gcc-4.6 warning
    qed: remove unnecessary call to memset
    tg3: clean up redundant initialization of tnapi
    tls: make tls_sw_free_resources static
    sctp: potential read out of bounds in sctp_ulpevent_type_enabled()
    MAINTAINERS: review Renesas DT bindings as well
    net_sched: gen_estimator: fix scaling error in bytes/packets samples
    nfp: wait for the NSP resource to appear on boot
    nfp: wait for board state before talking to the NSP
    ...

    Linus Torvalds
     

16 Sep, 2017

1 commit

  • Pull more KVM updates from Paolo Bonzini:
    - PPC bugfixes
    - RCU splat fix
    - swait races fix
    - pointless userspace-triggerable BUG() fix
    - misc fixes for KVM_RUN corner cases
    - nested virt correctness fixes + one host DoS
    - some cleanups
    - clang build fix
    - fix AMD AVIC with default QEMU command line options
    - x86 bugfixes

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
    kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly
    kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly
    kvm: nVMX: Remove nested_vmx_succeed after successful VM-entry
    kvm,mips: Fix potential swait_active() races
    kvm,powerpc: Serialize wq active checks in ops->vcpu_kick
    kvm: Serialize wq active checks in kvm_vcpu_wake_up()
    kvm,x86: Fix apf_task_wake_one() wq serialization
    kvm,lapic: Justify use of swait_active()
    kvm,async_pf: Use swq_has_sleeper()
    sched/wait: Add swq_has_sleeper()
    KVM: VMX: Do not BUG() on out-of-bounds guest IRQ
    KVM: Don't accept obviously wrong gsi values via KVM_IRQFD
    kvm: nVMX: Don't allow L2 to access the hardware CR8
    KVM: trace events: update list of exit reasons
    KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously
    KVM: X86: Don't block vCPU if there is pending exception
    KVM: SVM: Add irqchip_split() checks before enabling AVIC
    KVM: Add struct kvm_vcpu pointer parameter to get_enable_apicv()
    KVM: SVM: Refactor AVIC vcpu initialization into avic_init_vcpu()
    KVM: x86: fix clang build
    ...

    Linus Torvalds
     

15 Sep, 2017

1 commit


14 Sep, 2017

2 commits

  • GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived
    and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
    primary motivation was to allow users to tell that an allocation is
    short lived and so the allocator can try to place such allocations close
    together and prevent long term fragmentation. As much as this sounds
    like a reasonable semantic it becomes much less clear when to use the
    highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
    context holding that memory sleep? Can it take locks? It seems there is
    no good answer for those questions.

    The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
    __GFP_RECLAIMABLE which in itself is tricky because basically none of
    the existing caller provide a way to reclaim the allocated memory. So
    this is rather misleading and hard to evaluate for any benefits.

    I have checked some random users and none of them has added the flag
    with a specific justification. I suspect most of them just copied from
    other existing users and others just thought it might be a good idea to
    use without any measuring. This suggests that GFP_TEMPORARY just
    motivates for cargo cult usage without any reasoning.

    I believe that our gfp flags are quite complex already and especially
    those with highlevel semantic should be clearly defined to prevent from
    confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
    replace all existing users to simply use GFP_KERNEL. Please note that
    SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
    so they will be placed properly for memory fragmentation prevention.

    I can see reasons we might want some gfp flag to reflect shorterm
    allocations but I propose starting from a clear semantic definition and
    only then add users with proper justification.

    This was been brought up before LSF this year by Matthew [1] and it
    turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
    seems to be a heuristic without any measured advantage for most (if not
    all) its current users. The follow up discussion has revealed that
    opinions on what might be temporary allocation differ a lot between
    developers. So rather than trying to tweak existing users into a
    semantic which they haven't expected I propose to simply remove the flag
    and start from scratch if we really need a semantic for short term
    allocations.

    [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

    [akpm@linux-foundation.org: fix typo]
    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: drm/i915: fix up]
    Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Neil Brown
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pull block fixes from Jens Axboe:
    "Small collection of fixes that would be nice to have in -rc1. This
    contains:

    - NVMe pull request form Christoph, mostly with fixes for nvme-pci,
    host memory buffer in particular.

    - Error handling fixup for cgwb_create(), in case allocation of 'wb'
    fails. From Christophe Jaillet.

    - Ensure that trace_block_getrq() gets the 'dev' in an appropriate
    fashion, to avoid a potential NULL deref. From Greg Thelen.

    - Regression fix for dm-mq with blk-mq, fixing a problem with
    stacking IO schedulers. From me.

    - string.h fixup, fixing an issue with memcpy_and_pad(). This
    original change came in through an NVMe dependency, which is why
    I'm including it here. From Martin Wilck.

    - Fix potential int overflow in __blkdev_sectors_to_bio_pages(), from
    Mikulas.

    - MBR enable fix for sed-opal, from Scott"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: directly insert blk-mq request from blk_insert_cloned_request()
    mm/backing-dev.c: fix an error handling path in 'cgwb_create()'
    string.h: un-fortify memcpy_and_pad
    nvme-pci: implement the HMB entry number and size limitations
    nvme-pci: propagate (some) errors from host memory buffer setup
    nvme-pci: use appropriate initial chunk size for HMB allocation
    nvme-pci: fix host memory buffer allocation fallback
    nvme: fix lightnvm check
    block: fix integer overflow in __blkdev_sectors_to_bio_pages()
    block: sed-opal: Set MBRDone on S3 resume path if TPER is MBREnabled
    block: tolerate tracing of NULL bio

    Linus Torvalds
     

13 Sep, 2017

1 commit

  • Pull f2fs updates from Jaegeuk Kim:
    "In this round, we've mostly tuned f2fs to provide better user
    experience for Android. Especially, we've worked on atomic write
    feature again with SQLite community in order to support it officially.
    And we added or modified several facilities to analyze and enhance IO
    behaviors.

    Major changes include:
    - add app/fs io stat
    - add inode checksum feature
    - support project/journalled quota
    - enhance atomic write with new ioctl() which exposes feature set
    - enhance background gc/discard/fstrim flows with new gc_urgent mode
    - add F2FS_IOC_FS{GET,SET}XATTR
    - fix some quota flows"

    * tag 'f2fs-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (63 commits)
    f2fs: hurry up to issue discard after io interruption
    f2fs: fix to show correct discard_granularity in sysfs
    f2fs: detect dirty inode in evict_inode
    f2fs: clear radix tree dirty tag of pages whose dirty flag is cleared
    f2fs: speed up gc_urgent mode with SSR
    f2fs: better to wait for fstrim completion
    f2fs: avoid race in between read xattr & write xattr
    f2fs: make get_lock_data_page to handle encrypted inode
    f2fs: use generic terms used for encrypted block management
    f2fs: introduce f2fs_encrypted_file for clean-up
    Revert "f2fs: add a new function get_ssr_cost"
    f2fs: constify super_operations
    f2fs: fix to wake up all sleeping flusher
    f2fs: avoid race in between atomic_read & atomic_inc
    f2fs: remove unneeded parameter of change_curseg
    f2fs: update i_flags correctly
    f2fs: don't check inode's checksum if it was dirtied or writebacked
    f2fs: don't need to update inode checksum for recovery
    f2fs: trigger fdatasync for non-atomic_write file
    f2fs: fix to avoid race in between aio and gc
    ...

    Linus Torvalds
     

12 Sep, 2017

1 commit

  • Using bpf_redirect_map is allowed for generic XDP programs, but the
    appropriate map lookup was never performed in xdp_do_generic_redirect().

    Instead the map-index is directly used as the ifindex. For the
    xdp_redirect_map sample in SKB-mode '-S', this resulted in trying
    sending on ifindex 0 which isn't valid, resulting in getting SKB
    packets dropped. Thus, the reported performance numbers are wrong in
    commit 24251c264798 ("samples/bpf: add option for native and skb mode
    for redirect apps") for the 'xdp_redirect_map -S' case.

    Before commit 109980b894e9 ("bpf: don't select potentially stale
    ri->map from buggy xdp progs") it could crash the kernel. Like this
    commit also check that the map_owner owner is correct before
    dereferencing the map pointer. But make sure that this API misusage
    can be caught by a tracepoint. Thus, allowing userspace via
    tracepoints to detect misbehaving bpf_progs.

    Fixes: 6103aa96ec07 ("net: implement XDP_REDIRECT for xdp generic")
    Fixes: 24251c264798 ("samples/bpf: add option for native and skb mode for redirect apps")
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

11 Sep, 2017

1 commit

  • __get_request() can call trace_block_getrq() with bio=NULL which causes
    block_get_rq::TP_fast_assign() to deref a NULL pointer and panic.

    Syzkaller fuzzer panics with
    linux-next (1d53d908b79d7870d89063062584eead4cf83448):
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN
    Modules linked in:
    CPU: 0 PID: 2983 Comm: syzkaller401111 Not tainted 4.13.0-rc7-next-20170901+ #13
    task: ffff8801cf1da000 task.stack: ffff8801ce440000
    RIP: 0010:perf_trace_block_get_rq+0x697/0x970 include/trace/events/block.h:384
    RSP: 0018:ffff8801ce4473f0 EFLAGS: 00010246
    RAX: ffff8801cf1da000 RBX: 1ffff10039c88e84 RCX: 1ffffd1ffff84d27
    RDX: dffffc0000000001 RSI: 1ffff1003b643e7a RDI: ffffe8ffffc26938
    RBP: ffff8801ce447530 R08: 1ffff1003b643e6c R09: ffffe8ffffc26964
    R10: 0000000000000002 R11: fffff91ffff84d2d R12: ffffe8ffffc1f890
    R13: ffffe8ffffc26930 R14: ffffffff85cad9e0 R15: 0000000000000000
    FS: 0000000002641880(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000043e670 CR3: 00000001d1d7a000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    trace_block_getrq include/trace/events/block.h:423 [inline]
    __get_request block/blk-core.c:1283 [inline]
    get_request+0x1518/0x23b0 block/blk-core.c:1355
    blk_old_get_request block/blk-core.c:1402 [inline]
    blk_get_request+0x1d8/0x3c0 block/blk-core.c:1427
    sg_scsi_ioctl+0x117/0x750 block/scsi_ioctl.c:451
    sg_ioctl+0x192d/0x2ed0 drivers/scsi/sg.c:1070
    vfs_ioctl fs/ioctl.c:45 [inline]
    do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
    SYSC_ioctl fs/ioctl.c:700 [inline]
    SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    block_get_rq::TP_fast_assign() has multiple redundant ->dev assignments.
    Only one of them is NULL tolerant. Favor the NULL tolerant one.

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Greg Thelen
    Signed-off-by: Jens Axboe

    Greg Thelen
     

10 Sep, 2017

1 commit

  • Pull btrfs updates from David Sterba:
    "The changes range through all types: cleanups, core chagnes, sanity
    checks, fixes, other user visible changes, detailed list below:

    - deprecated: user transaction ioctl

    - mount option ssd does not change allocation alignments

    - degraded read-write mount is allowed if all the raid profile
    constraints are met, now based on more accurate check

    - defrag: do not reset compression afterwards; the NOCOMPRESS flag
    can be now overriden by defrag

    - prep work for better extent reference tracking (related to the
    qgroup slowness with balance)

    - prep work for compression heuristics

    - memory allocation reductions (may help latencies on a loaded
    system)

    - better accounting for io waiting states

    - error handling improvements (removed BUGs)

    - added more sanity checks for shared refs

    - fix readdir vs pagefault deadlock under some circumstances

    - fix for 'no-hole' mode, certain combination of compressed and
    inline extents

    - send: fix emission of invalid clone operations

    - fixup file mode if setting acls fail

    - more fixes from fuzzing

    - oher cleanups"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
    btrfs: submit superblock io with REQ_META and REQ_PRIO
    btrfs: remove unnecessary memory barrier in btrfs_direct_IO
    btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
    btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
    btrfs: pass fs_info to btrfs_del_root instead of tree_root
    Btrfs: add one more sanity check for shared ref type
    Btrfs: remove BUG_ON in __add_tree_block
    Btrfs: remove BUG() in add_data_reference
    Btrfs: remove BUG() in print_extent_item
    Btrfs: remove BUG() in btrfs_extent_inline_ref_size
    Btrfs: convert to use btrfs_get_extent_inline_ref_type
    Btrfs: add a helper to retrive extent inline ref type
    btrfs: scrub: simplify scrub worker initialization
    btrfs: scrub: clean up division in scrub_find_csum
    btrfs: scrub: clean up division in __scrub_mark_bitmap
    btrfs: scrub: use bool for flush_all_writes
    btrfs: preserve i_mode if __btrfs_set_acl() fails
    btrfs: Remove extraneous chunk_objectid variable
    btrfs: Remove chunk_objectid argument from btrfs_make_block_group
    btrfs: Remove extra parentheses from condition in copy_items()
    ...

    Linus Torvalds
     

08 Sep, 2017

3 commits

  • Pull MMC updates from Ulf Hansson:
    "MMC core:
    - Continue to refactor the mmc block code to prepare for blkmq
    - Move mmc block debugfs into block module
    - Next step for eMMC CMDQ by adding a new mmc host interface for it
    - Move Kconfig option MMC_DEBUG from core to host
    - Some additional minor improvements

    MMC host:
    - Declare structs as const when applicable
    - Explicitly request exclusive reset control when applicable
    - Improve some error paths and other various cleanups
    - sdhci: Preparations to support SDHCI OMAP
    - sdhci: Improve some PM related code
    - sdhci: Re-factoring and modernizations
    - sdhci-xenon: Add runtime PM and system sleep support
    - sdhci-xenon: Add support for eMMC HS400 Enhanced Strobe
    - sdhci-cadence: Add system sleep support
    - sdhci-of-at91: Improve system sleep support
    - dw_mmc: Add support for Hisilicon hi3660
    - sunxi: Add support for A83T eMMC
    - sunxi: Add support for DDR52 mode
    - meson-gx: Add support for UHS-I SD-cards
    - meson-gx: Cleanups and improvements
    - tmio: Fix CMD12 (STOP) handling
    - tmio: Cleanups and improvements
    - renesas_sdhi: Add r8a7743/5 support
    - renesas-sdhi: Add support for R-Car Gen3 SDHI DMAC
    - renesas_sdhi: Cleanups and improvements"

    * tag 'mmc-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (145 commits)
    mmc: renesas_sdhi: Add r8a7743/5 support
    mmc: meson-gx: fix __ffsdi2 undefined on arm32
    mmc: sdhci-xenon: add runtime pm support and reimplement standby
    mmc: core: Move mmc_start_areq() declaration
    mmc: mmci: stop building qcom dml as module
    mmc: sunxi: Reset the device at probe time
    clk: sunxi-ng: Provide a default reset hook
    mmc: meson-gx: rework tuning function
    mmc: meson-gx: change default tx phase
    mmc: meson-gx: implement voltage switch callback
    mmc: meson-gx: use CCF to handle the clock phases
    mmc: meson-gx: implement card_busy callback
    mmc: meson-gx: simplify interrupt handler
    mmc: meson-gx: work around clk-stop issue
    mmc: meson-gx: fix dual data rate mode frequencies
    mmc: meson-gx: rework clock init function
    mmc: meson-gx: rework clk_set function
    mmc: meson-gx: rework set_ios function
    mmc: meson-gx: cfg init overwrite values
    mmc: meson-gx: initialize sane clk default before clock register
    ...

    Linus Torvalds
     
  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     
  • Pull xen updates from Juergen Gross:

    - the new pvcalls backend for routing socket calls from a guest to dom0

    - some cleanups of Xen code

    - a fix for wrong usage of {get,put}_cpu()

    * tag 'for-linus-4.14b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (27 commits)
    xen/mmu: set MMU_NORMAL_PT_UPDATE in remap_area_mfn_pte_fn
    xen: Don't try to call xen_alloc_p2m_entry() on autotranslating guests
    xen/events: events_fifo: Don't use {get,put}_cpu() in xen_evtchn_fifo_init()
    xen/pvcalls: use WARN_ON(1) instead of __WARN()
    xen: remove not used trace functions
    xen: remove unused function xen_set_domain_pte()
    xen: remove tests for pvh mode in pure pv paths
    xen-platform: constify pci_device_id.
    xen: cleanup xen.h
    xen: introduce a Kconfig option to enable the pvcalls backend
    xen/pvcalls: implement write
    xen/pvcalls: implement read
    xen/pvcalls: implement the ioworker functions
    xen/pvcalls: disconnect and module_exit
    xen/pvcalls: implement release command
    xen/pvcalls: implement poll command
    xen/pvcalls: implement accept command
    xen/pvcalls: implement listen command
    xen/pvcalls: implement bind command
    xen/pvcalls: implement connect command
    ...

    Linus Torvalds
     

07 Sep, 2017

4 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - DAX updates

    - OCFS2

    - most of MM

    * emailed patches from Andrew Morton : (119 commits)
    mm,fork: introduce MADV_WIPEONFORK
    x86,mpx: make mpx depend on x86-64 to free up VMA flag
    mm: add /proc/pid/smaps_rollup
    mm: hugetlb: clear target sub-page last when clearing huge page
    mm: oom: let oom_reap_task and exit_mmap run concurrently
    swap: choose swap device according to numa node
    mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    z3fold: use per-cpu unbuddied lists
    mm, swap: don't use VMA based swap readahead if HDD is used as swap
    mm, swap: add sysfs interface for VMA based swap readahead
    mm, swap: VMA based swap readahead
    mm, swap: fix swap readahead marking
    mm, swap: add swap readahead hit statistics
    mm/vmalloc.c: don't reinvent the wheel but use existing llist API
    mm/vmstat.c: fix wrong comment
    selftests/memfd: add memfd_create hugetlbfs selftest
    mm/shmem: add hugetlbfs support to memfd_create()
    mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
    mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
    ...

    Linus Torvalds
     
  • Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
    in the child process after fork. This differs from MADV_DONTFORK in one
    important way.

    If a child process accesses memory that was MADV_WIPEONFORK, it will get
    zeroes. The address ranges are still valid, they are just empty.

    If a child process accesses memory that was MADV_DONTFORK, it will get a
    segmentation fault, since those address ranges are no longer valid in
    the child after fork.

    Since MADV_DONTFORK also seems to be used to allow very large programs
    to fork in systems with strict memory overcommit restrictions, changing
    the semantics of MADV_DONTFORK might break existing programs.

    MADV_WIPEONFORK only works on private, anonymous VMAs.

    The use case is libraries that store or cache information, and want to
    know that they need to regenerate it in the child process after fork.

    Examples of this would be:
    - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
    check, which is too slow without a PID cache)
    - PKCS#11 API reinitialization check (mandated by specification)
    - glibc's upcoming PRNG (reseed after fork)
    - OpenSSL PRNG (reseed after fork)

    The security benefits of a forking server having a re-inialized PRNG in
    every child process are pretty obvious. However, due to libraries
    having all kinds of internal state, and programs getting compiled with
    many different versions of each library, it is unreasonable to expect
    calling programs to re-initialize everything manually after fork.

    A further complication is the proliferation of clone flags, programs
    bypassing glibc's functions to call clone directly, and programs calling
    unshare, causing the glibc pthread_atfork hook to not get called.

    It would be better to have the kernel take care of this automatically.

    The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
    MADV_WIPEONFORK.

    This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

    [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
    Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
    Signed-off-by: Rik van Riel
    Reported-by: Florian Weimer
    Reported-by: Colm MacCártaigh
    Reviewed-by: Mike Kravetz
    Cc: "H. Peter Anvin"
    Cc: "Kirill A. Shutemov"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Helge Deller
    Cc: Kees Cook
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Will Drewry
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When servicing mmap() reads from file holes the current DAX code
    allocates a page cache page of all zeroes and places the struct page
    pointer in the mapping->page_tree radix tree.

    This has three major drawbacks:

    1) It consumes memory unnecessarily. For every 4k page that is read via
    a DAX mmap() over a hole, we allocate a new page cache page. This
    means that if you read 1GiB worth of pages, you end up using 1GiB of
    zeroed memory. This is easily visible by looking at the overall
    memory consumption of the system or by looking at /proc/[pid]/smaps:

    7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data
    Size: 1048576 kB
    Rss: 1048576 kB
    Pss: 1048576 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 1048576 kB
    Private_Dirty: 0 kB
    Referenced: 1048576 kB
    Anonymous: 0 kB
    LazyFree: 0 kB
    AnonHugePages: 0 kB
    ShmemPmdMapped: 0 kB
    Shared_Hugetlb: 0 kB
    Private_Hugetlb: 0 kB
    Swap: 0 kB
    SwapPss: 0 kB
    KernelPageSize: 4 kB
    MMUPageSize: 4 kB
    Locked: 0 kB

    2) It is slower than using a common zero page because each page fault
    has more work to do. Instead of just inserting a common zero page we
    have to allocate a page cache page, zero it, and then insert it. Here
    are the average latencies of dax_load_hole() as measured by ftrace on
    a random test box:

    Old method, using zeroed page cache pages: 3.4 us
    New method, using the common 4k zero page: 0.8 us

    This was the average latency over 1 GiB of sequential reads done by
    this simple fio script:

    [global]
    size=1G
    filename=/root/dax/data
    fallocate=none
    [io]
    rw=read
    ioengine=mmap

    3) The fact that we had to check for both DAX exceptional entries and
    for page cache pages in the radix tree made the DAX code more
    complex.

    Solve these issues by following the lead of the DAX PMD code and using a
    common 4k zero page instead. As with the PMD code we will now insert a
    DAX exceptional entry into the radix tree instead of a struct page
    pointer which allows us to remove all the special casing in the DAX
    code.

    Note that we do still pretty aggressively check for regular pages in the
    DAX radix tree, especially where we take action based on the bits set in
    the page. If we ever find a regular page in our radix tree now that
    most likely means that someone besides DAX is inserting pages (which has
    happened lots of times in the past), and we want to find that out early
    and fail loudly.

    This solution also removes the extra memory consumption. Here is that
    same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
    code:

    7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data
    Size: 1048576 kB
    Rss: 0 kB
    Pss: 0 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 0 kB
    Private_Dirty: 0 kB
    Referenced: 0 kB
    Anonymous: 0 kB
    LazyFree: 0 kB
    AnonHugePages: 0 kB
    ShmemPmdMapped: 0 kB
    Shared_Hugetlb: 0 kB
    Private_Hugetlb: 0 kB
    Swap: 0 kB
    SwapPss: 0 kB
    KernelPageSize: 4 kB
    MMUPageSize: 4 kB
    Locked: 0 kB

    Overall system memory consumption is similarly improved.

    Another major change is that we remove dax_pfn_mkwrite() from our fault
    flow, and instead rely on the page fault itself to make the PTE dirty
    and writeable. The following description from the patch adding the
    vm_insert_mixed_mkwrite() call explains this a little more:

    "To be able to use the common 4k zero page in DAX we need to have our
    PTE fault path look more like our PMD fault path where a PTE entry
    can be marked as dirty and writeable as it is first inserted rather
    than waiting for a follow-up dax_pfn_mkwrite() =>
    finish_mkwrite_fault() call.

    Right now we can rely on having a dax_pfn_mkwrite() call because we
    can distinguish between these two cases in do_wp_page():

    case 1: 4k zero page => writable DAX storage
    case 2: read-only DAX storage => writeable DAX storage

    This distinction is made by via vm_normal_page(). vm_normal_page()
    returns false for the common 4k zero page, though, just as it does
    for DAX ptes. Instead of special casing the DAX + 4k zero page case
    we will simplify our DAX PTE page fault sequence so that it matches
    our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
    We will instead use dax_iomap_fault() to handle write-protection
    faults.

    This means that insert_pfn() needs to follow the lead of
    insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
    'mkwrite' is set insert_pfn() will do the work that was previously
    done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"

    Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: "Darrick J. Wong"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Pull networking updates from David Miller:

    1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

    2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

    3) Allow generic XDP to work on virtual devices, from John Fastabend.

    4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

    5) Remove UFO offloads from the tree, gave us little other than bugs.

    6) Remove the IPSEC flow cache, from Florian Westphal.

    7) Support ipv6 route offload in mlxsw driver.

    8) Support VF representors in bnxt_en, from Sathya Perla.

    9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

    10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

    11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

    12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

    13) Add new jump instructions to eBPF, from Daniel Borkmann.

    14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

    15) Support XDP in tap driver, from Jason Wang.

    16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

    17) Add Huawei hinic ethernet driver.

    18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
    i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
    i40e: avoid NVM acquire deadlock during NVM update
    drivers: net: xgene: Remove return statement from void function
    drivers: net: xgene: Configure tx/rx delay for ACPI
    drivers: net: xgene: Read tx/rx delay for ACPI
    rocker: fix kcalloc parameter order
    rds: Fix non-atomic operation on shared flag variable
    net: sched: don't use GFP_KERNEL under spin lock
    vhost_net: correctly check tx avail during rx busy polling
    net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
    rxrpc: Make service connection lookup always check for retry
    net: stmmac: Delete dead code for MDIO registration
    gianfar: Fix Tx flow control deactivation
    cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
    cxgb4: Fix pause frame count in t4_get_port_stats
    cxgb4: fix memory leak
    tun: rename generic_xdp to skb_xdp
    tun: reserve extra headroom only when XDP is set
    net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
    net: dsa: bcm_sf2: Advertise number of egress queues
    ...

    Linus Torvalds
     

01 Sep, 2017

1 commit

  • This extends bridge fdb table tracepoints to also cover
    learned fdb entries in the br_fdb_update path. Note that
    unlike other tracepoints I have moved this to when the fdb
    is modified because this is in the datapath and can generate
    a lot of noise in the trace output. br_fdb_update is also called
    from added_by_user context in the NTF_USE case which is already
    traced ..hence the !added_by_user check.

    Signed-off-by: Roopa Prabhu
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Roopa Prabhu
     

31 Aug, 2017

2 commits


30 Aug, 2017

3 commits