27 Dec, 2016

1 commit

  • The attempt to prevent overwriting an active state resulted in a
    disaster which effectively disables all dynamically allocated hotplug
    states.

    Cleanup the mess.

    Fixes: dc280d936239 ("cpu/hotplug: Prevent overwriting of callbacks")
    Reported-by: Markus Trippelsdorf
    Reported-by: Boris Ostrovsky
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

26 Dec, 2016

4 commits

  • Pull timer type cleanups from Thomas Gleixner:
    "This series does a tree wide cleanup of types related to
    timers/timekeeping.

    - Get rid of cycles_t and use a plain u64. The type is not really
    helpful and caused more confusion than clarity

    - Get rid of the ktime union. The union has become useless as we use
    the scalar nanoseconds storage unconditionally now. The 32bit
    timespec alike storage got removed due to the Y2038 limitations
    some time ago.

    That leaves the odd union access around for no reason. Clean it up.

    Both changes have been done with coccinelle and a small amount of
    manual mopping up"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ktime: Get rid of ktime_equal()
    ktime: Cleanup ktime_set() usage
    ktime: Get rid of the union
    clocksource: Use a plain u64 instead of cycle_t

    Linus Torvalds
     
  • Pull SMP hotplug notifier removal from Thomas Gleixner:
    "This is the final cleanup of the hotplug notifier infrastructure. The
    series has been reintgrated in the last two days because there came a
    new driver using the old infrastructure via the SCSI tree.

    Summary:

    - convert the last leftover drivers utilizing notifiers

    - fixup for a completely broken hotplug user

    - prevent setup of already used states

    - removal of the notifiers

    - treewide cleanup of hotplug state names

    - consolidation of state space

    There is a sphinx based documentation pending, but that needs review
    from the documentation folks"

    * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/armada-xp: Consolidate hotplug state space
    irqchip/gic: Consolidate hotplug state space
    coresight/etm3/4x: Consolidate hotplug state space
    cpu/hotplug: Cleanup state names
    cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
    staging/lustre/libcfs: Convert to hotplug state machine
    scsi/bnx2i: Convert to hotplug state machine
    scsi/bnx2fc: Convert to hotplug state machine
    cpu/hotplug: Prevent overwriting of callbacks
    x86/msr: Remove bogus cleanup from the error path
    bus: arm-ccn: Prevent hotplug callback leak
    perf/x86/intel/cstate: Prevent hotplug callback leak
    ARM/imx/mmcd: Fix broken cpu hotplug handling
    scsi: qedi: Convert to hotplug state machine

    Linus Torvalds
     
  • ktime_set(S,N) was required for the timespec storage type and is still
    useful for situations where a Seconds and Nanoseconds part of a time value
    needs to be converted. For anything where the Seconds argument is 0, this
    is pointless and can be replaced with a simple assignment.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

25 Dec, 2016

4 commits

  • There is no point in having an extra type for extra confusion. u64 is
    unambiguous.

    Conversion was done with the following coccinelle script:

    @rem@
    @@
    -typedef u64 cycle_t;

    @fix@
    typedef cycle_t;
    @@
    -cycle_t
    +u64

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: John Stultz

    Thomas Gleixner
     
  • hotcpu_notifier(), cpu_notifier(), __hotcpu_notifier(), __cpu_notifier(),
    register_hotcpu_notifier(), register_cpu_notifier(),
    __register_hotcpu_notifier(), __register_cpu_notifier(),
    unregister_hotcpu_notifier(), unregister_cpu_notifier(),
    __unregister_hotcpu_notifier(), __unregister_cpu_notifier()

    are unused now. Remove them and all related code.

    Remove also the now pointless cpu notifier error injection mechanism. The
    states can be executed step by step and error rollback is the same as cpu
    down, so any state transition can be tested w/o requiring the notifier
    error injection.

    Some CPU hotplug states are kept as they are (ab)used for hotplug state
    tracking.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20161221192112.005642358@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Developers manage to overwrite states blindly without thought. That's fatal
    and hard to debug. Add sanity checks to make it fail.

    This requries to restructure the code so that the dynamic state allocation
    happens in the same lock protected section as the actual store. Otherwise
    the previous assignment of 'Reserved' to the name field would trigger the
    overwrite check.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Link: http://lkml.kernel.org/r/20161221192111.675234535@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • This was entirely automated, using the script by Al:

    PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*'
    sed -i -e "s!$PATT!#include !" \
    $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

    to do the replacement at the end of the merge window.

    Requested-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Dec, 2016

2 commits

  • Pull perf fixes from Ingo Molnar:
    "On the kernel side there's two x86 PMU driver fixes and a uprobes fix,
    plus on the tooling side there's a number of fixes and some late
    updates"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    perf sched timehist: Fix invalid period calculation
    perf sched timehist: Remove hardcoded 'comm_width' check at print_summary
    perf sched timehist: Enlarge default 'comm_width'
    perf sched timehist: Honour 'comm_width' when aligning the headers
    perf/x86: Fix overlap counter scheduling bug
    perf/x86/pebs: Fix handling of PEBS buffer overflows
    samples/bpf: Move open_raw_sock to separate header
    samples/bpf: Remove perf_event_open() declaration
    samples/bpf: Be consistent with bpf_load_program bpf_insn parameter
    tools lib bpf: Add bpf_prog_{attach,detach}
    samples/bpf: Switch over to libbpf
    perf diff: Do not overwrite valid build id
    perf annotate: Don't throw error for zero length symbols
    perf bench futex: Fix lock-pi help string
    perf trace: Check if MAP_32BIT is defined (again)
    samples/bpf: Make perf_event_read() static
    uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation
    samples/bpf: Make samples more libbpf-centric
    tools lib bpf: Add flags to bpf_create_map()
    tools lib bpf: use __u32 from linux/types.h
    ...

    Linus Torvalds
     
  • Pull final vfs updates from Al Viro:
    "Assorted cleanups and fixes all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    sg_write()/bsg_write() is not fit to be called under KERNEL_DS
    ufs: fix function declaration for ufs_truncate_blocks
    fs: exec: apply CLOEXEC before changing dumpable task flags
    seq_file: reset iterator to first record for zero offset
    vfs: fix isize/pos/len checks for reflink & dedupe
    [iov_iter] fix iterate_all_kinds() on empty iterators
    move aio compat to fs/aio.c
    reorganize do_make_slave()
    clone_private_mount() doesn't need to touch namespace_sem
    remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id()

    Linus Torvalds
     

23 Dec, 2016

1 commit

  • ... and fix the minor buglet in compat io_submit() - native one
    kills ioctx as cleanup when put_user() fails. Get rid of
    bogus compat_... in !CONFIG_AIO case, while we are at it - they
    should simply fail with ENOSYS, same as for native counterparts.

    Signed-off-by: Al Viro

    Al Viro
     

21 Dec, 2016

2 commits

  • Subtract KASLR offset from the kernel addresses reported by kcov.
    Tested on x86_64 and AArch64 (Hikey LeMaker).

    Link: http://lkml.kernel.org/r/1481417456-28826-3-git-send-email-alex.popov@linux.com
    Signed-off-by: Alexander Popov
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Ard Biesheuvel
    Cc: Mark Rutland
    Cc: Rob Herring
    Cc: Kefeng Wang
    Cc: AKASHI Takahiro
    Cc: Jon Masters
    Cc: David Daney
    Cc: Ganapatrao Kulkarni
    Cc: Dmitry Vyukov
    Cc: Nicolai Stange
    Cc: James Morse
    Cc: Andrey Ryabinin
    Cc: Andrey Konovalov
    Cc: Alexander Popov
    Cc: syzkaller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Popov
     
  • The TPM PCRs are only reset on a hard reboot. In order to validate a
    TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement
    list of the running kernel must be saved and restored on boot.

    This patch uses the kexec buffer passing mechanism to pass the
    serialized IMA binary_runtime_measurements to the next kernel.

    Link: http://lkml.kernel.org/r/1480554346-29071-7-git-send-email-zohar@linux.vnet.ibm.com
    Signed-off-by: Thiago Jung Bauermann
    Signed-off-by: Mimi Zohar
    Acked-by: "Eric W. Biederman"
    Acked-by: Dmitry Kasatkin
    Cc: Andreas Steffen
    Cc: Josh Sklar
    Cc: Dave Young
    Cc: Vivek Goyal
    Cc: Baoquan He
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Stewart Smith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mimi Zohar
     

19 Dec, 2016

3 commits


18 Dec, 2016

6 commits

  • Commit:

    72e6ae285a1d ('ARM: 8043/1: uprobes need icache flush after xol write'

    ... has introduced an arch-specific method to ensure all caches are
    flushed appropriately after an instruction is written to an XOL page.

    However, when the XOL area is created and the out-of-line breakpoint
    instruction is copied, caches are not flushed at all and stale data may
    be found in icache.

    Replace a simple copy_to_page() with arch_uprobe_copy_ixol() to allow
    the arch to ensure all caches are updated accordingly.

    This change fixes uprobes on MIPS InterAptiv (tested on Creator Ci40).

    Signed-off-by: Marcin Nowakowski
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Victor Kamensky
    Cc: linux-mips@linux-mips.org
    Link: http://lkml.kernel.org/r/1481625657-22850-1-git-send-email-marcin.nowakowski@imgtec.com
    Signed-off-by: Ingo Molnar

    Marcin Nowakowski
     
  • Pull networking fixes and cleanups from David Miller:

    1) Revert bogus nla_ok() change, from Alexey Dobriyan.

    2) Various bpf validator fixes from Daniel Borkmann.

    3) Add some necessary SET_NETDEV_DEV() calls to hsis_femac and hip04
    drivers, from Dongpo Li.

    4) Several ethtool ksettings conversions from Philippe Reynes.

    5) Fix bugs in inet port management wrt. soreuseport, from Tom Herbert.

    6) XDP support for virtio_net, from John Fastabend.

    7) Fix NAT handling within a vrf, from David Ahern.

    8) Endianness fixes in dpaa_eth driver, from Claudiu Manoil

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (63 commits)
    net: mv643xx_eth: fix build failure
    isdn: Constify some function parameters
    mlxsw: spectrum: Mark split ports as such
    cgroup: Fix CGROUP_BPF config
    qed: fix old-style function definition
    net: ipv6: check route protocol when deleting routes
    r6040: move spinlock in r6040_close as SOFTIRQ-unsafe lock order detected
    irda: w83977af_ir: cleanup an indent issue
    net: sfc: use new api ethtool_{get|set}_link_ksettings
    net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings
    net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings
    net: chelsio: cxgb3: use new api ethtool_{get|set}_link_ksettings
    net: chelsio: cxgb2: use new api ethtool_{get|set}_link_ksettings
    bpf: fix mark_reg_unknown_value for spilled regs on map value marking
    bpf: fix overflow in prog accounting
    bpf: dynamically allocate digest scratch buffer
    gtp: Fix initialization of Flags octet in GTPv1 header
    gtp: gtp_check_src_ms_ipv4() always return success
    net/x25: use designated initializers
    isdn: use designated initializers
    ...

    Linus Torvalds
     
  • Pull more vfs updates from Al Viro:
    "In this pile:

    - autofs-namespace series
    - dedupe stuff
    - more struct path constification"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
    ocfs2: charge quota for reflinked blocks
    ocfs2: fix bad pointer cast
    ocfs2: always unlock when completing dio writes
    ocfs2: don't eat io errors during _dio_end_io_write
    ocfs2: budget for extent tree splits when adding refcount flag
    ocfs2: prohibit refcounted swapfiles
    ocfs2: add newlines to some error messages
    ocfs2: convert inode refcount test to a helper
    simple_write_end(): don't zero in short copy into uptodate
    exofs: don't mess with simple_write_{begin,end}
    9p: saner ->write_end() on failing copy into non-uptodate page
    fix gfs2_stuffed_write_end() on short copies
    fix ceph_write_end()
    nfs_write_end(): fix handling of short copies
    vfs: refactor clone/dedupe_file_range common functions
    fs: try to clone files first in vfs_copy_file_range
    vfs: misc struct path constification
    namespace.c: constify struct path passed to a bunch of primitives
    quota: constify struct path in quota_on
    ...

    Linus Torvalds
     
  • Martin reported a verifier issue that hit the BUG_ON() for his
    test case in the mark_reg_unknown_value() function:

    [ 202.861380] kernel BUG at kernel/bpf/verifier.c:467!
    [...]
    [ 203.291109] Call Trace:
    [ 203.296501] [] mark_map_reg+0x45/0x50
    [ 203.308225] [] mark_map_regs+0x78/0x90
    [ 203.320140] [] do_check+0x226d/0x2c90
    [ 203.331865] [] bpf_check+0x48b/0x780
    [ 203.343403] [] bpf_prog_load+0x27e/0x440
    [ 203.355705] [] ? handle_mm_fault+0x11af/0x1230
    [ 203.369158] [] ? security_capable+0x48/0x60
    [ 203.382035] [] SyS_bpf+0x124/0x960
    [ 203.393185] [] ? __do_page_fault+0x276/0x490
    [ 203.406258] [] entry_SYSCALL_64_fastpath+0x13/0x94

    This issue got uncovered after the fix in a08dd0da5307 ("bpf: fix
    regression on verifier pruning wrt map lookups"). The reason why it
    wasn't noticed before was, because as mentioned in a08dd0da5307,
    mark_map_regs() was doing the id matching incorrectly based on the
    uncached regs[regno].id. So, in the first loop, we walked all regs
    and as soon as we found regno == i, then this reg's id was cleared
    when calling mark_reg_unknown_value() thus that every subsequent
    register was probed against id of 0 (which, in combination with the
    PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
    register state can hold), and therefore wasn't type transitioned such
    as in the spilled register case for the second loop.

    Now since that got fixed, it turned out that 57a09bf0a416 ("bpf:
    Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
    mark_reg_unknown_value() incorrectly for the spilled regs, and thus
    hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.

    Although spilled regs have the same type as the non-spilled regs
    for the verifier state, that is, struct bpf_reg_state, they are
    semantically different from the non-spilled regs. In other words,
    there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
    in the stack, for example, register R could have been spilled by
    the program to stack location X, Y, Z, and in mark_map_regs() we
    need to scan these stack slots of type STACK_SPILL for potential
    registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
    Therefore, depending on the location, the spilled_regs regno can
    be a lot higher than just MAX_BPF_REG's value since we operate on
    stack instead. The reset in mark_reg_unknown_value() itself is
    just fine, only that the BUG_ON() was inappropriate for this. Fix
    it by making a __mark_reg_unknown_value() version that can be
    called from mark_map_reg() generically; we know for the non-spilled
    case that the regno is always < MAX_BPF_REG anyway.

    Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
    Reported-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
    programs") made a wrong assumption of charging against prog->pages.
    Unlike map->pages, prog->pages are still subject to change when we
    need to expand the program through bpf_prog_realloc().

    This can for example happen during verification stage when we need to
    expand and rewrite parts of the program. Should the required space
    cross a page boundary, then prog->pages is not the same anymore as
    its original value that we used to bpf_prog_charge_memlock() on. Thus,
    we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
    is freed eventually. I noticed this that despite having unlimited
    memlock, programs suddenly refused to load with EPERM error due to
    insufficient memlock.

    There are two ways to fix this issue. One would be to add a cached
    variable to struct bpf_prog that takes a snapshot of prog->pages at the
    time of charging. The other approach is to also account for resizes. I
    chose to go with the latter for a couple of reasons: i) We want accounting
    rather to be more accurate instead of further fooling limits, ii) adding
    yet another page counter on struct bpf_prog would also be a waste just
    for this purpose. We also do want to charge as early as possible to
    avoid going into the verifier just to find out later on that we crossed
    limits. The only place that needs to be fixed is bpf_prog_realloc(),
    since only here we expand the program, so we try to account for the
    needed delta and should we fail, call-sites check for outcome anyway.
    On cBPF to eBPF migrations, we don't grab a reference to the user as
    they are charged differently. With that in place, my test case worked
    fine.

    Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Geert rightfully complained that 7bd509e311f4 ("bpf: add prog_digest
    and expose it via fdinfo/netlink") added a too large allocation of
    variable 'raw' from bss section, and should instead be done dynamically:

    # ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
    add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
    function old new delta
    raw - 32832 +32832
    [...]

    Since this is only relevant during program creation path, which can be
    considered slow-path anyway, lets allocate that dynamically and be not
    implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
    the beginning of replace_map_fd_with_map_ptr() and also error handling
    stays straight forward.

    Reported-by: Geert Uytterhoeven
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

17 Dec, 2016

4 commits

  • Commit 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
    registers") introduced a regression where existing programs stopped
    loading due to reaching the verifier's maximum complexity limit,
    whereas prior to this commit they were loading just fine; the affected
    program has roughly 2k instructions.

    What was found is that state pruning couldn't be performed effectively
    anymore due to mismatches of the verifier's register state, in particular
    in the id tracking. It doesn't mean that 57a09bf0a416 is incorrect per
    se, but rather that verifier needs to perform a lot more work for the
    same program with regards to involved map lookups.

    Since commit 57a09bf0a416 is only about tracking registers with type
    PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
    until they are promoted through pattern matching with a NULL check to
    either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
    id becomes irrelevant for the transitioned types.

    For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
    but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
    transferred further into other types that don't make use of it. Among
    others, one example is where UNKNOWN_VALUE is set on function call
    return with RET_INTEGER return type.

    states_equal() will then fall through the memcmp() on register state;
    note that the second memcmp() uses offsetofend(), so the id is part of
    that since d2a4dd37f6b4 ("bpf: fix state equivalence"). But the bisect
    pointed already to 57a09bf0a416, where we really reach beyond complexity
    limit. What I found was that states_equal() often failed in this
    case due to id mismatches in spilled regs with registers in type
    PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
    a memcmp() on their reg state and don't have any other optimizations
    in place, therefore also id was relevant in this case for making a
    pruning decision.

    We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
    For the affected program, it resulted in a ~17 fold reduction of
    complexity and let the program load fine again. Selftest suite also
    runs fine. The only other place where env->id_gen is used currently is
    through direct packet access, but for these cases id is long living, thus
    a different scenario.

    Also, the current logic in mark_map_regs() is not fully correct when
    marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
    reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
    it's id is reset and any subsequent registers that hold the original id
    and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
    anymore, since mark_map_reg() reuses the uncached regs[regno].id that
    was just overridden. Note, we don't need to cache it outside of
    mark_map_regs(), since it's called once on this_branch and the other
    time on other_branch, which are both two independent verifier states.
    A test case for this is added here, too.

    Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
    Signed-off-by: Daniel Borkmann
    Acked-by: Thomas Graf
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Al Viro
     
  • Pull vfs updates from Al Viro:

    - more ->d_init() stuff (work.dcache)

    - pathname resolution cleanups (work.namei)

    - a few missing iov_iter primitives - copy_from_iter_full() and
    friends. Either copy the full requested amount, advance the iterator
    and return true, or fail, return false and do _not_ advance the
    iterator. Quite a few open-coded callers converted (and became more
    readable and harder to fuck up that way) (work.iov_iter)

    - several assorted patches, the big one being logfs removal

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    logfs: remove from tree
    vfs: fix put_compat_statfs64() does not handle errors
    namei: fold should_follow_link() with the step into not-followed link
    namei: pass both WALK_GET and WALK_MORE to should_follow_link()
    namei: invert WALK_PUT logics
    namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
    namei: saner calling conventions for mountpoint_last()
    namei.c: get rid of user_path_parent()
    switch getfrag callbacks to ..._full() primitives
    make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
    [iov_iter] new primitives - copy_from_iter_full() and friends
    don't open-code file_inode()
    ceph: switch to use of ->d_init()
    ceph: unify dentry_operations instances
    lustre: switch to use of ->d_init()

    Linus Torvalds
     
  • Pull powerpc updates from Michael Ellerman:
    "Highlights include:

    - Support for the kexec_file_load() syscall, which is a prereq for
    secure and trusted boot.

    - Prevent kernel execution of userspace on P9 Radix (similar to
    SMEP/PXN).

    - Sort the exception tables at build time, to save time at boot, and
    store them as relative offsets to save space in the kernel image &
    memory.

    - Allow building the kernel with thin archives, which should allow us
    to build an allyesconfig once some other fixes land.

    - Build fixes to allow us to correctly rebuild when changing the
    kernel endian from big to little or vice versa.

    - Plumbing so that we can avoid doing a full mm TLB flush on P9
    Radix.

    - Initial stack protector support (-fstack-protector).

    - Support for dumping the radix (aka. Linux) and hash page tables via
    debugfs.

    - Fix an oops in cxl coredump generation when cxl_get_fd() is used.

    - Freescale updates from Scott: "Highlights include 8xx hugepage
    support, qbman fixes/cleanup, device tree updates, and some misc
    cleanup."

    - Many and varied fixes and minor enhancements as always.

    Thanks to:
    Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman
    Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz,
    Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar
    Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff
    Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin,
    Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N.
    Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica
    Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
    Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain"

    [ And thanks to Michael, who took time off from a new baby to get this
    pull request done. - Linus ]

    * tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits)
    powerpc/fsl/dts: add FMan node for t1042d4rdb
    powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb
    powerpc/fsl/dts: add QMan and BMan nodes on t1024
    powerpc/fsl/dts: add QMan and BMan nodes on t1023
    soc/fsl/qman: test: use DEFINE_SPINLOCK()
    powerpc/fsl-lbc: use DEFINE_SPINLOCK()
    powerpc/8xx: Implement support of hugepages
    powerpc: get hugetlbpage handling more generic
    powerpc: port 64 bits pgtable_cache to 32 bits
    powerpc/boot: Request no dynamic linker for boot wrapper
    soc/fsl/bman: Use resource_size instead of computation
    soc/fsl/qe: use builtin_platform_driver
    powerpc/fsl_pmc: use builtin_platform_driver
    powerpc/83xx/suspend: use builtin_platform_driver
    powerpc/ftrace: Fix the comments for ftrace_modify_code
    powerpc/perf: macros for power9 format encoding
    powerpc/perf: power9 raw event format encoding
    powerpc/perf: update attribute_group data structure
    powerpc/perf: factor out the event format field
    powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
    ...

    Linus Torvalds
     

16 Dec, 2016

3 commits

  • Pull tracing updates from Steven Rostedt:
    "This release has a few updates:

    - STM can hook into the function tracer
    - Function filtering now supports more advance glob matching
    - Ftrace selftests updates and added tests
    - Softirq tag in traces now show only softirqs
    - ARM nop added to non traced locations at compile time
    - New trace_marker_raw file that allows for binary input
    - Optimizations to the ring buffer
    - Removal of kmap in trace_marker
    - Wakeup and irqsoff tracers now adhere to the set_graph_notrace file
    - Other various fixes and clean ups"

    * tag 'trace-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (42 commits)
    selftests: ftrace: Shift down default message verbosity
    kprobes/trace: Fix kprobe selftest for newer gcc
    tracing/kprobes: Add a helper method to return number of probe hits
    tracing/rb: Init the CPU mask on allocation
    tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
    tracing/fgraph: Have wakeup and irqsoff tracers ignore graph functions too
    fgraph: Handle a case where a tracer ignores set_graph_notrace
    tracing: Replace kmap with copy_from_user() in trace_marker writing
    ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps to it
    tracing: Allow benchmark to be enabled at early_initcall()
    tracing: Have system enable return error if one of the events fail
    tracing: Do not start benchmark on boot up
    tracing: Have the reg function allow to fail
    ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inline
    ring-buffer: Froce rb_update_write_stamp() to be inlined
    ring-buffer: Force inline of hotpath helper functions
    tracing: Make __buffer_unlock_commit() always_inline
    tracing: Make tracepoint_printk a static_key
    ring-buffer: Always inline rb_event_data()
    ring-buffer: Make rb_reserve_next_event() always inlined
    ...

    Linus Torvalds
     
  • If CONFIG_PRINTK=n:

    kernel/printk/printk.c:1893: warning: ‘cont’ defined but not used

    Note that there are actually two different struct cont definitions and
    objects: the first one is used if CONFIG_PRINTK=y, the second one became
    unused by removing console_cont_flush().

    Fixes: 5c2992ee7fd8 ("printk: remove console flushing special cases for partial buffered lines")
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Petr Mladek
    [ I do the occasional "allnoconfig" builds, but apparently not often
    enough - Linus ]
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • When invoked with CPUHP_AP_ONLINE_DYN state __cpuhp_setup_state()
    is expected to return positive value which is the hotplug state that
    the routine assigns.

    Signed-off-by: Boris Ostrovsky
    Cc: linux-pm@vger.kernel.org
    Cc: viresh.kumar@linaro.org
    Cc: bigeasy@linutronix.de
    Cc: rjw@rjwysocki.net
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1481814058-4799-2-git-send-email-boris.ostrovsky@oracle.com
    Signed-off-by: Thomas Gleixner

    Boris Ostrovsky
     

15 Dec, 2016

10 commits

  • Commit 34c3d9819fda ("genirq/affinity: Provide smarter irq spreading
    infrastructure") introduced a better IRQ spreading mechanism, taking
    account of the available NUMA nodes in the machine.

    Problem is that the algorithm of retrieving the nodemask iterates
    "linearly" based on the number of online nodes - some architectures
    present non-linear node distribution among the nodemask, like PowerPC.
    If this is the case, the algorithm lead to a wrong node count number
    and therefore to a bad/incomplete IRQ affinity distribution.

    For example, this problem were found in a machine with 128 CPUs and two
    nodes, namely nodes 0 and 8 (instead of 0 and 1, if it was linearly
    distributed). This led to a wrong affinity distribution which then led to
    a bad mq allocation for nvme driver.

    Finally, we take the opportunity to fix a comment regarding the affinity
    distribution when we have _more_ nodes than vectors.

    Fixes: 34c3d9819fda ("genirq/affinity: Provide smarter irq spreading infrastructure")
    Reported-by: Gabriel Krisman Bertazi
    Signed-off-by: Guilherme G. Piccoli
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Gabriel Krisman Bertazi
    Reviewed-by: Gavin Shan
    Cc: linux-pci@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: hch@lst.de
    Link: http://lkml.kernel.org/r/1481738472-2671-1-git-send-email-gpiccoli@linux.vnet.ibm.com
    Signed-off-by: Thomas Gleixner

    Guilherme G. Piccoli
     
  • When a disfunctional timer, e.g. dummy timer, is installed, the tick core
    tries to setup the broadcast timer.

    If no broadcast device is installed, the kernel crashes with a NULL pointer
    dereference in tick_broadcast_setup_oneshot() because the function has no
    sanity check.

    Reported-by: Mason
    Signed-off-by: Thomas Gleixner
    Cc: Mark Rutland
    Cc: Anna-Maria Gleixner
    Cc: Richard Cochran
    Cc: Sebastian Andrzej Siewior
    Cc: Daniel Lezcano
    Cc: Peter Zijlstra ,
    Cc: Sebastian Frias
    Cc: Thibaud Cornic
    Cc: Robin Murphy
    Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr

    Thomas Gleixner
     
  • Al Viro
     
  • Pull xfs updates from Dave Chinner:
    "There is quite a varied bunch of stuff in this update, and some of it
    you will have already merged through the ext4 tree which imported the
    dax-4.10-iomap-pmd topic branch from the XFS tree.

    There is also a new direct IO implementation that uses the iomap
    infrastructure. It's much simpler, faster, and has lower IO latency
    than the existing direct IO infrastructure.

    Summary:
    - DAX PMD faults via iomap infrastructure
    - Direct-io support in iomap infrastructure
    - removal of now-redundant XFS inode iolock, replaced with VFS
    i_rwsem
    - synchronisation with fixes and changes in userspace libxfs code
    - extent tree lookup helpers
    - lots of little corruption detection improvements to verifiers
    - optimised CRC calculations
    - faster buffer cache lookups
    - deprecation of barrier/nobarrier mount options - we always use
    REQ_FUA/REQ_FLUSH where appropriate for data integrity now
    - cleanups to speculative preallocation
    - miscellaneous minor bug fixes and cleanups"

    * tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
    xfs: nuke unused tracepoint definitions
    xfs: use GPF_NOFS when allocating btree cursors
    xfs: use xfs_vn_setattr_size to check on new size
    xfs: deprecate barrier/nobarrier mount option
    xfs: Always flush caches when integrity is required
    xfs: ignore leaf attr ichdr.count in verifier during log replay
    xfs: use rhashtable to track buffer cache
    xfs: optimise CRC updates
    xfs: make xfs btree stats less huge
    xfs: don't cap maximum dedupe request length
    xfs: don't allow di_size with high bit set
    xfs: error out if trying to add attrs and anextents > 0
    xfs: don't crash if reading a directory results in an unexpected hole
    xfs: complain if we don't get nextents bmap records
    xfs: check for bogus values in btree block headers
    xfs: forbid AG btrees with level == 0
    xfs: several xattr functions can be void
    xfs: handle cow fork in xfs_bmap_trace_exlist
    xfs: pass state not whichfork to trace_xfs_extlist
    xfs: Move AGI buffer type setting to xfs_read_agi
    ...

    Linus Torvalds
     
  • It actively hurts proper merging, and makes for a lot of special cases.
    There was a good(ish) reason for doing it originally, but it's getting
    too painful to maintain. And most of the original reasons for it are
    long gone.

    So instead of having special code to flush partial lines to the console
    (as opposed to the record buffers), do _all_ the console writing from
    the record buffer, and be done with it.

    If an oops happens (or some other synchronous event), we will flush the
    partial lines due to the oops printing activity, so this does not affect
    that. It does mean that if you have a completely hung machine, a
    partial preceding line may not have been printed out.

    That was some of the original reason for this complexity, in fact, back
    when we used to test for the historical i386 "halt" instruction problem
    by doing

    pr_info("Checking 'hlt' instruction... ");

    if (!boot_cpu_data.hlt_works_ok) {
    pr_cont("disabled\n");
    return;
    }
    halt();
    halt();
    halt();
    halt();
    pr_cont("OK\n");

    and that model no longer works (it the 'hlt' instruction kills the
    machine, the partial line won't have been flushed, so you won't even see
    it).

    Of course, that was also back in the days when people actually had
    textual console output rather than a graphical splash-screen at bootup.
    How times change..

    Cc: Sergey Senozhatsky
    Cc: Joe Perches
    Cc: Steven Rostedt
    Tested-by: Petr Mladek
    Tested-by: Geert Uytterhoeven
    Tested-by: Mark Rutland
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The record logging code looks at the previous record flags in various
    ways, and they are all wrong.

    You can't use the previous record flags to determine anything about the
    next record, because they may simply not be related. In particular, the
    reason the previous record was a continuation record may well be exactly
    _because_ the new record was printed by a different process, which is
    why the previous record was flushed.

    So all those games are simply wrong, and make the code hard to
    understand (because the code fundamentally cdoes not make sense).

    So remove it.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull modules updates from Jessica Yu:
    "Summary of modules changes for the 4.10 merge window:

    - The rodata= cmdline parameter has been extended to additionally
    apply to module mappings

    - Fix a hard to hit race between module loader error/clean up
    handling and ftrace registration

    - Some code cleanups, notably panic.c and modules code use a unified
    taint_flags table now. This is much cleaner than duplicating the
    taint flag code in modules.c"

    * tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
    module: fix DEBUG_SET_MODULE_RONX typo
    module: extend 'rodata=off' boot cmdline parameter to module mappings
    module: Fix a comment above strong_try_module_get()
    module: When modifying a module's text ignore modules which are going away too
    module: Ensure a module's state is set accordingly during module coming cleanup code
    module: remove trailing whitespace
    taint/module: Clean up global and module taint flags handling
    modpost: free allocated memory

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:

    - a few misc things

    - kexec updates

    - DMA-mapping updates to better support networking DMA operations

    - IPC updates

    - various MM changes to improve DAX fault handling

    - lots of radix-tree changes, mainly to the test suite. All leading up
    to reimplementing the IDA/IDR code to be a wrapper layer over the
    radix-tree. However the final trigger-pulling patch is held off for
    4.11.

    * emailed patches from Andrew Morton : (114 commits)
    radix tree test suite: delete unused rcupdate.c
    radix tree test suite: add new tag check
    radix-tree: ensure counts are initialised
    radix tree test suite: cache recently freed objects
    radix tree test suite: add some more functionality
    idr: reduce the number of bits per level from 8 to 6
    rxrpc: abstract away knowledge of IDR internals
    tpm: use idr_find(), not idr_find_slowpath()
    idr: add ida_is_empty
    radix tree test suite: check multiorder iteration
    radix-tree: fix replacement for multiorder entries
    radix-tree: add radix_tree_split_preload()
    radix-tree: add radix_tree_split
    radix-tree: add radix_tree_join
    radix-tree: delete radix_tree_range_tag_if_tagged()
    radix-tree: delete radix_tree_locate_item()
    radix-tree: improve multiorder iterators
    btrfs: fix race in btrfs_free_dummy_fs_info()
    radix-tree: improve dump output
    radix-tree: make radix_tree_find_next_bit more useful
    ...

    Linus Torvalds
     
  • Patch series "mm: unexport __get_user_pages_unlocked()".

    This patch series continues the cleanup of get_user_pages*() functions
    taking advantage of the fact we can now pass gup_flags as we please.

    It firstly adds an additional 'locked' parameter to
    get_user_pages_remote() to allow for its callers to utilise
    VM_FAULT_RETRY functionality. This is necessary as the invocation of
    __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
    this and no other existing higher level function would allow it to do
    so.

    Secondly existing callers of __get_user_pages_unlocked() are replaced
    with the appropriate higher-level replacement -
    get_user_pages_unlocked() if the current task and memory descriptor are
    referenced, or get_user_pages_remote() if other task/memory descriptors
    are referenced (having acquiring mmap_sem.)

    This patch (of 2):

    Add a int *locked parameter to get_user_pages_remote() to allow
    VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().

    Taking into account the previous adjustments to get_user_pages*()
    functions allowing for the passing of gup_flags, we are now in a
    position where __get_user_pages_unlocked() need only be exported for his
    ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
    subsequently unexport __get_user_pages_unlocked() as well as allowing
    for future flexibility in the use of get_user_pages_remote().

    [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
    Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Separate hardlockup code from watchdog.c and move it to watchdog_hld.c.
    It is mostly straight forward. Remove everything inside
    CONFIG_HARDLOCKUP_DETECTORS. This code will go to file watchdog_hld.c.
    Also update the makefile accordigly.

    Link: http://lkml.kernel.org/r/1478034826-43888-3-git-send-email-babu.moger@oracle.com
    Signed-off-by: Babu Moger
    Acked-by: Don Zickus
    Cc: Ingo Molnar
    Cc: Jiri Kosina
    Cc: Andi Kleen
    Cc: Yaowei Bai
    Cc: Aaron Tomlin
    Cc: Ulrich Obergfell
    Cc: Tejun Heo
    Cc: Hidehiro Kawai
    Cc: Josh Hunt
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Babu Moger