21 Jul, 2017

4 commits

  • David S. Miller
     
  • Pull networking fixes from David Miller:

    1) BPF verifier signed/unsigned value tracking fix, from Daniel
    Borkmann, Edward Cree, and Josef Bacik.

    2) Fix memory allocation length when setting up calls to
    ->ndo_set_mac_address, from Cong Wang.

    3) Add a new cxgb4 device ID, from Ganesh Goudar.

    4) Fix FIB refcount handling, we have to set it's initial value before
    the configure callback (which can bump it). From David Ahern.

    5) Fix double-free in qcom/emac driver, from Timur Tabi.

    6) A bunch of gcc-7 string format overflow warning fixes from Arnd
    Bergmann.

    7) Fix link level headroom tests in ip_do_fragment(), from Vasily
    Averin.

    8) Fix chunk walking in SCTP when iterating over error and parameter
    headers. From Alexander Potapenko.

    9) TCP BBR congestion control fixes from Neal Cardwell.

    10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger.

    11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong
    Wang.

    12) xmit_recursion in ppp driver needs to be per-device not per-cpu,
    from Gao Feng.

    13) Cannot release skb->dst in UDP if IP options processing needs it.
    From Paolo Abeni.

    14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander
    Levin and myself.

    15) Revert some rtnetlink notification changes that are causing
    regressions, from David Ahern.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
    net: bonding: Fix transmit load balancing in balance-alb mode
    rds: Make sure updates to cp_send_gen can be observed
    net: ethernet: ti: cpsw: Push the request_irq function to the end of probe
    ipv4: initialize fib_trie prior to register_netdev_notifier call.
    rtnetlink: allocate more memory for dev_set_mac_address()
    net: dsa: b53: Add missing ARL entries for BCM53125
    bpf: more tests for mixed signed and unsigned bounds checks
    bpf: add test for mixed signed and unsigned bounds checks
    bpf: fix up test cases with mixed signed/unsigned bounds
    bpf: allow to specify log level and reduce it for test_verifier
    bpf: fix mixed signed/unsigned derived min/max value bounds
    ipv6: avoid overflow of offset in ip6_find_1stfragopt
    net: tehuti: don't process data if it has not been copied from userspace
    Revert "rtnetlink: Do not generate notifications for CHANGEADDR event"
    net: dsa: mv88e6xxx: Enable CMODE config support for 6390X
    dt-binding: ptp: Add SoC compatibility strings for dte ptp clock
    NET: dwmac: Make dwmac reset unconditional
    net: Zero terminate ifr_name in dev_ifname().
    wireless: wext: terminate ifr name coming from userspace
    netfilter: fix netfilter_net_init() return
    ...

    Linus Torvalds
     
  • Edward reported that there's an issue in min/max value bounds
    tracking when signed and unsigned compares both provide hints
    on limits when having unknown variables. E.g. a program such
    as the following should have been rejected:

    0: (7a) *(u64 *)(r10 -8) = 0
    1: (bf) r2 = r10
    2: (07) r2 += -8
    3: (18) r1 = 0xffff8a94cda93400
    5: (85) call bpf_map_lookup_elem#1
    6: (15) if r0 == 0x0 goto pc+7
    R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
    7: (7a) *(u64 *)(r10 -16) = -8
    8: (79) r1 = *(u64 *)(r10 -16)
    9: (b7) r2 = -1
    10: (2d) if r1 > r2 goto pc+3
    R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0
    R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
    11: (65) if r1 s> 0x1 goto pc+2
    R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=1
    R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
    12: (0f) r0 += r1
    13: (72) *(u8 *)(r0 +0) = 0
    R0=map_value_adj(ks=8,vs=8,id=0),min_value=0,max_value=1 R1=inv,min_value=0,max_value=1
    R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
    14: (b7) r0 = 0
    15: (95) exit

    What happens is that in the first part ...

    8: (79) r1 = *(u64 *)(r10 -16)
    9: (b7) r2 = -1
    10: (2d) if r1 > r2 goto pc+3

    ... r1 carries an unsigned value, and is compared as unsigned
    against a register carrying an immediate. Verifier deduces in
    reg_set_min_max() that since the compare is unsigned and operation
    is greater than (>), that in the fall-through/false case, r1's
    minimum bound must be 0 and maximum bound must be r2. Latter is
    larger than the bound and thus max value is reset back to being
    'invalid' aka BPF_REGISTER_MAX_RANGE. Thus, r1 state is now
    'R1=inv,min_value=0'. The subsequent test ...

    11: (65) if r1 s> 0x1 goto pc+2

    ... is a signed compare of r1 with immediate value 1. Here,
    verifier deduces in reg_set_min_max() that since the compare
    is signed this time and operation is greater than (>), that
    in the fall-through/false case, we can deduce that r1's maximum
    bound must be 1, meaning with prior test, we result in r1 having
    the following state: R1=inv,min_value=0,max_value=1. Given that
    the actual value this holds is -8, the bounds are wrongly deduced.
    When this is being added to r0 which holds the map_value(_adj)
    type, then subsequent store access in above case will go through
    check_mem_access() which invokes check_map_access_adj(), that
    will then probe whether the map memory is in bounds based
    on the min_value and max_value as well as access size since
    the actual unknown value is min_value = r1 goto pc+2
    R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
    R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
    11: (b7) r7 = 1
    12: (65) if r7 s> 0x0 goto pc+2
    R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
    R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,max_value=0 R10=fp
    13: (b7) r0 = 0
    14: (95) exit

    from 12 to 15: R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
    R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,min_value=1 R10=fp
    15: (0f) r7 += r1
    16: (65) if r7 s> 0x4 goto pc+2
    R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
    R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
    17: (0f) r0 += r7
    18: (72) *(u8 *)(r0 +0) = 0
    R0=map_value_adj(ks=8,vs=8,id=0),min_value=4,max_value=4 R1=inv,min_value=3
    R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
    19: (b7) r0 = 0
    20: (95) exit

    Meaning, in adjust_reg_min_max_vals() we must also reset range
    values on the dst when src/dst registers have mixed signed/
    unsigned derived min/max value bounds with one unbounded value
    as otherwise they can be added together deducing false boundaries.
    Once both boundaries are established from either ALU ops or
    compare operations w/o mixing signed/unsigned insns, then they
    can safely be added to other regs also having both boundaries
    established. Adding regs with one unbounded side to a map value
    where the bounded side has been learned w/o mixing ops is
    possible, but the resulting map value won't recover from that,
    meaning such op is considered invalid on the time of actual
    access. Invalid bounds are set on the dst reg in case i) src reg,
    or ii) in case dst reg already had them. The only way to recover
    would be to perform i) ALU ops but only 'add' is allowed on map
    value types or ii) comparisons, but these are disallowed on
    pointers in case they span a range. This is fine as only BPF_JEQ
    and BPF_JNE may be performed on PTR_TO_MAP_VALUE_OR_NULL registers
    which potentially turn them into PTR_TO_MAP_VALUE type depending
    on the branch, so only here min/max value cannot be invalidated
    for them.

    In terms of state pruning, value_from_signed is considered
    as well in states_equal() when dealing with adjusted map values.
    With regards to breaking existing programs, there is a small
    risk, but use-cases are rather quite narrow where this could
    occur and mixing compares probably unlikely.

    Joint work with Josef and Edward.

    [0] https://lists.iovisor.org/pipermail/iovisor-dev/2017-June/000822.html

    Fixes: 484611357c19 ("bpf: allow access into map value arrays")
    Reported-by: Edward Cree
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Edward Cree
    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Pull audit fix from Paul Moore:
    "A small audit fix, just a single line, to plug a memory leak in some
    audit error handling code"

    * 'stable-4.13' of git://git.infradead.org/users/pcmoore/audit:
    audit: fix memleak in auditd_send_unicast_skb.

    Linus Torvalds
     

19 Jul, 2017

2 commits

  • Pull structure randomization updates from Kees Cook:
    "Now that IPC and other changes have landed, enable manual markings for
    randstruct plugin, including the task_struct.

    This is the rest of what was staged in -next for the gcc-plugins, and
    comes in three patches, largest first:

    - mark "easy" structs with __randomize_layout

    - mark task_struct with an optional anonymous struct to isolate the
    __randomize_layout section

    - mark structs to opt _out_ of automated marking (which will come
    later)

    And, FWIW, this continues to pass allmodconfig (normal and patched to
    enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
    s390 for me"

    * tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    randstruct: opt-out externally exposed function pointer structs
    task_struct: Allow randomized layout
    randstruct: Mark various structs for randomization

    Linus Torvalds
     
  • Found this issue by kmemleak report, auditd_send_unicast_skb
    did not free skb if rcu_dereference(auditd_conn) returns null.

    unreferenced object 0xffff88082568ce00 (size 256):
    comm "auditd", pid 1119, jiffies 4294708499
    backtrace:
    [] kmemleak_alloc+0x4a/0xa0
    [] kmem_cache_alloc_node+0xcc/0x210
    [] __alloc_skb+0x5d/0x290
    [] audit_make_reply+0x54/0xd0
    [] audit_receive_msg+0x967/0xd70
    ----------------
    (gdb) list *audit_receive_msg+0x967
    0xffffffff8113dff7 is in audit_receive_msg (kernel/audit.c:1133).
    1132 skb = audit_make_reply(0, AUDIT_REPLACE, 0,
    0, &pvnr, sizeof(pvnr));
    ---------------
    [] audit_receive+0x52/0xa0
    [] netlink_unicast+0x181/0x240
    [] netlink_sendmsg+0x2c2/0x3b0
    [] sock_sendmsg+0x38/0x50
    [] SYSC_sendto+0x102/0x190
    [] SyS_sendto+0xe/0x10
    [] entry_SYSCALL_64_fastpath+0x1a/0xa5
    [] 0xffffffffffffffff

    Signed-off-by: Shu Wang
    Signed-off-by: Paul Moore

    Shu Wang
     

18 Jul, 2017

6 commits

  • Pull irq fix from Thomas Gleixner:
    "Fix the fallout from reworking the locking and resource management in
    request/free_irq()"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Keep chip buslock across irq_request/release_resources()

    Linus Torvalds
     
  • Pull SMP fix from Thomas Gleixner:
    "Replace the bogus BUG_ON in the cpu hotplug code"

    * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    smp/hotplug: Replace BUG_ON and react useful

    Linus Torvalds
     
  • The BPF map devmap holds a refcnt on the net_device structure when
    it is in the map. We need to do this to ensure on driver unload we
    don't lose a dev reference.

    However, its not very convenient to have to manually unload the map
    when destroying a net device so add notifier handlers to do the cleanup
    automatically. But this creates a race between update/destroy BPF
    syscall and programs and the unregister netdev hook.

    Unfortunately, the best I could come up with is either to live with
    requiring manual removal of net devices from the map before removing
    the net device OR to add a mutex in devmap to ensure the map is not
    modified while we are removing a device. The fallout also requires
    that BPF programs no longer update/delete the map from the BPF program
    side because the mutex may sleep and this can not be done from inside
    an rcu critical section. This is not a real problem though because I
    have not come up with any use cases where this is actually useful in
    practice. If/when we come up with a compelling user for this we may
    need to revisit this.

    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    John Fastabend
     
  • For performance reasons we want to avoid updating the tail pointer in
    the driver tx ring as much as possible. To accomplish this we add
    batching support to the redirect path in XDP.

    This adds another ndo op "xdp_flush" that is used to inform the driver
    that it should bump the tail pointer on the TX ring.

    Signed-off-by: John Fastabend
    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    John Fastabend
     
  • BPF programs can use the devmap with a bpf_redirect_map() helper
    routine to forward packets to netdevice in map.

    Signed-off-by: John Fastabend
    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Device map (devmap) is a BPF map, primarily useful for networking
    applications, that uses a key to lookup a reference to a netdevice.

    The map provides a clean way for BPF programs to build virtual port
    to physical port maps. Additionally, it provides a scoping function
    for the redirect action itself allowing multiple optimizations. Future
    patches will leverage the map to provide batching at the XDP layer.

    Another optimization/feature, that is not yet implemented, would be
    to support multiple netdevices per key to support efficient multicast
    and broadcast support.

    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    John Fastabend
     

16 Jul, 2017

1 commit

  • Pull ->s_options removal from Al Viro:
    "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
    gets moved to explicit ->show_options(), killing ->s_options off +
    some cosmetic bits around fs/namespace.c and friends. Basically, the
    stuff needed to work with fsmount series with minimum of conflicts
    with other work.

    It's not strictly required for this merge window, but it would reduce
    the PITA during the coming cycle, so it would be nice to have those
    bits and pieces out of the way"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    isofs: Fix isofs_show_options()
    VFS: Kill off s_options and helpers
    orangefs: Implement show_options
    9p: Implement show_options
    isofs: Implement show_options
    afs: Implement show_options
    affs: Implement show_options
    befs: Implement show_options
    spufs: Implement show_options
    bpf: Implement show_options
    ramfs: Implement show_options
    pstore: Implement show_options
    omfs: Implement show_options
    hugetlbfs: Implement show_options
    VFS: Don't use save/replace_mount_options if not using generic_show_options
    VFS: Provide empty name qstr
    VFS: Make get_filesystem() return the affected filesystem
    VFS: Clean up whitespace in fs/namespace.c and fs/super.c
    Provide a function to create a NUL-terminated string from unterminated data

    Linus Torvalds
     

15 Jul, 2017

3 commits

  • Pull power management fixes from Rafael Wysocki:
    "These fix a recently exposed issue in the PCI device wakeup code and
    one older problem related to PCI device wakeup that has been reported
    recently, modify one more piece of computations in intel_pstate to get
    rid of a rounding error, fix a possible race in the schedutil cpufreq
    governor, fix the device PM QoS sysfs interface to correctly handle
    invalid user input, fix return values of two probe routines in devfreq
    drivers and constify an attribute_group structure in devfreq.

    Specifics:

    - Avoid clearing the PCI PME Enable bit for devices as a result of
    config space restoration which confuses AML executed afterward and
    causes wakeup events to be lost on some systems (Rafael Wysocki).

    - Fix the native PCIe PME interrupts handling in the cases when the
    PME IRQ is set up as a system wakeup one so that runtime PM remote
    wakeup works as expected after system resume on systems where that
    happens (Rafael Wysocki).

    - Fix the device PM QoS sysfs interface to handle invalid user input
    correctly instead of using an unititialized variable value as the
    latency tolerance for the device at hand (Dan Carpenter).

    - Get rid of one more rounding error from intel_pstate computations
    (Srinivas Pandruvada).

    - Fix the schedutil cpufreq governor to prevent it from possibly
    accessing unititialized data structures from governor callbacks in
    some cases on systems when multiple CPUs share a single cpufreq
    policy object (Vikram Mulukutla).

    - Fix the return values of probe routines in two devfreq drivers
    (Gustavo Silva).

    - Constify an attribute_group structure in devfreq (Arvind Yadav)"

    * tag 'pm-fixes-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PCI / PM: Fix native PME handling during system suspend/resume
    PCI / PM: Restore PME Enable after config space restoration
    cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race
    PM / QoS: return -EINVAL for bogus strings
    cpufreq: intel_pstate: Fix ratio setting for min_perf_pct
    PM / devfreq: constify attribute_group structures.
    PM / devfreq: tegra: fix error return code in tegra_devfreq_probe()
    PM / devfreq: rk3399_dmc: fix error return code in rk3399_dmcfreq_probe()

    Linus Torvalds
     
  • If we reach the limit of modprobe_limit threads running the next
    request_module() call will fail. The original reason for adding a kill
    was to do away with possible issues with in old circumstances which would
    create a recursive series of request_module() calls.

    We can do better than just be super aggressive and reject calls once we've
    reached the limit by simply making pending callers wait until the
    threshold has been reduced, and then throttling them in, one by one.

    This throttling enables requests over the kmod concurrent limit to be
    processed once a pending request completes. Only the first item queued up
    to wait is woken up. The assumption here is once a task is woken it will
    have no other option to also kick the queue to check if there are more
    pending tasks -- regardless of whether or not it was successful.

    By throttling and processing only max kmod concurrent tasks we ensure we
    avoid unexpected fatal request_module() calls, and we keep memory
    consumption on module loading to a minimum.

    With x86_64 qemu, with 4 cores, 4 GiB of RAM it takes the following run
    time to run both tests:

    time ./kmod.sh -t 0008
    real 0m16.366s
    user 0m0.883s
    sys 0m8.916s

    time ./kmod.sh -t 0009
    real 0m50.803s
    user 0m0.791s
    sys 0m9.852s

    Link: http://lkml.kernel.org/r/20170628223155.26472-4-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Reviewed-by: Petr Mladek
    Cc: Jessica Yu
    Cc: Shuah Khan
    Cc: Rusty Russell
    Cc: Michal Marek
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • After commit 73ce0511c436 ("kernel/watchdog.c: move hardlockup
    detector to separate file"), 'NMI watchdog' is inappropriate in
    kernel/watchdog.c, using 'watchdog' only.

    Link: http://lkml.kernel.org/r/1499928642-48983-1-git-send-email-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Cc: Babu Moger
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     

14 Jul, 2017

3 commits

  • * pm-cpufreq-sched:
    cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race

    * intel_pstate:
    cpufreq: intel_pstate: Fix ratio setting for min_perf_pct

    Rafael J. Wysocki
     
  • Pull more tracing updates from Steven Rostedt:
    "A few more minor updates:

    - Show the tgid mappings for user space trace tools to use

    - Fix and optimize the comm and tgid cache recording

    - Sanitize derived kprobe names

    - Ftrace selftest updates

    - trace file header fix

    - Update of Documentation/trace/ftrace.txt

    - Compiler warning fixes

    - Fix possible uninitialized variable"

    * tag 'trace-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace: Fix uninitialized variable in match_records()
    ftrace: Remove an unneeded NULL check
    ftrace: Hide cached module code for !CONFIG_MODULES
    tracing: Do note expose stack_trace_filter without DYNAMIC_FTRACE
    tracing: Update Documentation/trace/ftrace.txt
    tracing: Fixup trace file header alignment
    selftests/ftrace: Add a testcase for kprobe event naming
    selftests/ftrace: Add a test to probe module functions
    selftests/ftrace: Update multiple kprobes test for powerpc
    trace/kprobes: Sanitize derived event names
    tracing: Attempt to record other information even if some fail
    tracing: Treat recording tgid for idle task as a success
    tracing: Treat recording comm for idle task as a success
    tracing: Add saved_tgids file to show cached pid to tgid mappings

    Linus Torvalds
     
  • Merge yet more updates from Andrew Morton:

    - various misc things

    - kexec updates

    - sysctl core updates

    - scripts/gdb udpates

    - checkpoint-restart updates

    - ipc updates

    - kernel/watchdog updates

    - Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature"

    - "stackprotector: ascii armor the stack canary"

    - more MM bits

    - checkpatch updates

    * emailed patches from Andrew Morton : (96 commits)
    writeback: rework wb_[dec|inc]_stat family of functions
    ARM: samsung: usb-ohci: move inline before return type
    video: fbdev: omap: move inline before return type
    video: fbdev: intelfb: move inline before return type
    USB: serial: safe_serial: move __inline__ before return type
    drivers: tty: serial: move inline before return type
    drivers: s390: move static and inline before return type
    x86/efi: move asmlinkage before return type
    sh: move inline before return type
    MIPS: SMP: move asmlinkage before return type
    m68k: coldfire: move inline before return type
    ia64: sn: pci: move inline before type
    ia64: move inline before return type
    FRV: tlbflush: move asmlinkage before return type
    CRIS: gpio: move inline before return type
    ARM: HP Jornada 7XX: move inline before return type
    ARM: KVM: move asmlinkage before type
    checkpatch: improve the STORAGE_CLASS test
    mm, migration: do not trigger OOM killer when migrating memory
    drm/i915: use __GFP_RETRY_MAYFAIL
    ...

    Linus Torvalds
     

13 Jul, 2017

18 commits

  • Pull modules updates from Jessica Yu:
    "Summary of modules changes for the 4.13 merge window:

    - Minor code cleanups

    - Avoid accessing mod struct prior to checking module struct version,
    from Kees

    - Fix racy atomic inc/dec logic of kmod_concurrent_max in kmod, from
    Luis"

    * tag 'modules-for-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
    module: make the modinfo name const
    kmod: reduce atomic operations on kmod_concurrent and simplify
    module: use list_for_each_entry_rcu() on find_module_all()
    kernel/module.c: suppress warning about unused nowarn variable
    module: Add module name to modinfo
    module: Pass struct load_info into symbol checks

    Linus Torvalds
     
  • Use the ascii-armor canary to prevent unterminated C string overflows
    from being able to successfully overwrite the canary, even if they
    somehow obtain the canary value.

    Inspired by execshield ascii-armor and Daniel Micay's linux-hardened
    tree.

    Link: http://lkml.kernel.org/r/20170524155751.424-3-riel@redhat.com
    Signed-off-by: Rik van Riel
    Acked-by: Kees Cook
    Cc: Daniel Micay
    Cc: "Theodore Ts'o"
    Cc: H. Peter Anvin
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Catalin Marinas
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Defining kexec_purgatory as a zero-length char array upsets compile time
    size checking. Since this is built on a per-arch basis, define it as an
    unsized char array (like is done for other similar things, e.g. linker
    sections). This silences the warning generated by the future
    CONFIG_FORTIFY_SOURCE, which did not like the memcmp() of a "0 byte"
    array. This drops the __weak and uses an extern instead, since both
    users define kexec_purgatory.

    Link: http://lkml.kernel.org/r/1497903987-21002-4-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Acked-by: "Eric W. Biederman"
    Cc: Daniel Micay
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • After reconfiguring watchdog sysctls etc., architecture specific
    watchdogs may not get all their parameters updated.

    watchdog_nmi_reconfigure() can be implemented to pull the new values in
    and set the arch NMI watchdog.

    [npiggin@gmail.com: add code comments]
    Link: http://lkml.kernel.org/r/20170617125933.774d3858@roar.ozlabs.ibm.com
    [arnd@arndb.de: hide unused function]
    Link: http://lkml.kernel.org/r/20170620204854.966601-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170616065715.18390-5-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Don Zickus
    Tested-by: Babu Moger [sparc]
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
    HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.

    LOCKUP_DETECTOR implies the general boot, sysctl, and programming
    interfaces for the lockup detectors.

    An architecture that wants to use a hard lockup detector must define
    HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.

    Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
    minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
    does not implement the LOCKUP_DETECTOR interfaces.

    sparc is unusual in that it has started to implement some of the
    interfaces, but not fully yet. It should probably be converted to a full
    HAVE_HARDLOCKUP_DETECTOR_ARCH.

    [npiggin@gmail.com: fix]
    Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
    Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Don Zickus
    Reviewed-by: Babu Moger
    Tested-by: Babu Moger [sparc]
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • For architectures that define HAVE_NMI_WATCHDOG, instead of having them
    provide the complete touch_nmi_watchdog() function, just have them
    provide arch_touch_nmi_watchdog().

    This gives the generic code more flexibility in implementing this
    function, and arch implementations don't miss out on touching the
    softlockup watchdog or other generic details.

    Link: http://lkml.kernel.org/r/20170616065715.18390-3-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Don Zickus
    Reviewed-by: Babu Moger
    Tested-by: Babu Moger [sparc]
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • Add /proc/self/task//fail-nth file that allows failing
    0-th, 1-st, 2-nd and so on calls systematically.
    Excerpt from the added documentation:

    "Write to this file of integer N makes N-th call in the current task
    fail (N is 0-based). Read from this file returns a single char 'Y' or
    'N' that says if the fault setup with a previous write to this file
    was injected or not, and disables the fault if it wasn't yet injected.
    Note that this file enables all types of faults (slab, futex, etc).
    This setting takes precedence over all other generic settings like
    probability, interval, times, etc. But per-capability settings (e.g.
    fail_futex/ignore-private) take precedence over it. This feature is
    intended for systematic testing of faults in a single system call. See
    an example below"

    Why add a new setting:
    1. Existing settings are global rather than per-task.
    So parallel testing is not possible.
    2. attr->interval is close but it depends on attr->count
    which is non reset to 0, so interval does not work as expected.
    3. Trying to model this with existing settings requires manipulations
    of all of probability, interval, times, space, task-filter and
    unexposed count and per-task make-it-fail files.
    4. Existing settings are per-failure-type, and the set of failure
    types is potentially expanding.
    5. make-it-fail can't be changed by unprivileged user and aggressive
    stress testing better be done from an unprivileged user.
    Similarly, this would require opening the debugfs files to the
    unprivileged user, as he would need to reopen at least times file
    (not possible to pre-open before dropping privs).

    The proposed interface solves all of the above (see the example).

    We want to integrate this into syzkaller fuzzer. A prototype has found
    10 bugs in kernel in first day of usage:

    https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance

    I've made the current interface work with all types of our sandboxes.
    For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
    make /proc entries non-root owned. So I am fine with the current
    version of the code.

    [akpm@linux-foundation.org: fix build]
    Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Cc: Akinobu Mita
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • With current epoll architecture target files are addressed with
    file_struct and file descriptor number, where the last is not unique.
    Moreover files can be transferred from another process via unix socket,
    added into queue and closed then so we won't find this descriptor in the
    task fdinfo list.

    Thus to checkpoint and restore such processes CRIU needs to find out
    where exactly the target file is present to add it into epoll queue.
    For this sake one can use kcmp call where some particular target file
    from the queue is compared with arbitrary file passed as an argument.

    Because epoll target files can have same file descriptor number but
    different file_struct a caller should explicitly specify the offset
    within.

    To test if some particular file is matching entry inside epoll one have
    to

    - fill kcmp_epoll_slot structure with epoll file descriptor,
    target file number and target file offset (in case if only
    one target is present then it should be 0)

    - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
    - the kernel fetch file pointer matching file descriptor @fd of pid1
    - lookups for file struct in epoll queue of pid2 and returns traditional
    0,1,2 result for sorting purpose

    Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Andrey Vagin
    Cc: Al Viro
    Cc: Pavel Emelyanov
    Cc: Michael Kerrisk
    Cc: Jason Baron
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Prevent use of uninitialized memory (originating from the stack frame of
    do_sysctl()) by verifying that the name array is filled with sufficient
    input data before comparing its specific entries with integer constants.

    Through timing measurement or analyzing the kernel debug logs, a
    user-mode program could potentially infer the results of comparisons
    against the uninitialized memory, and acquire some (very limited)
    information about the state of the kernel stack. The change also
    eliminates possible future warnings by tools such as KMSAN and other
    code checkers / instrumentations.

    Link: http://lkml.kernel.org/r/20170524122139.21333-1-mjurczyk@google.com
    Signed-off-by: Mateusz Jurczyk
    Acked-by: Kees Cook
    Cc: "David S. Miller"
    Cc: Matthew Whitehead
    Cc: "Eric W. Biederman"
    Cc: Tetsuo Handa
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mateusz Jurczyk
     
  • To keep parity with regular int interfaces provide the an unsigned int
    proc_douintvec_minmax() which allows you to specify a range of allowed
    valid numbers.

    Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
    actual user for that.

    Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Acked-by: Kees Cook
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32
    fields") added proc_douintvec() to start help adding support for
    unsigned int, this however was only half the work needed. Two fixes
    have come in since then for the following issues:

    o Printing the values shows a negative value, this happens since
    do_proc_dointvec() and this uses proc_put_long()

    This was fixed by commit 5380e5644afbba9 ("sysctl: don't print negative
    flag for proc_douintvec").

    o We can easily wrap around the int values: UINT_MAX is 4294967295, if
    we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
    end up with 1.
    o We echo negative values in and they are accepted

    This was fixed by commit 425fffd886ba ("sysctl: report EINVAL if value
    is larger than UINT_MAX for proc_douintvec").

    It still also failed to be added to sysctl_check_table()... instead of
    adding it with the current implementation just provide a proper and
    simplified unsigned int support without any array unsigned int support
    with no negative support at all.

    Historically sysctl proc helpers have supported arrays, due to the
    complexity this adds though we've taken a step back to evaluate array
    users to determine if its worth upkeeping for unsigned int. An
    evaluation using Coccinelle has been done to perform a grammatical
    search to ask ourselves:

    o How many sysctl proc_dointvec() (int) users exist which likely
    should be moved over to proc_douintvec() (unsigned int) ?
    Answer: about 8
    - Of these how many are array users ?
    Answer: Probably only 1
    o How many sysctl array users exist ?
    Answer: about 12

    This last question gives us an idea just how popular arrays: they are not.
    Array support should probably just be kept for strings.

    The identified uint ports are:

    drivers/infiniband/core/ucma.c - max_backlog
    drivers/infiniband/core/iwcm.c - default_backlog
    net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
    net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
    net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
    net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
    net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
    net/phonet/sysctl.c proc_local_port_range()

    The only possible array users is proc_local_port_range() but it does not
    seem worth it to add array support just for this given the range support
    works just as well. Unsigned int support should be desirable more for
    when you *need* more than INT_MAX or using int min/max support then does
    not suffice for your ranges.

    If you forget and by mistake happen to register an unsigned int proc
    entry with an array, the driver will fail and you will get something as
    follows:

    sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
    CPU: 2 PID: 1342 Comm: modprobe Tainted: G W E
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    Call Trace:
    dump_stack+0x63/0x81
    __register_sysctl_table+0x350/0x650
    ? kmem_cache_alloc_trace+0x107/0x240
    __register_sysctl_paths+0x1b3/0x1e0
    ? 0xffffffffc005f000
    register_sysctl_table+0x1f/0x30
    test_sysctl_init+0x10/0x1000 [test_sysctl]
    do_one_initcall+0x52/0x1a0
    ? kmem_cache_alloc_trace+0x107/0x240
    do_init_module+0x5f/0x200
    load_module+0x1867/0x1bd0
    ? __symbol_put+0x60/0x60
    SYSC_finit_module+0xdf/0x110
    SyS_finit_module+0xe/0x10
    entry_SYSCALL_64_fastpath+0x1e/0xad
    RIP: 0033:0x7f042b22d119

    Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
    Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Suggested-by: Alexey Dobriyan
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Liping Zhang
    Cc: Alexey Dobriyan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • The mode sysctl_writes_strict positional checks keep being copy and pasted
    as we add new proc handlers. Just add a helper to avoid code duplication.

    Link: http://lkml.kernel.org/r/20170519033554.18592-4-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Document the different sysctl_writes_strict modes in code.

    Link: http://lkml.kernel.org/r/20170519033554.18592-3-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Currently vmcoreinfo data is updated at boot time subsys_initcall(), it
    has the risk of being modified by some wrong code during system is
    running.

    As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
    when using "crash", "makedumpfile", etc utility to parse this vmcore, we
    probably will get "Segmentation fault" or other unexpected errors.

    E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
    system; 3) trigger kdump, then we obviously will fail to recognize the
    crash context correctly due to the corrupted vmcoreinfo.

    Now except for vmcoreinfo, all the crash data is well
    protected(including the cpu note which is fully updated in the crash
    path, thus its correctness is guaranteed). Given that vmcoreinfo data
    is a large chunk prepared for kdump, we better protect it as well.

    To solve this, we relocate and copy vmcoreinfo_data to the crash memory
    when kdump is loading via kexec syscalls. Because the whole crash
    memory will be protected by existing arch_kexec_protect_crashkres()
    mechanism, we naturally protect vmcoreinfo_data from write(even read)
    access under kernel direct mapping after kdump is loaded.

    Since kdump is usually loaded at the very early stage after boot, we can
    trust the correctness of the vmcoreinfo data copied.

    On the other hand, we still need to operate the vmcoreinfo safe copy
    when crash happens to generate vmcoreinfo_note again, we rely on vmap()
    to map out a new kernel virtual address and update to use this new one
    instead in the following crash_save_vmcoreinfo().

    BTW, we do not touch vmcoreinfo_note, because it will be fully updated
    using the protected vmcoreinfo_data after crash which is surely correct
    just like the cpu crash note.

    Link: http://lkml.kernel.org/r/1493281021-20737-3-git-send-email-xlpang@redhat.com
    Signed-off-by: Xunlei Pang
    Tested-by: Michael Holzheu
    Cc: Benjamin Herrenschmidt
    Cc: Dave Young
    Cc: Eric Biederman
    Cc: Hari Bathini
    Cc: Juergen Gross
    Cc: Mahesh Salgaonkar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     
  • vmcoreinfo_max_size stands for the vmcoreinfo_data, the correct one we
    should use is vmcoreinfo_note whose total size is VMCOREINFO_NOTE_SIZE.

    Like explained in commit 77019967f06b ("kdump: fix exported size of
    vmcoreinfo note"), it should not affect the actual function, but we
    better fix it, also this change should be safe and backward compatible.

    After this, we can get rid of variable vmcoreinfo_max_size, let's use
    the corresponding macros directly, fewer variables means more safety for
    vmcoreinfo operation.

    [xlpang@redhat.com: fix build warning]
    Link: http://lkml.kernel.org/r/1494830606-27736-1-git-send-email-xlpang@redhat.com
    Link: http://lkml.kernel.org/r/1493281021-20737-2-git-send-email-xlpang@redhat.com
    Signed-off-by: Xunlei Pang
    Reviewed-by: Mahesh Salgaonkar
    Reviewed-by: Dave Young
    Cc: Hari Bathini
    Cc: Benjamin Herrenschmidt
    Cc: Eric Biederman
    Cc: Juergen Gross
    Cc: Michael Holzheu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     
  • As Eric said,
    "what we need to do is move the variable vmcoreinfo_note out of the
    kernel's .bss section. And modify the code to regenerate and keep this
    information in something like the control page.

    Definitely something like this needs a page all to itself, and ideally
    far away from any other kernel data structures. I clearly was not
    watching closely the data someone decided to keep this silly thing in
    the kernel's .bss section."

    This patch allocates extra pages for these vmcoreinfo_XXX variables, one
    advantage is that it enhances some safety of vmcoreinfo, because
    vmcoreinfo now is kept far away from other kernel data structures.

    Link: http://lkml.kernel.org/r/1493281021-20737-1-git-send-email-xlpang@redhat.com
    Signed-off-by: Xunlei Pang
    Tested-by: Michael Holzheu
    Reviewed-by: Juergen Gross
    Suggested-by: Eric Biederman
    Cc: Benjamin Herrenschmidt
    Cc: Dave Young
    Cc: Hari Bathini
    Cc: Mahesh Salgaonkar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     
  • The reason to disable interrupts seems to be to avoid switching to a
    different processor while handling per cpu data using individual loads and
    stores. If we use per cpu RMV primitives we will not have to disable
    interrupts.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705171055130.5898@east.gentwo.org
    Signed-off-by: Christoph Lameter
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Reported-and-tested-by: Meelis Roos
    Fixes: commit d9e968cb9f84 "getrlimit()/setrlimit(): move compat to native"
    Signed-off-by: Al Viro
    Acked-by: David S. Miller
    Signed-off-by: Linus Torvalds

    Al Viro
     

12 Jul, 2017

3 commits

  • My static checker complains that if "func" is NULL then "clear_filter"
    is uninitialized. This seems like it could be true, although it's
    possible something subtle is happening that I haven't seen.

    kernel/trace/ftrace.c:3844 match_records()
    error: uninitialized symbol 'clear_filter'.

    Link: http://lkml.kernel.org/r/20170712073556.h6tkpjcdzjaozozs@mwanda

    Cc: stable@vger.kernel.org
    Fixes: f0a3b154bd7 ("ftrace: Clarify code for mod command")
    Signed-off-by: Dan Carpenter
    Signed-off-by: Steven Rostedt (VMware)

    Dan Carpenter
     
  • "func" can't be NULL and it doesn't make sense to check because we've
    already derefenced it.

    Link: http://lkml.kernel.org/r/20170712073340.4enzeojeoupuds5a@mwanda

    Signed-off-by: Dan Carpenter
    Signed-off-by: Steven Rostedt (VMware)

    Dan Carpenter
     
  • With a shared policy in place, when one of the CPUs in the policy is
    hotplugged out and then brought back online, sugov_stop() and
    sugov_start() are called in order.

    sugov_stop() removes utilization hooks for each CPU in the policy and
    does nothing else in the for_each_cpu() loop. sugov_start() on the
    other hand iterates through the CPUs in the policy and re-initializes
    the per-cpu structure _and_ adds the utilization hook. This implies
    that the scheduler is allowed to invoke a CPU's utilization update
    hook when the rest of the per-cpu structures have yet to be
    re-inited.

    Apart from some strange values in tracepoints this doesn't cause a
    problem, but if we do end up accessing a pointer from the per-cpu
    sugov_cpu structure somewhere in the sugov_update_shared() path,
    we will likely see crashes since the memset for another CPU in the
    policy is free to race with sugov_update_shared from the CPU that is
    ready to go. So let's fix this now to first init all per-cpu
    structures, and then add the per-cpu utilization update hooks all at
    once.

    Signed-off-by: Vikram Mulukutla
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Vikram Mulukutla