13 Sep, 2016

1 commit


11 Sep, 2016

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "nvdimm fixes for v4.8, two of them are tagged for -stable:

    - Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
    DAX pmd mappings end up with an uncached pgprot, and unusable
    performance for the device-dax interface. The device-dax interface
    appeared in 4.7 so this is tagged for -stable.

    - Fix a couple VM_BUG_ON() checks in the show_smaps() path to
    understand DAX pmd entries. This fix is tagged for -stable.

    - Fix a mis-merge of the nfit machine-check handler to flip the
    polarity of an if() to match the final version of the patch that
    Vishal sent for 4.8-rc1. Without this the nfit machine check
    handler never detects / inserts new 'badblocks' entries which
    applications use to identify lost portions of files.

    - For test purposes, fix the nvdimm_clear_poison() path to operate on
    legacy / simulated nvdimm memory ranges. Without this fix a test
    can set badblocks, but never clear them on these ranges.

    - Fix the range checking done by dax_dev_pmd_fault(). This is not
    tagged for -stable since this problem is mitigated by specifying
    aligned resources at device-dax setup time.

    These patches have appeared in a next release over the past week. The
    recent rebase you can see in the timestamps was to drop an invalid fix
    as identified by the updated device-dax unit tests [1]. The -mm
    touches have an ack from Andrew"

    [1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
    https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: allow legacy (e820) pmem region to clear bad blocks
    nfit, mce: Fix SPA matching logic in MCE handler
    mm: fix cache mode of dax pmd mappings
    mm: fix show_smap() for zone_device-pmd ranges
    dax: fix mapping size check

    Linus Torvalds
     

10 Sep, 2016

4 commits

  • This work adds BPF_CALL_() macros and converts all the eBPF helper functions
    to use them, in a similar fashion like we do with SYSCALL_DEFINE() macros
    that are used today. Motivation for this is to hide all the register handling
    and all necessary casts from the user, so that it is done automatically in the
    background when adding a BPF_CALL_() call.

    This makes current helpers easier to review, eases to write future helpers,
    avoids getting the casting mess wrong, and allows for extending all helpers at
    once (f.e. build time checks, etc). It also helps detecting more easily in
    code reviews that unused registers are not instrumented in the code by accident,
    breaking compatibility with existing programs.

    BPF_CALL_() internals are quite similar to SYSCALL_DEFINE() ones with some
    fundamental differences, for example, for generating the actual helper function
    that carries all u64 regs, we need to fill unused regs, so that we always end up
    with 5 u64 regs as an argument.

    I reviewed several 0-5 generated BPF_CALL_() variants of the .i results and
    they look all as expected. No sparse issue spotted. We let this also sit for a
    few days with Fengguang's kbuild test robot, and there were no issues seen. On
    s390, it barked on the "uses dynamic stack allocation" notice, which is an old
    one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
    to the call wrapper, just telling that the perf raw record/frag sits on stack
    (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
    and they were fine as well. All eBPF helpers are now converted to use these
    macros, getting rid of a good chunk of all the raw castings.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Add BPF_SIZEOF() and BPF_FIELD_SIZEOF() macros to improve the code a bit
    which otherwise often result in overly long bytes_to_bpf_size(sizeof())
    and bytes_to_bpf_size(FIELD_SIZEOF()) lines. So place them into a macro
    helper instead. Moreover, we currently have a BUILD_BUG_ON(BPF_FIELD_SIZEOF())
    check in convert_bpf_extensions(), but we should rather make that generic
    as well and add a BUILD_BUG_ON() test in all BPF_SIZEOF()/BPF_FIELD_SIZEOF()
    users to detect any rewriter size issues at compile time. Note, there are
    currently none, but we want to assert that it stays this way.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
    and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
    anyway, so we can drop the unneeded test for dev; also few more other
    misc bits addressed here.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • track_pfn_insert() in vmf_insert_pfn_pmd() is marking dax mappings as
    uncacheable rendering them impractical for application usage. DAX-pte
    mappings are cached and the goal of establishing DAX-pmd mappings is to
    attain more performance, not dramatically less (3 orders of magnitude).

    track_pfn_insert() relies on a previous call to reserve_memtype() to
    establish the expected page_cache_mode for the range. While memremap()
    arranges for reserve_memtype() to be called, devm_memremap_pages() does
    not. So, teach track_pfn_insert() and untrack_pfn() how to handle
    tracking without a vma, and arrange for devm_memremap_pages() to
    establish the write-back-cache reservation in the memtype tree.

    Cc:
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Nilesh Choudhury
    Cc: Kirill A. Shutemov
    Reported-by: Toshi Kani
    Reported-by: Kai Zhang
    Acked-by: Andrew Morton
    Signed-off-by: Dan Williams

    Dan Williams
     

09 Sep, 2016

1 commit

  • LLVM can generate code that tests for direct packet access via
    skb->data/data_end in a way that currently gets rejected by the
    verifier, example:

    [...]
    7: (61) r3 = *(u32 *)(r6 +80)
    8: (61) r9 = *(u32 *)(r6 +76)
    9: (bf) r2 = r9
    10: (07) r2 += 54
    11: (3d) if r3 >= r2 goto pc+12
    R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
    R9=pkt(id=0,off=0,r=0) R10=fp
    12: (18) r4 = 0xffffff7a
    14: (05) goto pc+430
    [...]

    from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv
    R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp
    24: (7b) *(u64 *)(r10 -40) = r1
    25: (b7) r1 = 0
    26: (63) *(u32 *)(r6 +56) = r1
    27: (b7) r2 = 40
    28: (71) r8 = *(u8 *)(r9 +20)
    invalid access to packet, off=20 size=1, R9(id=0,off=0,r=0)

    The reason why this gets rejected despite a proper test is that we
    currently call find_good_pkt_pointers() only in case where we detect
    tests like rX > pkt_end, where rX is of type pkt(id=Y,off=Z,r=0) and
    derived, for example, from a register of type pkt(id=Y,off=0,r=0)
    pointing to skb->data. find_good_pkt_pointers() then fills the range
    in the current branch to pkt(id=Y,off=0,r=Z) on success.

    For above case, we need to extend that to recognize pkt_end >= rX
    pattern and mark the other branch that is taken on success with the
    appropriate pkt(id=Y,off=0,r=Z) type via find_good_pkt_pointers().
    Since eBPF operates on BPF_JGT (>) and BPF_JGE (>=), these are the
    only two practical options to test for from what LLVM could have
    generated, since there's no such thing as BPF_JLT (= r2 goto pc+12
    R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
    R9=pkt(id=0,off=0,r=0) R10=fp
    12: (18) r4 = 0xffffff7a
    14: (05) goto pc+430
    [...]

    from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=54) R3=pkt_end R4=inv
    R6=ctx R9=pkt(id=0,off=0,r=54) R10=fp
    24: (7b) *(u64 *)(r10 -40) = r1
    25: (b7) r1 = 0
    26: (63) *(u32 *)(r6 +56) = r1
    27: (b7) r2 = 40
    28: (71) r8 = *(u8 *)(r9 +20)
    29: (bf) r1 = r8
    30: (25) if r8 > 0x3c goto pc+47
    R1=inv56 R2=imm40 R3=pkt_end R4=inv R6=ctx R8=inv56
    R9=pkt(id=0,off=0,r=54) R10=fp
    31: (b7) r1 = 1
    [...]

    Verifier test cases are also added in this work, one that demonstrates
    the mentioned example here and one that tries a bad packet access for
    the current/fall-through branch (the one with types pkt(id=X,off=Y,r=0),
    pkt(id=X,off=0,r=0)), then a case with good and bad accesses, and two
    with both test variants (>, >=).

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

07 Sep, 2016

1 commit

  • The newly added bpf_overflow_handler function is only built of both
    CONFIG_EVENT_TRACING and CONFIG_BPF_SYSCALL are enabled, but the caller
    only checks the latter:

    kernel/events/core.c: In function 'perf_event_alloc':
    kernel/events/core.c:9106:27: error: 'bpf_overflow_handler' undeclared (first use in this function)

    This changes the caller so we also skip this call if CONFIG_EVENT_TRACING
    is disabled entirely.

    Signed-off-by: Arnd Bergmann
    Fixes: aa6a5f3cb2b2 ("perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs")
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

05 Sep, 2016

1 commit

  • of_clk_init() ends up calling into pm_qos_update_request() very early
    during boot where irq is expected to stay disabled.
    pm_qos_update_request() uses cancel_delayed_work_sync() which
    correctly assumes that irq is enabled on invocation and
    unconditionally disables and re-enables it.

    Gate cancel_delayed_work_sync() invocation with kevented_up() to avoid
    enabling irq unexpectedly during early boot.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Qiao Zhou
    Link: http://lkml.kernel.org/r/d2501c4c-8e7b-bea3-1b01-000b36b5dfe9@asrmicro.com
    Signed-off-by: Rafael J. Wysocki

    Tejun Heo
     

04 Sep, 2016

1 commit


03 Sep, 2016

4 commits

  • Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
    via overflow_handler mechanism.
    When program is attached the overflow_handlers become stacked.
    The program acts as a filter.
    Returning zero from the program means that the normal perf_event_output handler
    will not be called and sampling event won't be stored in the ring buffer.

    The overflow_handler_context==NULL is an additional safety check
    to make sure programs are not attached to hw breakpoints and watchdog
    in case other checks (that prevent that now anyway) get accidentally
    relaxed in the future.

    The program refcnt is incremented in case perf_events are inhereted
    when target task is forked.
    Similar to kprobe and tracepoint programs there is no ioctl to
    detach the program or swap already attached program. The user space
    expected to close(perf_event_fd) like it does right now for kprobe+bpf.
    That restriction simplifies the code quite a bit.

    The invocation of overflow_handler in __perf_event_overflow() is now
    done via READ_ONCE, since that pointer can be replaced when the program
    is attached while perf_event itself could have been active already.
    There is no need to do similar treatment for event->prog, since it's
    assigned only once before it's accessed.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
    preallocated hash maps, since doing memory allocation
    in overflow_handler can crash depending on where nmi got triggered.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
    HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
    correspondingly in uapi/linux/perf_event.h)

    The program visible context meta structure is
    struct bpf_perf_event_data {
    struct pt_regs regs;
    __u64 sample_period;
    };
    which is accessible directly from the program:
    int bpf_prog(struct bpf_perf_event_data *ctx)
    {
    ... ctx->sample_period ...
    ... ctx->regs.ip ...
    }

    The bpf verifier rewrites the accesses into kernel internal
    struct bpf_perf_event_data_kern which allows changing
    struct perf_sample_data without affecting bpf programs.
    New fields can be added to the end of struct bpf_perf_event_data
    in the future.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • The verifier supported only 4-byte metafields in
    struct __sk_buff and struct xdp_md. The metafields in upcoming
    struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
    Teach verifier to recognize 8-byte metafield access.
    The patch doesn't affect safety of sockets and xdp programs.
    They check for 4-byte only ctx access before these conditions are hit.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

02 Sep, 2016

8 commits

  • tick_nohz_start_idle() is prevented to be called if the idle tick can't
    be stopped since commit 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle
    enter"). As a result, after suspend/resume the host machine, full dynticks
    kvm guest will softlockup:

    NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
    Call Trace:
    default_idle+0x31/0x1a0
    arch_cpu_idle+0xf/0x20
    default_idle_call+0x2a/0x50
    cpu_startup_entry+0x39b/0x4d0
    rest_init+0x138/0x140
    ? rest_init+0x5/0x140
    start_kernel+0x4c1/0x4ce
    ? set_init_arg+0x55/0x55
    ? early_idt_handler_array+0x120/0x120
    x86_64_start_reservations+0x24/0x26
    x86_64_start_kernel+0x142/0x14f

    In addition, cat /proc/stat | grep cpu in guest or host:

    cpu 398 16 5049 15754 5490 0 1 46 0 0
    cpu0 206 5 450 0 0 0 1 14 0 0
    cpu1 81 0 3937 3149 1514 0 0 9 0 0
    cpu2 45 6 332 6052 2243 0 0 11 0 0
    cpu3 65 2 328 6552 1732 0 0 11 0 0

    The idle and iowait states are weird 0 for cpu0(housekeeping).

    The bug is present in both guest and host kernels, and they both have
    cpu0's idle and iowait states issue, however, host kernel's suspend/resume
    path etc will touch watchdog to avoid the softlockup.

    - The watchdog will not be touched in tick_nohz_stop_idle path (need be
    touched since the scheduler stall is expected) if idle_active flags are
    not detected.
    - The idle and iowait states will not be accounted when exit idle loop
    (resched or interrupt) if idle start time and idle_active flags are
    not set.

    This patch fixes it by reverting commit 1f3b0f8243cb934 since can't stop
    idle tick doesn't mean can't be idle.

    Fixes: 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter")
    Signed-off-by: Wanpeng Li
    Cc: Sanjeev Yadav
    Cc: Gaurav Jindal
    Cc: stable@vger.kernel.org
    Cc: kvm@vger.kernel.org
    Cc: Radim Krčmář
    Cc: Peter Zijlstra
    Cc: Paolo Bonzini
    Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Thomas Gleixner

    Wanpeng Li
     
  • Merge fixes from Andrew Morton:
    "14 fixes"

    * emailed patches from Andrew Morton :
    rapidio/tsi721: fix incorrect detection of address translation condition
    rapidio/documentation/mport_cdev: add missing parameter description
    kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
    MAINTAINERS: Vladimir has moved
    mm, mempolicy: task->mempolicy must be NULL before dropping final reference
    printk/nmi: avoid direct printk()-s from __printk_nmi_flush()
    treewide: remove references to the now unnecessary DEFINE_PCI_DEVICE_TABLE
    drivers/scsi/wd719x.c: remove last declaration using DEFINE_PCI_DEVICE_TABLE
    mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator
    lib/test_hash.c: fix warning in preprocessor symbol evaluation
    lib/test_hash.c: fix warning in two-dimensional array init
    kconfig: tinyconfig: provide whole choice blocks to avoid warnings
    kexec: fix double-free when failing to relocate the purgatory
    mm, oom: prevent premature OOM killer invocation for high order request

    Linus Torvalds
     
  • Commit fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
    exit") has caused a subtle regression in nscd which uses
    CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
    shared databases, so that the clients are notified when nscd is
    restarted. Now, when nscd uses a non-persistent database, clients that
    have it mapped keep thinking the database is being updated by nscd, when
    in fact nscd has created a new (anonymous) one (for non-persistent
    databases it uses an unlinked file as backend).

    The original proposal for the CLONE_CHILD_CLEARTID change claimed
    (https://lkml.org/lkml/2006/10/25/233):

    : The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
    : on behalf of pthread_create() library calls. This feature is used to
    : request that the kernel clear the thread-id in user space (at an address
    : provided in the syscall) when the thread disassociates itself from the
    : address space, which is done in mm_release().
    :
    : Unfortunately, when a multi-threaded process incurs a core dump (such as
    : from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
    : the other threads, which then proceed to clear their user-space tids
    : before synchronizing in exit_mm() with the start of core dumping. This
    : misrepresents the state of process's address space at the time of the
    : SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
    : problems (misleading him/her to conclude that the threads had gone away
    : before the fault).
    :
    : The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
    : core dump has been initiated.

    The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
    seems to have a larger scope than the original patch asked for. It
    seems that limitting the scope of the check to core dumping should work
    for SIGSEGV issue describe above.

    [Changelog partly based on Andreas' description]
    Fixes: fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
    Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Tested-by: William Preston
    Acked-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Andreas Schwab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • KASAN allocates memory from the page allocator as part of
    kmem_cache_free(), and that can reference current->mempolicy through any
    number of allocation functions. It needs to be NULL'd out before the
    final reference is dropped to prevent a use-after-free bug:

    BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
    CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
    ...
    Call Trace:
    dump_stack
    kasan_object_err
    kasan_report_error
    __asan_report_load2_noabort
    alloc_pages_current mempolicy to NULL before dropping the final
    reference.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
    Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
    Signed-off-by: David Rientjes
    Reported-by: Vegard Nossum
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • __printk_nmi_flush() can be called from nmi_panic(), therefore it has to
    test whether it's executed in NMI context and thus must route the
    messages through deferred printk() or via direct printk().

    This is to avoid potential deadlocks, as described in commit
    cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic").

    However there remain two places where __printk_nmi_flush() does
    unconditional direct printk() calls:

    - pr_err("printk_nmi_flush: internal error ...")
    - pr_cont("\n")

    Factor out print_nmi_seq_line() parts into a new printk_nmi_flush_line()
    function, which takes care of in_nmi(), and use it in
    __printk_nmi_flush() for printing and error-reporting.

    Link: http://lkml.kernel.org/r/20160830161354.581-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Cc: Petr Mladek
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Using "make tinyconfig" produces a couple of annoying warnings that show
    up for build test machines all the time:

    .config:966:warning: override: NOHIGHMEM changes choice state
    .config:965:warning: override: SLOB changes choice state
    .config:963:warning: override: KERNEL_XZ changes choice state
    .config:962:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
    .config:933:warning: override: SLOB changes choice state
    .config:930:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
    .config:870:warning: override: SLOB changes choice state
    .config:868:warning: override: KERNEL_XZ changes choice state
    .config:867:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state

    I've made a previous attempt at fixing them and we discussed a number of
    alternatives.

    I tried changing the Makefile to use "merge_config.sh -n
    $(fragment-list)" but couldn't get that to work properly.

    This is yet another approach, based on the observation that we do want
    to see a warning for conflicting 'choice' options, and that we can
    simply make them non-conflicting by listing all other options as
    disabled. This is a trivial patch that we can apply independent of
    plans for other changes.

    Link: http://lkml.kernel.org/r/20160829214952.1334674-2-arnd@arndb.de
    Link: https://storage.kernelci.org/mainline/v4.7-rc6/x86-tinyconfig/build.log
    https://patchwork.kernel.org/patch/9212749/
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Josh Triplett
    Reviewed-by: Masahiro Yamada
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • If kexec_apply_relocations fails, kexec_load_purgatory frees pi->sechdrs
    and pi->purgatory_buf. This is redundant, because in case of error
    kimage_file_prepare_segments calls kimage_file_post_load_cleanup, which
    will also free those buffers.

    This causes two warnings like the following, one for pi->sechdrs and the
    other for pi->purgatory_buf:

    kexec-bzImage64: Loading purgatory failed
    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 2119 at mm/vmalloc.c:1490 __vunmap+0xc1/0xd0
    Trying to vfree() nonexistent vm area (ffffc90000e91000)
    Modules linked in:
    CPU: 1 PID: 2119 Comm: kexec Not tainted 4.8.0-rc3+ #5
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x4d/0x65
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x4f/0x60
    ? find_vmap_area+0x19/0x70
    ? kimage_file_post_load_cleanup+0x47/0xb0
    __vunmap+0xc1/0xd0
    vfree+0x2e/0x70
    kimage_file_post_load_cleanup+0x5e/0xb0
    SyS_kexec_file_load+0x448/0x680
    ? putname+0x54/0x60
    ? do_sys_open+0x190/0x1f0
    entry_SYSCALL_64_fastpath+0x13/0x8f
    ---[ end trace 158bb74f5950ca2b ]---

    Fix by setting pi->sechdrs an pi->purgatory_buf to NULL, since vfree
    won't try to free a NULL pointer.

    Link: http://lkml.kernel.org/r/1472083546-23683-1-git-send-email-bauerman@linux.vnet.ibm.com
    Signed-off-by: Thiago Jung Bauermann
    Acked-by: Baoquan He
    Cc: "Eric W. Biederman"
    Cc: Vivek Goyal
    Cc: Dave Young
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thiago Jung Bauermann
     
  • Pull audit fixes from Paul Moore:
    "Two small patches to fix some bugs with the audit-by-executable
    functionality we introduced back in v4.3 (both patches are marked
    for the stable folks)"

    * 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
    audit: fix exe_file access in audit_exe_compare
    mm: introduce get_task_exe_file

    Linus Torvalds
     

01 Sep, 2016

2 commits

  • Prior to the change the function would blindly deference mm, exe_file
    and exe_file->f_inode, each of which could have been NULL or freed.

    Use get_task_exe_file to safely obtain stable exe_file.

    Signed-off-by: Mateusz Guzik
    Acked-by: Konstantin Khlebnikov
    Acked-by: Richard Guy Briggs
    Cc: # 4.3.x
    Signed-off-by: Paul Moore

    Mateusz Guzik
     
  • For more convenient access if one has a pointer to the task.

    As a minor nit take advantage of the fact that only task lock + rcu are
    needed to safely grab ->exe_file. This saves mm refcount dance.

    Use the helper in proc_exe_link.

    Signed-off-by: Mateusz Guzik
    Acked-by: Konstantin Khlebnikov
    Acked-by: Richard Guy Briggs
    Cc: # 4.3.x
    Signed-off-by: Paul Moore

    Mateusz Guzik
     

31 Aug, 2016

3 commits

  • Pull seccomp fix from Kees Cook:
    "Fix fatal signal delivery after ptrace reordering"

    * tag 'seccomp-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    seccomp: Fix tracer exit notifications during fatal signals

    Linus Torvalds
     
  • This fixes a ptrace vs fatal pending signals bug as manifested in
    seccomp now that seccomp was reordered to happen after ptrace. The
    short version is that seccomp should not attempt to call do_exit()
    while fatal signals are pending under a tracer. The existing code was
    trying to be as defensively paranoid as possible, but it now ends up
    confusing ptrace. Instead, the syscall can just be skipped (which solves
    the original concern that the do_exit() was addressing) and normal signal
    handling, tracer notification, and process death can happen.

    Paraphrasing from the original bug report:

    If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed
    after such a trap but not yet been scheduled, and another task in the
    thread-group calls exit_group(), then the tracee task exits without the
    ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here:
    https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7

    The bug happens because when __seccomp_filter() detects
    fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
    signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and
    that task is descheduled, __schedule() notices that there is a fatal
    signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
    That prevents the ptracer's waitpid() from returning the ptrace event.
    A more detailed analysis is here:
    https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.

    Reported-by: Robert O'Callahan
    Reported-by: Kyle Huey
    Tested-by: Kyle Huey
    Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace")
    Signed-off-by: Kees Cook
    Acked-by: Oleg Nesterov
    Acked-by: James Morris

    Kees Cook
     
  • Pull cgroup fixes from Tejun Heo:
    "Two fixes for cgroup.

    - There still was a hole in enforcing cpuset rules, fixed by Li.

    - The recent switch to global percpu_rwseom for threadgroup locking
    revealed a couple issues in how percpu_rwsem is implemented and
    used by cgroup. Balbir found that the read locking section was too
    wide unnecessarily including operations which can often depend on
    IOs. With percpu_rwsem updates (coming through a different tree)
    and reduction of read locking section, all the reported locking
    latency issues, including the android one, are resolved.

    It looks like we can keep global percpu_rwsem locking for now. If
    there actually are cases which can't be resolved, we can go back to
    more complex per-signal_struct locking"

    * 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork
    cpuset: make sure new tasks conform to the current config of the cpuset

    Linus Torvalds
     

30 Aug, 2016

1 commit


29 Aug, 2016

3 commits

  • Pull perf fixes from Thomas Gleixner:
    "A few fixes from the perf departement

    - prevent a imbalanced preemption disable in the events teardown code
    - prevent out of bound acces in perf userspace
    - make perf tools compile with UCLIBC again
    - a fix for the userspace unwinder utility"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/core: Use this_cpu_ptr() when stopping AUX events
    perf evsel: Do not access outside hw cache name arrays
    tools lib: Reinstate strlcpy() header guard with __UCLIBC__
    perf unwind: Use addr_location::addr instead of ip for entries

    Linus Torvalds
     
  • Pull irq fixes from Thomas Gleixner:
    "This lot provides:

    - plug a hotplug race in the new affinity infrastructure
    - a fix for the trigger type of chained interrupts
    - plug a potential memory leak in the core code
    - a few fixes for ARM and MIPS GICs"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/mips-gic: Implement activate op for device domain
    irqchip/mips-gic: Cleanup chip and handler setup
    genirq/affinity: Use get/put_online_cpus around cpumask operations
    genirq: Fix potential memleak when failing to get irq pm
    irqchip/gicv3-its: Disable the ITS before initializing it
    irqchip/gicv3: Remove disabling redistributor and group1 non-secure interrupts
    irqchip/gic: Allow self-SGIs for SMP on UP configurations
    genirq: Correctly configure the trigger on chained interrupts

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "A few updates for timers & co:

    - prevent a livelock in the timekeeping code when debugging is
    enabled

    - prevent out of bounds access in the timekeeping debug code

    - various fixes in clocksource drivers

    - a new maintainers entry"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource/drivers/sun4i: Clear interrupts after stopping timer in probe function
    drivers/clocksource/pistachio: Fix memory corruption in init
    clocksource/drivers/timer-atmel-pit: Enable mck clock
    clocksource/drivers/pxa: Fix include files for compilation
    MAINTAINERS: Add ARM ARCHITECTED TIMER entry
    timekeeping: Cap array access in timekeeping_debug
    timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING

    Linus Torvalds
     

27 Aug, 2016

4 commits

  • Merge fixes from Andrew Morton:
    "11 fixes"

    * emailed patches from Andrew Morton :
    mm: silently skip readahead for DAX inodes
    dax: fix device-dax region base
    fs/seq_file: fix out-of-bounds read
    mm: memcontrol: avoid unused function warning
    mm: clarify COMPACTION Kconfig text
    treewide: replace config_enabled() with IS_ENABLED() (2nd round)
    printk: fix parsing of "brl=" option
    soft_dirty: fix soft_dirty during THP split
    sysctl: handle error writing UINT_MAX to u32 fields
    get_maintainer: quiet noisy implicit -f vcs_file_exists checking
    byteswap: don't use __builtin_bswap*() with sparse

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "Here's a set of block fixes for the current 4.8-rc release. This
    contains:

    - a fix for a secure erase regression, from Adrian.

    - a fix for an mmc use-after-free bug regression, also from Adrian.

    - potential zero pointer deference in bdev freezing, from Andrey.

    - a race fix for blk_set_queue_dying() from Bart.

    - a set of xen blkfront fixes from Bob Liu.

    - three small fixes for bcache, from Eric and Kent.

    - a fix for a potential invalid NVMe state transition, from Gabriel.

    - blk-mq CPU offline fix, preventing us from issuing and completing a
    request on the wrong queue. From me.

    - revert two previous floppy changes, since they caused a user
    visibile regression. A better fix is in the works.

    - ensure that we don't send down bios that have more than 256
    elements in them. Fixes a crash with bcache, for example. From
    Ming.

    - a fix for deferencing an error pointer with cgroup writeback.
    Fixes a regression. From Vegard"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    mmc: fix use-after-free of struct request
    Revert "floppy: refactor open() flags handling"
    Revert "floppy: fix open(O_ACCMODE) for ioctl-only open"
    fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
    blk-mq: improve warning for running a queue on the wrong CPU
    blk-mq: don't overwrite rq->mq_ctx
    block: make sure a big bio is split into at most 256 bvecs
    nvme: Fix nvme_get/set_features() with a NULL result pointer
    bdev: fix NULL pointer dereference
    xen-blkfront: free resources if xlvbd_alloc_gendisk fails
    xen-blkfront: introduce blkif_set_queue_limits()
    xen-blkfront: fix places not updated after introducing 64KB page granularity
    bcache: pr_err: more meaningful error message when nr_stripes is invalid
    bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
    bcache: register_bcache(): call blkdev_put() when cache_alloc() fails
    block: Fix race triggered by blk_set_queue_dying()
    block: Fix secure erase
    nvme: Prevent controller state invalid transition

    Linus Torvalds
     
  • Commit bbeddf52adc1 ("printk: move braille console support into separate
    braille.[ch] files") moved the parsing of braille-related options into
    _braille_console_setup(), changing the type of variable str from char*
    to char**. In this commit, memcmp(str, "brl,", 4) was correctly updated
    to memcmp(*str, "brl,", 4) but not memcmp(str, "brl=", 4).

    Update the code to make "brl=" option work again and replace memcmp()
    with strncmp() to make the compiler able to detect such an issue.

    Fixes: bbeddf52adc1 ("printk: move braille console support into separate braille.[ch] files")
    Link: http://lkml.kernel.org/r/20160823165700.28952-1-nicolas.iooss_linux@m4x.org
    Signed-off-by: Nicolas Iooss
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     
  • We have scripts which write to certain fields on 3.18 kernels but this
    seems to be failing on 4.4 kernels. An entry which we write to here is
    xfrm_aevent_rseqth which is u32.

    echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth

    Commit 230633d109e3 ("kernel/sysctl.c: detect overflows when converting
    to int") prevented writing to sysctl entries when integer overflow
    occurs. However, this does not apply to unsigned integers.

    Heinrich suggested that we introduce a new option to handle 64 bit
    limits and set min as 0 and max as UINT_MAX. This might not work as it
    leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
    we would need to change the datatype of the entry to 64 bit.

    static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
    {
    i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
    vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.

    Introduce a new proc handler proc_douintvec. Individual proc entries
    will need to be updated to use the new handler.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 230633d109e3 ("kernel/sysctl.c:detect overflows when converting to int")
    Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Subash Abhinov Kasiviswanathan
     

24 Aug, 2016

3 commits

  • When tearing down an AUX buf for an event via perf_mmap_close(),
    __perf_event_output_stop() is called on the event's CPU to ensure that
    trace generation is halted before the process of unmapping and
    freeing the buffer pages begins.

    The callback is performed via cpu_function_call(), which ensures that it
    runs with interrupts disabled and is therefore not preemptible.
    Unfortunately, the current code grabs the per-cpu context pointer using
    get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair
    the call with put_cpu_ptr(), leading to a preempt_count() imbalance and
    a BUG when freeing the AUX buffer later on:

    WARNING: CPU: 1 PID: 2249 at kernel/events/ring_buffer.c:539 __rb_free_aux+0x10c/0x120
    Modules linked in:
    [...]
    Call Trace:
    [] dump_stack+0x4f/0x72
    [] __warn+0xc6/0xe0
    [] warn_slowpath_null+0x18/0x20
    [] __rb_free_aux+0x10c/0x120
    [] rb_free_aux+0x13/0x20
    [] perf_mmap_close+0x29e/0x2f0
    [] ? perf_iterate_ctx+0xe0/0xe0
    [] remove_vma+0x25/0x60
    [] exit_mmap+0x106/0x140
    [] mmput+0x1c/0xd0
    [] do_exit+0x253/0xbf0
    [] do_group_exit+0x3e/0xb0
    [] get_signal+0x249/0x640
    [] do_signal+0x23/0x640
    [] ? _raw_write_unlock_irq+0x12/0x30
    [] ? _raw_spin_unlock_irq+0x9/0x10
    [] ? __schedule+0x2c6/0x710
    [] exit_to_usermode_loop+0x74/0x90
    [] prepare_exit_to_usermode+0x26/0x30
    [] retint_user+0x8/0x10

    This patch uses this_cpu_ptr() instead of get_cpu_ptr(), since preemption is
    already disabled by the caller.

    Signed-off-by: Will Deacon
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: 95ff4ca26c49 ("perf/core: Free AUX pages in unmap path")
    Link: http://lkml.kernel.org/r/20160824091905.GA16944@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • It was reported that hibernation could fail on the 2nd attempt, where the
    system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
    claim_dma_lock(), because the lock has already been taken.

    However there is actually no other process would like to grab this lock on
    that problematic platform.

    Further investigation showed that the problem is triggered by setting
    /sys/power/pm_trace to 1 before the 1st hibernation.

    Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
    and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
    than 1970) to the release date of that motherboard during POST stage, thus
    after resumed, it may seem that the system had a significant long sleep time
    which is a completely meaningless value.

    Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
    sleep time happened to be set to 1, fls() returns 32 and we add 1 to
    sleep_time_bin[32], which causes an out of bounds array access and therefor
    memory being overwritten.

    As depicted by System.map:
    0xffffffff81c9d080 b sleep_time_bin
    0xffffffff81c9d100 B dma_spin_lock
    the dma_spin_lock.val is set to 1, which caused this problem.

    This patch adds a sanity check in tk_debug_account_sleep_time()
    to ensure we don't index past the sleep_time_bin array.

    [jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
    issue slightly differently, but borrowed his excelent explanation of the
    issue here.]

    Fixes: 5c83545f24ab "power: Add option to log time spent in suspend"
    Reported-by: Janek Kozicki
    Reported-by: Chen Yu
    Signed-off-by: John Stultz
    Cc: linux-pm@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: Xunlei Pang
    Cc: "Rafael J. Wysocki"
    Cc: stable
    Cc: Zhang Rui
    Link: http://lkml.kernel.org/r/1471993702-29148-3-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     
  • When I added some extra sanity checking in timekeeping_get_ns() under
    CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns()
    method was using timekeeping_get_ns().

    Thus the locking added to the debug checks broke the NMI-safety of
    __ktime_get_fast_ns().

    This patch open-codes the timekeeping_get_ns() logic for
    __ktime_get_fast_ns(), so can avoid any deadlocks in NMI.

    Fixes: 4ca22c2648f9 "timekeeping: Add warnings when overflows or underflows are observed"
    Reported-by: Steven Rostedt
    Reported-by: Peter Zijlstra
    Signed-off-by: John Stultz
    Cc: stable
    Link: http://lkml.kernel.org/r/1471993702-29148-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     

22 Aug, 2016

2 commits

  • Without locking out CPU mask operations we might end up with an inconsistent
    view of the cpumask in the function.

    Fixes: 5e385a6ef31f: "genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors"
    Signed-off-by: Christoph Hellwig
    Link: http://lkml.kernel.org/r/1470924405-25728-1-git-send-email-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     
  • Obviously we should free action here if irq_chip_pm_get failed.

    Fixes: be45beb2df69: "genirq: Add runtime power management support for IRQ chips"
    Signed-off-by: Shawn Lin
    Cc: Jon Hunter
    Cc: Marc Zyngier
    Link: http://lkml.kernel.org/r/1471854112-13006-1-git-send-email-shawn.lin@rock-chips.com
    Signed-off-by: Thomas Gleixner

    Shawn Lin