03 Jan, 2018

6 commits

  • commit 466a2b42d67644447a1765276259a3ea5531ddff upstream.

    Since the recent remote cpufreq callback work, its possible that a cpufreq
    update is triggered from a remote CPU. For single policies however, the current
    code uses the local CPU when trying to determine if the remote sg_cpu entered
    idle or is busy. This is incorrect. To remedy this, compare with the nohz tick
    idle_calls counter of the remote CPU.

    Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
    Acked-by: Viresh Kumar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Joel Fernandes
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes
     
  • commit ae415fa4c5248a8cf4faabd5a3c20576cb1ad607 upstream.

    To free the reader page that is allocated with ring_buffer_alloc_read_page(),
    ring_buffer_free_read_page() must be called. For faster performance, this
    page can be reused by the ring buffer to avoid having to free and allocate
    new pages.

    The issue arises when the page is used with a splice pipe into the
    networking code. The networking code may up the page counter for the page,
    and keep it active while sending it is queued to go to the network. The
    incrementing of the page ref does not prevent it from being reused in the
    ring buffer, and this can cause the page that is being sent out to the
    network to be modified before it is sent by reading new data.

    Add a check to the page ref counter, and only reuse the page if it is not
    being used anywhere else.

    Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer")
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 45d8b80c2ac5d21cd1e2954431fb676bc2b1e099 upstream.

    Two info bits were added to the "commit" part of the ring buffer data page
    when returned to be consumed. This was to inform the user space readers that
    events have been missed, and that the count may be stored at the end of the
    page.

    What wasn't handled, was the splice code that actually called a function to
    return the length of the data in order to zero out the rest of the page
    before sending it up to user space. These data bits were returned with the
    length making the value negative, and that negative value was not checked.
    It was compared to PAGE_SIZE, and only used if the size was less than
    PAGE_SIZE. Luckily PAGE_SIZE is unsigned long which made the compare an
    unsigned compare, meaning the negative size value did not end up causing a
    large portion of memory to be randomly zeroed out.

    Fixes: 66a8cb95ed040 ("ring-buffer: Add place holder recording of dropped events")
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 24f2aaf952ee0b59f31c3a18b8b36c9e3d3c2cf5 upstream.

    Double free of the ring buffer happens when it fails to alloc new
    ring buffer instance for max_buffer if TRACER_MAX_TRACE is configured.
    The root cause is that the pointer is not set to NULL after the buffer
    is freed in allocate_trace_buffers(), and the freeing of the ring
    buffer is invoked again later if the pointer is not equal to Null,
    as:

    instance_mkdir()
    |-allocate_trace_buffers()
    |-allocate_trace_buffer(tr, &tr->trace_buffer...)
    |-allocate_trace_buffer(tr, &tr->max_buffer...)

    // allocate fail(-ENOMEM),first free
    // and the buffer pointer is not set to null
    |-ring_buffer_free(tr->trace_buffer.buffer)

    // out_free_tr
    |-free_trace_buffers()
    |-free_trace_buffer(&tr->trace_buffer);

    //if trace_buffer is not null, free again
    |-ring_buffer_free(buf->buffer)
    |-rb_free_cpu_buffer(buffer->buffers[cpu])
    // ring_buffer_per_cpu is null, and
    // crash in ring_buffer_per_cpu->pages

    Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com

    Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code")
    Signed-off-by: Jing Xia
    Signed-off-by: Chunyan Zhang
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Jing Xia
     
  • commit 4397f04575c44e1440ec2e49b6302785c95fd2f8 upstream.

    Jing Xia and Chunyan Zhang reported that on failing to allocate part of the
    tracing buffer, memory is freed, but the pointers that point to them are not
    initialized back to NULL, and later paths may try to free the freed memory
    again. Jing and Chunyan fixed one of the locations that does this, but
    missed a spot.

    Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com

    Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code")
    Reported-by: Jing Xia
    Reported-by: Chunyan Zhang
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 6b7e633fe9c24682df550e5311f47fb524701586 upstream.

    The ring_buffer_read_page() takes care of zeroing out any extra data in the
    page that it returns. There's no need to zero it out again from the
    consumer. It was removed from one consumer of this function, but
    read_buffers_splice_read() did not remove it, and worse, it contained a
    nasty bug because of it.

    Fixes: 2711ca237a084 ("ring-buffer: Move zeroing out excess in page to ring buffer code")
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     

30 Dec, 2017

1 commit

  • commit c10e83f598d08046dd1ebc8360d4bb12d802d51b upstream.

    In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
    allowed to fail. Fix up all instances.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Andy Lutomirsky
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Dave Hansen
    Cc: David Laight
    Cc: Denys Vlasenko
    Cc: Eduardo Valentin
    Cc: Greg KH
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: aliguori@amazon.com
    Cc: dan.j.williams@intel.com
    Cc: hughd@google.com
    Cc: keescook@google.com
    Cc: kirill.shutemov@linux.intel.com
    Cc: linux-mm@kvack.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

25 Dec, 2017

12 commits

  • From: Alexei Starovoitov

    [ Upstream commit bb7f0f989ca7de1153bd128a40a71709e339fa03 ]

    There were various issues related to the limited size of integers used in
    the verifier:
    - `off + size` overflow in __check_map_access()
    - `off + reg->off` overflow in check_mem_access()
    - `off + reg->var_off.value` overflow or 32-bit truncation of
    `reg->var_off.value` in check_mem_access()
    - 32-bit truncation in check_stack_boundary()

    Make sure that any integer math cannot overflow by not allowing
    pointer math with large values.

    Also reduce the scope of "scalar op scalar" tracking.

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Reported-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • From: Jann Horn

    [ Upstream commit 179d1c5602997fef5a940c6ddcf31212cbfebd14 ]

    This could be made safe by passing through a reference to env and checking
    for env->allow_ptr_leaks, but it would only work one way and is probably
    not worth the hassle - not doing it will not directly lead to program
    rejection.

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Signed-off-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • From: Jann Horn

    [ Upstream commit a5ec6ae161d72f01411169a938fa5f8baea16e8f ]

    Force strict alignment checks for stack pointers because the tracking of
    stack spills relies on it; unaligned stack accesses can lead to corruption
    of spilled registers, which is exploitable.

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Signed-off-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • From: Jann Horn

    Prevent indirect stack accesses at non-constant addresses, which would
    permit reading and corrupting spilled pointers.

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Signed-off-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • From: Jann Horn

    [ Upstream commit 468f6eafa6c44cb2c5d8aad35e12f06c240a812a ]

    32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
    Adjust the verifier accordingly.

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Signed-off-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • From: Jann Horn

    [ Upstream commit 0c17d1d2c61936401f4702e1846e2c19b200f958 ]

    Properly handle register truncation to a smaller size.

    The old code first mirrors the clearing of the high 32 bits in the bitwise
    tristate representation, which is correct. But then, it computes the new
    arithmetic bounds as the intersection between the old arithmetic bounds and
    the bounds resulting from the bitwise tristate representation. Therefore,
    when coerce_reg_to_32() is called on a number with bounds
    [0xffff'fff8, 0x1'0000'0007], the verifier computes
    [0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
    This is incorrect: The truncated number could also be in the range [0, 7],
    and no meaningful arithmetic bounds can be computed in that case apart from
    the obvious [0, 0xffff'ffff].

    Starting with v4.14, this is exploitable by unprivileged users as long as
    the unprivileged_bpf_disabled sysctl isn't set.

    Debian assigned CVE-2017-16996 for this issue.

    v2:
    - flip the mask during arithmetic bounds calculation (Ben Hutchings)
    v3:
    - add CVE number (Ben Hutchings)

    Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values")
    Signed-off-by: Jann Horn
    Acked-by: Edward Cree
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • From: Jann Horn

    [ Upstream commit 95a762e2c8c942780948091f8f2a4f32fce1ac6f ]

    Distinguish between
    BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
    and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
    only perform sign extension in the first case.

    Starting with v4.14, this is exploitable by unprivileged users as long as
    the unprivileged_bpf_disabled sysctl isn't set.

    Debian assigned CVE-2017-16995 for this issue.

    v3:
    - add CVE number (Ben Hutchings)

    Fixes: 484611357c19 ("bpf: allow access into map value arrays")
    Signed-off-by: Jann Horn
    Acked-by: Edward Cree
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • From: Edward Cree

    [ Upstream commit 4374f256ce8182019353c0c639bb8d0695b4c941 ]

    Incorrect signed bounds were being computed.
    If the old upper signed bound was positive and the old lower signed bound was
    negative, this could cause the new upper signed bound to be too low,
    leading to security issues.

    Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values")
    Reported-by: Jann Horn
    Signed-off-by: Edward Cree
    Acked-by: Alexei Starovoitov
    [jannh@google.com: changed description to reflect bug impact]
    Signed-off-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit 283ca526a9bd75aed7350220d7b1f8027d99c3fd ]

    When tracing and networking programs are both attached in the
    system and both use event-output helpers that eventually call
    into perf_event_output(), then we could end up in a situation
    where the tracing attached program runs in user context while
    a cls_bpf program is triggered on that same CPU out of softirq
    context.

    Since both rely on the same per-cpu perf_sample_data, we could
    potentially corrupt it. This can only ever happen in a combination
    of the two types; all tracing programs use a bpf_prog_active
    counter to bail out in case a program is already running on
    that CPU out of a different context. XDP and cls_bpf programs
    by themselves don't have this issue as they run in the same
    context only. Therefore, split both perf_sample_data so they
    cannot be accessed from each other.

    Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
    Reported-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Tested-by: Song Liu
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • From: Alexei Starovoitov

    [ Upstream commit c131187db2d3fa2f8bf32fdf4e9a4ef805168467 ]

    when the verifier detects that register contains a runtime constant
    and it's compared with another constant it will prune exploration
    of the branch that is guaranteed not to be taken at runtime.
    This is all correct, but malicious program may be constructed
    in such a way that it always has a constant comparison and
    the other branch is never taken under any conditions.
    In this case such path through the program will not be explored
    by the verifier. It won't be taken at run-time either, but since
    all instructions are JITed the malicious program may cause JITs
    to complain about using reserved fields, etc.
    To fix the issue we have to track the instructions explored by
    the verifier and sanitize instructions that are dead at run time
    with NOPs. We cannot reject such dead code, since llvm generates
    it for valid C code, since it doesn't do as much data flow
    analysis as the verifier does.

    Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit a15f7fc20389a8827d5859907568b201234d4b79 ]

    There are a small number of 'generic fields' (comm/COMM/cpu/CPU) that
    are found by trace_find_event_field() but are only meant for
    filtering. Specifically, they unlike normal fields, they have a size
    of 0 and thus wreak havoc when used as a histogram key.

    Exclude these (return -EINVAL) when used as histogram keys.

    Link: http://lkml.kernel.org/r/956154cbc3e8a4f0633d619b886c97f0f0edf7b4.1506105045.git.tom.zanussi@linux.intel.com

    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Tom Zanussi
     
  • commit 3382290ed2d5e275429cef510ab21889d3ccd164 upstream.

    [ Note, this is a Git cherry-pick of the following commit:

    506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

    ... for easier x86 PTI code testing and back-porting. ]

    READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
    can be used instead of lockless_dereference() without any change in
    semantics.

    Signed-off-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     

20 Dec, 2017

5 commits

  • [ Upstream commit 95b982b45122c57da2ee0b46cce70775e1d987af ]

    Problem: This flag does not get cleared currently in the suspend or
    resume path in the following cases:

    * In case some driver's suspend routine returns an error.
    * Successful s2idle case
    * etc?

    Why is this a problem: What happens is that the next suspend attempt
    could fail even though the user did not enable the flag by writing to
    /sys/power/wakeup_count. This is 1 use case how the issue can be seen
    (but similar use case with driver suspend failure can be thought of):

    1. Read /sys/power/wakeup_count
    2. echo count > /sys/power/wakeup_count
    3. echo freeze > /sys/power/wakeup_count
    4. Let the system suspend, and wakeup the system using some wake source
    that calls pm_wakeup_event() e.g. power button or something.
    5. Note that the combined wakeup count would be incremented due
    to the pm_wakeup_event() in the resume path.
    6. After resuming the events_check_enabled flag is still set.

    At this point if the user attempts to freeze again (without writing to
    /sys/power/wakeup_count), the suspend would fail even though there has
    been no wake event since the past resume.

    Address that by clearing the flag just before a resume is completed,
    so that it is always cleared for the corner cases mentioned above.

    Signed-off-by: Rajat Jain
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Rajat Jain
     
  • commit cef31d9af908243421258f1df35a4a644604efbe upstream.

    timer_create() specifies via sigevent->sigev_notify the signal delivery for
    the new timer. The valid modes are SIGEV_NONE, SIGEV_SIGNAL, SIGEV_THREAD
    and (SIGEV_SIGNAL | SIGEV_THREAD_ID).

    The sanity check in good_sigevent() is only checking the valid combination
    for the SIGEV_THREAD_ID bit, i.e. SIGEV_SIGNAL, but if SIGEV_THREAD_ID is
    not set it accepts any random value.

    This has no real effects on the posix timer and signal delivery code, but
    it affects show_timer() which handles the output of /proc/$PID/timers. That
    function uses a string array to pretty print sigev_notify. The access to
    that array has no bound checks, so random sigev_notify cause access beyond
    the array bounds.

    Add proper checks for the valid notify modes and remove the SIGEV_THREAD_ID
    masking from various code pathes as SIGEV_NONE can never be set in
    combination with SIGEV_THREAD_ID.

    Reported-by: Eric Biggers
    Reported-by: Dmitry Vyukov
    Reported-by: Alexey Dobriyan
    Signed-off-by: Thomas Gleixner
    Cc: John Stultz
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit f73c52a5bcd1710994e53fbccc378c42b97a06b6 upstream.

    Daniel Wagner reported a crash on the BeagleBone Black SoC.

    This is a single CPU architecture, and does not have a functional
    arch_send_call_function_single_ipi() implementation which can crash
    the kernel if that is called.

    As it only has one CPU, it shouldn't be called, but if the kernel is
    compiled for SMP, the push/pull RT scheduling logic now calls it for
    irq_work if the one CPU is overloaded, it can use that function to call
    itself and crash the kernel.

    Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
    only has a single CPU. But SCHED_FEAT is a constant if sched debugging
    is turned off. Another fix can also be used, and this should also help
    with normal SMP machines. That is, do not initiate the pull code if
    there's only one RT overloaded CPU, and that CPU happens to be the
    current CPU that is scheduling in a lower priority task.

    Even on a system with many CPUs, if there's many RT tasks waiting to
    run on a single CPU, and that CPU schedules in another RT task of lower
    priority, it will initiate the PULL logic in case there's a higher
    priority RT task on another CPU that is waiting to run. But if there is
    no other CPU with waiting RT tasks, it will initiate the RT pull logic
    on itself (as it still has RT tasks waiting to run). This is a wasted
    effort.

    Not only does this help with SMP code where the current CPU is the only
    one with RT overloaded tasks, it should also solve the issue that
    Daniel encountered, because it will prevent the PULL logic from
    executing, as there's only one CPU on the system, and the check added
    here will cause it to exit the RT pull code.

    Reported-by: Daniel Wagner
    Signed-off-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Sebastian Andrzej Siewior
    Cc: Thomas Gleixner
    Cc: linux-rt-users
    Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt
     
  • commit 90e406f96f630c07d631a021fd4af10aac913e77 upstream.

    The default NR_CPUS can be very large, but actual possible nr_cpu_ids
    usually is very small. For my x86 distribution, the NR_CPUS is 8192 and
    nr_cpu_ids is 4. About 2 pages are wasted.

    Most machines don't have so many CPUs, so define a array with NR_CPUS
    just wastes memory. So let's allocate the buffer dynamically when need.

    With this change, the mutext tracing_cpumask_update_lock also can be
    removed now, which was used to protect mask_str.

    Link: http://lkml.kernel.org/r/1512013183-19107-1-git-send-email-changbin.du@intel.com

    Fixes: 36dfe9252bd4c ("ftrace: make use of tracing_cpumask")
    Signed-off-by: Changbin Du
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Changbin Du
     
  • commit bdcf0a423ea1c40bbb40e7ee483b50fc8aa3d758 upstream.

    In testing, we found that nfsd threads may call set_groups in parallel
    for the same entry cached in auth.unix.gid, racing in the call of
    groups_sort, corrupting the groups for that entry and leading to
    permission denials for the client.

    This patch:
    - Make groups_sort globally visible.
    - Move the call to groups_sort to the modifiers of group_info
    - Remove the call to groups_sort from set_groups

    Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
    Signed-off-by: Thiago Rafael Becker
    Reviewed-by: Matthew Wilcox
    Reviewed-by: NeilBrown
    Acked-by: "J. Bruce Fields"
    Cc: Al Viro
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Thiago Rafael Becker
     

17 Dec, 2017

2 commits

  • [ Upstream commit 173743dd99a49c956b124a74c8aacb0384739a4c ]

    Prior to this patch we enabled audit in audit_init(), which is too
    late for PID 1 as the standard initcalls are run after the PID 1 task
    is forked. This means that we never allocate an audit_context (see
    audit_alloc()) for PID 1 and therefore miss a lot of audit events
    generated by PID 1.

    This patch enables audit as early as possible to help ensure that when
    PID 1 is forked it can allocate an audit_context if required.

    Reviewed-by: Richard Guy Briggs
    Signed-off-by: Paul Moore
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paul Moore
     
  • [ Upstream commit 33e8a907804428109ce1d12301c3365d619cc4df ]

    The API to end auditing has historically been for auditd to set the
    pid to 0. This patch restores that functionality.

    See: https://github.com/linux-audit/audit-kernel/issues/69

    Reviewed-by: Richard Guy Briggs
    Signed-off-by: Steve Grubb
    Signed-off-by: Paul Moore
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Steve Grubb
     

14 Dec, 2017

5 commits

  • [ Upstream commit 92ee46efeb505ead3ab06d3c5ce695637ed5f152 ]

    Fengguang Wu reported that running the rcuperf test during boot can cause
    the jump_label_test() to hit a WARN_ON(). The issue is that the core jump
    label code relies on kernel_text_address() to detect when it can no longer
    update branches that may be contained in __init sections. The
    kernel_text_address() in turn assumes that if the system_state variable is
    greter than or equal to SYSTEM_RUNNING then __init sections are no longer
    valid (since the assumption is that they have been freed). However, when
    rcuperf is setup to run in early boot it can call kernel_power_off() which
    sets the system_state to SYSTEM_POWER_OFF.

    Since rcuperf initialization is invoked via a module_init(), we can make
    the dependency of jump_label_test() needing to complete before rcuperf
    explicit by calling it via early_initcall().

    Reported-by: Fengguang Wu
    Signed-off-by: Jason Baron
    Acked-by: Paul E. McKenney
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1510609727-2238-1-git-send-email-jbaron@akamai.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jason Baron
     
  • [ Upstream commit 89ad2fa3f043a1e8daae193bcb5fe34d5f8caf28 ]

    pcpu_freelist_pop() needs the same lockdep awareness than
    pcpu_freelist_populate() to avoid a false positive.

    [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]

    switchto-defaul/12508 [HC0[0]:SC0[6]:HE0:SE0] is trying to acquire:
    (&htab->buckets[i].lock){......}, at: [] __htab_percpu_map_update_elem+0x1cb/0x300

    and this task is already holding:
    (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}, at: [] __dev_queue_xmit+0
    x868/0x1240
    which would create a new lock dependency:
    (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...} -> (&htab->buckets[i].lock){......}

    but this new dependency connects a SOFTIRQ-irq-safe lock:
    (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}
    ... which became SOFTIRQ-irq-safe at:
    [] __lock_acquire+0x42b/0x1f10
    [] lock_acquire+0xbc/0x1b0
    [] _raw_spin_lock+0x38/0x50
    [] __dev_queue_xmit+0x868/0x1240
    [] dev_queue_xmit+0x10/0x20
    [] ip_finish_output2+0x439/0x590
    [] ip_finish_output+0x150/0x2f0
    [] ip_output+0x7d/0x260
    [] ip_local_out+0x5e/0xe0
    [] ip_queue_xmit+0x205/0x620
    [] tcp_transmit_skb+0x5a8/0xcb0
    [] tcp_write_xmit+0x242/0x1070
    [] __tcp_push_pending_frames+0x3c/0xf0
    [] tcp_rcv_established+0x312/0x700
    [] tcp_v4_do_rcv+0x11c/0x200
    [] tcp_v4_rcv+0xaa2/0xc30
    [] ip_local_deliver_finish+0xa7/0x240
    [] ip_local_deliver+0x66/0x200
    [] ip_rcv_finish+0xdd/0x560
    [] ip_rcv+0x295/0x510
    [] __netif_receive_skb_core+0x988/0x1020
    [] __netif_receive_skb+0x21/0x70
    [] process_backlog+0x6f/0x230
    [] net_rx_action+0x229/0x420
    [] __do_softirq+0xd8/0x43d
    [] do_softirq_own_stack+0x1c/0x30
    [] do_softirq+0x55/0x60
    [] __local_bh_enable_ip+0xa8/0xb0
    [] cpu_startup_entry+0x1c7/0x500
    [] start_secondary+0x113/0x140

    to a SOFTIRQ-irq-unsafe lock:
    (&head->lock){+.+...}
    ... which became SOFTIRQ-irq-unsafe at:
    ... [] __lock_acquire+0x82f/0x1f10
    [] lock_acquire+0xbc/0x1b0
    [] _raw_spin_lock+0x38/0x50
    [] pcpu_freelist_pop+0x7a/0xb0
    [] htab_map_alloc+0x50c/0x5f0
    [] SyS_bpf+0x265/0x1200
    [] entry_SYSCALL_64_fastpath+0x12/0x17

    other info that might help us debug this:

    Chain exists of:
    dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2 --> &htab->buckets[i].lock --> &head->lock

    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&head->lock);
    local_irq_disable();
    lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);
    lock(&htab->buckets[i].lock);

    lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);

    *** DEADLOCK ***

    Fixes: e19494edab82 ("bpf: introduce percpu_freelist")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 98159d977f71c3b3dee898d1c34e56f520b094e7 ]

    Patch series "A few round_pipe_size() and pipe-max-size fixups", v3.

    While backporting Michael's "pipe: fix limit handling" patchset to a
    distro-kernel, Mikulas noticed that current upstream pipe limit handling
    contains a few problems:

    1 - procfs signed wrap: echo'ing a large number into
    /proc/sys/fs/pipe-max-size and then cat'ing it back out shows a
    negative value.

    2 - round_pipe_size() nr_pages overflow on 32bit: this would
    subsequently try roundup_pow_of_two(0), which is undefined.

    3 - visible non-rounded pipe-max-size value: there is no mutual
    exclusion or protection between the time pipe_max_size is assigned
    a raw value from proc_dointvec_minmax() and when it is rounded.

    4 - unsigned long -> unsigned int conversion makes for potential odd
    return errors from do_proc_douintvec_minmax_conv() and
    do_proc_dopipe_max_size_conv().

    This version underwent the same testing as v1:
    https://marc.info/?l=linux-kernel&m=150643571406022&w=2

    This patch (of 4):

    pipe_max_size is defined as an unsigned int:

    unsigned int pipe_max_size = 1048576;

    but its procfs/sysctl representation is an integer:

    static struct ctl_table fs_table[] = {
    ...
    {
    .procname = "pipe-max-size",
    .data = &pipe_max_size,
    .maxlen = sizeof(int),
    .mode = 0644,
    .proc_handler = &pipe_proc_fn,
    .extra1 = &pipe_min_size,
    },
    ...

    that is signed:

    int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
    size_t *lenp, loff_t *ppos)
    {
    ...
    ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)

    This leads to signed results via procfs for large values of pipe_max_size:

    % echo 2147483647 >/proc/sys/fs/pipe-max-size
    % cat /proc/sys/fs/pipe-max-size
    -2147483648

    Use unsigned operations on this variable to avoid such negative values.

    Link: http://lkml.kernel.org/r/1507658689-11669-2-git-send-email-joe.lawrence@redhat.com
    Signed-off-by: Joe Lawrence
    Reported-by: Mikulas Patocka
    Reviewed-by: Mikulas Patocka
    Cc: Michael Kerrisk
    Cc: Randy Dunlap
    Cc: Al Viro
    Cc: Jens Axboe
    Cc: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Joe Lawrence
     
  • commit c07d35338081d107e57cf37572d8cc931a8e32e2 upstream.

    kallsyms_symbol_next() returns a boolean (true on success). Currently
    kdb_read() tests the return value with an inequality that
    unconditionally evaluates to true.

    This is fixed in the obvious way and, since the conditional branch is
    supposed to be unreachable, we also add a WARN_ON().

    Reported-by: Dan Carpenter
    Signed-off-by: Daniel Thompson
    Signed-off-by: Jason Wessel
    Signed-off-by: Greg Kroah-Hartman

    Daniel Thompson
     
  • commit 46febd37f9c758b05cd25feae8512f22584742fe upstream.

    Commit 31487f8328f2 ("smp/cfd: Convert core to hotplug state machine")
    accidently put this step on the wrong place. The step should be at the
    cpuhp_ap_states[] rather than the cpuhp_bp_states[].

    grep smpcfd /sys/devices/system/cpu/hotplug/states
    40: smpcfd:prepare
    129: smpcfd:dying

    "smpcfd:dying" was missing before.
    So was the invocation of the function smpcfd_dying_cpu().

    Fixes: 31487f8328f2 ("smp/cfd: Convert core to hotplug state machine")
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Richard Weinberger
    Cc: Sebastian Andrzej Siewior
    Cc: Boris Ostrovsky
    Link: https://lkml.kernel.org/r/20171128131954.81229-1-jiangshanlai@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Lai Jiangshan
     

10 Dec, 2017

2 commits

  • [ Upstream commit a30b85df7d599f626973e9cd3056fe755bd778e0 ]

    We want to wait for all potentially preempted kprobes trampoline
    execution to have completed. This guarantees that any freed
    trampoline memory is not in use by any task in the system anymore.
    synchronize_rcu_tasks() gives such a guarantee, so use it.

    Also, this guarantees to wait for all potentially preempted tasks
    on the instructions which will be replaced with a jump.

    Since this becomes a problem only when CONFIG_PREEMPT=y, enable
    CONFIG_TASKS_RCU=y for synchronize_rcu_tasks() in that case.

    Signed-off-by: Masami Hiramatsu
    Acked-by: Paul E. McKenney
    Cc: Ananth N Mavinakayanahalli
    Cc: Linus Torvalds
    Cc: Naveen N . Rao
    Cc: Paul E . McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/150845661962.5443.17724352636247312231.stgit@devbox
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Masami Hiramatsu
     
  • [ Upstream commit a9cd8194e1e6bd09619954721dfaf0f94fe2003e ]

    Event timestamps are serialized using ctx->lock, make sure to hold it
    over reading all values.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

30 Nov, 2017

4 commits

  • commit 4f8413a3a799c958f7a10a6310a451e6b8aef5ad upstream.

    When requesting a shared interrupt, we assume that the firmware
    support code (DT or ACPI) has called irqd_set_trigger_type
    already, so that we can retrieve it and check that the requester
    is being reasonnable.

    Unfortunately, we still have non-DT, non-ACPI systems around,
    and these guys won't call irqd_set_trigger_type before requesting
    the interrupt. The consequence is that we fail the request that
    would have worked before.

    We can either chase all these use cases (boring), or address it
    in core code (easier). Let's have a per-irq_desc flag that
    indicates whether irqd_set_trigger_type has been called, and
    let's just check it when checking for a shared interrupt.
    If it hasn't been set, just take whatever the interrupt
    requester asks.

    Fixes: 382bd4de6182 ("genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs")
    Reported-and-tested-by: Petr Cvek
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.

    When a CPU lowers its priority (schedules out a high priority task for a
    lower priority one), a check is made to see if any other CPU has overloaded
    RT tasks (more than one). It checks the rto_mask to determine this and if so
    it will request to pull one of those tasks to itself if the non running RT
    task is of higher priority than the new priority of the next task to run on
    the current CPU.

    When we deal with large number of CPUs, the original pull logic suffered
    from large lock contention on a single CPU run queue, which caused a huge
    latency across all CPUs. This was caused by only having one CPU having
    overloaded RT tasks and a bunch of other CPUs lowering their priority. To
    solve this issue, commit:

    b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

    changed the way to request a pull. Instead of grabbing the lock of the
    overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

    Although the IPI logic worked very well in removing the large latency build
    up, it still could suffer from a large number of IPIs being sent to a single
    CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
    when I tested this on a 120 CPU box, with a stress test that had lots of
    RT tasks scheduling on all CPUs, it actually triggered the hard lockup
    detector! One CPU had so many IPIs sent to it, and due to the restart
    mechanism that is triggered when the source run queue has a priority status
    change, the CPU spent minutes! processing the IPIs.

    Thinking about this further, I realized there's no reason for each run queue
    to send its own IPI. As all CPUs with overloaded tasks must be scanned
    regardless if there's one or many CPUs lowering their priority, because
    there's no current way to find the CPU with the highest priority task that
    can schedule to one of these CPUs, there really only needs to be one IPI
    being sent around at a time.

    This greatly simplifies the code!

    The new approach is to have each root domain have its own irq work, as the
    rto_mask is per root domain. The root domain has the following fields
    attached to it:

    rto_push_work - the irq work to process each CPU set in rto_mask
    rto_lock - the lock to protect some of the other rto fields
    rto_loop_start - an atomic that keeps contention down on rto_lock
    the first CPU scheduling in a lower priority task
    is the one to kick off the process.
    rto_loop_next - an atomic that gets incremented for each CPU that
    schedules in a lower priority task.
    rto_loop - a variable protected by rto_lock that is used to
    compare against rto_loop_next
    rto_cpu - The cpu to send the next IPI to, also protected by
    the rto_lock.

    When a CPU schedules in a lower priority task and wants to make sure
    overloaded CPUs know about it. It increments the rto_loop_next. Then it
    atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
    then it is done, as another CPU is kicking off the IPI loop. If the old
    value is "0", then it will take the rto_lock to synchronize with a possible
    IPI being sent around to the overloaded CPUs.

    If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
    IPI being sent around, or one is about to finish. Then rto_cpu is set to the
    first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
    in rto_mask, then there's nothing to be done.

    When the CPU receives the IPI, it will first try to push any RT tasks that is
    queued on the CPU but can't run because a higher priority RT task is
    currently running on that CPU.

    Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
    finds one, it simply sends an IPI to that CPU and the process continues.

    If there's no more CPUs in the rto_mask, then rto_loop is compared with
    rto_loop_next. If they match, everything is done and the process is over. If
    they do not match, then a CPU scheduled in a lower priority task as the IPI
    was being passed around, and the process needs to start again. The first CPU
    in rto_mask is sent the IPI.

    This change removes this duplication of work in the IPI logic, and greatly
    lowers the latency caused by the IPIs. This removed the lockup happening on
    the 120 CPU machine. It also simplifies the code tremendously. What else
    could anyone ask for?

    Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
    supplying me with the rto_start_trylock() and rto_start_unlock() helper
    functions.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Clark Williams
    Cc: Daniel Bristot de Oliveira
    Cc: John Kacur
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Scott Wood
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.

    The current implementation of synchronize_sched_expedited() incorrectly
    assumes that resched_cpu() is unconditional, which it is not. This means
    that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
    fails as follows (analysis by Neeraj Upadhyay):

    o CPU1 is waiting for expedited wait to complete:

    sync_rcu_exp_select_cpus
    rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
    IPI sent to CPU5

    synchronize_sched_expedited_wait
    ret = swait_event_timeout(rsp->expedited_wq,
    sync_rcu_preempt_exp_done(rnp_root),
    jiffies_stall);

    expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())

    o CPU5 handles IPI and fails to acquire rq lock.

    Handles IPI
    sync_sched_exp_handler
    resched_cpu
    returns while failing to try lock acquire rq->lock
    need_resched is not set

    o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
    idle (schedule() is not called).

    o CPU 1 reports RCU stall.

    Given that resched_cpu() is now used only by RCU, this commit fixes the
    assumption by making resched_cpu() unconditional.

    Reported-by: Neeraj Upadhyay
    Suggested-by: Neeraj Upadhyay
    Signed-off-by: Paul E. McKenney
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • commit 07458f6a5171d97511dfbdf6ce549ed2ca0280c7 upstream.

    'cached_raw_freq' is used to get the next frequency quickly but should
    always be in sync with sg_policy->next_freq. There is a case where it is
    not and in such cases it should be reset to avoid switching to incorrect
    frequencies.

    Consider this case for example:

    - policy->cur is 1.2 GHz (Max)
    - New request comes for 780 MHz and we store that in cached_raw_freq.
    - Based on 780 MHz, we calculate the effective frequency as 800 MHz.
    - We then see the CPU wasn't idle recently and choose to keep the next
    freq as 1.2 GHz.
    - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
    1.2 GHz.
    - Now if the utilization doesn't change in then next request, then the
    next target frequency will still be 780 MHz and it will match with
    cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.

    Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
    Signed-off-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Viresh Kumar
     

24 Nov, 2017

1 commit

  • commit 135bd1a230bb69a68c9808a7d25467318900b80a upstream.

    The pending-callbacks check in rcu_prepare_for_idle() is backwards.
    It should accelerate if there are pending callbacks, but the check
    rather uselessly accelerates only if there are no callbacks. This commit
    therefore inverts this check.

    Fixes: 15fecf89e46a ("srcu: Abstract multi-tail callback list handling")
    Signed-off-by: Neeraj Upadhyay
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Greg Kroah-Hartman

    Neeraj Upadhyay
     

10 Nov, 2017

1 commit

  • Pull final power management fixes from Rafael Wysocki:
    "These fix a regression in the schedutil cpufreq governor introduced by
    a recent change and blacklist Dell XPS13 9360 from using the Low Power
    S0 Idle _DSM interface which triggers serious problems on one of these
    machines.

    Specifics:

    - Prevent the schedutil cpufreq governor from using the utilization
    of a wrong CPU in some cases which started to happen after one of
    the recent changes in it (Chris Redpath).

    - Blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM
    interface as that causes serious issue (related to NVMe) to appear
    on one of these machines, even though the other Dells XPS13 9360 in
    somewhat different HW configurations behave correctly (Rafael
    Wysocki)"

    * tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360
    cpufreq: schedutil: Examine the correct CPU when we update util

    Linus Torvalds
     

09 Nov, 2017

1 commit