02 Feb, 2015

1 commit

  • Commit 8eb23b9f35aa ("sched: Debug nested sleeps") added code to report
    on nested sleep conditions, which we generally want to avoid because the
    inner sleeping operation can re-set the thread state to TASK_RUNNING,
    but that will then cause the outer sleep loop not actually sleep when it
    calls schedule.

    However, that's actually valid traditional behavior, with the inner
    sleep being some fairly rare case (like taking a sleeping lock that
    normally doesn't actually need to sleep).

    And the debug code would actually change the state of the task to
    TASK_RUNNING internally, which makes that kind of traditional and
    working code not work at all, because now the nested sleep doesn't just
    sometimes cause the outer one to not block, but will cause it to happen
    every time.

    In particular, it will cause the cardbus kernel daemon (pccardd) to
    basically busy-loop doing scheduling, converting a laptop into a heater,
    as reported by Bruno Prémont. But there may be other legacy uses of
    that nested sleep model in other drivers that are also likely to never
    get converted to the new model.

    This fixes both cases:

    - don't set TASK_RUNNING when the nested condition happens (note: even
    if WARN_ONCE() only _warns_ once, the return value isn't whether the
    warning happened, but whether the condition for the warning was true.
    So despite the warning only happening once, the "if (WARN_ON(..))"
    would trigger for every nested sleep.

    - in the cases where we knowingly disable the warning by using
    "sched_annotate_sleep()", don't change the task state (that is used
    for all core scheduling decisions), instead use '->task_state_change'
    that is used for the debugging decision itself.

    (Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
    avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
    suggested change to use 'task_state_change' as part of the test)

    Reported-and-bisected-by: Bruno Prémont
    Tested-by: Rafael J Wysocki
    Acked-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner ,
    Cc: Ilya Dryomov ,
    Cc: Mike Galbraith
    Cc: Ingo Molnar
    Cc: Peter Hurley ,
    Cc: Davidlohr Bueso ,
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Jan, 2015

1 commit

  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, but also an event groups fix, two PMU driver
    fixes and a CPU model variant addition"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf: Tighten (and fix) the grouping condition
    perf/x86/intel: Add model number for Airmont
    perf/rapl: Fix crash in rapl_scale()
    perf/x86/intel/uncore: Move uncore_box_init() out of driver initialization
    perf probe: Fix probing kretprobes
    perf symbols: Introduce 'for' method to iterate over the symbols with a given name
    perf probe: Do not rely on map__load() filter to find symbols
    perf symbols: Introduce method to iterate symbols ordered by name
    perf symbols: Return the first entry with a given name in find_by_name method
    perf annotate: Fix memory leaks in LOCK handling
    perf annotate: Handle ins parsing failures
    perf scripting perl: Force to use stdbool
    perf evlist: Remove extraneous 'was' on error message

    Linus Torvalds
     

28 Jan, 2015

2 commits

  • The fix from 9fc81d87420d ("perf: Fix events installation during
    moving group") was incomplete in that it failed to recognise that
    creating a group with events for different CPUs is semantically
    broken -- they cannot be co-scheduled.

    Furthermore, it leads to real breakage where, when we create an event
    for CPU Y and then migrate it to form a group on CPU X, the code gets
    confused where the counter is programmed -- triggered in practice
    as well by me via the perf fuzzer.

    Fix this by tightening the rules for creating groups. Only allow
    grouping of counters that can be co-scheduled in the same context.
    This means for the same task and/or the same cpu.

    Fixes: 9fc81d87420d ("perf: Fix events installation during moving group")
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Pull networking fixes from David Miller:

    1) Don't OOPS on socket AIO, from Christoph Hellwig.

    2) Scheduled scans should be aborted upon RFKILL, from Emmanuel
    Grumbach.

    3) Fix sleep in atomic context in kvaser_usb, from Ahmed S Darwish.

    4) Fix RCU locking across copy_to_user() in bpf code, from Alexei
    Starovoitov.

    5) Lots of crash, memory leak, short TX packet et al bug fixes in
    sh_eth from Ben Hutchings.

    6) Fix memory corruption in SCTP wrt. INIT collitions, from Daniel
    Borkmann.

    7) Fix return value logic for poll handlers in netxen, enic, and bnx2x.
    From Eric Dumazet and Govindarajulu Varadarajan.

    8) Header length calculation fix in mac80211 from Fred Chou.

    9) mv643xx_eth doesn't handle highmem correctly in non-TSO code paths.
    From Ezequiel Garcia.

    10) udp_diag has bogus logic in it's hash chain skipping, copy same fix
    tcp diag used. From Herbert Xu.

    11) amd-xgbe programs wrong rx flow control register, from Thomas
    Lendacky.

    12) Fix race leading to use after free in ping receive path, from Subash
    Abhinov Kasiviswanathan.

    13) Cache redirect routes otherwise we can get a heavy backlog of rcu
    jobs liberating DST_NOCACHE entries. From Hannes Frederic Sowa.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
    net: don't OOPS on socket aio
    stmmac: prevent probe drivers to crash kernel
    bnx2x: fix napi poll return value for repoll
    ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too
    sh_eth: Fix DMA-API usage for RX buffers
    sh_eth: Check for DMA mapping errors on transmit
    sh_eth: Ensure DMA engines are stopped before freeing buffers
    sh_eth: Remove RX overflow log messages
    ping: Fix race in free in receive path
    udp_diag: Fix socket skipping within chain
    can: kvaser_usb: Fix state handling upon BUS_ERROR events
    can: kvaser_usb: Retry the first bulk transfer on -ETIMEDOUT
    can: kvaser_usb: Send correct context to URB completion
    can: kvaser_usb: Do not sleep in atomic context
    ipv4: try to cache dst_entries which would cause a redirect
    samples: bpf: relax test_maps check
    bpf: rcu lock must not be held when calling copy_to_user()
    net: sctp: fix slab corruption from use after free on INIT collisions
    net: mv643xx_eth: Fix highmem support in non-TSO egress path
    sh_eth: Fix serialisation of interrupt disable with interrupt & NAPI handlers
    ...

    Linus Torvalds
     

27 Jan, 2015

2 commits

  • BUG: sleeping function called from invalid context at mm/memory.c:3732
    in_atomic(): 0, irqs_disabled(): 0, pid: 671, name: test_maps
    1 lock held by test_maps/671:
    #0: (rcu_read_lock){......}, at: [] map_lookup_elem+0xe8/0x260
    Call Trace:
    ([] show_trace+0x12e/0x150)
    [] show_stack+0xa0/0x100
    [] dump_stack+0x74/0xc8
    [] ___might_sleep+0x23a/0x248
    [] might_fault+0x70/0xe8
    [] map_lookup_elem+0x188/0x260
    [] SyS_bpf+0x20e/0x840

    Fix it by allocating temporary buffer to store map element value.

    Fixes: db20fd2b0108 ("bpf: add lookup/update/delete/iterate methods to BPF maps")
    Reported-by: Michael Holzheu
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Pull cgroup fix from Tejun Heo:
    "The lifetime rules of cgroup hierarchies always have been somewhat
    counter-intuitive and cgroup core tried to enforce that hierarchies
    w/o userland-visible usages must die in finite amount of time so that
    the controllers can be reused for other hierarchies; unfortunately,
    this can't be implemented reasonably for the memory controller - the
    kmemcg part doesn't have any way to forcefully drain the existing
    usages, leading to an interruptible hang if a following mount attempts
    to use the controller in any way.

    So, it seems like we're stuck with "hierarchies live on till they die
    whenever that may be" at least for now. This pretty much confines
    attaching controllers to hierarchies to before the hierarchies are
    actively used by making dynamic configurations post active usages
    unreliable. This has never been reliable and should be fine in
    practice given how cgroups are used.

    After the patch, hierarchies aren't killed if it isn't already
    drained. A following mount attempt of the same mount options will
    reuse the existing hierarchy. Mount attempts with differing options
    will fail w/ -EBUSY"

    * 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: prevent mount hang due to memory controller lifetime

    Linus Torvalds
     

26 Jan, 2015

2 commits

  • Pull x86 fixes from Thomas Gleixner:
    "Hopefully the last round of fixes for 3.19

    - regression fix for the LDT changes
    - regression fix for XEN interrupt handling caused by the APIC
    changes
    - regression fixes for the PAT changes
    - last minute fixes for new the MPX support
    - regression fix for 32bit UP
    - fix for a long standing relocation issue on 64bit tagged for stable
    - functional fix for the Hyper-V clocksource tagged for stable
    - downgrade of a pr_err which tends to confuse users

    Looks a bit on the large side, but almost half of it are valuable
    comments"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/tsc: Change Fast TSC calibration failed from error to info
    x86/apic: Re-enable PCI_MSI support for non-SMP X86_32
    x86, mm: Change cachemode exports to non-gpl
    x86, tls: Interpret an all-zero struct user_desc as "no segment"
    x86, tls, ldt: Stop checking lm in LDT_empty
    x86, mpx: Strictly enforce empty prctl() args
    x86, mpx: Fix potential performance issue on unmaps
    x86, mpx: Explicitly disable 32-bit MPX support on 64-bit kernels
    x86, hyperv: Mark the Hyper-V clocksource as being continuous
    x86: Don't rely on VMWare emulating PAT MSR correctly
    x86, irq: Properly tag virtualization entry in /proc/interrupts
    x86, boot: Skip relocs when load address unchanged
    x86/xen: Override ACPI IRQ management callback __acpi_unregister_gsi
    ACPI: pci: Do not clear pci_dev->irq in acpi_pci_irq_disable()
    x86/xen: Treat SCI interrupt as normal GSI interrupt

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "A set of small fixes:

    - regression fix for exynos_mct clocksource

    - trivial build fix for kona clocksource

    - functional one liner fix for the sh_tmu clocksource

    - two validation fixes to prevent (root only) data corruption in the
    kernel via settimeofday and adjtimex. Tagged for stable"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    time: adjtimex: Validate the ADJ_FREQUENCY values
    time: settimeofday: Validate the values of tv from user
    clocksource: sh_tmu: Set cpu_possible_mask to fix SMP broadcast
    clocksource: kona: fix __iomem annotation
    clocksource: exynos_mct: Fix bitmask regression for exynos4_mct_write

    Linus Torvalds
     

23 Jan, 2015

2 commits

  • Description from Michael Kerrisk. He suggested an identical patch
    to one I had already coded up and tested.

    commit fe3d197f8431 "x86, mpx: On-demand kernel allocation of bounds
    tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and
    PR_MPX_DISABLE_MANAGEMENT. However, no checks were included to ensure
    that unused arguments are zero, as is done in many existing prctl()s
    and as should be done for all new prctl()s. This patch adds the
    required checks.

    Suggested-by: Andy Lutomirski
    Suggested-by: Michael Kerrisk
    Signed-off-by: Dave Hansen
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20150108223022.7F56FD13@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     
  • Pull module and param fixes from Rusty Russell:
    "Surprising number of fixes this merge window :(

    The first two are minor fallout from the param rework which went in
    this merge window.

    The next three are a series which fixes a longstanding (but never
    previously reported and unlikely , so no CC stable) race between
    kallsyms and freeing the init section.

    Finally, a minor cleanup as our module refcount will now be -1 during
    unload"

    * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    module: make module_refcount() a signed integer.
    module: fix race in kallsyms resolution during module load success.
    module: remove mod arg from module_free, rename module_memfree().
    module_arch_freeing_init(): new hook for archs before module->module_init freed.
    param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC
    param: initialize store function to NULL if not available.

    Linus Torvalds
     

22 Jan, 2015

3 commits

  • Since b2052564e66d ("mm: memcontrol: continue cache reclaim from
    offlined groups"), re-mounting the memory controller after using it is
    very likely to hang.

    The cgroup core assumes that any remaining references after deleting a
    cgroup are temporary in nature, and synchroneously waits for them, but
    the above-mentioned commit has left-over page cache pin its css until
    it is reclaimed naturally. That being said, swap entries and charged
    kernel memory have been doing the same indefinite pinning forever, the
    bug is just more likely to trigger with left-over page cache.

    Reparenting kernel memory is highly impractical, which leaves changing
    the cgroup assumptions to reflect this: once a controller has been
    mounted and used, it has internal state that is independent from mount
    and cgroup lifetime. It can be unmounted and remounted, but it can't
    be reconfigured during subsequent mounts.

    Don't offline the controller root as long as there are any children,
    dead or alive. A remount will no longer wait for these old references
    to drain, it will simply mount the persistent controller state again.

    Reported-by: "Suzuki K. Poulose"
    Reported-by: Will Deacon
    Signed-off-by: Johannes Weiner
    Signed-off-by: Tejun Heo

    Johannes Weiner
     
  • …ultz/linux into timers/urgent

    Pull urgent fixes from John Stultz:

    Two urgent fixes for user triggerable time related overflow issues

    Thomas Gleixner
     
  • James Bottomley points out that it will be -1 during unload. It's
    only used for diagnostics, so let's not hide that as it could be a
    clue as to what's gone wrong.

    Cc: Jason Wessel
    Acked-and-documention-added-by: James Bottomley
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Rusty Russell

    Rusty Russell
     

21 Jan, 2015

1 commit

  • Pull workqueue fix from Tejun Heo:
    "The xfs folks have been running into weird and very rare lockups for
    some time now. I didn't think this could have been from workqueue
    side because no one else was reporting it. This time, Eric had a
    kdump which we looked into and it turned out this actually was a
    workqueue bug and the bug has been there since the beginning of
    concurrency managed workqueue.

    A worker pool ensures forward progress of the workqueues associated
    with it by always having at least one worker reserved from executing
    work items. When the pool is under contention, the idle one tries to
    create more workers for the pool and if that doesn't succeed quickly
    enough, it calls the rescuers to the pool.

    This logic had a subtle race condition in an early exit path. When a
    worker invokes this manager function, the function may return %false
    indicating that the caller may proceed to executing work items either
    because another worker is already performing the role or conditions
    have changed and the pool is no longer under contention.

    The latter part depended on the assumption that whether more workers
    are necessary or not remains stable while the pool is locked; however,
    pool->nr_running (concurrency count) may change asynchronously and it
    getting bumped from zero asynchronously could send off the last idle
    worker to execute work items.

    The race window is fairly narrow, and, even when it gets triggered,
    the pool deadlocks iff if all work items get blocked on pending work
    items of the pool, which is highly unlikely but can be triggered by
    xfs.

    The patch removes the race window by removing the early exit path,
    which doesn't server any purpose anymore anyway"

    * 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix subtle pool management issue which can stall whole worker_pool

    Linus Torvalds
     

20 Jan, 2015

4 commits

  • The kallsyms routines (module_symbol_name, lookup_module_* etc) disable
    preemption to walk the modules rather than taking the module_mutex:
    this is because they are used for symbol resolution during oopses.

    This works because there are synchronize_sched() and synchronize_rcu()
    in the unload and failure paths. However, there's one case which doesn't
    have that: the normal case where module loading succeeds, and we free
    the init section.

    We don't want a synchronize_rcu() there, because it would slow down
    module loading: this bug was introduced in 2009 to speed module
    loading in the first place.

    Thus, we want to do the free in an RCU callback. We do this in the
    simplest possible way by allocating a new rcu_head: if we put it in
    the module structure we'd have to worry about that getting freed.

    Reported-by: Rui Xiang
    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Nothing needs the module pointer any more, and the next patch will
    call it from RCU, where the module itself might no longer exist.
    Removing the arg is the safest approach.

    This just codifies the use of the module_alloc/module_free pattern
    which ftrace and bpf use.

    Signed-off-by: Rusty Russell
    Acked-by: Alexei Starovoitov
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: Ralf Baechle
    Cc: Ley Foon Tan
    Cc: Benjamin Herrenschmidt
    Cc: Chris Metcalf
    Cc: Steven Rostedt
    Cc: x86@kernel.org
    Cc: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Cc: Masami Hiramatsu
    Cc: linux-cris-kernel@axis.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mips@linux-mips.org
    Cc: nios2-dev@lists.rocketboards.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: sparclinux@vger.kernel.org
    Cc: netdev@vger.kernel.org

    Rusty Russell
     
  • Archs have been abusing module_free() to clean up their arch-specific
    allocations. Since module_free() is also (ab)used by BPF and trace code,
    let's keep it to simple allocations, and provide a hook called before
    that.

    This means that avr32, ia64, parisc and s390 no longer need to implement
    their own module_free() at all. avr32 doesn't need module_finalize()
    either.

    Signed-off-by: Rusty Russell
    Cc: Chris Metcalf
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-parisc@vger.kernel.org
    Cc: linux-s390@vger.kernel.org

    Rusty Russell
     
  • ignore_lockdep is uninitialized, and sysfs_attr_init() doesn't initialize
    it, so memset to 0.

    Reported-by: Huang Ying
    Cc: Eric W. Biederman
    Signed-off-by: Rusty Russell

    Rusty Russell
     

17 Jan, 2015

3 commits

  • Avoid overflow possibility.

    [ The overflow is purely theoretical, since this is used for memory
    ranges that aren't even close to using the full 64 bits, but this is
    the right thing to do regardless. - Linus ]

    Signed-off-by: Louis Langholtz
    Cc: Yinghai Lu
    Cc: Peter Anvin
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Louis Langholtz
     
  • A worker_pool's forward progress is guaranteed by the fact that the
    last idle worker assumes the manager role to create more workers and
    summon the rescuers if creating workers doesn't succeed in timely
    manner before proceeding to execute work items.

    This manager role is implemented in manage_workers(), which indicates
    whether the worker may proceed to work item execution with its return
    value. This is necessary because multiple workers may contend for the
    manager role, and, if there already is a manager, others should
    proceed to work item execution.

    Unfortunately, the function also indicates that the worker may proceed
    to work item execution if need_to_create_worker() is false at the head
    of the function. need_to_create_worker() tests the following
    conditions.

    pending work items && !nr_running && !nr_idle

    The first and third conditions are protected by pool->lock and thus
    won't change while holding pool->lock; however, nr_running can change
    asynchronously as other workers block and resume and while it's likely
    to be zero, as someone woke this worker up in the first place, some
    other workers could have become runnable inbetween making it non-zero.

    If this happens, manage_worker() could return false even with zero
    nr_idle making the worker, the last idle one, proceed to execute work
    items. If then all workers of the pool end up blocking on a resource
    which can only be released by a work item which is pending on that
    pool, the whole pool can deadlock as there's no one to create more
    workers or summon the rescuers.

    This patch fixes the problem by removing the early exit condition from
    maybe_create_worker() and making manage_workers() return false iff
    there's already another manager, which ensures that the last worker
    doesn't start executing work items.

    We can leave the early exit condition alone and just ignore the return
    value but the only reason it was put there is because the
    manage_workers() used to perform both creations and destructions of
    workers and thus the function may be invoked while the pool is trying
    to reduce the number of workers. Now that manage_workers() is called
    only when more workers are needed, the only case this early exit
    condition is triggered is rare race conditions rendering it pointless.

    Tested with simulated workload and modified workqueue code which
    trigger the pool deadlock reliably without this patch.

    Signed-off-by: Tejun Heo
    Reported-by: Eric Sandeen
    Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
    Cc: Dave Chinner
    Cc: Lai Jiangshan
    Cc: stable@vger.kernel.org

    Tejun Heo
     
  • …it/rostedt/linux-trace

    Pull ftrace fixes from Steven Rostedt:
    "This holds a few fixes to the ftrace infrastructure as well as the
    mixture of function graph tracing and kprobes.

    When jprobes and function graph tracing is enabled at the same time it
    will crash the system:

    # modprobe jprobe_example
    # echo function_graph > /sys/kernel/debug/tracing/current_tracer

    After the first fork (jprobe_example probes it), the system will
    crash.

    This is due to the way jprobes copies the stack frame and does not do
    a normal function return. This messes up with the function graph
    tracing accounting which hijacks the return address from the stack and
    replaces it with a hook function. It saves the return addresses in a
    separate stack to put back the correct return address when done. But
    because the jprobe functions do not do a normal return, their stack
    addresses are not put back until the function they probe is called,
    which means that the probed function will get the return address of
    the jprobe handler instead of its own.

    The simple fix here was to disable function graph tracing while the
    jprobe handler is being called.

    While debugging this I found two minor bugs with the function graph
    tracing.

    The first was about the function graph tracer sharing its function
    hash with the function tracer (they both get filtered by the same
    input). The changing of the set_ftrace_filter would not sync the
    function recording records after a change if the function tracer was
    disabled but the function graph tracer was enabled. This was due to
    the update only checking one of the ops instead of the shared ops to
    see if they were enabled and should perform the sync. This caused the
    ftrace accounting to break and a ftrace_bug() would be triggered,
    disabling ftrace until a reboot.

    The second was that the check to update records only checked one of
    the filter hashes. It needs to test both the "filter" and "notrace"
    hashes. The "filter" hash determines what functions to trace where as
    the "notrace" hash determines what functions not to trace (trace all
    but these). Both hashes need to be passed to the update code to find
    out what change is being done during the update. This also broke the
    ftrace record accounting and triggered a ftrace_bug().

    This patch set also include two more fixes that were reported
    separately from the kprobe issue.

    One was that init_ftrace_syscalls() was called twice at boot up. This
    is not a major bug, but that call performed a rather large kmalloc
    (NR_syscalls * sizeof(*syscalls_metadata)). The second call made the
    first one a memory leak, and wastes memory.

    The other fix is a regression caused by an update in the v3.19 merge
    window. The moving to enable events early, moved the enabling before
    PID 1 was created. The syscall events require setting the
    TIF_SYSCALL_TRACEPOINT for all tasks. But for_each_process_thread()
    does not include the swapper task (PID 0), and ended up being a nop.

    A suggested fix was to add the init_task() to have its flag set, but I
    didn't really want to mess with PID 0 for this minor bug. Instead I
    disable and re-enable events again at early_initcall() where it use to
    be enabled. This also handles any other event that might have its own
    reg function that could break at early boot up"

    * tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix enabling of syscall events on the command line
    tracing: Remove extra call to init_ftrace_syscalls()
    ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing
    ftrace: Check both notrace and filter for old hash
    ftrace: Fix updating of filters for shared global_ops filters

    Linus Torvalds
     

15 Jan, 2015

4 commits

  • Commit 5f893b2639b2 "tracing: Move enabling tracepoints to just after
    rcu_init()" broke the enabling of system call events from the command
    line. The reason was that the enabling of command line trace events
    was moved before PID 1 started, and the syscall tracepoints require
    that all tasks have the TIF_SYSCALL_TRACEPOINT flag set. But the
    swapper task (pid 0) is not part of that. Since the swapper task is the
    only task that is running at this early in boot, no task gets the
    flag set, and the tracepoint never gets reached.

    Instead of setting the swapper task flag (there should be no reason to
    do that), re-enabled trace events again after the init thread (PID 1)
    has been started. It requires disabling all command line events and
    re-enabling them, as just enabling them again will not reset the logic
    to set the TIF_SYSCALL_TRACEPOINT flag, as the syscall tracepoint will
    be fooled into thinking that it was already set, and wont try setting
    it again. For this reason, we must first disable it and re-enable it.

    Link: http://lkml.kernel.org/r/1421188517-18312-1-git-send-email-mpe@ellerman.id.au
    Link: http://lkml.kernel.org/r/20150115040506.216066449@goodmis.org

    Reported-by: Michael Ellerman
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • trace_init() calls init_ftrace_syscalls() and then calls trace_event_init()
    which also calls init_ftrace_syscalls(). It makes more sense to only
    call it from trace_event_init().

    Calling it twice wastes memory, as it allocates the syscall events twice,
    and loses the first copy of it.

    Link: http://lkml.kernel.org/r/54AF53BD.5070303@huawei.com
    Link: http://lkml.kernel.org/r/20150115040505.930398632@goodmis.org

    Reported-by: Wang Nan
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • Using just the filter for checking for trampolines or regs is not enough
    when updating the code against the records that represent all functions.
    Both the filter hash and the notrace hash need to be checked.

    To trigger this bug (using trace-cmd and perf):

    # perf probe -a do_fork
    # trace-cmd start -B foo -e probe
    # trace-cmd record -p function_graph -n do_fork sleep 1

    The trace-cmd record at the end clears the filter before it disables
    function_graph tracing and then that causes the accounting of the
    ftrace function records to become incorrect and causes ftrace to bug.

    Link: http://lkml.kernel.org/r/20150114154329.358378039@goodmis.org

    Cc: stable@vger.kernel.org
    [ still need to switch old_hash_ops to old_ops_hash ]
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • As the set_ftrace_filter affects both the function tracer as well as the
    function graph tracer, the ops that represent each have a shared
    ftrace_ops_hash structure. This allows both to be updated when the filter
    files are updated.

    But if function graph is enabled and the global_ops (function tracing) ops
    is not, then it is possible that the filter could be changed without the
    update happening for the function graph ops. This will cause the changes
    to not take place and may even cause a ftrace_bug to occur as it could mess
    with the trampoline accounting.

    The solution is to check if the ops uses the shared global_ops filter and
    if the ops itself is not enabled, to check if there's another ops that is
    enabled and also shares the global_ops filter. In that case, the
    modification still needs to be executed.

    Link: http://lkml.kernel.org/r/20150114154329.055980438@goodmis.org

    Cc: stable@vger.kernel.org # 3.17+
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

12 Jan, 2015

3 commits

  • Pull scheduler fixes from Ingo Molnar:
    "Misc fixes: group scheduling corner case fix, two deadline scheduler
    fixes, effective_load() overflow fix, nested sleep fix, 6144 CPUs
    system fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()
    sched/deadline: Avoid double-accounting in case of missed deadlines
    sched/deadline: Fix migration of SCHED_DEADLINE tasks
    sched: Fix odd values in effective_load() calculations
    sched, fanotify: Deal with nested sleeps
    sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, but also some kernel side fixes: uncore PMU
    driver fix, user regs sampling fix and an instruction decoder fix that
    unbreaks PEBS precise sampling"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/uncore/hsw-ep: Handle systems with only two SBOXes
    perf/x86_64: Improve user regs sampling
    perf: Move task_pt_regs sampling into arch code
    x86: Fix off-by-one in instruction decoder
    perf hists browser: Fix segfault when showing callchain
    perf callchain: Free callchains when hist entries are deleted
    perf hists: Fix children sort key behavior
    perf diff: Fix to sort by baseline field by default
    perf list: Fix --raw-dump option
    perf probe: Fix crash in dwarf_getcfi_elf
    perf probe: Fix to fall back to find probe point in symbols
    perf callchain: Append callchains only when requested
    perf ui/tui: Print backtrace symbols when segfault occurs
    perf report: Show progress bar for output resorting

    Linus Torvalds
     
  • Pull locking fixes from Ingo Molnar:
    "A liblockdep fix and a mutex_unlock() mutex-debugging fix"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mutex: Always clear owner field upon mutex_unlock()
    tools/liblockdep: Fix debug_check thinko in mutex destroy

    Linus Torvalds
     

10 Jan, 2015

1 commit

  • Pull kgdb/kdb fixes from Jason Wessel:
    "These have been around since 3.17 and in kgdb-next for the last 9
    weeks and some will go back to -stable.

    Summary of changes:

    Cleanups
    - kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE

    Fixes
    - kgdb/kdb: Allow access on a single core, if a CPU round up is
    deemed impossible, which will allow inspection of the now "trashed"
    kernel
    - kdb: Add enable mask for the command groups
    - kdb: access controls to restrict sensitive commands"

    * tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
    kernel/debug/debug_core.c: Logging clean-up
    kgdb: timeout if secondary CPUs ignore the roundup
    kdb: Allow access to sensitive commands to be restricted by default
    kdb: Add enable mask for groups of commands
    kdb: Categorize kdb commands (similar to SysRq categorization)
    kdb: Remove KDB_REPEAT_NONE flag
    kdb: Use KDB_REPEAT_* values as flags
    kdb: Rename kdb_register_repeat() to kdb_register_flags()
    kdb: Rename kdb_repeat_t to kdb_cmdflags_t, cmd_repeat to cmd_flags
    kdb: Remove currently unused kdbtab_t->cmd_flags

    Linus Torvalds
     

09 Jan, 2015

7 commits

  • Currently if DEBUG_MUTEXES is enabled, the mutex->owner field is only
    cleared iff debug_locks is active. This exposes a race to other users of
    the field where the mutex->owner may be still set to a stale value,
    potentially upsetting mutex_spin_on_owner() among others.

    References: https://bugs.freedesktop.org/show_bug.cgi?id=87955
    Signed-off-by: Chris Wilson
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Davidlohr Bueso
    Cc: Daniel Vetter
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1420540175-30204-1-git-send-email-chris@chris-wilson.co.uk
    Signed-off-by: Ingo Molnar

    Chris Wilson
     
  • When alloc_fair_sched_group() in sched_create_group() fails,
    free_sched_group() is called, and free_fair_sched_group() is called by
    free_sched_group(). Since destroy_cfs_bandwidth() is called by
    free_fair_sched_group() without calling init_cfs_bandwidth(),
    RCU stall occurs at hrtimer_cancel():

    INFO: rcu_sched self-detected stall on CPU { 1} (t=60000 jiffies g=13074 c=13073 q=0)
    Task dump for CPU 1:
    (fprintd) R running task 0 6249 1 0x00000088
    ...
    Call Trace:
    [] sched_show_task+0xa8/0x110
    [] dump_cpu_task+0x3d/0x50
    [] rcu_dump_cpu_stacks+0x90/0xd0
    [] rcu_check_callbacks+0x491/0x700
    [] update_process_times+0x4b/0x80
    [] tick_sched_handle.isra.20+0x36/0x50
    [] tick_sched_timer+0x42/0x70
    [] __run_hrtimer+0x69/0x1a0
    [] ? tick_sched_handle.isra.20+0x50/0x50
    [] hrtimer_interrupt+0xef/0x230
    [] local_apic_timer_interrupt+0x3b/0x70
    [] smp_apic_timer_interrupt+0x45/0x60
    [] apic_timer_interrupt+0x6d/0x80
    [] ? lock_hrtimer_base.isra.23+0x18/0x50
    [] ? __kmalloc+0x211/0x230
    [] hrtimer_try_to_cancel+0x22/0xd0
    [] ? __kmalloc+0x211/0x230
    [] hrtimer_cancel+0x22/0x30
    [] free_fair_sched_group+0x25/0xd0
    [] free_sched_group+0x16/0x40
    [] sched_create_group+0x4b/0x80
    [] sched_autogroup_create_attach+0x43/0x1c0
    [] sys_setsid+0x7c/0x110
    [] system_call_fastpath+0x12/0x17

    Check whether init_cfs_bandwidth() was called before calling
    destroy_cfs_bandwidth().

    Signed-off-by: Tetsuo Handa
    [ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Paul Turner
    Cc: Ben Segall
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
    Signed-off-by: Ingo Molnar

    Tetsuo Handa
     
  • The dl_runtime_exceeded() function is supposed to ckeck if
    a SCHED_DEADLINE task must be throttled, by checking if its
    current runtime is
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Cc:
    Cc: Dario Faggioli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
    Signed-off-by: Ingo Molnar

    Luca Abeni
     
  • According to global EDF, tasks should be migrated between runqueues
    without checking if their scheduling deadlines and runtimes are valid.
    However, SCHED_DEADLINE currently performs such a check:
    a migration happens doing:

    deactivate_task(rq, next_task, 0);
    set_task_cpu(next_task, later_rq->cpu);
    activate_task(later_rq, next_task, 0);

    which ends up calling dequeue_task_dl(), setting the new CPU, and then
    calling enqueue_task_dl().

    enqueue_task_dl() then calls enqueue_dl_entity(), which calls
    update_dl_entity(), which can modify scheduling deadline and runtime,
    breaking global EDF scheduling.

    As a result, some of the properties of global EDF are not respected:
    for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
    two cores can have unbounded response times for the third task even
    if 30/80+40/80+120/170 = 1.5809 < 2

    This can be fixed by invoking update_dl_entity() only in case of
    wakeup, or if this is a new SCHED_DEADLINE task.

    Signed-off-by: Luca Abeni
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Cc:
    Cc: Dario Faggioli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
    Signed-off-by: Ingo Molnar

    Luca Abeni
     
  • In effective_load, we have (long w * unsigned long tg->shares) / long W,
    when w is negative, it is cast to unsigned long and hence the product is
    insanely large. Fix this by casting tg->shares to long.

    Reported-by: Sasha Levin
    Signed-off-by: Yuyang Du
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dave Jones
    Cc: Andrey Ryabinin
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
    Signed-off-by: Ingo Molnar

    Yuyang Du
     
  • On x86_64, at least, task_pt_regs may be only partially initialized
    in many contexts, so x86_64 should not use it without extra care
    from interrupt context, let alone NMI context.

    This will allow x86_64 to override the logic and will supply some
    scratch space to use to make a cleaner copy of user regs.

    Tested-by: Jiri Olsa
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Stephane Eranian
    Cc: chenggang.qcg@taobao.com
    Cc: Wu Fengguang
    Cc: Namhyung Kim
    Cc: Mike Galbraith
    Cc: Arjan van de Ven
    Cc: David Ahern
    Cc: Arnaldo Carvalho de Melo
    Cc: Catalin Marinas
    Cc: Jean Pihet
    Cc: Linus Torvalds
    Cc: Mark Salter
    Cc: Russell King
    Cc: Will Deacon
    Cc: linux-arm-kernel@lists.infradead.org
    Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.net
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and
    both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE
    change in between, gcc needs to reload p->exit_state after
    security_task_wait(). In this case ->notask_error will be wrongly
    cleared and do_wait() can hang forever if it was the last eligible
    child.

    Many thanks to Arne who carefully investigated the problem.

    Note: this bug is very old but it was pure theoretical until commit
    b3ab03160dfa ("wait: completely ignore the EXIT_DEAD tasks"). Before
    this commit "-O2" was probably enough to guarantee that compiler won't
    read ->exit_state twice.

    Signed-off-by: Oleg Nesterov
    Reported-by: Arne Goedeke
    Tested-by: Arne Goedeke
    Cc: [3.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

08 Jan, 2015

2 commits

  • Verify that the frequency value from userspace is valid and makes sense.

    Unverified values can cause overflows later on.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: stable
    Signed-off-by: Sasha Levin
    [jstultz: Fix up bug for negative values and drop redunent cap check]
    Signed-off-by: John Stultz

    Sasha Levin
     
  • An unvalidated user input is multiplied by a constant, which can result in
    an undefined behaviour for large values. While this is validated later,
    we should avoid triggering undefined behaviour.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: stable
    Signed-off-by: Sasha Levin
    [jstultz: include trivial milisecond->microsecond correction noticed
    by Andy]
    Signed-off-by: John Stultz

    Sasha Levin
     

01 Jan, 2015

1 commit

  • Pull audit fix from Paul Moore:
    "One audit patch to resolve a panic/oops when recording filenames in
    the audit log, see the mail archive link below.

    The fix isn't as nice as I would like, as it involves an allocate/copy
    of the filename, but it solves the problem and the overhead should
    only affect users who have configured audit rules involving file
    names.

    We'll revisit this issue with future kernels in an attempt to make
    this suck less, but in the meantime I think this fix should go into
    the next release of v3.19-rcX.

    [ https://marc.info/?t=141986927600001&r=1&w=2 ]"

    * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
    audit: create private file name copies when auditing inodes

    Linus Torvalds
     

31 Dec, 2014

1 commit

  • Pull networking fixes from David Miller:

    1) Fix double SKB free in bluetooth 6lowpan layer, from Jukka Rissanen.

    2) Fix receive checksum handling in enic driver, from Govindarajulu
    Varadarajan.

    3) Fix NAPI poll list corruption in virtio_net and caif_virtio, from
    Herbert Xu. Also, add code to detect drivers that have this mistake
    in the future.

    4) Fix doorbell endianness handling in mlx4 driver, from Amir Vadai.

    5) Don't clobber IP6CB() before xfrm6_policy_check() is called in TCP
    input path,f rom Nicolas Dichtel.

    6) Fix MPLS action validation in openvswitch, from Pravin B Shelar.

    7) Fix double SKB free in vxlan driver, also from Pravin.

    8) When we scrub a packet, which happens when we are switching the
    context of the packet (namespace, etc.), we should reset the
    secmark. From Thomas Graf.

    9) ->ndo_gso_check() needs to do more than return true/false, it also
    has to allow the driver to clear netdev feature bits in order for
    the caller to be able to proceed properly. From Jesse Gross.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
    genetlink: A genl_bind() to an out-of-range multicast group should not WARN().
    netlink/genetlink: pass network namespace to bind/unbind
    ne2k-pci: Add pci_disable_device in error handling
    bonding: change error message to debug message in __bond_release_one()
    genetlink: pass multicast bind/unbind to families
    netlink: call unbind when releasing socket
    netlink: update listeners directly when removing socket
    genetlink: pass only network namespace to genl_has_listeners()
    netlink: rename netlink_unbind() to netlink_undo_bind()
    net: Generalize ndo_gso_check to ndo_features_check
    net: incorrect use of init_completion fixup
    neigh: remove next ptr from struct neigh_table
    net: xilinx: Remove unnecessary temac_property in the driver
    net: phy: micrel: use generic config_init for KSZ8021/KSZ8031
    net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding
    openvswitch: fix odd_ptr_err.cocci warnings
    Bluetooth: Fix accepting connections when not using mgmt
    Bluetooth: Fix controller configuration with HCI_QUIRK_INVALID_BDADDR
    brcmfmac: Do not crash if platform data is not populated
    ipw2200: select CFG80211_WEXT
    ...

    Linus Torvalds