12 Mar, 2014

1 commit

  • Pull audit namespace fixes from Eric Biederman:
    "Starting with 3.14-rc1 the audit code is faulty (think oopses and
    races) with respect to how it computes the network namespace of which
    socket to reply to, and I happened to notice by chance when reading
    through the code.

    My testing and the automated build bots don't find any problems with
    these fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    audit: Update kdoc for audit_send_reply and audit_list_rules_send
    audit: Send replies in the proper network namespace.
    audit: Use struct net not pid_t to remember the network namespce to reply in

    Linus Torvalds
     

11 Mar, 2014

1 commit

  • GFP_THISNODE is for callers that implement their own clever fallback to
    remote nodes. It restricts the allocation to the specified node and
    does not invoke reclaim, assuming that the caller will take care of it
    when the fallback fails, e.g. through a subsequent allocation request
    without GFP_THISNODE set.

    However, many current GFP_THISNODE users only want the node exclusive
    aspect of the flag, without actually implementing their own fallback or
    triggering reclaim if necessary. This results in things like page
    migration failing prematurely even when there is easily reclaimable
    memory available, unless kswapd happens to be running already or a
    concurrent allocation attempt triggers the necessary reclaim.

    Convert all callsites that don't implement their own fallback strategy
    to __GFP_THISNODE. This restricts the allocation a single node too, but
    at the same time allows the allocator to enter the slowpath, wake
    kswapd, and invoke direct reclaim if necessary, to make the allocation
    happen when memory is full.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Jan Stancek
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Mar, 2014

2 commits

  • The kbuild test robot reported:
    > tree: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-next
    > head: 6f285b19d09f72e801525f5eea1bdad22e559bf0
    > commit: 6f285b19d09f72e801525f5eea1bdad22e559bf0 [2/2] audit: Send replies in the proper network namespace.
    > reproduce: make htmldocs
    >
    > >> Warning(kernel/audit.c:575): No description found for parameter 'request_skb'
    > >> Warning(kernel/audit.c:575): Excess function parameter 'portid' description in 'audit_send_reply'
    > >> Warning(kernel/auditfilter.c:1074): No description found for parameter 'request_skb'
    > >> Warning(kernel/auditfilter.c:1074): Excess function parameter 'portid' description in 'audit_list_rules_s

    Which was caused by my failure to update the kdoc annotations when I
    updated the functions. Fix that small oversight now.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Pull cgroup fixes from Tejun Heo:
    "Two cpuset locking fixes from Li. Both tagged for -stable"

    * 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: fix a race condition in __cpuset_node_allowed_softwall()
    cpuset: fix a locking issue in cpuset_migrate_mm()

    Linus Torvalds
     

08 Mar, 2014

2 commits

  • …it/rostedt/linux-trace

    Pull tracing fix from Steven Rostedt:
    "In the past, I've had lots of reports about trace events not working.
    Developers would say they put a trace_printk() before and after the
    trace event but when they enable it (and the trace event said it was
    enabled) they would see the trace_printks but not the trace event.

    I was not able to reproduce this, but that's because I wasn't looking
    at the right location. Recently, another bug came up that showed the
    issue.

    If your kernel supports signed modules but allows for non-signed
    modules to be loaded, then when one is, the kernel will silently set
    the MODULE_FORCED taint on the module. Although, this taint happens
    without the need for insmod --force or anything of the kind, it labels
    the module with that taint anyway.

    If this tainted module has tracepoints, the tracepoints will be
    ignored because of the MODULE_FORCED taint. But no error message will
    be displayed. Worse yet, the event infrastructure will still be
    created letting users enable the trace event represented by the
    tracepoint, although that event will never actually be enabled. This
    is because the tracepoint infrastructure allows for non-existing
    tracepoints to be enabled for new modules to arrive and have their
    tracepoints set.

    Although there are several things wrong with the above, this change
    only addresses the creation of the trace event files for tracepoints
    that are not created when a module is loaded and is tainted. This
    change will print an error message about the module being tainted and
    not the trace events will not be created, and it does not create the
    trace event infrastructure"

    * tag 'trace-fixes-v3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Do not add event files for modules that fail tracepoints

    Linus Torvalds
     
  • Pull irq fixes from Thomas Gleixner:
    - a bugfix for a long standing waitqueue race
    - a trivial fix for a missing include

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Include missing header file in irqdomain.c
    genirq: Remove racy waitqueue_active check

    Linus Torvalds
     

04 Mar, 2014

2 commits

  • If a module fails to add its tracepoints due to module tainting, do not
    create the module event infrastructure in the debugfs directory. As the events
    will not work and worse yet, they will silently fail, making the user wonder
    why the events they enable do not display anything.

    Having a warning on module load and the events not visible to the users
    will make the cause of the problem much clearer.

    Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org

    Fixes: 6d723736e472 "tracing/events: add support for modules to TRACE_EVENT"
    Acked-by: Mathieu Desnoyers
    Cc: stable@vger.kernel.org # 2.6.31+
    Cc: Rusty Russell
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • Pull scheduler fixes from Ingo Molnar:
    "Misc fixes, most of them SCHED_DEADLINE fallout"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/deadline: Prevent rt_time growth to infinity
    sched/deadline: Switch CPU's presence test order
    sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migration
    sched: Fix double normalization of vruntime

    Linus Torvalds
     

03 Mar, 2014

1 commit


01 Mar, 2014

1 commit


28 Feb, 2014

2 commits

  • In struct audit_netlink_list and audit_reply add a reference to the
    network namespace of the caller and remove the userspace pid of the
    caller. This cleanly remembers the callers network namespace, and
    removes a huge class of races and nasty failure modes that can occur
    when attempting to relook up the callers network namespace from a
    pid_t (including the caller's network namespace changing, pid
    wraparound, and the pid simply not being present).

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Pull filesystem fixes from Jan Kara:
    "Notification, writeback, udf, quota fixes

    The notification patches are (with one exception) a fallout of my
    fsnotify rework which went into -rc1 (I've extented LTP to cover these
    cornercases to avoid similar breakage in future).

    The UDF patch is a nasty data corruption Al has recently reported,
    the revert of the writeback patch is due to possibility of violating
    sync(2) guarantees, and a quota bug can lead to corruption of quota
    files in ocfs2"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    fsnotify: Allocate overflow events with proper type
    fanotify: Handle overflow in case of permission events
    fsnotify: Fix detection whether overflow event is queued
    Revert "writeback: do not sync data dirtied after sync start"
    quota: Fix race between dqput() and dquot_scan_active()
    udf: Fix data corruption on file type conversion
    inotify: Fix reporting of cookies for inotify events

    Linus Torvalds
     

27 Feb, 2014

9 commits

  • It's not safe to access task's cpuset after releasing task_lock().
    Holding callback_mutex won't help.

    Cc:
    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • I can trigger a lockdep warning:

    # mount -t cgroup -o cpuset xxx /cgroup
    # mkdir /cgroup/cpuset
    # mkdir /cgroup/tmp
    # echo 0 > /cgroup/tmp/cpuset.cpus
    # echo 0 > /cgroup/tmp/cpuset.mems
    # echo 1 > /cgroup/tmp/cpuset.memory_migrate
    # echo $$ > /cgroup/tmp/tasks
    # echo 1 > /cgruop/tmp/cpuset.mems

    ===============================
    [ INFO: suspicious RCU usage. ]
    3.14.0-rc1-0.1-default+ #32 Not tainted
    -------------------------------
    include/linux/cgroup.h:682 suspicious rcu_dereference_check() usage!
    ...
    [] dump_stack+0x72/0x86
    [] lockdep_rcu_suspicious+0x101/0x140
    [] cpuset_migrate_mm+0xb1/0xe0
    ...

    We used to hold cgroup_mutex when calling cpuset_migrate_mm(), but now
    we hold cpuset_mutex, which causes task_css() to complain.

    This is not a false-positive but a real issue.

    Holding cpuset_mutex won't prevent a task from migrating to another
    cpuset, and it won't prevent the original task->cgroup from destroying
    during this change.

    Fixes: 5d21cc2db040 (cpuset: replace cgroup_mutex locking with cpuset internal locking)
    Cc: # 3.9+
    Signed-off-by: Li Zefan
    Sigend-off-by: Tejun Heo

    Li Zefan
     
  • Include appropriate header file include/linux/of_irq.h in
    kernel/irq/irqdomain.c because it contains prototype definition of
    function define in kernel/irq/irqdomain.c.

    This eliminates the following warning in kernel/irq/irqdomain.c:
    kernel/irq/irqdomain.c:468:14: warning: no previous prototype for ‘irq_create_of_mapping’ [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Cc: Benjamin Herrenschmidt
    Link: http://lkml.kernel.org/r/eb89aebea7ff1a46122918ac389ebecf8248be9a.1393493276.git.rashika.kheria@gmail.com
    Signed-off-by: Thomas Gleixner

    Rashika Kheria
     
  • Drew Richardson reported that he could make the kernel go *boom* when hotplugging
    while having perf events active.

    It turned out that when you have a group event, the code in
    __perf_event_exit_context() fails to remove the group siblings from
    the context.

    We then proceed with destroying and freeing the event, and when you
    re-plug the CPU and try and add another event to that CPU, things go
    *boom* because you've still got dead entries there.

    Reported-by: Drew Richardson
    Signed-off-by: Peter Zijlstra
    Cc: Will Deacon
    Cc:
    Link: http://lkml.kernel.org/n/tip-k6v5wundvusvcseqj1si0oz0@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Kirill Tkhai noted:

    Since deadline tasks share rt bandwidth, we must care about
    bandwidth timer set. Otherwise rt_time may grow up to infinity
    in update_curr_dl(), if there are no other available RT tasks
    on top level bandwidth.

    RT task were in fact throttled right after they got enqueued,
    and never executed again (rt_time never again went below rt_runtime).

    Peter then proposed to accrue DL execution on rt_time only when
    rt timer is active, and proposed a patch (this patch is a slight
    modification of that) to implement that behavior. While this
    solves Kirill problem, it has a drawback.

    Indeed, Kirill noted again:

    It looks we may get into a situation, when all CPU time is shared
    between RT and DL tasks:

    rt_runtime = n
    rt_period = 2n

    | RT working, DL sleeping | DL working, RT sleeping |
    -----------------------------------------------------------
    | (1) duration = n | (2) duration = n | (repeat)
    |--------------------------|------------------------------|
    | (rt_bw timer is running) | (rt_bw timer is not running) |

    No time for fair tasks at all.

    While this can happen during the first period, if rq is always backlogged,
    RT tasks won't have the opportunity to execute anymore: rt_time reached
    rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
    throttled since rt timer didn't fire, replenishment is from now on eaten up
    by DL tasks that accrue their execution on rt_time (while rt timer is
    active - we have an RT task waiting for replenishment). FAIR tasks are
    not touched after this first period. Ok, this is not ideal, and the situation
    is even worse!

    What above (the nice case), practically never happens in reality, where
    your rt timer is not aligned to tasks periods, tasks are in general not
    periodic, etc.. Long story short, you always risk to overload your system.

    This patch is based on Peter's idea, but exploits an additional fact:
    if you don't have RT tasks enqueued, it makes little sense to continue
    incrementing rt_time once you reached the upper limit (DL tasks have their
    own mechanism for throttling).

    This cures both problems:

    - no matter how many DL instances in the past, you'll have an rt_time
    slightly above rt_runtime when an RT task is enqueued, and from that
    point on (after the first replenishment), the task will normally execute;

    - you can still eat up all bandwidth during the first period, but not
    anymore after that, remember that DL execution will increment rt_time
    till the upper limit is reached.

    The situation is still not perfect! But, we have a simple solution for now,
    that limits how much you can jeopardize your system, as we keep working
    towards the right answer: RT groups scheduled using deadline servers.

    Reported-by: Kirill Tkhai
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • Commit 82b9580 ("sched/deadline: Test for CPU's presence explicitly")
    changed how we check if a CPU returned by cpudeadline machinery is
    valid. But, we don't want to call cpu_present() if best_cpu is
    equal to -1. So, switch the order of tests inside WARN_ON().

    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Cc: boris.ostrovsky@oracle.com
    Cc: konrad.wilk@oracle.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1393238832-9100-1-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • In deadline class we do not have group scheduling.

    So, let's remove unnecessary

    X = X;

    equations.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    Cc: Juri Lelli
    Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
    which appears to guarentee that the !se->on_rq condition is met.
    If the task has done set_current_state(TASK_INTERRUPTIBLE) without
    schedule() the second condition will be met and vruntime will be
    incorrectly adjusted twice.

    In certain cases this can result in the task's vruntime never increasing
    past the vruntime of other tasks on the CFS' run queue, starving them of
    CPU time.

    This patch changes switched_from_fair() to use !p->on_rq instead of
    !se->on_rq.

    I'm able to cause a task with a priority of 120 to starve all other
    tasks with the same priority on an ARM platform running 3.2.51-rt72
    PREEMPT RT by writing one character at time to a serial tty (16550 UART)
    in a tight loop. I'm also able to verify making this change corrects the
    problem on that platform and kernel version.

    Signed-off-by: George McCollister
    Signed-off-by: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.com
    Signed-off-by: Ingo Molnar

    George McCollister
     
  • We hit one rare case below:

    T1 calling disable_irq(), but hanging at synchronize_irq()
    always;
    The corresponding irq thread is in sleeping state;
    And all CPUs are in idle state;

    After analysis, we found there is one possible scenerio which
    causes T1 is waiting there forever:
    CPU0 CPU1
    synchronize_irq()
    wait_event()
    spin_lock()
    atomic_dec_and_test(&threads_active)
    insert the __wait into queue
    spin_unlock()
    if(waitqueue_active)
    atomic_read(&threads_active)
    wake_up()

    Here after inserted the __wait into queue on CPU0, and before
    test if queue is empty on CPU1, there is no barrier, it maybe
    cause it is not visible for CPU1 immediately, although CPU0 has
    updated the queue list.
    It is similar for CPU0 atomic_read() threads_active also.

    So we'd need one smp_mb() before waitqueue_active.that, but removing
    the waitqueue_active() check solves it as wel l and it makes
    things simple and clear.

    Signed-off-by: Chuansheng Liu
    Cc: Xiaoming Wang
    Link: http://lkml.kernel.org/r/1393212590-32543-1-git-send-email-chuansheng.liu@intel.com
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner

    Chuansheng Liu
     

24 Feb, 2014

1 commit


22 Feb, 2014

9 commits

  • In deadline class we do not have group scheduling like in RT.

    dl_nr_total is the same as dl_nr_running. So, one of them should
    be removed.

    Cc: Ingo Molnar
    Cc: Juri Lelli
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/368631392675853@web20h.yandex.ru
    Signed-off-by: Thomas Gleixner

    Kirill Tkhai
     
  • A hot-removed CPU may have ID that is numerically larger than the number of
    existing CPUs in the system (e.g. we can unplug CPU 4 from a system that
    has CPUs 0, 1 and 4).

    Thus the WARN_ONs should check whether the CPU in question is currently
    present, not whether its ID value is less than num_present_cpus().

    Cc: Ingo Molnar
    Cc: Juri Lelli
    Cc: Steven Rostedt
    Reported-by: Konrad Rzeszutek Wilk
    Signed-off-by: Boris Ostrovsky
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1392646353-1874-1-git-send-email-boris.ostrovsky@oracle.com
    Signed-off-by: Thomas Gleixner

    Boris Ostrovsky
     
  • Because of a recent syscall design debate; its deemed appropriate for
    each syscall to have a flags argument for future extension; without
    immediately requiring new syscalls.

    Cc: juri.lelli@gmail.com
    Cc: Ingo Molnar
    Suggested-by: Michael Kerrisk
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • We're copying the on-stack structure to userspace, but forgot to give
    the right number of bytes to copy. This allows the calling process to
    obtain up to PAGE_SIZE bytes from the stack (and possibly adjacent
    kernel memory).

    This fix copies only as much as we actually have on the stack
    (attr->size defaults to the size of the struct) and leaves the rest of
    the userspace-provided buffer untouched.

    Found using kmemcheck + trinity.

    Fixes: d50dde5a10f30 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI")
    Cc: Dario Faggioli
    Cc: Juri Lelli
    Cc: Ingo Molnar
    Signed-off-by: Vegard Nossum
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1392585857-10725-1-git-send-email-vegard.nossum@oracle.com
    Signed-off-by: Thomas Gleixner

    Vegard Nossum
     
  • Normally task_numa_work scans over a fairly small amount of memory,
    but it is possible to run into a large unpopulated part of virtual
    memory, with no pages mapped. In that case, task_numa_work can run
    for a while, and it may make sense to reschedule as required.

    Cc: akpm@linux-foundation.org
    Cc: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Reported-by: Xing Gang
    Tested-by: Chegu Vinod
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1392761566-24834-2-git-send-email-riel@redhat.com
    Signed-off-by: Thomas Gleixner

    Rik van Riel
     
  • Fix this lockdep warning:

    [ 44.804600] =========================================================
    [ 44.805746] [ INFO: possible irq lock inversion dependency detected ]
    [ 44.805746] 3.14.0-rc2-test+ #14 Not tainted
    [ 44.805746] ---------------------------------------------------------
    [ 44.805746] bash/3674 just changed the state of lock:
    [ 44.805746] (&dl_b->lock){+.....}, at: [] sched_rt_handler+0x132/0x248
    [ 44.805746] but this lock was taken by another, HARDIRQ-safe lock in the past:
    [ 44.805746] (&rq->lock){-.-.-.}

    and interrupts could create inverse lock ordering between them.

    [ 44.805746]
    [ 44.805746] other info that might help us debug this:
    [ 44.805746] Possible interrupt unsafe locking scenario:
    [ 44.805746]
    [ 44.805746] CPU0 CPU1
    [ 44.805746] ---- ----
    [ 44.805746] lock(&dl_b->lock);
    [ 44.805746] local_irq_disable();
    [ 44.805746] lock(&rq->lock);
    [ 44.805746] lock(&dl_b->lock);
    [ 44.805746]
    [ 44.805746] lock(&rq->lock);

    by making dl_b->lock acquiring always IRQ safe.

    Cc: Ingo Molnar
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1392107067-19907-3-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Thomas Gleixner

    Juri Lelli
     
  • Don't compare sysctl_sched_rt_runtime against sysctl_sched_rt_period if
    the former is equal to RUNTIME_INF, otherwise disabling -rt bandwidth
    management (with CONFIG_RT_GROUP_SCHED=n) fails.

    Cc: Ingo Molnar
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1392107067-19907-2-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Thomas Gleixner

    Juri Lelli
     
  • While debugging the crash with the bad nr_running accounting, I hit
    another bug where, after running my sched deadline test, I was getting
    failures to take a CPU offline. It was giving me a -EBUSY error.

    Adding a bunch of trace_printk()s around, I found that the cpu
    notifier that called sched_cpu_inactive() was returning a failure. The
    overflow value was coming up negative?

    Talking this over with Juri, the problem is that the total_bw update was
    suppose to be made by dl_overflow() which, during my tests, seemed to
    not be called. Adding more trace_printk()s, it wasn't that it wasn't
    called, but it exited out right away with the check of new_bw being
    equal to p->dl.dl_bw. The new_bw calculates the ratio between period and
    runtime. The bug is that if you set a deadline, you do not need to set
    a period if you plan on the period being equal to the deadline. That
    is, if period is zero and deadline is not, then the system call should
    set the period to be equal to the deadline. This is done elsewhere in
    the code.

    The fix is easy, check if period is set, and if it is not, then use the
    deadline.

    Cc: Juri Lelli
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140219135335.7e74abd4@gandalf.local.home
    Signed-off-by: Thomas Gleixner

    Steven Rostedt
     
  • Rostedt writes:

    My test suite was locking up hard when enabling mmiotracer. This was due
    to the mmiotracer placing all but one CPU offline. I found this out
    when I was able to reproduce the bug with just my stress-cpu-hotplug
    test. This bug baffled me because it would not always trigger, and
    would only trigger on the first run after boot up. The
    stress-cpu-hotplug test would crash hard the first run, or never crash
    at all. But a new reboot may cause it to crash on the first run again.

    I spent all week bisecting this, as I couldn't find a consistent
    reproducer. I finally narrowed it down to the sched deadline patches,
    and even more peculiar, to the commit that added the sched
    deadline boot up self test to the latency tracer. Then it dawned on me
    to what the bug was.

    All it took was to run a task under sched deadline to screw up the CPU
    hot plugging. This explained why it would lock up only on the first run
    of the stress-cpu-hotplug test. The bug happened when the boot up self
    test of the schedule latency tracer would test a deadline task. The
    deadline task would corrupt something that would cause CPU hotplug to
    fail. If it didn't corrupt it, the stress test would always work
    (there's no other sched deadline tasks that would run to cause
    problems). If it did corrupt on boot up, the first test would lockup
    hard.

    I proved this theory by running my deadline test program on another box,
    and then run the stress-cpu-hotplug test, and it would now consistently
    lock up. I could run stress-cpu-hotplug over and over with no problem,
    but once I ran the deadline test, the next run of the
    stress-cpu-hotplug would lock hard.

    After adding lots of tracing to the code, I found the cause. The
    function tracer showed that migrate_tasks() was stuck in an infinite
    loop, where rq->nr_running never equaled 1 to break out of it. When I
    added a trace_printk() to see what that number was, it was 335 and
    never decrementing!

    Looking at the deadline code I found:

    static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) {
    dequeue_dl_entity(&p->dl);
    dequeue_pushable_dl_task(rq, p);
    }

    static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) {
    update_curr_dl(rq);
    __dequeue_task_dl(rq, p, flags);

    dec_nr_running(rq);
    }

    And this:

    if (dl_runtime_exceeded(rq, dl_se)) {
    __dequeue_task_dl(rq, curr, 0);
    if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
    dl_se->dl_throttled = 1;
    else
    enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);

    if (!is_leftmost(curr, &rq->dl))
    resched_task(curr);
    }

    Notice how we call __dequeue_task_dl() and in the else case we
    call enqueue_task_dl()? Also notice that dequeue_task_dl() has
    underscores where enqueue_task_dl() does not. The enqueue_task_dl()
    calls inc_nr_running(rq), but __dequeue_task_dl() does not. This is
    where we get nr_running out of sync.

    [snip]

    Another point where nr_running can get out of sync is when the dl_timer
    fires:

    dl_se->dl_throttled = 0;
    if (p->on_rq) {
    enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
    if (task_has_dl_policy(rq->curr))
    check_preempt_curr_dl(rq, p, 0);
    else
    resched_task(rq->curr);

    This patch does two things:

    - correctly accounts for throttled tasks (that are now considered
    !running);

    - fixes the bug, updating nr_running from {inc,dec}_dl_tasks(),
    since we risk to update it twice in some situations (e.g., a
    task is dequeued while it has exceeded its budget).

    Cc: mingo@redhat.com
    Cc: torvalds@linux-foundation.org
    Cc: akpm@linux-foundation.org
    Reported-by: Steven Rostedt
    Reviewed-by: Steven Rostedt
    Tested-by: Steven Rostedt
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1392884379-13744-1-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Thomas Gleixner

    Juri Lelli
     

21 Feb, 2014

3 commits

  • Pull cgroup fixes from Tejun Heo:
    "Quite a few fixes this time.

    Three locking fixes, all marked for -stable. A couple error path
    fixes and some misc fixes. Hugh found a bug in memcg offlining
    sequence and we thought we could fix that from cgroup core side but
    that turned out to be insufficient and got reverted. A different fix
    has been applied to -mm"

    * 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: update cgroup_enable_task_cg_lists() to grab siglock
    Revert "cgroup: use an ordered workqueue for cgroup destruction"
    cgroup: protect modifications to cgroup_idr with cgroup_mutex
    cgroup: fix locking in cgroup_cfts_commit()
    cgroup: fix error return from cgroup_create()
    cgroup: fix error return value in cgroup_mount()
    cgroup: use an ordered workqueue for cgroup destruction
    nfs: include xattr.h from fs/nfs/nfs3proc.c
    cpuset: update MAINTAINERS entry
    arm, pm, vmpressure: add missing slab.h includes

    Linus Torvalds
     
  • Pull workqueue fixes from Tejun Heo:
    "Two workqueue fixes. One for an unlikely but possible critical bug
    during kworker shutdown and the other to make lockdep names a bit more
    descriptive"

    * 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: ensure @task is valid across kthread_stop()
    workqueue: add args to workqueue lockdep name

    Linus Torvalds
     
  • Signed-off-by: Brian Campbell
    Signed-off-by: Linus Torvalds

    Brian Campbell
     

20 Feb, 2014

1 commit

  • The generic sched_clock registration function was previously
    done lockless, due to the fact that it was expected to be called
    only once. However, now there are systems that may register
    multiple sched_clock sources, for which the lack of locking has
    casued problems:

    If two sched_clock sources are registered we may end up in a
    situation where a call to sched_clock() may be accessing the
    epoch cycle count for the old counter and the cycle count for the
    new counter. This can lead to confusing results where
    sched_clock() values jump and then are reset to 0 (due to the way
    the registration function forces the epoch_ns to be 0).

    Fix this by reorganizing the registration function to hold the
    seqlock for as short a time as possible while we update the
    clock_data structure for a new counter. We also put any
    accumulated time into epoch_ns instead of resetting the time to
    0 so that the clock doesn't reset after each successful
    registration.

    [jstultz: Added extra context to the commit message]

    Reported-by: Will Deacon
    Signed-off-by: Stephen Boyd
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Will Deacon
    Cc: Peter Zijlstra
    Cc: Josh Cartwright
    Link: http://lkml.kernel.org/r/1392662736-7803-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Signed-off-by: Thomas Gleixner

    Stephen Boyd
     

19 Feb, 2014

2 commits

  • Currently, there's nothing preventing cgroup_enable_task_cg_lists()
    from missing set PF_EXITING and race against cgroup_exit(). Depending
    on the timing, cgroup_exit() may finish with the task still linked on
    css_set leading to list corruption. Fix it by grabbing siglock in
    cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be
    visible.

    This whole on-demand cg_list optimization is extremely fragile and has
    ample possibility to lead to bugs which can cause things like
    once-a-year oops during boot. I'm wondering whether the better
    approach would be just adding "cgroup_disable=all" handling which
    disables the whole cgroup rather than tempting fate with this
    on-demand craziness.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: stable@vger.kernel.org

    Tejun Heo
     
  • When a kworker should die, the kworkre is notified through WORKER_DIE
    flag instead of kthread_should_stop(). This, IIRC, is primarily to
    keep the test synchronized inside worker_pool lock. WORKER_DIE is
    first set while holding pool->lock, the lock is dropped and
    kthread_stop() is called.

    Unfortunately, this means that there's a slight chance that the target
    kworker may see WORKER_DIE before kthread_stop() finishes and exits
    and frees the target task before or during kthread_stop().

    Fix it by pinning the target task before setting WORKER_DIE and
    putting it after kthread_stop() is done.

    tj: Improved patch description and comment. Moved pinning above
    WORKER_DIE for better signify what it's protecting.

    CC: stable@vger.kernel.org
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

18 Feb, 2014

2 commits

  • My rework of handling of notification events (namely commit 7053aee26a35
    "fsnotify: do not share events between notification groups") broke
    sending of cookies with inotify events. We didn't propagate the value
    passed to fsnotify() properly and passed 4 uninitialized bytes to
    userspace instead (so it is also an information leak). Sadly I didn't
    notice this during my testing because inotify cookies aren't used very
    much and LTP inotify tests ignore them.

    Fix the problem by passing the cookie value properly.

    Fixes: 7053aee26a3548ebaba046ae2e52396ccf56ac6c
    Reported-by: Vegard Nossum
    Signed-off-by: Jan Kara

    Jan Kara
     
  • This is not a buffer overflow in the traditional sense: we don't
    overflow any *kernel* buffers, but we do mis-count the amount of data we
    copy back to user space for the SYSLOG_ACTION_READ_ALL case.

    In particular, if the user buffer is too small to hold everything, and
    *if* there is a continuation line at just the right place, we can end up
    giving the user more data than he asked for.

    The reason is that we first count up the number of bytes all the log
    records contains, then we walk the records again until we've skipped the
    records at the beginning that won't fit, and then we walk the rest of
    the records and copy them to the user space buffer.

    And in between that "skip the initial records that won't fit" and the
    "copy the records that *will* fit to user space", we reset the 'prev'
    variable that contained the record information for the last record not
    copied. That meant that when we started copying to user space, we now
    had a different character count than what we had originally calculated
    in the first record walk-through.

    The fix is to simply not clear the 'prev' flags value (in both cases
    where we had the same logic: syslog_print_all and kmsg_dump_get_buffer:
    the latter is used for pstore-like dumping)

    Reported-and-tested-by: Debabrata Banerjee
    Acked-by: Kay Sievers
    Cc: Greg Kroah-Hartman
    Cc: Jeff Mahoney
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Feb, 2014

1 commit

  • …el.org/pub/scm/linux/kernel/git/tip/tip

    Pull irq update from Thomas Gleixner:
    "Fix from the urgent branch: a trivial oneliner adding the missing
    Kconfig dependency curing build failures which have been discovered by
    several build robots.

    The update in the irq-core branch provides a new function in the
    irq/devres code, which is a prerequisite for driver developers to get
    rid of boilerplate code all over the place.

    Not a bugfix, but it has zero impact on the current kernel due to the
    lack of users. It's simpler to provide the infrastructure to
    interested parties via your tree than fulfilling the wishlist of
    driver maintainers on which particular commit or tag this should be
    based on"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Add missing irq_to_desc export for CONFIG_SPARSE_IRQ=n

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Add devm_request_any_context_irq()

    Linus Torvalds