31 May, 2011

1 commit

  • Commit cc3ce5176d83 (rcu: Start RCU kthreads in TASK_INTERRUPTIBLE
    state) fudges a sleeping task' state, resulting in the scheduler seeing
    a TASK_UNINTERRUPTIBLE task going to sleep, but a TASK_INTERRUPTIBLE
    task waking up. The result is unbalanced load calculation.

    The problem that patch tried to address is that the RCU threads could
    stay in UNINTERRUPTIBLE state for quite a while and triggering the hung
    task detector due to on-demand wake-ups.

    Cure the problem differently by always giving the tasks at least one
    wake-up once the CPU is fully up and running, this will kick them out of
    the initial UNINTERRUPTIBLE state and into the regular INTERRUPTIBLE
    wait state.

    [ The alternative would be teaching kthread_create() to start threads as
    INTERRUPTIBLE but that needs a tad more thought. ]

    Reported-by: Damien Wyart
    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/r/1306755291.1200.2872.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

30 May, 2011

2 commits

  • Thomas Gleixner reports that we now have a boot crash triggered by
    CONFIG_CPUMASK_OFFSTACK=y:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] find_next_bit+0x55/0xb0
    Call Trace:
    [] cpumask_any_but+0x2a/0x70
    [] flush_tlb_mm+0x2b/0x80
    [] pud_populate+0x35/0x50
    [] pgd_alloc+0x9a/0xf0
    [] mm_init+0xec/0x120
    [] mm_alloc+0x53/0xd0

    which was introduced by commit de03c72cfce5 ("mm: convert
    mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
    mm_init() vs mm_init_cpumask

    Thomas wrote a patch to just fix the ordering of initialization, but I
    hate the new double allocation in the fork path, so I ended up instead
    doing some more radical surgery to clean it all up.

    Reported-by: Thomas Gleixner
    Reported-by: Ingo Molnar
    Cc: KOSAKI Motohiro
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • * 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
    x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param
    x86 idle: deprecate "no-hlt" cmdline param
    x86 idle APM: deprecate CONFIG_APM_CPU_IDLE
    x86 idle floppy: deprecate disable_hlt()
    x86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it
    x86 idle: clarify AMD erratum 400 workaround
    idle governor: Avoid lock acquisition to read pm_qos before entering idle
    cpuidle: menu: fixed wrapping timers at 4.294 seconds

    Linus Torvalds
     

29 May, 2011

4 commits

  • Thanks to the reviews and comments by Rafael, James, Mark and Andi.
    Here's version 2 of the patch incorporating your comments and also some
    update to my previous patch comments.

    I noticed that before entering idle state, the menu idle governor will
    look up the current pm_qos target value according to the list of qos
    requests received. This look up currently needs the acquisition of a
    lock to access the list of qos requests to find the qos target value,
    slowing down the entrance into idle state due to contention by multiple
    cpus to access this list. The contention is severe when there are a lot
    of cpus waking and going into idle. For example, for a simple workload
    that has 32 pair of processes ping ponging messages to each other, where
    64 cpu cores are active in test system, I see the following profile with
    37.82% of cpu cycles spent in contention of pm_qos_lock:

    - 37.82% swapper [kernel.kallsyms] [k]
    _raw_spin_lock_irqsave
    - _raw_spin_lock_irqsave
    - 95.65% pm_qos_request
    menu_select
    cpuidle_idle_call
    - cpu_idle
    99.98% start_secondary

    A better approach will be to cache the updated pm_qos target value so
    reading it does not require lock acquisition as in the patch below.
    With this patch the contention for pm_qos_lock is removed and I saw a
    2.2X increase in throughput for my message passing workload.

    cc: stable@kernel.org
    Signed-off-by: Tim Chen
    Acked-by: Andi Kleen
    Acked-by: James Bottomley
    Acked-by: mark gross
    Signed-off-by: Len Brown

    Tim Chen
     
  • …el/git/tip/linux-2.6-tip

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed
    sched: Fix ->min_vruntime calculation in dequeue_entity()
    sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW
    sched: More sched_domain iterations fixes

    Linus Torvalds
     
  • …l/git/tip/linux-2.6-tip

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state
    rcu: Remove waitqueue usage for cpu, node, and boost kthreads
    rcu: Avoid acquiring rcu_node locks in timer functions
    atomic: Add atomic_or()
    Documentation: Add statistics about nested locks
    rcu: Decrease memory-barrier usage based on semi-formal proof
    rcu: Make rcu_enter_nohz() pay attention to nesting
    rcu: Don't do reschedule unless in irq
    rcu: Remove old memory barriers from rcu_process_callbacks()
    rcu: Add memory barriers
    rcu: Fix unpaired rcu_irq_enter() from locking selftests

    Linus Torvalds
     
  • …l/git/tip/linux-2.6-tip

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
    perf: Fix SIGIO handling
    perf top: Don't stop if no kernel symtab is found
    perf top: Handle kptr_restrict
    perf top: Remove unused macro
    perf events: initialize fd array to -1 instead of 0
    perf tools: Make sure kptr_restrict warnings fit 80 col terms
    perf tools: Fix build on older systems
    perf symbols: Handle /proc/sys/kernel/kptr_restrict
    perf: Remove duplicate headers
    ftrace: Add internal recursive checks
    tracing: Update btrfs's tracepoints to use u64 interface
    tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machine
    ftrace: Set ops->flag to enabled even on static function tracing
    tracing: Have event with function tracer check error return
    ftrace: Have ftrace_startup() return failure code
    jump_label: Check entries limit in __jump_label_update
    ftrace/recordmcount: Avoid STT_FUNC symbols as base on ARM
    scripts/tags.sh: Add magic for trace-events for etags too
    scripts/tags.sh: Fix ctags for DEFINE_EVENT()
    x86/ftrace: Fix compiler warning in ftrace.c
    ...

    Linus Torvalds
     

28 May, 2011

11 commits

  • Upon creation, kthreads are in TASK_UNINTERRUPTIBLE state, which can
    result in softlockup warnings. Because some of RCU's kthreads can
    legitimately be idle indefinitely, start them in TASK_INTERRUPTIBLE
    state in order to avoid those warnings.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Tested-by: Yinghai Lu
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     
  • It is not necessary to use waitqueues for the RCU kthreads because
    we always know exactly which thread is to be awakened. In addition,
    wake_up() only issues an actual wakeup when there is a thread waiting on
    the queue, which was why there was an extra explicit wake_up_process()
    to get the RCU kthreads started.

    Eliminating the waitqueues (and wake_up()) in favor of wake_up_process()
    eliminates the need for the initial wake_up_process() and also shrinks
    the data structure size a bit. The wakeup logic is placed in a new
    rcu_wait() macro.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This commit switches manipulations of the rcu_node ->wakemask field
    to atomic operations, which allows rcu_cpu_kthread_timer() to avoid
    acquiring the rcu_node lock. This should avoid the following lockdep
    splat reported by Valdis Kletnieks:

    [ 12.872150] usb 1-4: new high speed USB device number 3 using ehci_hcd
    [ 12.986667] usb 1-4: New USB device found, idVendor=413c, idProduct=2513
    [ 12.986679] usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0
    [ 12.987691] hub 1-4:1.0: USB hub found
    [ 12.987877] hub 1-4:1.0: 3 ports detected
    [ 12.996372] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/input/input10
    [ 13.071471] udevadm used greatest stack depth: 3984 bytes left
    [ 13.172129]
    [ 13.172130] =======================================================
    [ 13.172425] [ INFO: possible circular locking dependency detected ]
    [ 13.172650] 2.6.39-rc6-mmotm0506 #1
    [ 13.172773] -------------------------------------------------------
    [ 13.172997] blkid/267 is trying to acquire lock:
    [ 13.173009] (&p->pi_lock){-.-.-.}, at: [] try_to_wake_up+0x29/0x1aa
    [ 13.173009]
    [ 13.173009] but task is already holding lock:
    [ 13.173009] (rcu_node_level_0){..-...}, at: [] rcu_cpu_kthread_timer+0x27/0x58
    [ 13.173009]
    [ 13.173009] which lock already depends on the new lock.
    [ 13.173009]
    [ 13.173009]
    [ 13.173009] the existing dependency chain (in reverse order) is:
    [ 13.173009]
    [ 13.173009] -> #2 (rcu_node_level_0){..-...}:
    [ 13.173009] [] check_prevs_add+0x8b/0x104
    [ 13.173009] [] validate_chain+0x36f/0x3ab
    [ 13.173009] [] __lock_acquire+0x369/0x3e2
    [ 13.173009] [] lock_acquire+0xfc/0x14c
    [ 13.173009] [] _raw_spin_lock+0x36/0x45
    [ 13.173009] [] rcu_read_unlock_special+0x8c/0x1d5
    [ 13.173009] [] __rcu_read_unlock+0x4f/0xd7
    [ 13.173009] [] rcu_read_unlock+0x21/0x23
    [ 13.173009] [] cpuacct_charge+0x6c/0x75
    [ 13.173009] [] update_curr+0x101/0x12e
    [ 13.173009] [] check_preempt_wakeup+0xf7/0x23b
    [ 13.173009] [] check_preempt_curr+0x2b/0x68
    [ 13.173009] [] ttwu_do_wakeup+0x76/0x128
    [ 13.173009] [] ttwu_do_activate.constprop.63+0x57/0x5c
    [ 13.173009] [] scheduler_ipi+0x48/0x5d
    [ 13.173009] [] smp_reschedule_interrupt+0x16/0x18
    [ 13.173009] [] reschedule_interrupt+0x13/0x20
    [ 13.173009] [] rcu_read_unlock+0x21/0x23
    [ 13.173009] [] find_get_page+0xa9/0xb9
    [ 13.173009] [] filemap_fault+0x6a/0x34d
    [ 13.173009] [] __do_fault+0x54/0x3e6
    [ 13.173009] [] handle_pte_fault+0x12c/0x1ed
    [ 13.173009] [] handle_mm_fault+0x1cd/0x1e0
    [ 13.173009] [] do_page_fault+0x42d/0x5de
    [ 13.173009] [] page_fault+0x1f/0x30
    [ 13.173009]
    [ 13.173009] -> #1 (&rq->lock){-.-.-.}:
    [ 13.173009] [] check_prevs_add+0x8b/0x104
    [ 13.173009] [] validate_chain+0x36f/0x3ab
    [ 13.173009] [] __lock_acquire+0x369/0x3e2
    [ 13.173009] [] lock_acquire+0xfc/0x14c
    [ 13.173009] [] _raw_spin_lock+0x36/0x45
    [ 13.173009] [] __task_rq_lock+0x8b/0xd3
    [ 13.173009] [] wake_up_new_task+0x41/0x108
    [ 13.173009] [] do_fork+0x265/0x33f
    [ 13.173009] [] kernel_thread+0x6b/0x6d
    [ 13.173009] [] rest_init+0x21/0xd2
    [ 13.173009] [] start_kernel+0x3bb/0x3c6
    [ 13.173009] [] x86_64_start_reservations+0xaf/0xb3
    [ 13.173009] [] x86_64_start_kernel+0xf0/0xf7
    [ 13.173009]
    [ 13.173009] -> #0 (&p->pi_lock){-.-.-.}:
    [ 13.173009] [] check_prev_add+0x68/0x20e
    [ 13.173009] [] check_prevs_add+0x8b/0x104
    [ 13.173009] [] validate_chain+0x36f/0x3ab
    [ 13.173009] [] __lock_acquire+0x369/0x3e2
    [ 13.173009] [] lock_acquire+0xfc/0x14c
    [ 13.173009] [] _raw_spin_lock_irqsave+0x44/0x57
    [ 13.173009] [] try_to_wake_up+0x29/0x1aa
    [ 13.173009] [] wake_up_process+0x10/0x12
    [ 13.173009] [] rcu_cpu_kthread_timer+0x44/0x58
    [ 13.173009] [] call_timer_fn+0xac/0x1e9
    [ 13.173009] [] run_timer_softirq+0x1aa/0x1f2
    [ 13.173009] [] __do_softirq+0x109/0x26a
    [ 13.173009] [] call_softirq+0x1c/0x30
    [ 13.173009] [] do_softirq+0x44/0xf1
    [ 13.173009] [] irq_exit+0x58/0xc8
    [ 13.173009] [] smp_apic_timer_interrupt+0x79/0x87
    [ 13.173009] [] apic_timer_interrupt+0x13/0x20
    [ 13.173009] [] get_page_from_freelist+0x2aa/0x310
    [ 13.173009] [] __alloc_pages_nodemask+0x178/0x243
    [ 13.173009] [] pte_alloc_one+0x1e/0x3a
    [ 13.173009] [] __pte_alloc+0x22/0x14b
    [ 13.173009] [] handle_mm_fault+0x17e/0x1e0
    [ 13.173009] [] do_page_fault+0x42d/0x5de
    [ 13.173009] [] page_fault+0x1f/0x30
    [ 13.173009]
    [ 13.173009] other info that might help us debug this:
    [ 13.173009]
    [ 13.173009] Chain exists of:
    [ 13.173009] &p->pi_lock --> &rq->lock --> rcu_node_level_0
    [ 13.173009]
    [ 13.173009] Possible unsafe locking scenario:
    [ 13.173009]
    [ 13.173009] CPU0 CPU1
    [ 13.173009] ---- ----
    [ 13.173009] lock(rcu_node_level_0);
    [ 13.173009] lock(&rq->lock);
    [ 13.173009] lock(rcu_node_level_0);
    [ 13.173009] lock(&p->pi_lock);
    [ 13.173009]
    [ 13.173009] *** DEADLOCK ***
    [ 13.173009]
    [ 13.173009] 3 locks held by blkid/267:
    [ 13.173009] #0: (&mm->mmap_sem){++++++}, at: [] do_page_fault+0x1f3/0x5de
    [ 13.173009] #1: (&yield_timer){+.-...}, at: [] call_timer_fn+0x0/0x1e9
    [ 13.173009] #2: (rcu_node_level_0){..-...}, at: [] rcu_cpu_kthread_timer+0x27/0x58
    [ 13.173009]
    [ 13.173009] stack backtrace:
    [ 13.173009] Pid: 267, comm: blkid Not tainted 2.6.39-rc6-mmotm0506 #1
    [ 13.173009] Call Trace:
    [ 13.173009] [] print_circular_bug+0xc8/0xd9
    [ 13.173009] [] check_prev_add+0x68/0x20e
    [ 13.173009] [] ? save_stack_trace+0x28/0x46
    [ 13.173009] [] check_prevs_add+0x8b/0x104
    [ 13.173009] [] validate_chain+0x36f/0x3ab
    [ 13.173009] [] __lock_acquire+0x369/0x3e2
    [ 13.173009] [] ? try_to_wake_up+0x29/0x1aa
    [ 13.173009] [] lock_acquire+0xfc/0x14c
    [ 13.173009] [] ? try_to_wake_up+0x29/0x1aa
    [ 13.173009] [] ? rcu_check_quiescent_state+0x82/0x82
    [ 13.173009] [] _raw_spin_lock_irqsave+0x44/0x57
    [ 13.173009] [] ? try_to_wake_up+0x29/0x1aa
    [ 13.173009] [] try_to_wake_up+0x29/0x1aa
    [ 13.173009] [] ? rcu_check_quiescent_state+0x82/0x82
    [ 13.173009] [] wake_up_process+0x10/0x12
    [ 13.173009] [] rcu_cpu_kthread_timer+0x44/0x58
    [ 13.173009] [] ? rcu_check_quiescent_state+0x82/0x82
    [ 13.173009] [] call_timer_fn+0xac/0x1e9
    [ 13.173009] [] ? del_timer+0x75/0x75
    [ 13.173009] [] ? rcu_check_quiescent_state+0x82/0x82
    [ 13.173009] [] run_timer_softirq+0x1aa/0x1f2
    [ 13.173009] [] __do_softirq+0x109/0x26a
    [ 13.173009] [] ? tick_dev_program_event+0x37/0xf6
    [ 13.173009] [] ? time_hardirqs_off+0x1b/0x2f
    [ 13.173009] [] call_softirq+0x1c/0x30
    [ 13.173009] [] do_softirq+0x44/0xf1
    [ 13.173009] [] irq_exit+0x58/0xc8
    [ 13.173009] [] smp_apic_timer_interrupt+0x79/0x87
    [ 13.173009] [] apic_timer_interrupt+0x13/0x20
    [ 13.173009] [] ? get_page_from_freelist+0x114/0x310
    [ 13.173009] [] ? get_page_from_freelist+0x2aa/0x310
    [ 13.173009] [] ? clear_page_c+0x7/0x10
    [ 13.173009] [] ? prep_new_page+0x14c/0x1cd
    [ 13.173009] [] get_page_from_freelist+0x2aa/0x310
    [ 13.173009] [] __alloc_pages_nodemask+0x178/0x243
    [ 13.173009] [] ? __pmd_alloc+0x87/0x99
    [ 13.173009] [] pte_alloc_one+0x1e/0x3a
    [ 13.173009] [] ? __pmd_alloc+0x87/0x99
    [ 13.173009] [] __pte_alloc+0x22/0x14b
    [ 13.173009] [] handle_mm_fault+0x17e/0x1e0
    [ 13.173009] [] do_page_fault+0x42d/0x5de
    [ 13.173009] [] ? sys_brk+0x32/0x10c
    [ 13.173009] [] ? time_hardirqs_off+0x1b/0x2f
    [ 13.173009] [] ? trace_hardirqs_off_caller+0x3f/0x9c
    [ 13.173009] [] ? trace_hardirqs_off_thunk+0x3a/0x3c
    [ 13.173009] [] page_fault+0x1f/0x30
    [ 14.010075] usb 5-1: new full speed USB device number 2 using uhci_hcd

    Reported-by: Valdis Kletnieks
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     
  • …ck/linux-2.6-rcu into core/urgent

    Ingo Molnar
     
  • Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So
    explicitly push the wakeup (including signals) when requested.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc:
    Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
    tsk->cpus_allowed. Otherwise RT scheduler may confuse.

    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.com
    Signed-off-by: Ingo Molnar

    KOSAKI Motohiro
     
  • Dima Zavin reported:

    "After pulling the thread off the run-queue during a cgroup change,
    the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
    then gets normalized to this new value. This can then lead to the thread
    getting an unfair boost in the new group if the vruntime of the next
    task in the old run-queue was way further ahead."

    Reported-by: Dima Zavin
    Signed-off-by: John Stultz
    Recalls-having-tested-once-upon-a-time-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Marc reported that e4a52bcb9 (sched: Remove rq->lock from the first
    half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few
    __ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu()
    code was suspect.

    Yong found that the interrupt could hit after context_switch() changes
    current but before it clears p->on_cpu, if that interrupt were to
    attempt a wake-up of p we would indeed find ourselves spinning in IRQ
    context.

    Fix this by reverting to the old behaviour for this situation and
    perform a full remote wake-up.

    Cc: Frank Rowand
    Cc: Yong Zhang
    Cc: Oleg Nesterov
    Reported-by: Marc Zyngier
    Tested-by: Marc Zyngier
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • sched_domain iterations needs to be protected by rcu_read_lock() now,
    this patch adds another two places which needs the rcu lock, which is
    spotted by following suspicious rcu_dereference_check() usage warnings.

    kernel/sched_rt.c:1244 invoked rcu_dereference_check() without protection!
    kernel/sched_stats.h:41 invoked rcu_dereference_check() without protection!

    Signed-off-by: Xiaotian Feng
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1303469634-11678-1-git-send-email-dfeng@redhat.com
    Signed-off-by: Ingo Molnar

    Xiaotian Feng
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
    PM: Fix PM QOS's user mode interface to work with ASCII input
    PM / Hibernate: Update kerneldoc comments in hibernate.c
    PM / Hibernate: Remove arch_prepare_suspend()
    PM / Hibernate: Update some comments in core hibernate code

    Linus Torvalds
     
  • * 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs:
    Create Documentation/security/, move LSM-, credentials-, and keys-related files from Documentation/ to Documentation/security/, add Documentation/security/00-INDEX, and update all occurrences of Documentation/ to Documentation/security/.

    Linus Torvalds
     

27 May, 2011

14 commits

  • …rostedt/linux-2.6-trace into perf/urgent

    Ingo Molnar
     
  • profile_hits() has a common check for prof_on and prof_buffer regardless
    of SMP or !SMP. So, remove some duplicate code by splitting profile_hits
    into two.

    [akpm@linux-foundation.org: make do_profile_hits static]
    Signed-off-by: Rakib Mullick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rakib Mullick
     
  • Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
    This was because exe_file was needed only for /proc//exe. Since we
    will need the exe_file functionality also for core dumps (so core name can
    contain full binary path), built this functionality always into the
    kernel.

    To achieve that move that out of proc FS to the kernel/ where in fact it
    should belong. By doing that we can make dup_mm_exe_file static. Also we
    can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
    leads to some problems:

    * cgroup creation is out-of-control
    * cgroup name can conflict when pids are looping
    * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
    * we may want to create a namespace without creating a cgroup

    The ns_cgroup was replaced by a compatibility flag 'clone_children',
    where a newly created cgroup will copy the parent cgroup values.
    The userspace has to manually create a cgroup and add a task to
    the 'tasks' file.

    This patch removes the ns_cgroup as suggested in the following thread:

    https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

    The 'cgroup_clone' function is removed because it is no longer used.

    This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
    ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
    printk warning users that the feature is planned for removal. Since that
    time we have heard from XXX users who were affected by this.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Serge E. Hallyn
    Cc: Eric W. Biederman
    Cc: Jamal Hadi Salim
    Reviewed-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     
  • Convert cgroup_attach_proc to use flex_array.

    The cgroup_attach_proc implementation requires a pre-allocated array to
    store task pointers to atomically move a thread-group, but asking for a
    monolithic array with kmalloc() may be unreliable for very large groups.
    Using flex_array provides the same functionality with less risk of
    failure.

    This is a post-patch for cgroup-procs-write.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Make procs file writable to move all threads by tgid at once.

    Add functionality that enables users to move all threads in a threadgroup
    at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This
    current implementation makes use of a per-threadgroup rwsem that's taken
    for reading in the fork() path to prevent newly forking threads within the
    threadgroup from "escaping" while the move is in progress.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

    Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
    for cgroups's subsystem interface. Unlike can_attach and attach, these
    are for per-thread operations, to be called potentially many times when
    attaching an entire threadgroup.

    Also, the old "bool threadgroup" interface is removed, as replaced by
    this. All subsystems are modified for the new interface - of note is
    cpuset, which requires from/to nodemasks for attach to be globally scoped
    (though per-cpuset would work too) to persist from its pre_attach to
    attach_task and attach.

    This is a pre-patch for cgroup-procs-writable.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

    Add an rwsem that lives in a threadgroup's signal_struct that's taken for
    reading in the fork path, under CONFIG_CGROUPS. If another part of the
    kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
    ifdefs should be changed to a higher-up flag that CGROUPS and the other
    system would both depend on.

    This is a pre-patch for cgroup-procs-write.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Make pm_qos_power_write() accept values passed to it in the ASCII hex
    format either with or without an ending newline.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Mark Gross

    Rafael J. Wysocki
     
  • (Note: this was reverted, and is now being re-applied in pieces, with
    this being the fifth and final piece. See below for the reason that
    it is now felt to be safe to re-apply this.)

    Commit d09b62d fixed grace-period synchronization, but left some smp_mb()
    invocations in rcu_process_callbacks() that are no longer needed, but
    sheer paranoia prevented them from being removed. This commit removes
    them and provides a proof of correctness in their absence. It also adds
    a memory barrier to rcu_report_qs_rsp() immediately before the update to
    rsp->completed in order to handle the theoretical possibility that the
    compiler or CPU might move massive quantities of code into a lock-based
    critical section. This also proves that the sheer paranoia was not
    entirely unjustified, at least from a theoretical point of view.

    In addition, the old dyntick-idle synchronization depended on the fact
    that grace periods were many milliseconds in duration, so that it could
    be assumed that no dyntick-idle CPU could reorder a memory reference
    across an entire grace period. Unfortunately for this design, the
    addition of expedited grace periods breaks this assumption, which has
    the unfortunate side-effect of requiring atomic operations in the
    functions that track dyntick-idle state for RCU. (There is some hope
    that the algorithms used in user-level RCU might be applied here, but
    some work is required to handle the NMIs that user-space applications
    can happily ignore. For the short term, better safe than sorry.)

    This proof assumes that neither compiler nor CPU will allow a lock
    acquisition and release to be reordered, as doing so can result in
    deadlock. The proof is as follows:

    1. A given CPU declares a quiescent state under the protection of
    its leaf rcu_node's lock.

    2. If there is more than one level of rcu_node hierarchy, the
    last CPU to declare a quiescent state will also acquire the
    ->lock of the next rcu_node up in the hierarchy, but only
    after releasing the lower level's lock. The acquisition of this
    lock clearly cannot occur prior to the acquisition of the leaf
    node's lock.

    3. Step 2 repeats until we reach the root rcu_node structure.
    Please note again that only one lock is held at a time through
    this process. The acquisition of the root rcu_node's ->lock
    must occur after the release of that of the leaf rcu_node.

    4. At this point, we set the ->completed field in the rcu_state
    structure in rcu_report_qs_rsp(). However, if the rcu_node
    hierarchy contains only one rcu_node, then in theory the code
    preceding the quiescent state could leak into the critical
    section. We therefore precede the update of ->completed with a
    memory barrier. All CPUs will therefore agree that any updates
    preceding any report of a quiescent state will have happened
    before the update of ->completed.

    5. Regardless of whether a new grace period is needed, rcu_start_gp()
    will propagate the new value of ->completed to all of the leaf
    rcu_node structures, under the protection of each rcu_node's ->lock.
    If a new grace period is needed immediately, this propagation
    will occur in the same critical section that ->completed was
    set in, but courtesy of the memory barrier in #4 above, is still
    seen to follow any pre-quiescent-state activity.

    6. When a given CPU invokes __rcu_process_gp_end(), it becomes
    aware of the end of the old grace period and therefore makes
    any RCU callbacks that were waiting on that grace period eligible
    for invocation.

    If this CPU is the same one that detected the end of the grace
    period, and if there is but a single rcu_node in the hierarchy,
    we will still be in the single critical section. In this case,
    the memory barrier in step #4 guarantees that all callbacks will
    be seen to execute after each CPU's quiescent state.

    On the other hand, if this is a different CPU, it will acquire
    the leaf rcu_node's ->lock, and will again be serialized after
    each CPU's quiescent state for the old grace period.

    On the strength of this proof, this commit therefore removes the memory
    barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
    The effect is to reduce the number of memory barriers by one and to
    reduce the frequency of execution from about once per scheduling tick
    per CPU to once per grace period.

    This was reverted do to hangs found during testing by Yinghai Lu and
    Ingo Molnar. Frederic Weisbecker supplied Yinghai with tracing that
    located the underlying problem, and Frederic also provided the fix.

    The underlying problem was that the HARDIRQ_ENTER() macro from
    lib/locking-selftest.c invoked irq_enter(), which in turn invokes
    rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which
    does not invoke rcu_irq_exit(). This situation resulted in calls
    to rcu_irq_enter() that were not balanced by the required calls to
    rcu_irq_exit(). Therefore, after these locking selftests completed,
    RCU's dyntick-idle nesting count was a large number (for example,
    72), which caused RCU to to conclude that the affected CPU was not in
    dyntick-idle mode when in fact it was.

    RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting
    in hangs.

    In contrast, with Frederic's patch, which replaces the irq_enter()
    in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call
    either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU
    running the test is already marked as not being in dyntick-idle mode.
    This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU
    then has no problem working out which CPUs are in dyntick-idle mode and
    which are not.

    The reason that the imbalance was not noticed before the barrier patch
    was applied is that the old implementation of rcu_enter_nohz() ignored
    the nesting depth. This could still result in delays, but much shorter
    ones. Whenever there was a delay, RCU would IPI the CPU with the
    unbalanced nesting level, which would eventually result in rcu_enter_nohz()
    being called, which in turn would force RCU to see that the CPU was in
    dyntick-idle mode.

    The reason that very few people noticed the problem is that the mismatched
    irq_enter() vs. __irq_exit() occured only when the kernel was built with
    CONFIG_DEBUG_LOCKING_API_SELFTESTS.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The old version of rcu_enter_nohz() forced RCU into nohz mode even if
    the nesting count was non-zero. This change causes rcu_enter_nohz()
    to hold off for non-zero nesting counts.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Condition the set_need_resched() in rcu_irq_exit() on in_irq(). This
    should be a no-op, because rcu_irq_exit() should only be called from irq.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Second step of partitioning of commit e59fb3120b.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Add the memory barriers added by e59fb3120b.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

26 May, 2011

8 commits

  • Merge reason: Linus applied an overlapping commit:

    5f2e8e2b0bf0: kernel/watchdog.c: Use proper ANSI C prototypes

    So merge it in to make sure we can iterate the file without conflicts.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • commit 4b06042(bitmap, irq: add smp_affinity_list interface to
    /proc/irq) causes the following warning:

    [ 274.239500] WARNING: at fs/proc/generic.c:850 remove_proc_entry+0x24c/0x27a()
    [ 274.251761] remove_proc_entry: removing non-empty directory 'irq/184',
    leaking at least 'smp_affinity_list'

    Remove the new file in the exit path.

    Signed-off-by: Yinghai Lu
    Cc: Mike Travis
    Link: http://lkml.kernel.org/r/4DDDE094.6050505@kernel.org
    Signed-off-by: Thomas Gleixner

    Yinghai Lu
     
  • Witold reported a reboot caused by the selftests of the dynamic function
    tracer. He sent me a config and I used ktest to do a config_bisect on it
    (as my config did not cause the crash). It pointed out that the problem
    config was CONFIG_PROVE_RCU.

    What happened was that if multiple callbacks are attached to the
    function tracer, we iterate a list of callbacks. Because the list is
    managed by synchronize_sched() and preempt_disable, the access to the
    pointers uses rcu_dereference_raw().

    When PROVE_RCU is enabled, the rcu_dereference_raw() calls some
    debugging functions, which happen to be traced. The tracing of the debug
    function would then call rcu_dereference_raw() which would then call the
    debug function and then... well you get the idea.

    I first wrote two different patches to solve this bug.

    1) add a __rcu_dereference_raw() that would not do any checks.
    2) add notrace to the offending debug functions.

    Both of these patches worked.

    Talking with Paul McKenney on IRC, he suggested to add recursion
    detection instead. This seemed to be a better solution, so I decided to
    implement it. As the task_struct already has a trace_recursion to detect
    recursion in the ring buffer, and that has a very small number it
    allows, I decided to use that same variable to add flags that can detect
    the recursion inside the infrastructure of the function tracer.

    I plan to change it so that the task struct bit can be checked in
    mcount, but as that requires changes to all archs, I will hold that off
    to the next merge window.

    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Frederic Weisbecker
    Cc: Paul E. McKenney
    Link: http://lkml.kernel.org/r/1306348063.1465.116.camel@gandalf.stny.rr.com
    Reported-by: Witold Baryluk
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Filesystem, like Btrfs, has some "ULL" macros, and when these macros are passed
    to tracepoints'__print_symbolic(), there will be 64->32 truncate WARNINGS during
    compiling on 32bit box.

    Signed-off-by: Liu Bo
    Link: http://lkml.kernel.org/r/4DACE6E0.7000507@cn.fujitsu.com
    Signed-off-by: Steven Rostedt

    liubo
     
  • When dynamic ftrace is not configured, the ops->flags still needs
    to have its FTRACE_OPS_FL_ENABLED bit set in ftrace_startup().

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The self tests for event tracer does not check if the function
    tracing was successfully activated. It needs to before it continues
    the tests, otherwise the wrong errors may be reported.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • The register_ftrace_function() returns an error code on failure
    except if the call to ftrace_startup() fails. Add a error return to
    ftrace_startup() if it fails to start, allowing register_ftrace_funtion()
    to return a proper error value.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
    net: fix get_net_ns_by_fd for !CONFIG_NET_NS
    ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
    ns: Declare sys_setns in syscalls.h
    net: Allow setting the network namespace by fd
    ns proc: Add support for the ipc namespace
    ns proc: Add support for the uts namespace
    ns proc: Add support for the network namespace.
    ns: Introduce the setns syscall
    ns: proc files for namespace naming policy.

    Linus Torvalds