17 Nov, 2011

1 commit


07 Nov, 2011

1 commit

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Add a 'reason' to wb_writeback_work
    writeback: send work item to queue_io, move_expired_inodes
    writeback: trace event balance_dirty_pages
    writeback: trace event bdi_dirty_ratelimit
    writeback: fix ppc compile warnings on do_div(long long, unsigned long)
    writeback: per-bdi background threshold
    writeback: dirty position control - bdi reserve area
    writeback: control dirty pause time
    writeback: limit max dirty pause time
    writeback: IO-less balance_dirty_pages()
    writeback: per task dirty rate limit
    writeback: stabilize bdi->dirty_ratelimit
    writeback: dirty rate control
    writeback: add bg_threshold parameter to __bdi_update_bandwidth()
    writeback: dirty position control
    writeback: account per-bdi accumulated dirtied pages

    Linus Torvalds
     

26 Oct, 2011

3 commits

  • * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    llist: Add back llist_add_batch() and llist_del_first() prototypes
    sched: Don't use tasklist_lock for debug prints
    sched: Warn on rt throttling
    sched: Unify the ->cpus_allowed mask copy
    sched: Wrap scheduler p->cpus_allowed access
    sched: Request for idle balance during nohz idle load balance
    sched: Use resched IPI to kick off the nohz idle balance
    sched: Fix idle_cpu()
    llist: Remove cpu_relax() usage in cmpxchg loops
    sched: Convert to struct llist
    llist: Add llist_next()
    irq_work: Use llist in the struct irq_work logic
    llist: Return whether list is empty before adding in llist_add()
    llist: Move cpu_relax() to after the cmpxchg()
    llist: Remove the platform-dependent NMI checks
    llist: Make some llist functions inline
    sched, tracing: Show PREEMPT_ACTIVE state in trace_sched_switch
    sched: Remove redundant test in check_preempt_tick()
    sched: Add documentation for bandwidth control
    sched: Return unused runtime on group dequeue
    ...

    Linus Torvalds
     
  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
    rcu: Move propagation of ->completed from rcu_start_gp() to rcu_report_qs_rsp()
    rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states
    rcu: Wire up RCU_BOOST_PRIO for rcutree
    rcu: Make rcu_torture_boost() exit loops at end of test
    rcu: Make rcu_torture_fqs() exit loops at end of test
    rcu: Permit rt_mutex_unlock() with irqs disabled
    rcu: Avoid having just-onlined CPU resched itself when RCU is idle
    rcu: Suppress NMI backtraces when stall ends before dump
    rcu: Prohibit grace periods during early boot
    rcu: Simplify unboosting checks
    rcu: Prevent early boot set_need_resched() from __rcu_pending()
    rcu: Dump local stack if cannot dump all CPUs' stacks
    rcu: Move __rcu_read_unlock()'s barrier() within if-statement
    rcu: Improve rcu_assign_pointer() and RCU_INIT_POINTER() documentation
    rcu: Make rcu_assign_pointer() unconditionally insert a memory barrier
    rcu: Make rcu_implicit_dynticks_qs() locals be correct size
    rcu: Eliminate in_irq() checks in rcu_enter_nohz()
    nohz: Remove nohz_cpu_mask
    rcu: Document interpretation of RCU-lockdep splats
    rcu: Allow rcutorture's stat_interval parameter to be changed at runtime
    ...

    Linus Torvalds
     
  • * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    rtmutex: Add missing rcu_read_unlock() in debug_rt_mutex_print_deadlock()
    lockdep: Comment all warnings
    lib: atomic64: Change the type of local lock to raw_spinlock_t
    locking, lib/atomic64: Annotate atomic64_lock::lock as raw
    locking, x86, iommu: Annotate qi->q_lock as raw
    locking, x86, iommu: Annotate irq_2_ir_lock as raw
    locking, x86, iommu: Annotate iommu->register_lock as raw
    locking, dma, ipu: Annotate bank_lock as raw
    locking, ARM: Annotate low level hw locks as raw
    locking, drivers/dca: Annotate dca_lock as raw
    locking, powerpc: Annotate uic->lock as raw
    locking, x86: mce: Annotate cmci_discover_lock as raw
    locking, ACPI: Annotate c3_lock as raw
    locking, oprofile: Annotate oprofilefs lock as raw
    locking, video: Annotate vga console lock as raw
    locking, latencytop: Annotate latency_lock as raw
    locking, timer_stats: Annotate table_lock as raw
    locking, rwsem: Annotate inner lock as raw
    locking, semaphores: Annotate inner lock as raw
    locking, sched: Annotate thread_group_cputimer as raw
    ...

    Fix up conflicts in kernel/posix-cpu-timers.c manually: making
    cputimer->cputime a raw lock conflicted with the ABBA fix in commit
    bcd5cff7216f ("cputimer: Cure lock inversion").

    Linus Torvalds
     

25 Oct, 2011

1 commit

  • * 'usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (260 commits)
    usb: renesas_usbhs: fixup inconsistent return from usbhs_pkt_push()
    usb/isp1760: Allow to optionally trigger low-level chip reset via GPIOLIB.
    USB: gadget: midi: memory leak in f_midi_bind_config()
    USB: gadget: midi: fix range check in f_midi_out_open()
    QE/FHCI: fixed the CONTROL bug
    usb: renesas_usbhs: tidyup for smatch warnings
    USB: Fix USB Kconfig dependency problem on 85xx/QoirQ platforms
    EHCI: workaround for MosChip controller bug
    usb: gadget: file_storage: fix race on unloading
    USB: ftdi_sio.c: Use ftdi async_icount structure for TIOCMIWAIT, as in other drivers
    USB: ftdi_sio.c:Fill MSR fields of the ftdi async_icount structure
    USB: ftdi_sio.c: Fill LSR fields of the ftdi async_icount structure
    USB: ftdi_sio.c:Fill TX field of the ftdi async_icount structure
    USB: ftdi_sio.c: Fill the RX field of the ftdi async_icount structure
    USB: ftdi_sio.c: Basic icount infrastructure for ftdi_sio
    usb/isp1760: Let OF bindings depend on general CONFIG_OF instead of PPC_OF .
    USB: ftdi_sio: Support TI/Luminary Micro Stellaris BD-ICDI Board
    USB: Fix runtime wakeup on OHCI
    xHCI/USB: Make xHCI driver have a BOS descriptor.
    usb: gadget: add new usb gadget for ACM and mass storage
    ...

    Linus Torvalds
     

04 Oct, 2011

2 commits

  • Use the generic llist primitives.

    We had a private lockless list implementation in the scheduler in the wake-list
    code, now that we have a generic llist implementation that provides all required
    operations, switch to it.

    This patch is not expected to change any behavior.

    Signed-off-by: Peter Zijlstra
    Cc: Huang Ying
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1315836353.26517.42.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: pick up the latest fixes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

03 Oct, 2011

1 commit

  • Add two fields to task_struct.

    1) account dirtied pages in the individual tasks, for accuracy
    2) per-task balance_dirty_pages() call intervals, for flexibility

    The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
    scale near-sqrt to the safety gap between dirty pages and threshold.

    The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
    pages at exactly the same time, each task will be assigned a large
    initial nr_dirtied_pause, so that the dirty threshold will be exceeded
    long before each task reached its nr_dirtied_pause and hence call
    balance_dirty_pages().

    The solution is to watch for the number of pages dirtied on each CPU in
    between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
    (3% dirty threshold), force call balance_dirty_pages() for a chance to
    set bdi->dirty_exceeded. In normal situations, this safeguarding
    condition is not expected to trigger at all.

    On the sqrt in dirty_poll_interval():

    It will serve as an initial guess when dirty pages are still in the
    freerun area.

    When dirty pages are floating inside the dirty control scope [freerun,
    limit], a followup patch will use some refined dirty poll interval to
    get the desired pause time.

    thresh-dirty (MB) sqrt
    1 16
    2 22
    4 32
    8 45
    16 64
    32 90
    64 128
    128 181
    256 256
    512 362
    1024 512

    The above table means, given 1MB (or 1GB) gap and the dd tasks polling
    balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
    be exceeded as long as there are less than 16 (or 512) concurrent dd's.

    So sqrt naturally leads to less overheads and more safe concurrent tasks
    for large memory servers, which have large (thresh-freerun) gaps.

    peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

    CC: Peter Zijlstra
    Reviewed-by: Andrea Righi
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

30 Sep, 2011

2 commits

  • David reported:

    Attached below is a watered-down version of rt/tst-cpuclock2.c from
    GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
    similar.

    Run it several times, and you will see cases where the main thread
    will measure a process clock difference before and after the nanosleep
    which is smaller than the cpu-burner thread's individual thread clock
    difference. This doesn't make any sense since the cpu-burner thread
    is part of the top-level process's thread group.

    I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
    64-bit binaries).

    For example:

    [davem@boricha build-x86_64-linux]$ ./test
    process: before(0.001221967) after(0.498624371) diff(497402404)
    thread: before(0.000081692) after(0.498316431) diff(498234739)
    self: before(0.001223521) after(0.001240219) diff(16698)
    [davem@boricha build-x86_64-linux]$

    The diff of 'process' should always be >= the diff of 'thread'.

    I make sure to wrap the 'thread' clock measurements the most tightly
    around the nanosleep() call, and that the 'process' clock measurements
    are the outer-most ones.

    ---
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static pthread_barrier_t barrier;

    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1)
    __asm__ __volatile__("" : : : "memory");
    return NULL;
    }

    int main(void)
    {
    clockid_t process_clock, my_thread_clock, th_clock;
    struct timespec process_before, process_after;
    struct timespec me_before, me_after;
    struct timespec th_before, th_after;
    struct timespec sleeptime;
    unsigned long diff;
    pthread_t th;
    int err;

    err = clock_getcpuclockid(0, &process_clock);
    if (err)
    return 1;

    err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
    if (err)
    return 1;

    pthread_barrier_init(&barrier, NULL, 2);
    err = pthread_create(&th, NULL, chew_cpu, NULL);
    if (err)
    return 1;

    err = pthread_getcpuclockid(th, &th_clock);
    if (err)
    return 1;

    pthread_barrier_wait(&barrier);

    err = clock_gettime(process_clock, &process_before);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_before);
    if (err)
    return 1;

    err = clock_gettime(th_clock, &th_before);
    if (err)
    return 1;

    sleeptime.tv_sec = 0;
    sleeptime.tv_nsec = 500000000;
    nanosleep(&sleeptime, NULL);

    err = clock_gettime(th_clock, &th_after);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_after);
    if (err)
    return 1;

    err = clock_gettime(process_clock, &process_after);
    if (err)
    return 1;

    diff = process_after.tv_nsec - process_before.tv_nsec;
    printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    process_before.tv_sec, process_before.tv_nsec,
    process_after.tv_sec, process_after.tv_nsec, diff);
    diff = th_after.tv_nsec - th_before.tv_nsec;
    printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    th_before.tv_sec, th_before.tv_nsec,
    th_after.tv_sec, th_after.tv_nsec, diff);
    diff = me_after.tv_nsec - me_before.tv_nsec;
    printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    me_before.tv_sec, me_before.tv_nsec,
    me_after.tv_sec, me_after.tv_nsec, diff);

    return 0;
    }

    This is due to us using p->se.sum_exec_runtime in
    thread_group_cputime() where we iterate the thread group and sum all
    data. This does not take time since the last schedule operation (tick
    or otherwise) into account. We can cure this by using
    task_sched_runtime() at the cost of having to take locks.

    This also means we can (and must) do away with
    thread_group_sched_runtime() since the modified thread_group_cputime()
    is now more accurate and would deadlock when called from
    thread_group_sched_runtime().

    Aside of that it makes the function safe on 32 bit systems. The old
    code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
    64bit value and could be changed on another cpu at the same time.

    Reported-by: David Miller
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
    Tested-by: David Miller
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Add to the dev_state and alloc_async structures the user namespace
    corresponding to the uid and euid. Pass these to kill_pid_info_as_uid(),
    which can then implement a proper, user-namespace-aware uid check.

    Changelog:
    Sep 20: Per Oleg's suggestion: Instead of caching and passing user namespace,
    uid, and euid each separately, pass a struct cred.
    Sep 26: Address Alan Stern's comments: don't define a struct cred at
    usbdev_open(), and take and put a cred at async_completed() to
    ensure it lasts for the duration of kill_pid_info_as_cred().

    Signed-off-by: Serge Hallyn
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Serge Hallyn
     

29 Sep, 2011

2 commits

  • Commit 7765be (Fix RCU_BOOST race handling current->rcu_read_unlock_special)
    introduced a new ->rcu_boosted field in the task structure. This is
    redundant because the existing ->rcu_boost_mutex will be non-NULL at
    any time that ->rcu_boosted is nonzero. Therefore, this commit removes
    ->rcu_boosted and tests ->rcu_boost_mutex instead.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • RCU no longer uses this global variable, nor does anyone else. This
    commit therefore removes this variable. This reduces memory footprint
    and also removes some atomic instructions and memory barriers from
    the dyntick-idle path.

    Signed-off-by: Alex Shi
    Signed-off-by: Paul E. McKenney

    Shi, Alex
     

13 Sep, 2011

1 commit

  • The thread_group_cputimer lock can be taken in atomic context and therefore
    cannot be preempted on -rt - annotate it.

    In mainline this change documents the low level nature of
    the lock - otherwise there's no functional difference. Lockdep
    and Sparse checking will work as usual.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

14 Aug, 2011

1 commit

  • Account bandwidth usage on the cfs_rq level versus the task_groups to which
    they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
    under cfs_rq->runtime_enabled.

    cfs_rq's which belong to a bandwidth constrained task_group have their runtime
    accounted via the update_curr() path, which withdraws bandwidth from the global
    pool as desired. Updates involving the global pool are currently protected
    under cfs_bandwidth->lock, local runtime is protected by rq->lock.

    This patch only assigns and tracks quota, no action is taken in the case that
    cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     

12 Aug, 2011

1 commit

  • The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
    check in set_user() to check for NPROC exceeding via setuid() and
    similar functions.

    Before the check there was a possibility to greatly exceed the allowed
    number of processes by an unprivileged user if the program relied on
    rlimit only. But the check created new security threat: many poorly
    written programs simply don't check setuid() return code and believe it
    cannot fail if executed with root privileges. So, the check is removed
    in this patch because of too often privilege escalations related to
    buggy programs.

    The NPROC can still be enforced in the common code flow of daemons
    spawning user processes. Most of daemons do fork()+setuid()+execve().
    The check introduced in execve() (1) enforces the same limit as in
    setuid() and (2) doesn't create similar security issues.

    Neil Brown suggested to track what specific process has exceeded the
    limit by setting PF_NPROC_EXCEEDED process flag. With the change only
    this process would fail on execve(), and other processes' execve()
    behaviour is not changed.

    Solar Designer suggested to re-check whether NPROC limit is still
    exceeded at the moment of execve(). If the process was sleeping for
    days between set*uid() and execve(), and the NPROC counter step down
    under the limit, the defered execve() failure because NPROC limit was
    exceeded days ago would be unexpected. If the limit is not exceeded
    anymore, we clear the flag on successful calls to execve() and fork().

    The flag is also cleared on successful calls to set_user() as the limit
    was exceeded for the previous user, not the current one.

    Similar check was introduced in -ow patches (without the process flag).

    v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().

    Reviewed-by: James Morris
    Signed-off-by: Vasiliy Kulikov
    Acked-by: NeilBrown
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

26 Jul, 2011

1 commit

  • * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
    block: strict rq_affinity
    backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
    block: fix patch import error in max_discard_sectors check
    block: reorder request_queue to remove 64 bit alignment padding
    CFQ: add think time check for group
    CFQ: add think time check for service tree
    CFQ: move think time check variables to a separate struct
    fixlet: Remove fs_excl from struct task.
    cfq: Remove special treatment for metadata rqs.
    block: document blk_plug list access
    block: avoid building too big plug list
    compat_ioctl: fix make headers_check regression
    block: eliminate potential for infinite loop in blkdev_issue_discard
    compat_ioctl: fix warning caused by qemu
    block: flush MEDIA_CHANGE from drivers on close(2)
    blk-throttle: Make total_nr_queued unsigned
    block: Add __attribute__((format(printf...) and fix fallout
    fs/partitions/check.c: make local symbols static
    block:remove some spare spaces in genhd.c
    block:fix the comment error in blkdev.h
    ...

    Linus Torvalds
     

23 Jul, 2011

2 commits

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
    sched: Cleanup duplicate local variable in [enqueue|dequeue]_task_fair
    sched: Replace use of entity_key()
    sched: Separate group-scheduling code more clearly
    sched: Reorder root_domain to remove 64 bit alignment padding
    sched: Do not attempt to destroy uninitialized rt_bandwidth
    sched: Remove unused function cpu_cfs_rq()
    sched: Fix (harmless) typo 'CONFG_FAIR_GROUP_SCHED'
    sched, cgroup: Optimize load_balance_fair()
    sched: Don't update shares twice on on_rq parent
    sched: update correct entity's runtime in check_preempt_wakeup()
    xtensa: Use generic config PREEMPT definition
    h8300: Use generic config PREEMPT definition
    m32r: Use generic PREEMPT config
    sched: Skip autogroup when looking for all rt sched groups
    sched: Simplify mutex_spin_on_owner()
    sched: Remove rcu_read_lock() from wake_affine()
    sched: Generalize sleep inside spinlock detection
    sched: Make sleeping inside spinlock detection working in !CONFIG_PREEMPT
    sched: Isolate preempt counting in its own config option
    sched: Remove pointless in_atomic() definition check
    ...

    Linus Torvalds
     
  • * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
    ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
    ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
    connector: add an event for monitoring process tracers
    ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
    ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
    ptrace_init_task: initialize child->jobctl explicitly
    has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
    ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
    ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
    ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
    ptrace: ptrace_reparented() should check same_thread_group()
    redefine thread_group_leader() as exit_signal >= 0
    do not change dead_task->exit_signal
    kill task_detached()
    reparent_leader: check EXIT_DEAD instead of task_detached()
    make do_notify_parent() __must_check, update the callers
    __ptrace_detach: avoid task_detached(), check do_notify_parent()
    kill tracehook_notify_death()
    make do_notify_parent() return bool
    ptrace: s/tracehook_tracer_task()/ptrace_parent()/
    ...

    Linus Torvalds
     

22 Jul, 2011

1 commit


21 Jul, 2011

3 commits

  • …l/git/tip/linux-2.6-tip

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    signal: align __lock_task_sighand() irq disabling and RCU
    softirq,rcu: Inform RCU of irq_exit() activity
    sched: Add irq_{enter,exit}() to scheduler_ipi()
    rcu: protect __rcu_read_unlock() against scheduler-using irq handlers
    rcu: Streamline code produced by __rcu_read_unlock()
    rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special
    rcu: decrease rcu_report_exp_rnp coupling with scheduler

    Linus Torvalds
     
  • Allow for sched_domain spans that overlap by giving such domains their
    own sched_group list instead of sharing the sched_groups amongst
    each-other.

    This is needed for machines with more than 16 nodes, because
    sched_domain_node_span() will generate a node mask from the
    16 nearest nodes without regard if these masks have any overlap.

    Currently sched_domains have a sched_group that maps to their child
    sched_domain span, and since there is no overlap we share the
    sched_group between the sched_domains of the various CPUs. If however
    there is overlap, we would need to link the sched_group list in
    different ways for each cpu, and hence sharing isn't possible.

    In order to solve this, allocate private sched_groups for each CPU's
    sched_domain but have the sched_groups share a sched_group_power
    structure such that we can uniquely track the power.

    Reported-and-tested-by: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to prepare for non-unique sched_groups per domain, we need to
    carry the cpu_power elsewhere, so put a level of indirection in.

    Reported-and-tested-by: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 Jul, 2011

1 commit

  • The RCU_BOOST commits for TREE_PREEMPT_RCU introduced an other-task
    write to a new RCU_READ_UNLOCK_BOOSTED bit in the task_struct structure's
    ->rcu_read_unlock_special field, but, as noted by Steven Rostedt, without
    correctly synchronizing all accesses to ->rcu_read_unlock_special.
    This could result in bits in ->rcu_read_unlock_special being spuriously
    set and cleared due to conflicting accesses, which in turn could result
    in deadlocks between the rcu_node structure's ->lock and the scheduler's
    rq and pi locks. These deadlocks would result from RCU incorrectly
    believing that the just-ended RCU read-side critical section had been
    preempted and/or boosted. If that RCU read-side critical section was
    executed with either rq or pi locks held, RCU's ensuing (incorrect)
    calls to the scheduler would cause the scheduler to attempt to once
    again acquire the rq and pi locks, resulting in deadlock. More complex
    deadlock cycles are also possible, involving multiple rq and pi locks
    as well as locks from multiple rcu_node structures.

    This commit fixes synchronization by creating ->rcu_boosted field in
    task_struct that is accessed and modified only when holding the ->lock
    in the rcu_node structure on which the task is queued (on that rcu_node
    structure's ->blkd_tasks list). This results in tasks accessing only
    their own current->rcu_read_unlock_special fields, making unsynchronized
    access once again legal, and keeping the rcu_read_unlock() fastpath free
    of atomic instructions and memory barriers.

    The reason that the rcu_read_unlock() fastpath does not need to access
    the new current->rcu_boosted field is that this new field cannot
    be non-zero unless the RCU_READ_UNLOCK_BLOCKED bit is set in the
    current->rcu_read_unlock_special field. Therefore, rcu_read_unlock()
    need only test current->rcu_read_unlock_special: if that is zero, then
    current->rcu_boosted must also be zero.

    This bug does not affect TINY_PREEMPT_RCU because this implementation
    of RCU accesses current->rcu_read_unlock_special with irqs disabled,
    thus preventing races on the !SMP systems that TINY_PREEMPT_RCU runs on.

    Maybe-reported-by: Dave Jones
    Maybe-reported-by: Sergey Senozhatsky
    Reported-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Steven Rostedt

    Paul E. McKenney
     

12 Jul, 2011

1 commit

  • fs_excl is a poor man's priority inheritance for filesystems to hint to
    the block layer that an operation is important. It was never clearly
    specified, not widely adopted, and will not prevent starvation in many
    cases (like across cgroups).

    fs_excl was introduced with the time sliced CFQ IO scheduler, to
    indicate when a process held FS exclusive resources and thus needed
    a boost.

    It doesn't cover all file systems, and it was never fully complete.
    Lets kill it.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

05 Jul, 2011

1 commit

  • Alex reported that commit c8b281161df ("sched: Increase
    SCHED_LOAD_SCALE resolution") caused a power usage regression
    under light load as it increases the number of load-balance
    operations and keeps idle cpus from staying idle.

    Time has run out to find the root cause for this release so
    disable the feature for v3.0 until we can figure out what
    causes the problem.

    Reported-by: "Alex, Shi"
    Signed-off-by: Peter Zijlstra
    Cc: Nikhil Rao
    Cc: Ming Lei
    Cc: Mike Galbraith
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-m4onxn0sxnyn5iz9o88eskc3@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Jul, 2011

1 commit


28 Jun, 2011

4 commits

  • Change de_thread() to set old_leader->exit_signal = -1. This is
    good for the consistency, it is no longer the leader and all
    sub-threads have exit_signal = -1 set by copy_process(CLONE_THREAD).

    And this allows us to micro-optimize thread_group_leader(), it can
    simply check exit_signal >= 0. This also makes sense because we
    should move ->group_leader from task_struct to signal_struct.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Tejun Heo

    Oleg Nesterov
     
  • Upadate the last user of task_detached(), wait_task_zombie(), to
    use thread_group_leader() and kill task_detached().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Tejun Heo

    Oleg Nesterov
     
  • Change other callers of do_notify_parent() to check the value it
    returns, this makes the subsequent task_detached() unnecessary.
    Mark do_notify_parent() as __must_check.

    Use thread_group_leader() instead of !task_detached() to check
    if we need to notify the real parent in wait_task_zombie().

    Remove the stale comment in release_task(). "just for sanity" is
    no longer true, we have to set EXIT_DEAD to avoid the races with
    do_wait().

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo

    Oleg Nesterov
     
  • - change do_notify_parent() to return a boolean, true if the task should
    be reaped because its parent ignores SIGCHLD.

    - update the only caller which checks the returned value, exit_notify().

    This temporary uglifies exit_notify() even more, will be cleanuped by
    the next change.

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo

    Oleg Nesterov
     

17 Jun, 2011

3 commits

  • The previous patch implemented async notification for ptrace but it
    only worked while trace is running. This patch introduces
    PTRACE_LISTEN which is suggested by Oleg Nestrov.

    It's allowed iff tracee is in STOP trap and puts tracee into
    quasi-running state - tracee never really runs but wait(2) and
    ptrace(2) consider it to be running. While ptracer is listening,
    tracee is allowed to re-enter STOP to notify an async event.
    Listening state is cleared on the first notification. Ptracer can
    also clear it by issuing INTERRUPT - tracee will re-trap into STOP
    with listening state cleared.

    This allows ptracer to monitor group stop state without running tracee
    - use INTERRUPT to put tracee into STOP trap, issue LISTEN and then
    wait(2) to wait for the next group stop event. When it happens,
    PTRACE_GETSIGINFO provides information to determine the current state.

    Test program follows.

    #define PTRACE_SEIZE 0x4206
    #define PTRACE_INTERRUPT 0x4207
    #define PTRACE_LISTEN 0x4208

    #define PTRACE_SEIZE_DEVEL 0x80000000

    static const struct timespec ts1s = { .tv_sec = 1 };

    int main(int argc, char **argv)
    {
    pid_t tracee, tracer;
    int i;

    tracee = fork();
    if (!tracee)
    while (1)
    pause();

    tracer = fork();
    if (!tracer) {
    siginfo_t si;

    ptrace(PTRACE_SEIZE, tracee, NULL,
    (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
    ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
    repeat:
    waitid(P_PID, tracee, NULL, WSTOPPED);

    ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
    if (!si.si_code) {
    printf("tracer: SIG %d\n", si.si_signo);
    ptrace(PTRACE_CONT, tracee, NULL,
    (void *)(unsigned long)si.si_signo);
    goto repeat;
    }
    printf("tracer: stopped=%d signo=%d\n",
    si.si_signo != SIGTRAP, si.si_signo);
    if (si.si_signo != SIGTRAP)
    ptrace(PTRACE_LISTEN, tracee, NULL, NULL);
    else
    ptrace(PTRACE_CONT, tracee, NULL, NULL);
    goto repeat;
    }

    for (i = 0; i < 3; i++) {
    nanosleep(&ts1s, NULL);
    printf("mother: SIGSTOP\n");
    kill(tracee, SIGSTOP);
    nanosleep(&ts1s, NULL);
    printf("mother: SIGCONT\n");
    kill(tracee, SIGCONT);
    }
    nanosleep(&ts1s, NULL);

    kill(tracer, SIGKILL);
    kill(tracee, SIGKILL);
    return 0;
    }

    This is identical to the program to test TRAP_NOTIFY except that
    tracee is PTRACE_LISTEN'd instead of PTRACE_CONT'd when group stopped.
    This allows ptracer to monitor when group stop ends without running
    tracee.

    # ./test-listen
    tracer: stopped=0 signo=5
    mother: SIGSTOP
    tracer: SIG 19
    tracer: stopped=1 signo=19
    mother: SIGCONT
    tracer: stopped=0 signo=5
    tracer: SIG 18
    mother: SIGSTOP
    tracer: SIG 19
    tracer: stopped=1 signo=19
    mother: SIGCONT
    tracer: stopped=0 signo=5
    tracer: SIG 18
    mother: SIGSTOP
    tracer: SIG 19
    tracer: stopped=1 signo=19
    mother: SIGCONT
    tracer: stopped=0 signo=5
    tracer: SIG 18

    -v2: Moved JOBCTL_LISTENING check in wait_task_stopped() into
    task_stopped_code() as suggested by Oleg.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov

    Tejun Heo
     
  • Currently there's no way for ptracer to find out whether group stop
    finished other than polling with INTERRUPT - GETSIGINFO - CONT
    sequence. This patch implements group stop notification for ptracer
    using STOP traps.

    When group stop state of a seized tracee changes, JOBCTL_TRAP_NOTIFY
    is set, which schedules a STOP trap which is sticky - it isn't cleared
    by other traps and at least one STOP trap will happen eventually.
    STOP trap is synchronization point for event notification and the
    tracer can determine the current group stop state by looking at the
    signal number portion of exit code (si_status from waitid(2) or
    si_code from PTRACE_GETSIGINFO).

    Notifications are generated both on start and end of group stops but,
    because group stop participation always happens before STOP trap, this
    doesn't cause an extra trap while tracee is participating in group
    stop. The symmetry will be useful later.

    Note that this notification works iff tracee is not trapped.
    Currently there is no way to be notified of group stop state changes
    while tracee is trapped. This will be addressed by a later patch.

    An example program follows.

    #define PTRACE_SEIZE 0x4206
    #define PTRACE_INTERRUPT 0x4207

    #define PTRACE_SEIZE_DEVEL 0x80000000

    static const struct timespec ts1s = { .tv_sec = 1 };

    int main(int argc, char **argv)
    {
    pid_t tracee, tracer;
    int i;

    tracee = fork();
    if (!tracee)
    while (1)
    pause();

    tracer = fork();
    if (!tracer) {
    siginfo_t si;

    ptrace(PTRACE_SEIZE, tracee, NULL,
    (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
    ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
    repeat:
    waitid(P_PID, tracee, NULL, WSTOPPED);

    ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
    if (!si.si_code) {
    printf("tracer: SIG %d\n", si.si_signo);
    ptrace(PTRACE_CONT, tracee, NULL,
    (void *)(unsigned long)si.si_signo);
    goto repeat;
    }
    printf("tracer: stopped=%d signo=%d\n",
    si.si_signo != SIGTRAP, si.si_signo);
    ptrace(PTRACE_CONT, tracee, NULL, NULL);
    goto repeat;
    }

    for (i = 0; i < 3; i++) {
    nanosleep(&ts1s, NULL);
    printf("mother: SIGSTOP\n");
    kill(tracee, SIGSTOP);
    nanosleep(&ts1s, NULL);
    printf("mother: SIGCONT\n");
    kill(tracee, SIGCONT);
    }
    nanosleep(&ts1s, NULL);

    kill(tracer, SIGKILL);
    kill(tracee, SIGKILL);
    return 0;
    }

    In the above program, tracer keeps tracee running and gets
    notification of each group stop state changes.

    # ./test-notify
    tracer: stopped=0 signo=5
    mother: SIGSTOP
    tracer: SIG 19
    tracer: stopped=1 signo=19
    mother: SIGCONT
    tracer: stopped=0 signo=5
    tracer: SIG 18
    mother: SIGSTOP
    tracer: SIG 19
    tracer: stopped=1 signo=19
    mother: SIGCONT
    tracer: stopped=0 signo=5
    tracer: SIG 18
    mother: SIGSTOP
    tracer: SIG 19
    tracer: stopped=1 signo=19
    mother: SIGCONT
    tracer: stopped=0 signo=5
    tracer: SIG 18

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov

    Tejun Heo
     
  • do_signal_stop() implemented both normal group stop and trap for group
    stop while ptraced. This approach has been enough but scheduled
    changes require trap mechanism which can be used in more generic
    manner and using group stop trap for generic trap site simplifies both
    userland visible interface and implementation.

    This patch adds a new jobctl flag - JOBCTL_TRAP_STOP. When set, it
    triggers a trap site, which behaves like group stop trap, in
    get_signal_to_deliver() after checking for pending signals. While
    ptraced, do_signal_stop() doesn't stop itself. It initiates group
    stop if requested and schedules JOBCTL_TRAP_STOP and returns. The
    caller - get_signal_to_deliver() - is responsible for checking whether
    TRAP_STOP is pending afterwards and handling it.

    ptrace_attach() is updated to use JOBCTL_TRAP_STOP instead of
    JOBCTL_STOP_PENDING and __ptrace_unlink() to clear all pending trap
    bits and TRAPPING so that TRAP_STOP and future trap bits don't linger
    after detach.

    While at it, add proper function comment to do_signal_stop() and make
    it return bool.

    -v2: __ptrace_unlink() updated to clear JOBCTL_TRAP_MASK and TRAPPING
    instead of JOBCTL_PENDING_MASK. This avoids accidentally
    clearing JOBCTL_STOP_CONSUME. Spotted by Oleg.

    -v3: do_signal_stop() updated to return %false without dropping
    siglock while ptraced and TRAP_STOP check moved inside for(;;)
    loop after group stop participation. This avoids unnecessary
    relocking and also will help avoiding unnecessary traps by
    consuming group stop before handling pending traps.

    -v4: Jobctl trap handling moved into a separate function -
    do_jobctl_trap().

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov

    Tejun Heo
     

10 Jun, 2011

1 commit

  • Create a new CONFIG_PREEMPT_COUNT that handles the inc/dec
    of preempt count offset independently. So that the offset
    can be updated by preempt_disable() and preempt_enable()
    even without the need for CONFIG_PREEMPT beeing set.

    This prepares to make CONFIG_DEBUG_SPINLOCK_SLEEP working
    with !CONFIG_PREEMPT where it currently doesn't detect
    code that sleeps inside explicit preemption disabled
    sections.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

08 Jun, 2011

1 commit


05 Jun, 2011

3 commits

  • task->jobctl currently hosts JOBCTL_STOP_PENDING and will host TRAP
    pending bits too. Setting pending conditions on a dying task may make
    the task unkillable. Currently, each setting site is responsible for
    checking for the condition but with to-be-added job control traps this
    becomes too fragile.

    This patch adds task_set_jobctl_pending() which should be used when
    setting task->jobctl bits to schedule a stop or trap. The function
    performs the followings to ease setting pending bits.

    * Sanity checks.

    * If fatal signal is pending or PF_EXITING is set, no bit is set.

    * STOP_SIGMASK is automatically cleared if new value is being set.

    do_signal_stop() and ptrace_attach() are updated to use
    task_set_jobctl_pending() instead of setting STOP_PENDING explicitly.
    The surrounding structures around setting are changed to fit
    task_set_jobctl_pending() better but there should be no userland
    visible behavior difference.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     
  • This patch introduces JOBCTL_PENDING_MASK and replaces
    task_clear_jobctl_stop_pending() with task_clear_jobctl_pending()
    which takes an extra @mask argument.

    JOBCTL_PENDING_MASK is currently equal to JOBCTL_STOP_PENDING but
    future patches will add more bits. recalc_sigpending_tsk() is updated
    to use JOBCTL_PENDING_MASK instead.

    task_clear_jobctl_pending() takes @mask which in subset of
    JOBCTL_PENDING_MASK and clears the relevant jobctl bits. If
    JOBCTL_STOP_PENDING is set, other STOP bits are cleared together. All
    task_clear_jobctl_stop_pending() users are updated to call
    task_clear_jobctl_pending() with JOBCTL_STOP_PENDING which is
    functionally identical to task_clear_jobctl_stop_pending().

    This patch doesn't cause any functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     
  • signal->group_stop currently hosts mostly group stop related flags;
    however, it's gonna be used for wider purposes and the GROUP_STOP_
    flag prefix becomes confusing. Rename signal->group_stop to
    signal->jobctl and rename all GROUP_STOP_* flags to JOBCTL_*.

    Bit position macros JOBCTL_*_BIT are defined and JOBCTL_* flags are
    defined in terms of them to allow using bitops later.

    While at it, reassign JOBCTL_TRAPPING to bit 22 to better accomodate
    future additions.

    This doesn't cause any functional change.

    -v2: JOBCTL_*_BIT macros added as suggested by Linus.

    Signed-off-by: Tejun Heo
    Cc: Linus Torvalds
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     

31 May, 2011

1 commit

  • While looking over the code I found that with the ttwu rework the
    nr_wakeups_migrate test broke since we now switch cpus prior to
    calling ttwu_stat(), hence the test is always true.

    Cure this by passing the migration state in wake_flags. Also move the
    whole test under CONFIG_SMP, its hard to migrate tasks on UP :-)

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-pwwxl7gdqs5676f1d4cx6pj7@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra