13 Jan, 2012

1 commit

  • It's a very old and now unused prototype marking so just delete it.

    Neaten panic pointer argument style to keep checkpatch quiet.

    Signed-off-by: Joe Perches
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Tony Luck
    Cc: Fenghua Yu
    Acked-by: Geert Uytterhoeven
    Acked-by: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

11 Jan, 2012

1 commit

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     

10 Jan, 2012

1 commit

  • * 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cgroup: fix to allow mounting a hierarchy by name
    cgroup: move assignement out of condition in cgroup_attach_proc()
    cgroup: Remove task_lock() from cgroup_post_fork()
    cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
    cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
    cgroup: only need to check oldcgrp==newgrp once
    cgroup: remove redundant get/put of task struct
    cgroup: remove redundant get/put of old css_set from migrate
    cgroup: Remove unnecessary task_lock before fetching css_set on migration
    cgroup: Drop task_lock(parent) on cgroup_fork()
    cgroups: remove redundant get/put of css_set from css_set_check_fetched()
    resource cgroups: remove bogus cast
    cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
    cgroup, cpuset: don't use ss->pre_attach()
    cgroup: don't use subsys->can_attach_task() or ->attach_task()
    cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
    cgroup: improve old cgroup handling in cgroup_attach_proc()
    cgroup: always lock threadgroup during migration
    threadgroup: extend threadgroup_lock() to cover exit and exec
    threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
    ...

    Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
    fix a css_set not found bug in cgroup_attach_proc" that already
    mentioned that the bug is fixed (differently) in Tejun's cgroup
    patchset. This one, in other words.

    Linus Torvalds
     

09 Jan, 2012

1 commit

  • * 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
    PM / Hibernate: Implement compat_ioctl for /dev/snapshot
    PM / Freezer: fix return value of freezable_schedule_timeout_killable()
    PM / shmobile: Allow the A4R domain to be turned off at run time
    PM / input / touchscreen: Make st1232 use device PM QoS constraints
    PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
    PM / shmobile: Remove the stay_on flag from SH7372's PM domains
    PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
    PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
    PM: Drop generic_subsys_pm_ops
    PM / Sleep: Remove forward-only callbacks from AMBA bus type
    PM / Sleep: Remove forward-only callbacks from platform bus type
    PM: Run the driver callback directly if the subsystem one is not there
    PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
    PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
    PM / Sleep: Merge internal functions in generic_ops.c
    PM / Sleep: Simplify generic system suspend callbacks
    PM / Hibernate: Remove deprecated hibernation snapshot ioctls
    PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
    ARM: S3C64XX: Implement basic power domain support
    PM / shmobile: Use common always on power domain governor
    ...

    Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
    XBT_FORCE_SLEEP bit

    Linus Torvalds
     

07 Jan, 2012

1 commit

  • * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    sched/tracing: Add a new tracepoint for sleeptime
    sched: Disable scheduler warnings during oopses
    sched: Fix cgroup movement of waking process
    sched: Fix cgroup movement of newly created process
    sched: Fix cgroup movement of forking process
    sched: Remove cfs bandwidth period check in tg_set_cfs_period()
    sched: Fix load-balance lock-breaking
    sched: Replace all_pinned with a generic flags field
    sched: Only queue remote wakeups when crossing cache boundaries
    sched: Add missing rcu_dereference() around ->real_parent usage
    [S390] fix cputime overflow in uptime_proc_show
    [S390] cputime: add sparse checking and cleanup
    sched: Mark parent and real_parent as __rcu
    sched, nohz: Fix missing RCU read lock
    sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer
    sched, nohz: Fix the idle cpu check in nohz_idle_balance
    sched: Use jump_labels for sched_feat
    sched/accounting: Fix parameter passing in task_group_account_field
    sched/accounting: Fix user/system tick double accounting
    sched/accounting: Re-use scheduler statistics for the root cgroup
    ...

    Fix up conflicts in
    - arch/ia64/include/asm/cputime.h, include/asm-generic/cputime.h
    usecs_to_cputime64() vs the sparse cleanups
    - kernel/sched/fair.c, kernel/time/tick-sched.c
    scheduler changes in multiple branches

    Linus Torvalds
     

20 Dec, 2011

1 commit


18 Dec, 2011

1 commit

  • Compensate the task's think time when computing the final pause time,
    so that ->dirty_ratelimit can be executed accurately.

    think time := time spend outside of balance_dirty_pages()

    In the rare case that the task slept longer than the 200ms period time
    (result in negative pause time), the sleep time will be compensated in
    the following periods, too, if it's less than 1 second.

    Accumulated errors are carefully avoided as long as the max pause area
    is not hitted.

    Pseudo code:

    period = pages_dirtied / task_ratelimit;
    think = jiffies - dirty_paused_when;
    pause = period - think;

    1) normal case: period > think

    pause = period - think
    dirty_paused_when = jiffies + pause
    nr_dirtied = 0

    period time
    |===============================>|
    think time pause time
    |===============>|==============>|
    ------|----------------|---------------|------------------------
    dirty_paused_when jiffies

    2) no pause case: period |
    think time
    |===================================================>|
    ------|--------------------------------+-------------------|----
    dirty_paused_when jiffies

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

15 Dec, 2011

3 commits


13 Dec, 2011

2 commits

  • threadgroup_lock() protected only protected against new addition to
    the threadgroup, which was inherently somewhat incomplete and
    problematic for its only user cgroup. On-going migration could race
    against exec and exit leading to interesting problems - the symmetry
    between various attach methods, task exiting during method execution,
    ->exit() racing against attach methods, migrating task switching basic
    properties during exec and so on.

    This patch extends threadgroup_lock() such that it protects against
    all three threadgroup altering operations - fork, exit and exec. For
    exit, threadgroup_change_begin/end() calls are added to exit_signals
    around assertion of PF_EXITING. For exec, threadgroup_[un]lock() are
    updated to also grab and release cred_guard_mutex.

    With this change, threadgroup_lock() guarantees that the target
    threadgroup will remain stable - no new task will be added, no new
    PF_EXITING will be set and exec won't happen.

    The next patch will update cgroup so that it can take full advantage
    of this change.

    -v2: beefed up comment as suggested by Frederic.

    -v3: narrowed scope of protection in exit path as suggested by
    Frederic.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Li Zefan
    Acked-by: Frederic Weisbecker
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Paul Menage
    Cc: Linus Torvalds

    Tejun Heo
     
  • Make the following renames to prepare for extension of threadgroup
    locking.

    * s/signal->threadgroup_fork_lock/signal->group_rwsem/
    * s/threadgroup_fork_read_lock()/threadgroup_change_begin()/
    * s/threadgroup_fork_read_unlock()/threadgroup_change_end()/
    * s/threadgroup_fork_write_lock()/threadgroup_lock()/
    * s/threadgroup_fork_write_unlock()/threadgroup_unlock()/

    This patch doesn't cause any behavior change.

    -v2: Rename threadgroup_change_done() to threadgroup_change_end() per
    KAMEZAWA's suggestion.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Li Zefan
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Paul Menage

    Tejun Heo
     

12 Dec, 2011

1 commit

  • Commit 908a3283 (Fix idle_cpu()) invalidated some uses of idle_cpu(),
    which used to say whether or not the CPU was running the idle task,
    but now instead says whether or not the CPU is running the idle task
    in the absence of pending wakeups. Although this new implementation
    gives a better answer to the question "is this CPU idle?", it also
    invalidates other uses that were made of idle_cpu().

    This commit therefore introduces a new is_idle_task() API member
    that determines whether or not the specified task is one of the
    idle tasks, allowing open-coded "->pid == 0" sequences to be replaced
    by something more meaningful.

    Suggested-by: Josh Triplett
    Suggested-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

07 Dec, 2011

1 commit

  • Commit 69e1e811 ("sched, nohz: Track nr_busy_cpus in the
    sched_group_power") messed up the static inline function definition.

    Signed-off-by: Peter Zijlstra
    Cc: Suresh Siddha
    Link: http://lkml.kernel.org/n/tip-abjah8ctq5qrjjtdiabe8lph@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Dec, 2011

1 commit

  • Introduce nr_busy_cpus in the struct sched_group_power [Not in sched_group
    because sched groups are duplicated for the SD_OVERLAP scheduler domain]
    and for each cpu that enters and exits idle, this parameter will
    be updated in each scheduler group of the scheduler domain that this cpu
    belongs to.

    To avoid the frequent update of this state as the cpu enters
    and exits idle, the update of the stat during idle exit is
    delayed to the first timer tick that happens after the cpu becomes busy.
    This is done using NOHZ_IDLE flag in the struct rq's nohz_flags.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20111202010832.555984323@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

24 Nov, 2011

2 commits

  • * 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: (24 commits)
    freezer: fix wait_event_freezable/__thaw_task races
    freezer: kill unused set_freezable_with_signal()
    dmatest: don't use set_freezable_with_signal()
    usb_storage: don't use set_freezable_with_signal()
    freezer: remove unused @sig_only from freeze_task()
    freezer: use lock_task_sighand() in fake_signal_wake_up()
    freezer: restructure __refrigerator()
    freezer: fix set_freezable[_with_signal]() race
    freezer: remove should_send_signal() and update frozen()
    freezer: remove now unused TIF_FREEZE
    freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE
    cgroup_freezer: prepare for removal of TIF_FREEZE
    freezer: clean up freeze_processes() failure path
    freezer: kill PF_FREEZING
    freezer: test freezable conditions while holding freezer_lock
    freezer: make freezing indicate freeze condition in effect
    freezer: use dedicated lock instead of task_lock() + memory barrier
    freezer: don't distinguish nosig tasks on thaw
    freezer: remove racy clear_freeze_flag() and set PF_NOFREEZE on dead tasks
    freezer: rename thaw_process() to __thaw_task() and simplify the implementation
    ...

    Rafael J. Wysocki
     
  • There's no in-kernel user of set_freezable_with_signal() left. Mixing
    TIF_SIGPENDING with kernel threads can lead to nasty corner cases as
    kernel threads never travel signal delivery path on their own.

    e.g. the current implementation is buggy in the cancelation path of
    __thaw_task(). It calls recalc_sigpending_and_wake() in an attempt to
    clear TIF_SIGPENDING but the function never clears it regardless of
    sigpending state. This means that signallable freezable kthreads may
    continue executing with !freezing() && stuck TIF_SIGPENDING, which can
    be troublesome.

    This patch removes set_freezable_with_signal() along with
    PF_FREEZER_NOSIG and recalc_sigpending*() calls in freezer. User
    tasks get TIF_SIGPENDING, kernel tasks get woken up and the spurious
    sigpending is dealt with in the usual signal delivery path.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov

    Tejun Heo
     

22 Nov, 2011

1 commit

  • With the previous changes, there's no meaningful difference between
    PF_FREEZING and PF_FROZEN. Remove PF_FREEZING and use PF_FROZEN
    instead in task_contributes_to_load().

    Signed-off-by: Tejun Heo

    Tejun Heo
     

17 Nov, 2011

2 commits


07 Nov, 2011

1 commit

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Add a 'reason' to wb_writeback_work
    writeback: send work item to queue_io, move_expired_inodes
    writeback: trace event balance_dirty_pages
    writeback: trace event bdi_dirty_ratelimit
    writeback: fix ppc compile warnings on do_div(long long, unsigned long)
    writeback: per-bdi background threshold
    writeback: dirty position control - bdi reserve area
    writeback: control dirty pause time
    writeback: limit max dirty pause time
    writeback: IO-less balance_dirty_pages()
    writeback: per task dirty rate limit
    writeback: stabilize bdi->dirty_ratelimit
    writeback: dirty rate control
    writeback: add bg_threshold parameter to __bdi_update_bandwidth()
    writeback: dirty position control
    writeback: account per-bdi accumulated dirtied pages

    Linus Torvalds
     

26 Oct, 2011

3 commits

  • * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    llist: Add back llist_add_batch() and llist_del_first() prototypes
    sched: Don't use tasklist_lock for debug prints
    sched: Warn on rt throttling
    sched: Unify the ->cpus_allowed mask copy
    sched: Wrap scheduler p->cpus_allowed access
    sched: Request for idle balance during nohz idle load balance
    sched: Use resched IPI to kick off the nohz idle balance
    sched: Fix idle_cpu()
    llist: Remove cpu_relax() usage in cmpxchg loops
    sched: Convert to struct llist
    llist: Add llist_next()
    irq_work: Use llist in the struct irq_work logic
    llist: Return whether list is empty before adding in llist_add()
    llist: Move cpu_relax() to after the cmpxchg()
    llist: Remove the platform-dependent NMI checks
    llist: Make some llist functions inline
    sched, tracing: Show PREEMPT_ACTIVE state in trace_sched_switch
    sched: Remove redundant test in check_preempt_tick()
    sched: Add documentation for bandwidth control
    sched: Return unused runtime on group dequeue
    ...

    Linus Torvalds
     
  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
    rcu: Move propagation of ->completed from rcu_start_gp() to rcu_report_qs_rsp()
    rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states
    rcu: Wire up RCU_BOOST_PRIO for rcutree
    rcu: Make rcu_torture_boost() exit loops at end of test
    rcu: Make rcu_torture_fqs() exit loops at end of test
    rcu: Permit rt_mutex_unlock() with irqs disabled
    rcu: Avoid having just-onlined CPU resched itself when RCU is idle
    rcu: Suppress NMI backtraces when stall ends before dump
    rcu: Prohibit grace periods during early boot
    rcu: Simplify unboosting checks
    rcu: Prevent early boot set_need_resched() from __rcu_pending()
    rcu: Dump local stack if cannot dump all CPUs' stacks
    rcu: Move __rcu_read_unlock()'s barrier() within if-statement
    rcu: Improve rcu_assign_pointer() and RCU_INIT_POINTER() documentation
    rcu: Make rcu_assign_pointer() unconditionally insert a memory barrier
    rcu: Make rcu_implicit_dynticks_qs() locals be correct size
    rcu: Eliminate in_irq() checks in rcu_enter_nohz()
    nohz: Remove nohz_cpu_mask
    rcu: Document interpretation of RCU-lockdep splats
    rcu: Allow rcutorture's stat_interval parameter to be changed at runtime
    ...

    Linus Torvalds
     
  • * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    rtmutex: Add missing rcu_read_unlock() in debug_rt_mutex_print_deadlock()
    lockdep: Comment all warnings
    lib: atomic64: Change the type of local lock to raw_spinlock_t
    locking, lib/atomic64: Annotate atomic64_lock::lock as raw
    locking, x86, iommu: Annotate qi->q_lock as raw
    locking, x86, iommu: Annotate irq_2_ir_lock as raw
    locking, x86, iommu: Annotate iommu->register_lock as raw
    locking, dma, ipu: Annotate bank_lock as raw
    locking, ARM: Annotate low level hw locks as raw
    locking, drivers/dca: Annotate dca_lock as raw
    locking, powerpc: Annotate uic->lock as raw
    locking, x86: mce: Annotate cmci_discover_lock as raw
    locking, ACPI: Annotate c3_lock as raw
    locking, oprofile: Annotate oprofilefs lock as raw
    locking, video: Annotate vga console lock as raw
    locking, latencytop: Annotate latency_lock as raw
    locking, timer_stats: Annotate table_lock as raw
    locking, rwsem: Annotate inner lock as raw
    locking, semaphores: Annotate inner lock as raw
    locking, sched: Annotate thread_group_cputimer as raw
    ...

    Fix up conflicts in kernel/posix-cpu-timers.c manually: making
    cputimer->cputime a raw lock conflicted with the ABBA fix in commit
    bcd5cff7216f ("cputimer: Cure lock inversion").

    Linus Torvalds
     

25 Oct, 2011

1 commit

  • * 'usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (260 commits)
    usb: renesas_usbhs: fixup inconsistent return from usbhs_pkt_push()
    usb/isp1760: Allow to optionally trigger low-level chip reset via GPIOLIB.
    USB: gadget: midi: memory leak in f_midi_bind_config()
    USB: gadget: midi: fix range check in f_midi_out_open()
    QE/FHCI: fixed the CONTROL bug
    usb: renesas_usbhs: tidyup for smatch warnings
    USB: Fix USB Kconfig dependency problem on 85xx/QoirQ platforms
    EHCI: workaround for MosChip controller bug
    usb: gadget: file_storage: fix race on unloading
    USB: ftdi_sio.c: Use ftdi async_icount structure for TIOCMIWAIT, as in other drivers
    USB: ftdi_sio.c:Fill MSR fields of the ftdi async_icount structure
    USB: ftdi_sio.c: Fill LSR fields of the ftdi async_icount structure
    USB: ftdi_sio.c:Fill TX field of the ftdi async_icount structure
    USB: ftdi_sio.c: Fill the RX field of the ftdi async_icount structure
    USB: ftdi_sio.c: Basic icount infrastructure for ftdi_sio
    usb/isp1760: Let OF bindings depend on general CONFIG_OF instead of PPC_OF .
    USB: ftdi_sio: Support TI/Luminary Micro Stellaris BD-ICDI Board
    USB: Fix runtime wakeup on OHCI
    xHCI/USB: Make xHCI driver have a BOS descriptor.
    usb: gadget: add new usb gadget for ACM and mass storage
    ...

    Linus Torvalds
     

04 Oct, 2011

2 commits

  • Use the generic llist primitives.

    We had a private lockless list implementation in the scheduler in the wake-list
    code, now that we have a generic llist implementation that provides all required
    operations, switch to it.

    This patch is not expected to change any behavior.

    Signed-off-by: Peter Zijlstra
    Cc: Huang Ying
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1315836353.26517.42.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: pick up the latest fixes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

03 Oct, 2011

1 commit

  • Add two fields to task_struct.

    1) account dirtied pages in the individual tasks, for accuracy
    2) per-task balance_dirty_pages() call intervals, for flexibility

    The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
    scale near-sqrt to the safety gap between dirty pages and threshold.

    The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
    pages at exactly the same time, each task will be assigned a large
    initial nr_dirtied_pause, so that the dirty threshold will be exceeded
    long before each task reached its nr_dirtied_pause and hence call
    balance_dirty_pages().

    The solution is to watch for the number of pages dirtied on each CPU in
    between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
    (3% dirty threshold), force call balance_dirty_pages() for a chance to
    set bdi->dirty_exceeded. In normal situations, this safeguarding
    condition is not expected to trigger at all.

    On the sqrt in dirty_poll_interval():

    It will serve as an initial guess when dirty pages are still in the
    freerun area.

    When dirty pages are floating inside the dirty control scope [freerun,
    limit], a followup patch will use some refined dirty poll interval to
    get the desired pause time.

    thresh-dirty (MB) sqrt
    1 16
    2 22
    4 32
    8 45
    16 64
    32 90
    64 128
    128 181
    256 256
    512 362
    1024 512

    The above table means, given 1MB (or 1GB) gap and the dd tasks polling
    balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
    be exceeded as long as there are less than 16 (or 512) concurrent dd's.

    So sqrt naturally leads to less overheads and more safe concurrent tasks
    for large memory servers, which have large (thresh-freerun) gaps.

    peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

    CC: Peter Zijlstra
    Reviewed-by: Andrea Righi
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

30 Sep, 2011

2 commits

  • David reported:

    Attached below is a watered-down version of rt/tst-cpuclock2.c from
    GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
    similar.

    Run it several times, and you will see cases where the main thread
    will measure a process clock difference before and after the nanosleep
    which is smaller than the cpu-burner thread's individual thread clock
    difference. This doesn't make any sense since the cpu-burner thread
    is part of the top-level process's thread group.

    I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
    64-bit binaries).

    For example:

    [davem@boricha build-x86_64-linux]$ ./test
    process: before(0.001221967) after(0.498624371) diff(497402404)
    thread: before(0.000081692) after(0.498316431) diff(498234739)
    self: before(0.001223521) after(0.001240219) diff(16698)
    [davem@boricha build-x86_64-linux]$

    The diff of 'process' should always be >= the diff of 'thread'.

    I make sure to wrap the 'thread' clock measurements the most tightly
    around the nanosleep() call, and that the 'process' clock measurements
    are the outer-most ones.

    ---
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static pthread_barrier_t barrier;

    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1)
    __asm__ __volatile__("" : : : "memory");
    return NULL;
    }

    int main(void)
    {
    clockid_t process_clock, my_thread_clock, th_clock;
    struct timespec process_before, process_after;
    struct timespec me_before, me_after;
    struct timespec th_before, th_after;
    struct timespec sleeptime;
    unsigned long diff;
    pthread_t th;
    int err;

    err = clock_getcpuclockid(0, &process_clock);
    if (err)
    return 1;

    err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
    if (err)
    return 1;

    pthread_barrier_init(&barrier, NULL, 2);
    err = pthread_create(&th, NULL, chew_cpu, NULL);
    if (err)
    return 1;

    err = pthread_getcpuclockid(th, &th_clock);
    if (err)
    return 1;

    pthread_barrier_wait(&barrier);

    err = clock_gettime(process_clock, &process_before);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_before);
    if (err)
    return 1;

    err = clock_gettime(th_clock, &th_before);
    if (err)
    return 1;

    sleeptime.tv_sec = 0;
    sleeptime.tv_nsec = 500000000;
    nanosleep(&sleeptime, NULL);

    err = clock_gettime(th_clock, &th_after);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_after);
    if (err)
    return 1;

    err = clock_gettime(process_clock, &process_after);
    if (err)
    return 1;

    diff = process_after.tv_nsec - process_before.tv_nsec;
    printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    process_before.tv_sec, process_before.tv_nsec,
    process_after.tv_sec, process_after.tv_nsec, diff);
    diff = th_after.tv_nsec - th_before.tv_nsec;
    printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    th_before.tv_sec, th_before.tv_nsec,
    th_after.tv_sec, th_after.tv_nsec, diff);
    diff = me_after.tv_nsec - me_before.tv_nsec;
    printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    me_before.tv_sec, me_before.tv_nsec,
    me_after.tv_sec, me_after.tv_nsec, diff);

    return 0;
    }

    This is due to us using p->se.sum_exec_runtime in
    thread_group_cputime() where we iterate the thread group and sum all
    data. This does not take time since the last schedule operation (tick
    or otherwise) into account. We can cure this by using
    task_sched_runtime() at the cost of having to take locks.

    This also means we can (and must) do away with
    thread_group_sched_runtime() since the modified thread_group_cputime()
    is now more accurate and would deadlock when called from
    thread_group_sched_runtime().

    Aside of that it makes the function safe on 32 bit systems. The old
    code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
    64bit value and could be changed on another cpu at the same time.

    Reported-by: David Miller
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
    Tested-by: David Miller
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Add to the dev_state and alloc_async structures the user namespace
    corresponding to the uid and euid. Pass these to kill_pid_info_as_uid(),
    which can then implement a proper, user-namespace-aware uid check.

    Changelog:
    Sep 20: Per Oleg's suggestion: Instead of caching and passing user namespace,
    uid, and euid each separately, pass a struct cred.
    Sep 26: Address Alan Stern's comments: don't define a struct cred at
    usbdev_open(), and take and put a cred at async_completed() to
    ensure it lasts for the duration of kill_pid_info_as_cred().

    Signed-off-by: Serge Hallyn
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Serge Hallyn
     

29 Sep, 2011

2 commits

  • Commit 7765be (Fix RCU_BOOST race handling current->rcu_read_unlock_special)
    introduced a new ->rcu_boosted field in the task structure. This is
    redundant because the existing ->rcu_boost_mutex will be non-NULL at
    any time that ->rcu_boosted is nonzero. Therefore, this commit removes
    ->rcu_boosted and tests ->rcu_boost_mutex instead.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • RCU no longer uses this global variable, nor does anyone else. This
    commit therefore removes this variable. This reduces memory footprint
    and also removes some atomic instructions and memory barriers from
    the dyntick-idle path.

    Signed-off-by: Alex Shi
    Signed-off-by: Paul E. McKenney

    Shi, Alex
     

13 Sep, 2011

1 commit

  • The thread_group_cputimer lock can be taken in atomic context and therefore
    cannot be preempted on -rt - annotate it.

    In mainline this change documents the low level nature of
    the lock - otherwise there's no functional difference. Lockdep
    and Sparse checking will work as usual.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

14 Aug, 2011

1 commit

  • Account bandwidth usage on the cfs_rq level versus the task_groups to which
    they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
    under cfs_rq->runtime_enabled.

    cfs_rq's which belong to a bandwidth constrained task_group have their runtime
    accounted via the update_curr() path, which withdraws bandwidth from the global
    pool as desired. Updates involving the global pool are currently protected
    under cfs_bandwidth->lock, local runtime is protected by rq->lock.

    This patch only assigns and tracks quota, no action is taken in the case that
    cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     

12 Aug, 2011

1 commit

  • The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
    check in set_user() to check for NPROC exceeding via setuid() and
    similar functions.

    Before the check there was a possibility to greatly exceed the allowed
    number of processes by an unprivileged user if the program relied on
    rlimit only. But the check created new security threat: many poorly
    written programs simply don't check setuid() return code and believe it
    cannot fail if executed with root privileges. So, the check is removed
    in this patch because of too often privilege escalations related to
    buggy programs.

    The NPROC can still be enforced in the common code flow of daemons
    spawning user processes. Most of daemons do fork()+setuid()+execve().
    The check introduced in execve() (1) enforces the same limit as in
    setuid() and (2) doesn't create similar security issues.

    Neil Brown suggested to track what specific process has exceeded the
    limit by setting PF_NPROC_EXCEEDED process flag. With the change only
    this process would fail on execve(), and other processes' execve()
    behaviour is not changed.

    Solar Designer suggested to re-check whether NPROC limit is still
    exceeded at the moment of execve(). If the process was sleeping for
    days between set*uid() and execve(), and the NPROC counter step down
    under the limit, the defered execve() failure because NPROC limit was
    exceeded days ago would be unexpected. If the limit is not exceeded
    anymore, we clear the flag on successful calls to execve() and fork().

    The flag is also cleared on successful calls to set_user() as the limit
    was exceeded for the previous user, not the current one.

    Similar check was introduced in -ow patches (without the process flag).

    v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().

    Reviewed-by: James Morris
    Signed-off-by: Vasiliy Kulikov
    Acked-by: NeilBrown
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

26 Jul, 2011

1 commit

  • * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
    block: strict rq_affinity
    backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
    block: fix patch import error in max_discard_sectors check
    block: reorder request_queue to remove 64 bit alignment padding
    CFQ: add think time check for group
    CFQ: add think time check for service tree
    CFQ: move think time check variables to a separate struct
    fixlet: Remove fs_excl from struct task.
    cfq: Remove special treatment for metadata rqs.
    block: document blk_plug list access
    block: avoid building too big plug list
    compat_ioctl: fix make headers_check regression
    block: eliminate potential for infinite loop in blkdev_issue_discard
    compat_ioctl: fix warning caused by qemu
    block: flush MEDIA_CHANGE from drivers on close(2)
    blk-throttle: Make total_nr_queued unsigned
    block: Add __attribute__((format(printf...) and fix fallout
    fs/partitions/check.c: make local symbols static
    block:remove some spare spaces in genhd.c
    block:fix the comment error in blkdev.h
    ...

    Linus Torvalds
     

23 Jul, 2011

2 commits

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
    sched: Cleanup duplicate local variable in [enqueue|dequeue]_task_fair
    sched: Replace use of entity_key()
    sched: Separate group-scheduling code more clearly
    sched: Reorder root_domain to remove 64 bit alignment padding
    sched: Do not attempt to destroy uninitialized rt_bandwidth
    sched: Remove unused function cpu_cfs_rq()
    sched: Fix (harmless) typo 'CONFG_FAIR_GROUP_SCHED'
    sched, cgroup: Optimize load_balance_fair()
    sched: Don't update shares twice on on_rq parent
    sched: update correct entity's runtime in check_preempt_wakeup()
    xtensa: Use generic config PREEMPT definition
    h8300: Use generic config PREEMPT definition
    m32r: Use generic PREEMPT config
    sched: Skip autogroup when looking for all rt sched groups
    sched: Simplify mutex_spin_on_owner()
    sched: Remove rcu_read_lock() from wake_affine()
    sched: Generalize sleep inside spinlock detection
    sched: Make sleeping inside spinlock detection working in !CONFIG_PREEMPT
    sched: Isolate preempt counting in its own config option
    sched: Remove pointless in_atomic() definition check
    ...

    Linus Torvalds
     
  • * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
    ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
    ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
    connector: add an event for monitoring process tracers
    ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
    ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
    ptrace_init_task: initialize child->jobctl explicitly
    has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
    ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
    ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
    ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
    ptrace: ptrace_reparented() should check same_thread_group()
    redefine thread_group_leader() as exit_signal >= 0
    do not change dead_task->exit_signal
    kill task_detached()
    reparent_leader: check EXIT_DEAD instead of task_detached()
    make do_notify_parent() __must_check, update the callers
    __ptrace_detach: avoid task_detached(), check do_notify_parent()
    kill tracehook_notify_death()
    make do_notify_parent() return bool
    ptrace: s/tracehook_tracer_task()/ptrace_parent()/
    ...

    Linus Torvalds
     

22 Jul, 2011

1 commit


21 Jul, 2011

1 commit

  • …l/git/tip/linux-2.6-tip

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    signal: align __lock_task_sighand() irq disabling and RCU
    softirq,rcu: Inform RCU of irq_exit() activity
    sched: Add irq_{enter,exit}() to scheduler_ipi()
    rcu: protect __rcu_read_unlock() against scheduler-using irq handlers
    rcu: Streamline code produced by __rcu_read_unlock()
    rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special
    rcu: decrease rcu_report_exp_rnp coupling with scheduler

    Linus Torvalds