13 Jan, 2012

1 commit

  • It's a very old and now unused prototype marking so just delete it.

    Neaten panic pointer argument style to keep checkpatch quiet.

    Signed-off-by: Joe Perches
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Tony Luck
    Cc: Fenghua Yu
    Acked-by: Geert Uytterhoeven
    Acked-by: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

11 Jan, 2012

1 commit

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     

09 Jan, 2012

1 commit

  • * 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
    PM / Hibernate: Implement compat_ioctl for /dev/snapshot
    PM / Freezer: fix return value of freezable_schedule_timeout_killable()
    PM / shmobile: Allow the A4R domain to be turned off at run time
    PM / input / touchscreen: Make st1232 use device PM QoS constraints
    PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
    PM / shmobile: Remove the stay_on flag from SH7372's PM domains
    PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
    PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
    PM: Drop generic_subsys_pm_ops
    PM / Sleep: Remove forward-only callbacks from AMBA bus type
    PM / Sleep: Remove forward-only callbacks from platform bus type
    PM: Run the driver callback directly if the subsystem one is not there
    PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
    PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
    PM / Sleep: Merge internal functions in generic_ops.c
    PM / Sleep: Simplify generic system suspend callbacks
    PM / Hibernate: Remove deprecated hibernation snapshot ioctls
    PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
    ARM: S3C64XX: Implement basic power domain support
    PM / shmobile: Use common always on power domain governor
    ...

    Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
    XBT_FORCE_SLEEP bit

    Linus Torvalds
     

07 Jan, 2012

1 commit

  • * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    sched/tracing: Add a new tracepoint for sleeptime
    sched: Disable scheduler warnings during oopses
    sched: Fix cgroup movement of waking process
    sched: Fix cgroup movement of newly created process
    sched: Fix cgroup movement of forking process
    sched: Remove cfs bandwidth period check in tg_set_cfs_period()
    sched: Fix load-balance lock-breaking
    sched: Replace all_pinned with a generic flags field
    sched: Only queue remote wakeups when crossing cache boundaries
    sched: Add missing rcu_dereference() around ->real_parent usage
    [S390] fix cputime overflow in uptime_proc_show
    [S390] cputime: add sparse checking and cleanup
    sched: Mark parent and real_parent as __rcu
    sched, nohz: Fix missing RCU read lock
    sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer
    sched, nohz: Fix the idle cpu check in nohz_idle_balance
    sched: Use jump_labels for sched_feat
    sched/accounting: Fix parameter passing in task_group_account_field
    sched/accounting: Fix user/system tick double accounting
    sched/accounting: Re-use scheduler statistics for the root cgroup
    ...

    Fix up conflicts in
    - arch/ia64/include/asm/cputime.h, include/asm-generic/cputime.h
    usecs_to_cputime64() vs the sparse cleanups
    - kernel/sched/fair.c, kernel/time/tick-sched.c
    scheduler changes in multiple branches

    Linus Torvalds
     

05 Jan, 2012

1 commit

  • Test-case:

    int main(void)
    {
    int pid, status;

    pid = fork();
    if (!pid) {
    for (;;) {
    if (!fork())
    return 0;
    if (waitpid(-1, &status, 0) < 0) {
    printf("ERR!! wait: %m\n");
    return 0;
    }
    }
    }

    assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
    assert(waitpid(-1, NULL, 0) == pid);

    assert(ptrace(PTRACE_SETOPTIONS, pid, 0,
    PTRACE_O_TRACEFORK) == 0);

    do {
    ptrace(PTRACE_CONT, pid, 0, 0);
    pid = waitpid(-1, NULL, 0);
    } while (pid > 0);

    return 1;
    }

    It fails because ->real_parent sees its child in EXIT_DEAD state
    while the tracer is going to change the state back to EXIT_ZOMBIE
    in wait_task_zombie().

    The offending commit is 823b018e which moved the EXIT_DEAD check,
    but in fact we should not blame it. The original code was not
    correct as well because it didn't take ptrace_reparented() into
    account and because we can't really trust ->ptrace.

    This patch adds the additional check to close this particular
    race but it doesn't solve the whole problem. We simply can't
    rely on ->ptrace in this case, it can be cleared if the tracer
    is multithreaded by the exiting ->parent.

    I think we should kill EXIT_DEAD altogether, we should always
    remove the soon-to-be-reaped child from ->children or at least
    we should never do the DEAD->ZOMBIE transition. But this is too
    complex for 3.2.

    Reported-and-tested-by: Denys Vlasenko
    Tested-by: Lukasz Michalik
    Acked-by: Tejun Heo
    Cc: [3.0+]
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

18 Dec, 2011

1 commit

  • It's a years long problem that a large number of short-lived dirtiers
    (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
    (eg. dd) as well as pushing the dirty pages to the global hard limit.

    The solution is to charge the pages dirtied by the exited gcc to the
    other random dirtying tasks. It sounds not perfect, however should
    behave good enough in practice, seeing as that throttled tasks aren't
    actually running so those that are running are more likely to pick it up
    and get throttled, therefore promoting an equal spread.

    Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Randy Dunlap
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

15 Dec, 2011

1 commit


22 Nov, 2011

1 commit


01 Nov, 2011

1 commit

  • This removes mm->oom_disable_count entirely since it's unnecessary and
    currently buggy. The counter was intended to be per-process but it's
    currently decremented in the exit path for each thread that exits, causing
    it to underflow.

    The count was originally intended to prevent oom killing threads that
    share memory with threads that cannot be killed since it doesn't lead to
    future memory freeing. The counter could be fixed to represent all
    threads sharing the same mm, but it's better to remove the count since:

    - it is possible that the OOM_DISABLE thread sharing memory with the
    victim is waiting on that thread to exit and will actually cause
    future memory freeing, and

    - there is no guarantee that a thread is disabled from oom killing just
    because another thread sharing its mm is oom disabled.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

27 Jul, 2011

1 commit

  • Add support for the shm_rmid_forced sysctl. If set to 1, all shared
    memory objects in current ipc namespace will be automatically forced to
    use IPC_RMID.

    The POSIX way of handling shmem allows one to create shm objects and
    call shmdt(), leaving shm object associated with no process, thus
    consuming memory not counted via rlimits.

    With shm_rmid_forced=1 the shared memory object is counted at least for
    one process, so OOM killer may effectively kill the fat process holding
    the shared memory.

    It obviously breaks POSIX - some programs relying on the feature would
    stop working. So set shm_rmid_forced=1 only if you're sure nobody uses
    "orphaned" memory. Use shm_rmid_forced=0 by default for compatability
    reasons.

    The feature was previously impemented in -ow as a configure option.

    [akpm@linux-foundation.org: fix documentation, per Randy]
    [akpm@linux-foundation.org: fix warning]
    [akpm@linux-foundation.org: readability/conventionality tweaks]
    [akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
    Signed-off-by: Vasiliy Kulikov
    Cc: Randy Dunlap
    Cc: "Eric W. Biederman"
    Cc: "Serge E. Hallyn"
    Cc: Daniel Lezcano
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Cc: Alan Cox
    Cc: Solar Designer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

26 Jul, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    fs: Merge split strings
    treewide: fix potentially dangerous trailing ';' in #defined values/expressions
    uwb: Fix misspelling of neighbourhood in comment
    net, netfilter: Remove redundant goto in ebt_ulog_packet
    trivial: don't touch files that are removed in the staging tree
    lib/vsprintf: replace link to Draft by final RFC number
    doc: Kconfig: `to be' -> `be'
    doc: Kconfig: Typo: square -> squared
    doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
    drivers/net: static should be at beginning of declaration
    drivers/media: static should be at beginning of declaration
    drivers/i2c: static should be at beginning of declaration
    XTENSA: static should be at beginning of declaration
    SH: static should be at beginning of declaration
    MIPS: static should be at beginning of declaration
    ARM: static should be at beginning of declaration
    rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
    Update my e-mail address
    PCIe ASPM: forcedly -> forcibly
    gma500: push through device driver tree
    ...

    Fix up trivial conflicts:
    - arch/arm/mach-ep93xx/dma-m2p.c (deleted)
    - drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
    - drivers/net/r8169.c (just context changes)

    Linus Torvalds
     
  • * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
    block: strict rq_affinity
    backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
    block: fix patch import error in max_discard_sectors check
    block: reorder request_queue to remove 64 bit alignment padding
    CFQ: add think time check for group
    CFQ: add think time check for service tree
    CFQ: move think time check variables to a separate struct
    fixlet: Remove fs_excl from struct task.
    cfq: Remove special treatment for metadata rqs.
    block: document blk_plug list access
    block: avoid building too big plug list
    compat_ioctl: fix make headers_check regression
    block: eliminate potential for infinite loop in blkdev_issue_discard
    compat_ioctl: fix warning caused by qemu
    block: flush MEDIA_CHANGE from drivers on close(2)
    blk-throttle: Make total_nr_queued unsigned
    block: Add __attribute__((format(printf...) and fix fallout
    fs/partitions/check.c: make local symbols static
    block:remove some spare spaces in genhd.c
    block:fix the comment error in blkdev.h
    ...

    Linus Torvalds
     

23 Jul, 2011

1 commit

  • * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
    ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
    ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
    connector: add an event for monitoring process tracers
    ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
    ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
    ptrace_init_task: initialize child->jobctl explicitly
    has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
    ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
    ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
    ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
    ptrace: ptrace_reparented() should check same_thread_group()
    redefine thread_group_leader() as exit_signal >= 0
    do not change dead_task->exit_signal
    kill task_detached()
    reparent_leader: check EXIT_DEAD instead of task_detached()
    make do_notify_parent() __must_check, update the callers
    __ptrace_detach: avoid task_detached(), check do_notify_parent()
    kill tracehook_notify_death()
    make do_notify_parent() return bool
    ptrace: s/tracehook_tracer_task()/ptrace_parent()/
    ...

    Linus Torvalds
     

18 Jul, 2011

1 commit

  • has_stopped_jobs() naively checks task_is_stopped(group_leader). This
    was always wrong even without ptrace, group_leader can be dead. And
    given that ptrace can change the state to TRACED this is wrong even
    in the single-threaded case.

    Change the code to check SIGNAL_STOP_STOPPED and simplify the code,
    retval + break/continue doesn't make this trivial code more readable.

    We could probably add the usual "|| signal->group_stop_count" check
    but I don't think this makes sense, the task can start the group-stop
    right after the check anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo

    Oleg Nesterov
     

12 Jul, 2011

1 commit

  • fs_excl is a poor man's priority inheritance for filesystems to hint to
    the block layer that an operation is important. It was never clearly
    specified, not widely adopted, and will not prevent starvation in many
    cases (like across cgroups).

    fs_excl was introduced with the time sliced CFQ IO scheduler, to
    indicate when a process held FS exclusive resources and thus needed
    a boost.

    It doesn't cover all file systems, and it was never fully complete.
    Lets kill it.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

11 Jul, 2011

1 commit


09 Jul, 2011

1 commit


28 Jun, 2011

6 commits

  • wait_consider_task() checks same_thread_group(parent, real_parent),
    this is the open-coded ptrace_reparented().

    __ptrace_detach() remains the only function which has to check this by
    hand, although we could reorganize the code to delay __ptrace_unlink.

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo

    Oleg Nesterov
     
  • Upadate the last user of task_detached(), wait_task_zombie(), to
    use thread_group_leader() and kill task_detached().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Tejun Heo

    Oleg Nesterov
     
  • Change reparent_leader() to check ->exit_state instead of ->exit_signal,
    this matches the similar EXIT_DEAD check in wait_consider_task() and
    allows us to cleanup the do_notify_parent/task_detached logic.

    task_detached() was really needed during reparenting before 9cd80bbb
    "do_wait() optimization: do not place sub-threads on ->children list"
    to filter out the sub-threads. After this change task_detached(p) can
    only be true if p is the dead group_leader and its parent ignores
    SIGCHLD, in this case the caller of do_notify_parent() is going to
    reap this task and it should set EXIT_DEAD.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Tejun Heo

    Oleg Nesterov
     
  • Change other callers of do_notify_parent() to check the value it
    returns, this makes the subsequent task_detached() unnecessary.
    Mark do_notify_parent() as __must_check.

    Use thread_group_leader() instead of !task_detached() to check
    if we need to notify the real parent in wait_task_zombie().

    Remove the stale comment in release_task(). "just for sanity" is
    no longer true, we have to set EXIT_DEAD to avoid the races with
    do_wait().

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo

    Oleg Nesterov
     
  • Kill tracehook_notify_death(), reimplement the logic in its caller,
    exit_notify().

    Also, change the exec_id's check to use thread_group_leader() instead
    of task_detached(), this is more clear. This logic only applies to
    the exiting leader, a sub-thread must never change its exit_signal.

    Note: when the traced group leader exits the exit_signal-or-SIGCHLD
    logic looks really strange:

    - we notify the tracer even if !thread_group_empty() but
    do_wait(WEXITED) can't work until all threads exit

    - if the tracer is real_parent, it is not clear why can't
    we use ->exit_signal event if !thread_group_empty()

    -v2: do not try to fix the 2nd oddity to avoid the subtle behavior
    change mixed with reorganization, suggested by Tejun.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Tejun Heo

    Oleg Nesterov
     
  • - change do_notify_parent() to return a boolean, true if the task should
    be reaped because its parent ignores SIGCHLD.

    - update the only caller which checks the returned value, exit_notify().

    This temporary uglifies exit_notify() even more, will be cleanuped by
    the next change.

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo

    Oleg Nesterov
     

23 Jun, 2011

2 commits

  • At this point, tracehooks aren't useful to mainline kernel and mostly
    just add an extra layer of obfuscation. Although they have comments,
    without actual in-kernel users, it is difficult to tell what are their
    assumptions and they're actually trying to achieve. To mainline
    kernel, they just aren't worth keeping around.

    This patch kills the following trivial tracehooks.

    * Ones testing whether task is ptraced. Replace with ->ptrace test.

    tracehook_expect_breakpoints()
    tracehook_consider_ignored_signal()
    tracehook_consider_fatal_signal()

    * ptrace_event() wrappers. Call directly.

    tracehook_report_exec()
    tracehook_report_exit()
    tracehook_report_vfork_done()

    * ptrace_release_task() wrapper. Call directly.

    tracehook_finish_release_task()

    * noop

    tracehook_prepare_release_task()
    tracehook_report_death()

    This doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Cc: Christoph Hellwig
    Cc: Martin Schwidefsky
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     
  • task_ptrace(task) simply dereferences task->ptrace and isn't even used
    consistently only adding confusion. Kill it and directly access
    ->ptrace instead.

    This doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     

17 Jun, 2011

1 commit

  • The previous patch implemented async notification for ptrace but it
    only worked while trace is running. This patch introduces
    PTRACE_LISTEN which is suggested by Oleg Nestrov.

    It's allowed iff tracee is in STOP trap and puts tracee into
    quasi-running state - tracee never really runs but wait(2) and
    ptrace(2) consider it to be running. While ptracer is listening,
    tracee is allowed to re-enter STOP to notify an async event.
    Listening state is cleared on the first notification. Ptracer can
    also clear it by issuing INTERRUPT - tracee will re-trap into STOP
    with listening state cleared.

    This allows ptracer to monitor group stop state without running tracee
    - use INTERRUPT to put tracee into STOP trap, issue LISTEN and then
    wait(2) to wait for the next group stop event. When it happens,
    PTRACE_GETSIGINFO provides information to determine the current state.

    Test program follows.

    #define PTRACE_SEIZE 0x4206
    #define PTRACE_INTERRUPT 0x4207
    #define PTRACE_LISTEN 0x4208

    #define PTRACE_SEIZE_DEVEL 0x80000000

    static const struct timespec ts1s = { .tv_sec = 1 };

    int main(int argc, char **argv)
    {
    pid_t tracee, tracer;
    int i;

    tracee = fork();
    if (!tracee)
    while (1)
    pause();

    tracer = fork();
    if (!tracer) {
    siginfo_t si;

    ptrace(PTRACE_SEIZE, tracee, NULL,
    (void *)(unsigned long)PTRACE_SEIZE_DEVEL);
    ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL);
    repeat:
    waitid(P_PID, tracee, NULL, WSTOPPED);

    ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si);
    if (!si.si_code) {
    printf("tracer: SIG %d\n", si.si_signo);
    ptrace(PTRACE_CONT, tracee, NULL,
    (void *)(unsigned long)si.si_signo);
    goto repeat;
    }
    printf("tracer: stopped=%d signo=%d\n",
    si.si_signo != SIGTRAP, si.si_signo);
    if (si.si_signo != SIGTRAP)
    ptrace(PTRACE_LISTEN, tracee, NULL, NULL);
    else
    ptrace(PTRACE_CONT, tracee, NULL, NULL);
    goto repeat;
    }

    for (i = 0; i < 3; i++) {
    nanosleep(&ts1s, NULL);
    printf("mother: SIGSTOP\n");
    kill(tracee, SIGSTOP);
    nanosleep(&ts1s, NULL);
    printf("mother: SIGCONT\n");
    kill(tracee, SIGCONT);
    }
    nanosleep(&ts1s, NULL);

    kill(tracer, SIGKILL);
    kill(tracee, SIGKILL);
    return 0;
    }

    This is identical to the program to test TRAP_NOTIFY except that
    tracee is PTRACE_LISTEN'd instead of PTRACE_CONT'd when group stopped.
    This allows ptracer to monitor when group stop ends without running
    tracee.

    # ./test-listen
    tracer: stopped=0 signo=5
    mother: SIGSTOP
    tracer: SIG 19
    tracer: stopped=1 signo=19
    mother: SIGCONT
    tracer: stopped=0 signo=5
    tracer: SIG 18
    mother: SIGSTOP
    tracer: SIG 19
    tracer: stopped=1 signo=19
    mother: SIGCONT
    tracer: stopped=0 signo=5
    tracer: SIG 18
    mother: SIGSTOP
    tracer: SIG 19
    tracer: stopped=1 signo=19
    mother: SIGCONT
    tracer: stopped=0 signo=5
    tracer: SIG 18

    -v2: Moved JOBCTL_LISTENING check in wait_task_stopped() into
    task_stopped_code() as suggested by Oleg.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov

    Tejun Heo
     

16 Jun, 2011

1 commit

  • The following crash was reported:

    > Call Trace:
    > [] mem_cgroup_from_task+0x15/0x17
    > [] __mem_cgroup_try_charge+0x148/0x4b4
    > [] ? need_resched+0x23/0x2d
    > [] ? preempt_schedule+0x46/0x4f
    > [] mem_cgroup_charge_common+0x9a/0xce
    > [] mem_cgroup_newpage_charge+0x5d/0x5f
    > [] khugepaged+0x5da/0xfaf
    > [] ? __init_waitqueue_head+0x4b/0x4b
    > [] ? add_mm_counter.constprop.5+0x13/0x13
    > [] kthread+0xa8/0xb0
    > [] ? sub_preempt_count+0xa1/0xb4
    > [] kernel_thread_helper+0x4/0x10
    > [] ? retint_restore_args+0x13/0x13
    > [] ? __init_kthread_worker+0x5a/0x5a

    What happens is that khugepaged tries to charge a huge page against an mm
    whose last possible owner has already exited, and the memory controller
    crashes when the stale mm->owner is used to look up the cgroup to charge.

    mm->owner has never been set to NULL with the last owner going away, but
    nobody cared until khugepaged came along.

    Even then it wasn't a problem because the final mmput() on an mm was
    forced to acquire and release mmap_sem in write-mode, preventing an
    exiting owner to go away while the mmap_sem was held, and until "692e0b3
    mm: thp: optimize memcg charge in khugepaged", the memory cgroup charge
    was protected by mmap_sem in read-mode.

    Instead of going back to relying on the mmap_sem to enforce lifetime of a
    task, this patch ensures that mm->owner is properly set to NULL when the
    last possible owner is exiting, which the memory controller can handle
    just fine.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Hugh Dickins
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Reported-by: Hugh Dickins
    Reported-by: Dave Jones
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

21 May, 2011

1 commit

  • * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (41 commits)
    signal: trivial, fix the "timespec declared inside parameter list" warning
    job control: reorganize wait_task_stopped()
    ptrace: fix signal->wait_chldexit usage in task_clear_group_stop_trapping()
    signal: sys_sigprocmask() needs retarget_shared_pending()
    signal: cleanup sys_sigprocmask()
    signal: rename signandsets() to sigandnsets()
    signal: do_sigtimedwait() needs retarget_shared_pending()
    signal: introduce do_sigtimedwait() to factor out compat/native code
    signal: sys_rt_sigtimedwait: simplify the timeout logic
    signal: cleanup sys_rt_sigprocmask()
    x86: signal: sys_rt_sigreturn() should use set_current_blocked()
    x86: signal: handle_signal() should use set_current_blocked()
    signal: sigprocmask() should do retarget_shared_pending()
    signal: sigprocmask: narrow the scope of ->siglock
    signal: retarget_shared_pending: optimize while_each_thread() loop
    signal: retarget_shared_pending: consider shared/unblocked signals only
    signal: introduce retarget_shared_pending()
    ptrace: ptrace_check_attach() should not do s/STOPPED/TRACED/
    signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED
    signal: do_signal_stop: Remove the unneeded task_clear_group_stop_pending()
    ...

    Linus Torvalds
     

14 May, 2011

1 commit

  • wait_task_stopped() tested task_stopped_code() without acquiring
    siglock and, if stop condition existed, called wait_task_stopped() and
    directly returned the result. This patch moves the initial
    task_stopped_code() testing into wait_task_stopped() and make
    wait_consider_task() fall through to wait_task_continue() on 0 return.

    This is for the following two reasons.

    * Because the initial task_stopped_code() test is done without
    acquiring siglock, it may race against SIGCONT generation. The
    stopped condition might have been replaced by continued state by the
    time wait_task_stopped() acquired siglock. This may lead to
    unexpected failure of WNOHANG waits.

    This reorganization addresses this single race case but there are
    other cases - TASK_RUNNING -> TASK_STOPPED transition and EXIT_*
    transitions.

    * Scheduled ptrace updates require changes to the initial test which
    would fit better inside wait_task_stopped().

    Signed-off-by: Tejun Heo
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     

25 Apr, 2011

1 commit

  • When a task is traced and is in a stopped state, the tracer
    may execute a ptrace request to examine the tracee state and
    get its task struct. Right after, the tracee can be killed
    and thus its breakpoints released.
    This can happen concurrently when the tracer is in the middle
    of reading or modifying these breakpoints, leading to dereferencing
    a freed pointer.

    Hence, to prepare the fix, create a generic breakpoint reference
    holding API. When a reference on the breakpoints of a task is
    held, the breakpoints won't be released until the last reference
    is dropped. After that, no more ptrace request on the task's
    breakpoints can be serviced for the tracer.

    Reported-by: Oleg Nesterov
    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: Prasad
    Cc: Paul Mundt
    Cc: v2.6.33..
    Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com

    Frederic Weisbecker
     

08 Apr, 2011

1 commit


31 Mar, 2011

1 commit


23 Mar, 2011

3 commits

  • Currently a real parent can't access job control stopped/continued
    events through a ptraced child. This utterly breaks job control when
    the children are ptraced.

    For example, if a program is run from an interactive shell and then
    strace(1) attaches to it, pressing ^Z would send SIGTSTP and strace(1)
    would notice it but the shell has no way to tell whether the child
    entered job control stop and thus can't tell when to take over the
    terminal - leading to awkward lone ^Z on the terminal.

    Because the job control and ptrace stopped states are independent,
    there is no reason to prevent real parents from accessing the stopped
    state regardless of ptrace. The continued state isn't separate but
    ptracers don't have any use for them as ptracees can never resume
    without explicit command from their ptracers, so as long as ptracers
    don't consume it, it should be fine.

    Although this is a behavior change, because the previous behavior is
    utterly broken when viewed from real parents and the change is only
    visible to real parents, I don't think it's necessary to make this
    behavior optional.

    One situation to be careful about is when a task from the real
    parent's group is ptracing. The parent group is the recipient of both
    ptrace and job control stop events and one stop can be reported as
    both job control and ptrace stops. As this can break the current
    ptrace users, suppress job control stopped events for these cases.

    If a real parent ptracer wants to know about both job control and
    ptrace stops, it can create a separate process to serve the role of
    real parent.

    Note that this only updates wait(2) side of things. The real parent
    can access the states via wait(2) but still is not properly notified
    (woken up and delivered signal). Test case polls wait(2) with WNOHANG
    to work around. Notification will be updated by future patches.

    Test case follows.

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    const struct timespec ts100ms = { .tv_nsec = 100000000 };
    pid_t tracee, tracer;
    siginfo_t si;
    int i;

    tracee = fork();
    if (tracee == 0) {
    while (1) {
    printf("tracee: SIGSTOP\n");
    raise(SIGSTOP);
    nanosleep(&ts100ms, NULL);
    printf("tracee: SIGCONT\n");
    raise(SIGCONT);
    nanosleep(&ts100ms, NULL);
    }
    }

    waitid(P_PID, tracee, &si, WSTOPPED | WNOHANG | WNOWAIT);

    tracer = fork();
    if (tracer == 0) {
    nanosleep(&ts100ms, NULL);
    ptrace(PTRACE_ATTACH, tracee, NULL, NULL);

    for (i = 0; i < 11; i++) {
    si.si_pid = 0;
    waitid(P_PID, tracee, &si, WSTOPPED);
    if (si.si_pid && si.si_code == CLD_TRAPPED)
    ptrace(PTRACE_CONT, tracee, NULL,
    (void *)(long)si.si_status);
    }
    printf("tracer: EXITING\n");
    return 0;
    }

    while (1) {
    si.si_pid = 0;
    waitid(P_PID, tracee, &si,
    WSTOPPED | WCONTINUED | WEXITED | WNOHANG);
    if (si.si_pid)
    printf("mommy : WAIT status=%02d code=%02d\n",
    si.si_status, si.si_code);
    nanosleep(&ts100ms, NULL);
    }
    return 0;
    }

    Before the patch, while ptraced, the parent can't see any job control
    events.

    tracee: SIGSTOP
    mommy : WAIT status=19 code=05
    tracee: SIGCONT
    tracee: SIGSTOP
    tracee: SIGCONT
    tracee: SIGSTOP
    tracee: SIGCONT
    tracee: SIGSTOP
    tracer: EXITING
    mommy : WAIT status=19 code=05
    ^C

    After the patch,

    tracee: SIGSTOP
    mommy : WAIT status=19 code=05
    tracee: SIGCONT
    mommy : WAIT status=18 code=06
    tracee: SIGSTOP
    mommy : WAIT status=19 code=05
    tracee: SIGCONT
    mommy : WAIT status=18 code=06
    tracee: SIGSTOP
    mommy : WAIT status=19 code=05
    tracee: SIGCONT
    mommy : WAIT status=18 code=06
    tracee: SIGSTOP
    tracer: EXITING
    mommy : WAIT status=19 code=05
    ^C

    -v2: Oleg pointed out that wait(2) should be suppressed for the real
    parent's group instead of only the real parent task itself.
    Updated accordingly.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov

    Tejun Heo
     
  • wait(2) and friends allow access to stopped/continued states through
    zombies, which is required as the states are process-wide and should
    be accessible whether the leader task is alive or undead.
    wait_consider_task() implements this by always clearing notask_error
    and going through wait_task_stopped/continued() for unreaped zombies.

    However, while ptraced, the stopped state is per-task and as such if
    the ptracee became a zombie, there's no further stopped event to
    listen to and wait(2) and friends should return -ECHILD on the tracee.

    Fix it by clearing notask_error only if WCONTINUED | WEXITED is set
    for ptraced zombies. While at it, document why clearing notask_error
    is safe for each case.

    Test case follows.

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static void *nooper(void *arg)
    {
    pause();
    return NULL;
    }

    int main(void)
    {
    const struct timespec ts1s = { .tv_sec = 1 };
    pid_t tracee, tracer;
    siginfo_t si;

    tracee = fork();
    if (tracee == 0) {
    pthread_t thr;

    pthread_create(&thr, NULL, nooper, NULL);
    nanosleep(&ts1s, NULL);
    printf("tracee exiting\n");
    pthread_exit(NULL); /* let subthread run */
    }

    tracer = fork();
    if (tracer == 0) {
    ptrace(PTRACE_ATTACH, tracee, NULL, NULL);
    while (1) {
    if (waitid(P_PID, tracee, &si, WSTOPPED) < 0) {
    perror("waitid");
    break;
    }
    ptrace(PTRACE_CONT, tracee, NULL,
    (void *)(long)si.si_status);
    }
    return 0;
    }

    waitid(P_PID, tracer, &si, WEXITED);
    kill(tracee, SIGKILL);
    return 0;
    }

    Before the patch, after the tracee becomes a zombie, the tracer's
    waitid(WSTOPPED) never returns and the program doesn't terminate.

    tracee exiting
    ^C

    After the patch, tracee exiting triggers waitid() to fail.

    tracee exiting
    waitid: No child processes

    -v2: Oleg pointed out that exited in addition to continued can happen
    for ptraced dead group leader. Clear notask_error for ptraced
    child on WEXITED too.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov

    Tejun Heo
     
  • Move EXIT_DEAD test in wait_consider_task() above ptrace check. As
    ptraced tasks can't be EXIT_DEAD, this change doesn't cause any
    behavior change. This is to prepare for further changes.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov

    Tejun Heo
     

10 Mar, 2011

1 commit

  • This patch adds support for creating a queuing context outside
    of the queue itself. This enables us to batch up pieces of IO
    before grabbing the block device queue lock and submitting them to
    the IO scheduler.

    The context is created on the stack of the process and assigned in
    the task structure, so that we can auto-unplug it if we hit a schedule
    event.

    The current queue plugging happens implicitly if IO is submitted to
    an empty device, yet callers have to remember to unplug that IO when
    they are going to wait for it. This is an ugly API and has caused bugs
    in the past. Additionally, it requires hacks in the vm (->sync_page()
    callback) to handle that logic. By switching to an explicit plugging
    scheme we make the API a lot nicer and can get rid of the ->sync_page()
    hack in the vm.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Jan, 2011

1 commit

  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits)
    perf session: Fix infinite loop in __perf_session__process_events
    perf evsel: Support perf_evsel__open(cpus > 1 && threads > 1)
    perf sched: Use PTHREAD_STACK_MIN to avoid pthread_attr_setstacksize() fail
    perf tools: Emit clearer message for sys_perf_event_open ENOENT return
    perf stat: better error message for unsupported events
    perf sched: Fix allocation result check
    perf, x86: P4 PMU - Fix unflagged overflows handling
    dynamic debug: Fix build issue with older gcc
    tracing: Fix TRACE_EVENT power tracepoint creation
    tracing: Fix preempt count leak
    tracepoint: Add __rcu annotation
    tracing: remove duplicate null-pointer check in skb tracepoint
    tracing/trivial: Add missing comma in TRACE_EVENT comment
    tracing: Include module.h in define_trace.h
    x86: Save rbp in pt_regs on irq entry
    x86, dumpstack: Fix unused variable warning
    x86, NMI: Clean-up default_do_nmi()
    x86, NMI: Allow NMI reason io port (0x61) to be processed on any CPU
    x86, NMI: Remove DIE_NMI_IPI
    x86, NMI: Add priorities to handlers
    ...

    Linus Torvalds
     

07 Jan, 2011

1 commit

  • In particular this patch move perf_event_exit_task() before
    cgroup_exit() to allow for cgroup support. The cgroup_exit()
    function detaches the cgroups attached to a task.

    Other movements include hoisting some definitions and inlines
    at the top of perf_event.c

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

17 Dec, 2010

1 commit

  • __get_cpu_var() can be replaced with this_cpu_read and will then use a
    single read instruction with implied address calculation to access the
    correct per cpu instance.

    However, the address of a per cpu variable passed to __this_cpu_read()
    cannot be determined (since it's an implied address conversion through
    segment prefixes). Therefore apply this only to uses of __get_cpu_var
    where the address of the variable is not used.

    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

03 Dec, 2010

1 commit

  • If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
    otherwise reset before do_exit(). do_exit may later (via mm_release in
    fork.c) do a put_user to a user-controlled address, potentially allowing
    a user to leverage an oops into a controlled write into kernel memory.

    This is only triggerable in the presence of another bug, but this
    potentially turns a lot of DoS bugs into privilege escalations, so it's
    worth fixing. I have proof-of-concept code which uses this bug along
    with CVE-2010-3849 to write a zero to an arbitrary kernel address, so
    I've tested that this is not theoretical.

    A more logical place to put this fix might be when we know an oops has
    occurred, before we call do_exit(), but that would involve changing
    every architecture, in multiple places.

    Let's just stick it in do_exit instead.

    [akpm@linux-foundation.org: update code comment]
    Signed-off-by: Nelson Elhage
    Cc: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nelson Elhage