15 Jan, 2012

1 commit

  • * 'for-linus' of git://selinuxproject.org/~jmorris/linux-security:
    capabilities: remove __cap_full_set definition
    security: remove the security_netlink_recv hook as it is equivalent to capable()
    ptrace: do not audit capability check when outputing /proc/pid/stat
    capabilities: remove task_ns_* functions
    capabitlies: ns_capable can use the cap helpers rather than lsm call
    capabilities: style only - move capable below ns_capable
    capabilites: introduce new has_ns_capabilities_noaudit
    capabilities: call has_ns_capability from has_capability
    capabilities: remove all _real_ interfaces
    capabilities: introduce security_capable_noaudit
    capabilities: reverse arguments to security_capable
    capabilities: remove the task from capable LSM hook entirely
    selinux: sparse fix: fix several warnings in the security server cod
    selinux: sparse fix: fix warnings in netlink code
    selinux: sparse fix: eliminate warnings for selinuxfs
    selinux: sparse fix: declare selinux_disable() in security.h
    selinux: sparse fix: move selinux_complete_init
    selinux: sparse fix: make selinux_secmark_refcount static
    SELinux: Fix RCU deref check warning in sel_netport_insert()

    Manually fix up a semantic mis-merge wrt security_netlink_recv():

    - the interface was removed in commit fd7784615248 ("security: remove
    the security_netlink_recv hook as it is equivalent to capable()")

    - a new user of it appeared in commit a38f7907b926 ("crypto: Add
    userspace configuration API")

    causing no automatic merge conflict, but Eric Paris pointed out the
    issue.

    Linus Torvalds
     

13 Jan, 2012

1 commit

  • The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
    involved into calculation of program text/data segment sizes (which might
    be seen in /proc//statm) and into brk() call final address.

    For restore we need to know all these values. While
    mm->start_code/end_code already present in /proc/$pid/stat, the rest
    members are not, so this patch brings them in.

    The restore procedure of these members is addressed in another patch using
    prctl().

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Serge Hallyn
    Reviewed-by: Kees Cook
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

06 Jan, 2012

1 commit

  • Reading /proc/pid/stat of another process checks if one has ptrace permissions
    on that process. If one does have permissions it outputs some data about the
    process which might have security and attack implications. If the current
    task does not have ptrace permissions the read still works, but those fields
    are filled with inocuous (0) values. Since this check and a subsequent denial
    is not a violation of the security policy we should not audit such denials.

    This can be quite useful to removing ptrace broadly across a system without
    flooding the logs when ps is run or something which harmlessly walks proc.

    Signed-off-by: Eric Paris
    Acked-by: Serge E. Hallyn

    Eric Paris
     

15 Dec, 2011

1 commit


23 Jun, 2011

1 commit


27 May, 2011

1 commit


24 Mar, 2011

1 commit

  • While mm->start_stack was protected from cross-uid viewing (commit
    f83ce3e6b02d5 ("proc: avoid information leaks to non-privileged
    processes")), the start_code and end_code values were not. This would
    allow the text location of a PIE binary to leak, defeating ASLR.

    Note that the value "1" is used instead of "0" for a protected value since
    "ps", "killall", and likely other readers of /proc/pid/stat, take
    start_code of "0" to mean a kernel thread and will misbehave. Thanks to
    Brad Spengler for pointing this out.

    Addresses CVE-2011-0726

    Signed-off-by: Kees Cook
    Cc:
    Cc: Alexey Dobriyan
    Cc: David Howells
    Cc: Eugene Teo
    Cc: Martin Schwidefsky
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

15 Feb, 2011

1 commit

  • task_show_regs used to be a debugging aid in the early bringup days
    of Linux on s390. /proc//status is a world readable file, it
    is not a good idea to show the registers of a process. The only
    correct fix is to remove task_show_regs.

    Reported-by: Al Viro
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     

14 Jan, 2011

2 commits

  • For string without format specifiers, use seq_puts().
    For seq_printf("\n"), use seq_putc('\n').

    text data bss dec hex filename
    61866 488 112 62466 f402 fs/proc/proc.o
    61729 488 112 62329 f379 fs/proc/proc.o
    ----------------------------------------------------
    -139

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • /proc/*/statm code needlessly truncates data from unsigned long to int.
    One needs only 8+ TB of RAM to make truncation visible.

    Signed-off-by: Alexey Dobriyan
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

30 Jul, 2010

1 commit

  • It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
    credentials by incrementing their usage count after their replacement by the
    task being accessed.

    What happens is that get_task_cred() can race with commit_creds():

    TASK_1 TASK_2 RCU_CLEANER
    -->get_task_cred(TASK_2)
    rcu_read_lock()
    __cred = __task_cred(TASK_2)
    -->commit_creds()
    old_cred = TASK_2->real_cred
    TASK_2->real_cred = ...
    put_cred(old_cred)
    call_rcu(old_cred)
    [__cred->usage == 0]
    get_cred(__cred)
    [__cred->usage == 1]
    rcu_read_unlock()
    -->put_cred_rcu()
    [__cred->usage == 1]
    panic()

    However, since a tasks credentials are generally not changed very often, we can
    reasonably make use of a loop involving reading the creds pointer and using
    atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.

    If successful, we can safely return the credentials in the knowledge that, even
    if the task we're accessing has released them, they haven't gone to the RCU
    cleanup code.

    We then change task_state() in procfs to use get_task_cred() rather than
    calling get_cred() on the result of __task_cred(), as that suffers from the
    same problem.

    Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
    tripped when it is noticed that the usage count is not zero as it ought to be,
    for example:

    kernel BUG at kernel/cred.c:168!
    invalid opcode: 0000 [#1] SMP
    last sysfs file: /sys/kernel/mm/ksm/run
    CPU 0
    Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
    745
    RIP: 0010:[] [] __put_cred+0xc/0x45
    RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202
    RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
    RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
    RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
    R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
    FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
    Stack:
    ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
    ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
    ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
    Call Trace:
    [] put_cred+0x13/0x15
    [] commit_creds+0x16b/0x175
    [] set_current_groups+0x47/0x4e
    [] sys_setgroups+0xf6/0x105
    [] system_call_fastpath+0x16/0x1b
    Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
    48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 0b eb fe 65 48 8b
    04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
    RIP [] __put_cred+0xc/0x45
    RSP
    ---[ end trace df391256a100ebdd ]---

    Signed-off-by: David Howells
    Acked-by: Jiri Olsa
    Signed-off-by: Linus Torvalds

    David Howells
     

28 May, 2010

1 commit

  • Now that task->signal can't go away get_nr_threads() doesn't need
    ->siglock to read signal->count.

    Also, make it inline, move into sched.h, and convert 2 other proc users of
    signal->count to use this (now trivial) helper.

    Henceforth get_nr_threads() is the only valid user of signal->count, we
    are ready to turn it into "int nr_threads" or, perhaps, kill it.

    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

12 May, 2010

1 commit

  • Originally, commit d899bf7b ("procfs: provide stack information for
    threads") attempted to introduce a new feature for showing where the
    threadstack was located and how many pages are being utilized by the
    stack.

    Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
    applied to fix the NO_MMU case.

    Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
    64-bit") was applied to fix a bug in ia32 executables being loaded.

    Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
    threads") was applied to fix a bug which had kernel threads printing a
    userland stack address.

    Commit 1306d603f ('proc: partially revert "procfs: provide stack
    information for threads"') was then applied to revert the stack pages
    being used to solve a significant performance regression.

    This patch nearly undoes the effect of all these patches.

    The reason for reverting these is it provides an unusable value in
    field 28. For x86_64, a fork will result in the task->stack_start
    value being updated to the current user top of stack and not the stack
    start address. This unpredictability of the stack_start value makes
    it worthless. That includes the intended use of showing how much stack
    space a thread has.

    Other architectures will get different values. As an example, ia64
    gets 0. The do_fork() and copy_process() functions appear to treat the
    stack_start and stack_size parameters as architecture specific.

    I only partially reverted c44972f1 ("procfs: disable per-task stack usage
    on NOMMU") . If I had completely reverted it, I would have had to change
    mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
    configured. Since I could not test the builds without significant effort,
    I decided to not change mm/Makefile.

    I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
    information for threads on 64-bit") . I left the KSTK_ESP() change in
    place as that seemed worthwhile.

    Signed-off-by: Robin Holt
    Cc: Stefani Seibold
    Cc: KOSAKI Motohiro
    Cc: Michal Simek
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

25 Feb, 2010

1 commit

  • Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
    and fcheck_files().

    Cc: Alexander Viro
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    Cc: Alexander Viro
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

12 Jan, 2010

1 commit

  • Commit d899bf7b (procfs: provide stack information for threads) introduced
    to show stack information in /proc/{pid}/status. But it cause large
    performance regression. Unfortunately /proc/{pid}/status is used ps
    command too and ps is one of most important component. Because both to
    take mmap_sem and page table walk are heavily operation.

    If many process run, the ps performance is,

    [before d899bf7b]

    % perf stat ps >/dev/null

    Performance counter stats for 'ps':

    4090.435806 task-clock-msecs # 0.032 CPUs
    229 context-switches # 0.000 M/sec
    0 CPU-migrations # 0.000 M/sec
    234 page-faults # 0.000 M/sec
    8587565207 cycles # 2099.425 M/sec
    9866662403 instructions # 1.149 IPC
    3789415411 cache-references # 926.409 M/sec
    30419509 cache-misses # 7.437 M/sec

    128.859521955 seconds time elapsed

    [after d899bf7b]

    % perf stat ps > /dev/null

    Performance counter stats for 'ps':

    4305.081146 task-clock-msecs # 0.028 CPUs
    480 context-switches # 0.000 M/sec
    2 CPU-migrations # 0.000 M/sec
    237 page-faults # 0.000 M/sec
    9021211334 cycles # 2095.480 M/sec
    10605887536 instructions # 1.176 IPC
    3612650999 cache-references # 839.160 M/sec
    23917502 cache-misses # 5.556 M/sec

    152.277819582 seconds time elapsed

    Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
    provide almost same information. we can use it.

    Commit d899bf7b introduced two features:

    1) Add the annotattion of [thread stack: xxxx] mark to
    /proc/{pid}/task/{tid}/maps.
    2) Add StackUsage field to /proc/{pid}/status.

    I only revert (2), because I haven't seen (1) cause regression.

    Signed-off-by: KOSAKI Motohiro
    Cc: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Cc: Andrew Morton
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

17 Dec, 2009

2 commits


06 Dec, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
    sched, cputime: Introduce thread_group_times()
    sched, cputime: Cleanups related to task_times()
    Revert "sched, x86: Optimize branch hint in __switch_to()"
    sched: Fix isolcpus boot option
    sched: Revert 498657a478c60be092208422fefa9c7b248729c2
    sched, time: Define nsecs_to_jiffies()
    sched: Remove task_{u,s,g}time()
    sched: Introduce task_times() to replace task_{u,s}time() pair
    sched: Limit the number of scheduler debug messages
    sched.c: Call debug_show_all_locks() when dumping all tasks
    sched, x86: Optimize branch hint in __switch_to()
    sched: Optimize branch hint in context_switch()
    sched: Optimize branch hint in pick_next_task_fair()
    sched_feat_write(): Update ppos instead of file->f_pos
    sched: Sched_rt_periodic_timer vs cpu hotplug
    sched, kvm: Fix race condition involving sched_in_preempt_notifers
    sched: More generic WAKE_AFFINE vs select_idle_sibling()
    sched: Cleanup select_task_rq_fair()
    sched: Fix granularity of task_u/stime()
    sched: Fix/add missing update_rq_clock() calls
    ...

    Linus Torvalds
     

03 Dec, 2009

1 commit

  • This is a real fix for problem of utime/stime values decreasing
    described in the thread:

    http://lkml.org/lkml/2009/11/3/522

    Now cputime is accounted in the following way:

    - {u,s}time in task_struct are increased every time when the thread
    is interrupted by a tick (timer interrupt).

    - When a thread exits, its {u,s}time are added to signal->{u,s}time,
    after adjusted by task_times().

    - When all threads in a thread_group exits, accumulated {u,s}time
    (and also c{u,s}time) in signal struct are added to c{u,s}time
    in signal struct of the group's parent.

    So {u,s}time in task struct are "raw" tick count, while
    {u,s}time and c{u,s}time in signal struct are "adjusted" values.

    And accounted values are used by:

    - task_times(), to get cputime of a thread:
    This function returns adjusted values that originates from raw
    {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

    - thread_group_cputime(), to get cputime of a thread group:
    This function returns sum of all {u,s}time of living threads in
    the group, plus {u,s}time in the signal struct that is sum of
    adjusted cputimes of all exited threads belonged to the group.

    The problem is the return value of thread_group_cputime(),
    because it is mixed sum of "raw" value and "adjusted" value:

    group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

    This misbehavior can break {u,s}time monotonicity.
    Assume that if there is a thread that have raw values greater
    than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
    but only runs 45ms) and if it exits, cputime will decrease (e.g.
    -5ms).

    To fix this, we could do:

    group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

    But task_times() contains hard divisions, so applying it for
    every thread should be avoided.

    This patch fixes the above problem in the following way:

    - Modify thread's exit (= __exit_signal()) not to use task_times().
    It means {u,s}time in signal struct accumulates raw values instead
    of adjusted values. As the result it makes thread_group_cputime()
    to return pure sum of "raw" values.

    - Introduce a new function thread_group_times(*task, *utime, *stime)
    that converts "raw" values of thread_group_cputime() to "adjusted"
    values, in same calculation procedure as task_times().

    - Modify group's exit (= wait_task_zombie()) to use this introduced
    thread_group_times(). It make c{u,s}time in signal struct to
    have adjusted values like before this patch.

    - Replace some thread_group_cputime() by thread_group_times().
    This replacements are only applied where conveys the "adjusted"
    cputime to users, and where already uses task_times() near by it.
    (i.e. sys_times(), getrusage(), and /proc//stat.)

    This patch have a positive side effect:

    - Before this patch, if a group contains many short-life threads
    (e.g. runs 0.9ms and not interrupted by ticks), the group's
    cputime could be invisible since thread's cputime was accumulated
    after adjusted: imagine adjustment function as adj(ticks, runtime),
    {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
    After this patch it will not happen because the adjustment is
    applied after accumulated.

    v2:
    - remove if()s, put new variables into signal_struct.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Spencer Candland
    Cc: Americo Wang
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

26 Nov, 2009

2 commits

  • Now all task_{u,s}time() pairs are replaced by task_times().
    And task_gtime() is too simple to be an inline function.

    Cleanup them all.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Spencer Candland
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Americo Wang
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     
  • Functions task_{u,s}time() are called in pair in almost all
    cases. However task_stime() is implemented to call task_utime()
    from its inside, so such paired calls run task_utime() twice.

    It means we do heavy divisions (div_u64 + do_div) twice to get
    utime and stime which can be obtained at same time by one set
    of divisions.

    This patch introduces a function task_times(*tsk, *utime,
    *stime) to retrieve utime and stime at once in better, optimized
    way.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Spencer Candland
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Americo Wang
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

18 Nov, 2009

1 commit


26 Oct, 2009

1 commit


25 Sep, 2009

1 commit


23 Sep, 2009

1 commit

  • A patch to give a better overview of the userland application stack usage,
    especially for embedded linux.

    Currently you are only able to dump the main process/thread stack usage
    which is showed in /proc/pid/status by the "VmStk" Value. But you get no
    information about the consumed stack memory of the the threads.

    There is an enhancement in the /proc//{task/*,}/*maps and which marks
    the vm mapping where the thread stack pointer reside with "[thread stack
    xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
    information, because libpthread doesn't set the start of the stack to the
    top of the mapped area, depending of the pthread usage.

    A sample output of /proc//task//maps looks like:

    08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
    08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
    0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
    a7d12000-a7d13000 ---p 00000000 00:00 0
    a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
    a7f13000-a7f14000 ---p 00000000 00:00 0
    a7f14000-a7f36000 rw-p 00000000 00:00 0
    a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
    a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
    a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
    a806c000-a806f000 rw-p 00000000 00:00 0
    a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
    a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
    a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
    a8085000-a8088000 rw-p 00000000 00:00 0
    a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
    a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
    a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
    afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
    ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

    Also there is a new entry "stack usage" in /proc//{task/*,}/status
    which will you give the current stack usage in kb.

    A sample output of /proc/self/status looks like:

    Name: cat
    State: R (running)
    Tgid: 507
    Pid: 507
    .
    .
    .
    CapBnd: fffffffffffffeff
    voluntary_ctxt_switches: 0
    nonvoluntary_ctxt_switches: 0
    Stack usage: 12 kB

    I also fixed stack base address in /proc//{task/*,}/stat to the base
    address of the associated thread stack and not the one of the main
    process. This makes more sense.

    [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
    Signed-off-by: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     

21 Sep, 2009

1 commit

  • The Cpus_allowed fields in /proc//status is currently only
    shown in case of CONFIG_CPUSETS. However their contents are also
    useful for the !CONFIG_CPUSETS case.

    So change the current behaviour and always show these fields.

    Signed-off-by: Heiko Carstens
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     

05 May, 2009

1 commit

  • By using the same test as is used for /proc/pid/maps and /proc/pid/smaps,
    only allow processes that can ptrace() a given process to see information
    that might be used to bypass address space layout randomization (ASLR).
    These include eip, esp, wchan, and start_stack in /proc/pid/stat as well
    as the non-symbolic output from /proc/pid/wchan.

    ASLR can be bypassed by sampling eip as shown by the proof-of-concept
    code at http://code.google.com/p/fuzzyaslr/ As part of a presentation
    (http://www.cr0.org/paper/to-jt-linux-alsr-leak.pdf) esp and wchan were
    also noted as possibly usable information leaks as well. The
    start_stack address also leaks potentially useful information.

    Cc: Stable Team
    Signed-off-by: Jake Edge
    Acked-by: Arjan van de Ven
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Linus Torvalds

    Jake Edge
     

14 Nov, 2008

2 commits

  • Use RCU to access another task's creds and to release a task's own creds.
    This means that it will be possible for the credentials of a task to be
    replaced without another task (a) requiring a full lock to read them, and (b)
    seeing deallocated memory.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     

27 Oct, 2008

1 commit


20 Oct, 2008

1 commit


10 Oct, 2008

2 commits

  • After local seq_file conversion it was forgotten.

    Signed-off-by: Alexey Dobriyan

    Alexey Dobriyan
     
  • lock_task_sighand() make sure task->sighand is being protected,
    so we do not need rcu_read_lock().
    [ exec() will get task->sighand->siglock before change task->sighand! ]

    But code using rcu_read_lock() _just_ to protect lock_task_sighand()
    only appear in procfs. (and some code in procfs use lock_task_sighand()
    without such redundant protection.)

    Other subsystem may put lock_task_sighand() into rcu_read_lock()
    critical region, but these rcu_read_lock() are used for protecting
    "for_each_process()", "find_task_by_vpid()" etc. , not for protecting
    lock_task_sighand().

    Signed-off-by: Lai Jiangshan
    [ok from Oleg]
    Signed-off-by: Alexey Dobriyan

    Lai Jiangshan
     

14 Sep, 2008

1 commit

  • Overview

    This patch reworks the handling of POSIX CPU timers, including the
    ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
    with the help of Roland McGrath, the owner and original writer of this code.

    The problem we ran into, and the reason for this rework, has to do with using
    a profiling timer in a process with a large number of threads. It appears
    that the performance of the old implementation of run_posix_cpu_timers() was
    at least O(n*3) (where "n" is the number of threads in a process) or worse.
    Everything is fine with an increasing number of threads until the time taken
    for that routine to run becomes the same as or greater than the tick time, at
    which point things degrade rather quickly.

    This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."

    Code Changes

    This rework corrects the implementation of run_posix_cpu_timers() to make it
    run in constant time for a particular machine. (Performance may vary between
    one machine and another depending upon whether the kernel is built as single-
    or multiprocessor and, in the latter case, depending upon the number of
    running processors.) To do this, at each tick we now update fields in
    signal_struct as well as task_struct. The run_posix_cpu_timers() function
    uses those fields to make its decisions.

    We define a new structure, "task_cputime," to contain user, system and
    scheduler times and use these in appropriate places:

    struct task_cputime {
    cputime_t utime;
    cputime_t stime;
    unsigned long long sum_exec_runtime;
    };

    This is included in the structure "thread_group_cputime," which is a new
    substructure of signal_struct and which varies for uniprocessor versus
    multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
    a simple substructure, while for multiprocessor kernels it is a pointer:

    struct thread_group_cputime {
    struct task_cputime totals;
    };

    struct thread_group_cputime {
    struct task_cputime *totals;
    };

    We also add a new task_cputime substructure directly to signal_struct, to
    cache the earliest expiration of process-wide timers, and task_cputime also
    replaces the it_*_expires fields of task_struct (used for earliest expiration
    of thread timers). The "thread_group_cputime" structure contains process-wide
    timers that are updated via account_user_time() and friends. In the non-SMP
    case the structure is a simple aggregator; unfortunately in the SMP case that
    simplicity was not achievable due to cache-line contention between CPUs (in
    one measured case performance was actually _worse_ on a 16-cpu system than
    the same test on a 4-cpu system, due to this contention). For SMP, the
    thread_group_cputime counters are maintained as a per-cpu structure allocated
    using alloc_percpu(). The timer functions update only the timer field in
    the structure corresponding to the running CPU, obtained using per_cpu_ptr().

    We define a set of inline functions in sched.h that we use to maintain the
    thread_group_cputime structure and hide the differences between UP and SMP
    implementations from the rest of the kernel. The thread_group_cputime_init()
    function initializes the thread_group_cputime structure for the given task.
    The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
    out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
    in the per-cpu structures and fields. The thread_group_cputime_free()
    function, also a no-op for UP, in SMP frees the per-cpu structures. The
    thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
    thread_group_cputime_alloc() if the per-cpu structures haven't yet been
    allocated. The thread_group_cputime() function fills the task_cputime
    structure it is passed with the contents of the thread_group_cputime fields;
    in UP it's that simple but in SMP it must also safely check that tsk->signal
    is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
    if so, sums the per-cpu values for each online CPU. Finally, the three
    functions account_group_user_time(), account_group_system_time() and
    account_group_exec_runtime() are used by timer functions to update the
    respective fields of the thread_group_cputime structure.

    Non-SMP operation is trivial and will not be mentioned further.

    The per-cpu structure is always allocated when a task creates its first new
    thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
    It is freed at process exit via a call to thread_group_cputime_free() from
    cleanup_signal().

    All functions that formerly summed utime/stime/sum_sched_runtime values from
    from all threads in the thread group now use thread_group_cputime() to
    snapshot the values in the thread_group_cputime structure or the values in
    the task structure itself if the per-cpu structure hasn't been allocated.

    Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
    The run_posix_cpu_timers() function has been split into a fast path and a
    slow path; the former safely checks whether there are any expired thread
    timers and, if not, just returns, while the slow path does the heavy lifting.
    With the dedicated thread group fields, timers are no longer "rebalanced" and
    the process_timer_rebalance() function and related code has gone away. All
    summing loops are gone and all code that used them now uses the
    thread_group_cputime() inline. When process-wide timers are set, the new
    task_cputime structure in signal_struct is used to cache the earliest
    expiration; this is checked in the fast path.

    Performance

    The fix appears not to add significant overhead to existing operations. It
    generally performs the same as the current code except in two cases, one in
    which it performs slightly worse (Case 5 below) and one in which it performs
    very significantly better (Case 2 below). Overall it's a wash except in those
    two cases.

    I've since done somewhat more involved testing on a dual-core Opteron system.

    Case 1: With no itimer running, for a test with 100,000 threads, the fixed
    kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
    all of which was spent in the system. There were twice as many
    voluntary context switches with the fix as without it.

    Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
    an unmodified kernel can handle), the fixed kernel ran the test in
    eight percent of the time (5.8 seconds as opposed to 70 seconds) and
    had better tick accuracy (.012 seconds per tick as opposed to .023
    seconds per tick).

    Case 3: A 4000-thread test with an initial timer tick of .01 second and an
    interval of 10,000 seconds (i.e. a timer that ticks only once) had
    very nearly the same performance in both cases: 6.3 seconds elapsed
    for the fixed kernel versus 5.5 seconds for the unfixed kernel.

    With fewer threads (eight in these tests), the Case 1 test ran in essentially
    the same time on both the modified and unmodified kernels (5.2 seconds versus
    5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds
    versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
    tick versus .025 seconds per tick for the unmodified kernel.

    Since the fix affected the rlimit code, I also tested soft and hard CPU limits.

    Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
    running), the modified kernel was very slightly favored in that while
    it killed the process in 19.997 seconds of CPU time (5.002 seconds of
    wall time), only .003 seconds of that was system time, the rest was
    user time. The unmodified kernel killed the process in 20.001 seconds
    of CPU (5.014 seconds of wall time) of which .016 seconds was system
    time. Really, though, the results were too close to call. The results
    were essentially the same with no itimer running.

    Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
    (where the hard limit would never be reached) and an itimer running,
    the modified kernel exhibited worse tick accuracy than the unmodified
    kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise,
    performance was almost indistinguishable. With no itimer running this
    test exhibited virtually identical behavior and times in both cases.

    In times past I did some limited performance testing. those results are below.

    On a four-cpu Opteron system without this fix, a sixteen-thread test executed
    in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On
    the same system with the fix, user and elapsed time were about the same, but
    system time dropped to 0.007 seconds. Performance with eight, four and one
    thread were comparable. Interestingly, the timer ticks with the fix seemed
    more accurate: The sixteen-thread test with the fix received 149543 ticks
    for 0.024 seconds per tick, while the same test without the fix received 58720
    for 0.061 seconds per tick. Both cases were configured for an interval of
    0.01 seconds. Again, the other tests were comparable. Each thread in this
    test computed the primes up to 25,000,000.

    I also did a test with a large number of threads, 100,000 threads, which is
    impossible without the fix. In this case each thread computed the primes only
    up to 10,000 (to make the runtime manageable). System time dominated, at
    1546.968 seconds out of a total 2176.906 seconds (giving a user time of
    629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite
    accurate. There is obviously no comparable test without the fix.

    Signed-off-by: Frank Mayhar
    Cc: Roland McGrath
    Cc: Alexey Dobriyan
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar

    Frank Mayhar
     

06 Sep, 2008

1 commit

  • Spencer reported a problem where utime and stime were going negative despite
    the fixes in commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa. The suspected
    reason for the problem is that signal_struct maintains it's own utime and
    stime (of exited tasks), these are not updated using the new task_utime()
    routine, hence sig->utime can go backwards and cause the same problem
    to occur (sig->utime, adds tsk->utime and not task_utime()). This patch
    fixes the problem

    TODO: using max(task->prev_utime, derived utime) works for now, but a more
    generic solution is to implement cputime_max() and use the cputime_gt()
    function for comparison.

    Reported-by: spencer@bluehost.com
    Signed-off-by: Balbir Singh
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Balbir Singh
     

27 Jul, 2008

1 commit

  • This adds the tracehook_tracer_task() hook to consolidate all forms of
    "Who is using ptrace on me?" logic. This is used for "TracerPid:" in
    /proc and for permission checks. We also clean up the selinux code the
    called an identical accessor.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Reviewed-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

01 Jun, 2008

1 commit

  • Source code out there hard-codes a notion of what the
    _LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the
    raw capability system calls capget() and capset(). Its unfortunate, but
    true.

    Since the confusing header file has been in a released kernel, there is
    software that is erroneously using 64-bit capabilities with the semantics
    of 32-bit compatibilities. These recently compiled programs may suffer
    corruption of their memory when sys_getcap() overwrites more memory than
    they are coded to expect, and the raising of added capabilities when using
    sys_capset().

    As such, this patch does a number of things to clean up the situation
    for all. It

    1. forces the _LINUX_CAPABILITY_VERSION define to always retain its
    legacy value.

    2. adopts a new #define strategy for the kernel's internal
    implementation of the preferred magic.

    3. deprecates v2 capability magic in favor of a new (v3) magic
    number. The functionality of v3 is entirely equivalent to v2,
    the only difference being that the v2 magic causes the kernel
    to log a "deprecated" warning so the admin can find applications
    that may be using v2 inappropriately.

    [User space code continues to be encouraged to use the libcap API which
    protects the application from details like this. libcap-2.10 is the first
    to support v3 capabilities.]

    Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518.
    Thanks to Bojan Smojver for the report.

    [akpm@linux-foundation.org: s/depreciate/deprecate/g]
    [akpm@linux-foundation.org: be robust about put_user size]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrew G. Morgan
    Cc: Serge E. Hallyn
    Cc: Bojan Smojver
    Cc: stable@kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Chris Wright

    Andrew G. Morgan
     

13 May, 2008

1 commit

  • There is currently no way to query the bounding set of another task. As there
    appears to be no security reason not to, and as Michael Kerrisk points out the
    following valid reasons to do so exist:

    * consistency (I can see all of the other per-thread/process sets in
    /proc/.../status)

    * debugging -- I could imagine that it would make the job of debugging an
    application that uses capabilities a little simpler.

    this patch adds the bounding set to /proc/self/status right after the
    effective set.

    Signed-off-by: Serge E. Hallyn
    Acked-by: Michael Kerrisk
    Acked-by: Andrew G. Morgan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn