12 Apr, 2013

1 commit

  • The smpboot threads rely on the park/unpark mechanism which binds per
    cpu threads on a particular core. Though the functionality is racy:

    CPU0 CPU1 CPU2
    unpark(T) wake_up_process(T)
    clear(SHOULD_PARK) T runs
    leave parkme() due to !SHOULD_PARK
    bind_to(CPU2) BUG_ON(wrong CPU)

    We cannot let the tasks move themself to the target CPU as one of
    those tasks is actually the migration thread itself, which requires
    that it starts running on the target cpu right away.

    The solution to this problem is to prevent wakeups in park mode which
    are not from unpark(). That way we can guarantee that the association
    of the task to the target cpu is working correctly.

    Add a new task state (TASK_PARKED) which prevents other wakeups and
    use this state explicitly for the unpark wakeup.

    Peter noticed: Also, since the task state is visible to userspace and
    all the parked tasks are still in the PID space, its a good hint in ps
    and friends that these tasks aren't really there for the moment.

    The migration thread has another related issue.

    CPU0 CPU1
    Bring up CPU2
    create_thread(T)
    park(T)
    wait_for_completion()
    parkme()
    complete()
    sched_set_stop_task()
    schedule(TASK_PARKED)

    The sched_set_stop_task() call is issued while the task is on the
    runqueue of CPU1 and that confuses the hell out of the stop_task class
    on that cpu. So we need the same synchronizaion before
    sched_set_stop_task().

    Reported-by: Dave Jones
    Reported-and-tested-by: Dave Hansen
    Reported-and-tested-by: Borislav Petkov
    Acked-by: Peter Ziljstra
    Cc: Srivatsa S. Bhat
    Cc: dhillf@gmail.com
    Cc: Ingo Molnar
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

28 Jan, 2013

1 commit

  • This is in preparation for the full dynticks feature. While
    remotely reading the cputime of a task running in a full
    dynticks CPU, we'll need to do some extra-computation. This
    way we can account the time it spent tickless in userspace
    since its last cputime snapshot.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

18 Dec, 2012

6 commits

  • Merge misc patches from Andrew Morton:
    "Incoming:

    - lots of misc stuff

    - backlight tree updates

    - lib/ updates

    - Oleg's percpu-rwsem changes

    - checkpatch

    - rtc

    - aoe

    - more checkpoint/restart support

    I still have a pile of MM stuff pending - Pekka should be merging
    later today after which that is good to go. A number of other things
    are twiddling thumbs awaiting maintainer merges."

    * emailed patches from Andrew Morton : (180 commits)
    scatterlist: don't BUG when we can trivially return a proper error.
    docs: update documentation about /proc//fdinfo/ fanotify output
    fs, fanotify: add @mflags field to fanotify output
    docs: add documentation about /proc//fdinfo/ output
    fs, notify: add procfs fdinfo helper
    fs, exportfs: add exportfs_encode_inode_fh() helper
    fs, exportfs: escape nil dereference if no s_export_op present
    fs, epoll: add procfs fdinfo helper
    fs, eventfd: add procfs fdinfo helper
    procfs: add ability to plug in auxiliary fdinfo providers
    tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
    breakpoint selftests: print failure status instead of cause make error
    kcmp selftests: print fail status instead of cause make error
    kcmp selftests: make run_tests fix
    mem-hotplug selftests: print failure status instead of cause make error
    cpu-hotplug selftests: print failure status instead of cause make error
    mqueue selftests: print failure status instead of cause make error
    vm selftests: print failure status instead of cause make error
    ubifs: use prandom_bytes
    mtd: nandsim: use prandom_bytes
    ...

    Linus Torvalds
     
  • This allows us to print out eventpoll target file descriptor, events and
    data, the /proc/pid/fdinfo/fd consists of

    | pos: 0
    | flags: 02
    | tfd: 5 events: 1d data: ffffffffffffffff enabled: 1

    [avagin@: fix for unitialized ret variable]

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • We display a list of supplementary group for each process in
    /proc//status. However, we show only the first 32 groups, not all of
    them.

    Although this is rare, but sometimes processes do have more than 32
    supplementary groups, and this kernel limitation breaks user-space apps
    that rely on the group list in /proc//status.

    Number 32 comes from the internal NGROUPS_SMALL macro which defines the
    length for the internal kernel "small" groups buffer. There is no
    apparent reason to limit to this value.

    This patch removes the 32 groups printing limit.

    The Linux kernel limits the amount of supplementary groups by NGROUPS_MAX,
    which is currently set to 65536. And this is the maximum count of groups
    we may possibly print.

    Signed-off-by: Artem Bityutskiy
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem Bityutskiy
     
  • It is currently impossible to examine the state of seccomp for a given
    process. While attaching with gdb and attempting "call
    prctl(PR_GET_SECCOMP,...)" will work with some situations, it is not
    reliable. If the process is in seccomp mode 1, this query will kill the
    process (prctl not allowed), if the process is in mode 2 with prctl not
    allowed, it will similarly be killed, and in weird cases, if prctl is
    filtered to return errno 0, it can look like seccomp is disabled.

    When reviewing the state of running processes, there should be a way to
    externally examine the seccomp mode. ("Did this build of Chrome end up
    using seccomp?" "Did my distro ship ssh with seccomp enabled?")

    This adds the "Seccomp" line to /proc/$pid/status.

    Signed-off-by: Kees Cook
    Reviewed-by: Cyrill Gorcunov
    Cc: Andrea Arcangeli
    Cc: James Morris
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Without this patch it is really hard to interpret a bounding set, if
    CAP_LAST_CAP is unknown for a current kernel.

    Non-existant capabilities can not be deleted from a bounding set with help
    of prctl.

    E.g.: Here are two examples without/with this patch.

    CapBnd: ffffffe0fdecffff
    CapBnd: 00000000fdecffff

    I suggest to hide non-existent capabilities. Here is two reasons.
    * It's logically and easier for using.
    * It helps to checkpoint-restore capabilities of tasks, because tasks
    can be restored on another kernel, where CAP_LAST_CAP is bigger.

    Signed-off-by: Andrew Vagin
    Cc: Andrew G. Morgan
    Reviewed-by: Serge E. Hallyn
    Cc: Pavel Emelyanov
    Reviewed-by: Kees Cook
    Cc: KAMEZAWA Hiroyuki
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     
  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

29 Nov, 2012

1 commit

  • We have thread_group_cputime() and thread_group_times(). The naming
    doesn't provide enough information about the difference between
    these two APIs.

    To lower the confusion, rename thread_group_times() to
    thread_group_cputime_adjusted(). This name better suggests that
    it's a version of thread_group_cputime() that does some stabilization
    on the raw cputime values. ie here: scale on top of CFS runtime
    stats and bound lower value for monotonicity.

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Paul Gortmaker

    Frederic Weisbecker
     

20 Nov, 2012

1 commit


01 Jun, 2012

3 commits

  • We would like to have an ability to restore command line arguments and
    program environment pointers but first we need to obtain them somehow.
    Thus we put these values into /proc/$pid/stat. The exit_code is needed to
    restore zombie tasks.

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • When we do checkpoint of a task we need to know the list of children the
    task, has but there is no easy and fast way to generate reverse
    parent->children chain from arbitrary (while a parent pid is
    provided in "PPid" field of /proc//status).

    So instead of walking over all pids in the system (creating one big
    process tree in memory, just to figure out which children a task has) --
    we add explicit /proc//task//children entry, because the kernel
    already has this kind of information but it is not yet exported.

    This is a first level children, not the whole process tree.

    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • - use int fpr priority and nice, since task_nice()/task_prio() return that

    - field 24: get_mm_rss() returns unsigned long

    Signed-off-by: Jan Engelhardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Engelhardt
     

16 May, 2012

1 commit


03 May, 2012

1 commit


29 Mar, 2012

1 commit

  • bda7bad62bc4 ("procfs: speed up /proc/pid/stat, statm") broke /proc/statm
    - 'text' is printed twice by mistake.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

24 Mar, 2012

1 commit

  • Process accounting applications as top, ps visit some files under
    /proc/. With seq_put_decimal_ull(), we can optimize /proc//stat
    and /proc//statm files.

    This patch adds
    - seq_put_decimal_ll() for signed values.
    - allow delimiter == 0.
    - convert seq_printf() to seq_put_decimal_ull/ll in /proc/stat, statm.

    Test result on a system with 2000+ procs.

    Before patch:
    [kamezawa@bluextal test]$ top -b -n 1 | wc -l
    2223
    [kamezawa@bluextal test]$ time top -b -n 1 > /dev/null

    real 0m0.675s
    user 0m0.044s
    sys 0m0.121s

    [kamezawa@bluextal test]$ time ps -elf > /dev/null

    real 0m0.236s
    user 0m0.056s
    sys 0m0.176s

    After patch:
    kamezawa@bluextal ~]$ time top -b -n 1 > /dev/null

    real 0m0.657s
    user 0m0.052s
    sys 0m0.100s

    [kamezawa@bluextal ~]$ time ps -elf > /dev/null

    real 0m0.198s
    user 0m0.050s
    sys 0m0.145s

    Considering top, ps tend to scan /proc periodically, this will reduce cpu
    consumption by top/ps to some extent.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

15 Jan, 2012

1 commit

  • * 'for-linus' of git://selinuxproject.org/~jmorris/linux-security:
    capabilities: remove __cap_full_set definition
    security: remove the security_netlink_recv hook as it is equivalent to capable()
    ptrace: do not audit capability check when outputing /proc/pid/stat
    capabilities: remove task_ns_* functions
    capabitlies: ns_capable can use the cap helpers rather than lsm call
    capabilities: style only - move capable below ns_capable
    capabilites: introduce new has_ns_capabilities_noaudit
    capabilities: call has_ns_capability from has_capability
    capabilities: remove all _real_ interfaces
    capabilities: introduce security_capable_noaudit
    capabilities: reverse arguments to security_capable
    capabilities: remove the task from capable LSM hook entirely
    selinux: sparse fix: fix several warnings in the security server cod
    selinux: sparse fix: fix warnings in netlink code
    selinux: sparse fix: eliminate warnings for selinuxfs
    selinux: sparse fix: declare selinux_disable() in security.h
    selinux: sparse fix: move selinux_complete_init
    selinux: sparse fix: make selinux_secmark_refcount static
    SELinux: Fix RCU deref check warning in sel_netport_insert()

    Manually fix up a semantic mis-merge wrt security_netlink_recv():

    - the interface was removed in commit fd7784615248 ("security: remove
    the security_netlink_recv hook as it is equivalent to capable()")

    - a new user of it appeared in commit a38f7907b926 ("crypto: Add
    userspace configuration API")

    causing no automatic merge conflict, but Eric Paris pointed out the
    issue.

    Linus Torvalds
     

13 Jan, 2012

1 commit

  • The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
    involved into calculation of program text/data segment sizes (which might
    be seen in /proc//statm) and into brk() call final address.

    For restore we need to know all these values. While
    mm->start_code/end_code already present in /proc/$pid/stat, the rest
    members are not, so this patch brings them in.

    The restore procedure of these members is addressed in another patch using
    prctl().

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Serge Hallyn
    Reviewed-by: Kees Cook
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

06 Jan, 2012

1 commit

  • Reading /proc/pid/stat of another process checks if one has ptrace permissions
    on that process. If one does have permissions it outputs some data about the
    process which might have security and attack implications. If the current
    task does not have ptrace permissions the read still works, but those fields
    are filled with inocuous (0) values. Since this check and a subsequent denial
    is not a violation of the security policy we should not audit such denials.

    This can be quite useful to removing ptrace broadly across a system without
    flooding the logs when ps is run or something which harmlessly walks proc.

    Signed-off-by: Eric Paris
    Acked-by: Serge E. Hallyn

    Eric Paris
     

15 Dec, 2011

1 commit


23 Jun, 2011

1 commit


27 May, 2011

1 commit


24 Mar, 2011

1 commit

  • While mm->start_stack was protected from cross-uid viewing (commit
    f83ce3e6b02d5 ("proc: avoid information leaks to non-privileged
    processes")), the start_code and end_code values were not. This would
    allow the text location of a PIE binary to leak, defeating ASLR.

    Note that the value "1" is used instead of "0" for a protected value since
    "ps", "killall", and likely other readers of /proc/pid/stat, take
    start_code of "0" to mean a kernel thread and will misbehave. Thanks to
    Brad Spengler for pointing this out.

    Addresses CVE-2011-0726

    Signed-off-by: Kees Cook
    Cc:
    Cc: Alexey Dobriyan
    Cc: David Howells
    Cc: Eugene Teo
    Cc: Martin Schwidefsky
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

15 Feb, 2011

1 commit

  • task_show_regs used to be a debugging aid in the early bringup days
    of Linux on s390. /proc//status is a world readable file, it
    is not a good idea to show the registers of a process. The only
    correct fix is to remove task_show_regs.

    Reported-by: Al Viro
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     

14 Jan, 2011

2 commits

  • For string without format specifiers, use seq_puts().
    For seq_printf("\n"), use seq_putc('\n').

    text data bss dec hex filename
    61866 488 112 62466 f402 fs/proc/proc.o
    61729 488 112 62329 f379 fs/proc/proc.o
    ----------------------------------------------------
    -139

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • /proc/*/statm code needlessly truncates data from unsigned long to int.
    One needs only 8+ TB of RAM to make truncation visible.

    Signed-off-by: Alexey Dobriyan
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

30 Jul, 2010

1 commit

  • It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
    credentials by incrementing their usage count after their replacement by the
    task being accessed.

    What happens is that get_task_cred() can race with commit_creds():

    TASK_1 TASK_2 RCU_CLEANER
    -->get_task_cred(TASK_2)
    rcu_read_lock()
    __cred = __task_cred(TASK_2)
    -->commit_creds()
    old_cred = TASK_2->real_cred
    TASK_2->real_cred = ...
    put_cred(old_cred)
    call_rcu(old_cred)
    [__cred->usage == 0]
    get_cred(__cred)
    [__cred->usage == 1]
    rcu_read_unlock()
    -->put_cred_rcu()
    [__cred->usage == 1]
    panic()

    However, since a tasks credentials are generally not changed very often, we can
    reasonably make use of a loop involving reading the creds pointer and using
    atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.

    If successful, we can safely return the credentials in the knowledge that, even
    if the task we're accessing has released them, they haven't gone to the RCU
    cleanup code.

    We then change task_state() in procfs to use get_task_cred() rather than
    calling get_cred() on the result of __task_cred(), as that suffers from the
    same problem.

    Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
    tripped when it is noticed that the usage count is not zero as it ought to be,
    for example:

    kernel BUG at kernel/cred.c:168!
    invalid opcode: 0000 [#1] SMP
    last sysfs file: /sys/kernel/mm/ksm/run
    CPU 0
    Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
    745
    RIP: 0010:[] [] __put_cred+0xc/0x45
    RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202
    RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
    RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
    RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
    R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
    FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
    Stack:
    ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
    ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
    ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
    Call Trace:
    [] put_cred+0x13/0x15
    [] commit_creds+0x16b/0x175
    [] set_current_groups+0x47/0x4e
    [] sys_setgroups+0xf6/0x105
    [] system_call_fastpath+0x16/0x1b
    Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
    48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 0b eb fe 65 48 8b
    04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
    RIP [] __put_cred+0xc/0x45
    RSP
    ---[ end trace df391256a100ebdd ]---

    Signed-off-by: David Howells
    Acked-by: Jiri Olsa
    Signed-off-by: Linus Torvalds

    David Howells
     

28 May, 2010

1 commit

  • Now that task->signal can't go away get_nr_threads() doesn't need
    ->siglock to read signal->count.

    Also, make it inline, move into sched.h, and convert 2 other proc users of
    signal->count to use this (now trivial) helper.

    Henceforth get_nr_threads() is the only valid user of signal->count, we
    are ready to turn it into "int nr_threads" or, perhaps, kill it.

    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

12 May, 2010

1 commit

  • Originally, commit d899bf7b ("procfs: provide stack information for
    threads") attempted to introduce a new feature for showing where the
    threadstack was located and how many pages are being utilized by the
    stack.

    Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
    applied to fix the NO_MMU case.

    Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
    64-bit") was applied to fix a bug in ia32 executables being loaded.

    Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
    threads") was applied to fix a bug which had kernel threads printing a
    userland stack address.

    Commit 1306d603f ('proc: partially revert "procfs: provide stack
    information for threads"') was then applied to revert the stack pages
    being used to solve a significant performance regression.

    This patch nearly undoes the effect of all these patches.

    The reason for reverting these is it provides an unusable value in
    field 28. For x86_64, a fork will result in the task->stack_start
    value being updated to the current user top of stack and not the stack
    start address. This unpredictability of the stack_start value makes
    it worthless. That includes the intended use of showing how much stack
    space a thread has.

    Other architectures will get different values. As an example, ia64
    gets 0. The do_fork() and copy_process() functions appear to treat the
    stack_start and stack_size parameters as architecture specific.

    I only partially reverted c44972f1 ("procfs: disable per-task stack usage
    on NOMMU") . If I had completely reverted it, I would have had to change
    mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
    configured. Since I could not test the builds without significant effort,
    I decided to not change mm/Makefile.

    I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
    information for threads on 64-bit") . I left the KSTK_ESP() change in
    place as that seemed worthwhile.

    Signed-off-by: Robin Holt
    Cc: Stefani Seibold
    Cc: KOSAKI Motohiro
    Cc: Michal Simek
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

25 Feb, 2010

1 commit

  • Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
    and fcheck_files().

    Cc: Alexander Viro
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    Cc: Alexander Viro
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

12 Jan, 2010

1 commit

  • Commit d899bf7b (procfs: provide stack information for threads) introduced
    to show stack information in /proc/{pid}/status. But it cause large
    performance regression. Unfortunately /proc/{pid}/status is used ps
    command too and ps is one of most important component. Because both to
    take mmap_sem and page table walk are heavily operation.

    If many process run, the ps performance is,

    [before d899bf7b]

    % perf stat ps >/dev/null

    Performance counter stats for 'ps':

    4090.435806 task-clock-msecs # 0.032 CPUs
    229 context-switches # 0.000 M/sec
    0 CPU-migrations # 0.000 M/sec
    234 page-faults # 0.000 M/sec
    8587565207 cycles # 2099.425 M/sec
    9866662403 instructions # 1.149 IPC
    3789415411 cache-references # 926.409 M/sec
    30419509 cache-misses # 7.437 M/sec

    128.859521955 seconds time elapsed

    [after d899bf7b]

    % perf stat ps > /dev/null

    Performance counter stats for 'ps':

    4305.081146 task-clock-msecs # 0.028 CPUs
    480 context-switches # 0.000 M/sec
    2 CPU-migrations # 0.000 M/sec
    237 page-faults # 0.000 M/sec
    9021211334 cycles # 2095.480 M/sec
    10605887536 instructions # 1.176 IPC
    3612650999 cache-references # 839.160 M/sec
    23917502 cache-misses # 5.556 M/sec

    152.277819582 seconds time elapsed

    Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
    provide almost same information. we can use it.

    Commit d899bf7b introduced two features:

    1) Add the annotattion of [thread stack: xxxx] mark to
    /proc/{pid}/task/{tid}/maps.
    2) Add StackUsage field to /proc/{pid}/status.

    I only revert (2), because I haven't seen (1) cause regression.

    Signed-off-by: KOSAKI Motohiro
    Cc: Stefani Seibold
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Randy Dunlap
    Cc: Andrew Morton
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

17 Dec, 2009

2 commits


06 Dec, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
    sched, cputime: Introduce thread_group_times()
    sched, cputime: Cleanups related to task_times()
    Revert "sched, x86: Optimize branch hint in __switch_to()"
    sched: Fix isolcpus boot option
    sched: Revert 498657a478c60be092208422fefa9c7b248729c2
    sched, time: Define nsecs_to_jiffies()
    sched: Remove task_{u,s,g}time()
    sched: Introduce task_times() to replace task_{u,s}time() pair
    sched: Limit the number of scheduler debug messages
    sched.c: Call debug_show_all_locks() when dumping all tasks
    sched, x86: Optimize branch hint in __switch_to()
    sched: Optimize branch hint in context_switch()
    sched: Optimize branch hint in pick_next_task_fair()
    sched_feat_write(): Update ppos instead of file->f_pos
    sched: Sched_rt_periodic_timer vs cpu hotplug
    sched, kvm: Fix race condition involving sched_in_preempt_notifers
    sched: More generic WAKE_AFFINE vs select_idle_sibling()
    sched: Cleanup select_task_rq_fair()
    sched: Fix granularity of task_u/stime()
    sched: Fix/add missing update_rq_clock() calls
    ...

    Linus Torvalds
     

03 Dec, 2009

1 commit

  • This is a real fix for problem of utime/stime values decreasing
    described in the thread:

    http://lkml.org/lkml/2009/11/3/522

    Now cputime is accounted in the following way:

    - {u,s}time in task_struct are increased every time when the thread
    is interrupted by a tick (timer interrupt).

    - When a thread exits, its {u,s}time are added to signal->{u,s}time,
    after adjusted by task_times().

    - When all threads in a thread_group exits, accumulated {u,s}time
    (and also c{u,s}time) in signal struct are added to c{u,s}time
    in signal struct of the group's parent.

    So {u,s}time in task struct are "raw" tick count, while
    {u,s}time and c{u,s}time in signal struct are "adjusted" values.

    And accounted values are used by:

    - task_times(), to get cputime of a thread:
    This function returns adjusted values that originates from raw
    {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

    - thread_group_cputime(), to get cputime of a thread group:
    This function returns sum of all {u,s}time of living threads in
    the group, plus {u,s}time in the signal struct that is sum of
    adjusted cputimes of all exited threads belonged to the group.

    The problem is the return value of thread_group_cputime(),
    because it is mixed sum of "raw" value and "adjusted" value:

    group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

    This misbehavior can break {u,s}time monotonicity.
    Assume that if there is a thread that have raw values greater
    than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
    but only runs 45ms) and if it exits, cputime will decrease (e.g.
    -5ms).

    To fix this, we could do:

    group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

    But task_times() contains hard divisions, so applying it for
    every thread should be avoided.

    This patch fixes the above problem in the following way:

    - Modify thread's exit (= __exit_signal()) not to use task_times().
    It means {u,s}time in signal struct accumulates raw values instead
    of adjusted values. As the result it makes thread_group_cputime()
    to return pure sum of "raw" values.

    - Introduce a new function thread_group_times(*task, *utime, *stime)
    that converts "raw" values of thread_group_cputime() to "adjusted"
    values, in same calculation procedure as task_times().

    - Modify group's exit (= wait_task_zombie()) to use this introduced
    thread_group_times(). It make c{u,s}time in signal struct to
    have adjusted values like before this patch.

    - Replace some thread_group_cputime() by thread_group_times().
    This replacements are only applied where conveys the "adjusted"
    cputime to users, and where already uses task_times() near by it.
    (i.e. sys_times(), getrusage(), and /proc//stat.)

    This patch have a positive side effect:

    - Before this patch, if a group contains many short-life threads
    (e.g. runs 0.9ms and not interrupted by ticks), the group's
    cputime could be invisible since thread's cputime was accumulated
    after adjusted: imagine adjustment function as adj(ticks, runtime),
    {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
    After this patch it will not happen because the adjustment is
    applied after accumulated.

    v2:
    - remove if()s, put new variables into signal_struct.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Spencer Candland
    Cc: Americo Wang
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

26 Nov, 2009

2 commits

  • Now all task_{u,s}time() pairs are replaced by task_times().
    And task_gtime() is too simple to be an inline function.

    Cleanup them all.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Spencer Candland
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Americo Wang
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     
  • Functions task_{u,s}time() are called in pair in almost all
    cases. However task_stime() is implemented to call task_utime()
    from its inside, so such paired calls run task_utime() twice.

    It means we do heavy divisions (div_u64 + do_div) twice to get
    utime and stime which can be obtained at same time by one set
    of divisions.

    This patch introduces a function task_times(*tsk, *utime,
    *stime) to retrieve utime and stime at once in better, optimized
    way.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Spencer Candland
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Americo Wang
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto