01 Sep, 2010

1 commit

  • [ 23.584719]
    [ 23.584720] ===================================================
    [ 23.585059] [ INFO: suspicious rcu_dereference_check() usage. ]
    [ 23.585176] ---------------------------------------------------
    [ 23.585176] kernel/pid.c:419 invoked rcu_dereference_check() without protection!
    [ 23.585176]
    [ 23.585176] other info that might help us debug this:
    [ 23.585176]
    [ 23.585176]
    [ 23.585176] rcu_scheduler_active = 1, debug_locks = 1
    [ 23.585176] 1 lock held by rc.sysinit/728:
    [ 23.585176] #0: (tasklist_lock){.+.+..}, at: [] sys_setpgid+0x5f/0x193
    [ 23.585176]
    [ 23.585176] stack backtrace:
    [ 23.585176] Pid: 728, comm: rc.sysinit Not tainted 2.6.36-rc2 #2
    [ 23.585176] Call Trace:
    [ 23.585176] [] lockdep_rcu_dereference+0x99/0xa2
    [ 23.585176] [] find_task_by_pid_ns+0x50/0x6a
    [ 23.585176] [] find_task_by_vpid+0x1d/0x1f
    [ 23.585176] [] sys_setpgid+0x67/0x193
    [ 23.585176] [] system_call_fastpath+0x16/0x1b
    [ 24.959669] type=1400 audit(1282938522.956:4): avc: denied { module_request } for pid=766 comm="hwclock" kmod="char-major-10-135" scontext=system_u:system_r:hwclock_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclas

    It turns out that the setpgid() system call fails to enter an RCU
    read-side critical section before doing a PID-to-task_struct translation.
    This commit therefore does rcu_read_lock() before the translation, and
    also does rcu_read_unlock() after the last use of the returned pointer.

    Reported-by: Andrew Morton
    Signed-off-by: Paul E. McKenney
    Acked-by: David Howells

    Paul E. McKenney
     

16 Jul, 2010

9 commits

  • This patch adds the code to support the sys_prlimit64 syscall which
    modifies-and-returns the rlim values of a selected process atomically.
    The first parameter, pid, being 0 means current process.

    Unlike the current implementation, it is a generic interface,
    architecture indepentent so that we needn't handle compat stuff
    anymore. In the future, after glibc start to use this we can deprecate
    sys_setrlimit and sys_getrlimit in favor to clean up the code finally.

    It also adds a possibility of changing limits of other processes. We
    check the user's permissions to do that and if it succeeds, the new
    limits are propagated online. This is good for large scale
    applications such as SAP or databases where administrators need to
    change limits time by time (e.g. on crashes increase core size). And
    it is unacceptable to restart the service.

    For safety, all rlim users now either use accessors or doesn't need
    them due to
    - locking
    - the fact a process was just forked and nobody else knows about it
    yet (and nobody can't thus read/write limits)
    hence it is safe to modify limits now.

    The limitation is that we currently stay at ulong internal
    representation. So the rlim64_is_infinity check is used where value is
    compared against ULONG_MAX on 32-bit which is the maximum value there.

    And since internally the limits are held in struct rlimit, converters
    which are used before and after do_prlimit call in sys_prlimit64 are
    introduced.

    Signed-off-by: Jiri Slaby

    Jiri Slaby
     
  • After we added more generic do_prlimit, switch sys_getrlimit to that.
    Also switch compat handling, so we can get rid of ugly __user casts
    and avoid setting process' address limit to kernel data and back.

    Signed-off-by: Jiri Slaby

    Jiri Slaby
     
  • It now allows also reading of limits. I.e. all read and writes will
    later use this function.

    It takes two parameters, new and old limits which can be both NULL.
    If new is non-NULL, the value in it is set to rlimits.
    If old is non-NULL, current rlimits are stored there.
    If both are non-NULL, old are stored prior to setting the new ones,
    atomically.
    (Similar to sigaction.)

    Signed-off-by: Jiri Slaby

    Jiri Slaby
     
  • Do security_task_setrlimit under task_lock. Other tasks may change
    limits under our hands while we are checking limits inside the
    function. From now on, they can't.

    Note that all the security work is done under a spinlock here now.
    Security hooks count with that, they are called from interrupt context
    (like security_task_kill) and with spinlocks already held (e.g.
    capable->security_capable).

    Signed-off-by: Jiri Slaby
    Acked-by: James Morris
    Cc: Heiko Carstens

    Jiri Slaby
     
  • Add locking to allow setrlimit accept task parameter other than
    current.

    Namely, lock tasklist_lock for read and check whether the task
    structure has sighand non-null. Do all the signal processing under
    that lock still held.

    There are some points:
    1) security_task_setrlimit is now called with that lock held. This is
    not new, many security_* functions are called with this lock held
    already so it doesn't harm (all this security_* stuff does almost
    the same).
    2) task->sighand->siglock (in update_rlimit_cpu) is nested in
    tasklist_lock. This dependence is already existing.
    3) tsk->alloc_lock is nested in tasklist_lock. This is OK too, already
    existing dependence.

    Signed-off-by: Jiri Slaby
    Cc: Oleg Nesterov

    Jiri Slaby
     
  • Create do_setrlimit from sys_setrlimit and declare do_setrlimit
    in the resource header. This is the first phase to have generic
    do_prlimit which allows to be called from read, write and compat
    rlimits code.

    The new do_setrlimit also accepts a task pointer to change the limits
    of. Currently, it cannot be other than current, but this will change
    with locking later.

    Also pass tsk->group_leader to security_task_setrlimit to check
    whether current is allowed to change rlimits of the process and not
    its arbitrary thread because it makes more sense given that rlimit are
    per process and not per-thread.

    Signed-off-by: Jiri Slaby

    Jiri Slaby
     
  • Mostly preparation for Jiri's changes, but probably makes sense anyway.

    sys_setrlimit() checks new_rlim.rlim_max rlim_max, but when
    it takes task_lock() old_rlim->rlim_max can be already lowered. Move this
    check under task_lock().

    Currently this is not important, we can only race with our sub-thread,
    this means the application is stupid. But when we change the code to allow
    the update of !current task's limits, it becomes important to make sure
    ->rlim_max can be lowered "reliably" even if we race with the application
    doing sys_setrlimit().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Jiri Slaby

    Oleg Nesterov
     
  • Add task_struct as a parameter to update_rlimit_cpu to be able to set
    rlimit_cpu of different task than current.

    Signed-off-by: Jiri Slaby
    Acked-by: James Morris

    Jiri Slaby
     
  • Add task_struct to task_setrlimit of security_operations to be able to set
    rlimit of task other than current.

    Signed-off-by: Jiri Slaby
    Acked-by: Eric Paris
    Acked-by: James Morris

    Jiri Slaby
     

28 May, 2010

1 commit

  • About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe
    feature in the kernel works. We had reports of several races, including
    some reports of apps bypassing our recursion check so that a process that
    was forked as part of a core_pattern setup could infinitely crash and
    refork until the system crashed.

    We fixed those by improving our recursion checks. The new check basically
    refuses to fork a process if its core limit is zero, which works well.

    Unfortunately, I've been getting grief from maintainer of user space
    programs that are inserted as the forked process of core_pattern. They
    contend that in order for their programs (such as abrt and apport) to
    work, all the running processes in a system must have their core limits
    set to a non-zero value, to which I say 'yes'. I did this by design, and
    think thats the right way to do things.

    But I've been asked to ease this burden on user space enough times that I
    thought I would take a look at it. The first suggestion was to make the
    recursion check fail on a non-zero 'special' number, like one. That way
    the core collector process could set its core size ulimit to 1, and enable
    the kernel's recursion detection. This isn't a bad idea on the surface,
    but I don't like it since its opt-in, in that if a program like abrt or
    apport has a bug and fails to set such a core limit, we're left with a
    recursively crashing system again.

    So I've come up with this. What I've done is modify the
    call_usermodehelper api such that an extra parameter is added, a function
    pointer which will be called by the user helper task, after it forks, but
    before it exec's the required process. This will give the caller the
    opportunity to get a call back in the processes context, allowing it to do
    whatever it needs to to the process in the kernel prior to exec-ing the
    user space code. In the case of do_coredump, this callback is ues to set
    the core ulimit of the helper process to 1. This elimnates the opt-in
    problem that I had above, as it allows the ulimit for core sizes to be set
    to the value of 1, which is what the recursion check looks for in
    do_coredump.

    This patch:

    Create new function call_usermodehelper_fns() and allow it to assign both
    an init and cleanup function, as we'll as arbitrary data.

    The init function is called from the context of the forked process and
    allows for customization of the helper process prior to calling exec. Its
    return code gates the continuation of the process, or causes its exit.
    Also add an arbitrary data pointer to the subprocess_info struct allowing
    for data to be passed from the caller to the new process, and the
    subsequent cleanup process

    Also, use this patch to cleanup the cleanup function. It currently takes
    an argp and envp pointer for freeing, which is ugly. Lets instead just
    make the subprocess_info structure public, and pass that to the cleanup
    and init routines

    Signed-off-by: Neil Horman
    Reviewed-by: Oleg Nesterov
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     

06 May, 2010

1 commit


25 Apr, 2010

1 commit

  • On ppc64 you get this error:

    $ setarch ppc -R true
    setarch: ppc: Unrecognized architecture

    because uname still reports ppc64 as the machine.

    So mask off the personality flags when checking for PER_LINUX32.

    Signed-off-by: Andreas Schwab
    Reviewed-by: Christoph Hellwig
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Schwab
     

12 Apr, 2010

2 commits


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

13 Mar, 2010

2 commits

  • Add generic implementations of the old and really old uname system calls.
    Note that sh only implements sys_olduname but not sys_oldolduname, but I'm
    not going to bother with another ifdef for that special case.

    m32r implemented an old uname but never wired it up, so kill it, too.

    Signed-off-by: Christoph Hellwig
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Hirokazu Takata
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: "Luck, Tony"
    Cc: James Morris
    Cc: Andreas Schwab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • On an architecture that supports 32-bit compat we need to override the
    reported machine in uname with the 32-bit value. Instead of doing this
    separately in every architecture introduce a COMPAT_UTS_MACHINE define in
    and apply it directly in sys_newuname().

    Signed-off-by: Christoph Hellwig
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Hirokazu Takata
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: "Luck, Tony"
    Cc: James Morris
    Cc: Andreas Schwab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

01 Mar, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
    sched: Fix SCHED_MC regression caused by change in sched cpu_power
    sched: Don't use possibly stale sched_class
    kthread, sched: Remove reference to kthread_create_on_cpu
    sched: cpuacct: Use bigger percpu counter batch values for stats counters
    percpu_counter: Make __percpu_counter_add an inline function on UP
    sched: Remove member rt_se from struct rt_rq
    sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu]
    sched: Remove unused update_shares_locked()
    sched: Use for_each_bit
    sched: Queue a deboosted task to the head of the RT prio queue
    sched: Implement head queueing for sched_rt
    sched: Extend enqueue_task to allow head queueing
    sched: Remove USER_SCHED
    sched: Fix the place where group powers are updated
    sched: Assume *balance is valid
    sched: Remove load_balance_newidle()
    sched: Unify load_balance{,_newidle}()
    sched: Add a lock break for PREEMPT=y
    sched: Remove from fwd decls
    sched: Remove rq_iterator from move_one_task
    ...

    Fix up trivial conflicts in kernel/sched.c

    Linus Torvalds
     

23 Feb, 2010

1 commit


21 Jan, 2010

1 commit

  • Remove the USER_SCHED feature. It has been scheduled to be removed in
    2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2

    Signed-off-by: Dhaval Giani
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Dhaval Giani
     

20 Dec, 2009

1 commit


16 Dec, 2009

1 commit


11 Dec, 2009

1 commit

  • commit c69e8d9 (CRED: Use RCU to access another task's creds and to
    release a task's own creds) added non rcu_read_lock() protected access
    to task creds of the target task in set_prio_one().

    The comment above the function says:
    * - the caller must hold the RCU read lock

    The calling code in sys_setpriority does read_lock(&tasklist_lock) but
    not rcu_read_lock(). This works only when CONFIG_TREE_PREEMPT_RCU=n.
    With CONFIG_TREE_PREEMPT_RCU=y the rcu_callbacks can run in the tick
    interrupt when they see no read side critical section.

    There is another instance of __task_cred() in sys_setpriority() itself
    which is equally unprotected.

    Wrap the whole code section into a rcu read side critical section to
    fix this quick and dirty.

    Will be revisited in course of the read_lock(&tasklist_lock) -> rcu
    crusade.

    Oleg noted further:

    This also fixes another bug here. find_task_by_vpid() is not safe
    without rcu_read_lock(). I do not mean it is not safe to use the
    result, just find_pid_ns() by itself is not safe.

    Usually tasklist gives enough protection, but if copy_process() fails
    it calls free_pid() lockless and does call_rcu(delayed_put_pid().
    This means, without rcu lock find_pid_ns() can't scan the hash table
    safely.

    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Acked-by: Paul E. McKenney

    Thomas Gleixner
     

10 Dec, 2009

1 commit


03 Dec, 2009

1 commit

  • This is a real fix for problem of utime/stime values decreasing
    described in the thread:

    http://lkml.org/lkml/2009/11/3/522

    Now cputime is accounted in the following way:

    - {u,s}time in task_struct are increased every time when the thread
    is interrupted by a tick (timer interrupt).

    - When a thread exits, its {u,s}time are added to signal->{u,s}time,
    after adjusted by task_times().

    - When all threads in a thread_group exits, accumulated {u,s}time
    (and also c{u,s}time) in signal struct are added to c{u,s}time
    in signal struct of the group's parent.

    So {u,s}time in task struct are "raw" tick count, while
    {u,s}time and c{u,s}time in signal struct are "adjusted" values.

    And accounted values are used by:

    - task_times(), to get cputime of a thread:
    This function returns adjusted values that originates from raw
    {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

    - thread_group_cputime(), to get cputime of a thread group:
    This function returns sum of all {u,s}time of living threads in
    the group, plus {u,s}time in the signal struct that is sum of
    adjusted cputimes of all exited threads belonged to the group.

    The problem is the return value of thread_group_cputime(),
    because it is mixed sum of "raw" value and "adjusted" value:

    group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

    This misbehavior can break {u,s}time monotonicity.
    Assume that if there is a thread that have raw values greater
    than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
    but only runs 45ms) and if it exits, cputime will decrease (e.g.
    -5ms).

    To fix this, we could do:

    group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

    But task_times() contains hard divisions, so applying it for
    every thread should be avoided.

    This patch fixes the above problem in the following way:

    - Modify thread's exit (= __exit_signal()) not to use task_times().
    It means {u,s}time in signal struct accumulates raw values instead
    of adjusted values. As the result it makes thread_group_cputime()
    to return pure sum of "raw" values.

    - Introduce a new function thread_group_times(*task, *utime, *stime)
    that converts "raw" values of thread_group_cputime() to "adjusted"
    values, in same calculation procedure as task_times().

    - Modify group's exit (= wait_task_zombie()) to use this introduced
    thread_group_times(). It make c{u,s}time in signal struct to
    have adjusted values like before this patch.

    - Replace some thread_group_cputime() by thread_group_times().
    This replacements are only applied where conveys the "adjusted"
    cputime to users, and where already uses task_times() near by it.
    (i.e. sys_times(), getrusage(), and /proc//stat.)

    This patch have a positive side effect:

    - Before this patch, if a group contains many short-life threads
    (e.g. runs 0.9ms and not interrupted by ticks), the group's
    cputime could be invisible since thread's cputime was accumulated
    after adjusted: imagine adjustment function as adj(ticks, runtime),
    {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
    After this patch it will not happen because the adjustment is
    applied after accumulated.

    v2:
    - remove if()s, put new variables into signal_struct.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Spencer Candland
    Cc: Americo Wang
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

26 Nov, 2009

1 commit

  • Functions task_{u,s}time() are called in pair in almost all
    cases. However task_stime() is implemented to call task_utime()
    from its inside, so such paired calls run task_utime() twice.

    It means we do heavy divisions (div_u64 + do_div) twice to get
    utime and stime which can be obtained at same time by one set
    of divisions.

    This patch introduces a function task_times(*tsk, *utime,
    *stime) to retrieve utime and stime at once in better, optimized
    way.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Spencer Candland
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Americo Wang
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

29 Oct, 2009

2 commits

  • * 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    HWPOISON: fix invalid page count in printk output
    HWPOISON: Allow schedule_on_each_cpu() from keventd
    HWPOISON: fix/proc/meminfo alignment
    HWPOISON: fix oops on ksm pages
    HWPOISON: Fix page count leak in hwpoison late kill in do_swap_page
    HWPOISON: return early on non-LRU pages
    HWPOISON: Add brief hwpoison description to Documentation
    HWPOISON: Clean up PR_MCE_KILL interface

    Linus Torvalds
     
  • Since commit 02b51df1b07b4e9ca823c89284e704cadb323cd1 (proc connector: add
    event for process becoming session leader) we have the following warning:

    Badness at kernel/softirq.c:143
    [...]
    Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0)
    [...]
    Call Trace:
    ([] 0x13fe04100)
    [] sk_filter+0x9a/0xd0
    [] netlink_broadcast+0x2c0/0x53c
    [] cn_netlink_send+0x272/0x2b0
    [] proc_sid_connector+0xc4/0xd4
    [] __set_special_pids+0x58/0x90
    [] sys_setsid+0xb4/0xd8
    [] sysc_noemu+0x10/0x16
    [] 0x41616cb266

    The warning is
    ---> WARN_ON_ONCE(in_irq() || irqs_disabled());

    The network code must not be called with disabled interrupts but
    sys_setsid holds the tasklist_lock with spinlock_irq while calling the
    connector.

    After a discussion we agreed that we can move proc_sid_connector from
    __set_special_pids to sys_setsid.

    We also agreed that it is sufficient to change the check from
    task_session(curr) != pid into err > 0, since if we don't change the
    session, this means we were already the leader and return -EPERM.

    One last thing:
    There is also daemonize(), and some people might want to get a
    notification in that case. Since daemonize() is only needed if a user
    space does kernel_thread this does not look important (and there seems
    to be no consensus if this connector should be called in daemonize). If
    we really want this, we can add proc_sid_connector to daemonize() in an
    additional patch (Scott?)

    Signed-off-by: Christian Borntraeger
    Cc: Scott James Remnant
    Cc: Matt Helsley
    Cc: David S. Miller
    Acked-by: Oleg Nesterov
    Acked-by: Evgeniy Polyakov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Borntraeger
     

14 Oct, 2009

1 commit


04 Oct, 2009

1 commit

  • While writing the manpage I noticed some shortcomings in the
    current interface.

    - Define symbolic names for all the different values
    - Boundary check the kill mode values
    - For symmetry add a get interface too. This allows library
    code to get/set the current state.
    - For consistency define a PR_MCE_KILL_DEFAULT value

    Signed-off-by: Andi Kleen

    Andi Kleen
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

23 Sep, 2009

1 commit

  • Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
    mark. This struct is filled as a parameter to getrusage syscall.
    ->ru_maxrss value is set to KBs which is the way it is done in BSD
    systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
    which seems to be incorrect behavior. Maintainer of this util was
    notified by me with the patch which corrects it and cc'ed.

    To make this happen we extend struct signal_struct by two fields. The
    first one is ->maxrss which we use to store rss hiwater of the task. The
    second one is ->cmaxrss which we use to store highest rss hiwater of all
    task childs. These values are used in k_getrusage() to actually fill
    ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
    struct exists.

    Note:
    exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
    it is intetionally behavior. *BSD getrusage have exec() inheriting.

    test programs
    ========================================================

    getrusage.c
    ===========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    #define err(str) perror(str), exit(1)

    int main(int argc, char** argv)
    {
    int status;

    printf("allocate 100MB\n");
    consume(100);

    printf("testcase1: fork inherit? \n");
    printf(" expect: initial.self ~= child.self\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase2: fork inherit? (cont.) \n");
    printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("child");
    _exit(0);
    }
    printf("\n");

    printf("testcase3: fork + malloc \n");
    printf(" expect: child.self ~= initial.self + 50MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    printf("allocate +50MB\n");
    consume(50);
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase4: grandchild maxrss\n");
    printf(" expect: post_wait.children ~= 300MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 0 -g 300");
    _exit(0);
    }
    printf("\n");

    printf("testcase5: zombie\n");
    printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
    printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
    show_rusage("initial");
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("pre_wait");
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 400");
    _exit(0);
    }
    printf("\n");

    printf("testcase6: SIG_IGN\n");
    printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
    show_rusage("initial");
    signal(SIGCHLD, SIG_IGN);
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("after_zombie");
    } else {
    system("./child -n 500");
    _exit(0);
    }
    printf("\n");
    signal(SIGCHLD, SIG_DFL);

    printf("testcase7: exec (without fork) \n");
    printf(" expect: initial ~= exec \n");
    show_rusage("initial");
    execl("./child", "child", "-v", NULL);

    return 0;
    }

    child.c
    =======
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    int main(int argc, char** argv)
    {
    int status;
    int c;
    long consume_size = 0;
    long grandchild_consume_size = 0;
    int show = 0;

    while ((c = getopt(argc, argv, "n:g:v")) != -1) {
    switch (c) {
    case 'n':
    consume_size = atol(optarg);
    break;
    case 'v':
    show = 1;
    break;
    case 'g':

    grandchild_consume_size = atol(optarg);
    break;
    default:
    break;
    }
    }

    if (show)
    show_rusage("exec");

    if (consume_size) {
    printf("child alloc %ldMB\n", consume_size);
    consume(consume_size);
    }

    if (grandchild_consume_size) {
    if (fork()) {
    wait(&status);
    } else {
    printf("grandchild alloc %ldMB\n", grandchild_consume_size);
    consume(grandchild_consume_size);

    exit(0);
    }
    }

    return 0;
    }

    common.c
    ========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"
    #define err(str) perror(str), exit(1)

    void show_rusage(char *prefix)
    {
    int err, err2;
    struct rusage rusage_self;
    struct rusage rusage_children;

    printf("%s: ", prefix);
    err = getrusage(RUSAGE_SELF, &rusage_self);
    if (!err)
    printf("self %ld ", rusage_self.ru_maxrss);
    err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
    if (!err2)
    printf("children %ld ", rusage_children.ru_maxrss);

    printf("\n");
    }

    /* Some buggy OS need this worthless CPU waste. */
    void make_pagefault(void)
    {
    void *addr;
    int size = getpagesize();
    int i;

    for (i=0; i
    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Pirko
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

16 Sep, 2009

1 commit

  • This allows processes to override their early/late kill
    behaviour on hardware memory errors.

    Typically applications which are memory error aware is
    better of with early kill (see the error as soon
    as possible), all others with late kill (only
    see the error when the error is really impacting execution)

    There's a global sysctl, but this way an application
    can set its specific policy.

    We're using two bits, one to signify that the process
    stated its intention and that

    I also made the prctl future proof by enforcing
    the unused arguments are 0.

    The state is inherited to children.

    Note this makes us officially run out of process flags
    on 32bit, but the next patch can easily add another field.

    Manpage patch will be supplied separately.

    Signed-off-by: Andi Kleen

    Andi Kleen
     

17 Jun, 2009

1 commit

  • Move supplementary groups implementation to kernel/groups.c .
    kernel/sys.c already accumulated quite a few random stuff.

    Do strictly copy/paste + add required headers to compile. Compile-tested
    on many configs and archs.

    Signed-off-by: Alexey Dobriyan
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

29 Apr, 2009

1 commit


14 Apr, 2009

1 commit


06 Apr, 2009

1 commit

  • Merge reason: we have gathered quite a few conflicts, need to merge upstream

    Conflicts:
    arch/powerpc/kernel/Makefile
    arch/x86/ia32/ia32entry.S
    arch/x86/include/asm/hardirq.h
    arch/x86/include/asm/unistd_32.h
    arch/x86/include/asm/unistd_64.h
    arch/x86/kernel/cpu/common.c
    arch/x86/kernel/irq.c
    arch/x86/kernel/syscall_table_32.S
    arch/x86/mm/iomap_32.c
    include/linux/sched.h
    kernel/Makefile

    Signed-off-by: Ingo Molnar

    Ingo Molnar