13 Jan, 2012

1 commit

  • When we restore a task we need to set up text, data and data heap sizes
    from userspace to the values a task had at checkpoint time. This patch
    adds auxilary prctl codes for that.

    While most of them have a statistical nature (their values are involved
    into calculation of /proc//statm output) the start_brk and brk values
    are used to compute an allowed size of program data segment expansion.
    Which means an arbitrary changes of this values might be dangerous
    operation. So to restrict access the following requirements applied to
    prctl calls:

    - The process has to have CAP_SYS_ADMIN capability granted.
    - For all opcodes except start_brk/brk members an appropriate
    VMA area must exist and should fit certain VMA flags,
    such as:
    - code segment must be executable but not writable;
    - data segment must not be executable.

    start_brk/brk values must not intersect with data segment and must not
    exceed RLIMIT_DATA resource limit.

    Still the main guard is CAP_SYS_ADMIN capability check.

    Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
    otherwise these prctl calls will return -EINVAL.

    [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text]
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

15 Dec, 2011

1 commit


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

03 Nov, 2011

1 commit

  • Adding support for poll() in sysctl fs allows userspace to receive
    notifications of changes in sysctl entries. This adds a infrastructure to
    allow files in sysctl fs to be pollable and implements it for hostname and
    domainname.

    [akpm@linux-foundation.org: s/declare/define/ for definitions]
    Signed-off-by: Lucas De Marchi
    Cc: Greg KH
    Cc: Kay Sievers
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucas De Marchi
     

31 Oct, 2011

2 commits

  • These files were implicitly relying on coming in via
    module.h, as without it we get things like:

    kernel/power/suspend.c:100: error: implicit declaration of function ‘usermodehelper_disable’
    kernel/power/suspend.c:109: error: implicit declaration of function ‘usermodehelper_enable’
    kernel/power/user.c:254: error: implicit declaration of function ‘usermodehelper_disable’
    kernel/power/user.c:261: error: implicit declaration of function ‘usermodehelper_enable’

    kernel/sys.c:317: error: implicit declaration of function ‘usermodehelper_disable’
    kernel/sys.c:1816: error: implicit declaration of function ‘call_usermodehelper_setup’
    kernel/sys.c:1822: error: implicit declaration of function ‘call_usermodehelper_setfns’
    kernel/sys.c:1824: error: implicit declaration of function ‘call_usermodehelper_exec’

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

25 Oct, 2011

1 commit


17 Oct, 2011

1 commit

  • The size is always valid, but variable-length arrays generate worse code
    for no good reason (unless the function happens to be inlined and the
    compiler sees the length for the simple constant it is).

    Also, there seems to be some code generation problem on POWER, where
    Henrik Bakken reports that register r28 can get corrupted under some
    subtle circumstances (interrupt happening at the wrong time?). That all
    indicates some seriously broken compiler issues, but since variable
    length arrays are bad regardless, there's little point in trying to
    chase it down.

    "Just don't do that, then".

    Reported-by: Henrik Grindal Bakken
    Cc: Benjamin Herrenschmidt
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Sep, 2011

1 commit

  • Add an event to monitor comm value changes of tasks. Such an event
    becomes vital, if someone desires to control threads of a process in
    different manner.

    A natural characteristic of threads is its comm value, and helpfully
    application developers have an opportunity to change it in runtime.
    Reporting about such events via proc connector allows to fine-grain
    monitoring and control potentials, for instance a process control daemon
    listening to proc connector and following comm value policies can place
    specific threads to assigned cgroup partitions.

    It might be possible to achieve a pale partial one-shot likeness without
    this update, if an application changes comm value of a thread generator
    task beforehand, then a new thread is cloned, and after that proc
    connector listener gets the fork event and reads new thread's comm value
    from procfs stat file, but this change visibly simplifies and extends the
    matter.

    Signed-off-by: Vladimir Zapolskiy
    Acked-by: Evgeniy Polyakov
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Vladimir Zapolskiy
     

26 Aug, 2011

1 commit

  • I ran into a couple of programs which broke with the new Linux 3.0
    version. Some of those were binary only. I tried to use LD_PRELOAD to
    work around it, but it was quite difficult and in one case impossible
    because of a mix of 32bit and 64bit executables.

    For example, all kind of management software from HP doesnt work, unless
    we pretend to run a 2.6 kernel.

    $ uname -a
    Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux

    $ hpacucli ctrl all show

    Error: No controllers detected.

    $ rpm -qf /usr/sbin/hpacucli
    hpacucli-8.75-12.0

    Another notable case is that Python now reports "linux3" from
    sys.platform(); which in turn can break things that were checking
    sys.platform() == "linux2":

    https://bugzilla.mozilla.org/show_bug.cgi?id=664564

    It seems pretty clear to me though it's a bug in the apps that are using
    '==' instead of .startswith(), but this allows us to unbreak broken
    programs.

    This patch adds a UNAME26 personality that makes the kernel report a
    2.6.40+x version number instead. The x is the x in 3.x.

    I know this is somewhat ugly, but I didn't find a better workaround, and
    compatibility to existing programs is important.

    Some programs also read /proc/sys/kernel/osrelease. This can be worked
    around in user space with mount --bind (and a mount namespace)

    To use:

    wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c
    gcc -o uname26 uname26.c
    ./uname26 program

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

12 Aug, 2011

1 commit

  • The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
    check in set_user() to check for NPROC exceeding via setuid() and
    similar functions.

    Before the check there was a possibility to greatly exceed the allowed
    number of processes by an unprivileged user if the program relied on
    rlimit only. But the check created new security threat: many poorly
    written programs simply don't check setuid() return code and believe it
    cannot fail if executed with root privileges. So, the check is removed
    in this patch because of too often privilege escalations related to
    buggy programs.

    The NPROC can still be enforced in the common code flow of daemons
    spawning user processes. Most of daemons do fork()+setuid()+execve().
    The check introduced in execve() (1) enforces the same limit as in
    setuid() and (2) doesn't create similar security issues.

    Neil Brown suggested to track what specific process has exceeded the
    limit by setting PF_NPROC_EXCEEDED process flag. With the change only
    this process would fail on execve(), and other processes' execve()
    behaviour is not changed.

    Solar Designer suggested to re-check whether NPROC limit is still
    exceeded at the moment of execve(). If the process was sleeping for
    days between set*uid() and execve(), and the NPROC counter step down
    under the limit, the defered execve() failure because NPROC limit was
    exceeded days ago would be unexpected. If the limit is not exceeded
    anymore, we clear the flag on successful calls to execve() and fork().

    The flag is also cleared on successful calls to set_user() as the limit
    was exceeded for the previous user, not the current one.

    Similar check was introduced in -ow patches (without the process flag).

    v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().

    Reviewed-by: James Morris
    Signed-off-by: Vasiliy Kulikov
    Acked-by: NeilBrown
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

26 Jul, 2011

1 commit

  • It is not necessary to share the same notifier.h.

    This patch already moves register_reboot_notifier() and
    unregister_reboot_notifier() from kernel/notifier.c to kernel/sys.c.

    [amwang@redhat.com: make allyesconfig succeed on ppc64]
    Signed-off-by: WANG Cong
    Cc: David Miller
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Signed-off-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang
     

20 May, 2011

1 commit

  • …/gregkh/driver-core-2.6

    * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (44 commits)
    debugfs: Silence DEBUG_STRICT_USER_COPY_CHECKS=y warning
    sysfs: remove "last sysfs file:" line from the oops messages
    drivers/base/memory.c: fix warning due to "memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION"
    memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION
    SYSFS: Fix erroneous comments for sysfs_update_group().
    driver core: remove the driver-model structures from the documentation
    driver core: Add the device driver-model structures to kerneldoc
    Translated Documentation/email-clients.txt
    RAW driver: Remove call to kobject_put().
    reboot: disable usermodehelper to prevent fs access
    efivars: prevent oops on unload when efi is not enabled
    Allow setting of number of raw devices as a module parameter
    Introduce CONFIG_GOOGLE_FIRMWARE
    driver: Google Memory Console
    driver: Google EFI SMI
    x86: Better comments for get_bios_ebda()
    x86: get_bios_ebda_length()
    misc: fix ti-st build issues
    params.c: Use new strtobool function to process boolean inputs
    debugfs: move to new strtobool
    ...

    Fix up trivial conflicts in fs/debugfs/file.c due to the same patch
    being applied twice, and an unrelated cleanup nearby.

    Linus Torvalds
     

12 May, 2011

1 commit

  • Since suspend, resume and shutdown operations in struct sysdev_class
    and struct sysdev_driver are not used any more, remove them. Also
    drop sysdev_suspend(), sysdev_resume() and sysdev_shutdown() used
    for executing those operations and modify all of their users
    accordingly. This reduces kernel code size quite a bit and reduces
    its complexity.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     

07 May, 2011

1 commit

  • In case CONFIG_UEVENT_HELPER_PATH is not set to "", which it
    should be on every system, the kernel forks processes during
    shutdown, which try to access the rootfs, even when the
    binary does not exist. It causes exceptions and long delays in
    the disk driver, which gets read requests at the time it tries
    to shut down the disk.

    This patch disables all kernel-forked processes during reboot to
    allow a clean poweroff.

    Cc: Tejun Heo
    Tested-By: Anton Guda
    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

24 Mar, 2011

2 commits

  • This allows setuid/setgid in containers. It also fixes some corner cases
    where kernel logic foregoes capability checks when uids are equivalent.
    The latter will need to be done throughout the whole kernel.

    Changelog:
    Jan 11: Use nsown_capable() as suggested by Bastian Blank.
    Jan 11: Fix logic errors in uid checks pointed out by Bastian.
    Feb 15: allow prlimit to current (was regression in previous version)
    Feb 23: remove debugging printks, uninline set_one_prio_perm and
    make it bool, and document its return value.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Changelog:
    Feb 23: let clone_uts_ns() handle setting uts->user_ns
    To do so we need to pass in the task_struct who'll
    get the utsname, so we can get its user_ns.
    Feb 23: As per Oleg's coment, just pass in tsk, instead of two
    of its members.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

15 Mar, 2011

1 commit

  • Some subsystems need to carry out suspend/resume and shutdown
    operations with one CPU on-line and interrupts disabled. The only
    way to register such operations is to define a sysdev class and
    a sysdev specifically for this purpose which is cumbersome and
    inefficient. Moreover, the arguments taken by sysdev suspend,
    resume and shutdown callbacks are practically never necessary.

    For this reason, introduce a simpler interface allowing subsystems
    to register operations to be executed very late during system suspend
    and shutdown and very early during resume in the form of
    strcut syscore_ops objects.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     

31 Jan, 2011

1 commit

  • Since check_prlimit_permission always fails in the case of SUID/GUID
    processes, such processes are not able to read or set their own limits.
    This commit changes this by assuming that process can always read/change
    its own limits.

    Signed-off-by: Kacper Kornet
    Acked-by: Jiri Slaby
    Signed-off-by: Linus Torvalds

    Kacper Kornet
     

14 Jan, 2011

1 commit

  • We need to know the reason why system rebooted in support service.
    However, we can't inform our customers of the reason because final
    messages are lost on current Linux kernel.

    This patch improves the situation above because the final messages are
    saved by adding kmsg_dump() to reboot, halt, poweroff and
    emergency_restart path.

    Signed-off-by: Seiji Aguchi
    Cc: David Woodhouse
    Cc: Marco Stornelli
    Reviewed-by: Artem Bityutskiy
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seiji Aguchi
     

30 Nov, 2010

1 commit

  • A recurring complaint from CFS users is that parallel kbuild has
    a negative impact on desktop interactivity. This patch
    implements an idea from Linus, to automatically create task
    groups. Currently, only per session autogroups are implemented,
    but the patch leaves the way open for enhancement.

    Implementation: each task's signal struct contains an inherited
    pointer to a refcounted autogroup struct containing a task group
    pointer, the default for all tasks pointing to the
    init_task_group. When a task calls setsid(), a new task group
    is created, the process is moved into the new task group, and a
    reference to the preveious task group is dropped. Child
    processes inherit this task group thereafter, and increase it's
    refcount. When the last thread of a process exits, the
    process's reference is dropped, such that when the last process
    referencing an autogroup exits, the autogroup is destroyed.

    At runqueue selection time, IFF a task has no cgroup assignment,
    its current autogroup is used.

    Autogroup bandwidth is controllable via setting it's nice level
    through the proc filesystem:

    cat /proc//autogroup

    Displays the task's group and the group's nice level.

    echo > /proc//autogroup

    Sets the task group's shares to the weight of nice task.
    Setting nice level is rate limited for !admin users due to the
    abuse risk of task group locking.

    The feature is enabled from boot by default if
    CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
    the boot option noautogroup, and can also be turned on/off on
    the fly via:

    echo [01] > /proc/sys/kernel/sched_autogroup_enabled

    ... which will automatically move tasks to/from the root task group.

    Signed-off-by: Mike Galbraith
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Cc: Markus Trippelsdorf
    Cc: Mathieu Desnoyers
    Cc: Paul Turner
    Cc: Oleg Nesterov
    [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
    Signed-off-by: Ingo Molnar
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

01 Sep, 2010

1 commit

  • [ 23.584719]
    [ 23.584720] ===================================================
    [ 23.585059] [ INFO: suspicious rcu_dereference_check() usage. ]
    [ 23.585176] ---------------------------------------------------
    [ 23.585176] kernel/pid.c:419 invoked rcu_dereference_check() without protection!
    [ 23.585176]
    [ 23.585176] other info that might help us debug this:
    [ 23.585176]
    [ 23.585176]
    [ 23.585176] rcu_scheduler_active = 1, debug_locks = 1
    [ 23.585176] 1 lock held by rc.sysinit/728:
    [ 23.585176] #0: (tasklist_lock){.+.+..}, at: [] sys_setpgid+0x5f/0x193
    [ 23.585176]
    [ 23.585176] stack backtrace:
    [ 23.585176] Pid: 728, comm: rc.sysinit Not tainted 2.6.36-rc2 #2
    [ 23.585176] Call Trace:
    [ 23.585176] [] lockdep_rcu_dereference+0x99/0xa2
    [ 23.585176] [] find_task_by_pid_ns+0x50/0x6a
    [ 23.585176] [] find_task_by_vpid+0x1d/0x1f
    [ 23.585176] [] sys_setpgid+0x67/0x193
    [ 23.585176] [] system_call_fastpath+0x16/0x1b
    [ 24.959669] type=1400 audit(1282938522.956:4): avc: denied { module_request } for pid=766 comm="hwclock" kmod="char-major-10-135" scontext=system_u:system_r:hwclock_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclas

    It turns out that the setpgid() system call fails to enter an RCU
    read-side critical section before doing a PID-to-task_struct translation.
    This commit therefore does rcu_read_lock() before the translation, and
    also does rcu_read_unlock() after the last use of the returned pointer.

    Reported-by: Andrew Morton
    Signed-off-by: Paul E. McKenney
    Acked-by: David Howells

    Paul E. McKenney
     

16 Jul, 2010

9 commits

  • This patch adds the code to support the sys_prlimit64 syscall which
    modifies-and-returns the rlim values of a selected process atomically.
    The first parameter, pid, being 0 means current process.

    Unlike the current implementation, it is a generic interface,
    architecture indepentent so that we needn't handle compat stuff
    anymore. In the future, after glibc start to use this we can deprecate
    sys_setrlimit and sys_getrlimit in favor to clean up the code finally.

    It also adds a possibility of changing limits of other processes. We
    check the user's permissions to do that and if it succeeds, the new
    limits are propagated online. This is good for large scale
    applications such as SAP or databases where administrators need to
    change limits time by time (e.g. on crashes increase core size). And
    it is unacceptable to restart the service.

    For safety, all rlim users now either use accessors or doesn't need
    them due to
    - locking
    - the fact a process was just forked and nobody else knows about it
    yet (and nobody can't thus read/write limits)
    hence it is safe to modify limits now.

    The limitation is that we currently stay at ulong internal
    representation. So the rlim64_is_infinity check is used where value is
    compared against ULONG_MAX on 32-bit which is the maximum value there.

    And since internally the limits are held in struct rlimit, converters
    which are used before and after do_prlimit call in sys_prlimit64 are
    introduced.

    Signed-off-by: Jiri Slaby

    Jiri Slaby
     
  • After we added more generic do_prlimit, switch sys_getrlimit to that.
    Also switch compat handling, so we can get rid of ugly __user casts
    and avoid setting process' address limit to kernel data and back.

    Signed-off-by: Jiri Slaby

    Jiri Slaby
     
  • It now allows also reading of limits. I.e. all read and writes will
    later use this function.

    It takes two parameters, new and old limits which can be both NULL.
    If new is non-NULL, the value in it is set to rlimits.
    If old is non-NULL, current rlimits are stored there.
    If both are non-NULL, old are stored prior to setting the new ones,
    atomically.
    (Similar to sigaction.)

    Signed-off-by: Jiri Slaby

    Jiri Slaby
     
  • Do security_task_setrlimit under task_lock. Other tasks may change
    limits under our hands while we are checking limits inside the
    function. From now on, they can't.

    Note that all the security work is done under a spinlock here now.
    Security hooks count with that, they are called from interrupt context
    (like security_task_kill) and with spinlocks already held (e.g.
    capable->security_capable).

    Signed-off-by: Jiri Slaby
    Acked-by: James Morris
    Cc: Heiko Carstens

    Jiri Slaby
     
  • Add locking to allow setrlimit accept task parameter other than
    current.

    Namely, lock tasklist_lock for read and check whether the task
    structure has sighand non-null. Do all the signal processing under
    that lock still held.

    There are some points:
    1) security_task_setrlimit is now called with that lock held. This is
    not new, many security_* functions are called with this lock held
    already so it doesn't harm (all this security_* stuff does almost
    the same).
    2) task->sighand->siglock (in update_rlimit_cpu) is nested in
    tasklist_lock. This dependence is already existing.
    3) tsk->alloc_lock is nested in tasklist_lock. This is OK too, already
    existing dependence.

    Signed-off-by: Jiri Slaby
    Cc: Oleg Nesterov

    Jiri Slaby
     
  • Create do_setrlimit from sys_setrlimit and declare do_setrlimit
    in the resource header. This is the first phase to have generic
    do_prlimit which allows to be called from read, write and compat
    rlimits code.

    The new do_setrlimit also accepts a task pointer to change the limits
    of. Currently, it cannot be other than current, but this will change
    with locking later.

    Also pass tsk->group_leader to security_task_setrlimit to check
    whether current is allowed to change rlimits of the process and not
    its arbitrary thread because it makes more sense given that rlimit are
    per process and not per-thread.

    Signed-off-by: Jiri Slaby

    Jiri Slaby
     
  • Mostly preparation for Jiri's changes, but probably makes sense anyway.

    sys_setrlimit() checks new_rlim.rlim_max rlim_max, but when
    it takes task_lock() old_rlim->rlim_max can be already lowered. Move this
    check under task_lock().

    Currently this is not important, we can only race with our sub-thread,
    this means the application is stupid. But when we change the code to allow
    the update of !current task's limits, it becomes important to make sure
    ->rlim_max can be lowered "reliably" even if we race with the application
    doing sys_setrlimit().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Jiri Slaby

    Oleg Nesterov
     
  • Add task_struct as a parameter to update_rlimit_cpu to be able to set
    rlimit_cpu of different task than current.

    Signed-off-by: Jiri Slaby
    Acked-by: James Morris

    Jiri Slaby
     
  • Add task_struct to task_setrlimit of security_operations to be able to set
    rlimit of task other than current.

    Signed-off-by: Jiri Slaby
    Acked-by: Eric Paris
    Acked-by: James Morris

    Jiri Slaby
     

28 May, 2010

1 commit

  • About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe
    feature in the kernel works. We had reports of several races, including
    some reports of apps bypassing our recursion check so that a process that
    was forked as part of a core_pattern setup could infinitely crash and
    refork until the system crashed.

    We fixed those by improving our recursion checks. The new check basically
    refuses to fork a process if its core limit is zero, which works well.

    Unfortunately, I've been getting grief from maintainer of user space
    programs that are inserted as the forked process of core_pattern. They
    contend that in order for their programs (such as abrt and apport) to
    work, all the running processes in a system must have their core limits
    set to a non-zero value, to which I say 'yes'. I did this by design, and
    think thats the right way to do things.

    But I've been asked to ease this burden on user space enough times that I
    thought I would take a look at it. The first suggestion was to make the
    recursion check fail on a non-zero 'special' number, like one. That way
    the core collector process could set its core size ulimit to 1, and enable
    the kernel's recursion detection. This isn't a bad idea on the surface,
    but I don't like it since its opt-in, in that if a program like abrt or
    apport has a bug and fails to set such a core limit, we're left with a
    recursively crashing system again.

    So I've come up with this. What I've done is modify the
    call_usermodehelper api such that an extra parameter is added, a function
    pointer which will be called by the user helper task, after it forks, but
    before it exec's the required process. This will give the caller the
    opportunity to get a call back in the processes context, allowing it to do
    whatever it needs to to the process in the kernel prior to exec-ing the
    user space code. In the case of do_coredump, this callback is ues to set
    the core ulimit of the helper process to 1. This elimnates the opt-in
    problem that I had above, as it allows the ulimit for core sizes to be set
    to the value of 1, which is what the recursion check looks for in
    do_coredump.

    This patch:

    Create new function call_usermodehelper_fns() and allow it to assign both
    an init and cleanup function, as we'll as arbitrary data.

    The init function is called from the context of the forked process and
    allows for customization of the helper process prior to calling exec. Its
    return code gates the continuation of the process, or causes its exit.
    Also add an arbitrary data pointer to the subprocess_info struct allowing
    for data to be passed from the caller to the new process, and the
    subsequent cleanup process

    Also, use this patch to cleanup the cleanup function. It currently takes
    an argp and envp pointer for freeing, which is ugly. Lets instead just
    make the subprocess_info structure public, and pass that to the cleanup
    and init routines

    Signed-off-by: Neil Horman
    Reviewed-by: Oleg Nesterov
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     

06 May, 2010

1 commit


25 Apr, 2010

1 commit

  • On ppc64 you get this error:

    $ setarch ppc -R true
    setarch: ppc: Unrecognized architecture

    because uname still reports ppc64 as the machine.

    So mask off the personality flags when checking for PER_LINUX32.

    Signed-off-by: Andreas Schwab
    Reviewed-by: Christoph Hellwig
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Schwab
     

12 Apr, 2010

2 commits


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

13 Mar, 2010

2 commits

  • Add generic implementations of the old and really old uname system calls.
    Note that sh only implements sys_olduname but not sys_oldolduname, but I'm
    not going to bother with another ifdef for that special case.

    m32r implemented an old uname but never wired it up, so kill it, too.

    Signed-off-by: Christoph Hellwig
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Hirokazu Takata
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: "Luck, Tony"
    Cc: James Morris
    Cc: Andreas Schwab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • On an architecture that supports 32-bit compat we need to override the
    reported machine in uname with the 32-bit value. Instead of doing this
    separately in every architecture introduce a COMPAT_UTS_MACHINE define in
    and apply it directly in sys_newuname().

    Signed-off-by: Christoph Hellwig
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Hirokazu Takata
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: "Luck, Tony"
    Cc: James Morris
    Cc: Andreas Schwab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby