24 Mar, 2011

21 commits

  • CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
    because the resource comes from current's own ipc namespace.

    setuid/setgid are to uids in own namespace, so again checks can be against
    current_user_ns().

    Changelog:
    Jan 11: Use task_ns_capable() in place of sched_capable().
    Jan 11: Use nsown_capable() as suggested by Bastian Blank.
    Jan 11: Clarify (hopefully) some logic in futex and sched.c
    Feb 15: use ns_capable for ipc, not nsown_capable
    Feb 23: let copy_ipcs handle setting ipc_ns->user_ns
    Feb 23: pass ns down rather than taking it from current

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Changelog:
    Feb 15: Don't set new ipc->user_ns if we didn't create a new
    ipc_ns.
    Feb 23: Move extern declaration to ipc_namespace.h, and group
    fwd declarations at top.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This allows setuid/setgid in containers. It also fixes some corner cases
    where kernel logic foregoes capability checks when uids are equivalent.
    The latter will need to be done throughout the whole kernel.

    Changelog:
    Jan 11: Use nsown_capable() as suggested by Bastian Blank.
    Jan 11: Fix logic errors in uid checks pointed out by Bastian.
    Feb 15: allow prlimit to current (was regression in previous version)
    Feb 23: remove debugging printks, uninline set_one_prio_perm and
    make it bool, and document its return value.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • So we can let type safety keep things sane, and as a bonus we can remove
    the declaration of init_user_ns in capability.h.

    Signed-off-by: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Daniel Lezcano
    Cc: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • ptrace is allowed to tasks in the same user namespace according to the
    usual rules (i.e. the same rules as for two tasks in the init user
    namespace). ptrace is also allowed to a user namespace to which the
    current task the has CAP_SYS_PTRACE capability.

    Changelog:
    Dec 31: Address feedback by Eric:
    . Correct ptrace uid check
    . Rename may_ptrace_ns to ptrace_capable
    . Also fix the cap_ptrace checks.
    Jan 1: Use const cred struct
    Jan 11: use task_ns_capable() in place of ptrace_capable().
    Feb 23: same_or_ancestore_user_ns() was not an appropriate
    check to constrain cap_issubset. Rather, cap_issubset()
    only is meaningful when both capsets are in the same
    user_ns.

    Signed-off-by: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Changelog:
    Dec 8: Fixed bug in my check_kill_permission pointed out by
    Eric Biederman.
    Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred()
    for clarity
    Dec 31: address comment by Eric Biederman:
    don't need cred/tcred in check_kill_permission.
    Jan 1: use const cred struct.
    Jan 11: Per Bastian Blank's advice, clean up kill_ok_by_cred().
    Feb 16: kill_ok_by_cred: fix bad parentheses
    Feb 23: per akpm, let compiler inline kill_ok_by_cred

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Changelog:
    Feb 23: let clone_uts_ns() handle setting uts->user_ns
    To do so we need to pass in the task_struct who'll
    get the utsname, so we can get its user_ns.
    Feb 23: As per Oleg's coment, just pass in tsk, instead of two
    of its members.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • - Introduce ns_capable to test for a capability in a non-default
    user namespace.
    - Teach cap_capable to handle capabilities in a non-default
    user namespace.

    The motivation is to get to the unprivileged creation of new
    namespaces. It looks like this gets us 90% of the way there, with
    only potential uid confusion issues left.

    I still need to handle getting all caps after creation but otherwise I
    think I have a good starter patch that achieves all of your goals.

    Changelog:
    11/05/2010: [serge] add apparmor
    12/14/2010: [serge] fix capabilities to created user namespaces
    Without this, if user serge creates a user_ns, he won't have
    capabilities to the user_ns he created. THis is because we
    were first checking whether his effective caps had the caps
    he needed and returning -EPERM if not, and THEN checking whether
    he was the creator. Reverse those checks.
    12/16/2010: [serge] security_real_capable needs ns argument in !security case
    01/11/2011: [serge] add task_ns_capable helper
    01/11/2011: [serge] add nsown_capable() helper per Bastian Blank suggestion
    02/16/2011: [serge] fix a logic bug: the root user is always creator of
    init_user_ns, but should not always have capabilities to
    it! Fix the check in cap_capable().
    02/21/2011: Add the required user_ns parameter to security_capable,
    fixing a compile failure.
    02/23/2011: Convert some macros to functions as per akpm comments. Some
    couldn't be converted because we can't easily forward-declare
    them (they are inline if !SECURITY, extern if SECURITY). Add
    a current_user_ns function so we can use it in capability.h
    without #including cred.h. Move all forward declarations
    together to the top of the #ifdef __KERNEL__ section, and use
    kernel-doc format.
    02/23/2011: Per dhowells, clean up comment in cap_capable().
    02/23/2011: Per akpm, remove unreachable 'return -EPERM' in cap_capable.

    (Original written and signed off by Eric; latest, modified version
    acked by him)

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: export current_user_ns() for ecryptfs]
    [serge.hallyn@canonical.com: remove unneeded extra argument in selinux's task_has_capability]
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • The expected course of development for user namespaces targeted
    capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.

    Goals:

    - Make it safe for an unprivileged user to unshare namespaces. They
    will be privileged with respect to the new namespace, but this should
    only include resources which the unprivileged user already owns.

    - Provide separate limits and accounting for userids in different
    namespaces.

    Status:

    Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
    get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
    CAP_SETGID capabilities. What this gets you is a whole new set of
    userids, meaning that user 500 will have a different 'struct user' in
    your namespace than in other namespaces. So any accounting information
    stored in struct user will be unique to your namespace.

    However, throughout the kernel there are checks which

    - simply check for a capability. Since root in a child namespace
    has all capabilities, this means that a child namespace is not
    constrained.

    - simply compare uid1 == uid2. Since these are the integer uids,
    uid 500 in namespace 1 will be said to be equal to uid 500 in
    namespace 2.

    As a result, the lxc implementation at lxc.sf.net does not use user
    namespaces. This is actually helpful because it leaves us free to
    develop user namespaces in such a way that, for some time, user
    namespaces may be unuseful.

    Bugs aside, this patchset is supposed to not at all affect systems which
    are not actively using user namespaces, and only restrict what tasks in
    child user namespace can do. They begin to limit privilege to a user
    namespace, so that root in a container cannot kill or ptrace tasks in the
    parent user namespace, and can only get world access rights to files.
    Since all files currently belong to the initila user namespace, that means
    that child user namespaces can only get world access rights to *all*
    files. While this temporarily makes user namespaces bad for system
    containers, it starts to get useful for some sandboxing.

    I've run the 'runltplite.sh' with and without this patchset and found no
    difference.

    This patch:

    copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
    So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
    will have the new user namespace as its owner. That is what we want,
    since we want root in that new userns to be able to have privilege over
    it.

    Changelog:
    Feb 15: don't set uts_ns->user_ns if we didn't create
    a new uts_ns.
    Feb 23: Move extern init_user_ns declaration from
    init/version.c to utsname.h.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Reorganize proc_get_sb() so it can be called before the struct pid of the
    first process is allocated.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Daniel Lezcano
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • This patchset is a cleanup and a preparation to unshare the pid namespace.
    These prerequisites prepare for Eric's patchset to give a file descriptor
    to a namespace and join an existing namespace.

    This patch:

    It turns out that the existing assignment in copy_process of the
    child_reaper can handle the initial assignment of child_reaper we just
    need to generalize the test in kernel/fork.c

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Daniel Lezcano
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • When dmesg_restrict is set to 1 CAP_SYS_ADMIN is needed to read the kernel
    ring buffer. But a root user without CAP_SYS_ADMIN is able to reset
    dmesg_restrict to 0.

    This is an issue when e.g. LXC (Linux Containers) are used and complete
    user space is running without CAP_SYS_ADMIN. A unprivileged and jailed
    root user can bypass the dmesg_restrict protection.

    With this patch writing to dmesg_restrict is only allowed when root has
    CAP_SYS_ADMIN.

    Signed-off-by: Richard Weinberger
    Acked-by: Dan Rosenberg
    Acked-by: Serge E. Hallyn
    Cc: Eric Paris
    Cc: Kees Cook
    Cc: James Morris
    Cc: Eugene Teo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • Add boundaries of allowed input ranges for: dirty_expire_centisecs,
    drop_caches, overcommit_memory, page-cluster and panic_on_oom.

    Signed-off-by: Petr Holasek
    Acked-by: Dave Young
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Holasek
     
  • Drop dead code.

    Signed-off-by: Denis Kirjanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Kirjanov
     
  • Since the for loop checks for the table->procname drop useless
    table->procname checks inside the loop body

    Signed-off-by: Denis Kirjanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Kirjanov
     
  • Chaning cpuset->mems/cpuset->cpus should be protected under
    callback_mutex.

    cpuset_clone() doesn't follow this rule. It's ok because it's
    called when creating and initializing a cgroup, but we'd better
    hold the lock to avoid subtil break in the future.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Those functions that use NODEMASK_ALLOC() can't propagate errno
    to users, but will fail silently.

    Fix it by using a static nodemask_t variable for each function, and
    those variables are protected by cgroup_mutex;

    [akpm@linux-foundation.org: fix comment spelling, strengthen cgroup_lock comment]
    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • oldcs->mems_allowed is not modified during cpuset_attach(), so we don't
    have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it
    to cpuset_migrate_mm().

    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • It's not necessary to copy cpuset->mems_allowed to a buffer allocated by
    NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf().

    As spotted by Paul, a side effect is we fix a bug that the function can
    return -ENOMEM but the caller doesn't expect negative return value.
    Therefore change the return value of cpuset_sprintf_cpulist() and
    cpuset_sprintf_memlist() from int to size_t.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • In struct page_cgroup, we have a full word for flags but only a few are
    reserved. Use the remaining upper bits to encode, depending on
    configuration, the node or the section, to enable page_cgroup-to-page
    lookups without a direct pointer.

    This saves a full word for every page in a system with memory cgroups
    enabled.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • res_counter_read_u64 reads u64 value without lock. It's dangerous in a
    32bit environment. Add locking.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

23 Mar, 2011

15 commits

  • We've been burned by regressions/bugs which we later realized could have
    been triaged quicker if only we'd paid closer attention to dmesg. To make
    it easier to audit dmesg, we'd like to make DEFAULT_MESSAGE_LEVEL
    Kconfig-settable. That way we can set it to KERN_NOTICE and audit any
    messages
    Cc: Ingo Molnar
    Cc: Joe Perches
    Cc: Olof Johansson
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     
  • In an effort to reduce kernel address leaks that might be used to help
    target kernel privilege escalation exploits, this patch uses %pK when
    displaying addresses in /proc/kallsyms, /proc/modules, and
    /sys/module/*/sections/*.

    Note that this changes %x to %p, so some legitimately 0 values in
    /proc/kallsyms would have changed from 00000000 to "(null)". To avoid
    this, "(null)" is not used when using the "K" format. Anything that was
    already successfully parsing "(null)" in addition to full hex digits
    should have no problem with this change. (Thanks to Joe Perches for the
    suggestion.) Due to the %x to %p, "void *" casts are needed since these
    addresses are already "unsigned long" everywhere internally, due to their
    starting life as ELF section offsets.

    Signed-off-by: Kees Cook
    Cc: Eugene Teo
    Cc: Dan Rosenberg
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • For a platform with many consoles like:
    "console=tty1 console=ttyMFD2 console=ttyS0 earlyprintk=mrst"

    Each time when the non "selected_console" (tty1 and ttyMFD2 here) get
    registered, the existing kernel message will be printed out on registered
    consoles again, the "mrst" early console will get some same message for 3
    times, and "tty1" will get some for twice.

    As suggested by Andrew Morton, every time a new console is registered, it
    will be set as the "exclusive" console which will dump the already
    existing kernel messages.

    Signed-off-by: Feng Tang
    Cc: Greg KH
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Feng Tang
     
  • On some architectures, the boot process involves de-registering the boot
    console (early boot), initialize drivers and then re-register the console.

    This mechanism introduces a window in which no printk can happen on the
    console and messages are buffered and then printed once the new console is
    available.

    If a kernel crashes during this window, all it's left on the boot console
    is "console [foo] enabled, bootconsole disabled" making debug of the crash
    rather 'interesting'.

    By adding "keep_bootcon" option, do not unregister the boot console, that
    will allow to printk everything that is happening up to the crash.

    The option is clearly meant only for debugging purposes as it introduces
    lots of duplicated info printed on console, but will make bug report from
    users easier as it doesn't require a kernel build just to figure out where
    we crash.

    Signed-off-by: Fabio M. Di Nitto
    Acked-by: David S. Miller
    Cc: Alan Cox
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabio M. Di Nitto
     
  • This patch addresses a couple of problems. One was the case when the
    hardlockup failed to start, it also failed to start the softlockup. There
    were valid cases when the hardlockup shouldn't start and that shouldn't
    block the softlockup (no lapic, bios controls perf counters).

    The second problem was when the hardlockup failed to start on boxes (from
    a no lapic or bios controlled perf counter case), it reported failure to
    the cpu notifier chain. This blocked the notifier from continuing to
    start other more critical pieces of cpu bring-up (in our case based on a
    2.6.32 fork, it was the mce). As a result, during soft cpu online/offline
    testing, the system would panic when a cpu was offlined because the cpu
    notifier would succeed in processing a watchdog disable cpu event and
    would panic in the mce case as a result of un-initialized variables from a
    never executed cpu up event.

    I realized the hardlockup/softlockup cases are really just debugging aids
    and should never impede the progress of a cpu up/down event. Therefore I
    modified the code to always return NOTIFY_OK and instead rely on printks
    to inform the user of problems.

    Signed-off-by: Don Zickus
    Acked-by: Peter Zijlstra
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Don Zickus
     
  • When a cpu is considered stuck, instead of limping along and just printing
    a warning, it is sometimes preferred to just panic, let kdump capture the
    vmcore and reboot. This gets the machine back into a stable state quickly
    while saving the info that got it into a stuck state to begin with.

    Add a Kconfig option to allow users to set the hardlockup to panic
    by default. Also add in a 'nmi_watchdog=nopanic' to override this.

    [akpm@linux-foundation.org: fix strncmp length]
    Signed-off-by: Don Zickus
    Acked-by: Peter Zijlstra
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Don Zickus
     
  • Cleanup: kill the dead code which does nothing but complicates the code
    and confuses the reader.

    sys_unshare(CLONE_THREAD/SIGHAND/VM) is not really implemented, and I
    doubt very much it will ever work. At least, nobody even tried since the
    original 99d1419d96d7df9cfa56 ("unshare system call -v5: system call
    handler function") was applied more than 4 years ago.

    And the code is not consistent. unshare_thread() always fails
    unconditionally, while unshare_sighand() and unshare_vm() pretend to work
    if there is nothing to unshare.

    Remove unshare_thread(), unshare_sighand(), unshare_vm() helpers and
    related variables and add a simple CLONE_THREAD | CLONE_SIGHAND| CLONE_VM
    check into check_unshare_flags().

    Also, move the "CLONE_NEWNS needs CLONE_FS" check from
    check_unshare_flags() to sys_unshare(). This looks more consistent and
    matches the similar do_sysvsem check in sys_unshare().

    Note: with or without this patch "atomic_read(mm->mm_users) > 1" can give
    a false positive due to get_task_mm().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Janak Desai
    Cc: Daniel Lezcano
    Cc: "Eric W. Biederman"
    Cc: KOSAKI Motohiro
    Cc: Alexey Dobriyan
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change the printk() calls to have the KERN_INFO/KERN_ERROR stuff, and
    fixes other coding style errors. Not _all_ of them are gone, though.

    [akpm@linux-foundation.org: revert the bits I disagree with]
    Signed-off-by: Michael Rodriguez
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Rodriguez
     
  • Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter
    setup functions from init/main.c to kenrel/smp.c, saves some #ifdef
    CONFIG_SMP.

    Signed-off-by: WANG Cong
    Cc: Rakib Mullick
    Cc: David Howells
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Arnd Bergmann
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang
     
  • The oops=panic cmdline option is not x86 specific, move it to generic code.
    Update documentation.

    Signed-off-by: Olaf Hering
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olaf Hering
     
  • ksoftirqd, kworker, migration, and pktgend kthreads can be created with
    kthread_create_on_node(), to get proper NUMA affinities for their stack and
    task_struct.

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Acked-by: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • All kthreads being created from a single helper task, they all use memory
    from a single node for their kernel stack and task struct.

    This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
    to parameters already used by kthread_create().

    This parameter serves in allocating memory for the new kthread on its
    memory node if possible.

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Add a node parameter to alloc_thread_info(), and change its name to
    alloc_thread_info_node()

    This change is needed to allow NUMA aware kthread_create_on_cpu()

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • All kthreads being created from a single helper task, they all use memory
    from a single node for their kernel stack and task struct.

    This patch suite creates kthread_create_on_cpu(), adding a 'cpu' parameter
    to parameters already used by kthread_create().

    This parameter serves in allocating memory for the new kthread on its
    memory node if available.

    Users of this new function are : ksoftirqd, kworker, migration, pktgend...

    This patch:

    Add a node parameter to alloc_task_struct(), and change its name to
    alloc_task_struct_node()

    This change is needed to allow NUMA aware kthread_create_on_cpu()

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • list_del() leaves poison in the prev and next pointers. The next
    list_empty() will compare those poisons, and say the list isn't empty.
    Any list operations that assume the node is on a list because of such a
    check will be fooled into dereferencing poison. One needs to INIT the
    node after the del, and fortunately there's already a wrapper for that -
    list_del_init().

    Some of the dels are followed by deallocations, so can be ignored, and one
    can be merged with an add to make a move. Apart from that, I erred on the
    side of caution in making nodes list_empty()-queriable.

    Signed-off-by: Phil Carmody
    Reviewed-by: Paul Menage
    Cc: Li Zefan
    Acked-by: Kirill A. Shutemov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     

22 Mar, 2011

1 commit

  • Userland should be able to trust the pid and uid of the sender of a
    signal if the si_code is SI_TKILL.

    Unfortunately, the kernel has historically allowed sigqueueinfo() to
    send any si_code at all (as long as it was negative - to distinguish it
    from kernel-generated signals like SIGILL etc), so it could spoof a
    SI_TKILL with incorrect siginfo values.

    Happily, it looks like glibc has always set si_code to the appropriate
    SI_QUEUE, so there are probably no actual user code that ever uses
    anything but the appropriate SI_QUEUE flag.

    So just tighten the check for si_code (we used to allow any negative
    value), and add a (one-time) warning in case there are binaries out
    there that might depend on using other si_code values.

    Signed-off-by: Julien Tinnes
    Acked-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Julien Tinnes
     

21 Mar, 2011

1 commit

  • * 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6: (25 commits)
    video: change to new flag variable
    scsi: change to new flag variable
    rtc: change to new flag variable
    rapidio: change to new flag variable
    pps: change to new flag variable
    net: change to new flag variable
    misc: change to new flag variable
    message: change to new flag variable
    memstick: change to new flag variable
    isdn: change to new flag variable
    ieee802154: change to new flag variable
    ide: change to new flag variable
    hwmon: change to new flag variable
    dma: change to new flag variable
    char: change to new flag variable
    fs: change to new flag variable
    xtensa: change to new flag variable
    um: change to new flag variables
    s390: change to new flag variable
    mips: change to new flag variable
    ...

    Fix up trivial conflict in drivers/hwmon/Makefile

    Linus Torvalds
     

19 Mar, 2011

2 commits

  • …rnel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq: Fix incorrect unlock in __setup_irq()
    cris: Use generic show_interrupts()
    genirq: show_interrupts: Check desc->name before printing it blindly
    cris: Use accessor functions to set IRQ_PER_CPU flag
    cris: Fix irq conversion fallout

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched, kernel-doc: Fix runqueue_is_locked() description

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
    trace, filters: Initialize the match variable in process_ops() properly
    trace, documentation: Fix branch profiling location in debugfs
    oprofile, s390: Cleanups
    oprofile, s390: Remove hwsampler_files.c and merge it into init.c
    perf: Fix tear-down of inherited group events
    perf: Reorder & optimize perf_event_context to remove alignment padding on 64 bit builds
    perf: Handle stopped state with tracepoints
    perf: Fix the software events state check
    perf, powerpc: Handle events that raise an exception without overflowing
    perf, x86: Use INTEL_*_CONSTRAINT() for all PEBS event constraints
    perf, x86: Clean up SandyBridge PEBS events
    perf lock: Fix sorting by wait_min
    perf tools: Version incorrect with some versions of grep
    perf evlist: New command to list the names of events present in a perf.data file
    perf script: Add support for H/W and S/W events
    perf script: Add support for dumping symbols
    perf script: Support custom field selection for output
    perf script: Move printing of 'common' data from print_event and rename
    perf tracing: Remove print_graph_cpu and print_graph_proc from trace-event-parse
    perf script: Change process_event prototype
    ...

    Linus Torvalds