02 Jun, 2012

3 commits


01 Jun, 2012

1 commit

  • Pull second pile of signal handling patches from Al Viro:
    "This one is just task_work_add() series + remaining prereqs for it.

    There probably will be another pull request from that tree this
    cycle - at least for helpers, to get them out of the way for per-arch
    fixes remaining in the tree."

    Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
    had brought in commit 97fd75b7b8e0 ("kernel/irq/manage.c: use the
    pr_foo() infrastructure to prefix printks") which changed one of the
    pr_err() calls that this merge moves around.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    keys: kill task_struct->replacement_session_keyring
    keys: kill the dummy key_replace_session_keyring()
    keys: change keyctl_session_to_parent() to use task_work_add()
    genirq: reimplement exit_irq_thread() hook via task_work_add()
    task_work_add: generic process-context callbacks
    avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
    parisc: need to check NOTIFY_RESUME when exiting from syscall
    move key_repace_session_keyring() into tracehook_notify_resume()
    TIF_NOTIFY_RESUME is defined on all targets now

    Linus Torvalds
     

25 May, 2012

1 commit

  • Pull user-space probe instrumentation from Ingo Molnar:
    "The uprobes code originates from SystemTap and has been used for years
    in Fedora and RHEL kernels. This version is much rewritten, reviews
    from PeterZ, Oleg and myself shaped the end result.

    This tree includes uprobes support in 'perf probe' - but SystemTap
    (and other tools) can take advantage of user probe points as well.

    Sample usage of uprobes via perf, for example to profile malloc()
    calls without modifying user-space binaries.

    First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.

    If you don't know which function you want to probe you can pick one
    from 'perf top' or can get a list all functions that can be probed
    within libc (binaries can be specified as well):

    $ perf probe -F -x /lib/libc.so.6

    To probe libc's malloc():

    $ perf probe -x /lib64/libc.so.6 malloc
    Added new event:
    probe_libc:malloc (on 0x7eac0)

    You can now use it in all perf tools, such as:

    perf record -e probe_libc:malloc -aR sleep 1

    Make use of it to create a call graph (as the flat profile is going to
    look very boring):

    $ perf record -e probe_libc:malloc -gR make
    [ perf record: Woken up 173 times to write data ]
    [ perf record: Captured and wrote 44.190 MB perf.data (~1930712

    $ perf report | less

    32.03% git libc-2.15.so [.] malloc
    |
    --- malloc

    29.49% cc1 libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--0.95%-- 0x208eb1000000000
    |
    |--0.63%-- htab_traverse_noresize

    11.04% as libc-2.15.so [.] malloc
    |
    --- malloc
    |

    7.15% ld libc-2.15.so [.] malloc
    |
    --- malloc
    |

    5.07% sh libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.99% python-config libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.54% make libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--7.34%-- glob
    | |
    | |--93.18%-- 0x41588f
    | |
    | --6.82%-- glob
    | 0x41588f

    ...

    Or:

    $ perf report -g flat | less

    # Overhead Command Shared Object Symbol
    # ........ ............. ............. ..........
    #
    32.03% git libc-2.15.so [.] malloc
    27.19%
    malloc

    29.49% cc1 libc-2.15.so [.] malloc
    24.77%
    malloc

    11.04% as libc-2.15.so [.] malloc
    11.02%
    malloc

    7.15% ld libc-2.15.so [.] malloc
    6.57%
    malloc

    ...

    The core uprobes design is fairly straightforward: uprobes probe
    points register themselves at (inode:offset) addresses of
    libraries/binaries, after which all existing (or new) vmas that map
    that address will have a software breakpoint injected at that address.
    vmas are COW-ed to preserve original content. The probe points are
    kept in an rbtree.

    If user-space executes the probed inode:offset instruction address
    then an event is generated which can be recovered from the regular
    perf event channels and mmap-ed ring-buffer.

    Multiple probes at the same address are supported, they create a
    dynamic callback list of event consumers.

    The basic model is further complicated by the XOL speedup: the
    original instruction that is probed is copied (in an architecture
    specific fashion) and executed out of line when the probe triggers.
    The XOL area is a single vma per process, with a fixed number of
    entries (which limits probe execution parallelism).

    The API: uprobes are installed/removed via
    /sys/kernel/debug/tracing/uprobe_events, the API is integrated to
    align with the kprobes interface as much as possible, but is separate
    to it.

    Injecting a probe point is privileged operation, which can be relaxed
    by setting perf_paranoid to -1.

    You can use multiple probes as well and mix them with kprobes and
    regular PMU events or tracepoints, when instrumenting a task."

    Fix up trivial conflicts in mm/memory.c due to previous cleanup of
    unmap_single_vma().

    * 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    perf probe: Detect probe target when m/x options are absent
    perf probe: Provide perf interface for uprobes
    tracing: Fix kconfig warning due to a typo
    tracing: Provide trace events interface for uprobes
    tracing: Extract out common code for kprobes/uprobes trace events
    tracing: Modify is_delete, is_return from int to bool
    uprobes/core: Decrement uprobe count before the pages are unmapped
    uprobes/core: Make background page replacement logic account for rss_stat counters
    uprobes/core: Optimize probe hits with the help of a counter
    uprobes/core: Allocate XOL slots for uprobes use
    uprobes/core: Handle breakpoint and singlestep exceptions
    uprobes/core: Rename bkpt to swbp
    uprobes/core: Make order of function parameters consistent across functions
    uprobes/core: Make macro names consistent
    uprobes: Update copyright notices
    uprobes/core: Move insn to arch specific structure
    uprobes/core: Remove uprobe_opcode_sz
    uprobes/core: Make instruction tables volatile
    uprobes: Move to kernel/events/
    uprobes/core: Clean up, refactor and improve the code
    ...

    Linus Torvalds
     

24 May, 2012

4 commits

  • Kill the no longer used task_struct->replacement_session_keyring, update
    copy_creds() and exit_creds().

    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Cc: Thomas Gleixner
    Cc: Richard Kuo
    Cc: Linus Torvalds
    Cc: Alexander Gordeev
    Cc: Chris Zankel
    Cc: David Smith
    Cc: "Frank Ch. Eigler"
    Cc: Geert Uytterhoeven
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • exit_irq_thread() and task->irq_thread are needed to handle the unexpected
    (and unlikely) exit of irq-thread.

    We can use task_work instead and make this all private to
    kernel/irq/manage.c, cleanup plus micro-optimization.

    1. rename exit_irq_thread() to irq_thread_dtor(), make it
    static, and move it up before irq_thread().

    2. change irq_thread() to do task_work_add(irq_thread_dtor)
    at the start and task_work_cancel() before return.

    tracehook_notify_resume() can never play with kthreads,
    only do_exit()->exit_task_work() can call the callback
    and this is what we want.

    3. remove task_struct->irq_thread and the special hook
    in do_exit().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Thomas Gleixner
    Cc: David Howells
    Cc: Richard Kuo
    Cc: Linus Torvalds
    Cc: Alexander Gordeev
    Cc: Chris Zankel
    Cc: David Smith
    Cc: "Frank Ch. Eigler"
    Cc: Geert Uytterhoeven
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • Provide a simple mechanism that allows running code in the (nonatomic)
    context of the arbitrary task.

    The caller does task_work_add(task, task_work) and this task executes
    task_work->func() either from do_notify_resume() or from do_exit(). The
    callback can rely on PF_EXITING to detect the latter case.

    "struct task_work" can be embedded in another struct, still it has "void
    *data" to handle the most common/simple case.

    This allows us to kill the ->replacement_session_keyring hack, and
    potentially this can have more users.

    Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into
    tracehook_notify_resume() and do_exit(). But at the same time we can
    remove the "replacement_session_keyring != NULL" checks from
    arch/*/signal.c and exit_creds().

    Note: task_work_add/task_work_run abuses ->pi_lock. This is only because
    this lock is already used by lookup_pi_state() to synchronize with
    do_exit() setting PF_EXITING. Fortunately the scope of this lock in
    task_work.c is really tiny, and the code is unlikely anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Cc: Thomas Gleixner
    Cc: Richard Kuo
    Cc: Linus Torvalds
    Cc: Alexander Gordeev
    Cc: Chris Zankel
    Cc: David Smith
    Cc: "Frank Ch. Eigler"
    Cc: Geert Uytterhoeven
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

23 May, 2012

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "The biggest change is the cleanup/simplification of the load-balancer:
    instead of the current practice of architectures twiddling scheduler
    internal data structures and providing the scheduler domains in
    colorfully inconsistent ways, we now have generic scheduler code in
    kernel/sched/core.c:sched_init_numa() that looks at the architecture's
    node_distance() parameters and (while not fully trusting it) deducts a
    NUMA topology from it.

    This inevitably changes balancing behavior - hopefully for the better.

    There are various smaller optimizations, cleanups and fixlets as well"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
    sched: Remove stale power aware scheduling remnants and dysfunctional knobs
    sched/debug: Fix printing large integers on 32-bit platforms
    sched/fair: Improve the ->group_imb logic
    sched/nohz: Fix rq->cpu_load[] calculations
    sched/numa: Don't scale the imbalance
    sched/fair: Revert sched-domain iteration breakage
    sched/x86: Rewrite set_cpu_sibling_map()
    sched/numa: Fix the new NUMA topology bits
    sched/numa: Rewrite the CONFIG_NUMA sched domain support
    sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
    sched/fair: Add some serialization to the sched_domain load-balance walk
    sched/fair: Let minimally loaded cpu balance the group
    sched: Change rq->nr_running to unsigned int
    x86/numa: Check for nonsensical topologies on real hw as well
    x86/numa: Hard partition cpu topology masks on node boundaries
    x86/numa: Allow specifying node_distance() for numa=fake
    x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
    sched: Update documentation and comments
    sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt()

    Linus Torvalds
     

22 May, 2012

1 commit

  • Pull security subsystem updates from James Morris:
    "New notable features:
    - The seccomp work from Will Drewry
    - PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
    - Longer security labels for Smack from Casey Schaufler
    - Additional ptrace restriction modes for Yama by Kees Cook"

    Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
    apparmor: fix long path failure due to disconnected path
    apparmor: fix profile lookup for unconfined
    ima: fix filename hint to reflect script interpreter name
    KEYS: Don't check for NULL key pointer in key_validate()
    Smack: allow for significantly longer Smack labels v4
    gfp flags for security_inode_alloc()?
    Smack: recursive tramsmute
    Yama: replace capable() with ns_capable()
    TOMOYO: Accept manager programs which do not start with / .
    KEYS: Add invalidation support
    KEYS: Do LRU discard in full keyrings
    KEYS: Permit in-place link replacement in keyring list
    KEYS: Perform RCU synchronisation on keys prior to key destruction
    KEYS: Announce key type (un)registration
    KEYS: Reorganise keys Makefile
    KEYS: Move the key config into security/keys/Kconfig
    KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
    Yama: remove an unused variable
    samples/seccomp: fix dependencies on arch macros
    Yama: add additional ptrace scopes
    ...

    Linus Torvalds
     

17 May, 2012

1 commit

  • It's been broken forever (i.e. it's not scheduling in a power
    aware fashion), as reported by Suresh and others sending
    patches, and nobody cares enough to fix it properly ...
    so remove it to make space free for something better.

    There's various problems with the code as it stands today, first
    and foremost the user interface which is bound to topology
    levels and has multiple values per level. This results in a
    state explosion which the administrator or distro needs to
    master and almost nobody does.

    Furthermore large configuration state spaces aren't good, it
    means the thing doesn't just work right because it's either
    under so many impossibe to meet constraints, or even if
    there's an achievable state workloads have to be aware of
    it precisely and can never meet it for dynamic workloads.

    So pushing this kind of decision to user-space was a bad idea
    even with a single knob - it's exponentially worse with knobs
    on every node of the topology.

    There is a proposal to replace the user interface with a single
    3 state knob:

    sched_balance_policy := { performance, power, auto }

    where 'auto' would be the preferred default which looks at things
    like Battery/AC mode and possible cpufreq state or whatever the hw
    exposes to show us power use expectations - but there's been no
    progress on it in the past many months.

    Aside from that, the actual implementation of the various knobs
    is known to be broken. There have been sporadic attempts at
    fixing things but these always stop short of reaching a mergable
    state.

    Therefore this wholesale removal with the hopes of spurring
    people who care to come forward once again and work on a
    coherent replacement.

    Signed-off-by: Peter Zijlstra
    Cc: Suresh Siddha
    Cc: Arjan van de Ven
    Cc: Vincent Guittot
    Cc: Vaidyanathan Srinivasan
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 May, 2012

1 commit

  • Patches c22402a2f ("sched/fair: Let minimally loaded cpu balance the
    group") and 0ce90475 ("sched/fair: Add some serialization to the
    sched_domain load-balance walk") are horribly broken so revert them.

    The problem is that while it sounds good to have the minimally loaded
    cpu do the pulling of more load, the way we walk the domains there is
    absolutely no guarantee this cpu will actually get to the domain. In
    fact its very likely it wont. Therefore the higher up the tree we get,
    the less likely it is we'll balance at all.

    The first of mask always walks up, while sucky in that it accumulates
    load on the first cpu and needs extra passes to spread it out at least
    guarantees a cpu gets up that far and load-balancing happens at all.

    Since its now always the first and idle cpus should always be able to
    balance so they get a task as fast as possible we can also do away
    with the added serialization.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 May, 2012

1 commit


07 May, 2012

1 commit


03 May, 2012

1 commit

  • Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler.
    This is inefficient because enqueuing is required only if there is a
    context switch, and entry to the scheduler does not guarantee a context
    switch.

    The commit therefore moves the enqueuing to immediately precede the
    call to switch_to() from the scheduler.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Tested-by: Linus Torvalds

    Paul E. McKenney
     

14 Apr, 2012

3 commits

  • Merge in latest upstream (and the latest perf development tree),
    to prepare for tooling changes, and also to pick up v3.4 MM
    changes that the uprobes code needs to take care of.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Replaces the seccomp_t typedef with struct seccomp to match modern
    kernel style.

    Signed-off-by: Will Drewry
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Acked-by: Eric Paris

    v18: rebase
    ...
    v14: rebase/nochanges
    v13: rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: rebase on to linux-next
    v8-v11: no changes
    v7: struct seccomp_struct -> struct seccomp
    v6: original inclusion in this series.
    Signed-off-by: James Morris

    Will Drewry
     
  • With this change, calling
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
    disables privilege granting operations at execve-time. For example, a
    process will not be able to execute a setuid binary to change their uid
    or gid if this bit is set. The same is true for file capabilities.

    Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
    LSMs respect the requested behavior.

    To determine if the NO_NEW_PRIVS bit is set, a task may call
    prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
    It returns 1 if set and 0 if it is not set. If any of the arguments are
    non-zero, it will return -1 and set errno to -EINVAL.
    (PR_SET_NO_NEW_PRIVS behaves similarly.)

    This functionality is desired for the proposed seccomp filter patch
    series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
    system call behavior for itself and its child tasks without being
    able to impact the behavior of a more privileged task.

    Another potential use is making certain privileged operations
    unprivileged. For example, chroot may be considered "safe" if it cannot
    affect privileged tasks.

    Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
    set and AppArmor is in use. It is fixed in a subsequent patch.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Will Drewry
    Acked-by: Eric Paris
    Acked-by: Kees Cook

    v18: updated change desc
    v17: using new define values as per 3.4
    Signed-off-by: James Morris

    Andy Lutomirski
     

08 Apr, 2012

2 commits

  • Modify alloc_uid to take a kuid and make the user hash table global.
    Stop holding a reference to the user namespace in struct user_struct.

    This simplifies the code and makes the per user accounting not
    care about which user namespace a uid happens to appear in.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • With a user_ns reference in struct cred the only user of the user namespace
    reference in struct user_struct is to keep the uid hash table alive.

    The user_namespace reference in struct user_struct will be going away soon, and
    I have removed all of the references. Rename the field from user_ns to _user_ns
    so that the compiler can verify nothing follows the user struct to the user
    namespace anymore.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

03 Apr, 2012

1 commit


29 Mar, 2012

2 commits

  • …m/linux/kernel/git/dhowells/linux-asm_system

    Pull "Disintegrate and delete asm/system.h" from David Howells:
    "Here are a bunch of patches to disintegrate asm/system.h into a set of
    separate bits to relieve the problem of circular inclusion
    dependencies.

    I've built all the working defconfigs from all the arches that I can
    and made sure that they don't break.

    The reason for these patches is that I recently encountered a circular
    dependency problem that came about when I produced some patches to
    optimise get_order() by rewriting it to use ilog2().

    This uses bitops - and on the SH arch asm/bitops.h drags in
    asm-generic/get_order.h by a circuituous route involving asm/system.h.

    The main difficulty seems to be asm/system.h. It holds a number of
    low level bits with no/few dependencies that are commonly used (eg.
    memory barriers) and a number of bits with more dependencies that
    aren't used in many places (eg. switch_to()).

    These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

    Move memory barriers here. This already done for MIPS and Alpha.

    (2) asm/switch_to.h

    Move switch_to() and related stuff here.

    (3) asm/exec.h

    Move arch_align_stack() here. Other process execution related bits
    could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

    Move xchg() and cmpxchg() here as they're full word atomic ops and
    frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

    Move die() and related bits.

    (6) asm/auxvec.h

    Move AT_VECTOR_SIZE_ARCH here.

    Other arch headers are created as needed on a per-arch basis."

    Fixed up some conflicts from other header file cleanups and moving code
    around that has happened in the meantime, so David's testing is somewhat
    weakened by that. We'll find out anything that got broken and fix it..

    * tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
    Delete all instances of asm/system.h
    Remove all #inclusions of asm/system.h
    Add #includes needed to permit the removal of asm/system.h
    Move all declarations of free_initmem() to linux/mm.h
    Disintegrate asm/system.h for OpenRISC
    Split arch_align_stack() out from asm-generic/system.h
    Split the switch_to() wrapper out of asm-generic/system.h
    Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
    Create asm-generic/barrier.h
    Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
    Disintegrate asm/system.h for Xtensa
    Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
    Disintegrate asm/system.h for Tile
    Disintegrate asm/system.h for Sparc
    Disintegrate asm/system.h for SH
    Disintegrate asm/system.h for Score
    Disintegrate asm/system.h for S390
    Disintegrate asm/system.h for PowerPC
    Disintegrate asm/system.h for PA-RISC
    Disintegrate asm/system.h for MN10300
    ...

    Linus Torvalds
     
  • Remove all #inclusions of asm/system.h preparatory to splitting and killing
    it. Performed with the following command:

    perl -p -i -e 's!^#\s*include\s*.*\n!!' `grep -Irl '^#\s*include\s*' *`

    Signed-off-by: David Howells

    David Howells
     

24 Mar, 2012

1 commit

  • Userspace service managers/supervisors need to track their started
    services. Many services daemonize by double-forking and get implicitly
    re-parented to PID 1. The service manager will no longer be able to
    receive the SIGCHLD signals for them, and is no longer in charge of
    reaping the children with wait(). All information about the children is
    lost at the moment PID 1 cleans up the re-parented processes.

    With this prctl, a service manager process can mark itself as a sort of
    'sub-init', able to stay as the parent for all orphaned processes
    created by the started services. All SIGCHLD signals will be delivered
    to the service manager.

    Receiving SIGCHLD and doing wait() is in cases of a service-manager much
    preferred over any possible asynchronous notification about specific
    PIDs, because the service manager has full access to the child process
    data in /proc and the PID can not be re-used until the wait(), the
    service-manager itself is in charge of, has happened.

    As a side effect, the relevant parent PID information does not get lost
    by a double-fork, which results in a more elaborate process tree and
    'ps' output:

    before:
    # ps afx
    253 ? Ss 0:00 /bin/dbus-daemon --system --nofork
    294 ? Sl 0:00 /usr/libexec/polkit-1/polkitd
    328 ? S 0:00 /usr/sbin/modem-manager
    608 ? Sl 0:00 /usr/libexec/colord
    658 ? Sl 0:00 /usr/libexec/upowerd
    819 ? Sl 0:00 /usr/libexec/imsettings-daemon
    916 ? Sl 0:00 /usr/libexec/udisks-daemon
    917 ? S 0:00 \_ udisks-daemon: not polling any devices

    after:
    # ps afx
    294 ? Ss 0:00 /bin/dbus-daemon --system --nofork
    426 ? Sl 0:00 \_ /usr/libexec/polkit-1/polkitd
    449 ? S 0:00 \_ /usr/sbin/modem-manager
    635 ? Sl 0:00 \_ /usr/libexec/colord
    705 ? Sl 0:00 \_ /usr/libexec/upowerd
    959 ? Sl 0:00 \_ /usr/libexec/udisks-daemon
    960 ? S 0:00 | \_ udisks-daemon: not polling any devices
    977 ? Sl 0:00 \_ /usr/libexec/packagekitd

    This prctl is orthogonal to PID namespaces. PID namespaces are isolated
    from each other, while a service management process usually requires the
    services to live in the same namespace, to be able to talk to each
    other.

    Users of this will be the systemd per-user instance, which provides
    init-like functionality for the user's login session and D-Bus, which
    activates bus services on-demand. Both need init-like capabilities to
    be able to properly keep track of the services they start.

    Many thanks to Oleg for several rounds of review and insights.

    [akpm@linux-foundation.org: fix comment layout and spelling]
    [akpm@linux-foundation.org: add lengthy code comment from Oleg]
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Lennart Poettering
    Signed-off-by: Kay Sievers
    Acked-by: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lennart Poettering
     

22 Mar, 2012

1 commit

  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.

    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths. This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32. The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.

    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.

    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side. This is much cheaper on some architectures, including x86. The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.

    While updating the nodemask, a check is made to see if a false failure
    is a risk. If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.

    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
    actual results were

    3.3.0-rc3 3.3.0-rc3
    rc3-vanilla nobarrier-v2r1
    Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
    Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
    Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
    Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
    Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
    Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
    Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
    Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
    Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
    Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
    Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
    Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
    Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
    Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
    Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 135.68 132.17
    User+Sys Time Running Test (seconds) 164.2 160.13
    Total Elapsed Time (seconds) 123.46 120.87

    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected). The
    actual number of page faults is noticeably improved.

    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.

    To test the actual bug the commit fixed I opened two terminals. The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data. In a second window, the nodemask of the
    cpuset was continually randomised in a loop.

    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.

    Signed-off-by: Mel Gorman
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

21 Mar, 2012

3 commits

  • Pull scheduler changes for v3.4 from Ingo Molnar

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    printk: Make it compile with !CONFIG_PRINTK
    sched/x86: Fix overflow in cyc2ns_offset
    sched: Fix nohz load accounting -- again!
    sched: Update yield() docs
    printk/sched: Introduce special printk_sched() for those awkward moments
    sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
    sched: Cleanup cpu_active madness
    sched: Fix load-balance wreckage
    sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
    sched: Ditch per cgroup task lists for load-balancing
    sched: Rename load-balancing fields
    sched: Move load-balancing arguments into helper struct
    sched/rt: Do not submit new work when PI-blocked
    sched/rt: Prevent idle task boosting
    sched/wait: Add __wake_up_all_locked() API
    sched/rt: Document scheduler related skip-resched-check sites
    sched/rt: Use schedule_preempt_disabled()
    sched/rt: Add schedule_preempt_disabled()
    sched/rt: Do not throttle when PI boosting
    sched/rt: Keep period timer ticking when rt throttling is active
    ...

    Linus Torvalds
     
  • Pull irq/core changes for v3.4 from Ingo Molnar

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Remove paranoid warnons and bogus fixups
    genirq: Flush the irq thread on synchronization
    genirq: Get rid of unnecessary IRQTF_DIED flag
    genirq: No need to check IRQTF_DIED before stopping a thread handler
    genirq: Get rid of unnecessary irqaction field in task_struct
    genirq: Fix incorrect check for forced IRQ thread handler
    softirq: Reduce invoke_softirq() code duplication
    genirq: Fix long-term regression in genirq irq_set_irq_type() handling
    x86-32/irq: Don't switch to irq stack for a user-mode irq

    Linus Torvalds
     
  • Pull RCU changes for v3.4 from Ingo Molnar. The major features of this
    series are:

    - making RCU more aggressive about entering dyntick-idle mode in order
    to improve energy efficiency

    - converting a few more call_rcu()s to kfree_rcu()s

    - applying a number of rcutree fixes and cleanups to rcutiny

    - removing CONFIG_SMP #ifdefs from treercu

    - allowing RCU CPU stall times to be set via sysfs

    - adding CPU-stall capability to rcutorture

    - adding more RCU-abuse diagnostics

    - updating documentation

    - fixing yet more issues located by the still-ongoing top-to-bottom
    inspection of RCU, this time with a special focus on the CPU-hotplug
    code path.

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
    rcu: Stop spurious warnings from synchronize_sched_expedited
    rcu: Hold off RCU_FAST_NO_HZ after timer posted
    rcu: Eliminate softirq-mediated RCU_FAST_NO_HZ idle-entry loop
    rcu: Add RCU_NONIDLE() for idle-loop RCU read-side critical sections
    rcu: Allow nesting of rcu_idle_enter() and rcu_idle_exit()
    rcu: Remove redundant check for rcu_head misalignment
    PTR_ERR should be called before its argument is cleared.
    rcu: Convert WARN_ON_ONCE() in rcu_lock_acquire() to lockdep
    rcu: Trace only after NULL-pointer check
    rcu: Call out dangers of expedited RCU primitives
    rcu: Rework detection of use of RCU by offline CPUs
    lockdep: Add CPU-idle/offline warning to lockdep-RCU splat
    rcu: No interrupt disabling for rcu_prepare_for_idle()
    rcu: Move synchronize_sched_expedited() to rcutree.c
    rcu: Check for illegal use of RCU from offlined CPUs
    rcu: Update stall-warning documentation
    rcu: Add CPU-stall capability to rcutorture
    rcu: Make documentation give more realistic rcutorture duration
    rcutorture: Permit holding off CPU-hotplug operations during boot
    rcu: Print scheduling-clock information on RCU CPU stall-warning messages
    ...

    Linus Torvalds
     

14 Mar, 2012

1 commit

  • Uprobes uses exception notifiers to get to know if a thread hit
    a breakpoint or a singlestep exception.

    When a thread hits a uprobe or is singlestepping post a uprobe
    hit, the uprobe exception notifier sets its TIF_UPROBE bit,
    which will then be checked on its return to userspace path
    (do_notify_resume() ->uprobe_notify_resume()), where the
    consumers handlers are run (in task context) based on the
    defined filters.

    Uprobe hits are thread specific and hence we need to maintain
    information about if a task hit a uprobe, what uprobe was hit,
    the slot where the original instruction was copied for xol so
    that it can be singlestepped with appropriate fixups.

    In some cases, special care is needed for instructions that are
    executed out of line (xol). These are architecture specific
    artefacts, such as handling RIP relative instructions on x86_64.

    Since the instruction at which the uprobe was inserted is
    executed out of line, architecture specific fixups are added so
    that the thread continues normal execution in the presence of a
    uprobe.

    Postpone the signals until we execute the probed insn.
    post_xol() path does a recalc_sigpending() before return to
    user-mode, this ensures the signal can't be lost.

    Uprobes relies on DIE_DEBUG notification to notify if a
    singlestep is complete.

    Adds x86 specific uprobe exception notifiers and appropriate
    hooks needed to determine a uprobe hit and subsequent post
    processing.

    Add requisite x86 fixups for xol for uprobes. Specific cases
    needing fixups include relative jumps (x86_64), calls, etc.

    Where possible, we check and skip singlestepping the
    breakpointed instructions. For now we skip single byte as well
    as few multibyte nop instructions. However this can be extended
    to other instructions too.

    Credits to Oleg Nesterov for suggestions/patches related to
    signal, breakpoint, singlestep handling code.

    Signed-off-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Ananth N Mavinakayanahalli
    Cc: Jim Keniston
    Cc: Linux-mm
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Christoph Hellwig
    Cc: Steven Rostedt
    Cc: Arnaldo Carvalho de Melo
    Cc: Masami Hiramatsu
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120313180011.29771.89027.sendpatchset@srdronam.in.ibm.com
    [ Performed various cleanliness edits ]
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

13 Mar, 2012

2 commits


10 Mar, 2012

1 commit

  • When a new thread handler is created, an irqaction is passed to it as
    data. Not only that irqaction is stored in task_struct by the handler
    for later use, but also a structure associated with the kernel thread
    keeps this value as long as the thread exists.

    This fix kicks irqaction out off task_struct. Yes, I introduce new bit
    field. But it allows not only to eliminate the duplicate, but also
    shortens size of task_struct.

    Reported-by: Oleg Nesterov
    Signed-off-by: Alexander Gordeev
    Link: http://lkml.kernel.org/r/20120309135925.GB2114@dhcp-26-207.brq.redhat.com
    Signed-off-by: Thomas Gleixner

    Alexander Gordeev
     

06 Mar, 2012

4 commits

  • Previously it was (ab)used by utrace. Then it was wrongly used by the
    scheduler code.

    Currently it is not used, kill it before it finds the new erroneous user.

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that CLONE_VFORK is killable, coredump_wait() no longer needs
    complete_vfork_done(). zap_threads() should find and kill all tasks with
    the same ->mm, this includes our parent if ->vfork_done is set.

    mm_release() becomes the only caller, unexport complete_vfork_done().

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Make vfork() killable.

    Change do_fork(CLONE_VFORK) to do wait_for_completion_killable(). If it
    fails we do not return to the user-mode and never touch the memory shared
    with our child.

    However, in this case we should clear child->vfork_done before return, we
    use task_lock() in do_fork()->wait_for_vfork_done() and
    complete_vfork_done() to serialize with each other.

    Note: now that we use task_lock() we don't really need completion, we
    could turn task->vfork_done into "task_struct *wake_up_me" but this needs
    some complications.

    NOTE: this and the next patches do not affect in-kernel users of
    CLONE_VFORK, kernel threads run with all signals ignored including
    SIGKILL/SIGSTOP.

    However this is obviously the user-visible change. Not only a fatal
    signal can kill the vforking parent, a sub-thread can do execve or
    exit_group() and kill the thread sleeping in vfork().

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes.

    Move the clear-and-complete-vfork_done code into the new trivial helper,
    complete_vfork_done().

    Signed-off-by: Oleg Nesterov
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 Mar, 2012

1 commit


01 Mar, 2012

2 commits