09 Jun, 2012

4 commits

  • Pull scheduler fixes from Ingo Molnar.

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Fix the relax_domain_level boot parameter
    sched: Validate assumptions in sched_init_numa()
    sched: Always initialize cpu-power
    sched: Fix domain iteration
    sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
    sched/numa: Load balance between remote nodes
    sched/x86: Calculate booted cores after construction of sibling_mask

    Linus Torvalds
     
  • Fix lots of new kernel-doc warnings in kernel/sched/fair.c:

    Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
    Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
    Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
    Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
    Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
    .. more warnings

    Signed-off-by: Randy Dunlap
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Pull perf fixes from Ingo Molnar:
    "A bit larger than what I'd wish for - half of it is due to hw driver
    updates to Intel Ivy-Bridge which info got recently released,
    cycles:pp should work there now too, amongst other things. (but we
    are generally making exceptions for hardware enablement of this type.)

    There are also callchain fixes in it - responding to mostly
    theoretical (but valid) concerns. The tooling side sports perf.data
    endianness/portability fixes which did not make it for the merge
    window - and various other fixes as well."

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
    perf/x86: Check user address explicitly in copy_from_user_nmi()
    perf/x86: Check if user fp is valid
    perf: Limit callchains to 127
    perf/x86: Allow multiple stacks
    perf/x86: Update SNB PEBS constraints
    perf/x86: Enable/Add IvyBridge hardware support
    perf/x86: Implement cycles:p for SNB/IVB
    perf/x86: Fix Intel shared extra MSR allocation
    x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
    perf: Remove duplicate invocation on perf_event_for_each
    perf uprobes: Remove unnecessary check before strlist__delete
    perf symbols: Check for valid dso before creating map
    perf evsel: Fix 32 bit values endianity swap for sample_id_all header
    perf session: Handle endianity swap on sample_id_all header data
    perf symbols: Handle different endians properly during symbol load
    perf evlist: Pass third argument to ioctl explicitly
    perf tools: Update ioctl documentation for PERF_IOC_FLAG_GROUP
    perf tools: Make --version show kernel version instead of pull req tag
    perf tools: Check if callchain is corrupted
    perf callchain: Make callchain cursors TLS
    ...

    Linus Torvalds
     
  • Pull leap second timer fix from Thomas Gleixner.

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond

    Linus Torvalds
     

08 Jun, 2012

6 commits

  • This reverts commit 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94.

    It's horribly and utterly broken for at least the following reasons:

    - calling sync_mm_rss() from mmput() is fundamentally wrong, because
    there's absolutely no reason to believe that the task that does the
    mmput() always does it on its own VM. Example: fork, ptrace, /proc -
    you name it.

    - calling it *after* having done mmdrop() on it is doubly insane, since
    the mm struct may well be gone now.

    - testing mm against NULL before you call it is insane too, since a
    NULL mm there would have caused oopses long before.

    .. and those are just the three bugs I found before I decided to give up
    looking for me and revert it asap. I should have caught it before I
    even took it, but I trusted Andrew too much.

    Cc: Konstantin Khlebnikov
    Cc: Markus Trippelsdorf
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • mm->rss_stat counters have per-task delta: task->rss_stat. Before
    changing task->mm pointer the kernel must flush this delta with
    sync_mm_rss().

    do_exit() already calls sync_mm_rss() to flush the rss-counters before
    committing the rss statistics into task->signal->maxrss, taskstats,
    audit and other stuff. Unfortunately the kernel does this before
    calling mm_release(), which can call put_user() for processing
    task->clear_child_tid. So at this point we can trigger page-faults and
    task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
    inconsistent and check_mm() will print something like this:

    | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
    | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1

    This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
    out of do_exit() and calls it earlier. After mm_release() there should
    be no pagefaults.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Markus Trippelsdorf
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: [3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • In commit b76437579d13 ("procfs: mark thread stack correctly in
    proc//maps") the stack allocated via clone() is marked in
    /proc//maps as [stack:%d] thus it might be out of the former
    mm->start_stack/end_stack values (and even has some custom VMA flags
    set).

    So to be able to restore mm->start_stack/end_stack drop vma flags test,
    but still require the underlying VMA to exist.

    As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
    requires CAP_SYS_RESOURCE to be granted.

    Signed-off-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Zero is written at clear_tid_address when the process exits. This
    functionality is used by pthread_join().

    We already have sys_set_tid_address() to change this address for the
    current task but there is no way to obtain it from user space.

    Without the ability to find this address and dump it we can't restore
    pthread'ed apps which call pthread_join() once they have been restored.

    This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
    the current process to obtain own clear_tid_address.

    This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.

    [akpm@linux-foundation.org: fix prctl numbering]
    Signed-off-by: Andrew Vagin
    Signed-off-by: Cyrill Gorcunov
    Cc: Pedro Alves
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Make sure the address being set is greater than mmap_min_addr (as
    suggested by Kees Cook).

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Serge Hallyn
    Cc: Tejun Heo
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • A fix for commit b32dfe377102 ("c/r: prctl: add ability to set new
    mm_struct::exe_file").

    After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until
    final mmput(), it never becomes NULL while task is alive.

    We can check for other mapped files in mm instead of checking
    mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in
    order to forbid second changing of mm->exe_file.

    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Cc: Matt Helsley
    Cc: Kees Cook
    Cc: KOSAKI Motohiro
    Cc: Tejun Heo
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

06 Jun, 2012

9 commits

  • It does not get processed because sched_domain_level_max is 0 at the
    time that setup_relax_domain_level() is run.

    Simply accept the value as it is, as we don't know the value of
    sched_domain_level_max until sched domain construction is completed.

    Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls
    the set_domain_attribute() routine prior to setting the sd->level, however,
    the set_domain_attribute() routine relies on the sd->level to decide whether
    idle load balancing will be off/on.

    Signed-off-by: Dimitri Sivanich
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
    Signed-off-by: Ingo Molnar

    Dimitri Sivanich
     
  • Add some code to validate assumptions we're making and output
    warnings if they are not.

    If this trigger we want to know about it.

    Signed-off-by: Peter Zijlstra
    Cc: Alex Shi
    Link: http://lkml.kernel.org/n/tip-6uc3wk5s9udxtdl9cnku0vtt@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Often when we run into mis-shapen topologies the balance iteration
    fails to update the cpu power properly and we'll end up in /0 traps.

    Always initialize the cpu-power to a semi-sane value so that we can
    at least boot the machine, even if the load-balancer might not
    function correctly.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Weird topologies can lead to asymmetric domain setups. This needs
    further consideration since these setups are typically non-minimal
    too.

    For now, make it work by adding an extra mask selecting which CPUs
    are allowed to iterate up.

    The topology that triggered it is the one from David Rientjes:

    10 20 20 30
    20 10 20 20
    20 20 10 20
    30 20 20 10

    resulting in boxes that wouldn't even boot.

    Reported-by: David Rientjes
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Roland Dreier reported spurious, hard to trigger lockdep warnings
    within the scheduler - without any real lockup.

    This bit gives us the right clue:

    > [89945.640512] [] double_lock_balance+0x5a/0x90
    > [89945.640568] [] push_rt_task+0xc6/0x290

    if you look at that code you'll find the double_lock_balance() in
    question is the one in find_lock_lowest_rq() [yay for inlining].

    Now find_lock_lowest_rq() has a bug.. it fails to use
    double_unlock_balance() in one exit path, if this results in a retry in
    push_rt_task() we'll call double_lock_balance() again, at which point
    we'll run into said lockdep confusion.

    Reported-by: Roland Dreier
    Acked-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Commit cb83b629b ("sched/numa: Rewrite the CONFIG_NUMA sched
    domain support") removed the NODE sched domain and started checking
    if the node distance in SLIT table is farther than REMOTE_DISTANCE,
    if so, it will lose the load balance chance at exec/fork/wake_affine
    points.

    But actually, even the node distance is farther than REMOTE_DISTANCE.

    Modern CPUs also has QPI like connections, which ensures that memory
    access is not too slow between nodes. So the above change in behavior
    on NUMA machine causes a performance regression on various benchmarks:
    hackbench, tbench, netperf, oltp, etc.

    This patch will recover the scheduler behavior to old mode on all my
    Intel platforms: NHM EP/EX, WSM EP, SNB EP/EP4S, and thus fixes the
    perfromance regressions. (all of them just have 2 kinds distance, 10, 21)

    Signed-off-by: Alex Shi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1338965571-9812-1-git-send-email-alex.shi@intel.com
    Signed-off-by: Ingo Molnar

    Alex Shi
     
  • …it/acme/linux into perf/urgent

    Pull perf fixes from Arnaldo Carvalho de Melo:

    * Endianness fixes from Jiri Olsa

    * Fixes for make perf tarball

    * Fix for DSO name in perf script callchains, from David Ahern

    * Segfault fixes for perf top --callchain, from Namhyung Kim

    * Minor function result fixes from Srikar Dronamraju

    * Add missing 3rd ioctl parameter, from Namhyung Kim

    * Fix pager usage in minimal embedded systems, from Avik Sil

    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Pull cgroup fix from Tejun Heo:
    "This fixes the possible premature superblock release on umount bug
    mentioned during v3.5-rc1 pull request.

    Originally, cgroup dentry destruction path assumed that cgroup dentry
    didn't have any reference left after cgroup removal thus put super
    during dentry removal. Now that there can be lingering dentry
    references, this led to super being put with live dentries. This
    patch fixes the problem by putting super ref on dentry release instead
    of removal."

    * 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: superblock can't be released with active dentries

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar.

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Remove NULL assignment of dattr_cur
    sched: Remove the last NULL entry from sched_feat_names
    sched: Make sched_feat_names const
    sched/rt: Fix SCHED_RR across cgroups
    sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'
    sched: Make sure to not re-read variables after validation
    sched: Fix SD_OVERLAP
    sched: Don't try allocating memory from offline nodes
    sched/nohz: Fix rq->cpu_load calculations some more
    sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask

    Linus Torvalds
     

05 Jun, 2012

3 commits

  • Commit 6b43ae8a61 (ntp: Fix leap-second hrtimer livelock) broke the
    leapsecond update of CLOCK_MONOTONIC. The missing leapsecond update to
    wall_to_monotonic causes discontinuities in CLOCK_MONOTONIC.

    Adjust wall_to_monotonic when NTP inserted a leapsecond.

    Reported-by: Richard Cochran
    Signed-off-by: John Stultz
    Tested-by: Richard Cochran
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1338400497-12420-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     
  • …ernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull irq and smpboot updates from Thomas Gleixner:
    "Just cleanup patches with no functional change and a fix for suspend
    issues."

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Introduce irq_do_set_affinity() to reduce duplicated code
    genirq: Add IRQS_PENDING for nested and simple irq

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    smpboot, idle: Fix comment mismatch over idle_threads_init()
    smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init()

    Linus Torvalds
     
  • Pull timer updates from Thomas Gleixner:
    "The clocksource driver is pure hardware enablement and the skew option
    is default off, well tested and non dangerous."

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tick: Move skew_tick option into the HIGH_RES_TIMER section
    clocksource: em_sti: Add DT support
    clocksource: em_sti: Emma Mobile STI driver
    clockevents: Make clockevents_config() a global symbol
    tick: Add tick skew boot option

    Linus Torvalds
     

02 Jun, 2012

6 commits

  • Pull third pile of signal handling patches from Al Viro:
    "This time it's mostly helpers and conversions to them; there's a lot
    of stuff remaining in the tree, but that'll either go in -rc2
    (isolated bug fixes, ideally via arch maintainers' trees) or will sit
    there until the next cycle."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    x86: get rid of calling do_notify_resume() when returning to kernel mode
    blackfin: check __get_user() return value
    whack-a-mole with TIF_FREEZE
    FRV: Optimise the system call exit path in entry.S [ver #2]
    FRV: Shrink TIF_WORK_MASK [ver #2]
    FRV: Prevent syscall exit tracing and notify_resume at end of kernel exceptions
    new helper: signal_delivered()
    powerpc: get rid of restore_sigmask()
    most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set
    set_restore_sigmask() is never called without SIGPENDING (and never should be)
    TIF_RESTORE_SIGMASK can be set only when TIF_SIGPENDING is set
    don't call try_to_freeze() from do_signal()
    pull clearing RESTORE_SIGMASK into block_sigmask()
    sh64: failure to build sigframe != signal without handler
    openrisc: tracehook_signal_handler() is supposed to be called on success
    new helper: sigmask_to_save()
    new helper: restore_saved_sigmask()
    new helpers: {clear,test,test_and_clear}_restore_sigmask()
    HAVE_RESTORE_SIGMASK is defined on all architectures now

    Linus Torvalds
     
  • Pull vfs changes from Al Viro.
    "A lot of misc stuff. The obvious groups:
    * Miklos' atomic_open series; kills the damn abuse of
    ->d_revalidate() by NFS, which was the major stumbling block for
    all work in that area.
    * ripping security_file_mmap() and dealing with deadlocks in the
    area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
    general.
    * ->encode_fh() switched to saner API; insane fake dentry in
    mm/cleancache.c gone.
    * assorted annotations in fs (endianness, __user)
    * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
    * ->update_time() work from Josef.
    * other bits and pieces all over the place.

    Normally it would've been in two or three pull requests, but
    signal.git stuff had eaten a lot of time during this cycle ;-/"

    Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
    'truncate_range' inode method was removed by the VM changes, the VFS
    update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
    to sparse fix added twice, with other changes nearby).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
    nfs: don't open in ->d_revalidate
    vfs: retry last component if opening stale dentry
    vfs: nameidata_to_filp(): don't throw away file on error
    vfs: nameidata_to_filp(): inline __dentry_open()
    vfs: do_dentry_open(): don't put filp
    vfs: split __dentry_open()
    vfs: do_last() common post lookup
    vfs: do_last(): add audit_inode before open
    vfs: do_last(): only return EISDIR for O_CREAT
    vfs: do_last(): check LOOKUP_DIRECTORY
    vfs: do_last(): make ENOENT exit RCU safe
    vfs: make follow_link check RCU safe
    vfs: do_last(): use inode variable
    vfs: do_last(): inline walk_component()
    vfs: do_last(): make exit RCU safe
    vfs: split do_lookup()
    Btrfs: move over to use ->update_time
    fs: introduce inode operation ->update_time
    reiserfs: get rid of resierfs_sync_super
    reiserfs: mark the superblock as dirty a bit later
    ...

    Linus Torvalds
     
  • Does block_sigmask() + tracehook_signal_handler(); called when
    sigframe has been successfully built. All architectures converted
    to it; block_sigmask() itself is gone now (merged into this one).

    I'm still not too happy with the signature, but that's a separate
    story (IMO we need a structure that would contain signal number +
    siginfo + k_sigaction, so that get_signal_to_deliver() would fill one,
    signal_delivered(), handle_signal() and probably setup...frame() -
    take one).

    Signed-off-by: Al Viro

    Al Viro
     
  • Only 3 out of 63 do not. Renamed the current variant to __set_current_blocked(),
    added set_current_blocked() that will exclude unblockable signals, switched
    open-coded instances to it.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Everyone either defines it in arch thread_info.h or has TIF_RESTORE_SIGMASK
    and picks default set_restore_sigmask() in linux/thread_info.h. Kill the
    ifdefs, slap #error in linux/thread_info.h to catch breakage when new ones
    get merged.

    Signed-off-by: Al Viro

    Al Viro
     

01 Jun, 2012

12 commits

  • Pull second pile of signal handling patches from Al Viro:
    "This one is just task_work_add() series + remaining prereqs for it.

    There probably will be another pull request from that tree this
    cycle - at least for helpers, to get them out of the way for per-arch
    fixes remaining in the tree."

    Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
    had brought in commit 97fd75b7b8e0 ("kernel/irq/manage.c: use the
    pr_foo() infrastructure to prefix printks") which changed one of the
    pr_err() calls that this merge moves around.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    keys: kill task_struct->replacement_session_keyring
    keys: kill the dummy key_replace_session_keyring()
    keys: change keyctl_session_to_parent() to use task_work_add()
    genirq: reimplement exit_irq_thread() hook via task_work_add()
    task_work_add: generic process-context callbacks
    avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
    parisc: need to check NOTIFY_RESUME when exiting from syscall
    move key_repace_session_keyring() into tracehook_notify_resume()
    TIF_NOTIFY_RESUME is defined on all targets now

    Linus Torvalds
     
  • Merge misc patches from Andrew Morton:

    - the "misc" tree - stuff from all over the map

    - checkpatch updates

    - fatfs

    - kmod changes

    - procfs

    - cpumask

    - UML

    - kexec

    - mqueue

    - rapidio

    - pidns

    - some checkpoint-restore feature work. Reluctantly. Most of it
    delayed a release. I'm still rather worried that we don't have a
    clear roadmap to completion for this work.

    * emailed from Andrew Morton : (78 patches)
    kconfig: update compression algorithm info
    c/r: prctl: add ability to set new mm_struct::exe_file
    c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
    c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
    syscalls, x86: add __NR_kcmp syscall
    fs, proc: introduce /proc//task//children entry
    sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
    aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
    eventfd: change int to __u64 in eventfd_signal()
    fs/nls: add Apple NLS
    pidns: make killed children autoreap
    pidns: use task_active_pid_ns in do_notify_parent
    rapidio/tsi721: add DMA engine support
    rapidio: add DMA engine support for RIO data transfers
    ipc/mqueue: add rbtree node caching support
    tools/selftests: add mq_perf_tests
    ipc/mqueue: strengthen checks on mqueue creation
    ipc/mqueue: correct mq_attr_ok test
    ipc/mqueue: improve performance of send/recv
    selftests: add mq_open_tests
    ...

    Linus Torvalds
     
  • When we do restore we would like to have a way to setup a former
    mm_struct::exe_file so that /proc/pid/exe would point to the original
    executable file a process had at checkpoint time.

    For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a
    file descriptor which will be set as a source for new /proc/$pid/exe
    symlink.

    Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
    vmas present for current process, simply because this feature is a special
    to C/R and mm::num_exe_file_vmas become meaningless after that.

    To minimize the amount of transition the /proc/pid/exe symlink might have,
    this feature is implemented in one-shot manner. Thus once changed the
    symlink can't be changed again. This should help sysadmins to monitor the
    symlinks over all process running in a system.

    In particular one could make a snapshot of processes and ring alarm if
    there unexpected changes of /proc/pid/exe's in a system.

    Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
    the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
    request to change symlink will be rejected.

    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Pavel Emelyanov
    Cc: Kees Cook
    Cc: Tejun Heo
    Cc: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • During checkpoint we dump whole process memory to a file and the dump
    includes process stack memory. But among stack data itself, the stack
    carries additional parameters such as command line arguments, environment
    data and auxiliary vector.

    So when we do restore procedure and once we've restored stack data itself
    we need to setup mm_struct::arg_start/end, env_start/end, so restored
    process would be able to find command line arguments and environment data
    it had at checkpoint time. The same applies to auxiliary vector.

    For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
    ENV_END | AUXV) codes are introduced.

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • While doing the checkpoint-restore in the user space one need to determine
    whether various kernel objects (like mm_struct-s of file_struct-s) are
    shared between tasks and restore this state.

    The 2nd step can be solved by using appropriate CLONE_ flags and the
    unshare syscall, while there's currently no ways for solving the 1st one.

    One of the ways for checking whether two tasks share e.g. mm_struct is to
    provide some mm_struct ID of a task to its proc file, but showing such
    info considered to be not that good for security reasons.

    Thus after some debates we end up in conclusion that using that named
    'comparison' syscall might be the best candidate. So here is it --
    __NR_kcmp.

    It takes up to 5 arguments - the pids of the two tasks (which
    characteristics should be compared), the comparison type and (in case of
    comparison of files) two file descriptors.

    Lookups for pids are done in the caller's PID namespace only.

    At moment only x86 is supported and tested.

    [akpm@linux-foundation.org: fix up selftests, warnings]
    [akpm@linux-foundation.org: include errno.h]
    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Cyrill Gorcunov
    Acked-by: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Andrey Vagin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Thomas Gleixner
    Cc: Glauber Costa
    Cc: Andi Kleen
    Cc: Tejun Heo
    Cc: Matt Helsley
    Cc: Pekka Enberg
    Cc: Eric Dumazet
    Cc: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Valdis.Kletnieks@vt.edu
    Cc: Michal Marek
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • For those who doesn't need C/R functionality there is no need to control
    last pid, ie the pid for the next fork() call.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Force SIGCHLD handling to SIG_IGN so that signals are not generated and so
    that the children autoreap. This increases the parallelize and in general
    the speed of network namespace shutdown.

    Note self reaping childrean can exist past zap_pid_ns_processess but they
    will all be reaped before we allow the pid namespace init task with pid ==
    1 to be reaped.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Louis Rilling
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Using task_active_pid_ns is more robust because it works even after we
    have called exit_namespaces. This change allows us to have parent
    processes that are zombies. Normally a zombie parent processes is crazy
    and the last thing you would want to have but in the case of not letting
    the init process of a pid namespace be reaped until all of it's children
    are dead and reaped a zombie parent process is exactly what we want.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Louis Rilling
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Add more comments on clear_tasks_mm_cpumask, plus adds a runtime check:
    the function is only suitable for offlined CPUs, and if called
    inappropriately, the kernel should scream aloud.

    [akpm@linux-foundation.org: tweak comment: s/walks up/walks/, use 80 cols]
    Suggested-by: Andrew Morton
    Suggested-by: Peter Zijlstra
    Signed-off-by: Anton Vorontsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Vorontsov
     
  • Many architectures clear tasks' mm_cpumask like this:

    read_lock(&tasklist_lock);
    for_each_process(p) {
    if (p->mm)
    cpumask_clear_cpu(cpu, mm_cpumask(p->mm));
    }
    read_unlock(&tasklist_lock);

    Depending on the context, the code above may have several problems,
    such as:

    1. Working with task->mm w/o getting mm or grabing the task lock is
    dangerous as ->mm might disappear (exit_mm() assigns NULL under
    task_lock(), so tasklist lock is not enough).

    2. Checking for process->mm is not enough because process' main
    thread may exit or detach its mm via use_mm(), but other threads
    may still have a valid mm.

    This patch implements a small helper function that does things
    correctly, i.e.:

    1. We take the task's lock while whe handle its mm (we can't use
    get_task_mm()/mmput() pair as mmput() might sleep);

    2. To catch exited main thread case, we use find_lock_task_mm(),
    which walks up all threads and returns an appropriate task
    (with task lock held).

    Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in
    the new helper, instead we take the rcu read lock. We can do this
    because the function is called after the cpu is taken down and marked
    offline, so no new tasks will get this cpu set in their mm mask.

    Signed-off-by: Anton Vorontsov
    Cc: Richard Weinberger
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Benjamin Herrenschmidt
    Cc: Mike Frysinger
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Vorontsov
     
  • Child should wake up the parent from vfork() only after finishing all
    operations with shared mm. There is no sense in using
    CLONE_CHILD_CLEARTID together with CLONE_VFORK, but it looks more accurate
    now.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Konstantin Khlebnikov
    Cc: Markus Trippelsdorf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • In embedded systems, sometimes the same program (busybox) is the cause of
    multiple warnings. Outputting the pid with the program name in the
    warning printk helps distinguish which instances of a program are using
    the stack most.

    This is a small patch, but useful.

    Signed-off-by: Tim Bird
    Cc: Oleg Nesterov
    Cc: Frederic Weisbecker
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Bird