03 Apr, 2009

1 commit

  • Fix a number of issues with the per-MM VMA patch:

    (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
    a NOMMU system with more than 2G pages. Makes no difference on a 32-bit
    system.

    (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
    lest it overflow.

    (3) Move the allocation of the vm_area_struct slab back for fork.c.

    (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.

    (5) Use BUG_ON() rather than if () BUG().

    (6) Make the default validate_nommu_regions() a static inline rather than a
    #define.

    (7) Make free_page_series()'s objection to pages with a refcount != 1 more
    informative.

    (8) Adjust the __put_nommu_region() banner comment to indicate that the
    semaphore must be held for writing.

    (9) Limit the number of warnings about munmaps of non-mmapped regions.

    Reported-by: Andrew Morton
    Signed-off-by: David Howells
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

30 Mar, 2009

1 commit


28 Mar, 2009

1 commit


10 Mar, 2009

1 commit

  • CLONE_PARENT can fool the ->self_exec_id/parent_exec_id logic. If we
    re-use the old parent, we must also re-use ->parent_exec_id to make
    sure exit_notify() sees the right ->xxx_exec_id's when the CLONE_PARENT'ed
    task exits.

    Also, move down the "p->parent_exec_id = p->self_exec_id" thing, to place
    two different cases together.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Andrew Morton
    Cc: David Howells
    Cc: Serge E. Hallyn
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

13 Feb, 2009

1 commit


12 Feb, 2009

2 commits

  • …el/git/tip/linux-2.6-tip

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    timers: fix TIMER_ABSTIME for process wide cpu timers
    timers: split process wide cpu clocks/timers, fix
    x86: clean up hpet timer reinit
    timers: split process wide cpu clocks/timers, remove spurious warning
    timers: split process wide cpu clocks/timers
    signal: re-add dead task accumulation stats.
    x86: fix hpet timer reinit for x86_64
    sched: fix nohz load balancer on cpu offline

    Linus Torvalds
     
  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    ptrace, x86: fix the usage of ptrace_fork()
    i8327: fix outb() parameter order
    x86: fix math_emu register frame access
    x86: math_emu info cleanup
    x86: include correct %gs in a.out core dump
    x86, vmi: put a missing paravirt_release_pmd in pgd_dtor
    x86: find nr_irqs_gsi with mp_ioapic_routing
    x86: add clflush before monitor for Intel 7400 series
    x86: disable intel_iommu support by default
    x86: don't apply __supported_pte_mask to non-present ptes
    x86: fix grammar in user-visible BIOS warning
    x86/Kconfig.cpu: make Kconfig help readable in the console
    x86, 64-bit: print DMI info in the oops trace

    Linus Torvalds
     

11 Feb, 2009

1 commit

  • I noticed by pure accident we have ptrace_fork() and friends. This was
    added by "x86, bts: add fork and exit handling", commit
    bf53de907dfdaac178c92d774aae7370d7b97d20.

    I can't test this, ds_request_bts() returns -EOPNOTSUPP, but I strongly
    believe this needs the fix. I think something like this program

    int main(void)
    {
    int pid = fork();

    if (!pid) {
    ptrace(PTRACE_TRACEME, 0, NULL, NULL);
    kill(getpid(), SIGSTOP);
    fork();
    } else {
    struct ptrace_bts_config bts = {
    .flags = PTRACE_BTS_O_ALLOC,
    .size = 4 * 4096,
    };

    wait(NULL);

    ptrace(PTRACE_SETOPTIONS, pid, NULL, PTRACE_O_TRACEFORK);
    ptrace(PTRACE_BTS_CONFIG, pid, &bts, sizeof(bts));
    ptrace(PTRACE_CONT, pid, NULL, NULL);

    sleep(1);
    }

    return 0;
    }

    should crash the kernel.

    If the task is traced by its natural parent ptrace_reparented() returns 0
    but we should clear ->btsxxx anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Markus Metzger
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

07 Feb, 2009

1 commit

  • I happened to forked lots of processes, and hit NULL pointer dereference.
    It is because in copy_process() after checking max_threads, 0 is returned
    but not -EAGAIN.

    The bug is introduced by "CRED: Detach the credentials from task_struct"
    (commit f1752eec6145c97163dbce62d17cf5d928e28a27).

    Signed-off-by: Li Zefan
    Signed-off-by: David Howells
    Acked-by: James Morris
    Signed-off-by: Linus Torvalds

    Li Zefan
     

05 Feb, 2009

1 commit

  • We're going to split the process wide cpu accounting into two parts:

    - clocks; which can take all the time they want since they run
    from user context.

    - timers; which need constant time tracing but can affort the overhead
    because they're default off -- and rare.

    The clock readout will go back to a full sum of the thread group, for this
    we need to re-add the exit stats that were removed in the initial itimer
    rework (f06febc9: timers: fix itimer/many thread hang).

    Furthermore, since that full sum can be rather slow for large thread groups
    and we have the complete dead task stats, revert the do_notify_parent time
    computation.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Ingo Molnar
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

27 Jan, 2009

2 commits


21 Jan, 2009

1 commit


19 Jan, 2009

1 commit


14 Jan, 2009

2 commits


11 Jan, 2009

2 commits


10 Jan, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu:
    NOMMU: Support XIP on initramfs
    NOMMU: Teach kobjsize() about VMA regions.
    FLAT: Don't attempt to expand the userspace stack to fill the space allocated
    FDPIC: Don't attempt to expand the userspace stack to fill the space allocated
    NOMMU: Improve procfs output using per-MM VMAs
    NOMMU: Make mmap allocation page trimming behaviour configurable.
    NOMMU: Make VMAs per MM as for MMU-mode linux
    NOMMU: Delete askedalloc and realalloc variables
    NOMMU: Rename ARM's struct vm_region
    NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area()

    Linus Torvalds
     

09 Jan, 2009

1 commit

  • Currently task_active_pid_ns is not safe to call after a task becomes a
    zombie and exit_task_namespaces is called, as nsproxy becomes NULL. By
    reading the pid namespace from the pid of the task we can trivially solve
    this problem at the cost of one extra memory read in what should be the
    same cacheline as we read the namespace from.

    When moving things around I have made task_active_pid_ns out of line
    because keeping it in pid_namespace.h would require adding includes of
    pid.h and sched.h that I don't think we want.

    This change does make task_active_pid_ns unsafe to call during
    copy_process until we attach a pid on the task_struct which seems to be a
    reasonable trade off.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Bastian Blank
    Cc: Pavel Emelyanov
    Cc: Nadia Derbey
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

08 Jan, 2009

2 commits

  • Make VMAs per mm_struct as for MMU-mode linux. This solves two problems:

    (1) In SYSV SHM where nattch for a segment does not reflect the number of
    shmat's (and forks) done.

    (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
    exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
    that a VMA might be shared and already have its vm_mm assigned to another
    process or a dead process.

    A new struct (vm_region) is introduced to track a mapped region and to remember
    the circumstances under which it may be shared and the vm_list_struct structure
    is discarded as it's no longer required.

    This patch makes the following additional changes:

    (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
    with no recourse to __GFP_COMP, so the pages are not composite. Instead,
    each page has a reference on it held by the region. Anything else that is
    interested in such a page will have to get a reference on it to retain it.
    When the pages are released due to unmapping, each page is passed to
    put_page() and will be freed when the page usage count reaches zero.

    (2) Excess pages are trimmed after an allocation as the allocation must be
    made as a power-of-2 quantity of pages.

    (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may
    end up with overlapping VMAs within the tree, the VMA struct address is
    appended to the sort key.

    (4) Non-anonymous VMAs are now added to the backing inode's prio list.

    (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
    the backing region. The VMA and region structs will be split if
    necessary.

    (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
    segment instead of all the attachments at that addresss. Multiple
    shmat()'s return the same address under NOMMU-mode instead of different
    virtual addresses as under MMU-mode.

    (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.

    (8) /proc/maps is now the global list of mapped regions, and may list bits
    that aren't actually mapped anywhere.

    (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
    of RAM currently allocated by mmap to hold mappable regions that can't be
    mapped directly. These are copies of the backing device or file if not
    anonymous.

    These changes make NOMMU mode more similar to MMU mode. The downside is that
    NOMMU mode requires some extra memory to track things over NOMMU without this
    patch (VMAs are no longer shared, and there are now region structs).

    Signed-off-by: David Howells
    Tested-by: Mike Frysinger
    Acked-by: Paul Mundt

    David Howells
     
  • Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
    or we just bounce a single..

    Also, using per-cpu allocations for the thread-groups complicates the
    per-cpu allocator in that its currently aimed to be a fixed sized
    allocator and the only possible extention to that would be vmap based,
    which is seriously constrained on 32 bit archs.

    So making the per-cpu memory requirement depend on the number of
    processes is an issue.

    Lastly, it didn't deal with cpu-hotplug, although admittedly that might
    be fixable.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Jan, 2009

2 commits

  • Introduce a new kernel parameter `coredump_filter'. Setting a value to
    this parameter causes the default bitmask of coredump_filter to be
    changed.

    It is useful for users to change coredump_filter settings for the whole
    system at boot time. Without this parameter, users have to change
    coredump_filter settings for each /proc// in an initializing script.

    Signed-off-by: Hidehiro Kawai
    Cc: Roland McGrath
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • Check CLONE_SIGHAND only is enough, because combination of CLONE_THREAD and
    CLONE_SIGHAND is already done in copy_process().

    Impact: cleanup, no functionality changed

    Signed-off-by: Zhao Lei
    Cc: Roland McGrath
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhaolei
     

31 Dec, 2008

1 commit


29 Dec, 2008

2 commits

  • The mm->ioctx_list is currently protected by a reader-writer lock,
    so we always grab that lock on the read side for doing ioctx
    lookups. As the workload is extremely reader biased, turn this into
    an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
    the rwlock and use a spinlock for providing update side exclusion.

    There's usually only 1 entry on this list, so it doesn't make sense
    to look into fancier data structures.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • …el/git/tip/linux-2.6-tip

    * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
    sched, trace: update trace_sched_wakeup()
    tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
    Revert "x86: disable X86_PTRACE_BTS"
    ring-buffer: prevent false positive warning
    ring-buffer: fix dangling commit race
    ftrace: enable format arguments checking
    x86, bts: memory accounting
    x86, bts: add fork and exit handling
    ftrace: introduce tracing_reset_online_cpus() helper
    tracing: fix warnings in kernel/trace/trace_sched_switch.c
    tracing: fix warning in kernel/trace/trace.c
    tracing/ring-buffer: remove unused ring_buffer size
    trace: fix task state printout
    ftrace: add not to regex on filtering functions
    trace: better use of stack_trace_enabled for boot up code
    trace: add a way to enable or disable the stack tracer
    x86: entry_64 - introduce FTRACE_ frame macro v2
    tracing/ftrace: add the printk-msg-only option
    tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
    x86, bts: correctly report invalid bts records
    ...

    Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
    being already partly merged by the SH merge.

    Linus Torvalds
     

25 Dec, 2008

1 commit


20 Dec, 2008

1 commit

  • Impact: introduce new ptrace facility

    Add arch_ptrace_untrace() function that is called when the tracer
    detaches (either voluntarily or when the tracing task dies);
    ptrace_disable() is only called on a voluntary detach.

    Add ptrace_fork() and arch_ptrace_fork(). They are called when a
    traced task is forked.

    Clear DS and BTS related fields on fork.

    Release DS resources and reclaim memory in ptrace_untrace(). This
    releases resources already when the tracing task dies. We used to do
    that when the traced task dies.

    Signed-off-by: Markus Metzger
    Signed-off-by: Ingo Molnar

    Markus Metzger
     

19 Dec, 2008

1 commit


11 Dec, 2008

1 commit

  • Lee Schermerhorn noticed yesterday that I broke the mapping_writably_mapped
    test in 2.6.7! Bad bad bug, good good find.

    The i_mmap_writable count must be incremented for VM_SHARED (just as
    i_writecount is for VM_DENYWRITE, but while holding the i_mmap_lock)
    when dup_mmap() copies the vma for fork: it has its own more optimal
    version of __vma_link_file(), and I missed this out. So the count
    was later going down to 0 (dangerous) when one end unmapped, then
    wrapping negative (inefficient) when the other end unmapped.

    The only impact on x86 would have been that setting a mandatory lock on
    a file which has at some time been opened O_RDWR and mapped MAP_SHARED
    (but not necessarily PROT_WRITE) across a fork, might fail with -EAGAIN
    when it should succeed, or succeed when it should fail.

    But those architectures which rely on flush_dcache_page() to flush
    userspace modifications back into the page before the kernel reads it,
    may in some cases have skipped the flush after such a fork - though any
    repetitive test will soon wrap the count negative, in which case it will
    flush_dcache_page() unnecessarily.

    Fix would be a two-liner, but mapping variable added, and comment moved.

    Reported-by: Lee Schermerhorn
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Dec, 2008

1 commit


08 Dec, 2008

1 commit

  • While ideally CLONE_NEWUSER will eventually require no
    privilege, the required permission checks are currently
    not there. As a result, CLONE_NEWUSER has the same effect
    as a setuid(0)+setgroups(1,"0"). While we already require
    CAP_SYS_ADMIN, requiring CAP_SETUID and CAP_SETGID seems
    appropriate.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Signed-off-by: James Morris

    Serge E. Hallyn
     

04 Dec, 2008

1 commit

  • Impact: graph tracer race/crash fix

    There is a nasy race in startup of a new process running the
    function graph tracer. In fork.c:

    total_forks++;
    spin_unlock(¤t->sighand->siglock);
    write_unlock_irq(&tasklist_lock);
    ftrace_graph_init_task(p);
    proc_fork_connector(p);
    cgroup_post_fork(p);
    return p;

    The new task is free to run as soon as the tasklist_lock is released.
    This is before the ftrace_graph_init_task. If the task does run
    it will be using the same ret_stack and curr_ret_stack as the parent.
    This will cause crashes that are difficult to debug.

    This patch moves the ftrace_graph_init_task to just after the alloc_pid
    code. This fixes the above race.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

26 Nov, 2008

1 commit


25 Nov, 2008

1 commit

  • The user_ns is moved from nsproxy to user_struct, so that a struct
    cred by itself is sufficient to determine access (which it otherwise
    would not be). Corresponding ecryptfs fixes (by David Howells) are
    here as well.

    Fix refcounting. The following rules now apply:
    1. The task pins the user struct.
    2. The user struct pins its user namespace.
    3. The user namespace pins the struct user which created it.

    User namespaces are cloned during copy_creds(). Unsharing a new user_ns
    is no longer possible. (We could re-add that, but it'll cause code
    duplication and doesn't seem useful if PAM doesn't need to clone user
    namespaces).

    When a user namespace is created, its first user (uid 0) gets empty
    keyrings and a clean group_info.

    This incorporates a previous patch by David Howells. Here
    is his original patch description:

    >I suggest adding the attached incremental patch. It makes the following
    >changes:
    >
    > (1) Provides a current_user_ns() macro to wrap accesses to current's user
    > namespace.
    >
    > (2) Fixes eCryptFS.
    >
    > (3) Renames create_new_userns() to create_user_ns() to be more consistent
    > with the other associated functions and because the 'new' in the name is
    > superfluous.
    >
    > (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
    > beginning of do_fork() so that they're done prior to making any attempts
    > at allocation.
    >
    > (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
    > to fill in rather than have it return the new root user. I don't imagine
    > the new root user being used for anything other than filling in a cred
    > struct.
    >
    > This also permits me to get rid of a get_uid() and a free_uid(), as the
    > reference the creds were holding on the old user_struct can just be
    > transferred to the new namespace's creator pointer.
    >
    > (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
    > preparation rather than doing it in copy_creds().
    >
    >David

    >Signed-off-by: David Howells

    Changelog:
    Oct 20: integrate dhowells comments
    1. leave thread_keyring alone
    2. use current_user_ns() in set_user()

    Signed-off-by: Serge Hallyn

    Serge Hallyn
     

24 Nov, 2008

1 commit


23 Nov, 2008

2 commits


19 Nov, 2008

1 commit