10 Sep, 2009

1 commit

  • In fs/binfmt_elf.c, load_elf_interp() calls padzero() for .bss even if
    the PT_LOAD has no PROT_WRITE and no .bss. This generates EFAULT.

    Here is a small test case. (Yes, there are other, useful PT_INTERP
    which have only .text and no .data/.bss.)

    ----- ptinterp.S
    _start: .globl _start
    nop
    int3
    -----
    $ gcc -m32 -nostartfiles -nostdlib -o ptinterp ptinterp.S
    $ gcc -m32 -Wl,--dynamic-linker=ptinterp -o hello hello.c
    $ ./hello
    Segmentation fault # during execve() itself

    After applying the patch:
    $ ./hello
    Trace trap # user-mode execution after execve() finishes

    If the ELF headers are actually self-inconsistent, then dying is fine.
    But having no PROT_WRITE segment is perfectly normal and correct if
    there is no segment with p_memsz > p_filesz (i.e. bss). John Reiser
    suggested checking for PROT_WRITE in the bss logic. I think it makes
    most sense to simply apply the bss logic only when there is bss.

    This patch looks less trivial than it is due to some reindentation.
    It just moves the "if (last_bss > elf_bss) {" test up to include the
    partial-page bss logic as well as the more-pages bss logic.

    Reported-by: John Reiser
    Signed-off-by: Roland McGrath
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

02 Jul, 2009

1 commit


01 Jul, 2009

1 commit

  • With ELF, at generating coredump, some more headers other than used
    vmas are added.

    When max_map_count == 65536, a core generated by following kinds of
    code can be unreadable because the number of ELF's program header is
    written in 16bit in Ehdr (please see elf.h) and the number overflows.

    ==
    ... = mmap(); (munmap, mprotect, etc...)
    if (failed)
    abort();
    ==

    This can happen in mmap/munmap/mprotect/etc...which calls split_vma().

    I think 65536 is not safe as _default_ and reduce it to 65530 is good
    for avoiding unexpected corrupted core.

    Anyway, max_map_count can be enlarged by sysctl if a user is brave..

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Jakub Jelinek
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

19 Jun, 2009

1 commit


01 Apr, 2009

2 commits


07 Feb, 2009

1 commit

  • The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force,
    so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely
    use get_user() for a normal user-space address.

    Checking for VM_READ optimizes out the case where get_user() would fail
    anyway. The vm_file check here was already superfluous given the control
    flow earlier in the function, so that is a cleanup/optimization unrelated
    to other changes but an obvious and trivial one.

    Reported-by: Gerald Schaefer
    Signed-off-by: Roland McGrath

    Roland McGrath
     

09 Jan, 2009

1 commit

  • While discussing[1] the need for glibc to have access to random bytes
    during program load, it seems that an earlier attempt to implement
    AT_RANDOM got stalled. This implements a random 16 byte string, available
    to every ELF program via a new auxv AT_RANDOM vector.

    [1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html

    Ulrich said:

    glibc needs right after startup a bit of random data for internal
    protections (stack canary etc). What is now in upstream glibc is that we
    always unconditionally open /dev/urandom, read some data, and use it. For
    every process startup. That's slow.

    ...

    The solution is to provide a limited amount of random data to the
    starting process in the aux vector. I suggested 16 bytes and this is
    what the patch implements. If we need only 16 bytes or less we use the
    data directly. If we need more we'll use the 16 bytes to see a PRNG.
    This avoids the costly /dev/urandom use and it allows the kernel to use
    the most adequate source of random data for this purpose. It might not
    be the same pool as that for /dev/urandom.

    Concerns were expressed about the depletion of the randomness pool. But
    this patch doesn't make the situation worse, it doesn't deplete entropy
    more than happens now.

    Signed-off-by: Kees Cook
    Cc: Jakub Jelinek
    Cc: Andi Kleen
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

29 Dec, 2008

1 commit

  • * 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (85 commits)
    [S390] provide documentation for hvc_iucv kernel parameter.
    [S390] convert ctcm printks to dev_xxx and pr_xxx macros.
    [S390] convert zfcp printks to pr_xxx macros.
    [S390] convert vmlogrdr printks to pr_xxx macros.
    [S390] convert zfcp dumper printks to pr_xxx macros.
    [S390] convert cpu related printks to pr_xxx macros.
    [S390] convert qeth printks to dev_xxx and pr_xxx macros.
    [S390] convert sclp printks to pr_xxx macros.
    [S390] convert iucv printks to dev_xxx and pr_xxx macros.
    [S390] convert ap_bus printks to pr_xxx macros.
    [S390] convert dcssblk and extmem printks messages to pr_xxx macros.
    [S390] convert monwriter printks to pr_xxx macros.
    [S390] convert s390 debug feature printks to pr_xxx macros.
    [S390] convert monreader printks to pr_xxx macros.
    [S390] convert appldata printks to pr_xxx macros.
    [S390] convert setup printks to pr_xxx macros.
    [S390] convert hypfs printks to pr_xxx macros.
    [S390] convert time printks to pr_xxx macros.
    [S390] convert cpacf printks to pr_xxx macros.
    [S390] convert cio printks to pr_xxx macros.
    ...

    Linus Torvalds
     

25 Dec, 2008

1 commit

  • arch_setup_additional_pages currently gets two arguments, the binary
    format descripton and an indication if the process uses an executable
    stack or not. The second argument is not used by anybody, it could
    be removed without replacement.

    What actually does make sense is to pass an indication if the process
    uses the elf interpreter or not. The glibc code will not use anything
    from the vdso if the process does not use the dynamic linker, so for
    statically linked binaries the architecture backend can choose not
    to map the vdso.

    Acked-by: Ingo Molnar
    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

14 Nov, 2008

4 commits

  • Make execve() take advantage of copy-on-write credentials, allowing it to set
    up the credentials in advance, and then commit the whole lot after the point
    of no return.

    This patch and the preceding patches have been tested with the LTP SELinux
    testsuite.

    This patch makes several logical sets of alteration:

    (1) execve().

    The credential bits from struct linux_binprm are, for the most part,
    replaced with a single credentials pointer (bprm->cred). This means that
    all the creds can be calculated in advance and then applied at the point
    of no return with no possibility of failure.

    I would like to replace bprm->cap_effective with:

    cap_isclear(bprm->cap_effective)

    but this seems impossible due to special behaviour for processes of pid 1
    (they always retain their parent's capability masks where normally they'd
    be changed - see cap_bprm_set_creds()).

    The following sequence of events now happens:

    (a) At the start of do_execve, the current task's cred_exec_mutex is
    locked to prevent PTRACE_ATTACH from obsoleting the calculation of
    creds that we make.

    (a) prepare_exec_creds() is then called to make a copy of the current
    task's credentials and prepare it. This copy is then assigned to
    bprm->cred.

    This renders security_bprm_alloc() and security_bprm_free()
    unnecessary, and so they've been removed.

    (b) The determination of unsafe execution is now performed immediately
    after (a) rather than later on in the code. The result is stored in
    bprm->unsafe for future reference.

    (c) prepare_binprm() is called, possibly multiple times.

    (i) This applies the result of set[ug]id binaries to the new creds
    attached to bprm->cred. Personality bit clearance is recorded,
    but now deferred on the basis that the exec procedure may yet
    fail.

    (ii) This then calls the new security_bprm_set_creds(). This should
    calculate the new LSM and capability credentials into *bprm->cred.

    This folds together security_bprm_set() and parts of
    security_bprm_apply_creds() (these two have been removed).
    Anything that might fail must be done at this point.

    (iii) bprm->cred_prepared is set to 1.

    bprm->cred_prepared is 0 on the first pass of the security
    calculations, and 1 on all subsequent passes. This allows SELinux
    in (ii) to base its calculations only on the initial script and
    not on the interpreter.

    (d) flush_old_exec() is called to commit the task to execution. This
    performs the following steps with regard to credentials:

    (i) Clear pdeath_signal and set dumpable on certain circumstances that
    may not be covered by commit_creds().

    (ii) Clear any bits in current->personality that were deferred from
    (c.i).

    (e) install_exec_creds() [compute_creds() as was] is called to install the
    new credentials. This performs the following steps with regard to
    credentials:

    (i) Calls security_bprm_committing_creds() to apply any security
    requirements, such as flushing unauthorised files in SELinux, that
    must be done before the credentials are changed.

    This is made up of bits of security_bprm_apply_creds() and
    security_bprm_post_apply_creds(), both of which have been removed.
    This function is not allowed to fail; anything that might fail
    must have been done in (c.ii).

    (ii) Calls commit_creds() to apply the new credentials in a single
    assignment (more or less). Possibly pdeath_signal and dumpable
    should be part of struct creds.

    (iii) Unlocks the task's cred_replace_mutex, thus allowing
    PTRACE_ATTACH to take place.

    (iv) Clears The bprm->cred pointer as the credentials it was holding
    are now immutable.

    (v) Calls security_bprm_committed_creds() to apply any security
    alterations that must be done after the creds have been changed.
    SELinux uses this to flush signals and signal handlers.

    (f) If an error occurs before (d.i), bprm_free() will call abort_creds()
    to destroy the proposed new credentials and will then unlock
    cred_replace_mutex. No changes to the credentials will have been
    made.

    (2) LSM interface.

    A number of functions have been changed, added or removed:

    (*) security_bprm_alloc(), ->bprm_alloc_security()
    (*) security_bprm_free(), ->bprm_free_security()

    Removed in favour of preparing new credentials and modifying those.

    (*) security_bprm_apply_creds(), ->bprm_apply_creds()
    (*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()

    Removed; split between security_bprm_set_creds(),
    security_bprm_committing_creds() and security_bprm_committed_creds().

    (*) security_bprm_set(), ->bprm_set_security()

    Removed; folded into security_bprm_set_creds().

    (*) security_bprm_set_creds(), ->bprm_set_creds()

    New. The new credentials in bprm->creds should be checked and set up
    as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
    second and subsequent calls.

    (*) security_bprm_committing_creds(), ->bprm_committing_creds()
    (*) security_bprm_committed_creds(), ->bprm_committed_creds()

    New. Apply the security effects of the new credentials. This
    includes closing unauthorised files in SELinux. This function may not
    fail. When the former is called, the creds haven't yet been applied
    to the process; when the latter is called, they have.

    The former may access bprm->cred, the latter may not.

    (3) SELinux.

    SELinux has a number of changes, in addition to those to support the LSM
    interface changes mentioned above:

    (a) The bprm_security_struct struct has been removed in favour of using
    the credentials-under-construction approach.

    (c) flush_unauthorized_files() now takes a cred pointer and passes it on
    to inode_has_perm(), file_has_perm() and dentry_open().

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Use RCU to access another task's creds and to release a task's own creds.
    This means that it will be possible for the credentials of a task to be
    replaced without another task (a) requiring a full lock to read them, and (b)
    seeing deallocated memory.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Wrap current->cred and a few other accessors to hide their actual
    implementation.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     

21 Oct, 2008

1 commit

  • …/git/tip/linux-2.6-tip

    * 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
    fix documentation of sysrq-q really
    Fix documentation of sysrq-q
    timer_list: add base address to clock base
    timer_list: print cpu number of clockevents device
    timer_list: print real timer address
    NOHZ: restart tick device from irq_enter()
    NOHZ: split tick_nohz_restart_sched_tick()
    NOHZ: unify the nohz function calls in irq_enter()
    timers: fix itimer/many thread hang, fix
    timers: fix itimer/many thread hang, v3
    ntp: improve adjtimex frequency rounding
    timekeeping: fix rounding problem during clock update
    ntp: let update_persistent_clock() sleep
    hrtimer: reorder struct hrtimer to save 8 bytes on 64bit builds
    posix-timers: lock_timer: make it readable
    posix-timers: lock_timer: kill the bogus ->it_id check
    posix-timers: kill ->it_sigev_signo and ->it_sigev_value
    posix-timers: sys_timer_create: cleanup the error handling
    posix-timers: move the initialization of timer->sigq from send to create path
    posix-timers: sys_timer_create: simplify and s/tasklist/rcu/
    ...

    Fix trivial conflicts due to sysrq-q description clahes in
    Documentation/sysrq.txt and drivers/char/sysrq.c

    Linus Torvalds
     

20 Oct, 2008

2 commits

  • Presently hugepage's vma has a VM_RESERVED flag in order not to be
    swapped. But a VM_RESERVED vma isn't core dumped because this flag is
    often used for some kernel vmas (e.g. vmalloc, sound related).

    Thus hugepages are never dumped and it can't be debugged easily. Many
    developers want hugepages to be included into core-dump.

    However, We can't read generic VM_RESERVED area because this area is often
    IO mapping area. then these area reading may change device state. it is
    definitly undesiable side-effect.

    So adding a hugepage specific bit to the coredump filter is better. It
    will be able to hugepage core dumping and doesn't cause any side-effect to
    any i/o devices.

    In additional, libhugetlb use hugetlb private mapping pages as anonymous
    page. Then, hugepage private mapping pages should be core dumped by
    default.

    Then, /proc/[pid]/core_dump_filter has two new bits.

    - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
    - bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no)

    I tested by following method.

    % ulimit -c unlimited
    % ./crash_hugepage 50
    % ./crash_hugepage 50 -p
    % ls -lh
    % gdb ./crash_hugepage core
    %
    % echo 0x43 > /proc/self/coredump_filter
    % ./crash_hugepage 50
    % ./crash_hugepage 50 -p
    % ls -lh
    % gdb ./crash_hugepage core

    #include
    #include
    #include
    #include
    #include

    #include "hugetlbfs.h"

    int main(int argc, char** argv){
    char* p;
    int ch;
    int mmap_flags = MAP_SHARED;
    int fd;
    int nr_pages;

    while((ch = getopt(argc, argv, "p")) != -1) {
    switch (ch) {
    case 'p':
    mmap_flags &= ~MAP_SHARED;
    mmap_flags |= MAP_PRIVATE;
    break;
    default:
    /* nothing*/
    break;
    }
    }
    argc -= optind;
    argv += optind;

    if (argc == 0){
    printf("need # of pages\n");
    exit(1);
    }

    nr_pages = atoi(argv[0]);
    if (nr_pages < 2) {
    printf("nr_pages must >2\n");
    exit(1);
    }

    fd = hugetlbfs_unlinked_fd();
    p = mmap(NULL, nr_pages * gethugepagesize(),
    PROT_READ|PROT_WRITE, mmap_flags, fd, 0);

    sleep(2);

    *(p + gethugepagesize()) = 1; /* COW */
    sleep(2);

    /* crash! */
    *(int*)0 = 1;

    return 0;
    }

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Kawai Hidehiro
    Cc: Hugh Dickins
    Cc: William Irwin
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • …tp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus

    Thomas Gleixner
     

16 Oct, 2008

1 commit


14 Sep, 2008

1 commit

  • Overview

    This patch reworks the handling of POSIX CPU timers, including the
    ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
    with the help of Roland McGrath, the owner and original writer of this code.

    The problem we ran into, and the reason for this rework, has to do with using
    a profiling timer in a process with a large number of threads. It appears
    that the performance of the old implementation of run_posix_cpu_timers() was
    at least O(n*3) (where "n" is the number of threads in a process) or worse.
    Everything is fine with an increasing number of threads until the time taken
    for that routine to run becomes the same as or greater than the tick time, at
    which point things degrade rather quickly.

    This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."

    Code Changes

    This rework corrects the implementation of run_posix_cpu_timers() to make it
    run in constant time for a particular machine. (Performance may vary between
    one machine and another depending upon whether the kernel is built as single-
    or multiprocessor and, in the latter case, depending upon the number of
    running processors.) To do this, at each tick we now update fields in
    signal_struct as well as task_struct. The run_posix_cpu_timers() function
    uses those fields to make its decisions.

    We define a new structure, "task_cputime," to contain user, system and
    scheduler times and use these in appropriate places:

    struct task_cputime {
    cputime_t utime;
    cputime_t stime;
    unsigned long long sum_exec_runtime;
    };

    This is included in the structure "thread_group_cputime," which is a new
    substructure of signal_struct and which varies for uniprocessor versus
    multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
    a simple substructure, while for multiprocessor kernels it is a pointer:

    struct thread_group_cputime {
    struct task_cputime totals;
    };

    struct thread_group_cputime {
    struct task_cputime *totals;
    };

    We also add a new task_cputime substructure directly to signal_struct, to
    cache the earliest expiration of process-wide timers, and task_cputime also
    replaces the it_*_expires fields of task_struct (used for earliest expiration
    of thread timers). The "thread_group_cputime" structure contains process-wide
    timers that are updated via account_user_time() and friends. In the non-SMP
    case the structure is a simple aggregator; unfortunately in the SMP case that
    simplicity was not achievable due to cache-line contention between CPUs (in
    one measured case performance was actually _worse_ on a 16-cpu system than
    the same test on a 4-cpu system, due to this contention). For SMP, the
    thread_group_cputime counters are maintained as a per-cpu structure allocated
    using alloc_percpu(). The timer functions update only the timer field in
    the structure corresponding to the running CPU, obtained using per_cpu_ptr().

    We define a set of inline functions in sched.h that we use to maintain the
    thread_group_cputime structure and hide the differences between UP and SMP
    implementations from the rest of the kernel. The thread_group_cputime_init()
    function initializes the thread_group_cputime structure for the given task.
    The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
    out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
    in the per-cpu structures and fields. The thread_group_cputime_free()
    function, also a no-op for UP, in SMP frees the per-cpu structures. The
    thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
    thread_group_cputime_alloc() if the per-cpu structures haven't yet been
    allocated. The thread_group_cputime() function fills the task_cputime
    structure it is passed with the contents of the thread_group_cputime fields;
    in UP it's that simple but in SMP it must also safely check that tsk->signal
    is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
    if so, sums the per-cpu values for each online CPU. Finally, the three
    functions account_group_user_time(), account_group_system_time() and
    account_group_exec_runtime() are used by timer functions to update the
    respective fields of the thread_group_cputime structure.

    Non-SMP operation is trivial and will not be mentioned further.

    The per-cpu structure is always allocated when a task creates its first new
    thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
    It is freed at process exit via a call to thread_group_cputime_free() from
    cleanup_signal().

    All functions that formerly summed utime/stime/sum_sched_runtime values from
    from all threads in the thread group now use thread_group_cputime() to
    snapshot the values in the thread_group_cputime structure or the values in
    the task structure itself if the per-cpu structure hasn't been allocated.

    Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
    The run_posix_cpu_timers() function has been split into a fast path and a
    slow path; the former safely checks whether there are any expired thread
    timers and, if not, just returns, while the slow path does the heavy lifting.
    With the dedicated thread group fields, timers are no longer "rebalanced" and
    the process_timer_rebalance() function and related code has gone away. All
    summing loops are gone and all code that used them now uses the
    thread_group_cputime() inline. When process-wide timers are set, the new
    task_cputime structure in signal_struct is used to cache the earliest
    expiration; this is checked in the fast path.

    Performance

    The fix appears not to add significant overhead to existing operations. It
    generally performs the same as the current code except in two cases, one in
    which it performs slightly worse (Case 5 below) and one in which it performs
    very significantly better (Case 2 below). Overall it's a wash except in those
    two cases.

    I've since done somewhat more involved testing on a dual-core Opteron system.

    Case 1: With no itimer running, for a test with 100,000 threads, the fixed
    kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
    all of which was spent in the system. There were twice as many
    voluntary context switches with the fix as without it.

    Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
    an unmodified kernel can handle), the fixed kernel ran the test in
    eight percent of the time (5.8 seconds as opposed to 70 seconds) and
    had better tick accuracy (.012 seconds per tick as opposed to .023
    seconds per tick).

    Case 3: A 4000-thread test with an initial timer tick of .01 second and an
    interval of 10,000 seconds (i.e. a timer that ticks only once) had
    very nearly the same performance in both cases: 6.3 seconds elapsed
    for the fixed kernel versus 5.5 seconds for the unfixed kernel.

    With fewer threads (eight in these tests), the Case 1 test ran in essentially
    the same time on both the modified and unmodified kernels (5.2 seconds versus
    5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds
    versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
    tick versus .025 seconds per tick for the unmodified kernel.

    Since the fix affected the rlimit code, I also tested soft and hard CPU limits.

    Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
    running), the modified kernel was very slightly favored in that while
    it killed the process in 19.997 seconds of CPU time (5.002 seconds of
    wall time), only .003 seconds of that was system time, the rest was
    user time. The unmodified kernel killed the process in 20.001 seconds
    of CPU (5.014 seconds of wall time) of which .016 seconds was system
    time. Really, though, the results were too close to call. The results
    were essentially the same with no itimer running.

    Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
    (where the hard limit would never be reached) and an itimer running,
    the modified kernel exhibited worse tick accuracy than the unmodified
    kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise,
    performance was almost indistinguishable. With no itimer running this
    test exhibited virtually identical behavior and times in both cases.

    In times past I did some limited performance testing. those results are below.

    On a four-cpu Opteron system without this fix, a sixteen-thread test executed
    in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On
    the same system with the fix, user and elapsed time were about the same, but
    system time dropped to 0.007 seconds. Performance with eight, four and one
    thread were comparable. Interestingly, the timer ticks with the fix seemed
    more accurate: The sixteen-thread test with the fix received 149543 ticks
    for 0.024 seconds per tick, while the same test without the fix received 58720
    for 0.061 seconds per tick. Both cases were configured for an interval of
    0.01 seconds. Again, the other tests were comparable. Each thread in this
    test computed the primes up to 25,000,000.

    I also did a test with a large number of threads, 100,000 threads, which is
    impossible without the fix. In this case each thread computed the primes only
    up to 10,000 (to make the runtime manageable). System time dominated, at
    1546.968 seconds out of a total 2176.906 seconds (giving a user time of
    629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite
    accurate. There is obviously no comparable test without the fix.

    Signed-off-by: Frank Mayhar
    Cc: Roland McGrath
    Cc: Alexey Dobriyan
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar

    Frank Mayhar
     

27 Jul, 2008

1 commit

  • This moves all the ptrace hooks related to exec into tracehook.h inlines.

    This also lifts the calls for tracing out of the binfmt load_binary hooks
    into search_binary_handler() after it calls into the binfmt module. This
    change has no effect, since all the binfmt modules' load_binary functions
    did the call at the end on success, and now search_binary_handler() does
    it immediately after return if successful. We consolidate the repeated
    code, and binfmt modules no longer need to import ptrace_notify().

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Reviewed-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

26 Jul, 2008

3 commits

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (34 commits)
    powerpc: Wireup new syscalls
    Move update_mmu_cache() declaration from tlbflush.h to pgtable.h
    powerpc/pseries: Remove kmalloc call in handling writes to lparcfg
    powerpc/pseries: Update arch vector to indicate support for CMO
    ibmvfc: Add support for collaborative memory overcommit
    ibmvscsi: driver enablement for CMO
    ibmveth: enable driver for CMO
    ibmveth: Automatically enable larger rx buffer pools for larger mtu
    powerpc/pseries: Verify CMO memory entitlement updates with virtual I/O
    powerpc/pseries: vio bus support for CMO
    powerpc/pseries: iommu enablement for CMO
    powerpc/pseries: Add CMO paging statistics
    powerpc/pseries: Add collaborative memory manager
    powerpc/pseries: Utilities to set firmware page state
    powerpc/pseries: Enable CMO feature during platform setup
    powerpc/pseries: Split retrieval of processor entitlement data into a helper routine
    powerpc/pseries: Add memory entitlement capabilities to /proc/ppc64/lparcfg
    powerpc/pseries: Split processor entitlement retrieval and gathering to helper routines
    powerpc/pseries: Remove extraneous error reporting for hcall failures in lparcfg
    powerpc: Fix compile error with binutils 2.15
    ...

    Fixed up conflict in arch/powerpc/platforms/52xx/Kconfig manually.

    Linus Torvalds
     
  • Kill the nasty rcu_read_lock() + do_each_thread() loop, use the list
    encoded in mm->core_state instead, s/GFP_ATOMIC/GFP_KERNEL/.

    This patch allows futher cleanups in binfmt_elf.c, in particular we can
    kill the parallel info->threads list.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • linux_binfmt->core_dump() runs before the process does exit_aio(), this
    means that we can hit the kernel thread which shares the same ->mm.
    Afaics, nothing really bad can happen, but perhaps it makes sense to fix
    this minor bug.

    It is sad we have to iterate over all threads in system and use
    GFP_ATOMIC. Hopefully we can kill theses ugly do_each_thread()s, but this
    needs some nontrivial changes in mm_struct and do_coredump.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

25 Jul, 2008

1 commit

  • Some IBM POWER-based platforms have the ability to run in a
    mode which mostly appears to the OS as a different processor from the
    actual hardware. For example, a Power6 system may appear to be a
    Power5+, which makes the AT_PLATFORM value "power5+". This means that
    programs are restricted to the ISA supported by Power5+;
    Power6-specific instructions are treated as illegal.

    However, some applications (virtual machines, optimized libraries) can
    benefit from knowledge of the underlying CPU model. A new aux vector
    entry, AT_BASE_PLATFORM, will denote the actual hardware. For
    example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM
    will be "power5+" and AT_BASE_PLATFORM will be "power6". The idea is
    that AT_PLATFORM indicates the instruction set supported, while
    AT_BASE_PLATFORM indicates the underlying microarchitecture.

    If the architecture has defined ELF_BASE_PLATFORM, copy that value to
    the user stack in the same manner as ELF_PLATFORM.

    Signed-off-by: Nathan Lynch
    Acked-by: Andrew Morton
    Signed-off-by: Benjamin Herrenschmidt

    Nathan Lynch
     

23 Jul, 2008

1 commit

  • The Linux kernel puts the filename argument of execve() into the new
    address space. Many developers are surprised to learn this. Those who
    know and could use it, object "But it's not documented."

    Those who want to use it dislike the expression
    (char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env])
    because it requires locating the last original environment variable,
    and assumes that the filename follows the characters.

    This patch documents the insertion of the filename, and makes it easier
    to find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see .

    In many cases readlink("/proc/self/exe",) gives the same answer. But if
    all the original pages get unmapped, then the kernel erases the symlink
    for /proc/self/exe. This can happen when a program decompressor does a
    good job of cleaning up after uncompressing directly to memory, so that
    the address space of the target program looks the same as if compression
    had never happened. One example is http://upx.sourceforge.net .

    One notable use of the underlying concept (what path containED the
    executable) is glibc expanding $ORIGIN in DT_RUNPATH. In practice for
    the near term, it may be a good idea for user-mode code to use both
    /proc/self/exe and AT_EXECFN as fall-back methods for each other.
    /proc/self/exe can fail due to unmapping, AT_EXECFN can fail because it
    won't be present on non-new systems. The auxvec or {AT_EXECFN}.d_val
    also can get overwritten, although in nearly all cases this would be the
    result of a bug.

    The runtime cost is one NEW_AUX_ENT using two words of stack space. The
    underlying value is maintained already as bprm->exec; setup_arg_pages()
    in fs/exec.c slides it for stack_shift, etc.

    Signed-off-by: John Reiser
    Cc: Roland McGrath
    Cc: Jakub Jelinek
    Cc: Ulrich Drepper
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Reiser
     

17 Jun, 2008

1 commit

  • In commit d20894a23708c2af75966534f8e4dedb46d48db2 ("Remove a.out
    interpreter support in ELF loader"), Andi removed support for a.out
    interpreters from the ELF loader, which was only ever needed for the
    transition from a.out to ELF.

    This removes the last traces of that support, in particular the
    inclusion of .

    Signed-off-by: David Woodhouse
    Acked-by: Peter Korsgaard
    Signed-off-by: Linus Torvalds

    David Woodhouse
     

17 May, 2008

2 commits


29 Apr, 2008

2 commits

  • Fix these sparse warings:
    fs/binfmt_elf.c:1749:29: warning: symbol 'tmp' shadows an earlier one
    fs/binfmt_elf.c:1734:28: originally declared here
    fs/binfmt_elf.c:2009:26: warning: symbol 'vma' shadows an earlier one
    fs/binfmt_elf.c:1892:24: originally declared here

    [akpm@linux-foundation.org: chose better variable name]
    Signed-off-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Cong
     
  • This patch does simplify fill_elf_header function by setting
    to zero the whole elf header first. So we fillup the fields
    we really need only.

    before:
    text data bss dec hex filename
    11735 80 0 11815 2e27 fs/binfmt_elf.o

    after:
    text data bss dec hex filename
    11710 80 0 11790 2e0e fs/binfmt_elf.o

    viola, 25 bytes of text is freed

    Signed-off-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

25 Apr, 2008

1 commit

  • * unshare_files() can fail; doing it after irreversible actions is wrong
    and de_thread() is certainly irreversible.
    * since we do it unconditionally anyway, we might as well do it in do_execve()
    and save ourselves the PITA in binfmt handlers, etc.
    * while we are at it, binfmt_som actually leaked files_struct on failure.

    As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
    become unexported.

    Signed-off-by: Al Viro

    Al Viro
     

05 Mar, 2008

1 commit

  • This makes the user_regset-based core dump code call user_regset writeback
    hooks when available. This is necessary groundwork to allow IA64 to set
    CORE_DUMP_USE_REGSET.

    Cc: Shaohua Li
    Signed-off-by: Roland McGrath
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

09 Feb, 2008

2 commits

  • Following the deprecation schedule the a.out ELF interpreter support
    is removed now with this patch. a.out ELF interpreters were an transition
    feature for moving a.out systems to ELF, but they're unlikely to be still
    needed. Pure a.out systems will still work of course. This allows to
    simplify the hairy ELF loader.

    Signed-off-by: Andi Kleen
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Suppress A.OUT library support if CONFIG_ARCH_SUPPORTS_AOUT is not set.

    Not all architectures support the A.OUT binfmt, so the ELF binfmt should not
    be permitted to go looking for A.OUT libraries to load in such a case. Not
    only that, but under such conditions A.OUT core dumps are not produced either.

    To make this work, this patch also does the following:

    (1) Makes the existence of the contents of linux/a.out.h contingent on
    CONFIG_ARCH_SUPPORTS_AOUT.

    (2) Renames dump_thread() to aout_dump_thread() as it's only called by A.OUT
    core dumping code.

    (3) Moves aout_dump_thread() into asm/a.out-core.h and makes it inline. This
    is then included only where needed. This means that this bit of arch
    code will be stored in the appropriate A.OUT binfmt module rather than
    the core kernel.

    (4) Drops A.OUT support for Blackfin (according to Mike Frysinger it's not
    needed) and FRV.

    This patch depends on the previous patch to move STACK_TOP[_MAX] out of
    asm/a.out.h and into asm/processor.h as they're required whether or not A.OUT
    format is available.

    [jdike@addtoit.com: uml: re-remove accidentally restored code]
    Signed-off-by: David Howells
    Cc:
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

07 Feb, 2008

1 commit

  • based on similar patch from: Pavel Machek

    Introduce CONFIG_COMPAT_BRK. If disabled then the kernel is free
    (but not obliged to) randomize the brk area.

    Heap randomization breaks ancient binaries, so we keep COMPAT_BRK
    enabled by default.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

04 Feb, 2008

1 commit


30 Jan, 2008

4 commits

  • ibcs2 support has never been supported on 2.6 kernels as far as I know,
    and if it has it must have been an external patch. Anyways, if anybody
    applies an external patch they could as well readd the ibcs checking
    code to the ELF loader in the same patch. But there is no reason to
    keep this code running in all Linux kernels. This will save at least
    two strcmps each ELF execution.

    No deprecation period because it could not have been used anyway.

    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Andi Kleen
     
  • This modifies the ELF core dump code under #ifdef CORE_DUMP_USE_REGSET.
    It changes nothing when this macro is not defined. When it's #define'd
    by some arch header (e.g. asm/elf.h), the arch must support the
    user_regset (linux/regset.h) interface for reading thread state.

    This provides an alternate version of note segment writing that is based
    purely on the user_regset interfaces. When CORE_DUMP_USE_REGSET is set,
    the arch need not define macros such as ELF_CORE_COPY_REGS and ELF_ARCH.
    All that information is taken from the user_regset data structures.
    The core dumps come out exactly the same if arch's definitions for its
    user_regset details are correct.

    Signed-off-by: Roland McGrath
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Roland McGrath
     
  • This pulls out the code for writing the notes segment of an ELF core dump
    into separate functions. This cleanly isolates into one cluster of
    functions everything that deals with the note formats and the hooks into
    arch code to fill them. The top-level elf_core_dump function itself now
    deals purely with the generic ELF format and the memory segments.

    This only moves code around into functions that can be inlined away.
    It should not change any behavior at all.

    Signed-off-by: Roland McGrath
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Roland McGrath
     
  • #39: FILE: arch/ia64/ia32/binfmt_elf32.c:229:
    +elf32_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type, unsigned long unused)

    WARNING: no space between function name and open parenthesis '('
    #39: FILE: arch/ia64/ia32/binfmt_elf32.c:229:
    +elf32_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type, unsigned long unused)

    WARNING: line over 80 characters
    #67: FILE: arch/x86/kernel/sys_x86_64.c:80:
    + new_begin = randomize_range(*begin, *begin + 0x02000000, 0);

    ERROR: use tabs not spaces
    #110: FILE: arch/x86/kernel/sys_x86_64.c:185:
    + ^I mm->cached_hole_size = 0;$

    ERROR: use tabs not spaces
    #111: FILE: arch/x86/kernel/sys_x86_64.c:186:
    + ^I^Imm->free_area_cache = mm->mmap_base;$

    ERROR: use tabs not spaces
    #112: FILE: arch/x86/kernel/sys_x86_64.c:187:
    + ^I}$

    ERROR: use tabs not spaces
    #141: FILE: arch/x86/kernel/sys_x86_64.c:216:
    + ^I^I/* remember the largest hole we saw so far */$

    ERROR: use tabs not spaces
    #142: FILE: arch/x86/kernel/sys_x86_64.c:217:
    + ^I^Iif (addr + mm->cached_hole_size < vma->vm_start)$

    ERROR: use tabs not spaces
    #143: FILE: arch/x86/kernel/sys_x86_64.c:218:
    + ^I^I mm->cached_hole_size = vma->vm_start - addr;$

    ERROR: use tabs not spaces
    #157: FILE: arch/x86/kernel/sys_x86_64.c:232:
    + ^Imm->free_area_cache = TASK_UNMAPPED_BASE;$

    ERROR: need a space before the open parenthesis '('
    #291: FILE: arch/x86/mm/mmap_64.c:101:
    + } else if(mmap_is_legacy()) {

    WARNING: braces {} are not necessary for single statement blocks
    #302: FILE: arch/x86/mm/mmap_64.c:112:
    + if (current->flags & PF_RANDOMIZE) {
    + mm->mmap_base += ((long)rnd) << PAGE_SHIFT;
    + }

    WARNING: line over 80 characters
    #314: FILE: fs/binfmt_elf.c:48:
    +static unsigned long elf_map (struct file *, unsigned long, struct elf_phdr *, int, int, unsigned long);

    WARNING: no space between function name and open parenthesis '('
    #314: FILE: fs/binfmt_elf.c:48:
    +static unsigned long elf_map (struct file *, unsigned long, struct elf_phdr *, int, int, unsigned long);

    WARNING: line over 80 characters
    #429: FILE: fs/binfmt_elf.c:438:
    + eppnt, elf_prot, elf_type, total_size);

    ERROR: need space after that ',' (ctx:VxV)
    #480: FILE: fs/binfmt_elf.c:939:
    + elf_prot, elf_flags,0);
    ^

    total: 9 errors, 7 warnings, 461 lines checked
    Your patch has style problems, please review. If any of these errors
    are false positives report them to the maintainer, see
    CHECKPATCH in MAINTAINERS.

    Please run checkpatch prior to sending patches

    Cc: "Luck, Tony"
    Cc: Arjan van de Ven
    Cc: Jakub Jelinek
    Cc: Jiri Kosina
    Cc: KAMEZAWA Hiroyuki
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Andrew Morton