01 Oct, 2006

2 commits

  • Using the infrastructure created in previous patches implement support to
    pipe core dumps into programs.

    This is done by overloading the existing core_pattern sysctl
    with a new syntax:

    |program

    When the first character of the pattern is a '|' the kernel will instead
    threat the rest of the pattern as a command to run. The core dump will be
    written to the standard input of that program instead of to a file.

    This is useful for having automatic core dump analysis without filling up
    disks. The program can do some simple analysis and save only a summary of
    the core dump.

    The core dump proces will run with the privileges and in the name space of
    the process that caused the core dump.

    I also increased the core pattern size to 128 bytes so that longer command
    lines fit.

    Most of the changes comes from allowing core dumps without seeks. They are
    fairly straight forward though.

    One small incompatibility is that if someone had a core pattern previously
    that started with '|' they will get suddenly new behaviour. I think that's
    unlikely to be a real problem though.

    Additional background:

    > Very nice, do you happen to have a program that can accept this kind of
    > input for crash dumps? I'm guessing that the embedded people will
    > really want this functionality.

    I had a cheesy demo/prototype. Basically it wrote the dump to a file again,
    ran gdb on it to get a backtrace and wrote the summary to a shared directory.
    Then there was a simple CGI script to generate a "top 10" crashes HTML
    listing.

    Unfortunately this still had the disadvantage to needing full disk space for a
    dump except for deleting it afterwards (in fact it was worse because over the
    pipe holes didn't work so if you have a holey address map it would require
    more space).

    Fortunately gdb seems to be happy to handle /proc/pid/fd/xxx input pipes as
    cores (at least it worked with zsh's =(cat core) syntax), so it would be
    likely possible to do it without temporary space with a simple wrapper that
    calls it in the right way. I ran out of time before doing that though.

    The demo prototype scripts weren't very good. If there is really interest I
    can dig them out (they are currently on a laptop disk on the desk with the
    laptop itself being in service), but I would recommend to rewrite them for any
    serious application of this and fix the disk space problem.

    Also to be really useful it should probably find a way to automatically fetch
    the debuginfos (I cheated and just installed them in advance). If nobody else
    does it I can probably do the rewrite myself again at some point.

    My hope at some point was that desktops would support it in their builtin
    crash reporters, but at least the KDE people I talked too seemed to be happy
    with their user space only solution.

    Alan sayeth:

    I don't believe that piping as such as neccessarily the right model, but
    the ability to intercept and processes core dumps from user space is asked
    for by many enterprise users as well. They want to know about, capture,
    analyse and process core dumps, often centrally and in automated form.

    [akpm@osdl.org: loff_t != unsigned long]
    Signed-off-by: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Create a new header file, fs/internal.h, for common definitions local to the
    sources in the fs/ directory.

    Move extern definitions that should be in header files from fs/*.c to
    fs/internal.h or other main header files where they span directories.

    Signed-Off-By: David Howells
    Signed-off-by: Jens Axboe

    David Howells
     

30 Sep, 2006

2 commits

  • do_each_thread() is rcu-safe, and all tasks which use this ->mm must sleep
    in wait_for_completion(&mm->core_done) at this point, so we can use RCU
    locks.

    Also, remove unneeded INIT_LIST_HEAD(new) before list_add(new, head).

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Fixed race on put_files_struct on exec with proc. Restoring files on
    current on error path may lead to proc having a pointer to already kfree-d
    files_struct.

    ->files changing at exit.c and khtread.c are safe as exit_files() makes all
    things under lock.

    Found during OpenVZ stress testing.

    [akpm@osdl.org: add export]
    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

27 Sep, 2006

1 commit

  • * 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (225 commits)
    [PATCH] Don't set calgary iommu as default y
    [PATCH] i386/x86-64: New Intel feature flags
    [PATCH] x86: Add a cumulative thermal throttle event counter.
    [PATCH] i386: Make the jiffies compares use the 64bit safe macros.
    [PATCH] x86: Refactor thermal throttle processing
    [PATCH] Add 64bit jiffies compares (for use with get_jiffies_64)
    [PATCH] Fix unwinder warning in traps.c
    [PATCH] x86: Allow disabling early pci scans with pci=noearly or disallowing conf1
    [PATCH] x86: Move direct PCI scanning functions out of line
    [PATCH] i386/x86-64: Make all early PCI scans dependent on CONFIG_PCI
    [PATCH] Don't leak NT bit into next task
    [PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
    [PATCH] Fix some broken white space in ia32_signal.c
    [PATCH] Initialize argument registers for 32bit signal handlers.
    [PATCH] Remove all traces of signal number conversion
    [PATCH] Don't synchronize time reading on single core AMD systems
    [PATCH] Remove outdated comment in x86-64 mmconfig code
    [PATCH] Use string instructions for Core2 copy/clear
    [PATCH] x86: - restore i8259A eoi status on resume
    [PATCH] i386: Split multi-line printk in oops output.
    ...

    Linus Torvalds
     

26 Sep, 2006

2 commits


11 Jul, 2006

1 commit


04 Jul, 2006

1 commit

  • Fix check for bad address; use macro instead of open-coding two checks.

    Taken from RHEL4 kernel update.

    From: Ernie Petrides

    For background, the BAD_ADDR() macro should return TRUE if the address is
    TASK_SIZE, because that's the lowest address that is *not* valid for
    user-space mappings. The macro was correct in binfmt_aout.c but was wrong
    for the "equal to" case in binfmt_elf.c. There were two in-line validations
    of user-space addresses in binfmt_elf.c, which have been appropriately
    converted to use the corrected BAD_ADDR() macro in the patch you posted
    yesterday. Note that the size checks against TASK_SIZE are okay as coded.

    The additional changes that I propose are below. These are in the error
    paths for bad ELF entry addresses once load_elf_binary() has already
    committed to exec'ing the new image (following the tearing down of the
    task's original address space).

    The 1st hunk deals with the interp-side of the outer "if". There were two
    problems here. The printk() should be removed because this path can be
    triggered at will by a bogus interpreter image created and used by a
    malicious user. Further, the error code should not be ENOEXEC, because that
    causes the loop in search_binary_handler() to continue trying other exec
    handlers (twice, in fact). But it's too late for this to work correctly,
    because the user address space has already been torn down, and an exec()
    failure cannot be returned to the user code because the code no longer
    exists. The only recovery is to force a SIGSEGV, but it's best to terminate
    the search loop immediately. I somewhat arbitrarily chose EINVAL as a
    fallback error code, but any error returned by load_elf_interp() will
    override that (but this value will never be seen by user-space).

    The 2nd hunk deals with the non-interp-side of the outer "if". There were
    two problems here as well. The SIGSEGV needs to be forced, because a prior
    sigaction() syscall might have set the associated disposition to SIG_IGN.
    And the ENOEXEC should be changed to EINVAL as described above.

    Signed-off-by: Chuck Ebbert
    Signed-off-by: Ernie Petrides
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chuck Ebbert
     

23 Jun, 2006

3 commits

  • Remove redundant casts from NEW_AUX_ENT() arguments in fs/binfmt_elf.c

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Do a CodingStyle cleanup of fs/binfmt_elf.c and also remove some pointless
    casts of kmalloc() return values in the same file.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • This patch removes the steal_locks() function.

    steal_locks() doesn't work correctly with any filesystem that does it's own
    lock management, including NFS, CIFS, etc.

    In addition it has weird semantics on local filesystems in case tasks
    sharing file-descriptor tables are doing POSIX locking operations in
    parallel to execve().

    The steal_locks() function has an effect on applications doing:

    clone(CLONE_FILES)
    /* in child */
    lock
    execve
    lock

    POSIX locks acquired before execve (by "child", "parent" or any further
    task sharing files_struct) will after the execve be owned exclusively by
    "child".

    According to Chris Wright some LSB/LTP kind of suite triggers without the
    stealing behavior, but there's no known real-world application that would
    also fail.

    Apps using NPTL are not affected, since all other threads are killed before
    execve.

    Apps using LinuxThreads are only affected if they

    - have multiple threads during exec (LinuxThreads doesn't kill other
    threads, the app may do it with pthread_kill_other_threads_np())
    - rely on POSIX locks being inherited across exec

    Both conditions are documented, but not their interaction.

    Apps using clone() natively are affected if they

    - use clone(CLONE_FILES)
    - rely on POSIX locks being inherited across exec

    The above scenarios are unlikely, but possible.

    If the patch is vetoed, there's a plan B, that involves mostly keeping the
    weird stealing semantics, but changing the way lock ownership is handled so
    that network and local filesystems work consistently.

    That would add more complexity though, so this solution seems to be
    preferred by most people.

    Signed-off-by: Miklos Szeredi
    Cc: Trond Myklebust
    Cc: Matthew Wilcox
    Cc: Chris Wright
    Cc: Christoph Hellwig
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

26 Mar, 2006

3 commits


27 Feb, 2006

1 commit


15 Jan, 2006

1 commit


11 Jan, 2006

2 commits


09 Jan, 2006

2 commits

  • configurable support for ELF core dumps

    text data bss dec hex filename
    3330172 529036 190556 4049764 3dcb64 vmlinux-baseline
    3325552 528912 190556 4045020 3db8dc vmlinux-no-elf

    add/remove: 0/8 grow/shrink: 0/0 up/down: 0/-4424 (-4424)
    function old new delta
    fill_note 32 - -32
    maydump 58 - -58
    dump_seek 67 - -67
    writenote 180 - -180
    elf_dump_thread_status 274 - -274
    fill_psinfo 308 - -308
    fill_prstatus 466 - -466
    elf_core_dump 3039 - -3039

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • mmap() returns -EINVAL if given a zero length, and thus elf_map() in
    binfmt_elf.c does likewise if it attempts to map a (page-aligned) ELF
    segment with zero filesize. Such a situation never arises with the default
    linker scripts, but there's nothing inherently wrong with zero-filesize
    (but non-zero memsize) ELF segments. Custom linker scripts can generate
    them, and the kernel should be able to map them; this patch makes it so.

    Signed-off-by: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

07 Nov, 2005

1 commit

  • This is the fs/ part of the big kfree cleanup patch.

    Remove pointless checks for NULL prior to calling kfree() in fs/.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     

31 Oct, 2005

1 commit

  • task_struct is an internal structure to the kernel with a lot of good
    information, that is probably interesting in core dumps. However there is
    no way for user space to know what format that information is in making it
    useless.

    I grepped the GDB 6.3 source code and NT_TASKSTRUCT while defined is not
    used anywhere else. So I would be surprised if anyone notices it is
    missing.

    In addition exporting kernel pointers to all the interesting kernel data
    structures sounds like the very definition of an information leak. I
    haven't a clue what someone with evil intentions could do with that
    information, but in any attack against the kernel it looks like this is the
    perfect tool for aiming that attack.

    So since NT_TASKSTRUCT is useless as currently defined and is potentially
    dangerous, let's just not export it.

    (akpm: Daniel Jacobowitz "would be amazed" if anything was
    using NT_TASKSTRUCT).

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

30 Oct, 2005

1 commit

  • How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but
    that's not so good if an mm_counter_t is a special type. And how is rss
    initialized? By set_mm_counter, all over the place. Come on, we just need to
    initialize them both at once by set_mm_counter in mm_init (which follows the
    memcpy when forking).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Oct, 2005

1 commit

  • Nir Tzachar points out that if an ELF file specifies a
    zero-length bss at a whacky address, we cannot load that binary because
    padzero() tries to zero out the end of the page at the whacky address, and
    that may not be writeable.

    See also http://bugzilla.kernel.org/show_bug.cgi?id=5411

    So teach load_elf_binary() to skip the bss settng altogether if the elf file
    has a zero-length bss segment.

    Cc: Roland McGrath
    Cc: Daniel Jacobowitz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     

22 Jun, 2005

1 commit

  • Ingo recently introduced a great speedup for allocating new mmaps using the
    free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
    causes huge performance increases in thread creation.

    The downside of this patch is that it does lead to fragmentation in the
    mmap-ed areas (visible via /proc/self/maps), such that some applications
    that work fine under 2.4 kernels quickly run out of memory on any 2.6
    kernel.

    The problem is twofold:

    1) the free_area_cache is used to continue a search for memory where
    the last search ended. Before the change new areas were always
    searched from the base address on.

    So now new small areas are cluttering holes of all sizes
    throughout the whole mmap-able region whereas before small holes
    tended to close holes near the base leaving holes far from the base
    large and available for larger requests.

    2) the free_area_cache also is set to the location of the last
    munmap-ed area so in scenarios where we allocate e.g. five regions of
    1K each, then free regions 4 2 3 in this order the next request for 1K
    will be placed in the position of the old region 3, whereas before we
    appended it to the still active region 1, placing it at the location
    of the old region 2. Before we had 1 free region of 2K, now we only
    get two free regions of 1K -> fragmentation.

    The patch addresses thes issues by introducing yet another cache descriptor
    cached_hole_size that contains the largest known hole size below the
    current free_area_cache. If a new request comes in the size is compared
    against the cached_hole_size and if the request can be filled with a hole
    below free_area_cache the search is started from the base instead.

    The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
    (earlier posted) leakme.c test program terminates after 50000+ iterations
    with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
    (as expected) with thread creation, Ingo's test_str02 with 20000 threads
    requires 0.7s system time.

    Taking out Ingo's patch (un-patch available per request) by basically
    deleting all mentions of free_area_cache from the kernel and starting the
    search for new memory always at the respective bases we observe: leakme
    terminates successfully with 11 distinctive hardly fragmented areas in
    /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
    time for Ingo's test_str02 with 20000 threads.

    Now - drumroll ;-) the appended patch works fine with leakme: it ends with
    only 7 distinct areas in /proc/self/maps and also thread creation seems
    sufficiently fast with 0.71s for 20000 threads.

    Signed-off-by: Wolfgang Wander
    Credit-to: "Richard Purdie"
    Signed-off-by: Ken Chen
    Acked-by: Ingo Molnar (partly)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wolfgang Wander
     

17 Jun, 2005

1 commit

  • The ELF core dump code has one use of off_t when writing out segments.
    Some of the segments may be passed the 2GB limit of an off_t, even on a
    32-bit system, so it's important to use loff_t instead. This fixes a
    corrupted core dump in the bigcore test in GDB's testsuite.

    Signed-off-by: Daniel Jacobowitz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Jacobowitz
     

17 May, 2005

1 commit


29 Apr, 2005

1 commit


17 Apr, 2005

2 commits

  • This patch reworks the way the ppc64 is mapped in user memory by the kernel
    to make it more robust against possible collisions with executable
    segments. Instead of just whacking a VMA at 1Mb, I now use
    get_unmapped_area() with a hint, and I moved the mapping of the vDSO to
    after the mapping of the various ELF segments and of the interpreter, so
    that conflicts get caught properly (it still has to be before
    create_elf_tables since the later will fill the AT_SYSINFO_EHDR with the
    proper address).

    While I was at it, I also changed the 32 and 64 bits vDSO's to link at
    their "natural" address of 1Mb instead of 0. This is the address where
    they are normally mapped in absence of conflict. By doing so, it should be
    possible to properly prelink one it's been verified to work on glibc.

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds