02 Oct, 2006

8 commits

  • Currently proc_pident_lookup gets the names and types from a table and then
    has a huge switch statement to get the inode and file operations it needs.
    That is silly and is becoming increasingly hard to maintain so I just put all
    of the information in the table.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • There were enough changes in my last round of cleaning up proc I had to break
    up the patch series into smaller chunks, and my last chunk never got resent.

    This patchset gives proc dynamic inode numbers (the static inode numbers were
    a pain to maintain and prevent all kinds of things), and removes the horrible
    switch statements that had to be kept in sync with everything else. Being
    fully table driver takes us 90% of the way of being able to register new
    process specific attributes in proc.

    This patch:

    Group the functions by what they implement instead of by type of operation.
    As it existed base.c was quickly approaching the point where it could not be
    followed.

    No functionality or code changes asside from adding/removing forward
    declartions are implemented in this patch.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The problem: An opendir, readdir, closedir sequence can fail to report
    process ids that are continually in use throughout the sequence of system
    calls. For this race to trigger the process that proc_pid_readdir stops at
    must exit before readdir is called again.

    This can cause ps to fail to report processes, and it is in violation of
    posix guarantees and normal application expectations with respect to
    readdir.

    Currently there is no way to work around this problem in user space short
    of providing a gargantuan buffer to user space so the directory read all
    happens in on system call.

    This patch implements the normal directory semantics for proc, that
    guarantee that a directory entry that is neither created nor destroyed
    while reading the directory entry will be returned. For directory that are
    either created or destroyed during the readdir you may or may not see them.
    Furthermore you may seek to a directory offset you have previously seen.

    These are the guarantee that ext[23] provides and that posix requires, and
    more importantly that user space expects. Plus it is a simple semantic to
    implement reliable service. It is just a matter of calling readdir a
    second time if you are wondering if something new has show up.

    These better semantics are implemented by scanning through the pids in
    numerical order and by making the file offset a pid plus a fixed offset.

    The pid scan happens on the pid bitmap, which when you look at it is
    remarkably efficient for a brute force algorithm. Given that a typical
    cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
    are only 40 cache lines for the entire 32K pid space. A typical system
    will have 100 pids or more so this is actually fewer cache lines we have to
    look at to scan a linked list, and the worst case of having to scan the
    entire pid bitmap is pretty reasonable.

    If we need something more efficient we can go to a more efficient data
    structure for indexing the pids, but for now what we have should be
    sufficient.

    In addition this takes no additional locks and is actually less code than
    what we are doing now.

    Also another very subtle bug in this area has been fixed. It is possible
    to catch a task in the middle of de_thread where a thread is assuming the
    thread of it's thread group leader. This patch carefully handles that case
    so if we hit it we don't fail to return the pid, that is undergoing the
    de_thread dance.

    Thanks to KAMEZAWA Hiroyuki for
    providing the first fix, pointing this out and working on it.

    [oleg@tv-sign.ru: fix it]
    Signed-off-by: Eric W. Biederman
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Oleg Nesterov
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • When listing loaded modules during an oops or panic, also list each
    module's Tainted flags if non-zero (P: Proprietary or F: Forced load only).

    If a module is did not taint the kernel, it is just listed like
    usbcore
    but if it did taint the kernel, it is listed like
    wizmodem(PF)

    Example:
    [ 3260.121718] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
    [ 3260.121729] [] :dump_test:proc_dump_test+0x99/0xc8
    [ 3260.121742] PGD fe8d067 PUD 264a6067 PMD 0
    [ 3260.121748] Oops: 0002 [1] SMP
    [ 3260.121753] CPU 1
    [ 3260.121756] Modules linked in: dump_test(P) snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ide_cd generic ohci1394 snd_hda_intel snd_hda_codec snd_pcm snd_timer snd ieee1394 snd_page_alloc piix ide_core arcmsr aic79xx scsi_transport_spi usblp
    [ 3260.121785] Pid: 5556, comm: bash Tainted: P 2.6.18-git10 #1

    [Alternatively, I can look into listing tainted flags with 'lsmod',
    but that won't help in oopsen/panics so much.]

    [akpm@osdl.org: cleanup]
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • The exported kernel interfaces of genpool allocator need to adhere to
    the requirements of kernel-doc.

    Signed-off-by: Dean Nelson
    Cc: Steve Wise
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dean Nelson
     
  • Modules using the genpool allocator need to be able to destroy the data
    structure when unloading.

    Signed-off-by: Steve Wise
    Cc: Randy Dunlap
    Cc: Dean Nelson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steve Wise
     
  • The test for the error from pcmcia_replace_cis() was incorrect, and
    would always trigger (because if an error didn't happen, the "ret" value
    would not be zero, it would be the passed-in count).

    Reported and debugged by Fabrice Bellet

    Rather than just fix the single broken test, make the code in question
    use an understandable code-sequence instead, fixing the whole function
    to be more readable.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • It's not clear how this thinko got through..

    Cc: Olaf Hering
    Cc: David Brownell
    Cc: Alessandro Zummo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

01 Oct, 2006

32 commits

  • * master.kernel.org:/pub/scm/linux/kernel/git/davej/agpgart:
    [AGPGART] printk fixups.
    [AGPGART] Use pci_get_slot not pci_find_slot

    Linus Torvalds
     
  • * master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq:
    [CPUFREQ] Make acpi-cpufreq unsticky again.
    [CPUFREQ] longhaul: remove duplicated code.
    [CPUFREQ] Longhaul - Disable arbiter CLE266
    [CPUFREQ] Fix section mismatch warning
    [CPUFREQ] Fix cut-n-paste bug in suspend printk

    Linus Torvalds
     
  • During tracking down a PAE compile failure, I found that config.h was being
    included in a bunch of places in i386 code. It is no longer necessary, so
    drop it.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Add a pte_update_hook which notifies about pte changes that have been made
    without using the set_pte / clear_pte interfaces. This allows shadow mode
    hypervisors which do not trap on page table access to maintain synchronized
    shadows.

    It also turns out, there was one pte update in PAE mode that wasn't using any
    accessor interface at all for setting NX protection. Considering it is PAE
    specific, and the accessor is i386 specific, I didn't want to add a generic
    encapsulation of this behavior yet.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Now that ptep_establish has a definition in PAE i386 3-level paging code, the
    only paging model which is insane enough to have multi-word hardware PTEs
    which are not efficient to set atomically, we can remove the ghost of
    set_pte_atomic from other architectures which falesly duplicated it, and
    remove all knowledge of it from the generic pgtable code.

    set_pte_atomic is now a private pte operator which is specific to i386

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • The ptep_establish macro is only used on user-level PTEs, for P->P mapping
    changes. Since these always happen under protection of the pagetable lock,
    the strong synchronization of a 64-bit cmpxchg is not needed, in fact, not
    even a lock prefix needs to be used. We can simply instead clear the P-bit,
    followed by a normal set. The write ordering is still important to avoid the
    possibility of the TLB snooping a partially written PTE and getting a bad
    mapping installed.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Create a new PTE function which combines clearing a kernel PTE with the
    subsequent flush. This allows the two to be easily combined into a single
    hypercall or paravirt-op. More subtly, reverse the order of the flush for
    kmap_atomic. Instead of flushing on establishing a mapping, flush on clearing
    a mapping. This eliminates the possibility of leaving stale kmap entries
    which may still have valid TLB mappings. This is required for direct mode
    hypervisors, which need to reprotect all mappings of a given page when
    changing the page type from a normal page to a protected page (such as a page
    table or descriptor table page). But it also provides some nicer semantics
    for real hardware, by providing extra debug-proofing against using stale
    mappings, as well as ensuring that no stale mappings exist when changing the
    cacheability attributes of a page, which could lead to cache conflicts when
    two different types of mappings exist for the same page.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Remove ptep_test_and_clear_{dirty|young} from i386, and instead use the
    dominating functions, ptep_clear_flush_{dirty|young}. This allows the TLB
    page flush to be contained in the same macro, and allows for an eager
    optimization - if reading the PTE initially returned dirty/accessed, we can
    assume the fact that no subsequent update to the PTE which cleared accessed /
    dirty has occurred, as the only way A/D bits can change without holding the
    page table lock is if a remote processor clears them. This eliminates an
    extra branch which came from the generic version of the code, as we know that
    no other CPU could have cleared the A/D bit, so the flush will always be
    needed.

    We still export these two defines, even though we do not actually define
    the macros in the i386 code:

    #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
    #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY

    The reason for this is that the only use of these functions is within the
    generic clear_flush functions, and we want a strong guarantee that there
    are no other users of these functions, so we want to prevent the generic
    code from defining them for us.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Implement lazy MMU update hooks which are SMP safe for both direct and shadow
    page tables. The idea is that PTE updates and page invalidations while in
    lazy mode can be batched into a single hypercall. We use this in VMI for
    shadow page table synchronization, and it is a win. It also can be used by
    PPC and for direct page tables on Xen.

    For SMP, the enter / leave must happen under protection of the page table
    locks for page tables which are being modified. This is because otherwise,
    you end up with stale state in the batched hypercall, which other CPUs can
    race ahead of. Doing this under the protection of the locks guarantees the
    synchronization is correct, and also means that spurious faults which are
    generated during this window by remote CPUs are properly handled, as the page
    fault handler must re-check the PTE under protection of the same lock.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Rusty Russell
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Change pte_clear_full to a more appropriately named pte_clear_not_present,
    allowing optimizations when not-present mapping changes need not be reflected
    in the hardware TLB for protected page table modes. There is also another
    case that can use it in the fremap code.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Rusty Russell
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • We don't want to read PTEs directly like this after they have been modified,
    as a lazy MMU implementation of direct page tables may not have written the
    updated PTE back to memory yet.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Rusty Russell
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • The recent fix to invalidate_inode_pages() (git commit 016eb4a) managed to
    unfix invalidate_inode_pages2().

    The problem is that various bits of code in the kernel can take transient refs
    on pages: the page scanner will do this when inspecting a batch of pages, and
    the lru_cache_add() batching pagevecs also hold a ref.

    Net result is transient failures in invalidate_inode_pages2(). This affects
    NFS directory invalidation (observed) and presumably also block-backed
    direct-io (not yet reported).

    Fix it by reverting invalidate_inode_pages2() back to the old version which
    ignores the page refcounts.

    We may come up with something more clever later, but for now we need a 2.6.18
    fix for NFS.

    Cc: Chuck Lever
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Using the infrastructure created in previous patches implement support to
    pipe core dumps into programs.

    This is done by overloading the existing core_pattern sysctl
    with a new syntax:

    |program

    When the first character of the pattern is a '|' the kernel will instead
    threat the rest of the pattern as a command to run. The core dump will be
    written to the standard input of that program instead of to a file.

    This is useful for having automatic core dump analysis without filling up
    disks. The program can do some simple analysis and save only a summary of
    the core dump.

    The core dump proces will run with the privileges and in the name space of
    the process that caused the core dump.

    I also increased the core pattern size to 128 bytes so that longer command
    lines fit.

    Most of the changes comes from allowing core dumps without seeks. They are
    fairly straight forward though.

    One small incompatibility is that if someone had a core pattern previously
    that started with '|' they will get suddenly new behaviour. I think that's
    unlikely to be a real problem though.

    Additional background:

    > Very nice, do you happen to have a program that can accept this kind of
    > input for crash dumps? I'm guessing that the embedded people will
    > really want this functionality.

    I had a cheesy demo/prototype. Basically it wrote the dump to a file again,
    ran gdb on it to get a backtrace and wrote the summary to a shared directory.
    Then there was a simple CGI script to generate a "top 10" crashes HTML
    listing.

    Unfortunately this still had the disadvantage to needing full disk space for a
    dump except for deleting it afterwards (in fact it was worse because over the
    pipe holes didn't work so if you have a holey address map it would require
    more space).

    Fortunately gdb seems to be happy to handle /proc/pid/fd/xxx input pipes as
    cores (at least it worked with zsh's =(cat core) syntax), so it would be
    likely possible to do it without temporary space with a simple wrapper that
    calls it in the right way. I ran out of time before doing that though.

    The demo prototype scripts weren't very good. If there is really interest I
    can dig them out (they are currently on a laptop disk on the desk with the
    laptop itself being in service), but I would recommend to rewrite them for any
    serious application of this and fix the disk space problem.

    Also to be really useful it should probably find a way to automatically fetch
    the debuginfos (I cheated and just installed them in advance). If nobody else
    does it I can probably do the rewrite myself again at some point.

    My hope at some point was that desktops would support it in their builtin
    crash reporters, but at least the KDE people I talked too seemed to be happy
    with their user space only solution.

    Alan sayeth:

    I don't believe that piping as such as neccessarily the right model, but
    the ability to intercept and processes core dumps from user space is asked
    for by many enterprise users as well. They want to know about, capture,
    analyse and process core dumps, often centrally and in automated form.

    [akpm@osdl.org: loff_t != unsigned long]
    Signed-off-by: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • A new member in the ever growing family of call_usermode* functions is
    born. The new call_usermodehelper_pipe() function allows to pipe data to
    the stdin of the called user mode progam and behaves otherwise like the
    normal call_usermodehelp() (except that it always waits for the child to
    finish)

    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Split the big and hard to read do_pipe function into smaller pieces.

    This creates new create_write_pipe/free_write_pipe/create_read_pipe
    functions. These functions are made global so that they can be used by
    other parts of the kernel.

    The resulting code is more generic and easier to read and has cleaner error
    handling and less gotos.

    [akpm@osdl.org: cleanup]
    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • ioremap must be balanced by an iounmap and failing to do so can result
    in a memory leak.

    Signed-off-by: Amol Lad
    Cc: Alan Cox
    Cc: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amol Lad
     
  • ioremap must be balanced by an iounmap and failing to do so can result
    in a memory leak.

    Signed-off-by: Amol Lad
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amol Lad
     
  • ioremap must be balanced by an iounmap and failing to do so can result
    in a memory leak.

    Signed-off-by: Amol Lad
    Cc: Alan Cox
    Cc: Mark A. Greer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amol Lad
     
  • ioremap must be balanced by an iounmap and failing to do so can result
    in a memory leak.

    Signed-off-by: Amol Lad
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amol Lad
     
  • ioremap must be balanced by an iounmap and failing to do so can result
    in a memory leak.

    Signed-off-by: Amol Lad
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amol Lad
     
  • ioremap must be balanced by an iounmap and failing to do so can result
    in a memory leak.

    Signed-off-by: Amol Lad
    Cc: Alan Cox
    Cc: Brent Casavant
    Cc: Pat Gefre
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amol Lad
     
  • ioremap must be balanced by an iounmap and failing to do so can result
    in a memory leak.

    Signed-off-by: Amol Lad
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amol Lad
     
  • ioremap must be balanced by an iounmap and failing to do so can result
    in a memory leak.

    Signed-off-by: Amol Lad
    Cc: Alan Cox

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amol Lad
     
  • Convert x86_64 to use generic ioremap_page_range()

    [akpm@osdl.org: build fix]
    Signed-off-by: Haavard Skinnemoen
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haavard Skinnemoen
     
  • Convert m32r to use generic ioremap_page_range()

    Signed-off-by: Haavard Skinnemoen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haavard Skinnemoen
     
  • Convert i386 to use generic ioremap_page_range()

    [bunk@stusta.de: build fix]
    Signed-off-by: Haavard Skinnemoen
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haavard Skinnemoen
     
  • Convert CRIS to use generic ioremap_page_range()

    Signed-off-by: Haavard Skinnemoen
    Acked-by: Mikael Starvik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haavard Skinnemoen
     
  • Convert AVR32 to use generic ioremap_page_range()

    Signed-off-by: Haavard Skinnemoen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haavard Skinnemoen
     
  • Convert Alpha to use generic ioremap_page_range() by turning
    __alpha_remap_area_pages() into an inline wrapper around ioremap_page_range().

    Signed-off-by: Haavard Skinnemoen
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haavard Skinnemoen
     
  • The existing implementation of ioremap_page_range(), which was taken
    from i386, does this:

    flush_cache_all();
    /* modify page tables */
    flush_tlb_all();

    I think this is a bit defensive, so this patch changes the generic
    implementation to do:

    /* modify page tables */
    flush_cache_vmap(start, end);

    instead, which is similar to what vmalloc() does. This should still
    be correct because we never modify existing PTEs. According to
    James Bottomley:

    The problem the flush_tlb_all() is trying to solve is to avoid stale tlb
    entries in the ioremap area. We're just being conservative by flushing
    on both map and unmap. Technically what vmalloc/vfree does (only flush
    the tlb on unmap) is just fine because it means that the only tlb
    entries in the remap area must belong to in-use mappings.

    Signed-off-by: Haavard Skinnemoen
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Mikael Starvik
    Cc: Andi Kleen
    Cc:
    Cc: Ralf Baechle
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haavard Skinnemoen
     
  • This patch adds a generic implementation of ioremap_page_range() in
    lib/ioremap.c based on the i386 implementation. It differs from the
    i386 version in the following ways:

    * The PTE flags are passed as a pgprot_t argument and must be
    determined up front by the arch-specific code. No additional
    PTE flags are added.
    * Uses set_pte_at() instead of set_pte()

    [bunk@stusta.de: warning fix]
    ]dhowells@redhat.com: nommu build fix]
    Signed-off-by: Haavard Skinnemoen
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Mikael Starvik
    Cc: Andi Kleen
    Cc:
    Cc: Ralf Baechle
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Signed-off-by: Adrian Bunk
    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haavard Skinnemoen
     
  • Re-implement smp_send_nmi_allbutself() so that calls to smp_processor_id
    (through send_IPI_allbutself) can be replaced with safe_smp_processor_id
    without affecting other parts of the kernel (as suggested by Eric Biederman).

    Signed-off-by: Fernando Vazquez
    Looks-reasonable-to: Andi Kleen
    Acked-by: "Eric W. Biederman"
    Cc: Vivek Goyal
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fernando Vazquez