29 Jun, 2005

1 commit


28 Jun, 2005

1 commit

  • I spotted this issue while in memmap_init last week. I can't say the
    change has any test coverage by me. start_pfn was formerly used in main
    "for" loop. The fix is replace start_pfn with pfn.

    Signed-off-by: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Picco
     

26 Jun, 2005

8 commits

  • Linus Torvalds
     
  • 1. Establish a simple API for process freezing defined in linux/include/sched.h:

    frozen(process) Check for frozen process
    freezing(process) Check if a process is being frozen
    freeze(process) Tell a process to freeze (go to refrigerator)
    thaw_process(process) Restart process
    frozen_process(process) Process is frozen now

    2. Remove all references to PF_FREEZE and PF_FROZEN from all
    kernel sources except sched.h

    3. Fix numerous locations where try_to_freeze is manually done by a driver

    4. Remove the argument that is no longer necessary from two function calls.

    5. Some whitespace cleanup

    6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
    cleared before setting PF_FROZEN, recalc_sigpending does not check
    PF_FROZEN).

    This patch does not address the problem of freeze_processes() violating the rule
    that a task may only modify its own flags by setting PF_FREEZE. This is not clean
    in an SMP environment. freeze(process) is therefore not SMP safe!

    Signed-off-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch makes use of ALIGN() to remove duplicate round-up code.

    Signed-off-by: Nick Wilson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Wilson
     
  • This patch retrieves the max_pfn being used by previous kernel and stores it
    in a safe location (saved_max_pfn) before it is overwritten due to user
    defined memory map. This pfn is used to make sure that user does not try to
    read the physical memory beyond saved_max_pfn.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • Here is the fix for the problem described in

    http://bugzilla.kernel.org/show_bug.cgi?id=4721

    Basically, problem is generic_file_buffered_write() is accessing beyond end
    of the iov[] vector after handling the last vector. If we happen to cross
    page boundary, we get a fault.

    I think this simple patch is good enough. If we really don't want to
    depend on the "count", then we need pass nr_segs to
    filemap_set_next_iovec() and decrement it and check it.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • CONFIG_PM_DISK is long gone, but it still managed to survived at few
    places.

    Signed-off-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Machek
     
  • Out-of-tree user of remap_pfn_range hit kernel BUG at mm/memory.c:1112! It
    passes an unrounded size to remap_pfn_range, which was okay before 2.6.12,
    but misses remap_pte_range's new end condition. An audit of all the other
    ptwalks confirms that this is the only one so exposed.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fix a bug on error handling in the direct I/O function.

    Currently, if a file is opened with the O_DIRECT|O_SYNC flag, the write()
    syscall cannot receive the EIO error after an I/O error (SCSI cable is
    disconnected etc.).

    Return values of other points that call generic_osync_inode() are treated
    appropriately.

    Signed-off-by: Hisashi Hifumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hifumi Hisashi
     

24 Jun, 2005

20 commits

  • Make sys_madvice/fadvice return sane with xip.

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • This patch reworks filemap_xip.c with the goal to reduce code duplication
    from mm/filemap.c. It applies agains 2.6.12-rc6-mm1. Instead of
    implementing the aio functions, this one implements the synchronous
    read/write functions only. For readv and writev, the generic fallback is
    used. For aio, we rely on the application doing the fallback. Since our
    "synchronous" function does memcpy immediately anyway, there is no
    performance difference between using the fallbacks or implementing each
    operation.

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • - generic_file* file operations do no longer have a xip/non-xip split
    - filemap_xip.c implements a new set of fops that require get_xip_page
    aop to work proper. all new fops are exported GPL-only (don't like to
    see whatever code use those except GPL modules)
    - __xip_unmap now uses page_check_address, which is no longer static
    in rmap.c, and defined in linux/rmap.h
    - mm/filemap.h is now much more clean, plainly having just Linus'
    inline funcs moved here from filemap.c
    - fix includes in filemap_xip to make it build cleanly on i386

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • This patch updates some comments to match code changes.

    Signed-off-by: Martin Waitz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Waitz
     
  • The following patch removes the f_error field and all checks of f_error.

    Trond said:

    f_error was introduced for NFS, and made sense when we were guaranteed
    always to have a file pointer around when write errors occurred. Since
    then, we have (for various reasons) had to introduce the nfs_open_context in
    order to track the file read/write state, and it made sense to move our
    f_error tracking there too.

    Signed-off-by: Christoph Lameter
    Acked-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Here's a small patch to improve the performance of mempool_alloc by only
    initializing the wait queue when we're about to wait.

    Signed-off-by: Benjamin LaHaise
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin LaHaise
     
  • This patch removes redundant VM_ClearReadHint from mm/madvice.c which was
    left there by Prasanna's patch.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • This patch creates a new kstrdup library function and changes the "local"
    implementations in several places to use this function.

    Most of the changes come from the sound and net subsystems. The sound part
    had already been acknowledged by Takashi Iwai and the net part by David S.
    Miller.

    I left UML alone for now because I would need more time to read the code
    carefully before making changes there.

    Signed-off-by: Paulo Marques
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paulo Marques
     
  • Patch to allocate the control structures for for ide devices on the node of
    the device itself (for NUMA systems). The patch depends on the Slab API
    change patch by Manfred and me (in mm) and the pcidev_to_node patch that I
    posted today.

    Does some realignment too.

    Signed-off-by: Justin M. Forbes
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pravin Shelar
    Signed-off-by: Shobhit Dayal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make sparse's initalization be accessible at runtime. This allows sparse
    mappings to be created after boot in a hotplug situation.

    This patch is separated from the previous one just to give an indication how
    much of the sparse infrastructure is *just* for hotplug memory.

    The section_mem_map doesn't really store a pointer. It stores something that
    is convenient to do some math against to get a pointer. It isn't valid to
    just do *section_mem_map, so I don't think it should be stored as a pointer.

    There are a couple of things I'd like to store about a section. First of all,
    the fact that it is !NULL does not mean that it is present. There could be
    such a combination where section_mem_map *is* NULL, but the math gets you
    properly to a real mem_map. So, I don't think that check is safe.

    Since we're storing 32-bit-aligned structures, we have a few bits in the
    bottom of the pointer to play with. Use one bit to encode whether there's
    really a mem_map there, and the other one to tell whether there's a valid
    section there. We need to distinguish between the two because sometimes
    there's a gap between when a section is discovered to be present and when we
    can get the mem_map for it.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jack Steiner
    Signed-off-by: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • The part of the sparsemem patch which modifies memmap_init_zone() has recently
    become a problem. It changes behavior so that there is a call to
    pfn_to_page() for each individual page inside of a node's range:
    node_start_pfn through node_end_pfn. It used to simply do this once, at the
    beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
    of a node made it necessary to change.

    Mike Kravetz recently wrote a patch which made the NUMA code accept some new
    kinds of layouts. The system's memory was laid out like this, with node 0's
    memory in two pieces: one before and one after node 1's memory:

    Node 0: +++++ +++++
    Node 1: +++++

    Previous behavior before Mike's patch was to assign nodes like this:

    Node 0: 00000 XXXXX
    Node 1: 11111

    Where the 'X' areas were simply thrown away. The new behavior was to make the
    pg_data_t span node 0 across all of its areas, including areas that are really
    node 1's: Node 0: 000000000000000 Node 1: 11111

    This wastes a little bit of mem_map space, but ends up being OK, and more
    fully utilizes the system's memory. memmap_init_zone() initializes all of the
    "struct page"s for node 0, even for the "hole", but those never get used,
    because there is no pfn_to_page() that resolves to those pages. However, only
    calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
    allocated for node0->node_mem_map because:

    struct page *start = pfn_to_page(start_pfn);
    // effectively start = &node->node_mem_map[0]
    for (page = start; page < (start + size); page++) {
    init_page_here();...
    page++;
    }

    Slow, and wasteful, but generally harmless.

    But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
    does):

    for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
    page = pfn_to_page(pfn);
    }

    And you end up trying to initialize node 1's pages too early, along with bogus
    data from node 0. This patch checks for those weird layouts and declines to
    touch the pages, making the more frequent pfn_to_page() calls OK to do.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
    mem_map[] is needed by discontiguous memory machines (like in the old
    CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
    replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
    become a complete replacement.

    A significant advantage over DISCONTIGMEM is that it's completely separated
    from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
    and DISCONTIG are often confused.

    Another advantage is that sparse doesn't require each NUMA node's ranges to be
    contiguous. It can handle overlapping ranges between nodes with no problems,
    where DISCONTIGMEM currently throws away that memory.

    Sparsemem uses an array to provide different pfn_to_page() translations for
    each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
    to be chopped up.

    In order to do quick pfn_to_page() operations, the section number of the page
    is encoded in page->flags. Part of the sparsemem infrastructure enables
    sharing of these bits more dynamically (at compile-time) between the
    page_zone() and sparsemem operations. However, on 32-bit architectures, the
    number of bits is quite limited, and may require growing the size of the
    page->flags type in certain conditions. Several things might force this to
    occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
    memory), an increase in the physical address space, or an increase in the
    number of used page->flags.

    One thing to note is that, once sparsemem is present, the NUMA node
    information no longer needs to be stored in the page->flags. It might provide
    speed increases on certain platforms and will be stored there if there is
    room. But, if out of room, an alternate (theoretically slower) mechanism is
    used.

    This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
    there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
    often have to compile out the same areas of code.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Martin Bligh
    Signed-off-by: Adrian Bunk
    Signed-off-by: Yasunori Goto
    Signed-off-by: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Allow architectures to indicate that they will be providing hooks to indice
    installed memory areas, memory_present(). Provide prototypes for the i386
    implementation.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • This gives DISCONTIGMEM a bit more help text to explain what it does, not just
    when to choose it.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • I got some feedback from users who think that the new "Memory Model" menu is a
    little invasive. This patch will hide that menu, except when
    CONFIG_EXPERIMENTAL is enabled *or* when an individual architecture wants it.

    An individual arch may want to enable it because they've removed their
    arch-specific DISCONTIG prompt in favor of the mm/Kconfig one.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • The following patch applies on top of 2.6.12-rc2-mm1. It fixes a minor
    user interaction issue, and an early reference to SPARSEMEM.

    This "choice" menu would always default to FLATMEM, as it was listed first.
    Move it to the end so that the other defaults have a chance first.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • There is some confusion that arose when working on SPARSEMEM patch between
    what is needed for DISCONTIG vs. NUMA.

    Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
    All of the current NUMA implementations require an implementation of
    DISCONTIG. Because of this, quite a lot of code which is really needed for
    NUMA is actually under DISCONTIG #ifdefs. For SPARSEMEM, we changed some
    of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
    case.

    Introducing this new NEED_MULTIPLE_NODES config option allows code that is
    needed for both NUMA or DISCONTIG to be separated out from code that is
    specific to DISCONTIG.

    One great advantage of this approach is that it doesn't require every
    architecture to be converted over. All of the current implementations
    should "just work", only the ones implementing SPARSEMEM will have to be
    fixed up.

    The change to free_area_init() makes it work inside, or out of the new
    config option.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • With sparsemem being introduced, we need a central place for new
    memory-related .config options: mm/Kconfig. This allows us to remove many
    of the duplicated arch-specific options.

    The new option, CONFIG_FLATMEM, is there to enable us to detangle NUMA and
    DISCONTIGMEM. This is a requirement for sparsemem because sparsemem uses
    the NUMA code without the presence of DISCONTIGMEM. The sparsemem patches
    use CONFIG_FLATMEM in generic code, so this patch is a requirement before
    applying them.

    Almost all places that used to do '#ifndef CONFIG_DISCONTIGMEM' should use
    '#ifdef CONFIG_FLATMEM' instead.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Generify the value fields in the page_flags. The aim is to allow the location
    and size of these fields to be varied. Additionally we want to move away from
    fixed allocations per field whilst still enforcing the overall bit utilisation
    limits. We rely on the compiler to spot and optimise the accessor functions.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Introduce a simple allocator for the NUMA remap space. This space is very
    scarce, used for structures which are best allocated node local.

    This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
    the pgdat->node_mem_map initialized in a consistent place for all
    architectures.

    Issues:
    o alloc_remap takes a node_id where we might expect a pgdat which was intended
    to allow us to allocate the pgdat's using this mechanism; which we do not yet
    do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

23 Jun, 2005

1 commit

  • The boot_pageset needs to be preserved for hotplugging and for off line
    processors and nodes. Otherwise pointers will point into memory that has
    now a different use. /proc/zoneinfo is currently showing strange results
    if processors / nodes are not present.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 Jun, 2005

9 commits

  • OOM killer prints a stray newline.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Vlasenko
     
  • It's common practice to msync a large address range regularly, in which
    often only a few ptes have actually been dirtied since the previous pass.

    sync_pte_range then goes much faster if it tests whether pte is dirty
    before locating and accessing each struct page cacheline; and it is hardly
    slowed by ptep_clear_flush_dirty repeating that test in the opposite case,
    when every pte actually is dirty.

    But beware, s390's pte_dirty always says false, since its dirty bit is kept
    in the storage key, located via the struct page address. So skip this
    optimization in its case: use a pte_maybe_dirty macro which just says true
    if page_test_and_clear_dirty is implemented.

    Signed-off-by: Abhijit Karmarkar
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Abhijit Karmarkar
     
  • Remember that ironic get_user_pages race? when the raised page_count on a
    page swapped out led do_wp_page to decide that it had to copy on write, so
    substituted a different page into userspace. 2.6.7 onwards have Andrea's
    solution, where try_to_unmap_one backs out if it finds page_count raised.

    Which works, but is unsatisfying (rmap.c has no other page_count heuristics),
    and was found a few months ago to hang an intensive page migration test. A
    year ago I was hesitant to engage page_mapcount, now it seems the right fix.

    So remove the page_count hack from try_to_unmap_one; and use activate_page in
    unuse_mm when dropping lock, to replace its secondary effect of helping
    swapoff to make progress in that case.

    Simplify can_share_swap_page (now called only on anonymous pages) to check
    page_mapcount + page_swapcount == 1: still needs the page lock to stabilize
    their (pessimistic) sum, but does not need swapper_space.tree_lock for that.

    In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to
    keep sum on the high side, and correct when can_share_swap_page called.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A small optimization to do_wp_page's check for whether to avoid copy by
    reusing the page already mapped. It can never share a cached file page,
    nor can it share a reserved page (often the empty zero page), so it's a
    waste of time to lock and unlock in those cases. Which nowadays can both
    be neatly excluded by a preliminary PageAnon test.

    Christoph has reported that a preliminary page_count test proved valuable
    for scalability here, but PageAnon covers more common cases all at once.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Since its birth, get_user_pages has been calling a misguided get_page_map
    function. follow_page has already returned NULL if the pfn is invalid, we
    cannot reach an invalid pfn from a validated struct page.

    Remove get_page_map, and the messy rewind in get_user_pages to cope with
    its failure. Oh, and could we please call that "struct page *page" like
    everywhere else, instead of "struct page *map"?

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page
    ought to clear them to avoid repetitive reports (Nikita noticed this too).
    Let prep_new_page check page_count and PG_slab as free_pages_check does.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Strict mbind's check for currently mapped pages being on node has been
    using a slow loop which re-evaluates pgd, pud, pmd, pte for each entry:
    replace that by a standard four-level page table walk like others in mm.
    Since mmap_sem is held for writing, page_table_lock can be taken at the
    inner level to limit latency.

    Signed-off-by: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Strict mbind's check that pages already mapped are on right node has been
    using pte_page without checking if pfn_valid, and without page_table_lock
    to prevent spurious failures when try_to_unmap_one intervenes between the
    pte_present and the pte_page.

    Signed-off-by: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • To improve shmem scalability, we allowed tmpfs instances which don't need
    their blocks or inodes limited not to count them, and not to allocate any
    sbinfo. Which was okay when the only use for the sbinfo was accounting
    blocks and inodes; but since then a couple of unrelated projects extending
    tmpfs want to store other data in the sbinfo. Whether either extension
    reaches mainline is beside the point: I'm guilty of a bad design decision,
    and should restore sbinfo to make any such future extensions easier.

    So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and
    now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited
    inodes. Brent Casavant verified (many months ago) that this does not
    perceptibly impact the scalability (since the unlimited sbinfo cacheline is
    repeatedly accessed but only once dirtied).

    And merge shmem_set_size into its sole caller shmem_remount_fs.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins