09 Oct, 2007

3 commits


07 Oct, 2007

1 commit

  • When pinning and unpinning pagetables, we must protect them against
    being used by other CPUs, lest they see the pagetable in an
    intermediate read-only-but-not-pinned state.

    When using split pte locks, doing this properly would require taking
    all the pte locks for the pagetable while pinning, but this may overflow
    the PREEMPT_BITS part of the preempt counter if the process has mapped
    more than about 512M of memory.

    However, failing to take the pte locks causes write-protect faults when
    the pageout code is trying to clear the Access bit on a pte which is part
    of a freshy created and still being pinned process after fork.

    This is a short-term fix until the problem is solved properly.

    Signed-off-by: Jeremy Fitzhardinge
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Andrew Morton
    Cc: Andi Kleen
    Cc: Keir Fraser
    Cc: Jan Beulich
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     

05 Oct, 2007

1 commit

  • Gurudas Pai reports kernel BUG at arch/i386/mm/highmem.c:15! below
    sys_remap_file_pages, while running Oracle database test on x86 in 6GB
    RAM: kunmap thinks we're in_interrupt because the preempt count has
    wrapped.

    That's because __do_fault expected to unmap page_table, but one of its
    two callers do_nonlinear_fault already unmapped it: let do_linear_fault
    unmap it first too, and then there's no need to pass the page_table arg
    down.

    Why have we been so slow to notice this? Probably through forgetting
    that the mapping_cap_account_dirty test means that sys_remap_file_pages
    nowadays only goes the full nonlinear vma route on a few memory-backed
    filesystems like ramfs, tmpfs and hugetlbfs.

    [ It also depends on CONFIG_HIGHPTE, so it becomes even harder to
    trigger in practice. Many who have need of large memory have probably
    migrated to x86-64..

    Problem introduced by commit d0217ac04ca6591841e5665f518e38064f4e65bd
    ("mm: fault feedback #1") -- Linus ]

    Signed-off-by: Hugh Dickins
    Cc: gurudas pai
    Cc: Nick Piggin
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 Oct, 2007

1 commit

  • The virtual address space argument of clear_user_highpage is supposed to be
    the virtual address where the page being cleared will eventually be mapped.
    This allows architectures with virtually indexed caches a few clever
    tricks. That sort of trick falls over in painful ways if the virtual
    address argument is wrong.

    Signed-off-by: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     

20 Sep, 2007

1 commit

  • This patch proposes fixes to the reference counting of memory policy in the
    page allocation paths and in show_numa_map(). Extracted from my "Memory
    Policy Cleanups and Enhancements" series as stand-alone.

    Shared policy lookup [shmem] has always added a reference to the policy,
    but this was never unrefed after page allocation or after formatting the
    numa map data.

    Default system policy should not require additional ref counting, nor
    should the current task's task policy. However, show_numa_map() calls
    get_vma_policy() to examine what may be [likely is] another task's policy.
    The latter case needs protection against freeing of the policy.

    This patch adds a reference count to a mempolicy returned by
    get_vma_policy() when the policy is a vma policy or another task's
    mempolicy. Again, shared policy is already reference counted on lookup. A
    matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
    shared and vma policies, and in show_numa_map() for shared and another
    task's mempolicy. We can call __mpol_free() directly, saving an admittedly
    inexpensive inline NULL test, because we know we have a non-NULL policy.

    Handling policy ref counts for hugepages is a bit trickier.
    huge_zonelist() returns a zone list that might come from a shared or vma
    'BIND policy. In this case, we should hold the reference until after the
    huge page allocation in dequeue_hugepage(). The patch modifies
    huge_zonelist() to return a pointer to the mempolicy if it needs to be
    unref'd after allocation.

    Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:

    w/o patch w/ refcount patch
    Avg Std Devn Avg Std Devn
    Real: 100.59 0.38 100.63 0.43
    User: 1209.60 0.37 1209.91 0.31
    System: 81.52 0.42 81.64 0.34

    Signed-off-by: Lee Schermerhorn
    Acked-by: Andi Kleen
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

12 Sep, 2007

1 commit

  • This was posted on Aug 28 and fixes an issue that could cause troubles
    when slab caches >=128k are created.

    http://marc.info/?l=linux-mm&m=118798149918424&w=2

    Currently we simply add the debug flags unconditional when checking for a
    matching slab. This creates issues for sysfs processing when slabs exist
    that are exempt from debugging due to their huge size or because only a
    subset of slabs was selected for debugging.

    We need to only add the flags if kmem_cache_open() would also add them.

    Create a function to calculate the flags that would be set
    if the cache would be opened and use that function to determine
    the flags before looking for a compatible slab.

    [akpm@linux-foundation.org: fixlets]
    Signed-off-by: Christoph Lameter
    Cc: Chuck Ebbert
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

31 Aug, 2007

4 commits

  • Page migration currently does not check if the target of the move contains
    nodes that that are invalid (if root attempts to migrate pages)
    and may try to allocate from invalid nodes if these are specified
    leading to oopses.

    Return -EINVAL if an offline node is specified.

    Signed-off-by: Christoph Lameter
    Cc: Shaohua Li
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Do not BUG() if we cannot register a slab with sysfs. Just print an error.
    The only consequence of not registering is that the slab cache is not
    visible via /sys/slab. A BUG() may not be visible that early during boot
    and we have had multiple issues here already.

    Signed-off-by: Christoph Lameter
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • In migration fallback path, write_page() or lock_page() will be called.
    This causes sleep with holding rcu_read_lock().
    For avoding that, just do rcu_lock if the page is Anon.(this is enough.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Don't try to free memory which we didn't allocate.

    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

23 Aug, 2007

9 commits

  • The NUMA layer only supports NUMA policies for the highest zone. When
    ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
    ZONE_MOVABLE. The result is that policies are only applied to allocations
    like anonymous pages and page cache allocated from ZONE_MOVABLE when the
    zone is used.

    This patch applies policies to the two highest zones when the highest zone
    is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
    zone, it's always functionally equivalent.

    The patch has been tested on a variety of machines both NUMA and non-NUMA
    covering x86, x86_64 and ppc64. No abnormal results were seen in
    kernbench, tbench, dbench or hackbench. It passes regression tests from
    the numactl package with and without kernelcore= once numactl tests are
    patched to wait for vmstat counters to update.

    akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
    ZONE_MOVABLE and kernelcore= in 2.6.23. Christoph says "For .24 either merge
    the mobility or get the other solution that Mel is working on. That solution
    would only use a single zonelist per node and filter on the fly. That may
    help performance and also help to make memory policies work better."

    Signed-off-by: Mel Gorman
    Acked-by: Lee Schermerhorn
    Tested-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Print a big fat warning and do what is necessary to continue if a node is
    marked as up (meaning either node is online (upstream) or node has memory
    (Andrew's tree)) but allocations from the node do not succeed.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLUB is using atomic_read() for variables declared atomic_long_t.
    Switch to atomic_long_read().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It seems a simple mistake was made when converting follow_hugetlb_page()
    over to the VM_FAULT flags bitmasks (in "mm: fault feedback #2", commit
    83c54070ee1a2d05c89793884bea1a03f2851ed4).

    By using the wrong bitmask, hugetlb_fault() failures are not being
    recognized. This results in an infinite loop whenever follow_hugetlb_page
    is involved in a failed fault.

    Signed-off-by: Adam Litke
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Skip calling cache_free_alien() when the platform is not numa capable.
    This will avoid cache misses that happen while accessing slabp (which is
    per page memory reference) to get nodeid. Instead use a global variable to
    skip the call, which is mostly likely to be present in the cache.

    This gives a 0.8% performance boost with the database oltp workload on a
    quad-core SMP platform and by any means the number is not small :)

    Signed-off-by: Suresh Siddha
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Siddha, Suresh B
     
  • The new exec code inserts an accounted vma into an mm struct which is not
    current->mm. The existing memory check code has a hard coded assumption
    that this does not happen as does the security code.

    As the correct mm is known we pass the mm to the security method and the
    helper function. A new security test is added for the case where we need
    to pass the mm and the existing one is modified to pass current->mm to
    avoid the need to change large amounts of code.

    (Thanks to Tobias for fixing rejects and testing)

    Signed-off-by: Alan Cox
    Cc: WU Fengguang
    Cc: James Morris
    Cc: Tobias Diedrich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Lumpy reclaim works by selecting a lead page from the LRU list and then
    selecting pages for reclaim from the order-aligned area of pages. In the
    situation were all pages in that region are inactive and not referenced by any
    process over time, it works well.

    In the situation where there is even light load on the system, the pages may
    not free quickly. Out of a area of 1024 pages, maybe only 950 of them are
    freed when the allocation attempt occurs because lumpy reclaim returned early.
    This patch alters the behaviour of direct reclaim for large contiguous
    blocks.

    The first attempt to call shrink_page_list() is asynchronous but if it fails,
    the pages are submitted a second time and the calling process waits for the IO
    to complete. This may stall allocators waiting for contiguous memory but that
    should be expected behaviour for high-order users. It is preferable behaviour
    to potentially queueing unnecessary areas for IO. Note that kswapd will not
    stall in this fashion.

    [apw@shadowen.org: update to version 2]
    [apw@shadowen.org: update to version 3]
    Signed-off-by: Mel Gorman
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • As pointed out by Mel when reclaim is applied at higher orders a significant
    amount of IO may be started. As this takes finite time to drain reclaim will
    consider more areas than ultimatly needed to satisfy the request. This leads
    to more reclaim than strictly required and reduced success rates.

    I was able to confirm Mel's test results on systems locally. These show that
    even under light load the success rates drop off far more than expected.
    Testing with a modified version of his patch (which follows) I was able to
    allocate almost all of ZONE_MOVABLE with a near idle system. I ran 5 test
    passes sequentially following system boot (the system has 29 hugepages in
    ZONE_MOVABLE):

    2.6.23-rc1 11 8 6 7 7
    sync_lumpy 28 28 29 29 26

    These show that although hugely better than the near 0% success normally
    expected we can only allocate about a 1/4 of the zone. Using synchronous
    reclaim for these allocations we get close to 100% as expected.

    I have also run our standard high order tests and these show no regressions in
    allocation success rates at rest, and some significant improvements under
    load.

    This patch:

    We are transitioning pages from active to inactive in clear_active_flags,
    those need counting as PGDEACTIVATE vm events.

    Signed-off-by: Andy Whitcroft
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Booting SPARSEMEM on NUMA systems trips a BUG in page_alloc.c:

    Initializing HighMem for node 0 (00038000:00100000)
    Initializing HighMem for node 1 (00100000:001ffe00)
    ------------[ cut here ]------------
    kernel BUG at /home/apw/git/linux-2.6/mm/page_alloc.c:456!
    [...]

    This occurs because the section to node id mapping is not being
    setup correctly during init under SPARSEMEM_STATIC, leading to an
    attempt to free pages from all nodes into the zones on node 0.

    When the zone_table[] was removed in the following commit, a new
    section to node mapping table was introduced:

    commit 89689ae7f95995723fbcd5c116c47933a3bb8b13
    [PATCH] Get rid of zone_table[]

    That conversion inadvertantly only initialised the node mapping in
    SPARSEMEM_EXTREME. Ensure we initialise the node mapping in
    SPARSEMEM_STATIC.

    [akpm@linux-foundation.org: make the stubs static inline]
    Signed-off-by: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

12 Aug, 2007

3 commits


10 Aug, 2007

2 commits

  • The dynamic dma kmalloc creation can run into trouble if a
    GFP_ATOMIC allocation is the first one performed for a certain size
    of dma kmalloc slab.

    - Move the adding of the slab to sysfs into a workqueue
    (sysfs does GFP_KERNEL allocations)
    - Do not call kmem_cache_destroy() (uses slub_lock)
    - Only acquire the slub_lock once and--if we cannot wait--do a trylock.

    This introduces a slight risk of the first kmalloc(x, GFP_DMA|GFP_ATOMIC)
    for a range of sizes failing due to another process holding the slub_lock.
    However, we only need to acquire the spinlock once in order to establish
    each power of two DMA kmalloc cache. The possible conflict is with the
    slub_lock taken during slab management actions (create / remove slab cache).

    It is rather typical that a driver will first fill its buffers using
    GFP_KERNEL allocations which will wait until the slub_lock can be acquired.
    Drivers will also create its slab caches first outside of an atomic
    context before starting to use atomic kmalloc from an interrupt context.

    If there are any failures then they will occur early after boot or when
    loading of multiple drivers concurrently. Drivers can already accomodate
    failures of GFP_ATOMIC for other reasons. Retries will then create the slab.

    Signed-off-by: Christoph Lameter

    Christoph Lameter
     
  • The MAX_PARTIAL checks were supposed to be an optimization. However, slab
    shrinking is a manually triggered process either through running slabinfo
    or by the kernel calling kmem_cache_shrink.

    If one really wants to shrink a slab then all operations should be done
    regardless of the size of the partial list. This also fixes an issue that
    could surface if the number of partial slabs was initially above MAX_PARTIAL
    in kmem_cache_shrink and later drops below MAX_PARTIAL through the
    elimination of empty slabs on the partial list (rare). In that case a few
    slabs may be left off the partial list (and only be put back when they
    are empty).

    Signed-off-by: Christoph Lameter

    Christoph Lameter
     

01 Aug, 2007

3 commits

  • Fix kernel-doc warning:
    Warning(linux-2.6.23-rc1-mm1//mm/filemap.c:864): No description found for parameter 'ra'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • In badness(), the automatic variable 'points' is unsigned long. Print it
    as such.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • out_of_memory() may be called when an allocation is failing and the direct
    reclaim is not making any progress. This does not take into account the
    requested order of the allocation. If the request if for an order larger
    than PAGE_ALLOC_COSTLY_ORDER, it is reasonable to fail the allocation
    because the kernel makes no guarantees about those allocations succeeding.

    This false OOM situation can occur if a user is trying to grow the hugepage
    pool in a script like;

    #!/bin/bash
    REQUIRED=$1
    echo 1 > /proc/sys/vm/hugepages_treat_as_movable
    echo $REQUIRED > /proc/sys/vm/nr_hugepages
    ACTUAL=`cat /proc/sys/vm/nr_hugepages`
    while [ $REQUIRED -ne $ACTUAL ]; do
    echo Huge page pool at $ACTUAL growing to $REQUIRED
    echo $REQUIRED > /proc/sys/vm/nr_hugepages
    ACTUAL=`cat /proc/sys/vm/nr_hugepages`
    sleep 1
    done

    This is a reasonable scenario when ZONE_MOVABLE is in use but triggers OOM
    easily on 2.6.23-rc1. This patch will fail an allocation for an order above
    PAGE_ALLOC_COSTLY_ORDER instead of killing processes and retrying.

    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

31 Jul, 2007

2 commits


30 Jul, 2007

3 commits

  • Remove fs.h from mm.h. For this,
    1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
    2) Add back fs.h or less bloated headers (err.h) to files that need it.

    As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
    rebuilt down to 3444 (-12.3%).

    Cross-compile tested without regressions on my two usual configs and (sigh):

    alpha arm-mx1ads mips-bigsur powerpc-ebony
    alpha-allnoconfig arm-neponset mips-capcella powerpc-g5
    alpha-defconfig arm-netwinder mips-cobalt powerpc-holly
    alpha-up arm-netx mips-db1000 powerpc-iseries
    arm arm-ns9xxx mips-db1100 powerpc-linkstation
    arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200
    arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple
    arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2
    arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads
    arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb
    arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds
    arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb
    arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx
    arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp
    arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds
    arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds
    arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads
    arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds
    arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads
    arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds
    arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds
    arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn
    arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads
    arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads
    arm-ep93xx i386-up mips-pb1100 powerpc-pasemi
    arm-footbridge ia64 mips-pb1500 powerpc-pmac32
    arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64
    arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800
    arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3
    arm-h7202 ia64-gensparse mips-qemu powerpc-pseries
    arm-hackkit ia64-sim mips-rbhma4200 powerpc-up
    arm-integrator ia64-sn2 mips-rbhma4500 s390
    arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig
    arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig
    arm-iop33x ia64-zx1 mips-sead s390-up
    arm-ixp2000 m68k mips-tb0219 sparc
    arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig
    arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig
    arm-jornada720 m68k-atari mips-workpad sparc-up
    arm-kafa m68k-bvme6000 mips-wrppmc sparc64
    arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig
    arm-ks8695 m68k-mac parisc sparc64-defconfig
    arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up
    arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64
    arm-lpd7a400 m68k-q40 parisc-up x86_64
    arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig
    arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig
    arm-lusl7200 mips powerpc-celleb x86_64-up
    arm-mainstone mips-atlas powerpc-chrp32

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Introduce CONFIG_SUSPEND representing the ability to enter system sleep
    states, such as the ACPI S3 state, and allow the user to choose SUSPEND
    and HIBERNATION independently of each other.

    Make HOTPLUG_CPU be selected automatically if SUSPEND or HIBERNATION has
    been chosen and the kernel is intended for SMP systems.

    Also, introduce CONFIG_PM_SLEEP which is automatically selected if
    CONFIG_SUSPEND or CONFIG_HIBERNATION is set and use it to select the
    code needed for both suspend and hibernation.

    The top-level power management headers and the ACPI code related to
    suspend and hibernation are modified to use the new definitions (the
    changes in drivers/acpi/sleep/main.c are, mostly, moving code to reduce
    the number of ifdefs).

    There are many other files in which CONFIG_PM can be replaced with
    CONFIG_PM_SLEEP or even with CONFIG_SUSPEND, but they can be updated in
    the future.

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION to avoid
    confusion (among other things, with CONFIG_SUSPEND introduced in the
    next patch).

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

27 Jul, 2007

3 commits

  • With the introduction of kernelcore=, a configurable zone is created on
    request. In some cases, this value will be small enough that some nodes
    contain only ZONE_MOVABLE. On some NUMA configurations when this occurs,
    arch-independent zone-sizing will get the size of the memory holes within
    the node incorrect. The value of present_pages goes negative and the boot
    fails.

    This patch fixes the bug in the calculation of the size of the hole. The
    test case is to boot test a NUMA machine with a low value of kernelcore=
    before and after the patch is applied. While this bug exists in early
    kernel it cannot be triggered in practice.

    This patch has been boot-tested on a variety machines with and without
    kernelcore= set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • release_pages() in mm/swap.c changes page_count() to be 0 without removing
    PageLRU flag...

    This means isolate_lru_page() can see a page, PageLRU() &&
    page_count(page)==0.. This is BUG. (get_page() will be called against
    count=0 page.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In usual, migrate_pages(page,,) is called with holding mm->sem by system call.
    (mm here is a mm_struct which maps the migration target page.)
    This semaphore helps avoiding some race conditions.

    But, if we want to migrate a page by some kernel codes, we have to avoid
    some races. This patch adds check code for following race condition.

    1. A page which page->mapping==NULL can be target of migration. Then, we have
    to check page->mapping before calling try_to_unmap().

    2. anon_vma can be freed while page is unmapped, but page->mapping remains as
    it was. We drop page->mapcount to be 0. Then we cannot trust page->mapping.
    So, use rcu_read_lock() to prevent anon_vma pointed by page->mapping from
    being freed during migration.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

25 Jul, 2007

3 commits

  • * 'request-queue-t' of git://git.kernel.dk/linux-2.6-block:
    [BLOCK] Add request_queue_t and mark it deprecated
    [BLOCK] Get rid of request_queue_t typedef

    Linus Torvalds
     
  • Use the correct local variable when calling into the page allocator. Local
    `flags' can have __GFP_ZERO set, which causes us to pass __GFP_ZERO into the
    page allocator, possibly from illegal contexts. The page allocator will later
    do prep_zero_page()->kmap_atomic(..., KM_USER0) from irq contexts and will
    then go BUG.

    Cc: Mike Galbraith
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • dequeue_huge_page() has a serious memory leak upon hugetlb page
    allocation. The for loop continues on allocating hugetlb pages out of
    all allowable zone, where this function is supposedly only dequeue one
    and only one pages.

    Fixed it by breaking out of the for loop once a hugetlb page is found.

    Signed-off-by: Ken Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen