27 Jul, 2008

2 commits

  • Fixes a build failure reported by Alan Cox:

    mm/hugetlb.c: In function `hugetlb_acct_memory': mm/hugetlb.c:1507:
    error: implicit declaration of function `cpuset_mems_nr'

    Also reverts Ingo's

    commit e44d1b2998d62a1f2f4d7eb17b56ba396535509f
    Author: Ingo Molnar
    Date: Fri Jul 25 12:57:41 2008 +0200

    mm/hugetlb.c: fix build failure with !CONFIG_SYSCTL

    which fixed the build error but added some unused-static-function warnings.

    Signed-off-by: Nishanth Aravamudan
    Cc: Alan Cox
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • Fix this, on avr32:

    include/linux/utsname.h:35,
    from init/main.c:20:
    include/linux/sched.h: In function 'arch_pick_mmap_layout':
    include/linux/sched.h:2149: error: implicit declaration of function 'PAGE_ALIGN'

    Reported-by: Adrian Bunk
    Cc: Haavard Skinnemoen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

26 Jul, 2008

14 commits

  • on !CONFIG_SYSCTL on x86 with latest -git i get:

    mm/hugetlb.c: In function 'decrement_hugepage_resv_vma':
    mm/hugetlb.c:83: error: 'reserve' undeclared (first use in this function)
    mm/hugetlb.c:83: error: (Each undeclared identifier is reported only once
    mm/hugetlb.c:83: error: for each function it appears in.)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Sometimes, application responses become bad under heavy memory load.
    Applications take a bit time to reclaim memory. The statistics, how long
    memory reclaim takes, will be useful to measure memory usage.

    This patch adds accounting memory reclaim to per-task-delay-accounting for
    accounting the time of do_try_to_free_pages().

    - When System is under low memory load,
    memory reclaim may not occur.

    $ free
    total used free shared buffers cached
    Mem: 8197800 1577300 6620500 0 4808 1516724
    -/+ buffers/cache: 55768 8142032
    Swap: 16386292 0 16386292

    $ vmstat 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 0 0 5069748 10612 3014060 0 0 0 0 3 26 0 0 100 0
    0 0 0 5069748 10612 3014060 0 0 0 0 4 22 0 0 100 0
    0 0 0 5069748 10612 3014060 0 0 0 0 3 18 0 0 100 0

    Measure the time of tar command.

    $ ls -s test.dat
    1501472 test.dat

    $ time tar cvf test.tar test.dat
    real 0m13.388s
    user 0m0.116s
    sys 0m5.304s

    $ ./delayget -d -p
    CPU count real total virtual total delay total
    428 5528345500 5477116080 62749891
    IO count delay total
    338 8078977189
    SWAP count delay total
    0 0
    RECLAIM count delay total
    0 0

    - When system is under heavy memory load
    memory reclaim may occur.

    $ vmstat 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 0 7159032 49724 1812 3012 0 0 0 0 3 24 0 0 100 0
    0 0 7159032 49724 1812 3012 0 0 0 0 4 24 0 0 100 0
    0 0 7159032 49848 1812 3012 0 0 0 0 3 22 0 0 100 0

    In this case, one process uses more 8G memory
    by execution of malloc() and memset().

    $ time tar cvf test.tar test.dat
    real 1m38.563s
    CPU count real total virtual total delay total
    9021 7140446250 7315277975 923201824
    IO count delay total
    8965 90466349669
    SWAP count delay total
    3 21036367
    RECLAIM count delay total
    740 61011951153

    In the later case, the value of RECLAIM is increasing.
    So, taskstats can show how much memory reclaim influences TAT.

    Signed-off-by: Keika Kobayashi
    Acked-by: Balbir Singh
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keika Kobayashi
     
  • Shrinking memory usage at limit change.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Balbir Singh
    Acked-by: Pavel Emelyanov
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Those checks are unnecessary, because when the subsystem is disabled
    it can't be mounted, so those functions won't get called.

    The check is needed in functions which will be called in other places
    except cgroup.

    [hugh@veritas.com: further checking of disabled flag]
    Signed-off-by: Li Zefan
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Because of remove refcnt patch, it's very rare case to that
    mem_cgroup_charge_common() is called against a page which is accounted.

    mem_cgroup_charge_common() is called when.
    1. a page is added into file cache.
    2. an anon page is _newly_ mapped.

    A racy case is that a newly-swapped-in anonymous page is referred from
    prural threads in do_swap_page() at the same time.
    (a page is not Locked when mem_cgroup_charge() is called from do_swap_page.)

    Another case is shmem. It charges its page before calling add_to_page_cache().
    Then, mem_cgroup_charge_cache() is called twice. This case is handled in
    mem_cgroup_cache_charge(). But this check may be too hacky...

    Signed-off-by : KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Hugh Dickins
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Showing brach direction for obvious conditions.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Hugh Dickins
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • A new call, mem_cgroup_shrink_usage() is added for shmem handling and
    relacing non-standard usage of mem_cgroup_charge/uncharge.

    Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
    mem_cgroup. In general, shmem is used by some process group and not for
    global resource (like file caches). So, it's reasonable to reclaim pages
    from mem_cgroup where shmem is mainly used.

    [hugh@veritas.com: shmem_getpage release page sooner]
    [hugh@veritas.com: mem_cgroup_shrink_usage css_put]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg: performance improvements

    Patch Description
    1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
    2/5 ... swapcache handling patch
    3/5 ... add helper function for shmem's memory reclaim patch
    4/5 ... optimize by likely/unlikely ppatch
    5/5 ... remove redundunt check patch (shmem handling is fixed.)

    Unix bench result.

    == 2.6.26-rc2-mm1 + memory resource controller
    Execl Throughput 2915.4 lps (29.6 secs, 3 samples)
    C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 2915.4 678.0
    File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0
    File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5
    File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8
    Shell Scripts (8 concurrent) 6.0 1097.7 1829.5
    =========
    FINAL SCORE 991.3

    == 2.6.26-rc2-mm1 + this set ==
    Execl Throughput 3012.9 lps (29.9 secs, 3 samples)
    C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 3012.9 700.7
    File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7
    File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3
    File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0
    Shell Scripts (8 concurrent) 6.0 1120.3 1867.2
    =========
    FINAL SCORE 1004.6

    This patch:

    Remove refcnt from page_cgroup().

    After this,

    * A page is charged only when !page_mapped() && no page_cgroup is assigned.
    * Anon page is newly mapped.
    * File page is added to mapping->tree.

    * A page is uncharged only when
    * Anon page is fully unmapped.
    * File page is removed from LRU.

    There is no change in behavior from user's view.

    This patch also removes unnecessary calls in rmap.c which was used only for
    refcnt mangement.

    [akpm@linux-foundation.org: fix warning]
    [hugh@veritas.com: fix shmem_unuse_inode charging]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Hugh Dickins
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch changes page migration under memory controller to use a
    different algorithm. (thanks to Christoph for new idea.)

    Before:
    - page_cgroup is migrated from an old page to a new page.
    After:
    - a new page is accounted , no reuse of page_cgroup.

    Pros:

    - We can avoid compliated lock depndencies and races in migration.

    Cons:

    - new param to mem_cgroup_charge_common().

    - mem_cgroup_getref() is added for handling ref_cnt ping-pong.

    This version simplifies complicated lock dependency in page migraiton
    under memory resource controller.

    new refcnt sequence is following.

    a mapped page:
    prepage_migration() ..... +1 to NEW page
    try_to_unmap() ..... all refs to OLD page is gone.
    move_pages() ..... +1 to NEW page if page cache.
    remap... ..... all refs from *map* is added to NEW one.
    end_migration() ..... -1 to New page.

    page's mapcount + (page_is_cache) refs are added to NEW one.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: YAMAMOTO Takashi
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • * remove over-killing initialization (in fast path)
    * makeing the condition for PAGE_CGROUP_FLAG_ACTIVE be more obvious.

    Signed-off-by: KAMEAZAWA Hiroyuki
    Reviewed-by: Li Zefan
    Acked-by: Balbir Singh
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mem_cgroup_subsys and page_cgroup_cache should be read_mostly and
    MEM_CGROUP_RECLAIM_RETRIES can be just a fixed number.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Currently res_counter_write() is a raw file handler even though it's
    ultimately taking a number, since in some cases it wants to
    pre-process the string when converting it to a number.

    This patch converts res_counter_write() from a raw file handler to a
    write_string() handler; this allows some of the boilerplate
    copying/locking/checking to be removed, and simplies the cleanup path,
    since these functions are now performed by the cgroups framework.

    [lizf@cn.fujitsu.com: build fix]
    Signed-off-by: Paul Menage
    Cc: Paul Jackson
    Cc: Pavel Emelyanov
    Cc: Balbir Singh
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • journal_try_to_free_buffers() could race with jbd commit transaction when
    the later is holding the buffer reference while waiting for the data
    buffer to flush to disk. If the caller of journal_try_to_free_buffers()
    request tries hard to release the buffers, it will treat the failure as
    error and return back to the caller. We have seen the directo IO failed
    due to this race. Some of the caller of releasepage() also expecting the
    buffer to be dropped when passed with GFP_KERNEL mask to the
    releasepage()->journal_try_to_free_buffers().

    With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
    indicating this call could wait, in case of try_to_free_buffers() failed,
    let's waiting for journal_commit_transaction() to finish commit the
    current committing transaction, then try to free those buffers again.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mingming Cao
    Reviewed-by: Badari Pulavarty
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     

25 Jul, 2008

24 commits

  • Vegard Nossum has noticed the ever-decreasing negative priority in a
    swapon /swapoff loop, which eventually would misprioritize when int wraps
    positive. Not worth spending much code on, but probably better fixed.

    It's easy to handle the swapping on and off of just one area, but there's
    not much point if a pair or more still misbehave. To handle the general
    case, swapoff should compact negative priorities, keeping them always from
    -1 to -MAX_SWAPFILES. That's a change, but should cause no regression,
    since these negative (unspecified) priorities are disjoint from the the
    positive specified priorities 0 to 32767.

    One small functional difference, which seems appropriate: when swapoff
    fails to free all swap from a negative priority area, that area is now
    reinserted at lowest priority, rather than at its original priority.

    In moving down swapon's setting of priority, I notice that an area is
    visible to /proc/swaps when it has swap_map set, yet that was being set
    before all the visible fields were properly filled in: corrected.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Reported-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We'd like to support CONFIG_MEMORY_HOTREMOVE on s390, which depends on
    CONFIG_MIGRATION. So far, CONFIG_MIGRATION is only available with NUMA
    support.

    This patch makes CONFIG_MIGRATION selectable for architectures that define
    ARCH_ENABLE_MEMORY_HOTREMOVE. When MIGRATION is enabled w/o NUMA, the
    kernel won't compile because migrate_vmas() does not know about
    vm_ops->migrate() and vma_migratable() does not know about policy_zone.
    To fix this, those two functions can be restricted to '#ifdef CONFIG_NUMA'
    because they are not being used w/o NUMA. vma_migratable() is moved over
    from migrate.h to mempolicy.h.

    [kosaki.motohiro@jp.fujitsu.com: build fix]
    Acked-by: Christoph Lameter
    Signed-off-by: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: KOSAKI Motorhiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Memory may be hot-removed on a per-memory-block basis, particularly on
    POWER where the SPARSEMEM section size often matches the memory-block
    size. A user-level agent must be able to identify which sections of
    memory are likely to be removable before attempting the potentially
    expensive operation. This patch adds a file called "removable" to the
    memory directory in sysfs to help such an agent. In this patch, a memory
    block is considered removable if;

    o It contains only MOVABLE pageblocks
    o It contains only pageblocks with free pages regardless of pageblock type

    On the other hand, a memory block starting with a PageReserved() page will
    never be considered removable. Without this patch, the user-agent is
    forced to choose a memory block to remove randomly.

    Sample output of the sysfs files:

    ./memory/memory0/removable: 0
    ./memory/memory1/removable: 0
    ./memory/memory2/removable: 0
    ./memory/memory3/removable: 0
    ./memory/memory4/removable: 0
    ./memory/memory5/removable: 0
    ./memory/memory6/removable: 0
    ./memory/memory7/removable: 1
    ./memory/memory8/removable: 0
    ./memory/memory9/removable: 0
    ./memory/memory10/removable: 0
    ./memory/memory11/removable: 0
    ./memory/memory12/removable: 0
    ./memory/memory13/removable: 0
    ./memory/memory14/removable: 0
    ./memory/memory15/removable: 0
    ./memory/memory16/removable: 0
    ./memory/memory17/removable: 1
    ./memory/memory18/removable: 1
    ./memory/memory19/removable: 1
    ./memory/memory20/removable: 1
    ./memory/memory21/removable: 1
    ./memory/memory22/removable: 1

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • If zonelist is required to be rebuilt in online_pages(), there is no need
    to recalculate vm_total_pages in that function, as it has been updated in
    the call build_all_zonelists().

    Signed-off-by: Kent Liu
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Liu
     
  • - Change some naming
    * Magic -> types
    * MIX_INFO -> MIX_SECTION_INFO
    * Change definition of bootmem type from direct hex value

    - __free_pages_bootmem() becomes __meminit.

    Signed-off-by: Yasunori Goto
    Cc: Andy Whitcroft
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Usemaps are allocated on the section which has pgdat by this.

    Because usemap size is very small, many other sections usemaps are
    allocated on only one page. If a section has usemap, it can't be removed
    until removing other sections. This dependency is not desirable for
    memory removing.

    Pgdat has similar feature. When a section has pgdat area, it must be the
    last section for removing on the node. So, if section A has pgdat and
    section B has usemap for section A, Both sections can't be removed due to
    dependency each other.

    To solve this issue, this patch collects usemap on same section with pgdat
    as much as possible. If other sections doesn't have any dependency, this
    section will be able to be removed finally.

    Signed-off-by: Yasunori Goto
    Cc: Mel Gorman
    Cc: Andy Whitcroft
    Cc: David Miller
    Cc: Badari Pulavarty
    Cc: Heiko Carstens
    Cc: Hiroyuki KAMEZAWA
    Cc: Tony Breeds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • This was required by some old, no-longer-used gcc on sparc.

    Signed-off-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     
  • Make the needlessly global register_page_bootmem_info_section() static.

    Signed-off-by: Adrian Bunk
    Acked-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch contains the following cleanups:
    - make the following needlessly global variables static:
    - required_kernelcore
    - zone_movable_pfn[]
    - make the following needlessly global functions static:
    - move_freepages()
    - move_freepages_block()
    - setup_pageset()
    - find_usable_zone_for_movable()
    - adjust_zone_range_for_zone_movable()
    - __absent_pages_in_range()
    - find_min_pfn_for_node()
    - find_zone_movable_pfns_for_nodes()

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • alloc_pages_exact() is similar to alloc_pages(), except that it allocates
    the minimum number of pages to fulfill the request. This is useful if you
    want to allocate a very large buffer that is slightly larger than an even
    power-of-two number of pages. In that case, alloc_pages() will waste a
    lot of memory.

    I have a video driver that wants to allocate a 5MB buffer. alloc_pages()
    wiill waste 3MB of physically-contiguous memory.

    Signed-off-by: Timur Tabi
    Cc: Andi Kleen
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Timur Tabi
     
  • Almost all users of this field need a PFN instead of a physical address,
    so replace node_boot_start with node_min_pfn.

    [Lee.Schermerhorn@hp.com: fix spurious BUG_ON() in mark_bootmem()]
    Signed-off-by: Johannes Weiner
    Cc:
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since alloc_bootmem_core does no goal-fallback anymore and just returns
    NULL if the allocation fails, we might now use it in alloc_bootmem_section
    without all the fixup code for a misplaced allocation.

    Also, the limit can be the first PFN of the next section as the semantics
    is that the limit is _above_ the allocated region, not within.

    Signed-off-by: Johannes Weiner
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • __alloc_bootmem_node already does this, make the interface consistent.

    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The old node-agnostic code tried allocating on all nodes starting from the
    one with the lowest range. alloc_bootmem_core retried without the goal if
    it could not satisfy it and so the goal was only respected at all when it
    happened to be on the first (lowest page numbers) node (or theoretically
    if allocations failed on all nodes before to the one holding the goal).

    Introduce a non-panicking helper that starts allocating from the node
    holding the goal and falls back only after all thes tries failed, thus
    moving the goal fallback code out of alloc_bootmem_core.

    Make all other allocation functions benefit from this new helper.

    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: Andi Kleen
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Introduce new helpers that mark a range that resides completely on a node
    or node-agnostic ranges that might also span node boundaries.

    The free/reserve API functions will then directly use these helpers.

    Note that the free/reserve semantics become more strict: while the prior
    code took basically arbitrary range arguments and marked the PFNs that
    happen to fall into that range, the new code requires node-specific ranges
    to be completely on the node. The node-agnostic requests might span node
    boundaries as long as the nodes are contiguous.

    Passing ranges that do not satisfy these criteria is a bug.

    [akpm@linux-foundation.org: fix printk warnings]
    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Factor out the common operation of marking a range on the bitmap.

    [akpm@linux-foundation.org: fix various warnings]
    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • alloc_bootmem_core has become quite nasty to read over time. This is a
    clean rewrite that keeps the semantics.

    bdata->last_pos has been dropped.

    bdata->last_success has been renamed to hint_idx and it is now an index
    relative to the node's range. Since further block searching might start
    at this index, it is now set to the end of a succeeded allocation rather
    than its beginning.

    bdata->last_offset has been renamed to last_end_off to be more clear that
    it represents the ending address of the last allocation relative to the
    node.

    [y-goto@jp.fujitsu.com: fix new alloc_bootmem_core()]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Rewrite the code in a more concise way using less variables.

    [akpm@linux-foundation.org: fix printk warnings]
    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • link_bootmem handles an insertion of a new descriptor into the sorted list
    in more or less three explicit branches; empty list, insert in between and
    append. These cases can be expressed implicite.

    Also mark the sorted list as initdata as it can be thrown away after boot
    as well.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Reincarnate get_mapsize as bootmap_bytes and implement
    bootmem_bootmap_pages on top of it.

    Adjust users of these helpers and make free_all_bootmem_core use
    bootmem_bootmap_pages instead of open-coding it.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Introduce the bootmem_debug kernel parameter that enables very verbose
    diagnostics regarding all range operations of bootmem as well as the
    initialization and release of nodes.

    [akpm@linux-foundation.org: fix printk warnings]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Change the description, move a misplaced comment about the allocator
    itself and add me to the list of copyright holders.

    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This only reorders functions so that further patches will be easier to
    read. No code changed.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner