30 Oct, 2005

2 commits

  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Martin Hicks' page cache reclaim patch added the 'may_swap' flag to the
    scan_control struct; and modified shrink_list() not to add anon pages to
    the swap cache if may_swap is not asserted.

    Ref: http://marc.theaimsgroup.com/?l=linux-mm&m=111461480725322&w=4

    However, further down, if the page is mapped, shrink_list() calls
    try_to_unmap() which will call try_to_unmap_one() via try_to_unmap_anon ().
    try_to_unmap_one() will BUG_ON() an anon page that is NOT in the swap
    cache. Martin says he never encountered this path in his testing, but
    agrees that it might happen.

    This patch modifies shrink_list() to skip anon pages that are not already
    in the swap cache when !may_swap, rather than just not adding them to the
    cache.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

28 Oct, 2005

1 commit


17 Oct, 2005

1 commit

  • As noticed by Nick Piggin, we need to make sure that we check the page
    count before we check for PageDirty, since the dirty check is only valid
    if the count implies that we're the only possible ones holding the page.

    We always did do this, but the code needs a read-memory-barrier to make
    sure that the orderign is also honored by the CPU.

    (The writer side is ordered due to the atomic decrement and test on the
    page count, see the discussion on linux-kernel)

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Sep, 2005

1 commit


08 Sep, 2005

1 commit

  • This patch makes use of the previously underutilized cpuset flag
    'mem_exclusive' to provide what amounts to another layer of memory placement
    resolution. With this patch, there are now the following four layers of
    memory placement available:

    1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
    2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
    3) The current tasks cpuset (GFP_USER allocations constrained to here), and
    4) Specific node placement, using mbind and set_mempolicy.

    These nest - each layer is a subset (same or within) of the previous.

    Layer (2) above is new, with this patch. The call used to check whether a
    zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
    extended to take a gfp_mask argument, and its logic is extended, in the case
    that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
    hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
    placement is allowed. The definition of GFP_USER, which used to be identical
    to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
    cpuset_gfp_hardwall_flag patch.

    GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
    cpuset, so long as any node therein is not too tight on memory, but will
    escape to the larger layer, if need be.

    The intended use is to allow something like a batch manager to handle several
    jobs, each job in its own cpuset, but using common kernel memory for caches
    and such. Swapper and oom_kill activity is also constrained to Layer (2). A
    task in or below one mem_exclusive cpuset should not cause swapping on nodes
    in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
    task in another such cpuset. Heavy use of kernel memory for i/o caching and
    such by one job should not impact the memory available to jobs in other
    non-overlapping mem_exclusive cpusets.

    This patch enables providing hardwall, inescapable cpusets for memory
    allocations of each job, while sharing kernel memory allocations between
    several jobs, in an enclosing mem_exclusive cpuset.

    Like Dinakar's patch earlier to enable administering sched domains using the
    cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
    that had previously done nothing much useful other than restrict what cpuset
    configurations were allowed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

05 Sep, 2005

2 commits


29 Jun, 2005

1 commit


26 Jun, 2005

1 commit

  • 1. Establish a simple API for process freezing defined in linux/include/sched.h:

    frozen(process) Check for frozen process
    freezing(process) Check if a process is being frozen
    freeze(process) Tell a process to freeze (go to refrigerator)
    thaw_process(process) Restart process
    frozen_process(process) Process is frozen now

    2. Remove all references to PF_FREEZE and PF_FROZEN from all
    kernel sources except sched.h

    3. Fix numerous locations where try_to_freeze is manually done by a driver

    4. Remove the argument that is no longer necessary from two function calls.

    5. Some whitespace cleanup

    6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
    cleared before setting PF_FROZEN, recalc_sigpending does not check
    PF_FROZEN).

    This patch does not address the problem of freeze_processes() violating the rule
    that a task may only modify its own flags by setting PF_FREEZE. This is not clean
    in an SMP environment. freeze(process) is therefore not SMP safe!

    Signed-off-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 Jun, 2005

5 commits

  • try_to_free_pages accepts a third argument, order, but hasn't used it since
    before 2.6.0. The following patch removes the argument and updates all the
    calls to try_to_free_pages.

    Signed-off-by: Darren Hart
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darren Hart
     
  • When early zone reclaim is turned on the LRU is scanned more frequently when a
    zone is low on memory. This limits when the zone reclaim can be called by
    skipping the scan if another thread (either via kswapd or sync reclaim) is
    already reclaiming from the zone.

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • This is the core of the (much simplified) early reclaim. The goal of this
    patch is to reclaim some easily-freed pages from a zone before falling back
    onto another zone.

    One of the major uses of this is NUMA machines. With the default allocator
    behavior the allocator would look for memory in another zone, which might be
    off-node, before trying to reclaim from the current zone.

    This adds a zone tuneable to enable early zone reclaim. It is selected on a
    per-zone basis and is turned on/off via syscall.

    Adding some extra throttling on the reclaim was also required (patch
    4/4). Without the machine would grind to a crawl when doing a "make -j"
    kernel build. Even with this patch the System Time is higher on
    average, but it seems tolerable. Here are some numbers for kernbench
    runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

    wall user sys %cpu ctx sw. sleeps
    ---- ---- --- ---- ------ ------
    No patch 1009 1384 847 258 298170 504402
    w/patch, no reclaim 880 1376 667 288 254064 396745
    w/patch & reclaim 1079 1385 926 252 291625 548873

    These numbers are the average of 2 runs of 3 "make -j" runs done right
    after system boot. Run-to-run variability for "make -j" is huge, so
    these numbers aren't terribly useful except to seee that with reclaim
    the benchmark still finishes in a reasonable amount of time.

    I also looked at the NUMA hit/miss stats for the "make -j" runs and the
    reclaim doesn't make any difference when the machine is thrashing away.

    Doing a "make -j8" on a single node that is filled with page cache pages
    takes 700 seconds with reclaim turned on and 735 seconds without reclaim
    (due to remote memory accesses).

    The simple zone_reclaim syscall program is at
    http://www.bork.org/~mort/sgi/zone_reclaim.c

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • Here's the next round of these patches. These are totally different in
    an attempt to meet the "simpler" request after the last patches. For
    reference the earlier threads are:

    http://marc.theaimsgroup.com/?l=linux-kernel&m=110839604924587&w=2
    http://marc.theaimsgroup.com/?l=linux-mm&m=111461480721249&w=2

    This set of patches replaces my other vm- patches that are currently in
    -mm. So they're against 2.6.12-rc5-mm1 about half way through the -mm
    patchset.

    As I said already this patch is a lot simpler. The reclaim is turned on
    or off on a per-zone basis using a syscall. I haven't tested the x86
    syscall, so it might be wrong. It uses the existing reclaim/pageout
    code with the small addition of a may_swap flag to scan_control
    (patch 1/4).

    I also added __GFP_NORECLAIM (patch 3/4) so that certain allocation
    types can be flagged to never cause reclaim. This was a deficiency
    that was in all of my earlier patch sets. Previously, doing a big
    buffered read would fill one zone with page cache and then start to
    reclaim from that same zone, leaving the other zones untouched.

    Adding some extra throttling on the reclaim was also required (patch
    4/4). Without the machine would grind to a crawl when doing a "make -j"
    kernel build. Even with this patch the System Time is higher on
    average, but it seems tolerable. Here are some numbers for kernbench
    runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

    wall user sys %cpu ctx sw. sleeps
    ---- ---- --- ---- ------ ------
    No patch 1009 1384 847 258 298170 504402
    w/patch, no reclaim 880 1376 667 288 254064 396745
    w/patch & reclaim 1079 1385 926 252 291625 548873

    These numbers are the average of 2 runs of 3 "make -j" runs done right
    after system boot. Run-to-run variability for "make -j" is huge, so
    these numbers aren't terribly useful except to seee that with reclaim
    the benchmark still finishes in a reasonable amount of time.

    I also looked at the NUMA hit/miss stats for the "make -j" runs and the
    reclaim doesn't make any difference when the machine is thrashing away.

    Doing a "make -j8" on a single node that is filled with page cache pages
    takes 700 seconds with reclaim turned on and 735 seconds without reclaim
    (due to remote memory accesses).

    The simple zone_reclaim syscall program is at
    http://www.bork.org/~mort/sgi/zone_reclaim.c

    This patch:

    This adds an extra switch to the scan_control struct. It simply lets the
    reclaim code know if its allowed to swap pages out.

    This was required for a simple per-zone reclaimer. Without this addition
    pages would be swapped out as soon as a zone ran out of memory and the early
    reclaim kicked in.

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • Fix a problem identified by Andrea Arcangeli

    kswapd will set a zone into all_unreclaimable state if it sees that we're not
    successfully reclaiming LRU pages. But that fails to notice that we're
    successfully reclaiming slab obects, so we can set all_unreclaimable too soon.

    So change shrink_slab() to return a success indication if it actually
    reclaimed some objects, and don't assume that the zone is all_unreclaimable if
    that is true. This means that we won't enter all_unreclaimable state if we
    are successfully freeing slab objects but we're not yet actually freeing slab
    pages, due to internal fragmentation.

    (hm, this has a shortcoming. We could be successfully freeing ZONE_NORMAL
    slab objects while being really oom on ZONE_DMA. If that happens then kswapd
    might burn a lot of CPU. But given that there might be some slab objects in
    ZONE_DMA, perhaps that is appropriate.)

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     

17 Apr, 2005

2 commits

  • )

    We only call pageout() for dirty pages, so this test is redundant.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds