10 Sep, 2005

4 commits

  • This patch clarifies NULL handling of kfree() and vfree(). I addition,
    wording of calling context restriction for vfree() and vunmap() are changed
    from "may not" to "must not."

    Signed-off-by: Pekka Enberg
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • The NUMA API change that introduced kmalloc_node was accepted for
    2.6.12-rc3. Now it is possible to do slab allocations on a node to
    localize memory structures. This API was used by the pageset localization
    patch and the block layer localization patch now in mm. The existing
    kmalloc_node is slow since it simply searches through all pages of the slab
    to find a page that is on the node requested. The two patches do a one
    time allocation of slab structures at initialization and therefore the
    speed of kmalloc node does not matter.

    This patch allows kmalloc_node to be as fast as kmalloc by introducing node
    specific page lists for partial, free and full slabs. Slab allocation
    improves in a NUMA system so that we are seeing a performance gain in AIM7
    of about 5% with this patch alone.

    More NUMA localizations are possible if kmalloc_node operates in an fast
    way like kmalloc.

    Test run on a 32p systems with 32G Ram.

    w/o patch
    Tasks jobs/min jti jobs/min/task real cpu
    1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005
    100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005
    200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005
    300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005
    400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005
    500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005
    600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005
    700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005
    800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005
    900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005
    1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005

    w/patch
    Tasks jobs/min jti jobs/min/task real cpu
    1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005
    100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005
    200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005
    300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005
    400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005
    500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005
    600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005
    700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005
    800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005
    900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005
    1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005

    These are measurement taken directly after boot and show a greater
    improvement than 5%. However, the performance improvements become less
    over time if the AIM7 runs are repeated and settle down at around 5%.

    Links to earlier discussions:
    http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2
    http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2

    Changelog V4-V5:
    - alloc_arraycache and alloc_aliencache take node parameter instead of cpu
    - fix initialization so that nodes without cpus are properly handled.
    - simplify code in kmem_cache_init
    - patch against Andrews temp mm3 release
    - Add Shai to credits
    - fallback to __cache_alloc from __cache_alloc_node if the node's cache
    is not available yet.

    Changelog V3-V4:
    - Patch against 2.6.12-rc5-mm1
    - Cleanup patch integrated
    - More and better use of for_each_node and for_each_cpu
    - GCC 2.95 fix (do not use [] use [0])
    - Correct determination of INDEX_AC
    - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes.
    - Remove list3_data and list3_data_ptr macros for better readability

    Changelog V2-V3:
    - Made to patch against 2.6.12-rc4-mm1
    - Revised bootstrap mechanism so that larger size kmem_list3 structs can be
    supported. Do a generic solution so that the right slab can be found
    for the internal structs.
    - use for_each_online_node

    Changelog V1-V2:
    - Batching for freeing of wrong-node objects (alien caches)
    - Locking changes and NUMA #ifdefs as requested by Manfred

    Signed-off-by: Alok N Kataria
    Signed-off-by: Shobhit Dayal
    Signed-off-by: Shai Fultheim
    Signed-off-by: Christoph Lameter
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch modifies tmpfs to call the inode_init_security LSM hook to set
    up the incore inode security state for new inodes before the inode becomes
    accessible via the dcache.

    As there is no underlying storage of security xattrs in this case, it is
    not necessary for the hook to return the (name, value, len) triple to the
    tmpfs code, so this patch also modifies the SELinux hook function to
    correctly handle the case where the (name, value, len) pointers are NULL.

    The hook call is needed in tmpfs in order to support proper security
    labeling of tmpfs inodes (e.g. for udev with tmpfs /dev in Fedora). With
    this change in place, we should then be able to remove the
    security_inode_post_create/mkdir/... hooks safely.

    Signed-off-by: Stephen Smalley
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Update the file systems in fs/ implementing a delete_inode() callback to
    call truncate_inode_pages(). One implementation note: In developing this
    patch I put the calls to truncate_inode_pages() at the very top of those
    filesystems delete_inode() callbacks in order to retain the previous
    behavior. I'm guessing that some of those could probably be optimized.

    Signed-off-by: Mark Fasheh
    Acked-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     

09 Sep, 2005

1 commit

  • Run PCI driver initialization on local node

    Instead of adding messy kmalloc_node()s everywhere run the
    PCI driver probe on the node local to the device.

    This would not have helped for IDE, but should for
    other more clean drivers that do more initialization in probe().
    It won't help for drivers that do most of the work
    on first open (like many network drivers)

    Signed-off-by: Andi Kleen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

08 Sep, 2005

8 commits

  • This patch introduces a kzalloc wrapper and converts kernel/ to use it. It
    saves a little program text.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka J Enberg
     
  • Now the real motivation for this cpuset mem_exclusive patch series seems
    trivial.

    This patch keeps a task in or under one mem_exclusive cpuset from provoking an
    oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only
    interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
    containment, there is little to gain from oom killing a task under a
    non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
    allocation must come from disjoint memory nodes.

    This patch enables configuring a system so that a runaway job under one
    mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
    that might be using very high compute and memory resources for a prolonged
    time.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch makes use of the previously underutilized cpuset flag
    'mem_exclusive' to provide what amounts to another layer of memory placement
    resolution. With this patch, there are now the following four layers of
    memory placement available:

    1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
    2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
    3) The current tasks cpuset (GFP_USER allocations constrained to here), and
    4) Specific node placement, using mbind and set_mempolicy.

    These nest - each layer is a subset (same or within) of the previous.

    Layer (2) above is new, with this patch. The call used to check whether a
    zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
    extended to take a gfp_mask argument, and its logic is extended, in the case
    that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
    hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
    placement is allowed. The definition of GFP_USER, which used to be identical
    to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
    cpuset_gfp_hardwall_flag patch.

    GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
    cpuset, so long as any node therein is not too tight on memory, but will
    escape to the larger layer, if need be.

    The intended use is to allow something like a batch manager to handle several
    jobs, each job in its own cpuset, but using common kernel memory for caches
    and such. Swapper and oom_kill activity is also constrained to Layer (2). A
    task in or below one mem_exclusive cpuset should not cause swapping on nodes
    in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
    task in another such cpuset. Heavy use of kernel memory for i/o caching and
    such by one job should not impact the memory available to jobs in other
    non-overlapping mem_exclusive cpusets.

    This patch enables providing hardwall, inescapable cpusets for memory
    allocations of each job, while sharing kernel memory allocations between
    several jobs, in an enclosing mem_exclusive cpuset.

    Like Dinakar's patch earlier to enable administering sched domains using the
    cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
    that had previously done nothing much useful other than restrict what cpuset
    configurations were allowed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch series extends the use of the cpuset attribute 'mem_exclusive'
    to support cpuset configurations that:
    1) allow GFP_KERNEL allocations to come from a potentially larger
    set of memory nodes than GFP_USER allocations, and
    2) can constrain the oom killer to tasks running in cpusets in
    a specified subtree of the cpuset hierarchy.

    Here's an example usage scenario. For a few hours or more, a large NUMA
    system at a University is to be divided in two halves, with a bunch of student
    jobs running in half the system under some form of batch manager, and with a
    big research project running in the other half. Each of the student jobs is
    placed in a small cpuset, but should share the classic Unix time share
    facilities, such as buffered pages of files in /bin and /usr/lib. The big
    research project wants no interference whatsoever from the student jobs, and
    has highly tuned, unusual memory and i/o patterns that intend to make full use
    of all the main memory on the nodes available to it.

    In this example, we have two big sibling cpusets, one of which is further
    divided into a more dynamic set of child cpusets.

    We want kernel memory allocations constrained by the two big cpusets, and user
    allocations constrained by the smaller child cpusets where present. And we
    require that the oom killer not operate across the two halves of this system,
    or else the first time a student job runs amuck, the big research project will
    likely be first inline to get shot.

    Tweaking /proc//oom_adj is not ideal -- if the big research project
    really does run amuck allocating memory, it should be shot, not some other
    task outside the research projects mem_exclusive cpuset.

    I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage
    such scenarios. Let memory allocations for user space (GFP_USER) be
    constrained by a tasks current cpuset, but memory allocations for kernel space
    (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the
    current cpuset, even though kernel space allocations will still _prefer_ to
    remain within the current tasks cpuset, if memory is easily available.

    Let the oom killer be constrained to consider only tasks that are in
    overlapping mem_exclusive cpusets (it won't help much to kill a task that
    normally cannot allocate memory on any of the same nodes as the ones on which
    the current task can allocate.)

    The current constraints imposed on setting mem_exclusive are unchanged. A
    cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a
    mem_exclusive cpuset may not overlap any of its siblings memory nodes.

    This patch was presented on linux-mm in early July 2005, though did not
    generate much feedback at that time. It has been built for a variety of
    arch's using cross tools, and built, booted and tested for function on SN2
    (ia64).

    There are 4 patches in this set:
    1) Some minor cleanup, and some improvements to the code layout
    of one routine to make subsequent patches cleaner.
    2) Add another GFP flag - __GFP_HARDWALL. It marks memory
    requests for USER space, which are tightly confined by the
    current tasks cpuset.
    3) Now memory requests (such as KERNEL) that not marked HARDWALL can
    if short on memory, look in the potentially larger pool of memory
    defined by the nearest mem_exclusive ancestor cpuset of the current
    tasks cpuset.
    4) Finally, modify the oom killer to skip any task whose mem_exclusive
    cpuset doesn't overlap ours.

    Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32
    bytes of kernel text space. Patch (2) has no affect on the size of kernel
    text space (it just adds a preprocessor flag). Patches (3) and (4) added
    about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which
    matters only if CONFIG_CPUSET is enabled.

    This patch:

    This patch applies a few comment and code cleanups to mm/oom_kill.c prior to
    applying a few small patches to improve cpuset management of memory placement.

    The comment changed in oom_kill.c was seriously misleading. The code layout
    change in select_bad_process() makes room for adding another condition on
    which a process can be spared the oom killer (see the subsequent
    cpuset_nodes_overlap patch for this addition).

    Also a couple typos and spellos that bugged me, while I was here.

    This patch should have no material affect.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Mark variables which are usually accessed for reads with __readmostly.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Shai Fultheim
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • We don't reset the cache hit count until after readahead does a successful
    readahead. This seems to leave a corner case open where we miss in cache,
    but don't restart the readhead right away.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Pratt
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Move some more frequently read variables that showed up during some of our
    performance tests as sometimes ending up in hot cachelines to the
    read_mostly section.

    Fix: Move the __read_mostly from before hpet_usec_quotient to follow the
    variable like the other uses of __read_mostly.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Christoph Lameter
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

05 Sep, 2005

27 commits

  • This patch modifies the VFS setxattr, getxattr, and listxattr code to fall
    back to the security module for security xattrs if the filesystem does not
    support xattrs natively. This allows security modules to export the incore
    inode security label information to userspace even if the filesystem does
    not provide xattr storage, and eliminates the need to individually patch
    various pseudo filesystem types to provide such access. The patch removes
    the existing xattr code from devpts and tmpfs as it is then no longer
    needed.

    The patch restructures the code flow slightly to reduce duplication between
    the normal path and the fallback path, but this should only have one
    user-visible side effect - a program may get -EACCES rather than
    -EOPNOTSUPP if policy denied access but the filesystem didn't support the
    operation anyway. Note that the post_setxattr hook call is not needed in
    the fallback case, as the inode_setsecurity hook call handles the incore
    inode security state update directly. In contrast, we do call fsnotify in
    both cases.

    Signed-off-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Add page_state info to the per-node meminfo file in sysfs. This is mostly
    just for informational purposes.

    The lack of this information was brought up recently during a discussion
    regarding pagecache clearing, and I put this patch together to test out one
    of the suggestions.

    It seems like interesting info to have, so I'm submitting the patch.

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • Proposed by and based on a patch from Eric Dumazet :
    This patch removes unnecessary critical section in ksize() function, as
    cli/sti are rather expensive on modern CPUS.

    It additionally adds a docbook entry for ksize() and further simplifies the
    code.

    Signed-Off-By: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Mostobjects returned by __cache_alloc() will be written by the caller,
    (but not all callers want to write all the object, but just at the
    begining) prefetchw() tells the modern CPU to think about the future
    writes, ie start some memory transactions in advance.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Add a new accessor for PTEs, which passes the full hint from the mmu_gather
    struct; this allows architectures with hardware pagetables to optimize away
    atomic PTE operations when destroying an address space. Removing the
    locked operation should allow better pipelining of memory access in this
    loop. I measured an average savings of 30-35 cycles per zap_pte_range on
    the first 500 destructions on Pentium-M, but I believe the optimization
    would win more on older processors which still assert the bus lock on xchg
    for an exclusive cacheline.

    Update: I made some new measurements, and this saves exactly 26 cycles over
    ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180
    cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a
    full address space destruction.

    pte_clear_full is not yet used, but is provided for future optimizations
    (in particular, when running inside of a hypervisor that queues page table
    updates, the full hint allows us to avoid queueing unnecessary page table
    update for an address space in the process of being destroyed.

    This is not a huge win, but it does help a bit, and sets the stage for
    further hypervisor optimization of the mm layer on all architectures.

    Signed-off-by: Zachary Amsden
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • This is used only in slab.c and each architecture gets to define whcih
    underlying type is to be used.

    Seems a bit silly - move it to slab.c and use the same type for all
    architectures: unsigned int.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kyle Moffett
     
  • Initial Post (Wed, 17 Aug 2005)

    This patch moves the
    if (! pte_none(*pte))
    hugetlb_clean_stale_pgtable(pte);
    logic into huge_pte_alloc() so all of its callers can be immune to the bug
    described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246

    > It turns out there is a bug in hugetlb_prefault(): with 3 level page table,
    > huge_pte_alloc() might return a pmd that points to a PTE page. It happens
    > if the virtual address for hugetlb mmap is recycled from previously used
    > normal page mmap. free_pgtables() might not scrub the pmd entry on
    > munmap and hugetlb_prefault skips on any pmd presence regardless what type
    > it is.

    Unless I am missing something, it seems more correct to place the check inside
    huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated.
    It also allows checking for this condition when lazily faulting huge pages
    later in the series.

    Signed-off-by: Adam Litke
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Version 6 of the ARM architecture introduces the concept of 16MB pages
    (supersections) and 36-bit (40-bit actually, but nobody uses this) physical
    addresses. 36-bit addressed memory and I/O and ARMv6 can only be mapped
    using supersections and the requirement on these is that both virtual and
    physical addresses be 16MB aligned. In trying to add support for ioremap()
    of 36-bit I/O, we run into the issue that get_vm_area() allows for a
    maximum of 512K alignment via the IOREMAP_MAX_ORDER constant. To work
    around this, we can:

    - Allocate a larger VM area than needed (size + (1ul << IOREMAP_MAX_ORDER))
    and then align the pointer ourselves, but this ends up with 512K of
    wasted VM per ioremap().

    - Provide a new __get_vm_area_aligned() API and make __get_vm_area() sit
    on top of this. I did this and it works but I don't like the idea
    adding another VM API just for this one case.

    - My preferred solution which is to allow the architecture to override
    the IOREMAP_MAX_ORDER constant with it's own version.

    Signed-off-by: Deepak Saxena
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepak Saxena
     
  • If !vma->vm-ops we already BUG above, so retesting it is useless. The
    compiler cannot optimize this because BUG is a macro and is not thus marked
    noreturn; that should possibly be fixed.

    Signed-off-by: Paolo 'Blaisorblade' Giarrusso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paolo 'Blaisorblade' Giarrusso
     
  • Either shmem_getpage returns a failure, or it found a page, or it was told
    it couldn't do any I/O. So it's useless to check nonblock in the else
    branch. We could add a BUG() there but I preferred to comment the
    offending function.

    This was taken out from one Ingo Molnar's old patch I'm resurrecting.

    Signed-off-by: Paolo 'Blaisorblade' Giarrusso
    Cc: Ingo Molnar
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paolo 'Blaisorblade' Giarrusso
     
  • Fix a small spelling mistake. subtile->subtle

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • Better late than never, I've at last reviewed the madvise vma merging
    going into 2.6.13. Remove a pointless check and fix two little bugs -
    a simple test (with /proc//maps hacked to show ReadHints) showed
    both mismerges in practice: though being madvise, neither was disastrous.

    1. Correct placement of the success label in madvise_behavior: as in
    mprotect_fixup and mlock_fixup, it is necessary to update vm_flags
    when vma_merge succeeds (to handle the exceptional Case 8 noted in
    the comments above vma_merge itself).

    2. Correct initial value of prev when starting part way into a vma: as
    in sys_mprotect and do_mlock, it needs to be set to vma in this case
    (vma_merge handles only that minimum of cases shown in its comments).

    3. If find_vma_prev sets prev, then the vma it returns is prev->vm_next,
    so it's pointless to make that same assignment again in sys_madvise.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Christoph Lameter and Marcelo Tosatti asked to get rid of the
    atomic_inc_and_test() to cleanup the atomic ops in the zone reclaim code.

    Signed-off-by: Martin Hicks
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • Add a capability check to sys_set_zone_reclaim(). This syscall is not
    something that should be available to a user.

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • This bitop does not need to be atomic because it is performed when there will
    be no references to the page (ie. the page is being freed).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • filemap_xip's nopage routine maps the ZERO_PAGE into readonly mappings, if it
    has no data page to map there: then if the hole in the file is later filled,
    __xip_unmap uses an rmap technique to replace the ZERO_PAGEs mapped for that
    offset by the newly allocated file page, so that established mappings will see
    the newly written data.

    However, on MIPS (alone) there's not one but as many as eight ZERO_PAGEs,
    chosen for coloring by user virtual address; and if mremap has meanwhile been
    used to move a mapping containing a ZERO_PAGE, it will generally not match the
    ZERO_PAGE(address) __xip_unmap is looking for.

    To maintain XIP's established mappings correctly on MIPS, we need Nick's fix
    to mremap's move_one_page (originally presented as an optimization), to
    replace the ZERO_PAGE appropriate to the old address by the ZERO_PAGE
    appropriate to the new address.

    (But when I first saw this, I was thinking the ZERO_PAGEs themselves would get
    corrupted, very bad. Now I think it's the other way round, that the
    established mappings will fail to see the newly written data: incorrect, but
    not corrupting everything else. Whether filemap_xip's technique is generally
    safe, I'd hesitate to say in a hurry: it's interesting, but we've never tried
    to do that in tmpfs.)

    Signed-off-by: Hugh Dickins
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Thanks to Bill Irwin for pointing this out.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Microoptimise page_add_anon_rmap. Although these expressions are used only in
    the taken branch of the if() statement, the compiler can't reorder them inside
    because atomic_inc_and_test is a barrier.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Just be clear that VM_RESERVED pages here are a bug, and the test is not there
    because they are expected.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This patch was recently discussed on linux-mm:
    http://marc.theaimsgroup.com/?t=112085728500002&r=1&w=2

    I inherited a large code base from Ray for page migration. There was a
    small patch in there that I find to be very useful since it allows the
    display of the locality of the pages in use by a process. I reworked that
    patch and came up with a /proc//numa_maps that gives more information
    about the vma's of a process. numa_maps is indexes by the start address
    found in /proc//maps. F.e. with this patch you can see the page use
    of the "getty" process:

    margin:/proc/12008 # cat maps
    00000000-00004000 r--p 00000000 00:00 0
    2000000000000000-200000000002c000 r-xp 00000000 08:04 516 /lib/ld-2.3.3.so
    2000000000038000-2000000000040000 rw-p 00028000 08:04 516 /lib/ld-2.3.3.so
    2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0
    2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842 /lib/tls/libc.so.6.1
    2000000000260000-2000000000268000 ---p 00208000 08:04 54707842 /lib/tls/libc.so.6.1
    2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842 /lib/tls/libc.so.6.1
    2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0
    2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923 /usr/lib/locale/en_US.utf8/LC_CTYPE
    2000000000300000-2000000000308000 r--s 00000000 08:04 60071467 /usr/lib/gconv/gconv-modules.cache
    2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0
    4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399 /sbin/mingetty
    6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399 /sbin/mingetty
    6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0 [heap]
    60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0
    60000ffffff44000-60000ffffff98000 rw-p 60000ffffff44000 00:00 0 [stack]
    a000000000000000-a000000000020000 ---p 00000000 00:00 0 [vdso]

    cat numa_maps
    2000000000000000 default MaxRef=43 Pages=11 Mapped=11 N0=4 N1=3 N2=2 N3=2
    2000000000038000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
    2000000000040000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
    2000000000058000 default MaxRef=43 Pages=61 Mapped=61 N0=14 N1=15 N2=16 N3=16
    2000000000268000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
    2000000000274000 default MaxRef=1 Pages=3 Mapped=3 Anon=3 N0=3
    2000000000280000 default MaxRef=8 Pages=3 Mapped=3 N0=3
    2000000000300000 default MaxRef=8 Pages=2 Mapped=2 N0=2
    2000000000318000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N2=1
    4000000000000000 default MaxRef=6 Pages=2 Mapped=2 N1=2
    6000000000004000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
    6000000000008000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
    60000fff7fffc000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
    60000ffffff44000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1

    getty uses ld.so. The first vma is the code segment which is used by 43
    other processes and the pages are evenly distributed over the 4 nodes.

    The second vma is the process specific data portion for ld.so. This is
    only one page.

    The display format is:

    Links to information in /proc//map
    This can be "default" "interleave={}", "prefer=" or "bind={}"
    MaxRef=
    Pages=
    Mapped=
    Anon=
    Nx=

    The content of the proc-file is self-evident. If this would be tied into
    the sparsemem system then the contents of this file would not be too
    useful.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove the three get_mm_counter(mm, rss) tests from rmap.c: there was a
    time when testing rss was important to avoid a particular race between
    dup_mmap and the anonmm rmap; but now it's just a rather silly pseudo-
    optimization, made even more obscure by the get_mm_counter macro.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Three of the four BUG_ONs in delete_from_swap_cache are immediately
    repeated in __delete_from_swap_cache: delete those and add the one. But
    perhaps mm/ is altogether overprovisioned with historic BUGs?

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The idea of a swap_device_lock per device, and a swap_list_lock over them all,
    is appealing; but in practice almost every holder of swap_device_lock must
    already hold swap_list_lock, which defeats the purpose of the split.

    The only exceptions have been swap_duplicate, valid_swaphandles and an
    untrodden path in try_to_unuse (plus a few places added in this series).
    valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
    demand attention. However, with the hold time in get_swap_pages so much
    reduced, I've not yet found a load and set of swap device priorities to show
    even swap_duplicate benefitting from the split. Certainly the split is mere
    overhead in the common case of a single swap device.

    So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
    (generally we seem to prefer an _ in the name, and not hide in a macro).

    If someone can show a regression in swap_duplicate, then probably we should
    add a hashlock for the swap_map entries alone (shorts being anatomic), so as
    to help the case of the single swap device too.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The get_swap_page/scan_swap_map latency can be so bad that even those without
    preemption configured deserve relief: periodically cond_resched.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • get_swap_page has often shown up on latency traces, doing lengthy scans while
    holding two spinlocks. swap_list_lock is already dropped, now scan_swap_map
    drop swap_device_lock before scanning the swap_map.

    While scanning for an empty cluster, don't worry that racing tasks may
    allocate what was free and free what was allocated; but when allocating an
    entry, check it's still free after retaking the lock. Avoid dropping the lock
    in the expected common path. No barriers beyond the locks, just let the
    cookie crumble; highest_bit limit is volatile, but benign.

    Guard against swapoff: must check SWP_WRITEOK before allocating, must raise
    SWP_SCANNING reference count while in scan_swap_map, swapoff wait for that to
    fall - just use schedule_timeout, we don't want to burden scan_swap_map
    itself, and it's very unlikely that anyone can really still be in
    scan_swap_map once swapoff gets this far.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rewrite scan_swap_map to allocate in just the same way as before (taking the
    next free entry SWAPFILE_CLUSTER-1 times, then restarting at the lowest wholly
    empty cluster, falling back to lowest entry if none), but with a view towards
    dropping the lock in the next patch.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rewrite get_swap_page to allocate in just the same sequence as before, but
    without holding swap_list_lock across its scan_swap_map. Decrement
    nr_swap_pages and update swap_list.next in advance, while still holding
    swap_list_lock. Skip full devices by testing highest_bit. Swapoff hold
    swap_device_lock as well as swap_list_lock to clear SWP_WRITEOK. Reduces lock
    contention when there are parallel swap devices of the same priority.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins