09 Jan, 2009

4 commits

  • Now, you can see following even when swap accounting is enabled.

    1. Create Group 01, and 02.
    2. allocate a "file" on tmpfs by a task under 01.
    3. swap out the "file" (by memory pressure)
    4. Read "file" from a task in group 02.
    5. the charge of "file" is moved to group 02.

    This is not ideal behavior. This is because SwapCache which was loaded
    by read-ahead is not taken into account..

    This is a patch to fix shmem's swapcache behavior.
    - remove mem_cgroup_cache_charge_swapin().
    - Add SwapCache handler routine to mem_cgroup_cache_charge().
    By this, shmem's file cache is charged at add_to_page_cache()
    with GFP_NOWAIT.
    - pass the page of swapcache to shrink_mem_cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
    of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
    happen at memory reclaim.

    But in recent discussion, it's NACKed because it sounds ugly.

    This patch is for reverting it and add some clean up to gfp_mask of
    callers of charge. No behavior change but need review before generating
    HUNK in deep queue.

    This patch also adds explanation to meaning of gfp_mask passed to charge
    functions in memcontrol.h.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • SwapCache support for memory resource controller (memcg)

    Before mem+swap controller, memcg itself should handle SwapCache in proper
    way. This is cut-out from it.

    In current memcg, SwapCache is just leaked and the user can create tons of
    SwapCache. This is a leak of account and should be handled.

    SwapCache accounting is done as following.

    charge (anon)
    - charged when it's mapped.
    (because of readahead, charge at add_to_swap_cache() is not sane)
    uncharge (anon)
    - uncharged when it's dropped from swapcache and fully unmapped.
    means it's not uncharged at unmap.
    Note: delete from swap cache at swap-in is done after rmap information
    is established.
    charge (shmem)
    - charged at swap-in. this prevents charge at add_to_page_cache().

    uncharge (shmem)
    - uncharged when it's dropped from swapcache and not on shmem's
    radix-tree.

    at migration, check against 'old page' is modified to handle shmem.

    Comparing to the old version discussed (and caused troubles), we have
    advantages of
    - PCG_USED bit.
    - simple migrating handling.

    So, situation is much easier than several months ago, maybe.

    [hugh@veritas.com: memcg: handle swap caches build fix]
    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Fix misuse of gfp_kernel.

    Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.

    I think that this is from the fact that page_cgroup *was* dynamically
    allocated.

    But now, we allocate all page_cgroup at boot. And
    mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
    specified GFP_RECLAIM_MASK.

    * This is because we just want to reduce memory usage.
    "Where we should reclaim from ?" is not a problem in memcg.

    This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.

    Note: This patch is not for fixing behavior but for showing sane information
    in source code.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

2 commits

  • tiny-shmem shares most of its 130 lines of code with shmem and tends to
    break when particular bits of shmem get modified. Unifying saves code and
    makes keeping these two in sync much easier.

    before:
    14367 392 24 14783 39bf mm/shmem.o
    396 72 8 476 1dc mm/tiny-shmem.o

    after:
    14367 392 24 14783 39bf mm/shmem.o
    412 72 8 492 1ec mm/shmem.o tiny

    Signed-off-by: Matt Mackall
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Following "mm: don't mark_page_accessed in fault path", which now
    places a mark_page_accessed() in zap_pte_range(), we should remove
    the mark_page_accessed() from shmem_fault().

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Nov, 2008

1 commit

  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: linux-audit@redhat.com
    Cc: containers@lists.linux-foundation.org
    Cc: linux-mm@kvack.org
    Signed-off-by: James Morris

    David Howells
     

31 Oct, 2008

1 commit

  • Junjiro R. Okajima reported a problem where knfsd crashes if you are
    using it to export shmemfs objects and run strict overcommit. In this
    situation the current->mm based modifier to the overcommit goes through a
    NULL pointer.

    We could simply check for NULL and skip the modifier but we've caught
    other real bugs in the past from mm being NULL here - cases where we did
    need a valid mm set up (eg the exec bug about a year ago).

    To preserve the checks and get the logic we want shuffle the checking
    around and add a new helper to the vm_ security wrappers

    Also fix a current->mm reference in nommu that should use the passed mm

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix build]
    Reported-by: Junjiro R. Okajima
    Acked-by: James Morris
    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

20 Oct, 2008

3 commits

  • Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
    kept on the normal LRU, since scanning them is a waste of time and might
    throw off kswapd's balancing algorithms. Place them on the unevictable
    LRU list instead.

    Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
    memory regions as unevictable. Then these pages will be culled off the
    normal LRU lists during vmscan.

    Add new wrapper function to clear the mapping's unevictable state when/if
    shared memory segment is munlocked.

    Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
    the shmem segment's mapping [struct address_space] for evictability now
    that they're no longer locked. If so, move them to the appropriate zone
    lru list.

    Changes depend on [CONFIG_]UNEVICTABLE_LRU.

    [kosaki.motohiro@jp.fujitsu.com: revert shm change]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Kosaki Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Define page_file_cache() function to answer the question:
    is page backed by a file?

    Originally part of Rik van Riel's split-lru patch. Extracted to make
    available for other, independent reclaim patches.

    Moved inline function to linux/mm_inline.h where it will be needed by
    subsequent "split LRU" and "noreclaim" patches.

    Unfortunately this needs to use a page flag, since the PG_swapbacked state
    needs to be preserved all the way to the point where the page is last
    removed from the LRU. Trying to derive the status from other info in the
    page resulted in wrong VM statistics in earlier split VM patchsets.

    The total number of page flags in use on a 32 bit machine after this patch
    is 19.

    [akpm@linux-foundation.org: fix up out-of-order merge fallout]
    [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: MinChan Kim
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

18 Oct, 2008

1 commit

  • GEM needs to create shmem files to back buffer objects. Though currently
    creation of files for objects could have been driven from userland, the
    modesetting work will require allocation of buffer objects before userland
    is running, for boot-time message display.

    Signed-off-by: Eric Anholt
    Cc: Nick Piggin
    Signed-off-by: Dave Airlie

    Keith Packard
     

13 Oct, 2008

1 commit

  • Discussion on the mailing list questioned the use of these
    magic values in userspace, concluding these values are already
    exported to userspace via statfs and their correct/incorrect
    usage is left up to the userspace application.

    - Move special fs magic number definitions to magic.h
    - Add magic.h include

    Signed-off-by: Mimi Zohar
    Reviewed-by: James Morris
    Signed-off-by: James Morris

    Mimi Zohar
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

29 Jul, 2008

1 commit

  • SuSE's insserve initscript ordering program hits kernel BUG at mm/shmem.c:814
    on 2.6.26. It's using posix_fadvise on directories, and the shmem_readpage
    method added in 2.6.23 is letting POSIX_FADV_WILLNEED allocate useless pages
    to a tmpfs directory, incrementing i_blocks count but never decrementing it.

    Fix this by assigning shmem_aops (pointing to readpage and writepage and
    set_page_dirty) only when it's needed, on a regular file or a long symlink.

    Many thanks to Kel for outstanding bugreport and steps to reproduce it.

    Reported-by: Kel Modderman
    Tested-by: Kel Modderman
    Signed-off-by: Hugh Dickins
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Jul, 2008

2 commits

  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • If we can be sure that elevating the page_count on a pagecache page will
    pin it, we can speculatively run this operation, and subsequently check to
    see if we hit the right page rather than relying on holding a lock or
    otherwise pinning a reference to the page.

    This can be done if get_page/put_page behaves consistently throughout the
    whole tree (ie. if we "get" the page after it has been used for something
    else, we must be able to free it with a put_page).

    Actually, there is a period where the count behaves differently: when the
    page is free or if it is a constituent page of a compound page. We need
    an atomic_inc_not_zero operation to ensure we don't try to grab the page
    in either case.

    This patch introduces the core locking protocol to the pagecache (ie.
    adds page_cache_get_speculative, and tweaks some update-side code to make
    it work).

    Thanks to Hugh for pointing out an improvement to the algorithm setting
    page_count to zero when we have control of all references, in order to
    hold off speculative getters.

    [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
    [hugh@veritas.com: fix add_to_page_cache]
    [akpm@linux-foundation.org: repair a comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

26 Jul, 2008

2 commits

  • A new call, mem_cgroup_shrink_usage() is added for shmem handling and
    relacing non-standard usage of mem_cgroup_charge/uncharge.

    Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
    mem_cgroup. In general, shmem is used by some process group and not for
    global resource (like file caches). So, it's reasonable to reclaim pages
    from mem_cgroup where shmem is mainly used.

    [hugh@veritas.com: shmem_getpage release page sooner]
    [hugh@veritas.com: mem_cgroup_shrink_usage css_put]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg: performance improvements

    Patch Description
    1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
    2/5 ... swapcache handling patch
    3/5 ... add helper function for shmem's memory reclaim patch
    4/5 ... optimize by likely/unlikely ppatch
    5/5 ... remove redundunt check patch (shmem handling is fixed.)

    Unix bench result.

    == 2.6.26-rc2-mm1 + memory resource controller
    Execl Throughput 2915.4 lps (29.6 secs, 3 samples)
    C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 2915.4 678.0
    File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0
    File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5
    File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8
    Shell Scripts (8 concurrent) 6.0 1097.7 1829.5
    =========
    FINAL SCORE 991.3

    == 2.6.26-rc2-mm1 + this set ==
    Execl Throughput 3012.9 lps (29.9 secs, 3 samples)
    C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 3012.9 700.7
    File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7
    File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3
    File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0
    Shell Scripts (8 concurrent) 6.0 1120.3 1867.2
    =========
    FINAL SCORE 1004.6

    This patch:

    Remove refcnt from page_cgroup().

    After this,

    * A page is charged only when !page_mapped() && no page_cgroup is assigned.
    * Anon page is newly mapped.
    * File page is added to mapping->tree.

    * A page is uncharged only when
    * Anon page is fully unmapped.
    * File page is removed from LRU.

    There is no change in behavior from user's view.

    This patch also removes unnecessary calls in rmap.c which was used only for
    refcnt mangement.

    [akpm@linux-foundation.org: fix warning]
    [hugh@veritas.com: fix shmem_unuse_inode charging]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Hugh Dickins
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

25 Jul, 2008

1 commit

  • We have a request for tmpfs to support the AIO interface: easily done, no
    more than replacing the old shmem_file_read by shmem_file_aio_read,
    cribbed from generic_file_aio_read. (In 2.6.25 its write side was already
    changed to use generic_file_aio_write.)

    Incorporate cleanups from Andrew Morton and Harvey Harrison.

    Tests out fine with LTP's ltp-aiodio.sh, given hacks (not included) to
    support O_DIRECT. tmpfs cannot honestly support O_DIRECT: its
    cache-avoiding-IO nature is at odds with direct IO-avoiding-cache.

    Signed-off-by: Hugh Dickins
    Tested-by: Lawrence Greenfield
    Cc: Christoph Rohland
    Cc: Badari Pulavarty
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 Apr, 2008

1 commit

  • Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
    set, then don't update the per-bdi writeback stats from
    test_set_page_writeback() and test_clear_page_writeback().

    Misc cleanups:

    - convert bdi_cap_writeback_dirty() and friends to static inline functions
    - create a flag that includes all three dirty/writeback related flags,
    since almst all users will want to have them toghether

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

28 Apr, 2008

9 commits

  • This patch replaces the mempolicy mode, mode_flags, and nodemask in the
    shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
    This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
    inode.c and simplifies the interfaces.

    mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
    pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the
    returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context'
    argument that causes the input nodemask to be stored in the w.user_nodemask of
    the created mempolicy for use when the mempolicy is installed in a tmpfs inode
    shared policy tree. At that time, any cpuset contextualization is applied to
    the original input nodemask. This preserves the previous behavior where the
    input nodemask was stored in the superblock. We can think of the returned
    mempolicy as "context free".

    Because mpol_parse_str() is now calling mpol_new(), we can remove from
    mpol_to_str() the semantic checks that mpol_new() already performs.

    Add 'no_context' parameter to mpol_to_str() to specify that it should format
    the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.

    Change mpol_shared_policy_init() to take a pointer to a "context free" struct
    mempolicy and to create a new, "contextualized" mempolicy using the mode,
    mode_flags and user_nodemask from the input mempolicy.

    Note: we know that the mempolicy passed to mpol_to_str() or
    mpol_shared_policy_init() from a tmpfs superblock is "context free". This
    is currently the only instance thereof. However, if we found more uses for
    this concept, and introduced any ambiguity as to whether a mempolicy was
    context free or not, we could add another internal mode flag to identify
    context free mempolicies. Then, we could remove the 'no_context' argument
    from mpol_to_str().

    Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
    if one exists, to pass to mpol_shared_policy_init(). We must add the
    reference under the sb stat_lock to prevent races with replacement of the mpol
    by remount. This reference is removed in mpol_shared_policy_init().

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: yet another build fix]
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • mm/shmem.c currently contains functions to parse and display memory policy
    strings for the tmpfs 'mpol' mount option. Move this to mm/mempolicy.c with
    the rest of the mempolicy support. With subsequent patches, we'll be able to
    remove knowledge of the details [mode, flags, policy, ...] completely from
    shmem.c

    1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
    mm/mempolicy.c. Rework to use the policy_types[] array [used by
    mpol_to_str()] to look up mode by name.

    2) use mpol_to_str() to format policy for shmem_show_mpol(). mpol_to_str()
    expects a pointer to a struct mempolicy, so temporarily construct one.
    This will be replaced with a reference to a struct mempolicy in the tmpfs
    superblock in a subsequent patch.

    NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
    sign '=' as the nodemask delimiter to match mpol_parse_str() and the
    tmpfs/shmem mpol mount option formatting that now uses mpol_to_str(). This
    is a user visible change to numa_maps, but then the addition of the mode
    flags already changed the display. It makes sense to me to have the mounts
    and numa_maps display the policy in the same format. However, if anyone
    objects strongly, I can pass the desired nodemask delimeter as an arg to
    mpol_to_str().

    Note 2: Like show_numa_map(), I don't check the return code from
    mpol_to_str(). I do use a longer buffer than the one provided by
    show_numa_map(), which seems to have sufficed so far.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • After further discussion with Christoph Lameter, it has become clear that my
    earlier attempts to clean up the mempolicy reference counting were a bit of
    overkill in some areas, resulting in superflous ref/unref in what are usually
    fast paths. In other areas, further inspection reveals that I botched the
    unref for interleave policies.

    A separate patch, suitable for upstream/stable trees, fixes up the known
    errors in the previous attempt to fix reference counting.

    This patch reworks the memory policy referencing counting and, one hopes,
    simplifies the code. Maybe I'll get it right this time.

    See the update to the numa_memory_policy.txt document for a discussion of
    memory policy reference counting that motivates this patch.

    Summary:

    Lookup of mempolicy, based on (vma, address) need only add a reference for
    shared policy, and we need only unref the policy when finished for shared
    policies. So, this patch backs out all of the unneeded extra reference
    counting added by my previous attempt. It then unrefs only shared policies
    when we're finished with them, using the mpol_cond_put() [conditional put]
    helper function introduced by this patch.

    Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
    containing just the policy. read_swap_cache_async() can call alloc_page_vma()
    multiple times, so we can't let alloc_page_vma() unref the shared policy in
    this case. To avoid this, we make a copy of any non-null shared policy and
    remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading
    a page [or multiple pages] from swap, so the overhead should not be an issue
    here.

    I introduced a new static inline function "mpol_cond_copy()" to copy the
    shared policy to an on-stack policy and remove the flags that would require a
    conditional free. The current implementation of mpol_cond_copy() assumes that
    the struct mempolicy contains no pointers to dynamically allocated structures
    that must be duplicated or reference counted during copy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Parsing of new mode flags in the tmpfs mpol mount option is slightly broken:

    Setting a valid flag works OK:
    #mount -o remount,mpol=bind=static:1-2 /dev/shm
    #mount
    ...
    tmpfs on /dev/shm type tmpfs (rw,mpol=bind=static:1-2)
    ...

    However, we can't remove them or change them, once we've
    set a valid flag:

    #mount -o remount,mpol=bind:1-2 /dev/shm
    #mount
    ...
    tmpfs on /dev/shm type tmpfs (rw,mpol=bind:1-2)
    ...

    It SAYS it removed it, but that's just a copy of the input
    string. If we now try to set it to a different flag, we
    get:

    #mount -o remount,mpol=bind=relative:1-2 /dev/shm
    mount: /dev/shm not mounted already, or bad option

    And on the console, we see:
    tmpfs: Bad value 'bind' for mount option 'mpol'
    ^ lost remainder of string

    Furthermore, bogus flags are accepted with out error.
    Granted, they are a no-op:

    #mount -o remount,mpol=interleave=foo:0-3 /dev/shm
    #mount
    ...
    tmpfs on /dev/shm type tmpfs (rw,mpol=interleave=foo:0-3)

    Again, that's just a copy of the input string shown by the mount command.

    This patch fixes the behavior by pre-zeroing the flags so that only one of the
    mutually exclusive flags can be set at one time. It also reports an error
    when an unrecognized flag is specified.

    The check for both flags being set is removed because it can't happen with
    this implementation. If we ever want to support multiple non-exclusive flags,
    this area will need rework and we will need to check that any mutually
    exclusive flags aren't specified.

    Signed-off-by: Lee Schermerhorn
    Cc: David Rientjes
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Cc: Eric Whitney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
    nodemasks passed via set_mempolicy() or mbind() should be considered relative
    to the current task's mems_allowed.

    When the mempolicy is created, the passed nodemask is folded and mapped onto
    the current task's mems_allowed. For example, consider a task using
    set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
    nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is
    5-7 (the second, third, and fourth node of mems_allowed).

    If the same task is attached to a cpuset, the mempolicy nodemask is rebound
    each time the mems are changed. Some possible rebinds and results are:

    mems result
    1-3 1-3
    1-7 2-4
    1,5-6 1,5-6
    1,5-7 5-7

    Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
    to the resultant nodemask from the relative remap.

    In the MPOL_PREFERRED case, the preferred node is remapped from the currently
    effected nodemask to the relative nodemask.

    This mempolicy mode flag was conceived of by Paul Jackson .

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
    node remap when the policy is rebound.

    Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
    a union with cpuset_mems_allowed:

    struct mempolicy {
    ...
    union {
    nodemask_t cpuset_mems_allowed;
    nodemask_t user_nodemask;
    } w;
    }

    that stores the the nodemask that the user passed when he or she created the
    mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES,
    which is passed with any mempolicy mode, the user's passed nodemask
    intersected with the VMA or task's allowed nodes is always used when
    determining the preferred node, setting the MPOL_BIND zonelist, or creating
    the interleave nodemask. This happens whenever the policy is rebound,
    including when a task's cpuset assignment changes or the cpuset's mems are
    changed.

    This creates an interesting side-effect in that it allows the mempolicy
    "intent" to lie dormant and uneffected until it has access to the node(s) that
    it desires. For example, if you currently ask for an interleaved policy over
    a set of nodes that you do not have access to, the mempolicy is not created
    and the task continues to use the previous policy. With this change, however,
    it is possible to create the same mempolicy; it is only effected when access
    to nodes in the nodemask is acquired.

    It is also possible to mount tmpfs with the static nodemask behavior when
    specifying a node or nodemask. To do this, simply add "=static" immediately
    following the mempolicy mode at mount time:

    mount -o remount mpol=interleave=static:1-3

    Also removes mpol_check_policy() and folds its logic into mpol_new() since it
    is now obsoleted. The unused vma_mpol_equal() is also removed.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With the evolution of mempolicies, it is necessary to support mempolicy mode
    flags that specify how the policy shall behave in certain circumstances. The
    most immediate need for mode flag support is to suppress remapping the
    nodemask of a policy at the time of rebind.

    Both the mempolicy mode and flags are passed by the user in the 'int policy'
    formal of either the set_mempolicy() or mbind() syscall. A new constant,
    MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
    passed as part of this int. Mempolicies that include illegal flags as part of
    their policy are rejected as invalid.

    An additional member to struct mempolicy is added to support the mode flags:

    struct mempolicy {
    ...
    unsigned short policy;
    unsigned short flags;
    }

    The splitting of the 'int' actual passed by the user is done in
    sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is
    done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
    there are additional flags, and storing it in the new 'flags' member of struct
    mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
    the 'policy' member of the struct and all current users of pol->policy remain
    unchanged.

    The union of the policy mode and optional mode flags is passed back to the
    user in get_mempolicy().

    This combination of mode and flags within the same actual does not break
    userspace code that relies on get_mempolicy(&policy, ...) and either

    switch (policy) {
    case MPOL_BIND:
    ...
    case MPOL_INTERLEAVE:
    ...
    };

    statements or

    if (policy == MPOL_INTERLEAVE) {
    ...
    }

    statements. Such applications would need to use optional mode flags when
    calling set_mempolicy() or mbind() for these previously implemented statements
    to stop working. If an application does start using optional mode flags, it
    will need to mask the optional flags off the policy in switch and conditional
    statements that only test mode.

    An additional member is also added to struct shmem_sb_info to store the
    optional mode flags.

    [hugh@veritas.com: shmem mpol: fix build warning]
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
    MPOL_INTERLEAVE, are better declared as part of an enum since they are
    sequentially numbered and cannot be combined.

    The policy member of struct mempolicy is also converted from type short to
    type unsigned short. A negative policy does not have any legitimate meaning,
    so it is possible to change its type in preparation for adding optional mode
    flags later.

    The equivalent member of struct shmem_sb_info is also changed from int to
    unsigned short.

    For compatibility, the policy formal to get_mempolicy() remains as a pointer
    to an int:

    int get_mempolicy(int *policy, unsigned long *nmask,
    unsigned long maxnode, unsigned long addr,
    unsigned long flags);

    although the only possible values is the range of type unsigned short.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 Mar, 2008

1 commit


05 Mar, 2008

1 commit

  • My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling to
    mem_cgroup_charge_common. It seemed convenient at the time, but hard to
    justify now: there's a perfectly appropriate swappage to charge and uncharge
    instead, this is not on any hot path through shmem_getpage, and no performance
    hit was observed from the slight extra overhead.

    So revert that NULL page handling from mem_cgroup_charge_common; and make it
    clearer by bringing page_cgroup_assign_new_page_cgroup into its body - that
    was a helper I found more of a hindrance to understanding.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Feb, 2008

2 commits


08 Feb, 2008

1 commit

  • The memcgroup regime relies upon a cgroup reclaiming pages from itself within
    add_to_page_cache: which may involve some waiting. Whereas shmem and tmpfs
    rely upon using add_to_page_cache while holding a spinlock: when it cannot
    wait. The consequence is that when a cgroup reaches its limit, shmem_getpage
    just hangs - unless there is outside memory pressure too, neither kswapd nor
    radix_tree_preload get it out of the retry loop.

    In most cases we can mem_cgroup_cache_charge the page waitably first, to
    attach the page_cgroup in advance, so add_to_page_cache will do no more than
    increment a count; then mem_cgroup_uncharge_page after (in both success and
    failure cases) to balance the books again.

    And where there used to be a congestion_wait for kswapd (recently made
    redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
    to go through a cycle of allocation and freeing, without accounting to any
    particular page, and without updating the statistics vector. This brings the
    cgroup below its limit so the next try usually succeeds.

    Signed-off-by: Hugh Dickins
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Feb, 2008

5 commits

  • This patch modifies the interface to inode_getsecurity to have the function
    return a buffer containing the security blob and its length via parameters
    instead of relying on the calling function to give it an appropriately sized
    buffer.

    Security blobs obtained with this function should be freed using the
    release_secctx LSM hook. This alleviates the problem of the caller having to
    guess a length and preallocate a buffer for this function allowing it to be
    used elsewhere for Labeled NFS.

    The patch also removed the unused err parameter. The conversion is similar to
    the one performed by Al Viro for the security_getprocattr hook.

    Signed-off-by: David P. Quigley
    Cc: Stephen Smalley
    Cc: Chris Wright
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Casey Schaufler
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David P. Quigley
     
  • Intensive swapoff testing shows shmem_unuse spinning on an entry in
    shmem_swaplist pointing to itself: how does that come about? Days pass...

    First guess is this: shmem_delete_inode tests list_empty without taking the
    global mutex (so the swapping case doesn't slow down the common case); but
    there's an instant in shmem_unuse_inode's list_move_tail when the list entry
    may appear empty (a rare case, because it's actually moving the head not the
    the list member). So there's a danger of leaving the inode on the swaplist
    when it's freed, then reinitialized to point to itself when reused. Fix that
    by skipping the list_move_tail when it's a no-op, which happens to plug this.

    But this same spinning then surfaces on another machine. Ah, I'd never
    suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we
    still hold page lock, which would hold off inode deletion if the page were in
    pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
    doesn't wait on locked pages). Hmm: we could put the the inode on swaplist
    earlier, but then shmem_unuse_inode could never prune unswapped inodes.

    Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
    though I am a little uneasy about the iput which has to follow - it works, and
    I see nothing wrong with it, but it is surprising that shmem inode deletion
    may now occur below shmem_writepage. Revisit this fix later?

    And while we're looking at these races: the way shmem_unuse tests swapped
    without holding info->lock looks unsafe, if we've more than one swap area: a
    racing shmem_writepage on another page of the same inode could be putting it
    in swapcache, just as we're deciding to remove the inode from swaplist -
    there's a danger of going on swap without being listed, so a later swapoff
    would hang, being unable to locate the entry. Move that test and removal down
    into shmem_unuse_inode, once info->lock is held.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache
    or swap cache, without any radix tree preload: so tending to deplete emergency
    reserves of memory.

    GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's
    being called under memory pressure, so must not wait for more memory to become
    available. But shmem_unuse_inode now has a window in which it can and should
    preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its
    add_to_page_cache.

    shmem_getpage is not so straightforward: its filepage/swappage integrity
    relies upon exchanging between caches under spinlock, and it would need a lot
    of restructuring to place the preloads correctly. Instead, follow its pattern
    of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
    add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
    radix_tree_preload, followed immediately by radix_tree_preload_end - that
    won't guarantee success in the next add_to_page_cache, but doesn't need to.

    And we can then remove that bothersome congestion_wait: when needed, it'll
    automatically get done in the course of the radix_tree_preload.

    Signed-off-by: Hugh Dickins
    Looks-good-to: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There are a couple of reasons (patches follow) why it would be good to open a
    window for sleep in shmem_unuse_inode, between its search for a matching swap
    entry, and its handling of the entry found.

    shmem_unuse_inode must then use igrab to hold the inode against deletion in
    that window, and its corresponding iput might result in deletion: so it had
    better unlock_page before the iput, and might as well release the page too.

    Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll
    leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse
    into shmem_unuse_inode, in the case when it finds a match.

    Let try_to_unuse break on error in the shmem_unuse case, as it does in the
    unuse_mm case: though at this point in the series, no error to break on.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • shmem_unuse is at present an unbroken search through every swap vector page of
    every tmpfs file which might be swapped, all under shmem_swaplist_lock. This
    dates from long ago, when the caller held mmlist_lock over it all too: long
    gone, but there's never been much pressure for preemptible swapoff.

    Make it a little more preemptible, replacing shmem_swaplist_lock by
    shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a
    cond_resched_lock (on info->lock) at one convenient point in the
    shmem_unuse_inode loop, where it has no outstanding kmap_atomic.

    If we're serious about preemptible swapoff, there's much further to go e.g.
    I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM
    case dictate preemptiblility for other configs. But as in the earlier patch
    to make swapoff scan ptes preemptibly, my hidden agenda is really towards
    making memcgroups work, hardly about preemptibility at all.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins