16 Sep, 2009

1 commit

  • Devtmpfs lets the kernel create a tmpfs instance called devtmpfs
    very early at kernel initialization, before any driver-core device
    is registered. Every device with a major/minor will provide a
    device node in devtmpfs.

    Devtmpfs can be changed and altered by userspace at any time,
    and in any way needed - just like today's udev-mounted tmpfs.
    Unmodified udev versions will run just fine on top of it, and will
    recognize an already existing kernel-created device node and use it.
    The default node permissions are root:root 0600. Proper permissions
    and user/group ownership, meaningful symlinks, all other policy still
    needs to be applied by userspace.

    If a node is created by devtmps, devtmpfs will remove the device node
    when the device goes away. If the device node was created by
    userspace, or the devtmpfs created node was replaced by userspace, it
    will no longer be removed by devtmpfs.

    If it is requested to auto-mount it, it makes init=/bin/sh work
    without any further userspace support. /dev will be fully populated
    and dynamic, and always reflect the current device state of the kernel.
    With the commonly used dynamic device numbers, it solves the problem
    where static devices nodes may point to the wrong devices.

    It is intended to make the initial bootup logic simpler and more robust,
    by de-coupling the creation of the inital environment, to reliably run
    userspace processes, from a complex userspace bootstrap logic to provide
    a working /dev.

    Signed-off-by: Kay Sievers
    Signed-off-by: Jan Blunck
    Tested-By: Harald Hoyer
    Tested-By: Scott James Remnant
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

09 Sep, 2009

1 commit


25 Jun, 2009

1 commit


24 Jun, 2009

1 commit


17 Jun, 2009

2 commits

  • As function shmem_file_setup does not modify/allocate/free/pass given
    filename - mark it as const.

    Signed-off-by: Sergei Trofimovich
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergei Trofimovich
     
  • In a following patch, the usage of swap cache is recorded into swap_map.
    This patch is for necessary interface changes to do that.

    2 interfaces:

    - swapcache_prepare()
    - swapcache_free()

    are added for allocating/freeing refcnt from swap-cache to existing swap
    entries. But implementation itself is not changed under this patch. At
    adding swapcache_free(), memcg's hook code is moved under
    swapcache_free(). This is better than using scattered hooks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

22 May, 2009

2 commits

  • Based on discussion on lkml (Andrew Morton and Eric Paris),
    move ima_counts_get down a layer into shmem/hugetlb__file_setup().
    Resolves drm shmem_file_setup() usage case as well.

    HD comment:
    I still think you're doing this at the wrong level, but recognize
    that you probably won't be persuaded until a few more users of
    alloc_file() emerge, all wanting your ima_counts_get().

    Resolving GEM's shmem_file_setup() is an improvement, so I'll say

    Acked-by: Hugh Dickins
    Signed-off-by: Mimi Zohar
    Signed-off-by: James Morris

    Mimi Zohar
     
  • - Add support in ima_path_check() for integrity checking without
    incrementing the counts. (Required for nfsd.)
    - rename and export opencount_get to ima_counts_get
    - replace ima_shm_check calls with ima_counts_get
    - export ima_path_check

    Signed-off-by: Mimi Zohar
    Signed-off-by: James Morris

    Mimi Zohar
     

03 May, 2009

1 commit

  • Current mem_cgroup_shrink_usage() has two problems.

    1. It doesn't call mem_cgroup_out_of_memory and doesn't update
    last_oom_jiffies, so pagefault_out_of_memory invokes global OOM.

    2. Considering hierarchy, shrinking has to be done from the
    mem_over_limit, not from the memcg which the page would be charged to.

    mem_cgroup_try_charge_swapin() does all of these things properly, so we
    use it and call cancel_charge_swapin when it succeeded.

    The name of "shrink_usage" is not appropriate for this behavior, so we
    change it too.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Dhaval Giani
    Cc: Daisuke Nishimura
    Cc: YAMAMOTO Takashi
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

14 Apr, 2009

2 commits

  • SHMEM_MAX_BYTES was derived from the maximum size of its triple-indirect
    swap vector, forgetting to take the MAX_LFS_FILESIZE limit into account.
    Never mind 256kB pages, even 8kB pages on 32-bit kernels allowed files to
    grow slightly bigger than that supposed maximum.

    Fix this by using the min of both (at build time not run time). And it
    happens that this calculation is good as far as 8MB pages on 32-bit or
    16MB pages on 64-bit: though SHMSWP_MAX_INDEX gets truncated before that,
    it's truncated to such large numbers that we don't need to care.

    [akpm@linux-foundation.org: it needs pagemap.h]
    [akpm@linux-foundation.org: fix sparc64 min() warnings]
    Signed-off-by: Hugh Dickins
    Cc: Yuri Tikhonov
    Cc: Paul Mackerras
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fix a division by zero which we have in shmem_truncate_range() and
    shmem_unuse_inode() when using big PAGE_SIZE values (e.g. 256kB on
    ppc44x).

    With 256kB PAGE_SIZE, the ENTRIES_PER_PAGEPAGE constant becomes too large
    (0x1.0000.0000) on a 32-bit kernel, so this patch just changes its type
    from 'unsigned long' to 'unsigned long long'.

    Hugh: reverted its unsigned long longs in shmem_truncate_range() and
    shmem_getpage(): the pagecache index cannot be more than an unsigned long,
    so the divisions by zero occurred in unreached code. It's a pity we need
    any ULL arithmetic here, but I found no pretty way to avoid it.

    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Hugh Dickins
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yuri Tikhonov
     

01 Apr, 2009

1 commit

  • Synopsis: if shmem_writepage calls swap_writepage directly, most shmem
    swap loads benefit, and a catastrophic interaction between SLUB and some
    flash storage is avoided.

    shmem_writepage() has always been peculiar in making no attempt to write:
    it has just transferred a shmem page from file cache to swap cache, then
    let that page make its way around the LRU again before being written and
    freed.

    The idea was that people use tmpfs because they want those pages to stay
    in RAM; so although we give it an overflow to swap, we should resist
    writing too soon, giving those pages a second chance before they can be
    reclaimed.

    That was always questionable, and I've toyed with this patch for years;
    but never had a clear justification to depart from the original design.

    It became more questionable in 2.6.28, when the split LRU patches classed
    shmem and tmpfs pages as SwapBacked rather than as file_cache: that in
    itself gives them more resistance to reclaim than normal file pages. I
    prepared this patch for 2.6.29, but the merge window arrived before I'd
    completed gathering statistics to justify sending it in.

    Then while comparing SLQB against SLUB, running SLUB on a laptop I'd
    habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping
    tests five times slower than SLAB or SLQB - other machines slower too, but
    nowhere near so bad. Simpler "cp -a" swapping tests showed the same.

    slub_max_order=0 brings sanity to all, but heavy swapping is too far from
    normal to justify such a tuning. The crucial factor on that laptop turns
    out to be that I'm using an SD card for swap. What happens is this:

    By default, SLUB uses order-2 pages for shmem_inode_cache (and many other
    fs inodes), so creating tmpfs files under memory pressure brings lumpy
    reclaim into play. One subpage of the order is chosen from the bottom of
    the LRU as usual, then the other three picked out from their random
    positions on the LRUs.

    In a tmpfs load, many of these pages will be ones which already passed
    through shmem_writepage, so already have swap allocated. And though their
    offsets on swap were probably allocated sequentially, now that the pages
    are picked off at random, their swap offsets are scattered.

    But the flash storage on the SD card is very sensitive to having its
    writes merged: once swap is written at scattered offsets, performance
    falls apart. Rotating disk seeks increase too, but less disastrously.

    So: stop giving shmem/tmpfs pages a second pass around the LRU, write them
    out to swap as soon as their swap has been allocated.

    It's surely possible to devise an artificial load which runs faster the
    old way, one whose sizing is such that the tmpfs pages on their second
    pass are the ones that are wanted again, and other pages not.

    But I've not yet found such a load: on all machines, under the loads I've
    tried, immediate swap_writepage speeds up shmem swapping: especially when
    using the SLUB allocator (and more effectively than slub_max_order=0), but
    also with the others; and it also reduces the variance between runs. How
    much faster varies widely: a factor of five is rare, 5% is common.

    One load which might have suffered: imagine a swapping shmem load in a
    limited mem_cgroup on a machine with plenty of memory. Before 2.6.29 the
    swapcache was not charged, and such a load would have run quickest with
    the shmem swapcache never written to swap. But now swapcache is charged,
    so even this load benefits from shmem_writepage directly to swap.

    Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h:
    it's silly because that will never get called; but refactoring shmem.c
    sensibly according to CONFIG_SWAP will be a separate task.

    Signed-off-by: Hugh Dickins
    Acked-by: Pekka Enberg
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Mar, 2009

1 commit


26 Feb, 2009

1 commit

  • Each time I exit Firefox, /proc/meminfo's Committed_AS goes down almost
    400 kB: OVERCOMMIT_NEVER would be allowing overcommits it should
    prohibit.

    Commit fc8744adc870a8d4366908221508bb113d8b72ee "Stop playing silly
    games with the VM_ACCOUNT flag" changed shmem_file_setup() to set the
    shmem file's VM_ACCOUNT flag according to VM_NORESERVE not being set in
    the vma flags; but did so only _after_ the shmem_acct_size(flags, size)
    call which is expected to pre-account a shared anonymous object.

    It's all clearer if we switch shmem.c over to use VM_NORESERVE
    throughout in place of !VM_ACCOUNT.

    But I very nearly sent in a patch which mistakenly removed the
    accounting from tmpfs files: shmem_get_inode()'s memset was good for not
    setting VM_ACCOUNT, but now it needs to set VM_NORESERVE.

    Rather than setting that by default, then perhaps clearing it again in
    shmem_file_setup(), let's pass it as a flag to shmem_get_inode(): that
    allows us to remove the #ifdef CONFIG_SHMEM from shmem_file_setup().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 Feb, 2009

1 commit

  • Based on comments from Mike Frysinger and Randy Dunlap:
    (http://lkml.org/lkml/2009/2/9/262)
    - moved ima.h include before CONFIG_SHMEM test to fix compiler error
    on Blackfin:
    mm/shmem.c: In function 'shmem_zero_setup':
    mm/shmem.c:2670: error: implicit declaration of function 'ima_shm_check'

    - added 'struct linux_binprm' in ima.h to fix compiler warning on Blackfin:
    In file included from mm/shmem.c:32:
    include/linux/ima.h:25: warning: 'struct linux_binprm' declared inside
    parameter list
    include/linux/ima.h:25: warning: its scope is only this definition or
    declaration, which is probably not what you want

    - moved fs.h include within _LINUX_IMA_H definition

    Signed-off-by: Mimi Zohar
    Signed-off-by: Mike Frysinger
    Signed-off-by: James Morris

    Mimi Zohar
     

06 Feb, 2009

2 commits

  • Conflicts:
    fs/namei.c

    Manually merged per:

    diff --cc fs/namei.c
    index 734f2b5,bbc15c2..0000000
    --- a/fs/namei.c
    +++ b/fs/namei.c
    @@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char
    nd->flags |= LOOKUP_CONTINUE;
    err = exec_permission_lite(inode);
    if (err == -EAGAIN)
    - err = vfs_permission(nd, MAY_EXEC);
    + err = inode_permission(nd->path.dentry->d_inode,
    + MAY_EXEC);
    + if (!err)
    + err = ima_path_check(&nd->path, MAY_EXEC);
    if (err)
    break;

    @@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path *path, int acc
    flag &= ~O_TRUNC;
    }

    - error = vfs_permission(nd, acc_mode);
    + error = inode_permission(inode, acc_mode);
    if (error)
    return error;
    +
    - error = ima_path_check(&nd->path,
    ++ error = ima_path_check(path,
    + acc_mode & (MAY_READ | MAY_WRITE | MAY_EXEC));
    + if (error)
    + return error;
    /*
    * An append-only file must be opened in append mode for writing.
    */

    Signed-off-by: James Morris

    James Morris
     
  • The number of calls to ima_path_check()/ima_file_free()
    should be balanced. An extra call to fput(), indicates
    the file could have been accessed without first being
    measured.

    Although f_count is incremented/decremented in places other
    than fget/fput, like fget_light/fput_light and get_file, the
    current task must already hold a file refcnt. The call to
    __fput() is delayed until the refcnt becomes 0, resulting
    in ima_file_free() flagging any changes.

    - add hook to increment opencount for IPC shared memory(SYSV),
    shmat files, and /dev/zero
    - moved NULL iint test in opencount_get()

    Signed-off-by: Mimi Zohar
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Mimi Zohar
     

01 Feb, 2009

1 commit

  • The mmap_region() code would temporarily set the VM_ACCOUNT flag for
    anonymous shared mappings just to inform shmem_zero_setup() that it
    should enable accounting for the resulting shm object. It would then
    clear the flag after calling ->mmap (for the /dev/zero case) or doing
    shmem_zero_setup() (for the MAP_ANON case).

    This just resulted in vma merge issues, but also made for just
    unnecessary confusion. Use the already-existing VM_NORESERVE flag for
    this instead, and let shmem_{zero|file}_setup() just figure it out from
    that.

    This also happens to make it obvious that the new DRI2 GEM layer uses a
    non-reserving backing store for its object allocation - which is quite
    possibly not intentional. But since I didn't want to change semantics
    in this patch, I left it alone, and just updated the caller to use the
    new flag semantics.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Jan, 2009

4 commits

  • Now, you can see following even when swap accounting is enabled.

    1. Create Group 01, and 02.
    2. allocate a "file" on tmpfs by a task under 01.
    3. swap out the "file" (by memory pressure)
    4. Read "file" from a task in group 02.
    5. the charge of "file" is moved to group 02.

    This is not ideal behavior. This is because SwapCache which was loaded
    by read-ahead is not taken into account..

    This is a patch to fix shmem's swapcache behavior.
    - remove mem_cgroup_cache_charge_swapin().
    - Add SwapCache handler routine to mem_cgroup_cache_charge().
    By this, shmem's file cache is charged at add_to_page_cache()
    with GFP_NOWAIT.
    - pass the page of swapcache to shrink_mem_cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
    of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
    happen at memory reclaim.

    But in recent discussion, it's NACKed because it sounds ugly.

    This patch is for reverting it and add some clean up to gfp_mask of
    callers of charge. No behavior change but need review before generating
    HUNK in deep queue.

    This patch also adds explanation to meaning of gfp_mask passed to charge
    functions in memcontrol.h.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • SwapCache support for memory resource controller (memcg)

    Before mem+swap controller, memcg itself should handle SwapCache in proper
    way. This is cut-out from it.

    In current memcg, SwapCache is just leaked and the user can create tons of
    SwapCache. This is a leak of account and should be handled.

    SwapCache accounting is done as following.

    charge (anon)
    - charged when it's mapped.
    (because of readahead, charge at add_to_swap_cache() is not sane)
    uncharge (anon)
    - uncharged when it's dropped from swapcache and fully unmapped.
    means it's not uncharged at unmap.
    Note: delete from swap cache at swap-in is done after rmap information
    is established.
    charge (shmem)
    - charged at swap-in. this prevents charge at add_to_page_cache().

    uncharge (shmem)
    - uncharged when it's dropped from swapcache and not on shmem's
    radix-tree.

    at migration, check against 'old page' is modified to handle shmem.

    Comparing to the old version discussed (and caused troubles), we have
    advantages of
    - PCG_USED bit.
    - simple migrating handling.

    So, situation is much easier than several months ago, maybe.

    [hugh@veritas.com: memcg: handle swap caches build fix]
    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Fix misuse of gfp_kernel.

    Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.

    I think that this is from the fact that page_cgroup *was* dynamically
    allocated.

    But now, we allocate all page_cgroup at boot. And
    mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
    specified GFP_RECLAIM_MASK.

    * This is because we just want to reduce memory usage.
    "Where we should reclaim from ?" is not a problem in memcg.

    This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.

    Note: This patch is not for fixing behavior but for showing sane information
    in source code.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

2 commits

  • tiny-shmem shares most of its 130 lines of code with shmem and tends to
    break when particular bits of shmem get modified. Unifying saves code and
    makes keeping these two in sync much easier.

    before:
    14367 392 24 14783 39bf mm/shmem.o
    396 72 8 476 1dc mm/tiny-shmem.o

    after:
    14367 392 24 14783 39bf mm/shmem.o
    412 72 8 492 1ec mm/shmem.o tiny

    Signed-off-by: Matt Mackall
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Following "mm: don't mark_page_accessed in fault path", which now
    places a mark_page_accessed() in zap_pte_range(), we should remove
    the mark_page_accessed() from shmem_fault().

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Nov, 2008

1 commit

  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: linux-audit@redhat.com
    Cc: containers@lists.linux-foundation.org
    Cc: linux-mm@kvack.org
    Signed-off-by: James Morris

    David Howells
     

31 Oct, 2008

1 commit

  • Junjiro R. Okajima reported a problem where knfsd crashes if you are
    using it to export shmemfs objects and run strict overcommit. In this
    situation the current->mm based modifier to the overcommit goes through a
    NULL pointer.

    We could simply check for NULL and skip the modifier but we've caught
    other real bugs in the past from mm being NULL here - cases where we did
    need a valid mm set up (eg the exec bug about a year ago).

    To preserve the checks and get the logic we want shuffle the checking
    around and add a new helper to the vm_ security wrappers

    Also fix a current->mm reference in nommu that should use the passed mm

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix build]
    Reported-by: Junjiro R. Okajima
    Acked-by: James Morris
    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

20 Oct, 2008

3 commits

  • Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
    kept on the normal LRU, since scanning them is a waste of time and might
    throw off kswapd's balancing algorithms. Place them on the unevictable
    LRU list instead.

    Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
    memory regions as unevictable. Then these pages will be culled off the
    normal LRU lists during vmscan.

    Add new wrapper function to clear the mapping's unevictable state when/if
    shared memory segment is munlocked.

    Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
    the shmem segment's mapping [struct address_space] for evictability now
    that they're no longer locked. If so, move them to the appropriate zone
    lru list.

    Changes depend on [CONFIG_]UNEVICTABLE_LRU.

    [kosaki.motohiro@jp.fujitsu.com: revert shm change]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Kosaki Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Define page_file_cache() function to answer the question:
    is page backed by a file?

    Originally part of Rik van Riel's split-lru patch. Extracted to make
    available for other, independent reclaim patches.

    Moved inline function to linux/mm_inline.h where it will be needed by
    subsequent "split LRU" and "noreclaim" patches.

    Unfortunately this needs to use a page flag, since the PG_swapbacked state
    needs to be preserved all the way to the point where the page is last
    removed from the LRU. Trying to derive the status from other info in the
    page resulted in wrong VM statistics in earlier split VM patchsets.

    The total number of page flags in use on a 32 bit machine after this patch
    is 19.

    [akpm@linux-foundation.org: fix up out-of-order merge fallout]
    [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: MinChan Kim
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

18 Oct, 2008

1 commit

  • GEM needs to create shmem files to back buffer objects. Though currently
    creation of files for objects could have been driven from userland, the
    modesetting work will require allocation of buffer objects before userland
    is running, for boot-time message display.

    Signed-off-by: Eric Anholt
    Cc: Nick Piggin
    Signed-off-by: Dave Airlie

    Keith Packard
     

13 Oct, 2008

1 commit

  • Discussion on the mailing list questioned the use of these
    magic values in userspace, concluding these values are already
    exported to userspace via statfs and their correct/incorrect
    usage is left up to the userspace application.

    - Move special fs magic number definitions to magic.h
    - Add magic.h include

    Signed-off-by: Mimi Zohar
    Reviewed-by: James Morris
    Signed-off-by: James Morris

    Mimi Zohar
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

29 Jul, 2008

1 commit

  • SuSE's insserve initscript ordering program hits kernel BUG at mm/shmem.c:814
    on 2.6.26. It's using posix_fadvise on directories, and the shmem_readpage
    method added in 2.6.23 is letting POSIX_FADV_WILLNEED allocate useless pages
    to a tmpfs directory, incrementing i_blocks count but never decrementing it.

    Fix this by assigning shmem_aops (pointing to readpage and writepage and
    set_page_dirty) only when it's needed, on a regular file or a long symlink.

    Many thanks to Kel for outstanding bugreport and steps to reproduce it.

    Reported-by: Kel Modderman
    Tested-by: Kel Modderman
    Signed-off-by: Hugh Dickins
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Jul, 2008

2 commits

  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • If we can be sure that elevating the page_count on a pagecache page will
    pin it, we can speculatively run this operation, and subsequently check to
    see if we hit the right page rather than relying on holding a lock or
    otherwise pinning a reference to the page.

    This can be done if get_page/put_page behaves consistently throughout the
    whole tree (ie. if we "get" the page after it has been used for something
    else, we must be able to free it with a put_page).

    Actually, there is a period where the count behaves differently: when the
    page is free or if it is a constituent page of a compound page. We need
    an atomic_inc_not_zero operation to ensure we don't try to grab the page
    in either case.

    This patch introduces the core locking protocol to the pagecache (ie.
    adds page_cache_get_speculative, and tweaks some update-side code to make
    it work).

    Thanks to Hugh for pointing out an improvement to the algorithm setting
    page_count to zero when we have control of all references, in order to
    hold off speculative getters.

    [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
    [hugh@veritas.com: fix add_to_page_cache]
    [akpm@linux-foundation.org: repair a comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

26 Jul, 2008

2 commits

  • A new call, mem_cgroup_shrink_usage() is added for shmem handling and
    relacing non-standard usage of mem_cgroup_charge/uncharge.

    Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
    mem_cgroup. In general, shmem is used by some process group and not for
    global resource (like file caches). So, it's reasonable to reclaim pages
    from mem_cgroup where shmem is mainly used.

    [hugh@veritas.com: shmem_getpage release page sooner]
    [hugh@veritas.com: mem_cgroup_shrink_usage css_put]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg: performance improvements

    Patch Description
    1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
    2/5 ... swapcache handling patch
    3/5 ... add helper function for shmem's memory reclaim patch
    4/5 ... optimize by likely/unlikely ppatch
    5/5 ... remove redundunt check patch (shmem handling is fixed.)

    Unix bench result.

    == 2.6.26-rc2-mm1 + memory resource controller
    Execl Throughput 2915.4 lps (29.6 secs, 3 samples)
    C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 2915.4 678.0
    File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0
    File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5
    File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8
    Shell Scripts (8 concurrent) 6.0 1097.7 1829.5
    =========
    FINAL SCORE 991.3

    == 2.6.26-rc2-mm1 + this set ==
    Execl Throughput 3012.9 lps (29.9 secs, 3 samples)
    C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 3012.9 700.7
    File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7
    File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3
    File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0
    Shell Scripts (8 concurrent) 6.0 1120.3 1867.2
    =========
    FINAL SCORE 1004.6

    This patch:

    Remove refcnt from page_cgroup().

    After this,

    * A page is charged only when !page_mapped() && no page_cgroup is assigned.
    * Anon page is newly mapped.
    * File page is added to mapping->tree.

    * A page is uncharged only when
    * Anon page is fully unmapped.
    * File page is removed from LRU.

    There is no change in behavior from user's view.

    This patch also removes unnecessary calls in rmap.c which was used only for
    refcnt mangement.

    [akpm@linux-foundation.org: fix warning]
    [hugh@veritas.com: fix shmem_unuse_inode charging]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Hugh Dickins
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

25 Jul, 2008

1 commit

  • We have a request for tmpfs to support the AIO interface: easily done, no
    more than replacing the old shmem_file_read by shmem_file_aio_read,
    cribbed from generic_file_aio_read. (In 2.6.25 its write side was already
    changed to use generic_file_aio_write.)

    Incorporate cleanups from Andrew Morton and Harvey Harrison.

    Tests out fine with LTP's ltp-aiodio.sh, given hacks (not included) to
    support O_DIRECT. tmpfs cannot honestly support O_DIRECT: its
    cache-avoiding-IO nature is at odds with direct IO-avoiding-cache.

    Signed-off-by: Hugh Dickins
    Tested-by: Lawrence Greenfield
    Cc: Christoph Rohland
    Cc: Badari Pulavarty
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 Apr, 2008

1 commit

  • Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
    set, then don't update the per-bdi writeback stats from
    test_set_page_writeback() and test_clear_page_writeback().

    Misc cleanups:

    - convert bdi_cap_writeback_dirty() and friends to static inline functions
    - create a flag that includes all three dirty/writeback related flags,
    since almst all users will want to have them toghether

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

28 Apr, 2008

1 commit

  • This patch replaces the mempolicy mode, mode_flags, and nodemask in the
    shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
    This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
    inode.c and simplifies the interfaces.

    mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
    pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the
    returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context'
    argument that causes the input nodemask to be stored in the w.user_nodemask of
    the created mempolicy for use when the mempolicy is installed in a tmpfs inode
    shared policy tree. At that time, any cpuset contextualization is applied to
    the original input nodemask. This preserves the previous behavior where the
    input nodemask was stored in the superblock. We can think of the returned
    mempolicy as "context free".

    Because mpol_parse_str() is now calling mpol_new(), we can remove from
    mpol_to_str() the semantic checks that mpol_new() already performs.

    Add 'no_context' parameter to mpol_to_str() to specify that it should format
    the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.

    Change mpol_shared_policy_init() to take a pointer to a "context free" struct
    mempolicy and to create a new, "contextualized" mempolicy using the mode,
    mode_flags and user_nodemask from the input mempolicy.

    Note: we know that the mempolicy passed to mpol_to_str() or
    mpol_shared_policy_init() from a tmpfs superblock is "context free". This
    is currently the only instance thereof. However, if we found more uses for
    this concept, and introduced any ambiguity as to whether a mempolicy was
    context free or not, we could add another internal mode flag to identify
    context free mempolicies. Then, we could remove the 'no_context' argument
    from mpol_to_str().

    Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
    if one exists, to pass to mpol_shared_policy_init(). We must add the
    reference under the sb stat_lock to prevent races with replacement of the mpol
    by remount. This reference is removed in mpol_shared_policy_init().

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: yet another build fix]
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn