28 Sep, 2009

1 commit


26 Sep, 2009

2 commits

  • * 'writeback' of git://git.kernel.dk/linux-2.6-block:
    writeback: writeback_inodes_sb() should use bdi_start_writeback()
    writeback: don't delay inodes redirtied by a fast dirtier
    writeback: make the super_block pinning more efficient
    writeback: don't resort for a single super_block in move_expired_inodes()
    writeback: move inodes from one super_block together
    writeback: get rid to incorrect references to pdflush in comments
    writeback: improve readability of the wb_writeback() continue/break logic
    writeback: cleanup writeback_single_inode()
    writeback: kupdate writeback shall not stop when more io is possible
    writeback: stop background writeback when below background threshold
    writeback: balance_dirty_pages() shall write more than dirtied pages
    fs: Fix busyloop in wb_writeback()

    Linus Torvalds
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

4 commits

  • Fixes the following kmemcheck false positive (the compiler is using
    a 32-bit mov to load the 16-bit sbinfo->mode in shmem_fill_super):

    [ 0.337000] Total of 1 processors activated (3088.38 BogoMIPS).
    [ 0.352000] CPU0 attaching NULL sched-domain.
    [ 0.360000] WARNING: kmemcheck: Caught 32-bit read from uninitialized
    memory (9f8020fc)
    [ 0.361000]
    a44240820000000041f6998100000000000000000000000000000000ff030000
    [ 0.368000] i i i i i i i i i i i i i i i i u u u u i i i i i i i i i i u
    u
    [ 0.375000] ^
    [ 0.376000]
    [ 0.377000] Pid: 9, comm: khelper Not tainted (2.6.31-tip #206) P4DC6
    [ 0.378000] EIP: 0060:[] EFLAGS: 00010246 CPU: 0
    [ 0.379000] EIP is at shmem_fill_super+0xb5/0x120
    [ 0.380000] EAX: 00000000 EBX: 9f845400 ECX: 824042a4 EDX: 8199f641
    [ 0.381000] ESI: 9f8020c0 EDI: 9f845400 EBP: 9f81af68 ESP: 81cd6eec
    [ 0.382000] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    [ 0.383000] CR0: 8005003b CR2: 9f806200 CR3: 01ccd000 CR4: 000006d0
    [ 0.384000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    [ 0.385000] DR6: ffff4ff0 DR7: 00000400
    [ 0.386000] [] get_sb_nodev+0x3c/0x80
    [ 0.388000] [] shmem_get_sb+0x14/0x20
    [ 0.390000] [] vfs_kern_mount+0x4f/0x120
    [ 0.392000] [] init_tmpfs+0x7e/0xb0
    [ 0.394000] [] do_basic_setup+0x17/0x30
    [ 0.396000] [] kernel_init+0x57/0xa0
    [ 0.398000] [] kernel_thread_helper+0x7/0x10
    [ 0.400000] [] 0xffffffff
    [ 0.402000] khelper used greatest stack depth: 2820 bytes left
    [ 0.407000] calling init_mmap_min_addr+0x0/0x10 @ 1
    [ 0.408000] initcall init_mmap_min_addr+0x0/0x10 returned 0 after 0 usecs

    Reported-by: Ingo Molnar
    Analysed-by: Vegard Nossum
    Signed-off-by: Pekka Enberg
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
    CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
    more sense of it by removing CONFIG_TMPFS altogether, always enabling its
    code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
    CONFIG_TMPFS off that we'd better leave that as is.

    But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
    make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
    shmem_acl.o being pointlessly built into the kernel when SHMEM is off.

    And a selfish change, to prevent the world from being rebuilt when I
    switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
    header files is mm.h shmem_lock() - give that a shmem.c stub instead.

    Signed-off-by: Hugh Dickins
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fix the following 'make includecheck' warning:

    mm/shmem.c: linux/vfs.h is included more than once.

    Signed-off-by: Jaswinder Singh Rajput
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaswinder Singh Rajput
     
  • After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
    only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare()
    or get_swap_page() would call add_to_swap_cache(). So add_to_swap_cache()
    doesn't return -EEXIST any more.

    Even though it doesn't return -EEXIST, it's not good behavior conceptually
    to call swapcache_prepare() in the -EEXIST case, because it means clearing
    SWAP_HAS_CACHE flag while the entry is on swap cache.

    This patch removes redundant codes and comments from callers of it, and
    adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

16 Sep, 2009

3 commits

  • Enable removing of corrupted pages through truncation
    for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
    These should cover most server needs.

    I chose the set of migration aware file systems for this
    for now, assuming they have been especially audited.
    But in general it should be safe for all file systems
    on the data area that support read/write and truncate.

    Caveat: the hardware error handler does not take i_mutex
    for now before calling the truncate function. Is that ok?

    Cc: tytso@mit.edu
    Cc: hch@infradead.org
    Cc: mfasheh@suse.com
    Cc: aia21@cantab.net
    Cc: hugh.dickins@tiscali.co.uk
    Cc: swhiteho@redhat.com
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • The dirtying of page and set_page_dirty() can be moved into the page lock.

    - In shmem_write_end(), the page was dirtied while the page lock was held,
    but it's being marked dirty just after dropping the page lock.
    - In shmem_symlink(), both dirtying and marking can be moved into page lock.

    It's valuable for the hwpoison code to know whether one bad page can be dropped
    without losing data. It mainly judges by testing the PG_dirty bit after taking
    the page lock. So it becomes important that the dirtying of page and the
    marking of dirtiness are both done inside the page lock. Which is a common
    practice, but sadly not a rule.

    The noticeable exceptions are
    - mapped pages
    - pages with buffer_heads
    The above pages could go dirty at any time. Fortunately the hwpoison will
    unmap the page and release the buffer_heads beforehand anyway.

    Many other types of pages (eg. metadata pages) can also be dirtied at will by
    their owners, the hwpoison code cannot do meaningful things to them anyway.
    Only the dirtiness of pagecache pages owned by regular files are interested.

    v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)

    Acked-by: Hugh Dickins
    Reviewed-by: WANG Cong
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • Devtmpfs lets the kernel create a tmpfs instance called devtmpfs
    very early at kernel initialization, before any driver-core device
    is registered. Every device with a major/minor will provide a
    device node in devtmpfs.

    Devtmpfs can be changed and altered by userspace at any time,
    and in any way needed - just like today's udev-mounted tmpfs.
    Unmodified udev versions will run just fine on top of it, and will
    recognize an already existing kernel-created device node and use it.
    The default node permissions are root:root 0600. Proper permissions
    and user/group ownership, meaningful symlinks, all other policy still
    needs to be applied by userspace.

    If a node is created by devtmps, devtmpfs will remove the device node
    when the device goes away. If the device node was created by
    userspace, or the devtmpfs created node was replaced by userspace, it
    will no longer be removed by devtmpfs.

    If it is requested to auto-mount it, it makes init=/bin/sh work
    without any further userspace support. /dev will be fully populated
    and dynamic, and always reflect the current device state of the kernel.
    With the commonly used dynamic device numbers, it solves the problem
    where static devices nodes may point to the wrong devices.

    It is intended to make the initial bootup logic simpler and more robust,
    by de-coupling the creation of the inital environment, to reliably run
    userspace processes, from a complex userspace bootstrap logic to provide
    a working /dev.

    Signed-off-by: Kay Sievers
    Signed-off-by: Jan Blunck
    Tested-By: Harald Hoyer
    Tested-By: Scott James Remnant
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

09 Sep, 2009

1 commit


25 Jun, 2009

1 commit


24 Jun, 2009

1 commit


17 Jun, 2009

2 commits

  • As function shmem_file_setup does not modify/allocate/free/pass given
    filename - mark it as const.

    Signed-off-by: Sergei Trofimovich
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergei Trofimovich
     
  • In a following patch, the usage of swap cache is recorded into swap_map.
    This patch is for necessary interface changes to do that.

    2 interfaces:

    - swapcache_prepare()
    - swapcache_free()

    are added for allocating/freeing refcnt from swap-cache to existing swap
    entries. But implementation itself is not changed under this patch. At
    adding swapcache_free(), memcg's hook code is moved under
    swapcache_free(). This is better than using scattered hooks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

22 May, 2009

2 commits

  • Based on discussion on lkml (Andrew Morton and Eric Paris),
    move ima_counts_get down a layer into shmem/hugetlb__file_setup().
    Resolves drm shmem_file_setup() usage case as well.

    HD comment:
    I still think you're doing this at the wrong level, but recognize
    that you probably won't be persuaded until a few more users of
    alloc_file() emerge, all wanting your ima_counts_get().

    Resolving GEM's shmem_file_setup() is an improvement, so I'll say

    Acked-by: Hugh Dickins
    Signed-off-by: Mimi Zohar
    Signed-off-by: James Morris

    Mimi Zohar
     
  • - Add support in ima_path_check() for integrity checking without
    incrementing the counts. (Required for nfsd.)
    - rename and export opencount_get to ima_counts_get
    - replace ima_shm_check calls with ima_counts_get
    - export ima_path_check

    Signed-off-by: Mimi Zohar
    Signed-off-by: James Morris

    Mimi Zohar
     

03 May, 2009

1 commit

  • Current mem_cgroup_shrink_usage() has two problems.

    1. It doesn't call mem_cgroup_out_of_memory and doesn't update
    last_oom_jiffies, so pagefault_out_of_memory invokes global OOM.

    2. Considering hierarchy, shrinking has to be done from the
    mem_over_limit, not from the memcg which the page would be charged to.

    mem_cgroup_try_charge_swapin() does all of these things properly, so we
    use it and call cancel_charge_swapin when it succeeded.

    The name of "shrink_usage" is not appropriate for this behavior, so we
    change it too.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Dhaval Giani
    Cc: Daisuke Nishimura
    Cc: YAMAMOTO Takashi
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

14 Apr, 2009

2 commits

  • SHMEM_MAX_BYTES was derived from the maximum size of its triple-indirect
    swap vector, forgetting to take the MAX_LFS_FILESIZE limit into account.
    Never mind 256kB pages, even 8kB pages on 32-bit kernels allowed files to
    grow slightly bigger than that supposed maximum.

    Fix this by using the min of both (at build time not run time). And it
    happens that this calculation is good as far as 8MB pages on 32-bit or
    16MB pages on 64-bit: though SHMSWP_MAX_INDEX gets truncated before that,
    it's truncated to such large numbers that we don't need to care.

    [akpm@linux-foundation.org: it needs pagemap.h]
    [akpm@linux-foundation.org: fix sparc64 min() warnings]
    Signed-off-by: Hugh Dickins
    Cc: Yuri Tikhonov
    Cc: Paul Mackerras
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fix a division by zero which we have in shmem_truncate_range() and
    shmem_unuse_inode() when using big PAGE_SIZE values (e.g. 256kB on
    ppc44x).

    With 256kB PAGE_SIZE, the ENTRIES_PER_PAGEPAGE constant becomes too large
    (0x1.0000.0000) on a 32-bit kernel, so this patch just changes its type
    from 'unsigned long' to 'unsigned long long'.

    Hugh: reverted its unsigned long longs in shmem_truncate_range() and
    shmem_getpage(): the pagecache index cannot be more than an unsigned long,
    so the divisions by zero occurred in unreached code. It's a pity we need
    any ULL arithmetic here, but I found no pretty way to avoid it.

    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Hugh Dickins
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yuri Tikhonov
     

01 Apr, 2009

1 commit

  • Synopsis: if shmem_writepage calls swap_writepage directly, most shmem
    swap loads benefit, and a catastrophic interaction between SLUB and some
    flash storage is avoided.

    shmem_writepage() has always been peculiar in making no attempt to write:
    it has just transferred a shmem page from file cache to swap cache, then
    let that page make its way around the LRU again before being written and
    freed.

    The idea was that people use tmpfs because they want those pages to stay
    in RAM; so although we give it an overflow to swap, we should resist
    writing too soon, giving those pages a second chance before they can be
    reclaimed.

    That was always questionable, and I've toyed with this patch for years;
    but never had a clear justification to depart from the original design.

    It became more questionable in 2.6.28, when the split LRU patches classed
    shmem and tmpfs pages as SwapBacked rather than as file_cache: that in
    itself gives them more resistance to reclaim than normal file pages. I
    prepared this patch for 2.6.29, but the merge window arrived before I'd
    completed gathering statistics to justify sending it in.

    Then while comparing SLQB against SLUB, running SLUB on a laptop I'd
    habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping
    tests five times slower than SLAB or SLQB - other machines slower too, but
    nowhere near so bad. Simpler "cp -a" swapping tests showed the same.

    slub_max_order=0 brings sanity to all, but heavy swapping is too far from
    normal to justify such a tuning. The crucial factor on that laptop turns
    out to be that I'm using an SD card for swap. What happens is this:

    By default, SLUB uses order-2 pages for shmem_inode_cache (and many other
    fs inodes), so creating tmpfs files under memory pressure brings lumpy
    reclaim into play. One subpage of the order is chosen from the bottom of
    the LRU as usual, then the other three picked out from their random
    positions on the LRUs.

    In a tmpfs load, many of these pages will be ones which already passed
    through shmem_writepage, so already have swap allocated. And though their
    offsets on swap were probably allocated sequentially, now that the pages
    are picked off at random, their swap offsets are scattered.

    But the flash storage on the SD card is very sensitive to having its
    writes merged: once swap is written at scattered offsets, performance
    falls apart. Rotating disk seeks increase too, but less disastrously.

    So: stop giving shmem/tmpfs pages a second pass around the LRU, write them
    out to swap as soon as their swap has been allocated.

    It's surely possible to devise an artificial load which runs faster the
    old way, one whose sizing is such that the tmpfs pages on their second
    pass are the ones that are wanted again, and other pages not.

    But I've not yet found such a load: on all machines, under the loads I've
    tried, immediate swap_writepage speeds up shmem swapping: especially when
    using the SLUB allocator (and more effectively than slub_max_order=0), but
    also with the others; and it also reduces the variance between runs. How
    much faster varies widely: a factor of five is rare, 5% is common.

    One load which might have suffered: imagine a swapping shmem load in a
    limited mem_cgroup on a machine with plenty of memory. Before 2.6.29 the
    swapcache was not charged, and such a load would have run quickest with
    the shmem swapcache never written to swap. But now swapcache is charged,
    so even this load benefits from shmem_writepage directly to swap.

    Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h:
    it's silly because that will never get called; but refactoring shmem.c
    sensibly according to CONFIG_SWAP will be a separate task.

    Signed-off-by: Hugh Dickins
    Acked-by: Pekka Enberg
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Mar, 2009

1 commit


26 Feb, 2009

1 commit

  • Each time I exit Firefox, /proc/meminfo's Committed_AS goes down almost
    400 kB: OVERCOMMIT_NEVER would be allowing overcommits it should
    prohibit.

    Commit fc8744adc870a8d4366908221508bb113d8b72ee "Stop playing silly
    games with the VM_ACCOUNT flag" changed shmem_file_setup() to set the
    shmem file's VM_ACCOUNT flag according to VM_NORESERVE not being set in
    the vma flags; but did so only _after_ the shmem_acct_size(flags, size)
    call which is expected to pre-account a shared anonymous object.

    It's all clearer if we switch shmem.c over to use VM_NORESERVE
    throughout in place of !VM_ACCOUNT.

    But I very nearly sent in a patch which mistakenly removed the
    accounting from tmpfs files: shmem_get_inode()'s memset was good for not
    setting VM_ACCOUNT, but now it needs to set VM_NORESERVE.

    Rather than setting that by default, then perhaps clearing it again in
    shmem_file_setup(), let's pass it as a flag to shmem_get_inode(): that
    allows us to remove the #ifdef CONFIG_SHMEM from shmem_file_setup().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 Feb, 2009

1 commit

  • Based on comments from Mike Frysinger and Randy Dunlap:
    (http://lkml.org/lkml/2009/2/9/262)
    - moved ima.h include before CONFIG_SHMEM test to fix compiler error
    on Blackfin:
    mm/shmem.c: In function 'shmem_zero_setup':
    mm/shmem.c:2670: error: implicit declaration of function 'ima_shm_check'

    - added 'struct linux_binprm' in ima.h to fix compiler warning on Blackfin:
    In file included from mm/shmem.c:32:
    include/linux/ima.h:25: warning: 'struct linux_binprm' declared inside
    parameter list
    include/linux/ima.h:25: warning: its scope is only this definition or
    declaration, which is probably not what you want

    - moved fs.h include within _LINUX_IMA_H definition

    Signed-off-by: Mimi Zohar
    Signed-off-by: Mike Frysinger
    Signed-off-by: James Morris

    Mimi Zohar
     

06 Feb, 2009

2 commits

  • Conflicts:
    fs/namei.c

    Manually merged per:

    diff --cc fs/namei.c
    index 734f2b5,bbc15c2..0000000
    --- a/fs/namei.c
    +++ b/fs/namei.c
    @@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char
    nd->flags |= LOOKUP_CONTINUE;
    err = exec_permission_lite(inode);
    if (err == -EAGAIN)
    - err = vfs_permission(nd, MAY_EXEC);
    + err = inode_permission(nd->path.dentry->d_inode,
    + MAY_EXEC);
    + if (!err)
    + err = ima_path_check(&nd->path, MAY_EXEC);
    if (err)
    break;

    @@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path *path, int acc
    flag &= ~O_TRUNC;
    }

    - error = vfs_permission(nd, acc_mode);
    + error = inode_permission(inode, acc_mode);
    if (error)
    return error;
    +
    - error = ima_path_check(&nd->path,
    ++ error = ima_path_check(path,
    + acc_mode & (MAY_READ | MAY_WRITE | MAY_EXEC));
    + if (error)
    + return error;
    /*
    * An append-only file must be opened in append mode for writing.
    */

    Signed-off-by: James Morris

    James Morris
     
  • The number of calls to ima_path_check()/ima_file_free()
    should be balanced. An extra call to fput(), indicates
    the file could have been accessed without first being
    measured.

    Although f_count is incremented/decremented in places other
    than fget/fput, like fget_light/fput_light and get_file, the
    current task must already hold a file refcnt. The call to
    __fput() is delayed until the refcnt becomes 0, resulting
    in ima_file_free() flagging any changes.

    - add hook to increment opencount for IPC shared memory(SYSV),
    shmat files, and /dev/zero
    - moved NULL iint test in opencount_get()

    Signed-off-by: Mimi Zohar
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Mimi Zohar
     

01 Feb, 2009

1 commit

  • The mmap_region() code would temporarily set the VM_ACCOUNT flag for
    anonymous shared mappings just to inform shmem_zero_setup() that it
    should enable accounting for the resulting shm object. It would then
    clear the flag after calling ->mmap (for the /dev/zero case) or doing
    shmem_zero_setup() (for the MAP_ANON case).

    This just resulted in vma merge issues, but also made for just
    unnecessary confusion. Use the already-existing VM_NORESERVE flag for
    this instead, and let shmem_{zero|file}_setup() just figure it out from
    that.

    This also happens to make it obvious that the new DRI2 GEM layer uses a
    non-reserving backing store for its object allocation - which is quite
    possibly not intentional. But since I didn't want to change semantics
    in this patch, I left it alone, and just updated the caller to use the
    new flag semantics.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Jan, 2009

4 commits

  • Now, you can see following even when swap accounting is enabled.

    1. Create Group 01, and 02.
    2. allocate a "file" on tmpfs by a task under 01.
    3. swap out the "file" (by memory pressure)
    4. Read "file" from a task in group 02.
    5. the charge of "file" is moved to group 02.

    This is not ideal behavior. This is because SwapCache which was loaded
    by read-ahead is not taken into account..

    This is a patch to fix shmem's swapcache behavior.
    - remove mem_cgroup_cache_charge_swapin().
    - Add SwapCache handler routine to mem_cgroup_cache_charge().
    By this, shmem's file cache is charged at add_to_page_cache()
    with GFP_NOWAIT.
    - pass the page of swapcache to shrink_mem_cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
    of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
    happen at memory reclaim.

    But in recent discussion, it's NACKed because it sounds ugly.

    This patch is for reverting it and add some clean up to gfp_mask of
    callers of charge. No behavior change but need review before generating
    HUNK in deep queue.

    This patch also adds explanation to meaning of gfp_mask passed to charge
    functions in memcontrol.h.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • SwapCache support for memory resource controller (memcg)

    Before mem+swap controller, memcg itself should handle SwapCache in proper
    way. This is cut-out from it.

    In current memcg, SwapCache is just leaked and the user can create tons of
    SwapCache. This is a leak of account and should be handled.

    SwapCache accounting is done as following.

    charge (anon)
    - charged when it's mapped.
    (because of readahead, charge at add_to_swap_cache() is not sane)
    uncharge (anon)
    - uncharged when it's dropped from swapcache and fully unmapped.
    means it's not uncharged at unmap.
    Note: delete from swap cache at swap-in is done after rmap information
    is established.
    charge (shmem)
    - charged at swap-in. this prevents charge at add_to_page_cache().

    uncharge (shmem)
    - uncharged when it's dropped from swapcache and not on shmem's
    radix-tree.

    at migration, check against 'old page' is modified to handle shmem.

    Comparing to the old version discussed (and caused troubles), we have
    advantages of
    - PCG_USED bit.
    - simple migrating handling.

    So, situation is much easier than several months ago, maybe.

    [hugh@veritas.com: memcg: handle swap caches build fix]
    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Fix misuse of gfp_kernel.

    Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.

    I think that this is from the fact that page_cgroup *was* dynamically
    allocated.

    But now, we allocate all page_cgroup at boot. And
    mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
    specified GFP_RECLAIM_MASK.

    * This is because we just want to reduce memory usage.
    "Where we should reclaim from ?" is not a problem in memcg.

    This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.

    Note: This patch is not for fixing behavior but for showing sane information
    in source code.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

2 commits

  • tiny-shmem shares most of its 130 lines of code with shmem and tends to
    break when particular bits of shmem get modified. Unifying saves code and
    makes keeping these two in sync much easier.

    before:
    14367 392 24 14783 39bf mm/shmem.o
    396 72 8 476 1dc mm/tiny-shmem.o

    after:
    14367 392 24 14783 39bf mm/shmem.o
    412 72 8 492 1ec mm/shmem.o tiny

    Signed-off-by: Matt Mackall
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Following "mm: don't mark_page_accessed in fault path", which now
    places a mark_page_accessed() in zap_pte_range(), we should remove
    the mark_page_accessed() from shmem_fault().

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Nov, 2008

1 commit

  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: linux-audit@redhat.com
    Cc: containers@lists.linux-foundation.org
    Cc: linux-mm@kvack.org
    Signed-off-by: James Morris

    David Howells
     

31 Oct, 2008

1 commit

  • Junjiro R. Okajima reported a problem where knfsd crashes if you are
    using it to export shmemfs objects and run strict overcommit. In this
    situation the current->mm based modifier to the overcommit goes through a
    NULL pointer.

    We could simply check for NULL and skip the modifier but we've caught
    other real bugs in the past from mm being NULL here - cases where we did
    need a valid mm set up (eg the exec bug about a year ago).

    To preserve the checks and get the logic we want shuffle the checking
    around and add a new helper to the vm_ security wrappers

    Also fix a current->mm reference in nommu that should use the passed mm

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix build]
    Reported-by: Junjiro R. Okajima
    Acked-by: James Morris
    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

20 Oct, 2008

3 commits

  • Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
    kept on the normal LRU, since scanning them is a waste of time and might
    throw off kswapd's balancing algorithms. Place them on the unevictable
    LRU list instead.

    Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
    memory regions as unevictable. Then these pages will be culled off the
    normal LRU lists during vmscan.

    Add new wrapper function to clear the mapping's unevictable state when/if
    shared memory segment is munlocked.

    Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
    the shmem segment's mapping [struct address_space] for evictability now
    that they're no longer locked. If so, move them to the appropriate zone
    lru list.

    Changes depend on [CONFIG_]UNEVICTABLE_LRU.

    [kosaki.motohiro@jp.fujitsu.com: revert shm change]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Kosaki Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Define page_file_cache() function to answer the question:
    is page backed by a file?

    Originally part of Rik van Riel's split-lru patch. Extracted to make
    available for other, independent reclaim patches.

    Moved inline function to linux/mm_inline.h where it will be needed by
    subsequent "split LRU" and "noreclaim" patches.

    Unfortunately this needs to use a page flag, since the PG_swapbacked state
    needs to be preserved all the way to the point where the page is last
    removed from the LRU. Trying to derive the status from other info in the
    page resulted in wrong VM statistics in earlier split VM patchsets.

    The total number of page flags in use on a 32 bit machine after this patch
    is 19.

    [akpm@linux-foundation.org: fix up out-of-order merge fallout]
    [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: MinChan Kim
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

18 Oct, 2008

1 commit

  • GEM needs to create shmem files to back buffer objects. Though currently
    creation of files for objects could have been driven from userland, the
    modesetting work will require allocation of buffer objects before userland
    is running, for boot-time message display.

    Signed-off-by: Eric Anholt
    Cc: Nick Piggin
    Signed-off-by: Dave Airlie

    Keith Packard