20 Nov, 2015

1 commit

  • NOT FOR STAGING
    This patch re-adds the original shmem_set_file to mm/shmem.c
    and converts ashmem.c back to using it.

    CC: Brian Swetland
    CC: Colin Cross
    CC: Arve Hjønnevåg
    CC: Dima Zavin
    CC: Robert Love
    CC: Greg KH
    Signed-off-by: John Stultz

    John Stultz
     

18 Jun, 2015

1 commit

  • It appears that, at some point last year, XFS made directory handling
    changes which bring it into lockdep conflict with shmem_zero_setup():
    it is surprising that mmap() can clone an inode while holding mmap_sem,
    but that has been so for many years.

    Since those few lockdep traces that I've seen all implicated selinux,
    I'm hoping that we can use the __shmem_file_setup(,,,S_PRIVATE) which
    v3.13's commit c7277090927a ("security: shmem: implement kernel private
    shmem inodes") introduced to avoid LSM checks on kernel-internal inodes:
    the mmap("/dev/zero") cloned inode is indeed a kernel-internal detail.

    This also covers the !CONFIG_SHMEM use of ramfs to support /dev/zero
    (and MAP_SHARED|MAP_ANONYMOUS). I thought there were also drivers
    which cloned inode in mmap(), but if so, I cannot locate them now.

    Reported-and-tested-by: Prarit Bhargava
    Reported-and-tested-by: Daniel Wagner
    Reported-and-tested-by: Morten Stevens
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Apr, 2015

1 commit


12 Apr, 2015

2 commits


26 Mar, 2015

1 commit


24 Feb, 2015

1 commit

  • mapping->host can be NULL and shouldn't be dereferenced before being checked.

    [ 1295.741844] GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] SMP KASAN
    [ 1295.746387] Dumping ftrace buffer:
    [ 1295.748217] (ftrace buffer empty)
    [ 1295.749527] Modules linked in:
    [ 1295.750268] CPU: 62 PID: 23410 Comm: trinity-c70 Not tainted 3.19.0-next-20150219-sasha-00045-g9130270f #1939
    [ 1295.750268] task: ffff8803a49db000 ti: ffff8803a4dc8000 task.ti: ffff8803a4dc8000
    [ 1295.750268] RIP: shmem_mapping (mm/shmem.c:1458)
    [ 1295.750268] RSP: 0000:ffff8803a4dcfbf8 EFLAGS: 00010206
    [ 1295.750268] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 00000000000f2804
    [ 1295.750268] RDX: 0000000000000005 RSI: 0400000000000794 RDI: 0000000000000028
    [ 1295.750268] RBP: ffff8803a4dcfc08 R08: 0000000000000000 R09: 00000000031de000
    [ 1295.750268] R10: dffffc0000000000 R11: 00000000031c1000 R12: 0400000000000794
    [ 1295.750268] R13: 00000000031c2000 R14: 00000000031de000 R15: ffff880e3bdc1000
    [ 1295.750268] FS: 00007f8703c7e700(0000) GS:ffff881164800000(0000) knlGS:0000000000000000
    [ 1295.750268] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1295.750268] CR2: 0000000004e58000 CR3: 00000003a9f3c000 CR4: 00000000000007a0
    [ 1295.750268] DR0: ffffffff81000000 DR1: 0000009494949494 DR2: 0000000000000000
    [ 1295.750268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
    [ 1295.750268] Stack:
    [ 1295.750268] ffff8803a4dcfec8 ffffffffbb1dc770 ffff8803a4dcfc38 ffffffffad6f230b
    [ 1295.750268] ffffffffad6f2b0d 0000014100000000 ffff88001e17c08b ffff880d9453fe08
    [ 1295.750268] ffff8803a4dcfd18 ffffffffad6f2ce2 ffff8803a49dbcd8 ffff8803a49dbce0
    [ 1295.750268] Call Trace:
    [ 1295.750268] mincore_page (mm/mincore.c:61)
    [ 1295.750268] ? mincore_pte_range (include/linux/spinlock.h:312 mm/mincore.c:131)
    [ 1295.750268] mincore_pte_range (mm/mincore.c:151)
    [ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
    [ 1295.750268] __walk_page_range (mm/pagewalk.c:51 mm/pagewalk.c:90 mm/pagewalk.c:116 mm/pagewalk.c:204)
    [ 1295.750268] walk_page_range (mm/pagewalk.c:275)
    [ 1295.750268] SyS_mincore (mm/mincore.c:191 mm/mincore.c:253 mm/mincore.c:220)
    [ 1295.750268] ? mincore_pte_range (mm/mincore.c:220)
    [ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
    [ 1295.750268] ? __mincore_unmapped_range (mm/mincore.c:105)
    [ 1295.750268] ? ptlock_free (mm/mincore.c:24)
    [ 1295.750268] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1610)
    [ 1295.750268] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
    [ 1295.750268] Code: e5 48 c1 ea 03 53 48 89 fb 48 83 ec 08 80 3c 02 00 75 4f 48 b8 00 00 00 00 00 fc ff df 48 8b 1b 48 8d 7b 28 48 89 fa 48 c1 ea 03 3c 02 00 75 3f 48 b8 00 00 00 00 00 fc ff df 48 8b 5b 28 48

    All code
    ========
    0: e5 48 in $0x48,%eax
    2: c1 ea 03 shr $0x3,%edx
    5: 53 push %rbx
    6: 48 89 fb mov %rdi,%rbx
    9: 48 83 ec 08 sub $0x8,%rsp
    d: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
    11: 75 4f jne 0x62
    13: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
    1a: fc ff df
    1d: 48 8b 1b mov (%rbx),%rbx
    20: 48 8d 7b 28 lea 0x28(%rbx),%rdi
    24: 48 89 fa mov %rdi,%rdx
    27: 48 c1 ea 03 shr $0x3,%rdx
    2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)

    Fixes: 97b713ba3e ("fs: kill BDI_CAP_SWAP_BACKED")
    Signed-off-by: Sasha Levin
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Sasha Levin
     

23 Feb, 2015

1 commit

  • Convert the following where appropriate:

    (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

    (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

    (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more
    complicated than it appears as some calls should be converted to
    d_can_lookup() instead. The difference is whether the directory in
    question is a real dir with a ->lookup op or whether it's a fake dir with
    a ->d_automount op.

    In some circumstances, we can subsume checks for dentry->d_inode not being
    NULL into this, provided we the code isn't in a filesystem that expects
    d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
    use d_inode() rather than d_backing_inode() to get the inode pointer).

    Note that the dentry type field may be set to something other than
    DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
    manages the fall-through from a negative dentry to a lower layer. In such a
    case, the dentry type of the negative union dentry is set to the same as the
    type of the lower dentry.

    However, if you know d_inode is not NULL at the call site, then you can use
    the d_is_xxx() functions even in a filesystem.

    There is one further complication: a 0,0 chardev dentry may be labelled
    DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was
    intended for special directory entry types that don't have attached inodes.

    The following perl+coccinelle script was used:

    use strict;

    my @callers;
    open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
    @callers = ;
    close($fd);
    unless (@callers) {
    print "No matches\n";
    exit(0);
    }

    my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

    my $coccifile = "tmp.sp.cocci";
    open($fd, ">$coccifile") || die $coccifile;
    print($fd "$_\n") || die $coccifile foreach (@cocci);
    close($fd);

    foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
    die "spatch failed";
    }

    [AV: overlayfs parts skipped]

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

13 Feb, 2015

1 commit

  • Pull backing device changes from Jens Axboe:
    "This contains a cleanup of how the backing device is handled, in
    preparation for a rework of the life time rules. In this part, the
    most important change is to split the unrelated nommu mmap flags from
    it, but also removing a backing_dev_info pointer from the
    address_space (and inode), and a cleanup of other various minor bits.

    Christoph did all the work here, I just fixed an oops with pages that
    have a swap backing. Arnd fixed a missing export, and Oleg killed the
    lustre backing_dev_info from staging. Last patch was from Al,
    unexporting parts that are now no longer needed outside"

    * 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
    Make super_blocks and sb_lock static
    mtd: export new mtd_mmap_capabilities
    fs: make inode_to_bdi() handle NULL inode
    staging/lustre/llite: get rid of backing_dev_info
    fs: remove default_backing_dev_info
    fs: don't reassign dirty inodes to default_backing_dev_info
    nfs: don't call bdi_unregister
    ceph: remove call to bdi_unregister
    fs: remove mapping->backing_dev_info
    fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
    nilfs2: set up s_bdi like the generic mount_bdev code
    block_dev: get bdev inode bdi directly from the block device
    block_dev: only write bdev inode on close
    fs: introduce f_op->mmap_capabilities for nommu mmap support
    fs: kill BDI_CAP_SWAP_BACKED
    fs: deduplicate noop_backing_dev_info

    Linus Torvalds
     

12 Feb, 2015

1 commit


11 Feb, 2015

1 commit


06 Feb, 2015

1 commit

  • It has been reported that 965GM might trigger

    VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage)

    in mem_cgroup_migrate when shmem wants to replace a swap cache page
    because of shmem_should_replace_page (the page is allocated from an
    inappropriate zone). shmem_replace_page expects that the oldpage is not
    on LRU list and calls mem_cgroup_migrate without lrucare. This is
    obviously incorrect because swapcache pages might be on the LRU list
    (e.g. swapin readahead page).

    Fix this by enabling lrucare for the migration in shmem_replace_page.
    Also clarify that lrucare should be used even if one of the pages might
    be on LRU list.

    The BUG_ON will trigger only when CONFIG_DEBUG_VM is enabled but even
    without that the migration code might leave the old page on an
    inappropriate memcg' LRU which is not that critical because the page
    would get removed with its last reference but it is still confusing.

    Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
    Signed-off-by: Michal Hocko
    Reported-by: Chris Wilson
    Reported-by: Dave Airlie
    Acked-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: [3.17+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

21 Jan, 2015

2 commits

  • Now that we never use the backing_dev_info pointer in struct address_space
    we can simply remove it and save 4 to 8 bytes in every inode.

    Signed-off-by: Christoph Hellwig
    Acked-by: Ryusuke Konishi
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This bdi flag isn't too useful - we can determine that a vma is backed by
    either swap or shmem trivially in the caller.

    This also allows removing the backing_dev_info instaces for swap and shmem
    in favor of noop_backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

17 Dec, 2014

1 commit


24 Oct, 2014

1 commit

  • Allocate a dentry, initialize it with a whiteout and hash it in the place
    of the old dentry. Later the old dentry will be moved away and the
    whiteout will remain.

    i_mutex protects agains concurrent readdir.

    Signed-off-by: Miklos Szeredi
    Cc: Hugh Dickins

    Miklos Szeredi
     

10 Oct, 2014

2 commits

  • Pull percpu updates from Tejun Heo:
    "A lot of activities on percpu front. Notable changes are...

    - percpu allocator now can take @gfp. If @gfp doesn't contain
    GFP_KERNEL, it tries to allocate from what's already available to
    the allocator and a work item tries to keep the reserve around
    certain level so that these atomic allocations usually succeed.

    This will replace the ad-hoc percpu memory pool used by
    blk-throttle and also be used by the planned blkcg support for
    writeback IOs.

    Please note that I noticed a bug in how @gfp is interpreted while
    preparing this pull request and applied the fix 6ae833c7fe0c
    ("percpu: fix how @gfp is interpreted by the percpu allocator")
    just now.

    - percpu_ref now uses longs for percpu and global counters instead of
    ints. It leads to more sparse packing of the percpu counters on
    64bit machines but the overhead should be negligible and this
    allows using percpu_ref for refcnting pages and in-memory objects
    directly.

    - The switching between percpu and single counter modes of a
    percpu_ref is made independent of putting the base ref and a
    percpu_ref can now optionally be initialized in single or killed
    mode. This allows avoiding percpu shutdown latency for cases where
    the refcounted objects may be synchronously created and destroyed
    in rapid succession with only a fraction of them reaching fully
    operational status (SCSI probing does this when combined with
    blk-mq support). It's also planned to be used to implement forced
    single mode to detect underflow more timely for debugging.

    There's a separate branch percpu/for-3.18-consistent-ops which cleans
    up the duplicate percpu accessors. That branch causes a number of
    conflicts with s390 and other trees. I'll send a separate pull
    request w/ resolutions once other branches are merged"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
    percpu: fix how @gfp is interpreted by the percpu allocator
    blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
    percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
    percpu_ref: add PERCPU_REF_INIT_* flags
    percpu_ref: decouple switching to percpu mode and reinit
    percpu_ref: decouple switching to atomic mode and killing
    percpu_ref: add PCPU_REF_DEAD
    percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
    percpu_ref: replace pcpu_ prefix with percpu_
    percpu_ref: minor code and comment updates
    percpu_ref: relocate percpu_ref_reinit()
    Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
    Revert "percpu: free percpu allocation info for uniprocessor system"
    percpu-refcount: make percpu_ref based on longs instead of ints
    percpu-refcount: improve WARN messages
    percpu: fix locking regression in the failure path of pcpu_alloc()
    percpu-refcount: add @gfp to percpu_ref_init()
    proportions: add @gfp to init functions
    percpu_counter: add @gfp to percpu_counter_init()
    percpu_counter: make percpu_counters_lock irq-safe
    ...

    Linus Torvalds
     
  • This is designed to avoid a few ifdefs in .c files but it's obnoxious
    because it can cause unsuspecting "migrate_page" symbols to get turned into
    "NULL".

    Just nuke it and use the ifdefs.

    Cc: Konstantin Khlebnikov
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

27 Sep, 2014

1 commit

  • If overwriting an empty directory with rename, then need to drop the extra
    nlink.

    Test prog:

    #include
    #include
    #include
    #include

    int main(void)
    {
    const char *test_dir1 = "test-dir1";
    const char *test_dir2 = "test-dir2";
    int res;
    int fd;
    struct stat statbuf;

    res = mkdir(test_dir1, 0777);
    if (res == -1)
    err(1, "mkdir(\"%s\")", test_dir1);

    res = mkdir(test_dir2, 0777);
    if (res == -1)
    err(1, "mkdir(\"%s\")", test_dir2);

    fd = open(test_dir2, O_RDONLY);
    if (fd == -1)
    err(1, "open(\"%s\")", test_dir2);

    res = rename(test_dir1, test_dir2);
    if (res == -1)
    err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);

    res = fstat(fd, &statbuf);
    if (res == -1)
    err(1, "fstat(%i)", fd);

    if (statbuf.st_nlink != 0) {
    fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
    return 1;
    }

    return 0;
    }

    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Miklos Szeredi
     

08 Sep, 2014

1 commit

  • Percpu allocator now supports allocation mask. Add @gfp to
    percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
    with percpu_counters too.

    We could have left percpu_counter_init() alone and added
    percpu_counter_init_gfp(); however, the number of users isn't that
    high and introducing _gfp variants to all percpu data structures would
    be quite ugly, so let's just do the conversion. This is the one with
    the most users. Other percpu data structures are a lot easier to
    convert.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Jan Kara
    Acked-by: "David S. Miller"
    Cc: x86@kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andrew Morton

    Tejun Heo
     

12 Aug, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "Stuff in here:

    - acct.c fixes and general rework of mnt_pin mechanism. That allows
    to go for delayed-mntput stuff, which will permit mntput() on deep
    stack without worrying about stack overflows - fs shutdown will
    happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
    series without introducing tons of stack overflows on new mntput()
    call chains it introduces.
    - Bruce's d_splice_alias() patches
    - more Miklos' rename() stuff.
    - a couple of regression fixes (stable fodder, in the end of branch)
    and a fix for API idiocy in iov_iter.c.

    There definitely will be another pile, maybe even two. I'd like to
    get Eric's series in this time, but even if we miss it, it'll go right
    in the beginning of for-next in the next cycle - the tricky part of
    prereqs is in this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    fix copy_tree() regression
    __generic_file_write_iter(): fix handling of sync error after DIO
    switch iov_iter_get_pages() to passing maximal number of pages
    fs: mark __d_obtain_alias static
    dcache: d_splice_alias should detect loops
    exportfs: update Exporting documentation
    dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
    dcache: remove unused d_find_alias parameter
    dcache: d_obtain_alias callers don't all want DISCONNECTED
    dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
    dcache: d_splice_alias mustn't create directory aliases
    dcache: close d_move race in d_splice_alias
    dcache: move d_splice_alias
    namei: trivial fix to vfs_rename_dir comment
    VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
    cifs: support RENAME_NOREPLACE
    hostfs: support rename flags
    shmem: support RENAME_EXCHANGE
    shmem: support RENAME_NOREPLACE
    btrfs: add RENAME_NOREPLACE
    ...

    Linus Torvalds
     

09 Aug, 2014

5 commits

  • If we set SEAL_WRITE on a file, we must make sure there cannot be any
    ongoing write-operations on the file. For write() calls, we simply lock
    the inode mutex, for mmap() we simply verify there're no writable
    mappings. However, there might be pages pinned by AIO, Direct-IO and
    similar operations via GUP. We must make sure those do not write to the
    memfd file after we set SEAL_WRITE.

    As there is no way to notify GUP users to drop pages or to wait for them
    to be done, we implement the wait ourself: When setting SEAL_WRITE, we
    check all pages for their ref-count. If it's bigger than 1, we know
    there's some user of the page. We then mark the page and wait for up to
    150ms for those ref-counts to be dropped. If the ref-counts are not
    dropped in time, we refuse the seal operation.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
    that you can pass to mmap(). It can support sealing and avoids any
    connection to user-visible mount-points. Thus, it's not subject to quotas
    on mounted file-systems, but can be used like malloc()'ed memory, but with
    a file-descriptor to it.

    memfd_create() returns the raw shmem file, so calls like ftruncate() can
    be used to modify the underlying inode. Also calls like fstat() will
    return proper information and mark the file as regular file. If you want
    sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
    supported (like on all other regular files).

    Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
    subject to a filesystem size limit. It is still properly accounted to
    memcg limits, though, and to the same overcommit or no-overcommit
    accounting as all user memory.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • If two processes share a common memory region, they usually want some
    guarantees to allow safe access. This often includes:
    - one side cannot overwrite data while the other reads it
    - one side cannot shrink the buffer while the other accesses it
    - one side cannot grow the buffer beyond previously set boundaries

    If there is a trust-relationship between both parties, there is no need
    for policy enforcement. However, if there's no trust relationship (eg.,
    for general-purpose IPC) sharing memory-regions is highly fragile and
    often not possible without local copies. Look at the following two
    use-cases:

    1) A graphics client wants to share its rendering-buffer with a
    graphics-server. The memory-region is allocated by the client for
    read/write access and a second FD is passed to the server. While
    scanning out from the memory region, the server has no guarantee that
    the client doesn't shrink the buffer at any time, requiring rather
    cumbersome SIGBUS handling.
    2) A process wants to perform an RPC on another process. To avoid huge
    bandwidth consumption, zero-copy is preferred. After a message is
    assembled in-memory and a FD is passed to the remote side, both sides
    want to be sure that neither modifies this shared copy, anymore. The
    source may have put sensible data into the message without a separate
    copy and the target may want to parse the message inline, to avoid a
    local copy.

    While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
    ways to achieve most of this, the first one is unproportionally ugly to
    use in libraries and the latter two are broken/racy or even disabled due
    to denial of service attacks.

    This patch introduces the concept of SEALING. If you seal a file, a
    specific set of operations is blocked on that file forever. Unlike locks,
    seals can only be set, never removed. Hence, once you verified a specific
    set of seals is set, you're guaranteed that no-one can perform the blocked
    operations on this file, anymore.

    An initial set of SEALS is introduced by this patch:
    - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
    in size. This affects ftruncate() and open(O_TRUNC).
    - GROW: If SEAL_GROW is set, the file in question cannot be increased
    in size. This affects ftruncate(), fallocate() and write().
    - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
    are possible. This affects fallocate(PUNCH_HOLE), mmap() and
    write().
    - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
    This basically prevents the F_ADD_SEAL operation on a file and
    can be set to prevent others from adding further seals that you
    don't want.

    The described use-cases can easily use these seals to provide safe use
    without any trust-relationship:

    1) The graphics server can verify that a passed file-descriptor has
    SEAL_SHRINK set. This allows safe scanout, while the client is
    allowed to increase buffer size for window-resizing on-the-fly.
    Concurrent writes are explicitly allowed.
    2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
    SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
    process can modify the data while the other side parses it.
    Furthermore, it guarantees that even with writable FDs passed to the
    peer, it cannot increase the size to hit memory-limits of the source
    process (in case the file-storage is accounted to the source).

    The new API is an extension to fcntl(), adding two new commands:
    F_GET_SEALS: Return a bitset describing the seals on the file. This
    can be called on any FD if the underlying file supports
    sealing.
    F_ADD_SEALS: Change the seals of a given file. This requires WRITE
    access to the file and F_SEAL_SEAL may not already be set.
    Furthermore, the underlying file must support sealing and
    there may not be any existing shared mapping of that file.
    Otherwise, EBADF/EPERM is returned.
    The given seals are _added_ to the existing set of seals
    on the file. You cannot remove seals again.

    The fcntl() handler is currently specific to shmem and disabled on all
    files. A file needs to explicitly support sealing for this interface to
    work. A separate syscall is added in a follow-up, which creates files that
    support sealing. There is no intention to support this on other
    file-systems. Semantics are unclear for non-volatile files and we lack any
    use-case right now. Therefore, the implementation is specific to shmem.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Aug, 2014

2 commits


07 Aug, 2014

4 commits

  • The gfp arg is not used in shmem_add_to_page_cache. Remove this unused
    arg.

    Signed-off-by: Wang Sheng-Hui
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • Do we really need an exported alias for __SetPageReferenced()? Its
    callers better know what they're doing, in which case the page would not
    be already marked referenced. Kill init_page_accessed(), just
    __SetPageReferenced() inline.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A shared anonymous mapping created without MAP_NORESERVE holds memory
    reservation for whole range of shmem segment. Usually there is no way
    to change its size, but /proc//map_files/... (available if
    CONFIG_CHECKPOINT_RESTORE=y) allows that.

    This patch adjusts the memory reservation in shmem_setattr().

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • If __shmem_file_setup() fails on struct file allocation it uncharges
    memory commitment twice: first by shmem_unacct_size() and second time
    implicitly in shmem_evict_inode() when it kills the newly created inode.

    This patch removes shmem_unacct_size() from error path if the inode was
    already there.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

24 Jul, 2014

2 commits

  • shmem_fault() is the actual culprit in trinity's hole-punch starvation,
    and the most significant cause of such problems: since a page faulted is
    one that then appears page_mapped(), needing unmap_mapping_range() and
    i_mmap_mutex to be unmapped again.

    But it is not the only way in which a page can be brought into a hole in
    the radix_tree while that hole is being punched; and Vlastimil's testing
    implies that if enough other processors are busy filling in the hole,
    then shmem_undo_range() can be kept from completing indefinitely.

    shmem_file_splice_read() is the main other user of SGP_CACHE, which can
    instantiate shmem pagecache pages in the read-only case (without holding
    i_mutex, so perhaps concurrently with a hole-punch). Probably it's
    silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
    ought to be safe, but might bring surprises - not a change to be rushed.

    shmem_read_mapping_page_gfp() is an internal interface used by
    drivers/gpu/drm GEM (and next by uprobes): it should be okay. And
    shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
    called internally by the kernel (perhaps for a stacking filesystem,
    which might rely on holes to be reserved): it's unclear whether it could
    be provoked to keep hole-punch busy or not.

    We could apply the same umbrella as now used in shmem_fault() to
    shmem_file_splice_read() and the others; but it looks ugly, and use over
    a range raises questions - should it actually be per page? can these get
    starved themselves?

    The origin of this part of the problem is my v3.1 commit d0823576bf4b
    ("mm: pincer in truncate_inode_pages_range"), once it was duplicated
    into shmem.c. It seemed like a nice idea at the time, to ensure
    (barring RCU lookup fuzziness) that there's an instant when the entire
    hole is empty; but the indefinitely repeated scans to ensure that make
    it vulnerable.

    Revert that "enhancement" to hole-punch from shmem_undo_range(), but
    retain the unproblematic rescanning when it's truncating; add a couple
    of comments there.

    Remove the "indices[0] >= end" test: that is now handled satisfactorily
    by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
    to be worth avoiding here.

    But if we do not always loop indefinitely, we do need to handle the case
    of swap swizzled back to page before shmem_free_swap() gets it: add a
    retry for that case, as suggested by Konstantin Khlebnikov; and for the
    case of page swizzled back to swap, as suggested by Johannes Weiner.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Suggested-by: Vlastimil Babka
    Cc: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Cc: Lukas Czerner
    Cc: Dave Jones
    Cc: [3.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit f00cdc6df7d7 ("shmem: fix faulting into a hole while it's
    punched") was buggy: Sasha sent a lockdep report to remind us that
    grabbing i_mutex in the fault path is a no-no (write syscall may already
    hold i_mutex while faulting user buffer).

    We tried a completely different approach (see following patch) but that
    proved inadequate: good enough for a rational workload, but not good
    enough against trinity - which forks off so many mappings of the object
    that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
    into serious starvation when concurrent faults force the puncher to fall
    back to single-page unmap_mapping_range() searches of the i_mmap tree.

    So return to the original umbrella approach, but keep away from i_mutex
    this time. We really don't want to bloat every shmem inode with a new
    mutex or completion, just to protect this unlikely case from trinity.
    So extend the original with wait_queue_head on stack at the hole-punch
    end, and wait_queue item on the stack at the fault end.

    This involves further use of i_lock to guard against the races: lockdep
    has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
    i_lock around wake_up_bit(), which is comparable to what we do here.
    i_lock is more convenient, but we could switch to shmem's info->lock.

    This issue has been tagged with CVE-2014-4171, which will require commit
    f00cdc6df7d7 and this and the following patch to be backported: we
    suggest to 3.1+, though in fact the trinity forkbomb effect might go
    back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
    not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
    Anyone running trinity on 3.0 and earlier? I don't think we need care.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Vlastimil Babka
    Cc: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Cc: Lukas Czerner
    Cc: Dave Jones
    Cc: [3.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Jul, 2014

1 commit

  • Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
    in isolate_lru_pages() at mm/vmscan.c:1281!

    Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
    cache allocation where possible") looks like interrupted work-in-progress.

    mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
    - shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
    when the page is always visible in radix_tree, and often already on LRU.

    Revert change to shmem_write_begin(), and use init_page_accessed() or
    mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().

    SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
    before; but since many other filesystems use [__]page_symlink(), which did
    and does mark the page accessed, consider this as rectifying an oversight.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Jun, 2014

2 commits

  • Trinity finds that mmap access to a hole while it's punched from shmem
    can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
    from completing, until the reader chooses to stop; with the puncher's
    hold on i_mutex locking out all other writers until it can complete.

    It appears that the tmpfs fault path is too light in comparison with its
    hole-punching path, lacking an i_data_sem to obstruct it; but we don't
    want to slow down the common case.

    Extend shmem_fallocate()'s existing range notification mechanism, so
    shmem_fault() can refrain from faulting pages into the hole while it's
    punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
    faulting when not).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I was well aware of FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE
    support being added to fallocate(); but didn't realize until now that I
    had been too stupid to future-proof shmem_fallocate() against new
    additions. -EOPNOTSUPP instead of going on to ordinary fallocation.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Lukas Czerner
    Cc: [3.15]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

12 Jun, 2014

1 commit


05 Jun, 2014

1 commit

  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman