18 Aug, 2010

1 commit

  • list_add() corruption messages reported from shmem_fill_super()'s recently
    introduced percpu_counter_init(): shmem_put_super() needs to remember to
    percpu_counter_destroy(). And also check error from percpu_counter_init().

    Reported-bisected-and-tested-by: Tetsuo Handa
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
    no need for list_for_each_entry_safe()/resetting with superblock list
    Fix sget() race with failing mount
    vfs: don't hold s_umount over close_bdev_exclusive() call
    sysv: do not mark superblock dirty on remount
    sysv: do not mark superblock dirty on mount
    btrfs: remove junk sb_dirt change
    BFS: clean up the superblock usage
    AFFS: wait for sb synchronization when needed
    AFFS: clean up dirty flag usage
    cifs: truncate fallout
    mbcache: fix shrinker function return value
    mbcache: Remove unused features
    add f_flags to struct statfs(64)
    pass a struct path to vfs_statfs
    update VFS documentation for method changes.
    All filesystems that need invalidate_inode_buffers() are doing that explicitly
    convert remaining ->clear_inode() to ->evict_inode()
    Make ->drop_inode() just return whether inode needs to be dropped
    fs/inode.c:clear_inode() is gone
    fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
    ...

    Fix up trivial conflicts in fs/nilfs2/super.c

    Linus Torvalds
     

10 Aug, 2010

6 commits

  • I'm running a shmem pagefault test case (see attached file) under a 64 CPU
    system. Profile shows shmem_inode_info->lock is heavily contented and
    100% CPUs time are trying to get the lock. In the pagefault (no swap)
    case, shmem_getpage gets the lock twice, the last one is avoidable if we
    prealloc a page so we could reduce one time of locking. This is what
    below patch does.

    The result of the test case:
    2.6.35-rc3: ~20s
    2.6.35-rc3 + patch: ~12s
    so this is 40% improvement.

    One might argue if we could have better locking for shmem. But even shmem
    is lockless, the pagefault will soon have pagecache lock heavily contented
    because shmem must add new page to pagecache. So before we have better
    locking for pagecache, improving shmem locking doesn't have too much
    improvement. I did a similar pagefault test against a ramfs file, the
    test result is ~10.5s.

    [akpm@linux-foundation.org: fix comment, clean up code layout, elimintate code duplication]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Cc: "Zhang, Yanmin"
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • The current implementation of tmpfs is not scalable. We found that
    stat_lock is contended by multiple threads when we need to get a new page,
    leading to useless spinning inside this spin lock.

    This patch makes use of the percpu_counter library to maintain local count
    of used blocks to speed up getting and returning of pages. So the
    acquisition of stat_lock is unnecessary for getting and returning blocks,
    improving the performance of tmpfs on system with large number of cpus.
    On a 4 socket 32 core NHM-EX system, we saw improvement of 270%.

    The implementation below has a slight chance of race between threads
    causing a slight overshoot of the maximum configured blocks. However, any
    overshoot is small, and is bounded by the number of cpus. This happens
    when the number of used blocks is slightly below the maximum configured
    blocks when a thread checks the used block count, and another thread
    allocates the last block before the current thread does. This should not
    be a problem for tmpfs, as the overshoot is most likely to be a few blocks
    and bounded. If a strict limit is really desired, then configured the max
    blocks to be the limit less the number of cpus in system.

    Signed-off-by: Tim Chen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Make sure we check the truncate constraints early on in ->setattr by adding
    those checks to inode_change_ok. Also clean up and document inode_change_ok
    to make this obvious.

    As a fallout we don't have to call inode_newsize_ok from simple_setsize and
    simplify it down to a truncate_setsize which doesn't return an error. This
    simplifies a lot of setattr implementations and means we use truncate_setsize
    almost everywhere. Get rid of fat_setsize now that it's trivial and mark
    ext2_setsize static to make the calling convention obvious.

    Keep the inode_newsize_ok in vmtruncate for now as all callers need an
    audit for its removal anyway.

    Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
    needs a deeper audit, but that is left for later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Make sure we call inode_change_ok before doing any changes in ->setattr,
    and make sure to call it even if our fs wants to ignore normal UNIX
    permissions, but use the ATTR_FORCE to skip those.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Despite its name it's now a generic implementation of ->setattr, but
    rather a helper to copy attributes from a struct iattr to the inode.
    Rename it to setattr_copy to reflect this fact.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

05 Jun, 2010

1 commit

  • mtime and ctime should be changed only if the file size has actually
    changed. Patches changing ext2 and tmpfs from vmtruncate to new truncate
    sequence has caused regressions where they always update timestamps.

    There is some strange cases in POSIX where truncate(2) must not update
    times unless the size has acutally changed, see 6e656be89.

    This area is all still rather buggy in different ways in a lot of
    filesystems and needs a cleanup and audit (ideally the vfs will provide
    a simple attribute or call to direct all filesystems exactly which
    attributes to change). But coming up with the best solution will take a
    while and is not appropriate for rc anyway.

    So fix recent regression for now.

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     

28 May, 2010

3 commits

  • Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • We don't name our generic fsync implementations very well currently.
    The no-op implementation for in-memory filesystems currently is called
    simple_sync_file which doesn't make too much sense to start with,
    the the generic one for simple filesystems is called simple_fsync
    which can lead to some confusion.

    This patch renames the generic file fsync method to generic_file_fsync
    to match the other generic_file_* routines it is supposed to be used
    with, and the no-op implementation to noop_fsync to make it obvious
    what to expect. In addition add some documentation for both methods.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • This patch adds support for moving charge of file pages, which include
    normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
    bit 1 of /memory.move_charge_at_immigrate.

    Unlike the case of anonymous pages, file pages(and swaps) in the range
    mmapped by the task will be moved even if the task hasn't done page fault,
    i.e. they might not be the task's "RSS", but other task's "RSS" that maps
    the same file. And mapcount of the page is ignored(the page can be moved
    even if page_mapcount(page) > 1). So, conditions that the page/swap
    should be met to be moved is that it must be in the range mmapped by the
    target task and it must be charged to the old cgroup.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

25 May, 2010

1 commit

  • prep_new_page() will call set_page_private(page, 0) to initialise the
    page, so the code is redundant.

    Signed-off-by: Huang Shijie
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     

22 May, 2010

2 commits


17 Dec, 2009

6 commits

  • Replacing
    error = 0;
    if (error)
    op
    with nothing is not quite an equivalent transformation ;-)

    Signed-off-by: Al Viro

    Al Viro
     
  • Now that we cache the ACL pointers in the generic inode all the generic_acl
    cruft can go away and generic_acl.c can directly implement xattr handlers
    dealing with the full Posix ACL semantics for in-memory filesystems.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Add a flags argument to struct xattr_handler and pass it to all xattr
    handler methods. This allows using the same methods for multiple
    handlers, e.g. for the ACL methods which perform exactly the same action
    for the access and default ACLs, just using a different underlying
    attribute. With a little more groundwork it'll also allow sharing the
    methods for the regular user/trusted/secure handlers in extN, ocfs2 and
    jffs2 like it's already done for xfs in this patch.

    Also change the inode argument to the handlers to a dentry to allow
    using the handlers mechnism for filesystems that require it later,
    e.g. cifs.

    [with GFS2 bits updated by Steven Whitehouse ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Joel Becker
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • There are 2 groups of alloc_file() callers:
    * ones that are followed by ima_counts_get
    * ones giving non-regular files
    So let's pull that ima_counts_get() into alloc_file();
    it's a no-op in case of non-regular files.

    Signed-off-by: Al Viro

    Al Viro
     
  • ... and have the caller grab both mnt and dentry; kill
    leak in infiniband, while we are at it.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     

16 Dec, 2009

1 commit

  • While we're fiddling with the swap_map values, let's assign a particular
    value to shmem/tmpfs swap pages: their swap counts are never incremented,
    and it helps swapoff's try_to_unuse() a little if it can immediately
    distinguish those pages from process pages.

    Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
    we might as well use that 0xbf value for SWAP_MAP_SHMEM.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Sep, 2009

1 commit


26 Sep, 2009

2 commits

  • * 'writeback' of git://git.kernel.dk/linux-2.6-block:
    writeback: writeback_inodes_sb() should use bdi_start_writeback()
    writeback: don't delay inodes redirtied by a fast dirtier
    writeback: make the super_block pinning more efficient
    writeback: don't resort for a single super_block in move_expired_inodes()
    writeback: move inodes from one super_block together
    writeback: get rid to incorrect references to pdflush in comments
    writeback: improve readability of the wb_writeback() continue/break logic
    writeback: cleanup writeback_single_inode()
    writeback: kupdate writeback shall not stop when more io is possible
    writeback: stop background writeback when below background threshold
    writeback: balance_dirty_pages() shall write more than dirtied pages
    fs: Fix busyloop in wb_writeback()

    Linus Torvalds
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

4 commits

  • Fixes the following kmemcheck false positive (the compiler is using
    a 32-bit mov to load the 16-bit sbinfo->mode in shmem_fill_super):

    [ 0.337000] Total of 1 processors activated (3088.38 BogoMIPS).
    [ 0.352000] CPU0 attaching NULL sched-domain.
    [ 0.360000] WARNING: kmemcheck: Caught 32-bit read from uninitialized
    memory (9f8020fc)
    [ 0.361000]
    a44240820000000041f6998100000000000000000000000000000000ff030000
    [ 0.368000] i i i i i i i i i i i i i i i i u u u u i i i i i i i i i i u
    u
    [ 0.375000] ^
    [ 0.376000]
    [ 0.377000] Pid: 9, comm: khelper Not tainted (2.6.31-tip #206) P4DC6
    [ 0.378000] EIP: 0060:[] EFLAGS: 00010246 CPU: 0
    [ 0.379000] EIP is at shmem_fill_super+0xb5/0x120
    [ 0.380000] EAX: 00000000 EBX: 9f845400 ECX: 824042a4 EDX: 8199f641
    [ 0.381000] ESI: 9f8020c0 EDI: 9f845400 EBP: 9f81af68 ESP: 81cd6eec
    [ 0.382000] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    [ 0.383000] CR0: 8005003b CR2: 9f806200 CR3: 01ccd000 CR4: 000006d0
    [ 0.384000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    [ 0.385000] DR6: ffff4ff0 DR7: 00000400
    [ 0.386000] [] get_sb_nodev+0x3c/0x80
    [ 0.388000] [] shmem_get_sb+0x14/0x20
    [ 0.390000] [] vfs_kern_mount+0x4f/0x120
    [ 0.392000] [] init_tmpfs+0x7e/0xb0
    [ 0.394000] [] do_basic_setup+0x17/0x30
    [ 0.396000] [] kernel_init+0x57/0xa0
    [ 0.398000] [] kernel_thread_helper+0x7/0x10
    [ 0.400000] [] 0xffffffff
    [ 0.402000] khelper used greatest stack depth: 2820 bytes left
    [ 0.407000] calling init_mmap_min_addr+0x0/0x10 @ 1
    [ 0.408000] initcall init_mmap_min_addr+0x0/0x10 returned 0 after 0 usecs

    Reported-by: Ingo Molnar
    Analysed-by: Vegard Nossum
    Signed-off-by: Pekka Enberg
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
    CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
    more sense of it by removing CONFIG_TMPFS altogether, always enabling its
    code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
    CONFIG_TMPFS off that we'd better leave that as is.

    But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
    make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
    shmem_acl.o being pointlessly built into the kernel when SHMEM is off.

    And a selfish change, to prevent the world from being rebuilt when I
    switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
    header files is mm.h shmem_lock() - give that a shmem.c stub instead.

    Signed-off-by: Hugh Dickins
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fix the following 'make includecheck' warning:

    mm/shmem.c: linux/vfs.h is included more than once.

    Signed-off-by: Jaswinder Singh Rajput
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaswinder Singh Rajput
     
  • After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
    only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare()
    or get_swap_page() would call add_to_swap_cache(). So add_to_swap_cache()
    doesn't return -EEXIST any more.

    Even though it doesn't return -EEXIST, it's not good behavior conceptually
    to call swapcache_prepare() in the -EEXIST case, because it means clearing
    SWAP_HAS_CACHE flag while the entry is on swap cache.

    This patch removes redundant codes and comments from callers of it, and
    adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

16 Sep, 2009

3 commits

  • Enable removing of corrupted pages through truncation
    for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
    These should cover most server needs.

    I chose the set of migration aware file systems for this
    for now, assuming they have been especially audited.
    But in general it should be safe for all file systems
    on the data area that support read/write and truncate.

    Caveat: the hardware error handler does not take i_mutex
    for now before calling the truncate function. Is that ok?

    Cc: tytso@mit.edu
    Cc: hch@infradead.org
    Cc: mfasheh@suse.com
    Cc: aia21@cantab.net
    Cc: hugh.dickins@tiscali.co.uk
    Cc: swhiteho@redhat.com
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • The dirtying of page and set_page_dirty() can be moved into the page lock.

    - In shmem_write_end(), the page was dirtied while the page lock was held,
    but it's being marked dirty just after dropping the page lock.
    - In shmem_symlink(), both dirtying and marking can be moved into page lock.

    It's valuable for the hwpoison code to know whether one bad page can be dropped
    without losing data. It mainly judges by testing the PG_dirty bit after taking
    the page lock. So it becomes important that the dirtying of page and the
    marking of dirtiness are both done inside the page lock. Which is a common
    practice, but sadly not a rule.

    The noticeable exceptions are
    - mapped pages
    - pages with buffer_heads
    The above pages could go dirty at any time. Fortunately the hwpoison will
    unmap the page and release the buffer_heads beforehand anyway.

    Many other types of pages (eg. metadata pages) can also be dirtied at will by
    their owners, the hwpoison code cannot do meaningful things to them anyway.
    Only the dirtiness of pagecache pages owned by regular files are interested.

    v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)

    Acked-by: Hugh Dickins
    Reviewed-by: WANG Cong
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • Devtmpfs lets the kernel create a tmpfs instance called devtmpfs
    very early at kernel initialization, before any driver-core device
    is registered. Every device with a major/minor will provide a
    device node in devtmpfs.

    Devtmpfs can be changed and altered by userspace at any time,
    and in any way needed - just like today's udev-mounted tmpfs.
    Unmodified udev versions will run just fine on top of it, and will
    recognize an already existing kernel-created device node and use it.
    The default node permissions are root:root 0600. Proper permissions
    and user/group ownership, meaningful symlinks, all other policy still
    needs to be applied by userspace.

    If a node is created by devtmps, devtmpfs will remove the device node
    when the device goes away. If the device node was created by
    userspace, or the devtmpfs created node was replaced by userspace, it
    will no longer be removed by devtmpfs.

    If it is requested to auto-mount it, it makes init=/bin/sh work
    without any further userspace support. /dev will be fully populated
    and dynamic, and always reflect the current device state of the kernel.
    With the commonly used dynamic device numbers, it solves the problem
    where static devices nodes may point to the wrong devices.

    It is intended to make the initial bootup logic simpler and more robust,
    by de-coupling the creation of the inital environment, to reliably run
    userspace processes, from a complex userspace bootstrap logic to provide
    a working /dev.

    Signed-off-by: Kay Sievers
    Signed-off-by: Jan Blunck
    Tested-By: Harald Hoyer
    Tested-By: Scott James Remnant
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

09 Sep, 2009

1 commit


25 Jun, 2009

1 commit


24 Jun, 2009

1 commit


17 Jun, 2009

2 commits

  • As function shmem_file_setup does not modify/allocate/free/pass given
    filename - mark it as const.

    Signed-off-by: Sergei Trofimovich
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergei Trofimovich
     
  • In a following patch, the usage of swap cache is recorded into swap_map.
    This patch is for necessary interface changes to do that.

    2 interfaces:

    - swapcache_prepare()
    - swapcache_free()

    are added for allocating/freeing refcnt from swap-cache to existing swap
    entries. But implementation itself is not changed under this patch. At
    adding swapcache_free(), memcg's hook code is moved under
    swapcache_free(). This is better than using scattered hooks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

22 May, 2009

2 commits

  • Based on discussion on lkml (Andrew Morton and Eric Paris),
    move ima_counts_get down a layer into shmem/hugetlb__file_setup().
    Resolves drm shmem_file_setup() usage case as well.

    HD comment:
    I still think you're doing this at the wrong level, but recognize
    that you probably won't be persuaded until a few more users of
    alloc_file() emerge, all wanting your ima_counts_get().

    Resolving GEM's shmem_file_setup() is an improvement, so I'll say

    Acked-by: Hugh Dickins
    Signed-off-by: Mimi Zohar
    Signed-off-by: James Morris

    Mimi Zohar
     
  • - Add support in ima_path_check() for integrity checking without
    incrementing the counts. (Required for nfsd.)
    - rename and export opencount_get to ima_counts_get
    - replace ima_shm_check calls with ima_counts_get
    - export ima_path_check

    Signed-off-by: Mimi Zohar
    Signed-off-by: James Morris

    Mimi Zohar