29 May, 2011

1 commit

  • Some recent benchmarking on btrfs showed that a major scaling bottleneck
    on large systems on btrfs is currently the xattr lookup on every write.

    Why xattr lookup on every write I hear you ask?

    write wants to drop suid and security related xattrs that could set o
    capabilities for executables. To do that it currently looks up
    security.capability on EVERY write (even for non executables) to decide
    whether to drop it or not.

    In btrfs this causes an additional tree walk, hitting some per file system
    locks and quite bad scalability. In a simple read workload on a 8S
    system I saw over 90% CPU time in spinlocks related to that.

    Chris Mason tells me this is also a problem in ext4, where it hits
    the global mbcache lock.

    This patch adds a simple per inode to avoid this problem. We only
    do the lookup once per file and then if there is no xattr cache
    the decision. All xattr changes clear the flag.

    I also used the same flag to avoid the suid check, although
    that one is pretty cheap.

    A file system can also set this flag when it creates the inode,
    if it has a cheap way to do so. This is done for some common file systems
    in followon patches.

    With this patch a major part of the lock contention disappears
    for btrfs. Some testing on smaller systems didn't show significant
    performance changes, but at least it helps the larger systems
    and is generally more efficient.

    v2: Rename is_sgid. add file system helper.
    Cc: chris.mason@oracle.com
    Cc: josef@redhat.com
    Cc: viro@zeniv.linux.org.uk
    Cc: agruen@linbit.com
    Cc: Serge E. Hallyn
    Signed-off-by: Andi Kleen
    Signed-off-by: Al Viro

    Andi Kleen
     

27 May, 2011

3 commits

  • Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
    anything else, so that the filesystem can track internally if it
    needs to push out a transaction for fdatasync or not.

    This is just the prototype change with no user for it yet. I plan
    to push large XFS changes for the next merge window, and getting
    this trivial infrastructure in this window would help a lot to avoid
    tree interdependencies.

    Also remove incorrect comments that ->dirty_inode can't block. That
    has been changed a long time ago, and many implementations rely on it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
    xen: cleancache shim to Xen Transcendent Memory
    ocfs2: add cleancache support
    ext4: add cleancache support
    btrfs: add cleancache support
    ext3: add cleancache support
    mm/fs: add hooks to support cleancache
    mm: cleancache core ops functions and config
    fs: add field to superblock to support cleancache
    mm/fs: cleancache documentation

    Fix up trivial conflict in fs/btrfs/extent_io.c due to includes

    Linus Torvalds
     
  • This second patch of eight in this cleancache series adds a field to
    the generic superblock to squirrel away a pool identifier that is
    dynamically provided by cleancache-enabled filesystems at mount time
    to uniquely identify files and pages belonging to this mounted filesystem.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    [v8: trivial merge conflict update]
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Jan Beulich
    Cc: Chris Mason
    Cc: Andreas Dilger
    Cc: Ted Ts'o
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Nitin Gupta

    Dan Magenheimer
     

25 May, 2011

3 commits

  • Apps are increasingly using more than 1024 file descriptors. See
    discussion in several distro bug trackers, e.g. BugLink:
    http://bugs.launchpad.net/bugs/663090
    https://issues.rpath.com/browse/RPL-2054

    You don't want to raise the default soft limit, since that might break
    apps that use select(), but it's safe to raise the default hard limit;
    that way, apps that know they need lots of file descriptors can raise
    their soft limit without needing root, and without user intervention.

    Ubuntu is doing this with a kernel change because they have a policy of
    not changing kernel defaults in userland.

    While 4096 might not be enough for *all* apps, it seems to be plenty for
    the apps I've seen lately that are unhappy with 1024.

    Signed-off-by: Tim Gardner
    Cc: Dan Kegel
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Gardner
     
  • Straightforward conversion of i_mmap_lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Hugh says:
    "The only significant loser, I think, would be page reclaim (when
    concurrent with truncation): could spin for a long time waiting for
    the i_mmap_mutex it expects would soon be dropped? "

    Counter points:
    - cpu contention makes the spin stop (need_resched())
    - zap pages should be freeing pages at a higher rate than reclaim
    ever can

    I think the simplification of the truncate code is definitely worth it.

    Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
    unmap_mapping_range() on the same inode") and takes out the code that
    caused its problem.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

16 May, 2011

1 commit


15 May, 2011

1 commit

  • FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
    COW in btrfs, but FS_NOCOW_FL is sufficient.

    The fact is we don't have corresponding BTRFS_INODE_COW flag.

    COW is default, and FS_NOCOW_FL can be used to switch off COW for
    a single file.

    If we mount btrfs with nodatacow, a newly created file will be set with
    the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
    FS_NOCOW_FL flag.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     

13 Apr, 2011

2 commits

  • Gaah. When commit be85bccaa5aa reverted the export of file system uuid
    via /proc//mountinfo, it also unintentionally removed the s_uuid
    field in struct super_block.

    I didn't mean to do that, since filesystems have been taught to fill it
    in (and we want to keep it for future re-introduction in the mountinfo
    file).

    Stupid of me. This adds it back in.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This reverts commit 93f1c20bc8cdb757be50566eff88d65c3b26881f.

    It turns out that libmount misparses it because it adds a '-' character
    in the uuid string, which libmount then incorrectly confuses with the
    separator string (" - ") at the end of all the optional arguments.

    Upstream libmount (in the util-linux tree) has been fixed, but until
    that fix actually percolates up to users, we'd better not expose this
    change in the kernel.

    Let's revisit this later (possibly by exposing the UUID without any '-'
    characters in it, avoiding the user-space bug).

    Reported-by: Dave Jones
    Cc: Aneesh Kumar K.V
    Cc: Al Viro
    Cc: Karel Zak
    Cc: Ram Pai
    Cc: Miklos Szeredi
    Cc: Eric Sandeen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Apr, 2011

1 commit


06 Apr, 2011

1 commit

  • With the ->sync_page() hook gone, we have a few users that
    add their own static address_space_operations without any
    functions defined.

    fs/inode.c already has an empty_aops that it uses for init
    purposes. Lets export that and use it in the places where
    an otherwise empty aops was defined.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 Mar, 2011

1 commit


29 Mar, 2011

1 commit

  • …it/mason/btrfs-unstable

    * 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits)
    Btrfs: fix __btrfs_map_block on 32 bit machines
    btrfs: fix possible deadlock by clearing __GFP_FS flag
    btrfs: check link counter overflow in link(2)
    btrfs: don't mess with i_nlink of unlocked inode in rename()
    Btrfs: check return value of btrfs_alloc_path()
    Btrfs: fix OOPS of empty filesystem after balance
    Btrfs: fix memory leak of empty filesystem after balance
    Btrfs: fix return value of setflags ioctl
    Btrfs: fix uncheck memory allocations
    btrfs: make inode ref log recovery faster
    Btrfs: add btrfs_trim_fs() to handle FITRIM
    Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
    Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP
    Btrfs: make update_reserved_bytes() public
    btrfs: return EXDEV when linking from different subvolumes
    Btrfs: Per file/directory controls for COW and compression
    Btrfs: add datacow flag in inode flag
    btrfs: use GFP_NOFS instead of GFP_KERNEL
    Btrfs: check return value of read_tree_block()
    btrfs: properly access unaligned checksum buffer
    ...

    Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in
    the block layer.

    Linus Torvalds
     

28 Mar, 2011

1 commit

  • For datacow control, the corresponding inode flags are needed.
    This is for btrfs use.

    v1->v2:
    Change FS_COW_FL to another bit due to conflict with the upstream e2fsprogs

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    liubo
     

25 Mar, 2011

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: simplify iget & friends
    fs: pull inode->i_lock up out of writeback_single_inode
    fs: rename inode_lock to inode_hash_lock
    fs: move i_wb_list out from under inode_lock
    fs: move i_sb_list out from under inode_lock
    fs: remove inode_lock from iput_final and prune_icache
    fs: Lock the inode LRU list separately
    fs: factor inode disposal
    fs: protect inode->i_state with inode->i_lock
    autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
    autofs4 - remove autofs4_lock
    autofs4 - fix d_manage() return on rcu-walk
    autofs4 - fix autofs4_expire_indirect() traversal
    autofs4 - fix dentry leak in autofs4_expire_direct()
    autofs4 - reinstate last used update on access
    vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

    Linus Torvalds
     
  • Protect inode state transitions and validity checks with the
    inode->i_lock. This enables us to make inode state transitions
    independently of the inode_lock and is the first step to peeling
    away the inode_lock from the code.

    This requires that __iget() is done atomically with i_state checks
    during list traversals so that we don't race with another thread
    marking the inode I_FREEING between the state check and grabbing the
    reference.

    Also remove the unlock_new_inode() memory barrier optimisation
    required to avoid taking the inode_lock when clearing I_NEW.
    Simplify the code by simply taking the inode->i_lock around the
    state change and wakeup. Because the wakeup is no longer tricky,
    remove the wake_up_inode() function and open code the wakeup where
    necessary.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

24 Mar, 2011

2 commits

  • And give it a kernel-doc comment.

    [akpm@linux-foundation.org: btrfs changed in linux-next]
    Signed-off-by: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Cheat for now and say all files belong to init_user_ns. Next step will be
    to let superblocks belong to a user_ns, and derive inode_userns(inode)
    from inode->i_sb->s_user_ns. Finally we'll introduce more flexible
    arrangements.

    Changelog:
    Feb 15: make is_owner_or_cap take const struct inode
    Feb 23: make is_owner_or_cap bool

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

23 Mar, 2011

1 commit


18 Mar, 2011

1 commit


17 Mar, 2011

4 commits

  • * 'mnt_devname' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    vfs: bury ->get_sb()
    nfs: switch NFS from ->get_sb() to ->mount()
    nfs: stop mangling ->mnt_devname on NFS
    vfs: new superblock methods to override /proc/*/mount{s,info}
    nfs: nfs_do_{ref,sub}mount() superblock argument is redundant
    nfs: make nfs_path() work without vfsmount
    nfs: store devname at disconnected NFS roots
    nfs: propagate devname to nfs{,4}_get_root()

    Linus Torvalds
     
  • This is an ex-parrot.

    Signed-off-by: Al Viro

    Al Viro
     
  • a) ->show_devname(m, mnt) - what to put into devname columns in mounts,
    mountinfo and mountstats
    b) ->show_path(m, mnt) - what to put into relative path column in mountinfo

    Leaving those NULL gives old behaviour. NFS switched to using those.

    Signed-off-by: Al Viro

    Al Viro
     
  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
    AppArmor: kill unused macros in lsm.c
    AppArmor: cleanup generated files correctly
    KEYS: Add an iovec version of KEYCTL_INSTANTIATE
    KEYS: Add a new keyctl op to reject a key with a specified error code
    KEYS: Add a key type op to permit the key description to be vetted
    KEYS: Add an RCU payload dereference macro
    AppArmor: Cleanup make file to remove cruft and make it easier to read
    SELinux: implement the new sb_remount LSM hook
    LSM: Pass -o remount options to the LSM
    SELinux: Compute SID for the newly created socket
    SELinux: Socket retains creator role and MLS attribute
    SELinux: Auto-generate security_is_socket_class
    TOMOYO: Fix memory leak upon file open.
    Revert "selinux: simplify ioctl checking"
    selinux: drop unused packet flow permissions
    selinux: Fix packet forwarding checks on postrouting
    selinux: Fix wrong checks for selinux_policycap_netpeer
    selinux: Fix check for xfrm selinux context algorithm
    ima: remove unnecessary call to ima_must_measure
    IMA: remove IMA imbalance checking
    ...

    Linus Torvalds
     

15 Mar, 2011

3 commits

  • New flag for open(2) - O_PATH. Semantics:
    * pathname is resolved, but the file itself is _NOT_ opened
    as far as filesystem is concerned.
    * almost all operations on the resulting descriptors shall
    fail with -EBADF. Exceptions are:
    1) operations on descriptors themselves (i.e.
    close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
    fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
    fcntl(fd, F_SETFD, ...))
    2) fcntl(fd, F_GETFL), for a common non-destructive way to
    check if descriptor is open
    3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
    points of pathname resolution
    * closing such descriptor does *NOT* affect dnotify or
    posix locks.
    * permissions are checked as usual along the way to file;
    no permission checks are applied to the file itself. Of course,
    giving such thing to syscall will result in permission checks (at
    the moment it means checking that starting point of ....at() is
    a directory and caller has exec permissions on it).

    fget() and fget_light() return NULL on such descriptors; use of
    fget_raw() and fget_raw_light() is needed to get them. That protects
    existing code from dealing with those things.

    There are two things still missing (they come in the next commits):
    one is handling of symlinks (right now we refuse to open them that
    way; see the next commit for semantics related to those) and another
    is descriptor passing via SCM_RIGHTS datagrams.

    Signed-off-by: Al Viro

    Al Viro
     
  • We add a per superblock uuid field. File systems should
    update the uuid in the fill_super callback

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Al Viro

    Aneesh Kumar K.V
     
  • The syscall also return mount id which can be used
    to lookup file system specific information such as uuid
    in /proc//mountinfo

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Al Viro

    Aneesh Kumar K.V
     

14 Mar, 2011

3 commits

  • New helpers: user_statfs() and fd_statfs(), taking userland pathname and
    descriptor resp. and filling struct kstatfs. Syscalls of statfs family
    (native, compat and foreign - osf and hpux on alpha and parisc resp.)
    switched to those. Removes some boilerplate code, simplifies cleanup
    on errors...

    Signed-off-by: Al Viro

    Al Viro
     
  • new function: file_open_root(dentry, mnt, name, flags) opens the file
    vfs_path_lookup would arrive to.

    Note that name can be empty; in that case the usual requirement that
    dentry should be a directory is lifted.

    open-coded equivalents switched to it, may_open() got down exactly
    one caller and became static.

    Signed-off-by: Al Viro

    Al Viro
     
  • take calculation of open_flags by open(2) arguments into new helper
    in fs/open.c, move filp_open() over there, have it and do_sys_open()
    use that helper, switch exec.c callers of do_filp_open() to explicit
    (and constant) struct open_flags.

    Signed-off-by: Al Viro

    Al Viro
     

10 Mar, 2011

3 commits


08 Mar, 2011

1 commit


26 Feb, 2011

1 commit

  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix - again - partition detection when array becomes active
    Fix over-zealous flush_disk when changing device size.
    md: avoid spinlock problem in blk_throtl_exit
    md: correctly handle probe of an 'mdp' device.
    md: don't set_capacity before array is active.
    md: Fix raid1->raid0 takeover

    Linus Torvalds
     

24 Feb, 2011

2 commits

  • There are two cases when we call flush_disk.
    In one, the device has disappeared (check_disk_change) so any
    data will hold becomes irrelevant.
    In the oter, the device has changed size (check_disk_size_change)
    so data we hold may be irrelevant.

    In both cases it makes sense to discard any 'clean' buffers,
    so they will be read back from the device if needed.

    In the former case it makes sense to discard 'dirty' buffers
    as there will never be anywhere safe to write the data. In the
    second case it *does*not* make sense to discard dirty buffers
    as that will lead to file system corruption when you simply enlarge
    the containing devices.

    flush_disk calls __invalidate_devices.
    __invalidate_device calls both invalidate_inodes and invalidate_bdev.

    invalidate_inodes *does* discard I_DIRTY inodes and this does lead
    to fs corruption.

    invalidate_bev *does*not* discard dirty pages, but I don't really care
    about that at present.

    So this patch adds a flag to __invalidate_device (calling it
    __invalidate_device2) to indicate whether dirty buffers should be
    killed, and this is passed to invalidate_inodes which can choose to
    skip dirty inodes.

    flusk_disk then passes true from check_disk_change and false from
    check_disk_size_change.

    dm avoids tripping over this problem by calling i_size_write directly
    rathher than using check_disk_size_change.

    md does use check_disk_size_change and so is affected.

    This regression was introduced by commit 608aeef17a which causes
    check_disk_size_change to call flush_disk, so it is suitable for any
    kernel since 2.6.27.

    Cc: stable@kernel.org
    Acked-by: Jeff Moyer
    Cc: Andrew Patterson
    Cc: Jens Axboe
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Michael Leun reported that running parallel opens on a fuse filesystem
    can trigger a "kernel BUG at mm/truncate.c:475"

    Gurudas Pai reported the same bug on NFS.

    The reason is, unmap_mapping_range() is not prepared for more than
    one concurrent invocation per inode. For example:

    thread1: going through a big range, stops in the middle of a vma and
    stores the restart address in vm_truncate_count.

    thread2: comes in with a small (e.g. single page) unmap request on
    the same vma, somewhere before restart_address, finds that the
    vma was already unmapped up to the restart address and happily
    returns without doing anything.

    Another scenario would be two big unmap requests, both having to
    restart the unmapping and each one setting vm_truncate_count to its
    own value. This could go on forever without any of them being able to
    finish.

    Truncate and hole punching already serialize with i_mutex. Other
    callers of unmap_mapping_range() do not, and it's difficult to get
    i_mutex protection for all callers. In particular ->d_revalidate(),
    which calls invalidate_inode_pages2_range() in fuse, may be called
    with or without i_mutex.

    This patch adds a new mutex to 'struct address_space' to prevent
    running multiple concurrent unmap_mapping_range() on the same mapping.

    [ We'll hopefully get rid of all this with the upcoming mm
    preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
    lockbreak" patch in particular. But that is for 2.6.39 ]

    Signed-off-by: Miklos Szeredi
    Reported-by: Michael Leun
    Reported-by: Gurudas Pai
    Tested-by: Gurudas Pai
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi