04 Jun, 2011

1 commit

  • Caching "we have already removed suid/caps" was overenthusiastic as merged.
    On network filesystems we might have had suid/caps set on another client,
    silently picked by this client on revalidate, all of that *without* clearing
    the S_NOSEC flag.

    AFAICS, the only reasonably sane way to deal with that is
    * new superblock flag; unless set, S_NOSEC is not going to be set.
    * local block filesystems set it in their ->mount() (more accurately,
    mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
    local block ones clear it)
    * if any network filesystem (or a cluster one) wants to use S_NOSEC,
    it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
    inode attribute changes are picked from other clients.

    It's not an earth-shattering hole (anybody that can set suid on another client
    will almost certainly be able to write to the file before doing that anyway),
    but it's a bug that needs fixing.

    Signed-off-by: Al Viro

    Al Viro
     

27 May, 2011

2 commits

  • * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (28 commits)
    Ocfs2: Teach local-mounted ocfs2 to handle unwritten_extents correctly.
    ocfs2/dlm: Do not migrate resource to a node that is leaving the domain
    ocfs2/dlm: Add new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG
    Ocfs2/move_extents: Set several trivial constraints for threshold.
    Ocfs2/move_extents: Let defrag handle partial extent moving.
    Ocfs2/move_extents: move/defrag extents within a certain range.
    Ocfs2/move_extents: helper to calculate the defraging length in one run.
    Ocfs2/move_extents: move entire/partial extent.
    Ocfs2/move_extents: helpers to update the group descriptor and global bitmap inode.
    Ocfs2/move_extents: helper to probe a proper region to move in an alloc group.
    Ocfs2/move_extents: helper to validate and adjust moving goal.
    Ocfs2/move_extents: find the victim alloc group, where the given #blk fits.
    Ocfs2/move_extents: defrag a range of extent.
    Ocfs2/move_extents: move a range of extent.
    Ocfs2/move_extents: lock allocators and reserve metadata blocks and data clusters for extents moving.
    Ocfs2/move_extents: Add basic framework and source files for extent moving.
    Ocfs2/move_extents: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.
    Ocfs2/refcounttree: Publicize couple of funcs from refcounttree.c
    Ocfs2: Add a new code 'OCFS2_INFO_FREEFRAG' for o2info ioctl.
    Ocfs2: Add a new code 'OCFS2_INFO_FREEINODE' for o2info ioctl.
    ...

    Linus Torvalds
     
  • This eighth patch of eight in this cleancache series "opts-in"
    cleancache for ocfs2. Clustered filesystems must explicitly enable
    cleancache by calling cleancache_init_shared_fs anytime an instance
    of the filesystem is mounted. Ocfs2 is currently the only user of
    the clustered filesystem interface but nevertheless, the cleancache
    hooks in the VFS layer are sufficient for ocfs2 including the matching
    cleancache_flush_fs hook which must be called on unmount.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    [v8: trivial merge conflict update]
    [v5: jeremy@goop.org: simplify init hook and any future fs init changes]
    Signed-off-by: Dan Magenheimer
    Signed-off-by: Joel Becker
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Mark Fasheh
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Jan Beulich
    Cc: Chris Mason
    Cc: Andreas Dilger
    Cc: Ted Tso
    Cc: Nitin Gupta

    Dan Magenheimer
     

24 May, 2011

1 commit


31 Mar, 2011

1 commit


29 Mar, 2011

1 commit

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (39 commits)
    Treat writes as new when holes span across page boundaries
    fs,ocfs2: Move o2net_get_func_run_time under CONFIG_OCFS2_FS_STATS.
    ocfs2/dlm: Move kmalloc() outside the spinlock
    ocfs2: Make the left masklogs compat.
    ocfs2: Remove masklog ML_AIO.
    ocfs2: Remove masklog ML_UPTODATE.
    ocfs2: Remove masklog ML_BH_IO.
    ocfs2: Remove masklog ML_JOURNAL.
    ocfs2: Remove masklog ML_EXPORT.
    ocfs2: Remove masklog ML_DCACHE.
    ocfs2: Remove masklog ML_NAMEI.
    ocfs2: Remove mlog(0) from fs/ocfs2/dir.c
    ocfs2: remove NAMEI from symlink.c
    ocfs2: Remove masklog ML_QUOTA.
    ocfs2: Remove mlog(0) from quota_local.c.
    ocfs2: Remove masklog ML_RESERVATIONS.
    ocfs2: Remove masklog ML_XATTR.
    ocfs2: Remove masklog ML_SUPER.
    ocfs2: Remove mlog(0) from fs/ocfs2/heartbeat.c
    ocfs2: Remove mlog(0) from fs/ocfs2/slot_map.c
    ...

    Fix up trivial conflict in fs/ocfs2/super.c

    Linus Torvalds
     

16 Mar, 2011

1 commit

  • * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix build failure introduced by s/freezeable/freezable/
    workqueue: add system_freezeable_wq
    rds/ib: use system_wq instead of rds_ib_fmr_wq
    net/9p: replace p9_poll_task with a work
    net/9p: use system_wq instead of p9_mux_wq
    xfs: convert to alloc_workqueue()
    reiserfs: make commit_wq use the default concurrency level
    ocfs2: use system_wq instead of ocfs2_quota_wq
    ext4: convert to alloc_workqueue()
    scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
    scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
    misc/iwmc3200top: use system_wq instead of dedicated workqueues
    i2o: use alloc_workqueue() instead of create_workqueue()
    acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
    fs/aio: aio_wq isn't used in memory reclaim path
    input/tps6507x-ts: use system_wq instead of dedicated workqueue
    cpufreq: use system_wq instead of dedicated workqueues
    wireless/ipw2x00: use system_wq instead of dedicated workqueues
    arm/omap: use system_wq in mailbox
    workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER

    Linus Torvalds
     

07 Mar, 2011

1 commit

  • mlog_exit is used to record the exit status of a function.
    But because it is added in so many functions, if we enable it,
    the system logs get filled up quickly and cause too much I/O.
    So actually no one can open it for a production system or even
    for a test.

    This patch just try to remove it or change it. So:
    1. if all the error paths already use mlog_errno, it is just removed.
    Otherwise, it will be replaced by mlog_errno.
    2. if it is used to print some return value, it is replaced with
    mlog(0,...).
    mlog_exit_ptr is changed to mlog(0.
    All those mlog(0,...) will be replaced with trace events later.

    Signed-off-by: Tao Ma

    Tao Ma
     

23 Feb, 2011

1 commit


21 Feb, 2011

2 commits

  • About one year ago, Wengang Wang tried some first steps
    to add tracepoints to ocfs2. Hiss original patch is here:
    http://oss.oracle.com/pipermail/ocfs2-devel/2009-November/005512.html

    But as Steven Rostedt indicated in his article
    http://lwn.net/Articles/383362/, we'd better have our trace
    files resides in fs/ocfs2, so I rewrited the patch using the
    method Steven mentioned in that article.

    Signed-off-by: Wengang Wang
    Signed-off-by: Tao Ma

    Wengang Wang
     
  • ENTRY is used to record the entry of a function.
    But because it is added in so many functions, if we enable it,
    the system logs get filled up quickly and cause too much I/O.
    So actually no one can open it for a production system or even
    for a test.

    So for mlog_entry_void, we just remove it.
    for mlog_entry(...), we replace it with mlog(0,...), and they
    will be replace by trace event later.

    Signed-off-by: Tao Ma

    Tao Ma
     

20 Feb, 2011

1 commit

  • Commit 2c442719e90a44a6982c033d69df4aae4b167cfa added some checks for proper
    heartbeat mode when the o2cb stack is running. Unfortunately, it didn't
    take into account that a userpsace stack could be running. Fix this by only
    doing the check if o2cb is in use. This patch allows userspace stacks to
    mount the fs again.

    Cc: stable@kernel.org
    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Mark Fasheh
     

01 Feb, 2011

1 commit

  • ocfs2_quota_wq is not depended upon during memory reclaim and, with
    cmwq, there's no reason to use a dedicated workqueue. Drop
    ocfs2_quota_wq and use system_wq instead. dqi_sync_work is already
    sync canceled on quota disable and no further synchronization is
    necessary.

    This change makes ocfs2_quota_setup/shutdown() noops. Both functions
    removed.

    Signed-off-by: Tejun Heo
    Cc: Mark Fasheh
    Cc: Joel Becker

    Tejun Heo
     

21 Jan, 2011

1 commit


13 Jan, 2011

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
    is prone to deadlocks. We hold s_umount semaphore for reading during the
    path resolution and resolution itself may need to acquire the semaphore
    for writing when e. g. autofs mountpoint is passed.

    Solve the problem by performing the resolution before we get hold of the
    superblock (and thus s_umount semaphore). The whole thing is complicated
    by the fact that some filesystems (OCFS2) ignore the path argument. So to
    distinguish between filesystem which want the path and which do not we
    introduce new .quota_on_meta callback which does not get the path. OCFS2
    then uses this callback instead of old .quota_on.

    CC: Al Viro
    CC: Christoph Hellwig
    CC: Ted Ts'o
    CC: Joel Becker
    Signed-off-by: Jan Kara

    Jan Kara
     

07 Jan, 2011

1 commit

  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

18 Nov, 2010

1 commit


29 Oct, 2010

1 commit


23 Oct, 2010

1 commit

  • * 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits)
    BKL: remove BKL from freevxfs
    BKL: remove BKL from qnx4
    autofs4: Only declare function when CONFIG_COMPAT is defined
    autofs: Only declare function when CONFIG_COMPAT is defined
    ncpfs: Lock socket in ncpfs while setting its callbacks
    fs/locks.c: prepare for BKL removal
    BKL: Remove BKL from ncpfs
    BKL: Remove BKL from OCFS2
    BKL: Remove BKL from squashfs
    BKL: Remove BKL from jffs2
    BKL: Remove BKL from ecryptfs
    BKL: Remove BKL from afs
    BKL: Remove BKL from USB gadgetfs
    BKL: Remove BKL from autofs4
    BKL: Remove BKL from isofs
    BKL: Remove BKL from fat
    BKL: Remove BKL from ext2 filesystem
    BKL: Remove BKL from do_new_mount()
    BKL: Remove BKL from cgroup
    BKL: Remove BKL from NTFS
    ...

    Linus Torvalds
     

16 Oct, 2010

1 commit


12 Oct, 2010

2 commits

  • Currently, the default behavior of O_DIRECT writes was allowing
    concurrent writing among nodes to the same file, with no cluster
    coherency guaranteed (no EX lock held). This can leave stale data in
    the cache for buffered reads on other nodes.

    The new mount option introduce a chance to choose two different
    behaviors for O_DIRECT writes:

    * coherency=full, as the default value, will disallow
    concurrent O_DIRECT writes by taking
    EX locks.

    * coherency=buffered, allow concurrent O_DIRECT writes
    without EX lock among nodes, which
    gains high performance at risk of
    getting stale data on other nodes.

    Signed-off-by: Tristan Ye
    Signed-off-by: Joel Becker

    Tristan Ye
     
  • Functions such as ocfs2_recovery_init() make use of osb->max_slots.
    Initialize osb->max_slots early so the functions may use the correct
    value.

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Joel Becker

    Goldwyn Rodrigues
     

10 Oct, 2010

1 commit

  • OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
    both userspace and o2cb cluster stacks. It also allows us to extend cluster
    info to include stack flags.

    This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
    clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
    global heartbeat mode.

    This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
    clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.

    Signed-off-by: Sunil Mushran

    Sunil Mushran
     

08 Oct, 2010

1 commit


05 Oct, 2010

2 commits

  • The BKL in ocfs2/dlmfs is used in put_super, fill_super and remount_fs
    that are all three protected by the superblocks s_umount rw_semaphore.

    The use in ocfs2_control_open is evidently unrelated and the function
    is protected by ocfs2_control_lock.

    Therefore it is safe to remove the BKL entirely.

    Signed-off-by: Arnd Bergmann
    Cc: Mark Fasheh
    Cc: Joel Becker

    Arnd Bergmann
     
  • This patch is a preparation necessary to remove the BKL from do_new_mount().
    It explicitly adds calls to lock_kernel()/unlock_kernel() around
    get_sb/fill_super operations for filesystems that still uses the BKL.

    I've read through all the code formerly covered by the BKL inside
    do_kern_mount() and have satisfied myself that it doesn't need the BKL
    any more.

    do_kern_mount() is already called without the BKL when mounting the rootfs
    and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
    from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
    through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
    afs_mntpt_follow_link(). Both later functions are actually the filesystems
    follow_link inode operation. vfs_kern_mount() is calling the specified
    get_sb function and lets the filesystem do its job by calling the given
    fill_super function.

    Therefore I think it is safe to push down the BKL from the VFS to the
    low-level filesystems get_sb/fill_super operation.

    [arnd: do not add the BKL to those file systems that already
    don't use it elsewhere]

    Signed-off-by: Jan Blunck
    Signed-off-by: Arnd Bergmann
    Cc: Matthew Wilcox
    Cc: Christoph Hellwig

    Jan Blunck
     

10 Sep, 2010

2 commits

  • Durring orphan scan, if we are slot 0, and we are replaying
    orphan_dir:0001, the general process is that for every file
    in this dir:
    1. we will iget orphan_dir:0001, since there is no inode for it.
    we will have to create an inode and read it from the disk.
    2. do the normal work, such as delete_inode and remove it from
    the dir if it is allowed.
    3. call iput orphan_dir:0001 when we are done. In this case,
    since we have no dcache for this inode, i_count will
    reach 0, and VFS will have to call clear_inode and in
    ocfs2_clear_inode we will checkpoint the inode which will let
    ocfs2_cmt and journald begin to work.
    4. We loop back to 1 for the next file.

    So you see, actually for every deleted file, we have to read the
    orphan dir from the disk and checkpoint the journal. It is very
    time consuming and cause a lot of journal checkpoint I/O.
    A better solution is that we can have another reference for these
    inodes in ocfs2_super. So if there is no other race among
    nodes(which will let dlmglue to checkpoint the inode), for step 3,
    clear_inode won't be called and for step 1, we may only need to
    read the inode for the 1st time. This is a big win for us.

    So this patch will try to cache system inodes of other slots so
    that we will have one more reference for these inodes and avoid
    the extra inode read and journal checkpoint.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     
  • The OCFS2 developers have already done all of the hard work to allow
    volumes larger than 16 TiB. But there is still a "sanity check" in
    fs/ocfs2/super.c that prevents the mounting of such volumes, even when
    the cluster size and journal options would allow it.

    This patch replaces that sanity check with a more sophisticated one to
    mount a huge volume provided that (a) it is addressable by the raw
    word/address size of the system (borrowing a test from ext4); (b) the
    volume is using JBD2; and (c) the JBD2_FEATURE_INCOMPAT_64BIT flag is
    set on the journal.

    I factored out the sanity check into its own function. I also moved it
    from ocfs2_initialize_super() down to ocfs2_check_volume(); any earlier,
    and the journal will not have been initialized yet.

    This patch is one of a pair, and it depends on the other ("JBD2: Allow
    feature checks before journal recovery").

    I have tested this patch on small volumes, huge volumes, and huge
    volumes without 64-bit block support in the journal. All of them appear
    to work or to fail gracefully, as appropriate.

    Signed-off-by: Patrick LoPresti
    Signed-off-by: Joel Becker

    Patrick J. LoPresti
     

11 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
    no need for list_for_each_entry_safe()/resetting with superblock list
    Fix sget() race with failing mount
    vfs: don't hold s_umount over close_bdev_exclusive() call
    sysv: do not mark superblock dirty on remount
    sysv: do not mark superblock dirty on mount
    btrfs: remove junk sb_dirt change
    BFS: clean up the superblock usage
    AFFS: wait for sb synchronization when needed
    AFFS: clean up dirty flag usage
    cifs: truncate fallout
    mbcache: fix shrinker function return value
    mbcache: Remove unused features
    add f_flags to struct statfs(64)
    pass a struct path to vfs_statfs
    update VFS documentation for method changes.
    All filesystems that need invalidate_inode_buffers() are doing that explicitly
    convert remaining ->clear_inode() to ->evict_inode()
    Make ->drop_inode() just return whether inode needs to be dropped
    fs/inode.c:clear_inode() is gone
    fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
    ...

    Fix up trivial conflicts in fs/nilfs2/super.c

    Linus Torvalds
     

10 Aug, 2010

1 commit


17 Jun, 2010

2 commits


24 May, 2010

4 commits


22 May, 2010

1 commit


19 May, 2010

1 commit


11 May, 2010

1 commit

  • ocfs2 sometimes needs to block signals around dlm operations, but it
    currently does it with sigprocmask(). Even worse, it's checking the
    error code of sigprocmask(). The in-kernel sigprocmask() can only error
    if you get the SIG_* argument wrong. We don't.

    Wrap the sigprocmask() calls with ocfs2_[un]block_signals(). These
    functions are void, but they will BUG() if somehow sigprocmask() returns
    an error.

    Signed-off-by: Joel Becker

    Joel Becker