21 Dec, 2012

1 commit

  • Pull Ceph update from Sage Weil:
    "There are a few different groups of commits here. The largest is
    Alex's ongoing work to enable the coming RBD features (cloning,
    striping). There is some cleanup in libceph that goes along with it.

    Cyril and David have fixed some problems with NFS reexport (leaking
    dentries and page locks), and there is a batch of patches from Yan
    fixing problems with the fs client when running against a clustered
    MDS. There are a few bug fixes mixed in for good measure, many of
    which will be going to the stable trees once they're upstream.

    My apologies for the late pull. There is still a gremlin in the rbd
    map/unmap code and I was hoping to include the fix for that as well,
    but we haven't been able to confirm the fix is correct yet; I'll send
    that in a separate pull once it's nailed down."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
    rbd: get rid of rbd_{get,put}_dev()
    libceph: register request before unregister linger
    libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
    libceph: init event->node in ceph_osdc_create_event()
    libceph: init osd->o_node in create_osd()
    libceph: report connection fault with warning
    libceph: socket can close in any connection state
    rbd: don't use ENOTSUPP
    rbd: remove linger unconditionally
    rbd: get rid of RBD_MAX_SEG_NAME_LEN
    libceph: avoid using freed osd in __kick_osd_requests()
    ceph: don't reference req after put
    rbd: do not allow remove of mounted-on image
    libceph: Unlock unprocessed pages in start_read() error path
    ceph: call handle_cap_grant() for cap import message
    ceph: Fix __ceph_do_pending_vmtruncate
    ceph: Don't add dirty inode to dirty list if caps is in migration
    ceph: Fix infinite loop in __wake_requests
    ceph: Don't update i_max_size when handling non-auth cap
    bdi_register: add __printf verification, fix arg mismatch
    ...

    Linus Torvalds
     

19 Dec, 2012

1 commit


18 Dec, 2012

1 commit


13 Dec, 2012

9 commits


06 Nov, 2012

1 commit

  • ceph_aio_write() has an optimization that marks cap EPH_CAP_FILE_WR
    dirty before data is copied to page cache and inode size is updated.
    If ceph_check_caps() flushes the dirty cap before the inode size is
    updated, MDS can miss the new inode size. The fix is move
    ceph_{get,put}_cap_refs() into ceph_write_{begin,end}() and call
    __ceph_mark_dirty_caps() after inode size is updated.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Sage Weil

    Sage Weil
     

04 Nov, 2012

1 commit


29 Oct, 2012

2 commits


27 Oct, 2012

2 commits


10 Oct, 2012

1 commit

  • Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
    u64 inum = fid->raw[2];
    which is unhelpfully reported as at the end of shmem_alloc_inode():

    BUG: unable to handle kernel paging request at ffff880061cd3000
    IP: [] shmem_alloc_inode+0x40/0x40
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Call Trace:
    [] ? exportfs_decode_fh+0x79/0x2d0
    [] do_handle_open+0x163/0x2c0
    [] sys_open_by_handle_at+0xc/0x10
    [] tracesys+0xe1/0xe6

    Right, tmpfs is being stupid to access fid->raw[2] before validating that
    fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
    fall at the end of a page, and the next page not be present.

    But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
    careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
    could oops in the same way: add the missing fh_len checks to those.

    Reported-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Cc: Al Viro
    Cc: Sage Weil
    Cc: Steven Whitehouse
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Hugh Dickins
     

09 Oct, 2012

1 commit

  • Move actual pte filling for non-linear file mappings into the new special
    vma operation: ->remap_pages().

    Filesystems must implement this method to get non-linear mapping support,
    if it uses filemap_fault() then generic_file_remap_pages() can be used.

    Now device drivers can implement this method and obtain nonlinear vma support.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf #arch/tile
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Oct, 2012

1 commit

  • Pull ceph updates from Sage Weil:
    "The bulk of this pull is a series from Alex that refactors and cleans
    up the RBD code to lay the groundwork for supporting the new image
    format and evolving feature set. There are also some cleanups in
    libceph, and for ceph there's fixed validation of file striping
    layouts and a bugfix in the code handling a shrinking MDS cluster."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
    ceph: avoid 32-bit page index overflow
    ceph: return EIO on invalid layout on GET_DATALOC ioctl
    rbd: BUG on invalid layout
    ceph: propagate layout error on osd request creation
    libceph: check for invalid mapping
    ceph: convert to use le32_add_cpu()
    ceph: Fix oops when handling mdsmap that decreases max_mds
    rbd: update remaining header fields for v2
    rbd: get snapshot name for a v2 image
    rbd: get the snapshot context for a v2 image
    rbd: get image features for a v2 image
    rbd: get the object prefix for a v2 rbd image
    rbd: add code to get the size of a v2 rbd image
    rbd: lay out header probe infrastructure
    rbd: encapsulate code that gets snapshot info
    rbd: add an rbd features field
    rbd: don't use index in __rbd_add_snap_dev()
    rbd: kill create_snap sysfs entry
    rbd: define rbd_dev_image_id()
    rbd: define some new format constants
    ...

    Linus Torvalds
     

03 Oct, 2012

3 commits

  • A pgoff_t is defined (by default) to have type (unsigned long). On
    architectures such as i686 that's a 32-bit type. The ceph address
    space code was attempting to produce 64 bit offsets by shifting a
    page's index by PAGE_CACHE_SHIFT, but the result was not what was
    desired because the shift occurred before the result got promoted
    to 64 bits.

    Fix this by converting all uses of page->index used in this way to
    use the page_offset() macro, which ensures the 64-bit result has the
    intended value.

    This fixes http://tracker.newdream.net/issues/3112

    Reported-by: Mohamed Pakkeer
    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • If the user calls GET_DATALOC on a file with an invalid (e.g.,
    zeroed) layout, return EIO to userland.

    Signed-off-by: Sage Weil
    Reviewed-by: Alex Elder

    Sage Weil
     
  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov
     

02 Oct, 2012

4 commits

  • If we are creating an osd request and get an invalid layout, return
    an EINVAL to the caller. We switch up the return to have an error
    code instead of NULL implying -ENOMEM.

    Signed-off-by: Sage Weil
    Reviewed-by: Alex Elder

    Sage Weil
     
  • Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu().

    dpatch engine is used to auto generate this patch.
    (https://github.com/weiyj/dpatch)

    Signed-off-by: Wei Yongjun
    Signed-off-by: Sage Weil

    Wei Yongjun
     
  • When i >= newmap->m_max_mds, ceph_mdsmap_get_addr(newmap, i) return
    NULL. Passing NULL to memcmp() triggers oops.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Sage Weil

    Yan, Zheng
     
  • A recent change to /sbin/mountall causes any trailing '/' character
    in the "device" (or fs_spec) field in /etc/fstab to be stripped. As
    a result, an entry for a ceph mount that intends to mount the root
    of the name space ends up with now path portion, and the ceph mount
    option processing code rejects this.

    That is, an entry in /etc/fstab like:
    cephserver:port:/ /mnt ceph defaults 0 0
    provides to the ceph code just "cephserver:port:" as the "device,"
    and that gets rejected.

    Although this is a bug in /sbin/mountall, we can have the ceph mount
    code support an empty/nonexistent path, interpreting it to mean the
    root of the name space.

    RFC 5952 offers recommendations for how to express IPv6 addresses,
    and recommends the usage found in RFC 3986 (which specifies the
    format for URI's) for representing both IPv4 and IPv6 addresses that
    include port numbers. (See in particular the definition of
    "authority" found in the Appendix of RFC 3986.)

    According to those standards, no host specification will ever
    contain a '/' character. As a result, it is sufficient to scan a
    provided "device" from an /etc/fstab entry for the first '/'
    character, and if it's found, treat that as the beginning of the
    path. If no '/' character is present, we can treat the entire
    string as the monitor host specification(s), and assume the path
    to be the root of the name space. We'll still require a ':' to
    separate the host portion from the (possibly empty) path portion.

    This means that we can more formally define how ceph will interpret
    the "device" it's provided when processing a mount request:

    "device" will look like:
    [,...]:[]
    where
    is [:]
    is optional, but if present must begin with '/'

    This addresses http://tracker.newdream.net/issues/2919

    Signed-off-by: Alex Elder
    Reviewed-by: Dan Mick

    Alex Elder
     

27 Sep, 2012

1 commit


22 Aug, 2012

2 commits


21 Aug, 2012

1 commit

  • The debugfs directory includes the cluster fsid and our unique global_id.
    We need to delay the initialization of the debug entry until we have
    learned both the fsid and our global_id from the monitor or else the
    second client can't create its debugfs entry and will fail (and multiple
    client instances aren't properly reflected in debugfs).

    Reported by: Yan, Zheng
    Signed-off-by: Sage Weil
    Reviewed-by: Yehuda Sadeh

    Sage Weil
     

03 Aug, 2012

1 commit

  • The initial ->atomic_open op was carried over from the old intent code,
    which was incomplete and didn't really work. Replace it with a fresh
    method. In particular:

    * always attempt to do an atomic open+lookup, both for the create case
    and for lookups of existing files.
    * fix symlink handling by returning 1 to the VFS so that we can follow
    the link to its destination. This fixes a longstanding ceph bug (#2392).

    Signed-off-by: Sage Weil

    Sage Weil
     

02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

01 Aug, 2012

1 commit

  • Pull Ceph changes from Sage Weil:
    "Lots of stuff this time around:

    - lots of cleanup and refactoring in the libceph messenger code, and
    many hard to hit races and bugs closed as a result.
    - lots of cleanup and refactoring in the rbd code from Alex Elder,
    mostly in preparation for the layering functionality that will be
    coming in 3.7.
    - some misc rbd cleanups from Josh Durgin that are finally going
    upstream
    - support for CRUSH tunables (used by newer clusters to improve the
    data placement)
    - some cleanup in our use of d_parent that Al brought up a while back
    - a random collection of fixes across the tree

    There is another patch coming that fixes up our ->atomic_open()
    behavior, but I'm going to hammer on it a bit more before sending it."

    Fix up conflicts due to commits that were already committed earlier in
    drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
    rbd: create rbd_refresh_helper()
    rbd: return obj version in __rbd_refresh_header()
    rbd: fixes in rbd_header_from_disk()
    rbd: always pass ops array to rbd_req_sync_op()
    rbd: pass null version pointer in add_snap()
    rbd: make rbd_create_rw_ops() return a pointer
    rbd: have __rbd_add_snap_dev() return a pointer
    libceph: recheck con state after allocating incoming message
    libceph: change ceph_con_in_msg_alloc convention to be less weird
    libceph: avoid dropping con mutex before fault
    libceph: verify state after retaking con lock after dispatch
    libceph: revoke mon_client messages on session restart
    libceph: fix handling of immediate socket connect failure
    ceph: update MAINTAINERS file
    libceph: be less chatty about stray replies
    libceph: clear all flags on con_close
    libceph: clean up con flags
    libceph: replace connection state bits with states
    libceph: drop unnecessary CLOSED check in socket state change callback
    libceph: close socket directly from ceph_con_close()
    ...

    Linus Torvalds
     

31 Jul, 2012

5 commits

  • There are two structures in which a count of snapshots are
    maintained:

    struct ceph_snap_context {
    ...
    u32 num_snaps;
    ...
    }
    and
    struct ceph_snap_realm {
    ...
    u32 num_prior_parent_snaps; /* had prior to parent_since */
    ...
    u32 num_snaps;
    ...
    }

    These fields never take on negative values (e.g., to hold special
    meaning), and so are really inherently unsigned. Furthermore they
    take their value from over-the-wire or on-disk formatted 32-bit
    values.

    So change their definition to have type u32, and change some spots
    elsewhere in the code to account for this change.

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     
  • We re-run the loop but we don't re-set the attrs pointer back to NULL.

    Signed-off-by: Alan Cox
    Reviewed-by: Alex Elder

    Alan Cox
     
  • When we detect a mds session reset, close the old ceph_connection before
    reopening it. This ensures we clean up the old socket properly and keep
    the ceph_connection state correct.

    Signed-off-by: Sage Weil
    Reviewed-by: Alex Elder
    Reviewed-by: Yehuda Sadeh

    Sage Weil
     
  • This is simply cleanup that will keep things more closely synced with the
    userland code.

    Signed-off-by: Sage Weil
    Reviewed-by: Alex Elder
    Reviewed-by: Yehuda Sadeh

    Sage Weil
     
  • CC: Sage Weil
    CC: ceph-devel@vger.kernel.org
    Acked-by: Sage Weil
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara