10 Oct, 2012

1 commit

  • Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
    u64 inum = fid->raw[2];
    which is unhelpfully reported as at the end of shmem_alloc_inode():

    BUG: unable to handle kernel paging request at ffff880061cd3000
    IP: [] shmem_alloc_inode+0x40/0x40
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Call Trace:
    [] ? exportfs_decode_fh+0x79/0x2d0
    [] do_handle_open+0x163/0x2c0
    [] sys_open_by_handle_at+0xc/0x10
    [] tracesys+0xe1/0xe6

    Right, tmpfs is being stupid to access fid->raw[2] before validating that
    fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
    fall at the end of a page, and the next page not be present.

    But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
    careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
    could oops in the same way: add the missing fh_len checks to those.

    Reported-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Cc: Al Viro
    Cc: Sage Weil
    Cc: Steven Whitehouse
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Hugh Dickins
     

09 Oct, 2012

1 commit

  • Move actual pte filling for non-linear file mappings into the new special
    vma operation: ->remap_pages().

    Filesystems must implement this method to get non-linear mapping support,
    if it uses filemap_fault() then generic_file_remap_pages() can be used.

    Now device drivers can implement this method and obtain nonlinear vma support.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf #arch/tile
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Oct, 2012

1 commit

  • Pull ceph updates from Sage Weil:
    "The bulk of this pull is a series from Alex that refactors and cleans
    up the RBD code to lay the groundwork for supporting the new image
    format and evolving feature set. There are also some cleanups in
    libceph, and for ceph there's fixed validation of file striping
    layouts and a bugfix in the code handling a shrinking MDS cluster."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
    ceph: avoid 32-bit page index overflow
    ceph: return EIO on invalid layout on GET_DATALOC ioctl
    rbd: BUG on invalid layout
    ceph: propagate layout error on osd request creation
    libceph: check for invalid mapping
    ceph: convert to use le32_add_cpu()
    ceph: Fix oops when handling mdsmap that decreases max_mds
    rbd: update remaining header fields for v2
    rbd: get snapshot name for a v2 image
    rbd: get the snapshot context for a v2 image
    rbd: get image features for a v2 image
    rbd: get the object prefix for a v2 rbd image
    rbd: add code to get the size of a v2 rbd image
    rbd: lay out header probe infrastructure
    rbd: encapsulate code that gets snapshot info
    rbd: add an rbd features field
    rbd: don't use index in __rbd_add_snap_dev()
    rbd: kill create_snap sysfs entry
    rbd: define rbd_dev_image_id()
    rbd: define some new format constants
    ...

    Linus Torvalds
     

03 Oct, 2012

3 commits

  • A pgoff_t is defined (by default) to have type (unsigned long). On
    architectures such as i686 that's a 32-bit type. The ceph address
    space code was attempting to produce 64 bit offsets by shifting a
    page's index by PAGE_CACHE_SHIFT, but the result was not what was
    desired because the shift occurred before the result got promoted
    to 64 bits.

    Fix this by converting all uses of page->index used in this way to
    use the page_offset() macro, which ensures the 64-bit result has the
    intended value.

    This fixes http://tracker.newdream.net/issues/3112

    Reported-by: Mohamed Pakkeer
    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     
  • If the user calls GET_DATALOC on a file with an invalid (e.g.,
    zeroed) layout, return EIO to userland.

    Signed-off-by: Sage Weil
    Reviewed-by: Alex Elder

    Sage Weil
     
  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov
     

02 Oct, 2012

4 commits

  • If we are creating an osd request and get an invalid layout, return
    an EINVAL to the caller. We switch up the return to have an error
    code instead of NULL implying -ENOMEM.

    Signed-off-by: Sage Weil
    Reviewed-by: Alex Elder

    Sage Weil
     
  • Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu().

    dpatch engine is used to auto generate this patch.
    (https://github.com/weiyj/dpatch)

    Signed-off-by: Wei Yongjun
    Signed-off-by: Sage Weil

    Wei Yongjun
     
  • When i >= newmap->m_max_mds, ceph_mdsmap_get_addr(newmap, i) return
    NULL. Passing NULL to memcmp() triggers oops.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Sage Weil

    Yan, Zheng
     
  • A recent change to /sbin/mountall causes any trailing '/' character
    in the "device" (or fs_spec) field in /etc/fstab to be stripped. As
    a result, an entry for a ceph mount that intends to mount the root
    of the name space ends up with now path portion, and the ceph mount
    option processing code rejects this.

    That is, an entry in /etc/fstab like:
    cephserver:port:/ /mnt ceph defaults 0 0
    provides to the ceph code just "cephserver:port:" as the "device,"
    and that gets rejected.

    Although this is a bug in /sbin/mountall, we can have the ceph mount
    code support an empty/nonexistent path, interpreting it to mean the
    root of the name space.

    RFC 5952 offers recommendations for how to express IPv6 addresses,
    and recommends the usage found in RFC 3986 (which specifies the
    format for URI's) for representing both IPv4 and IPv6 addresses that
    include port numbers. (See in particular the definition of
    "authority" found in the Appendix of RFC 3986.)

    According to those standards, no host specification will ever
    contain a '/' character. As a result, it is sufficient to scan a
    provided "device" from an /etc/fstab entry for the first '/'
    character, and if it's found, treat that as the beginning of the
    path. If no '/' character is present, we can treat the entire
    string as the monitor host specification(s), and assume the path
    to be the root of the name space. We'll still require a ':' to
    separate the host portion from the (possibly empty) path portion.

    This means that we can more formally define how ceph will interpret
    the "device" it's provided when processing a mount request:

    "device" will look like:
    [,...]:[]
    where
    is [:]
    is optional, but if present must begin with '/'

    This addresses http://tracker.newdream.net/issues/2919

    Signed-off-by: Alex Elder
    Reviewed-by: Dan Mick

    Alex Elder
     

27 Sep, 2012

1 commit


22 Aug, 2012

2 commits


21 Aug, 2012

1 commit

  • The debugfs directory includes the cluster fsid and our unique global_id.
    We need to delay the initialization of the debug entry until we have
    learned both the fsid and our global_id from the monitor or else the
    second client can't create its debugfs entry and will fail (and multiple
    client instances aren't properly reflected in debugfs).

    Reported by: Yan, Zheng
    Signed-off-by: Sage Weil
    Reviewed-by: Yehuda Sadeh

    Sage Weil
     

03 Aug, 2012

1 commit

  • The initial ->atomic_open op was carried over from the old intent code,
    which was incomplete and didn't really work. Replace it with a fresh
    method. In particular:

    * always attempt to do an atomic open+lookup, both for the create case
    and for lookups of existing files.
    * fix symlink handling by returning 1 to the VFS so that we can follow
    the link to its destination. This fixes a longstanding ceph bug (#2392).

    Signed-off-by: Sage Weil

    Sage Weil
     

02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

01 Aug, 2012

1 commit

  • Pull Ceph changes from Sage Weil:
    "Lots of stuff this time around:

    - lots of cleanup and refactoring in the libceph messenger code, and
    many hard to hit races and bugs closed as a result.
    - lots of cleanup and refactoring in the rbd code from Alex Elder,
    mostly in preparation for the layering functionality that will be
    coming in 3.7.
    - some misc rbd cleanups from Josh Durgin that are finally going
    upstream
    - support for CRUSH tunables (used by newer clusters to improve the
    data placement)
    - some cleanup in our use of d_parent that Al brought up a while back
    - a random collection of fixes across the tree

    There is another patch coming that fixes up our ->atomic_open()
    behavior, but I'm going to hammer on it a bit more before sending it."

    Fix up conflicts due to commits that were already committed earlier in
    drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
    rbd: create rbd_refresh_helper()
    rbd: return obj version in __rbd_refresh_header()
    rbd: fixes in rbd_header_from_disk()
    rbd: always pass ops array to rbd_req_sync_op()
    rbd: pass null version pointer in add_snap()
    rbd: make rbd_create_rw_ops() return a pointer
    rbd: have __rbd_add_snap_dev() return a pointer
    libceph: recheck con state after allocating incoming message
    libceph: change ceph_con_in_msg_alloc convention to be less weird
    libceph: avoid dropping con mutex before fault
    libceph: verify state after retaking con lock after dispatch
    libceph: revoke mon_client messages on session restart
    libceph: fix handling of immediate socket connect failure
    ceph: update MAINTAINERS file
    libceph: be less chatty about stray replies
    libceph: clear all flags on con_close
    libceph: clean up con flags
    libceph: replace connection state bits with states
    libceph: drop unnecessary CLOSED check in socket state change callback
    libceph: close socket directly from ceph_con_close()
    ...

    Linus Torvalds
     

31 Jul, 2012

6 commits


14 Jul, 2012

10 commits


06 Jul, 2012

1 commit


20 Jun, 2012

1 commit

  • I got lots of NULL pointer dereference Oops when compiling kernel on ceph.
    The bug is because the kernel page migration routine replaces some pages
    in the page cache with new pages, these new pages' private can be non-zero.

    Signed-off-by: Zheng Yan
    Signed-off-by: Sage Weil
    (cherry picked from commit 28c0254ede13ab575d2df5c6585ed3d4817c3e6b)

    Yan, Zheng
     

16 Jun, 2012

1 commit


06 Jun, 2012

1 commit

  • Move the initialization of a ceph connection's private pointer,
    operations vector pointer, and peer name information into
    ceph_con_init(). Rearrange the arguments so the connection pointer
    is first. Hide the byte-swapping of the peer entity number inside
    ceph_con_init()

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     

02 Jun, 2012

1 commit

  • Pull vfs changes from Al Viro.
    "A lot of misc stuff. The obvious groups:
    * Miklos' atomic_open series; kills the damn abuse of
    ->d_revalidate() by NFS, which was the major stumbling block for
    all work in that area.
    * ripping security_file_mmap() and dealing with deadlocks in the
    area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
    general.
    * ->encode_fh() switched to saner API; insane fake dentry in
    mm/cleancache.c gone.
    * assorted annotations in fs (endianness, __user)
    * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
    * ->update_time() work from Josef.
    * other bits and pieces all over the place.

    Normally it would've been in two or three pull requests, but
    signal.git stuff had eaten a lot of time during this cycle ;-/"

    Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
    'truncate_range' inode method was removed by the VM changes, the VFS
    update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
    to sparse fix added twice, with other changes nearby).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
    nfs: don't open in ->d_revalidate
    vfs: retry last component if opening stale dentry
    vfs: nameidata_to_filp(): don't throw away file on error
    vfs: nameidata_to_filp(): inline __dentry_open()
    vfs: do_dentry_open(): don't put filp
    vfs: split __dentry_open()
    vfs: do_last() common post lookup
    vfs: do_last(): add audit_inode before open
    vfs: do_last(): only return EISDIR for O_CREAT
    vfs: do_last(): check LOOKUP_DIRECTORY
    vfs: do_last(): make ENOENT exit RCU safe
    vfs: make follow_link check RCU safe
    vfs: do_last(): use inode variable
    vfs: do_last(): inline walk_component()
    vfs: do_last(): make exit RCU safe
    vfs: split do_lookup()
    Btrfs: move over to use ->update_time
    fs: introduce inode operation ->update_time
    reiserfs: get rid of resierfs_sync_super
    reiserfs: mark the superblock as dirty a bit later
    ...

    Linus Torvalds
     

01 Jun, 2012

2 commits

  • A ceph client has a pointer to a ceph messenger structure in it.
    There is always exactly one ceph messenger for a ceph client, so
    there is no need to allocate it separate from the ceph client
    structure.

    Switch the ceph_client structure to embed its ceph_messenger
    structure.

    Signed-off-by: Alex Elder
    Reviewed-by: Yehuda Sadeh
    Reviewed-by: Sage Weil

    Alex Elder
     
  • I got lots of NULL pointer dereference Oops when compiling kernel on ceph.
    The bug is because the kernel page migration routine replaces some pages
    in the page cache with new pages, these new pages' private can be non-zero.

    Signed-off-by: Zheng Yan
    Signed-off-by: Sage Weil

    Yan, Zheng