14 Aug, 2014

1 commit

  • Pull Ceph updates from Sage Weil:
    "There is a lot of refactoring and hardening of the libceph and rbd
    code here from Ilya that fix various smaller bugs, and a few more
    important fixes with clone overlap. The main fix is a critical change
    to the request_fn handling to not sleep that was exposed by the recent
    mutex changes (which will also go to the 3.16 stable series).

    Yan Zheng has several fixes in here for CephFS fixing ACL handling,
    time stamps, and request resends when the MDS restarts.

    Finally, there are a few cleanups from Himangi Saraogi based on
    Coccinelle"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
    libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
    rbd: remove extra newlines from rbd_warn() messages
    rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
    rbd: rework rbd_request_fn()
    ceph: fix kick_requests()
    ceph: fix append mode write
    ceph: fix sizeof(struct tYpO *) typo
    ceph: remove redundant memset(0)
    rbd: take snap_id into account when reading in parent info
    rbd: do not read in parent info before snap context
    rbd: update mapping size only on refresh
    rbd: harden rbd_dev_refresh() and callers a bit
    rbd: split rbd_dev_spec_update() into two functions
    rbd: remove unnecessary asserts in rbd_dev_image_probe()
    rbd: introduce rbd_dev_header_info()
    rbd: show the entire chain of parent images
    ceph: replace comma with a semicolon
    rbd: use rbd_segment_name_free() instead of kfree()
    ceph: check zero length in ceph_sync_read()
    ceph: reset r_resend_mds after receiving -ESTALE
    ...

    Linus Torvalds
     

09 Aug, 2014

1 commit

  • Determining ->last_piece based on the value of ->page_offset + length
    is incorrect because length here is the length of the entire message.
    ->last_piece set to false even if page array data item length is /dev/null
    rbd snap create foo@snap
    rbd snap protect foo@snap
    rbd clone foo@snap bar
    # rbd_resize calls librbd rbd_resize(), size is in bytes
    ./rbd_resize bar $(((4 << 20) + 512))
    rbd resize --size 10 bar
    BAR_DEV=$(rbd map bar)
    # trigger a 512-byte copyup -- 512-byte page array data item
    dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5

    The problem exists only in ceph_msg_data_pages_cursor_init(),
    ceph_msg_data_pages_advance() does the right thing. The size_t cast is
    unnecessary.

    Cc: stable@vger.kernel.org # 3.10+
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil
    Reviewed-by: Alex Elder

    Ilya Dryomov
     

23 Jul, 2014

2 commits


08 Jul, 2014

11 commits


13 Jun, 2014

2 commits

  • Pull Ceph updates from Sage Weil:
    "This has a mix of bug fixes and cleanups.

    Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT
    check when a second rbd image is mapped and a couple memory leaks.
    Zheng fixes several issues with fragmented directories and multiple
    MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's
    patches fix setting and unsetting RBD images read-only.

    Naturally there are several other cleanups mixed in for good measure"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
    rbd: only set disk to read-only once
    rbd: move calls that may sleep out of spin lock range
    rbd: add ioctl for rbd
    ceph: use truncate_pagecache() instead of truncate_inode_pages()
    ceph: include time stamp in every MDS request
    rbd: fix ida/idr memory leak
    rbd: use reference counts for image requests
    rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()
    rbd: make sure we have latest osdmap on 'rbd map'
    libceph: add ceph_monc_wait_osdmap()
    libceph: mon_get_version request infrastructure
    libceph: recognize poolop requests in debugfs
    ceph: refactor readpage_nounlock() to make the logic clearer
    mds: check cap ID when handling cap export message
    ceph: remember subtree root dirfrag's auth MDS
    ceph: introduce ceph_fill_fragtree()
    ceph: handle cap import atomically
    ceph: pre-allocate ceph_cap struct for ceph_add_cap()
    ceph: update inode fields according to issued caps
    rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    ...

    Linus Torvalds
     
  • Pull networking updates from David Miller:

    1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

    2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

    3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

    4) BPF now has a "random" opcode, from Chema Gonzalez.

    5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

    6) Support TCP fastopen over ipv6, from Daniel Lee.

    7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

    8) Support software TSO in fec driver too, from Nimrod Andy.

    9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

    10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

    11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

    12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

    13) Support busy polling in SCTP, from Neal Horman.

    14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

    15) Bridge promisc mode handling improvements from Vlad Yasevich.

    16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
    rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
    tcp: fixing TLP's FIN recovery
    net: fec: Add software TSO support
    net: fec: Add Scatter/gather support
    net: fec: Increase buffer descriptor entry number
    net: fec: Factorize feature setting
    net: fec: Enable IP header hardware checksum
    net: fec: Factorize the .xmit transmit function
    bridge: fix compile error when compiling without IPv6 support
    bridge: fix smatch warning / potential null pointer dereference
    via-rhine: fix full-duplex with autoneg disable
    bnx2x: Enlarge the dorq threshold for VFs
    bnx2x: Check for UNDI in uncommon branch
    bnx2x: Fix 1G-baseT link
    bnx2x: Fix link for KR with swapped polarity lane
    sctp: Fix sk_ack_backlog wrap-around problem
    net/core: Add VF link state control policy
    net/fsl: xgmac_mdio is dependent on OF_MDIO
    net/fsl: Make xgmac_mdio read error message useful
    net_sched: drr: warn when qdisc is not work conserving
    ...

    Linus Torvalds
     

12 Jun, 2014

2 commits


06 Jun, 2014

3 commits

  • Add ceph_monc_wait_osdmap(), which will block until the osdmap with the
    specified epoch is received or timeout occurs.

    Export both of these as they are going to be needed by rbd.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Add support for mon_get_version requests to libceph. This reuses much
    of the ceph_mon_generic_request infrastructure, with one exception.
    Older OSDs don't set mon_get_version reply hdr->tid even if the
    original request had a non-zero tid, which makes it impossible to
    lookup ceph_mon_generic_request contexts by tid in get_generic_reply()
    for such replies. As a workaround, we allocate a reply message on the
    reply path. This can probably interfere with revoke, but I don't see
    a better way.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Recognize poolop requests in debugfs monc dump, fix prink format
    specifiers - tid is unsigned.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     

17 May, 2014

2 commits

  • Commit e2b149cc4ba0 ("crush: add chooseleaf_vary_r tunable") added the
    crush_map::chooseleaf_vary_r field but missed the decode part. This
    lead to misdirected requests caused by incorrect raw crush mapping
    sets.

    Fixes: http://tracker.ceph.com/issues/8226

    Reported-and-Tested-by: Dmitry Smirnov
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • It has been reported that using ZFSonLinux on rbd will result in memory
    corruption. The bug report can be found here:

    https://github.com/zfsonlinux/spl/issues/241
    http://tracker.ceph.com/issues/7790

    The reason is that ZFS will send pages with page_count 0 into rbd, which in
    turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
    page_count 0, as it will do get_page and put_page, and erroneously free the
    page.

    This type of issue has been noted before, and handled in iscsi, drbd,
    etc. So, rbd should also handle this. This fix address this issue by fall back
    to slower sendmsg when page_count 0 detected.

    Cc: Sage Weil
    Cc: Yehuda Sadeh
    Cc: stable@vger.kernel.org
    Signed-off-by: Chunwei Chen
    Reviewed-by: Ilya Dryomov

    Chunwei Chen
     

07 May, 2014

1 commit


06 May, 2014

1 commit

  • Pull Ceph fixes from Sage Weil:
    "First, there is a critical fix for the new primary-affinity function
    that went into -rc1.

    The second batch of patches from Zheng fix a range of problems with
    directory fragmentation, readdir, and a few odds and ends for cephfs"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: reserve caps for file layout/lock MDS requests
    ceph: avoid releasing caps that are being used
    ceph: clear directory's completeness when creating file
    libceph: fix non-default values check in apply_primary_affinity()
    ceph: use fpos_cmp() to compare dentry positions
    ceph: check directory's completeness before emitting directory entry

    Linus Torvalds
     

29 Apr, 2014

1 commit

  • osd_primary_affinity array is indexed into incorrectly when checking
    for non-default primary-affinity values. This nullifies the impact of
    the rest of the apply_primary_affinity() and results in misdirected
    requests.

    if (osds[i] != CRUSH_ITEM_NONE &&
    osdmap->osd_primary_affinity[i] !=
    ^^^
    CEPH_OSD_DEFAULT_PRIMARY_AFFINITY) {

    For a pool with size 2, this always ends up checking osd0 and osd1
    primary_affinity values, instead of the values that correspond to the
    osds in question. E.g., given a [2,3] up set and a [max,max,0,max]
    primary affinity vector, requests are still sent to osd2, because both
    osd0 and osd1 happen to have max primary_affinity values and therefore
    we return from apply_primary_affinity() early on the premise that all
    osds in the given set have max (default) values. Fix it.

    Fixes: http://tracker.ceph.com/issues/7954

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     

12 Apr, 2014

1 commit

  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

08 Apr, 2014

1 commit

  • Pull Ceph updates from Sage Weil:
    "The biggest chunk is a series of patches from Ilya that add support
    for new Ceph osd and crush map features, including some new tunables,
    primary affinity, and the new encoding that is needed for erasure
    coding support. This brings things into parity with the server side
    and the looming firefly release. There is also support for allocation
    hints in RBD that help limit fragmentation on the server side.

    There is also a series of patches from Zheng fixing NFS reexport,
    directory fragmentation support, flock vs fnctl behavior, and some
    issues with clustered MDS.

    Finally, there are some miscellaneous fixes from Yunchuan Wen for
    fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd)
    behavior"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits)
    ceph: skip invalid dentry during dcache readdir
    libceph: dump pool {read,write}_tier to debugfs
    libceph: output primary affinity values on osdmap updates
    ceph: flush cap release queue when trimming session caps
    ceph: don't grabs open file reference for aborted request
    ceph: drop extra open file reference in ceph_atomic_open()
    ceph: preallocate buffer for readdir reply
    libceph: enable PRIMARY_AFFINITY feature bit
    libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
    libceph: add support for osd primary affinity
    libceph: add support for primary_temp mappings
    libceph: return primary from ceph_calc_pg_acting()
    libceph: switch ceph_calc_pg_acting() to new helpers
    libceph: introduce apply_temps() helper
    libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
    libceph: ceph_can_shift_osds(pool) and pool type defines
    libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
    libceph: enable OSDMAP_ENC feature bit
    libceph: primary_affinity decode bits
    libceph: primary_affinity infrastructure
    ...

    Linus Torvalds
     

05 Apr, 2014

11 commits