14 Aug, 2014
1 commit
-
Pull Ceph updates from Sage Weil:
"There is a lot of refactoring and hardening of the libceph and rbd
code here from Ilya that fix various smaller bugs, and a few more
important fixes with clone overlap. The main fix is a critical change
to the request_fn handling to not sleep that was exposed by the recent
mutex changes (which will also go to the 3.16 stable series).Yan Zheng has several fixes in here for CephFS fixing ACL handling,
time stamps, and request resends when the MDS restarts.Finally, there are a few cleanups from Himangi Saraogi based on
Coccinelle"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
rbd: remove extra newlines from rbd_warn() messages
rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
rbd: rework rbd_request_fn()
ceph: fix kick_requests()
ceph: fix append mode write
ceph: fix sizeof(struct tYpO *) typo
ceph: remove redundant memset(0)
rbd: take snap_id into account when reading in parent info
rbd: do not read in parent info before snap context
rbd: update mapping size only on refresh
rbd: harden rbd_dev_refresh() and callers a bit
rbd: split rbd_dev_spec_update() into two functions
rbd: remove unnecessary asserts in rbd_dev_image_probe()
rbd: introduce rbd_dev_header_info()
rbd: show the entire chain of parent images
ceph: replace comma with a semicolon
rbd: use rbd_segment_name_free() instead of kfree()
ceph: check zero length in ceph_sync_read()
ceph: reset r_resend_mds after receiving -ESTALE
...
09 Aug, 2014
1 commit
-
Determining ->last_piece based on the value of ->page_offset + length
is incorrect because length here is the length of the entire message.
->last_piece set to false even if page array data item length is /dev/null
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
# rbd_resize calls librbd rbd_resize(), size is in bytes
./rbd_resize bar $(((4 << 20) + 512))
rbd resize --size 10 bar
BAR_DEV=$(rbd map bar)
# trigger a 512-byte copyup -- 512-byte page array data item
dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5The problem exists only in ceph_msg_data_pages_cursor_init(),
ceph_msg_data_pages_advance() does the right thing. The size_t cast is
unnecessary.Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil
Reviewed-by: Alex Elder
23 Jul, 2014
2 commits
-
Ceph can use user_match() instead of defining its own identical function.
Signed-off-by: David Howells
Acked-by: Steve Dickson
Reviewed-by: Sage Weil
cc: Tommi Virtanen -
Make use of key preparsing in Ceph so that quota size determination can take
place prior to keyring locking when a key is being added.Signed-off-by: David Howells
Acked-by: Steve Dickson
Reviewed-by: Sage Weil
cc: Tommi Virtanen
08 Jul, 2014
11 commits
-
queue_con() bumps osd ref count. We should do the reverse when
canceling con work.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Remove now unused ceph_osdc_unregister_linger_request().
Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Introduce ceph_osdc_cancel_request() intended for canceling requests
from the higher layers (rbd and cephfs). Because higher layers are in
charge and are supposed to know what and when they are canceling, the
request is not completed, only unref'ed and removed from the libceph
data structures.__cancel_request() is no longer called before __unregister_request(),
because __unregister_request() unconditionally revokes r_request and
there is no point in trying to do it twice.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
We should check if request is on the linger request list of any of the
OSDs, not whether request is registered or not.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Linger requests that have not yet been registered should not be
unregistered by __unregister_linger_request(). This messes up ref
count and leads to use-after-free.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
It is important that both regular and lingering requests lists are
empty when the OSD is removed.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Add some WARN_ONs to alert us when we try to destroy requests that are
still registered.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Add dout()s to ceph_osdc_request_{get,put}(). Also move them to .c and
turn kref release callback into a static function.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Add dout()s to ceph_msg_{get,put}(). Also move them to .c and turn
kref release callback into a static function.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Abstract out __move_osd_to_lru() logic from __unregister_request() and
__unregister_linger_request().Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
So that:
req->r_osd_item --> osd->o_requests list
req->r_linger_osd_item --> osd->o_linger_requests listSigned-off-by: Ilya Dryomov
Reviewed-by: Alex Elder
13 Jun, 2014
2 commits
-
Pull Ceph updates from Sage Weil:
"This has a mix of bug fixes and cleanups.Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT
check when a second rbd image is mapped and a couple memory leaks.
Zheng fixes several issues with fragmented directories and multiple
MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's
patches fix setting and unsetting RBD images read-only.Naturally there are several other cleanups mixed in for good measure"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
rbd: only set disk to read-only once
rbd: move calls that may sleep out of spin lock range
rbd: add ioctl for rbd
ceph: use truncate_pagecache() instead of truncate_inode_pages()
ceph: include time stamp in every MDS request
rbd: fix ida/idr memory leak
rbd: use reference counts for image requests
rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()
rbd: make sure we have latest osdmap on 'rbd map'
libceph: add ceph_monc_wait_osdmap()
libceph: mon_get_version request infrastructure
libceph: recognize poolop requests in debugfs
ceph: refactor readpage_nounlock() to make the logic clearer
mds: check cap ID when handling cap export message
ceph: remember subtree root dirfrag's auth MDS
ceph: introduce ceph_fill_fragtree()
ceph: handle cap import atomically
ceph: pre-allocate ceph_cap struct for ceph_add_cap()
ceph: update inode fields according to issued caps
rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
... -
Pull networking updates from David Miller:
1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.
2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.4) BPF now has a "random" opcode, from Chema Gonzalez.
5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.6) Support TCP fastopen over ipv6, from Daniel Lee.
7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.8) Support software TSO in fec driver too, from Nimrod Andy.
9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.
10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.13) Support busy polling in SCTP, from Neal Horman.
14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.
15) Bridge promisc mode handling improvements from Vlad Yasevich.
16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...
12 Jun, 2014
2 commits
-
Backmerge of dcache.c changes from mainline. It's that, or complete
rebase...Conflicts:
fs/splice.cSigned-off-by: Al Viro
-
Sparse complained about this bogus extern on definition of
a function.Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller
06 Jun, 2014
3 commits
-
Add ceph_monc_wait_osdmap(), which will block until the osdmap with the
specified epoch is received or timeout occurs.Export both of these as they are going to be needed by rbd.
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil -
Add support for mon_get_version requests to libceph. This reuses much
of the ceph_mon_generic_request infrastructure, with one exception.
Older OSDs don't set mon_get_version reply hdr->tid even if the
original request had a non-zero tid, which makes it impossible to
lookup ceph_mon_generic_request contexts by tid in get_generic_reply()
for such replies. As a workaround, we allocate a reply message on the
reply path. This can probably interfere with revoke, but I don't see
a better way.Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil -
Recognize poolop requests in debugfs monc dump, fix prink format
specifiers - tid is unsigned.Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil
17 May, 2014
2 commits
-
Commit e2b149cc4ba0 ("crush: add chooseleaf_vary_r tunable") added the
crush_map::chooseleaf_vary_r field but missed the decode part. This
lead to misdirected requests caused by incorrect raw crush mapping
sets.Fixes: http://tracker.ceph.com/issues/8226
Reported-and-Tested-by: Dmitry Smirnov
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil -
It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:https://github.com/zfsonlinux/spl/issues/241
http://tracker.ceph.com/issues/7790The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.Cc: Sage Weil
Cc: Yehuda Sadeh
Cc: stable@vger.kernel.org
Signed-off-by: Chunwei Chen
Reviewed-by: Ilya Dryomov
07 May, 2014
1 commit
-
Signed-off-by: Al Viro
06 May, 2014
1 commit
-
Pull Ceph fixes from Sage Weil:
"First, there is a critical fix for the new primary-affinity function
that went into -rc1.The second batch of patches from Zheng fix a range of problems with
directory fragmentation, readdir, and a few odds and ends for cephfs"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: reserve caps for file layout/lock MDS requests
ceph: avoid releasing caps that are being used
ceph: clear directory's completeness when creating file
libceph: fix non-default values check in apply_primary_affinity()
ceph: use fpos_cmp() to compare dentry positions
ceph: check directory's completeness before emitting directory entry
29 Apr, 2014
1 commit
-
osd_primary_affinity array is indexed into incorrectly when checking
for non-default primary-affinity values. This nullifies the impact of
the rest of the apply_primary_affinity() and results in misdirected
requests.if (osds[i] != CRUSH_ITEM_NONE &&
osdmap->osd_primary_affinity[i] !=
^^^
CEPH_OSD_DEFAULT_PRIMARY_AFFINITY) {For a pool with size 2, this always ends up checking osd0 and osd1
primary_affinity values, instead of the values that correspond to the
osds in question. E.g., given a [2,3] up set and a [max,max,0,max]
primary affinity vector, requests are still sent to osd2, because both
osd0 and osd1 happen to have max primary_affinity values and therefore
we return from apply_primary_affinity() early on the premise that all
osds in the given set have max (default) values. Fix it.Fixes: http://tracker.ceph.com/issues/7954
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil
12 Apr, 2014
1 commit
-
Several spots in the kernel perform a sequence like:
skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.Signed-off-by: David S. Miller
08 Apr, 2014
1 commit
-
Pull Ceph updates from Sage Weil:
"The biggest chunk is a series of patches from Ilya that add support
for new Ceph osd and crush map features, including some new tunables,
primary affinity, and the new encoding that is needed for erasure
coding support. This brings things into parity with the server side
and the looming firefly release. There is also support for allocation
hints in RBD that help limit fragmentation on the server side.There is also a series of patches from Zheng fixing NFS reexport,
directory fragmentation support, flock vs fnctl behavior, and some
issues with clustered MDS.Finally, there are some miscellaneous fixes from Yunchuan Wen for
fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd)
behavior"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits)
ceph: skip invalid dentry during dcache readdir
libceph: dump pool {read,write}_tier to debugfs
libceph: output primary affinity values on osdmap updates
ceph: flush cap release queue when trimming session caps
ceph: don't grabs open file reference for aborted request
ceph: drop extra open file reference in ceph_atomic_open()
ceph: preallocate buffer for readdir reply
libceph: enable PRIMARY_AFFINITY feature bit
libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
libceph: add support for osd primary affinity
libceph: add support for primary_temp mappings
libceph: return primary from ceph_calc_pg_acting()
libceph: switch ceph_calc_pg_acting() to new helpers
libceph: introduce apply_temps() helper
libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
libceph: ceph_can_shift_osds(pool) and pool type defines
libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
libceph: enable OSDMAP_ENC feature bit
libceph: primary_affinity decode bits
libceph: primary_affinity infrastructure
...
05 Apr, 2014
11 commits
-
Dump pool {read,write}_tier to debugfs. While at it, fixup printk type
specifiers and remove the unnecessary cast to unsigned long long.Signed-off-by: Ilya Dryomov
-
Similar to osd weights, output primary affinity values on incremental
osdmap updates.Signed-off-by: Ilya Dryomov
-
Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
and get rid of the now unused calc_pg_raw().Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Respond to non-default primary_affinity values accordingly. (Primary
affinity allows the admin to shift 'primary responsibility' away from
specific osds, effectively shifting around the read side of the
workload and whatever overhead is incurred by peering and writes by
virtue of being the primary).Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Change apply_temp() to override primary in the same way pg_temp
overrides osd set. primary_temp overrides pg_temp primary too.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
In preparation for adding support for primary_temp, stop assuming
primaryness: add a primary out parameter to ceph_calc_pg_acting() and
change call sites accordingly. Primary is now specified separately
from the order of osds in the set.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(),
raw_to_up_osds() and apply_temps().Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
apply_temp() helper for applying various temporary mappings (at this
point only pg_temp mappings) to the up set, therefore transforming it
into an acting set.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
pg_to_raw_osds() helper for computing a raw (crush) set, which can
contain non-existant and down osds.raw_to_up_osds() helper for pruning non-existant and down osds from the
raw set, therefore transforming it into an up set, and determining up
primary.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Add two helpers to decode primary_affinity (full map, vector) and
new_primary_affinity (inc map, map) and switch to them.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder -
Add primary_affinity infrastructure. primary_affinity values are
stored in an max_osd-sized array, hanging off ceph_osdmap, similar to
a osd_weight array.Introduce {get,set}_primary_affinity() helpers, primarily to return
CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to
abstract out osd_primary_affinity array allocation and initialization.Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder