30 May, 2018
1 commit
-
[ Upstream commit 937441f3a3158d5510ca8cc78a82453f57a96365 ]
When parsing string option, in order to avoid memory leak we need to
carefully free it first in case of specifying same option several times.Signed-off-by: Chengguang Xu
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
02 May, 2018
3 commits
-
commit 9c55ad1c214d9f8c4594ac2c3fa392c1c32431a7 upstream.
ceph_con_workfn() validates con->state before calling try_read() and
then try_write(). However, try_read() temporarily releases con->mutex,
notably in process_message() and ceph_con_in_msg_alloc(), opening the
window for ceph_con_close() to sneak in, close the connection and
release con->sock. When try_write() is called on the assumption that
con->state is still valid (i.e. not STANDBY or CLOSED), a NULL sock
gets passed to the networking stack:BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: selinux_socket_sendmsg+0x5/0x20Make sure con->state is valid at the top of try_write() and add an
explicit BUG_ON for this, similar to try_read().Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/23706
Signed-off-by: Ilya Dryomov
Reviewed-by: Jason Dillaman
Signed-off-by: Greg Kroah-Hartman -
commit 7b4c443d139f1d2b5570da475f7a9cbcef86740c upstream.
If we go without an established session for a while, backoff delay will
climb to 30 seconds. The keepalive timeout is also 30 seconds, so it's
pretty easily hit after a prolonged hunting for a monitor: we don't get
a chance to send out a keepalive in time, which means we never get back
a keepalive ack in time, cutting an established session and attempting
to connect to a different monitor every 30 seconds:[Sun Apr 1 23:37:05 2018] libceph: mon0 10.80.20.99:6789 session established
[Sun Apr 1 23:37:36 2018] libceph: mon0 10.80.20.99:6789 session lost, hunting for new mon
[Sun Apr 1 23:37:36 2018] libceph: mon2 10.80.20.103:6789 session established
[Sun Apr 1 23:38:07 2018] libceph: mon2 10.80.20.103:6789 session lost, hunting for new mon
[Sun Apr 1 23:38:07 2018] libceph: mon1 10.80.20.100:6789 session established
[Sun Apr 1 23:38:37 2018] libceph: mon1 10.80.20.100:6789 session lost, hunting for new mon
[Sun Apr 1 23:38:37 2018] libceph: mon2 10.80.20.103:6789 session established
[Sun Apr 1 23:39:08 2018] libceph: mon2 10.80.20.103:6789 session lost, hunting for new monThe regular keepalive interval is 10 seconds. After ->hunting is
cleared in finish_hunting(), call __schedule_delayed() to ensure we
send out a keepalive after 10 seconds.Cc: stable@vger.kernel.org # 4.7+
Link: http://tracker.ceph.com/issues/23537
Signed-off-by: Ilya Dryomov
Reviewed-by: Jason Dillaman
Signed-off-by: Greg Kroah-Hartman -
commit facb9f6eba3df4e8027301cc0e514dc582a1b366 upstream.
This means that if we do some backoff, then authenticate, and are
healthy for an extended period of time, a subsequent failure won't
leave us starting our hunting sequence with a large backoff.Mirrors ceph.git commit d466bc6e66abba9b464b0b69687cf45c9dccf383.
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Ilya Dryomov
Reviewed-by: Jason Dillaman
Signed-off-by: Greg Kroah-Hartman
30 Nov, 2017
1 commit
-
commit b11270853fa3654f08d4a6a03b23ddb220512d8d upstream.
The WARN_ON(!key->len) in set_secret() in net/ceph/crypto.c is hit if a
user tries to add a key of type "ceph" with an invalid payload as
follows (assuming CONFIG_CEPH_LIB=y):echo -e -n '\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' \
| keyctl padd ceph desc @sThis can be hit by fuzzers. As this is merely bad input and not a
kernel bug, replace the WARN_ON() with return -EINVAL.Fixes: 7af3ea189a9a ("libceph: stop allocating a new cipher on every crypto request")
Signed-off-by: Eric Biggers
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
20 Sep, 2017
1 commit
-
This reverts most of commit f53b7665c8ce ("libceph: upmap semantic
changes").We need to prevent duplicates in the final result. For example, we
can currently take[1,2,3] and apply [(1,2)] and get [2,2,3]
or
[1,2,3] and apply [(3,2)] and get [1,2,2]
The rest of the system is not prepared to handle duplicates in the
result set like this.The reverted piece was intended to allow
[1,2,3] and [(1,2),(2,1)] to get [2,1,3]
to reorder primaries. First, this bidirectional swap is hard to
implement in a way that also prevents dups. For example, [1,2,3] and
[(1,4),(2,3),(3,4)] would give [4,3,4] but would we just drop the last
step we'd have [4,3,3] which is also invalid, etc. Simpler to just not
handle bidirectional swaps. In practice, they are not needed: if you
just want to choose a different primary then use primary_affinity, or
pg_upmap (not pg_upmap_items).Cc: stable@vger.kernel.org # 4.13
Link: http://tracker.ceph.com/issues/21410
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil
07 Sep, 2017
2 commits
-
Improve accuracy of statfs reporting for Ceph filesystems comprising
exactly one data pool. In this case, the Ceph monitor can now report
the space usage for the single data pool instead of the global data
for the entire Ceph cluster. Include support for this message in
mon_client and leverage it in ceph/super.Signed-off-by: Douglas Fuller
Reviewed-by: Yan, Zheng
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov -
startsync is a no-op, has been for years. Remove it.
Link: http://tracker.ceph.com/issues/20604
Signed-off-by: Yanhu Cao
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
01 Aug, 2017
6 commits
-
This is needed so that the OSDs can regenerate the missing set at the
start of a new interval where support for recovery deletes changed.Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil -
- apply both pg_upmap and pg_upmap_items
- allow bidirectional swap of pg-upmap-itemsSigned-off-by: Ilya Dryomov
Reviewed-by: Sage Weil -
Reflects ceph.git commit 5e8fa3e06b68fae1582c9230a3a8d1abc6146286.
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil -
There is now a fallback to a choose_arg index of -1 if there isn't
a pool-specific choose_arg set. If you create a per-pool weight-set,
that works for that pool. Otherwise we try the compat/default one. If
that doesn't exist either, then we use the normal CRUSH weights.Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil -
Reencoding an already reencoded message is a bad idea. This could
happen on Policy::stateful_server connections (!CEPH_MSG_CONNECT_LOSSY),
such as MDS sessions.This didn't pop up in testing because currently only OSD requests are
reencoded and OSD sessions are always lossy.Fixes: 98ad5ebd1505 ("libceph: ceph_connection_operations::reencode_message() method")
Signed-off-by: Ilya Dryomov
Reviewed-by: "Yan, Zheng" -
Messages allocated out of ceph_msgpool have a fixed front length
(pool->front_len). Asserting that the entire front has been filled
while encoding is thus wrong.Fixes: 8cb441c0545d ("libceph: MOSDOp v8 encoding (actual spgid + full hash)")
Reported-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
Reviewed-by: "Yan, Zheng"
17 Jul, 2017
5 commits
-
If kmem_cache_zalloc() returns NULL then the INIT_LIST_HEAD(&data->links);
will Oops. The callers aren't really prepared for NULL returns so it
doesn't make a lot of difference in real life.Fixes: 5240d9f95dfe ("libceph: replace message data pointer with list")
Signed-off-by: Dan Carpenter
Signed-off-by: Ilya Dryomov -
encode_request_finish() is for MOSDOp messages. Calling it on
MOSDBackoff ack-block messages corrupts them.Fixes: a02a946dfe96 ("libceph: respect RADOS_BACKOFF backoffs")
Signed-off-by: Ilya Dryomov -
... otherwise we die in insert_pg_mapping(), which wants pg->node to be
empty, i.e. initialized with RB_CLEAR_NODE.Fixes: 6f428df47dae ("libceph: pg_upmap[_items] infrastructure")
Signed-off-by: Ilya Dryomov -
No sooner than Dan had fixed this issue in commit 293dffaad8d5
("libceph: NULL deref on crush_decode() error path"), I brought it
back. Add a new label and set -EINVAL once, right before failing.Fixes: 278b1d709c6a ("libceph: ceph_decode_skip_* helpers")
Reported-by: Dan Carpenter
Signed-off-by: Ilya Dryomov -
There are hidden gotos in the ceph_decode_* macros. We need to set the
"err" variable on these error paths otherwise we end up returning
ERR_PTR(0) which is NULL. It causes NULL dereferences in the callers.Fixes: 6f428df47dae ("libceph: pg_upmap[_items] infrastructure")
Signed-off-by: Dan Carpenter
[idryomov@gmail.com: similar bug in osdmap_decode(), changelog tweak]
Signed-off-by: Ilya Dryomov
16 Jul, 2017
1 commit
-
Pull random updates from Ted Ts'o:
"Add wait_for_random_bytes() and get_random_*_wait() functions so that
callers can more safely get random bytes if they can block until the
CRNG is initialized.Also print a warning if get_random_*() is called before the CRNG is
initialized. By default, only one single-line warning will be printed
per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a
warning will be printed for each function which tries to get random
bytes before the CRNG is initialized. This can get spammy for certain
architecture types, so it is not enabled by default"* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: reorder READ_ONCE() in get_random_uXX
random: suppress spammy warnings about unseeded randomness
random: warn when kernel uses unseeded randomness
net/route: use get_random_int for random counter
net/neighbor: use get_random_u32 for 32-bit hash random
rhashtable: use get_random_u32 for hash_rnd
ceph: ensure RNG is seeded before using
iscsi: ensure RNG is seeded before use
cifs: use get_random_u32 for 32-bit lock random
random: add get_random_{bytes,u32,u64,int,long,once}_wait family
random: add wait_for_random_bytes() API
07 Jul, 2017
19 commits
-
Signed-off-by: Ilya Dryomov
-
Reflects ceph.git commit dca1ae1e0a6b02029c3a7f9dec4114972be26d50.
Signed-off-by: Ilya Dryomov
-
It is not just a pointer to crush_work, it is the whole structure.
That is not a problem since it only contains a pointer. But it will
be a problem if new data members are added to crush_work.Reflects ceph.git commit ee957dd431bfbeb6dadaf77764db8e0757417328.
Signed-off-by: Ilya Dryomov
-
If there is no crush_choose_arg_map for a given pool, a NULL pointer is
passed to preserve existing crush_do_rule() behavior.Reflects ceph.git commits 55fb91d64071552ea1bc65ab4ea84d3c8b73ab4b,
dbe36e08be00c6519a8c89718dd47b0219c20516.Signed-off-by: Ilya Dryomov
-
bucket_straw2_choose needs to use weights that may be different from
weight_items. For instance to compensate for an uneven distribution
caused by a low number of values. Or to fix the probability biais
introduced by conditional probabilities (see
http://tracker.ceph.com/issues/15653 for more information).We introduce a weight_set for each straw2 bucket to set the desired
weight for a given item at a given position. The weight of a given item
when picking the first replica (first position) may be different from
the weight the second replica (second position). For instance the weight
matrix for a given bucket containing items 3, 7 and 13 could be as
follows:position 0 position 1
item 3 0x10000 0x100000
item 7 0x40000 0x10000
item 13 0x40000 0x10000When crush_do_rule picks the first of two replicas (position 0), item 7,
3 are four times more likely to be choosen by bucket_straw2_choose than
item 13. When choosing the second replica (position 1), item 3 is ten
times more likely to be choosen than item 7, 13.By default the weight_set of each bucket exactly matches the content of
item_weights for each position to ensure backward compatibility.bucket_straw2_choose compares items by using their id. The same ids are
also used to index buckets and they must be unique. For each item in a
bucket an array of ids can be provided for placement purposes and they
are used instead of the ids. If no replacement ids are provided, the
legacy behavior is preserved.Reflects ceph.git commit 19537a450fd5c5a0bb8b7830947507a76db2ceca.
Signed-off-by: Ilya Dryomov
-
Previously, pg_to_raw_osds() didn't filter for existent OSDs because
raw_to_up_osds() would filter for "up" ("up" is predicated on "exists")
and raw_to_up_osds() was called directly after pg_to_raw_osds(). Now,
with apply_upmap() call in there, nonexistent OSDs in pg_to_raw_osds()
output can affect apply_upmap(). Introduce remove_nonexistent_osds()
to deal with that.Signed-off-by: Ilya Dryomov
-
Move raw_pg_to_pg() call out of get_temp_osds() and into
ceph_pg_to_up_acting_osds(), for upcoming apply_upmap().Signed-off-by: Ilya Dryomov
-
pg_temp and pg_upmap encodings are the same (PG -> array of osds),
except for the incremental remove: it's an empty mapping in new_pg_temp
for pg_temp and a separate old_pg_upmap set for pg_upmap. (This isn't
to allow for empty pg_upmap mappings -- apparently, pg_temp just wasn't
looked at as an example for pg_upmap encoding.)Reuse __decode_pg_temp() for decoding pg_upmap and new_pg_upmap.
__decode_pg_temp() stores into pg_temp union member, but since pg_upmap
union member is identical, reading through pg_upmap later is OK.Signed-off-by: Ilya Dryomov
-
Some of these won't be as efficient as they could be (e.g.
ceph_decode_skip_set(... 32 ...) could advance by len * sizeof(u32)
once instead of advancing by sizeof(u32) len times), but that's fine
and not worth a bunch of extra macro code.Replace skip_name_map() with ceph_decode_skip_map as an example.
Signed-off-by: Ilya Dryomov
-
Switch to DEFINE_RB_FUNCS2-generated {insert,lookup,erase}_pg_mapping().
Signed-off-by: Ilya Dryomov
-
Signed-off-by: Ilya Dryomov
-
Make __{lookup,remove}_pg_mapping() look like their ceph_spg_mapping
counterparts: take const struct ceph_pg *.Signed-off-by: Ilya Dryomov
-
Signed-off-by: Ilya Dryomov
-
Signed-off-by: Ilya Dryomov
-
For luminous and beyond we are encoding the actual spgid, which
requires operating with the correct pg_num, i.e. that of the target
pool.Signed-off-by: Ilya Dryomov
-
need_check_tiering logic doesn't make a whole lot of sense. Drop it
and apply tiering unconditionally on every calc_target() call instead.Signed-off-by: Ilya Dryomov
-
Otherwise we may miss events like PG splits, pool deletions, etc when
we get multiple incremental maps at once. Because check_pool_dne() can
now be fed an unlinked request, finish_request() needed to be taught to
handle unlinked requests.Signed-off-by: Ilya Dryomov
-
When processing a map update consisting of multiple incrementals, we
may end up running check_linger_pool_dne() on a lingering request that
was previously added to need_resend_linger list. If it is concluded
that the target pool doesn't exist, the request is killed off while
still on need_resend_linger list, which leads to a crash on a NULL
lreq->osd in kick_requests():libceph: linger_id 18446462598732840961 pool does not exist
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: ceph_osdc_handle_map+0x4ae/0x870Signed-off-by: Ilya Dryomov
-
Note that ceph_osd_request_target fields are updated regardless of
RESEND_ON_SPLIT.Signed-off-by: Ilya Dryomov