Eric Lee / smarc-fsl-linux-kernel

12 Oct, 2020

1 commit

3986f9a42 libceph: multiple workspaces for CRUSH computations ... Browse Code »

Replace a global map->crush_workspace (protected by a global mutex)
with a list of workspaces, up to the number of CPUs + 1.

This is based on a patch from Robin Geuze .
Robin and his team have observed a 10-20% increase in IOPS on all
queue depths and lower CPU usage as well on a high-end all-NVMe
100GbE cluster.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2020-10-12 21:29:26 +0800

01 Jun, 2020

4 commits

117d96a04 libceph: support for balanced and localized reads ... Browse Code »

OSD-side issues with reads from replica have been resolved in
Octopus. Reading from replica should be safe wrt. unstable or
uncommitted state now, so add support for balanced and localized
reads.

There are two cases when a read from replica can't be served:

- OSD may silently drop the request, expecting the client to
notice that the acting set has changed and resend via the usual
means (handled with t->used_replica)

- OSD may return EAGAIN, expecting the client to resend to the
primary, ignoring replica read flags (see handle_reply())

Signed-off-by: Ilya Dryomov
Reviewed-by: Jeff Layton

Ilya Dryomov
2020-06-01 19:22:53 +0800
45e6aa9f5 libceph: crush_location infrastructure ... Browse Code »

Allow expressing client's location in terms of CRUSH hierarchy as
a set of (bucket type name, bucket name) pairs. The userspace syntax
"crush_location = key1=value1 key2=value2" is incompatible with mount
options and needed adaptation. Key-value pairs are separated by '|'
and we use ':' instead of '=' to separate keys from values. So for:

crush_location = host=foo rack=bar

one would write:

crush_location=host:foo|rack:bar

As in userspace, "multipath" locations are supported, so indicating
locality for parallel hierarchies is possible:

crush_location=rack:foo1|rack:foo2|datacenter:bar

Signed-off-by: Ilya Dryomov
Reviewed-by: Jeff Layton

Ilya Dryomov
2020-06-01 19:22:53 +0800
86403a92c libceph: decode CRUSH device/bucket types and names ... Browse Code »

These would be matched with the provided client location to calculate
the locality value.

Signed-off-by: Ilya Dryomov
Reviewed-by: Jeff Layton

Ilya Dryomov
2020-06-01 19:22:53 +0800
8a4b863c8 libceph: add non-asserting rbtree insertion helper ... Browse Code »

Needed for the next commit and useful for ceph_pg_pool_info tree as
well. I'm leaving the asserting helper in for now, but we should look
at getting rid of it in the future.

Signed-off-by: Ilya Dryomov
Reviewed-by: Jeff Layton

Ilya Dryomov
2020-06-01 19:22:53 +0800

23 Mar, 2020

1 commit

761420973 ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULL ... Browse Code »

CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult
per-pool flags as well. Unfortunately the backwards compatibility here
is lacking:

- the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but
was guarded by require_osd_release >= RELEASE_LUMINOUS
- it was subsequently backported to luminous in v12.2.2, but that makes
no difference to clients that only check OSDMAP_FULL/NEARFULL because
require_osd_release is not client-facing -- it is for OSDs

Since all kernels are affected, the best we can do here is just start
checking both map flags and pool flags and send that to stable.

These checks are best effort, so take osdc->lock and look up pool flags
just once. Remove the FIXME, since filesystem quotas are checked above
and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches
its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set.

Cc: stable@vger.kernel.org
Reported-by: Yanhu Cao
Signed-off-by: Ilya Dryomov
Reviewed-by: Jeff Layton
Acked-by: Sage Weil

Ilya Dryomov
2020-03-23 20:07:08 +0800

16 Sep, 2019

1 commit

cf73d882c libceph: use ceph_kvmalloc() for osdmap arrays ... Browse Code »

osdmap has a bunch of arrays that grow linearly with the number of
OSDs. osd_state, osd_weight and osd_primary_affinity take 4 bytes per
OSD. osd_addr takes 136 bytes per OSD because of sockaddr_storage.
The CRUSH workspace area also grows linearly with the number of OSDs.

Normally these arrays are allocated at client startup. The osdmap is
usually updated in small incrementals, but once in a while a full map
may need to be processed. For a cluster with 10000 OSDs, this means
a bunch of 40K allocations followed by a 1.3M allocation, all of which
are currently required to be physically contiguous. This results in
sporadic ENOMEM errors, hanging the client.

Go back to manually (re)allocating arrays and use ceph_kvmalloc() to
fall back to non-contiguous allocation when necessary.

Link: https://tracker.ceph.com/issues/40481
Signed-off-by: Ilya Dryomov
Reviewed-by: Jeff Layton

Ilya Dryomov
2019-09-16 18:06:25 +0800

08 Jul, 2019

2 commits

8cb5f2b4f libceph: correctly decode ADDR2 addresses in incremental OSD maps ... Browse Code »

Given the new format, we have to decode the addresses twice. Once to
skip past the new_up_client field, and a second time to collect the
addresses.

Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov

Jeff Layton
2019-07-08 20:01:43 +0800
dcbc919a5 libceph: switch osdmap decoding to use ceph_decode_entity_addr ... Browse Code »

Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov

Jeff Layton
2019-07-08 20:01:43 +0800

06 Mar, 2019

1 commit

6b41d4d9c libceph: use struct_size() for kmalloc() in crush_decode() ... Browse Code »

One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
int stuff;
struct boo entry[];
};

instance = kmalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov

Gustavo A. R. Silva
2019-03-06 01:55:17 +0800

15 Jun, 2018

1 commit

dc594c39f Merge tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client ... Browse Code »

Pull ceph updates from Ilya Dryomov:
"The main piece is a set of libceph changes that revamps how OSD
requests are aborted, improving CephFS ENOSPC handling and making
"umount -f" actually work (Zheng and myself).

The rest is mostly mount option handling cleanups from Chengguang and
assorted fixes from Zheng, Luis and Dongsheng.

* tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client: (31 commits)
rbd: flush rbd_dev->watch_dwork after watch is unregistered
ceph: update description of some mount options
ceph: show ino32 if the value is different with default
ceph: strengthen rsize/wsize/readdir_max_bytes validation
ceph: fix alignment of rasize
ceph: fix use-after-free in ceph_statfs()
ceph: prevent i_version from going back
ceph: fix wrong check for the case of updating link count
libceph: allocate the locator string with GFP_NOFAIL
libceph: make abort_on_full a per-osdc setting
libceph: don't abort reads in ceph_osdc_abort_on_full()
libceph: avoid a use-after-free during map check
libceph: don't warn if req->r_abort_on_full is set
libceph: use for_each_request() in ceph_osdc_abort_on_full()
libceph: defer __complete_request() to a workqueue
libceph: move more code into __complete_request()
libceph: no need to call flush_workqueue() before destruction
ceph: flush pending works before shutdown super
ceph: abort osd requests on force umount
libceph: introduce ceph_osdc_abort_requests()
...

Linus Torvalds
2018-06-15 06:24:58 +0800

13 Jun, 2018

1 commit

6da2ec560 treewide: kmalloc() -> kmalloc_array() ... Browse Code »

The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

kmalloc(a * b, gfp)

with:
kmalloc_array(a * b, gfp)

as well as handling cases of:

kmalloc(a * b * c, gfp)

with:

kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook

Kees Cook
2018-06-13 07:19:22 +0800

05 Jun, 2018

1 commit

a86f009f1 libceph: allocate the locator string with GFP_NOFAIL ... Browse Code »

calc_target() isn't supposed to fail with anything but POOL_DNE, in
which case we report that the pool doesn't exist and fail the request
with -ENOENT. Doing this for -ENOMEM is at the very least confusing
and also harmful -- as the preceding requests complete, a short-lived
locator string allocation is likely to succeed after a wait.

(We used to call ceph_object_locator_to_pg() for a pi lookup. In
theory that could fail with -ENOENT, hence the "ret != -ENOENT" warning
being removed.)

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2018-06-05 02:46:00 +0800

02 Apr, 2018

3 commits

08c1ac508 libceph, ceph: move ceph_calc_file_object_mapping() to striper.c ... Browse Code »

ceph_calc_file_object_mapping() has nothing to do with osdmaps.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2018-04-02 16:12:43 +0800
dccbf0800 libceph, ceph: change ceph_calc_file_object_mapping() signature ... Browse Code »

- make it void
- xlen (object extent length) out parameter should be u32 because only
a single stripe unit is mapped at a time

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2018-04-02 16:12:38 +0800
db2196a58 libceph: eliminate overflows in ceph_calc_file_object_mapping() ... Browse Code »

bl, stripeno and objsetno should be u64 -- otherwise large enough files
get corrupted. How large depends on file layout:

- 4M-objects layout (default): any file over 16P
- 64K-objects layout (smallest possible object size): any file over 512T

Only CephFS is affected, rbd doesn't use ceph_calc_file_object_mapping()
yet. Fortunately, CephFS has a max_file_size configurable, the default
for which is way below both of the above numbers.

Reimplement the logic from scratch with no layout validation -- it's
done on the MDS side.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2018-04-02 16:12:38 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

20 Sep, 2017

1 commit

29a0cfbf9 libceph: don't allow bidirectional swap of pg-upmap-items ... Browse Code »

This reverts most of commit f53b7665c8ce ("libceph: upmap semantic
changes").

We need to prevent duplicates in the final result. For example, we
can currently take

[1,2,3] and apply [(1,2)] and get [2,2,3]

or

[1,2,3] and apply [(3,2)] and get [1,2,2]

The rest of the system is not prepared to handle duplicates in the
result set like this.

The reverted piece was intended to allow

[1,2,3] and [(1,2),(2,1)] to get [2,1,3]

to reorder primaries. First, this bidirectional swap is hard to
implement in a way that also prevents dups. For example, [1,2,3] and
[(1,4),(2,3),(3,4)] would give [4,3,4] but would we just drop the last
step we'd have [4,3,3] which is also invalid, etc. Simpler to just not
handle bidirectional swaps. In practice, they are not needed: if you
just want to choose a different primary then use primary_affinity, or
pg_upmap (not pg_upmap_items).

Cc: stable@vger.kernel.org # 4.13
Link: http://tracker.ceph.com/issues/21410
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2017-09-20 02:34:29 +0800

01 Aug, 2017

4 commits

ae78dd813 libceph: make RECOVERY_DELETES feature create a new interval ... Browse Code »

This is needed so that the OSDs can regenerate the missing set at the
start of a new interval where support for recovery deletes changed.

Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2017-08-01 22:46:45 +0800
f53b7665c libceph: upmap semantic changes ... Browse Code »

- apply both pg_upmap and pg_upmap_items
- allow bidirectional swap of pg-upmap-items

Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2017-08-01 22:46:45 +0800
c7ed1a4bf crush: assume weight_set != null imples weight_set_size > 0 ... Browse Code »

Reflects ceph.git commit 5e8fa3e06b68fae1582c9230a3a8d1abc6146286.

Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2017-08-01 22:46:44 +0800
e17e8969f libceph: fallback for when there isn't a pool-specific choose_arg ... Browse Code »

There is now a fallback to a choose_arg index of -1 if there isn't
a pool-specific choose_arg set. If you create a per-pool weight-set,
that works for that pool. Otherwise we try the compat/default one. If
that doesn't exist either, then we use the normal CRUSH weights.

Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2017-08-01 22:46:44 +0800

17 Jul, 2017

3 commits

f5cc68986 libceph: use alloc_pg_mapping() in __decode_pg_upmap_items() ... Browse Code »

... otherwise we die in insert_pg_mapping(), which wants pg->node to be
empty, i.e. initialized with RB_CLEAR_NODE.

Fixes: 6f428df47dae ("libceph: pg_upmap[_items] infrastructure")
Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-17 20:54:58 +0800
c2acfd95d libceph: set -EINVAL in one place in crush_decode() ... Browse Code »

No sooner than Dan had fixed this issue in commit 293dffaad8d5
("libceph: NULL deref on crush_decode() error path"), I brought it
back. Add a new label and set -EINVAL once, right before failing.

Fixes: 278b1d709c6a ("libceph: ceph_decode_skip_* helpers")
Reported-by: Dan Carpenter
Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-17 20:54:58 +0800
00c8ebb36 libceph: NULL deref on osdmap_apply_incremental() error path ... Browse Code »

There are hidden gotos in the ceph_decode_* macros. We need to set the
"err" variable on these error paths otherwise we end up returning
ERR_PTR(0) which is NULL. It causes NULL dereferences in the callers.

Fixes: 6f428df47dae ("libceph: pg_upmap[_items] infrastructure")
Signed-off-by: Dan Carpenter
[idryomov@gmail.com: similar bug in osdmap_decode(), changelog tweak]
Signed-off-by: Ilya Dryomov

Dan Carpenter
2017-07-17 20:54:58 +0800

07 Jul, 2017

15 commits

0bb05da2e libceph: osd_state is 32 bits wide in luminous ... Browse Code »

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:19 +0800
5cf9c4a99 libceph, crush: per-pool crush_choose_arg_map for crush_do_rule() ... Browse Code »

If there is no crush_choose_arg_map for a given pool, a NULL pointer is
passed to preserve existing crush_do_rule() behavior.

Reflects ceph.git commits 55fb91d64071552ea1bc65ab4ea84d3c8b73ab4b,
dbe36e08be00c6519a8c89718dd47b0219c20516.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:19 +0800
069f3222c crush: implement weight and id overrides for straw2 ... Browse Code »

bucket_straw2_choose needs to use weights that may be different from
weight_items. For instance to compensate for an uneven distribution
caused by a low number of values. Or to fix the probability biais
introduced by conditional probabilities (see
http://tracker.ceph.com/issues/15653 for more information).

We introduce a weight_set for each straw2 bucket to set the desired
weight for a given item at a given position. The weight of a given item
when picking the first replica (first position) may be different from
the weight the second replica (second position). For instance the weight
matrix for a given bucket containing items 3, 7 and 13 could be as
follows:

position 0 position 1

item 3 0x10000 0x100000
item 7 0x40000 0x10000
item 13 0x40000 0x10000

When crush_do_rule picks the first of two replicas (position 0), item 7,
3 are four times more likely to be choosen by bucket_straw2_choose than
item 13. When choosing the second replica (position 1), item 3 is ten
times more likely to be choosen than item 7, 13.

By default the weight_set of each bucket exactly matches the content of
item_weights for each position to ensure backward compatibility.

bucket_straw2_choose compares items by using their id. The same ids are
also used to index buckets and they must be unique. For each item in a
bucket an array of ids can be provided for placement purposes and they
are used instead of the ids. If no replacement ids are provided, the
legacy behavior is preserved.

Reflects ceph.git commit 19537a450fd5c5a0bb8b7830947507a76db2ceca.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:19 +0800
1c2e7b451 libceph: apply_upmap() ... Browse Code »

Previously, pg_to_raw_osds() didn't filter for existent OSDs because
raw_to_up_osds() would filter for "up" ("up" is predicated on "exists")
and raw_to_up_osds() was called directly after pg_to_raw_osds(). Now,
with apply_upmap() call in there, nonexistent OSDs in pg_to_raw_osds()
output can affect apply_upmap(). Introduce remove_nonexistent_osds()
to deal with that.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:18 +0800
463bb8da5 libceph: compute actual pgid in ceph_pg_to_up_acting_osds() ... Browse Code »

Move raw_pg_to_pg() call out of get_temp_osds() and into
ceph_pg_to_up_acting_osds(), for upcoming apply_upmap().

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:18 +0800
6f428df47 libceph: pg_upmap[_items] infrastructure ... Browse Code »

pg_temp and pg_upmap encodings are the same (PG -> array of osds),
except for the incremental remove: it's an empty mapping in new_pg_temp
for pg_temp and a separate old_pg_upmap set for pg_upmap. (This isn't
to allow for empty pg_upmap mappings -- apparently, pg_temp just wasn't
looked at as an example for pg_upmap encoding.)

Reuse __decode_pg_temp() for decoding pg_upmap and new_pg_upmap.
__decode_pg_temp() stores into pg_temp union member, but since pg_upmap
union member is identical, reading through pg_upmap later is OK.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:18 +0800
278b1d709 libceph: ceph_decode_skip_* helpers ... Browse Code »

Some of these won't be as efficient as they could be (e.g.
ceph_decode_skip_set(... 32 ...) could advance by len * sizeof(u32)
once instead of advancing by sizeof(u32) len times), but that's fine
and not worth a bunch of extra macro code.

Replace skip_name_map() with ceph_decode_skip_map as an example.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:18 +0800
ab75144be libceph: kill __{insert,lookup,remove}_pg_mapping() ... Browse Code »

Switch to DEFINE_RB_FUNCS2-generated {insert,lookup,erase}_pg_mapping().

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:18 +0800
a303bb0e5 libceph: introduce and switch to decode_pg_mapping() ... Browse Code »

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:18 +0800
33333d107 libceph: don't pass pgid by value ... Browse Code »

Make __{lookup,remove}_pg_mapping() look like their ceph_spg_mapping
counterparts: take const struct ceph_pg *.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:17 +0800
a02a946df libceph: respect RADOS_BACKOFF backoffs ... Browse Code »

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:17 +0800
df28152d5 libceph: avoid unnecessary pi lookups in calc_target() ... Browse Code »

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:17 +0800
7de030d6b libceph: resend on PG splits if OSD has RESEND_ON_SPLIT ... Browse Code »

Note that ceph_osd_request_target fields are updated regardless of
RESEND_ON_SPLIT.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:16 +0800
dc98ff723 libceph: introduce ceph_spg, ceph_pg_to_primary_shard() ... Browse Code »

Store both raw pgid and actual spgid in ceph_osd_request_target.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:15 +0800
8e48cf00c libceph: new pi->last_force_request_resend ... Browse Code »

The old (v15) pi->last_force_request_resend has been repurposed to
make pre-RESEND_ON_SPLIT clients that don't check for PG splits but do
obey pi->last_force_request_resend resend on splits. See ceph.git
commit 189ca7ec6420 ("mon/OSDMonitor: make pre-luminous clients resend
ops on split").

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2017-07-07 23:25:15 +0800