06 Oct, 2016

2 commits

  • Remove extra x1 variable, it's just temporary placeholder that
    clutters the code unnecessarily.

    Reflects ceph.git commit 0d19408d91dd747340d70287b4ef9efd89e95c6b.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Use __builtin_clz() supported by GCC and Clang to figure out
    how many bits we should shift instead of shifting by a bit
    in a loop until the value gets normalized. Improves performance
    of this function by up to 3x in worst-case scenario and overall
    straw2 performance by ~10%.

    Reflects ceph.git commit 110de33ca497d94fc4737e5154d3fe781fa84a0a.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

05 Feb, 2016

3 commits

  • Add a tunable to fix the bug that chooseleaf may cause unnecessary pg
    migrations when some device fails.

    Reflects ceph.git commit fdb3f664448e80d984470f32f04e2e6f03ab52ec.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Ensure that the take argument is a valid bucket ID before indexing the
    buckets array.

    Reflects ceph.git commit 93ec538e8a667699876b72459b8ad78966d89c61.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • We were indexing the buckets array without verifying the index was
    within the [0,max_buckets) range. This could happen because
    a multistep rule does not have enough buckets and has CRUSH_ITEM_NONE
    for an intermediate result, which would feed in CRUSH_ITEM_NONE and
    make us crash.

    Reflects ceph.git commit 976a24a326da8931e689ee22fce35feab5b67b76.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     

25 Jun, 2015

2 commits

  • .. up to ceph.git commit 1db1abc8328d ("crush: eliminate ad hoc diff
    between kernel and userspace"). This fixes a bunch of recently pulled
    coding style issues and makes includes a bit cleaner.

    A patch "crush:Make the function crush_ln static" from Nicholas Krause
    is folded in as crush_ln() has been made static
    in userspace as well.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Verify that the 'take' argument is a valid device or bucket.
    Otherwise ignore it (do not add the value to the working vector).

    Reflects ceph.git commit 9324d0a1af61e1c234cc48e2175b4e6320fff8f4.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

22 Apr, 2015

3 commits

  • This is an improved straw bucket that correctly avoids any data movement
    between items A and B when neither A nor B's weights are changed. Said
    differently, if we adjust the weight of item C (including adding it anew
    or removing it completely), we will only see inputs move to or from C,
    never between other items in the bucket.

    Notably, there is not intermediate scaling factor that needs to be
    calculated. The mapping function is a simple function of the item weights.

    The below commits were squashed together into this one (mostly to avoid
    adding and then yanking a ~6000 lines worth of crush_ln_table):

    - crush: add a straw2 bucket type
    - crush: add crush_ln to calculate nature log efficently
    - crush: improve straw2 adjustment slightly
    - crush: change crush_ln to provide 32 more digits
    - crush: fix crush_get_bucket_item_weight and bucket destroy for straw2
    - crush/mapper: fix divide-by-0 in straw2
    (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need
    to create a proper compat.h in ceph.git)

    Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53,
    32a1ead92efcd351822d22a5fc37d159c65c1338,
    6289912418c4a3597a11778bcf29ed5415117ad9,
    35fcb04e2945717cf5cfe150b9fa89cb3d2303a1,
    6445d9ee7290938de1e4ee9563912a6ab6d8ee5f,
    b5921d55d16796e12d66ad2c4add7305f9ce2353.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Crush temporary buffers are allocated as per replica size configured
    by the user. When there are more final osds (to be selected as per
    rule) than the replicas, buffer overlaps and it causes crash. Now, it
    ensures that at most num-rep osds are selected even if more number of
    osds are allowed by the rule.

    Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0,
    234b066ba04976783d15ff2abc3e81b6cc06fb10.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

05 Apr, 2014

4 commits

  • This lets you adjust the vary_r tunable on a per-rule basis.

    Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Josh Durgin

    Ilya Dryomov
     
  • The current crush_choose_firstn code will re-use the same 'r' value for
    the recursive call. That means that if we are hitting a collision or
    rejection for some reason (say, an OSD that is marked out) and need to
    retry, we will keep making the same (bad) choice in that recursive
    selection.

    Introduce a tunable that fixes that behavior by incorporating the parent
    'r' value into the recursive starting point, so that a different path
    will be taken in subsequent placement attempts.

    Note that this was done from the get-go for the new crush_choose_indep
    algorithm.

    This was exposed by a user who was seeing PGs stuck in active+remapped
    after reweight-by-utilization because the up set mapped to a single OSD.

    Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Josh Durgin

    Ilya Dryomov
     
  • These two fields are misnomers; they are *retry* counts.

    Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Josh Durgin

    Ilya Dryomov
     
  • Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
    code to allow adjustment of the retry counts on a per-pool basis. That
    commit had an off-by-one bug: the previous "tries" counter was a *retry*
    count, not a *try* count, but the new code was passing in 1 meaning
    there should be no retries.

    Fix the ftotal vs tries comparison to use < instead of
    Reviewed-by: Josh Durgin

    Ilya Dryomov
     

01 Jan, 2014

19 commits

  • Reflects ceph.git commit 8b38f10bc2ee3643a33ea5f9545ad5c00e4ac5b4.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Reflects ceph.git commit ea3a0bb8b773360d73b8b77fa32115ef091c9857.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • This allows all of the tunables to be overridden by a specific rule.

    Reflects ceph.git commits d129e09e57fbc61cfd4f492e3ee77d0750c9d292,
    0497db49e5973b50df26251ed0e3f4ac7578e66e.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • The legacy behavior is to make the normal number of tries for the
    recursive chooseleaf call. The descend_once tunable changed this to
    making a single try and bail if we get a reject (note that it is
    impossible to collide in the recursive case).

    The new set_chooseleaf_tries lets you select the number of recursive
    chooseleaf attempts for indep mode, or default to 1. Use the same
    behavior for firstn, except default to total_tries when the legacy
    tunables are set (for compatibility). This makes the rule step
    override the (new) default of 1 recursive attempt, keeping behavior
    consistent with indep mode.

    Reflects ceph.git commit 685c6950ef3df325ef04ce7c986e36ca2514c5f1.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • This aligns the internal identifier names with the user-visible names in
    the decompiled crush map language.

    Reflects ceph.git commit caa0e22e15e4226c3671318ba1f61314bf6da2a6.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Since we can specify the recursive retries in a rule, we may as well also
    specify the non-recursive tries too for completeness.

    Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Parameterize the attempts for the _firstn choose method, and apply the
    rule-specified tries count to firstn mode as well. Note that we have
    slightly different behavior here than with indep:

    If the firstn value is not specified for firstn, we pass through the
    normal attempt count. This maintains compatibility with legacy behavior.
    Note that this is usually *not* actually N^2 work, though, because of the
    descend_once tunable. However, descend_once is unfortunately *not* the
    same thing as 1 chooseleaf try because it is only checked on a reject but
    not on a collision. Sigh.

    In contrast, for indep, if tries is not specified we default to 1
    recursive attempt, because that is simply more sane, and we have the
    option to do so. The descend_once tunable has no effect for indep.

    Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Explicitly control the number of sample attempts, and allow the number of
    tries in the recursive call to be explicitly controlled via the rule. This
    is important because the amount of time we want to spend looking for a
    solution may be rule dependent (e.g., higher for the wide indep pool than
    the rep pools).

    (We should do the same for the other tunables, by the way!)

    Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Pass down the parent's 'r' value so that we will sample different values in
    the recursive call when the parent tries multiple times. This avoids doing
    useless work (calling multiple times and trying the same values).

    Reflects ceph.git commit 2731d3030d7a3e80922b7f1b7756f9a4a124bac5.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Pass numrep (the width of the result) separately from the number of results
    we want *this* iteration. This makes things less awkward when we do a
    recursive call (for chooseleaf) and want only one item.

    Reflects ceph.git commit 1b567ee08972f268c11b43fc881e57b5984dd08b.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Now that indep is handled by crush_choose_indep, rename crush_choose to
    crush_choose_firstn and remove all the conditionals. This ends up
    stripping out *lots* of code.

    Note that it *also* makes it obvious that the shenanigans we were playing
    with r' for uniform buckets were broken for firstn mode. This appears to
    have happened waaaay back in commit dae8bec9 (or earlier)... 2007.

    Reflects ceph.git commit 94350996cb2035850bcbece6a77a9b0394177ec9.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Reflects ceph.git commit 4551fee9ad89d0427ed865d766d0d44004d3e3e1.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Reflects ceph.git commit 86e978036a4ecbac4c875e7c00f6c5bbe37282d3.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • For firstn mode, if we fail to make a valid placement choice, we just
    continue and return a short result to the caller. For indep mode, however,
    we need to make the position stable, and return an undefined value on
    failed placements to avoid shifting later results to the left.

    Reflects ceph.git commit b1d4dd4eb044875874a1d01c01c7d766db5d0a80.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • This is only present to size the temporary scratch arrays that we put on
    the stack. Let the caller allocate them as they wish and remove the
    limitation.

    Reflects ceph.git commit 1cfe140bf2dab99517589a82a916f4c75b9492d1.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Reflects ceph.git commit 3cef755428761f2481b1dd0e0fbd0464ac483fc5.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Reflects ceph.git commit e7d47827f0333c96ad43d257607fb92ed4176550.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Reflects ceph.git commit 43a01c9973c4b83f2eaa98be87429941a227ddde.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Pass the size of the weight vector into crush_do_rule() to ensure that we
    don't access values past the end. This can happen if the caller misbehaves
    and passes a weight vector that is smaller than max_devices.

    Currently the monitor tries to prevent that from happening, but this will
    gracefully tolerate previous bad osdmaps that got into this state. It's
    also a bit more defensive.

    Reflects ceph.git commit 5922e2c2b8335b5e46c9504349c3a55b7434c01a.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     

18 Jan, 2013

2 commits

  • This saves us some cycles, but does not affect the placement result at
    all.

    This corresponds to ceph.git commit 4abb53d4f.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Add libceph support for a new CRUSH tunable recently added to Ceph servers.

    Consider the CRUSH rule
    step chooseleaf firstn 0 type

    This rule means that replicas will be chosen in a manner such that
    each chosen leaf's branch will contain a unique instance of .

    When an object is re-replicated after a leaf failure, if the CRUSH map uses
    a chooseleaf rule the remapped replica ends up under the bucket
    that held the failed leaf. This causes uneven data distribution across the
    storage cluster, to the point that when all the leaves but one fail under a
    particular bucket, that remaining leaf holds all the data from
    its failed peers.

    This behavior also limits the number of peers that can participate in the
    re-replication of the data held by the failed leaf, which increases the
    time required to re-replicate after a failure.

    For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
    inner and outer descents.

    If the tree descent down to is the outer descent, and the descent
    from down to a leaf is the inner descent, the issue is that a
    down leaf is detected on the inner descent, so only the inner descent is
    retried.

    In order to disperse re-replicated data as widely as possible across a
    storage cluster after a failure, we want to retry the outer descent. So,
    fix up crush_choose() to allow the inner descent to return immediately on
    choosing a failed leaf. Wire this up as a new CRUSH tunable.

    Note that after this change, for a chooseleaf rule, if the primary OSD
    in a placement group has failed, choosing a replacement may result in
    one of the other OSDs in the PG colliding with the new primary. This
    requires that OSD's data for that PG to need moving as well. This
    seems unavoidable but should be relatively rare.

    This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc.

    Signed-off-by: Jim Schutt
    Reviewed-by: Sage Weil

    Jim Schutt
     

31 Jul, 2012

1 commit

  • The server side recently added support for tuning some magic
    crush variables. Decode these variables if they are present, or use the
    default values if they are not present.

    Corresponds to ceph.git commit 89af369c25f274fe62ef730e5e8aad0c54f1e5a5.

    Signed-off-by: caleb miles
    Reviewed-by: Sage Weil
    Reviewed-by: Alex Elder
    Reviewed-by: Yehuda Sadeh

    Sage Weil
     

31 May, 2012

1 commit

  • Pull ceph updates from Sage Weil:
    "There are some updates and cleanups to the CRUSH placement code, a bug
    fix with incremental maps, several cleanups and fixes from Josh Durgin
    in the RBD block device code, a series of cleanups and bug fixes from
    Alex Elder in the messenger code, and some miscellaneous bounds
    checking and gfp cleanups/fixes."

    Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the
    networking people preferring "unsigned int" over just "unsigned".

    * git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits)
    libceph: fix pg_temp updates
    libceph: avoid unregistering osd request when not registered
    ceph: add auth buf in prepare_write_connect()
    ceph: rename prepare_connect_authorizer()
    ceph: return pointer from prepare_connect_authorizer()
    ceph: use info returned by get_authorizer
    ceph: have get_authorizer methods return pointers
    ceph: ensure auth ops are defined before use
    ceph: messenger: reduce args to create_authorizer
    ceph: define ceph_auth_handshake type
    ceph: messenger: check return from get_authorizer
    ceph: messenger: rework prepare_connect_authorizer()
    ceph: messenger: check prepare_write_connect() result
    ceph: don't set WRITE_PENDING too early
    ceph: drop msgr argument from prepare_write_connect()
    ceph: messenger: send banner in process_connect()
    ceph: messenger: reset connection kvec caller
    libceph: don't reset kvec in prepare_write_banner()
    ceph: ignore preferred_osd field
    ceph: fully initialize new layout
    ...

    Linus Torvalds
     

08 May, 2012

3 commits