12 Oct, 2020

6 commits

  • con->out_msg must be cleared on Policy::stateful_server
    (!CEPH_MSG_CONNECT_LOSSY) faults. Not doing so botches the
    reconnection attempt, because after writing the banner the
    messenger moves on to writing the data section of that message
    (either from where it got interrupted by the connection reset or
    from the beginning) instead of writing struct ceph_msg_connect.
    This results in a bizarre error message because the server
    sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct
    ceph_msg_connect:

    libceph: mds0 (1)172.21.15.45:6828 socket error on write
    ceph: mds0 reconnect start
    libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN)
    libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32
    libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch

    AFAICT this bug goes back to the dawn of the kernel client.
    The reason it survived for so long is that only MDS sessions
    are stateful and only two MDS messages have a data section:
    CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare)
    and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved).
    The connection has to get reset precisely when such message
    is being sent -- in this case it was the former.

    Cc: stable@vger.kernel.org
    Link: https://tracker.ceph.com/issues/47723
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • Match the server side logs.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • The queued con->work can start executing (and therefore logging)
    before we get to this "con->work has been queued" message, making
    the logs confusing. Move it up, with the meaning of "con->work
    is about to be queued".

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Replace a global map->crush_workspace (protected by a global mutex)
    with a list of workspaces, up to the number of CPUs + 1.

    This is based on a patch from Robin Geuze .
    Robin and his team have observed a 10-20% increase in IOPS on all
    queue depths and lower CPU usage as well on a high-end all-NVMe
    100GbE cluster.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

03 Oct, 2020

1 commit

  • In libceph, ceph_tcp_sendpage() does the following checks before handle
    the page by network layer's zero copy sendpage method,
    if (page_count(page) >= 1 && !PageSlab(page))

    This check is exactly what sendpage_ok() does. This patch replace the
    open coded checks by sendpage_ok() as a code cleanup.

    Signed-off-by: Coly Li
    Acked-by: Jeff Layton
    Cc: Ilya Dryomov
    Signed-off-by: David S. Miller

    Coly Li
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

03 Aug, 2020

4 commits

  • Rationale:
    Reduces attack surface on kernel devs opening the links for MITM
    as HTTPS traffic is much harder to manipulate.

    Deterministic algorithm:
    For each file:
    If not .svg:
    For each line:
    If doesn't contain `\bxmlns\b`:
    For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
    If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
    If both the HTTP and HTTPS versions
    return 200 OK and serve the same content:
    Replace HTTP with HTTPS.

    [ idryomov: Do the same for the CRUSH paper and replace
    ceph.newdream.net with ceph.io. ]

    Signed-off-by: Alexander A. Klimov
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Alexander A. Klimov
     
  • The caller can just ignore the return. No need for this wrapper that
    just casts the other function to void.

    [ idryomov: argument alignment ]

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Instead of copying just oloc, oid and flags, copy the entire
    linger target. This is more for consistency than anything else,
    as send_linger() -> submit_request() -> __submit_request() sends
    the request regardless of what calc_target() says (i.e. both on
    CALC_TARGET_NO_ACTION and CALC_TARGET_NEED_RESEND).

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     

16 Jun, 2020

3 commits

  • Currently target_copy() is used only for sending linger pings, so
    this doesn't come up, but generally omitting used_replica can hang
    the client as we wouldn't notice the acting set change (legacy_change
    in calc_target()) or trigger a warning in handle_reply().

    Fixes: 117d96a04f00 ("libceph: support for balanced and localized reads")
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • Currently target_copy() is used only for sending linger pings, so
    this doesn't come up, but generally omitting recovery_deletes can
    result in unneeded resends (force_resend in calc_target()).

    Fixes: ae78dd8139ce ("libceph: make RECOVERY_DELETES feature create a new interval")
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • osd_req_flags is overly general and doesn't suit its only user
    (read_from_replica option) well:

    - applying osd_req_flags in account_request() affects all OSD
    requests, including linger (i.e. watch and notify). However,
    linger requests should always go to the primary even though
    some of them are reads (e.g. notify has side effects but it
    is a read because it doesn't result in mutation on the OSDs).

    - calls to class methods that are reads are allowed to go to
    the replica, but most such calls issued for "rbd map" and/or
    exclusive lock transitions are requested to be resent to the
    primary via EAGAIN, doubling the latency.

    Get rid of global osd_req_flags and set read_from_replica flag
    only on specific OSD requests instead.

    Fixes: 8ad44d5e0d1e ("libceph: read_from_replica option")
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     

09 Jun, 2020

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "The highlights are:

    - OSD/MDS latency and caps cache metrics infrastructure for the
    filesytem (Xiubo Li). Currently available through debugfs and will
    be periodically sent to the MDS in the future.

    - support for replica reads (balanced and localized reads) for rbd
    and the filesystem (myself). The default remains to always read
    from primary, users can opt-in with the new crush_location and
    read_from_replica options. Note that reading from replica is safe
    for general use only since Octopus.

    - support for RADOS allocation hint flags (myself). Currently used by
    rbd to propagate the compressible/incompressible hint given with
    the new compression_hint map option and ready for passing on more
    advanced hints, e.g. based on fadvise() from the filesystem.

    - support for efficient cross-quota-realm renames (Luis Henriques)

    - assorted cap handling improvements and cleanups, particularly
    untangling some of the locking (Jeff Layton)"

    * tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-client: (29 commits)
    rbd: compression_hint option
    libceph: support for alloc hint flags
    libceph: read_from_replica option
    libceph: support for balanced and localized reads
    libceph: crush_location infrastructure
    libceph: decode CRUSH device/bucket types and names
    libceph: add non-asserting rbtree insertion helper
    ceph: skip checking caps when session reconnecting and releasing reqs
    ceph: make sure mdsc->mutex is nested in s->s_mutex to fix dead lock
    ceph: don't return -ESTALE if there's still an open file
    libceph, rbd: replace zero-length array with flexible-array
    ceph: allow rename operation under different quota realms
    ceph: normalize 'delta' parameter usage in check_quota_exceeded
    ceph: ceph_kick_flushing_caps needs the s_mutex
    ceph: request expedited service on session's last cap flush
    ceph: convert mdsc->cap_dirty to a per-session list
    ceph: reset i_requested_max_size if file write is not wanted
    ceph: throw a warning if we destroy session with mutex still locked
    ceph: fix potential race in ceph_check_caps
    ceph: document what protects i_dirty_item and i_flushing_item
    ...

    Linus Torvalds
     

04 Jun, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     

03 Jun, 2020

1 commit

  • Switch all callers to map_kernel_range, which symmetric to the unmap side
    (as well as the _noflush versions).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Peter Zijlstra (Intel)
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Gao Xiang
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: "K. Y. Srinivasan"
    Cc: Laura Abbott
    Cc: Mark Rutland
    Cc: Michael Kelley
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Robin Murphy
    Cc: Sakari Ailus
    Cc: Stephen Hemminger
    Cc: Sumit Semwal
    Cc: Wei Liu
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Paul Mackerras
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200414131348.444715-17-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

02 Jun, 2020

1 commit


01 Jun, 2020

7 commits

  • Expose replica reads through read_from_replica=balance and
    read_from_replica=localize. The default is to read from primary
    (read_from_replica=no).

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • OSD-side issues with reads from replica have been resolved in
    Octopus. Reading from replica should be safe wrt. unstable or
    uncommitted state now, so add support for balanced and localized
    reads.

    There are two cases when a read from replica can't be served:

    - OSD may silently drop the request, expecting the client to
    notice that the acting set has changed and resend via the usual
    means (handled with t->used_replica)

    - OSD may return EAGAIN, expecting the client to resend to the
    primary, ignoring replica read flags (see handle_reply())

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • Allow expressing client's location in terms of CRUSH hierarchy as
    a set of (bucket type name, bucket name) pairs. The userspace syntax
    "crush_location = key1=value1 key2=value2" is incompatible with mount
    options and needed adaptation. Key-value pairs are separated by '|'
    and we use ':' instead of '=' to separate keys from values. So for:

    crush_location = host=foo rack=bar

    one would write:

    crush_location=host:foo|rack:bar

    As in userspace, "multipath" locations are supported, so indicating
    locality for parallel hierarchies is possible:

    crush_location=rack:foo1|rack:foo2|datacenter:bar

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • These would be matched with the provided client location to calculate
    the locality value.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • Needed for the next commit and useful for ceph_pg_pool_info tree as
    well. I'm leaving the asserting helper in for now, but we should look
    at getting rid of it in the future.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • Calculate the latency for OSD read requests. Add a new r_end_stamp
    field to struct ceph_osd_request that will hold the time of that
    the reply was received. Use that to calculate the RTT for each call,
    and divide the sum of those by number of calls to get averate RTT.

    Keep a tally of RTT for OSD writes and number of calls to track average
    latency of OSD writes.

    URL: https://tracker.ceph.com/issues/43215
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • xdp_umem.c had overlapping changes between the 64-bit math fix
    for the calculation of npgs and the removal of the zerocopy
    memory type which got rid of the chunk_size_nohdr member.

    The mlx5 Kconfig conflict is a case where we just take the
    net-next copy of the Kconfig entry dependency as it takes on
    the ESWITCH dependency by one level of indirection which is
    what the 'net' conflicting change is trying to ensure.

    Signed-off-by: David S. Miller

    David S. Miller
     

29 May, 2020

1 commit

  • Add a helper to directly set the TCP_NODELAY sockopt from kernel space
    without going through a fake uaccess. Cleanup the callers to avoid
    pointless wrappers now that this is a simple function call.

    Signed-off-by: Christoph Hellwig
    Acked-by: Sagi Grimberg
    Acked-by: Jason Gunthorpe
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

27 May, 2020

1 commit

  • OSD client should ignore cache/overlay flag if got redirect reply.
    Otherwise, the client hangs when the cache tier is in forward mode.

    [ idryomov: Redirects are effectively deprecated and no longer
    used or tested. The original tiering modes based on redirects
    are inherently flawed because redirects can race and reorder,
    potentially resulting in data corruption. The new proxy and
    readproxy tiering modes should be used instead of forward and
    readforward. Still marking for stable as obviously correct,
    though. ]

    Cc: stable@vger.kernel.org
    URL: https://tracker.ceph.com/issues/23296
    URL: https://tracker.ceph.com/issues/36406
    Signed-off-by: Jerry Lee
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jerry Lee
     

29 Apr, 2020

1 commit


30 Mar, 2020

4 commits


23 Mar, 2020

2 commits

  • Make it so that CEPH_MSG_DATA_PAGES data item can own pages,
    fixing a bunch of memory leaks for a page vector allocated in
    alloc_msg_with_page_vector(). Currently, only watch-notify
    messages trigger this allocation, and normally the page vector
    is freed either in handle_watch_notify() or by the caller of
    ceph_osdc_notify(). But if the message is freed before that
    (e.g. if the session faults while reading in the message or
    if the notify is stale), we leak the page vector.

    This was supposed to be fixed by switching to a message-owned
    pagelist, but that never happened.

    Fixes: 1907920324f1 ("libceph: support for sending notifies")
    Reported-by: Roman Penyaev
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Roman Penyaev

    Ilya Dryomov
     
  • CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult
    per-pool flags as well. Unfortunately the backwards compatibility here
    is lacking:

    - the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but
    was guarded by require_osd_release >= RELEASE_LUMINOUS
    - it was subsequently backported to luminous in v12.2.2, but that makes
    no difference to clients that only check OSDMAP_FULL/NEARFULL because
    require_osd_release is not client-facing -- it is for OSDs

    Since all kernels are affected, the best we can do here is just start
    checking both map flags and pool flags and send that to stable.

    These checks are best effort, so take osdc->lock and look up pool flags
    just once. Remove the FIXME, since filesystem quotas are checked above
    and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches
    its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set.

    Cc: stable@vger.kernel.org
    Reported-by: Yanhu Cao
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton
    Acked-by: Sage Weil

    Ilya Dryomov
     

09 Feb, 2020

1 commit

  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     

08 Feb, 2020

4 commits