12 Oct, 2020

3 commits

  • con->out_msg must be cleared on Policy::stateful_server
    (!CEPH_MSG_CONNECT_LOSSY) faults. Not doing so botches the
    reconnection attempt, because after writing the banner the
    messenger moves on to writing the data section of that message
    (either from where it got interrupted by the connection reset or
    from the beginning) instead of writing struct ceph_msg_connect.
    This results in a bizarre error message because the server
    sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct
    ceph_msg_connect:

    libceph: mds0 (1)172.21.15.45:6828 socket error on write
    ceph: mds0 reconnect start
    libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN)
    libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32
    libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch

    AFAICT this bug goes back to the dawn of the kernel client.
    The reason it survived for so long is that only MDS sessions
    are stateful and only two MDS messages have a data section:
    CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare)
    and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved).
    The connection has to get reset precisely when such message
    is being sent -- in this case it was the former.

    Cc: stable@vger.kernel.org
    Link: https://tracker.ceph.com/issues/47723
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • Match the server side logs.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • The queued con->work can start executing (and therefore logging)
    before we get to this "con->work has been queued" message, making
    the logs confusing. Move it up, with the meaning of "con->work
    is about to be queued".

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

03 Oct, 2020

1 commit

  • In libceph, ceph_tcp_sendpage() does the following checks before handle
    the page by network layer's zero copy sendpage method,
    if (page_count(page) >= 1 && !PageSlab(page))

    This check is exactly what sendpage_ok() does. This patch replace the
    open coded checks by sendpage_ok() as a code cleanup.

    Signed-off-by: Coly Li
    Acked-by: Jeff Layton
    Cc: Ilya Dryomov
    Signed-off-by: David S. Miller

    Coly Li
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

29 May, 2020

1 commit

  • Add a helper to directly set the TCP_NODELAY sockopt from kernel space
    without going through a fake uaccess. Cleanup the callers to avoid
    pointless wrappers now that this is a simple function call.

    Signed-off-by: Christoph Hellwig
    Acked-by: Sagi Grimberg
    Acked-by: Jason Gunthorpe
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

23 Mar, 2020

1 commit

  • Make it so that CEPH_MSG_DATA_PAGES data item can own pages,
    fixing a bunch of memory leaks for a page vector allocated in
    alloc_msg_with_page_vector(). Currently, only watch-notify
    messages trigger this allocation, and normally the page vector
    is freed either in handle_watch_notify() or by the caller of
    ceph_osdc_notify(). But if the message is freed before that
    (e.g. if the session faults while reading in the message or
    if the notify is stale), we leak the page vector.

    This was supposed to be fixed by switching to a message-owned
    pagelist, but that never happened.

    Fixes: 1907920324f1 ("libceph: support for sending notifies")
    Reported-by: Roman Penyaev
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Roman Penyaev

    Ilya Dryomov
     

28 Nov, 2019

1 commit

  • Convert the ceph filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    [ Numerous string handling, leak and regression fixes; rbd conversion
    was particularly broken and had to be redone almost from scratch. ]

    Signed-off-by: David Howells
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    David Howells
     

16 Sep, 2019

1 commit


19 Jul, 2019

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "Lots of exciting things this time!

    - support for rbd object-map and fast-diff features (myself). This
    will speed up reads, discards and things like snap diffs on sparse
    images.

    - ceph.snap.btime vxattr to expose snapshot creation time (David
    Disseldorp). This will be used to integrate with "Restore Previous
    Versions" feature added in Windows 7 for folks who reexport ceph
    through SMB.

    - security xattrs for ceph (Zheng Yan). Only selinux is supported for
    now due to the limitations of ->dentry_init_security().

    - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff
    Layton). This is actually a single feature bit which was missing
    because of the filesystem pieces. With this in, the kernel client
    will finally be reported as "luminous" by "ceph features" -- it is
    still being reported as "jewel" even though all required Luminous
    features were implemented in 4.13.

    - stop NULL-terminating ceph vxattrs (Jeff Layton). The convention
    with xattrs is to not terminate and this was causing
    inconsistencies with ceph-fuse.

    - change filesystem time granularity from 1 us to 1 ns, again fixing
    an inconsistency with ceph-fuse (Luis Henriques).

    On top of this there are some additional dentry name handling and cap
    flushing fixes from Zheng. Finally, Jeff is formally taking over for
    Zheng as the filesystem maintainer"

    * tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits)
    ceph: fix end offset in truncate_inode_pages_range call
    ceph: use generic_delete_inode() for ->drop_inode
    ceph: use ceph_evict_inode to cleanup inode's resource
    ceph: initialize superblock s_time_gran to 1
    MAINTAINERS: take over for Zheng as CephFS kernel client maintainer
    rbd: setallochint only if object doesn't exist
    rbd: support for object-map and fast-diff
    rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
    libceph: export osd_req_op_data() macro
    libceph: change ceph_osdc_call() to take page vector for response
    libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
    rbd: new exclusive lock wait/wake code
    rbd: quiescing lock should wait for image requests
    rbd: lock should be quiesced on reacquire
    rbd: introduce copyup state machine
    rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
    rbd: move OSD request allocation into object request state machines
    rbd: factor out __rbd_osd_setup_discard_ops()
    rbd: factor out rbd_osd_setup_copyup()
    rbd: introduce obj_req->osd_reqs list
    ...

    Linus Torvalds
     

08 Jul, 2019

3 commits


28 Jun, 2019

1 commit

  • Create a request_key_net() function and use it to pass the network
    namespace domain tag into DNS revolver keys and rxrpc/AFS keys so that keys
    for different domains can coexist in the same keyring.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: linux-nfs@vger.kernel.org
    cc: linux-cifs@vger.kernel.org
    cc: linux-afs@lists.infradead.org

    David Howells
     

17 May, 2019

1 commit

  • Pull misc AFS fixes from David Howells:
    "This fixes a set of miscellaneous issues in the afs filesystem,
    including:

    - leak of keys on file close.

    - broken error handling in xattr functions.

    - missing locking when updating VL server list.

    - volume location server DNS lookup whereby preloaded cells may not
    ever get a lookup and regular DNS lookups to maintain server lists
    consume power unnecessarily.

    - incorrect error propagation and handling in the fileserver
    iteration code causes operations to sometimes apparently succeed.

    - interruption of server record check/update side op during
    fileserver iteration causes uninterruptible main operations to fail
    unexpectedly.

    - callback promise expiry time miscalculation.

    - over invalidation of the callback promise on directories.

    - double locking on callback break waking up file locking waiters.

    - double increment of the vnode callback break counter.

    Note that it makes some changes outside of the afs code, including:

    - an extra parameter to dns_query() to allow the dns_resolver key
    just accessed to be immediately invalidated. AFS is caching the
    results itself, so the key can be discarded.

    - an interruptible version of wait_var_event().

    - an rxrpc function to allow the maximum lifespan to be set on a
    call.

    - a way for an rxrpc call to be marked as non-interruptible"

    * tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    afs: Fix double inc of vnode->cb_break
    afs: Fix lock-wait/callback-break double locking
    afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not set
    afs: Fix calculation of callback expiry time
    afs: Make dynamic root population wait uninterruptibly for proc_cells_lock
    afs: Make some RPC operations non-interruptible
    rxrpc: Allow the kernel to mark a call as being non-interruptible
    afs: Fix error propagation from server record check/update
    afs: Fix the maximum lifespan of VL and probe calls
    rxrpc: Provide kernel interface to set max lifespan on a call
    afs: Fix "kAFS: AFS vnode with undefined type 0"
    afs: Fix cell DNS lookup
    Add wait_var_event_interruptible()
    dns_resolver: Allow used keys to be invalidated
    afs: Fix afs_cell records to always have a VL server list record
    afs: Fix missing lock when replacing VL server list
    afs: Fix afs_xattr_get_yfs() to not try freeing an error value
    afs: Fix incorrect error handling in afs_xattr_get_acl()
    afs: Fix key leak in afs_release() and afs_evict_inode()

    Linus Torvalds
     

16 May, 2019

1 commit

  • Allow used DNS resolver keys to be invalidated after use if the caller is
    doing its own caching of the results. This reduces the amount of resources
    required.

    Fix AFS to invalidate DNS results to kill off permanent failure records
    that get lodged in the resolver keyring and prevent future lookups from
    happening.

    Fixes: 0a5143f2f89c ("afs: Implement VL server rotation")
    Signed-off-by: David Howells

    David Howells
     

08 May, 2019

2 commits

  • GCC9 is throwing a lot of warnings about unaligned accesses by
    callers of ceph_pr_addr. All of the current callers are passing a
    pointer to the sockaddr inside struct ceph_entity_addr.

    Fix it to take a pointer to a struct ceph_entity_addr instead,
    and then have the function make a copy of the sockaddr before
    printing it.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • GCC9 is throwing a lot of warnings about unaligned access. This patch
    fixes some of them by changing most of the sockaddr handling functions
    to take a pointer to struct ceph_entity_addr instead of struct
    sockaddr_storage. The lower functions can then make copies or do
    unaligned accesses as needed.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

26 Mar, 2019

1 commit

  • A bvec can now consist of multiple physically contiguous pages.
    This means that bvec_iter_advance() can move to a different page while
    staying in the same bvec (i.e. ->bi_bvec_done != 0).

    The messenger works in terms of segments which can now be defined as
    the smaller of a bvec and a page. The "more bytes to process in this
    segment" condition holds only if bvec_iter_advance() leaves us in the
    same bvec _and_ in the same page. On next bvec (possibly in the same
    page) and on next page (possibly in the same bvec) we may need to set
    ->last_piece.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

19 Feb, 2019

1 commit

  • The authorize reply can be empty, for example when the ticket used to
    build the authorizer is too old and TAG_BADAUTHORIZER is returned from
    the service. Calling ->verify_authorizer_reply() results in an attempt
    to decrypt and validate (somewhat) random data in au->buf (most likely
    the signature block from calc_signature()), which fails and ends up in
    con_fault_finish() with !con->auth_retry. The ticket isn't invalidated
    and the connection is retried again and again until a new ticket is
    obtained from the monitor:

    libceph: osd2 192.168.122.1:6809 bad authorize reply
    libceph: osd2 192.168.122.1:6809 bad authorize reply
    libceph: osd2 192.168.122.1:6809 bad authorize reply
    libceph: osd2 192.168.122.1:6809 bad authorize reply

    Let TAG_BADAUTHORIZER handler kick in and increment con->auth_retry.

    Cc: stable@vger.kernel.org
    Fixes: 5c056fdc5b47 ("libceph: verify authorize reply on connect")
    Link: https://tracker.ceph.com/issues/20164
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     

21 Jan, 2019

1 commit

  • con_fault() can transition the connection into STANDBY right after
    ceph_con_keepalive() clears STANDBY in clear_standby():

    libceph user thread ceph-msgr worker

    ceph_con_keepalive()
    mutex_lock(&con->mutex)
    clear_standby(con)
    mutex_unlock(&con->mutex)
    mutex_lock(&con->mutex)
    con_fault()
    ...
    if KEEPALIVE_PENDING isn't set
    set state to STANDBY
    ...
    mutex_unlock(&con->mutex)
    set KEEPALIVE_PENDING
    set WRITE_PENDING

    This triggers warnings in clear_standby() when either ceph_con_send()
    or ceph_con_keepalive() get to clearing STANDBY next time.

    I don't see a reason to condition queue_con() call on the previous
    value of KEEPALIVE_PENDING, so move the setting of KEEPALIVE_PENDING
    into the critical section -- unlike WRITE_PENDING, KEEPALIVE_PENDING
    could have been a non-atomic flag.

    Reported-by: syzbot+acdeb633f6211ccdf886@syzkaller.appspotmail.com
    Signed-off-by: Ilya Dryomov
    Tested-by: Myungho Jung

    Ilya Dryomov
     

26 Dec, 2018

4 commits

  • Unlike in ceph_tcp_sendpage(), it's a bool.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Prevent do_tcp_sendpages() from calling tcp_push() (at least) once per
    page. Instead, arrange for tcp_push() to be called (at least) once per
    data payload. This results in more MSS-sized packets and fewer packets
    overall (5-10% reduction in my tests with typical OSD request sizes).
    See commits 2f5338442425 ("tcp: allow splice() to build full TSO
    packets"), 35f9c09fe9c7 ("tcp: tcp_sendpages() should call tcp_push()
    once") and ae62ca7b0321 ("tcp: fix MSG_SENDPAGE_NOTLAST logic") for
    details.

    Here is an example of a packet size histogram for 128K OSD requests
    (MSS = 1448, top 5):

    Before:

    SIZE COUNT
    1448 777700
    952 127915
    1200 39238
    1219 9806
    21 5675

    After:

    SIZE COUNT
    1448 897280
    21 6201
    1019 2797
    643 2739
    376 2479

    We could do slightly better by explicitly corking the socket but it's
    not clear it's worth it.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • sock_no_sendpage() makes the code cleaner.

    Also, don't set MSG_EOR. sendpage doesn't act on MSG_EOR on its own,
    it just honors the setting from the preceding sendmsg call by looking
    at ->eor in tcp_skb_can_collapse_to().

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • last_piece is for the last piece in the current data item, not in the
    entire data payload of the message. This is harmful for messages with
    multiple data items. On top of that, we don't need to signal the end
    of a data payload either because it is always followed by a footer.

    We used to signal "more" unconditionally, until commit fe38a2b67bc6
    ("libceph: start defining message data cursor"). Part of a large
    series, it introduced cursor->last_piece and also mistakenly inverted
    the hint by passing last_piece for "more". This was corrected with
    commit c2cfa1940097 ("libceph: Fix ceph_tcp_sendpage()'s more boolean
    usage").

    As it is, last_piece is not helping at all: because Nagle algorithm is
    disabled, for a simple message with two 512-byte data items we end up
    emitting three packets: front + first data item, second data item and
    footer. Go back to the original pre-fe38a2b67bc6 behavior -- a single
    packet in most cases.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

20 Nov, 2018

1 commit

  • skb_can_coalesce() allows coalescing neighboring slab objects into
    a single frag:

    return page == skb_frag_page(frag) &&
    off == frag->page_offset + skb_frag_size(frag);

    ceph_tcp_sendpage() can be handed slab pages. One example of this is
    XFS: it passes down sector sized slab objects for its metadata I/O. If
    the kernel client is co-located on the OSD node, the skb may go through
    loopback and pop on the receive side with the exact same set of frags.
    When tcp_recvmsg() attempts to copy out such a frag, hardened usercopy
    complains because the size exceeds the object's allocated size:

    usercopy: kernel memory exposure attempt detected from ffff9ba917f20a00 (kmalloc-512) (1024 bytes)

    Although skb_can_coalesce() could be taught to return false if the
    resulting frag would cross a slab object boundary, we already have
    a fallback for non-refcounted pages. Utilize it for slab pages too.

    Cc: stable@vger.kernel.org # 4.8+
    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

02 Nov, 2018

1 commit

  • Pull AFS updates from Al Viro:
    "AFS series, with some iov_iter bits included"

    * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    missing bits of "iov_iter: Separate type from direction and use accessor functions"
    afs: Probe multiple fileservers simultaneously
    afs: Fix callback handling
    afs: Eliminate the address pointer from the address list cursor
    afs: Allow dumping of server cursor on operation failure
    afs: Implement YFS support in the fs client
    afs: Expand data structure fields to support YFS
    afs: Get the target vnode in afs_rmdir() and get a callback on it
    afs: Calc callback expiry in op reply delivery
    afs: Fix FS.FetchStatus delivery from updating wrong vnode
    afs: Implement the YFS cache manager service
    afs: Remove callback details from afs_callback_break struct
    afs: Commit the status on a new file/dir/symlink
    afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
    afs: Don't invoke the server to read data beyond EOF
    afs: Add a couple of tracepoints to log I/O errors
    afs: Handle EIO from delivery function
    afs: Fix TTL on VL server and address lists
    afs: Implement VL server rotation
    afs: Improve FS server rotation error handling
    ...

    Linus Torvalds
     

24 Oct, 2018

1 commit

  • In the iov_iter struct, separate the iterator type from the iterator
    direction and use accessor functions to access them in most places.

    Convert a bunch of places to use switch-statements to access them rather
    then chains of bitwise-AND statements. This makes it easier to add further
    iterator types. Also, this can be more efficient as to implement a switch
    of small contiguous integers, the compiler can use ~50% fewer compare
    instructions than it has to use bitwise-and instructions.

    Further, cease passing the iterator type into the iterator setup function.
    The iterator function can set that itself. Only the direction is required.

    Signed-off-by: David Howells

    David Howells
     

22 Oct, 2018

2 commits

  • Currently message data items are allocated with ceph_msg_data_create()
    in setup_request_data() inside send_request(). send_request() has never
    been allowed to fail, so each allocation is followed by a BUG_ON:

    data = ceph_msg_data_create(...);
    BUG_ON(!data);

    It's been this way since support for multiple message data items was
    added in commit 6644ed7b7e04 ("libceph: make message data be a pointer")
    in 3.10.

    There is no reason to delay the allocation of message data items until
    the last possible moment and we certainly don't need a linked list of
    them as they are only ever appended to the end and never erased. Make
    ceph_msg_new2() take max_data_items and adapt the rest of the code.

    Reported-by: Jerry Lee
    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Because send_mds_reconnect() wants to send a message with a pagelist
    and pass the ownership to the messenger, ceph_msg_data_add_pagelist()
    consumes a ref which is then put in ceph_msg_data_destroy(). This
    makes managing pagelists in the OSD client (where they are wrapped in
    ceph_osd_data) unnecessarily hard because the handoff only happens in
    ceph_osdc_start_request() instead of when the pagelist is passed to
    ceph_osd_data_pagelist_init(). I counted several memory leaks on
    various error paths.

    Fix up ceph_msg_data_add_pagelist() and carry a pagelist ref in
    ceph_osd_data.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

03 Aug, 2018

5 commits

  • Avoid scribbling over memory if the received reply/challenge is larger
    than the buffer supplied with the authorizer.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • When a client authenticates with a service, an authorizer is sent with
    a nonce to the service (ceph_x_authorize_[ab]) and the service responds
    with a mutation of that nonce (ceph_x_authorize_reply). This lets the
    client verify the service is who it says it is but it doesn't protect
    against a replay: someone can trivially capture the exchange and reuse
    the same authorizer to authenticate themselves.

    Allow the service to reject an initial authorizer with a random
    challenge (ceph_x_authorize_challenge). The client then has to respond
    with an updated authorizer proving they are able to decrypt the
    service's challenge and that the new authorizer was produced for this
    specific connection instance.

    The accepting side requires this challenge and response unconditionally
    if the client side advertises they have CEPHX_V2 feature bit.

    This addresses CVE-2018-1128.

    Link: http://tracker.ceph.com/issues/24836
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Will be used for sending ceph_msg_connect with an updated authorizer,
    after the server challenges the initial authorizer.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • We already copy authorizer_reply_buf and authorizer_reply_buf_len into
    ceph_connection. Factoring out __prepare_write_connect() requires two
    more: authorizer_buf and authorizer_buf_len. Store the pointer to the
    handshake in con->auth rather than piling on.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • ceph_con_keepalive_expired() is the last user of timespec_add() and some
    of the last uses of ktime_get_real_ts(). Replacing this with timespec64
    based interfaces lets us remove that deprecated API.

    I'm introducing new ceph_encode_timespec64()/ceph_decode_timespec64()
    here that take timespec64 structures and convert to/from ceph_timespec,
    which is defined to have an unsigned 32-bit tv_sec member. This extends
    the range of valid times to year 2106, avoiding the year 2038 overflow.

    The ceph file system portion still uses the old functions for inode
    timestamps, this will be done separately after the VFS layer is converted.

    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Arnd Bergmann
     

05 Jun, 2018

2 commits


26 Apr, 2018

1 commit

  • ceph_con_workfn() validates con->state before calling try_read() and
    then try_write(). However, try_read() temporarily releases con->mutex,
    notably in process_message() and ceph_con_in_msg_alloc(), opening the
    window for ceph_con_close() to sneak in, close the connection and
    release con->sock. When try_write() is called on the assumption that
    con->state is still valid (i.e. not STANDBY or CLOSED), a NULL sock
    gets passed to the networking stack:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
    IP: selinux_socket_sendmsg+0x5/0x20

    Make sure con->state is valid at the top of try_write() and add an
    explicit BUG_ON for this, similar to try_read().

    Cc: stable@vger.kernel.org
    Link: https://tracker.ceph.com/issues/23706
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jason Dillaman

    Ilya Dryomov
     

02 Apr, 2018

2 commits