Eric Lee / smarc-fsl-linux-kernel

12 Oct, 2020

3 commits

28e1581c3 libceph: clear con->out_msg on Policy::stateful_server faults ... Browse Code »

con->out_msg must be cleared on Policy::stateful_server
(!CEPH_MSG_CONNECT_LOSSY) faults. Not doing so botches the
reconnection attempt, because after writing the banner the
messenger moves on to writing the data section of that message
(either from where it got interrupted by the connection reset or
from the beginning) instead of writing struct ceph_msg_connect.
This results in a bizarre error message because the server
sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct
ceph_msg_connect:

libceph: mds0 (1)172.21.15.45:6828 socket error on write
ceph: mds0 reconnect start
libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN)
libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32
libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch

AFAICT this bug goes back to the dawn of the kernel client.
The reason it survived for so long is that only MDS sessions
are stateful and only two MDS messages have a data section:
CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare)
and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved).
The connection has to get reset precisely when such message
is being sent -- in this case it was the former.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/47723
Signed-off-by: Ilya Dryomov
Reviewed-by: Jeff Layton

Ilya Dryomov
2020-10-12 21:29:27 +0800
a9dfe31e5 libceph: format ceph_entity_addr nonces as unsigned ... Browse Code »

Match the server side logs.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2020-10-12 21:29:27 +0800
5a5036c89 libceph: move a dout in queue_con_delay() ... Browse Code »

The queued con->work can start executing (and therefore logging)
before we get to this "con->work has been queued" message, making
the logs confusing. Move it up, with the meaning of "con->work
is about to be queued".

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2020-10-12 21:29:27 +0800

03 Oct, 2020

1 commit

40efc4dc7 libceph: use sendpage_ok() in ceph_tcp_sendpage() ... Browse Code »

In libceph, ceph_tcp_sendpage() does the following checks before handle
the page by network layer's zero copy sendpage method,
if (page_count(page) >= 1 && !PageSlab(page))

This check is exactly what sendpage_ok() does. This patch replace the
open coded checks by sendpage_ok() as a code cleanup.

Signed-off-by: Coly Li
Acked-by: Jeff Layton
Cc: Ilya Dryomov
Signed-off-by: David S. Miller

Coly Li
2020-10-03 06:27:08 +0800

24 Aug, 2020

1 commit

df561f668 treewide: Use fallthrough pseudo-keyword ... Browse Code »

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva

Gustavo A. R. Silva
2020-08-24 06:36:59 +0800

29 May, 2020

1 commit

12abc5ee7 tcp: add tcp_sock_set_nodelay ... Browse Code »

Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess. Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig
Acked-by: Sagi Grimberg
Acked-by: Jason Gunthorpe
Signed-off-by: David S. Miller

Christoph Hellwig
2020-05-29 02:11:45 +0800

23 Mar, 2020

1 commit

e88627403 libceph: fix alloc_msg_with_page_vector() memory leaks ... Browse Code »

Make it so that CEPH_MSG_DATA_PAGES data item can own pages,
fixing a bunch of memory leaks for a page vector allocated in
alloc_msg_with_page_vector(). Currently, only watch-notify
messages trigger this allocation, and normally the page vector
is freed either in handle_watch_notify() or by the caller of
ceph_osdc_notify(). But if the message is freed before that
(e.g. if the session faults while reading in the message or
if the notify is stale), we leak the page vector.

This was supposed to be fixed by switching to a message-owned
pagelist, but that never happened.

Fixes: 1907920324f1 ("libceph: support for sending notifies")
Reported-by: Roman Penyaev
Signed-off-by: Ilya Dryomov
Reviewed-by: Roman Penyaev

Ilya Dryomov
2020-03-23 20:07:08 +0800

28 Nov, 2019

1 commit

82995cc6c libceph, rbd, ceph: convert to use the new mount API ... Browse Code »

Convert the ceph filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

[ Numerous string handling, leak and regression fixes; rbd conversion
was particularly broken and had to be redone almost from scratch. ]

Signed-off-by: David Howells
Signed-off-by: Jeff Layton
Signed-off-by: Ilya Dryomov

David Howells
2019-11-28 05:28:37 +0800

16 Sep, 2019

1 commit

120a75ea9 libceph: add function that reset client's entity addr ... Browse Code »

This function also re-open connections to OSD/MON, and re-send in-flight
OSD requests after re-opening connections to OSD.

Signed-off-by: "Yan, Zheng"
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov

Yan, Zheng
2019-09-16 18:06:23 +0800

19 Jul, 2019

1 commit

d9b9c8930 Merge tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client ... Browse Code »

Pull ceph updates from Ilya Dryomov:
"Lots of exciting things this time!

- support for rbd object-map and fast-diff features (myself). This
will speed up reads, discards and things like snap diffs on sparse
images.

- ceph.snap.btime vxattr to expose snapshot creation time (David
Disseldorp). This will be used to integrate with "Restore Previous
Versions" feature added in Windows 7 for folks who reexport ceph
through SMB.

- security xattrs for ceph (Zheng Yan). Only selinux is supported for
now due to the limitations of ->dentry_init_security().

- support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff
Layton). This is actually a single feature bit which was missing
because of the filesystem pieces. With this in, the kernel client
will finally be reported as "luminous" by "ceph features" -- it is
still being reported as "jewel" even though all required Luminous
features were implemented in 4.13.

- stop NULL-terminating ceph vxattrs (Jeff Layton). The convention
with xattrs is to not terminate and this was causing
inconsistencies with ceph-fuse.

- change filesystem time granularity from 1 us to 1 ns, again fixing
an inconsistency with ceph-fuse (Luis Henriques).

On top of this there are some additional dentry name handling and cap
flushing fixes from Zheng. Finally, Jeff is formally taking over for
Zheng as the filesystem maintainer"

* tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits)
ceph: fix end offset in truncate_inode_pages_range call
ceph: use generic_delete_inode() for ->drop_inode
ceph: use ceph_evict_inode to cleanup inode's resource
ceph: initialize superblock s_time_gran to 1
MAINTAINERS: take over for Zheng as CephFS kernel client maintainer
rbd: setallochint only if object doesn't exist
rbd: support for object-map and fast-diff
rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
libceph: export osd_req_op_data() macro
libceph: change ceph_osdc_call() to take page vector for response
libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
rbd: new exclusive lock wait/wake code
rbd: quiescing lock should wait for image requests
rbd: lock should be quiesced on reacquire
rbd: introduce copyup state machine
rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
rbd: move OSD request allocation into object request state machines
rbd: factor out __rbd_osd_setup_discard_ops()
rbd: factor out rbd_osd_setup_copyup()
rbd: introduce obj_req->osd_reqs list
...

Linus Torvalds
2019-07-19 02:05:25 +0800

08 Jul, 2019

3 commits

2c66de560 libceph: rename ceph_encode_addr to ceph_encode_banner_addr ... Browse Code »

...ditto for the decode function. We only use these functions to fix
up banner addresses now, so let's name them more appropriately.

Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov

Jeff Layton
2019-07-08 20:01:43 +0800
d3c3c0a84 libceph: use TYPE_LEGACY for entity addrs instead of TYPE_NONE ... Browse Code »

Going forward, we'll have different address types so let's use
the addr2 TYPE_LEGACY for internal tracking rather than TYPE_NONE.

Also, make ceph_pr_addr print the address type value as well.

Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov

Jeff Layton
2019-07-08 20:01:43 +0800
bc07532cc libceph: fix sa_family just after reading address ... Browse Code »

It doesn't make sense to leave it undecoded until later.

Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov

Jeff Layton
2019-07-08 20:01:43 +0800

28 Jun, 2019

1 commit

a58946c15 keys: Pass the network namespace into request_key mechanism ... Browse Code »

Create a request_key_net() function and use it to pass the network
namespace domain tag into DNS revolver keys and rxrpc/AFS keys so that keys
for different domains can coexist in the same keyring.

Signed-off-by: David Howells
cc: netdev@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-afs@lists.infradead.org

David Howells
2019-06-28 06:02:12 +0800

17 May, 2019

1 commit

227747fb9 Merge tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs ... Browse Code »

Pull misc AFS fixes from David Howells:
"This fixes a set of miscellaneous issues in the afs filesystem,
including:

- leak of keys on file close.

- broken error handling in xattr functions.

- missing locking when updating VL server list.

- volume location server DNS lookup whereby preloaded cells may not
ever get a lookup and regular DNS lookups to maintain server lists
consume power unnecessarily.

- incorrect error propagation and handling in the fileserver
iteration code causes operations to sometimes apparently succeed.

- interruption of server record check/update side op during
fileserver iteration causes uninterruptible main operations to fail
unexpectedly.

- callback promise expiry time miscalculation.

- over invalidation of the callback promise on directories.

- double locking on callback break waking up file locking waiters.

- double increment of the vnode callback break counter.

Note that it makes some changes outside of the afs code, including:

- an extra parameter to dns_query() to allow the dns_resolver key
just accessed to be immediately invalidated. AFS is caching the
results itself, so the key can be discarded.

- an interruptible version of wait_var_event().

- an rxrpc function to allow the maximum lifespan to be set on a
call.

- a way for an rxrpc call to be marked as non-interruptible"

* tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix double inc of vnode->cb_break
afs: Fix lock-wait/callback-break double locking
afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not set
afs: Fix calculation of callback expiry time
afs: Make dynamic root population wait uninterruptibly for proc_cells_lock
afs: Make some RPC operations non-interruptible
rxrpc: Allow the kernel to mark a call as being non-interruptible
afs: Fix error propagation from server record check/update
afs: Fix the maximum lifespan of VL and probe calls
rxrpc: Provide kernel interface to set max lifespan on a call
afs: Fix "kAFS: AFS vnode with undefined type 0"
afs: Fix cell DNS lookup
Add wait_var_event_interruptible()
dns_resolver: Allow used keys to be invalidated
afs: Fix afs_cell records to always have a VL server list record
afs: Fix missing lock when replacing VL server list
afs: Fix afs_xattr_get_yfs() to not try freeing an error value
afs: Fix incorrect error handling in afs_xattr_get_acl()
afs: Fix key leak in afs_release() and afs_evict_inode()

Linus Torvalds
2019-05-17 08:00:13 +0800

16 May, 2019

1 commit

d0660f0b3 dns_resolver: Allow used keys to be invalidated ... Browse Code »

Allow used DNS resolver keys to be invalidated after use if the caller is
doing its own caching of the results. This reduces the amount of resources
required.

Fix AFS to invalidate DNS results to kill off permanent failure records
that get lodged in the resolver keyring and prevent future lookups from
happening.

Fixes: 0a5143f2f89c ("afs: Implement VL server rotation")
Signed-off-by: David Howells

David Howells
2019-05-16 00:35:54 +0800

08 May, 2019

2 commits

b726ec972 libceph: make ceph_pr_addr take an struct ceph_entity_addr pointer ... Browse Code »

GCC9 is throwing a lot of warnings about unaligned accesses by
callers of ceph_pr_addr. All of the current callers are passing a
pointer to the sockaddr inside struct ceph_entity_addr.

Fix it to take a pointer to a struct ceph_entity_addr instead,
and then have the function make a copy of the sockaddr before
printing it.

Signed-off-by: Jeff Layton
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov

Jeff Layton
2019-05-08 01:43:05 +0800
cede185b1 libceph: fix unaligned accesses in ceph_entity_addr handling ... Browse Code »

GCC9 is throwing a lot of warnings about unaligned access. This patch
fixes some of them by changing most of the sockaddr handling functions
to take a pointer to struct ceph_entity_addr instead of struct
sockaddr_storage. The lower functions can then make copies or do
unaligned accesses as needed.

Signed-off-by: Jeff Layton
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov

Jeff Layton
2019-05-08 01:43:05 +0800

26 Mar, 2019

1 commit

187df7632 libceph: fix breakage caused by multipage bvecs ... Browse Code »

A bvec can now consist of multiple physically contiguous pages.
This means that bvec_iter_advance() can move to a different page while
staying in the same bvec (i.e. ->bi_bvec_done != 0).

The messenger works in terms of segments which can now be defined as
the smaller of a bvec and a page. The "more bytes to process in this
segment" condition holds only if bvec_iter_advance() leaves us in the
same bvec _and_ in the same page. On next bvec (possibly in the same
page) and on next page (possibly in the same bvec) we may need to set
->last_piece.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2019-03-26 05:28:07 +0800

19 Feb, 2019

1 commit

0fd3fd0a9 libceph: handle an empty authorize reply ... Browse Code »

The authorize reply can be empty, for example when the ticket used to
build the authorizer is too old and TAG_BADAUTHORIZER is returned from
the service. Calling ->verify_authorizer_reply() results in an attempt
to decrypt and validate (somewhat) random data in au->buf (most likely
the signature block from calc_signature()), which fails and ends up in
con_fault_finish() with !con->auth_retry. The ticket isn't invalidated
and the connection is retried again and again until a new ticket is
obtained from the monitor:

libceph: osd2 192.168.122.1:6809 bad authorize reply
libceph: osd2 192.168.122.1:6809 bad authorize reply
libceph: osd2 192.168.122.1:6809 bad authorize reply
libceph: osd2 192.168.122.1:6809 bad authorize reply

Let TAG_BADAUTHORIZER handler kick in and increment con->auth_retry.

Cc: stable@vger.kernel.org
Fixes: 5c056fdc5b47 ("libceph: verify authorize reply on connect")
Link: https://tracker.ceph.com/issues/20164
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2019-02-19 01:05:33 +0800

21 Jan, 2019

1 commit

4aac9228d libceph: avoid KEEPALIVE_PENDING races in ceph_con_keepalive() ... Browse Code »

con_fault() can transition the connection into STANDBY right after
ceph_con_keepalive() clears STANDBY in clear_standby():

libceph user thread ceph-msgr worker

ceph_con_keepalive()
mutex_lock(&con->mutex)
clear_standby(con)
mutex_unlock(&con->mutex)
mutex_lock(&con->mutex)
con_fault()
...
if KEEPALIVE_PENDING isn't set
set state to STANDBY
...
mutex_unlock(&con->mutex)
set KEEPALIVE_PENDING
set WRITE_PENDING

This triggers warnings in clear_standby() when either ceph_con_send()
or ceph_con_keepalive() get to clearing STANDBY next time.

I don't see a reason to condition queue_con() call on the previous
value of KEEPALIVE_PENDING, so move the setting of KEEPALIVE_PENDING
into the critical section -- unlike WRITE_PENDING, KEEPALIVE_PENDING
could have been a non-atomic flag.

Reported-by: syzbot+acdeb633f6211ccdf886@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov
Tested-by: Myungho Jung

Ilya Dryomov
2019-01-21 21:53:12 +0800

26 Dec, 2018

4 commits

87349cdad libceph: switch more to bool in ceph_tcp_sendmsg() ... Browse Code »

Unlike in ceph_tcp_sendpage(), it's a bool.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2018-12-26 22:56:04 +0800
433b0a129 libceph: use MSG_SENDPAGE_NOTLAST with ceph_tcp_sendpage() ... Browse Code »

Prevent do_tcp_sendpages() from calling tcp_push() (at least) once per
page. Instead, arrange for tcp_push() to be called (at least) once per
data payload. This results in more MSS-sized packets and fewer packets
overall (5-10% reduction in my tests with typical OSD request sizes).
See commits 2f5338442425 ("tcp: allow splice() to build full TSO
packets"), 35f9c09fe9c7 ("tcp: tcp_sendpages() should call tcp_push()
once") and ae62ca7b0321 ("tcp: fix MSG_SENDPAGE_NOTLAST logic") for
details.

Here is an example of a packet size histogram for 128K OSD requests
(MSS = 1448, top 5):

Before:

SIZE COUNT
1448 777700
952 127915
1200 39238
1219 9806
21 5675

After:

SIZE COUNT
1448 897280
21 6201
1019 2797
643 2739
376 2479

We could do slightly better by explicitly corking the socket but it's
not clear it's worth it.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2018-12-26 22:56:04 +0800
3239eb521 libceph: use sock_no_sendpage() as a fallback in ceph_tcp_sendpage() ... Browse Code »

sock_no_sendpage() makes the code cleaner.

Also, don't set MSG_EOR. sendpage doesn't act on MSG_EOR on its own,
it just honors the setting from the preceding sendmsg call by looking
at ->eor in tcp_skb_can_collapse_to().

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2018-12-26 22:56:04 +0800
1f6b821ae libceph: drop last_piece logic from write_partial_message_data() ... Browse Code »

last_piece is for the last piece in the current data item, not in the
entire data payload of the message. This is harmful for messages with
multiple data items. On top of that, we don't need to signal the end
of a data payload either because it is always followed by a footer.

We used to signal "more" unconditionally, until commit fe38a2b67bc6
("libceph: start defining message data cursor"). Part of a large
series, it introduced cursor->last_piece and also mistakenly inverted
the hint by passing last_piece for "more". This was corrected with
commit c2cfa1940097 ("libceph: Fix ceph_tcp_sendpage()'s more boolean
usage").

As it is, last_piece is not helping at all: because Nagle algorithm is
disabled, for a simple message with two 512-byte data items we end up
emitting three packets: front + first data item, second data item and
footer. Go back to the original pre-fe38a2b67bc6 behavior -- a single
packet in most cases.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2018-12-26 22:56:04 +0800

20 Nov, 2018

1 commit

7e241f647 libceph: fall back to sendmsg for slab pages ... Browse Code »

skb_can_coalesce() allows coalescing neighboring slab objects into
a single frag:

return page == skb_frag_page(frag) &&
off == frag->page_offset + skb_frag_size(frag);

ceph_tcp_sendpage() can be handed slab pages. One example of this is
XFS: it passes down sector sized slab objects for its metadata I/O. If
the kernel client is co-located on the OSD node, the skb may go through
loopback and pop on the receive side with the exact same set of frags.
When tcp_recvmsg() attempts to copy out such a frag, hardened usercopy
complains because the size exceeds the object's allocated size:

usercopy: kernel memory exposure attempt detected from ffff9ba917f20a00 (kmalloc-512) (1024 bytes)

Although skb_can_coalesce() could be taught to return false if the
resulting frag would cross a slab object boundary, we already have
a fallback for non-refcounted pages. Utilize it for slab pages too.

Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Ilya Dryomov

Ilya Dryomov
2018-11-20 00:59:47 +0800

02 Nov, 2018

1 commit

9931a07d5 Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull AFS updates from Al Viro:
"AFS series, with some iov_iter bits included"

* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
missing bits of "iov_iter: Separate type from direction and use accessor functions"
afs: Probe multiple fileservers simultaneously
afs: Fix callback handling
afs: Eliminate the address pointer from the address list cursor
afs: Allow dumping of server cursor on operation failure
afs: Implement YFS support in the fs client
afs: Expand data structure fields to support YFS
afs: Get the target vnode in afs_rmdir() and get a callback on it
afs: Calc callback expiry in op reply delivery
afs: Fix FS.FetchStatus delivery from updating wrong vnode
afs: Implement the YFS cache manager service
afs: Remove callback details from afs_callback_break struct
afs: Commit the status on a new file/dir/symlink
afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
afs: Don't invoke the server to read data beyond EOF
afs: Add a couple of tracepoints to log I/O errors
afs: Handle EIO from delivery function
afs: Fix TTL on VL server and address lists
afs: Implement VL server rotation
afs: Improve FS server rotation error handling
...

Linus Torvalds
2018-11-02 10:58:52 +0800

24 Oct, 2018

1 commit

aa563d7bc iov_iter: Separate type from direction and use accessor functions ... Browse Code »

In the iov_iter struct, separate the iterator type from the iterator
direction and use accessor functions to access them in most places.

Convert a bunch of places to use switch-statements to access them rather
then chains of bitwise-AND statements. This makes it easier to add further
iterator types. Also, this can be more efficient as to implement a switch
of small contiguous integers, the compiler can use ~50% fewer compare
instructions than it has to use bitwise-and instructions.

Further, cease passing the iterator type into the iterator setup function.
The iterator function can set that itself. Only the direction is required.

Signed-off-by: David Howells

David Howells
2018-10-24 07:41:07 +0800

22 Oct, 2018

2 commits

0d9c1ab3b libceph: preallocate message data items ... Browse Code »

Currently message data items are allocated with ceph_msg_data_create()
in setup_request_data() inside send_request(). send_request() has never
been allowed to fail, so each allocation is followed by a BUG_ON:

data = ceph_msg_data_create(...);
BUG_ON(!data);

It's been this way since support for multiple message data items was
added in commit 6644ed7b7e04 ("libceph: make message data be a pointer")
in 3.10.

There is no reason to delay the allocation of message data items until
the last possible moment and we certainly don't need a linked list of
them as they are only ever appended to the end and never erased. Make
ceph_msg_new2() take max_data_items and adapt the rest of the code.

Reported-by: Jerry Lee
Signed-off-by: Ilya Dryomov

Ilya Dryomov
2018-10-22 16:28:22 +0800
894868330 libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist() ... Browse Code »

Because send_mds_reconnect() wants to send a message with a pagelist
and pass the ownership to the messenger, ceph_msg_data_add_pagelist()
consumes a ref which is then put in ceph_msg_data_destroy(). This
makes managing pagelists in the OSD client (where they are wrapped in
ceph_osd_data) unnecessarily hard because the handoff only happens in
ceph_osdc_start_request() instead of when the pagelist is passed to
ceph_osd_data_pagelist_init(). I counted several memory leaks on
various error paths.

Fix up ceph_msg_data_add_pagelist() and carry a pagelist ref in
ceph_osd_data.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2018-10-22 16:28:21 +0800

03 Aug, 2018

5 commits

130f52f2b libceph: check authorizer reply/challenge length before reading ... Browse Code »

Avoid scribbling over memory if the received reply/challenge is larger
than the buffer supplied with the authorizer.

Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2018-08-03 03:33:26 +0800
6daca13d2 libceph: add authorizer challenge ... Browse Code »

When a client authenticates with a service, an authorizer is sent with
a nonce to the service (ceph_x_authorize_[ab]) and the service responds
with a mutation of that nonce (ceph_x_authorize_reply). This lets the
client verify the service is who it says it is but it doesn't protect
against a replay: someone can trivially capture the exchange and reuse
the same authorizer to authenticate themselves.

Allow the service to reject an initial authorizer with a random
challenge (ceph_x_authorize_challenge). The client then has to respond
with an updated authorizer proving they are able to decrypt the
service's challenge and that the new authorizer was produced for this
specific connection instance.

The accepting side requires this challenge and response unconditionally
if the client side advertises they have CEPHX_V2 feature bit.

This addresses CVE-2018-1128.

Link: http://tracker.ceph.com/issues/24836
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2018-08-03 03:33:24 +0800
c0f56b483 libceph: factor out __prepare_write_connect() ... Browse Code »

Will be used for sending ceph_msg_connect with an updated authorizer,
after the server challenges the initial authorizer.

Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2018-08-03 03:33:22 +0800
262614c42 libceph: store ceph_auth_handshake pointer in ceph_connection ... Browse Code »

We already copy authorizer_reply_buf and authorizer_reply_buf_len into
ceph_connection. Factoring out __prepare_write_connect() requires two
more: authorizer_buf and authorizer_buf_len. Store the pointer to the
handshake in con->auth rather than piling on.

Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2018-08-03 03:33:22 +0800
473bd2d78 libceph: use timespec64 in for keepalive2 and ticket validity ... Browse Code »

ceph_con_keepalive_expired() is the last user of timespec_add() and some
of the last uses of ktime_get_real_ts(). Replacing this with timespec64
based interfaces lets us remove that deprecated API.

I'm introducing new ceph_encode_timespec64()/ceph_decode_timespec64()
here that take timespec64 structures and convert to/from ceph_timespec,
which is defined to have an unsigned 32-bit tv_sec member. This extends
the range of valid times to year 2106, avoiding the year 2038 overflow.

The ceph file system portion still uses the old functions for inode
timestamps, this will be done separately after the VFS layer is converted.

Signed-off-by: Arnd Bergmann
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov

Arnd Bergmann
2018-08-03 03:26:12 +0800

05 Jun, 2018

2 commits

e5c938839 libceph: use MSG_TRUNC for discarding received bytes ... Browse Code »

Avoid a copy into the "skip buffer".

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2018-06-05 02:45:55 +0800
d2935d6f7 libceph: get rid of more_kvec in try_write() ... Browse Code »

All gotos to "more" are conditioned on con->state == OPEN, but the only
thing "more" does is opening the socket if con->state == PREOPEN. Kill
that label and rename "more_kvec" to "more".

Signed-off-by: Ilya Dryomov
Reviewed-by: Jason Dillaman

Ilya Dryomov
2018-06-05 02:45:55 +0800

26 Apr, 2018

1 commit

9c55ad1c2 libceph: validate con->state at the top of try_write() ... Browse Code »

ceph_con_workfn() validates con->state before calling try_read() and
then try_write(). However, try_read() temporarily releases con->mutex,
notably in process_message() and ceph_con_in_msg_alloc(), opening the
window for ceph_con_close() to sneak in, close the connection and
release con->sock. When try_write() is called on the assumption that
con->state is still valid (i.e. not STANDBY or CLOSED), a NULL sock
gets passed to the networking stack:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: selinux_socket_sendmsg+0x5/0x20

Make sure con->state is valid at the top of try_write() and add an
explicit BUG_ON for this, similar to try_read().

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/23706
Signed-off-by: Ilya Dryomov
Reviewed-by: Jason Dillaman

Ilya Dryomov
2018-04-26 23:39:08 +0800

02 Apr, 2018

2 commits

57a35dfb5 libceph, ceph: add __init attribution to init funcitons ... Browse Code »

Add __init attribution to the functions which are called only once
during initiating/registering operations and deleting unnecessary
symbol exports.

Signed-off-by: Chengguang Xu
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov

Chengguang Xu
2018-04-02 16:12:48 +0800
45a267dbb libceph: handle zero-length data items ... Browse Code »

rbd needs this for null copyups -- if copyup data is all zeroes, we
want to save some I/O and network bandwidth. See rbd_obj_issue_copyup()
in the next commit.

Signed-off-by: Ilya Dryomov
Reviewed-by: Alex Elder

Ilya Dryomov
2018-04-02 16:12:40 +0800