22 Sep, 2015

1 commit

  • commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream.

    Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

04 Aug, 2015

1 commit

  • commit 82cd003a77173c91b9acad8033fb7931dac8d751 upstream.

    struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe()
    should be used. -Wconversion catches this, but I guess it went
    unnoticed in all the noise it spews. The actual problem (at least for
    common crushmaps) isn't the u32 -> u8 truncation though - it's the
    advancement by 4 bytes instead of 1 in the crushmap buffer.

    Fixes: http://tracker.ceph.com/issues/2759

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Josh Durgin
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     

21 May, 2015

2 commits

  • This reverts commit ba9d114ec5578e6e99a4dfa37ff8ae688040fd64.

    .. which introduced a regression that prevented all lingering requests
    requeued in kick_requests() from ever being sent to the OSDs, resulting
    in a lot of missed notifies. In retrospect it's pretty obvious that
    r_req_lru_item item in the case of lingering requests can be used not
    only for notarget, but also for unsent linkage due to how tightly
    actual map and enqueue operations are coupled in __map_request().

    The assertion that was being silenced is taken care of in the previous
    ("libceph: request a new osdmap if lingering request maps to no osd")
    commit: by always kicking homeless lingering requests we ensure that
    none of them ends up on the notarget list outside of the critical
    section guarded by request_mutex.

    Cc: stable@vger.kernel.org # 3.18+, needs b0494532214b "libceph: request a new osdmap if lingering request maps to no osd"
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • This commit does two things. First, if there are any homeless
    lingering requests, we now request a new osdmap even if the osdmap that
    is being processed brought no changes, i.e. if a given lingering
    request turned homeless in one of the previous epochs and remained
    homeless in the current epoch. Not doing so leaves us with a stale
    osdmap and as a result we may miss our window for reestablishing the
    watch and lose notifies.

    MON=1 OSD=1:

    # cat linger-needmap.sh
    #!/bin/bash
    rbd create --size 1 test
    DEV=$(rbd map test)
    ceph osd out 0
    rbd map dne/dne # obtain a new osdmap as a side effect (!)
    sleep 1
    ceph osd in 0
    rbd resize --size 2 test
    # rbd info test | grep size -> 2M
    # blockdev --getsize $DEV -> 1M

    N.B.: Not obtaining a new osdmap in between "osd out" and "osd in"
    above is enough to make it miss that resize notify, but that is a
    bug^Wlimitation of ceph watch/notify v1.

    Second, homeless lingering requests are now kicked just like those
    lingering requests whose mapping has changed. This is mainly to
    recognize that a homeless lingering request makes no sense and to
    preserve the invariant that a registered lingering request is not
    sitting on any of r_req_lru_item lists. This spares us a WARN_ON,
    which commit ba9d114ec557 ("libceph: clear r_req_lru_item in
    __unregister_linger_request()") tried to fix the _wrong_ way.

    Cc: stable@vger.kernel.org # 3.10+
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     

22 Apr, 2015

3 commits

  • This is an improved straw bucket that correctly avoids any data movement
    between items A and B when neither A nor B's weights are changed. Said
    differently, if we adjust the weight of item C (including adding it anew
    or removing it completely), we will only see inputs move to or from C,
    never between other items in the bucket.

    Notably, there is not intermediate scaling factor that needs to be
    calculated. The mapping function is a simple function of the item weights.

    The below commits were squashed together into this one (mostly to avoid
    adding and then yanking a ~6000 lines worth of crush_ln_table):

    - crush: add a straw2 bucket type
    - crush: add crush_ln to calculate nature log efficently
    - crush: improve straw2 adjustment slightly
    - crush: change crush_ln to provide 32 more digits
    - crush: fix crush_get_bucket_item_weight and bucket destroy for straw2
    - crush/mapper: fix divide-by-0 in straw2
    (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need
    to create a proper compat.h in ceph.git)

    Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53,
    32a1ead92efcd351822d22a5fc37d159c65c1338,
    6289912418c4a3597a11778bcf29ed5415117ad9,
    35fcb04e2945717cf5cfe150b9fa89cb3d2303a1,
    6445d9ee7290938de1e4ee9563912a6ab6d8ee5f,
    b5921d55d16796e12d66ad2c4add7305f9ce2353.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Crush temporary buffers are allocated as per replica size configured
    by the user. When there are more final osds (to be selected as per
    rule) than the replicas, buffer overlaps and it causes crash. Now, it
    ensures that at most num-rep osds are selected even if more number of
    osds are allowed by the rule.

    Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0,
    234b066ba04976783d15ff2abc3e81b6cc06fb10.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

20 Apr, 2015

3 commits


08 Apr, 2015

1 commit

  • This reverts commit 89baaa570ab0b476db09408d209578cfed700e9f.

    Dirty page throttling should be sufficient for us in the general case
    so there is no need to use __GFP_MEMALLOC - it would be needed only in
    the swap-over-rbd case, which we currently don't support. (It would
    probably take approximately the commit that is being reverted to add
    that support, but we would also need the "swap" option to distinguish
    from the general case and make sure swap ceph_client-s aren't shared
    with anything else.) See ceph-devel threads [1] and [2] for the
    details of why enabling pfmemalloc reserves for all cases is a bad
    thing.

    On top of potential system lockups related to drained emergency
    reserves, this turned out to cause ceph lockups in case peers are on
    the same host and communicating via loopback due to sk_filter()
    dropping pfmemalloc skbs on the receiving side because the receiving
    loopback socket is not tagged with SOCK_MEMALLOC.

    [1] "SOCK_MEMALLOC vs loopback"
    http://www.spinics.net/lists/ceph-devel/msg22998.html
    [2] "[PATCH] libceph: don't set memalloc flags in loopback case"
    http://www.spinics.net/lists/ceph-devel/msg23392.html

    Conflicts:
    net/ceph/messenger.c [ context: tcp_nodelay option ]

    Cc: Mike Christie
    Cc: Mel Gorman
    Cc: Sage Weil
    Cc: stable@vger.kernel.org # 3.18+, needs backporting
    Signed-off-by: Ilya Dryomov
    Acked-by: Mike Christie
    Acked-by: Mel Gorman

    Ilya Dryomov
     

20 Feb, 2015

1 commit

  • Pull Ceph changes from Sage Weil:
    "On the RBD side, there is a conversion to blk-mq from Christoph,
    several long-standing bug fixes from Ilya, and some cleanup from
    Rickard Strandqvist.

    On the CephFS side there is a long list of fixes from Zheng, including
    improved session handling, a few IO path fixes, some dcache management
    correctness fixes, and several blocking while !TASK_RUNNING fixes.

    The core code gets a few cleanups and Chaitanya has added support for
    TCP_NODELAY (which has been used on the server side for ages but we
    somehow missed on the kernel client).

    There is also an update to MAINTAINERS to fix up some email addresses
    and reflect that Ilya and Zheng are doing most of the maintenance for
    RBD and CephFS these days. Do not be surprised to see a pull request
    come from one of them in the future if I am unavailable for some
    reason"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
    MAINTAINERS: update Ceph and RBD maintainers
    libceph: kfree() in put_osd() shouldn't depend on authorizer
    libceph: fix double __remove_osd() problem
    rbd: convert to blk-mq
    ceph: return error for traceless reply race
    ceph: fix dentry leaks
    ceph: re-send requests when MDS enters reconnecting stage
    ceph: show nocephx_require_signatures and notcp_nodelay options
    libceph: tcp_nodelay support
    rbd: do not treat standalone as flatten
    ceph: fix atomic_open snapdir
    ceph: properly mark empty directory as complete
    client: include kernel version in client metadata
    ceph: provide seperate {inode,file}_operations for snapdir
    ceph: fix request time stamp encoding
    ceph: fix reading inline data when i_size > PAGE_SIZE
    ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
    ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)
    ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)
    rbd: fix error paths in rbd_dev_refresh()
    ...

    Linus Torvalds
     

19 Feb, 2015

5 commits

  • a255651d4cad ("ceph: ensure auth ops are defined before use") made
    kfree() in put_osd() conditional on the authorizer. A mechanical
    mistake most likely - fix it.

    Cc: Alex Elder
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • It turns out it's possible to get __remove_osd() called twice on the
    same OSD. That doesn't sit well with rb_erase() - depending on the
    shape of the tree we can get a NULL dereference, a soft lockup or
    a random crash at some point in the future as we end up touching freed
    memory. One scenario that I was able to reproduce is as follows:

    con_fault_finish()
    osd_reset()

    ceph_osdc_handle_map()

    kick_requests()

    reset_changed_osds()
    __reset_osd()
    __remove_osd()




    __kick_osd_requests()
    __reset_osd()
    __remove_osd()
    Cc: stable@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular and lingering lists in __remove_osd()
    Cc: stable@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to WARN for __remove_osd() asserts
    Cc: stable@vger.kernel.org # 3.9+
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • TCP_NODELAY socket option set on connection sockets,
    disables Nagle’s algorithm and improves latency characteristics.
    tcp_nodelay(default)/notcp_nodelay option flags provided to
    enable/disable setting the socket option.

    Signed-off-by: Chaitanya Huilgol
    [idryomov@redhat.com: NO_TCP_NODELAY -> TCP_NODELAY, minor adjustments]
    Signed-off-by: Ilya Dryomov

    Chaitanya Huilgol
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • On Mon, Dec 22, 2014 at 5:35 PM, Sage Weil wrote:
    > On Mon, 22 Dec 2014, Ilya Dryomov wrote:
    >> Actually, pool op stuff has been unused for over two years - looks like
    >> it was added for rbd create_snap and that got ripped out in 2012. It's
    >> unlikely we'd ever need to manage pools or snaps from the kernel client
    >> so I think it makes sense to nuke it. Sage?
    >
    > Yep!

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

12 Feb, 2015

1 commit

  • This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to
    the page fault in order to release the mmap_sem during the I/O.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Kirill A. Shutemov
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

09 Jan, 2015

1 commit


18 Dec, 2014

7 commits


14 Nov, 2014

4 commits

  • No reason to use BUG_ON for osd request list assertions.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • kick_requests() can put linger requests on the notarget list. This
    means we need to clear the much-overloaded req->r_req_lru_item in
    __unregister_linger_request() as well, or we get an assertion failure
    in ceph_osdc_release_request() - !list_empty(&req->r_req_lru_item).

    AFAICT the assumption was that registered linger requests cannot be on
    any of req->r_req_lru_item lists, but that's clearly not the case.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Requests have to be unlinked from both osd->o_requests (normal
    requests) and osd->o_linger_requests (linger requests) lists when
    clearing req->r_osd. Otherwise __unregister_linger_request() gets
    confused and we trip over a !list_empty(&osd->o_linger_requests)
    assert in __remove_osd().

    MON=1 OSD=1:

    # cat remove-osd.sh
    #!/bin/bash
    rbd create --size 1 test
    DEV=$(rbd map test)
    ceph osd out 0
    sleep 3
    rbd map dne/dne # obtain a new osdmap as a side effect
    rbd unmap $DEV & # will block
    sleep 3
    ceph osd in 0

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Large (greater than 32k, the value of PAGE_ALLOC_COSTLY_ORDER) auth
    tickets will have their buffers vmalloc'ed, which leads to the
    following crash in crypto:

    [ 28.685082] BUG: unable to handle kernel paging request at ffffeb04000032c0
    [ 28.686032] IP: [] scatterwalk_pagedone+0x22/0x80
    [ 28.686032] PGD 0
    [ 28.688088] Oops: 0000 [#1] PREEMPT SMP
    [ 28.688088] Modules linked in:
    [ 28.688088] CPU: 0 PID: 878 Comm: kworker/0:2 Not tainted 3.17.0-vm+ #305
    [ 28.688088] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
    [ 28.688088] Workqueue: ceph-msgr con_work
    [ 28.688088] task: ffff88011a7f9030 ti: ffff8800d903c000 task.ti: ffff8800d903c000
    [ 28.688088] RIP: 0010:[] [] scatterwalk_pagedone+0x22/0x80
    [ 28.688088] RSP: 0018:ffff8800d903f688 EFLAGS: 00010286
    [ 28.688088] RAX: ffffeb04000032c0 RBX: ffff8800d903f718 RCX: ffffeb04000032c0
    [ 28.688088] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800d903f750
    [ 28.688088] RBP: ffff8800d903f688 R08: 00000000000007de R09: ffff8800d903f880
    [ 28.688088] R10: 18df467c72d6257b R11: 0000000000000000 R12: 0000000000000010
    [ 28.688088] R13: ffff8800d903f750 R14: ffff8800d903f8a0 R15: 0000000000000000
    [ 28.688088] FS: 00007f50a41c7700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
    [ 28.688088] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 28.688088] CR2: ffffeb04000032c0 CR3: 00000000da3f3000 CR4: 00000000000006b0
    [ 28.688088] Stack:
    [ 28.688088] ffff8800d903f698 ffffffff81392ca8 ffff8800d903f6e8 ffffffff81395d32
    [ 28.688088] ffff8800dac96000 ffff880000000000 ffff8800d903f980 ffff880119b7e020
    [ 28.688088] ffff880119b7e010 0000000000000000 0000000000000010 0000000000000010
    [ 28.688088] Call Trace:
    [ 28.688088] [] scatterwalk_done+0x38/0x40
    [ 28.688088] [] scatterwalk_done+0x38/0x40
    [ 28.688088] [] blkcipher_walk_done+0x182/0x220
    [ 28.688088] [] crypto_cbc_encrypt+0x15f/0x180
    [ 28.688088] [] ? crypto_aes_set_key+0x30/0x30
    [ 28.688088] [] ceph_aes_encrypt2+0x29c/0x2e0
    [ 28.688088] [] ceph_encrypt2+0x93/0xb0
    [ 28.688088] [] ceph_x_encrypt+0x4a/0x60
    [ 28.688088] [] ? ceph_buffer_new+0x5d/0xf0
    [ 28.688088] [] ceph_x_build_authorizer.isra.6+0x297/0x360
    [ 28.688088] [] ? kmem_cache_alloc_trace+0x11b/0x1c0
    [ 28.688088] [] ? ceph_auth_create_authorizer+0x36/0x80
    [ 28.688088] [] ceph_x_create_authorizer+0x63/0xd0
    [ 28.688088] [] ceph_auth_create_authorizer+0x54/0x80
    [ 28.688088] [] get_authorizer+0x80/0xd0
    [ 28.688088] [] prepare_write_connect+0x18b/0x2b0
    [ 28.688088] [] try_read+0x1e59/0x1f10

    This is because we set up crypto scatterlists as if all buffers were
    kmalloc'ed. Fix it.

    Cc: stable@vger.kernel.org
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     

01 Nov, 2014

1 commit

  • Commit c27a3e4d667f ("libceph: do not hard code max auth ticket len")
    while fixing a buffer overlow tried to keep the same as much of the
    surrounding code as possible and introduced an unnecessary kmalloc() in
    the unencrypted ticket path. It is likely to fail on huge tickets, so
    get rid of it.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     

30 Oct, 2014

1 commit

  • This patch has ceph's lib code use the memalloc flags.

    If the VM layer needs to write data out to free up memory to handle new
    allocation requests, the block layer must be able to make forward progress.
    To handle that requirement we use structs like mempools to reserve memory for
    objects like bios and requests.

    The problem is when we send/receive block layer requests over the network
    layer, net skb allocations can fail and the system can lock up.
    To solve this, the memalloc related flags were added. NBD, iSCSI
    and NFS uses these flags to tell the network/vm layer that it should
    use memory reserves to fullfill allcation requests for structs like
    skbs.

    I am running ceph in a bunch of VMs in my laptop, so this patch was
    not tested very harshly.

    Signed-off-by: Mike Christie
    Reviewed-by: Ilya Dryomov

    Mike Christie
     

15 Oct, 2014

8 commits

  • Pull Ceph updates from Sage Weil:
    "There is the long-awaited discard support for RBD (Guangliang Zhao,
    Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's
    (Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and
    performance and debugging improvements (Yan, Zheng, John Spray), and a
    smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits)
    ceph: fix divide-by-zero in __validate_layout()
    rbd: rbd workqueues need a resque worker
    libceph: ceph-msgr workqueue needs a resque worker
    ceph: fix bool assignments
    libceph: separate multiple ops with commas in debugfs output
    libceph: sync osd op definitions in rados.h
    libceph: remove redundant declaration
    ceph: additional debugfs output
    ceph: export ceph_session_state_name function
    ceph: include the initial ACL in create/mkdir/mknod MDS requests
    ceph: use pagelist to present MDS request data
    libceph: reference counting pagelist
    ceph: fix llistxattr on symlink
    ceph: send client metadata to MDS
    ceph: remove redundant code for max file size verification
    ceph: remove redundant io_iter_advance()
    ceph: move ceph_find_inode() outside the s_mutex
    ceph: request xattrs if xattr_version is zero
    rbd: set the remaining discard properties to enable support
    rbd: use helpers to handle discard for layered images correctly
    ...

    Linus Torvalds
     
  • Commit f363e45fd118 ("net/ceph: make ceph_msgr_wq non-reentrant")
    effectively removed WQ_MEM_RECLAIM flag from ceph_msgr_wq. This is
    wrong - libceph is very much a memory reclaim path, so restore it.

    Cc: stable@vger.kernel.org # needs backporting for < 3.12
    Signed-off-by: Ilya Dryomov
    Tested-by: Micha Krause
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • For requests with multiple ops, separate ops with commas instead of \t,
    which is a field separator here.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Bring in missing osd ops and strings, use macros to eliminate multiple
    points of maintenance.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • this allow pagelist to present data that may be sent multiple times.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Sage Weil

    Yan, Zheng
     
  • queue_work() doesn't "fail to queue", it returns false if work was
    already on a queue, which can't happen here since we allocate
    event_work right before we queue it. So don't bother at all.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Use the more common pr_warn.

    Other miscellanea:

    o Coalesce formats
    o Realign arguments

    Signed-off-by: Joe Perches
    Signed-off-by: Ilya Dryomov

    Joe Perches
     
  • If the state variable is krealloced successfully, map->osd_state will be
    freed, once following two reallocation failed, and exit the function
    without resetting map->osd_state, map->osd_state become a wild pointer.

    fix it by resetting them after krealloc successfully.

    Signed-off-by: Li RongQing
    Signed-off-by: Ilya Dryomov

    Li RongQing