10 May, 2017

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "The two main items are support for disabling automatic rbd exclusive
    lock transfers from myself and the long awaited -ENOSPC handling
    series from Jeff.

    The former will allow rbd users to take advantage of exclusive lock's
    built-in blacklist/break-lock functionality while staying in control
    of who owns the lock. With the latter in place, we will abort
    filesystem writes on -ENOSPC instead of having them block
    indefinitely.

    Beyond that we've got the usual pile of filesystem fixes from Zheng,
    some refcount_t conversion patches from Elena and a patch for an
    ancient open() flags handling bug from Alexander"

    * tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits)
    ceph: fix memory leak in __ceph_setxattr()
    ceph: fix file open flags on ppc64
    ceph: choose readdir frag based on previous readdir reply
    rbd: exclusive map option
    rbd: return ResponseMessage result from rbd_handle_request_lock()
    rbd: kill rbd_is_lock_supported()
    rbd: support updating the lock cookie without releasing the lock
    rbd: store lock cookie
    rbd: ignore unlock errors
    rbd: fix error handling around rbd_init_disk()
    rbd: move rbd_unregister_watch() call into rbd_dev_image_release()
    rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()
    ceph: when seeing write errors on an inode, switch to sync writes
    Revert "ceph: SetPageError() for writeback pages if writepages fails"
    ceph: handle epoch barriers in cap messages
    libceph: add an epoch_barrier field to struct ceph_osd_client
    libceph: abort already submitted but abortable requests when map or pool goes full
    libceph: allow requests to return immediately on full conditions if caller wishes
    libceph: remove req->r_replay_version
    ceph: make seeky readdir more efficient
    ...

    Linus Torvalds
     

09 May, 2017

2 commits

  • CURRENT_TIME is not y2038 safe. The macro will be deleted and all the
    references to it will be replaced by ktime_get_* apis.

    struct timespec is also not y2038 safe. Retain timespec for timestamp
    representation here as ceph uses it internally everywhere. These
    references will be changed to use struct timespec64 in a separate patch.

    The current_fs_time() api is being changed to use vfs struct inode* as
    an argument instead of struct super_block*.

    Set the new mds client request r_stamp field using ktime_get_real_ts()
    instead of using current_fs_time().

    Also, since r_stamp is used as mtime on the server, use timespec_trunc()
    to truncate the timestamp, using the right granularity from the
    superblock.

    This api will be transitioned to be y2038 safe along with vfs.

    Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    M: Ilya Dryomov
    M: "Yan, Zheng"
    M: Sage Weil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     
  • __vmalloc* allows users to provide gfp flags for the underlying
    allocation. This API is quite popular

    $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
    77

    The only problem is that many people are not aware that they really want
    to give __GFP_HIGHMEM along with other flags because there is really no
    reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
    which are mapped to the kernel vmalloc space. About half of users don't
    use this flag, though. This signals that we make the API unnecessarily
    too complex.

    This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
    be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
    are simplified and drop the flag.

    Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Matthew Wilcox
    Cc: Al Viro
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Cristopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 May, 2017

10 commits

  • As we no longer release the lock before potentially raising BLACKLISTED
    in rbd_reregister_watch(), the "either locked or blacklisted" assert in
    rbd_queue_workfn() needs to go: we can be both locked and blacklisted
    at that point now.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jason Dillaman

    Ilya Dryomov
     
  • Cephfs can get cap update requests that contain a new epoch barrier in
    them. When that happens we want to pause all OSD traffic until the right
    map epoch arrives.

    Add an epoch_barrier field to ceph_osd_client that is protected by the
    osdc->lock rwsem. When the barrier is set, and the current OSD map
    epoch is below that, pause the request target when submitting the
    request or when revisiting it. Add a way for upper layers (cephfs)
    to update the epoch_barrier as well.

    If we get a new map, compare the new epoch against the barrier before
    kicking requests and request another map if the map epoch is still lower
    than the one we want.

    If we get a map with a full pool, or at quota condition, then set the
    barrier to the current epoch value.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • When a Ceph volume hits capacity, a flag is set in the OSD map to
    indicate that, and a new map is sprayed around the cluster. With cephfs
    we want it to shut down any abortable requests that are in progress with
    an -ENOSPC error as they'd just hang otherwise.

    Add a new ceph_osdc_abort_on_full helper function to handle this. It
    will first check whether there is an out-of-space condition in the
    cluster and then walk the tree and abort any request that has
    r_abort_on_full set with a -ENOSPC error. Call this new function
    directly whenever we get a new OSD map.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Usually, when the osd map is flagged as full or the pool is at quota,
    write requests just hang. This is not what we want for cephfs, where
    it would be better to simply report -ENOSPC back to userland instead
    of stalling.

    If the caller knows that it will want an immediate error return instead
    of blocking on a full or at-quota error condition then allow it to set a
    flag to request that behavior.

    Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
    and on any other write request from ceph.ko.

    A later patch will deal with requests that were submitted before the new
    map showing the full condition came in.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Nothing uses this anymore with the removal of the ack vs. commit code.
    Remove the field and just encode zeroes into place in the request
    encoding.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: Ilya Dryomov

    Elena Reshetova
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: Ilya Dryomov

    Elena Reshetova
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: Ilya Dryomov

    Elena Reshetova
     
  • Add a readonly, exported to sysfs module parameter so that userspace
    can generate meaningful error messages. It's a bit funky, but there is
    no other libceph-specific place.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • No reason to hide CephFS-specific features in the rbd case. Recent
    feature bits mix RADOS and CephFS-specific stuff together anyway.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

23 Mar, 2017

1 commit

  • sock_alloc_inode() allocates socket+inode and socket_wq with
    GFP_KERNEL, which is not allowed on the writeback path:

    Workqueue: ceph-msgr con_work [libceph]
    ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
    0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
    ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
    Call Trace:
    [] schedule+0x29/0x70
    [] schedule_timeout+0x1bd/0x200
    [] ? ttwu_do_wakeup+0x2c/0x120
    [] ? ttwu_do_activate.constprop.135+0x66/0x70
    [] wait_for_completion+0xbf/0x180
    [] ? try_to_wake_up+0x390/0x390
    [] flush_work+0x165/0x250
    [] ? worker_detach_from_pool+0xd0/0xd0
    [] xlog_cil_force_lsn+0x81/0x200 [xfs]
    [] ? __slab_free+0xee/0x234
    [] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
    [] ? lookup_page_cgroup_used+0xe/0x30
    [] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_log_force_lsn+0x3f/0xf0 [xfs]
    [] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
    [] ? wake_atomic_t_function+0x40/0x40
    [] xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
    [] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
    [] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
    [] super_cache_scan+0x178/0x180
    [] shrink_slab_node+0x14e/0x340
    [] ? mem_cgroup_iter+0x16b/0x450
    [] shrink_slab+0x100/0x140
    [] do_try_to_free_pages+0x335/0x490
    [] try_to_free_pages+0xb9/0x1f0
    [] ? __alloc_pages_direct_compact+0x69/0x1be
    [] __alloc_pages_nodemask+0x69a/0xb40
    [] alloc_pages_current+0x9e/0x110
    [] new_slab+0x2c5/0x390
    [] __slab_alloc+0x33b/0x459
    [] ? sock_alloc_inode+0x2d/0xd0
    [] ? inet_sendmsg+0x71/0xc0
    [] ? sock_alloc_inode+0x2d/0xd0
    [] kmem_cache_alloc+0x1a2/0x1b0
    [] sock_alloc_inode+0x2d/0xd0
    [] alloc_inode+0x26/0xa0
    [] new_inode_pseudo+0x1a/0x70
    [] sock_alloc+0x1e/0x80
    [] __sock_create+0x95/0x220
    [] sock_create_kern+0x24/0x30
    [] con_work+0xef9/0x2050 [libceph]
    [] ? rbd_img_request_submit+0x4c/0x60 [rbd]
    [] process_one_work+0x159/0x4f0
    [] worker_thread+0x11b/0x530
    [] ? create_worker+0x1d0/0x1d0
    [] kthread+0xc9/0xe0
    [] ? flush_kthread_worker+0x90/0x90
    [] ret_from_fork+0x58/0x90
    [] ? flush_kthread_worker+0x90/0x90

    Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.

    Cc: stable@vger.kernel.org # 3.10+, needs backporting
    Link: http://tracker.ceph.com/issues/19309
    Reported-by: Sergey Jerusalimov
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     

07 Mar, 2017

3 commits

  • osd_request_timeout specifies how many seconds to wait for a response
    from OSDs before returning -ETIMEDOUT from an OSD request. 0 (default)
    means no limit.

    osd_request_timeout is osdkeepalive-precise -- in-flight requests are
    swept through every osdkeepalive seconds. With ack vs commit behaviour
    gone, abort_request() is really simple.

    This is based on a patch from Artur Molchanov .

    Tested-by: Artur Molchanov
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
    osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
    This changes the result of applying an incremental for clients, not
    just OSDs. Because CRUSH computations are obviously affected,
    pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
    object placement, resulting in misdirected requests.

    Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.

    Fixes: 930c53286977 ("libceph: apply new_state before new_up_client on incrementals")
    Link: http://tracker.ceph.com/issues/19122
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • Older (shorter) CRUSH maps too need to be finalized.

    Fixes: 66a0e2d579db ("crush: remove mutable part of CRUSH map")
    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

04 Mar, 2017

1 commit

  • Pull sched.h split-up from Ingo Molnar:
    "The point of these changes is to significantly reduce the
    header footprint, to speed up the kernel build and to
    have a cleaner header structure.

    After these changes the new 's typical preprocessed
    size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
    lines), which is around 40% faster to build on typical configs.

    Not much changed from the last version (-v2) posted three weeks ago: I
    eliminated quirks, backmerged fixes plus I rebased it to an upstream
    SHA1 from yesterday that includes most changes queued up in -next plus
    all sched.h changes that were pending from Andrew.

    I've re-tested the series both on x86 and on cross-arch defconfigs,
    and did a bisectability test at a number of random points.

    I tried to test as many build configurations as possible, but some
    build breakage is probably still left - but it should be mostly
    limited to architectures that have no cross-compiler binaries
    available on kernel.org, and non-default configurations"

    * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
    sched/headers: Clean up
    sched/headers: Remove #ifdefs from
    sched/headers: Remove the include from
    sched/headers, hrtimer: Remove the include from
    sched/headers, x86/apic: Remove the header inclusion from
    sched/headers, timers: Remove the include from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/core: Remove unused prefetch_stack()
    sched/headers: Remove from
    sched/headers: Remove the 'init_pid_ns' prototype from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove the runqueue_is_locked() prototype
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove the include from
    sched/headers: Remove from
    ...

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Pull vfs sendmsg updates from Al Viro:
    "More sendmsg work.

    This is a fairly separate isolated stuff (there's a continuation
    around lustre, but that one was too late to soak in -next), thus the
    separate pull request"

    * 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ncpfs: switch to sock_sendmsg()
    ncpfs: don't mess with manually advancing iovec on send
    ncpfs: sendmsg does *not* bugger iovec these days
    ceph_tcp_sendpage(): use ITER_BVEC sendmsg
    afs_send_pages(): use ITER_BVEC
    rds: remove dead code
    ceph: switch to sock_recvmsg()
    usbip_recv(): switch to sock_recvmsg()
    iscsi_target: deal with short writes on the tx side
    [nbd] pass iov_iter to nbd_xmit()
    [nbd] switch sock_xmit() to sock_{send,recv}msg()
    [drbd] use sock_sendmsg()

    Linus Torvalds
     

02 Mar, 2017

1 commit


01 Mar, 2017

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "This time around we have:

    - support for rbd data-pool feature, which enables rbd images on
    erasure-coded pools (myself). CEPH_PG_MAX_SIZE has been bumped to
    allow erasure-coded profiles with k+m up to 32.

    - a patch for ceph_d_revalidate() performance regression introduced
    in 4.9, along with some cleanups in the area (Jeff Layton)

    - a set of fixes for unsafe ->d_parent accesses in CephFS (Jeff
    Layton)

    - buffered reads are now processed in rsize windows instead of rasize
    windows (Andreas Gerstmayr). The new default for rsize mount option
    is 64M.

    - ack vs commit distinction is gone, greatly simplifying ->fsync()
    and MOSDOpReply handling code (myself)

    ... also a few filesystem bug fixes from Zheng, a CRUSH sync up (CRUSH
    computations are still serialized though) and several minor fixes and
    cleanups all over"

    * tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-client: (52 commits)
    libceph, rbd, ceph: WRITE | ONDISK -> WRITE
    libceph: get rid of ack vs commit
    ceph: remove special ack vs commit behavior
    ceph: tidy some white space in get_nonsnap_parent()
    crush: fix dprintk compilation
    crush: do is_out test only if we do not collide
    ceph: remove req from unsafe list when unregistering it
    rbd: constify device_type structure
    rbd: kill obj_request->object_name and rbd_segment_name_cache
    rbd: store and use obj_request->object_no
    rbd: RBD_V{1,2}_DATA_FORMAT macros
    rbd: factor out __rbd_osd_req_create()
    rbd: set offset and length outside of rbd_obj_request_create()
    rbd: support for data-pool feature
    rbd: introduce rbd_init_layout()
    rbd: use rbd_obj_bytes() more
    rbd: remove now unused rbd_obj_request_wait() and helpers
    rbd: switch rbd_obj_method_sync() to ceph_osdc_call()
    libceph: pass reply buffer length through ceph_osdc_call()
    rbd: do away with obj_request in rbd_obj_read_sync()
    ...

    Linus Torvalds
     

25 Feb, 2017

2 commits

  • CEPH_OSD_FLAG_ONDISK is set in account_request().

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • - CEPH_OSD_FLAG_ACK shouldn't be set anymore, so assert on it
    - remove support for handling ack replies (OSDs will send ack replies
    only if clients request them)
    - drop the "do lingering callbacks under osd->lock" logic from
    handle_reply() -- lreq->lock is sufficient in all three cases

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton
    Reviewed-by: Sage Weil

    Ilya Dryomov
     

24 Feb, 2017

2 commits


21 Feb, 2017

1 commit

  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Implement wraparound-safe refcount_t and kref_t types based on
    generic atomic primitives (Peter Zijlstra)

    - Improve and fix the ww_mutex code (Nicolai Hähnle)

    - Add self-tests to the ww_mutex code (Chris Wilson)

    - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
    Bueso)

    - Micro-optimize the current-task logic all around the core kernel
    (Davidlohr Bueso)

    - Tidy up after recent optimizations: remove stale code and APIs,
    clean up the code (Waiman Long)

    - ... plus misc fixes, updates and cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
    fork: Fix task_struct alignment
    locking/spinlock/debug: Remove spinlock lockup detection code
    lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
    lkdtm: Convert to refcount_t testing
    kref: Implement 'struct kref' using refcount_t
    refcount_t: Introduce a special purpose refcount type
    sched/wake_q: Clarify queue reinit comment
    sched/wait, rcuwait: Fix typo in comment
    locking/mutex: Fix lockdep_assert_held() fail
    locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
    locking/rwsem: Reinit wake_q after use
    locking/rwsem: Remove unnecessary atomic_long_t casts
    jump_labels: Move header guard #endif down where it belongs
    locking/atomic, kref: Implement kref_put_lock()
    locking/ww_mutex: Turn off __must_check for now
    locking/atomic, kref: Avoid more abuse
    locking/atomic, kref: Use kref_get_unless_zero() more
    locking/atomic, kref: Kill kref_sub()
    locking/atomic, kref: Add kref_read()
    locking/atomic, kref: Add KREF_INIT()
    ...

    Linus Torvalds
     

20 Feb, 2017

8 commits


19 Jan, 2017

1 commit


14 Jan, 2017

1 commit

  • Since we need to change the implementation, stop exposing internals.

    Provide kref_read() to read the current reference count; typically
    used for debug messages.

    Kills two anti-patterns:

    atomic_read(&kref->refcount)
    kref->refcount.counter

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

27 Dec, 2016

2 commits


15 Dec, 2016

2 commits

  • Kill the wrapper and rename __finish_request() to finish_request().

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • r_safe_completion is currently, and has always been, signaled only if
    on-disk ack was requested. It's there for fsync and syncfs, which wait
    for in-flight writes to flush - all data write requests set ONDISK.

    However, the pool perm check code introduced in 4.2 sends a write
    request with only ACK set. An unfortunately timed syncfs can then hang
    forever: r_safe_completion won't be signaled because only an unsafe
    reply was requested.

    We could patch ceph_osdc_sync() to skip !ONDISK write requests, but
    that is somewhat incomplete and yet another special case. Instead,
    rename this completion to r_done_completion and always signal it when
    the OSD client is done with the request, whether unsafe, safe, or
    error. This is a bit cleaner and helps with the cancellation code.

    Reported-by: Yan, Zheng
    Signed-off-by: Ilya Dryomov

    Ilya Dryomov