03 Oct, 2016

1 commit


28 Jul, 2016

6 commits


26 May, 2016

4 commits


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Mar, 2016

1 commit


03 Nov, 2015

1 commit


09 Sep, 2015

1 commit


25 Jun, 2015

8 commits

  • Previously our dcache readdir code relies on that child dentries in
    directory dentry's d_subdir list are sorted by dentry's offset in
    descending order. When adding dentries to the dcache, if a dentry
    already exists, our readdir code moves it to head of directory
    dentry's d_subdir list. This design relies on dcache internals.
    Al Viro suggests using ncpfs's approach: keeping array of pointers
    to dentries in page cache of directory inode. the validity of those
    pointers are presented by directory inode's complete and ordered
    flags. When a dentry gets pruned, we clear directory inode's complete
    flag in the d_prune() callback. Before moving a dentry to other
    directory, we clear the ordered flag for both old and new directory.

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • So we know TID of the oldest pending caps flushing. Later patch will
    send this information to MDS, so that MDS can trim its completed caps
    flush list.

    Tracking pending caps flushing globally also simplifies syncfs code.

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • Previously we do not trace accurate TID for flushing caps. when
    MDS failovers, we have no choice but to re-send all flushing caps
    with a new TID. This can cause problem because MDS can has already
    flushed some caps and has issued the same caps to other client.
    The re-sent cap flush has a new TID, which makes MDS unable to
    detect if it has already processed the cap flush.

    This patch adds code to track pending caps flushing accurately.
    When re-sending cap flush is needed, we use its original flush
    TID.

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • There are currently three libceph-level timeouts that the user can
    specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of
    these are in seconds and no checking is done on user input: negative
    values are accepted, we multiply them all by HZ which may or may not
    overflow, arbitrarily large jiffies then get added together, etc.

    There is also a bug in the way mount_timeout=0 is handled. It's
    supposed to mean "infinite timeout", but that's not how wait.h APIs
    treat it and so __ceph_open_session() for example will busy loop
    without much chance of being interrupted if none of ceph-mons are
    there.

    Fix all this by verifying user input, storing timeouts capped by
    msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
    helper for all user-specified waits to handle infinite timeouts
    correctly.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • setfilelock requests can block for a long time, which can prevent
    client from advancing its oldest tid.

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • Previously we pre-allocate cap release messages for each caps. This
    wastes lots of memory when there are large amount of caps. This patch
    make the code not pre-allocate the cap release messages. Instead,
    we add the corresponding ceph_cap struct to a list when releasing a
    cap. Later when flush cap releases is needed, we allocate the cap
    release messages dynamically.

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • Signed-off-by: Yan, Zheng

    Yan, Zheng
     

19 Feb, 2015

2 commits


18 Dec, 2014

3 commits


15 Oct, 2014

2 commits


06 Jun, 2014

1 commit

  • We recently modified the client/MDS protocol to include a timestamp in the
    client request. This allows ctime updates to follow the client's clock
    in most cases, which avoids subtle problems when clocks are out of sync
    and timestamps are updated sometimes by the MDS clock (for most requests)
    and sometimes by the client clock (for cap writeback).

    Signed-off-by: Sage Weil

    Sage Weil
     

05 Apr, 2014

1 commit


21 Jan, 2014

1 commit


24 Nov, 2013

1 commit


01 Mar, 2013

1 commit

  • Pull Ceph updates from Sage Weil:
    "A few groups of patches here. Alex has been hard at work improving
    the RBD code, layout groundwork for understanding the new formats and
    doing layering. Most of the infrastructure is now in place for the
    final bits that will come with the next window.

    There are a few changes to the data layout. Jim Schutt's patch fixes
    some non-ideal CRUSH behavior, and a set of patches from me updates
    the client to speak a newer version of the protocol and implement an
    improved hashing strategy across storage nodes (when the server side
    supports it too).

    A pair of patches from Sam Lang fix the atomicity of open+create
    operations. Several patches from Yan, Zheng fix various mds/client
    issues that turned up during multi-mds torture tests.

    A final set of patches expose file layouts via virtual xattrs, and
    allow the policies to be set on directories via xattrs as well
    (avoiding the awkward ioctl interface and providing a consistent
    interface for both kernel mount and ceph-fuse users)."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (143 commits)
    libceph: add support for HASHPSPOOL pool flag
    libceph: update osd request/reply encoding
    libceph: calculate placement based on the internal data types
    ceph: update support for PGID64, PGPOOL3, OSDENC protocol features
    ceph: update "ceph_features.h"
    libceph: decode into cpu-native ceph_pg type
    libceph: rename ceph_pg -> ceph_pg_v1
    rbd: pass length, not op for osd completions
    rbd: move rbd_osd_trivial_callback()
    libceph: use a do..while loop in con_work()
    libceph: use a flag to indicate a fault has occurred
    libceph: separate non-locked fault handling
    libceph: encapsulate connection backoff
    libceph: eliminate sparse warnings
    ceph: eliminate sparse warnings in fs code
    rbd: eliminate sparse warnings
    libceph: define connection flag helpers
    rbd: normalize dout() calls
    rbd: barriers are hard
    rbd: ignore zero-length requests
    ...

    Linus Torvalds
     

12 Feb, 2013

1 commit


18 Jan, 2013

1 commit

  • The mds now sends back a created inode if the create request
    performed the create. If the file already existed, no inode is
    returned in the reply. This allows ceph to set the created flag
    in atomic_open so that permissions are properly checked in the case
    that the file wasn't created by the create call to the mds.

    To ensure compability with previous kernels, a feature for sending
    back the inode in the create reply was added, so that the mds will
    only send back the inode if the client indicates it supports the
    feature.

    Signed-off-by: Sam Lang
    Reviewed-by: Sage Weil

    Sam Lang
     

17 May, 2012

1 commit

  • The definitions for the ceph_mds_session and ceph_osd both contain
    five fields related only to "authorizers." Encapsulate those fields
    into their own struct type, allowing for better isolation in some
    upcoming patches.

    Fix the #includes in "linux/ceph/osd_client.h" to lay out their more
    complete canonical path.

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     

03 Feb, 2012

1 commit

  • Lockdep was reporting a possible circular lock dependency in
    dentry_lease_is_valid(). That function needs to sample the
    session's s_cap_gen and and s_cap_ttl fields coherently, but needs
    to do so while holding a dentry lock. The s_cap_lock field was
    being used to protect the two fields, but that can't be taken while
    holding a lock on a dentry within the session.

    In most cases, the s_cap_gen and s_cap_ttl fields only get operated
    on separately. But in three cases they need to be updated together.
    Implement a new lock to protect the spots updating both fields
    atomically is required.

    Signed-off-by: Alex Elder
    Reviewed-by: Sage Weil

    Alex Elder
     

08 Dec, 2011

1 commit

  • We have been using i_lock to protect all kinds of data structures in the
    ceph_inode_info struct, including lists of inodes that we need to iterate
    over while avoiding races with inode destruction. That requires grabbing
    a reference to the inode with the list lock protected, but igrab() now
    takes i_lock to check the inode flags.

    Changing the list lock ordering would be a painful process.

    However, using a ceph-specific i_ceph_lock in the ceph inode instead of
    i_lock is a simple mechanical change and avoids the ordering constraints
    imposed by igrab().

    Reported-by: Amon Ott
    Signed-off-by: Sage Weil

    Sage Weil