05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 Mar, 2016

4 commits

  • Now __ceph_open_session() only accepts closed client. An opened
    client will tigger BUG_ON().

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • ceph_empty_snapc->num_snaps == 0 at all times. Passing such a snapc to
    ceph_osdc_alloc_request() (possibly through ceph_osdc_new_request()) is
    equivalent to passing NULL, as ceph_osdc_alloc_request() uses it only
    for sizing the request message.

    Further, in all four cases the subsequent ceph_osdc_build_request() is
    passed NULL for snapc, meaning that 0 is encoded for seq and num_snaps
    and making ceph_empty_snapc entirely useless. The two cases where it
    actually mattered were removed in commits 860560904962 ("ceph: avoid
    sending unnessesary FLUSHSNAP message") and 23078637e054 ("ceph: fix
    queuing inode to mdsdir's snaprealm").

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Yan, Zheng

    Ilya Dryomov
     
  • When rbytes mount option is enabled, directory size is recursive
    size. Recursive size is not updated instantly. This can cause
    directory size to change between successive stat(1)

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • It is currently hard-coded in the mon_client that mdsmap and monmap
    subs are continuous, while osdmap sub is always "onetime". To better
    handle full clusters/pools in the osd_client, we need to be able to
    issue continuous osdmap subs. Revamp subs code to allow us to specify
    for each sub whether it should be continuous or not.

    Although not strictly required for the above, switch to SUBSCRIBE2
    protocol while at it, eliminating the ambiguity between a request for
    "every map since X" and a request for "just the latest" when we don't
    have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now
    required - it's been supported since pre-argonaut (2010).

    Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
    in before we validate the epoch and successfully install the new map
    can mess up mon_client sub state.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

12 Sep, 2015

1 commit

  • Pull Ceph update from Sage Weil:
    "There are a few fixes for snapshot behavior with CephFS and support
    for the new keepalive protocol from Zheng, a libceph fix that affects
    both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya,
    and several small fixes and cleanups from Jianpeng and others"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: improve readahead for file holes
    ceph: get inode size for each append write
    libceph: check data_len in ->alloc_msg()
    libceph: use keepalive2 to verify the mon session is alive
    rbd: plug rbd_dev->header.object_prefix memory leak
    rbd: fix double free on rbd_dev->header_name
    libceph: set 'exists' flag for newly up osd
    ceph: cleanup use of ceph_msg_get
    ceph: no need to get parent inode in ceph_open
    ceph: remove the useless judgement
    ceph: remove redundant test of head->safe and silence static analysis warnings
    ceph: fix queuing inode to mdsdir's snaprealm
    libceph: rename con_work() to ceph_con_workfn()
    libceph: Avoid holding the zero page on ceph_msgr_slab_init errors
    libceph: remove the unused macro AES_KEY_SIZE
    ceph: invalidate dirty pages after forced umount
    ceph: EIO all operations after forced umount

    Linus Torvalds
     

09 Sep, 2015

1 commit


05 Sep, 2015

1 commit

  • Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

25 Jun, 2015

3 commits

  • Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • There are currently three libceph-level timeouts that the user can
    specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of
    these are in seconds and no checking is done on user input: negative
    values are accepted, we multiply them all by HZ which may or may not
    overflow, arbitrarily large jiffies then get added together, etc.

    There is also a bug in the way mount_timeout=0 is handled. It's
    supposed to mean "infinite timeout", but that's not how wait.h APIs
    treat it and so __ceph_open_session() for example will busy loop
    without much chance of being interrupted if none of ceph-mons are
    there.

    Fix all this by verifying user input, storing timeouts capped by
    msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
    helper for all user-specified waits to handle infinite timeouts
    correctly.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Signed-off-by: Yan, Zheng

    Yan, Zheng
     

27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

20 Apr, 2015

3 commits

  • Don't pollute /proc/mounts with default options (presently these are
    dcache, nofsc and acl). Leave the acl/noacl however - it's a bit of
    a special case due to CONFIG_CEPH_FS_POSIX_ACL.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Split ceph_show_options() into two pieces and move the piece
    responsible for printing client (libceph) options into net/ceph. This
    way people adding a libceph option wouldn't have to remember to update
    code in fs/ceph.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Currently, there is no check for the kstrdup() for r_path2,
    r_path1 and snapdir_name as various locations as there is a
    possibility of failure during memory pressure. Therefore,
    returning ENOMEM where the checks have been missed.

    Signed-off-by: Sanidhya Kashyap
    Signed-off-by: Yan, Zheng

    Sanidhya Kashyap
     

16 Apr, 2015

1 commit


20 Feb, 2015

1 commit

  • Pull Ceph changes from Sage Weil:
    "On the RBD side, there is a conversion to blk-mq from Christoph,
    several long-standing bug fixes from Ilya, and some cleanup from
    Rickard Strandqvist.

    On the CephFS side there is a long list of fixes from Zheng, including
    improved session handling, a few IO path fixes, some dcache management
    correctness fixes, and several blocking while !TASK_RUNNING fixes.

    The core code gets a few cleanups and Chaitanya has added support for
    TCP_NODELAY (which has been used on the server side for ages but we
    somehow missed on the kernel client).

    There is also an update to MAINTAINERS to fix up some email addresses
    and reflect that Ilya and Zheng are doing most of the maintenance for
    RBD and CephFS these days. Do not be surprised to see a pull request
    come from one of them in the future if I am unavailable for some
    reason"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
    MAINTAINERS: update Ceph and RBD maintainers
    libceph: kfree() in put_osd() shouldn't depend on authorizer
    libceph: fix double __remove_osd() problem
    rbd: convert to blk-mq
    ceph: return error for traceless reply race
    ceph: fix dentry leaks
    ceph: re-send requests when MDS enters reconnecting stage
    ceph: show nocephx_require_signatures and notcp_nodelay options
    libceph: tcp_nodelay support
    rbd: do not treat standalone as flatten
    ceph: fix atomic_open snapdir
    ceph: properly mark empty directory as complete
    client: include kernel version in client metadata
    ceph: provide seperate {inode,file}_operations for snapdir
    ceph: fix request time stamp encoding
    ceph: fix reading inline data when i_size > PAGE_SIZE
    ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
    ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)
    ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)
    rbd: fix error paths in rbd_dev_refresh()
    ...

    Linus Torvalds
     

19 Feb, 2015

1 commit


21 Jan, 2015

2 commits

  • Now that default_backing_dev_info is not used for writeback purposes we can
    git rid of it easily:

    - instead of using it's name for tracing unregistered bdi we just use
    "unknown"
    - btrfs and ceph can just assign the default read ahead window themselves
    like several other filesystems already do.
    - we can assign noop_backing_dev_info as the default one in alloc_super.
    All filesystems already either assigned their own or
    noop_backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • bdi_destroy already does all the work, and if we delay freeing the
    anon bdev we can get away with just that single call.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Dec, 2014

3 commits


08 Aug, 2014

1 commit

  • There are a few d_obtain_alias callers that are using it to get the
    root of a filesystem which may already have an alias somewhere else.

    This is not the same as the filehandle-lookup case, and none of them
    actually need DCACHE_DISCONNECTED set.

    It isn't really a serious problem, but it would really be clearer if we
    reserved DCACHE_DISCONNECTED for those cases where it's actually needed.

    In the btrfs case this was causing a spurious printk from
    nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED
    dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED
    manually in 3a0dfa6a12e "Btrfs: unset DCACHE_DISCONNECTED when mounting
    default subvol", and this replaces that workaround.

    Cc: Josef Bacik
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     

05 Apr, 2014

1 commit

  • flock and posix lock should use fl->fl_file instead of process ID
    as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
    is usually equal to fl->fl_file, but it also can be a customized
    value). The process ID of who holds the lock is just for F_GETLK
    fcntl(2).

    The fix is rename the 'pid' fields of struct ceph_mds_request_args
    and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
    to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
    We also set the most significant bit of the 'owner' field. MDS can
    use that bit to distinguish between old and new clients.

    The MDS counterpart of this patch modifies the flock code to not
    take the 'pid_namespace' into consideration when checking conflict
    locks.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Sage Weil

    Yan, Zheng
     

18 Feb, 2014

1 commit

  • Make the 'acl' option dependent on having ACL support compiled in. Make
    the 'noacl' option work even without it so that one can always ask it to
    be off and not error out on mount when it is not supported.

    Signed-off-by: Guangliang Zhao
    Signed-off-by: Sage Weil

    Sage Weil
     

01 Jan, 2014

2 commits


14 Dec, 2013

1 commit

  • Positve dentry and corresponding inode are always accompanied in MDS reply.
    So no need to keep inode in the cache after dropping all its aliases.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Sage Weil

    Yan, Zheng
     

07 Sep, 2013

1 commit

  • Adding support for fscache to the Ceph filesystem. This would bring it to on
    par with some of the other network filesystems in Linux (like NFS, AFS, etc...)

    In order to mount the filesystem with fscache the 'fsc' mount option must be
    passed.

    Signed-off-by: Milosz Tanski
    Signed-off-by: Sage Weil

    Milosz Tanski
     

04 Jul, 2013

1 commit

  • when mounting ceph with a dev name that starts with a slash, ceph
    would attempt to access the character before that slash. Since we
    don't actually own that byte of memory, we would trigger an
    invalid access:

    [ 43.499934] BUG: unable to handle kernel paging request at ffff880fa3a97fff
    [ 43.500984] IP: [] parse_mount_options+0x1a4/0x300
    [ 43.501491] PGD 743b067 PUD 10283c4067 PMD 10282a6067 PTE 8000000fa3a97060
    [ 43.502301] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    [ 43.503006] Dumping ftrace buffer:
    [ 43.503596] (ftrace buffer empty)
    [ 43.504046] CPU: 0 PID: 10879 Comm: mount Tainted: G W 3.10.0-sasha #1129
    [ 43.504851] task: ffff880fa625b000 ti: ffff880fa3412000 task.ti: ffff880fa3412000
    [ 43.505608] RIP: 0010:[] [] parse_mount_options$
    [ 43.506552] RSP: 0018:ffff880fa3413d08 EFLAGS: 00010286
    [ 43.507133] RAX: ffff880fa3a98000 RBX: ffff880fa3a98000 RCX: 0000000000000000
    [ 43.507893] RDX: ffff880fa3a98001 RSI: 000000000000002f RDI: ffff880fa3a98000
    [ 43.508610] RBP: ffff880fa3413d58 R08: 0000000000001f99 R09: ffff880fa3fe64c0
    [ 43.509426] R10: ffff880fa3413d98 R11: ffff880fa38710d8 R12: ffff880fa3413da0
    [ 43.509792] R13: ffff880fa3a97fff R14: 0000000000000000 R15: ffff880fa3413d90
    [ 43.509792] FS: 00007fa9c48757e0(0000) GS:ffff880fd2600000(0000) knlGS:000000000000$
    [ 43.509792] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 43.509792] CR2: ffff880fa3a97fff CR3: 0000000fa3bb9000 CR4: 00000000000006b0
    [ 43.509792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 43.509792] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 43.509792] Stack:
    [ 43.509792] 0000e5180000000e ffffffff85ca1900 ffff880fa38710d8 ffff880fa3413d98
    [ 43.509792] 0000000000000120 0000000000000000 ffff880fa3a98000 0000000000000000
    [ 43.509792] ffffffff85cf32a0 0000000000000000 ffff880fa3413dc8 ffffffff818f3c72
    [ 43.509792] Call Trace:
    [ 43.509792] [] ceph_mount+0xa2/0x390
    [ 43.509792] [] ? pcpu_alloc+0x334/0x3c0
    [ 43.509792] [] mount_fs+0x8d/0x1a0
    [ 43.509792] [] ? __alloc_percpu+0x10/0x20
    [ 43.509792] [] vfs_kern_mount+0x79/0x100
    [ 43.509792] [] do_new_mount+0xcd/0x1c0
    [ 43.509792] [] do_mount+0x15d/0x210
    [ 43.509792] [] ? strndup_user+0x45/0x60
    [ 43.509792] [] SyS_mount+0x9d/0xe0
    [ 43.509792] [] tracesys+0xdd/0xe2
    [ 43.509792] Code: 4c 8b 5d c0 74 0a 48 8d 50 01 49 89 14 24 eb 17 31 c0 48 83 c9 ff $
    [ 43.509792] RIP [] parse_mount_options+0x1a4/0x300
    [ 43.509792] RSP
    [ 43.509792] CR2: ffff880fa3a97fff
    [ 43.509792] ---[ end trace 22469cd81e93af51 ]---

    Signed-off-by: Sasha Levin
    Reviewed-by: Sage Weil

    Sasha Levin
     

02 May, 2013

1 commit

  • In create_fs_client() a memory pool is set up be used for arrays of
    pages that might be needed in ceph_writepages_start() if memory is
    tight. There are two problems with the way it's initialized:
    - The size provided is the number of pages we want in the
    array, but it should be the number of bytes required for
    that many page pointers.
    - The number of pages computed can end up being 0, while we
    will always need at least one page.

    This patch fixes both of these problems.

    This resolves the two simple problems defined in:
    http://tracker.ceph.com/issues/4603

    Signed-off-by: Alex Elder
    Reviewed-by: Josh Durgin

    Alex Elder
     

04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Feb, 2013

1 commit

  • Different versions of glibc are broken in different ways, but the short of
    it is that for the time being, frsize should == bsize, and be used as the
    multiple for the blocks, free, and available fields. This mirrors what is
    done for NFS. The previous reporting of the page size for frsize meant
    that newer glibc and df would report a very small value for the fs size.

    Fixes http://tracker.ceph.com/issues/3793.

    Signed-off-by: Sage Weil
    Reviewed-by: Greg Farnum

    Sage Weil
     

21 Dec, 2012

1 commit

  • Pull Ceph update from Sage Weil:
    "There are a few different groups of commits here. The largest is
    Alex's ongoing work to enable the coming RBD features (cloning,
    striping). There is some cleanup in libceph that goes along with it.

    Cyril and David have fixed some problems with NFS reexport (leaking
    dentries and page locks), and there is a batch of patches from Yan
    fixing problems with the fs client when running against a clustered
    MDS. There are a few bug fixes mixed in for good measure, many of
    which will be going to the stable trees once they're upstream.

    My apologies for the late pull. There is still a gremlin in the rbd
    map/unmap code and I was hoping to include the fix for that as well,
    but we haven't been able to confirm the fix is correct yet; I'll send
    that in a separate pull once it's nailed down."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
    rbd: get rid of rbd_{get,put}_dev()
    libceph: register request before unregister linger
    libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
    libceph: init event->node in ceph_osdc_create_event()
    libceph: init osd->o_node in create_osd()
    libceph: report connection fault with warning
    libceph: socket can close in any connection state
    rbd: don't use ENOTSUPP
    rbd: remove linger unconditionally
    rbd: get rid of RBD_MAX_SEG_NAME_LEN
    libceph: avoid using freed osd in __kick_osd_requests()
    ceph: don't reference req after put
    rbd: do not allow remove of mounted-on image
    libceph: Unlock unprocessed pages in start_read() error path
    ceph: call handle_cap_grant() for cap import message
    ceph: Fix __ceph_do_pending_vmtruncate
    ceph: Don't add dirty inode to dirty list if caps is in migration
    ceph: Fix infinite loop in __wake_requests
    ceph: Don't update i_max_size when handling non-auth cap
    bdi_register: add __printf verification, fix arg mismatch
    ...

    Linus Torvalds
     

13 Dec, 2012

2 commits

  • __printf is useful to verify format and arguments.

    Signed-off-by: Joe Perches
    Reviewed-by: Alex Elder

    Joe Perches
     
  • This would reset a connection with any OSD that had an outstanding
    request that was taking more than N seconds. The idea was that if the
    OSD was buggy, the client could compensate by resending the request.

    In reality, this only served to hide server bugs, and we haven't
    actually seen such a bug in quite a while. Moreover, the userspace
    client code never did this.

    More importantly, often the request is taking a long time because the
    OSD is trying to recover, or overloaded, and killing the connection
    and retrying would only make the situation worse by giving the OSD
    more work to do.

    Signed-off-by: Sage Weil
    Reviewed-by: Alex Elder

    Sage Weil
     

08 Oct, 2012

1 commit

  • Pull ceph updates from Sage Weil:
    "The bulk of this pull is a series from Alex that refactors and cleans
    up the RBD code to lay the groundwork for supporting the new image
    format and evolving feature set. There are also some cleanups in
    libceph, and for ceph there's fixed validation of file striping
    layouts and a bugfix in the code handling a shrinking MDS cluster."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
    ceph: avoid 32-bit page index overflow
    ceph: return EIO on invalid layout on GET_DATALOC ioctl
    rbd: BUG on invalid layout
    ceph: propagate layout error on osd request creation
    libceph: check for invalid mapping
    ceph: convert to use le32_add_cpu()
    ceph: Fix oops when handling mdsmap that decreases max_mds
    rbd: update remaining header fields for v2
    rbd: get snapshot name for a v2 image
    rbd: get the snapshot context for a v2 image
    rbd: get image features for a v2 image
    rbd: get the object prefix for a v2 rbd image
    rbd: add code to get the size of a v2 rbd image
    rbd: lay out header probe infrastructure
    rbd: encapsulate code that gets snapshot info
    rbd: add an rbd features field
    rbd: don't use index in __rbd_add_snap_dev()
    rbd: kill create_snap sysfs entry
    rbd: define rbd_dev_image_id()
    rbd: define some new format constants
    ...

    Linus Torvalds
     

03 Oct, 2012

1 commit

  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov