28 May, 2016

1 commit

  • smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
    we'd hashed the new dentry and attached it to inode, we need ->setxattr()
    instances getting the inode as an explicit argument rather than obtaining
    it from dentry.

    Similar change for ->getxattr() had been done in commit ce23e64. Unlike
    ->getxattr() (which is used by both selinux and smack instances of
    ->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
    it got missed back then.

    Reported-by: Seung-Woo Kim
    Tested-by: Casey Schaufler
    Signed-off-by: Al Viro

    Al Viro
     

18 May, 2016

1 commit

  • Pull vfs cleanups from Al Viro:
    "More cleanups from Christoph"

    * 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    nfsd: use RWF_SYNC
    fs: add RWF_DSYNC aand RWF_SYNC
    ceph: use generic_write_sync
    fs: simplify the generic_write_sync prototype
    fs: add IOCB_SYNC and IOCB_DSYNC
    direct-io: remove the offset argument to dio_complete
    direct-io: eliminate the offset argument to ->direct_IO
    xfs: eliminate the pos variable in xfs_file_dio_aio_write
    filemap: remove the pos argument to generic_file_direct_write
    filemap: remove pos variables in generic_file_read_iter

    Linus Torvalds
     

17 May, 2016

1 commit


03 May, 2016

2 commits


02 May, 2016

2 commits


25 Apr, 2016

1 commit

  • fuse_get_user_pages() should return error or 0. Otherwise fuse_direct_io
    read will not return 0 to indicate that read has completed.

    Fixes: 742f992708df ("fuse: return patrial success from fuse_direct_io()")
    Signed-off-by: Ashish Samant
    Signed-off-by: Seth Forshee
    Signed-off-by: Miklos Szeredi

    Ashish Samant
     

11 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

1 commit

  • If a user calls writev/readv in direct io mode with partially valid data
    in the iovec array such that any vector other than the first one in the
    array contains invalid data, we currently return the error for the invalid
    iovec.

    Instead, we should return the number of bytes already written/read and not
    the error as we do in the non direct_io case.

    Reported-by: Alexey Kodanev
    Signed-off-by: Ashish Samant
    Signed-off-by: Miklos Szeredi

    Ashish Samant
     

14 Mar, 2016

2 commits

  • The 'reqs' member of fuse_io_priv serves two purposes. First is to track
    the number of oustanding async requests to the server and to signal that
    the io request is completed. The second is to be a reference count on the
    structure to know when it can be freed.

    For sync io requests these purposes can be at odds. fuse_direct_IO() wants
    to block until the request is done, and since the signal is sent when
    'reqs' reaches 0 it cannot keep a reference to the object. Yet it needs to
    use the object after the userspace server has completed processing
    requests. This leads to some handshaking and special casing that it
    needlessly complicated and responsible for at least one race condition.

    It's much cleaner and safer to maintain a separate reference count for the
    object lifecycle and to let 'reqs' just be a count of outstanding requests
    to the userspace server. Then we can know for sure when it is safe to free
    the object without any handshaking or special cases.

    The catch here is that most of the time these objects are stack allocated
    and should not be freed. Initializing these objects with a single reference
    that is never released prevents accidental attempts to free the objects.

    Fixes: 9d5722b7777e ("fuse: handle synchronous iocbs internally")
    Cc: stable@vger.kernel.org # v4.1+
    Signed-off-by: Seth Forshee
    Signed-off-by: Miklos Szeredi

    Seth Forshee
     
  • There's a race in fuse_direct_IO(), whereby is_sync_kiocb() is called on an
    iocb that could have been freed if async io has already completed. The fix
    in this case is simple and obvious: cache the result before starting io.

    It was discovered by KASan:

    kernel: ==================================================================
    kernel: BUG: KASan: use after free in fuse_direct_IO+0xb1a/0xcc0 at addr ffff88036c414390

    Signed-off-by: Robert Doebbelin
    Signed-off-by: Miklos Szeredi
    Fixes: bcba24ccdc82 ("fuse: enable asynchronous processing direct IO")
    Cc: # 3.10+

    Robert Doebbelin
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

22 Jan, 2016

1 commit


15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

12 Jan, 2016

1 commit

  • Pull vfs RCU symlink updates from Al Viro:
    "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
    even if the symlink is not an embedded one.

    No changes since the mailbomb on Jan 1"

    * 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->get_link() to delayed_call, kill ->put_link()
    kill free_page_put_link()
    teach nfs_get_link() to work in RCU mode
    teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
    teach shmem_get_link() to work in RCU mode
    teach page_get_link() to work in RCU mode
    replace ->follow_link() with new method that could stay in RCU mode
    don't put symlink bodies in pagecache into highmem
    namei: page_getlink() and page_follow_link_light() are the same thing
    ufs: get rid of ->setattr() for symlinks
    udf: don't duplicate page_symlink_inode_operations
    logfs: don't duplicate page_symlink_inode_operations
    switch befs long symlinks to page_symlink_operations

    Linus Torvalds
     

31 Dec, 2015

1 commit


30 Dec, 2015

1 commit


12 Dec, 2015

1 commit


09 Dec, 2015

1 commit

  • new method: ->get_link(); replacement of ->follow_link(). The differences
    are:
    * inode and dentry are passed separately
    * might be called both in RCU and non-RCU mode;
    the former is indicated by passing it a NULL dentry.
    * when called that way it isn't allowed to block
    and should return ERR_PTR(-ECHILD) if it needs to be called
    in non-RCU mode.

    It's a flagday change - the old method is gone, all in-tree instances
    converted. Conversion isn't hard; said that, so far very few instances
    do not immediately bail out when called in RCU mode. That'll change
    in the next commits.

    Signed-off-by: Al Viro

    Al Viro
     

10 Nov, 2015

3 commits

  • A useful performance improvement for accessing virtual machine images
    via FUSE mount.

    See https://bugzilla.redhat.com/show_bug.cgi?id=1220173 for a use-case
    for glusterFS.

    Signed-off-by: Ravishankar N
    Signed-off-by: Miklos Szeredi

    Ravishankar N
     
  • I got a report about unkillable task eating CPU. Further
    investigation shows, that the problem is in the fuse_fill_write_pages()
    function. If iov's first segment has zero length, we get an infinite
    loop, because we never reach iov_iter_advance() call.

    Fix this by calling iov_iter_advance() before repeating an attempt to
    copy data from userspace.

    A similar problem is described in 124d3b7041f ("fix writev regression:
    pan hanging unkillable and un-straceable"). If zero-length segmend
    is followed by segment with invalid address,
    iov_iter_fault_in_readable() checks only first segment (zero-length),
    iov_iter_copy_from_user_atomic() skips it, fails at second and
    returns zero -> goto again without skipping zero-length segment.

    Patch calls iov_iter_advance() before goto again: we'll skip zero-length
    segment at second iteraction and iov_iter_fault_in_readable() will detect
    invalid address.

    Special thanks to Konstantin Khlebnikov, who helped a lot with the commit
    description.

    Cc: Andrew Morton
    Cc: Maxim Patlasov
    Cc: Konstantin Khlebnikov
    Signed-off-by: Roman Gushchin
    Signed-off-by: Miklos Szeredi
    Fixes: ea9b9907b82a ("fuse: implement perform_write")
    Cc:

    Roman Gushchin
     
  • The problem is that fuse_dev_alloc() acquires an extra reference to cc.fc,
    and the original ref count is never dropped.

    Reported-by: Colin Ian King
    Signed-off-by: Miklos Szeredi
    Fixes: cc080e9e9be1 ("fuse: introduce per-instance fuse_dev structure")
    Cc: # v4.2+

    Miklos Szeredi
     

23 Oct, 2015

1 commit


17 Aug, 2015

1 commit

  • fuse_dev_ioctl() performed fuse_get_dev() on a user-supplied fd,
    leading to a type confusion issue. Fix it by checking file->f_op.

    Signed-off-by: Jann Horn
    Acked-by: Miklos Szeredi
    Signed-off-by: Linus Torvalds

    Jann Horn
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

04 Jul, 2015

1 commit

  • Pull user namespace updates from Eric Biederman:
    "Long ago and far away when user namespaces where young it was realized
    that allowing fresh mounts of proc and sysfs with only user namespace
    permissions could violate the basic rule that only root gets to decide
    if proc or sysfs should be mounted at all.

    Some hacks were put in place to reduce the worst of the damage could
    be done, and the common sense rule was adopted that fresh mounts of
    proc and sysfs should allow no more than bind mounts of proc and
    sysfs. Unfortunately that rule has not been fully enforced.

    There are two kinds of gaps in that enforcement. Only filesystems
    mounted on empty directories of proc and sysfs should be ignored but
    the test for empty directories was insufficient. So in my tree
    directories on proc, sysctl and sysfs that will always be empty are
    created specially. Every other technique is imperfect as an ordinary
    directory can have entries added even after a readdir returns and
    shows that the directory is empty. Special creation of directories
    for mount points makes the code in the kernel a smidge clearer about
    it's purpose. I asked container developers from the various container
    projects to help test this and no holes were found in the set of mount
    points on proc and sysfs that are created specially.

    This set of changes also starts enforcing the mount flags of fresh
    mounts of proc and sysfs are consistent with the existing mount of
    proc and sysfs. I expected this to be the boring part of the work but
    unfortunately unprivileged userspace winds up mounting fresh copies of
    proc and sysfs with noexec and nosuid clear when root set those flags
    on the previous mount of proc and sysfs. So for now only the atime,
    read-only and nodev attributes which userspace happens to keep
    consistent are enforced. Dealing with the noexec and nosuid
    attributes remains for another time.

    This set of changes also addresses an issue with how open file
    descriptors from /proc//ns/* are displayed. Recently readlink of
    /proc//fd has been triggering a WARN_ON that has not been
    meaningful since it was added (as all of the code in the kernel was
    converted) and is not now actively wrong.

    There is also a short list of issues that have not been fixed yet that
    I will mention briefly.

    It is possible to rename a directory from below to above a bind mount.
    At which point any directory pointers below the renamed directory can
    be walked up to the root directory of the filesystem. With user
    namespaces enabled a bind mount of the bind mount can be created
    allowing the user to pick a directory whose children they can rename
    to outside of the bind mount. This is challenging to fix and doubly
    so because all obvious solutions must touch code that is in the
    performance part of pathname resolution.

    As mentioned above there is also a question of how to ensure that
    developers by accident or with purpose do not introduce exectuable
    files on sysfs and proc and in doing so introduce security regressions
    in the current userspace that will not be immediately obvious and as
    such are likely to require breaking userspace in painful ways once
    they are recognized"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Remove incorrect debugging WARN in prepend_path
    mnt: Update fs_fully_visible to test for permanently empty directories
    sysfs: Create mountpoints with sysfs_create_mount_point
    sysfs: Add support for permanently empty directories to serve as mount points.
    kernfs: Add support for always empty directories.
    proc: Allow creating permanently empty directories that serve as mount points
    sysctl: Allow creating permanently empty directories that serve as mountpoints.
    fs: Add helper functions for permanently empty directories.
    vfs: Ignore unlocked mounts in fs_fully_visible
    mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
    mnt: Refactor the logic for mounting sysfs and proc in a user namespace

    Linus Torvalds
     

03 Jul, 2015

1 commit

  • Pull fuse updates from Miklos Szeredi:
    "This is the start of improving fuse scalability.

    An input queue and a processing queue is split out from the monolithic
    fuse connection, each of those having their own spinlock. The end of
    the patchset adds the ability to clone a fuse connection. This means,
    that instead of having to read/write requests/answers on a single fuse
    device fd, the fuse daemon can have multiple distinct file descriptors
    open. Each of those can be used to receive requests and send answers,
    currently the only constraint is that a request must be answered on
    the same fd as it was read from.

    This can be extended further to allow binding a device clone to a
    specific CPU or NUMA node.

    Based on a patchset by Srinivas Eeda and Ashish Samant. Thanks to
    Ashish for the review of this series"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (40 commits)
    fuse: update MAINTAINERS entry
    fuse: separate pqueue for clones
    fuse: introduce per-instance fuse_dev structure
    fuse: device fd clone
    fuse: abort: no fc->lock needed for request ending
    fuse: no fc->lock for pqueue parts
    fuse: no fc->lock in request_end()
    fuse: cleanup request_end()
    fuse: request_end(): do once
    fuse: add req flag for private list
    fuse: pqueue locking
    fuse: abort: group pqueue accesses
    fuse: cleanup fuse_dev_do_read()
    fuse: move list_del_init() from request_end() into callers
    fuse: duplicate ->connected in pqueue
    fuse: separate out processing queue
    fuse: simplify request_wait()
    fuse: no fc->lock for iqueue parts
    fuse: allow interrupt queuing without fc->lock
    fuse: iqueue locking
    ...

    Linus Torvalds
     

01 Jul, 2015

11 commits

  • This allows for better documentation in the code and
    it allows for a simpler and fully correct version of
    fs_fully_visible to be written.

    The mount points converted and their filesystems are:
    /sys/hypervisor/s390/ s390_hypfs
    /sys/kernel/config/ configfs
    /sys/kernel/debug/ debugfs
    /sys/firmware/efi/efivars/ efivarfs
    /sys/fs/fuse/connections/ fusectl
    /sys/fs/pstore/ pstore
    /sys/kernel/tracing/ tracefs
    /sys/fs/cgroup/ cgroup
    /sys/kernel/security/ securityfs
    /sys/fs/selinux/ selinuxfs
    /sys/fs/smackfs/ smackfs

    Cc: stable@vger.kernel.org
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Make each fuse device clone refer to a separate processing queue. The only
    constraint on userspace code is that the request answer must be written to
    the same device clone as it was read off.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Allow fuse device clones to refer to be distinguished. This patch just
    adds the infrastructure by associating a separate "struct fuse_dev" with
    each clone.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • Allow an open fuse device to be "cloned". Userspace can create a clone by:

    newfd = open("/dev/fuse", O_RDWR)
    ioctl(newfd, FUSE_DEV_IOC_CLONE, &oldfd);

    At this point newfd will refer to the same fuse connection as oldfd.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • In fuse_abort_conn() when all requests are on private lists we no longer
    need fc->lock protection.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • Remove fc->lock protection from processing queue members, now protected by
    fpq->lock.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • No longer need to call request_end() with the connection lock held. We
    still protect the background counters and queue with fc->lock, so acquire
    it if necessary.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • Now that we atomically test having already done everything we no longer
    need other protection.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • When the connection is aborted it is possible that request_end() will be
    called twice. Use atomic test and set to do the actual ending only once.

    test_and_set_bit() also provides the necessary barrier semantics so no
    explicit smp_wmb() is necessary.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • When an unlocked request is aborted, it is moved from fpq->io to a private
    list. Then, after unlocking fpq->lock, the private list is processed and
    the requests are finished off.

    To protect the private list, we need to mark the request with a flag, so if
    in the meantime the request is unlocked the list is not corrupted.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • Add a fpq->lock for protecting members of struct fuse_pqueue and FR_LOCKED
    request flag.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi