05 Sep, 2018

3 commits

  • commit e8f3bd773d22f488724dffb886a1618da85c2966 upstream.

    syzbot is hitting NULL pointer dereference at process_init_reply().
    This is because deactivate_locked_super() is called before response for
    initial request is processed.

    Fix this by aborting and waiting for all requests (including FUSE_INIT)
    before resetting fc->sb.

    Original patch by Tetsuo Handa .

    Reported-by: syzbot
    Fixes: e27c9d3877a0 ("fuse: fuse: add time_gran to INIT_OUT")
    Cc: # v3.19
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit b8f95e5d13f5f0191dcb4b9113113d241636e7cb upstream.

    fuse_abort_conn() does not guarantee that all async requests have actually
    finished aborting (i.e. their ->end() function is called). This could
    actually result in still used inodes after umount.

    Add a helper to wait until all requests are fully done. This is done by
    looking at the "num_waiting" counter. When this counter drops to zero, we
    can be sure that no more requests are outstanding.

    Fixes: 0d8e84b0432b ("fuse: simplify request abort")
    Cc: # v4.2
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit 63576c13bd17848376c8ba4a98f5d5151140c4ac upstream.

    If parallel dirops are enabled in FUSE_INIT reply, then first operation may
    leave fi->mutex held.

    Reported-by: syzbot
    Fixes: 5c672ab3f0ee ("fuse: serialize dirops by default")
    Cc: # v4.7
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     

03 Jul, 2018

1 commit

  • commit 543b8f8662fe6d21f19958b666ab0051af9db21a upstream.

    syzbot is reporting use-after-free at fuse_kill_sb_blk() [1].
    Since sb->s_fs_info field is not cleared after fc was released by
    fuse_conn_put() when initialization failed, fuse_kill_sb_blk() finds
    already released fc and tries to hold the lock. Fix this by clearing
    sb->s_fs_info field after calling fuse_conn_put().

    [1] https://syzkaller.appspot.com/bug?id=a07a680ed0a9290585ca424546860464dd9658db

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Fixes: 3b463ae0c626 ("fuse: invalidation reverse calls")
    Cc: John Muir
    Cc: Csaba Henk
    Cc: Anand Avati
    Cc: # v2.6.31
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

19 Oct, 2017

1 commit


21 May, 2017

1 commit

  • Pull block fixes from Jens Axboe:
    "A small collection of fixes that should go into this cycle.

    - a pull request from Christoph for NVMe, which ended up being
    manually applied to avoid pulling in newer bits in master. Mostly
    fibre channel fixes from James, but also a few fixes from Jon and
    Vijay

    - a pull request from Konrad, with just a single fix for xen-blkback
    from Gustavo.

    - a fuseblk bdi fix from Jan, fixing a regression in this series with
    the dynamic backing devices.

    - a blktrace fix from Shaohua, replacing sscanf() with kstrtoull().

    - a request leak fix for drbd from Lars, fixing a regression in the
    last series with the kref changes. This will go to stable as well"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    nvmet: release the sq ref on rdma read errors
    nvmet-fc: remove target cpu scheduling flag
    nvme-fc: stop queues on error detection
    nvme-fc: require target or discovery role for fc-nvme targets
    nvme-fc: correct port role bits
    nvme: unmap CMB and remove sysfs file in reset path
    blktrace: fix integer parse
    fuseblk: Fix warning in super_setup_bdi_name()
    block: xen-blkback: add null check to avoid null pointer dereference
    drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()

    Linus Torvalds
     

17 May, 2017

1 commit

  • Commit 5f7f7543f52e "fuse: Convert to separately allocated bdi" didn't
    properly handle fuseblk filesystem. When fuse_bdi_init() is called for
    that filesystem type, sb->s_bdi is already initialized (by
    set_bdev_super()) to point to block device's bdi and consequently
    super_setup_bdi_name() complains about this fact when reseting bdi to
    the private one.

    Fix the problem by properly dropping bdi reference in fuse_bdi_init()
    before creating a private bdi in super_setup_bdi_name().

    Fixes: 5f7f7543f52e ("fuse: Convert to separately allocated bdi")
    Reported-by: Rakesh Pandit
    Tested-by: Rakesh Pandit
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

10 May, 2017

1 commit


21 Apr, 2017

2 commits

  • It is not needed anymore since bdi is initialized whenever superblock
    exists.

    CC: Miklos Szeredi
    CC: linux-fsdevel@vger.kernel.org
    Suggested-by: Miklos Szeredi
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Allocate struct backing_dev_info separately instead of embedding it
    inside the superblock. This unifies handling of bdi among users.

    CC: Miklos Szeredi
    CC: linux-fsdevel@vger.kernel.org
    Acked-by: Miklos Szeredi
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

18 Apr, 2017

2 commits

  • When the userspace process servicing fuse requests is running in
    a pid namespace then pids passed via the fuse fd are not being
    translated into that process' namespace. Translation is necessary
    for the pid to be useful to that process.

    Since no use case currently exists for changing namespaces all
    translations can be done relative to the pid namespace in use
    when fuse_conn_init() is called. For fuse this translates to
    mount time, and for cuse this is when /dev/cuse is opened. IO for
    this connection from another namespace will return errors.

    Requests from processes whose pid cannot be translated into the
    target namespace will have a value of 0 for in.h.pid.

    File locking changes based on previous work done by Eric
    Biederman.

    Signed-off-by: Seth Forshee
    Signed-off-by: Miklos Szeredi

    Seth Forshee
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: Miklos Szeredi

    Elena Reshetova
     

18 Oct, 2016

1 commit


01 Oct, 2016

4 commits

  • Only two flags: "default_permissions" and "allow_other". All other flags
    are handled via bitfields. So convert these two as well. They don't
    change during the lifetime of the filesystem, so this is quite safe.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Add a new INIT flag, FUSE_POSIX_ACL, for negotiating ACL support with
    userspace. When it is set in the INIT response, ACL support will be
    enabled. ACL support also implies "default_permissions".

    When ACL support is enabled, the kernel will cache and have responsibility
    for enforcing ACLs. ACL xattrs will be passed to userspace, which is
    responsible for updating the ACLs in the filesystem, keeping the file mode
    in sync, and inheritance of default ACLs when new filesystem nodes are
    created.

    Signed-off-by: Seth Forshee
    Signed-off-by: Miklos Szeredi

    Seth Forshee
     
  • Only userspace filesystem can do the killing of suid/sgid without races.
    So introduce an INIT flag and negotiate support for this.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • In preparation for posix acl support, rework fuse to use xattr handlers and
    the generic setxattr/getxattr/listxattr callbacks. Split the xattr code
    out into it's own file, and promote symbols to module-global scope as
    needed.

    Functionally these changes have no impact, as fuse still uses a single
    handler for all xattrs which uses the old callbacks.

    Signed-off-by: Seth Forshee
    Signed-off-by: Miklos Szeredi

    Seth Forshee
     

06 Aug, 2016

1 commit

  • Pull qstr constification updates from Al Viro:
    "Fairly self-contained bunch - surprising lot of places passes struct
    qstr * as an argument when const struct qstr * would suffice; it
    complicates analysis for no good reason.

    I'd prefer to feed that separately from the assorted fixes (those are
    in #for-linus and with somewhat trickier topology)"

    * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    qstr: constify instances in adfs
    qstr: constify instances in lustre
    qstr: constify instances in f2fs
    qstr: constify instances in ext2
    qstr: constify instances in vfat
    qstr: constify instances in procfs
    qstr: constify instances in fuse
    qstr constify instances in fs/dcache.c
    qstr: constify instances in nfs
    qstr: constify instances in ocfs2
    qstr: constify instances in autofs4
    qstr: constify instances in hfs
    qstr: constify instances in hfsplus
    qstr: constify instances in logfs
    qstr: constify dentry_init_security

    Linus Torvalds
     

31 Jul, 2016

1 commit


29 Jul, 2016

1 commit


30 Jun, 2016

1 commit

  • Negotiate with userspace filesystems whether they support parallel readdir
    and lookup. Disable parallelism by default for fear of breaking fuse
    filesystems.

    Signed-off-by: Miklos Szeredi
    Fixes: 9902af79c01a ("parallel lookups: actual switch to rwsem")
    Fixes: d9b3dbdcfd62 ("fuse: switch to ->iterate_shared()")

    Miklos Szeredi
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

04 Jul, 2015

1 commit

  • Pull user namespace updates from Eric Biederman:
    "Long ago and far away when user namespaces where young it was realized
    that allowing fresh mounts of proc and sysfs with only user namespace
    permissions could violate the basic rule that only root gets to decide
    if proc or sysfs should be mounted at all.

    Some hacks were put in place to reduce the worst of the damage could
    be done, and the common sense rule was adopted that fresh mounts of
    proc and sysfs should allow no more than bind mounts of proc and
    sysfs. Unfortunately that rule has not been fully enforced.

    There are two kinds of gaps in that enforcement. Only filesystems
    mounted on empty directories of proc and sysfs should be ignored but
    the test for empty directories was insufficient. So in my tree
    directories on proc, sysctl and sysfs that will always be empty are
    created specially. Every other technique is imperfect as an ordinary
    directory can have entries added even after a readdir returns and
    shows that the directory is empty. Special creation of directories
    for mount points makes the code in the kernel a smidge clearer about
    it's purpose. I asked container developers from the various container
    projects to help test this and no holes were found in the set of mount
    points on proc and sysfs that are created specially.

    This set of changes also starts enforcing the mount flags of fresh
    mounts of proc and sysfs are consistent with the existing mount of
    proc and sysfs. I expected this to be the boring part of the work but
    unfortunately unprivileged userspace winds up mounting fresh copies of
    proc and sysfs with noexec and nosuid clear when root set those flags
    on the previous mount of proc and sysfs. So for now only the atime,
    read-only and nodev attributes which userspace happens to keep
    consistent are enforced. Dealing with the noexec and nosuid
    attributes remains for another time.

    This set of changes also addresses an issue with how open file
    descriptors from /proc//ns/* are displayed. Recently readlink of
    /proc//fd has been triggering a WARN_ON that has not been
    meaningful since it was added (as all of the code in the kernel was
    converted) and is not now actively wrong.

    There is also a short list of issues that have not been fixed yet that
    I will mention briefly.

    It is possible to rename a directory from below to above a bind mount.
    At which point any directory pointers below the renamed directory can
    be walked up to the root directory of the filesystem. With user
    namespaces enabled a bind mount of the bind mount can be created
    allowing the user to pick a directory whose children they can rename
    to outside of the bind mount. This is challenging to fix and doubly
    so because all obvious solutions must touch code that is in the
    performance part of pathname resolution.

    As mentioned above there is also a question of how to ensure that
    developers by accident or with purpose do not introduce exectuable
    files on sysfs and proc and in doing so introduce security regressions
    in the current userspace that will not be immediately obvious and as
    such are likely to require breaking userspace in painful ways once
    they are recognized"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Remove incorrect debugging WARN in prepend_path
    mnt: Update fs_fully_visible to test for permanently empty directories
    sysfs: Create mountpoints with sysfs_create_mount_point
    sysfs: Add support for permanently empty directories to serve as mount points.
    kernfs: Add support for always empty directories.
    proc: Allow creating permanently empty directories that serve as mount points
    sysctl: Allow creating permanently empty directories that serve as mountpoints.
    fs: Add helper functions for permanently empty directories.
    vfs: Ignore unlocked mounts in fs_fully_visible
    mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
    mnt: Refactor the logic for mounting sysfs and proc in a user namespace

    Linus Torvalds
     

01 Jul, 2015

11 commits

  • This allows for better documentation in the code and
    it allows for a simpler and fully correct version of
    fs_fully_visible to be written.

    The mount points converted and their filesystems are:
    /sys/hypervisor/s390/ s390_hypfs
    /sys/kernel/config/ configfs
    /sys/kernel/debug/ debugfs
    /sys/firmware/efi/efivars/ efivarfs
    /sys/fs/fuse/connections/ fusectl
    /sys/fs/pstore/ pstore
    /sys/kernel/tracing/ tracefs
    /sys/fs/cgroup/ cgroup
    /sys/kernel/security/ securityfs
    /sys/fs/selinux/ selinuxfs
    /sys/fs/smackfs/ smackfs

    Cc: stable@vger.kernel.org
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Make each fuse device clone refer to a separate processing queue. The only
    constraint on userspace code is that the request answer must be written to
    the same device clone as it was read off.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Allow fuse device clones to refer to be distinguished. This patch just
    adds the infrastructure by associating a separate "struct fuse_dev" with
    each clone.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • Add a fpq->lock for protecting members of struct fuse_pqueue and FR_LOCKED
    request flag.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • This will allow checking ->connected just with the processing queue lock.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • This is just two fields: fc->io and fc->processing.

    This patch just rearranges the fields, no functional change.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • This will allow checking ->connected just with the input queue lock.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • The input queue contains normal requests (fc->pending), forgets
    (fc->forget_*) and interrupts (fc->interrupts). There's also fc->waitq and
    fc->fasync for waking up the readers of the fuse device when a request is
    available.

    The fc->reqctr is also moved to the input queue (assigned to the request
    when the request is added to the input queue.

    This patch just rearranges the fields, no functional change.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • Since it's a 64bit counter, it's never gonna wrap around. Remove code
    dealing with that possibility.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • Finer grained locking will mean there's no single lock to protect
    modification of bitfileds in fuse_req.

    So move to using bitops. Can use the non-atomic variants for those which
    happen while the request definitely has only one reference.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Ashish Samant

    Miklos Szeredi
     
  • fc->release is called from fuse_conn_put() which was used in the error
    cleanup before fc->release was initialized.

    [Jeremiah Mahler : assign fc->release after calling
    fuse_conn_init(fc) instead of before.]

    Signed-off-by: Miklos Szeredi
    Fixes: a325f9b92273 ("fuse: update fuse_conn_init() and separate out fuse_conn_kill()")
    Cc: #v2.6.31+

    Miklos Szeredi
     

16 Apr, 2015

1 commit


21 Jan, 2015

1 commit

  • Now that we never use the backing_dev_info pointer in struct address_space
    we can simply remove it and save 4 to 8 bytes in every inode.

    Signed-off-by: Christoph Hellwig
    Acked-by: Ryusuke Konishi
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

06 Jan, 2015

2 commits

  • Theoretically we need to order setting of various fields in fc with
    fc->initialized.

    No known bug reports related to this yet.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Analysis from Marc:

    "Commit 7078187a795f ("fuse: introduce fuse_simple_request() helper")
    from the above pull request triggers some EIO errors for me in some tests
    that rely on fuse

    Looking at the code changes and a bit of debugging info I think there's a
    general problem here that fuse_get_req checks and possibly waits for
    fc->initialized, and this was always called first. But this commit
    changes the ordering and in many places fc->minor is now possibly used
    before fuse_get_req, and we can't be sure that fc has been initialized.
    In my case fuse_lookup_init sets req->out.args[0].size to the wrong size
    because fc->minor at that point is still 0, leading to the EIO error."

    Fix by moving the compat adjustments into fuse_simple_request() to after
    fuse_get_req().

    This is also more readable than the original, since now compatibility is
    handled in a single function instead of cluttering each operation.

    Reported-by: Marc Dionne
    Tested-by: Marc Dionne
    Signed-off-by: Miklos Szeredi
    Fixes: 7078187a795f ("fuse: introduce fuse_simple_request() helper")

    Miklos Szeredi
     

12 Dec, 2014

1 commit

  • The following pattern is repeated many times:

    req = fuse_get_req_nopages(fc);
    /* Initialize req->(in|out).args */
    fuse_request_send(fc, req);
    err = req->out.h.error;
    fuse_put_request(req);

    Create a new replacement helper:

    /* Initialize args */
    err = fuse_simple_request(fc, &args);

    In addition to reducing the code size, this will ease moving from the
    complex arg-based to a simpler page-based I/O on the fuse device.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi