18 Sep, 2020

1 commit

  • We want to allow submounts for the same fuse_conn, but with different
    superblocks so that each of the submounts has its own device ID. To do
    so, we need to split all mount-specific information off of fuse_conn
    into a new fuse_mount structure, so that multiple mounts can share a
    single fuse_conn.

    We need to take care only to perform connection-level actions once (i.e.
    when the fuse_conn and thus the first fuse_mount are established, or
    when the last fuse_mount and thus the fuse_conn are destroyed). For
    example, fuse_sb_destroy() must invoke fuse_send_destroy() until the
    last superblock is released.

    To do so, we keep track of which fuse_mount is the root mount and
    perform all fuse_conn-level actions only when this fuse_mount is
    involved.

    Signed-off-by: Max Reitz
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Miklos Szeredi

    Max Reitz
     

10 Sep, 2020

9 commits

  • Add logic to free up a busy memory range. Freed memory range will be
    returned to free pool. Add a worker which can be started to select
    and free some busy memory ranges.

    Process can also steal one of its busy dax ranges if free range is not
    available. I will refer it to as direct reclaim.

    If free range is not available and nothing can't be stolen from same
    inode, caller waits on a waitq for free range to become available.

    For reclaiming a range, as of now we need to hold following locks in
    specified order.

    down_write(&fi->i_mmap_sem);
    down_write(&fi->dax->sem);

    We look for a free range in following order.

    A. Try to get a free range.
    B. If not, try direct reclaim.
    C. If not, wait for a memory range to become free

    Signed-off-by: Vivek Goyal
    Signed-off-by: Liu Bo
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • This list will be used selecting fuse_dax_mapping to free when number of
    free mappings drops below a threshold.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Currently in fuse we don't seem have any lock which can serialize fault
    path with truncate/punch_hole path. With dax support I need one for
    following reasons.

    1. Dax requirement

    DAX fault code relies on inode size being stable for the duration of
    fault and want to serialize with truncate/punch_hole and they explicitly
    mention it.

    static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
    const struct iomap_ops *ops)
    /*
    * Check whether offset isn't beyond end of file now. Caller is
    * supposed to hold locks serializing us with truncate / punch hole so
    * this is a reliable test.
    */
    max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);

    2. Make sure there are no users of pages being truncated/punch_hole

    get_user_pages() might take references to page and then do some DMA
    to said pages. Filesystem might truncate those pages without knowing
    that a DMA is in progress or some I/O is in progress. So use
    dax_layout_busy_page() to make sure there are no such references
    and I/O is not in progress on said pages before moving ahead with
    truncation.

    3. Limitation of kvm page fault error reporting

    If we are truncating file on host first and then removing mappings in
    guest lateter (truncate page cache etc), then this could lead to a
    problem with KVM. Say a mapping is in place in guest and truncation
    happens on host. Now if guest accesses that mapping, then host will
    take a fault and kvm will either exit to qemu or spin infinitely.

    IOW, before we do truncation on host, we need to make sure that guest
    inode does not have any mapping in that region or whole file.

    4. virtiofs memory range reclaim

    Soon I will introduce the notion of being able to reclaim dax memory
    ranges from a fuse dax inode. There also I need to make sure that
    no I/O or fault is going on in the reclaimed range and nobody is using
    it so that range can be reclaimed without issues.

    Currently if we take inode lock, that serializes read/write. But it does
    not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
    for this purpose. It can be used to serialize with faults.

    As of now, I am adding taking this semaphore only in dax fault path and
    not regular fault path because existing code does not have one. May
    be existing code can benefit from it as well to take care of some
    races, but that we can fix later if need be. For now, I am just focussing
    only on DAX path which is new path.

    Also added logic to take fuse_inode->i_mmap_sem in
    truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
    fuse dax fault are mutually exlusive and avoid all the above problems.

    Signed-off-by: Vivek Goyal
    Cc: Dave Chinner
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • This is done along the lines of ext4 and xfs. I primarily wanted
    ->writepages hook at this time so that I could call into
    dax_writeback_mapping_range(). This in turn will decide which pfns need to
    be written back.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Add DAX mmap() support.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Miklos Szeredi

    Stefan Hajnoczi
     
  • This patch implements basic DAX support. mmap() is not implemented
    yet and will come in later patches. This patch looks into implemeting
    read/write.

    We make use of interval tree to keep track of per inode dax mappings.

    Do not use dax for file extending writes, instead just send WRITE message
    to daemon (like we do for direct I/O path). This will keep write and
    i_size change atomic w.r.t crash.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Dr. David Alan Gilbert
    Signed-off-by: Vivek Goyal
    Signed-off-by: Liu Bo
    Signed-off-by: Peng Tao
    Cc: Dave Chinner
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
    constraints via the FUST_INIT map_alignment field. Parse this field and
    ensure our DAX mappings meet the alignment constraints.

    We don't actually align anything differently since our mappings are
    already 2MB aligned. Just check the value when the connection is
    established. If it becomes necessary to honor arbitrary alignments in
    the future we'll have to adjust how mappings are sized.

    The upshot of this commit is that we can be confident that mappings will
    work even when emulating x86 on Power and similar combinations where the
    host page sizes are different.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Stefan Hajnoczi
     
  • Divide the dax memory range into fixed size ranges (2MB for now) and put
    them in a list. This will track free ranges. Once an inode requires a
    free range, we will take one from here and put it in interval-tree
    of ranges assigned to inode.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Peng Tao
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Add a mount option to allow using dax with virtio_fs.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal