13 Sep, 2021

1 commit

  • Pull smbfs updates from Steve French:
    "cifs/smb3 updates:

    - DFS reconnect fix

    - begin creating common headers for server and client

    - rename the cifs_common directory to smbfs_common to be more
    consistent ie change use of the name cifs to smb (smb3 or smbfs is
    more accurate, as the very old cifs dialect has long been
    superseded by smb3 dialects).

    In the future we can rename the fs/cifs directory to fs/smbfs.

    This does not include the set of multichannel fixes nor the two
    deferred close fixes (they are still being reviewed and tested)"

    * tag '5.15-rc-cifs-part2' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: properly invalidate cached root handle when closing it
    cifs: move SMB FSCTL definitions to common code
    cifs: rename cifs_common to smbfs_common
    cifs: update FSCTL definitions

    Linus Torvalds
     

12 Sep, 2021

1 commit

  • Pull block fixes from Jens Axboe:

    - NVMe pull request from Christoph:
    - fix nvmet command set reporting for passthrough controllers (Adam Manzanares)
    - update a MAINTAINERS email address (Chaitanya Kulkarni)
    - set QUEUE_FLAG_NOWAIT for nvme-multipth (me)
    - handle errors from add_disk() (Luis Chamberlain)
    - update the keep alive interval when kato is modified (Tatsuya Sasaki)
    - fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke)
    - do not reset transport on data digest errors in nvme-tcp (Daniel Wagner)
    - only call synchronize_srcu when clearing current path (Daniel Wagner)
    - revalidate paths during rescan (Hannes Reinecke)

    - Split out the fs/block_dev into block/fops.c and block/bdev.c, which
    has been long overdue. Do this now before -rc1, to avoid annoying
    conflicts due to this (Christoph)

    - blk-throtl use-after-free fix (Li)

    - Improve plug depth for multi-device plugs, greatly increasing md
    resync performance (Song)

    - blkdev_show() locking fix (Tetsuo)

    - n64cart error check fix (Yang)

    * tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
    n64cart: fix return value check in n64cart_probe()
    blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
    block: move fs/block_dev.c to block/bdev.c
    block: split out operations on block special files
    blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
    block: genhd: don't call blkdev_show() with major_names_lock held
    nvme: update MAINTAINERS email address
    nvme: add error handling support for add_disk()
    nvme: only call synchronize_srcu when clearing current path
    nvme: update keep alive interval when kato is modified
    nvme-tcp: Do not reset transport on data digest errors
    nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()
    nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req
    nvmet: looks at the passthrough controller when initializing CAP
    nvme: move nvme_multi_css into nvme.h
    nvme-multipath: revalidate paths during rescan
    nvme-multipath: set QUEUE_FLAG_NOWAIT

    Linus Torvalds
     

09 Sep, 2021

1 commit

  • As we move to common code between client and server, we have
    been asked to make the names less confusing, and refer less
    to "cifs" and more to words which include "smb" instead to
    e.g. "smbfs" for the client (we already have "ksmbd" for the
    kernel server, and "smbd" for the user space Samba daemon).
    So to be more consistent in the naming of common code between
    client and server and reduce the risk of merge conflicts as
    more common code is added - rename "cifs_common" to
    "smbfs_common" (in future releases we also will rename
    the fs/cifs directory to fs/smbfs)

    Reviewed-by: Ronnie Sahlberg
    Signed-off-by: Steve French

    Steve French
     

07 Sep, 2021

1 commit


05 Sep, 2021

1 commit

  • Merge NTFSv3 filesystem from Konstantin Komarov:
    "This patch adds NTFS Read-Write driver to fs/ntfs3.

    Having decades of expertise in commercial file systems development and
    huge test coverage, we at Paragon Software GmbH want to make our
    contribution to the Open Source Community by providing implementation
    of NTFS Read-Write driver for the Linux Kernel.

    This is fully functional NTFS Read-Write driver. Current version works
    with NTFS (including v3.1) and normal/compressed/sparse files and
    supports journal replaying.

    We plan to support this version after the codebase once merged, and
    add new features and fix bugs. For example, full journaling support
    over JBD will be added in later updates"

    Link: https://lore.kernel.org/lkml/20210729134943.778917-1-almaz.alexandrovich@paragon-software.com/
    Link: https://lore.kernel.org/lkml/aa4aa155-b9b2-9099-b7a2-349d8d9d8fbd@paragon-software.com/

    * git://github.com/Paragon-Software-Group/linux-ntfs3: (35 commits)
    fs/ntfs3: Change how module init/info messages are displayed
    fs/ntfs3: Remove GPL boilerplates from decompress lib files
    fs/ntfs3: Remove unnecessary condition checking from ntfs_file_read_iter
    fs/ntfs3: Fix integer overflow in ni_fiemap with fiemap_prep()
    fs/ntfs3: Restyle comments to better align with kernel-doc
    fs/ntfs3: Rework file operations
    fs/ntfs3: Remove fat ioctl's from ntfs3 driver for now
    fs/ntfs3: Restyle comments to better align with kernel-doc
    fs/ntfs3: Fix error handling in indx_insert_into_root()
    fs/ntfs3: Potential NULL dereference in hdr_find_split()
    fs/ntfs3: Fix error code in indx_add_allocate()
    fs/ntfs3: fix an error code in ntfs_get_acl_ex()
    fs/ntfs3: add checks for allocation failure
    fs/ntfs3: Use kcalloc/kmalloc_array over kzalloc/kmalloc
    fs/ntfs3: Do not use driver own alloc wrappers
    fs/ntfs3: Use kernel ALIGN macros over driver specific
    fs/ntfs3: Restyle comment block in ni_parse_reparse()
    fs/ntfs3: Remove unused including
    fs/ntfs3: Fix fall-through warnings for Clang
    fs/ntfs3: Fix one none utf8 char in source file
    ...

    Linus Torvalds
     

01 Sep, 2021

2 commits

  • Pull cifs client updates from Steve French:
    "Eleven cifs/smb3 client fixes:

    - mostly restructuring to allow disabling less secure algorithms
    (this will allow eventual removing rc4 and md4 from general use in
    the kernel)

    - four fixes, including two for stable

    - enable r/w support with fscache and cifs.ko

    I am working on a larger set of changes (the usual ... multichannel,
    auth and signing improvements), but wanted to get these in earlier to
    reduce chance of merge conflicts later in the merge window"

    * tag '5.15-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: Do not leak EDEADLK to dgetents64 for STATUS_USER_SESSION_DELETED
    cifs: add cifs_common directory to MAINTAINERS file
    cifs: cifs_md4 convert to SPDX identifier
    cifs: create a MD4 module and switch cifs.ko to use it
    cifs: fork arc4 and create a separate module for it for cifs and other users
    cifs: remove support for NTLM and weaker authentication algorithms
    cifs: enable fscache usage even for files opened as rw
    oid_registry: Add OIDs for missing Spnego auth mechanisms to Macs
    smb3: fix posix extensions mount option
    cifs: fix wrong release in sess_alloc_buffer() failed path
    CIFS: Fix a potencially linear read overflow

    Linus Torvalds
     
  • Pull initial ksmbd implementation from Steve French:
    "Initial merge of kernel smb3 file server, ksmbd.

    The SMB family of protocols is the most widely deployed network
    filesystem protocol, the default on Windows and Macs (and even on many
    phones and tablets), with clients and servers on all major operating
    systems, but lacked a kernel server for Linux. For many cases the
    current userspace server choices were suboptimal either due to memory
    footprint, performance or difficulty integrating well with advanced
    Linux features.

    ksmbd is a new kernel module which implements the server-side of the
    SMB3 protocol. The target is to provide optimized performance, GPLv2
    SMB server, and better lease handling (distributed caching). The
    bigger goal is to add new features more rapidly (e.g. RDMA aka
    "smbdirect", and recent encryption and signing improvements to the
    protocol) which are easier to develop on a smaller, more tightly
    optimized kernel server than for example in Samba.

    The Samba project is much broader in scope (tools, security services,
    LDAP, Active Directory Domain Controller, and a cross platform file
    server for a wider variety of purposes) but the user space file server
    portion of Samba has proved hard to optimize for some Linux workloads,
    including for smaller devices.

    This is not meant to replace Samba, but rather be an extension to
    allow better optimizing for Linux, and will continue to integrate well
    with Samba user space tools and libraries where appropriate. Working
    with the Samba team we have already made sure that the configuration
    files and xattrs are in a compatible format between the kernel and
    user space server.

    Various types of functional and regression tests are regularly run
    against it. One example is the automated 'buildbot' regression tests
    which use the Linux client to test against ksmbd, e.g.

    http://smb3-test-rhel-75.southcentralus.cloudapp.azure.com/#/builders/8/builds/56

    but other test suites, including Samba's smbtorture functional test
    suite are also used regularly"

    * tag '5.15-rc-first-ksmbd-merge' of git://git.samba.org/ksmbd: (219 commits)
    ksmbd: fix __write_overflow warning in ndr_read_string
    MAINTAINERS: ksmbd: add cifs_common directory to ksmbd entry
    MAINTAINERS: ksmbd: update my email address
    ksmbd: fix permission check issue on chown and chmod
    ksmbd: don't set FILE DELETE and FILE_DELETE_CHILD in access mask by default
    MAINTAINERS: add git adddress of ksmbd
    ksmbd: update SMB3 multi-channel support in ksmbd.rst
    ksmbd: smbd: fix kernel oops during server shutdown
    ksmbd: remove select FS_POSIX_ACL in Kconfig
    ksmbd: use proper errno instead of -1 in smb2_get_ksmbd_tcon()
    ksmbd: update the comment for smb2_get_ksmbd_tcon()
    ksmbd: change int data type to boolean
    ksmbd: Fix multi-protocol negotiation
    ksmbd: fix an oops in error handling in smb2_open()
    ksmbd: add ipv6_addr_v4mapped check to know if connection from client is ipv4
    ksmbd: fix missing error code in smb2_lock
    ksmbd: use channel signingkey for binding SMB2 session setup
    ksmbd: don't set RSS capable in FSCTL_QUERY_NETWORK_INTERFACE_INFO
    ksmbd: Return STATUS_OBJECT_PATH_NOT_FOUND if smb2_creat() returns ENOENT
    ksmbd: fix -Wstringop-truncation warnings
    ...

    Linus Torvalds
     

26 Aug, 2021

1 commit


13 Aug, 2021

1 commit


26 Jul, 2021

1 commit

  • We have a fairly specific alpha binary loader in Linux: running x86
    (i386, i486) binaries via the em86 [1] emulator. As noted in the Kconfig
    option, the same behavior can be achieved via binfmt_misc, for example,
    more nowadays used for running qemu-user.

    An example on how to get binfmt_misc running with em86 can be found in
    Documentation/admin-guide/binfmt-misc.rst

    The defconfig does not have CONFIG_BINFMT_EM86=y set. And doing a
    make defconfig && make olddefconfig
    results in
    # CONFIG_BINFMT_EM86 is not set

    ... as we don't seem to have any supported Linux distirbution for alpha
    anymore, there isn't really any "default" user of that feature anymore.

    Searching for "CONFIG_BINFMT_EM86=y" reveals mostly discussions from
    around 20 years ago, like [2] describing how to get netscape via em86
    running via em86, or [3] discussing that running wine or installing
    Win 3.11 through em86 would be a nice feature.

    The latest binaries available for em86 are from 2000, version 2.2.1 [4] --
    which translates to "unsupported"; further, em86 doesn't even work with
    glibc-2.x but only with glibc-2.0 [4, 5]. These are clear signs that
    there might not be too many em86 users out there, especially users
    relying on modern Linux kernels.

    Even though the code footprint is relatively small, let's just get rid
    of this blast from the past that's effectively unused.

    [1] http://ftp.dreamtime.org/pub/linux/Linux-Alpha/em86/v0.4/docs/em86.html
    [2] https://static.lwn.net/1998/1119/a/alpha-netscape.html
    [3] https://groups.google.com/g/linux.debian.alpha/c/AkGuQHeCe0Y
    [4] http://zeniv.linux.org.uk/pub/linux/alpha/em86/v2.2-1/relnotes.2.2.1.html
    [5] https://forum.teamspeak.com/archive/index.php/t-1477.html

    Cc: Alexander Viro
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Linus Torvalds
    Cc: Greg Kroah-Hartman
    Cc: Jonathan Corbet
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-api@vger.kernel.org
    Cc: linux-alpha@vger.kernel.org
    Signed-off-by: David Hildenbrand
    Signed-off-by: Matt Turner

    David Hildenbrand
     

28 Jun, 2021

1 commit


11 May, 2021

1 commit


23 Apr, 2021

1 commit

  • Add a pair of helper functions:

    (*) netfs_readahead()
    (*) netfs_readpage()

    to do the work of handling a readahead or a readpage, where the page(s)
    that form part of the request may be split between the local cache, the
    server or just require clearing, and may be single pages and transparent
    huge pages. This is all handled within the helper.

    Note that while both will read from the cache if there is data present,
    only netfs_readahead() will expand the request beyond what it was asked to
    do, and only netfs_readahead() will write back to the cache.

    netfs_readpage(), on the other hand, is synchronous and only fetches the
    page (which might be a THP) it is asked for.

    The netfs gives the helper parameters from the VM, the cache cookie it
    wants to use (or NULL) and a table of operations (only one of which is
    mandatory):

    (*) expand_readahead() [optional]

    Called to allow the netfs to request an expansion of a readahead
    request to meet its own alignment requirements. This is done by
    changing rreq->start and rreq->len.

    (*) clamp_length() [optional]

    Called to allow the netfs to cut down a subrequest to meet its own
    boundary requirements. If it does this, the helper will generate
    additional subrequests until the full request is satisfied.

    (*) is_still_valid() [optional]

    Called to find out if the data just read from the cache has been
    invalidated and must be reread from the server.

    (*) issue_op() [required]

    Called to ask the netfs to issue a read to the server. The subrequest
    describes the read. The read request holds information about the file
    being accessed.

    The netfs can cache information in rreq->netfs_priv.

    Upon completion, the netfs should set the error, transferred and can
    also set FSCACHE_SREQ_CLEAR_TAIL and then call
    fscache_subreq_terminated().

    (*) done() [optional]

    Called after the pages have been unlocked. The read request is still
    pinning the file and mapping and may still be pinning pages with
    PG_fscache. rreq->error indicates any error that has been
    accumulated.

    (*) cleanup() [optional]

    Called when the helper is disposing of a finished read request. This
    allows the netfs to clear rreq->netfs_priv.

    Netfs support is enabled with CONFIG_NETFS_SUPPORT=y. It will be built
    even if CONFIG_FSCACHE=n and in this case much of it should be optimised
    away, allowing the filesystem to use it even when caching is disabled.

    Changes:
    v5:
    - Comment why netfs_readahead() is putting pages[2].
    - Use page_file_mapping() rather than page->mapping[2].
    - Use page_index() rather than page->index[2].
    - Use set_page_fscache()[3] rather then SetPageFsCache() as this takes an
    appropriate ref too[4].

    v4:
    - Folded in a kerneldoc comment fix.
    - Folded in a fix for the error handling in the case that ENOMEM occurs.
    - Added flag to netfs_subreq_terminated() to indicate that the caller may
    have been running async and stuff that might sleep needs punting to a
    workqueue (can't use in_softirq()[1]).

    Signed-off-by: David Howells
    Reviewed-and-tested-by: Jeff Layton
    Tested-by: Dave Wysochanski
    Tested-By: Marc Dionne
    cc: Matthew Wilcox
    cc: linux-mm@kvack.org
    cc: linux-cachefs@redhat.com
    cc: linux-afs@lists.infradead.org
    cc: linux-nfs@vger.kernel.org
    cc: linux-cifs@vger.kernel.org
    cc: ceph-devel@vger.kernel.org
    cc: v9fs-developer@lists.sourceforge.net
    cc: linux-fsdevel@vger.kernel.org
    Link: https://lore.kernel.org/r/20210216084230.GA23669@lst.de/ [1]
    Link: https://lore.kernel.org/r/20210321014202.GF3420@casper.infradead.org/ [2]
    Link: https://lore.kernel.org/r/2499407.1616505440@warthog.procyon.org.uk/ [3]
    Link: https://lore.kernel.org/r/CAHk-=wh+2gbF7XEjYc=HV9w_2uVzVf7vs60BPz0gFA=+pUm3ww@mail.gmail.com/ [4]
    Link: https://lore.kernel.org/r/160588497406.3465195.18003475695899726222.stgit@warthog.procyon.org.uk/ # rfc
    Link: https://lore.kernel.org/r/161118136849.1232039.8923686136144228724.stgit@warthog.procyon.org.uk/ # rfc
    Link: https://lore.kernel.org/r/161161032290.2537118.13400578415247339173.stgit@warthog.procyon.org.uk/ # v2
    Link: https://lore.kernel.org/r/161340394873.1303470.6237319335883242536.stgit@warthog.procyon.org.uk/ # v3
    Link: https://lore.kernel.org/r/161539537375.286939.16642940088716990995.stgit@warthog.procyon.org.uk/ # v4
    Link: https://lore.kernel.org/r/161653795430.2770958.4947584573720000554.stgit@warthog.procyon.org.uk/ # v5
    Link: https://lore.kernel.org/r/161789076581.6155.6745849361504760209.stgit@warthog.procyon.org.uk/ # v6

    David Howells
     

29 Jan, 2021

1 commit

  • The dcookies stuff was only used by the kernel's old oprofile code. Now
    that oprofile's support is removed from the kernel, there is no need for
    dcookies as well. Remove it.

    Suggested-by: Christoph Hellwig
    Suggested-by: Linus Torvalds
    Signed-off-by: Viresh Kumar
    Acked-by: Robert Richter
    Acked-by: William Cohen
    Acked-by: Al Viro
    Acked-by: Thomas Gleixner

    Viresh Kumar
     

24 Oct, 2020

1 commit

  • Pull clone/dedupe/remap code refactoring from Darrick Wong:
    "Move the generic file range remap (aka reflink and dedupe) functions
    out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to
    reduce clutter in the first two files"

    * tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    vfs: move the generic write and copy checks out of mm
    vfs: move the remap range helpers to remap_range.c
    vfs: move generic_remap_checks out of mm

    Linus Torvalds
     

16 Oct, 2020

1 commit

  • Pull char/misc driver updates from Greg KH:
    "Here is the big set of char, misc, and other assorted driver subsystem
    patches for 5.10-rc1.

    There's a lot of different things in here, all over the drivers/
    directory. Some summaries:

    - soundwire driver updates

    - habanalabs driver updates

    - extcon driver updates

    - nitro_enclaves new driver

    - fsl-mc driver and core updates

    - mhi core and bus updates

    - nvmem driver updates

    - eeprom driver updates

    - binder driver updates and fixes

    - vbox minor bugfixes

    - fsi driver updates

    - w1 driver updates

    - coresight driver updates

    - interconnect driver updates

    - misc driver updates

    - other minor driver updates

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (396 commits)
    binder: fix UAF when releasing todo list
    docs: w1: w1_therm: Fix broken xref, mistakes, clarify text
    misc: Kconfig: fix a HISI_HIKEY_USB dependency
    LSM: Fix type of id parameter in kernel_post_load_data prototype
    misc: Kconfig: add a new dependency for HISI_HIKEY_USB
    firmware_loader: fix a kernel-doc markup
    w1: w1_therm: make w1_poll_completion static
    binder: simplify the return expression of binder_mmap
    test_firmware: Test partial read support
    firmware: Add request_partial_firmware_into_buf()
    firmware: Store opt_flags in fw_priv
    fs/kernel_file_read: Add "offset" arg for partial reads
    IMA: Add support for file reads without contents
    LSM: Add "contents" flag to kernel_read_file hook
    module: Call security_kernel_post_load_data()
    firmware_loader: Use security_post_load_data()
    LSM: Introduce kernel_post_load_data() hook
    fs/kernel_read_file: Add file_size output argument
    fs/kernel_read_file: Switch buffer size arg to size_t
    fs/kernel_read_file: Remove redundant size argument
    ...

    Linus Torvalds
     

15 Oct, 2020

1 commit

  • I would like to move all the generic helpers for the vfs remap range
    functionality (aka clonerange and dedupe) into a separate file so that
    they won't be scattered across the vfs and the mm subsystems. The
    eventual goal is to be able to deselect remap_range.c if none of the
    filesystems need that code, but the tricky part here is picking a
    stable(ish) part of the merge window to rearrange code.

    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     

05 Oct, 2020

1 commit

  • These routines are used in places outside of exec(2), so in preparation
    for refactoring them, move them into a separate source file,
    fs/kernel_read_file.c.

    Signed-off-by: Kees Cook
    Reviewed-by: Mimi Zohar
    Reviewed-by: Luis Chamberlain
    Acked-by: Scott Branden
    Link: https://lore.kernel.org/r/20201002173828.2099543-5-keescook@chromium.org
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

23 Sep, 2020

1 commit


31 Jul, 2020

1 commit

  • Like do_mount, but takes a kernel pointer for the destination path.
    Switch over the mounts in the init code and devtmpfs to it, which
    just happen to work due to the implicit set_fs(KERNEL_DS) during early
    init right now.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     

06 Mar, 2020

1 commit


10 Feb, 2020

2 commits

  • Pull new zonefs file system from Damien Le Moal:
    "Zonefs is a very simple file system exposing each zone of a zoned
    block device as a file.

    Unlike a regular file system with native zoned block device support
    (e.g. f2fs or the on-going btrfs effort), zonefs does not hide the
    sequential write constraint of zoned block devices to the user. As a
    result, zonefs is not a POSIX compliant file system. Its goal is to
    simplify the implementation of zoned block devices support in
    applications by replacing raw block device file accesses with a richer
    file based API, avoiding relying on direct block device file ioctls
    which may be more obscure to developers.

    One example of this approach is the implementation of LSM
    (log-structured merge) tree structures (such as used in RocksDB and
    LevelDB) on zoned block devices by allowing SSTables to be stored in a
    zone file similarly to a regular file system rather than as a range of
    sectors of a zoned device. The introduction of the higher level
    construct "one file is one zone" can help reducing the amount of
    changes needed in the application while at the same time allowing the
    use of zoned block devices with various programming languages other
    than C.

    Zonefs IO management implementation uses the new iomap generic code.
    Zonefs has been successfully tested using a functional test suite
    (available with zonefs userland format tool on github) and a prototype
    implementation of LevelDB on top of zonefs"

    * tag 'zonefs-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
    zonefs: Add documentation
    fs: New zonefs file system

    Linus Torvalds
     
  • Pull vboxfs from Al Viro:
    "This is the VirtualBox guest shared folder support by Hans de Goede,
    with fixups for fs_parse folded in to avoid bisection hazards from
    those API changes..."

    * 'work.vboxsf' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Add VirtualBox guest shared folder (vboxsf) support

    Linus Torvalds
     

09 Feb, 2020

1 commit

  • VirtualBox hosts can share folders with guests, this commit adds a
    VFS driver implementing the Linux-guest side of this, allowing folders
    exported by the host to be mounted under Linux.

    This driver depends on the guest host IPC functions exported by
    the vboxguest driver.

    Acked-by: Christoph Hellwig
    Signed-off-by: Hans de Goede
    Signed-off-by: Al Viro

    Hans de Goede
     

07 Feb, 2020

1 commit

  • zonefs is a very simple file system exposing each zone of a zoned block
    device as a file. Unlike a regular file system with zoned block device
    support (e.g. f2fs), zonefs does not hide the sequential write
    constraint of zoned block devices to the user. Files representing
    sequential write zones of the device must be written sequentially
    starting from the end of the file (append only writes).

    As such, zonefs is in essence closer to a raw block device access
    interface than to a full featured POSIX file system. The goal of zonefs
    is to simplify the implementation of zoned block device support in
    applications by replacing raw block device file accesses with a richer
    file API, avoiding relying on direct block device file ioctls which may
    be more obscure to developers. One example of this approach is the
    implementation of LSM (log-structured merge) tree structures (such as
    used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
    to be stored in a zone file similarly to a regular file system rather
    than as a range of sectors of a zoned device. The introduction of the
    higher level construct "one file is one zone" can help reducing the
    amount of changes needed in the application as well as introducing
    support for different application programming languages.

    Zonefs on-disk metadata is reduced to an immutable super block to
    persistently store a magic number and optional feature flags and
    values. On mount, zonefs uses blkdev_report_zones() to obtain the device
    zone configuration and populates the mount point with a static file tree
    solely based on this information. E.g. file sizes come from the device
    zone type and write pointer offset managed by the device itself.

    The zone files created on mount have the following characteristics.
    1) Files representing zones of the same type are grouped together
    under a common sub-directory:
    * For conventional zones, the sub-directory "cnv" is used.
    * For sequential write zones, the sub-directory "seq" is used.
    These two directories are the only directories that exist in zonefs.
    Users cannot create other directories and cannot rename nor delete
    the "cnv" and "seq" sub-directories.
    2) The name of zone files is the number of the file within the zone
    type sub-directory, in order of increasing zone start sector.
    3) The size of conventional zone files is fixed to the device zone size.
    Conventional zone files cannot be truncated.
    4) The size of sequential zone files represent the file's zone write
    pointer position relative to the zone start sector. Truncating these
    files is allowed only down to 0, in which case, the zone is reset to
    rewind the zone write pointer position to the start of the zone, or
    up to the zone size, in which case the file's zone is transitioned
    to the FULL state (finish zone operation).
    5) All read and write operations to files are not allowed beyond the
    file zone size. Any access exceeding the zone size is failed with
    the -EFBIG error.
    6) Creating, deleting, renaming or modifying any attribute of files and
    sub-directories is not allowed.
    7) There are no restrictions on the type of read and write operations
    that can be issued to conventional zone files. Buffered, direct and
    mmap read & write operations are accepted. For sequential zone files,
    there are no restrictions on read operations, but all write
    operations must be direct IO append writes. mmap write of sequential
    files is not allowed.

    Several optional features of zonefs can be enabled at format time.
    * Conventional zone aggregation: ranges of contiguous conventional
    zones can be aggregated into a single larger file instead of the
    default one file per zone.
    * File ownership: The owner UID and GID of zone files is by default 0
    (root) but can be changed to any valid UID/GID.
    * File access permissions: the default 640 access permissions can be
    changed.

    The mkzonefs tool is used to format zoned block devices for use with
    zonefs. This tool is available on Github at:

    git@github.com:damien-lemoal/zonefs-tools.git.

    zonefs-tools also includes a test suite which can be run against any
    zoned block device, including null_blk block device created with zoned
    mode.

    Example: the following formats a 15TB host-managed SMR HDD with 256 MB
    zones with the conventional zones aggregation feature enabled.

    $ sudo mkzonefs -o aggr_cnv /dev/sdX
    $ sudo mount -t zonefs /dev/sdX /mnt
    $ ls -l /mnt/
    total 0
    dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
    dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq

    The size of the zone files sub-directories indicate the number of files
    existing for each type of zones. In this example, there is only one
    conventional zone file (all conventional zones are aggregated under a
    single file).

    $ ls -l /mnt/cnv
    total 137101312
    -rw-r----- 1 root root 140391743488 Nov 25 13:23 0

    This aggregated conventional zone file can be used as a regular file.

    $ sudo mkfs.ext4 /mnt/cnv/0
    $ sudo mount -o loop /mnt/cnv/0 /data

    The "seq" sub-directory grouping files for sequential write zones has
    in this example 55356 zones.

    $ ls -lv /mnt/seq
    total 14511243264
    -rw-r----- 1 root root 0 Nov 25 13:23 0
    -rw-r----- 1 root root 0 Nov 25 13:23 1
    -rw-r----- 1 root root 0 Nov 25 13:23 2
    ...
    -rw-r----- 1 root root 0 Nov 25 13:23 55354
    -rw-r----- 1 root root 0 Nov 25 13:23 55355

    For sequential write zone files, the file size changes as data is
    appended at the end of the file, similarly to any regular file system.

    $ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
    1+0 records in
    1+0 records out
    4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s

    $ ls -l /mnt/seq/0
    -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0

    The written file can be truncated to the zone size, preventing any
    further write operation.

    $ truncate -s 268435456 /mnt/seq/0
    $ ls -l /mnt/seq/0
    -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0

    Truncation to 0 size allows freeing the file zone storage space and
    restart append-writes to the file.

    $ truncate -s 0 /mnt/seq/0
    $ ls -l /mnt/seq/0
    -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0

    Since files are statically mapped to zones on the disk, the number of
    blocks of a file as reported by stat() and fstat() indicates the size
    of the file zone.

    $ stat /mnt/seq/0
    File: /mnt/seq/0
    Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
    Device: 870h/2160d Inode: 50431 Links: 1
    Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2019-11-25 13:23:57.048971997 +0900
    Modify: 2019-11-25 13:52:25.553805765 +0900
    Change: 2019-11-25 13:52:25.553805765 +0900
    Birth: -

    The number of blocks of the file ("Blocks") in units of 512B blocks
    gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
    to the device zone size in this example. Of note is that the "IO block"
    field always indicates the minimum IO size for writes and corresponds
    to the device physical sector size.

    This code contains contributions from:
    * Johannes Thumshirn ,
    * Darrick J. Wong ,
    * Christoph Hellwig ,
    * Chaitanya Kulkarni and
    * Ting Yao .

    Signed-off-by: Damien Le Moal
    Reviewed-by: Dave Chinner

    Damien Le Moal
     

03 Jan, 2020

1 commit


30 Oct, 2019

1 commit

  • This adds support for io-wq, a smaller and specialized thread pool
    implementation. This is meant to replace workqueues for io_uring. Among
    the reasons for this addition are:

    - We can assign memory context smarter and more persistently if we
    manage the life time of threads.

    - We can drop various work-arounds we have in io_uring, like the
    async_list.

    - We can implement hashed work insertion, to manage concurrency of
    buffered writes without needing a) an extra workqueue, or b)
    needlessly making the concurrency of said workqueue very low
    which hurts performance of multiple buffered file writers.

    - We can implement cancel through signals, for cancelling
    interruptible work like read/write (or send/recv) to/from sockets.

    - We need the above cancel for being able to assign and use file tables
    from a process.

    - We can implement a more thorough cancel operation in general.

    - We need it to move towards a syslet/threadlet model for even faster
    async execution. For that we need to take ownership of the used
    threads.

    This list is just off the top of my head. Performance should be the
    same, or better, at least that's what I've seen in my testing. io-wq
    supports basic NUMA functionality, setting up a pool per node.

    io-wq hooks up to the scheduler schedule in/out just like workqueue
    and uses that to drive the need for more/less workers.

    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Sep, 2019

1 commit

  • Pull fs-verity support from Eric Biggers:
    "fs-verity is a filesystem feature that provides Merkle tree based
    hashing (similar to dm-verity) for individual readonly files, mainly
    for the purpose of efficient authenticity verification.

    This pull request includes:

    (a) The fs/verity/ support layer and documentation.

    (b) fs-verity support for ext4 and f2fs.

    Compared to the original fs-verity patchset from last year, the UAPI
    to enable fs-verity on a file has been greatly simplified. Lots of
    other things were cleaned up too.

    fs-verity is planned to be used by two different projects on Android;
    most of the userspace code is in place already. Another userspace tool
    ("fsverity-utils"), and xfstests, are also available. e2fsprogs and
    f2fs-tools already have fs-verity support. Other people have shown
    interest in using fs-verity too.

    I've tested this on ext4 and f2fs with xfstests, both the existing
    tests and the new fs-verity tests. This has also been in linux-next
    since July 30 with no reported issues except a couple minor ones I
    found myself and folded in fixes for.

    Ted and I will be co-maintaining fs-verity"

    * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
    f2fs: add fs-verity support
    ext4: update on-disk format documentation for fs-verity
    ext4: add fs-verity read support
    ext4: add basic fs-verity support
    fs-verity: support builtin file signatures
    fs-verity: add SHA-512 support
    fs-verity: implement FS_IOC_MEASURE_VERITY ioctl
    fs-verity: implement FS_IOC_ENABLE_VERITY ioctl
    fs-verity: add data verification hooks for ->readpages()
    fs-verity: add the hook for file ->setattr()
    fs-verity: add the hook for file ->open()
    fs-verity: add inode and superblock fields
    fs-verity: add Kconfig and the helper functions for hashing
    fs: uapi: define verity bit for FS_IOC_GETFLAGS
    fs-verity: add UAPI header
    fs-verity: add MAINTAINERS file entry
    fs-verity: add a documentation file

    Linus Torvalds
     

24 Aug, 2019

1 commit

  • EROFS filesystem has been merged into linux-staging for a year.

    EROFS is designed to be a better solution of saving extra storage
    space with guaranteed end-to-end performance for read-only files
    with the help of reduced metadata, fixed-sized output compression
    and decompression inplace technologies.

    In the past year, EROFS was greatly improved by many people as
    a staging driver, self-tested, betaed by a large number of our
    internal users, successfully applied to almost all in-service
    HUAWEI smartphones as the part of EMUI 9.1 and proven to be stable
    enough to be moved out of staging.

    EROFS is a self-contained filesystem driver. Although there are
    still some TODOs to be more generic, we have a dedicated team
    actively keeping on working on EROFS in order to make it better
    with the evolution of Linux kernel as the other in-kernel filesystems.

    As Pavel suggested, it's better to do as one commit since git
    can do moves and all histories will be saved in this way.

    Let's promote it from staging and enhance it more actively as
    a "real" part of kernel for more wider scenarios!

    Cc: Greg Kroah-Hartman
    Cc: Alexander Viro
    Cc: Andrew Morton
    Cc: Stephen Rothwell
    Cc: Theodore Ts'o
    Cc: Pavel Machek
    Cc: David Sterba
    Cc: Amir Goldstein
    Cc: Christoph Hellwig
    Cc: Darrick J . Wong
    Cc: Dave Chinner
    Cc: Jaegeuk Kim
    Cc: Jan Kara
    Cc: Richard Weinberger
    Cc: Linus Torvalds
    Cc: Chao Yu
    Cc: Miao Xie
    Cc: Li Guifu
    Cc: Fang Wei
    Signed-off-by: Gao Xiang
    Link: https://lore.kernel.org/r/20190822213659.5501-1-hsiangkao@aol.com
    Signed-off-by: Greg Kroah-Hartman

    Gao Xiang
     

29 Jul, 2019

1 commit


17 Jul, 2019

1 commit


15 Jul, 2019

1 commit


08 May, 2019

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Add as a feature case-insensitive directories (the casefold feature)
    using Unicode 12.1.

    Also, the usual largish number of cleanups and bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
    ext4: export /sys/fs/ext4/feature/casefold if Unicode support is present
    ext4: fix ext4_show_options for file systems w/o journal
    unicode: refactor the rule for regenerating utf8data.h
    docs: ext4.rst: document case-insensitive directories
    ext4: Support case-insensitive file name lookups
    ext4: include charset encoding information in the superblock
    MAINTAINERS: add Unicode subsystem entry
    unicode: update unicode database unicode version 12.1.0
    unicode: introduce test module for normalized utf8 implementation
    unicode: implement higher level API for string handling
    unicode: reduce the size of utf8data[]
    unicode: introduce code for UTF-8 normalization
    unicode: introduce UTF-8 character database
    ext4: actually request zeroing of inode table after grow
    ext4: cond_resched in work-heavy group loops
    ext4: fix use-after-free race with debug_want_extra_isize
    ext4: avoid drop reference to iloc.bh twice
    ext4: ignore e_value_offs for xattrs with value-in-ea-inode
    ext4: protect journal inode's blocks using block_validity
    ext4: use BUG() instead of BUG_ON(1)
    ...

    Linus Torvalds
     

26 Apr, 2019

1 commit

  • The decomposition and casefolding of UTF-8 characters are described in a
    prefix tree in utf8data.h, which is a generate from the Unicode
    Character Database (UCD), published by the Unicode Consortium, and
    should not be edited by hand. The structures in utf8data.h are meant to
    be used for lookup operations by the unicode subsystem, when decoding a
    utf-8 string.

    mkutf8data.c is the source for a program that generates utf8data.h. It
    was written by Olaf Weber from SGI and originally proposed to be merged
    into Linux in 2014. The original proposal performed the compatibility
    decomposition, NFKD, but the current version was modified by me to do
    canonical decomposition, NFD, as suggested by the community. The
    changes from the original submission are:

    * Rebase to mainline.
    * Fix out-of-tree-build.
    * Update makefile to build 11.0.0 ucd files.
    * drop references to xfs.
    * Convert NFKD to NFD.
    * Merge back robustness fixes from original patch. Requested by
    Dave Chinner.

    The original submission is archived at:

    The utf8data.h file can be regenerated using the instructions in
    fs/unicode/README.utf8data.

    - Notes on the update from 8.0.0 to 11.0:

    The structure of the ucd files and special cases have not experienced
    any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition
    of Cherokee LC characters, which is an interesting case for
    case-folding. The update is accompanied by new tests on the test_ucd
    module to catch specific cases. No changes to mkutf8data script were
    required for the updates.

    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Theodore Ts'o

    Gabriel Krisman Bertazi
     

21 Mar, 2019

2 commits

  • Provide an fsopen() system call that starts the process of preparing to
    create a superblock that will then be mountable, using an fd as a context
    handle. fsopen() is given the name of the filesystem that will be used:

    int mfd = fsopen(const char *fsname, unsigned int flags);

    where flags can be 0 or FSOPEN_CLOEXEC.

    For example:

    sfd = fsopen("ext4", FSOPEN_CLOEXEC);
    fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD);
    fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
    fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
    fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
    fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0);
    fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
    fsinfo(sfd, NULL, ...); // query new superblock attributes
    mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
    move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

    sfd = fsopen("afs", -1);
    fsconfig(fd, FSCONFIG_SET_STRING, "source",
    "#grand.central.org:root.cell", 0);
    fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
    mfd = fsmount(sfd, 0, MS_NODEV);
    move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

    If an error is reported at any step, an error message may be available to be
    read() back (ENODATA will be reported if there isn't an error available) in
    the form:

    "e :"
    "e SELinux:Mount on mountpoint not permitted"

    Once fsmount() has been called, further fsconfig() calls will incur EBUSY,
    even if the fsmount() fails. read() is still possible to retrieve error
    information.

    The fsopen() syscall creates a mount context and hangs it of the fd that it
    returns.

    Netlink is not used because it is optional and would make the core VFS
    dependent on the networking layer and also potentially add network
    namespace issues.

    Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
    fsopen().

    Signed-off-by: David Howells
    cc: linux-api@vger.kernel.org
    Signed-off-by: Al Viro

    David Howells
     
  • Make the anon_inodes facility unconditional so that it can be used by core
    VFS code.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

13 Mar, 2019

1 commit

  • Pull vfs mount infrastructure updates from Al Viro:
    "The rest of core infrastructure; no new syscalls in that pile, but the
    old parts are switched to new infrastructure. At that point
    conversions of individual filesystems can happen independently; some
    are done here (afs, cgroup, procfs, etc.), there's also a large series
    outside of that pile dealing with NFS (quite a bit of option-parsing
    stuff is getting used there - it's one of the most convoluted
    filesystems in terms of mount-related logics), but NFS bits are the
    next cycle fodder.

    It got seriously simplified since the last cycle; documentation is
    probably the weakest bit at the moment - I considered dropping the
    commit introducing Documentation/filesystems/mount_api.txt (cutting
    the size increase by quarter ;-), but decided that it would be better
    to fix it up after -rc1 instead.

    That pile allows to do followup work in independent branches, which
    should make life much easier for the next cycle. fs/super.c size
    increase is unpleasant; there's a followup series that allows to
    shrink it considerably, but I decided to leave that until the next
    cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
    afs: Use fs_context to pass parameters over automount
    afs: Add fs_context support
    vfs: Add some logging to the core users of the fs_context log
    vfs: Implement logging through fs_context
    vfs: Provide documentation for new mount API
    vfs: Remove kern_mount_data()
    hugetlbfs: Convert to fs_context
    cpuset: Use fs_context
    kernfs, sysfs, cgroup, intel_rdt: Support fs_context
    cgroup: store a reference to cgroup_ns into cgroup_fs_context
    cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
    cgroup_do_mount(): massage calling conventions
    cgroup: stash cgroup_root reference into cgroup_fs_context
    cgroup2: switch to option-by-option parsing
    cgroup1: switch to option-by-option parsing
    cgroup: take options parsing into ->parse_monolithic()
    cgroup: fold cgroup1_mount() into cgroup1_get_tree()
    cgroup: start switching to fs_context
    ipc: Convert mqueue fs to fs_context
    proc: Add fs_context support to procfs
    ...

    Linus Torvalds
     

10 Mar, 2019

1 commit

  • Pull SCSI updates from James Bottomley:
    "This is mostly update of the usual drivers: arcmsr, qla2xxx, lpfc,
    hisi_sas, target/iscsi and target/core.

    Additionally Christoph refactored gdth as part of the dma changes. The
    major mid-layer change this time is the removal of bidi commands and
    with them the whole of the osd/exofs driver and filesystem. This is a
    major simplification for block and mq in particular"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (240 commits)
    scsi: cxgb4i: validate tcp sequence number only if chip version pf
    scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
    scsi: mpt3sas: Add missing breaks in switch statements
    scsi: aacraid: Fix missing break in switch statement
    scsi: kill command serial number
    scsi: csiostor: drop serial_number usage
    scsi: mvumi: use request tag instead of serial_number
    scsi: dpt_i2o: remove serial number usage
    scsi: st: osst: Remove negative constant left-shifts
    scsi: ufs-bsg: Allow reading descriptors
    scsi: ufs: Allow reading descriptor via raw upiu
    scsi: ufs-bsg: Change the calling convention for write descriptor
    scsi: ufs: Remove unused device quirks
    Revert "scsi: ufs: disable vccq if it's not needed by UFS device"
    scsi: megaraid_sas: Remove a bunch of set but not used variables
    scsi: clean obsolete return values of eh_timed_out
    scsi: sd: Optimal I/O size should be a multiple of physical block size
    scsi: MAINTAINERS: SCSI initiator and target tweaks
    scsi: fcoe: make use of fip_mode enum complete
    ...

    Linus Torvalds
     

09 Mar, 2019

1 commit

  • Pull io_uring IO interface from Jens Axboe:
    "Second attempt at adding the io_uring interface.

    Since the first one, we've added basic unit testing of the three
    system calls, that resides in liburing like the other unit tests that
    we have so far. It'll take a while to get full coverage of it, but
    we're working towards it. I've also added two basic test programs to
    tools/io_uring. One uses the raw interface and has support for all the
    various features that io_uring supports outside of standard IO, like
    fixed files, fixed IO buffers, and polled IO. The other uses the
    liburing API, and is a simplified version of cp(1).

    This adds support for a new IO interface, io_uring.

    io_uring allows an application to communicate with the kernel through
    two rings, the submission queue (SQ) and completion queue (CQ) ring.
    This allows for very efficient handling of IOs, see the v5 posting for
    some basic numbers:

    https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

    Outside of just efficiency, the interface is also flexible and
    extendable, and allows for future use cases like the upcoming NVMe
    key-value store API, networked IO, and so on. It also supports async
    buffered IO, something that we've always failed to support in the
    kernel.

    Outside of basic IO features, it supports async polled IO as well.
    This particular feature has already been tested at Facebook months ago
    for flash storage boxes, with 25-33% improvements. It makes polled IO
    actually useful for real world use cases, where even basic flash sees
    a nice win in terms of efficiency, latency, and performance. These
    boxes were IOPS bound before, now they are not.

    This series adds three new system calls. One for setting up an
    io_uring instance (io_uring_setup(2)), one for submitting/completing
    IO (io_uring_enter(2)), and one for aux functions like registrating
    file sets, buffers, etc (io_uring_register(2)). Through the help of
    Arnd, I've coordinated the syscall numbers so merge on that front
    should be painless.

    Jon did a writeup of the interface a while back, which (except for
    minor details that have been tweaked) is still accurate. Find that
    here:

    https://lwn.net/Articles/776703/

    Huge thanks to Al Viro for helping getting the reference cycle code
    correct, and to Jann Horn for his extensive reviews focused on both
    security and bugs in general.

    There's a userspace library that provides basic functionality for
    applications that don't need or want to care about how to fiddle with
    the rings directly. It has helpers to allow applications to easily set
    up an io_uring instance, and submit/complete IO through it without
    knowing about the intricacies of the rings. It also includes man pages
    (thanks to Jeff Moyer), and will continue to grow support helper
    functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

    Fio has full support for the raw interface, both in the form of an IO
    engine (io_uring), but also with a small test application (t/io_uring)
    that can exercise and benchmark the interface"

    * tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
    io_uring: add a few test tools
    io_uring: allow workqueue item to handle multiple buffered requests
    io_uring: add support for IORING_OP_POLL
    io_uring: add io_kiocb ref count
    io_uring: add submission polling
    io_uring: add file set registration
    net: split out functions related to registering inflight socket files
    io_uring: add support for pre-mapped user IO buffers
    block: implement bio helper to add iter bvec pages to bio
    io_uring: batch io_kiocb allocation
    io_uring: use fget/fput_many() for file references
    fs: add fget_many() and fput_many()
    io_uring: support for IO polling
    io_uring: add fsync support
    Add io_uring IO interface

    Linus Torvalds
     

28 Feb, 2019

1 commit

  • The submission queue (SQ) and completion queue (CQ) rings are shared
    between the application and the kernel. This eliminates the need to
    copy data back and forth to submit and complete IO.

    IO submissions use the io_uring_sqe data structure, and completions
    are generated in the form of io_uring_cqe data structures. The SQ
    ring is an index into the io_uring_sqe array, which makes it possible
    to submit a batch of IOs without them being contiguous in the ring.
    The CQ ring is always contiguous, as completion events are inherently
    unordered, and hence any io_uring_cqe entry can point back to an
    arbitrary submission.

    Two new system calls are added for this:

    io_uring_setup(entries, params)
    Sets up an io_uring instance for doing async IO. On success,
    returns a file descriptor that the application can mmap to
    gain access to the SQ ring, CQ ring, and io_uring_sqes.

    io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
    Initiates IO against the rings mapped to this fd, or waits for
    them to complete, or both. The behavior is controlled by the
    parameters passed in. If 'to_submit' is non-zero, then we'll
    try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
    kernel will wait for 'min_complete' events, if they aren't
    already available. It's valid to set IORING_ENTER_GETEVENTS
    and 'min_complete' == 0 at the same time, this allows the
    kernel to return already completed events without waiting
    for them. This is useful only for polling, as for IRQ
    driven IO, the application can just check the CQ ring
    without entering the kernel.

    With this setup, it's possible to do async IO with a single system
    call. Future developments will enable polled IO with this interface,
    and polled submission as well. The latter will enable an application
    to do IO without doing ANY system calls at all.

    For IRQ driven IO, an application only needs to enter the kernel for
    completions if it wants to wait for them to occur.

    Each io_uring is backed by a workqueue, to support buffered async IO
    as well. We will only punt to an async context if the command would
    need to wait for IO on the device side. Any data that can be accessed
    directly in the page cache is done inline. This avoids the slowness
    issue of usual threadpools, since cached data is accessed as quickly
    as a sync interface.

    Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Jens Axboe