10 Jan, 2019

1 commit

  • commit fb265c9cb49e2074ddcdd4de99728aefdd3b3592 upstream.

    Today, when sb_bread() returns NULL, this can either be because of an
    I/O error or because the system failed to allocate the buffer. Since
    it's an old interface, changing would require changing many call
    sites.

    So instead we create our own ext4_sb_bread(), which also allows us to
    set the REQ_META flag.

    Also fixed a problem in the xattr code where a NULL return in a
    function could also mean that the xattr was not found, which could
    lead to the wrong error getting returned to userspace.

    Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3")
    Cc: stable@kernel.org # 2.6.19
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     

14 Nov, 2018

1 commit

  • commit 33458eaba4dfe778a426df6a19b7aad2ff9f7eec upstream.

    It's possible for ext4_show_quota_options() to try reading
    s_qf_names[i] while it is being modified by ext4_remount() --- most
    notably, in ext4_remount's error path when the original values of the
    quota file name gets restored.

    Reported-by: syzbot+a2872d6feea6918008a9@syzkaller.appspotmail.com
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org # 3.2+
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     

29 Sep, 2018

1 commit

  • commit bcd8e91f98c156f4b1ebcfacae675f9cfd962441 upstream.

    A maliciously crafted file system can cause an overflow when the
    results of a 64-bit calculation is stored into a 32-bit length
    parameter.

    https://bugzilla.kernel.org/show_bug.cgi?id=200623

    Signed-off-by: Theodore Ts'o
    Reported-by: Wen Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     

11 Jul, 2018

2 commits

  • commit c37e9e013469521d9adb932d17a1795c139b36db upstream.

    If there is a directory entry pointing to a system inode (such as a
    journal inode), complain and declare the file system to be corrupted.

    Also, if the superblock's first inode number field is too small,
    refuse to mount the file system.

    This addresses CVE-2018-10882.

    https://bugzilla.kernel.org/show_bug.cgi?id=200069

    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit 8bc1379b82b8e809eef77a9fedbb75c6c297be19 upstream.

    Use a separate journal transaction if it turns out that we need to
    convert an inline file to use an data block. Otherwise we could end
    up failing due to not having journal credits.

    This addresses CVE-2018-10883.

    https://bugzilla.kernel.org/show_bug.cgi?id=200071

    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

12 Sep, 2017

1 commit

  • Pull libnvdimm from Dan Williams:
    "A rework of media error handling in the BTT driver and other updates.
    It has appeared in a few -next releases and collected some late-
    breaking build-error and warning fixups as a result.

    Summary:

    - Media error handling support in the Block Translation Table (BTT)
    driver is reworked to address sleeping-while-atomic locking and
    memory-allocation-context conflicts.

    - The dax_device lookup overhead for xfs and ext4 is moved out of the
    iomap hot-path to a mount-time lookup.

    - A new 'ecc_unit_size' sysfs attribute is added to advertise the
    read-modify-write boundary property of a persistent memory range.

    - Preparatory fix-ups for arm and powerpc pmem support are included
    along with other miscellaneous fixes"

    * tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
    libnvdimm, btt: fix format string warnings
    libnvdimm, btt: clean up warning and error messages
    ext4: fix null pointer dereference on sbi
    libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
    dax: fix FS_DAX=n BLOCK=y compilation
    libnvdimm: fix integer overflow static analysis warning
    libnvdimm, nd_blk: remove mmio_flush_range()
    libnvdimm, btt: rework error clearing
    libnvdimm: fix potential deadlock while clearing errors
    libnvdimm, btt: cache sector_size in arena_info
    libnvdimm, btt: ensure that flags were also unchanged during a map_read
    libnvdimm, btt: refactor map entry operations with macros
    libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
    libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
    ext4: perform dax_device lookup at mount
    ext2: perform dax_device lookup at mount
    xfs: perform dax_device lookup at mount
    dax: introduce a fs_dax_get_by_bdev() helper
    libnvdimm, btt: check memory allocation failure
    libnvdimm, label: fix index block size calculation
    ...

    Linus Torvalds
     

01 Sep, 2017

1 commit

  • The ->iomap_begin() operation is a hot path, so cache the
    fs_dax_get_by_host() result at mount time to avoid the incurring the
    hash lookup overhead on a per-i/o basis.

    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Reviewed-by: Jan Kara
    Reported-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

25 Aug, 2017

2 commits

  • Original Lustre ea_inode feature did not have ref counts on xattr inodes
    because there was always one parent that referenced it. New
    implementation expects ref count to be initialized which is not true for
    Lustre case. Handle this by detecting Lustre created xattr inode and set
    its ref count to 1.

    The quota handling of xattr inodes have also changed with deduplication
    support. New implementation manually manages quotas to support sharing
    across multiple users. A consequence is that, a referencing inode
    incorporates the blocks of xattr inode into its own i_block field.

    We need to know how a xattr inode was created so that we can reverse the
    block charges during reference removal. This is handled by introducing a
    EXT4_STATE_LUSTRE_EA_INODE flag. The flag is set on a xattr inode if
    inode appears to have been created by Lustre. During xattr inode reference
    removal, the manual quota uncharge is skipped if the flag is set.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • Changing behavior based on the version code is a timebomb waiting to
    happen, and not easily bisectable. Drop it and leave any removal
    to explicit developer action. (And I don't think file system
    should _ever_ remove backwards compatibility that has no explicit
    flag, but I'll leave that to the ext4 folks).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Eric Biggers

    Christoph Hellwig
     

06 Aug, 2017

3 commits

  • ext4_xattr_inode_read() currently reads each block sequentially while
    waiting for io operation to complete before moving on to the next
    block. This prevents request merging in block layer.

    Add a ext4_bread_batch() function that starts reads for all blocks
    then optionally waits for them to complete. A similar logic is used
    in ext4_find_entry(), so update that code to use the new function.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • The dir_nlink feature has been enabled by default for new ext4
    filesystems since e2fsprogs-1.41 in 2008, and was automatically
    enabled by the kernel for older ext4 filesystems since the
    dir_nlink feature was added with ext4 in kernel 2.6.28+ when
    the subdirectory count exceeded EXT4_LINK_MAX-1.

    Automatically adding the file system features such as dir_nlink is
    generally frowned upon, since it could cause the file system to not be
    mountable on older kernel, thus preventing the administrator from
    rolling back to an older kernel if necessary.

    In this case, the administrator might also want to disable the feature
    because glibc's fts_read() function does not correctly optimize
    directory traversal for directories that use st_nlinks field of 1 to
    indicate that the number of links in the directory are not tracked by
    the file system, and could fail to traverse the full directory
    hierarchy. Fortunately, in the past ten years very few users have
    complained about incomplete file system traversal by glibc's
    fts_read().

    This commit also changes ext4_inc_count() to allow i_nlinks to reach
    the full EXT4_LINK_MAX links on the parent directory (including "."
    and "..") before changing i_links_count to be 1.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196405
    Signed-off-by: Andreas Dilger
    Signed-off-by: Theodore Ts'o

    Andreas Dilger
     
  • I get a static checker warning:

    fs/ext4/ext4.h:3091 ext4_set_de_type()
    error: buffer overflow 'ext4_type_by_mode' 15

    Dan Carpenter
     

31 Jul, 2017

1 commit

  • Two variables in ext4_inode_info, i_reserved_meta_blocks and
    i_allocated_meta_blocks, are unused. Removing them saves a little
    memory per in-memory inode and cleans up clutter in several tracepoints.
    Adjust tracepoint output from ext4_alloc_da_blocks() for consistency
    and fix a typo and whitespace near these changes.

    Signed-off-by: Eric Whitney
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Eric Whitney
     

23 Jun, 2017

1 commit

  • Now, when we mount ext4 filesystem with '-o discard' option, we have to
    issue all the discard commands for the blocks to be deallocated and
    wait for the completion of the commands on the commit complete phase.
    Because this procedure might involve a lot of sequential combinations of
    issuing discard commands and waiting for that, the delay of this
    procedure might be too much long, even to 17.0s in our test,
    and it results in long commit delay and fsync() performance degradation.

    To reduce this kind of delay, instead of adding callback for each
    extent and handling all of them in a sequential manner on commit phase,
    we instead add a separate list of extents to free to the superblock and
    then process this list at once after transaction commits so that
    we can issue all the discard commands in a parallel manner like XFS
    filesystem.

    Finally, we could enhance the discard command handling performance.
    The result was such that 17.0s delay of a single commit in the worst
    case has been enhanced to 4.8s.

    Signed-off-by: Daeho Jeong
    Signed-off-by: Theodore Ts'o
    Tested-by: Hobin Woo
    Tested-by: Kitae Lee
    Reviewed-by: Jan Kara

    Daeho Jeong
     

22 Jun, 2017

10 commits

  • The main purpose of mb cache is to achieve deduplication in
    extended attributes. In use cases where opportunity for deduplication
    is unlikely, it only adds overhead.

    Add a mount option to explicitly turn off mb cache.

    Suggested-by: Andreas Dilger
    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • Ext4 now supports xattr values that are up to 64k in size (vfs limit).
    Large xattr values are stored in external inodes each one holding a
    single value. Once written the data blocks of these inodes are immutable.

    The real world use cases are expected to have a lot of value duplication
    such as inherited acls etc. To reduce data duplication on disk, this patch
    implements a deduplicator that allows sharing of xattr inodes.

    The deduplication is based on an in-memory hash lookup that is a best
    effort sharing scheme. When a xattr inode is read from disk (i.e.
    getxattr() call), its crc32c hash is added to a hash table. Before
    creating a new xattr inode for a value being set, the hash table is
    checked to see if an existing inode holds an identical value. If such an
    inode is found, the ref count on that inode is incremented. On value
    removal the ref count is decremented and if it reaches zero the inode is
    deleted.

    The quota charging for such inodes is manually managed. Every reference
    holder is charged the full size as if there was no sharing happening.
    This is consistent with how xattr blocks are also charged.

    [ Fixed up journal credits calculation to handle inline data and the
    rare case where an shared xattr block can get freed when two thread
    race on breaking the xattr block sharing. --tytso ]

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • IS_NOQUOTA() indicates whether quota is disabled for an inode. Ext4
    also uses it to check whether an inode is for a quota file. The
    distinction currently doesn't matter because quota is disabled only
    for the quota files. When we start disabling quota for other inodes
    in the future, we will want to make the distinction clear.

    Replace IS_NOQUOTA() call with ext4_is_quota_file() at places where
    we are checking for quota files.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • There will be a second mb_cache instance that tracks ea_inodes. Make
    existing names more explicit so that it is clear that they refer to
    xattr block cache.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • Since this is a xattr specific data structure it is cleaner to keep it in
    xattr header file.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • Tracking struct inode * rather than the inode number eliminates the
    repeated ext4_xattr_inode_iget() call later. The second call cannot
    fail in practice but still requires explanation when it wants to ignore
    the return value. Avoid the trouble and make things simple.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • EXT4_XATTR_MAX_LARGE_EA_SIZE definition in ext4 is currently unused.
    Besides, vfs enforces its own 64k limit which makes the 1MB limit in
    ext4 redundant. Remove it.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • We don't need acls on xattr inodes because they are not directly
    accessible from user mode.

    Besides lockdep complains about recursive locking of xattr_sem as seen
    below.

    =============================================
    [ INFO: possible recursive locking detected ]
    4.11.0-rc8+ #402 Not tainted
    ---------------------------------------------
    python/1894 is trying to acquire lock:
    (&ei->xattr_sem){++++..}, at: [] ext4_xattr_get+0x66/0x270

    but task is already holding lock:
    (&ei->xattr_sem){++++..}, at: [] ext4_xattr_set_handle+0xa0/0x5d0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&ei->xattr_sem);
    lock(&ei->xattr_sem);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    3 locks held by python/1894:
    #0: (sb_writers#10){.+.+.+}, at: [] mnt_want_write+0x1f/0x50
    #1: (&sb->s_type->i_mutex_key#15){+.+...}, at: [] vfs_setxattr+0x57/0xb0
    #2: (&ei->xattr_sem){++++..}, at: [] ext4_xattr_set_handle+0xa0/0x5d0

    stack backtrace:
    CPU: 0 PID: 1894 Comm: python Not tainted 4.11.0-rc8+ #402
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x67/0x99
    __lock_acquire+0x5f3/0x1830
    lock_acquire+0xb5/0x1d0
    down_read+0x2f/0x60
    ext4_xattr_get+0x66/0x270
    ext4_get_acl+0x43/0x1e0
    get_acl+0x72/0xf0
    posix_acl_create+0x5e/0x170
    ext4_init_acl+0x21/0xc0
    __ext4_new_inode+0xffd/0x16b0
    ext4_xattr_set_entry+0x5ea/0xb70
    ext4_xattr_block_set+0x1b5/0x970
    ext4_xattr_set_handle+0x351/0x5d0
    ext4_xattr_set+0x124/0x180
    ext4_xattr_user_set+0x34/0x40
    __vfs_setxattr+0x66/0x80
    __vfs_setxattr_noperm+0x69/0x1c0
    vfs_setxattr+0xa2/0xb0
    setxattr+0x129/0x160
    path_setxattr+0x87/0xb0
    SyS_setxattr+0xf/0x20
    entry_SYSCALL_64_fastpath+0x18/0xad

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.

    If the size of an xattr value is larger than will fit in a single
    external block, then the xattr value will be saved into the body
    of an external xattr inode.

    The also helps support a larger number of xattr, since only the headers
    will be stored in the in-inode space or the single external block.

    The inode is referenced from the xattr header via "e_value_inum",
    which was formerly "e_value_block", but that field was never used.
    The e_value_size still contains the xattr size so that listing
    xattrs does not need to look up the inode if the data is not accessed.

    struct ext4_xattr_entry {
    __u8 e_name_len; /* length of name */
    __u8 e_name_index; /* attribute name index */
    __le16 e_value_offs; /* offset in disk block of value */
    __le32 e_value_inum; /* inode in which value is stored */
    __le32 e_value_size; /* size of attribute value */
    __le32 e_hash; /* hash value of name and value */
    char e_name[0]; /* attribute name */
    };

    The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
    holds a back-reference to the owning inode in its i_mtime field,
    allowing the ext4/e2fsck to verify the correct inode is accessed.

    [ Applied fix by Dan Carpenter to avoid freeing an ERR_PTR. ]

    Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
    Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424
    Signed-off-by: Kalpak Shah
    Signed-off-by: James Simmons
    Signed-off-by: Andreas Dilger
    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Dan Carpenter

    Andreas Dilger
     
  • This INCOMPAT_LARGEDIR feature allows larger directories to be created
    in ldiskfs, both with directory sizes over 2GB and and a maximum htree
    depth of 3 instead of the current limit of 2. These features are needed
    in order to exceed the current limit of approximately 10M entries in a
    single directory.

    This patch was originally written by Yang Sheng to support the Lustre server.

    [ Bumped the credits needed to update an indexed directory -- tytso ]

    Signed-off-by: Liang Zhen
    Signed-off-by: Yang Sheng
    Signed-off-by: Artem Blagodarenko
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Andreas Dilger

    Artem Blagodarenko
     

25 May, 2017

1 commit


09 May, 2017

1 commit

  • Pull ext4 updates from Ted Ts'o:

    - add GETFSMAP support

    - some performance improvements for very large file systems and for
    random write workloads into a preallocated file

    - bug fixes and cleanups.

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    jbd2: cleanup write flags handling from jbd2_write_superblock()
    ext4: mark superblock writes synchronous for nobarrier mounts
    ext4: inherit encryption xattr before other xattrs
    ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
    ext4: avoid unnecessary transaction stalls during writeback
    ext4: preload block group descriptors
    ext4: make ext4_shutdown() static
    ext4: support GETFSMAP ioctls
    vfs: add common GETFSMAP ioctl definitions
    ext4: evict inline data when writing to memory map
    ext4: remove ext4_xattr_check_entry()
    ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
    ext4: merge ext4_xattr_list() into ext4_listxattr()
    ext4: constify static data that is never modified
    ext4: trim return value and 'dir' argument from ext4_insert_dentry()
    jbd2: fix dbench4 performance regression for 'nobarrier' mounts
    jbd2: Fix lockdep splat with generic/270 test
    mm: retry writepages() on ENOMEM when doing an data integrity writeback

    Linus Torvalds
     

04 May, 2017

1 commit

  • Pull quota, reiserfs, udf and ext2 updates from Jan Kara:
    "The branch contains changes to quota code so that it does not modify
    persistent flags in inode->i_flags (it was the only place in kernel
    doing that) and handle it inside filesystem's quotaon/off handlers
    instead.

    The branch also contains two UDF cleanups, a couple of reiserfs fixes
    and one fix for ext2 quota locking"

    * 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    ext4: Improve comments in ext4_quota_{on|off}()
    udf: use kmap_atomic for memcpy copying
    udf: use octal for permissions
    quota: Remove dquot_quotactl_ops
    reiserfs: Remove i_attrs_to_sd_attrs()
    reiserfs: Remove useless setting of i_flags
    jfs: Remove jfs_get_inode_flags()
    ext2: Remove ext2_get_inode_flags()
    ext4: Remove ext4_get_inode_flags()
    quota: Stop setting IMMUTABLE and NOATIME flags on quota files
    jfs: Set flags on quota files directly
    ext2: Set flags on quota files directly
    reiserfs: Set flags on quota files directly
    ext4: Set flags on quota files directly
    reiserfs: Protect dquot_writeback_dquots() by s_umount semaphore
    reiserfs: Make cancel_old_flush() reliable
    ext2: Call dquot_writeback_dquots() with s_umount held
    reiserfs: avoid a -Wmaybe-uninitialized warning

    Linus Torvalds
     

30 Apr, 2017

2 commits


19 Apr, 2017

1 commit


03 Apr, 2017

1 commit

  • Return enhanced file attributes from the Ext4 filesystem. This includes
    the following:

    (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.

    (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.

    This requires that all ext4 inodes have a getattr call, not just some of
    them, so to this end, split the ext4_getattr() function and only call part
    of it where appropriate.

    Example output:

    [root@andromeda ~]# touch foo
    [root@andromeda ~]# chattr +ai foo
    [root@andromeda ~]# /tmp/test-statx foo
    statx(foo) = 0
    results=fff
    Size: 0 Blocks: 0 IO Block: 4096 regular file
    Device: 08:12 Inode: 2101950 Links: 1
    Access: (0644/-rw-r--r--) Uid: 0 Gid: 0
    Access: 2016-02-11 17:08:29.031795451+0000
    Modify: 2016-02-11 17:08:29.031795451+0000
    Change: 2016-02-11 17:11:11.987790114+0000
    Birth: 2016-02-11 17:08:29.031795451+0000
    Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

04 Mar, 2017

1 commit

  • Pull vfs 'statx()' update from Al Viro.

    This adds the new extended stat() interface that internally subsumes our
    previous stat interfaces, and allows user mode to specify in more detail
    what kind of information it wants.

    It also allows for some explicit synchronization information to be
    passed to the filesystem, which can be relevant for network filesystems:
    is the cached value ok, or do you need open/close consistency, or what?

    From David Howells.

    Andreas Dilger points out that the first version of the extended statx
    interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

    * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    statx: Add a system call to make enhanced file info available

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Add a system call to make extended file information available, including
    file creation and some attribute flags where available through the
    underlying filesystem.

    The getattr inode operation is altered to take two additional arguments: a
    u32 request_mask and an unsigned int flags that indicate the
    synchronisation mode. This change is propagated to the vfs_getattr*()
    function.

    Functions like vfs_stat() are now inline wrappers around new functions
    vfs_statx() and vfs_statx_fd() to reduce stack usage.

    ========
    OVERVIEW
    ========

    The idea was initially proposed as a set of xattrs that could be retrieved
    with getxattr(), but the general preference proved to be for a new syscall
    with an extended stat structure.

    A number of requests were gathered for features to be included. The
    following have been included:

    (1) Make the fields a consistent size on all arches and make them large.

    (2) Spare space, request flags and information flags are provided for
    future expansion.

    (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
    __s64).

    (4) Creation time: The SMB protocol carries the creation time, which could
    be exported by Samba, which will in turn help CIFS make use of
    FS-Cache as that can be used for coherency data (stx_btime).

    This is also specified in NFSv4 as a recommended attribute and could
    be exported by NFSD [Steve French].

    (5) Lightweight stat: Ask for just those details of interest, and allow a
    netfs (such as NFS) to approximate anything not of interest, possibly
    without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
    Dilger] (AT_STATX_DONT_SYNC).

    (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
    its cached attributes are up to date [Trond Myklebust]
    (AT_STATX_FORCE_SYNC).

    And the following have been left out for future extension:

    (7) Data version number: Could be used by userspace NFS servers [Aneesh
    Kumar].

    Can also be used to modify fill_post_wcc() in NFSD which retrieves
    i_version directly, but has just called vfs_getattr(). It could get
    it from the kstat struct if it used vfs_xgetattr() instead.

    (There's disagreement on the exact semantics of a single field, since
    not all filesystems do this the same way).

    (8) BSD stat compatibility: Including more fields from the BSD stat such
    as creation time (st_btime) and inode generation number (st_gen)
    [Jeremy Allison, Bernd Schubert].

    (9) Inode generation number: Useful for FUSE and userspace NFS servers
    [Bernd Schubert].

    (This was asked for but later deemed unnecessary with the
    open-by-handle capability available and caused disagreement as to
    whether it's a security hole or not).

    (10) Extra coherency data may be useful in making backups [Andreas Dilger].

    (No particular data were offered, but things like last backup
    timestamp, the data version number and the DOS archive bit would come
    into this category).

    (11) Allow the filesystem to indicate what it can/cannot provide: A
    filesystem can now say it doesn't support a standard stat feature if
    that isn't available, so if, for instance, inode numbers or UIDs don't
    exist or are fabricated locally...

    (This requires a separate system call - I have an fsinfo() call idea
    for this).

    (12) Store a 16-byte volume ID in the superblock that can be returned in
    struct xstat [Steve French].

    (Deferred to fsinfo).

    (13) Include granularity fields in the time data to indicate the
    granularity of each of the times (NFSv4 time_delta) [Steve French].

    (Deferred to fsinfo).

    (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
    Note that the Linux IOC flags are a mess and filesystems such as Ext4
    define flags that aren't in linux/fs.h, so translation in the kernel
    may be a necessity (or, possibly, we provide the filesystem type too).

    (Some attributes are made available in stx_attributes, but the general
    feeling was that the IOC flags were to ext[234]-specific and shouldn't
    be exposed through statx this way).

    (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
    Michael Kerrisk].

    (Deferred, probably to fsinfo. Finding out if there's an ACL or
    seclabal might require extra filesystem operations).

    (16) Femtosecond-resolution timestamps [Dave Chinner].

    (A __reserved field has been left in the statx_timestamp struct for
    this - if there proves to be a need).

    (17) A set multiple attributes syscall to go with this.

    ===============
    NEW SYSTEM CALL
    ===============

    The new system call is:

    int ret = statx(int dfd,
    const char *filename,
    unsigned int flags,
    unsigned int mask,
    struct statx *buffer);

    The dfd, filename and flags parameters indicate the file to query, in a
    similar way to fstatat(). There is no equivalent of lstat() as that can be
    emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
    also no equivalent of fstat() as that can be emulated by passing a NULL
    filename to statx() with the fd of interest in dfd.

    Whether or not statx() synchronises the attributes with the backing store
    can be controlled by OR'ing a value into the flags argument (this typically
    only affects network filesystems):

    (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
    respect.

    (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
    its attributes with the server - which might require data writeback to
    occur to get the timestamps correct.

    (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
    network filesystem. The resulting values should be considered
    approximate.

    mask is a bitmask indicating the fields in struct statx that are of
    interest to the caller. The user should set this to STATX_BASIC_STATS to
    get the basic set returned by stat(). It should be noted that asking for
    more information may entail extra I/O operations.

    buffer points to the destination for the data. This must be 256 bytes in
    size.

    ======================
    MAIN ATTRIBUTES RECORD
    ======================

    The following structures are defined in which to return the main attribute
    set:

    struct statx_timestamp {
    __s64 tv_sec;
    __s32 tv_nsec;
    __s32 __reserved;
    };

    struct statx {
    __u32 stx_mask;
    __u32 stx_blksize;
    __u64 stx_attributes;
    __u32 stx_nlink;
    __u32 stx_uid;
    __u32 stx_gid;
    __u16 stx_mode;
    __u16 __spare0[1];
    __u64 stx_ino;
    __u64 stx_size;
    __u64 stx_blocks;
    __u64 __spare1[1];
    struct statx_timestamp stx_atime;
    struct statx_timestamp stx_btime;
    struct statx_timestamp stx_ctime;
    struct statx_timestamp stx_mtime;
    __u32 stx_rdev_major;
    __u32 stx_rdev_minor;
    __u32 stx_dev_major;
    __u32 stx_dev_minor;
    __u64 __spare2[14];
    };

    The defined bits in request_mask and stx_mask are:

    STATX_TYPE Want/got stx_mode & S_IFMT
    STATX_MODE Want/got stx_mode & ~S_IFMT
    STATX_NLINK Want/got stx_nlink
    STATX_UID Want/got stx_uid
    STATX_GID Want/got stx_gid
    STATX_ATIME Want/got stx_atime{,_ns}
    STATX_MTIME Want/got stx_mtime{,_ns}
    STATX_CTIME Want/got stx_ctime{,_ns}
    STATX_INO Want/got stx_ino
    STATX_SIZE Want/got stx_size
    STATX_BLOCKS Want/got stx_blocks
    STATX_BASIC_STATS [The stuff in the normal stat struct]
    STATX_BTIME Want/got stx_btime{,_ns}
    STATX_ALL [All currently available stuff]

    stx_btime is the file creation time, stx_mask is a bitmask indicating the
    data provided and __spares*[] are where as-yet undefined fields can be
    placed.

    Time fields are structures with separate seconds and nanoseconds fields
    plus a reserved field in case we want to add even finer resolution. Note
    that times will be negative if before 1970; in such a case, the nanosecond
    fields will also be negative if not zero.

    The bits defined in the stx_attributes field convey information about a
    file, how it is accessed, where it is and what it does. The following
    attributes map to FS_*_FL flags and are the same numerical value:

    STATX_ATTR_COMPRESSED File is compressed by the fs
    STATX_ATTR_IMMUTABLE File is marked immutable
    STATX_ATTR_APPEND File is append-only
    STATX_ATTR_NODUMP File is not to be dumped
    STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

    Within the kernel, the supported flags are listed by:

    KSTAT_ATTR_FS_IOC_FLAGS

    [Are any other IOC flags of sufficient general interest to be exposed
    through this interface?]

    New flags include:

    STATX_ATTR_AUTOMOUNT Object is an automount trigger

    These are for the use of GUI tools that might want to mark files specially,
    depending on what they are.

    Fields in struct statx come in a number of classes:

    (0) stx_dev_*, stx_blksize.

    These are local system information and are always available.

    (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
    stx_size, stx_blocks.

    These will be returned whether the caller asks for them or not. The
    corresponding bits in stx_mask will be set to indicate whether they
    actually have valid values.

    If the caller didn't ask for them, then they may be approximated. For
    example, NFS won't waste any time updating them from the server,
    unless as a byproduct of updating something requested.

    If the values don't actually exist for the underlying object (such as
    UID or GID on a DOS file), then the bit won't be set in the stx_mask,
    even if the caller asked for the value. In such a case, the returned
    value will be a fabrication.

    Note that there are instances where the type might not be valid, for
    instance Windows reparse points.

    (2) stx_rdev_*.

    This will be set only if stx_mode indicates we're looking at a
    blockdev or a chardev, otherwise will be 0.

    (3) stx_btime.

    Similar to (1), except this will be set to 0 if it doesn't exist.

    =======
    TESTING
    =======

    The following test program can be used to test the statx system call:

    samples/statx/test-statx.c

    Just compile and run, passing it paths to the files you want to examine.
    The file is built automatically if CONFIG_SAMPLES is enabled.

    Here's some example output. Firstly, an NFS directory that crosses to
    another FSID. Note that the AUTOMOUNT attribute is set because transiting
    this directory will cause d_automount to be invoked by the VFS.

    [root@andromeda ~]# /tmp/test-statx -A /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:26 Inode: 1703937 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000
    Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

    Secondly, the result of automounting on that directory.

    [root@andromeda ~]# /tmp/test-statx /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:27 Inode: 2 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

02 Mar, 2017

1 commit


25 Feb, 2017

1 commit

  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

23 Feb, 2017

1 commit

  • Pull xfs updates from Darrick Wong:
    "Here are the XFS changes for 4.11. We aren't introducing any major
    features in this release cycle except for this being the first merge
    window I've managed on my own. :)

    Changes since last update:

    - Various cleanups

    - Livelock fixes for eofblocks scanning

    - Improved input verification for on-disk metadata

    - Fix races in the copy on write remap mechanism

    - Fix buffer io error timeout controls

    - Streamlining of directio copy on write

    - Asynchronous discard support

    - Fix asserts when splitting delalloc reservations

    - Don't bloat bmbt when right shifting extents

    - Inode alignment fixes for 32k block sizes"

    * tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (39 commits)
    xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AG
    xfs: simplify xfs_rtallocate_extent
    xfs: tune down agno asserts in the bmap code
    xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment
    xfs: don't reserve blocks for right shift transactions
    xfs: fix len comparison in xfs_extent_busy_trim
    xfs: fix uninitialized variable in _reflink_convert_cow
    xfs: split indlen reservations fairly when under reserved
    xfs: handle indlen shortage on delalloc extent merge
    xfs: resurrect debug mode drop buffered writes mechanism
    xfs: clear delalloc and cache on buffered write failure
    xfs: don't block the log commit handler for discards
    xfs: improve busy extent sorting
    xfs: improve handling of busy extents in the low-level allocator
    xfs: don't fail xfs_extent_busy allocation
    xfs: correct null checks and error processing in xfs_initialize_perag
    xfs: update ctime and mtime on clone destinatation inodes
    xfs: allocate direct I/O COW blocks in iomap_begin
    xfs: go straight to real allocations for direct I/O COW writes
    xfs: return the converted extent in __xfs_reflink_convert_cow
    ...

    Linus Torvalds
     

21 Feb, 2017

2 commits

  • Pull ext4 updates from Ted Ts'o:
    "For this cycle we add support for the shutdown ioctl, which is
    primarily used for testing, but which can be useful on production
    systems when a scratch volume is being destroyed and the data on it
    doesn't need to be saved.

    This found (and we fixed) a number of bugs with ext4's recovery to
    corrupted file system --- the bugs increased the amount of data that
    could be potentially lost, and in the case of the inline data feature,
    could cause the kernel to BUG.

    Also included are a number of other bug fixes, including in ext4's
    fscrypt, DAX, inline data support"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
    ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN
    ext4: fix fencepost in s_first_meta_bg validation
    ext4: don't BUG when truncating encrypted inodes on the orphan list
    ext4: do not use stripe_width if it is not set
    ext4: fix stripe-unaligned allocations
    dax: assert that i_rwsem is held exclusive for writes
    ext4: fix DAX write locking
    ext4: add EXT4_IOC_GOINGDOWN ioctl
    ext4: add shutdown bit and check for it
    ext4: rename s_resize_flags to s_ext4_flags
    ext4: return EROFS if device is r/o and journal replay is needed
    ext4: preserve the needs_recovery flag when the journal is aborted
    jbd2: don't leak modified metadata buffers on an aborted journal
    ext4: fix inline data error paths
    ext4: move halfmd4 into hash.c directly
    ext4: fix use-after-iput when fscrypt contexts are inconsistent
    jbd2: fix use after free in kjournald2()
    ext4: fix data corruption in data=journal mode
    ext4: trim allocation requests to group size
    ext4: replace BUG_ON with WARN_ON in mb_find_extent()
    ...

    Linus Torvalds
     
  • It's very likely the file system independent ioctl name will be
    FS_IOC_SHUTDOWN, so let's use the same name for the ext4 ioctl name.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

07 Feb, 2017

1 commit

  • Previously, each filesystem configured without encryption support would
    define all the public fscrypt functions to their notsupp_* stubs. This
    list of #defines had to be updated in every filesystem whenever a change
    was made to the public fscrypt functions. To make things more
    maintainable now that we have three filesystems using fscrypt, split the
    old header fscrypto.h into several new headers. fscrypt_supp.h contains
    the real declarations and is included by filesystems when configured
    with encryption support, whereas fscrypt_notsupp.h contains the inline
    stubs and is included by filesystems when configured without encryption
    support. fscrypt_common.h contains common declarations needed by both.

    Signed-off-by: Eric Biggers
    Signed-off-by: Theodore Ts'o

    Eric Biggers