15 Dec, 2015

3 commits

  • commit 4327ba52afd03fc4b5afa0ee1d774c9c5b0e85c5 upstream.

    If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
    journaling will be aborted first and the error number will be recorded
    into JBD2 superblock and, finally, the system will enter into the
    panic state in "errors=panic" option. But, in the rare case, this
    sequence is little twisted like the below figure and it will happen
    that the system enters into panic state, which means the system reset
    in mobile environment, before completion of recording an error in the
    journal superblock. In this case, e2fsck cannot recognize that the
    filesystem failure occurred in the previous run and the corruption
    wouldn't be fixed.

    Task A Task B
    ext4_handle_error()
    -> jbd2_journal_abort()
    -> __journal_abort_soft()
    -> __jbd2_journal_abort_hard()
    | -> journal->j_flags |= JBD2_ABORT;
    |
    | __ext4_abort()
    | -> jbd2_journal_abort()
    | | -> __journal_abort_soft()
    | | -> if (journal->j_flags & JBD2_ABORT)
    | | return;
    | -> panic()
    |
    -> jbd2_journal_update_sb_errno()

    Tested-by: Hobin Woo
    Signed-off-by: Daeho Jeong
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Daeho Jeong
     
  • commit 6934da9238da947628be83635e365df41064b09b upstream.

    There is a use-after-free possibility in __ext4_journal_stop() in the
    case that we free the handle in the first jbd2_journal_stop() because
    we're referencing handle->h_err afterwards. This was introduced in
    9705acd63b125dee8b15c705216d7186daea4625 and it is wrong. Fix it by
    storing the handle->h_err value beforehand and avoid referencing
    potentially freed handle.

    Fixes: 9705acd63b125dee8b15c705216d7186daea4625
    Signed-off-by: Lukas Czerner
    Reviewed-by: Andreas Dilger
    Signed-off-by: Greg Kroah-Hartman

    Lukas Czerner
     
  • commit 937d7b84dca58f2565715f2c8e52f14c3d65fb22 upstream.

    There are times when ext4_bio_write_page() is called even though we
    don't actually need to do any I/O. This happens when ext4_writepage()
    gets called by the jbd2 commit path when an inode needs to force its
    pages written out in order to provide data=ordered guarantees --- and
    a page is backed by an unwritten (e.g., uninitialized) block on disk,
    or if delayed allocation means the page's backing store hasn't been
    allocated yet. In that case, we need to skip the call to
    ext4_encrypt_page(), since in addition to wasting CPU, it leads to a
    bounce page and an ext4 crypto context getting leaked.

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     

30 Sep, 2015

2 commits

  • commit bdfe0cbd746aa9b2509c2f6d6be17193cf7facd7 upstream.

    This reverts commit 08439fec266c3cc5702953b4f54bdf5649357de0.

    Unfortunately we still need to test for bdi->dev to avoid a crash when a
    USB stick is yanked out while a file system is mounted:

    usb 2-2: USB disconnect, device number 2
    Buffer I/O error on dev sdb1, logical block 15237120, lost sync page write
    JBD2: Error -5 detected when updating journal superblock for sdb1-8.
    BUG: unable to handle kernel paging request at 34beb000
    IP: [] __percpu_counter_add+0x18/0xc0
    *pdpt = 0000000023db9001 *pde = 0000000000000000
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 4083 Comm: umount Tainted: G U OE 4.1.1-040101-generic #201507011435
    Hardware name: LENOVO 7675CTO/7675CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
    task: ebf06b50 ti: ebebc000 task.ti: ebebc000
    EIP: 0060:[] EFLAGS: 00010082 CPU: 0
    EIP is at __percpu_counter_add+0x18/0xc0
    EAX: f21c8e88 EBX: f21c8e88 ECX: 00000000 EDX: 00000001
    ESI: 00000001 EDI: 00000000 EBP: ebebde60 ESP: ebebde40
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    CR0: 8005003b CR2: 34beb000 CR3: 33354200 CR4: 000007f0
    Stack:
    c1abe100 edcb0098 edcb00ec ffffffff f21c8e68 ffffffff f21c8e68 f286d160
    ebebde84 c1160454 00000010 00000282 f72a77f8 00000984 f72a77f8 f286d160
    f286d170 ebebdea0 c11e613f 00000000 00000282 f72a77f8 edd7f4d0 00000000
    Call Trace:
    [] account_page_dirtied+0x74/0x110
    [] __set_page_dirty+0x3f/0xb0
    [] mark_buffer_dirty+0x53/0xc0
    [] ext4_commit_super+0x17b/0x250
    [] ext4_put_super+0xc1/0x320
    [] ? fsnotify_unmount_inodes+0x1aa/0x1c0
    [] ? evict_inodes+0xca/0xe0
    [] generic_shutdown_super+0x6a/0xe0
    [] ? prepare_to_wait_event+0xd0/0xd0
    [] ? unregister_shrinker+0x40/0x50
    [] kill_block_super+0x26/0x70
    [] deactivate_locked_super+0x45/0x80
    [] deactivate_super+0x47/0x60
    [] cleanup_mnt+0x39/0x80
    [] __cleanup_mnt+0x10/0x20
    [] task_work_run+0x91/0xd0
    [] do_notify_resume+0x7c/0x90
    [] work_notify
    Code: 8b 55 e8 e9 f4 fe ff ff 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 83 ec 20 89 5d f4 89 c3 89 75 f8 89 d6 89 7d fc 89 cf 8b 48 14 8b 01 89 45 ec 89 c2 8b 45 08 c1 fa 1f 01 75 ec 89 55 f0 89
    EIP: [] __percpu_counter_add+0x18/0xc0 SS:ESP 0068:ebebde40
    CR2: 0000000034beb000
    ---[ end trace dd564a7bea834ecd ]---

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101011

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit c642dc9e1aaed953597e7092d7df329e6234096e upstream.

    At some point along this sequence of changes:

    f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops
    bb04457 ext4: support freezing ext2 (nojournal) file systems
    9ca9238 ext4: Use separate super_operations structure for no_journal filesystems

    ext4 started setting needs_recovery on filesystems without journals
    when they are unfrozen. This makes no sense, and in fact confuses
    blkid to the point where it doesn't recognize the filesystem at all.

    (freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
    see needs_recovery set on fs w/ no journal).

    To fix this, don't manipulate the INCOMPAT_RECOVER feature on
    filesystems without journals.

    Reported-by: Stu Mark
    Reviewed-by: Jan Kara
    Signed-off-by: Eric Sandeen
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Eric Sandeen
     

22 Sep, 2015

1 commit

  • commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream.

    Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

04 Aug, 2015

10 commits

  • commit 7444a072c387a93ebee7066e8aee776954ab0e41 upstream.

    ext4_free_blocks is looping around the allocation request and mimics
    __GFP_NOFAIL behavior without any allocation fallback strategy. Let's
    remove the open coded loop and replace it with __GFP_NOFAIL. Without the
    flag the allocator has no way to find out never-fail requirement and
    cannot help in any way.

    Signed-off-by: Michal Hocko
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 8974fec7d72e3e02752fe0f27b4c3719c78d9a15 upstream.

    Currently ext4_ind_migrate() doesn't correctly handle a file which
    contains a hole at the beginning of the file. This caused the migration
    to be done incorrectly, and then if there is a subsequent following
    delayed allocation write to the "hole", this would reclaim the same data
    blocks again and results in fs corruption.

    # assmuing 4k block size ext4, with delalloc enabled
    # skip the first block and write to the second block
    xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile

    # converting to indirect-mapped file, which would move the data blocks
    # to the beginning of the file, but extent status cache still marks
    # that region as a hole
    chattr -e /mnt/ext4/testfile

    # delayed allocation writes to the "hole", reclaim the same data block
    # again, results in i_blocks corruption
    xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
    umount /mnt/ext4
    e2fsck -nf /dev/sda6
    ...
    Inode 53, i_blocks is 16, should be 8. Fix? no
    ...

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Eryu Guan
     
  • commit d6f123a9297496ad0b6335fe881504c4b5b2a5e5 upstream.

    Currently the check in ext4_ind_migrate() is not enough before doing the
    real conversion:

    a) delayed allocated extents could bypass the check on eh->eh_entries
    and eh->eh_depth

    This can be demonstrated by this script

    xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
    chattr -e /mnt/ext4/testfile

    where testfile has two extents but still be converted to non-extent
    based file format.

    b) only extent length is checked but not the offset, which would result
    in data lose (delalloc) or fs corruption (nodelalloc), because
    non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
    blocks

    This can be demostrated by

    xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
    chattr -e /mnt/ext4/testfile
    sync

    If delalloc is enabled, dmesg prints
    EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
    EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
    EXT4-fs (dm-4): This should not happen!! Data will be lost

    If delalloc is disabled, e2fsck -nf shows corruption
    Inode 53, i_size is 5497558142976, should be 4096. Fix? no

    Fix the two issues by

    a) forcing all delayed allocation blocks to be allocated before checking
    eh->eh_depth and eh->eh_entries
    b) limiting the last logical block of the extent is within direct map

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Eryu Guan
     
  • commit 9705acd63b125dee8b15c705216d7186daea4625 upstream.

    On delalloc enabled file system on invalidatepage operation
    in ext4_da_page_release_reservation() we want to clear the delayed
    buffer and remove the extent covering the delayed buffer from the extent
    status tree.

    However currently there is a bug where on the systems with page size >
    block size we will always remove extents from the start of the page
    regardless where the actual delayed buffers are positioned in the page.
    This leads to the errors like this:

    EXT4-fs warning (device loop0): ext4_da_release_space:1225:
    ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
    blocks

    This however can cause data loss on writeback time if the file system is
    in ENOSPC condition because we're releasing reservation for someones
    else delayed buffer.

    Fix this by only removing extents that corresponds to the part of the
    page we want to invalidate.

    This problem is reproducible by the following fio receipt (however I was
    only able to reproduce it with fio-2.1 or older.

    [global]
    bs=8k
    iodepth=1024
    iodepth_batch=60
    randrepeat=1
    size=1m
    directory=/mnt/test
    numjobs=20
    [job1]
    ioengine=sync
    bs=1k
    direct=1
    rw=randread
    filename=file1:file2
    [job2]
    ioengine=libaio
    rw=randwrite
    direct=1
    filename=file1:file2
    [job3]
    bs=1k
    ioengine=posixaio
    rw=randwrite
    direct=1
    filename=file1:file2
    [job5]
    bs=1k
    ioengine=sync
    rw=randread
    filename=file1:file2
    [job7]
    ioengine=libaio
    rw=randwrite
    filename=file1:file2
    [job8]
    ioengine=posixaio
    rw=randwrite
    filename=file1:file2
    [job10]
    ioengine=mmap
    rw=randwrite
    bs=1k
    filename=file1:file2
    [job11]
    ioengine=mmap
    rw=randwrite
    direct=1
    filename=file1:file2

    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Lukas Czerner
     
  • commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 upstream.

    Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible
    deadlocks in the page writeback path.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • commit 0f0ff9a9f3fa2ec6f427603fd521d5f3a0b076d1 upstream.

    Commit 8f4d8558391: "ext4: fix lazytime optimization" was not a
    complete fix. In the case where the inode number is a multiple of 16,
    and we could still end up updating an inode with dirty timestamps
    written to the wrong inode on disk. Oops.

    This can be easily reproduced by using generic/005 with a file system
    with metadata_csum and lazytime enabled.

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit a2fd66d069d86d793e9d39d4079b96f46d13f237 upstream.

    Newer versions of mount parse the lazytime feature and pass it to the
    mount system call via the flags field in the mount system call,
    removing the lazytime string from the mount options list. So we need
    to check for the presence of MS_LAZYTIME and set it in sb->s_flags in
    order for this flag to be set on a remount.

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit 292db1bc6c105d86111e858859456bcb11f90f91 upstream.

    ext4 isn't willing to map clusters to a non-extent file. Don't signal
    this with an out of space error, since the FS will retry the
    allocation (which didn't fail) forever. Instead, return EUCLEAN so
    that the operation will fail immediately all the way back to userspace.

    (The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.)

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong
     
  • commit 89d96a6f8e6491f24fc8f99fd6ae66820e85c6c1 upstream.

    Normally all of the buffers will have been forced out to disk before
    we call invalidate_bdev(), but there will be some cases, where a file
    system operation was aborted due to an ext4_error(), where there may
    still be some dirty buffers in the buffer cache for the device. So
    try to force them out to memory before calling invalidate_bdev().

    This fixes a warning triggered by generic/081:

    WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f()

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit bdf96838aea6a265f2ae6cbcfb12a778c84a0b8e upstream.

    The commit cf108bca465d: "ext4: Invert the locking order of page_lock
    and transaction start" caused __ext4_journalled_writepage() to drop
    the page lock before the page was written back, as part of changing
    the locking order to jbd2_journal_start -> page_lock. However, this
    introduced a potential race if there was a truncate racing with the
    data=journalled writeback mode.

    Fix this by grabbing the page lock after starting the journal handle,
    and then checking to see if page had gotten truncated out from under
    us.

    This fixes a number of different warnings or BUG_ON's when running
    xfstests generic/086 in data=journalled mode, including:

    jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
    c0, 164), jh->b_transaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0

    - and -

    kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
    ...
    Call Trace:
    [] ? __ext4_journalled_invalidatepage+0x117/0x117
    [] __ext4_journalled_invalidatepage+0x10f/0x117
    [] ? __ext4_journalled_invalidatepage+0x117/0x117
    [] ? lock_buffer+0x36/0x36
    [] ext4_journalled_invalidatepage+0xd/0x22
    [] do_invalidatepage+0x22/0x26
    [] truncate_inode_page+0x5b/0x85
    [] truncate_inode_pages_range+0x156/0x38c
    [] truncate_inode_pages+0x11/0x15
    [] truncate_pagecache+0x55/0x71
    [] ext4_setattr+0x4a9/0x560
    [] ? current_kernel_time+0x10/0x44
    [] notify_change+0x1c7/0x2be
    [] do_truncate+0x65/0x85
    [] ? file_ra_state_init+0x12/0x29

    - and -

    WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
    irty_metadata+0x14a/0x1ae()
    ...
    Call Trace:
    [] ? console_unlock+0x3a1/0x3ce
    [] dump_stack+0x48/0x60
    [] warn_slowpath_common+0x89/0xa0
    [] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
    [] warn_slowpath_null+0x14/0x18
    [] jbd2_journal_dirty_metadata+0x14a/0x1ae
    [] __ext4_handle_dirty_metadata+0xd4/0x19d
    [] write_end_fn+0x40/0x53
    [] ext4_walk_page_buffers+0x4e/0x6a
    [] ext4_writepage+0x354/0x3b8
    [] ? mpage_release_unused_pages+0xd4/0xd4
    [] ? wait_on_buffer+0x2c/0x2c
    [] ? ext4_writepage+0x3b8/0x3b8
    [] __writepage+0x10/0x2e
    [] write_cache_pages+0x22d/0x32c
    [] ? ext4_writepage+0x3b8/0x3b8
    [] ext4_writepages+0x102/0x607
    [] ? sched_clock_local+0x10/0x10e
    [] ? __lock_is_held+0x2e/0x44
    [] ? lock_is_held+0x43/0x51
    [] do_writepages+0x1c/0x29
    [] __writeback_single_inode+0xc3/0x545
    [] writeback_sb_inodes+0x21f/0x36d
    ...

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     

15 May, 2015

6 commits

  • The xfstests test suite assumes that an attempt to collapse range on
    the range (0, 1) will return EOPNOTSUPP if the file system does not
    support collapse range. Commit 280227a75b56: "ext4: move check under
    lock scope to close a race" broke this, and this caused xfstests to
    fail when run when testing file systems that did not have the extents
    feature enabled.

    Reported-by: Eric Whitney
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • The following commit introduced a bug when checking for zero length extent

    5946d08 ext4: check for overlapping extents in ext4_valid_extent_entries()

    Zero length extent could pass the check if lblock is zero.

    Adding the explicit check for zero length back.

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Eryu Guan
     
  • Currently when journal restart fails, we'll have the h_transaction of
    the handle set to NULL to indicate that the handle has been effectively
    aborted. We handle this situation quietly in the jbd2_journal_stop() and just
    free the handle and exit because everything else has been done before we
    attempted (and failed) to restart the journal.

    Unfortunately there are a number of problems with that approach
    introduced with commit

    41a5b913197c "jbd2: invalidate handle if jbd2_journal_restart()
    fails"

    First of all in ext4 jbd2_journal_stop() will be called through
    __ext4_journal_stop() where we would try to get a hold of the superblock
    by dereferencing h_transaction which in this case would lead to NULL
    pointer dereference and crash.

    In addition we're going to free the handle regardless of the refcount
    which is bad as well, because others up the call chain will still
    reference the handle so we might potentially reference already freed
    memory.

    Moreover it's expected that we'll get aborted handle as well as detached
    handle in some of the journalling function as the error propagates up
    the stack, so it's unnecessary to call WARN_ON every time we get
    detached handle.

    And finally we might leak some memory by forgetting to free reserved
    handle in jbd2_journal_stop() in the case where handle was detached from
    the transaction (h_transaction is NULL).

    Fix the NULL pointer dereference in __ext4_journal_stop() by just
    calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
    the potential memory leak in jbd2_journal_stop() and use proper
    handle refcounting before we attempt to free it to avoid use-after-free
    issues.

    And finally remove all WARN_ON(!transaction) from the code so that we do
    not get random traces when something goes wrong because when journal
    restart fails we will get to some of those functions.

    Cc: stable@vger.kernel.org
    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Lukas Czerner
     
  • The ext4_extent_tree_init() function hasn't been in the ext4 code for
    a long time ago, except in an unused function prototype in ext4.h

    Google-Bug-Id: 4530137
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • Google-Bug-Id: 20939131
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • We had a fencepost error in the lazytime optimization which means that
    timestamp would get written to the wrong inode.

    Cc: stable@vger.kernel.org
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

04 May, 2015

1 commit

  • Pull ext4 fixes from Ted Ts'o:
    "Some miscellaneous bug fixes and some final on-disk and ABI changes
    for ext4 encryption which provide better security and performance"

    * tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: fix growing of tiny filesystems
    ext4: move check under lock scope to close a race.
    ext4: fix data corruption caused by unwritten and delayed extents
    ext4 crypto: remove duplicated encryption mode definitions
    ext4 crypto: do not select from EXT4_FS_ENCRYPTION
    ext4 crypto: add padding to filenames before encrypting
    ext4 crypto: simplify and speed up filename encryption

    Linus Torvalds
     

03 May, 2015

3 commits

  • The estimate of necessary transaction credits in ext4_flex_group_add()
    is too pessimistic. It reserves credit for sb, resize inode, and resize
    inode dindirect block for each group added in a flex group although they
    are always the same block and thus it is enough to account them only
    once. Also the number of modified GDT block is overestimated since we
    fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.

    Make the estimation more precise. That reduces number of requested
    credits enough that we can grow 20 MB filesystem (which has 1 MB
    journal, 79 reserved GDT blocks, and flex group size 16 by default).

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Eric Sandeen

    Jan Kara
     
  • fallocate() checks that the file is extent-based and returns
    EOPNOTSUPP in case is not. Other tasks can convert from and to
    indirect and extent so it's safe to check only after grabbing
    the inode mutex.

    Signed-off-by: Davide Italiano
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Davide Italiano
     
  • Currently it is possible to lose whole file system block worth of data
    when we hit the specific interaction with unwritten and delayed extents
    in status extent tree.

    The problem is that when we insert delayed extent into extent status
    tree the only way to get rid of it is when we write out delayed buffer.
    However there is a limitation in the extent status tree implementation
    so that when inserting unwritten extent should there be even a single
    delayed block the whole unwritten extent would be marked as delayed.

    At this point, there is no way to get rid of the delayed extents,
    because there are no delayed buffers to write out. So when a we write
    into said unwritten extent we will convert it to written, but it still
    remains delayed.

    When we try to write into that block later ext4_da_map_blocks() will set
    the buffer new and delayed and map it to invalid block which causes
    the rest of the block to be zeroed loosing already written data.

    For now we can fix this by simply not allowing to set delayed status on
    written extent in the extent status tree. Also add WARN_ON() to make
    sure that we notice if this happens in the future.

    This problem can be easily reproduced by running the following xfs_io.

    xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
    -c "falloc 0 131072" \
    -c "pwrite -S 0xbb 65536 2048" \
    -c "fsync" /mnt/test/fff

    echo 3 > /proc/sys/vm/drop_caches
    xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff

    This can be theoretically also reproduced by at random by running fsx,
    but it's not very reliable, though on machines with bigger page size
    (like ppc) this can be seen more often (especially xfstest generic/127)

    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Lukas Czerner
     

02 May, 2015

4 commits


27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

25 Apr, 2015

1 commit

  • do_blockdev_direct_IO() increments and decrements the inode
    ->i_dio_count for each IO operation. It does this to protect against
    truncate of a file. Block devices don't need this sort of protection.

    For a capable multiqueue setup, this atomic int is the only shared
    state between applications accessing the device for O_DIRECT, and it
    presents a scaling wall for that. In my testing, as much as 30% of
    system time is spent incrementing and decrementing this value. A mixed
    read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
    better latencies too. Before:

    clat percentiles (usec):
    | 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
    | 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
    | 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
    | 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
    | 99.99th=[ 165]

    After:

    clat percentiles (usec):
    | 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
    | 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
    | 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
    | 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
    | 99.99th=[ 438]

    In other setups, Robert Elliott reported seeing good performance
    improvements:

    https://lkml.org/lkml/2015/4/3/557

    The more applications accessing the device, the worse it gets.

    Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
    do_blockdev_direct_IO() that it need not worry about incrementing
    or decrementing the inode i_dio_count for this caller.

    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Theodore Ts'o
    Cc: Elliott, Robert (Server Storage)
    Cc: Al Viro
    Signed-off-by: Jens Axboe
    Signed-off-by: Al Viro

    Jens Axboe
     

20 Apr, 2015

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "A few bug fixes and add support for file-system level encryption in
    ext4"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
    ext4 crypto: enable encryption feature flag
    ext4 crypto: add symlink encryption
    ext4 crypto: enable filename encryption
    ext4 crypto: filename encryption modifications
    ext4 crypto: partial update to namei.c for fname crypto
    ext4 crypto: insert encrypted filenames into a leaf directory block
    ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames
    ext4 crypto: filename encryption facilities
    ext4 crypto: implement the ext4 decryption read path
    ext4 crypto: implement the ext4 encryption write path
    ext4 crypto: inherit encryption policies on inode and directory create
    ext4 crypto: enforce context consistency
    ext4 crypto: add encryption key management facilities
    ext4 crypto: add ext4 encryption facilities
    ext4 crypto: add encryption policy and password salt support
    ext4 crypto: add encryption xattr support
    ext4 crypto: export ext4_empty_dir()
    ext4 crypto: add ext4 encryption Kconfig
    ext4 crypto: reserve codepoints used by the ext4 encryption feature
    ext4 crypto: add ext4_mpage_readpages()
    ...

    Linus Torvalds
     

17 Apr, 2015

2 commits

  • Pull third hunk of vfs changes from Al Viro:
    "This contains the ->direct_IO() changes from Omar + saner
    generic_write_checks() + dealing with fcntl()/{read,write}() races
    (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
    repeatedly looking at ->f_flags, which can be changed by fcntl(2),
    check ->ki_flags - which cannot) + infrastructure bits for dhowells'
    d_inode annotations + Christophs switch of /dev/loop to
    vfs_iter_write()"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
    block: loop: switch to VFS ITER_BVEC
    configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
    VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
    VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
    VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
    NFS: Don't use d_inode as a variable name
    VFS: Impose ordering on accesses of d_inode and d_flags
    VFS: Add owner-filesystem positive/negative dentry checks
    nfs: generic_write_checks() shouldn't be done on swapout...
    ocfs2: use __generic_file_write_iter()
    mirror O_APPEND and O_DIRECT into iocb->ki_flags
    switch generic_write_checks() to iocb and iter
    ocfs2: move generic_write_checks() before the alignment checks
    ocfs2_file_write_iter: stop messing with ppos
    udf_file_write_iter: reorder and simplify
    fuse: ->direct_IO() doesn't need generic_write_checks()
    ext4_file_write_iter: move generic_write_checks() up
    xfs_file_aio_write_checks: switch to iocb/iov_iter
    generic_write_checks(): drop isblk argument
    blkdev_write_iter: expand generic_file_checks() call in there
    ...

    Linus Torvalds
     
  • Pull quota and udf updates from Jan Kara:
    "The pull contains quota changes which complete unification of XFS and
    VFS quota interfaces (so tools can use either interface to manipulate
    any filesystem). There's also a patch to support project quotas in
    VFS quota subsystem from Li Xi.

    Finally there's a bunch of UDF fixes and cleanups and tiny cleanup in
    reiserfs & ext3"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
    udf: Update ctime and mtime when directory is modified
    udf: return correct errno for udf_update_inode()
    ext3: Remove useless condition in if statement.
    vfs: Add general support to enforce project quota limits
    reiserfs: fix __RASSERT format string
    udf: use int for allocated blocks instead of sector_t
    udf: remove redundant buffer_head.h includes
    udf: remove else after return in __load_block_bitmap()
    udf: remove unused variable in udf_table_free_blocks()
    quota: Fix maximum quota limit settings
    quota: reorder flags in quota state
    quota: paranoia: check quota tree root
    quota: optimize i_dquot access
    quota: Hook up Q_XSETQLIM for id 0 to ->set_info
    xfs: Add support for Q_SETINFO
    quota: Make ->set_info use structure with neccesary info to VFS and XFS
    quota: Remove ->get_xstate and ->get_xstatev callbacks
    gfs2: Convert to using ->get_state callback
    xfs: Convert to using ->get_state callback
    quota: Wire up Q_GETXSTATE and Q_GETXSTATV calls to work with ->get_state
    ...

    Linus Torvalds
     

16 Apr, 2015

5 commits

  • Also add the test dummy encryption mode flag so we can more easily
    test the encryption patches using xfstests.

    Signed-off-by: Michael Halcrow
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • Signed-off-by: Uday Savagaonkar
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • Merge second patchbomb from Andrew Morton:

    - the rest of MM

    - various misc bits

    - add ability to run /sbin/reboot at reboot time

    - printk/vsprintf changes

    - fiddle with seq_printf() return value

    * akpm: (114 commits)
    parisc: remove use of seq_printf return value
    lru_cache: remove use of seq_printf return value
    tracing: remove use of seq_printf return value
    cgroup: remove use of seq_printf return value
    proc: remove use of seq_printf return value
    s390: remove use of seq_printf return value
    cris fasttimer: remove use of seq_printf return value
    cris: remove use of seq_printf return value
    openrisc: remove use of seq_printf return value
    ARM: plat-pxa: remove use of seq_printf return value
    nios2: cpuinfo: remove use of seq_printf return value
    microblaze: mb: remove use of seq_printf return value
    ipc: remove use of seq_printf return value
    rtc: remove use of seq_printf return value
    power: wakeup: remove use of seq_printf return value
    x86: mtrr: if: remove use of seq_printf return value
    linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
    MAINTAINERS: CREDITS: remove Stefano Brivio from B43
    .mailmap: add Ricardo Ribalda
    CREDITS: add Ricardo Ribalda Delgado
    ...

    Linus Torvalds
     
  • The original dax patchset split the ext2/4_file_operations because of the
    two NULL splice_read/splice_write in the dax case.

    In the vfs if splice_read/splice_write are NULL we then call
    default_splice_read/write.

    What we do here is make generic_file_splice_read aware of IS_DAX() so the
    original ext2/4_file_operations can be used as is.

    For write it appears that iter_file_splice_write is just fine. It uses
    the regular f_op->write(file,..) or new_sync_write(file, ...).

    Signed-off-by: Boaz Harrosh
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • From: Yigal Korman

    [v1]
    Without this patch, c/mtime is not updated correctly when mmap'ed page is
    first read from and then written to.

    A new xfstest is submitted for testing this (generic/080)

    [v2]
    Jan Kara has pointed out that if we add the
    sb_start/end_pagefault pair in the new pfn_mkwrite we
    are then fixing another bug where: A user could start
    writing to the page while filesystem is frozen.

    Signed-off-by: Yigal Korman
    Signed-off-by: Boaz Harrosh
    Reviewed-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh