08 Dec, 2015

1 commit

  • Pull ext4 fixes from Ted Ts'o:
    "Ext4 bug fixes for v4.4, including fixes for post-2038 time encodings,
    some endian conversion problems with ext4 encryption, potential memory
    leaks after truncate in data=journal mode, and an ocfs2 regression
    caused by a jbd2 performance improvement"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    jbd2: fix null committed data return in undo_access
    ext4: add "static" to ext4_seq_##name##_fops struct
    ext4: fix an endianness bug in ext4_encrypted_follow_link()
    ext4: fix an endianness bug in ext4_encrypted_zeroout()
    jbd2: Fix unreclaimed pages after truncate in data=journal mode
    ext4: Fix handling of extended tv_sec

    Linus Torvalds
     

05 Dec, 2015

1 commit

  • introduced jbd2_write_access_granted() to improve write|undo_access
    speed, but missed to check the status of b_committed_data which caused
    a kernel panic on ocfs2.

    [ 6538.405938] ------------[ cut here ]------------
    [ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400!
    [ 6538.406686] invalid opcode: 0000 [#1] SMP
    [ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod
    [ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 #1
    [ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
    [ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000
    [ 6538.406686] RIP: 0010:[] [] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
    [ 6538.406686] RSP: 0018:ffff880075b7b7f8 EFLAGS: 00010246
    [ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0
    [ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430
    [ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001
    [ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001
    [ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0
    [ 6538.406686] FS: 00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000
    [ 6538.406686] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0
    [ 6538.406686] Stack:
    [ 6538.406686] ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800
    [ 6538.406686] ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d
    [ 6538.406686] ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000
    [ 6538.406686] Call Trace:
    [ 6538.406686] [] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2]
    [ 6538.406686] [] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2]
    [ 6538.406686] [] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
    [ 6538.406686] [] _ocfs2_free_clusters+0xee/0x210 [ocfs2]
    [ 6538.406686] [] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
    [ 6538.406686] [] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
    [ 6538.406686] [] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2]
    [ 6538.406686] [] ocfs2_free_clusters+0x15/0x20 [ocfs2]
    [ 6538.406686] [] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2]
    [ 6538.406686] [] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2]
    [ 6538.406686] [] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2]
    [ 6538.406686] [] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2]
    [ 6538.406686] [] ocfs2_remove_btree_range+0x374/0x630 [ocfs2]
    [ 6538.406686] [] ? jbd2_journal_stop+0x25b/0x470 [jbd2]
    [ 6538.406686] [] ocfs2_commit_truncate+0x305/0x670 [ocfs2]
    [ 6538.406686] [] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2]
    [ 6538.406686] [] ocfs2_truncate_file+0x297/0x380 [ocfs2]
    [ 6538.406686] [] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2]
    [ 6538.406686] [] ocfs2_setattr+0x572/0x860 [ocfs2]
    [ 6538.406686] [] ? current_fs_time+0x3f/0x50
    [ 6538.406686] [] notify_change+0x1d7/0x340
    [ 6538.406686] [] ? generic_getxattr+0x79/0x80
    [ 6538.406686] [] do_truncate+0x66/0x90
    [ 6538.406686] [] ? __audit_syscall_entry+0xb0/0x110
    [ 6538.406686] [] do_sys_ftruncate.clone.0+0xf3/0x120
    [ 6538.406686] [] SyS_ftruncate+0xe/0x10
    [ 6538.406686] [] entry_SYSCALL_64_fastpath+0x12/0x71
    [ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
    [ 6538.406686] RIP [] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
    [ 6538.406686] RSP
    [ 6538.691128] ---[ end trace 31cd7011d6770d7e ]---
    [ 6538.694492] Kernel panic - not syncing: Fatal exception
    [ 6538.695484] Kernel Offset: disabled

    Fixes: de92c8caf16c("jbd2: speedup jbd2_journal_get_[write|undo]_access()")
    Cc:
    Signed-off-by: Junxiao Bi
    Signed-off-by: Theodore Ts'o

    Junxiao Bi
     

25 Nov, 2015

1 commit

  • Ted and Namjae have reported that truncated pages don't get timely
    reclaimed after being truncated in data=journal mode. The following test
    triggers the issue easily:

    for (i = 0; i < 1000; i++) {
    pwrite(fd, buf, 1024*1024, 0);
    fsync(fd);
    fsync(fd);
    ftruncate(fd, 0);
    }

    The reason is that journal_unmap_buffer() finds that truncated buffers
    are not journalled (jh->b_transaction == NULL), they are part of
    checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have
    been already written out (!buffer_dirty(bh)). We clean such buffers but
    we leave them in the checkpoint list. Since checkpoint transaction holds
    a reference to the journal head, these buffers cannot be released until
    the checkpoint transaction is cleaned up. And at that point we don't
    call release_buffer_page() anymore so pages detached from mapping are
    lingering in the system waiting for reclaim to find them and free them.

    Fix the problem by removing buffers from transaction checkpoint lists
    when journal_unmap_buffer() finds out they don't have to be there
    anymore.

    Reported-and-tested-by: Namjae Jeon
    Fixes: de1b794130b130e77ffa975bb58cb843744f9ae5
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Jan Kara
     

08 Nov, 2015

1 commit

  • Merge second patch-bomb from Andrew Morton:

    - most of the rest of MM

    - procfs

    - lib/ updates

    - printk updates

    - bitops infrastructure tweaks

    - checkpatch updates

    - nilfs2 update

    - signals

    - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
    dma-debug, dma-mapping, ...

    * emailed patches from Andrew Morton : (102 commits)
    ipc,msg: drop dst nil validation in copy_msg
    include/linux/zutil.h: fix usage example of zlib_adler32()
    panic: release stale console lock to always get the logbuf printed out
    dma-debug: check nents in dma_sync_sg*
    dma-mapping: tidy up dma_parms default handling
    pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
    kexec: use file name as the output message prefix
    fs, seqfile: always allow oom killer
    seq_file: reuse string_escape_str()
    fs/seq_file: use seq_* helpers in seq_hex_dump()
    coredump: change zap_threads() and zap_process() to use for_each_thread()
    coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
    signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
    signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
    signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
    signals: kill block_all_signals() and unblock_all_signals()
    nilfs2: fix gcc uninitialized-variable warnings in powerpc build
    nilfs2: fix gcc unused-but-set-variable warnings
    MAINTAINERS: nilfs2: add header file for tracing
    nilfs2: add tracepoints for analyzing reading and writing metadata files
    ...

    Linus Torvalds
     

07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

19 Oct, 2015

1 commit

  • If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
    journaling will be aborted first and the error number will be recorded
    into JBD2 superblock and, finally, the system will enter into the
    panic state in "errors=panic" option. But, in the rare case, this
    sequence is little twisted like the below figure and it will happen
    that the system enters into panic state, which means the system reset
    in mobile environment, before completion of recording an error in the
    journal superblock. In this case, e2fsck cannot recognize that the
    filesystem failure occurred in the previous run and the corruption
    wouldn't be fixed.

    Task A Task B
    ext4_handle_error()
    -> jbd2_journal_abort()
    -> __journal_abort_soft()
    -> __jbd2_journal_abort_hard()
    | -> journal->j_flags |= JBD2_ABORT;
    |
    | __ext4_abort()
    | -> jbd2_journal_abort()
    | | -> __journal_abort_soft()
    | | -> if (journal->j_flags & JBD2_ABORT)
    | | return;
    | -> panic()
    |
    -> jbd2_journal_update_sb_errno()

    Tested-by: Hobin Woo
    Signed-off-by: Daeho Jeong
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Daeho Jeong
     

18 Oct, 2015

3 commits

  • Unlike comments and expectation of callers journal_clean_one_cp_list()
    returned 1 not only if it freed the transaction but also if it freed
    some buffers in the transaction. That could make
    __jbd2_journal_clean_checkpoint_list() skip processing
    t_checkpoint_io_list and continue with processing the next transaction.
    This is mostly a cosmetic issue since the only result is we can
    sometimes free less memory than we could. But it's still worth fixing.
    Fix journal_clean_one_cp_list() to return 1 only if the transaction was
    really freed.

    Fixes: 50849db32a9f529235a84bcc84a6b8e631b1d0ec
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Jan Kara
     
  • Create separate predicate functions to test/set/clear feature flags,
    thereby replacing the wordy old macros. Furthermore, clean out the
    places where we open-coded feature tests.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o

    Darrick J. Wong
     
  • Instead of overloading EIO for CRC errors and corrupt structures,
    return the same error codes that XFS returns for the same issues.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o

    Darrick J. Wong
     

15 Oct, 2015

1 commit

  • Change the journal's checksum functions to gate on whether or not the
    crc32c driver is loaded, and gate the loading on the superblock bits.
    This prevents a journal crash if someone loads a journal in no-csum
    mode and then randomizes the superblock, thus flipping on the feature
    bits.

    Tested-By: Nikolay Borisov
    Reported-by: Nikolay Borisov
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o

    Darrick J. Wong
     

04 Aug, 2015

1 commit

  • Currently there is no limitation on number of reserved credits we can
    ask for. If we ask for more reserved credits than 1/2 of maximum
    transaction size, or if total number of credits exceeds the maximum
    transaction size per operation (which is currently only possible with
    the former) we will spin forever in start_this_handle().

    Fix this by adding this limitation at the start of start_this_handle().

    This patch also removes the credit limitation 1/2 of maximum transaction
    size, since we really only want to limit the number of reserved credits.
    There is not much point to limit the credits if there is still space in
    the journal.

    This accidentally also fixes the online resize, where due to the
    limitation of the journal credits we're unable to grow file systems with
    1k block size and size between 16M and 32M. It has been partially fixed
    by 2c869b262a10ca99cb866d04087d75311587a30c, but not entirely.

    Thanks Jan Kara for helping me getting the correct fix.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Lukas Czerner
     

29 Jul, 2015

1 commit

  • Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
    superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
    when the journal is aborted. That makes logic in
    jbd2_log_do_checkpoint() bail out which is fine, except that
    jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
    a progress in cleaning the journal. Without it jbd2_journal_destroy()
    just loops in an infinite loop.

    Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
    jbd2_log_do_checkpoint() fails with error.

    Reported-by: Eryu Guan
    Tested-by: Eryu Guan
    Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

23 Jul, 2015

1 commit

  • When an error condition is detected, an error status should be recorded into
    superblocks of EXT4 or JBD2. However, the write request is submitted now
    without REQ_FUA flag, even in "barrier=1" mode, which is followed by
    panic() function in "errors=panic" mode. On mobile devices which make
    whole system reset as soon as kernel panic occurs, this write request
    containing an error flag will disappear just from storage cache without
    written to the physical cells. Therefore, when next start, even forever,
    the error flag cannot be shown in both superblocks, and e2fsck cannot fix
    the filesystem problems automatically, unless e2fsck is executed in
    force checking mode.

    [ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ]

    Signed-off-by: Daeho Jeong
    Signed-off-by: Theodore Ts'o

    Daeho Jeong
     

13 Jul, 2015

1 commit

  • It is often the case that we mark buffer as having dirty metadata when
    the buffer is already in that state (frequent for bitmaps, inode table
    blocks, superblock). Thus it is unnecessary to contend on grabbing
    journal head reference and bh_state lock. Avoid that by checking whether
    any modification to the buffer is needed before grabbing any locks or
    references.

    [ Note: this is a fixed version of commit 2143c1965a761, which was
    reverted in ebeaa8ddb3663b5 due to a false positive triggering of an
    assertion check. -- Ted ]

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

28 Jun, 2015

1 commit

  • This reverts commit 2143c1965a761332ae417b22fd477b636e4f54ec.

    This commit seems to be the cause of the following jbd2 assertion
    failure:

    ------------[ cut here ]------------
    kernel BUG at fs/jbd2/transaction.c:1325!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: bnep bluetooth fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 ...
    CPU: 7 PID: 5509 Comm: gcc Not tainted 4.1.0-10944-g2a298679b411 #1
    Hardware name: /DH87RL, BIOS RLH8710H.86A.0327.2014.0924.1645 09/24/2014
    task: ffff8803bf866040 ti: ffff880308528000 task.ti: ffff880308528000
    RIP: jbd2_journal_dirty_metadata+0x237/0x290
    Call Trace:
    __ext4_handle_dirty_metadata+0x43/0x1f0
    ext4_handle_dirty_dirent_node+0xde/0x160
    ? jbd2_journal_get_write_access+0x36/0x50
    ext4_delete_entry+0x112/0x160
    ? __ext4_journal_start_sb+0x52/0xb0
    ext4_unlink+0xfa/0x260
    vfs_unlink+0xec/0x190
    do_unlinkat+0x24a/0x270
    SyS_unlink+0x11/0x20
    entry_SYSCALL_64_fastpath+0x12/0x6a
    ---[ end trace ae033ebde8d080b4 ]---

    which is not easily reproducible (I've seen it just once, and then Ted
    was able to reproduce it once). Revert it while Ted and Jan try to
    figure out what is wrong.

    Cc: Jan Kara
    Acked-by: Theodore Ts'o
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Jun, 2015

1 commit

  • Merge second patchbomb from Andrew Morton:

    - most of the rest of MM

    - lots of misc things

    - procfs updates

    - printk feature work

    - updates to get_maintainer, MAINTAINERS, checkpatch

    - lib/ updates

    * emailed patches from Andrew Morton : (96 commits)
    exit,stats: /* obey this comment */
    coredump: add __printf attribute to cn_*printf functions
    coredump: use from_kuid/kgid when formatting corename
    fs/reiserfs: remove unneeded cast
    NILFS2: support NFSv2 export
    fs/befs/btree.c: remove unneeded initializations
    fs/minix: remove unneeded cast
    init/do_mounts.c: add create_dev() failure log
    kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
    fs/efs: femove unneeded cast
    checkpatch: emit "NOTE: " message only once after multiple files
    checkpatch: emit an error when there's a diff in a changelog
    checkpatch: validate MODULE_LICENSE content
    checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
    checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
    checkpatch: fix processing of MEMSET issues
    checkpatch: suggest using ether_addr_equal*()
    checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
    checkpatch: remove local from codespell path
    checkpatch: add --showfile to allow input via pipe to show filenames
    ...

    Linus Torvalds
     

26 Jun, 2015

1 commit


21 Jun, 2015

1 commit

  • It is often the case that we mark buffer as having dirty metadata when
    the buffer is already in that state (frequent for bitmaps, inode table
    blocks, superblock). Thus it is unnecessary to contend on grabbing
    journal head reference and bh_state lock. Avoid that by checking whether
    any modification to the buffer is needed before grabbing any locks or
    references.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

16 Jun, 2015

2 commits

  • insert_revoke_hash does an open coded endless allocation loop if
    journal_oom_retry is true. It doesn't implement any allocation fallback
    strategy between the retries, though. The memory allocator doesn't know
    about the never fail requirement so it cannot potentially help to move
    on with the allocation (e.g. use memory reserves).

    Get rid of the retry loop and use __GFP_NOFAIL instead. We will lose the
    debugging message but I am not sure it is anyhow helpful.

    Do the same for journal_alloc_journal_head which is doing a similar
    thing.

    Signed-off-by: Michal Hocko
    Signed-off-by: Theodore Ts'o

    Michal Hocko
     
  • If updating journal superblock fails after journal data has been
    flushed, the error is omitted and this will mislead the caller as a
    normal case. In ocfs2, the checkpoint will be treated successfully
    and the other node can get the lock to update. Since the sb_start is
    still pointing to the old log block, it will rewrite the journal data
    during journal recovery by the other node. Thus the new updates will
    be overwritten and ocfs2 corrupts. So in above case we have to return
    the error, and ocfs2_commit_cache will take care of the error and
    prevent the other node to do update first. And only after recovering
    journal it can do the new updates.

    The issue discussion mail can be found at:
    https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
    http://comments.gmane.org/gmane.comp.file-systems.ext4/48841

    [ Fixed bug in patch which allowed a non-negative error return from
    jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
    was causing xfstests ext4/306 to fail. -- Ted ]

    Reported-by: Yiwen Jiang
    Signed-off-by: Joseph Qi
    Signed-off-by: Theodore Ts'o
    Tested-by: Yiwen Jiang
    Cc: Junxiao Bi
    Cc: stable@vger.kernel.org

    Joseph Qi
     

15 Jun, 2015

1 commit

  • jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start()
    So allocations should be done with GFP_NOFS

    [Full stack trace snipped from 3.10-rh7]
    [] dump_stack+0x19/0x1b
    [] warn_slowpath_common+0x61/0x80
    [] warn_slowpath_null+0x1a/0x20
    [] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
    [] kmem_cache_alloc+0x55/0x210
    [] ? mempool_alloc_slab+0x15/0x20
    [] mempool_alloc_slab+0x15/0x20
    [] mempool_alloc+0x69/0x170
    [] ? _raw_spin_unlock_irq+0xe/0x20
    [] ? finish_task_switch+0x5d/0x150
    [] bio_alloc_bioset+0x1be/0x2e0
    [] blkdev_issue_flush+0x99/0x120
    [] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL
    [] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2]
    [] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2]
    [] start_this_handle+0x2d8/0x550 [jbd2]
    [] ? __memcg_kmem_put_cache+0x29/0x30
    [] ? kmem_cache_alloc+0x130/0x210
    [] jbd2__journal_start+0xba/0x190 [jbd2]
    [] ? lru_cache_add+0xe/0x10
    [] ? ext4_da_write_begin+0xf9/0x330 [ext4]
    [] __ext4_journal_start_sb+0x77/0x160 [ext4]
    [] ext4_da_write_begin+0xf9/0x330 [ext4]
    [] generic_file_buffered_write_iter+0x10c/0x270
    [] __generic_file_write_iter+0x178/0x390
    [] __generic_file_aio_write+0x8b/0xb0
    [] generic_file_aio_write+0x5d/0xc0
    [] ext4_file_write+0xa9/0x450 [ext4]
    [] ? pipe_read+0x379/0x4f0
    [] do_sync_write+0x90/0xe0
    [] vfs_write+0xbd/0x1e0
    [] SyS_write+0x58/0xb0
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Dmitry Monakhov
     

09 Jun, 2015

4 commits

  • jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
    frequently called for buffers that are already part of the running
    transaction - most frequently it is the case for bitmaps, inode table
    blocks, and superblock. Since in such cases we have nothing to do, it is
    unfortunate we still grab reference to journal head, lock the bh, lock
    bh_state only to find out there's nothing to do.

    Improving this is a bit subtle though since until we find out journal
    head is attached to the running transaction, it can disappear from under
    us because checkpointing / commit decided it's no longer needed. We deal
    with this by protecting journal_head slab with RCU. We still have to be
    careful about journal head being freed & reallocated within slab and
    about exposing journal head in consistent state (in particular
    b_modified and b_frozen_data must be in correct state before we allow
    user to touch the buffer).

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Check for the simple case of unjournaled buffer first, handle it and
    bail out. This allows us to remove one if and unindent the difficult case
    by one tab. The result is easier to read.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • We were acquiring bh_state_lock when allocation of buffer failed in
    do_get_write_access() only to be able to jump to a label that releases
    the lock and does all other checks that don't make sense for this error
    path. Just jump into the right label instead.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • needs_copy is set only in one place in do_get_write_access(), just move
    the frozen buffer copying into that place and factor it out to a
    separate function to make do_get_write_access() slightly more readable.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

08 Jun, 2015

1 commit

  • This basically reverts 47def82672b3 (jbd2: Remove __GFP_NOFAIL from jbd2
    layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
    to open coding the endless loop around the allocator rather than
    removing the dependency on the non failing allocation. So the
    deprecation was a clear failure and the reality tells us that
    __GFP_NOFAIL is not even close to go away.

    It is still true that __GFP_NOFAIL allocations are generally discouraged
    and new uses should be evaluated and an alternative (pre-allocations or
    reservations) should be considered but it doesn't make any sense to lie
    the allocator about the requirements. Allocator can take steps to help
    making a progress if it knows the requirements.

    Signed-off-by: Michal Hocko
    Signed-off-by: Theodore Ts'o
    Acked-by: David Rientjes

    Michal Hocko
     

15 May, 2015

2 commits

  • The journal revoke block recovery code does not check r_count for
    sanity, which means that an evil value of r_count could result in
    the kernel reading off the end of the revoke table and into whatever
    garbage lies beyond. This could crash the kernel, so fix that.

    However, in testing this fix, I discovered that the code to write
    out the revoke tables also was not correctly checking to see if the
    block was full -- the current offset check is fine so long as the
    revoke table space size is a multiple of the record size, but this
    is not true when either journal_csum_v[23] are set.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     
  • Currently when journal restart fails, we'll have the h_transaction of
    the handle set to NULL to indicate that the handle has been effectively
    aborted. We handle this situation quietly in the jbd2_journal_stop() and just
    free the handle and exit because everything else has been done before we
    attempted (and failed) to restart the journal.

    Unfortunately there are a number of problems with that approach
    introduced with commit

    41a5b913197c "jbd2: invalidate handle if jbd2_journal_restart()
    fails"

    First of all in ext4 jbd2_journal_stop() will be called through
    __ext4_journal_stop() where we would try to get a hold of the superblock
    by dereferencing h_transaction which in this case would lead to NULL
    pointer dereference and crash.

    In addition we're going to free the handle regardless of the refcount
    which is bad as well, because others up the call chain will still
    reference the handle so we might potentially reference already freed
    memory.

    Moreover it's expected that we'll get aborted handle as well as detached
    handle in some of the journalling function as the error propagates up
    the stack, so it's unnecessary to call WARN_ON every time we get
    detached handle.

    And finally we might leak some memory by forgetting to free reserved
    handle in jbd2_journal_stop() in the case where handle was detached from
    the transaction (h_transaction is NULL).

    Fix the NULL pointer dereference in __ext4_journal_stop() by just
    calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
    the potential memory leak in jbd2_journal_stop() and use proper
    handle refcounting before we attempt to free it to avoid use-after-free
    issues.

    And finally remove all WARN_ON(!transaction) from the code so that we do
    not get random traces when something goes wrong because when journal
    restart fails we will get to some of those functions.

    Cc: stable@vger.kernel.org
    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Lukas Czerner
     

20 Jan, 2015

1 commit


13 Dec, 2014

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Lots of bugs fixes, including Zheng and Jan's extent status shrinker
    fixes, which should improve CPU utilization and potential soft lockups
    under heavy memory pressure, and Eric Whitney's bigalloc fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
    ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
    ext4: fix suboptimal seek_{data,hole} extents traversial
    ext4: ext4_inline_data_fiemap should respect callers argument
    ext4: prevent fsreentrance deadlock for inline_data
    ext4: forbid journal_async_commit in data=ordered mode
    jbd2: remove unnecessary NULL check before iput()
    ext4: Remove an unnecessary check for NULL before iput()
    ext4: remove unneeded code in ext4_unlink
    ext4: don't count external journal blocks as overhead
    ext4: remove never taken branch from ext4_ext_shift_path_extents()
    ext4: create nojournal_checksum mount option
    ext4: update comments regarding ext4_delete_inode()
    ext4: cleanup GFP flags inside resize path
    ext4: introduce aging to extent status tree
    ext4: cleanup flag definitions for extent status tree
    ext4: limit number of scanned extents in status tree shrinker
    ext4: move handling of list of shrinkable inodes into extent status code
    ext4: change LRU to round-robin in extent status tree shrinker
    ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
    ext4: fix block reservation for bigalloc filesystems
    ...

    Linus Torvalds
     

02 Dec, 2014

1 commit

  • When we're enabling journal features, we cannot use the predicate
    jbd2_journal_has_csum_v2or3() because we haven't yet set the sb
    feature flag fields! Moreover, we just finished loading the shash
    driver, so the test is unnecessary; calculate the seed always.

    Without this patch, we fail to initialize the checksum seed the first
    time we turn on journal_checksum, which means that all journal blocks
    written during that first mount are corrupt. Transactions written
    after the second mount will be fine, since the feature flag will be
    set in the journal superblock. xfstests generic/{034,321,322} are the
    regression tests.

    (This is important for 3.18.)

    Signed-off-by: Darrick J. Wong
    Reported-by: Eric Whitney
    Signed-off-by: Theodore Ts'o

    Darrick J. Wong
     

26 Nov, 2014

1 commit


30 Oct, 2014

1 commit


18 Sep, 2014

2 commits

  • __jbd2_journal_clean_checkpoint_list() returns number of buffers it
    freed but noone was using the value so just stop doing that. This
    also allows for simplifying the calling convention for
    journal_clean_once_cp_list().

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Yuanhan has reported that when he is running fsync(2) heavy workload
    creating new files over ramdisk, significant amount of time is spent in
    __jbd2_journal_clean_checkpoint_list() trying to clean old transactions
    (but they cannot be cleaned up because flusher hasn't yet checkpointed
    those buffers). The workload can be generated by:
    fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096

    Reduce the amount of scanning by stopping to scan the transaction list
    once we find a transaction that cannot be checkpointed. Note that this
    way of cleaning is still enough to keep freeing space in the journal
    after fully checkpointed transactions.

    Reported-and-tested-by: Yuanhan Liu
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

17 Sep, 2014

2 commits

  • If EIO happens after we have dropped j_state_lock, we won't notice
    that the journal has been aborted. So it is reasonable to move this
    check after we have grabbed the j_checkpoint_mutex and re-grabbed the
    j_state_lock. This patch helps to prevent false positive complain
    after EIO.

    #DMESG:
    __jbd2_log_wait_for_space: needed 8448 blocks and only had 8386 space available
    __jbd2_log_wait_for_space: no way to get more journal space in ram1-8
    ------------[ cut here ]------------
    WARNING: CPU: 15 PID: 6739 at fs/jbd2/checkpoint.c:168 __jbd2_log_wait_for_space+0x188/0x200()
    Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
    CPU: 15 PID: 6739 Comm: fsstress Tainted: G W 3.17.0-rc2-00429-g684de57 #139
    Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
    00000000000000a8 ffff88077aaab878 ffffffff815c1a8c 00000000000000a8
    0000000000000000 ffff88077aaab8b8 ffffffff8106ce8c ffff88077aaab898
    ffff8807c57e6000 ffff8807c57e6028 0000000000002100 ffff8807c57e62f0
    Call Trace:
    [] dump_stack+0x51/0x6d
    [] warn_slowpath_common+0x8c/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] __jbd2_log_wait_for_space+0x188/0x200
    [] start_this_handle+0x4da/0x7b0
    [] ? local_clock+0x25/0x30
    [] ? lockdep_init_map+0xe7/0x180
    [] jbd2__journal_start+0xdc/0x1d0
    [] ? __ext4_new_inode+0x7f4/0x1330
    [] __ext4_journal_start_sb+0xf8/0x110
    [] __ext4_new_inode+0x7f4/0x1330
    [] ? lock_release_holdtime+0x29/0x190
    [] ext4_create+0x8b/0x150
    [] vfs_create+0x7b/0xb0
    [] do_last+0x7db/0xcf0
    [] ? inode_permission+0x4d/0x50
    [] path_openat+0x242/0x590
    [] ? __alloc_fd+0x36/0x140
    [] do_filp_open+0x4a/0xb0
    [] ? __alloc_fd+0x121/0x140
    [] do_sys_open+0x170/0x220
    [] SyS_open+0x1e/0x20
    [] SyS_creat+0x16/0x20
    [] system_call_fastpath+0x16/0x1b
    ---[ end trace cd71c831f82059db ]---

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Theodore Ts'o

    Dmitry Monakhov
     
  • Free the buffer head if the journal descriptor block fails checksum
    verification.

    This is the jbd2 port of the e2fsprogs patch "e2fsck: free bh on csum
    verify error in do_one_pass".

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Eric Sandeen
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     

11 Sep, 2014

1 commit


05 Sep, 2014

2 commits

  • Sicne the jbd/jbd2 superblock is not released until the file system is
    unmounted, allocate the buffer cache from the non-moveable area to
    allow page migration and CMA allocations to more easily succeed.

    Signed-off-by: Gioh Kim
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Gioh Kim
     
  • When we discover written out buffer in transaction checkpoint list we
    don't have to recheck validity of a transaction. Either this is the
    last buffer in a transaction - and then we are done - or this isn't
    and then we can just take another buffer from the checkpoint list
    without dropping j_list_lock.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara