30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Feb, 2010

1 commit


16 Feb, 2010

1 commit

  • Delay discarding buffers in journal_unmap_buffer until
    we know that "add to orphan" operation has definitely been
    committed, otherwise the log space of committing transation
    may be freed and reused before truncate get committed, updates
    may get lost if crash happens.

    Signed-off-by: dingdinghua
    Signed-off-by: "Theodore Ts'o"

    dingdinghua
     

23 Dec, 2009

4 commits

  • Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • It triggers the warning in get_page_from_freelist(), and it isn't
    appropriate to use __GFP_NOFAIL here anyway.

    Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14843

    Reported-by: Christian Casteyde
    Signed-off-by: Andrew Morton
    Signed-off-by: "Theodore Ts'o"

    Andrew Morton
     
  • This is a bit complicated because we are trying to optimize when we
    send barriers to the fs data disk. We could just throw in an extra
    barrier to the data disk whenever we send a barrier to the journal
    disk, but that's not always strictly necessary.

    We only need to send a barrier during a commit when there are data
    blocks which are must be written out due to an inode written in
    ordered mode, or if fsync() depends on the commit to force data blocks
    to disk. Finally, before we drop transactions from the beginning of
    the journal during a checkpoint operation, we need to guarantee that
    any blocks that were flushed out to the data disk are firmly on the
    rust platter before we drop the transaction from the journal.

    Thanks to Oleg Drokin for pointing out this flaw in ext3/ext4.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • jbd-debug and jbd2-debug is currently read-only (S_IRUGO), which is not
    correct. Make it writable so that we can start debuging.

    Signed-off-by: Yin Kangkai
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Jan Kara

    Yin Kangkai
     

18 Dec, 2009

1 commit

  • This reverts commit e4c570c4cb7a95dbfafa3d016d2739bf3fdfe319, as
    requested by Alexey:

    "I think I gave a good enough arguments to not merge it.
    To iterate:
    * patch makes impossible to start using ext3 on EXT3_FS=n kernels
    without reboot.
    * this is done only for one pointer on task_struct"

    None of config options which define task_struct are tristate directly
    or effectively."

    Requested-by: Alexey Dobriyan
    Acked-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Dec, 2009

1 commit

  • journal_info in task_struct is used in journaling file system only. So
    introduce CONFIG_FS_JOURNAL_INFO and make it conditional.

    Signed-off-by: Hiroshi Shimamoto
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Steven Whitehouse
    Cc: KONISHI Ryusuke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiroshi Shimamoto
     

12 Dec, 2009

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits)
    ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
    ext3: Fix data / filesystem corruption when write fails to copy data
    ext4: Support for 64-bit quota format
    ext3: Support for vfsv1 quota format
    quota: Implement quota format with 64-bit space and inode limits
    quota: Move definition of QFMT_OCFS2 to linux/quota.h
    ext2: fix comment in ext2_find_entry about return values
    ext3: Unify log messages in ext3
    ext2: clear uptodate flag on super block I/O error
    ext2: Unify log messages in ext2
    ext3: make "norecovery" an alias for "noload"
    ext3: Don't update the superblock in ext3_statfs()
    ext3: journal all modifications in ext3_xattr_set_handle
    ext2: Explicitly assign values to on-disk enum of filetypes
    quota: Fix WARN_ON in lookup_one_len
    const: struct quota_format_ops
    ubifs: remove manual O_SYNC handling
    afs: remove manual O_SYNC handling
    kill wait_on_page_writeback_range
    vfs: Implement proper O_SYNC semantics
    ...

    Linus Torvalds
     

10 Dec, 2009

2 commits


07 Dec, 2009

1 commit

  • Now that the SLUB seems to be fixed so that it respects the requested
    alignment, use kmem_cache_alloc() to allocator if the block size of
    the buffer heads to be allocated is less than the page size.
    Previously, we were using 16k page on a Power system for each buffer,
    even when the file system was using 1k or 4k block size.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

01 Dec, 2009

1 commit


16 Nov, 2009

1 commit

  • If there is a failed journal checksum, don't reset the journal. This
    allows for userspace programs to decide how to recover from this
    situation. It may be that ignoring the journal checksum failure might
    be a better way of recovering the file system. Once we add per-block
    checksums, we can definitely do better. Until then, a system
    administrator can try backing up the file system image (or taking a
    snapshot) and and trying to determine experimentally whether ignoring
    the checksum failure or aborting the journal replay results in less
    data loss.

    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Theodore Ts'o
     

11 Nov, 2009

1 commit


02 Oct, 2009

1 commit


30 Sep, 2009

2 commits

  • The /proc/fs/jbd2//history was maintained manually; by using
    tracepoints, we can get all of the existing functionality of the /proc
    file plus extra capabilities thanks to the ftrace infrastructure. We
    save memory as a bonus.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • There are a number of kernel printk's which are printed when an ext4
    filesystem is mounted and unmounted. Disable them to economize space
    in the system logs. In addition, disabling the mballoc stats by
    default saves a number of unneeded atomic operations for every block
    allocation or deallocation.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

23 Sep, 2009

1 commit

  • Make all seq_operations structs const, to help mitigate against
    revectoring user-triggerable function pointers.

    This is derived from the grsecurity patch, although generated from scratch
    because it's simpler than extracting the changes from there.

    Signed-off-by: James Morris
    Acked-by: Serge Hallyn
    Acked-by: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     

16 Sep, 2009

1 commit


11 Sep, 2009

1 commit

  • Previously the journal_async_commit mount option was equivalent to
    using barrier=0 (and just as unsafe). This patch fixes it so that we
    eliminate the barrier before the commit block (by not using ordered
    mode), and explicitly issuing an empty barrier bio after writing the
    commit block. Because of the journal checksum, it is safe to do this;
    if the journal blocks are not all written before a power failure, the
    checksum in the commit block will prevent the last transaction from
    being replayed.

    Using the fs_mark benchmark, using journal_async_commit shows a 50%
    improvement:

    FSUse% Count Size Files/sec App Overhead
    8 1000 10240 30.5 28242

    vs.

    FSUse% Count Size Files/sec App Overhead
    8 1000 10240 45.8 28620

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

18 Aug, 2009

1 commit


11 Aug, 2009

1 commit

  • fix jiffie rounding in jbd commit timer setup code. Rounding down
    could cause the timer to be fired before the corresponding transaction
    has expired. That transaction can stay not committed forever if no
    new transaction is created or expicit sync/umount happens.

    Signed-off-by: Alex Zhuravlev (Tomas)
    Signed-off-by: Andreas Dilger
    Signed-off-by: "Theodore Ts'o"

    Andreas Dilger
     

17 Jul, 2009

1 commit


14 Jul, 2009

2 commits

  • The function jbd2_journal_write_metadata_buffer() calls
    jbd_unlock_bh_state(bh_in) too early; this could potentially allow
    another thread to call get_write_access on the buffer head, modify the
    data, and dirty it, and allowing the wrong data to be written into the
    journal. Fortunately, if we lose this race, the only time this will
    actually cause filesystem corruption is if there is a system crash or
    other unclean shutdown of the system before the next commit can take
    place.

    Signed-off-by: dingdinghua
    Signed-off-by: "Theodore Ts'o"

    dingdinghua
     
  • The following race can happen:

    CPU1 CPU2
    checkpointing code checks the buffer, adds
    it to an array for writeback
    do_get_write_access()
    ...
    lock_buffer()
    unlock_buffer()
    flush_batch() submits the buffer for IO
    __jbd2_journal_file_buffer()

    So a buffer under writeout is returned from
    do_get_write_access(). Since the filesystem code relies on the fact
    that journaled buffers cannot be written out, it does not take the
    buffer lock and so it can modify buffer while it is under
    writeout. That can lead to a filesystem corruption if we crash at the
    right moment.

    We fix the problem by clearing the buffer dirty bit under buffer_lock
    even if the buffer is on BJ_None list. Actually, we clear the dirty
    bit regardless the list the buffer is in and warn about the fact if
    the buffer is already journalled.

    Thanks for spotting the problem goes to dingdinghua .

    Reported-by: dingdinghua
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

21 Jun, 2009

1 commit


18 Jun, 2009

1 commit

  • This patch reverts 3f31fddf, which is no longer needed because if a
    race between freeing buffer and committing transaction functionality
    occurs and dio gets error, currently dio falls back to buffered IO due
    to the commit 6ccfa806.

    Signed-off-by: Hisashi Hifumi
    Cc: Mingming Cao
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: "Theodore Ts'o"

    Hisashi Hifumi
     

17 Jun, 2009

1 commit


09 Jun, 2009

1 commit


14 Apr, 2009

1 commit


06 Apr, 2009

1 commit

  • When you are going to be submitting several sync writes, we want to
    give the IO scheduler a chance to merge some of them. Instead of
    using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
    and rely on sync_buffer() doing the unplug when someone does a
    wait_on_buffer()/lock_buffer().

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

28 Mar, 2009

1 commit


26 Mar, 2009

1 commit


11 Feb, 2009

2 commits

  • If we race with commit code setting i_transaction to NULL, we could
    possibly dereference it. Proper locking requires the journal pointer
    (to access journal->j_list_lock), which we don't have. So we have to
    change the prototype of the function so that filesystem passes us the
    journal pointer. Also add a more detailed comment about why the
    function jbd2_journal_begin_ordered_truncate() does what it does and
    how it should be used.

    Thanks to Dan Carpenter for pointing to the
    suspitious code.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    Acked-by: Joel Becker
    CC: linux-ext4@vger.kernel.org
    CC: ocfs2-devel@oss.oracle.com
    CC: mfasheh@suse.de
    CC: Dan Carpenter

    Jan Kara
     
  • The function jbd2_journal_start_commit() returns 1 if either a
    transaction is committing or the function has queued a transaction
    commit. But it returns 0 if we raced with somebody queueing the
    transaction commit as well. This resulted in ext4_sync_fs() not
    functioning correctly (description from Arthur Jones):

    In the case of a data=ordered umount with pending long symlinks
    which are delayed due to a long list of other I/O on the backing
    block device, this causes the buffer associated with the long
    symlinks to not be moved to the inode dirty list in the second
    phase of fsync_super. Then, before they can be dirtied again,
    kjournald exits, seeing the UMOUNT flag and the dirty pages are
    never written to the backing block device, causing long symlink
    corruption and exposing new or previously freed block data to
    userspace.

    This can be reproduced with a script created by Eric Sandeen
    :

    #!/bin/bash

    umount /mnt/test2
    mount /dev/sdb4 /mnt/test2
    rm -f /mnt/test2/*
    dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
    touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
    ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
    /mnt/test2/link
    umount /mnt/test2
    mount /dev/sdb4 /mnt/test2
    ls /mnt/test2/

    This patch fixes jbd2_journal_start_commit() to always return 1 when
    there's a transaction committing or queued for commit.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    CC: Eric Sandeen
    CC: linux-ext4@vger.kernel.org

    Jan Kara
     

12 Jan, 2009

1 commit

  • the following warning:

    fs/jbd2/journal.c: In function ‘jbd2_seq_info_show’:
    fs/jbd2/journal.c:850: warning: format ‘%lu’ expects type ‘long
    unsigned int’, but argument 3 has type ‘uint32_t’

    is caused by wrong usage of do_div that modifies the dividend in-place
    and returns the quotient. So not only would an incorrect value be
    displayed, but s->journal->j_average_commit_time would also be changed
    to a wrong value!

    Fix it by using div_u64 instead.

    Signed-off-by: Simon Holm Thøgersen
    Signed-off-by: "Theodore Ts'o"

    Simon Holm Thøgersen
     

09 Jan, 2009

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
    jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
    ext4: Remove "extents" mount option
    block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
    ext4: Make printk's consistently prefixed with "EXT4-fs: "
    ext4: Add sanity checks for the superblock before mounting the filesystem
    ext4: Add mount option to set kjournald's I/O priority
    jbd2: Submit writes to the journal using WRITE_SYNC
    jbd2: Add pid and journal device name to the "kjournald2 starting" message
    ext4: Add markers for better debuggability
    ext4: Remove code to create the journal inode
    ext4: provide function to release metadata pages under memory pressure
    ext3: provide function to release metadata pages under memory pressure
    add releasepage hooks to block devices which can be used by file systems
    ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
    ext4: Init the complete page while building buddy cache
    ext4: Don't allow new groups to be added during block allocation
    ext4: mark the blocks/inode bitmap beyond end of group as used
    ext4: Use new buffer_head flag to check uninit group bitmaps initialization
    ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
    ext4: code cleanup
    ...

    Linus Torvalds
     

07 Jan, 2009

1 commit

  • On 32-bit system with CONFIG_LBD getblk can fail because provided
    block number is too big. Add error checks so we fail gracefully if
    getblk() returns NULL (which can also happen on memory allocation
    failures).

    Thanks to David Maciejak from Fortinet's FortiGuard Global Security
    Research Team for reporting this bug.

    http://bugzilla.kernel.org/show_bug.cgi?id=12370

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    cc: stable@kernel.org

    Jan Kara