24 Jul, 2011

2 commits


18 Jul, 2011

6 commits

  • If eh_entries is equal to (or greater than) eh_max, the operation of
    inserting new extent_idx will make number of entries overflow.
    So check eh_entries before inserting the new extent_idx.

    Although there is no bug case according the code (function
    ext4_ext_insert_index is called by ext4_ext_split and ext4_ext_split
    is called only if the index block has free space), the right logic
    should be "lookup the capacity before insertion".

    Signed-off-by: Robin Dong
    Signed-off-by: "Theodore Ts'o"

    Robin Dong
     
  • This patch avoids an extraneous lookup of the extent cache
    in ext4_ext_map_blocks() when the flag
    EXT4_GET_BLOCKS_PUNCH_OUT_EXT is absent.

    The existing logic was performing the lookup but not making
    use of the result. The patch simply reverses the order of evaluation
    in the condition.

    Since ext4_ext_in_cache() does not initialize newex on misses, bypassing
    its invocation does not introduce any new issue in this regard.

    Signed-off-by: Robin Dong
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Lukas Czerner
    Reviewed-by: Eric Gouriou

    Robin Dong
     
  • This patch removes the extra parameter in ext4_ext_remove_space()
    which is no longer needed.

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"

    Allison Henderson
     
  • This patch optimizes the punch hole operation by skipping the
    tree walking code that is used by truncate. Since punch hole
    is done through map blocks, the path to the extent is already
    known in this function, so we do not need to look it up again.

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"

    Allison Henderson
     
  • If the stripe width was set to 1, then this patch will ignore
    that stripe width and ext4 will act as if the stripe width
    were 0 with respect to optimizing allocations.

    Signed-off-by: Dan Ehrenberg
    Signed-off-by: "Theodore Ts'o"

    Dan Ehrenberg
     
  • Previously, if a stripe width was provided, then it would be used
    as the preallocation granularity, with no santiy checking and no
    way to override this. Now, mb_prealloc_size defaults to the smallest
    multiple of stripe size that is greater than or equal to the old
    default mb_prealloc_size, and this can be overridden with the sysfs
    interface.

    Signed-off-by: Dan Ehrenberg
    Signed-off-by: "Theodore Ts'o"

    Dan Ehrenberg
     

17 Jul, 2011

1 commit


12 Jul, 2011

4 commits


11 Jul, 2011

10 commits

  • If eh->eh_entries is smaller than eh->eh_max, the routine will
    go to the "repeat" and then go to "has_space" directlly ,
    since argument "depth" and "eh" are not even changed.

    Therefore, goto "has_space" directly and remove redundant "repeat" tag.

    Signed-off-by: Robin Dong

    Robin Dong
     
  • at ext4_trim_all_free() comment, there is no longer an @e4b parameter,
    instead it is @group.

    Reported-by: Andreas Dilger
    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"

    Tao Ma
     
  • In ext4, when FITRIM is called every time, we iterate all the
    groups and do trim one by one. It is a bit time wasting if the
    group has been trimmed and there is no change since the last
    trim.

    So this patch adds a new flag in ext4_group_info->bb_state to
    indicate that the group has been trimmed, and it will be cleared
    if some blocks is freed(in release_blocks_on_commit). Another
    trim_minlen is added in ext4_sb_info to record the last minlen
    we use to trim the volume, so that if the caller provide a small
    one, we will go on the trim regardless of the bb_state.

    A simple test with my intel x25m ssd:
    df -h shows:
    /dev/sdb1 40G 21G 17G 56% /mnt/ext4
    Block size: 4096

    run the FITRIM with the following parameter:
    range.start = 0;
    range.len = UINT64_MAX;
    range.minlen = 1048576;

    without the patch:
    [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
    real 0m5.505s
    user 0m0.000s
    sys 0m1.224s
    [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
    real 0m5.359s
    user 0m0.000s
    sys 0m1.178s
    [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
    real 0m5.228s
    user 0m0.000s
    sys 0m1.151s

    with the patch:
    [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
    real 0m5.625s
    user 0m0.000s
    sys 0m1.269s
    [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
    real 0m0.002s
    user 0m0.000s
    sys 0m0.001s
    [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
    real 0m0.002s
    user 0m0.000s
    sys 0m0.001s

    A big improvement for the 2nd and 3rd run.

    Even after I delete some big image files, it is still much
    faster than iterating the whole disk.

    [root@boyu-tm test]# time ./ftrim /mnt/ext4/a
    real 0m1.217s
    user 0m0.000s
    sys 0m0.196s

    Cc: Lukas Czerner
    Reviewed-by: Andreas Dilger
    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"

    Tao Ma
     
  • Add ext4_trim_extent and ext4_trim_all_free.

    Reviewed-by: Lukas Czerner
    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"

    Tao Ma
     
  • When we trim some free blocks in a group of ext4, we need to
    calculate the free blocks properly and check whether there are
    enough freed blocks left for us to trim. Current solution will
    only calculate free spaces if they are large for a trim which
    isn't appropriate.

    Let us see a small example:
    a group has 1.5M free which are 300k, 300k, 300k, 300k, 300k.
    And minblocks is 1M. With current solution, we have to iterate
    the whole group since these 300k will never be subtracted from
    1.5M. But actually we should exit after we find the first 2
    free spaces since the left 3 chunks only sum up to 900K if we
    subtract the first 600K although they can't be trimed.

    Reviewed-by: Andreas Dilger
    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"

    Tao Ma
     
  • In 0f0a25b, we adjust 'len' with s_first_data_block - start, but
    it could underflow in case blocksize=1K, fstrim_range.len=512 and
    fstrim_range.start = 0. In this case, when we run the code:
    len -= first_data_blk - start; len will be underflow to -1ULL.
    In the end, although we are safe that last_group check later will limit
    the trim to the whole volume, but that isn't what the user really want.

    So this patch fix it. It also adds the check for 'start' like ext3 so that
    we can break immediately if the start is invalid.

    Cc: Lukas Czerner
    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"

    Tao Ma
     
  • This will help debug who is responsible for starting a jbd2 transaction.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • Using function calls in TP_printk causes perf heartburn, so print the
    MAJOR/MINOR device numbers instead.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • Upon corrupted inode or disk failures, we may fail after we already
    allocate some blocks from the inode or take some blocks from the
    inode's preallocation list, but before we successfully insert the
    corresponding extent to the extent tree. In this case, we should free
    any allocated blocks and discard the inode's preallocated blocks
    because the entries in the inode's preallocation list may be in an
    inconsistent state.

    Signed-off-by: Jiaying Zhang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Jiaying Zhang
     
  • The current implementation of ext4_free_blocks() always calls
    dquot_free_block This looks quite sensible in the most cases: blocks
    to be freed are associated with inode and were accounted in quota and
    i_blocks some time ago.

    However, there is a case when blocks to free were not accounted by the
    time calling ext4_free_blocks() yet:

    1. delalloc is on, write_begin pre-allocated some space in quota
    2. write-back happens, ext4 allocates some blocks in ext4_ext_map_blocks()
    3. then ext4_ext_map_blocks() gets an error (e.g. ENOSPC) from
    ext4_ext_insert_extent() and calls ext4_free_blocks().

    In this scenario, ext4_free_blocks() calls dquot_free_block() who, in
    turn, decrements i_blocks for blocks which were not accounted yet (due
    to delalloc) After clean umount, e2fsck reports something like:

    > Inode 21, i_blocks is 5080, should be 5128. Fix?
    because i_blocks was erroneously decremented as explained above.

    The patch fixes the problem by passing the new flag
    EXT4_FREE_BLOCKS_NO_QUOT_UPDATE to ext4_free_blocks(), to request
    that the dquot_free_block() call be skipped.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Maxim Patlasov
     

30 Jun, 2011

1 commit


28 Jun, 2011

9 commits

  • Unused variables was deleted.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • I found that ext4_ext_find_goal() and ext4_find_near()
    share the same code for returning a coloured start block
    based on i_block_group.

    We can refactor this into a common function so that they
    don't diverge in the future.

    Thanks to adilger for suggesting the new function name.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • This patch moves functions from inode.c to indirect.c.
    The moved functions are ext4_ind_* functions and their helpers.
    Functions called from inode.c are declared extern.

    Signed-off-by: Amir Goldstein
    Signed-off-by: "Theodore Ts'o"

    Amir Goldstein
     
  • Move two functions that will be needed by the indirect functions to be
    moved to indirect.c as well as inode.c to truncate.h as inline
    functions, so that we can avoid having duplicate copies of the
    function (which can be a maintenance problem) without having to expose
    them as globally functions.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • In preparation for moving the indirect functions to a separate file,
    move __ext4_check_blockref() to block_validity.c and rename it to
    ext4_check_blockref() which is exported as globally visible function.

    Also, rename the cpp macro ext4_check_inode_blockref() to
    ext4_ind_check_inode(), to make it clear that it is only valid for use
    with non-extent mapped inodes.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • We are going to move all ext4_ind_* functions to indirect.c.
    Before we do that, let's rename 2 functions called ext4_indirect_*
    to ext4_ind_*, to keep to the naming convention.

    Signed-off-by: Amir Goldstein
    Signed-off-by: "Theodore Ts'o"

    Amir Goldstein
     
  • We are about to move all indirect inode functions to a new file.
    Before we do that, let's split ext4_ind_truncate() out of ext4_truncate()
    leaving only generic code in the latter, so we will be able to move
    ext4_ind_truncate() to the new file.

    Signed-off-by: Amir Goldstein
    Signed-off-by: "Theodore Ts'o"

    Amir Goldstein
     
  • In function ext4_ext_insert_index when eh_entries of curp is
    bigger than eh_max, error messages will be printed out, but the content
    is about logical and ei_block, that's incorret.

    Signed-off-by: Robin Dong
    Signed-off-by: "Theodore Ts'o"

    Robin Dong
     
  • In journal checkpoint, we write the buffer and wait for its finish.
    But in cfq, the async queue has a very low priority, and in our test,
    if there are too many sync queues and every queue is filled up with
    requests, the write request will be delayed for quite a long time and
    all the tasks which are waiting for journal space will end with errors like:

    INFO: task attr_set:3816 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    attr_set D ffff880028393480 0 3816 1 0x00000000
    ffff8802073fbae8 0000000000000086 ffff8802140847c8 ffff8800283934e8
    ffff8802073fb9d8 ffffffff8103e456 ffff8802140847b8 ffff8801ed728080
    ffff8801db4bc080 ffff8801ed728450 ffff880028393480 0000000000000002
    Call Trace:
    [] ? __dequeue_entity+0x33/0x38
    [] ? need_resched+0x23/0x2d
    [] ? thread_return+0xa2/0xbc
    [] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2]
    [] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2]
    [] __mutex_lock_common+0x14e/0x1a9
    [] ? brelse+0x13/0x15 [ext4]
    [] __mutex_lock_slowpath+0x19/0x1b
    [] mutex_lock+0x1b/0x32
    [] __jbd2_journal_insert_checkpoint+0xe3/0x20c [jbd2]
    [] start_this_handle+0x438/0x527 [jbd2]
    [] ? autoremove_wake_function+0x0/0x3e
    [] jbd2_journal_start+0xa1/0xcc [jbd2]
    [] ext4_journal_start_sb+0x57/0x81 [ext4]
    [] ext4_xattr_set+0x6c/0xe3 [ext4]
    [] ext4_xattr_user_set+0x42/0x4b [ext4]
    [] generic_setxattr+0x6b/0x76
    [] __vfs_setxattr_noperm+0x47/0xc0
    [] vfs_setxattr+0x7f/0x9a
    [] setxattr+0xb5/0xe8
    [] ? do_filp_open+0x571/0xa6e
    [] sys_fsetxattr+0x6b/0x91
    [] system_call_fastpath+0x16/0x1b

    So this patch tries to use WRITE_SYNC in __flush_batch so that the request will
    be moved into sync queue and handled by cfq timely. We also use the new plug,
    sot that all the WRITE_SYNC requests can be given as a whole when we unplug it.

    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"
    Cc: Jan Kara
    Reported-by: Robin Dong

    Tao Ma
     

22 Jun, 2011

1 commit


21 Jun, 2011

6 commits

  • Linus Torvalds
     
  • Commit 13e12d14e2dc ("vfs: reorganize 'struct inode' layout a bit")
    moved things around a bit changed i_state to be unsigned int instead of
    unsigned long. That was to help structure layout for the 64-bit case,
    and shrink 'struct inode' a bit (admittedly that only happened when
    spinlock debugging was on and i_flags didn't pack with i_lock).

    However, Meelis Roos reports that this results in unaligned exceptions
    on sprc, and it turns out that the bit-locking primitives that we use
    for the I_NEW bit want to use the bitops. Which want 'unsigned long',
    not 'unsigned int'.

    We really should fix the bit locking code to not have that kind of
    requirement, but that's a much bigger change. So for now, revert that
    field back to 'unsigned long' (but keep the other re-ordering changes
    from the commit that caused this).

    Andi points out that we have played games with this in 'struct page', so
    it's solvable with other hacks too, but since right now the struct inode
    size advantage only happens with some rare config options, it's not
    worth fighting.

    It _would_ be worth fixing the bitlocking code, though. Especially
    since there is no type safety in the bitlocking code (this never caused
    any warnings, and worked fine on x86-64, because the bitlocks take a
    'void *' and x86-64 doesn't care that deeply about alignment). So it's
    currently a very easy problem to trigger by mistake and never notice.

    Reported-by: Meelis Roos
    Cc: Andi Kleen
    Cc: David Miller
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • * 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
    drm/radeon/kms/r6xx+: voltage fixes
    drm/nouveau: drop leftover debugging
    drm/radeon: avoid warnings from r600/eg irq handlers on powered off card.
    drm/radeon/kms: add missing param for dce3.2 DP transmitter setup
    drm/radeon/kms/atom: fix duallink on some early DCE3.2 cards
    drm/nouveau: fix assumption that semaphore dmaobj is valid in x-chan sync
    drm/nv50/disp: fix gamma with page flipping overlay turned on
    drm/nouveau/pm: Prevent overflow in nouveau_perf_init()
    drm/nouveau: fix big-endian switch

    Linus Torvalds
     
  • * 'msm-fix' of git://codeaurora.org/quic/kernel/davidb/linux-msm:
    msm: timer: Fix DGT rate on 8960 and 8660
    msm: timer: compensate for timer shift in msm_read_timer_count
    msm: timer: Fix SMP build error

    Linus Torvalds
     
  • * 'for-2.6.40' of git://linux-nfs.org/~bfields/linux:
    nfsd4: fix break_lease flags on nfsd open
    nfsd: link returns nfserr_delay when breaking lease
    nfsd: v4 support requires CRYPTO
    nfsd: fix dependency of nfsd on auth_rpcgss

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (40 commits)
    pxa168_eth: fix race in transmit path.
    ipv4, ping: Remove duplicate icmp.h include
    netxen: fix race in skb->len access
    sgi-xp: fix a use after free
    hp100: fix an skb->len race
    netpoll: copy dev name of slaves to struct netpoll
    ipv4: fix multicast losses
    r8169: fix static initializers.
    inet_diag: fix inet_diag_bc_audit()
    gigaset: call module_put before restart of if_open()
    farsync: add module_put to error path in fst_open()
    net: rfs: enable RFS before first data packet is received
    fs_enet: fix freescale FCC ethernet dp buffer alignment
    netdev: bfin_mac: fix memory leak when freeing dma descriptors
    vlan: don't call ndo_vlan_rx_register on hardware that doesn't have vlan support
    caif: Bugfix - XOFF removed channel from caif-mux
    tun: teach the tun/tap driver to support netpoll
    dp83640: drop PHY status frames in the driver.
    dp83640: fix phy status frame event parsing
    phylib: Allow BCM63XX PHY to be selected only on BCM63XX.
    ...

    Linus Torvalds