19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Oct, 2009

3 commits

  • Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (41 commits)
    Revert "Seperate read and write statistics of in_flight requests"
    cfq-iosched: don't delay async queue if it hasn't dispatched at all
    block: Topology ioctls
    cfq-iosched: use assigned slice sync value, not default
    cfq-iosched: rename 'desktop' sysfs entry to 'low_latency'
    cfq-iosched: implement slower async initiate and queue ramp up
    cfq-iosched: delay async IO dispatch, if sync IO was just done
    cfq-iosched: add a knob for desktop interactiveness
    Add a tracepoint for block request remapping
    block: allow large discard requests
    block: use normal I/O path for discard requests
    swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL
    fs/bio.c: move EXPORT* macros to line after function
    Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs
    cciss: fix build when !PROC_FS
    block: Do not clamp max_hw_sectors for stacking devices
    block: Set max_sectors correctly for stacking devices
    cciss: cciss_host_attr_groups should be const
    cciss: Dynamically allocate the drive_info_struct for each logical drive.
    cciss: Add usage_count attribute to each logical drive in /sys
    ...

    Linus Torvalds
     
  • This reverts commit a9327cac440be4d8333bba975cbbf76045096275.

    Corrado Zoccolo reports:

    "with 2.6.32-rc1 I started getting the following strange output from
    "iostat -kx 2":
    Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU)

    avg-cpu: %user %nice %system %iowait %steal %idle
    10,70 0,00 3,16 15,75 0,00 70,38

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 18,22 0,00 0,67 0,01 14,77 0,02
    43,94 0,01 10,53 39043915,03 2629219,87
    sdb 60,89 9,68 50,79 3,04 1724,43 50,52
    65,95 0,70 13,06 488437,47 2629219,87

    avg-cpu: %user %nice %system %iowait %steal %idle
    2,72 0,00 0,74 0,00 0,00 96,53

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00

    avg-cpu: %user %nice %system %iowait %steal %idle
    6,68 0,00 0,99 0,00 0,00 92,33

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00

    avg-cpu: %user %nice %system %iowait %steal %idle
    4,40 0,00 0,73 1,47 0,00 93,40

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 4,00 0,00 3,00 0,00 28,00
    18,67 0,06 19,50 333,33 100,00

    Global values for service time and utilization are garbage. For
    interval values, utilization is always 100%, and service time is
    higher than normal.

    I bisected it down to:
    [a9327cac440be4d8333bba975cbbf76045096275] Seperate read and write
    statistics of in_flight requests
    and verified that reverting just that commit indeed solves the issue
    on 2.6.32-rc1."

    So until this is debugged, revert the bad commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Oct, 2009

1 commit


03 Oct, 2009

2 commits

  • On a 256M filesystem, doing this in a loop:

    xfs_io -F -f -d -c 'pwrite 0 64m' test
    rm -f test

    eventually leads to ENOSPC. (the xfs_io command does a
    64m direct IO write to the file "test")

    As with other block allocation callers, it looks like we need to
    potentially retry the allocations on the initial ENOSPC.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • This fixes the following warning:

    fs/ext4/inode.c: In function 'ext4_dirty_inode':
    fs/ext4/inode.c:5615: warning: unused variable 'current_handle'

    We remove the jbd_debug() statement which does use current_handle, as
    it's not terribly important in the grand scheme of things.

    Thanks to Stephen Rothwell for pointing this out.

    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: "Theodore Ts'o"

    Curt Wohlgemuth
     

02 Oct, 2009

9 commits


01 Oct, 2009

6 commits

  • wait_on_page_writeback_range/btrfs_wait_on_page_writeback_range takes
    a pagecache offset, not a byte offset into the file. Shift the arguments
    around to wait for the correct range

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Chris Mason

    Christoph Hellwig
     
  • Kconfig & super.c promised it'd be gone by 2.6.31, so it's
    about time to drop it.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • In ext4_num_dirty_pages() we were calling page_buffers() before
    checking to see if the page actually had pages attached to it; this
    would cause a BUG check crash in the inline function page_buffers().

    Thanks to Markus Trippelsdorf for reporting this bug.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
    nilfs2: fix missing initialization of i_dir_start_lookup member
    nilfs2: fix missing zero-fill initialization of btree node cache

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: Fix time encoding with extra epoch bits
    ext4: Add a stub for mpage_da_data in the trace header
    jbd2: Use tracepoints for history file
    ext4: Use tracepoints for mb_history trace file
    ext4, jbd2: Drop unneeded printks at mount and unmount time
    ext4: Handle nested ext4_journal_start/stop calls without a journal
    ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode
    ext4: Avoid updating the inode table bh twice in no journal mode
    ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first
    ext4: async direct IO for holes and fallocate support
    ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O
    ext4: Split uninitialized extents for direct I/O
    ext4: release reserved quota when block reservation for delalloc retry
    ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks
    ext4: Fix hueristic which avoids group preallocation for closed files
    ext4: Use ext4_msg() for ext4_da_writepage() errors
    ext4: Update documentation about quota mount options

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
    fat: Check s_dirt in fat_sync_fs()
    vfat: change the default from shortname=lower to shortname=mixed
    fat/nls: Fix handling of utf8 invalid char

    Linus Torvalds
     

30 Sep, 2009

10 commits

  • "Looking at ext4.h, I think the setting of extra time fields forgets to
    mask the epoch bits so the epoch part overwrites nsec part. The second
    change is only for coherency (2 -> EXT4_EPOCH_BITS)."

    Thanks to Damien Guibouret for pointing out this problem.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • The /proc/fs/jbd2//history was maintained manually; by using
    tracepoints, we can get all of the existing functionality of the /proc
    file plus extra capabilities thanks to the ftrace infrastructure. We
    save memory as a bonus.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • The /proc/fs/ext4//mb_history was maintained manually, and had a
    number of problems: it required a largish amount of memory to be
    allocated for each ext4 filesystem, and the s_mb_history_lock
    introduced a CPU contention problem.

    By ripping out the mb_history code and replacing it with ftrace
    tracepoints, and we get more functionality: timestamps, event
    filtering, the ability to correlate mballoc history with other ext4
    tracepoints, etc.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • If an ioctl-initiated transaction is open, we can't force a commit during
    the free space checks in order to free up pinned extents or else we
    deadlock. Just ENOSPC instead.

    A more satisfying solution that reserves space for the entire user
    transaction up front is forthcoming...

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     
  • Fix leak of vfsmount write reference and open_ioctl_trans reference on
    ENOMEM. Clean up the error paths while we're at it.

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     
  • There are a number of kernel printk's which are printed when an ext4
    filesystem is mounted and unmounted. Disable them to economize space
    in the system logs. In addition, disabling the mballoc stats by
    default saves a number of unneeded atomic operations for every block
    allocation or deallocation.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • We've already defined CONFIG_BTRFS_POSIX_ACL in Kconfig, but we're
    currently not using it and are testing CONFIG_FS_POSIX_ACL instead.
    CONFIG_FS_POSIX_ACL states "Never use this symbol for ifdefs".

    Signed-off-by: Chris Ball
    Signed-off-by: Chris Mason

    Chris Ball
     
  • Error handling code following a kzalloc should free the allocated data.

    The semantic match that finds the problem is as follows:
    (http://www.emn.fr/x-info/coccinelle/)

    //
    @r exists@
    local idexpression x;
    statement S;
    expression E;
    identifier f,f1,l;
    position p1,p2;
    expression *ptr != NULL;
    @@

    x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
    ...
    if (x == NULL) S
    }
    (
    x->f1 = E
    |
    (x->f1 == NULL || ...)
    |
    f(...,x->f1,...)
    )
    ...>
    (
    return \(0\|\|ptr\);
    |
    return@p2 ...;
    )

    @script:python@
    p1 << r.p1;
    p2 << r.p2;
    @@

    print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Chris Mason

    Julia Lawall
     
  • We currently set sb->s_flags |= MS_POSIXACL unconditionally, which is
    incorrect -- it tells the VFS that it shouldn't set umask because we
    will, yet we don't set it ourselves if we aren't using POSIX ACLs, so
    the umask ends up ignored.

    Signed-off-by: Chris Ball
    Signed-off-by: Chris Mason

    Chris Ball
     
  • This patch a problem that ext4_dirty_inode() was not calling
    ext4_mark_inode_dirty() if the current_handle is not valid, which it
    is the case in no journal mode.

    It also removes a test for non-matching transaction which can never
    happen.

    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: "Theodore Ts'o"

    Curt Wohlgemuth
     

29 Sep, 2009

8 commits

  • This patch fixes a problem with handling nested calls to
    ext4_journal_start/ext4_journal_stop, when there is no journal present.

    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: "Theodore Ts'o"

    Curt Wohlgemuth
     
  • This is a cleanup of commit 91ac6f4. Since ext4_mark_inode_dirty()
    has already called ext4_mark_iloc_dirty(), which in turn calls
    ext4_do_update_inode(), it's not necessary to have ext4_write_inode()
    call ext4_do_update_inode() in no journal mode. Indeed, it would be
    duplicated work.

    Reviewed-by: "Aneesh Kumar K.V"
    Signed-off-by: Frank Mayhar
    Signed-off-by: "Theodore Ts'o"

    Frank Mayhar
     
  • The i_dir_start_lookup field in nilfs_inode_info objects should be
    cleared when the objects are allocated, but the the initialization was
    missing in case of reading from disk. This adds the initialization.

    Since the variable just gives a start page on directory lookups, the
    bug was nonfatal until now.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This will fix file system corruption which infrequently happens after
    mount. The problem was reported from users with the title "[NILFS
    users] Fail to mount NILFS." (Message-ID:
    ), and so forth. I've also
    experienced the corruption multiple times on kernel 2.6.30 and 2.6.31.

    The problem turned out to be caused due to discordance between
    mapping->nrpages of a btree node cache and the actual number of pages
    hung on the cache; if the mapping->nrpages becomes zero even as it has
    pages, truncate_inode_pages() returns without doing anything. Usually
    this is harmless except it may cause page leak, but garbage collection
    fairly infrequently sees a stale page remained in the btree node cache
    of DAT (i.e. disk address translation file of nilfs), and induces the
    corruption.

    I identified a missing initialization in btree node caches was the
    root cause. This corrects the bug.

    I've tested this for kernel 2.6.30 and 2.6.31.

    Reported-by: Yuri Chislov
    Signed-off-by: Ryusuke Konishi
    Cc: stable

    Ryusuke Konishi
     
  • At the start of a transaction we do a btrfs_reserve_metadata_space() and
    specify how many items we plan on modifying. Then once we've done our
    modifications and such, just call btrfs_unreserve_metadata_space() for
    the same number of items we reserved.

    For keeping track of metadata needed for data I've had to add an extent_io op
    for when we merge extents. This lets us track space properly when we are doing
    sequential writes, so we don't end up reserving way more metadata space than
    what we need.

    The only place where the metadata space accounting is not done is in the
    relocation code. This is because Yan is going to be reworking that code in the
    near future, so running btrfs-vol -b could still possibly result in a ENOSPC
    related panic. This patch also turns off the metadata_ratio stuff in order to
    allow users to more efficiently use their disk space.

    This patch makes it so we track how much metadata we need for an inode's
    delayed allocation extents by tracking how many extents are currently
    waiting for allocation. It introduces two new callbacks for the
    extent_io tree's, merge_extent_hook and split_extent_hook. These help
    us keep track of when we merge delalloc extents together and split them
    up. Reservations are handled prior to any actually dirty'ing occurs,
    and then we unreserve after we dirty.

    btrfs_unreserve_metadata_for_delalloc() will make the appropriate
    unreservations as needed based on the number of reservations we
    currently have and the number of extents we currently have. Doing the
    reservation outside of doing any of the actual dirty'ing lets us do
    things like filemap_flush() the inode to try and force delalloc to
    happen, or as a last resort actually start allocation on all delalloc
    inodes in the fs. This has survived dbench, fs_mark and an fsx torture
    test.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Move the check to make sure the original and donor inodes are
    different earlier, to avoid a potential deadlock by trying to lock the
    same inode twice.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • For async direct IO that covers holes or fallocate, the end_io
    callback function now queued the convertion work on workqueue but
    don't flush the work rightaway as it might take too long to afford.

    But when fsync is called after all the data is completed, user expects
    the metadata also being updated before fsync returns.

    Thus we need to flush the conversion work when fsync() is called.
    This patch keep track of a listed of completed async direct io that
    has a work queued on workqueue. When fsync() is called, it will go
    through the list and do the conversion.

    Signed-off-by: Mingming Cao

    Mingming Cao
     
  • Currently the DIO VFS code passes create = 0 when writing to the
    middle of file. It does this to avoid block allocation for holes, so
    as not to expose stale data out when there is a parallel buffered read
    (which does not hold the i_mutex lock). Direct I/O writes into holes
    falls back to buffered IO for this reason.

    Since preallocated extents are treated as holes when doing a
    get_block() look up (buffer is not mapped), direct IO over fallocate
    also falls back to buffered IO. Thus ext4 actually silently falls
    back to buffered IO in above two cases, which is undesirable.

    To fix this, this patch creates unitialized extents when a direct I/O
    write into holes in sparse files, and registering an end_io callback which
    converts the uninitialized extent to an initialized extent after the
    I/O is completed.

    Singed-Off-By: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Mingming Cao