01 Jan, 2010

2 commits

  • In the past, ext4_calc_metadata_amount(), and its sub-functions
    ext4_ext_calc_metadata_amount() and ext4_indirect_calc_metadata_amount()
    badly over-estimated the number of metadata blocks that might be
    required for delayed allocation blocks. This didn't matter as much
    when functions which managed the reserved metadata blocks were more
    aggressive about dropping reserved metadata blocks as delayed
    allocation blocks were written, but unfortunately they were too
    aggressive. This was fixed in commit 0637c6f, but as a result the
    over-estimation by ext4_calc_metadata_amount() would lead to reserving
    2-3 times the number of pending delayed allocation blocks as
    potentially required metadata blocks. So if there are 1 megabytes of
    blocks which have been not yet been allocation, up to 3 megabytes of
    space would get reserved out of the user's quota and from the file
    system free space pool until all of the inode's data blocks have been
    allocated.

    This commit addresses this problem by much more accurately estimating
    the number of metadata blocks that will be required. It will still
    somewhat over-estimate the number of blocks needed, since it must make
    a worst case estimate not knowing which physical blocks will be
    needed, but it is much more accurate than before.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • Commit 0637c6f had a typo which caused the reserved metadata blocks to
    not be released correctly. Fix this.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

31 Dec, 2009

1 commit

  • As reported in Kernel Bugzilla #14936, commit d21cd8f triggered a BUG
    in the function ext4_da_update_reserve_space() found in
    fs/ext4/inode.c. The root cause of this BUG() was caused by the fact
    that ext4_calc_metadata_amount() can severely over-estimate how many
    metadata blocks will be needed, especially when using direct
    block-mapped files.

    In addition, it can also badly *under* estimate how much space is
    needed, since ext4_calc_metadata_amount() assumes that the blocks are
    contiguous, and this is not always true. If the application is
    writing blocks to a sparse file, the number of metadata blocks
    necessary can be severly underestimated by the functions
    ext4_da_reserve_space(), ext4_da_update_reserve_space() and
    ext4_da_release_space(). This was the cause of the dq_claim_space
    reports found on kerneloops.org.

    Unfortunately, doing this right means that we need to massively
    over-estimate the amount of free space needed. So in some cases we
    may need to force the inode to be written to disk asynchronously in
    to avoid spurious quota failures.

    http://bugzilla.kernel.org/show_bug.cgi?id=14936

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

30 Dec, 2009

1 commit


26 Dec, 2009

1 commit

  • When ext4_da_writepages increases the nr_to_write in writeback_control
    then it must always re-base the return value. Originally there was a
    (misguided) attempt prevent wbc.nr_to_write from going negative. In
    fact, it's necessary to allow nr_to_write to be negative so that
    wb_writeback() can correctly calculate how many pages were actually
    written.

    Signed-off-by: Richard Kennedy
    Signed-off-by: "Theodore Ts'o"

    Richard Kennedy
     

25 Dec, 2009

11 commits

  • Per commit 240799cd, the option name for readahead should be
    inode_readahead_blks, not inode_readahead.

    Signed-off-by: Fang Wenqi
    Signed-off-by: "Theodore Ts'o"

    Fang Wenqi
     
  • Linus Torvalds
     
  • * 'sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6:
    SYSCTL: Add a mutex to the page_alloc zone order sysctl
    SYSCTL: Print binary sysctl warnings (nearly) only once

    Linus Torvalds
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    HWPOISON: Add PROC_FS dependency to hwpoison injector v2

    Linus Torvalds
     
  • * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (34 commits)
    classmate-laptop: add support for Classmate PC ACPI devices
    hp-wmi: Fix two memleaks
    acer-wmi, msi-wmi: Remove needless DMI MODULE_ALIAS
    dell-wmi: do not keep driver loaded on unsupported boxes
    wmi: Free the allocated acpi objects through wmi_get_event_data
    drivers/platform/x86/acerhdf.c: check BIOS information whether it begins with string of table
    acerhdf: add new BIOS versions
    acerhdf: limit modalias matching to supported
    toshiba_acpi: convert to seq_file
    asus_acpi: convert to seq_file
    ACPI: do not select ACPI_DOCK from ATA_ACPI
    sony-laptop: enumerate rfkill devices using SN06
    sony-laptop: rfkill support for newer models
    ACPI: fix OSC regression that caused aer and pciehp not to load
    MAINTAINERS: add maintainer for msi-wmi driver
    fujitu-laptop: fix tests of acpi_evaluate_integer() return value
    arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c: avoid cross-CPU interrupts by using smp_call_function_any()
    ACPI: processor: remove _PDC object list from struct acpi_processor
    ACPI: processor: change acpi_processor_set_pdc() interface
    ACPI: processor: open code acpi_processor_cleanup_pdc
    ...

    Linus Torvalds
     
  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
    ocfs2/trivial: Use le16_to_cpu for a disk value in xattr.c
    ocfs2/trivial: Use proper mask for 2 places in hearbeat.c
    Ocfs2: Let ocfs2 support fiemap for symlink and fast symlink.
    Ocfs2: Should ocfs2 support fiemap for S_IFDIR inode?
    ocfs2: Use FIEMAP_EXTENT_SHARED
    fiemap: Add new extent flag FIEMAP_EXTENT_SHARED
    ocfs2: replace u8 by __u8 in ocfs2_fs.h
    ocfs2: explicit declare uninitialized var in user_cluster_connect()
    ocfs2-devel: remove redundant OCFS2_MOUNT_POSIX_ACL check in ocfs2_get_acl_nolock()
    ocfs2: return -EAGAIN instead of EAGAIN in dlm
    ocfs2/cluster: Make fence method configurable - v2
    ocfs2: Set MS_POSIXACL on remount
    ocfs2: Make acl use the default
    ocfs2: Always include ACL support

    Linus Torvalds
     
  • * 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm:
    VIDEO: cyberpro: pci_request_regions needs a persistent name
    ARM: dma-isa: request cascade channel after registering it
    ARM: footbridge: trim down old ISA rtc setup
    ARM: fix PAGE_KERNEL
    ARM: Fix wrong shared bit for CPU write buffer bug test
    ARM: 5857/1: ARM: dmabounce: fix build
    ARM: 5856/1: Fix bug of uart0 platfrom data for nuc900
    ARM: 5855/1: putc support for nuc900
    ARM: 5854/1: fix compiling error for NUC900
    ARM: 5849/1: ARMv7: fix Oprofile events count
    ARM: add missing include to nwflash.c
    ARM: Kill CONFIG_CPU_32
    ARM: Convert VFP/Crunch/XscaleCP thread_release() to exit_thread()
    ARM: 5853/1: ARM: Fix build break on ARM v6 and v7

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
    edac, pci: remove pesky debug printk
    amd64_edac: restrict PCI config space access
    amd64_edac: fix forcing module load/unload
    amd64_edac: make driver loading more robust
    amd64_edac: fix driver instance freeing
    amd64_edac: fix K8 chip select reporting

    Linus Torvalds
     
  • * 'sh/for-2.6.33' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6:
    sh: Ensure all PG_dcache_dirty pages are written back.
    sh: mach-ecovec24: setup.c detailed correction
    serial: sh-sci: Convert tremaining ctrl_xxx I/O routines to __raw_xxx.
    serial: sh-sci: earlyprintk zero uartclk fix
    sh: Only use bl bit toggling for sleeping idle.
    sh: Restore bl bit toggling in idle loop.
    sh: Fix up MAX_DMA_CHANNELS definition when DMA is disabled.
    sh: dmaengine support for SH7785
    sh: dmaengine support for sh7724.

    Linus Torvalds
     
  • Don't pass a name pointer from the kernel stack, it will not survive
    and will result in corrupted /proc/iomem output.

    Signed-off-by: Russell King

    Russell King
     
  • We can't request the cascade channel before it's been registered, so
    move it afterwards.

    Signed-off-by: Russell King

    Russell King
     

24 Dec, 2009

14 commits


23 Dec, 2009

6 commits

  • It triggers the warning in get_page_from_freelist(), and it isn't
    appropriate to use __GFP_NOFAIL here anyway.

    Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14843

    Reported-by: Christian Casteyde
    Signed-off-by: Andrew Morton
    Signed-off-by: "Theodore Ts'o"

    Andrew Morton
     
  • Creating many small files in rapid succession on a small
    filesystem can lead to spurious ENOSPC; on a 104MB filesystem:

    for i in `seq 1 22500`; do
    echo -n > $SCRATCH_MNT/$i
    echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
    done

    leads to ENOSPC even though after a sync, 40% of the fs is free
    again.

    This is because we reserve worst-case metadata for delalloc writes,
    and when data is allocated that worst-case reservation is not
    usually needed.

    When freespace is low, kicking off an async writeback will start
    converting that worst-case space usage into something more realistic,
    almost always freeing up space to continue.

    This resolves the testcase for me, and survives all 4 generic
    ENOSPC tests in xfstests.

    We'll still need a hard synchronous sync to squeeze out the last bit,
    but this fixes things up to a large degree.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • ext4, at least, would like to start pushing on writeback if it starts
    to get close to ENOSPC when reserving worst-case blocks for delalloc
    writes. Writing out delalloc data will convert those worst-case
    predictions into usually smaller actual usage, freeing up space
    before we hit ENOSPC based on this speculation.

    Thanks to Jens for the suggestion for the helper function,
    & the naming help.

    I've made the helper return status on whether writeback was
    started even though I don't plan to use it in the ext4 patch;
    it seems like it would be potentially useful to test this
    in some cases.

    Signed-off-by: Eric Sandeen
    Acked-by: Jan Kara

    Eric Sandeen
     
  • b_entry_name and buffer are initially NULL, are initialized within a loop
    to the result of calling kmalloc, and are freed at the bottom of this loop.
    The loop contains gotos to cleanup, which also frees b_entry_name and
    buffer. Some of these gotos are before the reinitializations of
    b_entry_name and buffer. To maintain the invariant that b_entry_name and
    buffer are NULL at the top of the loop, and thus acceptable arguments to
    kfree, these variables are now set to NULL after the kfrees.

    This seems to be the simplest solution. A more complicated solution
    would be to introduce more labels in the error handling code at the end of
    the function.

    A simplified version of the semantic match that finds this problem is as
    follows: (http://coccinelle.lip6.fr/)

    //
    @r@
    identifier E;
    expression E1;
    iterator I;
    statement S;
    @@

    *kfree(E);
    ... when != E = E1
    when != I(E,...) S
    when != &E
    *kfree(E);
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: "Theodore Ts'o"

    Julia Lawall
     
  • sparc64 allmodconfig:

    fs/ext4/super.c: In function `lifetime_write_kbytes_show':
    fs/ext4/super.c:2174: warning: long long unsigned int format, long unsigned int arg (arg 4)
    fs/ext4/super.c:2174: warning: long long unsigned int format, long unsigned int arg (arg 4)

    Signed-off-by: Andrew Morton
    Signed-off-by: "Theodore Ts'o"

    Andrew Morton
     
  • This is a bit complicated because we are trying to optimize when we
    send barriers to the fs data disk. We could just throw in an extra
    barrier to the data disk whenever we send a barrier to the journal
    disk, but that's not always strictly necessary.

    We only need to send a barrier during a commit when there are data
    blocks which are must be written out due to an inode written in
    ordered mode, or if fsync() depends on the commit to force data blocks
    to disk. Finally, before we drop transactions from the beginning of
    the journal during a checkpoint operation, we need to guarantee that
    any blocks that were flushed out to the data disk are firmly on the
    rust platter before we drop the transaction from the journal.

    Thanks to Oleg Drokin for pointing out this flaw in ext3/ext4.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

21 Dec, 2009

2 commits


14 Dec, 2009

2 commits

  • This patch fixes the Kernel BZ #14286. When the address of an extent
    corresponding to a valid block is corrupted, a -EIO should be reported
    instead of a BUG(). This situation should not normally not occur
    except in the case of a corrupted filesystem. If however it does,
    then the system should not panic directly but depending on the mount
    time options appropriate action should be taken. If the mount options
    so permit, the I/O should be gracefully aborted by returning a -EIO.

    http://bugzilla.kernel.org/show_bug.cgi?id=14286

    Signed-off-by: Surbhi Palande
    Signed-off-by: "Theodore Ts'o"

    Surbhi Palande
     
  • Remove unused #include ('s) in
    fs/ext4/block_validity.c
    fs/ext4/mballoc.h

    Signed-off-by: Huang Weiyi
    Signed-off-by: "Theodore Ts'o"

    Huang Weiyi