26 Jun, 2006

31 commits

  • Convert the ext3 in-kernel filesystem blocks to ext3_fsblk_t. Convert the
    rest of all unsigned long type in-kernel filesystem blocks to ext3_fsblk_t,
    and replace the printk format string respondingly.

    Signed-off-by: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • Some of the in-kernel ext3 block variable type are treated as signed 4 bytes
    int type, thus limited ext3 filesystem to 8TB (4kblock size based). While
    trying to fix them, it seems quite confusing in the ext3 code where some
    blocks are filesystem-wide blocks, some are group relative offsets that need
    to be signed value (as -1 has special meaning). So it seem saner to define
    two types of physical blocks: one is filesystem wide blocks, another is
    group-relative blocks. The following patches clarify these two types of
    blocks in the ext3 code, and fix the type bugs which limit current 32 bit ext3
    filesystem limit to 8TB.

    With this series of patches and the percpu counter data type changes in the mm
    tree, we are able to extend exts filesystem limit to 16TB.

    This work is also a pre-request for the recent >32 bit ext3 work, and makes
    the kernel to able to address 48 bit ext3 block a lot easier: Simply redefine
    ext3_fsblk_t from unsigned long to sector_t and redefine the format string for
    ext3 filesystem block corresponding.

    Two RFC with a series patches have been posted to ext2-devel list and have
    been reviewed and discussed:
    http://marc.theaimsgroup.com/?l=ext2-devel&m=114722190816690&w=2

    http://marc.theaimsgroup.com/?l=ext2-devel&m=114784919525942&w=2

    Patches are tested on both 32 bit machine and 64 bit machine, 8TB ext3 filesystem(with the latest to be released e2fsprogs-1.39). Tests
    includes overnight fsx, tiobench, dbench and fsstress.

    This patch:

    Defines ext3_fsblk_t and ext3_grpblk_t, and the printk format string for
    filesystem wide blocks.

    This patch classifies all block group relative blocks, and ext3_fsblk_t blocks
    occurs in the same function where used to be confusing before. Also include
    kernel bug fixes for filesystem wide in-kernel block variables. There are
    some fileystem wide blocks are treated as int/unsigned int type in the kernel
    currently, especially in ext3 block allocation and reservation code. This
    patch fixed those bugs by converting those variables to ext3_fsblk_t(unsigned
    long) type.

    Signed-off-by: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • The problem is that when we write to a file, the copy from userspace to
    pagecache is first done with preemption disabled, so if the source address is
    not immediately available the copy fails *and* *zeros* *the* *destination*.

    This is a problem because a concurrent read (which admittedly is an odd thing
    to do) might see zeros rather that was there before the write, or what was
    there after, or some mixture of the two (any of these being a reasonable thing
    to see).

    If the copy did fail, it will immediately be retried with preemption
    re-enabled so any transient problem with accessing the source won't cause an
    error.

    The first copying does not need to zero any uncopied bytes, and doing so
    causes the problem. It uses copy_from_user_atomic rather than copy_from_user
    so the simple expedient is to change copy_from_user_atomic to *not* zero out
    bytes on failure.

    The first of these two patches prepares for the change by fixing two places
    which assume copy_from_user_atomic does zero the tail. The two usages are
    very similar pieces of code which copy from a userspace iovec into one or more
    page-cache pages. These are changed to remove the assumption.

    The second patch changes __copy_from_user_inatomic* to not zero the tail.
    Once these are accepted, I will look at similar patches of other architectures
    where this is important (ppc, mips and sparc being the ones I can find).

    This patch:

    There is a problem with __copy_from_user_inatomic zeroing the tail of the
    buffer in the case of an error. As it is called in atomic context, the error
    may be transient, so it results in zeros being written where maybe they
    shouldn't be.

    In the usage in filemap, this opens a window for a well timed read to see data
    (zeros) which is not consistent with any ordering of reads and writes.

    Most cases where __copy_from_user_inatomic is called, a failure results in
    __copy_from_user being called immediately. As long as the latter zeros the
    tail, the former doesn't need to. However in *copy_from_user_iovec
    implementations (in both filemap and ntfs/file), it is assumed that
    copy_from_user_inatomic will zero the tail.

    This patch removes that assumption, so that after this patch it will
    be safe for copy_from_user_inatomic to not zero the tail.

    This patch also adds some commentary to filemap.h and asm-i386/uaccess.h.

    After this patch, all architectures that might disable preempt when
    kmap_atomic is called need to have their __copy_from_user_inatomic* "fixed".
    This includes
    - powerpc
    - i386
    - mips
    - sparc

    Signed-off-by: Neil Brown
    Cc: David Howells
    Cc: Anton Altaparmakov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This was reported as Debian bug #336604.

    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     
  • Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     
  • The variable i is guaranteed to be the same as db_count given the previous
    for loop. So get rid of it since it's dead code.

    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     
  • If ext3 filesystem is larger than 2TB, and sector_t is a u32 (i.e.
    CONFIG_LBD not defined in the kernel), the calculation of the disk sector
    will overflow. Add check at ext3_fill_super() and ext3_group_extend() to
    prevent mount/remount/resize >2TB ext3 filesystem if sector_t size is 4
    bytes.

    Verified this patch on a 32 bit platform without CONFIG_LBD defined
    (sector_t is 32 bits long), mount refuse to mount a 10TB ext3.

    Signed-off-by: Mingming Cao
    Acked-by: Andreas Dilger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • "Move" "common code" out to PTR_NOD, which does the conversion from private
    pointer to node number. This is to reduce potential casting/conversion errors
    due to redundancy. (The naming PTR_NOD follows PTR_ERR, turning a pointer
    into xyz.)

    [akpm@osdl.org: cleanups]
    Signed-off-by: Jan Engelhardt
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Engelhardt
     
  • Remove unnecessary casts in fs/openpromfs/inode.c

    Signed-off-by: Jan Engelhardt
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Engelhardt
     
  • tchars is not '\0'-terminated so the strtoul may run into problems. Fix that.
    Also make tchars as big as a long in hexadecimal form would take rather than
    just 16.

    Signed-off-by: Jan Engelhardt
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Engelhardt
     
  • Make two needlessly global functions static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • In ufs code there is function: ubh_ll_rw_block, it has parameter how many
    ufs_buffer_head it should handle, but it always called with "1" on the place
    of this parameter. This patch removes unused parameter of "ubh_ll_wr_block".

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • ufs super block contains some statistic about file systems, like amount of
    directories, free blocks, inodes and so on.

    UFS1 hold this information in one location and uses 32bit integers for such
    information, UFS2 hold statistic in another location and uses 64bit integers.

    There is transition variant, if UFS1 has type 44BSD and flags field in super
    block has some special value this mean that we work with statistic like UFS2
    does. and this also means that nobody care about old(UFS1) statistic.

    So if start fsck against such file system, after usage linux ufs driver, it
    found error: at now only UFS1 like statistic is updated.

    This patch should fix this. Also it contains some minor cleanup: CodingSytle
    and remove unused variables.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • Presently ufs doesn't support "fsync", this make some applications unhappy,
    for example vim. This patch fixes this situation.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • Super block of UFS usually has size >512, because of fragment size may be 512,
    this cause some problems.

    Currently, there are two methods to work with ufs super block:

    1) split structure which describes ufs super blocks into structures with
    size b_data + bh[n]->b_size == bh[n + 1]->b_data

    The second variant may cause some problems in the future, and usage of two
    variants cause unnecessary code duplication.

    This patch remove the second variant. Also patch contains some CodingStyle
    fixes.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • This patch fixes two bugs, which introduced by previous patches:

    1) Missed "brelse"

    2) Sometimes "baseblk" may be wrongly calculated, if i_size is equal to
    zero, which lead infinite cycle in "mpage_writepages".

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • fs/ufs/super.c: In function `ufs_print_super_stuff':
    fs/ufs/super.c:103: warning: unsigned int format, different type arg (arg 2) fs/ufs/super.c: In function `ufs2_print_super_stuff': fs/ufs/super.c:147: warning: unsigned int format, different type arg (arg 2) fs/ufs/super.c: In function `ufs_print_cylinder_stuff':
    fs/ufs/super.c:175: warning: unsigned int format, different type arg (arg 2)

    Cc: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Presently if we allocate several "metadata" blocks (pointers to indirect
    blocks for example), we fill with zeroes only the first block. This cause
    some problems in "truncate" function. Also this patch remove some unused
    arguments from several functions and add comments.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • ufs_free_blocks function looks now in so way:
    if (err)
    goto failed;
    lock_super();
    failed:
    unlock_super();

    So if error happen we'll unlock not locked super.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • At now UFS code uses DQUOT_* mechanism, but it also update inode->i_blocks
    manually, this cause wrong i_blocks value.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • This patch make little optimization of ufs_find_entry like "ext2" does. Save
    number of page and reuse it again in the next call.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • Currently to turn on debug mode "user" has to edit ~10 files, to turn off he
    has to do it again.

    This patch introduce such changes:
    1)turn on(off) debug messages via ".config"
    2)remove unnecessary duplication of code
    3)make "UFSD" macros more similar to function
    4)fix some compiler warnings

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • To find new bugs, I suggest revert this patch:
    http://lkml.org/lkml/2006/1/31/275 in -mm tree.

    So others can test "write support" of UFS.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • The writing to UFS file system with block/fragment!=8 may cause bogus
    behaviour. The problem in "ufs_bitmap_search" function, which doesn't work
    correctly in "block/fragment!=8" case. The idea is stolen from BSD code.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • There are two ugly macros in ufs code:
    #define UCPI_UBH ((struct ufs_buffer_head *)ucpi)
    #define USPI_UBH ((struct ufs_buffer_head *)uspi)
    when uspi looks like
    struct {
    struct ufs_buffer_head ;
    }
    and USPI_UBH has some sence,
    ucpi looks like
    struct {
    struct not_ufs_buffer_head;
    }

    To prevent bugs in future, this patch convert macros to inline function and
    fix "ucpi" structure.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • Change function in fs/ufs/dir.c and fs/ufs/namei.c to work with pages
    instead of straight work with blocks. It fixed such bugs:

    * for i in `seq 1 1000`; do touch $i; done - crash system
    * mkdir create directory without "." and ".." entries

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • This series of patches finished "bugs fixing" mentioned
    here http://lkml.org/lkml/2006/1/31/275 .

    The main bugs:
    * for i in `seq 1 1000`; do touch $i; done - crash system
    * mkdir create directory without "." and ".." entries

    The suggested solution is work with page cache instead of straight work
    with blocks. Such solution has following advantages

    * reduce code size and its complexity
    * some global locks go away
    * fix bugs

    The most part of code is stolen from ext2, because of it has similar
    directory structure.

    Patches testes with UFS1 and UFS2 file systems.

    This patch installs i_mapping->a_ops for directory inodes and removes some
    duplicated code.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • First of all some necessary notes about UFS by it self: To avoid waste of disk
    space the tail of file consists not from blocks (which is ordinary big enough,
    16K usually), it consists from fragments(which is ordinary 2K). When file is
    growing its tail occupy 1 fragment, 2 fragments... At some stage decision to
    allocate whole block is made and all fragments are moved to one block.

    How this situation was handled before:

    ufs_prepare_write
    ->block_prepare_write
    ->ufs_getfrag_block
    ->...
    ->ufs_new_fragments:

    bh = sb_bread
    bh->b_blocknr = result + i;
    mark_buffer_dirty (bh);

    This is wrong solution, because:

    - it didn't take into consideration that there is another cache: "inode page
    cache"

    - because of sb_getblk uses not b_blocknr, (it uses page->index) to find
    certain block, this breaks sb_getblk.

    How this situation is handled now: we go though all "page inode cache", if
    there are no such page in cache we load it into cache, and change b_blocknr.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • * After block allocation, we map it on the same "address" as 8 others
    blocks

    * We nullify block several times: once in ufs/block.c and once in
    block_*write_full_page, and use different "caches" for this.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • Currently, ufs write support have two sets of problems: work with files and
    work with directories.

    This series of patches should solve the first problem.

    This patch is similar to http://lkml.org/lkml/2006/1/17/61 this patch
    complements it.

    The situation the same: in ufs_trunc_(not direct), we read block, check if
    count of links to it is equal to one, if so we finish cycle, if not
    continue. Because of "count of links" always >=2 this operation cause
    infinite cycle and hang up the kernel.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     
  • Fix of some spelling errors in fs/freevxfs error messages and comments

    Signed-off-by: Cliff Wickman
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     

23 Jun, 2006

9 commits

  • Otherwise we could be racing with truncate/mapping removal.

    Problem found/fixed by Nick Piggin , logic rewritten
    by me.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • A process flag to indicate whether we are doing sync io is incredibly
    ugly. It also causes performance problems when one does a lot of async
    io and then proceeds to sync it. Part of the io will go out as async,
    and the other part as sync. This causes a disconnect between the
    previously submitted io and the synced io. For io schedulers such as CFQ,
    this will cause us lost merges and suboptimal behaviour in scheduling.

    Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
    the O_DIRECT path just directly indicate that the writes are sync
    by using WRITE_SYNC instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Sometimes partitions claim to be larger than the reported capacity of a
    disk device. This patch makes the kernel warn about those partitions.

    We still permit these patitions to be used. Quoting Andries Brouwer
    :

    Case 1: The kernel is mistaken about the size of the disk. (There are
    commands to clip a disk to a certain capacity, there are jumpers to tell a
    disk that it should report a certain capacity etc. Usually this is because
    of BIOS bugs. In bad cases the machine will crash in the BIOS and hence fail
    to boot if the disk reports full capacity.) In such cases actually accessing
    the blocks of the partition may work fine, or may work fine after running an
    unclip utility. I wrote "setmax" some years ago precisely for this reason.

    Case 2: There was a messy partition table (maybe just a rounding error) but
    the actual filesystem on the partition is contained in the physical disk.
    Now using the filesystem goes without problem.

    Case 3: Both partition and filesystem extend beyond the end of the disk. In
    forensic or debugging situations one often uses a copy of the start of a
    disk. Now access beyond the end gives an expected I/O error.

    Signed-off-by: Mike Miller
    Signed-off-by: Stephen Cameron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Miller
     
  • Split the checkpoint list of the transaction into two lists. In the first
    list we keep the buffers that need to be submitted for IO. In the second
    list are kept buffers that were already submitted and we just have to wait
    for the IO to complete. This should simplify a handling of checkpoint
    lists a bit and can eventually be also a performance gain.

    Signed-off-by: Jan Kara
    Cc: Mark Fasheh
    Cc: "Stephen C. Tweedie"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • list_splice_init(list, head) does unneeded job if it is known that
    list_empty(head) == 1. We can use list_replace_init() instead.

    Signed-off-by: Oleg Nesterov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The percpu counter data type are changed in this set of patches to support
    more users like ext3 who need more than 32 bit to store the free blocks
    total in the filesystem.

    - Generic perpcu counters data type changes. The size of the global counter
    and local counter were explictly specified using s64 and s32. The global
    counter is changed from long to s64, while the local counter is changed from
    long to s32, so we could avoid doing 64 bit update in most cases.

    - Users of the percpu counters are updated to make use of the new
    percpu_counter_init() routine now taking an additional parameter to allow
    users to pass the initial value of the global counter.

    Signed-off-by: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • Remove redundant casts from NEW_AUX_ENT() arguments in fs/binfmt_elf.c

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Do a CodingStyle cleanup of fs/binfmt_elf.c and also remove some pointless
    casts of kmalloc() return values in the same file.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Steven Rostedt points out that `rsv' here is usually
    NULL, so we should avoid calling kfree().

    Also, fix up some nearby whitespace damage.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton