20 Oct, 2008

1 commit

  • trylock_buffer and unlock_buffer open and close a critical section.
    Hence, we can use the lock bitops to get the desired memory ordering.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 Aug, 2008

1 commit


05 Aug, 2008

1 commit


31 Jul, 2008

1 commit

  • Uninline the __remove_assoc_queue() function in fs/buffer.c, called at too
    many places and too long to really be inlined. Size results:

    text data bss dec hex filename
    1134606 118840 212992 1466438 166046 vmlinux.old
    1134303 118840 212992 1466135 165f17 vmlinux
    -303 0 0 -303 -12F +/-

    This patch is part of the Linux Tiny project and has been originally
    written by Matt Mackall .

    Signed-off-by: Thomas Petazzoni
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Petazzoni
     

29 Jul, 2008

1 commit

  • When we read some part of a file through pagecache, if there is a
    pagecache of corresponding index but this page is not uptodate, read IO
    is issued and this page will be uptodate.

    I think this is good for pagesize == blocksize environment but there is
    room for improvement on pagesize != blocksize environment. Because in
    this case a page can have multiple buffers and even if a page is not
    uptodate, some buffers can be uptodate.

    So I suggest that when all buffers which correspond to a part of a file
    that we want to read are uptodate, use this pagecache and copy data from
    this pagecache to user buffer even if a page is not uptodate. This can
    reduce read IO and improve system throughput.

    I wrote a benchmark program and got result number with this program.

    This benchmark do:

    1: mount and open a test file.

    2: create a 512MB file.

    3: close a file and umount.

    4: mount and again open a test file.

    5: pwrite randomly 300000 times on a test file. offset is aligned
    by IO size(1024bytes).

    6: measure time of preading randomly 100000 times on a test file.

    The result was:
    2.6.26
    330 sec

    2.6.26-patched
    226 sec

    Arch:i386
    Filesystem:ext3
    Blocksize:1024 bytes
    Memory: 1GB

    On ext3/4, a file is written through buffer/block. So random read/write
    mixed workloads or random read after random write workloads are optimized
    with this patch under pagesize != blocksize environment. This test result
    showed this.

    The benchmark program is as follows:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 1024
    #define LOOP 1024*512 /* 512MB */

    main(void)
    {
    unsigned long i, offset, filesize;
    int fd;
    char buf[LEN];
    time_t t1, t2;

    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    memset(buf, 0, LEN);
    fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }
    for (i = 0; i < LOOP; i++)
    write(fd, buf, LEN);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    fd = open("/root/test1/testfile", O_RDWR);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }

    filesize = LEN * LOOP;
    for (i = 0; i < 300000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pwrite(fd, buf, LEN, offset);
    }
    printf("start test\n");
    time(&t1);
    for (i = 0; i < 100000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pread(fd, buf, LEN, offset);
    }
    time(&t2);
    printf("%ld sec\n", t2-t1);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    }

    Signed-off-by: Hisashi Hifumi
    Cc: Nick Piggin
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

27 Jul, 2008

3 commits

  • Use WARN() instead of a printk+WARN_ON() pair; this way the message
    becomes part of the warning section for better reporting/collection.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • mapping->tree_lock has no read lockers. convert the lock from an rwlock
    to a spinlock.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

16 Jul, 2008

1 commit

  • Conflicts:

    arch/powerpc/Kconfig
    arch/s390/kernel/time.c
    arch/x86/kernel/apic_32.c
    arch/x86/kernel/cpu/perfctr-watchdog.c
    arch/x86/kernel/i8259_64.c
    arch/x86/kernel/ldt.c
    arch/x86/kernel/nmi_64.c
    arch/x86/kernel/smpboot.c
    arch/x86/xen/smp.c
    include/asm-x86/hw_irq_32.h
    include/asm-x86/hw_irq_64.h
    include/asm-x86/mach-default/irq_vectors.h
    include/asm-x86/mach-voyager/irq_vectors.h
    include/asm-x86/smp.h
    kernel/Makefile

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

12 Jul, 2008

2 commits

  • Export mpage_bio_submit() and __mpage_writepage() for the benefit of
    ext4's delayed allocation support. Also change __block_write_full_page
    so that if buffers that have the BH_Delay flag set it will call
    get_block() to get the physical block allocated, just as in the
    !BH_Mapped case.

    Signed-off-by: Alex Tomas
    Signed-off-by: "Theodore Ts'o"

    Alex Tomas
     
  • There's no need to call mark_inode_dirty() under page lock in
    generic_write_end(). It unnecessarily makes hold time of page lock longer
    and more importantly it forces locking order of page lock and transaction
    start for journaling filesystems.

    Signed-off-by: Jan Kara
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

01 Jul, 2008

1 commit

  • fsync_buffers_list() and sync_dirty_buffer() both issue async writes and
    then immediately wait on them. Conceptually, that makes them sync writes
    and we should treat them as such so that the IO schedulers can handle
    them appropriately.

    This patch fixes a write starvation issue that Lin Ming reported, where
    xx is stuck for more than 2 minutes because of a large number of
    synchronous IO in the system:

    INFO: task kjournald:20558 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    kjournald D ffff810010820978 6712 20558 2
    ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
    ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
    0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
    Call Trace:
    [] kobject_get+0x12/0x17
    [] getnstimeofday+0x2f/0x83
    [] sync_buffer+0x0/0x3f
    [] io_schedule+0x5d/0x9f
    [] sync_buffer+0x3b/0x3f
    [] __wait_on_bit+0x40/0x6f
    [] sync_buffer+0x0/0x3f
    [] out_of_line_wait_on_bit+0x6c/0x78
    [] wake_bit_function+0x0/0x23
    [] sync_dirty_buffer+0x98/0xcb
    [] journal_commit_transaction+0x97d/0xcb6
    [] lock_timer_base+0x26/0x4b
    [] kjournald+0xc1/0x1fb
    [] autoremove_wake_function+0x0/0x2e
    [] kjournald+0x0/0x1fb
    [] kthread+0x47/0x74
    [] schedule_tail+0x28/0x5d
    [] child_rip+0xa/0x12
    [] kthread+0x0/0x74
    [] child_rip+0x0/0x12

    Lin Ming confirms that this patch fixes the issue. I've run tests with
    it for the past week and no ill effects have been observed, so I'm
    proposing it for inclusion into 2.6.26.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Jun, 2008

1 commit


30 Apr, 2008

1 commit


29 Apr, 2008

2 commits


28 Apr, 2008

7 commits

  • On the systems, ftruncate() which expand size for FAT became the cause
    of OOM. The cont_expand_zero() filled all memory with dirty pages,
    and since disk is very slow, limit of page scanning was exceeded, then
    it triggered OOM.

    This adds balance_dirty_pages_ratelimited() to avoid filling memory
    with dirty pages.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Filtering zonelists requires very frequent use of zone_idx(). This is costly
    as it involves a lookup of another structure and a substraction operation. As
    the zone_idx is often required, it should be quickly accessible. The node idx
    could also be stored here if it was found that accessing zone->node is
    significant which may be the case on workloads where nodemasks are heavily
    used.

    This patch introduces a struct zoneref to store a zone pointer and a zone
    index. The zonelist then consists of an array of these struct zonerefs which
    are looked up as necessary. Helpers are given for accessing the zone index as
    well as the node index.

    [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
    [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
    [hugh@veritas.com: just return do_try_to_free_pages]
    [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently a node has two sets of zonelists, one for each zone type in the
    system and a second set for GFP_THISNODE allocations. Based on the zones
    allowed by a gfp mask, one of these zonelists is selected. All of these
    zonelists consume memory and occupy cache lines.

    This patch replaces the multiple zonelists per-node with two zonelists. The
    first contains all populated zones in the system, ordered by distance, for
    fallback allocations when the target/preferred node has no free pages. The
    second contains all populated zones in the node suitable for GFP_THISNODE
    allocations.

    An iterator macro is introduced called for_each_zone_zonelist() that interates
    through each zone allowed by the GFP flags in the selected zonelist.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Introduce a node_zonelist() helper function. It is used to lookup the
    appropriate zonelist given a node and a GFP mask. The patch on its own is a
    cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If
    necessary, it can be merged with the next patch in this set without problems.

    Reviewed-by: Christoph Lameter
    Signed-off-by: Mel Gorman
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The following patches replace multiple zonelists per node with two zonelists
    that are filtered based on the GFP flags. The patches as a set fix a bug with
    regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the
    MPOL_BIND will apply to the two highest zones when the highest zone is
    ZONE_MOVABLE. This should be considered as an alternative fix for the
    MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
    only custom zonelists.

    The first patch cleans up an inconsistency where direct reclaim uses
    zonelist->zones where other places use zonelist.

    The second patch introduces a helper function node_zonelist() for looking up
    the appropriate zonelist for a GFP mask which simplifies patches later in the
    set.

    The third patch defines/remembers the "preferred zone" for numa statistics, as
    it is no longer always the first zone in a zonelist.

    The forth patch replaces multiple zonelists with two zonelists that are
    filtered. The two zonelists are due to the fact that the memoryless patchset
    introduces a second set of zonelists for __GFP_THISNODE.

    The fifth patch introduces helper macros for retrieving the zone and node
    indices of entries in a zonelist.

    The final patch introduces filtering of the zonelists based on a nodemask.
    Two zonelists exist per node, one for normal allocations and one for
    __GFP_THISNODE.

    Performance results varied depending on the machine configuration. In real
    workloads the gain/loss will depend on how much the userspace portion of the
    benchmark benefits from having more cache available due to reduced referencing
    of zonelists.

    These are the range of performance losses/gains when running against
    2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and
    ppc64 both NUMA and non-NUMA.
    loss to gain
    Total CPU time on Kernbench: -0.86% to 1.13%
    Elapsed time on Kernbench: -0.79% to 0.76%
    page_test from aim9: -4.37% to 0.79%
    brk_test from aim9: -0.71% to 4.07%
    fork_test from aim9: -1.84% to 4.60%
    exec_test from aim9: -0.71% to 1.08%

    This patch:

    The allocator deals with zonelists which indicate the order in which zones
    should be targeted for an allocation. Similarly, direct reclaim of pages
    iterates over an array of zones. For consistency, this patch converts direct
    reclaim to use a zonelist. No functionality is changed by this patch. This
    simplifies zonelist iterators in the next patch.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Migrate flags must be set on slab creation as agreed upon when the antifrag
    logic was reviewed. Otherwise some slabs of a slabcache will end up in the
    unmovable and others in the reclaimable section depending on which flag was
    active when a new slab page was allocated.

    This likely slid in somehow when antifrag was merged. Remove it.

    The buffer_heads are always allocated with __GFP_RECLAIMABLE because the
    SLAB_RECLAIM_ACCOUNT option is set. The set_migrateflags() never had any
    effect there.

    Radix tree allocations are not directly reclaimable but they are allocated
    with __GFP_RECLAIMABLE set on each allocation. We now set
    SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix
    tree slabs are consistently placed in the reclaimable section. Radix tree
    slabs will also be accounted as such.

    There is then no user left of set_migratepages. So remove it.

    Signed-off-by: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

05 Apr, 2008

1 commit

  • Mikulas Patocka noted that the optimization where we check if a buffer
    was already dirty (and we avoid re-dirtying it) was not really SMP-safe.

    Since the read of the old status was not synchronized with anything, an
    aggressive CPU re-ordering of memory accesses might have moved that read
    up to before the data was even written to the buffer, and another CPU
    that cleaned it again, causing the newly dirty state to never actually
    hit the disk.

    Admittedly this would probably never trigger in practice, but it's still
    wrong.

    Mikulas sent a patch that fixed the problem, but I dislike the subtlety
    of the whole optimization, so this is an alternate fix that is more
    explicit about the particular SMP ordering for the optimization, and
    separates out the speculative reads of the buffer state into its own
    conditional (and makes the memory barrier only happen if we are likely
    to actually hit the optimized case in the first place).

    I considered removing the optimization entirely, but Andrew argued for
    it's continued existence. I'm a push-over.

    Cc: Mikulas Patocka
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Mar, 2008

1 commit

  • Current nobh_write_end() implementation ignore partial writes(copied < len)
    case if page was fully mapped and simply mark page as Uptodate, which is
    totally wrong because area [pos+copied, pos+len) wasn't updated explicitly in
    previous write_begin call. It simply contains garbage from pagecache and
    result in data leakage.

    #TEST_CASE_BEGIN:
    ~~~~~~~~~~~~~~~~
    In fact issue triggered by classical testcase
    open("/mnt/test", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
    ftruncate(3, 409600) = 0
    writev(3, [{"a", 1}, {NULL, 4095}], 2) = 1
    ##TESTCASE_SOURCE:
    ~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include
    #include
    int main(int argc, char **argv)
    {
    int fd, ret;
    void* p;
    struct iovec iov[2];
    fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0666);
    ftruncate(fd, 409600);
    iov[0].iov_base="a";
    iov[0].iov_len=1;
    iov[1].iov_base=NULL;
    iov[1].iov_len=4096;
    ret = writev(fd, iov, sizeof(iov)/sizeof(struct iovec));
    printf("writev = %d, err = %d\n", ret, errno);
    return 0;
    }
    ##TESTCASE RESULT:
    ~~~~~~~~~~~~~~~~~~
    [root@ts63 ~]# mount | grep mnt2
    /dev/mapper/test on /mnt2 type ext2 (rw,nobh)
    [root@ts63 ~]# /tmp/writev /mnt2/test
    writev = 1, err = 0
    [root@ts63 ~]# hexdump -C /mnt2/test

    00000000 61 65 62 6f 6f 74 00 00 f0 b9 b4 59 3a 00 00 00 |aeboot.....Y:...|
    00000010 20 00 00 00 00 00 00 00 21 00 00 00 00 00 00 00 | .......!.......|
    00000020 df df df df df df df df df df df df df df df df |................|
    00000030 3a 00 00 00 2a 00 00 00 21 00 00 00 00 00 00 00 |:...*...!.......|
    00000040 60 c0 8c 00 00 00 00 00 40 4a 8d 00 00 00 00 00 |`.......@J......|
    00000050 00 00 00 00 00 00 00 00 41 00 00 00 00 00 00 00 |........A.......|
    00000060 74 69 6d 65 20 64 64 20 69 66 3d 2f 64 65 76 2f |time dd if=/dev/|
    00000070 6c 6f 6f 70 30 20 20 6f 66 3d 2f 64 65 76 2f 6e |loop0 of=/dev/n|
    skip..
    00000f50 00 00 00 00 00 00 00 00 31 00 00 00 00 00 00 00 |........1.......|
    00000f60 6d 6b 66 73 2e 65 78 74 33 20 2f 64 65 76 2f 76 |mkfs.ext3 /dev/v|
    00000f70 7a 76 67 2f 74 65 73 74 20 2d 62 34 30 39 36 00 |zvg/test -b4096.|
    00000f80 a0 fe 8c 00 00 00 00 00 21 00 00 00 00 00 00 00 |........!.......|
    00000f90 23 31 32 30 35 39 35 30 34 30 34 00 3a 00 00 00 |#1205950404.:...|
    00000fa0 20 00 8d 00 00 00 00 00 21 00 00 00 00 00 00 00 | .......!.......|
    00000fb0 d0 cf 8c 00 00 00 00 00 10 d0 8c 00 00 00 00 00 |................|
    00000fc0 00 00 00 00 00 00 00 00 41 00 00 00 00 00 00 00 |........A.......|
    00000fd0 6d 6f 75 6e 74 20 2f 64 65 76 2f 76 7a 76 67 2f |mount /dev/vzvg/|
    00000fe0 74 65 73 74 20 20 2f 76 7a 20 2d 6f 20 64 61 74 |test /vz -o dat|
    00000ff0 61 3d 77 72 69 74 65 62 61 63 6b 00 00 00 00 00 |a=writeback.....|
    00001000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

    As you can see file's page contains garbage from pagecache instead of zeros.
    #TEST_CASE_END

    Attached patch:
    - Add sanity check BUG_ON in order to prevent incorrect usage by caller,
    This is function invariant because page can has buffers and in no zero
    *fadata pointer at the same time.
    - Always attach buffers to page is it is partial write case.
    - Always switch back to generic_write_end if page has buffers.
    This is reasonable because if page already has buffer then generic_write_begin
    was called previously.

    Signed-off-by: Dmitri Monakhov
    Reviewed-by: Nick Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitri Monakhov
     

20 Mar, 2008

1 commit

  • Fix kernel-doc notation warnings in fs/.

    Warning(mmotm-2008-0314-1449//fs/super.c:560): missing initial short description on line:
    * mark_files_ro
    Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
    * lease_get_mtime
    Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
    * lease_get_mtime
    Warning(mmotm-2008-0314-1449//fs/namei.c:1368): missing initial short description on line:
    * lookup_one_len: filesystem helper to lookup single pathname component
    Warning(mmotm-2008-0314-1449//fs/buffer.c:3221): missing initial short description on line:
    * bh_uptodate_or_lock: Test whether the buffer is uptodate
    Warning(mmotm-2008-0314-1449//fs/buffer.c:3240): missing initial short description on line:
    * bh_submit_read: Submit a locked buffer for reading
    Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:30): missing initial short description on line:
    * writeback_acquire: attempt to get exclusive writeback access to a device
    Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:47): missing initial short description on line:
    * writeback_in_progress: determine whether there is writeback in progress
    Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:58): missing initial short description on line:
    * writeback_release: relinquish exclusive writeback access against a device.
    Warning(mmotm-2008-0314-1449//include/linux/jbd.h:351): contents before sections
    Warning(mmotm-2008-0314-1449//include/linux/jbd.h:561): contents before sections
    Warning(mmotm-2008-0314-1449//fs/jbd/transaction.c:1935): missing initial short description on line:
    * void journal_invalidatepage()

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

05 Mar, 2008

1 commit

  • Fix NULL pointer dereference in fsync_buffers_list() introduced by recent fix
    of races in private_list handling. Since bh->b_assoc_map has been cleared in
    __remove_assoc_queue() we should really use original value stored in the
    'mapping' variable.

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

04 Mar, 2008

1 commit


09 Feb, 2008

3 commits

  • There are two possible races in handling of private_list in buffer cache.

    1) When fsync_buffers_list() processes a private_list, it clears
    b_assoc_mapping and moves buffer to its private list. Now
    drop_buffers() comes, sees a buffer is on list so it calls
    __remove_assoc_queue() which complains about b_assoc_mapping being
    cleared (as it cannot propagate possible IO error). This race has been
    actually observed in the wild.

    2) When fsync_buffers_list() processes a private_list,
    mark_buffer_dirty_inode() can be called on bh which is already on the
    private list of fsync_buffers_list(). As buffer is on some list (note
    that the check is performed without private_lock), it is not readded to
    the mapping's private_list and after fsync_buffers_list() finishes, we
    have a dirty buffer which should be on private_list but it isn't. This
    race has not been reported, probably because most (but not all) callers
    of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with
    fsync().

    Fix these issues by not clearing b_assoc_map when fsync_buffers_list()
    moves buffer to a dedicated list and by reinserting buffer in private_list
    when it is found dirty after we have submitted buffer for IO. We also
    change the tests whether a buffer is on a private list from
    !list_empty(&bh->b_assoc_buffers) to bh->b_assoc_map so that they are
    single word reads and hence lockless checks are safe.

    Signed-off-by: Jan Kara
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • This is a rewrite of the ramdisk block device driver.

    The old one is really difficult because it effectively implements a block
    device which serves data out of its own buffer cache. It relies on the dirty
    bit being set, to pin its backing store in cache, however there are non
    trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
    which had recently lead to data corruption. And in general it is completely
    wrong for a block device driver to do this.

    The new one is more like a regular block device driver. It has no idea about
    vm/vfs stuff. It's backing store is similar to the buffer cache (a simple
    radix-tree of pages), but it doesn't know anything about page cache (the pages
    in the radix tree are not pagecache pages).

    There is one slight downside -- direct block device access and filesystem
    metadata access goes through an extra copy and gets stored in RAM twice.
    However, this downside is only slight, because the real buffercache of the
    device is now reclaimable (because we're not playing crazy games with it), so
    under memory intensive situations, footprint should effectively be the same --
    maybe even a slight advantage to the new driver because it can also reclaim
    buffer heads.

    The fact that it now goes through all the regular vm/fs paths makes it
    much more useful for testing, too.

    text data bss dec hex filename
    2837 849 384 4070 fe6 drivers/block/rd.o
    3528 371 12 3911 f47 drivers/block/brd.o

    Text is larger, but data and bss are smaller, making total size smaller.

    A few other nice things about it:
    - Similar structure and layout to the new loop device handlinag.
    - Dynamic ramdisk creation.
    - Runtime flexible buffer head size (because it is no longer part of the
    ramdisk code).
    - Boot / load time flexible ramdisk size, which could easily be extended
    to a per-ramdisk runtime changeable size (eg. with an ioctl).
    - Can use highmem for the backing store.

    [akpm@linux-foundation.org: fix build]
    [byron.bbradley@gmail.com: make rd_size non-static]
    Signed-off-by: Nick Piggin
    Signed-off-by: Byron Bradley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

06 Feb, 2008

2 commits

  • The constructor for buffer_head slabs was removed recently. We need the
    constructor back in slab defrag in order to insure that slab objects always
    have a definite state even before we allocated them.

    I think we mistakenly merged the removal of the constuctor into a cleanup
    patch. You (ie: akpm) had a test that showed that the removal of the
    constructor led to a small regression. The prior state makes things easier
    for slab defrag.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Simplify page cache zeroing of segments of pages through 3 functions

    zero_user_segments(page, start1, end1, start2, end2)

    Zeros two segments of the page. It takes the position where to
    start and end the zeroing which avoids length calculations and
    makes code clearer.

    zero_user_segment(page, start, end)

    Same for a single segment.

    zero_user(page, start, length)

    Length variant for the case where we know the length.

    We remove the zero_user_page macro. Issues:

    1. Its a macro. Inline functions are preferable.

    2. The KM_USER0 macro is only defined for HIGHMEM.

    Having to treat this special case everywhere makes the
    code needlessly complex. The parameter for zeroing is always
    KM_USER0 except in one single case that we open code.

    Avoiding KM_USER0 makes a lot of code not having to be dealing
    with the special casing for HIGHMEM anymore. Dealing with
    kmap is only necessary for HIGHMEM configurations. In those
    configurations we use KM_USER0 like we do for a series of other
    functions defined in highmem.h.

    Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
    function could not be a macro. zero_user_* functions introduced
    here can be be inline because that constant is not used when these
    functions are called.

    Also extract the flushing of the caches to be outside of the kmap.

    [akpm@linux-foundation.org: fix nfs and ntfs build]
    [akpm@linux-foundation.org: fix ntfs build some more]
    Signed-off-by: Christoph Lameter
    Cc: Steven French
    Cc: Michael Halcrow
    Cc:
    Cc: Steven Whitehouse
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: David Chinner
    Cc: Michael Halcrow
    Cc: Steven French
    Cc: Steven Whitehouse
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

29 Jan, 2008

1 commit


21 Oct, 2007

1 commit

  • This path mustn't have been tested :( I did attempt to exercise it
    by injecting failures here, but I suspect PageMappedToDisk may have
    been getting in the way. Will need more of a look, although I think
    nobh mode is OK for an -rc1 (it shouldn't eat anyone's data).

    Commit 03158cd7eb3374843de68421142ca5900df845d9 ("fs: restore nobh")
    introcduced a NULL deref. Spotted by the Coverity checker.

    Signed-off-by: Nick Piggin
    Cc: Badari Pulavarty
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

17 Oct, 2007

5 commits

  • Miklos Szeredi and me identified a writeback bug:

    > The following strange behavior can be observed:
    >
    > 1. large file is written
    > 2. after 30 seconds, nr_dirty goes down by 1024
    > 3. then for some time (< 30 sec) nothing happens (disk idle)
    > 4. then nr_dirty again goes down by 1024
    > 5. repeat from 3. until whole file is written
    >
    > So basically a 4Mbyte chunk of the file is written every 30 seconds.
    > I'm quite sure this is not the intended behavior.

    It can be produced by the following test scheme:

    # cat bin/test-writeback.sh
    grep nr_dirty /proc/vmstat
    echo 1 > /proc/sys/fs/inode_debug
    dd if=/dev/zero of=/var/x bs=1K count=204800&
    while true; do grep nr_dirty /proc/vmstat; sleep 1; done

    # bin/test-writeback.sh
    nr_dirty 19207
    nr_dirty 19207
    nr_dirty 30924
    204800+0 records in
    204800+0 records out
    209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
    nr_dirty 47150
    nr_dirty 47141
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47205
    nr_dirty 47214
    nr_dirty 47214
    nr_dirty 47214
    nr_dirty 47214
    nr_dirty 47214
    nr_dirty 47215
    nr_dirty 47216
    nr_dirty 47216
    nr_dirty 47216
    nr_dirty 47154
    nr_dirty 47143
    nr_dirty 47143
    nr_dirty 47143
    nr_dirty 47143
    nr_dirty 47143
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47134
    nr_dirty 47134
    nr_dirty 47135
    nr_dirty 47135
    nr_dirty 47135
    nr_dirty 46097 pages_skipped != pages_skipped) {
    544 /*
    545 * writeback is not making progress due to locked
    546 * buffers. Skip this inode for now.
    547 */
    548 redirty_tail(inode);
    549 }

    More debug efforts show that __block_write_full_page()
    never has the chance to call submit_bh() for that big dirty file:
    the buffer head is *clean*. So basicly no page io is issued by
    __block_write_full_page(), hence pages_skipped goes up.

    Also the comment in generic_sync_sb_inodes():

    544 /*
    545 * writeback is not making progress due to locked
    546 * buffers. Skip this inode for now.
    547 */

    and the comment in __block_write_full_page():

    1713 /*
    1714 * The page was marked dirty, but the buffers were
    1715 * clean. Someone wrote them back by hand with
    1716 * ll_rw_block/submit_bh. A rare case.
    1717 */

    do not quite agree with each other. The page writeback should be skipped for
    'locked buffer', but here it is 'clean buffer'!

    This patch fixes this bug. Though I'm not sure why __block_write_full_page()
    is called only to do nothing and who actually issued the writeback for us.

    This is the two possible new behaviors after the patch:

    1) pretty nice: wait 30s and write ALL:)
    2) not so good:
    - during the dd: ~16M
    - after 30s: ~4M
    - after 5s: ~4M
    - after 5s: ~176M

    The next patch will fix case (2).

    Cc: David Chinner
    Cc: Ken Chen
    Signed-off-by: Fengguang Wu
    Signed-off-by: David Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • This patch marks a number of allocations that are either short-lived such as
    network buffers or are reclaimable such as inode allocations. When something
    like updatedb is called, long-lived and unmovable kernel allocations tend to
    be spread throughout the address space which increases fragmentation.

    This patch groups these allocations together as much as possible by adding a
    new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
    reclaimed on demand, but not moved. i.e. they can be migrated by deleting
    them and re-reading the information from elsewhere.

    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Implement nobh in new aops. This is a bit tricky. FWIW, nobh_truncate is
    now implemented in a way that does not create blocks in sparse regions,
    which is a silly thing for it to have been doing (isn't it?)

    ext2 survives fsx and fsstress. jfs is converted as well... ext3
    should be easy to do (but not done yet).

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin