04 Oct, 2006

1 commit


30 Jun, 2006

1 commit

  • The recent generic_file_write() deadlock fix caused
    generic_file_buffered_write() to loop inifinitely when presented with a
    zero-length iovec segment. Fix.

    Note that this fix deliberately avoids calling ->prepare_write(),
    ->commit_write() etc with a zero-length write. This is because I don't trust
    all filesystems to get that right.

    This is a cautious approach, for 2.6.17.x. For 2.6.18 we should just go ahead
    and call ->prepare_write() and ->commit_write() with the zero length and fix
    any broken filesystems. So I'll make that change once this code is stabilised
    and backported into 2.6.17.x.

    The reason for preferring to call ->prepare_write() and ->commit_write() with
    the zero-length segment: a zero-length segment _should_ be sufficiently
    uncommon that this is the correct way of handling it. We don't want to
    optimise for poorly-written userspace at the expense of well-written
    userspace.

    Cc: "Vladimir V. Saveliev"
    Cc: Neil Brown
    Cc: Martin Schwidefsky
    Cc: Chris Wright
    Cc: Greg KH
    Cc:
    Cc: walt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

26 Jun, 2006

1 commit

  • The problem is that when we write to a file, the copy from userspace to
    pagecache is first done with preemption disabled, so if the source address is
    not immediately available the copy fails *and* *zeros* *the* *destination*.

    This is a problem because a concurrent read (which admittedly is an odd thing
    to do) might see zeros rather that was there before the write, or what was
    there after, or some mixture of the two (any of these being a reasonable thing
    to see).

    If the copy did fail, it will immediately be retried with preemption
    re-enabled so any transient problem with accessing the source won't cause an
    error.

    The first copying does not need to zero any uncopied bytes, and doing so
    causes the problem. It uses copy_from_user_atomic rather than copy_from_user
    so the simple expedient is to change copy_from_user_atomic to *not* zero out
    bytes on failure.

    The first of these two patches prepares for the change by fixing two places
    which assume copy_from_user_atomic does zero the tail. The two usages are
    very similar pieces of code which copy from a userspace iovec into one or more
    page-cache pages. These are changed to remove the assumption.

    The second patch changes __copy_from_user_inatomic* to not zero the tail.
    Once these are accepted, I will look at similar patches of other architectures
    where this is important (ppc, mips and sparc being the ones I can find).

    This patch:

    There is a problem with __copy_from_user_inatomic zeroing the tail of the
    buffer in the case of an error. As it is called in atomic context, the error
    may be transient, so it results in zeros being written where maybe they
    shouldn't be.

    In the usage in filemap, this opens a window for a well timed read to see data
    (zeros) which is not consistent with any ordering of reads and writes.

    Most cases where __copy_from_user_inatomic is called, a failure results in
    __copy_from_user being called immediately. As long as the latter zeros the
    tail, the former doesn't need to. However in *copy_from_user_iovec
    implementations (in both filemap and ntfs/file), it is assumed that
    copy_from_user_inatomic will zero the tail.

    This patch removes that assumption, so that after this patch it will
    be safe for copy_from_user_inatomic to not zero the tail.

    This patch also adds some commentary to filemap.h and asm-i386/uaccess.h.

    After this patch, all architectures that might disable preempt when
    kmap_atomic is called need to have their __copy_from_user_inatomic* "fixed".
    This includes
    - powerpc
    - i386
    - mips
    - sparc

    Signed-off-by: Neil Brown
    Cc: David Howells
    Cc: Anton Altaparmakov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

23 Jun, 2006

1 commit

  • Use the x86 cache-bypassing copy instructions for copy_from_user().

    Some performance data are

    Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

    2.6.12.4.orig 1921587
    2.6.12.4.nt 1599424
    1599424/1921587=83.23% (16.77% reduction)

    BSQ_CACHE_REFERENCE (L3 cache miss)
    2.6.12.4.orig 57427
    2.6.12.4.nt 20858
    20858/57427=36.32% (63.7% reduction)

    L3 cache miss reduction of __copy_from_user_ll
    samples %
    37408 65.1412 vmlinux __copy_from_user_ll
    23 0.1103 vmlinux __copy_user_zeroing_intel_nocache
    23/37408=0.061% (99.94% reduction)

    Top 5 of 2.6.12.4.nt
    Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
    samples % app name symbol name
    128392 8.0274 vmlinux __copy_user_zeroing_intel_nocache
    64206 4.0143 vmlinux journal_add_journal_head
    59746 3.7355 vmlinux do_get_write_access
    47674 2.9807 vmlinux journal_put_journal_head
    46021 2.8774 vmlinux journal_dirty_metadata
    pattern9-0-cpu4-0-09011728/summary.out

    Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
    samples % app name symbol name
    69755 4.2861 vmlinux __copy_user_zeroing_intel_nocache
    55685 3.4215 vmlinux journal_add_journal_head
    52371 3.2179 vmlinux __find_get_block
    45504 2.7960 vmlinux journal_put_journal_head
    36005 2.2123 vmlinux journal_stop
    pattern9-0-cpu4-0-09011744/summary.out

    Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
    samples % app name symbol name
    1147 5.4994 vmlinux journal_add_journal_head
    881 4.2240 vmlinux journal_dirty_data
    872 4.1809 vmlinux blk_rq_map_sg
    734 3.5192 vmlinux journal_commit_transaction
    617 2.9582 vmlinux radix_tree_delete
    pattern9-0-cpu4-0-09011731/summary.out

    iozone results are

    original 2.6.12.4 CPU time = 207.768 sec
    cache aware CPU time = 184.783 sec
    (three times run)
    184.783/207.768=88.94% (11.06% reduction)

    original:
    pattern9-0-cpu4-0-08191720/iozone.out: CPU Utilization: Wall time 45.997 CPU time 64.527 CPU utilization 140.28 %
    pattern9-0-cpu4-0-08191741/iozone.out: CPU Utilization: Wall time 46.878 CPU time 71.933 CPU utilization 153.45 %
    pattern9-0-cpu4-0-08191743/iozone.out: CPU Utilization: Wall time 45.152 CPU time 71.308 CPU utilization 157.93 %

    cache awre:
    pattern9-0-cpu4-0-09011728/iozone.out: CPU Utilization: Wall time 44.842 CPU time 62.465 CPU utilization 139.30 %
    pattern9-0-cpu4-0-09011731/iozone.out: CPU Utilization: Wall time 44.718 CPU time 59.273 CPU utilization 132.55 %
    pattern9-0-cpu4-0-09011744/iozone.out: CPU Utilization: Wall time 44.367 CPU time 63.045 CPU utilization 142.10 %

    Signed-off-by: Hiro Yoshioka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiro Yoshioka
     

24 Jun, 2005

2 commits

  • This patch reworks filemap_xip.c with the goal to reduce code duplication
    from mm/filemap.c. It applies agains 2.6.12-rc6-mm1. Instead of
    implementing the aio functions, this one implements the synchronous
    read/write functions only. For readv and writev, the generic fallback is
    used. For aio, we rely on the application doing the fallback. Since our
    "synchronous" function does memcpy immediately anyway, there is no
    performance difference between using the fallbacks or implementing each
    operation.

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • - generic_file* file operations do no longer have a xip/non-xip split
    - filemap_xip.c implements a new set of fops that require get_xip_page
    aop to work proper. all new fops are exported GPL-only (don't like to
    see whatever code use those except GPL modules)
    - __xip_unmap now uses page_check_address, which is no longer static
    in rmap.c, and defined in linux/rmap.h
    - mm/filemap.h is now much more clean, plainly having just Linus'
    inline funcs moved here from filemap.c
    - fix includes in filemap_xip to make it build cleanly on i386

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte