30 Sep, 2006

1 commit


26 Sep, 2006

2 commits

  • Let's try to keep mm/ comments more useful and up to date. This is a start.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • lock_page needs the caller to have a reference on the page->mapping inode
    due to sync_page, ergo set_page_dirty_lock is obviously buggy according to
    its comments.

    Solve it by introducing a new lock_page_nosync which does not do a sync_page.

    akpm: unpleasant solution to an unpleasant problem. If it goes wrong it could
    cause great slowdowns while the lock_page() caller waits for kblockd to
    perform the unplug. And if a filesystem has special sync_page() requirements
    (none presently do), permanent hangs are possible.

    otoh, set_page_dirty_lock() is usually (always?) called against userspace
    pages. They are always up-to-date, so there shouldn't be any pending read I/O
    against these pages.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

30 Jul, 2006

1 commit


01 Jul, 2006

4 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    Remove obsolete #include
    remove obsolete swsusp_encrypt
    arch/arm26/Kconfig typos
    Documentation/IPMI typos
    Kconfig: Typos in net/sched/Kconfig
    v9fs: do not include linux/version.h
    Documentation/DocBook/mtdnand.tmpl: typo fixes
    typo fixes: specfic -> specific
    typo fixes in Documentation/networking/pktgen.txt
    typo fixes: occuring -> occurring
    typo fixes: infomation -> information
    typo fixes: disadvantadge -> disadvantage
    typo fixes: aquire -> acquire
    typo fixes: mecanism -> mechanism
    typo fixes: bandwith -> bandwidth
    fix a typo in the RTC_CLASS help text
    smb is no longer maintained

    Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S

    Linus Torvalds
     
  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently a single atomic variable is used to establish the size of the page
    cache in the whole machine. The zoned VM counters have the same method of
    implementation as the nr_pagecache code but also allow the determination of
    the pagecache size per zone.

    Remove the special implementation for nr_pagecache and make it a zoned counter
    named NR_FILE_PAGES.

    Updates of the page cache counters are always performed with interrupts off.
    We can therefore use the __ variant here.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Jörn Engel
    Signed-off-by: Adrian Bunk

    Jörn Engel
     

30 Jun, 2006

1 commit

  • The recent generic_file_write() deadlock fix caused
    generic_file_buffered_write() to loop inifinitely when presented with a
    zero-length iovec segment. Fix.

    Note that this fix deliberately avoids calling ->prepare_write(),
    ->commit_write() etc with a zero-length write. This is because I don't trust
    all filesystems to get that right.

    This is a cautious approach, for 2.6.17.x. For 2.6.18 we should just go ahead
    and call ->prepare_write() and ->commit_write() with the zero length and fix
    any broken filesystems. So I'll make that change once this code is stabilised
    and backported into 2.6.17.x.

    The reason for preferring to call ->prepare_write() and ->commit_write() with
    the zero-length segment: a zero-length segment _should_ be sufficiently
    uncommon that this is the correct way of handling it. We don't want to
    optimise for poorly-written userspace at the expense of well-written
    userspace.

    Cc: "Vladimir V. Saveliev"
    Cc: Neil Brown
    Cc: Martin Schwidefsky
    Cc: Chris Wright
    Cc: Greg KH
    Cc:
    Cc: walt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

29 Jun, 2006

1 commit


28 Jun, 2006

1 commit

  • generic_file_buffered_write() prefaults in user pages in order to avoid
    deadlock on copying from the same page as write goes to.

    However, it looks like there is a problem when write is vectored:
    fault_in_pages_readable brings in current segment or its part (maxlen).
    OTOH, filemap_copy_from_user_iovec is called to copy number of bytes
    (bytes) which may exceed current segment, so filemap_copy_from_user_iovec
    switches to the next segment which is not brought in yet. Pagefault is
    generated. That causes the deadlock if pagefault is for the same page
    write goes to: page being written is locked and not uptodate, pagefault
    will deadlock trying to lock locked page.

    [akpm@osdl.org: somewhat rewritten]
    Cc: Neil Brown
    Cc: Martin Schwidefsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir V. Saveliev
     

26 Jun, 2006

2 commits

  • Backoff readahead size exponentially on I/O error.

    Michael Tokarev described the problem as:

    [QUOTE]
    Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
    In order to "fix" it, one have to read it and write to another CD-rom,
    or something.. or just ignore the error (if it's just a skip in a video
    stream). Let's assume the unreadable block is number U.

    But current behavior is just insane. An application requests block
    number N, which is before U. Kernel tries to read-ahead blocks N..U.
    Cdrom drive tries to read it, re-read it.. for some time. Finally,
    when all the N..U-1 blocks are read, kernel returns block number N
    (as requested) to an application, successefully.

    Now an app requests block number N+1, and kernel tries to read
    blocks N+1..U+1. Retrying again as in previous step.

    And so on, up to when an app requests block number U-1. And when,
    finally, it requests block U, it receives read error.

    So, kernel currentry tries to re-read the same failing block as
    many times as the current readahead value (256 (times?) by default).

    This whole process already killed my cdrom drive (I posted about it
    to LKML several months ago) - literally, the drive has fried, and
    does not work anymore. Ofcourse that problem was a bug in firmware
    (or whatever) of the drive *too*, but.. main problem with that is
    current readahead logic as described above.
    [/QUOTE]

    Which was confirmed by Jens Axboe :

    [QUOTE]
    For ide-cd, it tends do only end the first part of the request on a
    medium error. So you may see a lot of repeats :/
    [/QUOTE]

    With this patch, retries are expected to be reduced from, say, 256, to 5.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • The problem is that when we write to a file, the copy from userspace to
    pagecache is first done with preemption disabled, so if the source address is
    not immediately available the copy fails *and* *zeros* *the* *destination*.

    This is a problem because a concurrent read (which admittedly is an odd thing
    to do) might see zeros rather that was there before the write, or what was
    there after, or some mixture of the two (any of these being a reasonable thing
    to see).

    If the copy did fail, it will immediately be retried with preemption
    re-enabled so any transient problem with accessing the source won't cause an
    error.

    The first copying does not need to zero any uncopied bytes, and doing so
    causes the problem. It uses copy_from_user_atomic rather than copy_from_user
    so the simple expedient is to change copy_from_user_atomic to *not* zero out
    bytes on failure.

    The first of these two patches prepares for the change by fixing two places
    which assume copy_from_user_atomic does zero the tail. The two usages are
    very similar pieces of code which copy from a userspace iovec into one or more
    page-cache pages. These are changed to remove the assumption.

    The second patch changes __copy_from_user_inatomic* to not zero the tail.
    Once these are accepted, I will look at similar patches of other architectures
    where this is important (ppc, mips and sparc being the ones I can find).

    This patch:

    There is a problem with __copy_from_user_inatomic zeroing the tail of the
    buffer in the case of an error. As it is called in atomic context, the error
    may be transient, so it results in zeros being written where maybe they
    shouldn't be.

    In the usage in filemap, this opens a window for a well timed read to see data
    (zeros) which is not consistent with any ordering of reads and writes.

    Most cases where __copy_from_user_inatomic is called, a failure results in
    __copy_from_user being called immediately. As long as the latter zeros the
    tail, the former doesn't need to. However in *copy_from_user_iovec
    implementations (in both filemap and ntfs/file), it is assumed that
    copy_from_user_inatomic will zero the tail.

    This patch removes that assumption, so that after this patch it will
    be safe for copy_from_user_inatomic to not zero the tail.

    This patch also adds some commentary to filemap.h and asm-i386/uaccess.h.

    After this patch, all architectures that might disable preempt when
    kmap_atomic is called need to have their __copy_from_user_inatomic* "fixed".
    This includes
    - powerpc
    - i386
    - mips
    - sparc

    Signed-off-by: Neil Brown
    Cc: David Howells
    Cc: Anton Altaparmakov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

23 Jun, 2006

3 commits

  • Use the x86 cache-bypassing copy instructions for copy_from_user().

    Some performance data are

    Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

    2.6.12.4.orig 1921587
    2.6.12.4.nt 1599424
    1599424/1921587=83.23% (16.77% reduction)

    BSQ_CACHE_REFERENCE (L3 cache miss)
    2.6.12.4.orig 57427
    2.6.12.4.nt 20858
    20858/57427=36.32% (63.7% reduction)

    L3 cache miss reduction of __copy_from_user_ll
    samples %
    37408 65.1412 vmlinux __copy_from_user_ll
    23 0.1103 vmlinux __copy_user_zeroing_intel_nocache
    23/37408=0.061% (99.94% reduction)

    Top 5 of 2.6.12.4.nt
    Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
    samples % app name symbol name
    128392 8.0274 vmlinux __copy_user_zeroing_intel_nocache
    64206 4.0143 vmlinux journal_add_journal_head
    59746 3.7355 vmlinux do_get_write_access
    47674 2.9807 vmlinux journal_put_journal_head
    46021 2.8774 vmlinux journal_dirty_metadata
    pattern9-0-cpu4-0-09011728/summary.out

    Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
    samples % app name symbol name
    69755 4.2861 vmlinux __copy_user_zeroing_intel_nocache
    55685 3.4215 vmlinux journal_add_journal_head
    52371 3.2179 vmlinux __find_get_block
    45504 2.7960 vmlinux journal_put_journal_head
    36005 2.2123 vmlinux journal_stop
    pattern9-0-cpu4-0-09011744/summary.out

    Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
    samples % app name symbol name
    1147 5.4994 vmlinux journal_add_journal_head
    881 4.2240 vmlinux journal_dirty_data
    872 4.1809 vmlinux blk_rq_map_sg
    734 3.5192 vmlinux journal_commit_transaction
    617 2.9582 vmlinux radix_tree_delete
    pattern9-0-cpu4-0-09011731/summary.out

    iozone results are

    original 2.6.12.4 CPU time = 207.768 sec
    cache aware CPU time = 184.783 sec
    (three times run)
    184.783/207.768=88.94% (11.06% reduction)

    original:
    pattern9-0-cpu4-0-08191720/iozone.out: CPU Utilization: Wall time 45.997 CPU time 64.527 CPU utilization 140.28 %
    pattern9-0-cpu4-0-08191741/iozone.out: CPU Utilization: Wall time 46.878 CPU time 71.933 CPU utilization 153.45 %
    pattern9-0-cpu4-0-08191743/iozone.out: CPU Utilization: Wall time 45.152 CPU time 71.308 CPU utilization 157.93 %

    cache awre:
    pattern9-0-cpu4-0-09011728/iozone.out: CPU Utilization: Wall time 44.842 CPU time 62.465 CPU utilization 139.30 %
    pattern9-0-cpu4-0-09011731/iozone.out: CPU Utilization: Wall time 44.718 CPU time 59.273 CPU utilization 132.55 %
    pattern9-0-cpu4-0-09011744/iozone.out: CPU Utilization: Wall time 44.367 CPU time 63.045 CPU utilization 142.10 %

    Signed-off-by: Hiro Yoshioka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiro Yoshioka
     
  • mm/filemap.c:
    - add lots of kernel-doc;
    - fix some typos and kernel-doc errors;
    - drop some blank lines between function close and EXPORT_SYMBOL();

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • When a writeback_control's `start' and `end' fields are used to
    indicate a one-byte-range starting at file offset zero, the required
    values of .start=0,.end=0 mean that the ->writepages() implementation
    has no way of telling that it is being asked to perform a range
    request. Because we're currently overloading (start == 0 && end == 0)
    to mean "this is not a write-a-range request".

    To make all this sane, the patch changes range of writeback_control.

    So caller does: If it is calling ->writepages() to write pages, it
    sets range (range_start/end or range_cyclic) always.

    And if range_cyclic is true, ->writepages() thinks the range is
    cyclic, otherwise it just uses range_start and range_end.

    This patch does,

    - Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
    -1 is usually ok for range_end (type is long long). But, if someone did,

    range_end += val; range_end is "val - 1"
    u64val = range_end >> bits; u64val is "~(0ULL)"

    or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
    things, and uses LLONG_MAX for range_end.

    - All callers of ->writepages() sets range_start/end or range_cyclic.

    - Fix updates of ->writeback_index. It seems already bit strange.
    If it starts at 0 and ended by check of nr_to_write, this last
    index may reduce chance to scan end of file. So, this updates
    ->writeback_index only if range_cyclic is true or whole-file is
    scanned.

    Signed-off-by: OGAWA Hirofumi
    Cc: Nathan Scott
    Cc: Anton Altaparmakov
    Cc: Steven French
    Cc: "Vladimir V. Saveliev"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     

27 Apr, 2006

1 commit


24 Mar, 2006

3 commits

  • Add two new linux-specific fadvise extensions():

    LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
    offsets `offset' and `offset+len'. Any pages which are currently under
    writeout are skipped, whether or not they are dirty.

    LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
    offsets `offset' and `offset+len'.

    By combining these two operations the application may do several things:

    LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.

    LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
    pages at the disk.

    LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
    of the currently dirty pages at the disk, wait until they have been written.

    It should be noted that none of these operations write out the file's
    metadata. So unless the application is strictly performing overwrites of
    already-instantiated disk blocks, there are no guarantees here that the data
    will be available after a crash.

    To complete this suite of operations I guess we should have a "sync file
    metadata only" operation. This gives applications access to all the building
    blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
    well with the fadvise() interface. Probably it should be a new syscall:
    sys_fmetadatasync().

    The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
    It is made to represent that last affected byte in the file (ie: it is
    inclusive). Generally, all these byterange and pagerange functions are
    inclusive so we can easily represent EOF with -1.

    As Ulrich notes, these two functions are somewhat abusive of the fadvise()
    concept, which appears to be "set the future policy for this fd".

    But these commands are a perfect fit with the fadvise() impementation, and
    several of the existing fadvise() commands are synchronous and don't affect
    future policy either. I think we can live with the slight incongruity.

    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • I had trouble understanding working out whether filemap_fdatawrite_range()'s
    `end' parameter describes the last-byte-to-be-written or the last-plus-one.
    Clarify that in comments.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Change the page cache allocation calls to support cpuset memory spreading.

    See the previous patch, cpuset_mem_spread, for an explanation of cpuset memory
    spreading.

    On systems without cpusets configured in the kernel, this is no change.

    On systems with cpusets configured in the kernel, but the "memory_spread"
    cpuset option not enabled for the current tasks cpuset, this adds a call to a
    cpuset routine and failed bit test of the processor state flag PF_SPREAD_PAGE.

    On tasks in cpusets with "memory_spread" enabled, this adds a call to a cpuset
    routine that computes which of the tasks mems_allowed nodes should be
    preferred for this allocation.

    If memory spreading applies to a particular allocation, then any other NUMA
    mempolicy does not apply.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

22 Mar, 2006

1 commit

  • Remove __put_page from outside the core mm/. It is dangerous because it does
    not handle compound pages nicely, and misses 1->0 transitions. If a user
    later appears that really needs the extra speed we can reevaluate.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

19 Jan, 2006

1 commit

  • Migration code currently does not take a reference to target page
    properly, so between unlocking the pte and trying to take a new
    reference to the page with isolate_lru_page, anything could happen to
    it.

    Fix this by holding the pte lock until we get a chance to elevate the
    refcount.

    Other small cleanups while we're here.

    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

12 Jan, 2006

1 commit

  • - Move capable() from sched.h to capability.h;

    - Use where capable() is used
    (in include/, block/, ipc/, kernel/, a few drivers/,
    mm/, security/, & sound/;
    many more drivers/ to go)

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy.Dunlap
     

11 Jan, 2006

1 commit

  • To allow various options to work per-mount instead of per-sb we need a
    struct vfsmount when updating ctime and mtime. This preparation patch
    replaces the inode_update_time routine with a file_update_atime routine so
    we can easily get at the vfsmount. (and the file makes more sense in this
    context anyway). Also get rid of the unused second argument - we always
    want to update the ctime when calling this routine.

    Signed-off-by: Christoph Hellwig
    Cc: Al Viro
    Cc: Anton Altaparmakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

10 Jan, 2006

1 commit


09 Jan, 2006

2 commits

  • This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.

    See mm/filemap.c:

    And changes the filemap_write_and_wait() and filemap_write_and_wait_range().

    Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
    returns error. However, even if filemap_fdatawrite() returned an
    error, it may have submitted the partially data pages to the device.
    (e.g. in the case of -ENOSPC)

    Andrew Morton writes,

    If filemap_fdatawrite() returns an error, this might be due to some
    I/O problem: dead disk, unplugged cable, etc. Given the generally
    crappy quality of the kernel's handling of such exceptions, there's a
    good chance that the filemap_fdatawait() will get stuck in D state
    forever.

    So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.

    Trond, could you please review the nfs part? Especially I'm not sure,
    nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.

    Acked-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • This exports/changes the sync_page_range/_nolock(). The fatfs needs
    sync_page_range/_nolock() for expanding truncate, and changes "size_t count"
    to "loff_t count".

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     

07 Jan, 2006

1 commit

  • As find_lock_page() already checks with TestSetPageLocked() that page is
    locked, there is no need to call lock_page() that will try-lock page again
    (chances of page being unlocked in between are small). Call __lock_page()
    directly, this saves one atomic operation.

    Also, mark truncate-while-slept path as unlikely while we are here.

    (akpm: ug. But this is actually a common path for normal old read()s against
    a page which is under readahead I/O so ho-hum.)

    Signed-off-by: Nikita Danilov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikita Danilov
     

04 Jan, 2006

1 commit

  • readpage(), prepare_write(), and commit_write() callers are updated to
    understand the special return code AOP_TRUNCATED_PAGE in the style of
    writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that
    the callee has unlocked the page and that the operation should be tried again
    with a new page. OCFS2 uses this to detect and work around a lock inversion in
    its aop methods. There should be no change in behaviour for methods that don't
    return AOP_TRUNCATED_PAGE.

    WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are
    made enums so that kerneldoc can be used to document their semantics.

    Signed-off-by: Zach Brown

    Zach Brown
     

15 Nov, 2005

1 commit


31 Oct, 2005

1 commit

  • When __generic_file_aio_read() hits an error during reading, it reports the
    error iff nothing has successfully been read yet. This is condition - when
    an error occurs, if nothing has been read/written, report the error code;
    otherwise, report the amount of bytes successfully transferred upto that
    point.

    This corner case can be exposed by performing readv(2) with the following
    iov.

    iov[0] = len0 @ ptr0
    iov[1] = len1 @ NULL (or any other invalid pointer)
    iov[2] = len2 @ ptr2

    When file size is enough, performing above readv(2) results in

    len0 bytes from file_pos @ ptr0
    len2 bytes from file_pos + len0 @ ptr2

    And the return value is len0 + len2. Test program is attached to this
    mail.

    This patch makes __generic_file_aio_read()'s error handling identical to
    other functions.

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    const char *path;
    struct stat stbuf;
    size_t len0, len1;
    void *buf0, *buf1;
    struct iovec iov[3];
    int fd, i;
    ssize_t ret;

    if (argc < 2) {
    fprintf(stderr, "Usage: testreadv path (better be a "
    "small text file)\n");
    return 1;
    }
    path = argv[1];

    if (stat(path, &stbuf) < 0) {
    perror("stat");
    return 1;
    }

    len0 = stbuf.st_size / 2;
    len1 = stbuf.st_size - len0;

    if (!len0 || !len1) {
    fprintf(stderr, "Dude, file is too small\n");
    return 1;
    }

    if ((fd = open(path, O_RDONLY)) < 0) {
    perror("open");
    return 1;
    }

    if (!(buf0 = malloc(len0)) || !(buf1 = malloc(len1))) {
    perror("malloc");
    return 1;
    }

    memset(buf0, 0, len0);
    memset(buf1, 0, len1);

    iov[0].iov_base = buf0;
    iov[0].iov_len = len0;
    iov[1].iov_base = NULL;
    iov[1].iov_len = len1;
    iov[2].iov_base = buf1;
    iov[2].iov_len = len1;

    printf("vector ");
    for (i = 0; i < 3; i++)
    printf("%p:%zu ", iov[i].iov_base, iov[i].iov_len);
    printf("\n");

    ret = readv(fd, iov, 3);
    if (ret < 0)
    perror("readv");

    printf("readv returned %zd\nbuf0 = [%s]\nbuf1 = [%s]\n",
    ret, (char *)buf0, (char *)buf1);

    return 0;
    }

    Signed-off-by: Tejun Heo
    Cc: Benjamin LaHaise
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

30 Oct, 2005

4 commits

  • move EXPORT_SYMBOL(filemap_populate) to the proper place: just after
    function itself: it's easy to miss that function is exported otherwise.

    Signed-off-by: Nikita Danilov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikita Danilov
     
  • Updated several references to page_table_lock in common code comments.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Impose a little more consistency on the page fault handlers do_wp_page,
    do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their
    arguments in the same order, called the same names?

    break_cow is all very well, but what it did was inlined elsewhere: easier to
    compare if it's brought back into do_wp_page.

    do_file_page's fallback to do_no_page dates from a time when we were testing
    pte_file by using it wherever possible: currently it's peculiar to nonlinear
    vmas, so just check that. BUG_ON if not? Better not, it's probably page
    table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's
    use that for do_wp_page's invalid pfn too.

    Hah! Someone in the ppc64 world noticed pte_ERROR was unused so removed it:
    restored (and say "pud" not "pmd" in its pud_ERROR).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Oct, 2005

1 commit


11 Sep, 2005

1 commit


05 Sep, 2005

2 commits

  • Either shmem_getpage returns a failure, or it found a page, or it was told
    it couldn't do any I/O. So it's useless to check nonblock in the else
    branch. We could add a BUG() there but I preferred to comment the
    offending function.

    This was taken out from one Ingo Molnar's old patch I'm resurrecting.

    Signed-off-by: Paolo 'Blaisorblade' Giarrusso
    Cc: Ingo Molnar
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paolo 'Blaisorblade' Giarrusso
     
  • The idea of a swap_device_lock per device, and a swap_list_lock over them all,
    is appealing; but in practice almost every holder of swap_device_lock must
    already hold swap_list_lock, which defeats the purpose of the split.

    The only exceptions have been swap_duplicate, valid_swaphandles and an
    untrodden path in try_to_unuse (plus a few places added in this series).
    valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
    demand attention. However, with the hold time in get_swap_pages so much
    reduced, I've not yet found a load and set of swap device priorities to show
    even swap_duplicate benefitting from the split. Certainly the split is mere
    overhead in the common case of a single swap device.

    So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
    (generally we seem to prefer an _ in the name, and not hide in a macro).

    If someone can show a regression in swap_duplicate, then probably we should
    add a hashlock for the swap_map entries alone (shorts being anatomic), so as
    to help the case of the single swap device too.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

26 Jun, 2005

1 commit

  • Here is the fix for the problem described in

    http://bugzilla.kernel.org/show_bug.cgi?id=4721

    Basically, problem is generic_file_buffered_write() is accessing beyond end
    of the iov[] vector after handling the last vector. If we happen to cross
    page boundary, we get a fault.

    I think this simple patch is good enough. If we really don't want to
    depend on the "count", then we need pass nr_segs to
    filemap_set_next_iovec() and decrement it and check it.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty