10 Mar, 2011

1 commit

  • With the plugging now being explicitly controlled by the
    submitter, callers need not pass down unplugging hints
    to the block layer. If they want to unplug, it's because they
    manually plugged on their own - in which case, they should just
    unplug at will.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Aug, 2010

1 commit

  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

16 Dec, 2009

2 commits

  • Seems that page_io.c doesn't really need to know that page_private(page)
    is the swp_entry 'val'. Rework map_swap_page() to do what its name says
    and map a page to a page offset in the swap space.

    The only other caller of map_swap_page() is internal to mm/swapfile.c and
    it does want to map a swap entry to the 'sector'. So rename
    map_swap_page() to map_swap_entry(), make it 'static' and and implement
    map_swap_page() as a wrapper around that.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • The swap_info_struct is mostly private to mm/swapfile.c, with only
    one other in-tree user: get_swap_bio(). Adjust its interface to
    map_swap_page(), so that we can then remove get_swap_info_struct().

    But there is a popular user out-of-tree, TuxOnIce: so leave the
    declaration of swap_info_struct in linux/swap.h.

    Signed-off-by: Hugh Dickins
    Cc: Nigel Cunningham
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jun, 2009

1 commit

  • The file argument resulted from address_space's readpage long time ago.

    We don't use it any more. Let's remove unnecessary argement.

    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

18 Feb, 2009

1 commit


07 Jan, 2009

2 commits

  • remove_exclusive_swap_page(): its problem is in living up to its name.

    It doesn't matter if someone else has a reference to the page (raised
    page_count); it doesn't matter if the page is mapped into userspace
    (raised page_mapcount - though that hints it may be worth keeping the
    swap): all that matters is that there be no more references to the swap
    (and no writeback in progress).

    swapoff (try_to_unuse) has been removing pages from swapcache for years,
    with no concern for page count or page mapcount, and we used to have a
    comment in lookup_swap_cache() recognizing that: if you go for a page of
    swapcache, you'll get the right page, but it could have been removed from
    swapcache by the time you get page lock.

    So, give up asking for exclusivity: get rid of
    remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
    remove_exclusive_swap_page_count() which were spawned for the recent LRU
    work: replace them by the simpler try_to_free_swap() which just checks
    page_swapcount().

    Similarly, remove the page_count limitation from free_swap_and_count(),
    but assume that it's worth holding on to the swap if page is mapped and
    swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
    It would be consistent, but I think we probably have enough for now.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap code is over-provisioned with BUG_ONs on assorted page flags,
    mostly dating back to 2.3. They're good documentation, and guard against
    developer error, but a waste of space on most systems: change them to
    VM_BUG_ONs, conditional on CONFIG_DEBUG_VM. Just delete the PagePrivate
    ones: they're later, from 2.5.69, but even less interesting now.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Feb, 2008

1 commit

  • After running SetPageUptodate, preceeding stores to the page contents to
    actually bring it uptodate may not be ordered with the store to set the
    page uptodate.

    Therefore, another CPU which checks PageUptodate is true, then reads the
    page contents can get stale data.

    Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
    PageUptodate.

    Many places that test PageUptodate, do so with the page locked, and this
    would be enough to ensure memory ordering in those places if
    SetPageUptodate were only called while the page is locked. Unfortunately
    that is not always the case for some filesystems, but it could be an idea
    for the future.

    Also bring the handling of anonymous page uptodateness in line with that of
    file backed page management, by marking anon pages as uptodate when they
    _are_ uptodate, rather than when our implementation requires that they be
    marked as such. Doing allows us to get rid of the smp_wmb's in the page
    copying functions, which were especially added for anonymous pages for an
    analogous memory ordering problem. Both file and anonymous pages are
    handled with the same barriers.

    FAQ:
    Q. Why not do this in flush_dcache_page?
    A. Firstly, flush_dcache_page handles only one side (the smb side) of the
    ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
    memory barriers in a completely unrelated function is nasty; at least in the
    PageUptodate macros, they are located together with (half) the operations
    involved in the ordering. Thirdly, the smp_wmb is only required when first
    bringing the page uptodate, wheras flush_dcache_page should be called each time
    it is written to through the kernel mapping. It is logically the wrong place to
    put it.

    Q. Why does this increase my text size / reduce my performance / etc.
    A. Because it is adding the necessary instructions to eliminate the data-race.

    Q. Can it be improved?
    A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
    run under the page lock, we could avoid the smp_rmb places where PageUptodate
    is queried under the page lock. Requires audit of all filesystems and at least
    some would need reworking. That's great you're interested, I'm eagerly awaiting
    your patches.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

10 Oct, 2007

1 commit

  • As bi_end_io is only called once when the reqeust is complete,
    the 'size' argument is now redundant. Remove it.

    Now there is no need for bio_endio to subtract the size completed
    from bi_size. So don't do that either.

    While we are at it, change bi_end_io to return void.

    Signed-off-by: Neil Brown
    Signed-off-by: Jens Axboe

    NeilBrown
     

08 Dec, 2006

1 commit

  • Make swsusp use block device offsets instead of swap offsets to identify swap
    locations and make it use the same code paths for writing as well as for
    reading data.

    This allows us to use the same code for handling swap files and swap
    partitions and to simplify the code, eg. by dropping rw_swap_page_sync().

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

26 Sep, 2006

3 commits

  • Implement async reads for swsusp resuming.

    Crufty old PIII testbox:
    15.7 MB/s -> 20.3 MB/s

    Sony Vaio:
    14.6 MB/s -> 33.3 MB/s

    I didn't implement the post-resume bio_set_pages_dirty(). I don't really
    understand why resume needs to run set_page_dirty() against these pages.

    It might be a worry that this code modifies PG_Uptodate, PG_Error and
    PG_Locked against the image pages. Can this possibly affect the resumed-into
    kernel? Hopefully not, if we're atomically restoring its mem_map?

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Jens Axboe
    Cc: Laurent Riffard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Switch the swsusp writeout code from 4k-at-a-time to 4MB-at-a-time.

    Crufty old PIII testbox:
    12.9 MB/s -> 20.9 MB/s

    Sony Vaio:
    14.7 MB/s -> 26.5 MB/s

    The implementation is crude. A better one would use larger BIOs, but wouldn't
    gain any performance.

    The memcpys will be mostly pipelined with the IO and basically come for free.

    The ENOMEM path has not been tested. It should be.

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Currently we can silently drop data if the write to swap failed. It
    usually doesn't result in data-corruption because on page-in the process
    will receive SIGBUS (assuming write-failure implies read-failure).

    This assumption might or might not be valid.

    This patch will avoid the page being discarded after a failed write. But
    will print a warning the sysadmin _should_ take to heart, if a lot of swap
    space becomes un-writeable, OOM is not far off.

    Tested by making the write fail 'randomly' once every 50 writes or so.

    [akpm@osdl.org: printk warning fix]
    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

01 Jul, 2006

1 commit

  • The remaining counters in page_state after the zoned VM counter patches
    have been applied are all just for show in /proc/vmstat. They have no
    essential function for the VM.

    We use a simple increment of per cpu variables. In order to avoid the most
    severe races we disable preempt. Preempt does not prevent the race between
    an increment and an interrupt handler incrementing the same statistics
    counter. However, that race is exceedingly rare, we may only loose one
    increment or so and there is no requirement (at least not in kernel) that
    the vm event counters have to be accurate.

    In the non preempt case this results in a simple increment for each
    counter. For many architectures this will be reduced by the compiler to a
    single instruction. This single instruction is atomic for i386 and x86_64.
    And therefore even the rare race condition in an interrupt is avoided for
    both architectures in most cases.

    The patchset also adds an off switch for embedded systems that allows a
    building of linux kernels without these counters.

    The implementation of these counters is through inline code that hopefully
    results in only a single instruction increment instruction being emitted
    (i386, x86_64) or in the increment being hidden though instruction
    concurrency (EPIC architectures such as ia64 can get that done).

    Benefits:
    - VM event counter operations usually reduce to a single inline instruction
    on i386 and x86_64.
    - No interrupt disable, only preempt disable for the preempt case.
    Preempt disable can also be avoided by moving the counter into a spinlock.
    - Handling is similar to zoned VM counters.
    - Simple and easily extendable.
    - Can be omitted to reduce memory use for embedded use.

    References:

    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
    RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
    local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
    V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
    V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
    V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

30 Oct, 2005

1 commit

  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Oct, 2005

1 commit

  • - added typedef unsigned int __nocast gfp_t;

    - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
    the same warnings as far as sparse is concerned, doesn't change
    generated code (from gcc point of view we replaced unsigned int with
    typedef) and documents what's going on far better.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

26 Jun, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds