06 Feb, 2008

4 commits

  • After making dirty a 100M file, the normal behavior is to start the
    writeback for all data after 30s delays. But sometimes the following
    happens instead:

    - after 30s: ~4M
    - after 5s: ~4M
    - after 5s: all remaining 92M

    Some analyze shows that the internal io dispatch queues goes like this:

    s_io s_more_io
    -------------------------
    1) 100M,1K 0
    2) 1K 96M
    3) 0 96M
    1) initial state with a 100M file and a 1K file

    2) 4M written, nr_to_write 0, no more writes(BUG)

    nr_to_write > 0 in (3) fools the upper layer to think that data have all
    been written out. The big dirty file is actually still sitting in
    s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io
    becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
    may starve newly expired inodes in s_dirty. It is also not an option to
    draw inodes from both s_more_io and s_dirty, an let the loop go on: this
    might lead to live locks, and might also starve other superblocks in sync
    time(well kupdate may still starve some superblocks, that's another bug).

    We have to return when a full scan of s_io completes. So nr_to_write > 0
    does not necessarily mean that "all data are written". This patch
    introduces a flag writeback_control.more_io to indicate that more io should
    be done. With it the big dirty file no longer has to wait for the next
    kupdate invokation 5s later.

    In sync_sb_inodes() we only set more_io on super_blocks we actually
    visited. This avoids the interaction between two pdflush deamons.

    Also in __sync_single_inode() we don't blindly keep requeuing the io if the
    filesystem cannot progress. Failing to do so may lead to 100% iowait.

    Tested-by: Mike Snitzer
    Signed-off-by: Fengguang Wu
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • fastcall is always defined to be empty, remove it

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Add vm.highmem_is_dirtyable toggle

    A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
    approximately 2Gb size which contains a hash format that is written
    randomly by the dbclean process. On 2.6.16 this process took a few
    minutes. With lowmem only accounting of dirty ratios, this takes about 12
    hours of 100% disk IO, all random writes.

    Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
    add the highmem back to the total available memory count.

    [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
    Signed-off-by: Bron Gondwana
    Cc: Ethan Solomita
    Cc: Peter Zijlstra
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bron Gondwana
     
  • task_dirty_limit() can become static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

15 Jan, 2008

1 commit

  • This reverts commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, as
    requested by Fengguang Wu. It's not quite fully baked yet, and while
    there are patches around to fix the problems it caused, they should get
    more testing. Says Fengguang: "I'll resend them both for -mm later on,
    in a more complete patchset".

    See

    http://bugzilla.kernel.org/show_bug.cgi?id=9738

    for some of this discussion.

    Requested-by: Fengguang Wu
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Nov, 2007

1 commit

  • This code harks back to the days when we didn't count dirty mapped
    pages, which led us to try to balance the number of dirty unmapped pages
    by how much unmapped memory there was in the system.

    That makes no sense any more, since now the dirty counts include the
    mapped pages. Not to mention that the math doesn't work with HIGHMEM
    machines anyway, and causes the unmapped_ratio to potentially turn
    negative (which we do catch thanks to clamping it at a minimum value,
    but I mention that as an indication of how broken the code is).

    The code also was written at a time when the default dirty ratio was
    much larger, and the unmapped_ratio logic effectively capped that large
    dirty ratio a bit. Again, we've since lowered the dirty ratio rather
    aggressively, further lessening the point of that code.

    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Nov, 2007

1 commit

  • We allow violation of bdi limits if there is a lot of room on the system.
    Once we hit half the total limit we start enforcing bdi limits and bdi
    ramp-up should happen. Doing it this way avoids many small writeouts on an
    otherwise idle system and should also speed up the ramp-up.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

20 Oct, 2007

1 commit


17 Oct, 2007

10 commits

  • We don't want to introduce pointless delays in throttle_vm_writeout() when
    the writeback limits are not yet exceeded, do we?

    Cc: Nick Piggin
    Cc: OGAWA Hirofumi
    Cc: Kumar Gala
    Cc: Pete Zaitcev
    Cc: Greg KH
    Reviewed-by: Rik van Riel
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • I_LOCK was used for several unrelated purposes, which caused deadlock
    situations in certain filesystems as a side effect. One of the purposes
    now uses the new I_SYNC bit.

    Also document the various bits and change their order from historical to
    logical.

    [bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
    Signed-off-by: Joern Engel
    Cc: Dave Kleikamp
    Cc: David Chinner
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joern Engel
     
  • After making dirty a 100M file, the normal behavior is to start the writeback
    for all data after 30s delays. But sometimes the following happens instead:

    - after 30s: ~4M
    - after 5s: ~4M
    - after 5s: all remaining 92M

    Some analyze shows that the internal io dispatch queues goes like this:

    s_io s_more_io
    -------------------------
    1) 100M,1K 0
    2) 1K 96M
    3) 0 96M

    1) initial state with a 100M file and a 1K file
    2) 4M written, nr_to_write 0, no more writes(BUG)

    nr_to_write > 0 in (3) fools the upper layer to think that data have all been
    written out. The big dirty file is actually still sitting in s_more_io. We
    cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
    let the loop in generic_sync_sb_inodes() continue: this may starve newly
    expired inodes in s_dirty. It is also not an option to draw inodes from both
    s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
    and might also starve other superblocks in sync time(well kupdate may still
    starve some superblocks, that's another bug).

    We have to return when a full scan of s_io completes. So nr_to_write > 0 does
    not necessarily mean that "all data are written". This patch introduces a
    flag writeback_control.more_io to indicate this situation. With it the big
    dirty file no longer has to wait for the next kupdate invocation 5s later.

    Cc: David Chinner
    Cc: Ken Chen
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • This is a writeback-internal marker but we're propagating it all the way back
    to userspace!.

    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Based on ideas of Andrew:
    http://marc.info/?l=linux-kernel&m=102912915020543&w=2

    Scale the bdi dirty limit inversly with the tasks dirty rate.
    This makes heavy writers have a lower dirty limit than the occasional writer.

    Andrea proposed something similar:
    http://lwn.net/Articles/152277/

    The main disadvantage to his patch is that he uses an unrelated quantity to
    measure time, which leaves him with a workload dependant tunable. Other than
    that the two approaches appear quite similar.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Scale writeback cache per backing device, proportional to its writeout speed.

    By decoupling the BDI dirty thresholds a number of problems we currently have
    will go away, namely:

    - mutual interference starvation (for any number of BDIs);
    - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

    It might be that all dirty pages are for a single BDI while other BDIs are
    idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
    dirty pages outstanding and make progress.

    A global threshold also creates a deadlock for stacked BDIs; when A writes to
    B, and A generates enough dirty pages to get throttled, B will never start
    writeback until the dirty pages go away. Again, by giving each BDI its own
    'independent' dirty limit, this problem is avoided.

    So the problem is to determine how to distribute the total dirty limit across
    the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
    not have any dirty pages outstanding is a waste.

    What is done is to keep a floating proportion between the DBIs based on
    writeback completions. This way faster/more active devices get a larger share
    than slower/idle devices.

    [akpm@linux-foundation.org: fix warnings]
    [hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Count per BDI writeback pages.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Here's a cut at fixing up uses of the online node map in generic code.

    mm/shmem.c:shmem_parse_mpol()

    Ensure nodelist is subset of nodes with memory.
    Use node_states[N_HIGH_MEMORY] as default for missing
    nodelist for interleave policy.

    mm/shmem.c:shmem_fill_super()

    initialize policy_nodes to node_states[N_HIGH_MEMORY]

    mm/page-writeback.c:highmem_dirtyable_memory()

    sum over nodes with memory

    mm/page_alloc.c:zlc_setup()

    allowednodes - use nodes with memory.

    mm/page_alloc.c:default_zonelist_order()

    average over nodes with memory.

    mm/page_alloc.c:find_next_best_node()

    skip nodes w/o memory.
    N_HIGH_MEMORY state mask may not be initialized at this time,
    unless we want to depend on early_calculate_totalpages() [see
    below]. Will ZONE_MOVABLE ever be configurable?

    mm/page_alloc.c:find_zone_movable_pfns_for_nodes()

    spread kernelcore over nodes with memory.

    This required calling early_calculate_totalpages()
    unconditionally, and populating N_HIGH_MEMORY node
    state therein from nodes in the early_node_map[].
    If we can depend on this, we can eliminate the
    population of N_HIGH_MEMORY mask from __build_all_zonelists()
    and use the N_HIGH_MEMORY mask in find_next_best_node().

    mm/mempolicy.c:mpol_check_policy()

    Ensure nodes specified for policy are subset of
    nodes with memory.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Probing pages and radix_tree_tagged are lockless operations with the lockless
    radix-tree. Convert these users to RCU locking rather than using tree_lock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

09 Oct, 2007

1 commit

  • All the current page_mkwrite() implementations also set the page dirty. Which
    results in the set_page_dirty_balance() call to _not_ call balance, because the
    page is already found dirty.

    This allows us to dirty a _lot_ of pages without ever hitting
    balance_dirty_pages(). Not good (tm).

    Force a balance call if ->page_mkwrite() was successful.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

20 Jul, 2007

3 commits

  • page-writeback accounting is presently performed in the page-flags macros.
    This is inconsistent and a bit ugly and makes it awkward to implement
    per-backing_dev under-writeback page accounting.

    So move this accounting down to the callsite(s).

    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Share the same page flag bit for PG_readahead and PG_reclaim.

    One is used only on file reads, another is only for emergency writes. One
    is used mostly for fresh/young pages, another is for old pages.

    Combinations of possible interactions are:

    a) clear PG_reclaim => implicit clear of PG_readahead
    it will delay an asynchronous readahead into a synchronous one
    it actually does _good_ for readahead:
    the pages will be reclaimed soon, it's readahead thrashing!
    in this case, synchronous readahead makes more sense.

    b) clear PG_readahead => implicit clear of PG_reclaim
    one(and only one) page will not be reclaimed in time
    it can be avoided by checking PageWriteback(page) in readahead first

    c) set PG_reclaim => implicit set of PG_readahead
    will confuse readahead and make it restart the size rampup process
    it's a trivial problem, and can mostly be avoided by checking
    PageWriteback(page) first in readahead

    d) set PG_readahead => implicit set of PG_reclaim
    PG_readahead will never be set on already cached pages.
    PG_reclaim will always be cleared on dirtying a page.
    so not a problem.

    In summary,
    a) we get better behavior
    b,d) possible interactions can be avoided
    c) racy condition exists that might affect readahead, but the chance
    is _really_ low, and the hurt on readahead is trivial.

    Compound pages also use PG_reclaim, but for now they do not interact with
    reclaim/readahead code.

    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Fix msync data loss and (less importantly) dirty page accounting
    inaccuracies due to the race remaining in clear_page_dirty_for_io().

    The deleted comment explains what the race was, and the added comments
    explain how it is fixed.

    Signed-off-by: Nick Piggin
    Acked-by: Linus Torvalds
    Cc: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Jul, 2007

1 commit

  • It is a bug to set a page dirty if it is not uptodate unless it has
    buffers. If the page has buffers, then the page may be dirty (some buffers
    dirty) but not uptodate (some buffers not uptodate). The exception to this
    rule is if the set_page_dirty caller is racing with truncate or invalidate.

    A buffer can not be set dirty if it is not uptodate.

    If either of these situations occurs, it indicates there could be some data
    loss problem. Some of these warnings could be a harmless one where the
    page or buffer is set uptodate immediately after it is dirtied, however we
    should fix those up, and enforce this ordering.

    Bring the order of operations for truncate into line with those of
    invalidate. This will prevent a page from being able to go !uptodate while
    we're holding the tree_lock, which is probably a good thing anyway.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

17 Jul, 2007

1 commit


11 May, 2007

1 commit

  • Clean up massive code duplication between mpage_writepages() and
    generic_writepages().

    The new generic function, write_cache_pages() takes a function pointer
    argument, which will be called for each page to be written.

    Maybe cifs_writepages() too can use this infrastructure, but I'm not
    touching that with a ten-foot pole.

    The upcoming page writeback support in fuse will also want this.

    Signed-off-by: Miklos Szeredi
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

09 May, 2007

1 commit


08 May, 2007

1 commit

  • We can use the global ZVC counters to establish the exact size of the LRU
    and the free pages. This allows a more accurate determination of the dirty
    ratio.

    This patch will fix the broken ratio calculations if large amounts of
    memory are allocated to huge pags or other consumers that do not put the
    pages on to the LRU.

    Notes:
    - I did not add NR_SLAB_RECLAIMABLE to the calculation of the
    dirtyable pages. Those may be reclaimable but they are at this
    point not dirtyable. If NR_SLAB_RECLAIMABLE would be considered
    then a huge number of reclaimable pages would stop writeback
    from occurring.

    - This patch used to be in mm as the last one in a series of patches.
    It was removed when Linus updated the treatment of highmem because
    there was a conflict. I updated the patch to follow Linus' approach.
    This patch is neede to fulfill the claims made in the beginning of the
    patchset that is now in Linus' tree.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Apr, 2007

1 commit


02 Mar, 2007

1 commit


12 Feb, 2007

3 commits

  • Change a hard-coded constant 0 to the symbolic equivalent NOTIFY_DONE in
    the ratelimit_handler() CPU notifier handler function.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
    source files, including:

    * make multi-line initial descriptions single line
    * denote some function names, constants and structs as such
    * change erroneous opening '/*' to '/**' in a few places
    * reword some text for clarity

    Signed-off-by: Robert P. J. Day
    Cc: "Randy.Dunlap"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • shmem backed file does not have page writeback, nor it participates in
    backing device's dirty or writeback accounting. So using generic
    __set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
    overkill. It unnecessarily prolongs shm unmap latency.

    For example, on a densely populated large shm segment (sevearl GBs), the
    unmapping operation becomes painfully long. Because at unmap, kernel
    transfers dirty bit in PTE into page struct and to the radix tree tag. The
    operation of tagging the radix tree is particularly expensive because it
    has to traverse the tree from the root to the leaf node on every dirty
    page. What's bothering is that radix tree tag is used for page write back.
    However, shmem is memory backed and there is no page write back for such
    file system. And in the end, we spend all that time tagging radix tree and
    none of that fancy tagging will be used. So let's simplify it by introduce
    a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.

    Signed-off-by: Ken Chen
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

30 Jan, 2007

1 commit

  • This makes balance_dirty_page() always base its calculations on the
    amount of non-highmem memory in the machine, rather than try to base it
    on total memory and then falling back on non-highmem memory if the
    mapping it was writing wasn't highmem capable.

    This not only fixes a situation where two different writers can have
    wildly different notions about what is a "balanced" dirty state, but it
    also means that people with highmem machines don't run into an OOM
    situation when regular memory fills up with dirty pages.

    We used to try to handle the latter case by scaling down the dirty_ratio
    if the machine had a lot of highmem pages in page_writeback_init(), but
    it wasn't aggressive enough for some situations, and since basing the
    dirty ratio on highmem memory was broken in the first place, let's just
    stop doing so.

    (A variation of this theme fixed Justin Piszcz's OOM problem when
    copying an 18GB file on a RAID setup).

    Acked-by: Nick Piggin
    Cc: Justin Piszcz
    Cc: Andrew Morton
    Cc: Neil Brown
    Cc: Ingo Molnar
    Cc: Randy Dunlap
    Cc: Christoph Lameter
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Adrian Bunk
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Dec, 2006

1 commit

  • The VM layer (on the face of it, fairly reasonably) expected that when
    it does a ->writepage() call to the filesystem, it would write out the
    full page at that point in time. Especially since it had earlier marked
    the whole page dirty with "set_page_dirty()".

    But that isn't actually the case: ->writepage() does not actually write
    a page, it writes the parts of the page that have been explicitly marked
    dirty before, *and* that had not got written out for other reasons since
    the last time we told it they were dirty.

    That last caveat is the important one.

    Which _most_ of the time ends up being the whole page (since we had
    called "set_page_dirty()" on the page earlier), but if the filesystem
    had done any dirty flushing of its own (for example, to honor some
    internal write ordering guarantees), it might end up doing only a
    partial page IO (or none at all) when ->writepage() is actually called.

    That is the correct thing in general (since we actually often _want_
    only the known-dirty parts of the page to be written out), but the
    shared dirty page handling had implicitly forgotten about these details,
    and had a number of cases where it was doing just the "->writepage()"
    part, without telling the low-level filesystem that the whole page might
    have been re-dirtied as part of being mapped writably into user space.

    Since most of the time the FS did actually write out the full page, we
    didn't notice this for a loong time, and this needed some really odd
    patterns to trigger. But it caused occasional corruption with rtorrent
    and with the Debian "apt" database, because both use shared mmaps to
    update the end result.

    This fixes it. Finally. After way too much hair-pulling.

    Acked-by: Nick Piggin
    Acked-by: Martin J. Bligh
    Acked-by: Martin Michlmayr
    Acked-by: Martin Johansson
    Acked-by: Ingo Molnar
    Acked-by: Andrei Popa
    Cc: High Dickins
    Cc: Andrew Morton ,
    Cc: Peter Zijlstra
    Cc: Segher Boessenkool
    Cc: David Miller
    Cc: Arjan van de Ven
    Cc: Gordon Farquharson
    Cc: Guillaume Chazarain
    Cc: Theodore Tso
    Cc: Kenneth Cheng
    Cc: Tobias Diedrich
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Dec, 2006

1 commit

  • They were horribly easy to mis-use because of their tempting naming, and
    they also did way more than any users of them generally wanted them to
    do.

    A dirty page can become clean under two circumstances:

    (a) when we write it out. We have "clear_page_dirty_for_io()" for
    this, and that function remains unchanged.

    In the "for IO" case it is not sufficient to just clear the dirty
    bit, you also have to mark the page as being under writeback etc.

    (b) when we actually remove a page due to it becoming inaccessible to
    users, notably because it was truncate()'d away or the file (or
    metadata) no longer exists, and we thus want to cancel any
    outstanding dirty state.

    For the (b) case, we now introduce "cancel_dirty_page()", which only
    touches the page state itself, and verifies that the page is not mapped
    (since cancelling writes on a mapped page would be actively wrong as it
    is still accessible to users).

    Some filesystems need to be fixed up for this: CIFS, FUSE, JFS,
    ReiserFS, XFS all use the old confusing functions, and will be fixed
    separately in subsequent commits (with some of them just removing the
    offending logic, and others using clear_page_dirty_for_io()).

    This was confirmed by Martin Michlmayr to fix the apt database
    corruption on ARM.

    Cc: Martin Michlmayr
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: Arjan van de Ven
    Cc: Andrei Popa
    Cc: Andrew Morton
    Cc: Dave Kleikamp
    Cc: Gordon Farquharson
    Cc: Martin Schwidefsky
    Cc: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Dec, 2006

2 commits

  • Accounting writes is fairly simple: whenever a process flips a page from clean
    to dirty, we accuse it of having caused a write to underlying storage of
    PAGE_CACHE_SIZE bytes.

    This may overestimate the amount of writing: the page-dirtying may cause only
    one buffer_head's worth of writeout. Fixing that is possible, but probably a
    bit messy and isn't obviously important.

    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Save a tabstop in __set_page_dirty_nobuffers() and __set_page_dirty_buffers()
    and a few other places. No functional changes.

    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

21 Oct, 2006

1 commit

  • Separate out the concept of "queue congestion" from "backing-dev congestion".
    Congestion is a backing-dev concept, not a queue concept.

    The blk_* congestion functions are retained, as wrappers around the core
    backing-dev congestion functions.

    This proper layering is needed so that NFS can cleanly use the congestion
    functions, and so that CONFIG_BLOCK=n actually links.

    Cc: "Thomas Maier"
    Cc: "Jens Axboe"
    Cc: Trond Myklebust
    Cc: David Howells
    Cc: Peter Osterlund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

04 Oct, 2006

1 commit


01 Oct, 2006

1 commit

  • Make it possible to disable the block layer. Not all embedded devices require
    it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
    the block layer to be present.

    This patch does the following:

    (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
    support.

    (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
    an item that uses the block layer. This includes:

    (*) Block I/O tracing.

    (*) Disk partition code.

    (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

    (*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
    block layer to do scheduling. Some drivers that use SCSI facilities -
    such as USB storage - end up disabled indirectly from this.

    (*) Various block-based device drivers, such as IDE and the old CDROM
    drivers.

    (*) MTD blockdev handling and FTL.

    (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
    taking a leaf out of JFFS2's book.

    (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
    linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
    however, still used in places, and so is still available.

    (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
    parts of linux/fs.h.

    (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

    (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

    (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
    is not enabled.

    (*) fs/no-block.c is created to hold out-of-line stubs and things that are
    required when CONFIG_BLOCK is not set:

    (*) Default blockdev file operations (to give error ENODEV on opening).

    (*) Makes some /proc changes:

    (*) /proc/devices does not list any blockdevs.

    (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

    (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

    (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
    given command other than Q_SYNC or if a special device is specified.

    (*) In init/do_mounts.c, no reference is made to the blockdev routines if
    CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.

    (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
    error ENOSYS by way of cond_syscall if so).

    (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
    CONFIG_BLOCK is not set, since they can't then happen.

    Signed-Off-By: David Howells
    Signed-off-by: Jens Axboe

    David Howells