29 Oct, 2009

1 commit


26 Sep, 2009

1 commit

  • Sometimes we only want to write pages from a specific super_block,
    so allow that to be passed in.

    This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb
    causing writeback on all super_blocks on a bdi, where we only really
    want to sync a specific sb from writeback_inodes_sb().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Sep, 2009

2 commits

  • bdi_start_writeback() is currently split into two paths, one for
    WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
    for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
    only WB_SYNC_NONE.

    Push down the writeback_control allocation and only accept the
    parameters that make sense for each function. This cleans up
    the API considerably.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Now that bdi_writeback_all() no longer handles integrity writeback,
    it doesn't have to block anymore. This means that we can switch
    bdi_list reader side protection to RCU.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Sep, 2009

4 commits

  • Also a debugging aid. We want to catch dirty inodes being added to
    backing devices that don't do writeback.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This enables us to track who does what and print info. Its main use
    is catching dirty inodes on the default_backing_dev_info, so we can
    fix that up.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This gets rid of pdflush for bdi writeout and kupdated style cleaning.
    pdflush writeout suffers from lack of locality and also requires more
    threads to handle the same workload, since it has to work in a
    non-blocking fashion against each queue. This also introduces lumpy
    behaviour and potential request starvation, since pdflush can be starved
    for queue access if others are accessing it. A sample ffsb workload that
    does random writes to files is about 8% faster here on a simple SATA drive
    during the benchmark phase. File layout also seems a LOT more smooth in
    vmstat:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
    0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
    1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
    0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
    0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
    0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
    0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
    0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
    0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
    0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

    where vanilla tends to fluctuate a lot in the creation phase:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
    1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
    0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
    0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
    1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
    0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
    0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
    1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
    0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
    1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
    1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
    0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

    A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
    SSD based writeback test on XFS performs over 20% better as well, with
    the throughput being very stable around 1GB/sec, where pdflush only
    manages 750MB/sec and fluctuates wildly while doing so. Random buffered
    writes to many files behave a lot better as well, as does random mmap'ed
    writes.

    A separate thread is added to sync the super blocks. In the long term,
    adding sync_supers_bdi() functionality could get rid of this thread again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is a first step at introducing per-bdi flusher threads. We should
    have no change in behaviour, although sb_has_dirty_inodes() is now
    ridiculously expensive, as there's no easy way to answer that question.
    Not a huge problem, since it'll be deleted in subsequent patches.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Jul, 2009

1 commit

  • Move the definition of BLK_RW_ASYNC/BLK_RW_SYNC into linux/backing-dev.h
    so that it is available to all callers of set/clear_bdi_congested().

    This replaces commit 097041e576ee3a50d92dd643ee8ca65bf6a62e21 ("fuse:
    Fix build error"), which will be reverted.

    Signed-off-by: Trond Myklebust
    Acked-by: Larry Finger
    Cc: Jens Axboe
    Cc: Miklos Szeredi
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

11 Jul, 2009

1 commit


06 Apr, 2009

1 commit


20 Oct, 2008

1 commit

  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

30 Apr, 2008

6 commits

  • Fuse needs this for writable mmap support.

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
    set, then don't update the per-bdi writeback stats from
    test_set_page_writeback() and test_clear_page_writeback().

    Misc cleanups:

    - convert bdi_cap_writeback_dirty() and friends to static inline functions
    - create a flag that includes all three dirty/writeback related flags,
    since almst all users will want to have them toghether

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Move BDI statistics to debugfs:

    /sys/kernel/debug/bdi//stats

    Use postcore_initcall() to initialize the sysfs class and debugfs,
    because debugfs is initialized in core_initcall().

    Update descriptions in ABI documentation.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of
    the global dirty threshold allocated to this bdi.

    [mszeredi@suse.cz]

    - fix parsing in max_ratio_store().
    - export bdi_set_max_ratio() to modules
    - limit bdi_dirty with bdi->max_ratio
    - document new sysfs attribute

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Under normal circumstances each device is given a part of the total write-back
    cache that relates to its current avg writeout speed in relation to the other
    devices.

    min_ratio - allows one to assign a minimum portion of the write-back cache to
    a particular device. This is useful in situations where you might want to
    provide a minimum QoS. (One request for this feature came from flash based
    storage people who wanted to avoid writing out at all costs - they of course
    needed some pdflush hacks as well)

    max_ratio - allows one to assign a maximum portion of the dirty limit to a
    particular device. This is useful in situations where you want to avoid one
    device taking all or most of the write-back cache. Eg. an NFS mount that is
    prone to get stuck, or a FUSE mount which you don't trust to play fair.

    Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of
    the global dirty threshold allocated to this bdi.

    [mszeredi@suse.cz]

    - fix parsing in min_ratio_store()
    - document new sysfs attribute

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
    This allows us to see and set the various BDI specific variables.

    In particular this properly exposes the read-ahead window for all relevant
    users and /sys/block//queue/read_ahead_kb should be deprecated.

    With patient help from Kay Sievers and Greg KH

    [mszeredi@suse.cz]

    - split off NFS and FUSE changes into separate patches
    - document new sysfs attributes under Documentation/ABI
    - do bdi_class_init as a core_initcall, otherwise the "default" BDI
    won't be initialized
    - remove bdi_init_fmt macro, it's not used very much

    [akpm@linux-foundation.org: fix ia64 warning]
    Signed-off-by: Peter Zijlstra
    Cc: Kay Sievers
    Acked-by: Greg KH
    Cc: Trond Myklebust
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

29 Apr, 2008

1 commit


17 Oct, 2007

6 commits

  • Scale writeback cache per backing device, proportional to its writeout speed.

    By decoupling the BDI dirty thresholds a number of problems we currently have
    will go away, namely:

    - mutual interference starvation (for any number of BDIs);
    - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

    It might be that all dirty pages are for a single BDI while other BDIs are
    idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
    dirty pages outstanding and make progress.

    A global threshold also creates a deadlock for stacked BDIs; when A writes to
    B, and A generates enough dirty pages to get throttled, B will never start
    writeback until the dirty pages go away. Again, by giving each BDI its own
    'independent' dirty limit, this problem is avoided.

    So the problem is to determine how to distribute the total dirty limit across
    the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
    not have any dirty pages outstanding is a waste.

    What is done is to keep a floating proportion between the DBIs based on
    writeback completions. This way faster/more active devices get a larger share
    than slower/idle devices.

    [akpm@linux-foundation.org: fix warnings]
    [hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Count per BDI writeback pages.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Provide scalable per backing_dev_info statistics counters.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • provide BDI constructor/destructor hooks

    [akpm@linux-foundation.org: compile fix]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • These patches aim to improve balance_dirty_pages() and directly address three
    issues:
    1) inter device starvation
    2) stacked device deadlocks
    3) inter process starvation

    1 and 2 are a direct result from removing the global dirty limit and using
    per device dirty limits. By giving each device its own dirty limit is will
    no longer starve another device, and the cyclic dependancy on the dirty limit
    is broken.

    In order to efficiently distribute the dirty limit across the independant
    devices a floating proportion is used, this will allocate a share of the total
    limit proportional to the device's recent activity.

    3 is done by also scaling the dirty limit proportional to the current task's
    recent dirty rate.

    This patch:

    nfs: remove congestion_end(). It's redundant, clear_bdi_congested() already
    wakes the waiters.

    Signed-off-by: Peter Zijlstra
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

17 Jul, 2007

1 commit


17 Mar, 2007

1 commit

  • The current NFS client congestion logic is severly broken, it marks the
    backing device congested during each nfs_writepages() call but doesn't
    mirror this in nfs_writepage() which makes for deadlocks. Also it
    implements its own waitqueue.

    Replace this by a more regular congestion implementation that puts a cap on
    the number of active writeback pages and uses the bdi congestion waitqueue.

    Also always use an interruptible wait since it makes sense to be able to
    SIGKILL the process even for mounts without 'intr'.

    Signed-off-by: Peter Zijlstra
    Acked-by: Trond Myklebust
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

21 Oct, 2006

1 commit

  • Separate out the concept of "queue congestion" from "backing-dev congestion".
    Congestion is a backing-dev concept, not a queue concept.

    The blk_* congestion functions are retained, as wrappers around the core
    backing-dev congestion functions.

    This proper layering is needed so that NFS can cleanly use the congestion
    functions, and so that CONFIG_BLOCK=n actually links.

    Cc: "Thomas Maier"
    Cc: "Jens Axboe"
    Cc: Trond Myklebust
    Cc: David Howells
    Cc: Peter Osterlund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds