02 Aug, 2012

2 commits

  • Pull block driver changes from Jens Axboe:

    - Making the plugging support for drivers a bit more sane from Neil.
    This supersedes the plugging change from Shaohua as well.

    - The usual round of drbd updates.

    - Using a tail add instead of a head add in the request completion for
    ndb, making us find the most completed request more quickly.

    - A few floppy changes, getting rid of a duplicated flag and also
    running the floppy init async (since it takes forever in boot terms)
    from Andi.

    * 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
    floppy: remove duplicated flag FD_RAW_NEED_DISK
    blk: pass from_schedule to non-request unplug functions.
    block: stack unplug
    blk: centralize non-request unplug handling.
    md: remove plug_cnt feature of plugging.
    block/nbd: micro-optimization in nbd request completion
    drbd: announce FLUSH/FUA capability to upper layers
    drbd: fix max_bio_size to be unsigned
    drbd: flush drbd work queue before invalidate/invalidate remote
    drbd: fix potential access after free
    drbd: call local-io-error handler early
    drbd: do not reset rs_pending_cnt too early
    drbd: reset congestion information before reporting it in /proc/drbd
    drbd: report congestion if we are waiting for some userland callback
    drbd: differentiate between normal and forced detach
    drbd: cleanup, remove two unused global flags
    floppy: Run floppy initialization asynchronous

    Linus Torvalds
     
  • Pull md updates from NeilBrown.

    * 'for-next' of git://neil.brown.name/md:
    DM RAID: Add support for MD RAID10
    md/RAID1: Add missing case for attempting to repair known bad blocks.
    md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.
    md/raid1: don't abort a resync on the first badblock.
    md: remove duplicated test on ->openers when calling do_md_stop()
    raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer
    md/raid1: prevent merging too large request
    md/raid1: read balance chooses idlest disk for SSD
    md/raid1: make sequential read detection per disk based
    MD RAID10: Export md_raid10_congested
    MD: Move macros from raid1*.h to raid1*.c
    MD RAID1: rename mirror_info structure
    MD RAID10: rename mirror_info structure
    MD RAID10: Fix compiler warning.
    raid5: add a per-stripe lock
    raid5: remove unnecessary bitmap write optimization
    raid5: lockless access raid5 overrided bi_phys_segments
    raid5: reduce chance release_stripe() taking device_lock

    Linus Torvalds
     

01 Aug, 2012

2 commits


31 Jul, 2012

18 commits

  • This will allow md/raid to know why the unplug was called,
    and will be able to act according - if !from_schedule it
    is safe to perform tasks which could themselves schedule.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • Both md and umem has similar code for getting notified on an
    blk_finish_plug event.
    Centralize this code in block/ and allow each driver to
    provide its distinctive difference.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • This seemed like a good idea at the time, but after further thought I
    cannot see it making a difference other than very occasionally and
    testing to try to exercise the case it is most likely to help did not
    show any performance difference by removing it.

    So remove the counting of active plugs and allow 'pending writes' to
    be activated at any time, not just when no plugs are active.

    This is only relevant when there is a write-intent bitmap, and the
    updating of the bitmap will likely introduce enough delay that
    the single-threading of bitmap updates will be enough to collect large
    numbers of updates together.

    Removing this will make it easier to centralise the unplug code, and
    will clear the other for other unplug enhancements which have a
    measurable effect.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • When doing resync or repair, attempt to correct bad blocks, according
    to WriteErrorSeen policy

    Signed-off-by: Alex Lyakas
    Signed-off-by: NeilBrown

    Alexander Lyakas
     
  • Merge Andrew's first set of patches:
    "Non-MM patches:

    - lots of misc bits

    - tree-wide have_clk() cleanups

    - quite a lot of printk tweaks. I draw your attention to "printk:
    convert the format for KERN_ to a 2 byte pattern" which
    looks a bit scary. But afaict it's solid.

    - backlight updates

    - lib/ feature work (notably the addition and use of memweight())

    - checkpatch updates

    - rtc updates

    - nilfs updates

    - fatfs updates (partial, still waiting for acks)

    - kdump, proc, fork, IPC, sysctl, taskstats, pps, etc

    - new fault-injection feature work"

    * Merge emailed patches from Andrew Morton : (128 commits)
    drivers/misc/lkdtm.c: fix missing allocation failure check
    lib/scatterlist: do not re-write gfp_flags in __sg_alloc_table()
    fault-injection: add tool to run command with failslab or fail_page_alloc
    fault-injection: add selftests for cpu and memory hotplug
    powerpc: pSeries reconfig notifier error injection module
    memory: memory notifier error injection module
    PM: PM notifier error injection module
    cpu: rewrite cpu-notifier-error-inject module
    fault-injection: notifier error injection
    c/r: fcntl: add F_GETOWNER_UIDS option
    resource: make sure requested range is included in the root range
    include/linux/aio.h: cpp->C conversions
    fs: cachefiles: add support for large files in filesystem caching
    pps: return PTR_ERR on error in device_create
    taskstats: check nla_reserve() return
    sysctl: suppress kmemleak messages
    ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION
    ipc: compat: use signed size_t types for msgsnd and msgrcv
    ipc: allow compat IPC version field parsing if !ARCH_WANT_OLD_COMPAT_IPC
    ipc: add COMPAT_SHMLBA support
    ...

    Linus Torvalds
     
  • Use memweight() to count the total number of bits set in memory area.

    Signed-off-by: Akinobu Mita
    Cc: Alasdair Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • 'sync' writes set both REQ_SYNC and REQ_NOIDLE.
    O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.

    We currently assume that a REQ_SYNC request will not be followed by
    more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
    request.
    This is appropriate for sync requests, but not for O_DIRECT requests.

    So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
    rather than REQ_SYNC. This is consistent with the documented meaning
    of REQ_NOIDLE:

    __REQ_NOIDLE, /* don't anticipate more IO after this one */

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • If a resync of a RAID1 array with 2 devices finds a known bad block
    one device it will neither read from, or write to, that device for
    this block offset.
    So there will be one read_target (The other device) and zero write
    targets.
    This condition causes md/raid1 to abort the resync assuming that it
    has finished - without known bad blocks this would be true.

    When there are no write targets because of the presence of bad blocks
    we should only skip over the area covered by the bad block.
    RAID10 already gets this right, raid1 doesn't. Or didn't.

    As this can cause a 'sync' to abort early and appear to have succeeded
    it could lead to some data corruption, so it suitable for -stable.

    Cc: stable@vger.kernel.org
    Reported-by: Alexander Lyakas
    Signed-off-by: NeilBrown

    NeilBrown
     
  • do_md_stop tests mddev->openers while holding ->open_mutex,
    and fails if this count is too high.
    So callers do not need to check mddev->openers and doing so isn't
    very meaningful as they don't hold ->open_mutex so the number could
    change.

    So remove the unnecessary tests on mddev->openers.
    These are not called often enough for there to be any gain in
    an early test on ->open_mutex to avoid the need for a slightly more
    costly mutex_lock call.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Because bios will merge at block-layer,so bios-error may caused by other
    bio which be merged into to the same request.
    Using this flag,it will find exactly error-sector and not do redundant
    operation like re-write and re-read.

    V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
    layer.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • For SSD, if request size exceeds specific value (optimal io size), request size
    isn't important for bandwidth. In such condition, if making request size bigger
    will cause some disks idle, the total throughput will actually drop. A good
    example is doing a readahead in a two-disk raid1 setup.

    So when should we split big requests? We absolutly don't want to split big
    request to very small requests. Even in SSD, big request transfer is more
    efficient. This patch only considers request with size above optimal io size.

    If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
    two requests 32k and two disks. We can let each disk run one 32k request, or
    split the requests to 4 16k requests and each disk runs two. It's hard to say
    which case is better, depending on hardware.

    So only consider case where there are idle disks. For readahead, split is
    always better in this case. And in my test, below patch can improve > 30%
    thoughput. Hmm, not 100%, because disk isn't 100% busy.

    Such case can happen not just in readahead, for example, in directio. But I
    suppose directio usually will have bigger IO depth and make all disks busy, so
    I ignored it.

    Note: if the raid uses any hard disk, we don't prevent merging. That will make
    performace worse.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • SSD hasn't spindle, distance between requests means nothing. And the original
    distance based algorithm sometimes can cause severe performance issue for SSD
    raid.

    Considering two thread groups, one accesses file A, the other access file B.
    The first group will access one disk and the second will access the other disk,
    because requests are near from one group and far between groups. In this case,
    read balance might keep one disk very busy but the other relative idle. For
    SSD, we should try best to distribute requests to as many disks as possible.
    There isn't spindle move penality anyway.

    With below patch, I can see more than 50% throughput improvement sometimes
    depending on workloads.

    The only exception is small requests can be merged to a big request which
    typically can drive higher throughput for SSD too. Such small requests are
    sequential reads. Unlike hard disk, sequential read which can't be merged (for
    example direct IO, or read without readahead) can be ignored for SSD. Again
    there is no spindle move penality. readahead dispatches small requests and such
    requests can be merged.

    Last patch can help detect sequential read well, at least if concurrent read
    number isn't greater than raid disk number. In that case, distance based
    algorithm doesn't work well too.

    V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
    random IO too. This makes the algorithm generic for raid with SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • Currently the sequential read detection is global wide. It's natural to make it
    per disk based, which can improve the detection for concurrent multiple
    sequential reads. And next patch will make SSD read balance not use distance
    based algorithm, where this change help detect truly sequential read for SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • md/raid10: Export is_congested test.

    In similar fashion to commits
    11d8a6e3719519fbc0e2c9d61b6fa931b84bf813
    1ed7242e591af7e233234d483f12d33818b189d9
    we export the RAID10 congestion checking function so that dm-raid.c can
    make use of it and make use of the personality. The 'queue' and 'gendisk'
    structures will not be available to the MD code when device-mapper sets
    up the device, so we conditionalize access to these fields also.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • MD RAID1/RAID10: Move some macros from .h file to .c file

    There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
    in both raid1.h and raid10.h. They are only used in there respective .c files.
    However, if we wish to make RAID10 accessible to the device-mapper RAID
    target (dm-raid.c), then we need to move these macros into the .c files where
    they are used so that they do not conflict with each other.

    The macros from the two files are identical and could be moved into md.h, but
    I chose to leave the duplication and have them remain in the personality
    files.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'

    The same structure name ('mirror_info') is used by raid10. Each of these
    structures are defined in there respective header files. If dm-raid is
    to support both RAID1 and RAID10, the header files will be included and
    the structure names must not collide. While only one of these structure
    names needs to change, this patch adds consistency to the naming of the
    structure.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • MD RAID10: Rename the structure 'mirror_info' to 'raid10_info'

    The same structure name ('mirror_info') is used by raid1. Each of these
    structures are defined in there respective header files. If dm-raid is
    to support both RAID1 and RAID10, the header files will be included and
    the structure names must not collide.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • MD RAID10: Fix compiler warning.

    Initialize variable to prevent compiler warning.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     

27 Jul, 2012

18 commits