08 Jan, 2012

1 commit

  • NFS might send us offsets that are not PAGE aligned. So
    we must read in the reminder of the first/last pages, in cases
    we need it for Parity calculations.

    We only add an sg segments to read the partial page. But
    we don't mark it as read=true because it is a lock-for-write
    page.

    TODO: In some cases (IO spans a single unit) we can just
    adjust the raid_unit offset/length, but this is left for
    later Kernels.

    [Bug in 3.2.0 Kernel]
    CC: Stable Tree
    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     

06 Jan, 2012

1 commit

  • When reading RAID5 files, in rare cases, we calculated too
    few sg segments. There should be two extra for the beginning
    and end partial units.

    Also "too few sg segments" should not be a BUG_ON there is
    all the mechanics in place to handle it, as a short read.
    So just return -ENOMEM and the rest of the code will gracefully
    split the IO.

    [Bug in 3.2.0 Kernel]
    CC: Stable Tree
    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     

25 Oct, 2011

2 commits

  • This is finally the RAID5 Write support.

    The bigger part of this patch is not the XOR engine itself, But the
    read4write logic, which is a complete mini prepare_for_striping
    reading engine that can read scattered pages of a stripe into cache
    so it can be used for XOR calculation. That is, if the write was not
    stripe aligned.

    The main algorithm behind the XOR engine is the 2 dimensional array:
    struct __stripe_pages_2d.
    A drawing might save 1000 words
    ---

    __stripe_pages_2d
    |
    n = pages_in_stripe_unit;
    w = group_width - parity;
    | pages array presented to the XOR lib
    | |
    V |
    __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] [c0][c1]..[cw][c_par] [c0][c1]..[cw][c_par]
    ^
    |
    data added columns first then row

    ---
    The pages are put on this array columns first. .i.e:
    p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
    So we are doing a corner turn of the pages.

    Note that pages will zigzag down and left. but are put sequentially
    in growing order. So when the time comes to XOR the stripe, only the
    beginning and end of the array need be checked. We scan the array
    and any NULL spot will be field by pages-to-be-read.

    The FS that wants to support RAID5 needs to supply an
    operations-vector that searches a given page in cache, and specifies
    if the page is uptodate or need reading. All these pages to be read
    are put on a slave ore_io_state and synchronously read. All the pages
    of a stripe are read in one IO, using the scatter gather mechanism.

    In write we constrain our IO to only be incomplete on a single
    stripe. Meaning either the complete IO is within a single stripe so
    we might have pages to read from both beginning or end of the
    strip. Or we have some reading to do at beginning but end at strip
    boundary. The left over pages are pushed to the next IO by the API
    already established by previous work, where an IO offset/length
    combination presented to the ORE might get the length truncated and
    the user must re-submit the leftover pages. (Both exofs and NFS
    support this)

    But any ORE user should make it's best effort to align it's IO
    before hand and avoid complications. A cached ore_layout->stripe_size
    member can be used for that calculation. (NOTE: that ORE demands
    that stripe_size may not be bigger then 32bit)

    What else? Well read it and tell me.

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • This patch introduces the first stage of RAID5 support
    mainly the skip-over-raid-units when reading. For
    writes it inserts BLANK units, into where XOR blocks
    should be calculated and written to.

    It introduces the new "general raid maths", and the main
    additional parameters and components needed for raid5.

    Since at this stage it could corrupt future version that
    actually do support raid5. The enablement of raid5
    mounting and setting of parity-count > 0 is disabled. So
    the raid5 code will never be used. Mounting of raid5 is
    only enabled later once the basic XOR write is also in.
    But if the patch "enable RAID5" is applied this code has
    been tested to be able to properly read raid5 volumes
    and is according to standard.

    Also it has been tested that the new maths still properly
    supports RAID0 and grouping code just as before.
    (BTW: I have found more bugs in the pnfs-obj RAID math
    fixed here)

    The ore.c file is getting too big, so new ore_raid.[hc]
    files are added that will include the special raid stuff
    that are not used in striping and mirrors. In future write
    support these will get bigger.
    When adding the ore_raid.c to Kbuild file I was forced to
    rename ore.ko to libore.ko. Is it possible to keep source
    file, say ore.c and module file ore.ko the same even if there
    are multiple files inside ore.ko?

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh