18 Jun, 2009

16 commits

  • Current, when we update the 'conf' structure, when adding a
    drive to a linear array, we keep the old version around until
    the array is finally stopped, as it is not safe to free it
    immediately.

    Now that we have rcu protection on all accesses to 'conf',
    we can use call_rcu to free it more promptly.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Due to the lack of memory ordering guarantees, we may have races around
    mddev->conf.

    In particular, the correct contents of the structure we get from
    dereferencing ->private might not be visible to this CPU yet, and
    they might not be correct w.r.t mddev->raid_disks.

    This patch addresses the problem using rcu protection to avoid
    such race conditions.

    Signed-off-by: SandeepKsinha
    Signed-off-by: NeilBrown

    SandeepKsinha
     
  • If the superblock of a component device indicates the presence of a
    bitmap but the corresponding raid personality does not support bitmaps
    (raid0, linear, multipath, faulty), then something is seriously wrong
    and we'd better refuse to run such an array.

    Currently, this check is performed while the superblocks are examined,
    i.e. before entering personality code. Therefore the generic md layer
    must know which raid levels support bitmaps and which do not.

    This patch avoids this layer violation without adding identical code
    to various personalities. This is accomplished by introducing a new
    public function to md.c, md_check_no_bitmap(), which replaces the
    hard-coded checks in the superblock loading functions.

    A call to md_check_no_bitmap() is added to the ->run method of each
    personality which does not support bitmaps and assembly is aborted
    if at least one component device contains a bitmap.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • It is easiest to round sizes to multiples of chunk size in
    the personality code for those personalities which care.
    Those personalities now do the rounding, so we can
    remove that function from common code.

    Also remove the upper bound on the size of a chunk, and the lower
    bound on the size of a device (1 chunk), neither of which really buy
    us anything.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This is currently ensured by common code, but it is more reliable to
    ensure it where it is needed in personality code.
    All the other personalities that care already round the size to
    the chunk_size. raid0 and linear are the only hold-outs.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Currently the assignment to utime gets skipped for 'external'
    metadata. So move it to the top of the function so that it
    always gets effected.
    This is of largely cosmetic interest. Nothing actually depends
    on ->utime being right for external arrays.
    "mdadm --monitor" does use it for 0.90 and 1.x arrays, but with
    mdadm-3.0, this is not important for external metadata.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Currently, the md layer checks in analyze_sbs() if the raid level
    supports reconstruction (mddev->level >= 1) and if reconstruction is
    in progress (mddev->recovery_cp != MaxSector).

    Move that printk into the personality code of those raid levels that
    care (levels 1, 4, 5, 6, 10).

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • The difference between these two methods is artificial.
    Both check that a pending reshape is valid, and perform any
    aspect of it that can be done immediately.
    'reconfig' handles chunk size and layout.
    'check_reshape' handles raid_disks.

    So make them just one method.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Passing the new layout and chunksize as args is not necessary as
    the mddev has fields for new_check and new_layout.

    This is preparation for combining the check_reshape and reconfig
    methods

    Signed-off-by: NeilBrown

    NeilBrown
     
  • In reshape cases that do not change the number of devices,
    start_reshape is called without first calling check_reshape.

    Currently, the check that the stripe_cache is large enough is
    only done in check_reshape. It should be in start_reshape too.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • following the conversion to chunk_sectors, there is room
    for cleaning up a little.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • 1/ Raid5 has learned to take over also raid4 and raid6 arrays.
    2/ new_chunk in mdp_superblock_1 is in sectors, not bytes.

    Signed-off-by: NeilBrown

    Andre Noll
     
  • Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • This kills some more shifts.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • A straight-forward conversion which gets rid of some
    multiplications/divisions/shifts. The patch also introduces a couple
    of new ones, most of which are due to conf->chunk_size still being
    represented in bytes. This will be cleaned up in subsequent patches.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • This patch renames the chunk_size field to chunk_sectors with the
    implied change of semantics. Since

    is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
    = is_power_of_2(chunk_sectors)

    these bits don't need an adjustment for the shift.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     

16 Jun, 2009

21 commits

  • Maintain two flows, one for pow2 chunk sizes (which uses masks and
    shift), and a flow for the general case (which uses sector_div).
    This is for the sake of performance.

    - introduce map_sector and is_io_in_chunk_boundary to encapsulate
    those two flows better for raid0_make_request
    - fix blk_mergeable to support the two flows.

    Signed-off-by: raziebe@gmail.com
    Signed-off-by: NeilBrown

    raz ben yehuda
     
  • Remove chunk size check from md as this is now performed in the run
    function in each personality.

    Replace chunk size power 2 code calculations by a regular division.

    Signed-off-by: raziebe@gmail.com
    Signed-off-by: NeilBrown

    raz ben yehuda
     
  • have raid5 check chunk size in run/reshape method instead of in md

    Signed-off-by: raziebe@gmail.com
    Signed-off-by: NeilBrown

    raz ben yehuda
     
  • have raid10 check chunk size in run method instead of in md

    Signed-off-by: raziebe@gmail.com
    Signed-off-by: NeilBrown

    raz ben yehuda
     
  • have raid0 check chunk size in run method instead of in md.
    This is part of a series moving the checks from common code to
    the personalities where they belong.

    hardsect is short and chunksize is an int, so it is safe to use %.

    Signed-off-by: raziebe@gmail.com
    Signed-off-by: NeilBrown

    raz ben yehuda
     
  • Report to the user what are the raid zones

    Signed-off-by: raziebe@gmail.com
    Signed-off-by: NeilBrown

    raz ben yehuda
     
  • Because of the removal of the device list from
    the strips raid0 did not compile with MD_DEBUG flag on

    Signed-off-by: NeilBrown

    raz ben yehuda
     
  • Replace the linear search with binary search in which_dev.

    Signed-off-by: Sandeep K Sinha
    Signed-off-by: NeilBrown

    Sandeep K Sinha
     
  • Remove num_sectors from dev_info and replace start_sector with
    end_sector. This makes a lot of comparisons much simpler.

    Signed-off-by: Sandeep K Sinha
    Signed-off-by: NeilBrown

    Sandeep K Sinha
     
  • Get rid of sector_div and hash table for linear raid and replace
    with a linear search in which_dev.
    The hash table adds a lot of complexity for little if any gain.
    Ultimately a binary search will be used which will have smaller
    cache foot print, a similar number of memory access, and no
    divisions.

    Signed-off-by: Sandeep K Sinha
    Signed-off-by: NeilBrown

    Sandeep K Sinha
     
  • Having a macro just to cast a void* isn't really helpful.
    I would must rather see that we are simply de-referencing ->private,
    than have to know what the macro does.

    So open code the macro everywhere and remove the pointless cast.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This setting doesn't seem to make sense (half the chunk size??) and
    shouldn't be needed.
    The segment boundary exported by raid0 should simply be the minimum
    of the segment boundary of all component devices. And we already
    get that right.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If we treat conf->devlist more like a 2 dimensional array,
    we can get the devlist for a particular zone simply by indexing
    that array, so we don't need to store the pointers to subarrays
    in strip_zone. This makes strip_zone smaller and so (hopefully)
    searches faster.

    Signed-of-by: NeilBrown

    NeilBrown
     
  • storing ->sectors is redundant as is can be computed from the
    difference z->zone_end - (z-1)->zone_end

    The one place where it is used, it is just as efficient to use
    a zone_end value instead.

    And removing it makes strip_zone smaller, so they array of these that
    is searched on every request has a better chance to say in cache.

    So discard the field and get the value from elsewhere.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • raid0_stop() removes all references to the raid0 configuration but
    misses to free the ->devlist buffer.

    This patch closes this leak, removes a pointless initialization and
    fixes a coding style issue in raid0_stop().

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • Currently the raid0 configuration is allocated in raid0_run() while
    the buffers for the strip_zone and the dev_list arrays are allocated
    in create_strip_zones(). On errors, all three buffers are freed
    in raid0_run().

    It's easier and more readable to do the allocation and cleanup within
    a single function. So move that code into create_strip_zones().

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • Currently raid0_run() always returns -ENOMEM on errors. This is
    incorrect as running the array might fail for other reasons, for
    example because not all component devices were available.

    This patch changes create_strip_zones() so that it returns a proper
    error code (either -ENOMEM or -EINVAL) rather than 1 on errors and
    makes raid0_run(), its single caller, return that value instead
    of -ENOMEM.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • The "sector_shift" and "spacing" fields of struct raid0_private_data
    were only used for the hash table lookups. So the removal of the
    hash table allows get rid of these fields as well which simplifies
    create_strip_zones() and raid0_run() quite a bit.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • The raid0 hash table has become unused due to the changes in the
    previous patch. This patch removes the hash table allocation and
    setup code and kills the hash_table field of struct raid0_private_data.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • 1/ remove current_start. The same value is available in
    zone->dev_start and storing it separately doesn't gain anything.
    2/ rename curr_zone_start to curr_zone_end as we are now more
    focused on the 'end' of each zone. We end up storing the
    same number though - the old name was a little confusing
    (and what does 'current' mean in this context anyway).

    Signed-off-by: NeilBrown

    NeilBrown
     
  • The number of strip_zones of a raid0 array is bounded by the number of
    drives in the array and is in fact much smaller for typical setups. For
    example, any raid0 array containing identical disks will have only
    a single strip_zone.

    Therefore, the hash tables which are used for quickly finding the
    strip_zone that holds a particular sector are of questionable value
    and add quite a bit of unnecessary complexity.

    This patch replaces the hash table lookup by equivalent code which
    simply loops over all strip zones to find the zone that holds the
    given sector.

    In order to make this loop as fast as possible, the zone->start field
    of struct strip_zone has been renamed to zone_end, and it now stores
    the beginning of the next zone in sectors. This allows to save one
    addition in the loop.

    Subsequent cleanup patches will remove the hash table structure.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     

15 Jun, 2009

3 commits

  • * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-next: (53 commits)
    .gitignore: ignore *.lzma files
    kbuild: add generic --set-str option to scripts/config
    kbuild: simplify argument loop in scripts/config
    kbuild: handle non-existing options in scripts/config
    kallsyms: generalize text region handling
    kallsyms: support kernel symbols in Blackfin on-chip memory
    documentation: make version fix
    kbuild: fix a compile warning
    gitignore: Add GNU GLOBAL files to top .gitignore
    kbuild: fix delay in setlocalversion on readonly source
    README: fix misleading pointer to the defconf directory
    vmlinux.lds.h update
    kernel-doc: cleanup perl script
    Improve vmlinux.lds.h support for arch specific linker scripts
    kbuild: fix headers_exports with boolean expression
    kbuild/headers_check: refine extern check
    kbuild: fix "Argument list too long" error for "make headers_check",
    ignore *.patch files
    Remove bashisms from scripts
    menu: fix embedded menu presentation
    ...

    Linus Torvalds
     
  • Signed-off-by: Sam Ravnborg

    Arne Janbu
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
    mlx4_core: Don't double-free IRQs when falling back from MSI-X to INTx
    IB/mthca: Don't double-free IRQs when falling back from MSI-X to INTx
    IB/mlx4: Add strong ordering to local inval and fast reg work requests
    IB/ehca: Remove superfluous bitmasks from QP control block
    RDMA/cxgb3: Limit fast register size based on T3 limitations
    RDMA/cxgb3: Report correct port state and MTU
    mlx4_core: Add module parameter for number of MTTs per segment
    IB/mthca: Add module parameter for number of MTTs per segment
    RDMA/nes: Fix off-by-one bugs in reset_adapter_ne020() and init_serdes()
    infiniband: Remove void casts
    IB/ehca: Increment version number
    IB/ehca: Remove unnecessary memory operations for userspace queue pairs
    IB/ehca: Fall back to vmalloc() for big allocations
    IB/ehca: Replace vmalloc() with kmalloc() for queue allocation

    Linus Torvalds