14 Jan, 2011

40 commits

  • * 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/p2m: Fix module linking error.
    xen p2m: clear the old pte when adding a page to m2p_override
    xen gntdev: use gnttab_map_refs and gnttab_unmap_refs
    xen: introduce gnttab_map_refs and gnttab_unmap_refs
    xen p2m: transparently change the p2m mappings in the m2p override
    xen/gntdev: Fix circular locking dependency
    xen/gntdev: stop using "token" argument
    xen: gntdev: move use of GNTMAP_contains_pte next to the map_op
    xen: add m2p override mechanism
    xen: move p2m handling to separate file
    xen/gntdev: add VM_PFNMAP to vma
    xen/gntdev: allow usermode to map granted pages
    xen: define gnttab_set_map_op/unmap_op

    Fix up trivial conflict in drivers/xen/Kconfig

    Linus Torvalds
     
  • * 'stable/platform-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen-platform: Fix compile errors if CONFIG_PCI is not enabled.
    xen: rename platform-pci module to xen-platform-pci.
    xen-platform: use PCI interfaces to request IO and MEM resources.

    Linus Torvalds
     
  • Add hugepage statistics to per-node sysfs meminfo

    Reviewed-by: Rik van Riel
    Signed-off-by: David Rientjes
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When working in RS485 mode, the atmel_serial driver keeps RTS high after
    the initialization of the serial port. It goes low only after the first
    character has been sent.

    [akpm@linux-foundation.org: simplify code]
    Signed-off-by: Claudio Scordino
    Signed-off-by: Arkadiusz Bubala
    Tested-by: Arkadiusz Bubala
    Cc: Nicolas Ferre
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Scordino
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (32 commits)
    dm: raid456 basic support
    dm: per target unplug callback support
    dm: introduce target callbacks and congestion callback
    dm mpath: delay activate_path retry on SCSI_DH_RETRY
    dm: remove superfluous irq disablement in dm_request_fn
    dm log: use PTR_ERR value instead of ENOMEM
    dm snapshot: avoid storing private suspended state
    dm snapshot: persistent make metadata_wq multithreaded
    dm: use non reentrant workqueues if equivalent
    dm: convert workqueues to alloc_ordered
    dm stripe: switch from local workqueue to system_wq
    dm: dont use flush_scheduled_work
    dm snapshot: remove unused dm_snapshot queued_bios_work
    dm ioctl: suppress needless warning messages
    dm crypt: add loop aes iv generator
    dm crypt: add multi key capability
    dm crypt: add post iv call to iv generator
    dm crypt: use io thread for reads only if mempool exhausted
    dm crypt: scale to multiple cpus
    dm crypt: simplify compatible table output
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix removal of extra drives when converting RAID6 to RAID5
    md: range check slot number when manually adding a spare.
    md/raid5: handle manually-added spares in start_reshape.
    md: fix sync_completed reporting for very large drives (>2TB)
    md: allow suspend_lo and suspend_hi to decrease as well as increase.
    md: Don't let implementation detail of curr_resync leak out through sysfs.
    md: separate meta and data devs
    md-new-param-to_sync_page_io
    md-new-param-to-calc_dev_sboffset
    md: Be more careful about clearing flags bit in ->recovery
    md: md_stop_writes requires mddev_lock.
    md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer
    md: Ensure no IO request to get md device before it is properly initialised.
    md: Fix single printks with multiple KERN_s
    md: fix regression resulting in delays in clearing bits in a bitmap
    md: fix regression with re-adding devices to arrays with no metadata

    Linus Torvalds
     
  • When a RAID6 is converted to a RAID5, the extra drive should
    be discarded. However it isn't due to a typo in a comparison.

    This bug was introduced in commit e93f68a1fc6 in 2.6.35-rc4
    and is suitable for any -stable since than.

    As the extra drive is not removed, the 'degraded' counter is wrong and
    so the RAID5 will not respond correctly to a subsequent failure.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     
  • When adding a spare to an active array, we should check the slot
    number, but allow it to be larger than raid_disks if a reshape
    is being prepared.

    Apply the same test when adding a device to an
    array-under-construction. It already had most of the test in place,
    but not quite all.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • It is possible to manually add spares to specific slots before
    starting a reshape.
    raid5_start_reshape should recognised this possibility and include
    it in the accounting.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • The values exported in the sync_completed file are unsigned long, which
    overflows with very large drives, resulting in wrong values reported.

    Since sync_completed uses sectors as unit, we'll start getting wrong
    values with components larger than 2TB.

    This patch simply replaces the use of unsigned long by unsigned long long.

    Signed-off-by: Rémi Rérolle
    Signed-off-by: NeilBrown

    Rémi Rérolle
     
  • The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region
    to which read/writes are suspended so that the under lying data can be
    manipulated without user-space noticing.
    Currently the window they describe can only move forwards along the
    device. However this is an unnecessary restriction which will cause
    problems with planned developments.
    So relax this restriction and allow these endpoints to move
    arbitrarily.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • mddev->curr_resync has artificial values of '1' and '2' which are used
    by the code which ensures only one resync is happening at a time on
    any given device.

    These values are internal and should never be exposed to user-space
    (except when translated appropriately as in the 'pending' status in
    /proc/mdstat).

    Unfortunately they are as ->curr_resync is assigned to
    ->curr_resync_completed and that value is directly visible through
    sysfs.

    So change the assignments to ->curr_resync_completed to get the same
    valued from elsewhere in a form that doesn't have the magic '1' or '2'
    values.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Allow the metadata to be on a separate device from the
    data.

    This doesn't mean the data and metadata will by on separate
    physical devices - it simply gives device-mapper and userspace
    tools more flexibility.

    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • Add new parameter to 'sync_page_io'.

    The new parameter allows us to distinguish between metadata and data
    operations. This becomes important later when we add the ability to
    use separate devices for data and metadata.

    Signed-off-by: Jonathan Brassow

    Jonathan Brassow
     
  • When we allow for separate devices for data and metadata
    in a later patch, we will need to be able to calculate
    the superblock offset based on more than the bdev.

    Signed-off-by: Jonathan Brassow

    Jonathan Brassow
     
  • Setting ->recovery to 0 is generally not a good idea as it could clear
    bits that shouldn't be cleared. In particular, MD_RECOVERY_FROZEN
    should only be cleared on explicit request from user-space.

    So when we need to clear things, just clear the bits that need
    clearing.

    As there are a few different places which reap a resync process - and
    some do an incomplte job - factor out the code for doing the from
    md_check_recovery and call that function instead of open coding part
    of it.

    Signed-off-by: NeilBrown
    Reported-by: Jonathan Brassow

    NeilBrown
     
  • As md_stop_writes manipulates the sync_thread and calls md_update_sb,
    it need to be called with mddev_lock held.

    In all internal cases it is, but the symbol is exported for dm-raid to
    call and in that case the lock won't be help.
    Do make an exported version which takes the lock, and an internal
    version which does not.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • With the module parameter 'start_dirty_degraded' set,
    raid5_spare_active() previously called sysfs_notify_dirent() with a NULL
    argument (rdev->sysfs_state) when a rebuild finished.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer

    Jonathan Brassow
     
  • When an md device is in the process of coming on line it is possible
    for an IO request (typically a partition table probe) to get through
    before the array is fully initialised, which can cause unexpected
    behaviour (e.g. a crash).

    So explicitly record when the array is ready for IO and don't allow IO
    through until then.

    There is no possibility for a similar problem when the array is going
    off-line as there must only be one 'open' at that time, and it is busy
    off-lining the array and so cannot send IO requests. So no memory
    barrier is needed in md_stop()

    This has been a bug since commit 409c57f3801 in 2.6.30 which
    introduced md_make_request. Before then, each personality would
    register its own make_request_fn when it was ready.
    This is suitable for any stable kernel from 2.6.30.y onwards.

    Cc:
    Signed-off-by: NeilBrown
    Reported-by: "Hawrylewicz Czarnowski, Przemyslaw"

    NeilBrown
     
  • Noticed-by: Russell King
    Signed-off-by: Joe Perches
    Signed-off-by: NeilBrown

    Joe Perches
     
  • commit 589a594be1fb (2.6.37-rc4) fixed a problem were md_thread would
    sometimes call the ->run function at a bad time.

    If an error is detected during array start up after the md_thread has
    been started, the md_thread is killed. This resulted in the ->run
    function being called once. However the array may not be in a state
    that it is safe to call ->run.

    However the fix imposed meant that ->run was not called on a timeout.
    This means that when an array goes idle, bitmap bits do not get
    cleared promptly. While the array is busy the bits will still be
    cleared when appropriate so this is not very serious. There is no
    risk to data.

    Change the test so that we only avoid calling ->run when the thread
    is being stopped. This more explicitly addresses the problem situation.

    This is suitable for 2.6.37-stable and any -stable kernel to which
    589a594be1fb was applied.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     
  • This patch is the skeleton for the DM target that will be
    the bridge from DM to MD (initially RAID456 and later RAID1). It
    provides a way to use device-mapper interfaces to the MD RAID456
    drivers.

    As with all device-mapper targets, the nominal public interfaces are the
    constructor (CTR) tables and the status outputs (both STATUSTYPE_INFO
    and STATUSTYPE_TABLE). The CTR table looks like the following:

    1: raid \
    2: \
    3: ..

    Line 1 contains the standard first three arguments to any device-mapper
    target - the start, length, and target type fields. The target type in
    this case is "raid".

    Line 2 contains the arguments that define the particular raid
    type/personality/level, the required arguments for that raid type, and
    any optional arguments. Possible raid types include: raid4, raid5_la,
    raid5_ls, raid5_rs, raid6_zr, raid6_nr, and raid6_nc. (again, raid1 is
    planned for the future.) The list of required and optional parameters
    is the same for all the current raid types. The required parameters are
    positional, while the optional parameters are given as key/value pairs.
    The possible parameters are as follows:
    Chunk size in sectors.
    [[no]sync] Force/Prevent RAID initialization
    [rebuild ] Rebuild the drive indicated by the index
    [daemon_sleep ] Time between bitmap daemon work to clear bits
    [min_recovery_rate ] Throttle RAID initialization
    [max_recovery_rate ] Throttle RAID initialization
    [max_write_behind ] See '-write-behind=' (man mdadm)
    [stripe_cache ] Stripe cache size for higher RAIDs

    Line 3 contains the list of devices that compose the array in
    metadata/data device pairs. If the metadata is stored separately, a '-'
    is given for the metadata device position. If a drive has failed or is
    missing at creation time, a '-' can be given for both the metadata and
    data drives for a given position.

    Examples:
    # RAID4 - 4 data drives, 1 parity
    # No metadata devices specified to hold superblock/bitmap info
    # Chunk size of 1MiB
    # (Lines separated for easy reading)
    0 1960893648 raid \
    raid4 1 2048 \
    5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81

    # RAID4 - 4 data drives, 1 parity (no metadata devices)
    # Chunk size of 1MiB, force RAID initialization,
    # min recovery rate at 20 kiB/sec/disk
    0 1960893648 raid \
    raid4 4 2048 min_recovery_rate 20 sync\
    5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81

    Performing a 'dmsetup table' should display the CTR table used to
    construct the mapping (with possible reordering of optional
    parameters).

    Performing a 'dmsetup status' will yield information on the state and
    health of the array. The output is as follows:
    1: raid \
    2:

    Line 1 is standard DM output. Line 2 is best shown by example:
    0 1960893648 raid raid4 5 AAAAA 2/490221568
    Here we can see the RAID type is raid4, there are 5 devices - all of
    which are 'A'live, and the array is 2/490221568 complete with recovery.

    Cc: linux-raid@vger.kernel.org
    Signed-off-by: NeilBrown
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    NeilBrown
     
  • Add per-target unplug callback support.

    Cc: linux-raid@vger.kernel.org
    Signed-off-by: NeilBrown
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    NeilBrown
     
  • DM currently implements congestion checking by checking on congestion
    in each component device. For raid456 we need to also check if the
    stripe cache is congested.

    Add per-target congestion checker callback support.

    Extending the target_callbacks structure with additional callback
    functions allows for establishing multiple callbacks per-target (a
    callback is also needed for unplug).

    Cc: linux-raid@vger.kernel.org
    Signed-off-by: NeilBrown
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    NeilBrown
     
  • This patch adds a user-configurable 'pg_init_delay_msecs' feature. Use
    this feature to specify the number of milliseconds to delay before
    retrying scsi_dh_activate, when SCSI_DH_RETRY is returned.

    SCSI Device Handlers return SCSI_DH_IMM_RETRY if we could retry
    activation immediately and SCSI_DH_RETRY in cases where it is better to
    retry after some delay.

    Currently we immediately retry scsi_dh_activate irrespective of
    SCSI_DH_IMM_RETRY and SCSI_DH_RETRY.

    The 'pg_init_delay_msecs' feature may be provided during table create or
    load, e.g.:
    dmsetup create --table "0 20971520 multipath 3 queue_if_no_path \
    pg_init_delay_msecs 2500 ..." mpatha

    The default for 'pg_init_delay_msecs' is 2000 milliseconds.
    Maximum configurable delay is 60000 milliseconds. Specifying a
    'pg_init_delay_msecs' of 0 will cause immediate retry.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Chandra Seetharaman
    Acked-by: Mike Christie
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Chandra Seetharaman
     
  • This patch changes spin_lock_irq() to spin_lock() in dm_request_fn().
    This patch is just a clean-up and no functional change.

    The spin_lock_irq() was leftover from the early request-based dm code,
    where map_request() used to enable interrupts.
    Since current map_request() never enables interrupts, we can change it
    to spin_lock() to match the prior spin_unlock().

    Auditing through the dm and block-layer code called from
    map_request(), I confirmed all functions save/restore interrupt
    status, so no function returning with interrupts enabled.
    Also I haven't observed any problem on my test environment which
    uses scsi and lpfc driver after heavy I/O testing with occasional
    path down/up.

    Added BUG_ON() to detect breakage in future.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • It's nicer to return the PTR_ERR() value instead of just returning
    -ENOMEM. In the current code the PTR_ERR() value is always equal to
    -ENOMEM so this doesn't actually affect anything, but still...

    In addition, dm_dirty_log_create() doesn't check for a specific -ENOMEM
    return. So this change is safe relative to potential for a non -ENOMEM
    return in the future.

    Signed-off-by: Dan Carpenter
    Acked-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Dan Carpenter
     
  • Use dm_suspended() rather than having each snapshot target maintain a
    private 'suspended' flag in struct dm_snapshot.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • metadata_wq serves on-stack work items from chunk_io(). Even if
    multiple chunk_io() are simultaneously in progress, each is
    independent and queued only once, so multithreaded workqueue can be
    safely used.

    Switch metadata_wq to multithread and flush the work item instead of
    the workqueue in chunk_io().

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • kmirrord_wq, kcopyd_work and md->wq are created per dm instance and
    serve only a single work item from the dm instance, so non-reentrant
    workqueues would provide the same ordering guarantees as ordered ones
    while allowing CPU affinity and use of the workqueues for other
    purposes. Switch them to non-reentrant workqueues.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • Convert all create[_singlethread]_work() users to the new
    alloc[_ordered]_workqueue(). This conversion is mechanical and
    doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • kstriped only serves sc->kstriped_ws which runs dm_table_event().
    This doesn't need to be executed from an ordered workqueue w/ rescuer.
    Drop kstriped and use the system_wq instead. While at it, rename
    kstriped_ws to trigger_event so that it's consistent with other dm
    modules.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • flush_scheduled_work() is being deprecated. Flush the used work
    directly instead. In all dm targets, the only work which uses
    system_wq is ->trigger_event.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • dm_snapshot->queued_bios_work isn't used. Remove ->queued_bios[_work]
    from dm_snapshot structure, the flush_queued_bios work function and
    ksnapd workqueue.

    The DM snapshot changes that were going to use the ksnapd workqueue were
    either superseded (fix for origin write races) or never completed
    (deallocation of invalid snapshot's memory via workqueue).

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • The device-mapper should not send warning messages to syslog
    if a device is not found. This can be done by userspace
    according to the returned dm-ioctl error code.

    So move these messages to debug level and use rate limiting
    to not flood syslog.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • This patch adds a compatible implementation of the block
    chaining mode used by the Loop-AES block device encryption
    system (http://loop-aes.sourceforge.net/) designed
    by Jari Ruusu.

    It operates on full 512 byte sectors and uses CBC
    with an IV derived from the sector number, the data and
    optionally extra IV seed.

    This means that after CBC decryption the first block of sector
    must be tweaked according to decrypted data.

    Loop-AES can use three encryption schemes:
    version 1: is plain aes-cbc mode (already compatible)
    version 2: uses 64 multikey scheme with own IV generator
    version 3: the same as version 2 with additional IV seed
    (it uses 65 keys, last key is used as IV seed)

    The IV generator is here named lmk (Loop-AES multikey)
    and for the cipher specification looks like: aes:64-cbc-lmk

    Version 2 and 3 is recognised according to length
    of provided multi-key string (which is just hexa encoded
    "raw key" used in original Loop-AES ioctl).

    Configuration of the device and decoding key string will
    be done in userspace (cryptsetup).
    (Loop-AES stores keys in gpg encrypted file, raw keys are
    output of simple hashing of lines in this file).

    Based on an implementation by Max Vozeler:
    http://article.gmane.org/gmane.linux.kernel.cryptoapi/3752/

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon
    CC: Max Vozeler

    Milan Broz
     
  • This patch adds generic multikey handling to be used
    in following patch for Loop-AES mode compatibility.

    This patch extends mapping table to optional keycount and
    implements generic multi-key capability.

    With more keys defined the string is divided into
    several sections and these are used for tfms.

    The tfm is used according to sector offset
    (sector 0->tfm[0], sector 1->tfm[1], sector N->tfm[N modulo keycount])
    (only power of two values supported for keycount here).

    Because of tfms per-cpu allocation, this mode can be take
    a lot of memory on large smp systems.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon
    Cc: Max Vozeler

    Milan Broz
     
  • IV (initialisation vector) can in principle depend not only
    on sector but also on plaintext data (or other attributes).

    Change IV generator interface to work directly with dmreq
    structure to allow such dependence in generator.

    Also add post() function which is called after the crypto
    operation.

    This allows tricky modification of decrypted data or IV
    internals.

    In asynchronous mode the post() can be called after
    ctx->sector count was increased so it is needed
    to add iv_sector copy directly to dmreq structure.
    (N.B. dmreq always include only one sector in scatterlists)

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • If there is enough memory, code can directly submit bio
    instead queing this operation in separate thread.

    Try to alloc bio clone with GFP_NOWAIT and only if it
    fails use separate queue (map function cannot block here).

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Currently dm-crypt does all the encryption work for a single dm-crypt
    mapping in a single workqueue. This does not scale well when multiple
    CPUs are submitting IO at a high rate. The single CPU running the single
    thread cannot keep up with the encryption and encrypted IO performance
    tanks.

    This patch changes the crypto workqueue to be per CPU. This means
    that as long as the IO submitter (or the interrupt target CPUs
    for reads) runs on different CPUs the encryption work will be also
    parallel.

    To avoid a bottleneck on the IO worker I also changed those to be
    per-CPU threads.

    There is still some shared data, so I suspect some bouncing
    cache lines. But I haven't done a detailed study on that yet.

    Signed-off-by: Andi Kleen
    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Andi Kleen