22 Apr, 2015

17 commits

  • expansion/resync can grab a stripe when the stripe is in batch list. Since all
    stripes in batch list must be in the same state, we can't allow some stripes
    run into expansion/resync. So we delay expansion/resync for stripe in batch
    list.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    shli@kernel.org
     
  • If io error happens in any stripe of a batch list, the batch list will be
    split, then normal process will run for the stripes in the list.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    shli@kernel.org
     
  • stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
    unit. Idealy we should use big size for adjacent full stripe writes. Bigger
    stripe cache size means less stripes runing in the state machine so can reduce
    cpu overhead. And also bigger size can cause bigger IO size dispatched to under
    layer disks.

    With below patch, we will automatically batch adjacent full stripe write
    together. Such stripes will be added to the batch list. Only the first stripe
    of the list will be put to handle_list and so run handle_stripe(). Some steps
    of handle_stripe() are extended to cover all stripes of the list, including
    ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
    running in handle_stripe() and we send IO of whole stripe list together to
    increase IO size.

    Stripes added to a batch list have some limitations. A batch list can only
    include full stripe write and can't cross chunk boundary to make sure stripes
    have the same parity disks. Stripes in a batch list must be in the same state
    (no written, toread and so on). If a stripe is in a batch list, all new
    read/write to add_stripe_bio will be blocked to overlap conflict till the batch
    list is handled. The limitations will make sure stripes in a batch list be in
    exactly the same state in the life circly.

    I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
    PCIe SSD. This patch improves around 30% performance and IO size to under layer
    disk is exactly 32k. I also run a 4k randwrite test in the same array to make
    sure the performance isn't changed with the patch.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    shli@kernel.org
     
  • Track overwrite disk count, so we can know if a stripe is a full stripe write.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    shli@kernel.org
     
  • A freshly new stripe with write request can be batched. Any time the stripe is
    handled or new read is queued, the flag will be cleared.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    shli@kernel.org
     
  • Use flex_array for scribble data. Next patch will batch several stripes
    together, so scribble data should be able to cover several stripes, so this
    patch also allocates scribble data for stripes across a chunk.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    shli@kernel.org
     
  • … not set when accessed from dm-raid

    The patch makes 3 references to mddev->queue in the raid0 personality
    conditional in order to allow for it to be accessed from dm-raid.
    Mandatory, because md instances underneath dm-raid don't manage
    a request queue of their own which'd lead to oopses without the patch.

    Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
    Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
    Signed-off-by: NeilBrown <neilb@suse.de>

    Heinz Mauelshagen
     
  • When md notices non-sync IO happening while it is trying
    to resync (or reshape or recover) it slows down to the
    set minimum.

    The default minimum might have made sense many years ago
    but the drives have become faster. Changing the default
    to match the times isn't really a long term solution.

    This patch changes the code so that instead of waiting until the speed
    has dropped to the target, it just waits until pending requests
    have completed.
    This means that the delay inserted is a function of the speed
    of the devices.

    Testing shows that:
    - for some loads, the resync speed is unchanged. For those loads
    increasing the minimum doesn't change the speed either.
    So this is a good result. To increase resync speed under such
    loads we would probably need to increase the resync window
    size.

    - for other loads, resync speed does increase to a reasonable
    fraction (e.g. 20%) of maximum possible, and throughput of
    the load only drops a little bit (e.g. 10%)

    - for other loads, throughput of the non-sync load drops quite a bit
    more. These seem to be latency-sensitive loads.

    So it isn't a perfect solution, but it is mostly an improvement.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This option is not well justified and testing suggests that
    it hardly ever makes any difference.

    The comment suggests there might be a need to wait for non-resync
    activity indicated by ->nr_waiting, however raise_barrier()
    already waits for all of that.

    So just remove it to simplify reasoning about speed limiting.

    This allows us to remove a 'FIXME' comment from raid5.c as that
    never used the flag.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • There is really no need for sync_min to be a multiple of
    chunk_size, and values read from here often aren't.
    That means you cannot read a value and expect to be able
    to write it back later.

    So remove the chunk_size check, and round down to a multiple
    of 4K, to be sure everything works with 4K-sector devices.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • NeilBrown
     
  • When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
    the clustered md:

    1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
    clear the Faulty bit in their respective rdev->flags.
    2. The node initiating re-add, gathers the bitmaps of all nodes
    and copies them into the local bitmap. It does not clear the bitmap
    from which it is copying.
    3. Initiating node schedules a md recovery to sync the devices.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: NeilBrown

    Goldwyn Rodrigues
     
  • This adds the capability of re-adding a failed disk by
    writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.

    This facilitates adding disks which have encountered a temporary
    error such as a network disconnection/hiccup in an iSCSI device,
    or a SAN cable disconnection which has been restored. In such
    a situation, you do not need to remove and re-add the device.
    Writing re-add to the failed device's state would add it again
    to the array and perform the recovery of only the blocks which
    were written after the device failed.

    This works for generic md, and is not related to clustering. However,
    this patch is to ease re-add operations listed above in clustering
    environments.

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: NeilBrown

    Goldwyn Rodrigues
     
  • This adds "remove" capabilities for the clustered environment.
    When a user initiates removal of a device from the array, a
    REMOVE message with disk number in the array is sent to all
    the nodes which kick the respective device in their own array.

    This facilitates the removal of failed devices.

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: NeilBrown

    Goldwyn Rodrigues
     
  • This is required by the clustering module (patches to follow) to
    find the device to remove or re-add.

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: NeilBrown

    Goldwyn Rodrigues
     
  • This export is required for clustering module in order to
    co-ordinate remove/readd a rdev from all nodes.

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: NeilBrown

    Goldwyn Rodrigues
     
  • Since the node num of md-cluster is from zero, and
    cinfo->slot_number represents the slot num of dlm,
    no need to check for equality.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: NeilBrown

    Guoqing Jiang
     

10 Apr, 2015

1 commit

  • Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1
    in v3.14-rc1 RAID0 has performed incorrect calculations
    when the chunksize is not a power of 2.

    This happens because "sector_div()" modifies its first argument, but
    this wasn't taken into account in the patch.

    So restore that first arg before re-using the variable.

    Reported-by: Joe Landman
    Reported-by: Dave Chinner
    Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1
    Cc: stable@vger.kernel.org (3.14 and later).
    Signed-off-by: NeilBrown

    NeilBrown
     

08 Apr, 2015

1 commit

  • Simon reported the md io stats accounting issue:
    "
    I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
    It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
    the other two being write-mostly:

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
    sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 345.00 0.00 0.00 0.00 0.00 100.00
    md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58779.00 0.00 0.00 0.00 0.00 100.00
    md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 100.00
    "
    The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the
    generic_start_io_acct to account the disk stats rather than the open code,
    but it also introduced the increase to .in_flight[rw] which is needless to
    md. So we re-use the open code here to fix it.

    Reported-by: Simon Kirby
    Cc: 3.19
    Signed-off-by: Gu Zheng
    Signed-off-by: NeilBrown

    Gu Zheng
     

07 Apr, 2015

4 commits

  • Pull networking fixes from David Miller:

    1) In TCP, don't register an FRTO for cumulatively ACK'd data that was
    previously SACK'd, from Neal Cardwell.

    2) Need to hold RNL mutex in ipv4 multicast code namespace cleanup,
    from Cong WANG.

    3) Similarly we have to hold RNL mutex for fib_rules_unregister(), also
    from Cong WANG.

    4) Revert and rework netns nsid allocation fix, from Nicolas Dichtel.

    5) When we encapsulate for a tunnel device, skb->sk still points to the
    user socket. So this leads to cases where we retraverse the
    ipv4/ipv6 output path with skb->sk being of some other address
    family (f.e. AF_PACKET). This can cause things to crash since the
    ipv4 output path is dereferencing an AF_PACKET socket as if it were
    an ipv4 one.

    The short term fix for 'net' and -stable is to elide these socket
    checks once we've entered an encapsulation sequence by testing
    xmit_recursion.

    Longer term we have a better solution wherein we pass the tunnel's
    socket down through the output paths, but that is way too invasive
    for 'net' and -stable.

    From Hannes Frederic Sowa.

    6) l2tp_init() failure path forgets to unregister per-net ops, from
    Cong WANG.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    net/mlx4_core: Fix error message deprecation for ConnectX-2 cards
    net: dsa: fix filling routing table from OF description
    l2tp: unregister l2tp_net_ops on failure path
    mvneta: dont call mvneta_adjust_link() manually
    ipv6: protect skb->sk accesses from recursive dereference inside the stack
    netns: don't allocate an id for dead netns
    Revert "netns: don't clear nsid too early on removal"
    ip6mr: call del_timer_sync() in ip6mr_free_table()
    net: move fib_rules_unregister() under rtnl lock
    ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup
    tcp: fix FRTO undo on cumulative ACK of SACKed range
    xen-netfront: transmit fully GSO-sized packets

    Linus Torvalds
     
  • Commit 1daa4303b4ca ("net/mlx4_core: Deprecate error message at
    ConnectX-2 cards startup to debug") did the deprecation only for port 1
    of the card. Need to deprecate for port 2 as well.

    Fixes: 1daa4303b4ca ("net/mlx4_core: Deprecate error message at ConnectX-2 cards startup to debug")
    Signed-off-by: Jack Morgenstein
    Signed-off-by: Amir Vadai
    Signed-off-by: David S. Miller

    Jack Morgenstein
     
  • Pull input fixes from Dmitry Torokhov:
    "Updates for the input subsystem - two more tweaks for ALPS driver to
    work out kinks after splitting the touchpad, trackstick, and potential
    external PS/2 mouse into separate input devices.

    Changes to support ALPS SS4 devices (protocol V8) will be coming in
    4.1..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
    Input: alps - document stick behavior for protocol V2
    Input: alps - report V2 Dualpoint Stick events via the right evdev node
    Input: alps - report interleaved bare PS/2 packets via dev3

    Linus Torvalds
     
  • mvneta_adjust_link() is a callback for of_phy_connect() and should
    not be called directly. The result of calling it directly is as below:

    Signed-off-by: David S. Miller

    Stas Sergeev
     

06 Apr, 2015

2 commits

  • On V2 devices the DualPoint Stick reports bare packets, these should be
    reported via the "AlpsPS/2 ALPS DualPoint Stick" dev2 evdev node, which also
    has the INPUT_PROP_POINTING_STICK propbit set.

    Note that since there is no way to distinguish these packets from an external
    PS/2 mouse (insofar as these laptops have an external PS/2 port) this means
    that we will be reporting PS/2 mouse events via this evdev node too, as we've
    been doing in kernel 3.19 and older.

    This has been tested on a Dell Latitude D620 and a Dell Latitude E6400,
    which both have a V2 touchpad + a DualPoint Stick which reports bare packets.

    Signed-off-by: Hans de Goede
    Reviewed-by: Pali Rohár
    Signed-off-by: Dmitry Torokhov

    Hans de Goede
     
  • Bare packets should be reported via the same evdev device independent on
    whether they are detected on the beginning of a packet or in the middle
    of a packet.

    This has been tested on a Dell Latitude E6400, where the DualPoint Stick
    reports bare packets, which get reported via dev3 when the touchpad is
    idle, and via dev2 when the touchpad and stick are used simultaneously.

    This commit fixes this inconsistency by always reporting bare packets via
    dev3. Note that since the come from a DualPoint Stick they really should be
    reported via dev2, this gets fixed in a later commit.

    Signed-off-by: Hans de Goede
    Reviewed-by: Pali Rohár
    Signed-off-by: Dmitry Torokhov

    Hans de Goede
     

05 Apr, 2015

3 commits

  • Pull USB fixes from Greg KH:
    "Here are some small USB fixes and new device ids for 4.0-rc6. Nothing
    major, some xhci fixes for reported problems, and some usb-serial
    device ids.

    All have been in linux-next for a while"

    * tag 'usb-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
    USB: ftdi_sio: Use jtag quirk for SNAP Connect E10
    usb: isp1760: fix spin unlock in the error path of isp1760_udc_start
    usb: xhci: apply XHCI_AVOID_BEI quirk to all Intel xHCI controllers
    usb: xhci: handle Config Error Change (CEC) in xhci driver
    USB: keyspan_pda: add new device id
    USB: ftdi_sio: Added custom PID for Synapse Wireless product

    Linus Torvalds
     
  • Pull staging driver fixes from Greg KH:
    "Here are some staging driver fixes, well, really all just IIO driver
    fixes, for 4.0-rc6. They fix issues that have been reported with
    these drivers.

    All of these patches have been in linux-next for a while"

    * tag 'staging-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
    iio: imu: Use iio_trigger_get for indio_dev->trig assignment
    iio: adc: vf610: use ADC clock within specification
    iio/adc/cc10001_adc.c: Fix !HAS_IOMEM build
    iio: core: Fix double free.
    iio:inv-mpu6050: Fix inconsistency for the scale channel
    staging: iio: dummy: Fix undefined symbol build error
    iio: inv_mpu6050: Clear timestamps fifo while resetting hardware fifo
    staging: iio: hmc5843: Set iio name property in sysfs
    iio: bmc150: change sampling frequency
    iio: fix drivers that check buffer->scan_mask

    Linus Torvalds
     
  • Pull tty/serial fixes from Greg KH:
    "Here are 3 serial driver fixes for 4.0-rc6. They fix some reported
    issues with the samsung and fsl_lpuart drivers.

    All have been in linux-next for a while"

    * tag 'tty-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
    tty: serial: fsl_lpuart: clear receive flag on FIFO flush
    tty: serial: fsl_lpuart: specify transmit FIFO size
    serial: samsung: Clear operation mode on UART shutdown

    Linus Torvalds
     

04 Apr, 2015

3 commits


03 Apr, 2015

9 commits

  • Pull drm fixes from Dave Airlie:
    "One drm core fix, one exynos regression fix, two sets of radeon fixes
    (Alex was a bit behind last week), and two i915 fixes.

    Nothing too serious we seem to have calmed down i915 since last week"

    * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
    drm/radeon: fix wait in radeon_mn_invalidate_range_start
    drm/radeon: add extra check in radeon_ttm_tt_unpin_userptr
    drm: Exynos: Respect framebuffer pitch for FIMD/Mixer
    drm/i915: Reject the colorkey ioctls for primary and cursor planes
    drm/i915: Skip allocating shadow batch for 0-length batches
    drm/radeon: programm the VCE fw BAR as well
    drm/radeon: always dump the ring content if it's available
    radeon: Do not directly dereference pointers to BIOS area.
    drm/radeon/dpm: fix 120hz handling harder
    drm/edid: set ELD for firmware and debugfs override EDIDs

    Linus Torvalds
     
  • Pull irqchip fixes from Jason Cooper:
    "This is the second round of fixes for irqchip. It contains some fixes
    found while the arm64 guys were writing the kvm gicv3 its emulation.

    GICv3 ITS:
    - Small batch of fixes discovered while writing the kvm ITS emulation"

    * tag 'irqchip-fixes-4.0-2' of git://git.infradead.org/users/jcooper/linux:
    irqchip: gicv3-its: Use non-cacheable accesses when no shareability
    irqchip: gicv3-its: Fix PROP/PEND and BASE/CBASE confusion
    irqchip: gicv3-its: Fix device ID encoding
    irqchip: gicv3-its: Fix encoding of collection's target redistributor

    Linus Torvalds
     
  • Just two small fixes for radeon, both destined for stable.

    * 'drm-fixes-4.0' of git://people.freedesktop.org/~agd5f/linux:
    drm/radeon: fix wait in radeon_mn_invalidate_range_start
    drm/radeon: add extra check in radeon_ttm_tt_unpin_userptr

    Dave Airlie
     
  • …/daeinki/drm-exynos into drm-fixes

    Fix display on issue to Exynos5250 based Snow(1366x768) board.

    * 'exynos-drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos:
    drm: Exynos: Respect framebuffer pitch for FIMD/Mixer

    Dave Airlie
     
  • one oops fixes and a 0-length allocation fix from next backported.

    * tag 'drm-intel-fixes-2015-04-02' of git://anongit.freedesktop.org/drm-intel:
    drm/i915: Reject the colorkey ioctls for primary and cursor planes
    drm/i915: Skip allocating shadow batch for 0-length batches

    Dave Airlie
     
  • Here's a single drm core fix, cc: stable, that affects i915
    users.

    * tag 'topic/drm-fixes-2015-04-02' of git://anongit.freedesktop.org/drm-intel:
    drm/edid: set ELD for firmware and debugfs override EDIDs

    Dave Airlie
     
  • Pull xen regression fixes from David Vrabel:
    "Fix two regressions in the balloon driver's use of memory hotplug when
    used in a PV guest"

    * tag 'stable/for-linus-4.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
    xen/balloon: before adding hotplugged memory, set frames to invalid
    x86/xen: prepare p2m list for memory hotplug

    Linus Torvalds
     
  • Pull infiniband/rdma fix from Roland Dreier:
    "Fix for exploitable integer overflow in uverbs interface"

    * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
    IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic

    Linus Torvalds
     
  • xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
    GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
    bytes each. This slight reduction in the size of packets means a slight loss in
    efficiency.

    Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to
    XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
    where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.

    The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s
    6c09fa09d) in determining when to split an skb into two is
    sk->sk_gso_max_size - 1 - MAX_TCP_HEADER.

    So the maximum permitted size of an skb is calculated to be
    (XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.

    Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
    Instead, there is no need to deviate from the default gso_max_size of 65536 as
    this already accommodates the size of the header.

    Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
    of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
    skbs of up to 65160 bytes (45 segments of 1448 bytes each).

    Similarly, the maximum allowable mtu does not need to subtract MAX_TCP_HEADER as
    it relates to the size of the whole packet, including the header.

    Fixes: 9ecd1a75d977 ("xen-netfront: reduce gso_max_size to account for max TCP header")
    Signed-off-by: Jonathan Davies
    Signed-off-by: David S. Miller

    Jonathan Davies