08 Mar, 2014

1 commit

  • Pull block fixes from Jens Axboe:
    "Small collection of fixes for 3.14-rc. It contains:

    - Three minor update to blk-mq from Christoph.

    - Reduce number of unaligned (< 4kb) in-flight writes on mtip32xx to
    two. From Micron.

    - Make the blk-mq CPU notify spinlock raw, since it can't be a
    sleeper spinlock on RT. From Mike Galbraith.

    - Drop now bogus BUG_ON() for bio iteration with blk integrity. From
    Nic Bellinger.

    - Properly propagate the SYNC flag on requests. From Shaohua"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    blk-mq: add REQ_SYNC early
    rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
    bio-integrity: Drop bio_integrity_verify BUG_ON in post bip->bip_iter world
    blk-mq: support partial I/O completions
    blk-mq: merge blk_mq_insert_request and blk_mq_run_request
    blk-mq: remove blk_mq_alloc_rq
    mtip32xx: Reduce the number of unaligned writes to 2

    Linus Torvalds
     

04 Mar, 2014

2 commits

  • zram_meta_alloc could fail so caller should check it. Otherwise, your
    system will hang.

    Signed-off-by: Minchan Kim
    Acked-by: Jerome Marchand
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

19 Feb, 2014

1 commit


15 Feb, 2014

1 commit

  • Pull block IO fixes from Jens Axboe:
    "Second round of updates and fixes for 3.14-rc2. Most of this stuff
    has been queued up for a while. The notable exception is the blk-mq
    changes, which are naturally a bit more in flux still.

    The pull request contains:

    - Two bug fixes for the new immutable vecs, causing crashes with raid
    or swap. From Kent.

    - Various blk-mq tweaks and fixes from Christoph. A fix for
    integrity bio's from Nic.

    - A few bcache fixes from Kent and Darrick Wong.

    - xen-blk{front,back} fixes from David Vrabel, Matt Rushton, Nicolas
    Swenson, and Roger Pau Monne.

    - Fix for a vec miscount with integrity vectors from Martin.

    - Minor annotations or fixes from Masanari Iida and Rashika Kheria.

    - Tweak to null_blk to do more normal FIFO processing of requests
    from Shlomo Pongratz.

    - Elevator switching bypass fix from Tejun.

    - Softlockup in blkdev_issue_discard() fix when !CONFIG_PREEMPT from
    me"

    * 'for-linus' of git://git.kernel.dk/linux-block: (31 commits)
    block: add cond_resched() to potentially long running ioctl discard loop
    xen-blkback: init persistent_purge_work work_struct
    blk-mq: pair blk_mq_start_request / blk_mq_requeue_request
    blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq
    block: Fix cloning of discard/write same bios
    block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show
    blk-mq: rework flush sequencing logic
    null_blk: use blk_complete_request and blk_mq_complete_request
    virtio_blk: use blk_mq_complete_request
    blk-mq: rework I/O completions
    fs: Add prototype declaration to appropriate header file include/linux/bio.h
    fs: Mark function as static in fs/bio-integrity.c
    block/null_blk: Fix completion processing from LIFO to FIFO
    block: Explicitly handle discard/write same segments
    block: Fix nr_vecs for inline integrity vectors
    blk-mq: Add bio_integrity setup to blk_mq_make_request
    blk-mq: initialize sg_reserved_size
    blk-mq: handle dma_drain_size
    blk-mq: divert __blk_put_request for MQ ops
    blk-mq: support at_head inserations for blk_execute_rq
    ...

    Linus Torvalds
     

12 Feb, 2014

1 commit

  • Initialize persistent_purge_work work_struct on xen_blkif_alloc (and
    remove the previous initialization done in purge_persistent_gnt). This
    prevents flush_work from complaining even if purge_persistent_gnt has
    not been used.

    Signed-off-by: Roger Pau Monné
    Reviewed-by: David Vrabel
    Tested-by: Sander Eikelenboom
    Signed-off-by: Jens Axboe

    Roger Pau Monne
     

11 Feb, 2014

3 commits

  • …/git/xen/tip into for-linus

    Konrad writes:

    Please git pull the following branch:

    git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git stable/for-jens-3.14

    which is based off v3.13-rc6. If you would like me to rebase it on
    a different branch/tag I would be more than happy to do so.

    The patches are all bug-fixes and hopefully can go in 3.14.

    They deal with xen-blkback shutdown and cause memory leaks
    as well as shutdown races. They should go to stable tree and if you
    are OK with I will ask them to backport those fixes.

    There is also a fix to xen-blkfront to deal with unexpected state
    transition. And lastly a fix to the header where it was using the
    __aligned__ unnecessarily.

    Jens Axboe
     
  • Use the block layer helpers for CPU-local completions instead of
    reimplementing them locally.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Make sure to complete requests on the submitting CPU. Previously this
    was done in blk_mq_end_io, but the responsibility shifted to the drivers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Feb, 2014

6 commits

  • The completion queue is implemented using lockless list.

    The llist_add is adds the events to the list head which is a push operation.
    The processing of the completion elements is done by disconnecting all the
    pushed elements and iterating over the disconnected list. The problem is
    that the processing is done in reverse order w.r.t order of the insertion
    i.e. LIFO processing. By reversing the disconnected list which is done in
    linear time the desired FIFO processing is achieved.

    Signed-off-by: Shlomo Pongratz
    Signed-off-by: Jens Axboe

    Shlomo Pongratz
     
  • Backend drivers shouldn't transistion to CLOSED unless the frontend is
    CLOSED. If a backend does transition to CLOSED too soon then the
    frontend may not see the CLOSING state and will not properly shutdown.

    So, treat an unexpected backend CLOSED state the same as CLOSING.

    Signed-off-by: David Vrabel
    Acked-by: Konrad Rzeszutek Wilk
    Cc: stable@vger.kernel.org
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • This was wrongly introduced in commit 402b27f9, the only difference
    between blkif_request_segment_aligned and blkif_request_segment is
    that the former has a named padding, while both share the same
    memory layout.

    Also correct a few minor glitches in the description, including for it
    to no longer assume PAGE_SIZE == 4096.

    Signed-off-by: Roger Pau Monné
    [Description fix by Jan Beulich]
    Signed-off-by: Jan Beulich
    Reported-by: Jan Beulich
    Cc: Konrad Rzeszutek Wilk
    Cc: David Vrabel
    Cc: Boris Ostrovsky
    Tested-by: Matt Rushton
    Cc: Matt Wilson
    Signed-off-by: Konrad Rzeszutek Wilk

    Roger Pau Monne
     
  • Introduce a new variable to keep track of the number of in-flight
    requests. We need to make sure that when xen_blkif_put is called the
    request has already been freed and we can safely free xen_blkif, which
    was not the case before.

    Signed-off-by: Roger Pau Monné
    Cc: Konrad Rzeszutek Wilk
    Cc: David Vrabel
    Reviewed-by: Boris Ostrovsky
    Tested-by: Matt Rushton
    Reviewed-by: Matt Rushton
    Cc: Matt Wilson
    Cc: Ian Campbell
    Signed-off-by: Konrad Rzeszutek Wilk

    Roger Pau Monne
     
  • I've at least identified two possible memory leaks in blkback, both
    related to the shutdown path of a VBD:

    - blkback doesn't wait for any pending purge work to finish before
    cleaning the list of free_pages. The purge work will call
    put_free_pages and thus we might end up with pages being added to
    the free_pages list after we have emptied it. Fix this by making
    sure there's no pending purge work before exiting
    xen_blkif_schedule, and moving the free_page cleanup code to
    xen_blkif_free.
    - blkback doesn't wait for pending requests to end before cleaning
    persistent grants and the list of free_pages. Again this can add
    pages to the free_pages list or persistent grants to the
    persistent_gnts red-black tree. Fixed by moving the persistent
    grants and free_pages cleanup code to xen_blkif_free.

    Also, add some checks in xen_blkif_free to make sure we are cleaning
    everything.

    Signed-off-by: Roger Pau Monné
    Cc: Konrad Rzeszutek Wilk
    Reviewed-by: David Vrabel
    Cc: Boris Ostrovsky
    Tested-by: Matt Rushton
    Reviewed-by: Matt Rushton
    Cc: Matt Wilson
    Cc: Ian Campbell
    Signed-off-by: Konrad Rzeszutek Wilk

    Roger Pau Monne
     
  • Currently shrink_free_pagepool() is called before the pages used for
    persistent grants are released via free_persistent_gnts(). This
    results in a memory leak when a VBD that uses persistent grants is
    torn down.

    Cc: Konrad Rzeszutek Wilk
    Cc: "Roger Pau Monné"
    Cc: Ian Campbell
    Reviewed-by: David Vrabel
    Cc: linux-kernel@vger.kernel.org
    Cc: xen-devel@lists.xen.org
    Cc: Anthony Liguori
    Signed-off-by: Matt Rushton
    Signed-off-by: Matt Wilson
    Signed-off-by: Konrad Rzeszutek Wilk

    Matt Rushton
     

06 Feb, 2014

2 commits

  • Pull Xen fixes from Konrad Rzeszutek Wilk:
    "Bug-fixes:
    - Revert "xen/grant-table: Avoid m2p_override during mapping" as it
    broke Xen ARM build.
    - Fix CR4 not being set on AP processors in Xen PVH mode"

    * tag 'stable/for-linus-3.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
    xen/pvh: set CR4 flags for APs
    Revert "xen/grant-table: Avoid m2p_override during mapping"

    Linus Torvalds
     
  • Pull NVMe driver update from Matthew Wilcox:
    "Looks like I missed the merge window ... but these are almost all
    bugfixes anyway (the ones that aren't have been baking for months)"

    * git://git.infradead.org/users/willy/linux-nvme:
    NVMe: Namespace use after free on surprise removal
    NVMe: Correct uses of INIT_WORK
    NVMe: Include device and queue numbers in interrupt name
    NVMe: Add a pci_driver shutdown method
    NVMe: Disable admin queue on init failure
    NVMe: Dynamically allocate partition numbers
    NVMe: Async IO queue deletion
    NVMe: Surprise removal handling
    NVMe: Abort timed out commands
    NVMe: Schedule reset for failed controllers
    NVMe: Device resume error handling
    NVMe: Cache dev->pci_dev in a local pointer
    NVMe: Fix lockdep warnings
    NVMe: compat SG_IO ioctl
    NVMe: remove deprecated IRQF_DISABLED
    NVMe: Avoid shift operation when writing cq head doorbell

    Linus Torvalds
     

03 Feb, 2014

2 commits


01 Feb, 2014

1 commit

  • …inux/kernel/git/xen/tip

    Pull Xen bugfixes from Konrad Rzeszutek Wilk:
    "Bug-fixes for the new features that were added during this cycle.

    There are also two fixes for long-standing issues for which we have a
    solution: grant-table operations extra work that was not needed
    causing performance issues and the self balloon code was too
    aggressive causing OOMs.

    Details:
    - Xen ARM couldn't use the new FIFO events
    - Xen ARM couldn't use the SWIOTLB if compiled as 32-bit with 64-bit PCIe devices.
    - Grant table were doing needless M2P operations.
    - Ratchet down the self-balloon code so it won't OOM.
    - Fix misplaced kfree in Xen PVH error code paths"

    * tag 'stable/for-linus-3.14-rc0-late-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
    xen/pvh: Fix misplaced kfree from xlated_setup_gnttab_pages
    drivers: xen: deaggressive selfballoon driver
    xen/grant-table: Avoid m2p_override during mapping
    xen/gnttab: Use phys_addr_t to describe the grant frame base address
    xen: swiotlb: handle sizeof(dma_addr_t) != sizeof(phys_addr_t)
    arm/xen: Initialize event channels earlier

    Linus Torvalds
     

31 Jan, 2014

13 commits

  • The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
    for blkback and future netback patches it just cause a lock contention, as
    those pages never go to userspace. Therefore this series does the following:
    - the original functions were renamed to __gnttab_[un]map_refs, with a new
    parameter m2p_override
    - based on m2p_override either they follow the original behaviour, or just set
    the private flag and call set_phys_to_machine
    - gnttab_[un]map_refs are now a wrapper to call __gnttab_[un]map_refs with
    m2p_override false
    - a new function gnttab_[un]map_refs_userspace provides the old behaviour

    It also removes a stray space from page.h and change ret to 0 if
    XENFEAT_auto_translated_physmap, as that is the only possible return value
    there.

    v2:
    - move the storing of the old mfn in page->index to gnttab_map_refs
    - move the function header update to a separate patch

    v3:
    - a new approach to retain old behaviour where it needed
    - squash the patches into one

    v4:
    - move out the common bits from m2p* functions, and pass pfn/mfn as parameter
    - clear page->private before doing anything with the page, so m2p_find_override
    won't race with this

    v5:
    - change return value handling in __gnttab_[un]map_refs
    - remove a stray space in page.h
    - add detail why ret = 0 now at some places

    v6:
    - don't pass pfn to m2p* functions, just get it locally

    Signed-off-by: Zoltan Kiss
    Suggested-by: David Vrabel
    Acked-by: David Vrabel
    Acked-by: Stefano Stabellini
    Signed-off-by: Konrad Rzeszutek Wilk

    Zoltan Kiss
     
  • Finally, we separated zram->lock dependency from 32bit stat/ table
    handling so there is no reason to use rw_semaphore between read and
    write path so this patch removes the lock from read path totally and
    changes rw_semaphore with mutex. So, we could do

    old:

    read-read: OK
    read-write: NO
    write-write: NO

    Now:

    read-read: OK
    read-write: OK
    write-write: NO

    The below data proves mixed workload performs well 11 times and there is
    also enhance on write-write path because current rw-semaphore doesn't
    support SPIN_ON_OWNER. It's side effect but anyway good thing for us.

    Write-related tests perform better (from 61% to 1058%) but read path has
    good/bad(from -2.22% to 1.45%) but they are all marginal within stddev.

    CPU 12
    iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0

    ==Initial write ==Initial write
    records: 10 records: 10
    avg: 516189.16 avg: 839907.96
    std: 22486.53 (4.36%) std: 47902.17 (5.70%)
    max: 546970.60 max: 909910.35
    min: 481131.54 min: 751148.38
    ==Rewrite ==Rewrite
    records: 10 records: 10
    avg: 509527.98 avg: 1050156.37
    std: 45799.94 (8.99%) std: 40695.44 (3.88%)
    max: 611574.27 max: 1111929.26
    min: 443679.95 min: 980409.62
    ==Read ==Read
    records: 10 records: 10
    avg: 4408624.17 avg: 4472546.76
    std: 281152.61 (6.38%) std: 163662.78 (3.66%)
    max: 4867888.66 max: 4727351.03
    min: 4058347.69 min: 4126520.88
    ==Re-read ==Re-read
    records: 10 records: 10
    avg: 4462147.53 avg: 4363257.75
    std: 283546.11 (6.35%) std: 247292.63 (5.67%)
    max: 4912894.44 max: 4677241.75
    min: 4131386.50 min: 4035235.84
    ==Reverse Read ==Reverse Read
    records: 10 records: 10
    avg: 4565865.97 avg: 4485818.08
    std: 313395.63 (6.86%) std: 248470.10 (5.54%)
    max: 5232749.16 max: 4789749.94
    min: 4185809.62 min: 3963081.34
    ==Stride read ==Stride read
    records: 10 records: 10
    avg: 4515981.80 avg: 4418806.01
    std: 211192.32 (4.68%) std: 212837.97 (4.82%)
    max: 4889287.28 max: 4686967.22
    min: 4210362.00 min: 4083041.84
    ==Random read ==Random read
    records: 10 records: 10
    avg: 4410525.23 avg: 4387093.18
    std: 236693.22 (5.37%) std: 235285.23 (5.36%)
    max: 4713698.47 max: 4669760.62
    min: 4057163.62 min: 3952002.16
    ==Mixed workload ==Mixed workload
    records: 10 records: 10
    avg: 243234.25 avg: 2818677.27
    std: 28505.07 (11.72%) std: 195569.70 (6.94%)
    max: 288905.23 max: 3126478.11
    min: 212473.16 min: 2484150.69
    ==Random write ==Random write
    records: 10 records: 10
    avg: 555887.07 avg: 1053057.79
    std: 70841.98 (12.74%) std: 35195.36 (3.34%)
    max: 683188.28 max: 1096125.73
    min: 437299.57 min: 992481.93
    ==Pwrite ==Pwrite
    records: 10 records: 10
    avg: 501745.93 avg: 810363.09
    std: 16373.54 (3.26%) std: 19245.01 (2.37%)
    max: 518724.52 max: 833359.70
    min: 464208.73 min: 765501.87
    ==Pread ==Pread
    records: 10 records: 10
    avg: 4539894.60 avg: 4457680.58
    std: 197094.66 (4.34%) std: 188965.60 (4.24%)
    max: 4877170.38 max: 4689905.53
    min: 4226326.03 min: 4095739.72

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Tested-by: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Commit a0c516cbfc74 ("zram: don't grab mutex in zram_slot_free_noity")
    introduced free request pending code to avoid scheduling by mutex under
    spinlock and it was a mess which made code lenghty and increased
    overhead.

    Now, we don't need zram->lock any more to free slot so this patch
    reverts it and then, tb_lock should protect it.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Tested-by: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Currently, the zram table is protected by zram->lock but it's rather
    coarse-grained lock and it makes hard for scalibility.

    Let's use own rwlock instead of depending on zram->lock. This patch
    adds new locking so obviously, it would make slow but this patch is just
    prepartion for removing coarse-grained rw_semaphore(ie, zram->lock)
    which is hurdle about zram scalability.

    Final patch in this patchset series will remove the lock from read-path
    and change rw_semaphore with mutex in write path. With bonus, we could
    drop pending slot free mess in next patch.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Tested-by: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Some of fields in zram->stats are protected by zram->lock which is
    rather coarse-grained so let's use atomic operation without explict
    locking.

    This patch is ready for removing dependency of zram->lock in read path
    which is very coarse-grained rw_semaphore. Of course, this patch adds
    new atomic operation so it might make slow but my 12CPU test couldn't
    spot any regression. All gain/lose is marginal within stddev.

    iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0

    ==Initial write ==Initial write
    records: 50 records: 50
    avg: 412875.17 avg: 415638.23
    std: 38543.12 (9.34%) std: 36601.11 (8.81%)
    max: 521262.03 max: 502976.72
    min: 343263.13 min: 351389.12
    ==Rewrite ==Rewrite
    records: 50 records: 50
    avg: 416640.34 avg: 397914.33
    std: 60798.92 (14.59%) std: 46150.42 (11.60%)
    max: 543057.07 max: 522669.17
    min: 304071.67 min: 316588.77
    ==Read ==Read
    records: 50 records: 50
    avg: 4147338.63 avg: 4070736.51
    std: 179333.25 (4.32%) std: 223499.89 (5.49%)
    max: 4459295.28 max: 4539514.44
    min: 3753057.53 min: 3444686.31
    ==Re-read ==Re-read
    records: 50 records: 50
    avg: 4096706.71 avg: 4117218.57
    std: 229735.04 (5.61%) std: 171676.25 (4.17%)
    max: 4430012.09 max: 4459263.94
    min: 2987217.80 min: 3666904.28
    ==Reverse Read ==Reverse Read
    records: 50 records: 50
    avg: 4062763.83 avg: 4078508.32
    std: 186208.46 (4.58%) std: 172684.34 (4.23%)
    max: 4401358.78 max: 4424757.22
    min: 3381625.00 min: 3679359.94
    ==Stride read ==Stride read
    records: 50 records: 50
    avg: 4094933.49 avg: 4082170.22
    std: 185710.52 (4.54%) std: 196346.68 (4.81%)
    max: 4478241.25 max: 4460060.97
    min: 3732593.23 min: 3584125.78
    ==Random read ==Random read
    records: 50 records: 50
    avg: 4031070.04 avg: 4074847.49
    std: 192065.51 (4.76%) std: 206911.33 (5.08%)
    max: 4356931.16 max: 4399442.56
    min: 3481619.62 min: 3548372.44
    ==Mixed workload ==Mixed workload
    records: 50 records: 50
    avg: 149925.73 avg: 149675.54
    std: 7701.26 (5.14%) std: 6902.09 (4.61%)
    max: 191301.56 max: 175162.05
    min: 133566.28 min: 137762.87
    ==Random write ==Random write
    records: 50 records: 50
    avg: 404050.11 avg: 393021.47
    std: 58887.57 (14.57%) std: 42813.70 (10.89%)
    max: 601798.09 max: 524533.43
    min: 325176.99 min: 313255.34
    ==Pwrite ==Pwrite
    records: 50 records: 50
    avg: 411217.70 avg: 411237.96
    std: 43114.99 (10.48%) std: 33136.29 (8.06%)
    max: 530766.79 max: 471899.76
    min: 320786.84 min: 317906.94
    ==Pread ==Pread
    records: 50 records: 50
    avg: 4154908.65 avg: 4087121.92
    std: 151272.08 (3.64%) std: 219505.04 (5.37%)
    max: 4459478.12 max: 4435857.38
    min: 3730512.41 min: 3101101.67

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Tested-by: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Commit a0c516cbfc74 ("zram: don't grab mutex in zram_slot_free_noity")
    introduced pending zram slot free in zram's write path in case of
    missing slot free by memory allocation failure in zram_slot_free_notify
    but it is not necessary because we have already freed the slot right
    before overwriting.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Tested-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Sergey reported we don't need to handle pending free request every I/O
    so that this patch removes it in read path while we remain it in write
    path.

    Let's consider below example.

    Swap subsystem ask to zram "A" block free by swap_slot_free_notify but
    zram had been pended it without real freeing. Swap subsystem allocates
    "A" block for new data but request pended for a long time just handled
    and zram blindly free new data on the "A" block. :(

    That's why we couldn't remove handle pending free request right before
    zram-write.

    Signed-off-by: Minchan Kim
    Reported-by: Sergey Senozhatsky
    Tested-by: Sergey Senozhatsky
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Dan and Sergey reported that there is a racy between reset and flushing
    of pending work so that it could make oops by freeing zram->meta in
    reset while zram_slot_free can access zram->meta if new request is
    adding during the race window.

    This patch moves flush after taking init_lock so it prevents new request
    so that it closes the race.

    Signed-off-by: Minchan Kim
    Reported-by: Dan Carpenter
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Tested-by: Sergey Senozhatsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Add my copyright to the zram source code which I maintain.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Remove the old private compcache project address so upcoming patches
    should be sent to LKML because we Linux kernel community will take care.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Zram has lived in staging for a LONG LONG time and have been
    fixed/improved by many contributors so code is clean and stable now. Of
    course, there are lots of product using zram in real practice.

    The major TV companys have used zram as swap since two years ago and
    recently our production team released android smart phone with zram
    which is used as swap, too and recently Android Kitkat start to use zram
    for small memory smart phone. And there was a report Google released
    their ChromeOS with zram, too and cyanogenmod have been used zram long
    time ago. And I heard some disto have used zram block device for tmpfs.
    In addition, I saw many report from many other peoples. For example,
    Lubuntu start to use it.

    The benefit of zram is very clear. With my experience, one of the
    benefit was to remove jitter of video application with backgroud memory
    pressure. It would be effect of efficient memory usage by compression
    but more issue is whether swap is there or not in the system. Recent
    mobile platforms have used JAVA so there are many anonymous pages. But
    embedded system normally are reluctant to use eMMC or SDCard as swap
    because there is wear-leveling and latency issues so if we do not use
    swap, it means we can't reclaim anoymous pages and at last, we could
    encounter OOM kill. :(

    Although we have real storage as swap, it was a problem, too. Because
    it sometime ends up making system very unresponsible caused by slow swap
    storage performance.

    Quote from Luigi on Google
    "Since Chrome OS was mentioned: the main reason why we don't use swap
    to a disk (rotating or SSD) is because it doesn't degrade gracefully
    and leads to a bad interactive experience. Generally we prefer to
    manage RAM at a higher level, by transparently killing and restarting
    processes. But we noticed that zram is fast enough to be competitive
    with the latter, and it lets us make more efficient use of the
    available RAM. " and he announced.
    http://www.spinics.net/lists/linux-mm/msg57717.html

    Other uses case is to use zram for block device. Zram is block device
    so anyone can format the block device and mount on it so some guys on
    the internet start zram as /var/tmp.
    http://forums.gentoo.org/viewtopic-t-838198-start-0.html

    Let's promote zram and enhance/maintain it instead of removing.

    Signed-off-by: Minchan Kim
    Reviewed-by: Konrad Rzeszutek Wilk
    Acked-by: Nitin Gupta
    Acked-by: Pekka Enberg
    Cc: Bob Liu
    Cc: Greg Kroah-Hartman
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Pull block IO driver changes from Jens Axboe:

    - bcache update from Kent Overstreet.

    - two bcache fixes from Nicholas Swenson.

    - cciss pci init error fix from Andrew.

    - underflow fix in the parallel IDE pg_write code from Dan Carpenter.
    I'm sure the 1 (or 0) users of that are now happy.

    - two PCI related fixes for sx8 from Jingoo Han.

    - floppy init fix for first block read from Jiri Kosina.

    - pktcdvd error return miss fix from Julia Lawall.

    - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
    Michael Opdenacker.

    - comment typo fix for the loop driver from Olaf Hering.

    - potential oops fix for null_blk from Raghavendra K T.

    - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
    an OOM problem and a problem with handling security locked conditions

    * 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
    mg_disk: Spelling s/finised/finished/
    null_blk: Null pointer deference problem in alloc_page_buffers
    mtip32xx: Correctly handle security locked condition
    mtip32xx: Make SGL container per-command to eliminate high order dma allocation
    drivers/block/loop.c: fix comment typo in loop_config_discard
    drivers/block/cciss.c:cciss_init_one(): use proper errnos
    drivers/block/paride/pg.c: underflow bug in pg_write()
    drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
    drivers/block/sx8.c: use module_pci_driver()
    floppy: bail out in open() if drive is not responding to block0 read
    bcache: Fix auxiliary search trees for key size > cacheline size
    bcache: Don't return -EINTR when insert finished
    bcache: Improve bucket_prio() calculation
    bcache: Add bch_bkey_equal_header()
    bcache: update bch_bkey_try_merge
    bcache: Move insert_fixup() to btree_keys_ops
    bcache: Convert sorting to btree_keys
    bcache: Convert debug code to btree_keys
    bcache: Convert btree_iter to struct btree_keys
    bcache: Refactor bset_tree sysfs stats
    ...

    Linus Torvalds
     
  • Pull core block IO changes from Jens Axboe:
    "The major piece in here is the immutable bio_ve series from Kent, the
    rest is fairly minor. It was supposed to go in last round, but
    various issues pushed it to this release instead. The pull request
    contains:

    - Various smaller blk-mq fixes from different folks. Nothing major
    here, just minor fixes and cleanups.

    - Fix for a memory leak in the error path in the block ioctl code
    from Christian Engelmayer.

    - Header export fix from CaiZhiyong.

    - Finally the immutable biovec changes from Kent Overstreet. This
    enables some nice future work on making arbitrarily sized bios
    possible, and splitting more efficient. Related fixes to immutable
    bio_vecs:

    - dm-cache immutable fixup from Mike Snitzer.
    - btrfs immutable fixup from Muthu Kumar.

    - bio-integrity fix from Nic Bellinger, which is also going to stable"

    * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
    xtensa: fixup simdisk driver to work with immutable bio_vecs
    block/blk-mq-cpu.c: use hotcpu_notifier()
    blk-mq: for_each_* macro correctness
    block: Fix memory leak in rw_copy_check_uvector() handling
    bio-integrity: Fix bio_integrity_verify segment start bug
    block: remove unrelated header files and export symbol
    blk-mq: uses page->list incorrectly
    blk-mq: use __smp_call_function_single directly
    btrfs: fix missing increment of bi_remaining
    Revert "block: Warn and free bio if bi_end_io is not set"
    block: Warn and free bio if bi_end_io is not set
    blk-mq: fix initializing request's start time
    block: blk-mq: don't export blk_mq_free_queue()
    block: blk-mq: make blk_sync_queue support mq
    block: blk-mq: support draining mq queue
    dm cache: increment bi_remaining when bi_end_io is restored
    block: fixup for generic bio chaining
    block: Really silence spurious compiler warnings
    block: Silence spurious compiler warnings
    block: Kill bio_pair_split()
    ...

    Linus Torvalds
     

30 Jan, 2014

1 commit

  • We need to initialise the work_struct when we initialise the rest of the
    struct nvme_dev, otherwise we'll hit a lockdep warning when we remove
    the device. Use PREPARE_WORK to change the function pointer instead
    of INIT_WORK.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

29 Jan, 2014

1 commit

  • Pull ceph updates from Sage Weil:
    "This is a big batch. From Ilya we have:

    - rbd support for more than ~250 mapped devices (now uses same scheme
    that SCSI does for device major/minor numbering)
    - crush updates for new mapping behaviors (will be needed for coming
    erasure coding support, among other things)
    - preliminary support for tiered storage pools

    There is also a big series fixing a pile cephfs bugs with clustered
    MDSs from Yan Zheng, ACL support for cephfs from Guangliang Zhao, ceph
    fscache improvements from Li Wang, improved behavior when we get
    ENOSPC from Josh Durgin, some readv/writev improvements from
    Majianpeng, and the usual mix of small cleanups"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (76 commits)
    ceph: cast PAGE_SIZE to size_t in ceph_sync_write()
    ceph: fix dout() compile warnings in ceph_filemap_fault()
    libceph: support CEPH_FEATURE_OSD_CACHEPOOL feature
    libceph: follow redirect replies from osds
    libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}
    libceph: follow {read,write}_tier fields on osd request submission
    libceph: add ceph_pg_pool_by_id()
    libceph: CEPH_OSD_FLAG_* enum update
    libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
    libceph: introduce and start using oid abstraction
    libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN
    libceph: move ceph_file_layout helpers to ceph_fs.h
    libceph: start using oloc abstraction
    libceph: dout() is missing a newline
    libceph: add ceph_kv{malloc,free}() and switch to them
    libceph: support CEPH_FEATURE_EXPORT_PEER
    ceph: add imported caps when handling cap export message
    ceph: add open export target session helper
    ceph: remove exported caps when handling cap import message
    ceph: handle session flush message
    ...

    Linus Torvalds
     

28 Jan, 2014

5 commits

  • On larger systems with many drives, it may help debugging to know which
    queue is tied to which interrupt, just by looking at /proc/interrupts.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • We need to shut down the device cleanly when the system is being shut down.
    This was in an earlier patch but was inadvertently lost during a rewrite.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     
  • Disable the admin queue if device fails during initialization so the
    queue's irq is freed.

    Signed-off-by: Keith Busch
    [rewritten to use nvme_free_queues]
    Signed-off-by: Matthew Wilcox

    Keith Busch
     
  • Some users need more than 64 partitions per device. Rather than simply
    increasing the number of partitions, switch to the dynamic partition
    allocation scheme.

    This means that minor numbers are not stable across boots, but since major
    numbers aren't either, I cannot see this being a significant problem.

    Tested-by: Matias Bjørling
    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • This attempts to delete all IO queues at the same time asynchronously on
    shutdown. This is necessary for a present device that is not responding;
    a shutdown operation previously would take 2 minutes per queue-pair
    to timeout before moving on to the next queue, making a device removal
    appear to take a very long time or "hung" as reported by users.

    In the previous worst case, a removal may be stuck forever until a kill
    signal is given if there are more than 32 queue pairs since it would run
    out of admin command IDs after over an hour of timed out sync commands
    (admin queue depth is 64).

    This patch will wait for the admin command timeout for all commands to
    complete, so the worst case now for an unresponsive controller is 60
    seconds, though that still seems like a long time.

    Since this adds another way to take queues offline, some duplicate code
    resulted so I moved these into more convienient functions.

    Signed-off-by: Keith Busch
    [make functions static, correct line length and whitespace issues]
    Signed-off-by: Matthew Wilcox

    Keith Busch