24 Nov, 2015
3 commits
-
We had seen lots of reports of this kind issue, so add one
warnning in blk-merge, then it can be triggered easily and
avoid to depend on warning/bug from drivers.Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
Commit bdced438acd83a(block: setup bi_phys_segments after
splitting) introduces function of computing bio->bi_phys_segments
during bio splitting.Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
arn't computed, so too many physical segments may be obtained
for one request since both the two are used to check if one segment
across two bios can be possible.This patch fixes the issue by computing the two variables in
blk_bio_segment_split().Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting)
Reported-by: Michael Ellerman
Reported-by: Mark Salter
Tested-by: Laurent Dufour
Tested-by: Mark Salter
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
always points to the iterator local variable, which is obviously
wrong, so fix it by pointing to the local variable of 'bvprv'.Fixes: 5014c311baa2b(block: fix bogus compiler warnings in blk-merge.c)
Cc: stable@kernel.org #4.3
Reported-by: Michael Ellerman
Reported-by: Mark Salter
Tested-by: Laurent Dufour
Tested-by: Mark Salter
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
21 Nov, 2015
1 commit
-
Liu reported that running certain parts of xfstests threw the
following error:BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
3 locks held by kworker/u16:0/6:
#0: ("writeback"){++++.+}, at: [] process_one_work+0x173/0x730
#1: ((&(&wb->dwork)->work)){+.+.+.}, at: [] process_one_work+0x173/0x730
#2: (&type->s_umount_key#44){+++++.}, at: [] trylock_super+0x25/0x60
CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-btrfs-108)
ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
Call Trace:
[] dump_stack+0x4f/0x74
[] ___might_sleep+0x185/0x240
[] __might_sleep+0x52/0x90
[] __alloc_pages_nodemask+0x268/0x410
[] ? sched_clock_local+0x1c/0x90
[] ? local_clock+0x21/0x40
[] ? __lock_release+0x420/0x510
[] ? __lock_acquired+0x16c/0x3c0
[] alloc_pages_current+0xc5/0x210
[] ? rbio_is_full+0x55/0x70 [btrfs]
[] ? mark_held_locks+0x78/0xa0
[] ? _raw_spin_unlock_irqrestore+0x40/0x60
[] full_stripe_write+0x5a/0xc0 [btrfs]
[] __raid56_parity_write+0x39/0x60 [btrfs]
[] run_plug+0x11b/0x140 [btrfs]
[] btrfs_raid_unplug+0x23/0x70 [btrfs]
[] blk_flush_plug_list+0x82/0x1f0
[] blk_sq_make_request+0x1f9/0x740
[] ? generic_make_request_checks+0x222/0x7c0
[] ? blk_queue_enter+0x124/0x310
[] ? blk_queue_enter+0x92/0x310
[] generic_make_request+0x172/0x2c0
[] ? generic_make_request+0x164/0x2c0
[] submit_bio+0x70/0x140
[] ? rbio_add_io_page+0x99/0x150 [btrfs]
[] finish_rmw+0x4d9/0x600 [btrfs]
[] full_stripe_write+0x9c/0xc0 [btrfs]
[] raid56_parity_write+0xef/0x160 [btrfs]
[] btrfs_map_bio+0xe3/0x2d0 [btrfs]
[] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
[] submit_one_bio+0x74/0xb0 [btrfs]
[] submit_extent_page+0xe5/0x1c0 [btrfs]
[] __extent_writepage_io+0x408/0x4c0 [btrfs]
[] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
[] __extent_writepage+0x218/0x3a0 [btrfs]
[] ? mark_held_locks+0x78/0xa0
[] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
[] extent_writepages+0x52/0x70 [btrfs]
[] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
[] btrfs_writepages+0x27/0x30 [btrfs]
[] do_writepages+0x23/0x40
[] __writeback_single_inode+0x89/0x4d0
[] ? writeback_sb_inodes+0x260/0x480
[] ? writeback_sb_inodes+0x260/0x480
[] ? writeback_sb_inodes+0x15f/0x480
[] writeback_sb_inodes+0x2d2/0x480
[] ? down_read_trylock+0x57/0x60
[] ? trylock_super+0x25/0x60
[] ? rcu_read_lock_sched_held+0x4f/0x90
[] __writeback_inodes_wb+0x8c/0xc0
[] wb_writeback+0x2b5/0x500
[] ? mark_held_locks+0x78/0xa0
[] ? __local_bh_enable_ip+0x68/0xc0
[] ? wb_do_writeback+0x62/0x310
[] wb_do_writeback+0xc1/0x310
[] ? set_worker_desc+0x79/0x90
[] wb_workfn+0x92/0x330
[] process_one_work+0x223/0x730
[] ? process_one_work+0x173/0x730
[] ? worker_thread+0x18f/0x430
[] worker_thread+0x11d/0x430
[] ? maybe_create_worker+0xf0/0xf0
[] ? maybe_create_worker+0xf0/0xf0
[] kthread+0xef/0x110
[] ? schedule_tail+0x1e/0xd0
[] ? __init_kthread_worker+0x70/0x70
[] ret_from_fork+0x3f/0x70
[] ? __init_kthread_worker+0x70/0x70The issue is that we've got the software context pinned while
calling blk_flush_plug_list(), which flushes callbacks that
are allowed to sleep. btrfs and raid has such callbacks.Flip the checks around a bit, so we can enable preempt a bit
earlier and flush plugs without having preempt disabled.This only affects blk-mq driven devices, and only those that
register a single queue.Reported-by: Liu Bo
Tested-by: Liu Bo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe
20 Nov, 2015
12 commits
-
If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single
512 byte sector would be read (secsize / 512). However the partition
structure would be located past the end of the buffer (secsize % 512).Signed-off-by: Kees Cook
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe -
kthread_create_on_node takes format+args, so there's no need to do the
pretty-printing in advance. Moreover, "mtip_svc_thd_99" (including its
'\0') only just fits in 16 bytes, so if index could ever go above 99
we'd have a stack buffer overflow.Signed-off-by: Rasmus Villemoes
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe -
Make sure that there are no unprocesssed entries on a completion
queue before deleting it, and check for validity of the CQ
door bell before writing completions to it.This fixes problems with doing a sysfs reset of the device while
it's handling IO.Tested-by: Jon Derrick
Signed-off-by: Jens Axboe
-
Add free block, used block, and bad block information to the show debug
interface. This information is used to debug how targets track blocks.Also, change debug function name to make it more generic.
Signed-off-by: Javier Gonzalez
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
Maintain number of in use blocks, free blocks, and bad blocks in a per
lun basis. This allows the upper layers to get information about the
state of each lun.Also, account for blocks reserved to the device on the free block count.
nr_free_blocks matches now the actual number of blocks on the free list
when the device is booted.Signed-off-by: Javier Gonzalez
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
According to the Open-Channel SSD Specification, the NVMe-NVM admin
commands use vendor specific opcodes of NVMe, so use the NVMe admin
queue to dispatch these commands.Signed-off-by: Wenwei Tao
Updated by me to include set bad block table as well and also use
the admin queue for l2p len calculation.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
If either max_phys_sect is out of bound, the nvm_dev structure is not
freed.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
The return value should be non-zero under error conditions.
Remove nvme_free(dev) to avoid free dev more than once.Signed-off-by: Wenwei Tao
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
The gendisk structure has not been initialized when using lightnvm.
Make sure to not delete it upon exit. Also make sure that we use the
appropriate disk_name at unregistration.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
The linear addressing mode was removed in 7386af2. Make null_blk instead
expose the ppa format geometry and support the generic addressing mode.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
Instead of using a page pool, we can save memory by only allocating room
for 64 entries for the ppa command. Introduce a ppa_cache to allocate only
the required memory for the ppa list.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
Reported-by: Paul Grabinar
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe
17 Nov, 2015
16 commits
-
Currently blk_insert_flush() just adds flush request to q->queue_head
when flush is not required. That completely bypasses IO scheduler so
e.g. CFQ can be idling waiting for new request to arrive and will idle
through the whole window unnecessarily. Luckily this only happens in
rare cases as usually checks in generic_make_request_checks() clear
FLUSH and FUA flags early if they are not needed.When no flushing is actually required, we can easily fix the problem by
properly queueing the request through the IO scheduler. Ideally IO
scheduler should be also made aware of requests queued via
blk_flush_queue_rq(). However inserting flush request through IO
scheduler can have unwanted side-effects since due to flush batching
delaying the flush request in IO scheduler will delay all flush requests
possibly coming from other processes. So we keep adding the request
directly to q->queue_head.Signed-off-by: Jan Kara
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe -
Add support for registering as a LightNVM device. This allows us to
evaluate the performance of the LightNVM subsystem.In /drivers/Makefile, LightNVM is moved above block device drivers
to make sure that the LightNVM media managers have been initialized
before drivers under /drivers/block are initialized.Signed-off-by: Matias Bjørling
Fix by Jens Axboe to remove unneeded slab cache and the following
memory leak.
Signed-off-by: Jens Axboe -
To make the intention clearer, use list_{first,prev,next}_entry
instead of list_entry.Signed-off-by: Geliang Tang
Signed-off-by: Jens Axboe -
This prevents outstanding IOs to be sent for completion to target after
the target has been removed. The flow is now: stop new IOs > cleanup
queue > remove target.Signed-off-by: Javier Gonzalez
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
The specification was updated the remove the double word just after
number of configuration groups and capabilities. Update the identify
structure to reflect it.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
The ppa format was not copied from the NVMe specific ppa format to the
lightnvm specific ppa format. This led to the ppa format not being
communicated to the layers above.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
The linear and device specific address modes can be replaced with a
simple offset and bit length conversion that is generic across all
devices.This both simplifies the specification and removes the special case for
qemu nvme, that previously relied on the linear address mapping.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
Both the nvm_register and nvm_init does a kfree(dev) on error. Make sure
to only free it once.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
We register with nvm_devices when there registration can still fail.
Move the final registration at the end of the nvm_register function
to make sure we are fully registered when added to the nvm_devices list.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
Only NAND flash with SLC and MLC is supported. Make sure to not try to
initialize TLC memory or other non-volatile memory types.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
The nvm_id, nvm_id_group and nvm_addr_format data structures contain
reserved attributes. They are unused by media managers and targets.
Remove them.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
The mccap field is required for I/O command option support. It defines the
following flash access modes:* SLC mode
* Erase/Program Suspension
* Scramble On/Off
* EncryptionIt is slotted in between mpos and cpar, changing the offset for
cpar as well.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
A single 8 bit and 16 bit reserve field were inserted in the
specification to align fields appropriately. Reflect this in the
identify group structure.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
The specification was changed to reflect a multi-value bad block table.
Instead of bit-based bad block table, the bad block table now allows
eight bad block categories. Currently four are defined:* Factory bad blocks
* Grown bad blocks
* Device-side reserved blocks
* Host-side reserved blocksThe factory and grown bad blocks are the regular bad blocks. The
reserved blocks are either for internal use or external use. In
particular, the device-side reserved blocks allows the host to
bootstrap from a limited number of flash blocks. Reducing the flash
blocks to scan upon super block initialization.Support for both get bad block table and set bad block table is added.
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
The max_phys_sect variable is defined as a char. We do a boundary check
to maximally allow 256 physical page descriptors per command. As we are
not indexing from zero. This expression is always false. Bump the
max_phys_sect to an unsigned int to support the range check.Signed-off-by: Matias Bjørling
Reported-by: Geert Uytterhoeven
Signed-off-by: Jens Axboe -
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe
16 Nov, 2015
7 commits
-
Pull perf updates from Thomas Gleixner:
"Mostly updates to the perf tool plus two fixes to the kernel core code:- Handle tracepoint filters correctly for inherited events (Peter
Zijlstra)- Prevent a deadlock in perf_lock_task_context (Paul McKenney)
- Add missing newlines to some pr_err() calls (Arnaldo Carvalho de
Melo)- Print full source file paths when using 'perf annotate --print-line
--full-paths' (Michael Petlan)- Fix 'perf probe -d' when just one out of uprobes and kprobes is
enabled (Wang Nan)- Add compiler.h to list.h to fix 'make perf-tar-src-pkg' generated
tarballs, i.e. out of tree building (Arnaldo Carvalho de Melo)- Add the llvm-src-base.c and llvm-src-kbuild.c files, generated by
the 'perf test' LLVM entries, when running it in-tree, to
.gitignore (Yunlong Song)- libbpf error reporting improvements, using a strerror interface to
more precisely tell the user about problems with the provided
scriptlet, be it in C or as a ready made object file (Wang Nan)- Do not be case sensitive when searching for matching 'perf test'
entries (Arnaldo Carvalho de Melo)- Inform the user about objdump failures in 'perf annotate' (Andi
Kleen)- Improve the LLVM 'perf test' entry, introduce a new ones for BPF
and kbuild tests to check the environment used by clang to compile
.c scriptlets (Wang Nan)"* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
perf/x86/intel/rapl: Remove the unused RAPL_EVENT_DESC() macro
tools include: Add compiler.h to list.h
perf probe: Verify parameters in two functions
perf session: Add missing newlines to some pr_err() calls
perf annotate: Support full source file paths for srcline fix
perf test: Add llvm-src-base.c and llvm-src-kbuild.c to .gitignore
perf: Fix inherited events vs. tracepoint filters
perf: Disable IRQs across RCU RS CS that acquires scheduler lock
perf test: Do not be case sensitive when searching for matching tests
perf test: Add 'perf test BPF'
perf test: Enhance the LLVM tests: add kbuild test
perf test: Enhance the LLVM test: update basic BPF test program
perf bpf: Improve BPF related error messages
perf tools: Make fetch_kernel_version() publicly available
bpf tools: Add new API bpf_object__get_kversion()
bpf tools: Improve libbpf error reporting
perf probe: Cleanup find_perf_probe_point_from_map to reduce redundancy
perf annotate: Inform the user about objdump failures in --stdio
perf stat: Make stat options global
perf sched latency: Fix thread pid reuse issue
... -
Pull scheduler fix from Thomas Gleixner:
"A single fix to prevent math underflow in the numa balancing code"* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/numa: Fix math underflow in task_tick_numa() -
Pull liblockdep fixes from Thomas Gleixner:
"Three small patches to synchronize liblockdep with the latest core
changes"* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/liblockdep: explicitly declare lockdep API we call from liblockdep
tools/liblockdep: add userspace versions of WRITE_ONCE and RCU_INIT_POINTER
tools/liblockdep: remove task argument from debug_check_no_locks_held -
Pull x86 fixes from Thomas Gleixner:
"A couple of fixes and updates related to x86:- Fix the W+X check regression on XEN
- The real fix for the low identity map trainwreck
- Probe legacy PIC early instead of unconditionally allocating legacy
irqs- Add cpu verification to long mode entry
- Adjust the cache topology to AMD Fam17H systems
- Let Merrifield use the TSC across S3"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Call verify_cpu() after having entered long mode too
x86/setup: Fix low identity map for >= 2GB kernel range
x86/mm: Skip the hypervisor range when walking PGD
x86/AMD: Fix last level cache topology for AMD Fam17h systems
x86/irq: Probe for PIC presence before allocating descs for legacy IRQs
x86/cpu/intel: Enable X86_FEATURE_NONSTOP_TSC_S3 for Merrifield -
….kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq and timer fixes from Thomas Gleixner:
- An irq regression fix to restore the wakeup behaviour of chained
interrupts.- A timer fix for a long standing race versus timers scheduled on a
target cpu which got exposed by recent changes in the workqueue
implementation.* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/PM: Restore system wake up from chained interrupts* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Use proper base migration in add_timer_on() -
Pull MIPS updates from Ralf Baechle:
"These are the highlists of the main MIPS pull request for 4.4:- Add latencytop support
- Support appended DTBs
- VDSO support and initially use it for gettimeofday.
- Drop the .MIPS.abiflags and ELF NOTE sections from vmlinux
- Support for the 5KE, an internal test core.
- Switch all MIPS platfroms to libata drivers.
- Improved support, cleanups for ralink and Lantiq platforms.
- Support for the new xilfpga platform.
- A number of DTB improvments for BMIPS.
- Improved support for CM and CPS.
- Minor JZ4740 and BCM47xx enhancements"* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (120 commits)
MIPS: idle: add case for CPU_5KE
MIPS: Octeon: Support APPENDED_DTB
MIPS: vmlinux: create a section for appended DTB
MIPS: Clean up compat_siginfo_t
MIPS: Fix PAGE_MASK definition
MIPS: BMIPS: Enable GZIP ramdisk and timed printks
MIPS: Add xilfpga defconfig
MIPS: xilfpga: Add mipsfpga platform code
MIPS: xilfpga: Add xilfpga device tree files.
dt-bindings: MIPS: Document xilfpga bindings and boot style
MIPS: Make MIPS_CMDLINE_DTB default
MIPS: Make the kernel arguments from dtb available
MIPS: Use USE_OF as the guard for appended dtb
MIPS: BCM63XX: Use pr_* instead of printk
MIPS: Loongson: Cleanup CONFIG_LOONGSON_SUSPEND.
MIPS: lantiq: Disable xbar fpi burst mode
MIPS: lantiq: Force the crossbar to big endian
MIPS: lantiq: Initialize the USB core on boot
MIPS: lantiq: Return correct value for fpi clock on ar9
MIPS: ralink: Add missing clock on rt305x
...
15 Nov, 2015
1 commit
-
Pull sound fixes from Takashi Iwai:
"Here are a collection of small fixes tha have been gathered for
4.4-rc1. The only significant changes are those in PCI drivers
Kconfig, to use "depends on" instead of "select" for CONFIG_ZONE_DMA.
A reverse select is often more user-friendly, but in this case, it
makes hard to manage with the conflict with ZONE_DEVICE, so changed in
such a way for now.Others are all small fixes and quirks: an error check in soundcore
reigster_chrdev(), HD-audio HDMI/DP phantom jack fix, Intel Broxton DP
quirk, USB-audio DSD device quirk, some constifications, etc"* tag 'sound-fix-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: pci: depend on ZONE_DMA
ALSA: hda - Simplify phantom jack handling for HDMI/DP
ALSA: hda/hdmi - apply Skylake fix-ups to Broxton display codec
ALSA: ctxfi: constify rsc ops structures
ALSA: usb: Add native DSD support for Aune X1S
ALSA: oxfw: add an comment to Kconfig for TASCAM FireOne
sound: fix check for error condition of register_chrdev()