14 Oct, 2020
3 commits
-
The out label is only used in one place and return ret directly without
something like resource cleanup or lock release and so on. So we should
remove this jump label and do some cleanup.Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: https://lkml.kernel.org/r/20200927124032.22521-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds -
SWP_FS is used to make swap_{read,write}page() go through the filesystem,
and it's only used for swap files over NFS for now. Otherwise it will
directly submit IO to blockdev according to swapfile extents reported by
filesystems in advance.As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's
rename to SWP_FS_OPS.[1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org
Suggested-by: Matthew Wilcox
Signed-off-by: Gao Xiang
Signed-off-by: Andrew Morton
Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com
Signed-off-by: Linus Torvalds -
Pull block updates from Jens Axboe:
- Series of merge handling cleanups (Baolin, Christoph)
- Series of blk-throttle fixes and cleanups (Baolin)
- Series cleaning up BDI, seperating the block device from the
backing_dev_info (Christoph)- Removal of bdget() as a generic API (Christoph)
- Removal of blkdev_get() as a generic API (Christoph)
- Cleanup of is-partition checks (Christoph)
- Series reworking disk revalidation (Christoph)
- Series cleaning up bio flags (Christoph)
- bio crypt fixes (Eric)
- IO stats inflight tweak (Gabriel)
- blk-mq tags fixes (Hannes)
- Buffer invalidation fixes (Jan)
- Allow soft limits for zone append (Johannes)
- Shared tag set improvements (John, Kashyap)
- Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)
- DM no-wait support (Mike, Konstantin)
- Request allocation improvements (Ming)
- Allow md/dm/bcache to use IO stat helpers (Song)
- Series improving blk-iocost (Tejun)
- Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
Xianting, Yang, Yufen, yangerkun)* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
block: fix uapi blkzoned.h comments
blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk-mq: get rid of the dead flush handle code path
block: get rid of unnecessary local variable
block: fix comment and add lockdep assert
blk-mq: use helper function to test hw stopped
block: use helper function to test queue register
block: remove redundant mq check
block: invoke blk_mq_exit_sched no matter whether have .exit_sched
percpu_ref: don't refer to ref->data if it isn't allocated
block: ratelimit handle_bad_sector() message
blk-throttle: Re-use the throtl_set_slice_end()
blk-throttle: Open code __throtl_de/enqueue_tg()
blk-throttle: Move service tree validation out of the throtl_rb_first()
blk-throttle: Move the list operation after list validation
blk-throttle: Fix IO hang for a corner case
blk-throttle: Avoid tracking latency if low limit is invalid
blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
block: Remove redundant 'return' statement
...
25 Sep, 2020
1 commit
-
There is no point in trying to call bdev_read_page if SWP_SYNCHRONOUS_IO
is not set, as the device won't support it.Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe
04 Sep, 2020
1 commit
-
Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
every physical page, when swapping pages out to disk it is necessary to
save these tags, and later restore them when reading the pages back.Add some hooks along with dummy implementations to enable the
arch code to handle this.Three new hooks are added to the swap code:
* arch_prepare_to_swap() and
* arch_swap_invalidate_page() / arch_swap_invalidate_area().
One new hook is added to shmem:
* arch_swap_restore()Signed-off-by: Steven Price
[catalin.marinas@arm.com: add unlock_page() on the error path]
[catalin.marinas@arm.com: dropped the _tags suffix]
Signed-off-by: Catalin Marinas
Acked-by: Andrew Morton
15 Aug, 2020
3 commits
-
struct swap_info_struct si.flags could be accessed concurrently as noticed
by KCSAN,BUG: KCSAN: data-race in scan_swap_map_slots / swap_readpage
write to 0xffff9c77b80ac400 of 8 bytes by task 91325 on cpu 16:
scan_swap_map_slots+0x6fe/0xb50
scan_swap_map_slots at mm/swapfile.c:887
get_swap_pages+0x39d/0x5c0
get_swap_page+0x377/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1740/0x2820
shrink_inactive_list+0x316/0x8b0
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40read to 0xffff9c77b80ac400 of 8 bytes by task 5422 on cpu 7:
swap_readpage+0x204/0x6a0
swap_readpage at mm/page_io.c:380
read_swap_cache_async+0xa2/0xb0
swapin_readahead+0x6a0/0x890
do_swap_page+0x465/0xeb0
__handle_mm_fault+0xc7a/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 5422 Comm: gmain Tainted: G W O L 5.5.0-next-20200204+ #6
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019Other reads,
read to 0xffff91ea33eac400 of 8 bytes by task 11276 on cpu 120:
__swap_writepage+0x140/0xc20
__swap_writepage at mm/page_io.c:289read to 0xffff91ea33eac400 of 8 bytes by task 11264 on cpu 16:
swap_set_page_dirty+0x44/0x1f4
swap_set_page_dirty at mm/page_io.c:442The write is under &si->lock, but the reads are done as lockless. Since
the reads only check for a specific bit in the flag, it is harmless even
if load tearing happens. Thus, just mark them as intentional data races
using the data_race() macro.[cai@lca.pw: add a missing annotation]
Link: http://lkml.kernel.org/r/1581612585-5812-1-git-send-email-cai@lca.pwSigned-off-by: Qian Cai
Signed-off-by: Andrew Morton
Cc: Marco Elver
Link: http://lkml.kernel.org/r/20200207003601.1526-1-cai@lca.pw
Signed-off-by: Linus Torvalds -
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.[akpm@linux-foundation.org: fix mm/migrate.c]
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: William Kucharski
Reviewed-by: Zi Yan
Cc: Mike Kravetz
Cc: David Hildenbrand
Cc: "Kirill A. Shutemov"
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds -
This function returns the number of bytes in a THP. It is like
page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
is disabled.Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: William Kucharski
Reviewed-by: Zi Yan
Cc: David Hildenbrand
Cc: Mike Kravetz
Cc: "Kirill A. Shutemov"
Link: http://lkml.kernel.org/r/20200629151959.15779-5-willy@infradead.org
Signed-off-by: Linus Torvalds
08 Aug, 2020
1 commit
-
swap_readpage() does the sync io for one page, the io is not big,
normally, the io can be finished quickly, but it may take long time or
wait forever in case of io failure or discard.This patch uses blk_io_schedule() instead of io_schedule() to avoid task
hung and crash (when set /proc/sys/kernel/hung_task_panic) when the above
exception occurs.This is similar to the hung task avoidance in submit_bio_wait(),
blk_execute_rq() and __blkdev_direct_IO().Signed-off-by: Xianting Tian
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Ming Lei
Cc: Bart Van Assche
Cc: Hannes Reinecke
Cc: Jens Axboe
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/1596461807-21087-1-git-send-email-xianting_tian@126.com
Signed-off-by: Linus Torvalds
29 Jun, 2020
1 commit
-
bio_associate_blkg_from_page is a special purpose helper for swap bios
that doesn't need access to bio internals. Move it to the swap code
instead of having it in bio.c.Acked-by: Tejun Heo
Reviewed-by: Johannes Thumshirn
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
10 Jun, 2020
1 commit
-
Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.Most of these definitions are actually identical and typically it boils
down to, e.g.static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.This patch (of 12):
The linux/mm.h header includes to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include
in the files that include .The include statements in such cases are remove with a simple loop:
for f in $(git grep -l "include ") ; do
sed -i -e '/include / d' $f
doneSigned-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Brian Cain
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Ungerer
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Ingo Molnar
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Matthew Wilcox
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Mike Rapoport
Cc: Nick Hu
Cc: Paul Walmsley
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vincent Chen
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds
03 Feb, 2020
1 commit
-
By now, bmap() will either return the physical block number related to
the requested file offset or 0 in case of error or the requested offset
maps into a hole.
This patch makes the needed changes to enable bmap() to proper return
errors, using the return value as an error return, and now, a pointer
must be passed to bmap() to be filled with the mapped physical block.It will change the behavior of bmap() on return:
- negative value in case of error
- zero on success or map fell into a holeIn case of a hole, the *block will be zero too
Since this is a prep patch, by now, the only error return is -EINVAL if
->bmap doesn't exist.Reviewed-by: Christoph Hellwig
Signed-off-by: Carlos Maiolino
Signed-off-by: Al Viro
02 Dec, 2019
1 commit
-
If a block device supports rw_page operation, it doesn't submit bios so
the annotation in submit_bio() for refault stall doesn't work. It
happens with zram in android, especially swap read path which could
consume CPU cycle for decompress. It is also a problem for zswap which
uses frontswap.Annotate swap_readpage() to account the synchronous IO overhead to
prevent underreport memory pressure.[akpm@linux-foundation.org: add comment, per Johannes]
Link: http://lkml.kernel.org/r/20191010152134.38545-1-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Cc: Seth Jennings
Cc: Dan Streetman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Nov, 2019
1 commit
-
The following race is observed due to which a processes faulting on a
swap entry, finds the page neither in swapcache nor swap. This causes
zram to give a zero filled page that gets mapped to the process,
resulting in a user space crash later.Consider parent and child processes Pa and Pb sharing the same swap slot
with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
Virtual address 'VA' of Pa and Pb points to the shared swap entry.Pa Pb
fault on VA fault on VA
do_swap_page do_swap_page
lookup_swap_cache fails lookup_swap_cache fails
Pb scheduled out
swapin_readahead (deletes zram entry)
swap_free (makes swap_count 1)
Pb scheduled in
swap_readpage (swap_count == 1)
Takes SWP_SYNCHRONOUS_IO path
zram enrty absent
zram gives a zero filled pageFix this by making sure that swap slot is freed only when swap count
drops down to one.Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org
Fixes: aa8d22a11da9 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference")
Signed-off-by: Vinayak Menon
Suggested-by: Minchan Kim
Acked-by: Minchan Kim
Cc: Michal Hocko
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Jul, 2019
1 commit
-
swap_extent is used to map swap page offset to backing device's block
offset. For a continuous block range, one swap_extent is used and all
these swap_extents are managed in a linked list.These swap_extents are used by map_swap_entry() during swap's read and
write path. To find out the backing device's block offset for a page
offset, the swap_extent list will be traversed linearly, with
curr_swap_extent being used as a cache to speed up the search.This works well as long as swap_extents are not huge or when the number
of processes that access swap device are few, but when the swap device
has many extents and there are a number of processes accessing the swap
device concurrently, it can be a problem. On one of our servers, the
disk's remaining size is tight:$df -h
Filesystem Size Used Avail Use% Mounted on
... ...
/dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4When creating a 80G swapfile there, there are as many as 84656 swap
extents. The end result is, kernel spends abou 30% time in
map_swap_entry() and swap throughput is only 70MB/s.As a comparison, when I used smaller sized swapfile, like 4G whose
swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
map_swap_entry() is about 3%.One downside of using rbtree for swap_extent is, 'struct rbtree' takes
24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
for each swap_extent. For a swapfile that has 80k swap_extents, that
means 625KiB more memory consumed.Test:
Since it's not possible to reboot that server, I can not test this patch
diretly there. Instead, I tested it on another server with NVMe disk.I created a 20G swapfile on an NVMe backed XFS fs. By default, the
filesystem is quite clean and the created swapfile has only 2 extents.
Testing vanilla and this patch shows no obvious performance difference
when swapfile is not fragmented.To see the patch's effects, I used some tweaks to manually fragment the
swapfile by breaking the extent at 1M boundary. This made the swapfile
have 20K extents.nr_task=4
kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
vanilla 165191 90.77% 171798 90.21%
patched 858993 +420% 2.16% 715827 +317% 0.77%nr_task=8
kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
vanilla 306783 92.19% 318145 87.76%
patched 954437 +211% 2.35% 1073741 +237% 1.57%swapout: the throughput of swap out, in KB/s, higher is better 1st
map_swap_entry: cpu cycles percent sampled by perf swapin: the
throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry:
cpu cycles percent sampled by perfnr_task=1 doesn't show any difference, this is due to the curr_swap_extent
can be effectively used to cache the correct swap extent for single task
workload.[akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu
Signed-off-by: Aaron Lu
Cc: Huang Ying
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jul, 2019
1 commit
-
swap_readpage() sets waiter = bio->bi_private even if synchronous = F,
this means that the caller can get the spurious wakeup after return.This can be fatal if blk_wake_io_task() does
set_current_state(TASK_RUNNING) after the caller does
set_special_state(), in the worst case the kernel can crash in
do_task_dead().Link: http://lkml.kernel.org/r/20190704160301.GA5956@redhat.com
Fixes: 0619317ff8baa2d ("block: add polled wakeup task helper")
Signed-off-by: Oleg Nesterov
Reported-by: Qian Cai
Acked-by: Hugh Dickins
Reviewed-by: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Jun, 2019
1 commit
-
0-Day test system reported some OOM regressions for several THP
(Transparent Huge Page) swap test cases. These regressions are bisected
to 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"). In the
commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled. So the
bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out
THP. That causes the OOM.As in the patch description of 6861428921b5 ("block: always define
BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP
to swap space. So the issue is fixed via doing that in get_swap_bio().BTW: I remember I have checked the THP swap code when 6861428921b5
("block: always define BIO_MAX_PAGES as 256") was merged, and thought the
THP swap code needn't to be changed. But apparently, I was wrong. I
should have done this at that time.Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com
Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256")
Signed-off-by: "Huang, Ying"
Reviewed-by: Ming Lei
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Hugh Dickins
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Daniel Jordan
Cc: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jan, 2019
1 commit
-
swap_readpage() wants to do polling to bring in pages if asked to, but
it doesn't mark the bio as being polled. Additionally, the looping
around the blk_poll() check isn't correct - if we get a zero return, we
should call io_schedule(), we can't just assume that the bio has
completed. The regular bio->bi_private check should be used for that.Link: http://lkml.kernel.org/r/e15243a8-2cdf-c32c-ecee-f289377c8ef9@kernel.dk
Signed-off-by: Jens Axboe
Reviewed-by: Andrew Morton
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Jan, 2019
1 commit
-
This mostly reverts commit 849a370016a5 ("block: avoid ordered task
state change for polled IO"). It was wrongly claiming that the ordering
wasn't necessary. The memory barrier _is_ necessary.If something is truly polling and not going to sleep, it's the whole
state setting that is unnecessary, not the memory barrier. Whenever you
set your state to a sleeping state, you absolutely need the memory
barrier.Note that sometimes the memory barrier can be elsewhere. For example,
the ordering might be provided by an external lock, or by setting the
process state to sleeping before adding yourself to the wait queue list
that is used for waking up (where the wait queue lock itself will
guarantee that any wakeup will correctly see the sleeping state).But none of those cases were true here.
NOTE! Some of the polling paths may indeed be able to drop the state
setting entirely, at which point the memory barrier also goes away.(Also note that this doesn't revert the TASK_RUNNING cases: there is no
race between a wakeup and setting the process state to TASK_RUNNING,
since the end result doesn't depend on ordering).Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Linus Torvalds
08 Dec, 2018
1 commit
-
A prior patch in this series added blkg association to bios issued by
cgroups. There are two other paths that we want to attribute work back
to the appropriate cgroup: swap and writeback. Here we modify the way
swap tags bios to include the blkg. Writeback will be tackle in the next
patch.Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe
26 Nov, 2018
1 commit
-
blk_poll() has always kept spinning until it found an IO. This is
fine for SYNC polling, since we need to find one request we have
pending, but in preparation for ASYNC polling it can be beneficial
to just check if we have any entries available or not.Existing callers are converted to pass in 'spin == true', to retain
the old behavior.Signed-off-by: Jens Axboe
19 Nov, 2018
1 commit
-
For the core poll helper, the task state setting don't need to imply any
atomics, as it's the current task itself that is being modified and
we're not going to sleep.For IRQ driven, the wakeup path have the necessary barriers to not need
us using the heavy handed version of the task state setting.Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
16 Nov, 2018
1 commit
-
If we're polling for IO on a device that doesn't use interrupts, then
IO completion loop (and wake of task) is done by submitting task itself.
If that is the case, then we don't need to enter the wake_up_process()
function, we can simply mark ourselves as TASK_RUNNING.Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
03 Nov, 2018
1 commit
-
Pull block layer fixes from Jens Axboe:
"The biggest part of this pull request is the revert of the blkcg
cleanup series. It had one fix earlier for a stacked device issue, but
another one was reported. Rather than play whack-a-mole with this,
revert the entire series and try again for the next kernel release.Apart from that, only small fixes/changes.
Summary:
- Indentation fixup for mtip32xx (Colin Ian King)
- The blkcg cleanup series revert (Dennis Zhou)
- Two NVMe fixes. One fixing a regression in the nvme request
initialization in this merge window, causing nvme-fc to not work.
The other is a suspend/resume p2p resource issue (James, Keith)- Fix sg discard merge, allowing us to merge in cases where we didn't
before (Jianchao Wang)- Call rq_qos_exit() after the queue is frozen, preventing a hang
(Ming)- Fix brd queue setup, fixing an oops if we fail setting up all
devices (Ming)"* tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
nvme-pci: fix conflicting p2p resource adds
nvme-fc: fix request private initialization
blkcg: revert blkcg cleanups series
block: brd: associate with queue until adding disk
block: call rq_qos_exit() after queue is frozen
mtip32xx: clean an indentation issue, remove extraneous tabs
block: fix the DISCARD request merge
02 Nov, 2018
2 commits
-
Pull AFS updates from Al Viro:
"AFS series, with some iov_iter bits included"* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
missing bits of "iov_iter: Separate type from direction and use accessor functions"
afs: Probe multiple fileservers simultaneously
afs: Fix callback handling
afs: Eliminate the address pointer from the address list cursor
afs: Allow dumping of server cursor on operation failure
afs: Implement YFS support in the fs client
afs: Expand data structure fields to support YFS
afs: Get the target vnode in afs_rmdir() and get a callback on it
afs: Calc callback expiry in op reply delivery
afs: Fix FS.FetchStatus delivery from updating wrong vnode
afs: Implement the YFS cache manager service
afs: Remove callback details from afs_callback_break struct
afs: Commit the status on a new file/dir/symlink
afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
afs: Don't invoke the server to read data beyond EOF
afs: Add a couple of tracepoints to log I/O errors
afs: Handle EIO from delivery function
afs: Fix TTL on VL server and address lists
afs: Implement VL server rotation
afs: Improve FS server rotation error handling
... -
This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.The original series can be found in [2] with a follow up series in [3].
[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/This reverts the following commits:
d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe
27 Oct, 2018
1 commit
-
The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
through the filesystem, and to make swapoff() call ->swap_deactivate().
For Btrfs, we want the latter but not the former, so split this flag into
two. This makes us always call ->swap_deactivate() if ->swap_activate()
succeeded, not just if it didn't add any swap extents itself.This also resolves the issue of the very misleading name of SWP_FILE,
which is only used for swap files over NFS.Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com
Signed-off-by: Omar Sandoval
Reviewed-by: Nikolay Borisov
Reviewed-by: Andrew Morton
Cc: Johannes Weiner
Cc: David Sterba
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Oct, 2018
1 commit
-
In the iov_iter struct, separate the iterator type from the iterator
direction and use accessor functions to access them in most places.Convert a bunch of places to use switch-statements to access them rather
then chains of bitwise-AND statements. This makes it easier to add further
iterator types. Also, this can be more efficient as to implement a switch
of small contiguous integers, the compiler can use ~50% fewer compare
instructions than it has to use bitwise-and instructions.Further, cease passing the iterator type into the iterator setup function.
The iterator function can set that itself. Only the direction is required.Signed-off-by: David Howells
22 Sep, 2018
1 commit
-
A prior patch in this series added blkg association to bios issued by
cgroups. There are two other paths that we want to attribute work back
to the appropriate cgroup: swap and writeback. Here we modify the way
swap tags bios to include the blkg. Writeback will be tackle in the next
patch.Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe
09 Jul, 2018
2 commits
-
For backcharging we need to know who the page belongs to when swapping
it out. We don't worry about things that do ->rw_page (zram etc) at the
moment, we're only worried about pages that actually go to a block
device.Signed-off-by: Tejun Heo
Signed-off-by: Josef Bacik
Acked-by: Johannes Weiner
Acked-by: Andrew Morton
Signed-off-by: Jens Axboe -
Just like REQ_META, it's important to know the IO coming down is swap
in order to guard against potential IO priority inversion issues with
cgroups. Add REQ_SWAP and use it for all swap IO, and add it to our
bio_issue_as_root_blkg helper.Signed-off-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe
07 Jan, 2018
1 commit
-
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
16 Nov, 2017
1 commit
-
With fast swap storage, the platforms want to use swap more aggressively
and swap-in is crucial to application latency.The rw_page() based synchronous devices like zram, pmem and btt are such
fast storage. When I profile swapin performance with zram lz4
decompress test, S/W overhead is more than 70%. Maybe, it would be
bigger in nvdimm.This patch aims to reduce swap-in latency by skipping swapcache if the
swap device is synchronous device like rw_page based device. It
enhances 45% my swapin test(5G sequential swapin, no readahead, from
2.41sec to 1.64sec).Link: http://lkml.kernel.org/r/1505886205-9671-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Hugh Dickins
Cc: Christoph Hellwig
Cc: Ilya Dryomov
Cc: Jens Axboe
Cc: Sergey Senozhatsky
Cc: Huang Ying
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Nov, 2017
1 commit
-
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...
04 Nov, 2017
1 commit
-
That we we can also poll non blk-mq queues. Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
08 Sep, 2017
1 commit
-
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
07 Sep, 2017
1 commit
-
To support delay splitting THP (Transparent Huge Page) after swapped
out, we need to enhance swap writing code to support to write a THP as a
whole. This will improve swap write IO performance.As Ming Lei pointed out, this should be based on
multipage bvec support, which hasn't been merged yet. So this patch is
only for testing the functionality of the other patches in the series.
And will be reimplemented after multipage bvec support is merged.Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: "Kirill A . Shutemov"
Cc: Andrea Arcangeli
Cc: Dan Williams
Cc: Hugh Dickins
Cc: Jens Axboe
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
Cc: Shaohua Li
Cc: Vishal L Verma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Aug, 2017
1 commit
-
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
03 Aug, 2017
1 commit
-
When a thread is OOM-killed during swap_readpage() operation, an oops
occurs because end_swap_bio_read() is calling wake_up_process() based on
an assumption that the thread which called swap_readpage() is still
alive.Out of memory: Kill process 525 (polkitd) score 0 or sacrifice child
Killed process 525 (polkitd) total-vm:528128kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB
oom_reaper: reaped process 525 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon sg shpchp vmw_vmci parport_pc parport i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi vmwgfx ahci libahci drm_kms_helper ata_piix syscopyarea sysfillrect sysimgblt fb_sys_fops mptspi scsi_transport_spi ttm e1000 mptscsih drm mptbase i2c_core libata serio_raw
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-rc2-next-20170725 #129
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
task: ffffffffb7c16500 task.stack: ffffffffb7c00000
RIP: 0010:__lock_acquire+0x151/0x12f0
Call Trace:
lock_acquire+0x59/0x80
_raw_spin_lock_irqsave+0x3b/0x4f
try_to_wake_up+0x3b/0x410
wake_up_process+0x10/0x20
end_swap_bio_read+0x6f/0xf0
bio_endio+0x92/0xb0
blk_update_request+0x88/0x270
scsi_end_request+0x32/0x1c0
scsi_io_completion+0x209/0x680
scsi_finish_command+0xd4/0x120
scsi_softirq_done+0x120/0x140
__blk_mq_complete_request_remote+0xe/0x10
flush_smp_call_function_queue+0x51/0x120
generic_smp_call_function_single_interrupt+0xe/0x20
smp_trace_call_function_single_interrupt+0x22/0x30
smp_call_function_single_interrupt+0x9/0x10
call_function_single_interrupt+0xa7/0xb0
RIP: 0010:native_safe_halt+0x6/0x10
default_idle+0xe/0x20
arch_cpu_idle+0xa/0x10
default_idle_call+0x1e/0x30
do_idle+0x187/0x200
cpu_startup_entry+0x6e/0x70
rest_init+0xd0/0xe0
start_kernel+0x456/0x477
x86_64_start_reservations+0x24/0x26
x86_64_start_kernel+0xf7/0x11a
secondary_startup_64+0xa5/0xa5
Code: c3 49 81 3f 20 9e 0b b8 41 bc 00 00 00 00 44 0f 45 e2 83 fe 01 0f 87 62 ff ff ff 89 f0 49 8b 44 c7 08 48 85 c0 0f 84 52 ff ff ff ff 80 98 01 00 00 8b 3d 5a 49 c4 01 45 8b b3 18 0c 00 00 85
RIP: __lock_acquire+0x151/0x12f0 RSP: ffffa01f39e03c50
---[ end trace 6c441db499169b1e ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception in interruptFix it by holding a reference to the thread.
[akpm@linux-foundation.org: add comment]
Fixes: 23955622ff8d231b ("swap: add block io poll in swapin path")
Signed-off-by: Tetsuo Handa
Reviewed-by: Shaohua Li
Cc: Tim Chen
Cc: Huang Ying
Cc: Jens Axboe
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds