16 Sep, 2020
20 commits
-
Enable the new inode btree counters feature.
Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster -
Add the necessary bits to the online repair code to support logging the
inode btree counters when rebuilding the btrees, and to support fixing
the counters when rebuilding the AGI.Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster -
Add the necessary bits to the online scrub code to check the inode btree
counters when enabled.Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster -
Now that we have reliable finobt block counts, use them to speed up the
per-AG block reservation calculations at mount time.Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster -
Add a btree block usage counters for both inode btrees to the AGI header
so that we don't have to walk the entire finobt at mount time to create
the per-AG reservations.Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster -
Instead of poking deeply into buffer cache internals when re-reading the
superblock during log recovery just generalize _xfs_buf_read and use it
there. Note that we don't have to explicitly set up the ops as they
must be set from the initial read.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Merge xfs_getsb into its only caller, and clean that one up a little bit
as well.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Remove the mp argument as this function is only called in transaction
context, and open code xfs_getsb given that the function already accesses
the buffer pointer in the mount point directly.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
The log recovery I/O completion handler does not substancially differ from
the normal one except for the fact that it:a) never retries failed writes
b) can have log items that aren't on the AIL
c) never has inode/dquot log items attached and thus don't need to
handle themAdd conditionals for (a) and (b) to the ioend code, while (c) doesn't
need special handling anyway.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Clear the flags at the end of xfs_buf_ioend so that they can be used
during the completion.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Reuse xfs_buf_item_relse instead of duplicating it.
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Now that all the actual error handling is in a single place,
xfs_buf_ioend_disposition just needs to return true if took ownership of
the buffer, or false if not instead of the tristate. Also move the
error check back in the caller to optimize for the fast path, and give
the function a better fitting name.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Keep all the error handling code together.
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Merge xfs_buf_ioerror_retry into its only caller to make the resubmission
flow a little easier to follow.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
xfs_buf_ioerror_fail_without_retry is a somewhat weird function in
that it has two trivial checks that decide the return value, while
the rest implements a ratelimited warning. Just lift the two checks
into the caller, and give the remainder a suitable name.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
No need to keep a separate helper for this logic.
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Move the buffer retry state machine logic to xfs_buf.c and call it once
from xfs_ioend instead of duplicating it three times for the three kinds
of buffers.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Move the log recovery I/O completion handling entirely into the log
recovery code, and re-arrange the normal I/O completion handler flow
to prepare to lifting more logic into common code in the next commits.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Handle the no-error case in xfs_buf_iodone_error as well, and to clarify
the code rename the function, use the actual enum type as return value
and then switch on it in the callers.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
07 Sep, 2020
7 commits
-
With the recent rework of the inode cluster flushing, we no longer
ever wait on the the inode flush "lock". It was never a lock in the
first place, just a completion to allow callers to wait for inode IO
to complete. We now never wait for flush completion as all inode
flushing is non-blocking. Hence we can get rid of all the iflock
infrastructure and instead just set and check a state flag.Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
xfs_iflock_nowait() test-and-set operations on that flag, and
replace all the xfs_ifunlock() calls to clear operations.Signed-off-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Remove kmem_realloc() function and convert its users to use MM API
directly (krealloc())Signed-off-by: Carlos Maiolino
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Pull more io_uring fixes from Jens Axboe:
"Two followup fixes. One is fixing a regression from this merge window,
the other is two commits fixing cancelation of deferred requests.Both have gone through full testing, and both spawned a few new
regression test additions to liburing.- Don't play games with const, properly store the output iovec and
assign it as needed.- Deferred request cancelation fix (Pavel)"
* tag 'io_uring-5.9-2020-09-06' of git://git.kernel.dk/linux-block:
io_uring: fix linked deferred ->files cancellation
io_uring: fix cancel of deferred reqs with ->files
io_uring: fix explicit async read/write mapping for large segments -
Pull iommu fixes from Joerg Roedel:
- three Intel VT-d fixes to fix address handling on 32bit, fix a NULL
pointer dereference bug and serialize a hardware register access as
required by the VT-d spec.- two patches for AMD IOMMU to force AMD GPUs into translation mode
when memory encryption is active and disallow using IOMMUv2
functionality. This makes the AMDGPU driver work when memory
encryption is active.- two more fixes for AMD IOMMU to fix updating the Interrupt Remapping
Table Entries.- MAINTAINERS file update for the Qualcom IOMMU driver.
* tag 'iommu-fixes-v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Handle 36bit addressing for x86-32
iommu/amd: Do not use IOMMUv2 functionality when SME is active
iommu/amd: Do not force direct mapping when SME is active
iommu/amd: Use cmpxchg_double() when updating 128-bit IRTE
iommu/amd: Restore IRTE.RemapEn bit after programming IRTE
iommu/vt-d: Fix NULL pointer dereference in dev_iommu_priv_set()
iommu/vt-d: Serialize IOMMU GCMD register modifications
MAINTAINERS: Update QUALCOMM IOMMU after Arm SMMU drivers move -
Pull x86 fixes from Ingo Molnar:
- more generic entry code ABI fallout
- debug register handling bugfixes
- fix vmalloc mappings on 32-bit kernels
- kprobes instrumentation output fix on 32-bit kernels
- fix over-eager WARN_ON_ONCE() on !SMAP hardware
- NUMA debugging fix
- fix Clang related crash on !RETPOLINE kernels
* tag 'x86-urgent-2020-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Unbreak 32bit fast syscall
x86/debug: Allow a single level of #DB recursion
x86/entry: Fix AC assertion
tracing/kprobes, x86/ptrace: Fix regs argument order for i386
x86, fakenuma: Fix invalid starting node ID
x86/mm/32: Bring back vmalloc faulting on x86_32
x86/cmdline: Disable jump tables for cmdline.c -
Pull xen updates from Juergen Gross:
"A small series for fixing a problem with Xen PVH guests when running
as backends (e.g. as dom0).Mapping other guests' memory is now working via ZONE_DEVICE, thus not
requiring to abuse the memory hotplug functionality for that purpose"* tag 'for-linus-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: add helpers to allocate unpopulated memory
memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC
xen/balloon: add header guard
06 Sep, 2020
13 commits
-
While looking for ->files in ->defer_list, consider that requests there
may actually be links.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
While trying to cancel requests with ->files, it also should look for
requests in ->defer_list, otherwise it might end up hanging a thread.Cancel all requests in ->defer_list up to the last request there with
matching ->files, that's needed to follow drain ordering semantics.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Pull ARC fixes from Vineet Gupta:
- HSDK-4xd Dev system: perf driver updates for sampling interrupt
- HSDK* Dev System: Ethernet broken [Evgeniy Didin]
- HIGHMEM broken (2 memory banks) [Mike Rapoport]
- show_regs() rewrite once and for all
- Other minor fixes
* tag 'arc-5.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: [plat-hsdk]: Switch ethernet phy-mode to rgmii-id
arc: fix memory initialization for systems with two memory banks
irqchip/eznps: Fix build error for !ARC700 builds
ARC: show_regs: fix r12 printing and simplify
ARC: HSDK: wireup perf irq
ARC: perf: don't bail setup if pct irq missing in device-tree
ARC: pgalloc.h: delete a duplicated word + other fixes -
Merge misc fixes from Andrew Morton:
"19 patches.Subsystems affected by this patch series: MAINTAINERS, ipc, fork,
checkpatch, lib, and mm (memcg, slub, pagemap, madvise, migration,
hugetlb)"* emailed patches from Andrew Morton :
include/linux/log2.h: add missing () around n in roundup_pow_of_two()
mm/khugepaged.c: fix khugepaged's request size in collapse_file
mm/hugetlb: fix a race between hugetlb sysctl handlers
mm/hugetlb: try preferred node first when alloc gigantic page from cma
mm/migrate: preserve soft dirty in remove_migration_pte()
mm/migrate: remove unnecessary is_zone_device_page() check
mm/rmap: fixup copying of soft dirty and uffd ptes
mm/migrate: fixup setting UFFD_WP flag
mm: madvise: fix vma user-after-free
checkpatch: fix the usage of capture group ( ... )
fork: adjust sysctl_max_threads definition to match prototype
ipc: adjust proc_ipc_sem_dointvec definition to match prototype
mm: track page table modifications in __apply_to_page_range()
MAINTAINERS: IA64: mark Status as Odd Fixes only
MAINTAINERS: add LLVM maintainers
MAINTAINERS: update Cavium/Marvell entries
mm: slub: fix conversion of freelist_corrupted()
mm: memcg: fix memcg reclaim soft lockup
memcg: fix use-after-free in uncharge_batch -
Otherwise gcc generates warnings if the expression is complicated.
Fixes: 312a0c170945 ("[PATCH] LOG2: Alter roundup_pow_of_two() so that it can use a ilog2() on a constant")
Signed-off-by: Jason Gunthorpe
Signed-off-by: Andrew Morton
Link: https://lkml.kernel.org/r/0-v1-8a2697e3c003+41165-log_brackets_jgg@nvidia.com
Signed-off-by: Linus Torvalds -
collapse_file() in khugepaged passes PAGE_SIZE as the number of pages to
be read to page_cache_sync_readahead(). The intent was probably to read
a single page. Fix it to use the number of pages to the end of the
window instead.Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: David Howells
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: Matthew Wilcox (Oracle)
Acked-by: Song Liu
Acked-by: Yang Shi
Acked-by: Pankaj Gupta
Cc: Eric Biggers
Link: https://lkml.kernel.org/r/20200903140844.14194-2-willy@infradead.org
Signed-off-by: Linus Torvalds -
There is a race between the assignment of `table->data` and write value
to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
the other thread.CPU0: CPU1:
proc_sys_write
hugetlb_sysctl_handler proc_sys_call_handler
hugetlb_sysctl_handler_common hugetlb_sysctl_handler
table->data = &tmp; hugetlb_sysctl_handler_common
table->data = &tmp;
proc_doulongvec_minmax
do_proc_doulongvec_minmax sysctl_head_finish
__do_proc_doulongvec_minmax unuse_table
i = table->data;
*i = val; // corrupt CPU1's stackFix this by duplicating the `table`, and only update the duplicate of
it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
simplify the code.The following oops was seen:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
Code: Bad RIP value.
...
Call Trace:
? set_max_huge_pages+0x3da/0x4f0
? alloc_pool_huge_page+0x150/0x150
? proc_doulongvec_minmax+0x46/0x60
? hugetlb_sysctl_handler_common+0x1c7/0x200
? nr_hugepages_store+0x20/0x20
? copy_fd_bitmaps+0x170/0x170
? hugetlb_sysctl_handler+0x1e/0x20
? proc_sys_call_handler+0x2f1/0x300
? unregister_sysctl_table+0xb0/0xb0
? __fd_install+0x78/0x100
? proc_sys_write+0x14/0x20
? __vfs_write+0x4d/0x90
? vfs_write+0xef/0x240
? ksys_write+0xc0/0x160
? __ia32_sys_read+0x50/0x50
? __close_fd+0x129/0x150
? __x64_sys_write+0x43/0x50
? do_syscall_64+0x6c/0x200
? entry_SYSCALL_64_after_hwframe+0x44/0xa9Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
Signed-off-by: Muchun Song
Signed-off-by: Andrew Morton
Reviewed-by: Mike Kravetz
Cc: Andi Kleen
Link: http://lkml.kernel.org/r/20200828031146.43035-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds -
Since commit cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic
hugepages using cma"), the gigantic page would be allocated from node
which is not the preferred node, although there are pages available from
that node. The reason is that the nid parameter has been ignored in
alloc_gigantic_page().Besides, the __GFP_THISNODE also need be checked if user required to
alloc only from the preferred node.After this patch, the preferred node is tried first before other allowed
nodes, and don't try to allocate from other nodes if __GFP_THISNODE is
specified. If user don't specify the preferred node, the current node
will be used as preferred node, which makes sure consistent behavior of
allocating gigantic and non-gigantic hugetlb page.Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
Signed-off-by: Li Xinhai
Signed-off-by: Andrew Morton
Reviewed-by: Mike Kravetz
Acked-by: Michal Hocko
Cc: Roman Gushchin
Link: https://lkml.kernel.org/r/20200902025016.697260-1-lixinhai.lxh@gmail.com
Signed-off-by: Linus Torvalds -
The code to remove a migration PTE and replace it with a device private
PTE was not copying the soft dirty bit from the migration entry. This
could lead to page contents not being marked dirty when faulting the page
back from device private memory.Signed-off-by: Ralph Campbell
Signed-off-by: Andrew Morton
Reviewed-by: Christoph Hellwig
Cc: Jerome Glisse
Cc: Alistair Popple
Cc: Christoph Hellwig
Cc: Jason Gunthorpe
Cc: Bharata B Rao
Link: https://lkml.kernel.org/r/20200831212222.22409-3-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds -
Patch series "mm/migrate: preserve soft dirty in remove_migration_pte()".
I happened to notice this from code inspection after seeing Alistair
Popple's patch ("mm/rmap: Fixup copying of soft dirty and uffd ptes").This patch (of 2):
The check for is_zone_device_page() and is_device_private_page() is
unnecessary since the latter is sufficient to determine if the page is a
device private page. Simplify the code for easier reading.Signed-off-by: Ralph Campbell
Signed-off-by: Andrew Morton
Reviewed-by: Christoph Hellwig
Cc: Jerome Glisse
Cc: Alistair Popple
Cc: Christoph Hellwig
Cc: Jason Gunthorpe
Cc: Bharata B Rao
Link: https://lkml.kernel.org/r/20200831212222.22409-1-rcampbell@nvidia.com
Link: https://lkml.kernel.org/r/20200831212222.22409-2-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds -
During memory migration a pte is temporarily replaced with a migration
swap pte. Some pte bits from the existing mapping such as the soft-dirty
and uffd write-protect bits are preserved by copying these to the
temporary migration swap pte.However these bits are not stored at the same location for swap and
non-swap ptes. Therefore testing these bits requires using the
appropriate helper function for the given pte type.Unfortunately several code locations were found where the wrong helper
function is being used to test soft_dirty and uffd_wp bits which leads to
them getting incorrectly set or cleared during page-migration.Fix these by using the correct tests based on pte type.
Fixes: a5430dda8a3a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Alistair Popple
Signed-off-by: Andrew Morton
Reviewed-by: Peter Xu
Cc: Jérôme Glisse
Cc: John Hubbard
Cc: Ralph Campbell
Cc: Alistair Popple
Cc:
Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.au
Signed-off-by: Linus Torvalds -
Commit f45ec5ff16a75 ("userfaultfd: wp: support swap and page migration")
introduced support for tracking the uffd wp bit during page migration.
However the non-swap PTE variant was used to set the flag for zone device
private pages which are a type of swap page.This leads to corruption of the swap offset if the original PTE has the
uffd_wp flag set.Fixes: f45ec5ff16a75 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Alistair Popple
Signed-off-by: Andrew Morton
Reviewed-by: Peter Xu
Cc: Jérôme Glisse
Cc: John Hubbard
Cc: Ralph Campbell
Link: https://lkml.kernel.org/r/20200825064232.10023-1-alistair@popple.id.au
Signed-off-by: Linus Torvalds