Eric Lee / smarc-fsl-linux-kernel

24 Oct, 2020

1 commit

c4728cfbe Merge tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull clone/dedupe/remap code refactoring from Darrick Wong:
"Move the generic file range remap (aka reflink and dedupe) functions
out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to
reduce clutter in the first two files"

* tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: move the generic write and copy checks out of mm
vfs: move the remap range helpers to remap_range.c
vfs: move generic_remap_checks out of mm

Linus Torvalds
2020-10-24 02:33:41 +0800

23 Oct, 2020

1 commit

f56e65dff Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull initial set_fs() removal from Al Viro:
"Christoph's set_fs base series + fixups"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Allow a NULL pos pointer to __kernel_read
fs: Allow a NULL pos pointer to __kernel_write
powerpc: remove address space overrides using set_fs()
powerpc: use non-set_fs based maccess routines
x86: remove address space overrides using set_fs()
x86: make TASK_SIZE_MAX usable from assembly code
x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
lkdtm: remove set_fs-based tests
test_bitmap: remove user bitmap tests
uaccess: add infrastructure for kernel builds with set_fs()
fs: don't allow splice read/write without explicit ops
fs: don't allow kernel reads and writes without iter ops
sysctl: Convert to iter interfaces
proc: add a read_iter method to proc proc_ops
proc: cleanup the compat vs no compat file ops
proc: remove a level of indentation in proc_get_inode

Linus Torvalds
2020-10-23 00:59:21 +0800

16 Oct, 2020

4 commits

7b84b665c fs: Allow a NULL pos pointer to __kernel_read ... Browse Code »

Match the behaviour of new_sync_read() and __kernel_write().

Reviewed-by: Christoph Hellwig
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Al Viro

Matthew Wilcox (Oracle)
2020-10-16 02:20:42 +0800
4c207ef48 fs: Allow a NULL pos pointer to __kernel_write ... Browse Code »

Linus prefers that callers be allowed to pass in a NULL pointer for ppos
like new_sync_write().

Reviewed-by: Christoph Hellwig
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Al Viro

Matthew Wilcox (Oracle)
2020-10-16 02:20:41 +0800
407e9c63e vfs: move the generic write and copy checks out of mm ... Browse Code »

The generic write check helpers also don't have much to do with the page
cache, so move them to the vfs.

Signed-off-by: Darrick J. Wong

Darrick J. Wong
2020-10-16 00:50:01 +0800
1b2c54d63 vfs: move the remap range helpers to remap_range.c ... Browse Code »

Complete the migration by moving the file remapping helper functions out
of read_write.c and into remap_range.c. This reduces the clutter in the
first file and (eventually) will make it so that we can compile out the
second file if it isn't needed.

Signed-off-by: Darrick J. Wong

Darrick J. Wong
2020-10-16 00:48:49 +0800

13 Oct, 2020

1 commit

85ed13e78 Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull compat iovec cleanups from Al Viro:
"Christoph's series around import_iovec() and compat variant thereof"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
security/keys: remove compat_keyctl_instantiate_key_iov
mm: remove compat_process_vm_{readv,writev}
fs: remove compat_sys_vmsplice
fs: remove the compat readv/writev syscalls
fs: remove various compat readv/writev helpers
iov_iter: transparently handle compat iovecs in import_iovec
iov_iter: refactor rw_copy_check_uvector and import_iovec
iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
compat.h: fix a spelling error in

Linus Torvalds
2020-10-13 07:35:51 +0800

03 Oct, 2020

3 commits

5f764d624 fs: remove the compat readv/writev syscalls ... Browse Code »

Now that import_iovec handles compat iovecs, the native readv and writev
syscalls can be used for the compat case as well.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2020-10-03 12:02:14 +0800
3523a9d45 fs: remove various compat readv/writev helpers ... Browse Code »

Now that import_iovec handles compat iovecs as well, all the duplicated
code in the compat readv/writev helpers is not needed. Remove them
and switch the compat syscall handlers to use the native helpers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2020-10-03 12:02:14 +0800
89cd35c58 iov_iter: transparently handle compat iovecs in import_iovec ... Browse Code »

Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.

This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2020-10-03 12:02:13 +0800

30 Sep, 2020

1 commit

90fb70279 autofs: use __kernel_write() for the autofs pipe writing ... Browse Code »

autofs got broken in some configurations by commit 13c164b1a186
("autofs: switch to kernel_write") because there is now an extra LSM
permission check done by security_file_permission() in rw_verify_area().

autofs is one if the few places that really does want the much more
limited __kernel_write(), because the write is an internal kernel one
that shouldn't do any user permission checks (it also doesn't need the
file_start_write/file_end_write logic, since it's just a pipe).

There are a couple of other cases like that - accounting, core dumping,
and splice - but autofs stands out because it can be built as a module.

As a result, we need to export this internal __kernel_write() function
again.

We really don't want any other module to use this, but we don't have a
"EXPORT_SYMBOL_FOR_AUTOFS_ONLY()". But we can mark it GPL-only to at
least approximate that "internal use only" for licensing.

While in this area, make autofs pass in NULL for the file position
pointer, since it's always a pipe, and we now use a NULL file pointer
for streaming file descriptors (see file_ppos() and commit 438ab720c675:
"vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files")

This effectively reverts commits 9db977522449 ("fs: unexport
__kernel_write") and 13c164b1a186 ("autofs: switch to kernel_write").

Fixes: 13c164b1a186 ("autofs: switch to kernel_write")
Reported-by: Ondrej Mosnacek
Acked-by: Christoph Hellwig
Acked-by: Acked-by: Ian Kent
Signed-off-by: Linus Torvalds

Linus Torvalds
2020-09-30 08:18:34 +0800

25 Sep, 2020

1 commit

fb041b598 iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c ... Browse Code »

This lets the compiler inline it into import_iovec() generating
much better code.

Signed-off-by: David Laight
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

David Laight
2020-09-25 23:36:02 +0800

09 Sep, 2020

2 commits

36e2c7421 fs: don't allow splice read/write without explicit ops ... Browse Code »

default_file_splice_write is the last piece of generic code that uses
set_fs to make the uaccess routines operate on kernel pointers. It
implements a "fallback loop" for splicing from files that do not actually
provide a proper splice_read method. The usual file systems and other
high bandwidth instances all provide a ->splice_read, so this just removes
support for various device drivers and procfs/debugfs files. If splice
support for any of those turns out to be important it can be added back
by switching them to the iter ops and using generic_file_splice_read.

Signed-off-by: Christoph Hellwig
Reviewed-by: Kees Cook
Signed-off-by: Al Viro

Christoph Hellwig
2020-09-09 10:21:32 +0800
4d03e3cc5 fs: don't allow kernel reads and writes without iter ops ... Browse Code »

Don't allow calling ->read or ->write with set_fs as a preparation for
killing off set_fs. All the instances that we use kernel_read/write on
are using the iter ops already.

If a file has both the regular ->read/->write methods and the iter
variants those could have different semantics for messed up enough
drivers. Also fails the kernel access to them in that case.

Signed-off-by: Christoph Hellwig
Reviewed-by: Kees Cook
Signed-off-by: Al Viro

Christoph Hellwig
2020-09-09 10:21:31 +0800

30 Jul, 2020

1 commit

bef173299 initrd: switch initrd loading to struct file based APIs ... Browse Code »

There is no good reason to mess with file descriptors from in-kernel
code, switch the initrd loading to struct file based read and writes
instead.

Also Pass an explicit offset instead of ->f_pos, and to make that easier,
use file scope file structs and offsets everywhere except for
identify_ramdisk_image instead of the current strange mix.

Signed-off-by: Christoph Hellwig
Acked-by: Linus Torvalds

Christoph Hellwig
2020-07-30 14:22:47 +0800

08 Jul, 2020

7 commits

775802c05 fs: remove __vfs_read ... Browse Code »

Fold it into the two callers.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2020-07-08 14:27:57 +0800
6209dd913 fs: implement kernel_read using __kernel_read ... Browse Code »

Consolidate the two in-kernel read helpers to make upcoming changes
easier. The only difference are the missing call to rw_verify_area
in kernel_read, and an access_ok check that doesn't make sense for
kernel buffers to start with.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2020-07-08 14:27:57 +0800
61a707c54 fs: add a __kernel_read helper ... Browse Code »

This is the counterpart to __kernel_write, and skip the rw_verify_area
call compared to kernel_read.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2020-07-08 14:27:56 +0800
53ad86266 fs: remove __vfs_write ... Browse Code »

Fold it into the two callers.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2020-07-08 14:27:56 +0800
81238b2cf fs: implement kernel_write using __kernel_write ... Browse Code »

Consolidate the two in-kernel write helpers to make upcoming changes
easier. The only difference are the missing call to rw_verify_area
in kernel_write, and an access_ok check that doesn't make sense for
kernel buffers to start with.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2020-07-08 14:27:56 +0800
a01ac27be fs: check FMODE_WRITE in __kernel_write ... Browse Code »

Add a WARN_ON_ONCE if the file isn't actually open for write. This
matches the check done in vfs_write, but actually warn warns as a
kernel user calling write on a file not opened for writing is a pretty
obvious programming error.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2020-07-08 14:27:56 +0800
9db977522 fs: unexport __kernel_write ... Browse Code »

This is a very special interface that skips sb_writes protection, and not
used by modules anymore.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2020-07-08 14:27:56 +0800

02 Apr, 2020

1 commit

9e62ccec3 powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro ... Browse Code »

This partially reverts commit caf6f9c8a326 ("asm-generic: Remove
unneeded __ARCH_WANT_SYS_LLSEEK macro")

When CONFIG_COMPAT is disabled on ppc64 the kernel does not build.

There is resistance to both removing the llseek syscall from the 64bit
syscall tables and building the llseek interface unconditionally.

Signed-off-by: Michal Suchanek
Reviewed-by: Arnd Bergmann
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/lkml/20190828151552.GA16855@infradead.org/
Link: https://lore.kernel.org/lkml/20190829214319.498c7de2@naga/
Link: https://lore.kernel.org/r/dd4575c51e31766e87f7e7fa121d099ab78d3290.1584699455.git.msuchanek@suse.de

Michal Suchanek
2020-04-02 21:09:59 +0800

04 Feb, 2020

1 commit

7f879e1a9 Merge tag 'ovl-update-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs ... Browse Code »

Pull overlayfs update from Miklos Szeredi:

- Try to preserve holes in sparse files when copying up, thus saving
disk space and improving performance.

- Fix a performance regression introduced in v4.19 by preserving
asynchronicity of IO when fowarding to underlying layers. Add VFS
helpers to submit async iocbs.

- Fix a regression in lseek(2) introduced in v4.19 that breaks >2G
seeks on 32bit kernels.

- Fix a corner case where st_ino/st_dev was not preserved across copy
up.

- Miscellaneous fixes and cleanups.

* tag 'ovl-update-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix lseek overflow on 32bit
ovl: add splice file read write helper
ovl: implement async IO routines
vfs: add vfs_iocb_iter_[read|write] helper functions
ovl: layer is const
ovl: fix corner case of non-constant st_dev;st_ino
ovl: fix corner case of conflicting lower layer uuid
ovl: generalize the lower_fs[] array
ovl: simplify ovl_same_sb() helper
ovl: generalize the lower_layers[] array
ovl: improving copy-up efficiency for big sparse file
ovl: use ovl_inode_lock in ovl_llseek()
ovl: use pr_fmt auto generate prefix
ovl: fix wrong WARN_ON() in ovl_cache_update_ino()

Linus Torvalds
2020-02-04 19:45:21 +0800

24 Jan, 2020

2 commits

5dcdc43e2 vfs: add vfs_iocb_iter_[read|write] helper functions ... Browse Code »

This doesn't cause any behavior changes and will be used by overlay async
IO implementation.

Signed-off-by: Jiufei Xue
Signed-off-by: Miklos Szeredi

Jiufei Xue
2020-01-24 16:46:46 +0800
a5e6ea18e fs: allow deduplication of eof block into the end of the destination file ... Browse Code »

We always round down, to a multiple of the filesystem's block size, the
length to deduplicate at generic_remap_check_len(). However this is only
needed if an attempt to deduplicate the last block into the middle of the
destination file is requested, since that leads into a corruption if the
length of the source file is not block size aligned. When an attempt to
deduplicate the last block into the end of the destination file is
requested, we should allow it because it is safe to do it - there's no
stale data exposure and we are prepared to compare the data ranges for
a length not aligned to the block (or page) size - in fact we even do
the data compare before adjusting the deduplication length.

After btrfs was updated to use the generic helpers from VFS (by commit
34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
and deduplication")) we started to have user reports of deduplication
not reflinking the last block anymore, and whence users getting lower
deduplication scores. The main use case is deduplication of entire
files that have a size not aligned to the block size of the filesystem.

We already allow cloning the last block to the end (and beyond) of the
destination file, so allow for deduplication as well.

Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik
Reviewed-by: Darrick J. Wong
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba

Filipe Manana
2020-01-24 01:20:48 +0800

17 Aug, 2019

1 commit

edc58dd01 vfs: fix page locking deadlocks when deduping files ... Browse Code »

When dedupe wants to use the page cache to compare parts of two files
for dedupe, we must be very careful to handle locking correctly. The
current code doesn't do this. It must lock and unlock the page only
once if the two pages are the same, since the overlapping range check
doesn't catch this when blocksize < pagesize. If the pages are distinct
but from the same file, we must observe page locking order and lock them
in order of increasing offset to avoid clashing with writeback locking.

Fixes: 876bec6f9bbfcb3 ("vfs: refactor clone/dedupe_file_range common functions")
Signed-off-by: Darrick J. Wong
Reviewed-by: Bill O'Donnell
Reviewed-by: Matthew Wilcox (Oracle)

Darrick J. Wong
2019-08-17 09:43:24 +0800

10 Jun, 2019

6 commits

5dae222a5 vfs: allow copy_file_range to copy across devices ... Browse Code »

We want to enable cross-filesystem copy_file_range functionality
where possible, so push the "same superblock only" checks down to
the individual filesystem callouts so they can make their own
decisions about cross-superblock copy offload and fallack to
generic_copy_file_range() for cross-superblock copy.

[Amir] We do not call ->remap_file_range() in case the files are not
on the same sb and do not call ->copy_file_range() in case the files
do not belong to the same filesystem driver.

This changes behavior of the copy_file_range(2) syscall, which will
now allow cross filesystem in-kernel copy. CIFS already supports
cross-superblock copy, between two shares to the same server. This
functionality will now be available via the copy_file_range(2) syscall.

Cc: Steve French
Signed-off-by: Dave Chinner
Signed-off-by: Amir Goldstein
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Amir Goldstein
2019-06-10 01:06:20 +0800
e38f7f53c vfs: introduce file_modified() helper ... Browse Code »

The combination of file_remove_privs() and file_update_mtime() is
quite common in filesystem ->write_iter() methods.

Modelled after the helper file_accessed(), introduce file_modified()
and use it from generic_remap_file_range_prep().

Note that the order of calling file_remove_privs() before
file_update_mtime() in the helper was matched to the more common order by
filesystems and not the current order in generic_remap_file_range_prep().

Signed-off-by: Amir Goldstein
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Amir Goldstein
2019-06-10 01:06:19 +0800
96e6e8f4a vfs: add missing checks to copy_file_range ... Browse Code »

Like the clone and dedupe interfaces we've recently fixed, the
copy_file_range() implementation is missing basic sanity, limits and
boundary condition tests on the parameters that are passed to it
from userspace. Create a new "generic_copy_file_checks()" function
modelled on the generic_remap_checks() function to provide this
missing functionality.

[Amir] Shorten copy length instead of checking pos_in limits
because input file size already abides by the limits.

Signed-off-by: Dave Chinner
Signed-off-by: Amir Goldstein
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Amir Goldstein
2019-06-10 01:06:19 +0800
a31713517 vfs: introduce generic_file_rw_checks() ... Browse Code »

Factor out helper with some checks on in/out file that are
common to clone_file_range and copy_file_range.

Suggested-by: Darrick J. Wong
Signed-off-by: Amir Goldstein
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Amir Goldstein
2019-06-10 01:06:19 +0800
64bf5ff58 vfs: no fallback for ->copy_file_range ... Browse Code »

Now that we have generic_copy_file_range(), remove it as a fallback
case when offloads fail. This puts the responsibility for executing
fallbacks on the filesystems that implement ->copy_file_range and
allows us to add operational validity checks to
generic_copy_file_range().

Rework vfs_copy_file_range() to call a new do_copy_file_range()
helper to execute the copying callout, and move calls to
generic_file_copy_range() into filesystem methods where they
currently return failures.

[Amir] overlayfs is not responsible of executing the fallback.
It is the responsibility of the underlying filesystem.

Signed-off-by: Dave Chinner
Signed-off-by: Amir Goldstein
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2019-06-10 01:06:19 +0800
f16acc9d9 vfs: introduce generic_copy_file_range() ... Browse Code »

Right now if vfs_copy_file_range() does not use any offload
mechanism, it falls back to calling do_splice_direct(). This fails
to do basic sanity checks on the files being copied. Before we
start adding this necessarily functionality to the fallback path,
separate it out into generic_copy_file_range().

generic_copy_file_range() has the same prototype as
->copy_file_range() so that filesystems can use it in their custom
->copy_file_range() method if they so choose.

Signed-off-by: Dave Chinner
Signed-off-by: Amir Goldstein
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2019-06-10 01:06:18 +0800

06 May, 2019

1 commit

438ab720c vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files ... Browse Code »

This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
because they are called by vfs_read/vfs_write & friends before
file_operations .read/.write .

Note 2: if file backend uses new-style .read_iter/.write_iter, position
is still passed into there as non-pointer kiocb.ki_pos . Currently
stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
whose file_operations has *_iter methods.

Suggested-by: Rasmus Villemoes
Signed-off-by: Kirill Smelkov

Kirill Smelkov
2019-05-06 22:46:52 +0800

07 Apr, 2019

1 commit

10dce8af3 fs: stream_open - opener for stream-like files so that read and write can run si… ... Browse Code »

…multaneously without deadlock

Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.

This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of
/proc/xen/xenbus.

The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.

However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f2655e3 -
is doing so irregardless of whether a file is seekable or not.

See

https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
https://lwn.net/Articles/180387
https://lwn.net/Articles/180396

for historic context.

The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some
examples:

kernel/power/user.c snapshot_read
fs/debugfs/file.c u32_array_read
fs/fuse/control.c fuse_conn_waiting_read + ...
drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read
arch/s390/hypfs/inode.c hypfs_read_iter
...

Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.

Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):

drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.

FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:

See

https://github.com/libfuse/osspd
https://lwn.net/Articles/308445

https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:

https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:

https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

Let's fix this regression. The plan is:

1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
doing so would break many in-kernel nonseekable_open users which
actually use ppos in read/write handlers.

2. Add stream_open() to kernel to open stream-like non-seekable file
descriptors. Read and write on such file descriptors would never use
nor change ppos. And with that property on stream-like files read and
write will be running without taking f_pos lock - i.e. read and write
could be running simultaneously.

3. With semantic patch search and convert to stream_open all in-kernel
nonseekable_open users for which read and write actually do not
depend on ppos and where there is no other methods in file_operations
which assume @offset access.

4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
steam_open if that bit is present in filesystem open reply.

It was tempting to change fs/fuse/ open handler to use stream_open
instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and
write handlers

https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

so if we would do such a change it will break a real user.

5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
from v3.14+ (the kernel where 9c225f2655 first appeared).

This will allow to patch OSSPD and other FUSE filesystems that
provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
in their open handler and this way avoid the deadlock on all kernel
versions. This should work because fs/fuse/ ignores unknown open
flags returned from a filesystem and so passing FOPEN_STREAM to a
kernel that is not aware of this flag cannot hurt. In turn the kernel
that is not aware of FOPEN_STREAM will be < v3.14 where just
FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
write deadlock.

This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.

Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.

The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Kirill Smelkov
2019-04-07 01:01:55 +0800

13 Mar, 2019

1 commit

5f739e4a4 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"Assorted fixes (really no common topic here)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: Make __vfs_write() static
vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
pipe: stop using ->can_merge
splice: don't merge into linked buffers
fs: move generic stat response attr handling to vfs_getattr_nosec
orangefs: don't reinitialize result_mask in ->getattr
fs/devpts: always delete dcache dentry-s in dput()

Linus Torvalds
2019-03-13 04:27:20 +0800

05 Mar, 2019

1 commit

736706bee get rid of legacy 'get_ds()' function ... Browse Code »

Every in-kernel use of this function defined it to KERNEL_DS (either as
an actual define, or as an inline function). It's an entirely
historical artifact, and long long long ago used to actually read the
segment selector valueof '%ds' on x86.

Which in the kernel is always KERNEL_DS.

Inspired by a patch from Jann Horn that just did this for a very small
subset of users (the ones in fs/), along with Al who suggested a script.
I then just took it to the logical extreme and removed all the remaining
gunk.

Roughly scripted with

git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/'
git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d'

plus manual fixups to remove a few unusual usage patterns, the couple of
inline function cases and to fix up a comment that had become stale.

The 'get_ds()' function remains in an x86 kvm selftest, since in user
space it actually does something relevant.

Inspired-by: Jann Horn
Inspired-by: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2019-03-05 02:50:14 +0800

22 Feb, 2019

1 commit

12e1e7af1 vfs: Make __vfs_write() static ... Browse Code »

__vfs_write() was unexported, and removed from , but
forgotten to be made static.

Fixes: eb031849d52e61d2 ("fs: unexport __vfs_read/__vfs_write")
Signed-off-by: Geert Uytterhoeven
Signed-off-by: Al Viro

Geert Uytterhoeven
2019-02-22 13:30:05 +0800

16 Feb, 2019

1 commit

cc4b1242d vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1 ... Browse Code »

The preadv2 and pwritev2 syscalls are supposed to emulate the readv and
writev syscalls when offset == -1. Therefore the compat code should
check for offset before calling do_compat_preadv64 and
do_compat_pwritev64. This is the case for the preadv2 and pwritev2
syscalls, but handling of offset == -1 is missing in their 64-bit
equivalent.

This patch fixes that, calling do_compat_readv and do_compat_writev when
offset == -1. This fixes the following glibc tests on x32:
- misc/tst-preadvwritev2
- misc/tst-preadvwritev64v2

Cc: Alexander Viro
Cc: H.J. Lu
Signed-off-by: Aurelien Jarno
Signed-off-by: Al Viro

Aurelien Jarno
2019-02-16 23:21:52 +0800

04 Jan, 2019

1 commit

96d4f267e Remove 'type' argument from access_ok() function ... Browse Code »

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

- csky still had the old "verify_area()" name as an alias.

- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)

- microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-04 10:57:57 +0800