Eric Lee / smarc-fsl-linux-kernel

09 Mar, 2019

1 commit

38e7571c0 Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block ... Browse Code »

Pull io_uring IO interface from Jens Axboe:
"Second attempt at adding the io_uring interface.

Since the first one, we've added basic unit testing of the three
system calls, that resides in liburing like the other unit tests that
we have so far. It'll take a while to get full coverage of it, but
we're working towards it. I've also added two basic test programs to
tools/io_uring. One uses the raw interface and has support for all the
various features that io_uring supports outside of standard IO, like
fixed files, fixed IO buffers, and polled IO. The other uses the
liburing API, and is a simplified version of cp(1).

This adds support for a new IO interface, io_uring.

io_uring allows an application to communicate with the kernel through
two rings, the submission queue (SQ) and completion queue (CQ) ring.
This allows for very efficient handling of IOs, see the v5 posting for
some basic numbers:

https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

Outside of just efficiency, the interface is also flexible and
extendable, and allows for future use cases like the upcoming NVMe
key-value store API, networked IO, and so on. It also supports async
buffered IO, something that we've always failed to support in the
kernel.

Outside of basic IO features, it supports async polled IO as well.
This particular feature has already been tested at Facebook months ago
for flash storage boxes, with 25-33% improvements. It makes polled IO
actually useful for real world use cases, where even basic flash sees
a nice win in terms of efficiency, latency, and performance. These
boxes were IOPS bound before, now they are not.

This series adds three new system calls. One for setting up an
io_uring instance (io_uring_setup(2)), one for submitting/completing
IO (io_uring_enter(2)), and one for aux functions like registrating
file sets, buffers, etc (io_uring_register(2)). Through the help of
Arnd, I've coordinated the syscall numbers so merge on that front
should be painless.

Jon did a writeup of the interface a while back, which (except for
minor details that have been tweaked) is still accurate. Find that
here:

https://lwn.net/Articles/776703/

Huge thanks to Al Viro for helping getting the reference cycle code
correct, and to Jann Horn for his extensive reviews focused on both
security and bugs in general.

There's a userspace library that provides basic functionality for
applications that don't need or want to care about how to fiddle with
the rings directly. It has helpers to allow applications to easily set
up an io_uring instance, and submit/complete IO through it without
knowing about the intricacies of the rings. It also includes man pages
(thanks to Jeff Moyer), and will continue to grow support helper
functions and features as time progresses. Find it here:

git://git.kernel.dk/liburing

Fio has full support for the raw interface, both in the form of an IO
engine (io_uring), but also with a small test application (t/io_uring)
that can exercise and benchmark the interface"

* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
io_uring: add a few test tools
io_uring: allow workqueue item to handle multiple buffered requests
io_uring: add support for IORING_OP_POLL
io_uring: add io_kiocb ref count
io_uring: add submission polling
io_uring: add file set registration
net: split out functions related to registering inflight socket files
io_uring: add support for pre-mapped user IO buffers
block: implement bio helper to add iter bvec pages to bio
io_uring: batch io_kiocb allocation
io_uring: use fget/fput_many() for file references
fs: add fget_many() and fput_many()
io_uring: support for IO polling
io_uring: add fsync support
Add io_uring IO interface

Linus Torvalds
2019-03-09 06:48:40 +0800

06 Mar, 2019

1 commit

5704a0681 fs/file.c: initialize init_files.resize_wait ... Browse Code »

(Taken from https://bugzilla.kernel.org/show_bug.cgi?id=200647)

'get_unused_fd_flags' in kthread cause kernel crash. It works fine on
4.1, but causes crash after get 64 fds. It also cause crash on
ubuntu1404/1604/1804, centos7.5, and the crash messages are almost the
same.

The crash message on centos7.5 shows below:

start fd 61
start fd 62
start fd 63
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: __wake_up_common+0x2e/0x90
PGD 0
Oops: 0000 [#1] SMP
Modules linked in: test(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink sunrpc kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 joydev ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi virtio_console virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel drm ata_piix serio_raw libata virtio_pci virtio_ring i2c_core
virtio floppy dm_mirror dm_region_hash dm_log dm_mod
CPU: 2 PID: 1820 Comm: test_fd Kdump: loaded Tainted: G OE ------------ 3.10.0-862.3.3.el7.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
task: ffff8e92b9431fa0 ti: ffff8e94247a0000 task.ti: ffff8e94247a0000
RIP: 0010:__wake_up_common+0x2e/0x90
RSP: 0018:ffff8e94247a2d18 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffffffff9d09daa0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff9d09daa0
RBP: ffff8e94247a2d50 R08: 0000000000000000 R09: ffff8e92b95dfda8
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d09daa8
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff8e9434e80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000017c686000 CR4: 00000000000207e0
Call Trace:
__wake_up+0x39/0x50
expand_files+0x131/0x250
__alloc_fd+0x47/0x170
get_unused_fd_flags+0x30/0x40
test_fd+0x12a/0x1c0 [test]
kthread+0xd1/0xe0
ret_from_fork_nospec_begin+0x21/0x21
Code: 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 49 89 fc 49 83 c4 08 53 48 83 ec 10 48 8b 47 08 89 55 cc 4c 89 45 d0 8b 08 49 39 c4 48 8d 78 e8 4c 8d 69 e8 75 08 eb 3b 4c 89 ef
RIP __wake_up_common+0x2e/0x90
RSP
CR2: 0000000000000000

This issue exists since CentOS 7.5 3.10.0-862 and CentOS 7.4
(3.10.0-693.21.1 ) is ok. Root cause: the item 'resize_wait' is not
initialized before being used.

Reported-by: Richard Zhang
Reviewed-by: Andrew Morton
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shuriyc Chu
2019-03-06 13:07:14 +0800

28 Feb, 2019

1 commit

091141a42 fs: add fget_many() and fput_many() ... Browse Code »

Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2019-02-28 23:24:23 +0800

29 Dec, 2018

1 commit

457fa3469 Merge tag 'char-misc-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc ... Browse Code »

Pull char/misc driver updates from Greg KH:
"Here is the big set of char and misc driver patches for 4.21-rc1.

Lots of different types of driver things in here, as this tree seems
to be the "collection of various driver subsystems not big enough to
have their own git tree" lately.

Anyway, some highlights of the changes in here:

- binderfs: is it a rule that all driver subsystems will eventually
grow to have their own filesystem? Binder now has one to handle the
use of it in containerized systems.

This was discussed at the Plumbers conference a few months ago and
knocked into mergable shape very fast by Christian Brauner. Who
also has signed up to be another binder maintainer, showing a
distinct lack of good judgement :)

- binder updates and fixes

- mei driver updates

- fpga driver updates and additions

- thunderbolt driver updates

- soundwire driver updates

- extcon driver updates

- nvmem driver updates

- hyper-v driver updates

- coresight driver updates

- pvpanic driver additions and reworking for more device support

- lp driver updates. Yes really, it's _finally_ moved to the proper
parallal port driver model, something I never thought I would see
happen. Good stuff.

- other tiny driver updates and fixes.

All of these have been in linux-next for a while with no reported
issues"

* tag 'char-misc-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (116 commits)
MAINTAINERS: add another Android binder maintainer
intel_th: msu: Fix an off-by-one in attribute store
stm class: Add a reference to the SyS-T document
stm class: Fix a module refcount leak in policy creation error path
char: lp: use new parport device model
char: lp: properly count the lp devices
char: lp: use first unused lp number while registering
char: lp: detach the device when parallel port is removed
char: lp: introduce list to save port number
bus: qcom: remove duplicated include from qcom-ebi2.c
VMCI: Use memdup_user() rather than duplicating its implementation
char/rtc: Use of_node_name_eq for node name comparisons
misc: mic: fix a DMA pool free failure
ptp: fix an IS_ERR() vs NULL check
genwqe: Fix size check
binder: implement binderfs
binder: fix use-after-free due to ksys_close() during fdget()
bus: fsl-mc: remove duplicated include files
bus: fsl-mc: explicitly define the fsl_mc_command endianness
misc: ti-st: make array read_ver_cmd static, shrinks object size
...

Linus Torvalds
2018-12-29 12:54:57 +0800

19 Dec, 2018

1 commit

80cd79563 binder: fix use-after-free due to ksys_close() during fdget() ... Browse Code »

44d8047f1d8 ("binder: use standard functions to allocate fds")
exposed a pre-existing issue in the binder driver.

fdget() is used in ksys_ioctl() as a performance optimization.
One of the rules associated with fdget() is that ksys_close() must
not be called between the fdget() and the fdput(). There is a case
where this requirement is not met in the binder driver which results
in the reference count dropping to 0 when the device is still in
use. This can result in use-after-free or other issues.

If userpace has passed a file-descriptor for the binder driver using
a BINDER_TYPE_FDA object, then kys_close() is called on it when
handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
the assumptions for using fdget().

The problem is fixed by deferring the close using task_work_add(). A
new variant of __close_fd() was created that returns a struct file
with a reference. The fput() is deferred instead of using ksys_close().

Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds")
Suggested-by: Al Viro
Signed-off-by: Todd Kjos
Cc: stable
Signed-off-by: Greg Kroah-Hartman

Todd Kjos
2018-12-19 16:40:13 +0800

28 Nov, 2018

1 commit

c93ffc15c fs/file: Replace synchronize_sched() with synchronize_rcu() ... Browse Code »

Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu(). This commit therefore makes this change.

Signed-off-by: Paul E. McKenney
Cc: Alexander Viro
Cc:

Paul E. McKenney
2018-11-28 01:21:39 +0800

03 Apr, 2018

2 commits

2ca2a09d6 fs: add ksys_close() wrapper; remove in-kernel calls to sys_close() ... Browse Code »

Using the ksys_close() wrapper allows us to get rid of in-kernel calls
to the sys_close() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_close(), with one subtle
difference:

The few places which checked the return value did not care about the return
value re-writing in sys_close(), so simply use a wrapper around
__close_fd().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro
Cc: Andrew Morton
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:16:00 +0800
c7248321a fs: add ksys_dup{,3}() helper; remove in-kernel calls to sys_dup{,3}() ... Browse Code »

Using ksys_dup() and ksys_dup3() as helper functions allows us to
avoid the in-kernel calls to the sys_dup() and sys_dup3() syscalls.
The ksys_ prefix denotes that these functions are meant as a drop-in
replacement for the syscalls. In particular, they use the same
calling convention as sys_dup{,3}().

In the near future, the fs-external callers of ksys_dup{,3}() should be
converted to call do_dup2() directly. Then, ksys_dup{,3}() can be moved
within sys_dup{,3}() again.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Alexander Viro
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:15:49 +0800

01 Feb, 2018

1 commit

19e7b5f99 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"All kinds of misc stuff, without any unifying topic, from various
people.

Neil's d_anon patch, several bugfixes, introduction of kvmalloc
analogue of kmemdup_user(), extending bitfield.h to deal with
fixed-endians, assorted cleanups all over the place..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
alpha: osf_sys.c: use timespec64 where appropriate
alpha: osf_sys.c: fix put_tv32 regression
jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
dcache: delete unused d_hash_mask
dcache: subtract d_hash_shift from 32 in advance
fs/buffer.c: fold init_buffer() into init_page_buffers()
fs: fold __inode_permission() into inode_permission()
fs: add RWF_APPEND
sctp: use vmemdup_user() rather than badly open-coding memdup_user()
snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
replace_user_tlv(): switch to vmemdup_user()
new primitive: vmemdup_user()
memdup_user(): switch to GFP_USER
eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
eventfd: fold eventfd_ctx_read() into eventfd_read()
eventfd: convert to use anon_inode_getfd()
nfs4file: get rid of pointless include of btrfs.h
uvc_v4l2: clean copyin/copyout up
vme_user: don't use __copy_..._user()
usx2y: don't bother with memdup_user() for 16-byte structure
...

Linus Torvalds
2018-02-01 01:25:20 +0800

05 Dec, 2017

2 commits

d6b4dcf5c fs/file.c: trim includes ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2017-12-05 22:41:03 +0800
388a4c880 fs: Eliminate cond_resched_rcu_qs() in favor of cond_resched() ... Browse Code »

Now that cond_resched() also provides RCU quiescent states when
needed, it can be used in place of cond_resched_rcu_qs(). This
commit therefore makes this change.

Signed-off-by: Paul E. McKenney
Cc: Alexander Viro
Cc:

Paul E. McKenney
2017-12-05 02:28:59 +0800

18 Nov, 2017

1 commit

ca5b857cb Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"Assorted stuff, really no common topic here"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: grab the lock instead of blocking in __fd_install during resizing
vfs: stop clearing close on exec when closing a fd
include/linux/fs.h: fix comment about struct address_space
fs: make fiemap work from compat_ioctl
coda: fix 'kernel memory exposure attempt' in fsync
pstore: remove unneeded unlikely()
vfs: remove unneeded unlikely()
stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case
make vfs_ustat() static
do_handle_open() should be static
elf_fdpic: fix unused variable warning
fold destroy_super() into __put_super()
new helper: destroy_unused_super()
fix address space warnings in ipc/
acct.h: get rid of detritus

Linus Torvalds
2017-11-18 04:54:01 +0800

06 Nov, 2017

2 commits

c02b1a9b4 vfs: grab the lock instead of blocking in __fd_install during resizing ... Browse Code »

Explicit locking in the fallback case provides a safe state of the
table. Getting rid of blocking semantics makes __fd_install usable
again in non-sleepable contexts, which easies backporting efforts.

There is a side effect of slightly nicer assembly for the common case
as might_sleep can now be removed.

Signed-off-by: Mateusz Guzik
Signed-off-by: Al Viro

Mateusz Guzik
2017-11-06 07:58:07 +0800
529790827 vfs: stop clearing close on exec when closing a fd ... Browse Code »

Codepaths allocating a fd always make sure the bit is set/unset
depending on flags, thus clearing on close is redundant.

Signed-off-by: Mateusz Guzik
Signed-off-by: Al Viro

Mateusz Guzik
2017-11-06 07:58:06 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

07 Jul, 2017

1 commit

c823bd924 fs/file.c: replace alloc_fdmem() with kvmalloc() alternative ... Browse Code »

There is no real reason to duplicate kvmalloc* helpers so drop
alloc_fdmem and replace it with the appropriate library function.

Link: http://lkml.kernel.org/r/20170531155145.17111-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:30 +0800

09 May, 2017

1 commit

19809c2da mm, vmalloc: use __GFP_HIGHMEM implicitly ... Browse Code »

__vmalloc* allows users to provide gfp flags for the underlying
allocation. This API is quite popular

$ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
77

The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space. About half of users don't
use this flag, though. This signals that we make the API unnecessarily
too complex.

This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
are simplified and drop the flag.

Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reviewed-by: Matthew Wilcox
Cc: Al Viro
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Cristopher Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-05-09 08:15:13 +0800

02 Mar, 2017

1 commit

3f07c0144 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> ... Browse Code »

We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:29 +0800

28 Sep, 2016

1 commit

9b80a184e fs/file: more unsigned file descriptors ... Browse Code »

Propagate unsignedness for grand total of 149 bytes:

$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
function old new delta
set_close_on_exec 99 98 -1
put_files_struct 201 200 -1
get_close_on_exec 59 58 -1
do_prlimit 498 497 -1
do_execveat_common.isra 1662 1661 -1
__close_fd 178 173 -5
do_dup2 219 204 -15
seq_show 685 660 -25
__alloc_fd 384 357 -27
dup_fd 718 646 -72

It mostly comes from converting "unsigned int" to "long" for bit operations.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Al Viro

Alexey Dobriyan
2016-09-28 06:47:38 +0800

03 May, 2016

1 commit

63b6df141 give readdir(2)/getdents(2)/etc. uniform exclusion with lseek() ... Browse Code »

same as read() on regular files has, and for the same reason.

Signed-off-by: Al Viro

Al Viro
2016-05-03 07:49:28 +0800

15 Jan, 2016

1 commit

5d097056c kmemcg: account certain kmem allocations to memcg ... Browse Code »

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:

- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800

07 Dec, 2015

1 commit

752343be6 fs/file.c: __const_max is actually __const_min :-) ... Browse Code »

7f4b36f9bb930 "get rid of files_defer_init()" inexplicably changed a
min() to a __const_max() - but the __const_max macro actually gives
the minimum... So no functional change, just less confusing naming.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Al Viro

Rasmus Villemoes
2015-12-07 10:17:11 +0800

06 Nov, 2015

1 commit

ea5c58e70 vfs: clear remainder of 'full_fds_bits' in dup_fd() ... Browse Code »

This fixes a bug from commit f3f86e33dc3d ("vfs: Fix pathological
performance case for __alloc_fd()").

v2: refactor to share fd bitmap copying code
Signed-off-by: Eric Biggers
Signed-off-by: Linus Torvalds

Eric Biggers
2015-11-06 15:05:32 +0800

01 Nov, 2015

2 commits

fc90888d0 vfs: conditionally clear close-on-exec flag ... Browse Code »

We clear the close-on-exec flag when opening and closing files, and the
bit was almost always already clear before. Avoid dirtying the
cacheline if the clearning isn't necessary. That avoids unnecessary
cacheline dirtying and bouncing in multi-socket environments.

Eric Dumazet has a file descriptor benchmark that goes 4% faster from
this on his two-socket machine. It's probably partly superlinear
improvement due to getting slightly less spinlock contention on the
file_lock spinlock due to less work in the critical section.

Tested-by: Eric Dumazet
Signed-off-by: Linus Torvalds

Linus Torvalds
2015-11-01 07:14:51 +0800
f3f86e33d vfs: Fix pathological performance case for __alloc_fd() ... Browse Code »

Al Viro points out that:
> > * [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > with some use patterns. The cacheline pingpong in the bitmap is probably
> > inevitable, unless we accept considerably heavier memory footprint,
> > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > to trigger - close(3);open(...); will have the next open() after that
> > scanning the entire in-use bitmap.

And Eric Dumazet has a somewhat realistic multithreaded microbenchmark
that opens and closes a lot of sockets with minimal work per socket.

This patch largely fixes it. We keep a 2nd-level bitmap of the open
file bitmaps, showing which words are already full. So then we can
traverse that second-level bitmap to efficiently skip already allocated
file descriptors.

On his benchmark, this improves performance by up to an order of
magnitude, by avoiding the excessive open file bitmap scanning.

Tested-and-acked-by: Eric Dumazet
Cc: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2015-11-01 07:12:10 +0800

01 Jul, 2015

2 commits

5ba97d283 fs/file.c: __fget() and dup2() atomicity rules ... Browse Code »

__fget() does lockless fetch of pointer from the descriptor
table, attempts to grab a reference and treats "it was already
zero" as "it's already gone from the table, we just hadn't
seen the store, let's fail". Unfortunately, that breaks the
atomicity of dup2() - __fget() might see the old pointer,
notice that it's been already dropped and treat that as
"it's closed". What we should be getting is either the
old file or new one, depending whether we come before or after
dup2().

Dmitry had following test failing sometimes :

int fd;
void *Thread(void *x) {
char buf;
int n = read(fd, &buf, 1);
if (n != 1)
exit(printf("read failed: n=%d errno=%d\n", n, errno));
return 0;
}

int main()
{
fd = open("/dev/urandom", O_RDONLY);
int fd2 = open("/dev/urandom", O_RDONLY);
if (fd == -1 || fd2 == -1)
exit(printf("open failed\n"));
pthread_t th;
pthread_create(&th, 0, Thread, 0);
if (dup2(fd2, fd) == -1)
exit(printf("dup2 failed\n"));
pthread_join(th, 0);
if (close(fd) == -1)
exit(printf("close failed\n"));
if (close(fd2) == -1)
exit(printf("close failed\n"));
printf("DONE\n");
return 0;
}

Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Signed-off-by: Al Viro

Eric Dumazet
2015-07-01 14:31:08 +0800
8a81252b7 fs/file.c: don't acquire files->file_lock in fd_install() ... Browse Code »

Mateusz Guzik reported :

Currently obtaining a new file descriptor results in locking fdtable
twice - once in order to reserve a slot and second time to fill it.

Holding the spinlock in __fd_install() is needed in case a resize is
done, or to prevent a resize.

Mateusz provided an RFC patch and a micro benchmark :
http://people.redhat.com/~mguzik/pipebench.c

A resize is an unlikely operation in a process lifetime,
as table size is at least doubled at every resize.

We can use RCU instead of the spinlock.

__fd_install() must wait if a resize is in progress.

The resize must block new __fd_install() callers from starting,
and wait that ongoing install are finished (synchronize_sched())

resize should be attempted by a single thread to not waste resources.

rcu_sched variant is used, as __fd_install() and expand_fdtable() run
from process context.

It gives us a ~30% speedup using pipebench on a dual Intel(R) Xeon(R)
CPU E5-2696 v2 @ 2.50GHz

Signed-off-by: Eric Dumazet
Reported-by: Mateusz Guzik
Acked-by: Mateusz Guzik
Tested-by: Mateusz Guzik
Signed-off-by: Al Viro

Eric Dumazet
2015-07-01 14:30:09 +0800

17 Apr, 2015

1 commit

90f31d0ea mm: rcu-protected get_mm_exe_file() ... Browse Code »

This patch removes mm->mmap_sem from mm->exe_file read side.
Also it kills dup_mm_exe_file() and moves exe_file duplication into
dup_mmap() where both mmap_sems are locked.

[akpm@linux-foundation.org: fix comment typo]
Signed-off-by: Konstantin Khlebnikov
Cc: Davidlohr Bueso
Cc: Al Viro
Cc: Oleg Nesterov
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2015-04-17 21:04:07 +0800

11 Dec, 2014

1 commit

8d10a0358 fs/file.c: replace get_unused_fd() with get_unused_fd_flags(0) ... Browse Code »

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags(), with the hope O_CLOEXEC could be
used, either by default or choosen by userspace.

Signed-off-by: Yann Droneaud
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yann Droneaud
2014-12-11 09:41:10 +0800

13 Oct, 2014

1 commit

d6dd50e07 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"The main changes in this cycle were:

- changes related to No-CBs CPUs and NO_HZ_FULL

- RCU-tasks implementation

- torture-test updates

- miscellaneous fixes

- locktorture updates

- RCU documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
workqueue: Use cond_resched_rcu_qs macro
workqueue: Add quiescent state between work items
locktorture: Cleanup header usage
locktorture: Cannot hold read and write lock
locktorture: Fix __acquire annotation for spinlock irq
locktorture: Support rwlocks
rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
locktorture: Document boot/module parameters
rcutorture: Rename rcutorture_runnable parameter
locktorture: Add test scenario for rwsem_lock
locktorture: Add test scenario for mutex_lock
locktorture: Make torture scripting account for new _runnable name
locktorture: Introduce torture context
locktorture: Support rwsems
locktorture: Add infrastructure for torturing read locks
torture: Address race in module cleanup
locktorture: Make statistics generic
locktorture: Teach about lock debugging
locktorture: Support mutexes
locktorture: Add documentation
...

Linus Torvalds
2014-10-13 21:44:12 +0800

09 Oct, 2014

1 commit

e983094d6 missing annotation in fs/file.c ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-10-09 14:39:11 +0800

08 Sep, 2014

1 commit

bde6c3aa9 rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops ... Browse Code »

RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks. In some cases, this requires
instrumenting cond_resched(). However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks. Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2014-09-08 07:27:20 +0800

07 May, 2014

1 commit

f6c0a1920 fs/file.c: don't open-code kvfree() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-05-07 05:31:10 +0800

13 Apr, 2014

1 commit

5166701b3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.

Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.

This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...

Linus Torvalds
2014-04-13 05:49:50 +0800

02 Apr, 2014

1 commit

7f4b36f9b get rid of files_defer_init() ... Browse Code »

the only thing it's doing these days is calculation of
upper limit for fs.nr_open sysctl and that can be done
statically

Signed-off-by: Al Viro

Al Viro
2014-04-02 11:19:14 +0800

01 Apr, 2014

1 commit

b3fd4ea9d Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"Main changes:

- Torture-test changes, including refactoring of rcutorture and
introduction of a vestigial locktorture.

- Real-time latency fixes.

- Documentation updates.

- Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
rcu: Provide grace-period piggybacking API
rcu: Ensure kernel/rcu/rcu.h can be sourced/used stand-alone
rcu: Fix sparse warning for rcu_expedited from kernel/ksysfs.c
notifier: Substitute rcu_access_pointer() for rcu_dereference_raw()
Documentation/memory-barriers.txt: Clarify release/acquire ordering
rcutorture: Save kvm.sh output to log
rcutorture: Add a lock_busted to test the test
rcutorture: Place kvm-test-1-run.sh output into res directory
rcutorture: Rename TREE_RCU-Kconfig.txt
locktorture: Add kvm-recheck.sh plug-in for locktorture
rcutorture: Gracefully handle NULL cleanup hooks
locktorture: Add vestigial locktorture configuration
rcutorture: Introduce "rcu" directory level underneath configs
rcutorture: Rename kvm-test-1-rcu.sh
rcutorture: Remove RCU dependencies from ver_functions.sh API
rcutorture: Create CFcommon file for common Kconfig parameters
rcutorture: Create config files for scripted test-the-test testing
rcutorture: Add an rcu_busted to test the test
locktorture: Add a lock-torture kernel module
rcutorture: Abstract kvm-recheck.sh
...

Linus Torvalds
2014-04-01 02:05:24 +0800

23 Mar, 2014

1 commit

99aea6813 vfs: Don't let __fdget_pos() get FMODE_PATH files ... Browse Code »

Commit bd2a31d522344 ("get rid of fget_light()") introduced the
__fdget_pos() function, which returns the resulting file pointer and
fdput flags combined in an 'unsigned long'. However, it also changed the
behavior to return files with FMODE_PATH set, which shouldn't happen
because read(), write(), lseek(), etc. aren't allowed on such files.
This commit restores the old behavior.

This regression actually had no effect on read() and write() since
FMODE_READ and FMODE_WRITE are not set on file descriptors opened with
O_PATH, but it did cause lseek() on a file descriptor opened with O_PATH
to fail with ESPIPE rather than EBADF.

Signed-off-by: Eric Biggers
Signed-off-by: Al Viro

Eric Biggers
2014-03-23 12:03:12 +0800

10 Mar, 2014

1 commit

bd2a31d52 get rid of fget_light() ... Browse Code »

instead of returning the flags by reference, we can just have the
low-level primitive return those in lower bits of unsigned long,
with struct file * derived from the rest.

Signed-off-by: Al Viro

Al Viro
2014-03-10 23:44:42 +0800

18 Feb, 2014

1 commit

add1f0995 fs: Substitute rcu_access_pointer() for rcu_dereference_raw() ... Browse Code »

(Trivial patch.)

If the code is looking at the RCU-protected pointer itself, but not
dereferencing it, the rcu_dereference() functions can be downgraded to
rcu_access_pointer(). This commit makes this downgrade in __alloc_fd(),
which simply compares the RCU-protected pointer against NULL with no
dereferencing.

Signed-off-by: Paul E. McKenney
Cc: Alexander Viro
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Josh Triplett

Paul E. McKenney
2014-02-18 07:02:21 +0800

11 Feb, 2014

1 commit

96c7a2ff2 fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem ... Browse Code »

Recently due to a spike in connections per second memcached on 3
separate boxes triggered the OOM killer from accept. At the time the
OOM killer was triggered there was 4GB out of 36GB free in zone 1. The
problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
hold a bitmap, and there was sufficient fragmentation that the largest
page available was 8KiB.

I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
but I do agree that order 3 allocations are very likely to succeed.

There are always pathologies where order > 0 allocations can fail when
there are copious amounts of free memory available. Using the pigeon
hole principle it is easy to show that it requires 1 page more than 50%
of the pages being free to guarantee an order 1 (8KiB) allocation will
succeed, 1 page more than 75% of the pages being free to guarantee an
order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
the pages being free to guarantee an order 3 allocate will succeed.

A server churning memory with a lot of small requests and replies like
memcached is a common case that if anything can will skew the odds
against large pages being available.

Therefore let's not give external applications a practical way to kill
linux server applications, and specify __GFP_NORETRY to the kmalloc in
alloc_fdmem. Unless I am misreading the code and by the time the code
reaches should_alloc_retry in __alloc_pages_slowpath (where
__GFP_NORETRY becomes signification). We have already tried everything
reasonable to allocate a page and the only thing left to do is wait. So
not waiting and falling back to vmalloc immediately seems like the
reasonable thing to do even if there wasn't a chance of triggering the
OOM killer.

Signed-off-by: "Eric W. Biederman"
Cc: Eric Dumazet
Acked-by: David Rientjes
Cc: Cong Wang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2014-02-11 08:01:41 +0800