Eric Lee / smarc-fsl-linux-kernel

31 Oct, 2016

4 commits

70fe2f481 aio: fix freeze protection of aio writes ... Browse Code »

Currently we dropped freeze protection of aio writes just after IO was
submitted. Thus aio write could be in flight while the filesystem was
frozen and that could result in unexpected situation like aio completion
wanting to convert extent type on frozen filesystem. Testcase from
Dmitry triggering this is like:

for ((i=0;i
Signed-off-by: Jan Kara
[hch: forward ported on top of various VFS and aio changes]
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Jan Kara
2016-10-31 01:09:42 +0800
89319d31d fs: remove aio_run_iocb ... Browse Code »

Pass the ABI iocb structure to aio_setup_rw and let it handle the
non-vectored I/O case as well. With that and a new helper for the AIO
return value handling we can now define new aio_read and aio_write
helpers that implement reads and writes in a self-contained way without
duplicating too much code.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2016-10-31 01:09:42 +0800
723c03847 fs: remove the never implemented aio_fsync file operation ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2016-10-31 01:09:42 +0800
0b944d3a4 aio: hold an extra file reference over AIO read/write operations ... Browse Code »

Otherwise we might dereference an already freed file and/or inode
when aio_complete is called before we return from the read_iter or
write_iter method.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2016-10-31 01:09:42 +0800

28 Sep, 2016

1 commit

de04e7693 fs/aio.c: eliminate redundant loads in put_aio_ring_file ... Browse Code »

Using a local variable we can prevent gcc from reloading
aio_ring_file->f_inode->i_mapping twice, eliminating 2x2 dependent
loads.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Al Viro

Rasmus Villemoes
2016-09-28 09:45:46 +0800

16 Sep, 2016

1 commit

22f6b4d34 aio: mark AIO pseudo-fs noexec ... Browse Code »

This ensures that do_mmap() won't implicitly make AIO memory mappings
executable if the READ_IMPLIES_EXEC personality flag is set. Such
behavior is problematic because the security_mmap_file LSM hook doesn't
catch this case, potentially permitting an attacker to bypass a W^X
policy enforced by SELinux.

I have tested the patch on my machine.

To test the behavior, compile and run this:

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include

int main(void) {
personality(READ_IMPLIES_EXEC);
aio_context_t ctx = 0;
if (syscall(__NR_io_setup, 1, &ctx))
err(1, "io_setup");

char cmd[1000];
sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
(int)getpid());
system(cmd);
return 0;
}

In the output, "rw-s" is good, "rwxs" is bad.

Signed-off-by: Jann Horn
Signed-off-by: Linus Torvalds

Jann Horn
2016-09-16 06:49:28 +0800

24 May, 2016

1 commit

013373e8b aio: make aio_setup_ring killable ... Browse Code »

aio_setup_ring waits for mmap_sem in writable mode. If the waiting task
gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting. This will also expedite the
return to the userspace and do_exit.

Signed-off-by: Michal Hocko
Acked-by: Jeff Moyer
Acked-by: Vlastimil Babka
Cc: Benamin LaHaise
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-05-24 08:04:14 +0800

04 Apr, 2016

1 commit

2958ec177 aio: remove a pointless assignment ... Browse Code »

the value is never used after that point

Signed-off-by: Al Viro

Al Viro
2016-04-04 07:51:33 +0800

05 Sep, 2015

1 commit

5477e70a6 mm: move ->mremap() from file_operations to vm_operations_struct ... Browse Code »

vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this
way ->mremap() can have more users. Say, vdso.

While at it, s/aio_ring_remap/aio_ring_mremap/.

Note: this is the minimal change before ->mremap() finds another user in
file_operations; this method should have more arguments, and it can be
used to kill arch_remap().

Signed-off-by: Oleg Nesterov
Acked-by: Pavel Emelyanov
Acked-by: Kirill A. Shutemov
Cc: David Rientjes
Cc: Benjamin LaHaise
Cc: Hugh Dickins
Cc: Jeff Moyer
Cc: Laurent Dufour
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-09-05 07:54:41 +0800

17 Apr, 2015

2 commits

4fc8adcfe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull third hunk of vfs changes from Al Viro:
"This contains the ->direct_IO() changes from Omar + saner
generic_write_checks() + dealing with fcntl()/{read,write}() races
(mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
repeatedly looking at ->f_flags, which can be changed by fcntl(2),
check ->ki_flags - which cannot) + infrastructure bits for dhowells'
d_inode annotations + Christophs switch of /dev/loop to
vfs_iter_write()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
block: loop: switch to VFS ITER_BVEC
configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
NFS: Don't use d_inode as a variable name
VFS: Impose ordering on accesses of d_inode and d_flags
VFS: Add owner-filesystem positive/negative dentry checks
nfs: generic_write_checks() shouldn't be done on swapout...
ocfs2: use __generic_file_write_iter()
mirror O_APPEND and O_DIRECT into iocb->ki_flags
switch generic_write_checks() to iocb and iter
ocfs2: move generic_write_checks() before the alignment checks
ocfs2_file_write_iter: stop messing with ppos
udf_file_write_iter: reorder and simplify
fuse: ->direct_IO() doesn't need generic_write_checks()
ext4_file_write_iter: move generic_write_checks() up
xfs_file_aio_write_checks: switch to iocb/iov_iter
generic_write_checks(): drop isblk argument
blkdev_write_iter: expand generic_file_checks() call in there
...

Linus Torvalds
2015-04-17 11:27:56 +0800
d82312c80 Merge branch 'for-4.1/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer core bits from Jens Axboe:
"This is the core pull request for 4.1. Not a lot of stuff in here for
this round, mostly little fixes or optimizations. This pull request
contains:

- An optimization that speeds up queue runs on blk-mq, especially for
the case where there's a large difference between nr_cpu_ids and
the actual mapped software queues on a hardware queue. From Chong
Yuan.

- Honor node local allocations for requests on legacy devices. From
David Rientjes.

- Cleanup of blk_mq_rq_to_pdu() from me.

- exit_aio() fixup from me, greatly speeding up exiting multiple IO
contexts off exit_group(). For my particular test case, fio exit
took ~6 seconds. A typical case of both exposing RCU grace periods
to user space, and serializing exit of them.

- Make blk_mq_queue_enter() honor the gfp mask passed in, so we only
wait if __GFP_WAIT is set. From Keith Busch.

- blk-mq exports and two added helpers from Mike Snitzer, which will
be used by the dm-mq code.

- Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang"

* 'for-4.1/core' of git://git.kernel.dk/linux-block:
blk-mq: reduce unnecessary software queue looping
aio: fix serial draining in exit_aio()
blk-mq: cleanup blk_mq_rq_to_pdu()
blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
block: allocate request memory local to request queue
blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set
blk-mq: export blk_mq_run_hw_queues
blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk

Linus Torvalds
2015-04-17 09:49:16 +0800

16 Apr, 2015

2 commits

fa927894b Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull second vfs update from Al Viro:
"Now that net-next went in... Here's the next big chunk - killing
->aio_read() and ->aio_write().

There'll be one more pile today (direct_IO changes and
generic_write_checks() cleanups/fixes), but I'd prefer to keep that
one separate"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
->aio_read and ->aio_write removed
pcm: another weird API abuse
infinibad: weird APIs switched to ->write_iter()
kill do_sync_read/do_sync_write
fuse: use iov_iter_get_pages() for non-splice path
fuse: switch to ->read_iter/->write_iter
switch drivers/char/mem.c to ->read_iter/->write_iter
make new_sync_{read,write}() static
coredump: accept any write method
switch /dev/loop to vfs_iter_write()
serial2002: switch to __vfs_read/__vfs_write
ashmem: use __vfs_read()
export __vfs_read()
autofs: switch to __vfs_write()
new helper: __vfs_write()
switch hugetlbfs to ->read_iter()
coda: switch to ->read_iter/->write_iter
ncpfs: switch to ->read_iter/->write_iter
net/9p: remove (now-)unused helpers
p9_client_attach(): set fid->uid correctly
...

Linus Torvalds
2015-04-16 04:22:56 +0800
dc48e56d7 aio: fix serial draining in exit_aio() ... Browse Code »

exit_aio() currently serializes killing io contexts. Each context
killing ends up having to do percpu_ref_kill(), which in turns has
to wait for an RCU grace period. This can take a long time, depending
on the number of contexts. And there's no point in doing them serially,
when we could be waiting for all of them in one fell swoop.

This patches makes my fio thread offload test case exit 0.2s instead
of almost 6s.

Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Jens Axboe
2015-04-16 01:17:23 +0800

15 Apr, 2015

1 commit

ca2ec3265 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs update from Al Viro:
"Part one:

- struct filename-related cleanups

- saner iov_iter_init() replacements (and switching the syscalls to
use of those)

- ntfs switch to ->write_iter() (Anton)

- aio cleanups and splitting iocb into common and async parts
(Christoph)

- assorted fixes (me, bfields, Andrew Elble)

There's a lot more, including the completion of switchover to
->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
race fixes, etc, but that goes after #for-davem merge. David has
pulled it, and once it's in I'll send the next vfs pull request"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
sg_start_req(): use import_iovec()
sg_start_req(): make sure that there's not too many elements in iovec
blk_rq_map_user(): use import_single_range()
sg_io(): use import_iovec()
process_vm_access: switch to {compat_,}import_iovec()
switch keyctl_instantiate_key_common() to iov_iter
switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
vmsplice_to_user(): switch to import_iovec()
kill aio_setup_single_vector()
aio: simplify arguments of aio_setup_..._rw()
aio: lift iov_iter_init() into aio_setup_..._rw()
lift iov_iter into {compat_,}do_readv_writev()
NFS: fix BUG() crash in notify_change() with patch to chown_common()
dcache: return -ESTALE not -EBUSY on distributed fs race
NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
VFS: Add iov_iter_fault_in_multipages_readable()
drop bogus check in file_open_root()
switch security_inode_getattr() to struct path *
constify tomoyo_realpath_from_path()
...

Linus Torvalds
2015-04-15 06:31:03 +0800

12 Apr, 2015

10 commits

2ba48ce51 mirror O_APPEND and O_DIRECT into iocb->ki_flags ... Browse Code »

... avoiding write_iter/fcntl races.

Signed-off-by: Al Viro

Al Viro
2015-04-12 10:30:22 +0800
dfea93457 Merge branch 'for-linus' into for-next Browse Code »

Al Viro
2015-04-12 10:29:51 +0800
843631820 ->aio_read and ->aio_write removed ... Browse Code »

no remaining users

Signed-off-by: Al Viro

Al Viro
2015-04-12 10:29:43 +0800
47e393622 aio_run_iocb(): kill dead check ... Browse Code »

We check if ->ki_pos is positive. However, by that point we have
already done rw_verify_area(), which would have rejected such
unless the file had been one of /dev/mem, /dev/kmem and /proc/kcore.
All of which do not have vectored rw methods, so we would've bailed
out even earlier.

This check had been introduced before rw_verify_area() had been added there
- in fact, it was a subset of checks done on sync paths by rw_verify_area()
(back then the /dev/mem exception didn't exist at all). The rest of checks
(mandatory locking, etc.) hadn't been added until later. Unfortunately,
by the time the call of rw_verify_area() got added, the /dev/mem exception
had already appeared, so it wasn't obvious that the older explicit check
downstream had become dead code. It *is* a dead code, though, since the few
files for which the exception applies do not have ->aio_{read,write}() or
->{read,write}_iter() and for them we won't reach that check anyway.

What's more, even if we ever introduce vectored methods for /dev/mem
and friends, they'll have to cope with negative positions anyway, since
readv(2) and writev(2) are using the same checks as read(2) and write(2) -
i.e. rw_verify_area().

Let's bury it.

Signed-off-by: Al Viro

Al Viro
2015-04-12 10:27:55 +0800
08397acdd ioctx_alloc(): remove pointless check ... Browse Code »

Way, way back kiocb used to be picked from arrays, so ioctx_alloc()
checked for multiplication overflow when calculating the size of
such array. By the time fs/aio.c went into the tree (in 2002) they
were already allocated one-by-one by kmem_cache_alloc(), so that
check had already become pointless. Let's bury it...

Signed-off-by: Al Viro

Al Viro
2015-04-12 10:27:54 +0800
32a56afa2 aio_setup_vectored_rw(): switch to {compat_,}import_iovec() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2015-04-12 10:27:11 +0800
d4fb392f4 kill aio_setup_single_vector() ... Browse Code »

identical to import_single_range()

Signed-off-by: Al Viro

Al Viro
2015-04-12 10:27:10 +0800
a96114fa1 aio: simplify arguments of aio_setup_..._rw() ... Browse Code »

We don't need req in either of those. We don't need nr_segs in caller.
We don't really need len in caller either - iov_iter_count(&iter) will do.

Signed-off-by: Al Viro

Al Viro
2015-04-12 10:26:45 +0800
4c185ce06 aio: lift iov_iter_init() into aio_setup_..._rw() ... Browse Code »

the only non-trivial detail is that we do it before rw_verify_area(),
so we'd better cap the length ourselves in aio_setup_single_rw()
case (for vectored case rw_copy_check_uvector() will do that for us).

Signed-off-by: Al Viro

Al Viro
2015-04-12 10:26:45 +0800
c0fec3a98 Merge branch 'iocb' into for-next Browse Code »

Al Viro
2015-04-12 10:24:41 +0800

07 Apr, 2015

2 commits

deeb8525f ioctx_alloc(): fix vma (and file) leak on failure ... Browse Code »

If we fail past the aio_setup_ring(), we need to destroy the
mapping. We don't need to care about anybody having found ctx,
or added requests to it, since the last failure exit is exactly
the failure to make ctx visible to lookups.

Reproducer (based on one by Joe Mario ):

void count(char *p)
{
char s[80];
printf("%s: ", p);
fflush(stdout);
sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid());
system(s);
}

int main()
{
io_context_t *ctx;
int created, limit, i, destroyed;
FILE *f;

count("before");
if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL)
perror("opening aio-max-nr");
else if (fscanf(f, "%d", &limit) != 1)
fprintf(stderr, "can't parse aio-max-nr\n");
else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL)
perror("allocating aio_context_t array");
else {
for (i = 0, created = 0; i < limit; i++) {
if (io_setup(1000, ctx + created) == 0)
created++;
}
for (i = 0, destroyed = 0; i < created; i++)
if (io_destroy(ctx[i]) == 0)
destroyed++;
printf("created %d, failed %d, destroyed %d\n",
created, limit - created, destroyed);
count("after");
}
}

Found-by: Joe Mario
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro

Al Viro
2015-04-07 05:57:44 +0800
b2edffdd9 fix mremap() vs. ioctx_kill() race ... Browse Code »

teach ->mremap() method to return an error and have it fail for
aio mappings in process of being killed

Note that in case of ->mremap() failure we need to undo move_page_tables()
we'd already done; we could call ->mremap() first, but then the failure of
move_page_tables() would require undoing whatever _successful_ ->mremap()
has done, which would be a lot more headache in general.

Signed-off-by: Al Viro

Al Viro
2015-04-07 05:50:59 +0800

14 Mar, 2015

2 commits

04b2fa9f8 fs: split generic and aio kiocb ... Browse Code »

Most callers in the kernel want to perform synchronous file I/O, but
still have to bloat the stack with a full struct kiocb. Split out
the parts needed in filesystem code from those in the aio code, and
only allocate those needed to pass down argument on the stack. The
aio code embedds the generic iocb in the one it allocates and can
easily get back to it by using container_of.

Also add a ->ki_complete method to struct kiocb, this is used to call
into the aio code and thus removes the dependency on aio for filesystems
impementing asynchronous operations. It will also allow other callers
to substitute their own completion callback.

We also add a new ->ki_flags field to work around the nasty layering
violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add
eventfd notification about ffs events").

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2015-03-14 00:10:27 +0800
599bd19bd fs: don't allow to complete sync iocbs through aio_complete ... Browse Code »

The AIO interface is fairly complex because it tries to allow
filesystems to always work async and then wakeup a synchronous
caller through aio_complete. It turns out that basically no one
was doing this to avoid the complexity and context switches,
and we've already fixed up the remaining users and can now
get rid of this case.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2015-03-14 00:10:22 +0800

13 Mar, 2015

1 commit

66ee59af6 fs: remove ki_nbytes ... Browse Code »

There is no need to pass the total request length in the kiocb, as
we already get passed in through the iov_iter argument.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2015-03-13 11:50:23 +0800

20 Feb, 2015

1 commit

acd88d4e1 fs/aio.c: Remove duplicate function name in pr_debug messages ... Browse Code »

Have defined pr_fmt as below in fs/aio.c, so remove duplicate
function name in pr_debug message.

#define pr_fmt(fmt) "%s: " fmt, __func__

Signed-off-by: Kinglong Mee
Signed-off-by: Al Viro

Kinglong Mee
2015-02-20 17:56:44 +0800

13 Feb, 2015

1 commit

6bec00352 Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block ... Browse Code »

Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.

Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"

* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info

Linus Torvalds
2015-02-13 05:50:21 +0800

04 Feb, 2015

1 commit

9c9ce763b aio: annotate aio_read_event_ring for sleep patterns ... Browse Code »

Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw
warnings like the following due to being called from wait_event
context:

WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90()
do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait_event+0x63/0x110
Modules linked in:
CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8
ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8
ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300
Call Trace:
[] dump_stack+0x4c/0x65
[] warn_slowpath_common+0x8a/0xc0
[] warn_slowpath_fmt+0x46/0x50
[] ? prepare_to_wait_event+0x63/0x110
[] ? prepare_to_wait_event+0x63/0x110
[] __might_sleep+0x7f/0x90
[] mutex_lock+0x24/0x45
[] aio_read_events+0x4c/0x290
[] read_events+0x1ec/0x220
[] ? prepare_to_wait_event+0x110/0x110
[] ? hrtimer_get_res+0x50/0x50
[] SyS_io_getevents+0x4d/0xb0
[] system_call_fastpath+0x12/0x17
---[ end trace bde69eaf655a4fea ]---

There is not actually a bug here, so annotate the code to tell the
debug logic that everything is just fine and not to fire a false
positive.

Signed-off-by: Dave Chinner
Signed-off-by: Benjamin LaHaise

Dave Chinner
2015-02-04 08:29:05 +0800

21 Jan, 2015

2 commits

b83ae6d42 fs: remove mapping->backing_dev_info ... Browse Code »

Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.

Signed-off-by: Christoph Hellwig
Acked-by: Ryusuke Konishi
Reviewed-by: Tejun Heo
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-01-21 05:03:05 +0800
b4caecd48 fs: introduce f_op->mmap_capabilities for nommu mmap support ... Browse Code »

Since "BDI: Provide backing device capability information [try #3]" the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.

Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead. Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device. It also removes the need for
the mtd_inodefs filesystem.

Signed-off-by: Christoph Hellwig
Reviewed-by: Tejun Heo
Acked-by: Brian Norris
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-01-21 05:02:58 +0800

14 Dec, 2014

2 commits

5f785de58 aio: Skip timer for io_getevents if timeout=0 ... Browse Code »

In this case, it is basically a polling. Let's not involve timer at all
because that would hurt performance for application event loops.

In an arbitrary test I've done, io_getevents syscall elapsed time
reduces from 50000+ nanoseconds to a few hundereds.

Signed-off-by: Fam Zheng
Signed-off-by: Benjamin LaHaise

Fam Zheng
2014-12-14 06:50:20 +0800
e4a0d3e72 aio: Make it possible to remap aio ring ... Browse Code »

There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.

So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).

To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.

So, to make restore work we need to make sure that

a) ring is mapped at desired virtual address
b) kioctx->user_id matches this value

Having said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.

Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.

What do you think?

Signed-off-by: Pavel Emelyanov
Acked-by: Dmitry Monakhov
Signed-off-by: Benjamin LaHaise

Pavel Emelyanov
2014-12-14 06:49:50 +0800

26 Nov, 2014

1 commit

277f850fb Merge git://git.kvack.org/~bcrl/aio-fixes ... Browse Code »

Pull aio fix from Ben LaHaise:
"Dirty page accounting fix for aio"

* git://git.kvack.org/~bcrl/aio-fixes:
aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

Linus Torvalds
2014-11-26 10:55:44 +0800

07 Nov, 2014

1 commit

835f252c6 aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer ... Browse Code »

https://bugzilla.kernel.org/show_bug.cgi?id=86831

Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.

Add a warn trace in __dec_zone_state will catch this easily:

static inline void __dec_zone_state(struct zone *zone, enum
zone_stat_item item)
{
atomic_long_dec(&zone->vm_stat[item]);
+ WARN_ON_ONCE(item == NR_FILE_DIRTY &&
atomic_long_read(&zone->vm_stat[item]) < 0);
atomic_long_dec(&vm_stat[item]);
}

[ 21.341632] ------------[ cut here ]------------
[ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[ 21.355296] Modules linked in: wutbox_cp sata_mv
[ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[ 21.366793] Workqueue: events free_ioctx
[ 21.370760] [] (unwind_backtrace) from []
(show_stack+0x20/0x24)
[ 21.378562] [] (show_stack) from []
(dump_stack+0x24/0x28)
[ 21.385840] [] (dump_stack) from []
(warn_slowpath_common+0x84/0x9c)
[ 21.393976] [] (warn_slowpath_common) from []
(warn_slowpath_null+0x2c/0x34)
[ 21.402800] [] (warn_slowpath_null) from []
(cancel_dirty_page+0x164/0x224)
[ 21.411524] [] (cancel_dirty_page) from []
(truncate_inode_page+0x8c/0x158)
[ 21.420272] [] (truncate_inode_page) from []
(truncate_inode_pages_range+0x11c/0x53c)
[ 21.429890] [] (truncate_inode_pages_range) from
[] (truncate_pagecache+0x88/0xac)
[ 21.439252] [] (truncate_pagecache) from []
(truncate_setsize+0x5c/0x74)
[ 21.447731] [] (truncate_setsize) from []
(put_aio_ring_file.isra.14+0x34/0x90)
[ 21.456826] [] (put_aio_ring_file.isra.14) from
[] (aio_free_ring+0x20/0xcc)
[ 21.465660] [] (aio_free_ring) from []
(free_ioctx+0x24/0x44)
[ 21.473190] [] (free_ioctx) from []
(process_one_work+0x134/0x47c)
[ 21.481132] [] (process_one_work) from []
(worker_thread+0x130/0x414)
[ 21.489350] [] (worker_thread) from []
(kthread+0xd4/0xec)
[ 21.496621] [] (kthread) from []
(ret_from_fork+0x14/0x20)
[ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.

The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.

In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.

Reported-by: Markus Königshaus
Signed-off-by: Gu Zheng
Cc: stable
Acked-by: Andrew Morton
Signed-off-by: Benjamin LaHaise

Gu Zheng
2014-11-07 03:27:19 +0800

25 Sep, 2014

2 commits

2aad2a86f percpu_ref: add PERCPU_REF_INIT_* flags ... Browse Code »

With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing. Fixed.

Signed-off-by: Tejun Heo
Reviewed-by: Kent Overstreet
Cc: Jens Axboe
Cc: Christoph Hellwig
Cc: Johannes Weiner

Tejun Heo
2014-09-25 01:31:50 +0800
d06efebf0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/… ... Browse Code »

…linux-block into for-3.18

This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>

Tejun Heo
2014-09-25 01:00:21 +0800