Eric Lee / smarc-fsl-linux-kernel

04 Oct, 2016

1 commit

4038acdb1 consistent treatment of EFAULT on O_DIRECT read/write ... Browse Code »

Make local filesystems treat a fault as shortened IO,
returning -EFAULT only if nothing had been transferred.
That's how everything else (NFS, FUSE, ceph, Lustre)
behaves.

Signed-off-by: Al Viro

Al Viro
2016-10-04 08:38:55 +0800

08 Jun, 2016

2 commits

8a4c1e42e direct-io: use bio set/get op accessors ... Browse Code »

This patch has the dio code use a REQ_OP for the op and rq_flag_bits
for bi_rw flags. To set/get the op it uses the bio_set_op_attrs/bio_op
accssors.

It also begins to convert btrfs's dio_submit_t because of the dio
submit_io callout use. The next patches will completely convert
this code and the reset of the btrfs code paths.

Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800
4e49ea4a3 block/fs/drivers: remove rw argument from submit_bio ... Browse Code »

This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800

28 May, 2016

1 commit

9ecd10b7a direct-io: fix direct write stale data exposure from concurrent buffered read ... Browse Code »

Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
before calling get_block() callback), if it's a sparse file, direct
writes fall back to buffered writes to avoid stale data exposure from
concurrent buffered read. But there're two cases that can result in
stale data exposure are not correctly detected.

1. The detection for "writing inside i_size" is not sufficient,
writes can be treated as "extending writes" wrongly. For example,
direct write 1FSB (file system block) to a 1FSB sparse file on
ext2/3/4, starting from offset 0, in this case it's writing inside
i_size, but 'create' is non-zero, because 'block_in_file' and
'(i_size_read(inode) >> blkbits' are both zero.

2. Direct writes starting from or beyong i_size (not inside i_size)
also could trigger block allocation and expose stale data. For
example, consider a sparse file with i_size of 2k, and a write to
offset 2k or 3k into the file, with a filesystem block size of 4k.
(Thanks to Jeff Moyer for pointing this case out in his review.)

The first problem can be demostrated by running ltp-aiodio test ADSP045
many times. When testing on extN filesystems, I see test failures
occasionally, buffered read could read non-zero (stale) data.

ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1

dio_sparse 0 TINFO : Dirtying free blocks
dio_sparse 0 TINFO : Starting I/O tests
non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
non-zero read at offset 0
dio_sparse 0 TINFO : Killing childrens(s)
dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally

The second problem can also be reproduced easily by a hacked dio_sparse
program, which accepts an option to specify the write offset.

What we should really do is to disable block allocation for writes that
could result in filling holes inside i_size.

Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.com
Reviewed-by: Jan Kara
Signed-off-by: Eryu Guan
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eryu Guan
2016-05-28 05:49:37 +0800

02 May, 2016

4 commits

e25922176 fs: simplify the generic_write_sync prototype ... Browse Code »

The kiocb already has the new position, so use that. The only interesting
case is AIO, where we currently don't bother updating ki_pos. We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2016-05-02 07:58:39 +0800
dde0c2e79 fs: add IOCB_SYNC and IOCB_DSYNC ... Browse Code »

This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro

Christoph Hellwig
2016-05-02 07:58:39 +0800
716b9bc0c direct-io: remove the offset argument to dio_complete ... Browse Code »

It has to be identical to ki_pos of the iocb, so use that instead.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2016-05-02 07:58:39 +0800
c8b8e32d7 direct-io: eliminate the offset argument to ->direct_IO ... Browse Code »

Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2016-05-02 07:58:39 +0800

05 Apr, 2016

1 commit

09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

22 Mar, 2016

1 commit

53d2e6976 Merge tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs ... Browse Code »

Pull xfs updates from Dave Chinner:
"There's quite a lot in this request, and there's some cross-over with
ext4, dax and quota code due to the nature of the changes being made.

As for the rest of the XFS changes, there are lots of little things
all over the place, which add up to a lot of changes in the end.

The major changes are that we've reduced the size of the struct
xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
>10%), the writepage code now only does a single set of mapping tree
lockups so uses less CPU, delayed allocation reservations won't
overrun under random write loads anymore, and we added compile time
verification for on-disk structure sizes so we find out when a commit
or platform/compiler change breaks the on disk structure as early as
possible.

Change summary:

- error propagation for direct IO failures fixes for both XFS and
ext4
- new quota interfaces and XFS implementation for iterating all the
quota IDs in the filesystem
- locking fixes for real-time device extent allocation
- reduction of duplicate information in the xfs and vfs inode, saving
roughly 100 bytes of memory per cached inode.
- buffer flag cleanup
- rework of the writepage code to use the generic write clustering
mechanisms
- several fixes for inode flag based DAX enablement
- rework of remount option parsing
- compile time verification of on-disk format structure sizes
- delayed allocation reservation overrun fixes
- lots of little error handling fixes
- small memory leak fixes
- enable xfsaild freezing again"

* tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
xfs: always set rvalp in xfs_dir2_node_trim_free
xfs: ensure committed is initialized in xfs_trans_roll
xfs: borrow indirect blocks from freed extent when available
xfs: refactor delalloc indlen reservation split into helper
xfs: update freeblocks counter after extent deletion
xfs: debug mode forced buffered write failure
xfs: remove impossible condition
xfs: check sizes of XFS on-disk structures at compile time
xfs: ioends require logically contiguous file offsets
xfs: use named array initializers for log item dumping
xfs: fix computation of inode btree maxlevels
xfs: reinitialise per-AG structures if geometry changes during recovery
xfs: remove xfs_trans_get_block_res
xfs: fix up inode32/64 (re)mount handling
xfs: fix format specifier , should be %llx and not %llu
xfs: sanitize remount options
xfs: convert mount option parsing to tokens
xfs: fix two memory leaks in xfs_attr_list.c error paths
xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
...

Linus Torvalds
2016-03-22 02:53:05 +0800

20 Mar, 2016

1 commit

3c2de27d7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:

- Preparations of parallel lookups (the remaining main obstacle is the
need to move security_d_instantiate(); once that becomes safe, the
rest will be a matter of rather short series local to fs/*.c

- preadv2/pwritev2 series from Christoph

- assorted fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
splice: handle zero nr_pages in splice_to_pipe()
vfs: show_vfsstat: do not ignore errors from show_devname method
dcache.c: new helper: __d_add()
don't bother with __d_instantiate(dentry, NULL)
untangle fsnotify_d_instantiate() a bit
uninline d_add()
replace d_add_unique() with saner primitive
quota: use lookup_one_len_unlocked()
cifs_get_root(): use lookup_one_len_unlocked()
nfs_lookup: don't bother with d_instantiate(dentry, NULL)
kill dentry_unhash()
ceph_fill_trace(): don't bother with d_instantiate(dn, NULL)
autofs4: don't bother with d_instantiate(dentry, NULL) in ->lookup()
configfs: move d_rehash() into configfs_create() for regular files
ceph: don't bother with d_rehash() in splice_dentry()
namei: teach lookup_slow() to skip revalidate
namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()
lookup_one_len_unlocked(): use lookup_dcache()
namei: simplify invalidation logics in lookup_dcache()
namei: change calling conventions for lookup_{fast,slow} and follow_managed()
...

Linus Torvalds
2016-03-20 09:52:29 +0800

05 Mar, 2016

1 commit

c43c83a29 direct-io: only use block polling if explicitly requested ... Browse Code »

Signed-off-by: Christoph Hellwig
Reviewed-by: Stephen Bates
Tested-by: Stephen Bates
Acked-by: Jeff Moyer
Signed-off-by: Al Viro

Christoph Hellwig
2016-03-05 01:20:10 +0800

08 Feb, 2016

1 commit

187372a3b direct-io: always call ->end_io if non-NULL ... Browse Code »

This way we can pass back errors to the file system, and allow for
cleanup required for all direct I/O invocations.

Also allow the ->end_io handlers to return errors on their own, so that
I/O completion errors can be passed on to the callers.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-02-08 11:40:51 +0800

31 Jan, 2016

1 commit

7ddc971f8 block: fix use-after-free in dio_bio_complete ... Browse Code »

kasan reported the following error when i ran xfstest:

[ 701.826854] ==================================================================
[ 701.826864] BUG: KASAN: use-after-free in dio_bio_complete+0x41a/0x600 at addr ffff880080b95f94
[ 701.826870] Read of size 4 by task loop2/3874
[ 701.826879] page:ffffea000202e540 count:0 mapcount:0 mapping: (null) index:0x0
[ 701.826890] flags: 0x100000000000000()
[ 701.826895] page dumped because: kasan: bad access detected
[ 701.826904] CPU: 3 PID: 3874 Comm: loop2 Tainted: G B W L 4.5.0-rc1-next-20160129 #83
[ 701.826910] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 07/09/2013
[ 701.826917] ffff88008fadf800 ffff88008fadf758 ffffffff81ca67bb 0000000041b58ab3
[ 701.826941] ffffffff830d1e74 ffffffff81ca6724 ffff88008fadf748 ffffffff8161c05c
[ 701.826963] 0000000000000282 ffff88008fadf800 ffffed0010172bf2 ffffea000202e540
[ 701.826987] Call Trace:
[ 701.826997] [] dump_stack+0x97/0xdc
[ 701.827005] [] ? _atomic_dec_and_lock+0xc4/0xc4
[ 701.827014] [] ? __dump_page+0x32c/0x490
[ 701.827023] [] kasan_report_error+0x5f3/0x8b0
[ 701.827033] [] ? dio_bio_complete+0x41a/0x600
[ 701.827040] [] __asan_report_load4_noabort+0x59/0x80
[ 701.827048] [] ? dio_bio_complete+0x41a/0x600
[ 701.827053] [] dio_bio_complete+0x41a/0x600
[ 701.827057] [] ? blk_queue_exit+0x108/0x270
[ 701.827060] [] dio_bio_end_aio+0xa0/0x4d0
[ 701.827063] [] ? dio_bio_complete+0x600/0x600
[ 701.827067] [] ? blk_account_io_completion+0x316/0x5d0
[ 701.827070] [] bio_endio+0x79/0x200
[ 701.827074] [] blk_update_request+0x1df/0xc50
[ 701.827078] [] blk_mq_end_request+0x57/0x120
[ 701.827081] [] __blk_mq_complete_request+0x310/0x590
[ 701.827084] [] ? set_next_entity+0x2f8/0x2ed0
[ 701.827088] [] ? put_prev_entity+0x22d/0x2a70
[ 701.827091] [] blk_mq_complete_request+0x5b/0x80
[ 701.827094] [] loop_queue_work+0x273/0x19d0
[ 701.827098] [] ? finish_task_switch+0x1c8/0x8e0
[ 701.827101] [] ? trace_hardirqs_on_caller+0x18/0x6c0
[ 701.827104] [] ? lo_read_simple+0x890/0x890
[ 701.827108] [] ? debug_check_no_locks_freed+0x350/0x350
[ 701.827111] [] ? __hrtick_start+0x130/0x130
[ 701.827115] [] ? __schedule+0x936/0x20b0
[ 701.827118] [] ? kthread_worker_fn+0x3ed/0x8d0
[ 701.827121] [] ? kthread_worker_fn+0x21d/0x8d0
[ 701.827125] [] ? trace_hardirqs_on_caller+0x18/0x6c0
[ 701.827128] [] kthread_worker_fn+0x2af/0x8d0
[ 701.827132] [] ? __init_kthread_worker+0x170/0x170
[ 701.827135] [] ? _raw_spin_unlock_irqrestore+0x36/0x60
[ 701.827138] [] ? __init_kthread_worker+0x170/0x170
[ 701.827141] [] ? __init_kthread_worker+0x170/0x170
[ 701.827144] [] kthread+0x24b/0x3a0
[ 701.827148] [] ? kthread_create_on_node+0x4c0/0x4c0
[ 701.827151] [] ? trace_hardirqs_on+0xd/0x10
[ 701.827155] [] ? do_group_exit+0xdd/0x350
[ 701.827158] [] ? kthread_create_on_node+0x4c0/0x4c0
[ 701.827161] [] ret_from_fork+0x3f/0x70
[ 701.827165] [] ? kthread_create_on_node+0x4c0/0x4c0
[ 701.827167] Memory state around the buggy address:
[ 701.827170] ffff880080b95e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 701.827172] ffff880080b95f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 701.827175] >ffff880080b95f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 701.827177] ^
[ 701.827179] ffff880080b96000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 701.827182] ffff880080b96080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 701.827183] ==================================================================

The problem is that bio_check_pages_dirty calls bio_put, so we must
not access bio fields after bio_check_pages_dirty.

Fixes: 9b81c842355ac96097ba ("block: don't access bio->bi_error after bio_put()").
Signed-off-by: Mike Krinkin
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe

Mike Krinkin
2016-01-31 13:02:10 +0800

23 Jan, 2016

1 commit

5955102c9 wrappers for ->i_mutex access ... Browse Code »

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro

Al Viro
2016-01-23 07:04:28 +0800

09 Dec, 2015

1 commit

2d4594acb fix the regression from "direct-io: Fix negative return from dio read beyond eof" ... Browse Code »

Sure, it's better to bail out of past-the-eof read and return 0 than return
a bogus negative value on such. Only we'd better make sure we are bailing out
with 0 and not -ENOMEM...

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro

Al Viro
2015-12-09 04:02:42 +0800

01 Dec, 2015

1 commit

74cedf9b6 direct-io: Fix negative return from dio read beyond eof ... Browse Code »

Assume a filesystem with 4KB blocks. When a file has size 1000 bytes and
we issue direct IO read at offset 1024, blockdev_direct_IO() reads the
tail of the last block and the logic for handling short DIO reads in
dio_complete() results in a return value -24 (1000 - 1024) which
obviously confuses userspace.

Fix the problem by bailing out early once we sample i_size and can
reliably check that direct IO read starts beyond i_size.

Reported-by: Avi Kivity
Fixes: 9fe55eea7e4b444bafc42fa0000cc2d1d2847275
CC: stable@vger.kernel.org
CC: Steven Whitehouse
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2015-12-01 01:15:42 +0800

11 Nov, 2015

2 commits

3419b4503 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO poll support from Jens Axboe:
"Various groups have been doing experimentation around IO polling for
(really) fast devices. The code has been reviewed and has been
sitting on the side for a few releases, but this is now good enough
for coordinated benchmarking and further experimentation.

Currently O_DIRECT sync read/write are supported. A framework is in
the works that allows scalable stats tracking so we can auto-tune
this. And we'll add libaio support as well soon. Fow now, it's an
opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
direct-io: be sure to assign dio->bio_bdev for both paths
directio: add block polling support
NVMe: add blk polling support
block: add block polling support
blk-mq: return tag/queue combo in the make_request_fn handlers
block: change ->make_request_fn() and users to return a queue cookie

Linus Torvalds
2015-11-11 09:23:49 +0800
c1c534609 direct-io: be sure to assign dio->bio_bdev for both paths ... Browse Code »

btrfs sets ->submit_io(), and we failed to set the block dev for
that path. That resulted in a potential NULL dereference when
we later wait for IO in dio_await_one().

Reported-by: kernel test robot
Signed-off-by: Jens Axboe

Jens Axboe
2015-11-11 01:14:38 +0800

08 Nov, 2015

1 commit

15c4f638f directio: add block polling support ... Browse Code »

This adds support for sync O_DIRECT read/write poll support.

Signed-off-by: Jens Axboe
[hch: split from a larger patch, minor updates]
Signed-off-by: Christoph Hellwig
Acked-by: Keith Busch

Jens Axboe
2015-11-08 01:40:47 +0800

07 Nov, 2015

1 commit

71baba4b9 mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM ... Browse Code »

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Acked-by: Johannes Weiner
Cc: Christoph Lameter
Acked-by: David Rientjes
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800

24 Sep, 2015

1 commit

53cbf3b15 fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read ... Browse Code »

When direct read IO is submitted from kernel, it is often
unnecessary to dirty pages, for example of loop, dirtying pages
have been considered in the upper filesystem(over loop) side
already, and they don't need to be dirtied again.

So this patch doesn't dirtying pages for ITER_BVEC/ITER_KVEC
direct read, and loop should be the 1st case to use ITER_BVEC/ITER_KVEC
for direct read I/O.

The patch is based on previous Dave's patch.

Reviewed-by: Dave Kleikamp
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-09-24 01:00:57 +0800

14 Aug, 2015

1 commit

b54ffb73c block: remove bio_get_nr_vecs() ... Browse Code »

We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Acked-by: Steven Whitehouse
Signed-off-by: Kent Overstreet
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig
Signed-off-by: Ming Lin
Signed-off-by: Jens Axboe

Kent Overstreet
2015-08-14 02:32:04 +0800

12 Aug, 2015

1 commit

9b81c8423 block: don't access bio->bi_error after bio_put() ... Browse Code »

Commit 4246a0b6 ("block: add a bi_error field to struct bio") has added a few
dereferences of 'bio' after a call to bio_put(). This causes use-after-frees
such as:

[521120.719695] BUG: KASan: use after free in dio_bio_complete+0x2b3/0x320 at addr ffff880f36b38714
[521120.720638] Read of size 4 by task mount.ocfs2/9644
[521120.721212] =============================================================================
[521120.722056] BUG kmalloc-256 (Not tainted): kasan: bad access detected
[521120.722968] -----------------------------------------------------------------------------
[521120.722968]
[521120.723915] Disabling lock debugging due to kernel taint
[521120.724539] INFO: Slab 0xffffea003cdace00 objects=32 used=25 fp=0xffff880f36b38600 flags=0x46fffff80004080
[521120.726037] INFO: Object 0xffff880f36b38700 @offset=1792 fp=0xffff880f36b38800
[521120.726037]
[521120.726974] Bytes b4 ffff880f36b386f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.727898] Object ffff880f36b38700: 00 88 b3 36 0f 88 ff ff 00 00 d8 de 0b 88 ff ff ...6............
[521120.728822] Object ffff880f36b38710: 02 00 00 f0 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.729705] Object ffff880f36b38720: 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ................
[521120.730623] Object ffff880f36b38730: 00 00 00 00 00 00 00 00 01 00 00 00 00 02 00 00 ................
[521120.731621] Object ffff880f36b38740: 00 02 00 00 01 00 00 00 d0 f7 87 ad ff ff ff ff ................
[521120.732776] Object ffff880f36b38750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.733640] Object ffff880f36b38760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[521120.734508] Object ffff880f36b38770: 01 00 03 00 01 00 00 00 88 87 b3 36 0f 88 ff ff ...........6....
[521120.735385] Object ffff880f36b38780: 00 73 22 ad 02 88 ff ff 40 13 e0 3c 00 ea ff ff .s".....@..ffff880f36b38700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.781465] ^
[521120.782083] ffff880f36b38780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.783717] ffff880f36b38800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[521120.784818] ==================================================================

This patch fixes a few of those places that I caught while auditing the patch, but the
original patch should be audited further for more occurences of this issue since I'm
not too familiar with the code.

Signed-off-by: Sasha Levin
Signed-off-by: Jens Axboe

Sasha Levin
2015-08-12 01:34:32 +0800

29 Jul, 2015

1 commit

4246a0b63 block: add a bi_error field to struct bio ... Browse Code »

Currently we have two different ways to signal an I/O error on a BIO:

(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: NeilBrown
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-07-29 22:55:15 +0800

25 Apr, 2015

1 commit

fe0f07d08 direct-io: only inc/dec inode->i_dio_count for file systems ... Browse Code »

do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
| 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
| 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
| 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
| 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
| 99.99th=[ 165]

After:

clat percentiles (usec):
| 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
| 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
| 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
| 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
| 99.99th=[ 438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Theodore Ts'o
Cc: Elliott, Robert (Server Storage)
Cc: Al Viro
Signed-off-by: Jens Axboe
Signed-off-by: Al Viro

Jens Axboe
2015-04-25 03:45:28 +0800

12 Apr, 2015

1 commit

17f8c842d Remove rw from {,__,do_}blockdev_direct_IO() ... Browse Code »

Most filesystems call through to these at some point, so we'll start
here.

Signed-off-by: Omar Sandoval
Signed-off-by: Al Viro

Omar Sandoval
2015-04-12 10:29:44 +0800

26 Mar, 2015

1 commit

e2e40f2c1 fs: move struct kiocb to fs.h ... Browse Code »

struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2015-03-26 08:28:11 +0800

14 Mar, 2015

1 commit

04b2fa9f8 fs: split generic and aio kiocb ... Browse Code »

Most callers in the kernel want to perform synchronous file I/O, but
still have to bloat the stack with a full struct kiocb. Split out
the parts needed in filesystem code from those in the aio code, and
only allocate those needed to pass down argument on the stack. The
aio code embedds the generic iocb in the one it allocates and can
easily get back to it by using container_of.

Also add a ->ki_complete method to struct kiocb, this is used to call
into the aio code and thus removes the dependency on aio for filesystems
impementing asynchronous operations. It will also allow other callers
to substitute their own completion callback.

We also add a new ->ki_flags field to work around the nasty layering
violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add
eventfd notification about ffs events").

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2015-03-14 00:10:27 +0800

27 Sep, 2014

1 commit

2c80929c4 fuse: honour max_read and max_write in direct_io mode ... Browse Code »

The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
bytes a caller asked to pack into fuse request. This value may be lesser
than capacity of fuse request or iov_iter. So fuse_get_user_pages() must
ensure that *nbytesp won't grow.

Now, when helper iov_iter_get_pages() performs all hard work of extracting
pages from iov_iter, it can be done by passing properly calculated
"maxsize" to the helper.

The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
this capability, so pass LONG_MAX as the maxsize argument here.

Fixes: c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()")
Reported-by: Werner Baumann
Tested-by: Maxim Patlasov
Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro

Miklos Szeredi
2014-09-27 09:16:51 +0800

08 Aug, 2014

1 commit

c7f3888ad switch iov_iter_get_pages() to passing maximal number of pages ... Browse Code »

... instead of maximal size.

Signed-off-by: Al Viro

Al Viro
2014-08-08 02:40:11 +0800

01 Aug, 2014

1 commit

af4364727 direct-io: fix AIO regression ... Browse Code »

The direct-io.c rewrite to use the iov_iter infrastructure stopped updating
the size field in struct dio_submit, and thus rendered the check for
allowing asynchronous completions to always return false. Fix this by
comparing it to the count of bytes in the iov_iter instead.

Signed-off-by: Christoph Hellwig
Reported-by: Tim Chen
Tested-by: Tim Chen

Christoph Hellwig
2014-08-01 14:35:51 +0800

24 Jul, 2014

1 commit

6fcc5420b direct-io: fix uninitialized warning in do_direct_IO() ... Browse Code »

The following warnings:

fs/direct-io.c: In function ‘__blockdev_direct_IO’:
fs/direct-io.c:1011:12: warning: ‘to’ may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/direct-io.c:913:16: note: ‘to’ was declared here
fs/direct-io.c:1011:12: warning: ‘from’ may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/direct-io.c:913:10: note: ‘from’ was declared here

are false positive because dio_get_page() either fails, or sets both
'from' and 'to'.

Paul Bolle said ...
Maybe it's better to move initializing "to" and "from" out of
dio_get_page(). That _might_ make it easier for both the the reader and
the compiler to understand what's going on. Something like this:

Christoph Hellwig said ...
The fix of moving the code definitively looks nicer, while I think
uninitialized_var is horrible wart that won't get anywhere near my code.

Boaz Harrosh: I agree with Christoph and Paul

Signed-off-by: Boaz Harrosh
Signed-off-by: Christoph Hellwig

Boaz Harrosh
2014-07-24 18:17:07 +0800

07 May, 2014

5 commits

f67da30c1 new helper: iov_iter_npages() ... Browse Code »

counts the pages covered by iov_iter, up to given limit.
do_block_direct_io() and fuse_iter_npages() switched to
it.

Signed-off-by: Al Viro

Al Viro
2014-05-07 05:32:52 +0800
7b2c99d15 new helper: iov_iter_get_pages() ... Browse Code »

iov_iter_get_pages(iter, pages, maxsize, &start) grabs references pinning
the pages of up to maxsize of (contiguous) data from iter. Returns the
amount of memory grabbed or -error. In case of success, the requested
area begins at offset start in pages[0] and runs through pages[1], etc.
Less than requested amount might be returned - either because the contiguous
area in the beginning of iterator is smaller than requested, or because
the kernel failed to pin that many pages.

direct-io.c switched to using iov_iter_get_pages()

Signed-off-by: Al Viro

Al Viro
2014-05-07 05:32:50 +0800
3320c60b3 dio: take updating ->result into do_direct_IO() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-05-07 05:32:50 +0800
886a39115 new primitive: iov_iter_alignment() ... Browse Code »

returns the value aligned as badly as the worst remaining segment
in iov_iter is. Use instead of open-coded equivalents.

Signed-off-by: Al Viro

Al Viro
2014-05-07 05:32:47 +0800
31b140398 switch {__,}blockdev_direct_IO() to iov_iter ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-05-07 05:32:46 +0800

05 Apr, 2014

1 commit

d15e03104 Merge tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs ... Browse Code »

Pull xfs update from Dave Chinner:
"There are a couple of new fallocate features in this request - it was
decided that it was easiest to push them through the XFS tree using
topic branches and have the ext4 support be based on those branches.
Hence you may see some overlap with the ext4 tree merge depending on
how they including those topic branches into their tree. Other than
that, there is O_TMPFILE support, some cleanups and bug fixes.

The main changes in the XFS tree for 3.15-rc1 are:

- O_TMPFILE support
- allowing AIO+DIO writes beyond EOF
- FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
implementation
- FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
implementation
- IO verifier cleanup and rework
- stack usage reduction changes
- vm_map_ram NOIO context fixes to remove lockdep warings
- various bug fixes and cleanups"

* tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs: (34 commits)
xfs: fix directory hash ordering bug
xfs: extra semi-colon breaks a condition
xfs: Add support for FALLOC_FL_ZERO_RANGE
fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
xfs: inode log reservations are still too small
xfs: xfs_check_page_type buffer checks need help
xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation
xfs: use NOIO contexts for vm_map_ram
xfs: don't leak EFSBADCRC to userspace
xfs: fix directory inode iolock lockdep false positive
xfs: allocate xfs_da_args to reduce stack footprint
xfs: always do log forces via the workqueue
xfs: modify verifiers to differentiate CRC from other errors
xfs: print useful caller information in xfs_error_report
xfs: add xfs_verifier_error()
xfs: add helper for updating checksums on xfs_bufs
xfs: add helper for verifying checksums on xfs_bufs
xfs: Use defines for CRC offsets in all cases
xfs: skip pointless CRC updates after verifier failures
xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
...

Linus Torvalds
2014-04-05 06:50:08 +0800

04 Apr, 2014

1 commit

2b665e276 fs/direct-io.c: remove redundant comparison ... Browse Code »

The return value of bio_get_nr_vecs() cannot be bigger than
BIO_MAX_PAGES, so we can remove redundant the comparison between
nr_pages and BIO_MAX_PAGES.

Signed-off-by: Gu Zheng
Cc: Al Viro
Reviewed-by: Jeff Moyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gu Zheng
2014-04-04 07:20:57 +0800