Eric Lee / smarc-fsl-linux-kernel

27 May, 2016

2 commits

315227f6d Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull misc DAX updates from Vishal Verma:
"DAX error handling for 4.7

- Until now, dax has been disabled if media errors were found on any
device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.

- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.

Other misc changes:

- When mounting DAX filesystems, check to make sure the partition is
page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent
reads/writes would fail.

- Misc/cleanup fixes from Jan that remove unused code from DAX
related to zeroing, writeback, and some size checks"

* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: fix a comment in dax_zero_page_range and dax_truncate_page
dax: for truncate/hole-punch, do zeroing through the driver if possible
dax: export a low-level __dax_zero_page_range helper
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: enable dax in the presence of known media errors (badblocks)
dax: fallback from pmd to pte on error
block: Update blkdev_dax_capable() for consistency
xfs: Add alignment check for DAX mount
ext2: Add alignment check for DAX mount
ext4: Add alignment check for DAX mount
block: Add bdev_dax_supported() for dax mount checks
block: Add vfs_msg() interface
dax: Remove redundant inode size checks
dax: Remove pointless writeback from dax_do_io()
dax: Remove zeroing from dax_io()
dax: Remove dead zeroing code from fault handlers
ext2: Avoid DAX zeroing to corrupt data
ext2: Fix block zeroing in ext2_get_blocks() for DAX
dax: Remove complete_unwritten argument
DAX: move RADIX_DAX_ definitions to dax.c

Linus Torvalds
2016-05-27 10:34:26 +0800
a10c38a4f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull Ceph updates from Sage Weil:
"This changeset has a few main parts:

- Ilya has finished a huge refactoring effort to sync up the
client-side logic in libceph with the user-space client code, which
has evolved significantly over the last couple years, with lots of
additional behaviors (e.g., how requests are handled when cluster
is full and transitions from full to non-full).

This structure of the code is more closely aligned with userspace
now such that it will be much easier to maintain going forward when
behavior changes take place. There are some locking improvements
bundled in as well.

- Zheng adds multi-filesystem support (multiple namespaces within the
same Ceph cluster)

- Zheng has changed the readdir offsets and directory enumeration so
that dentry offsets are hash-based and therefore stable across
directory fragmentation events on the MDS.

- Zheng has a smorgasbord of bug fixes across fs/ceph"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
ceph: fix wake_up_session_cb()
ceph: don't use truncate_pagecache() to invalidate read cache
ceph: SetPageError() for writeback pages if writepages fails
ceph: handle interrupted ceph_writepage()
ceph: make ceph_update_writeable_page() uninterruptible
libceph: make ceph_osdc_wait_request() uninterruptible
ceph: handle -EAGAIN returned by ceph_update_writeable_page()
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
ceph: block non-fatal signals for fault/page_mkwrite
ceph: make logical calculation functions return bool
ceph: tolerate bad i_size for symlink inode
ceph: improve fragtree change detection
ceph: keep leaf frag when updating fragtree
ceph: fix dir_auth check in ceph_fill_dirfrag()
ceph: don't assume frag tree splits in mds reply are sorted
ceph: fix inode reference leak
ceph: using hash value to compose dentry offset
ceph: don't forbid marking directory complete after forward seek
ceph: record 'offset' for each entry of readdir result
ceph: define 'end/complete' in readdir reply as bit flags
...

Linus Torvalds
2016-05-27 05:10:32 +0800

26 May, 2016

10 commits

7cca78c9d libceph: replace ceph_monc_request_next_osdmap() ... Browse Code »

... with a wrapper around maybe_request_map() - no need for two
osdmap-specific functions.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-05-26 07:15:30 +0800
d0b19705e libceph: async MON client generic requests ... Browse Code »

For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion. Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is
switched over and remains sync.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-05-26 07:15:29 +0800
922dab613 libceph, rbd: ceph_osd_linger_request, watch/notify v2 ... Browse Code »

This adds support and switches rbd to a new, more reliable version of
watch/notify protocol. As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed. watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().

A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request. ceph_osd_event has been merged into
ceph_osd_linger_request.

All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.

ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.

Portions of this patch are loosely based on work by Douglas Fuller
and Mike Christie .

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-05-26 07:15:02 +0800
c525f0360 rbd: rbd_dev_header_unwatch_sync() variant ... Browse Code »

Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify
callbacks. This is for the new rados_watcherrcb_t, which would be
called from a notify callback.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-05-26 07:14:06 +0800
85e084feb libceph: drop msg argument from ceph_osdc_callback_t ... Browse Code »

finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-05-26 06:36:27 +0800
bb873b539 libceph: switch to calc_target(), part 2 ... Browse Code »

The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().

Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.

Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-05-26 06:36:27 +0800
c41d13a31 rbd: use header_oid instead of header_name ... Browse Code »

Switch to ceph_object_id and use ceph_oid_aprintf() instead of a bare
const char *. This reduces noise in rbd_dev_header_name().

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-05-26 06:36:22 +0800
d30291b98 libceph: variable-sized ceph_object_id ... Browse Code »

Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases,
expect one - long rbd image names:

- a format 1 header is named ".rbd"
- an object that points to a format 2 header is named "rbd_id."

We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh. (A format 2 header name is
a small system-generated string.)

Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string. Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-05-26 06:36:22 +0800
13d1ad16d libceph: move message allocation out of ceph_osdc_alloc_request() ... Browse Code »

The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed. Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:

req = ceph_osdc_alloc_request(...);
r_base_oid>
r_base_oloc>
ceph_osdc_alloc_messages(req);

Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-05-26 06:36:21 +0800
663ae2cc0 rbd: get/put img_request in rbd_img_request_submit() ... Browse Code »

By the time we get to checking for_each_obj_request_safe(img_request)
terminating condition, all obj_requests may be complete and img_request
ref, that rbd_img_request_submit() takes away from its caller, may be
put. Moving the next_obj_request cursor is then a use-after-free on
img_request.

It's totally benign, as the value that's read is never used, but
I think it's still worth fixing.

Cc: Alex Elder
Signed-off-by: Ilya Dryomov

Ilya Dryomov
2016-05-26 06:36:20 +0800

21 May, 2016

4 commits

623e47fc6 zram: introduce per-device debug_stat sysfs node ... Browse Code »

debug_stat sysfs is read-only and represents various debugging data that
zram developers may need. This file is not meant to be used by anyone
else: its content is not documented and will change any time w/o any
notice. Therefore, the output of debug_stat file contains a version
string. To avoid any confusion, we will increase the version number
every time we modify the output.

At the moment this file exports only one value -- the number of
re-compressions, IOW, the number of times compression fast path has
failed. This stat is temporary any will be useful in case if any
per-cpu compression streams regressions will be reported.

Link: http://lkml.kernel.org/r/20160513230834.GB26763@bbox
Link: http://lkml.kernel.org/r/20160511134553.12655-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2016-05-21 08:58:30 +0800
43209ea2d zram: remove max_comp_streams internals ... Browse Code »

Remove the internal part of max_comp_streams interface, since we
switched to per-cpu streams. We will keep RW max_comp_streams attr
around, because:

a) we may (silently) switch back to idle compression streams list and
don't want to disturb user space

b) max_comp_streams attr must wait for the next 'lay off cycle'; we
give user space 2 years to adjust before we remove/downgrade the attr,
and there are already several attrs scheduled for removal in 4.11, so
it's too late for max_comp_streams.

This slightly change a user visible behaviour:

- First, reading from max_comp_stream file now will always return the
number of online CPUs.

- Second, writing to max_comp_stream will not take any effect.

Link: http://lkml.kernel.org/r/20160503165546.25201-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2016-05-21 08:58:30 +0800
da9556a23 zram: user per-cpu compression streams ... Browse Code »

Remove idle streams list and keep compression streams in per-cpu data.
This removes two contented spin_lock()/spin_unlock() calls from write
path and also prevent write OP from being preempted while holding the
compression stream, which can cause slow downs.

For instance, let's assume that we have N cpus and N-2
max_comp_streams.TASK1 owns the last idle stream, TASK2-TASK3 come in
with the write requests:

TASK1 TASK2 TASK3
zram_bvec_write()
spin_lock
find stream
spin_unlock

compress

<> zram_bvec_write()
spin_lock
find stream
spin_unlock
no_stream
schedule
zram_bvec_write()
spin_lock
find_stream
spin_unlock
no_stream
schedule
spin_lock
release stream
spin_unlock
wake up TASK2

not only TASK2 and TASK3 will not get the stream, TASK1 will be
preempted in the middle of its operation; while we would prefer it to
finish compression and release the stream.

Test environment: x86_64, 4 CPU box, 3G zram, lzo

The following fio tests were executed:
read, randread, write, randwrite, rw, randrw
with the increasing number of jobs from 1 to 10.

4 streams 8 streams per-cpu
===========================================================
jobs1
READ: 2520.1MB/s 2566.5MB/s 2491.5MB/s
READ: 2102.7MB/s 2104.2MB/s 2091.3MB/s
WRITE: 1355.1MB/s 1320.2MB/s 1378.9MB/s
WRITE: 1103.5MB/s 1097.2MB/s 1122.5MB/s
READ: 434013KB/s 435153KB/s 439961KB/s
WRITE: 433969KB/s 435109KB/s 439917KB/s
READ: 403166KB/s 405139KB/s 403373KB/s
WRITE: 403223KB/s 405197KB/s 403430KB/s
jobs2
READ: 7958.6MB/s 8105.6MB/s 8073.7MB/s
READ: 6864.9MB/s 6989.8MB/s 7021.8MB/s
WRITE: 2438.1MB/s 2346.9MB/s 3400.2MB/s
WRITE: 1994.2MB/s 1990.3MB/s 2941.2MB/s
READ: 981504KB/s 973906KB/s 1018.8MB/s
WRITE: 981659KB/s 974060KB/s 1018.1MB/s
READ: 937021KB/s 938976KB/s 987250KB/s
WRITE: 934878KB/s 936830KB/s 984993KB/s
jobs3
READ: 13280MB/s 13553MB/s 13553MB/s
READ: 11534MB/s 11785MB/s 11755MB/s
WRITE: 3456.9MB/s 3469.9MB/s 4810.3MB/s
WRITE: 3029.6MB/s 3031.6MB/s 4264.8MB/s
READ: 1363.8MB/s 1362.6MB/s 1448.9MB/s
WRITE: 1361.9MB/s 1360.7MB/s 1446.9MB/s
READ: 1309.4MB/s 1310.6MB/s 1397.5MB/s
WRITE: 1307.4MB/s 1308.5MB/s 1395.3MB/s
jobs4
READ: 20244MB/s 20177MB/s 20344MB/s
READ: 17886MB/s 17913MB/s 17835MB/s
WRITE: 4071.6MB/s 4046.1MB/s 6370.2MB/s
WRITE: 3608.9MB/s 3576.3MB/s 5785.4MB/s
READ: 1824.3MB/s 1821.6MB/s 1997.5MB/s
WRITE: 1819.8MB/s 1817.4MB/s 1992.5MB/s
READ: 1765.7MB/s 1768.3MB/s 1937.3MB/s
WRITE: 1767.5MB/s 1769.1MB/s 1939.2MB/s
jobs5
READ: 18663MB/s 18986MB/s 18823MB/s
READ: 16659MB/s 16605MB/s 16954MB/s
WRITE: 3912.4MB/s 3888.7MB/s 6126.9MB/s
WRITE: 3506.4MB/s 3442.5MB/s 5519.3MB/s
READ: 1798.2MB/s 1746.5MB/s 1935.8MB/s
WRITE: 1792.7MB/s 1740.7MB/s 1929.1MB/s
READ: 1727.6MB/s 1658.2MB/s 1917.3MB/s
WRITE: 1726.5MB/s 1657.2MB/s 1916.6MB/s
jobs6
READ: 21017MB/s 20922MB/s 21162MB/s
READ: 19022MB/s 19140MB/s 18770MB/s
WRITE: 3968.2MB/s 4037.7MB/s 6620.8MB/s
WRITE: 3643.5MB/s 3590.2MB/s 6027.5MB/s
READ: 1871.8MB/s 1880.5MB/s 2049.9MB/s
WRITE: 1867.8MB/s 1877.2MB/s 2046.2MB/s
READ: 1755.8MB/s 1710.3MB/s 1964.7MB/s
WRITE: 1750.5MB/s 1705.9MB/s 1958.8MB/s
jobs7
READ: 21103MB/s 20677MB/s 21482MB/s
READ: 18522MB/s 18379MB/s 19443MB/s
WRITE: 4022.5MB/s 4067.4MB/s 6755.9MB/s
WRITE: 3691.7MB/s 3695.5MB/s 5925.6MB/s
READ: 1841.5MB/s 1933.9MB/s 2090.5MB/s
WRITE: 1842.7MB/s 1935.3MB/s 2091.9MB/s
READ: 1832.4MB/s 1856.4MB/s 1971.5MB/s
WRITE: 1822.3MB/s 1846.2MB/s 1960.6MB/s
jobs8
READ: 20463MB/s 20194MB/s 20862MB/s
READ: 18178MB/s 17978MB/s 18299MB/s
WRITE: 4085.9MB/s 4060.2MB/s 7023.8MB/s
WRITE: 3776.3MB/s 3737.9MB/s 6278.2MB/s
READ: 1957.6MB/s 1944.4MB/s 2109.5MB/s
WRITE: 1959.2MB/s 1946.2MB/s 2111.4MB/s
READ: 1900.6MB/s 1885.7MB/s 2082.1MB/s
WRITE: 1896.2MB/s 1881.4MB/s 2078.3MB/s
jobs9
READ: 19692MB/s 19734MB/s 19334MB/s
READ: 17678MB/s 18249MB/s 17666MB/s
WRITE: 4004.7MB/s 4064.8MB/s 6990.7MB/s
WRITE: 3724.7MB/s 3772.1MB/s 6193.6MB/s
READ: 1953.7MB/s 1967.3MB/s 2105.6MB/s
WRITE: 1953.4MB/s 1966.7MB/s 2104.1MB/s
READ: 1860.4MB/s 1897.4MB/s 2068.5MB/s
WRITE: 1858.9MB/s 1895.9MB/s 2066.8MB/s
jobs10
READ: 19730MB/s 19579MB/s 19492MB/s
READ: 18028MB/s 18018MB/s 18221MB/s
WRITE: 4027.3MB/s 4090.6MB/s 7020.1MB/s
WRITE: 3810.5MB/s 3846.8MB/s 6426.8MB/s
READ: 1956.1MB/s 1994.6MB/s 2145.2MB/s
WRITE: 1955.9MB/s 1993.5MB/s 2144.8MB/s
READ: 1852.8MB/s 1911.6MB/s 2075.8MB/s
WRITE: 1855.7MB/s 1914.6MB/s 2078.1MB/s

perf stat

======================
jobs1
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
jobs2
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
jobs3
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
jobs4
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
jobs5
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
jobs6
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
jobs7
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
jobs8
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
jobs9
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses
jobs10
stalled-cycles-frontend
stalled-cycles-backend
instructions
branches
branch-misses 4 streams 8 streams per-cpu ============================================================================================== 23,174,811,209 ( 38.21%) 23,220,254,188 ( 38.25%) 23,061,406,918 ( 38.34%) 11,514,174,638 ( 18.98%) 11,696,722,657 ( 19.27%) 11,370,852,810 ( 18.90%) 73,925,005,782 ( 1.22) 73,903,177,632 ( 1.22) 73,507,201,037 ( 1.22) 14,455,124,835 ( 756.063) 14,455,184,779 ( 755.281) 14,378,599,509 ( 758.546) 69,801,336 ( 0.48%) 80,225,529 ( 0.55%) 72,044,726 ( 0.50%) 49,912,741,782 ( 46.11%) 50,101,189,290 ( 45.95%) 32,874,195,633 ( 35.11%) 27,080,366,230 ( 25.02%) 27,949,970,232 ( 25.63%) 16,461,222,706 ( 17.58%) 122,831,629,690 ( 1.13) 122,919,846,419 ( 1.13) 121,924,786,775 ( 1.30) 23,725,889,239 ( 692.663) 23,733,547,140 ( 688.062) 23,553,950,311 ( 794.794) 90,733,041 ( 0.38%) 96,320,895 ( 0.41%) 84,561,092 ( 0.36%) 66,437,834,608 ( 45.58%) 63,534,923,344 ( 43.69%) 42,101,478,505 ( 33.19%) 34,940,799,661 ( 23.97%) 34,774,043,148 ( 23.91%) 21,163,324,388 ( 16.68%) 171,692,121,862 ( 1.18) 171,775,373,044 ( 1.18) 170,353,542,261 ( 1.34) 32,968,962,622 ( 628.723) 32,987,739,894 ( 630.512) 32,729,463,918 ( 717.027) 111,522,732 ( 0.34%) 110,472,894 ( 0.33%) 99,791,291 ( 0.30%) 98,741,701,675 ( 49.72%) 94,797,349,965 ( 47.59%) 54,535,655,381 ( 33.53%) 54,642,609,615 ( 27.51%) 55,233,554,408 ( 27.73%) 27,882,323,541 ( 17.14%) 220,884,807,851 ( 1.11) 220,930,887,273 ( 1.11) 218,926,845,851 ( 1.35) 42,354,518,180 ( 592.105) 42,362,770,587 ( 590.452) 41,955,552,870 ( 716.154) 138,093,449 ( 0.33%) 131,295,286 ( 0.31%) 121,794,771 ( 0.29%) 116,219,747,212 ( 48.14%) 110,310,397,012 ( 46.29%) 66,373,082,723 ( 33.70%) 66,325,434,776 ( 27.48%) 64,157,087,914 ( 26.92%) 32,999,097,299 ( 16.76%) 270,615,008,466 ( 1.12) 270,546,409,525 ( 1.14) 268,439,910,948 ( 1.36) 51,834,046,557 ( 599.108) 51,811,867,722 ( 608.883) 51,412,576,077 ( 729.213) 158,197,086 ( 0.31%) 142,639,805 ( 0.28%) 133,425,455 ( 0.26%) 138,009,414,492 ( 48.23%) 139,063,571,254 ( 48.80%) 75,278,568,278 ( 32.80%) 79,211,949,650 ( 27.68%) 79,077,241,028 ( 27.75%) 37,735,797,899 ( 16.44%) 319,763,993,731 ( 1.12) 319,937,782,834 ( 1.12) 316,663,600,784 ( 1.38) 61,219,433,294 ( 595.056) 61,250,355,540 ( 598.215) 60,523,446,617 ( 733.706) 169,257,123 ( 0.28%) 154,898,028 ( 0.25%) 141,180,587 ( 0.23%) 162,974,812,119 ( 49.20%) 159,290,061,987 ( 48.43%) 88,046,641,169 ( 33.21%) 92,223,151,661 ( 27.84%) 91,667,904,406 ( 27.87%) 44,068,454,971 ( 16.62%) 369,516,432,430 ( 1.12) 369,361,799,063 ( 1.12) 365,290,380,661 ( 1.38) 70,795,673,950 ( 594.220) 70,743,136,124 ( 597.876) 69,803,996,038 ( 732.822) 181,708,327 ( 0.26%) 165,767,821 ( 0.23%) 150,109,797 ( 0.22%) 185,000,017,027 ( 49.30%) 182,334,345,473 ( 48.37%) 99,980,147,041 ( 33.26%) 105,753,516,186 ( 28.18%) 107,937,830,322 ( 28.63%) 51,404,177,181 ( 17.10%) 418,153,161,055 ( 1.11) 418,308,565,828 ( 1.11) 413,653,475,581 ( 1.38) 80,035,882,398 ( 592.296) 80,063,204,510 ( 589.843) 79,024,105,589 ( 730.530) 199,764,528 ( 0.25%) 177,936,926 ( 0.22%) 160,525,449 ( 0.20%) 210,941,799,094 ( 49.63%) 204,714,679,254 ( 48.55%) 114,251,113,756 ( 33.96%) 122,640,849,067 ( 28.85%) 122,188,553,256 ( 28.98%) 58,360,041,127 ( 17.35%) 468,151,025,415 ( 1.10) 467,354,869,323 ( 1.11) 462,665,165,216 ( 1.38) 89,657,067,510 ( 585.628) 89,411,550,407 ( 588.990) 88,360,523,943 ( 730.151) 218,292,301 ( 0.24%) 191,701,247 ( 0.21%) 178,535,678 ( 0.20%) 233,595,958,008 ( 49.81%) 227,540,615,689 ( 49.11%) 160,341,979,938 ( 43.07%) 136,153,676,021 ( 29.03%) 133,635,240,742 ( 28.84%) 65,909,135,465 ( 17.70%) 517,001,168,497 ( 1.10) 516,210,976,158 ( 1.11) 511,374,038,613 ( 1.37) 98,911,641,329 ( 585.796) 98,700,069,712 ( 591.583) 97,646,761,028 ( 728.712) 232,341,823 ( 0.23%) 199,256,308 ( 0.20%) 183,135,268 ( 0.19%)

per-cpu streams tend to cause significantly less stalled cycles; execute
less branches and hit less branch-misses.

perf stat reported execution time

4 streams 8 streams per-cpu
====================================================================
jobs1
seconds elapsed 20.909073870 20.875670495 20.817838540
jobs2
seconds elapsed 18.529488399 18.720566469 16.356103108
jobs3
seconds elapsed 18.991159531 18.991340812 16.766216066
jobs4
seconds elapsed 19.560643828 19.551323547 16.246621715
jobs5
seconds elapsed 24.746498464 25.221646740 20.696112444
jobs6
seconds elapsed 28.258181828 28.289765505 22.885688857
jobs7
seconds elapsed 32.632490241 31.909125381 26.272753738
jobs8
seconds elapsed 35.651403851 36.027596308 29.108024711
jobs9
seconds elapsed 40.569362365 40.024227989 32.898204012
jobs10
seconds elapsed 44.673112304 43.874898137 35.632952191

Please see
Link: http://marc.info/?l=linux-kernel&m=146166970727530
Link: http://marc.info/?l=linux-kernel&m=146174716719650
for more test results (under low memory conditions).

Signed-off-by: Sergey Senozhatsky
Suggested-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2016-05-21 08:58:30 +0800
d0d8da2dc zsmalloc: require GFP in zs_malloc() ... Browse Code »

Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to
zs_create_pool(), so we can be more flexible, but, more importantly, we
need this to switch zram to per-cpu compression streams -- zram will try
to allocate handle with preemption disabled in a fast path and switch to
a slow path (using different gfp mask) if the fast one has failed.

Apart from that, this also align zs_malloc() interface with zspool/zbud.

[sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask]
Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2016-05-21 08:58:30 +0800

20 May, 2016

1 commit

0139aa7b7 mm: rename _count, field of the struct page, to _refcount ... Browse Code »

Many developers already know that field for reference count of the
struct page is _count and atomic type. They would try to handle it
directly and this could break the purpose of page reference count
tracepoint. To prevent direct _count modification, this patch rename it
to _refcount and add warning message on the code. After that, developer
who need to handle reference count will find that field should not be
accessed directly.

[akpm@linux-foundation.org: fix comments, per Vlastimil]
[akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
[sfr@canb.auug.org.au: sync ethernet driver changes]
Signed-off-by: Joonsoo Kim
Signed-off-by: Stephen Rothwell
Cc: Vlastimil Babka
Cc: Hugh Dickins
Cc: Johannes Berg
Cc: "David S. Miller"
Cc: Sunil Goutham
Cc: Chris Metcalf
Cc: Manish Chopra
Cc: Yuval Mintz
Cc: Tariq Toukan
Cc: Saeed Mahameed
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-05-20 10:12:14 +0800

19 May, 2016

1 commit

0a70bd430 dax: enable dax in the presence of known media errors (badblocks) ... Browse Code »

1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
requested when errors present.

Reviewed-by: Jeff Moyer
Reviewed-by: Christoph Hellwig
Signed-off-by: Dan Williams
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: Vishal Verma

Dan Williams
2016-05-19 02:16:56 +0800

18 May, 2016

3 commits

a7fd20d1c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Highlights:

1) Support SPI based w5100 devices, from Akinobu Mita.

2) Partial Segmentation Offload, from Alexander Duyck.

3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

4) Allow cls_flower stats offload, from Amir Vadai.

5) Implement bpf blinding, from Daniel Borkmann.

6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.

7) Run TCP more preemptibly, also from Eric Dumazet.

8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.

9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

11) Support tunneling offloads in qed, from Manish Chopra.

12) aRFS offloading in mlx5e, from Maor Gottlieb.

13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.

14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.

15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.

16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.

18) Checksum neutral ILA in ipv6, from Tom Herbert.

19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot

20) Add VF support to qed driver, from Yuval Mintz"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...

Linus Torvalds
2016-05-18 07:26:30 +0800
24b9f0cf0 Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver updates from Jens Axboe:
"On top of the core pull request, this is the drivers pull request for
this merge window. This contains:

- Switch drivers to the new write back cache API, and kill off the
flush flags. From me.

- Kill the discard support for the STEC pci-e flash driver. It's
trivially broken, and apparently unmaintained, so it's safer to
just remove it. From Jeff Moyer.

- A set of lightnvm updates from the usual suspects (Matias/Javier,
and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei
Tao.

- A set of updates for NVMe:

- Turn the controller state management into a proper state
machine. From Christoph.

- Shuffling of code in preparation for NVMe-over-fabrics, also
from Christoph.

- Cleanup of the command prep part from Ming Lin.

- Rewrite of the discard support from Ming Lin.

- Deadlock fix for namespace removal from Ming Lin.

- Use the now exported blk-mq tag helper for IO termination.
From Sagi.

- Various little fixes from Christoph, Guilherme, Keith, Ming
Lin, Wang Sheng-Hui.

- Convert mtip32xx to use the now exported blk-mq tag iter function,
from Keith"

* 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits)
lightnvm: reserved space calculation incorrect
lightnvm: rename nr_pages to nr_ppas on nvm_rq
lightnvm: add is_cached entry to struct ppa_addr
lightnvm: expose gennvm_mark_blk to targets
lightnvm: remove mgt targets on mgt removal
lightnvm: pass dma address to hardware rather than pointer
lightnvm: do not assume sequential lun alloc.
nvme/lightnvm: Log using the ctrl named device
lightnvm: rename dma helper functions
lightnvm: enable metadata to be sent to device
lightnvm: do not free unused metadata on rrpc
lightnvm: fix out of bound ppa lun id on bb tbl
lightnvm: refactor set_bb_tbl for accepting ppa list
lightnvm: move responsibility for bad blk mgmt to target
lightnvm: make nvm_set_rqd_ppalist() aware of vblks
lightnvm: remove struct factory_blks
lightnvm: refactor device ops->get_bb_tbl()
lightnvm: introduce nvm_for_each_lun_ppa() macro
lightnvm: refactor dev->online_target to global nvm_targets
lightnvm: rename nvm_targets to nvm_tgt_type
...

Linus Torvalds
2016-05-18 07:03:32 +0800
a4d1dbed0 Merge branch 'for-4.7/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block layer updates from Jens Axboe:
"This is the core block IO changes for this merge window. Nothing
earth shattering in here, it's mostly just fixes. In detail:

- Fix for a long standing issue where wrong ordering in blk-mq caused
order_to_size() to spew a warning. From Bart.

- Async discard support from Christoph. Basically just splitting our
sync interface into a submit + wait part.

- Add a cleaner interface for flagging whether a device has a write
back cache or not. We've previously overloaded blk_queue_flush()
with this, but let's make it more explicit. Drivers cleaned up and
updated in the drivers pull request. From me.

- Fix for a double check for whether IO accounting is enabled or not.
From Michael Callahan.

- Fix for the async discard from Mike Snitzer, reinstating the early
EOPNOTSUPP return if the device doesn't support discards.

- Also from Mike, export bio_inc_remaining() so dm can drop it's
private copy of it.

- From Ming Lin, add support for passing in an offset for request
payloads.

- Tag function export from Sagi, which will be used in NVMe in the
drivers pull.

- Two blktrace related fixes from Shaohua.

- Propagate NOMERGE flag when making a request from a bio, also from
Shaohua.

- An optimization to not parse cgroup paths in blk-throttle, if we
don't need to. From Shaohua"

* 'for-4.7/core' of git://git.kernel.dk/linux-block:
blk-mq: fix undefined behaviour in order_to_size()
blk-throttle: don't parse cgroup path if trace isn't enabled
blktrace: add missed mask name
blktrace: delete garbage for message trace
block: make bio_inc_remaining() interface accessible again
block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discard
block: Minor blk_account_io_start usage cleanup
block: add __blkdev_issue_discard
block: remove struct bio_batch
block: copy NOMERGE flag from bio to request
block: add ability to flag write back caching on a device
blk-mq: Export tagset iter function
block: add offset in blk_add_request_payload()
writeback: Fix performance regression in wb_over_bg_thresh()

Linus Torvalds
2016-05-18 06:29:49 +0800

11 May, 2016

1 commit

1dee3f59a block/drbd: align properly u64 in nl messages ... Browse Code »

The attribute 0 is never used in drbd, so let's use it as pad attribute
in netlink messages. This minimizes the patch.

Note that this patch is only compile-tested.

Signed-off-by: Nicolas Dichtel
Signed-off-by: Lars Ellenberg
Signed-off-by: David S. Miller

Nicolas Dichtel
2016-05-11 03:43:09 +0800

28 Apr, 2016

2 commits

d3767f0fa rbd: report unsupported features to syslog ... Browse Code »

... instead of just returning an error.

Signed-off-by: Ilya Dryomov
Reviewed-by: Josh Durgin

Ilya Dryomov
2016-04-28 16:07:43 +0800
811c66887 rbd: fix rbd map vs notify races ... Browse Code »

A while ago, commit 9875201e1049 ("rbd: fix use-after free of
rbd_dev->disk") fixed rbd unmap vs notify race by introducing
an exported wrapper for flushing notifies and sticking it into
do_rbd_remove().

A similar problem exists on the rbd map path, though: the watch is
registered in rbd_dev_image_probe(), while the disk is set up quite
a few steps later, in rbd_dev_device_setup(). Nothing prevents
a notify from coming in and crashing on a NULL rbd_dev->disk:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
Call Trace:
[] rbd_watch_cb+0x34/0x180 [rbd]
[] do_event_work+0x40/0xb0 [libceph]
[] process_one_work+0x17b/0x470
[] worker_thread+0x11b/0x400
[] ? rescuer_thread+0x400/0x400
[] kthread+0xcf/0xe0
[] ? finish_task_switch+0x53/0x170
[] ? kthread_create_on_node+0x140/0x140
[] ret_from_fork+0x58/0x90
[] ? kthread_create_on_node+0x140/0x140
RIP [] rbd_dev_refresh+0xfa/0x180 [rbd]

If an error occurs during rbd map, we have to error out, potentially
tearing down a watch. Just like on rbd unmap, notifies have to be
flushed, otherwise rbd_watch_cb() may end up trying to read in the
image header after rbd_dev_image_release() has run:

Assertion failure in rbd_dev_header_info() at line 4722:

rbd_assert(rbd_image_format_valid(rbd_dev->image_format));

Call Trace:
[] ? rbd_parent_request_create+0x150/0x150
[] rbd_dev_refresh+0x59/0x390
[] rbd_watch_cb+0x69/0x290
[] do_event_work+0x10f/0x1c0
[] process_one_work+0x689/0x1a80
[] ? process_one_work+0x5e7/0x1a80
[] ? finish_task_switch+0x225/0x640
[] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
[] worker_thread+0xd9/0x1320
[] ? process_one_work+0x1a80/0x1a80
[] kthread+0x21d/0x2e0
[] ? kthread_stop+0x550/0x550
[] ret_from_fork+0x22/0x40
[] ? kthread_stop+0x550/0x550
RIP [] rbd_dev_header_info+0xa19/0x1e30

To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling
revalidate_disk(), b) move ceph_osdc_flush_notifies() call into
rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn
header read-in into a critical section. The latter also happens to
take care of rbd map foo@bar vs rbd snap rm foo@bar race.

Fixes: http://tracker.ceph.com/issues/15490

Signed-off-by: Ilya Dryomov
Reviewed-by: Josh Durgin

Ilya Dryomov
2016-04-28 16:07:22 +0800

26 Apr, 2016

1 commit

49bdedb36 skd: remove broken discard support ... Browse Code »

Simply creating a file system on an skd device, followed by mount and
fstrim will result in errors in the logs and then a BUG(). Let's remove
discard support from that driver. As far as I can tell, it hasn't
worked right since it was merged. This patch also has a side-effect of
cleaning up an unintentional shadowed declaration inside of
skd_end_request.

I tested to ensure that I can still do I/O to the device using xfstests
./check -g quick. I didn't do anything more extensive than that,
though.

Signed-off-by: Jeff Moyer
Signed-off-by: Jens Axboe

Jeff Moyer
2016-04-26 09:12:38 +0800

16 Apr, 2016

1 commit

2e5725991 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"A few fixes for the current series. This contains:

- Two fixes for NVMe:

One fixes a reset race that can be triggered by repeated
insert/removal of the module.

The other fixes an issue on some platforms, where we get probe
timeouts since legacy interrupts isn't working. This used not to
be a problem since we had the worker thread poll for completions,
but since that was killed off, it means those poor souls can't
successfully probe their NVMe device. Use a proper IRQ check and
probe (msi-x -> msi ->legacy), like most other drivers to work
around this. Both from Keith.

- A loop corruption issue with offset in iters, from Ming Lei.

- A fix for not having the partition stat per cpu ref count
initialized before sending out the KOBJ_ADD, which could cause user
space to access the counter prior to initialization. Also from
Ming Lei.

- A fix for using the wrong congestion state, from Kaixu Xia"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: loop: fix filesystem corruption in case of aio/dio
NVMe: Always use MSI/MSI-x interrupts
NVMe: Fix reset/remove race
writeback: fix the wrong congested state variable definition
block: partition: initialize percpuref before sending out KOBJ_ADD

Linus Torvalds
2016-04-16 06:44:10 +0800

15 Apr, 2016

1 commit

a7297a6a3 block: loop: fix filesystem corruption in case of aio/dio ... Browse Code »

Starting from commit e36f620428(block: split bios to max possible length),
block core starts to split bio in the middle of bvec.

Unfortunately loop dio/aio doesn't consider this situation, and
always treat 'iter.iov_offset' as zero. Then filesystem corruption
is observed.

This patch figures out the offset of the base bvevc via
'bio->bi_iter.bi_bvec_done' and fixes the issue by passing the offset
to iov iterator.

Fixes: e36f6204288088f (block: split bios to max possible length)
Cc: Keith Busch
Cc: Al Viro
Cc: stable@vger.kernel.org (4.5)
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2016-04-15 22:25:56 +0800

14 Apr, 2016

1 commit

c888a8f95 block: kill off q->flush_flags ... Browse Code »

Now that we converted everything to the newer block write cache
interface, kill off the queue flush_flags and queueable flush
entries.

Signed-off-by: Jens Axboe

Jens Axboe
2016-04-14 03:33:19 +0800

13 Apr, 2016

11 commits

bfd230ac4 xen-blkfront: switch to using blk_queue_write_cache() ... Browse Code »

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-04-13 06:00:39 +0800
ad9126ac7 virtio_blk: switch to using blk_queue_write_cache() ... Browse Code »

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-04-13 06:00:39 +0800
12c95f137 ps3disk: switch to using blk_queue_write_cache() ... Browse Code »

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-04-13 06:00:39 +0800
6975f7327 skd_main: switch to using blk_queue_write_cache() ... Browse Code »

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-04-13 06:00:39 +0800
177febc84 osdblk: switch to using blk_queue_write_cache() ... Browse Code »

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-04-13 06:00:39 +0800
aafb1eecb nbd: switch to using blk_queue_write_cache() ... Browse Code »

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-04-13 06:00:39 +0800
17fe95f46 mtip32xx: remove call to blk_queue_flush() ... Browse Code »

The driver calls it with 0 for flags, since it doesn't have a writeback
cache. Just remove the call, as it's a no-op right now.

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-04-13 06:00:39 +0800
21d0727f6 loop: switch to using blk_queue_write_cache() ... Browse Code »

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-04-13 06:00:39 +0800
fe8fb75e3 drbd: switch to using blk_queue_write_cache() ... Browse Code »

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-04-13 06:00:39 +0800
6d125de40 mtip32xx: Convert to use blk_mq_tagset_busy_iter ... Browse Code »

Only a single tags array anyway.

Signed-off-by: Keith Busch
Reviewed-by: Johannes Thumshirn
Signed-off-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Keith Busch
2016-04-13 05:07:36 +0800
37e58237a block: add offset in blk_add_request_payload() ... Browse Code »

We could kmalloc() the payload, so need the offset in page.

Signed-off-by: Ming Lin
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Ming Lin
2016-04-13 03:13:23 +0800

08 Apr, 2016

1 commit

1c915b3ac Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull Ceph fix from Sage Weil:
"This just fixes a few remaining memory allocations in RBD to use
GFP_NOIO instead of GFP_ATOMIC"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: use GFP_NOIO consistently for request allocations

Linus Torvalds
2016-04-08 07:34:26 +0800