Eric Lee / smarc-fsl-linux-kernel

85c5e27c4 lib/interval_tree.c: simplify includes ... Browse Code »

The file uses nothing from init.h, and also doesn't need the full module.h
machinery; export.h is sufficient. The latter requires the user to ensure
compiler.h is included, so do that explicitly instead of relying on some
other header pulling it in.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:15 +0800

114fc1afb hexdump: make it return number of bytes placed in buffer ... Browse Code »

This patch makes hexdump return the number of bytes placed in the buffer
excluding trailing NUL. In the case of overflow it returns the desired
amount of bytes to produce the entire dump. Thus, it mimics snprintf().

This will be useful for users that would like to repeat with a bigger
buffer.

[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:15 +0800

5d909c8d5 hexdump: do a few calculations ahead ... Browse Code »

Instead of doing calculations in each case of different groupsize let's do
them beforehand. While there, change the switch to an if-else-if
construction.

Signed-off-by: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:15 +0800

6f6f3fcb8 hexdump: fix ascii column for the tail of a dump ... Browse Code »

In the current implementation we have a floating ascii column in the tail
of the dump.

For example, for row size equal to 16 the ascii column as in following
table

group size \ length 8 12 16
1 50 50 50
2 22 32 42
4 20 29 38
8 19 - 36

This patch makes it the same independently of amount of bytes dumped.

The change is safe since all current users, which use ASCII part of the
dump, rely on the group size equal to 1. The patch doesn't change
behaviour for such group size (see the table above).

Signed-off-by: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

64d1d77a4 hexdump: introduce test suite ... Browse Code »

Test different scenarios of function calls located in lib/hexdump.c.

Currently hex_dump_to_buffer() is only tested and test data is provided
for little endian CPUs.

Signed-off-by: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

ad3d5d2f7 lib/genalloc.c: fix the end addr check in addr_in_gen_pool() ... Browse Code »

Since chunk->end_addr is (chunk->start_addr + size - 1), the end address
to compare should be (start + size - 1).

Signed-off-by: Toshi Kikuchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

af3cd1350 lib/string.c: remove strnicmp() ... Browse Code »

Now that all in-tree users of strnicmp have been converted to
strncasecmp, the wrapper can be removed.

Signed-off-by: Rasmus Villemoes
Cc: David Howells
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

9814ec135 lib/bitmap.c: make the bits parameter of bitmap_remap unsigned ... Browse Code »

Also, rename bits to nbits. Both changes for consistency with other
bitmap_* functions.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

f6a1f5db8 lib/bitmap.c: simplify bitmap_ord_to_pos ... Browse Code »

Make the return value and the ord and nbits parameters of
bitmap_ord_to_pos unsigned.

Also, simplify the implementation and as a side effect make the result
fully defined, returning nbits for ord >= weight, in analogy with what
find_{first,next}_bit does. This is a better sentinel than the former
("unofficial") 0. No current users are affected by this change.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

df1d80a9e lib/bitmap.c: simplify bitmap_pos_to_ord ... Browse Code »

The ordinal of a set bit is simply the number of set bits before it;
counting those doesn't need to be done one bit at a time. While at it,
update the parameters to unsigned int.

It is not completely unthinkable that gcc would see pos as compile-time
constant 0 in one of the uses of bitmap_pos_to_ord. Since the static
inline frontend bitmap_weight doesn't handle nbits==0 correctly (it would
behave exactly as if nbits==BITS_PER_LONG), use __bitmap_weight.

Alternatively, the last line could be spelled bitmap_weight(buf, pos+1)-1,
but this is simpler.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

b26ad5836 lib/bitmap.c: change parameters of bitmap_fold to unsigned ... Browse Code »

Change the sz and nbits parameters of bitmap_fold to unsigned int for
consistency with other bitmap_* functions, and to save another few bytes
in the generated code.

[akpm@linux-foundation.org: fix kerneldoc]
Signed-off-by: Rasmus Villemoes
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

eb5698837 lib/bitmap.c: update bitmap_onto to unsigned ... Browse Code »

Change the nbits parameter of bitmap_onto to unsigned int for consistency
with other bitmap_* functions.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

f5ac1f552 linux/cpumask.h: update bitmap wrappers to take unsigned int ... Browse Code »

Since the various bitmap_* functions now take an unsigned int as nbits
parameter, it makes sense to also update the various wrappers, even though
they're marked as obsolete.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

33c4fa8c6 linux/nodemask.h: update bitmap wrappers to take unsigned int ... Browse Code »

Since the various bitmap_* functions now take an unsigned int as nbits
parameter, it makes sense to also update the various wrappers.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

8b4daad52 lib/bitmap.c: more signed->unsigned conversions ... Browse Code »

For consistency with the other bitmap_* functions, also make the nbits
parameter of bitmap_zero, bitmap_fill and bitmap_copy unsigned.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:14 +0800

d1214c65c libstring_helpers.c:string_get_size(): return void ... Browse Code »

string_get_size() was documented to return an error, but in fact always
returned 0. Since the output always fits in 9 bytes, just document that
and let callers do what they do now: pass a small stack buffer and ignore
the return value.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

84b9fbedf lib/string_helpers.c:string_get_size(): use 32 bit arithmetic when possible ... Browse Code »

The remainder from do_div is always a u32, and after size has been reduced
to be below 1000 (or 1024), it certainly fits in u32. So both remainder
and sf_cap can be made u32s, the format specifiers can be simplified (%lld
wasn't the right thing to use for _unsigned_ long long anyway), and we can
replace a do_div with an ordinary 32/32 bit division.

Signed-off-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

7eed8fde0 lib/string_helpers.c:string_get_size(): remove redundant prefixes ... Browse Code »

While commit 3c9f3681d0b4 ("[SCSI] lib: add generic helper to print
sizes rounded to the correct SI range") says that Z and Y are included
in preparation for 128 bit computers, they just waste .text currently.
If and when we get u128, string_get_size needs updating anyway (and ISO
needs to come up with four more prefixes).

Also there's no need to include and test for the NULL sentinel; once we
reach "E" size is at most 18. [The test is also wrong; it should be
units_str[units][i+1]; if we've reached NULL we're already doomed.]

Signed-off-by: Rasmus Villemoes
Cc: James Bottomley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

43e5b666c lib/vsprintf.c: replace while with do-while in skip_atoi ... Browse Code »

All callers of skip_atoi have already checked for the first character
being a digit. In this case, gcc generates simpler code for a do
while-loop.

Signed-off-by: Rasmus Villemoes
Cc: Jiri Kosina
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

2aa2f9e21 lib/vsprintf.c: improve sanity check in vsnprintf() ... Browse Code »

On 64 bit, size may very well be huge even if bit 31 happens to be 0.
Somehow it doesn't feel right that one can pass a 5 GiB buffer but not a
3 GiB one. So cap at INT_MAX as was probably the intention all along.
This is also the made-up value passed by sprintf and vsprintf.

Signed-off-by: Rasmus Villemoes
Cc: Jiri Kosina
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

ffbfed03b lib/vsprintf.c: consume 'p' in format_decode ... Browse Code »

It seems a little simpler to consume the p from a %p specifier in
format_decode, just as it is done for the surrounding %c, %s and %% cases.

While there, delete a redundant and misplaced comment.

Signed-off-by: Rasmus Villemoes
Cc: Jiri Kosina
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

205bd3d23 printk: correct timeout comment, neaten MODULE_PARM_DESC ... Browse Code »

Neaten the MODULE_PARAM_DESC message.
Use 30 seconds in the comment for the zap console locks timeout.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

02f1f2170 kernel.h: remove ancient __FUNCTION__ hack ... Browse Code »

__FUNCTION__ hasn't been treated as a string literal since gcc 3.4, so
this only helps people who only test-compile using 3.3 (compiler-gcc3.h
barks at anything older than that). Besides, there are almost no
occurrences of __FUNCTION__ left in the tree.

[akpm@linux-foundation.org: convert remaining __FUNCTION__ references]
Signed-off-by: Rasmus Villemoes
Cc: Michal Nazarewicz
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

4be1b2979 powerpc: add running_clock for powerpc to prevent spurious softlockup warnings ... Browse Code »

On POWER8 virtualised kernels the VTB register can be read to have a view
of time that only increases while the guest is running. This will prevent
guests from seeing time jump if a guest is paused for significant amounts
of time.

On POWER7 and below virtualised kernels stolen time is subtracted from
local_clock as a best effort approximation. This will not eliminate
spurious warnings in the case of a suspended guest but may reduce the
occurance in the case of softlockups due to host over commit.

Bare metal kernels should avoid reading the VTB as KVM does not restore
sane values when not executing, the approxmation is fine as host kernels
won't observe any stolen time.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cyril Bur
Cc: Michael Ellerman
Cc: Andrew Jones
Acked-by: Don Zickus
Cc: Ingo Molnar
Cc: Ulrich Obergfell
Cc: chai wen
Cc: Fabian Frederick
Cc: Aaron Tomlin
Cc: Ben Zhang
Cc: Martin Schwidefsky
Cc: John Stultz
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

545a2bf74 kernel/sched/clock.c: add another clock for use with the soft lockup watchdog ... Browse Code »

When the hypervisor pauses a virtualised kernel the kernel will observe a
jump in timebase, this can cause spurious messages from the softlockup
detector.

Whilst these messages are harmless, they are accompanied with a stack
trace which causes undue concern and more problematically the stack trace
in the guest has nothing to do with the observed problem and can only be
misleading.

Futhermore, on POWER8 this is completely avoidable with the introduction
of the Virtual Time Base (VTB) register.

This patch (of 2):

This permits the use of arch specific clocks for which virtualised kernels
can use their notion of 'running' time, not the elpased wall time which
will include host execution time.

Signed-off-by: Cyril Bur
Cc: Michael Ellerman
Cc: Andrew Jones
Acked-by: Don Zickus
Cc: Ingo Molnar
Cc: Ulrich Obergfell
Cc: chai wen
Cc: Fabian Frederick
Cc: Aaron Tomlin
Cc: Ben Zhang
Cc: Martin Schwidefsky
Cc: John Stultz
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

dd4a5c1e6 linux/types.h: Always use unsigned long for pgoff_t ... Browse Code »

Everybody uses unsigned long for pgoff_t, and no one ever overrode the
definition of pgoff_t. Keep it that way, and remove the option of
overriding it.

Signed-off-by: Geert Uytterhoeven
Cc: Randy Dunlap
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

dd10ca6c9 gitignore: ignore tar-install build directory ... Browse Code »

Have git ignore the Debian directory created when running:
make tar-pkg / targz-pkg / tarbz2-pkg / tarxz-pkg

Signed-off-by: Andrey Skvortsov
Cc: Michal Marek
Cc: Greg Kroah-Hartman
Cc: Boaz Harrosh
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:13 +0800

f56141e3e all arches, signal: move restart_block to struct task_struct ... Browse Code »

If an attacker can cause a controlled kernel stack overflow, overwriting
the restart block is a very juicy exploit target. This is because the
restart_block is held in the same memory allocation as the kernel stack.

Moving the restart block to struct task_struct prevents this exploit by
making the restart_block harder to locate.

Note that there are other fields in thread_info that are also easy
targets, at least on some architectures.

It's also a decent simplification, since the restart code is more or less
identical on all architectures.

[james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
Signed-off-by: Andy Lutomirski
Cc: Thomas Gleixner
Cc: Al Viro
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Kees Cook
Cc: David Miller
Acked-by: Richard Weinberger
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: Vineet Gupta
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Haavard Skinnemoen
Cc: Hans-Christian Egtvedt
Cc: Steven Miao
Cc: Mark Salter
Cc: Aurelien Jacquiot
Cc: Mikael Starvik
Cc: Jesper Nilsson
Cc: David Howells
Cc: Richard Kuo
Cc: "Luck, Tony"
Cc: Geert Uytterhoeven
Cc: Michal Simek
Cc: Ralf Baechle
Cc: Jonas Bonn
Cc: "James E.J. Bottomley"
Cc: Helge Deller
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Acked-by: Michael Ellerman (powerpc)
Tested-by: Michael Ellerman (powerpc)
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Chen Liqin
Cc: Lennox Wu
Cc: Chris Metcalf
Cc: Guan Xuetao
Cc: Chris Zankel
Cc: Max Filippov
Cc: Oleg Nesterov
Cc: Guenter Roeck
Signed-off-by: James Hogan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

edc924e02 fs/proc/array.c: convert to use string_escape_str() ... Browse Code »

Instead of custom approach let's use string_escape_str() to escape a given
string (task_name in this case).

Signed-off-by: Andy Shevchenko
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

198d1597c fs: proc: task_mmu: show page size in /proc/<pid>/numa_maps ... Browse Code »

The output of /proc/$pid/numa_maps is in terms of number of pages like
anon=22 or dirty=54. Here's some output:

7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50
7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50
7fff8d425000 default stack anon=50 dirty=50 N0=50

Looks like we have a stack and a couple of anonymous hugetlbfs
areas page which both use the same amount of memory. They don't.

The 'bigfile' uses 1GB pages and takes up ~50GB of space. The
anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack
uses normal 4k pages. You can go over to smaps to figure out what the
page size _really_ is with KernelPageSize or MMUPageSize. But, I think
this is a pretty nasty and counterintuitive interface as it stands.

This patch introduces 'kernelpagesize_kB' line element to
/proc//numa_maps report file in order to help identifying the size of
pages that are backing memory areas mapped by a given task. This is
specially useful to help differentiating between HUGE and GIGANTIC page
backed VMAs.

This patch is based on Dave Hansen's proposal and reviewer's follow-ups
taken from the following dicussion threads:
* https://lkml.org/lkml/2011/9/21/454
* https://lkml.org/lkml/2014/12/20/66

Signed-off-by: Rafael Aquini
Cc: Johannes Weiner
Cc: Dave Hansen
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

0c3697118 Documentation/filesystems/proc.txt: add /proc/pid/numa_maps interface explanation snippet ... Browse Code »

Add a small section to proc.txt doc in order to document its
/proc/pid/numa_maps interface. It does not introduce any functional
changes, just documentation.

Signed-off-by: Rafael Aquini
Cc: Johannes Weiner
Cc: Dave Hansen
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

6bee55f94 fs: proc: use PDE() to get proc_dir_entry ... Browse Code »

Use the PDE() helper to get proc_dir_entry instead of coding it directly.

Signed-off-by: Alexander Kuleshov
Acked-by: Nicolas Dichtel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

695f05593 fs/proc/task_mmu.c: add user-space support for resetting mm->hiwater_rss (peak RSS) ... Browse Code »

Peak resident size of a process can be reset back to the process's
current rss value by writing "5" to /proc/pid/clear_refs. The driving
use-case for this would be getting the peak RSS value, which can be
retrieved from the VmHWM field in /proc/pid/status, per benchmark
iteration or test scenario.

[akpm@linux-foundation.org: clarify behaviour in documentation]
Signed-off-by: Petr Cermak
Cc: Bjorn Helgaas
Cc: Primiano Tucci
Cc: Petr Cermak
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

d170cf467 arch/frv/mm/extable.c: remove unused function ... Browse Code »

Remove the function search_one_table() that is not used anywhere.

This was partially found by using a static code analysis program called
cppcheck.

Signed-off-by: Rickard Strandqvist
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

0f050d997 mm/zsmalloc: add statistics support ... Browse Code »

Keeping fragmentation of zsmalloc in a low level is our target. But now
we still need to add the debug code in zsmalloc to get the quantitative
data.

This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the
statistics collection for developers. Currently only the objects
statatitics in each class are collected. User can get the information via
debugfs.

cat /sys/kernel/debug/zsmalloc/zram0/...

For example:

After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem:
class size obj_allocated obj_used pages_used
0 32 0 0 0
1 48 256 12 3
2 64 64 14 1
3 80 51 7 1
4 96 128 5 3
5 112 73 5 2
6 128 32 4 1
7 144 0 0 0
8 160 0 0 0
9 176 0 0 0
10 192 0 0 0
11 208 0 0 0
12 224 0 0 0
13 240 0 0 0
14 256 16 1 1
15 272 15 9 1
16 288 0 0 0
17 304 0 0 0
18 320 0 0 0
19 336 0 0 0
20 352 0 0 0
21 368 0 0 0
22 384 0 0 0
23 400 0 0 0
24 416 0 0 0
25 432 0 0 0
26 448 0 0 0
27 464 0 0 0
28 480 0 0 0
29 496 33 1 4
30 512 0 0 0
31 528 0 0 0
32 544 0 0 0
33 560 0 0 0
34 576 0 0 0
35 592 0 0 0
36 608 0 0 0
37 624 0 0 0
38 640 0 0 0
40 672 0 0 0
42 704 0 0 0
43 720 17 1 3
44 736 0 0 0
46 768 0 0 0
49 816 0 0 0
51 848 0 0 0
52 864 14 1 3
54 896 0 0 0
57 944 13 1 3
58 960 0 0 0
62 1024 4 1 1
66 1088 15 2 4
67 1104 0 0 0
71 1168 0 0 0
74 1216 0 0 0
76 1248 0 0 0
83 1360 3 1 1
91 1488 11 1 4
94 1536 0 0 0
100 1632 5 1 2
107 1744 0 0 0
111 1808 9 1 4
126 2048 4 4 2
144 2336 7 3 4
151 2448 0 0 0
168 2720 15 15 10
190 3072 28 27 21
202 3264 0 0 0
254 4096 36209 36209 36209

Total 37022 36326 36288

We can calculate the overall fragentation by the last line:
Total 37022 36326 36288
(37022 - 36326) / 37022 = 1.87%

Also by analysing objects alocated in every class we know why we got so
low fragmentation: Most of the allocated objects is in . And
there is only 1 page in class 254 zspage. So, No fragmentation will be
introduced by allocating objs in class 254.

And in future, we can collect other zsmalloc statistics as we need and
analyse them.

Signed-off-by: Ganesh Mahendran
Suggested-by: Minchan Kim
Acked-by: Minchan Kim
Cc: Nitin Gupta
Cc: Seth Jennings
Cc: Dan Streetman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

3eba0c6a5 mm/zpool: add name argument to create zpool ... Browse Code »

Currently the underlay of zpool: zsmalloc/zbud, do not know who creates
them. There is not a method to let zsmalloc/zbud find which caller they
belong to.

Now we want to add statistics collection in zsmalloc. We need to name the
debugfs dir for each pool created. The way suggested by Minchan Kim is to
use a name passed by caller(such as zram) to create the zsmalloc pool.

/sys/kernel/debug/zsmalloc/zram0

This patch adds an argument `name' to zs_create_pool() and other related
functions.

Signed-off-by: Ganesh Mahendran
Acked-by: Minchan Kim
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Dan Streetman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

ee9801601 zram: remove request_queue from struct zram ... Browse Code »

`struct zram' contains both `struct gendisk' and `struct request_queue'.
the latter can be deleted, because zram->disk carries ->queue pointer, and
->queue carries zram pointer:

create_device()
zram->queue->queuedata = zram
zram->disk->queue = zram->queue
zram->disk->private_data = zram

so zram->queue is not needed, we can access all necessary data anyway.

Signed-off-by: Sergey Senozhatsky
Cc: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

08eee69fc zram: remove init_lock in zram_make_request ... Browse Code »

Admin could reset zram during I/O operation going on so we have used
zram->init_lock as read-side lock in I/O path to prevent sudden zram
meta freeing.

However, the init_lock is really troublesome. We can't do call
zram_meta_alloc under init_lock due to lockdep splat because
zram_rw_page is one of the function under reclaim path and hold it as
read_lock while other places in process context hold it as write_lock.
So, we have used allocation out of the lock to avoid lockdep warn but
it's not good for readability and fainally, I met another lockdep splat
between init_lock and cpu_hotplug from kmem_cache_destroy during working
zsmalloc compaction. :(

Yes, the ideal is to remove horrible init_lock of zram in rw path. This
patch removes it in rw path and instead, add atomic refcount for meta
lifetime management and completion to free meta in process context.
It's important to free meta in process context because some of resource
destruction needs mutex lock, which could be held if we releases the
resource in reclaim context so it's deadlock, again.

As a bonus, we could remove init_done check in rw path because
zram_meta_get will do a role for it, instead.

Signed-off-by: Sergey Senozhatsky
Signed-off-by: Minchan Kim
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Ganesh Mahendran
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:12 +0800

2b269ce6f zram: check bd_openers instead of bd_holders ... Browse Code »

bd_holders is increased only when user open the device file as FMODE_EXCL
so if something opens zram0 as !FMODE_EXCL and request I/O while another
user reset zram0, we can see following warning.

zram0: detected capacity change from 0 to 64424509440
Buffer I/O error on dev zram0, logical block 180823, lost async page write
Buffer I/O error on dev zram0, logical block 180824, lost async page write
Buffer I/O error on dev zram0, logical block 180825, lost async page write
Buffer I/O error on dev zram0, logical block 180826, lost async page write
Buffer I/O error on dev zram0, logical block 180827, lost async page write
Buffer I/O error on dev zram0, logical block 180828, lost async page write
Buffer I/O error on dev zram0, logical block 180829, lost async page write
Buffer I/O error on dev zram0, logical block 180830, lost async page write
Buffer I/O error on dev zram0, logical block 180831, lost async page write
Buffer I/O error on dev zram0, logical block 180832, lost async page write
------------[ cut here ]------------
WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
Modules linked in:
CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x45/0x57
warn_slowpath_common+0x8a/0xc0
warn_slowpath_null+0x1a/0x20
__blkdev_put+0x1d7/0x210
blkdev_put+0x50/0x130
blkdev_close+0x25/0x30
__fput+0xdf/0x1e0
____fput+0xe/0x10
task_work_run+0xa7/0xe0
do_notify_resume+0x49/0x60
int_signal+0x12/0x17
---[ end trace 274fbbc5664827d2 ]---

The warning comes from bdev_write_node in blkdev_put path.

static void bdev_write_inode(struct inode *inode)
{
spin_lock(&inode->i_lock);
while (inode->i_state & I_DIRTY) {
spin_unlock(&inode->i_lock);
WARN_ON_ONCE(write_inode_now(inode, true)); i_lock);
}
spin_unlock(&inode->i_lock);
}

The reason is dd process encounters I/O fails due to sudden block device
disappear so in filemap_check_errors in __writeback_single_inode returns
-EIO.

If we check bd_openers instead of bd_holders, we could address the
problem. When I see the brd, it already have used it rather than
bd_holders so although I'm not a expert of block layer, it seems to be
better.

I can make following warning with below simple script. In addition, I
added msleep(2000) below set_capacity(zram->disk, 0) after applying your
patch to make window huge(Kudos to Ganesh!)

script:

echo $((60< /sys/block/zram0/disksize
setsid dd if=/dev/zero of=/dev/zram0 &
sleep 1
setsid echo 1 > /sys/block/zram0/reset

Signed-off-by: Minchan Kim
Acked-by: Sergey Senozhatsky
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Ganesh Mahendran
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:11 +0800

a096cafc3 zram: rework reset and destroy path ... Browse Code »

We need to return set_capacity(disk, 0) from reset_store() back to
zram_reset_device(), a catch by Ganesh Mahendran. Potentially, we can
race set_capacity() calls from init and reset paths.

The problem is that zram_reset_device() is also getting called from
zram_exit(), which performs operations in misleading reversed order -- we
first create_device() and then init it, while zram_exit() perform
destroy_device() first and then does zram_reset_device(). This is done to
remove sysfs group before we reset device, so we can continue with device
reset/destruction not being raced by sysfs attr write (f.e. disksize).

Apart from that, destroy_device() releases zram->disk (but we still have
->disk pointer), so we cannot acces zram->disk in later
zram_reset_device() call, which may cause additional errors in the future.

So, this patch rework and cleanup destroy path.

1) remove several unneeded goto labels in zram_init()

2) factor out zram_init() error path and zram_exit() into
destroy_devices() function, which takes the number of devices to
destroy as its argument.

3) remove sysfs group in destroy_devices() first, so we can reorder
operations -- reset device (as expected) goes before disk destroy and
queue cleanup. So we can always access ->disk in zram_reset_device().

4) and, finally, return set_capacity() back under ->init_lock.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Sergey Senozhatsky
Reported-by: Ganesh Mahendran
Cc: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2015-02-13 10:54:11 +0800

GITLAB

Eric Lee / smarc-fsl-linux-kernel

13 Feb, 2015