Eric Lee / smarc-fsl-linux-kernel

11 Sep, 2015

40 commits

d0edd8528 ipc: convert invalid scenarios to use WARN_ON ... Browse Code »

Considering Linus' past rants about the (ab)use of BUG in the kernel, I
took a look at how we deal with such calls in ipc. Given that any errors
or corruption in ipc code are most likely contained within the set of
processes participating in the broken mechanisms, there aren't really many
strong fatal system failure scenarios that would require a BUG call.
Also, if something is seriously wrong, ipc might not be the place for such
a BUG either.

1. For example, recently, a customer hit one of these BUG_ONs in shm
after failing shm_lock(). A busted ID imho does not merit a BUG_ON,
and WARN would have been better.

2. MSG_COPY functionality of posix msgrcv(2) for checkpoint/restore.
I don't see how we can hit this anyway -- at least it should be IS_ERR.
The 'copy' arg from do_msgrcv is always set by calling prepare_copy()
first and foremost. We could also probably drop this check altogether.
Either way, it does not merit a BUG_ON.

3. No ->fault() callback for the fs getting the corresponding page --
seems selfish to make the system unusable.

Signed-off-by: Davidlohr Bueso
Cc: Manfred Spraul
Cc: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2015-09-11 04:29:01 +0800
8b235f2f1 zlib_deflate/deftree: remove bi_reverse() ... Browse Code »

Remove bi_reverse() and use generic bitrev32() instead - it should have
better performance on some platforms.

Signed-off-by: yalin wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yalin wang
2015-09-11 04:29:01 +0800
e4e29dc48 lib/decompress_unlzma: Do a NULL check for pointer ... Browse Code »

Compare pointer-typed values to NULL rather than 0.

The semantic patch that makes this change is available
in scripts/coccinelle/null/badzero.cocci.

Signed-off-by: Fabio Estevam
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabio Estevam
2015-09-11 04:29:01 +0800
2d3862d26 lib/decompressors: use real out buf size for gunzip with kernel ... Browse Code »

When loading x86 64bit kernel above 4GiB with patched grub2, got kernel
gunzip error.

| early console in decompress_kernel
| decompress_kernel:
| input: [0x807f2143b4-0x807ff61aee]
| output: [0x807cc00000-0x807f3ea29b] 0x027ea29c: output_len
| boot via startup_64
| KASLR using RDTSC...
| new output: [0x46fe000000-0x470138cfff] 0x0338d000: output_run_size
| decompress: [0x46fe000000-0x47007ea29b]
Cc: Alexandre Courbot
Cc: Jon Medhurst
Cc: Stephen Warren
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2015-09-11 04:29:01 +0800
e852d82a5 fs/affs: make root lookup from blkdev logical size ... Browse Code »

This patch resolves https://bugzilla.kernel.org/show_bug.cgi?id=16531.

When logical blkdev size > 512 then sector numbers become larger than the
device can support.

Make affs start lookup based on the device's logical sector size instead
of 512.

Reported-by: Mark
Suggested-by: Mark
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pranay Kr. Srivastava
2015-09-11 04:29:01 +0800
9a5bc726d sysctl: fix int -> unsigned long assignments in INT_MIN case ... Browse Code »

The following

if (val < 0)
*lvalp = (unsigned long)-val;

is incorrect because the compiler is free to assume -val to be positive
and use a sign-extend instruction for extending the bit pattern. This is
a problem if val == INT_MIN:

# echo -2147483648 >/proc/sys/dev/scsi/logging_level
# cat /proc/sys/dev/scsi/logging_level
-18446744071562067968

Cast to unsigned long before negation - that way we first sign-extend and
then negate an unsigned, which is well defined. With this:

# cat /proc/sys/dev/scsi/logging_level
-2147483648

Signed-off-by: Ilya Dryomov
Cc: Mikulas Patocka
Cc: Robert Xiao
Cc: "Eric W. Biederman"
Cc: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ilya Dryomov
2015-09-11 04:29:01 +0800
1303a27c9 kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo ... Browse Code »

In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
accordingly the MODULES_VADDR is changed to 0xffffffffa0000000. However,
in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
And the kernel text mapping addr space is enlarged from 512M to 1G. That
means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
support is not compiled in and 1G when kaslr support is compiled in.
Accordingly the MODULES_VADDR is changed too to be:

#define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)

So when kaslr is compiled in and enabled, the kernel text mapping addr
space and modules vaddr space need be adjusted. Otherwise makedumpfile
will collapse since the addr for some symbols is not correct.

Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
makedumpfile to help calculate MODULES_VADDR.

Signed-off-by: Baoquan He
Acked-by: Kees Cook
Acked-by: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Baoquan He
2015-09-11 04:29:01 +0800
bbb78b8f3 kexec: align crash_notes allocation to make it be inside one physical page ... Browse Code »

People reported that crash_notes in /proc/vmcore were corrupted and this
cause crash kdump failure. With code debugging and log we got the root
cause. This is because percpu variable crash_notes are allocated in 2
vmalloc pages. Currently percpu is based on vmalloc by default. Vmalloc
can't guarantee 2 continuous vmalloc pages are also on 2 continuous
physical pages. So when 1st kernel exports the starting address and size
of crash_notes through sysfs like below:

/sys/devices/system/cpu/cpux/crash_notes
/sys/devices/system/cpu/cpux/crash_notes_size

kdump kernel use them to get the content of crash_notes. However the 2nd
part may not be in the next neighbouring physical page as we expected if
crash_notes are allocated accross 2 vmalloc pages. That's why
nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in
update_note_header_size_elf64() and cause note header merging failure or
some warnings.

In this patch change to call __alloc_percpu() to passed in the align value
by rounding crash_notes_size up to the nearest power of two. This makes
sure the crash_notes is allocated inside one physical page since
sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE. Meanwhile add
a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since
crash_notes definitely will be in 2 pages. That need be avoided, and need
be reported if it's unavoidable.

[akpm@linux-foundation.org: use correct comment layout]
Signed-off-by: Baoquan He
Cc: Eric W. Biederman
Cc: Vivek Goyal
Cc: Dave Young
Cc: Lisa Mitchell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Baoquan He
2015-09-11 04:29:01 +0800
04e9949b2 kexec: remove unnecessary test in kimage_alloc_crash_control_pages() ... Browse Code »

Transforming PFN(Page Frame Number) to struct page is never failure, so we
can simplify the code logic to do the image->control_page assignment
directly in the loop, and remove the unnecessary conditional judgement.

Signed-off-by: Minfei Huang
Acked-by: Dave Young
Acked-by: Vivek Goyal
Cc: Simon Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minfei Huang
2015-09-11 04:29:01 +0800
2965faa5e kexec: split kexec_load syscall from kexec core code ... Browse Code »

There are two kexec load syscalls, kexec_load another and kexec_file_load.
kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young
Cc: Eric W. Biederman
Cc: Vivek Goyal
Cc: Petr Tesarik
Cc: Theodore Ts'o
Cc: Josh Boyer
Cc: David Howells
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2015-09-11 04:29:01 +0800
a43cac0d9 kexec: split kexec_file syscall code to kexec_file.c ... Browse Code »

Split kexec_file syscall related code to another file kernel/kexec_file.c
so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.

Sharing variables and functions are moved to kernel/kexec_internal.h per
suggestion from Vivek and Petr.

[akpm@linux-foundation.org: fix bisectability]
[akpm@linux-foundation.org: declare the various arch_kexec functions]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Dave Young
Cc: Eric W. Biederman
Cc: Vivek Goyal
Cc: Petr Tesarik
Cc: Theodore Ts'o
Cc: Josh Boyer
Cc: David Howells
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2015-09-11 04:29:01 +0800
a202fbbf5 drivers/net/wireless/ath/wil6210: use seq_hex_dump() to dump buffers ... Browse Code »

Instead of custom approach let's use recently introduced seq_hex_dump()
helper.

Signed-off-by: Andy Shevchenko
Cc: Alexander Viro
Cc: Joe Perches
Cc: Tadeusz Struk
Cc: Helge Deller
Cc: Ingo Tuchscherer
Cc: Catalin Marinas
Cc: Vladimir Kondratiev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2015-09-11 04:29:01 +0800
6fc37c490 kmemleak: use seq_hex_dump() to dump buffers ... Browse Code »

Instead of custom approach let's use recently introduced seq_hex_dump()
helper.

Signed-off-by: Andy Shevchenko
Cc: Alexander Viro
Cc: Joe Perches
Cc: Tadeusz Struk
Cc: Helge Deller
Cc: Ingo Tuchscherer
Acked-by: Catalin Marinas
Cc: Vladimir Kondratiev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2015-09-11 04:29:01 +0800
5d2fe875c drivers/s390/crypto/zcrypt_api.c: use seq_hex_dump() to dump buffers ... Browse Code »

Instead of custom approach let's use recently introduced seq_hex_dump()
helper.

Signed-off-by: Andy Shevchenko
Acked-by: Ingo Tuchscherer
Cc: Alexander Viro
Cc: Joe Perches
Cc: Tadeusz Struk
Cc: Helge Deller
Cc: Catalin Marinas
Cc: Vladimir Kondratiev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2015-09-11 04:29:01 +0800
b342a65dd parisc: use seq_hex_dump() to dump buffers ... Browse Code »

Instead of custom approach let's use recently introduced seq_hex_dump()
helper.

In one case it changes the output from
1111111122222222333333334444444455555555666666667777777788888888
to
11111111 22222222 33333333 44444444 55555555 66666666 77777777 88888888

though it seems it prints same data (by meaning) in both cases. I decide
to choose to use the space divided one.

Signed-off-by: Andy Shevchenko
Acked-by: Helge Deller
Cc: Alexander Viro
Cc: Joe Perches
Cc: Tadeusz Struk
Cc: Ingo Tuchscherer
Cc: Catalin Marinas
Cc: Vladimir Kondratiev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2015-09-11 04:29:01 +0800
d0cce0622 drivers/crypto/qat: use seq_hex_dump() to dump buffers ... Browse Code »

Instead of custom approach let's use recently introduced seq_hex_dump()
helper.

Signed-off-by: Andy Shevchenko
Acked-by: Tadeusz Struk
Cc: Alexander Viro
Cc: Joe Perches
Cc: Helge Deller
Cc: Ingo Tuchscherer
Cc: Catalin Marinas
Cc: Vladimir Kondratiev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2015-09-11 04:29:01 +0800
37607102c seq_file: provide an analogue of print_hex_dump() ... Browse Code »

This introduces a new helper and switches current users to use it. All
patches are compiled tested. kmemleak is tested via its own test suite.

This patch (of 6):

The new seq_hex_dump() is a complete analogue of print_hex_dump().

We have few users of this functionality already. It allows to reduce their
codebase.

Signed-off-by: Andy Shevchenko
Cc: Alexander Viro
Cc: Joe Perches
Cc: Tadeusz Struk
Cc: Helge Deller
Cc: Ingo Tuchscherer
Cc: Catalin Marinas
Cc: Vladimir Kondratiev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2015-09-11 04:29:01 +0800
40f705a73 fs: Don't dump core if the corefile would become world-readable. ... Browse Code »

On a filesystem like vfat, all files are created with the same owner
and mode independent of who created the file. When a vfat filesystem
is mounted with root as owner of all files and read access for everyone,
root's processes left world-readable coredumps on it (but other
users' processes only left empty corefiles when given write access
because of the uid mismatch).

Given that the old behavior was inconsistent and insecure, I don't see
a problem with changing it. Now, all processes refuse to dump core unless
the resulting corefile will only be readable by their owner.

Signed-off-by: Jann Horn
Acked-by: Kees Cook
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jann Horn
2015-09-11 04:29:01 +0800
fbb181694 fs: if a coredump already exists, unlink and recreate with O_EXCL ... Browse Code »

It was possible for an attacking user to trick root (or another user) into
writing his coredumps into an attacker-readable, pre-existing file using
rename() or link(), causing the disclosure of secret data from the victim
process' virtual memory. Depending on the configuration, it was also
possible to trick root into overwriting system files with coredumps. Fix
that issue by never writing coredumps into existing files.

Requirements for the attack:
- The attack only applies if the victim's process has a nonzero
RLIMIT_CORE and is dumpable.
- The attacker can trick the victim into coredumping into an
attacker-writable directory D, either because the core_pattern is
relative and the victim's cwd is attacker-writable or because an
absolute core_pattern pointing to a world-writable directory is used.
- The attacker has one of these:
A: on a system with protected_hardlinks=0:
execute access to a folder containing a victim-owned,
attacker-readable file on the same partition as D, and the
victim-owned file will be deleted before the main part of the attack
takes place. (In practice, there are lots of files that fulfill
this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
This does not apply to most Linux systems because most distros set
protected_hardlinks=1.
B: on a system with protected_hardlinks=1:
execute access to a folder containing a victim-owned,
attacker-readable and attacker-writable file on the same partition
as D, and the victim-owned file will be deleted before the main part
of the attack takes place.
(This seems to be uncommon.)
C: on any system, independent of protected_hardlinks:
write access to a non-sticky folder containing a victim-owned,
attacker-readable file on the same partition as D
(This seems to be uncommon.)

The basic idea is that the attacker moves the victim-owned file to where
he expects the victim process to dump its core. The victim process dumps
its core into the existing file, and the attacker reads the coredump from
it.

If the attacker can't move the file because he does not have write access
to the containing directory, he can instead link the file to a directory
he controls, then wait for the original link to the file to be deleted
(because the kernel checks that the link count of the corefile is 1).

A less reliable variant that requires D to be non-sticky works with link()
and does not require deletion of the original link: link() the file into
D, but then unlink() it directly before the kernel performs the link count
check.

On systems with protected_hardlinks=0, this variant allows an attacker to
not only gain information from coredumps, but also clobber existing,
victim-writable files with coredumps. (This could theoretically lead to a
privilege escalation.)

Signed-off-by: Jann Horn
Cc: Kees Cook
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jann Horn
2015-09-11 04:29:01 +0800
bb304a5c6 kmod: handle UMH_WAIT_PROC from system unbound workqueue ... Browse Code »

The UMH_WAIT_PROC handler runs in its own thread in order to make sure
that waiting for the exec kernel thread completion won't block other
usermodehelper queued jobs.

On older workqueue implementations, worklets couldn't sleep without
blocking the rest of the queue. But now the workqueue subsystem handles
that. Khelper still had the older limitation due to its singlethread
properties but we replaced it to system unbound workqueues.

Those are affine to the current node and can block up to some number of
instances.

They are a good candidate to handle UMH_WAIT_PROC assuming that we have
enough system unbound workers to handle lots of parallel usermodehelper
jobs.

Signed-off-by: Frederic Weisbecker
Cc: Rik van Riel
Reviewed-by: Oleg Nesterov
Cc: Christoph Lameter
Cc: Tejun Heo
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-11 04:29:01 +0800
90f023030 kmod: use system_unbound_wq instead of khelper ... Browse Code »

We need to launch the usermodehelper kernel threads with the widest
affinity and this is partly why we use khelper. This workqueue has
unbound properties and thus a wide affinity inherited by all its children.

Now khelper also has special properties that we aren't much interested in:
ordered and singlethread. There is really no need about ordering as all
we do is creating kernel threads. This can be done concurrently. And
singlethread is a useless limitation as well.

The workqueue engine already proposes generic unbound workqueues that
don't share these useless properties and handle well parallel jobs.

The only worrysome specific is their affinity to the node of the current
CPU. It's fine for creating the usermodehelper kernel threads but those
inherit this affinity for longer jobs such as requesting modules.

This patch proposes to use these node affine unbound workqueues assuming
that a node is sufficient to handle several parallel usermodehelper
requests.

Signed-off-by: Frederic Weisbecker
Cc: Rik van Riel
Reviewed-by: Oleg Nesterov
Cc: Christoph Lameter
Cc: Tejun Heo
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-11 04:29:01 +0800
b639e86ba kmod: add up-to-date explanations on the purpose of each asynchronous levels ... Browse Code »

There seem to be quite some confusions on the comments, likely due to
changes that came after them.

Now since it's very non obvious why we have 3 levels of asynchronous code
to implement usermodehelpers, it's important to comment in detail the
reason of this layout.

Signed-off-by: Frederic Weisbecker
Cc: Rik van Riel
Reviewed-by: Oleg Nesterov
Cc: Christoph Lameter
Cc: Tejun Heo
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-11 04:29:01 +0800
d097c0240 kmod: remove unecessary explicit wide CPU affinity setting ... Browse Code »

Khelper is affine to all CPUs. Now since it creates the
call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide
affinity.

As such explicitly forcing a wide affinity from those kernel threads
is like a no-op.

Just remove it. It's needless and it breaks CPU isolation users who
rely on workqueue affinity tuning.

Signed-off-by: Frederic Weisbecker
Cc: Rik van Riel
Reviewed-by: Oleg Nesterov
Cc: Christoph Lameter
Cc: Tejun Heo
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-11 04:29:01 +0800
b6b50a814 kmod: bunch of internal functions renames ... Browse Code »

This patchset does a bunch of cleanups and converts khelper to use system
unbound workqueues. The 3 first patches should be uncontroversial. The
last 2 patches are debatable.

Kmod creates kernel threads that perform userspace jobs and we want those
to have a large affinity in order not to contend busy CPUs. This is
(partly) why we use khelper which has a wide affinity that the kernel
threads it create can inherit from. Now khelper is a dedicated workqueue
that has singlethread properties which we aren't interested in.

Hence those two debatable changes:

_ We would like to use generic workqueues. System unbound workqueues are
a very good candidate but they are not wide affine, only node affine.
Now probably a node is enough to perform many parallel kmod jobs.

_ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC
handler) to use the workqueue. It means that if the workqueue blocks,
and no other worker can take pending kmod request, we can be screwed.
Now if we have 512 threads, this should be enough.

This patch (of 5):

Underscores on function names aren't much verbose to explain the purpose
of a function. And kmod has interesting such flavours.

Lets rename the following functions:

* __call_usermodehelper -> call_usermodehelper_exec_work
* ____call_usermodehelper -> call_usermodehelper_exec_async
* wait_for_helper -> call_usermodehelper_exec_sync

Signed-off-by: Frederic Weisbecker
Cc: Rik van Riel
Reviewed-by: Oleg Nesterov
Cc: Christoph Lameter
Cc: Tejun Heo
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-11 04:29:01 +0800
60b61a6f4 kmod: correct documentation of return status of request_module ... Browse Code »

If request_module() successfully runs modprobe, but modprobe exits with a
non-zero status, then the return value from request_module() will be that
(positive) error status. So the return from request_module can be:

negative errno
zero for success
positive exit code.

Signed-off-by: NeilBrown
Cc: Goldwyn Rodrigues
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2015-09-11 04:29:01 +0800
b4cc0efea hfs: fix B-tree corruption after insertion at position 0 ... Browse Code »

Fix B-tree corruption when a new record is inserted at position 0 in the
node in hfs_brec_insert().

This is an identical change to the corresponding hfs b-tree code to Sergei
Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
to keep similar code paths in the hfs and hfsplus drivers in sync, where
appropriate.

Signed-off-by: Hin-Tak Leung
Cc: Sergei Antonov
Cc: Joe Perches
Reviewed-by: Vyacheslav Dubeyko
Cc: Anton Altaparmakov
Cc: Al Viro
Cc: Christoph Hellwig
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hin-Tak Leung
2015-09-11 04:29:01 +0800
7cb74be6f hfs,hfsplus: cache pages correctly between bnode_create and bnode_free ... Browse Code »

Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
hfs_bnode_find() for finding or creating pages corresponding to an inode)
are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
and should not be page_cache_release()'ed until hfs_bnode_free().

This patch fixes a problem I first saw in July 2012: merely running "du"
on a large hfsplus-mounted directory a few times on a reasonably loaded
system would get the hfsplus driver all confused and complaining about
B-tree inconsistencies, and generates a "BUG: Bad page state". Most
recently, I can generate this problem on up-to-date Fedora 22 with shipped
kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
lightly-used QEMU VM image of the full Mac OS X 10.9:

$ df -i / /home /mnt
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/fedora-root 3276800 551665 2725135 17% /
/dev/mapper/fedora-home 52879360 716221 52163139 2% /home
/dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt

After applying the patch, I was able to run "du /" (60+ times) and "du
/mnt" (150+ times) continuously and simultaneously for 6+ hours.

There are many reports of the hfsplus driver getting confused under load
and generating "BUG: Bad page state" or other similar issues over the
years. [1]

The unpatched code [2] has always been wrong since it entered the kernel
tree. The only reason why it gets away with it is that the
kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
the kernel has not had a chance to reuse the memory for something else,
most of the time.

The current RW driver appears to have followed the design and development
of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
2001) had a B-tree node-centric approach to
read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
migrating towards version 0.2 (June 2002) of caching and releasing pages
per inode extents. When the current RW code first entered the kernel [2]
in 2005, there was an REF_PAGES conditional (and "//" commented out code)
to switch between B-node centric paging to inode-centric paging. There
was a mistake with the direction of one of the REF_PAGES conditionals in
__hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
removed, but a page_cache_release() was mistakenly left in (propagating
the "REF_PAGES !REF_PAGE" mistake), and the commented-out
page_cache_release() in bnode_release() (which should be spanned by
!REF_PAGES) was never enabled.

References:
[1]:
Michael Fox, Apr 2013
http://www.spinics.net/lists/linux-fsdevel/msg63807.html
("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

Sasha Levin, Feb 2015
http://lkml.org/lkml/2015/2/20/85 ("use after free")

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
https://bugzilla.kernel.org/show_bug.cgi?id=42342
https://bugzilla.kernel.org/show_bug.cgi?id=63841
https://bugzilla.kernel.org/show_bug.cgi?id=78761

[2]:
http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
Author: Andrew Morton
Date: Wed Feb 25 16:17:36 2004 -0800

[PATCH] HFS rewrite

http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

commit 91556682e0bf004d98a529bf829d339abb98bbbd
Author: Andrew Morton
Date: Wed Feb 25 16:17:48 2004 -0800

[PATCH] HFS+ support

[3]:
http://sourceforge.net/projects/linux-hfsplus/

http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
fs/hfsplus/bnode.c?r1=1.4&r2=1.5

Date: Thu Jun 6 09:45:14 2002 +0000
Use buffer cache instead of page cache in bnode.c. Cache inode extents.

[4]:
http://git.kernel.org/cgit/linux/kernel/git/\
stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

commit a5e3985fa014029eb6795664c704953720cc7f7d
Author: Roman Zippel
Date: Tue Sep 6 15:18:47 2005 -0700

[PATCH] hfs: remove debug code

Signed-off-by: Hin-Tak Leung
Signed-off-by: Sergei Antonov
Reviewed-by: Anton Altaparmakov
Reported-by: Sasha Levin
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Vyacheslav Dubeyko
Cc: Sougata Santra
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hin-Tak Leung
2015-09-11 04:29:01 +0800
3725e9dd5 fs/coda: fix readlink buffer overflow ... Browse Code »

Dan Carpenter discovered a buffer overflow in the Coda file system
readlink code. A userspace file system daemon can return a 4096 byte
result which then triggers a one byte write past the allocated readlink
result buffer.

This does not trigger with an unmodified Coda implementation because Coda
has a 1024 byte limit for symbolic links, however other userspace file
systems using the Coda kernel module could be affected.

Although this is an obvious overflow, I don't think this has to be handled
as too sensitive from a security perspective because the overflow is on
the Coda userspace daemon side which already needs root to open Coda's
kernel device and to mount the file system before we get to the point that
links can be read.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Harkes
Reported-by: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Harkes
2015-09-11 04:29:01 +0800
c5595fa2f checkpatch: add constant comparison on left side test ... Browse Code »

"CONST variable" checks like:

if (NULL != foo)
and
while (0 < bar(...))

where a constant (or what appears to be a constant like an upper case
identifier) is on the left of a comparison are generally preferred to be
written using the constant on the right side like:

if (foo != NULL)
and
while (bar(...) > 0)

Add a test for this.

Add a --fix option too, but only do it when the code is immediately
surrounded by parentheses to avoid misfixing things like "(0 < bar() +
constant)"

Signed-off-by: Joe Perches
Cc: Nicolas Morey Chaisemartin
Cc: Viresh Kumar
Cc: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
54507b518 checkpatch: add __pmem to $Sparse annotations ... Browse Code »

commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates") added a new __pmem annotation for sparse
verification. Add __pmem to the $Sparse variable so checkpatch can
appropriately ignore uses of this attribute too.

Signed-off-by: Joe Perches
Reviewed-by: Ross Zwisler
Acked-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
4e5d56bdf checkpatch: fix left brace warning ... Browse Code »

Using checkpatch.pl with Perl 5.22.0 generates the following warning:

Unescaped left brace in regex is deprecated, passed through in regex;

This patch fixes the warnings by escaping occurrences of the left brace
inside the regular expression.

Signed-off-by: Eddie Kovsky
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eddie Kovsky
2015-09-11 04:29:01 +0800
bf4daf12a checkpatch: avoid some commit message long line warnings ... Browse Code »

Fixes: and Link: lines may exceed 75 chars in the commit log.

So too can stack dump and dmesg lines and lines that seem
like filenames.

And Fixes: lines don't need to have a "commit" prefix before the
commit id.

Add exceptions for these types of lines.

Signed-off-by: Joe Perches
Reported-by: Paul Bolle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
6e3007574 checkpatch: emit an error on formats with 0x%<decimal> ... Browse Code »

Using 0x%d is wrong. Emit a message when it happens.

Miscellanea:

Improve the %Lu warning to match formats like %16Lu.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
7bd7e483c checkpatch: make --strict the default for drivers/staging files and patches ... Browse Code »

Making --strict the default for staging may help some people submit
patches without obvious defects.

Signed-off-by: Joe Perches
Cc: Dan Carpenter
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
86406b1cb checkpatch: always check block comment styles ... Browse Code »

Some of the block comment tests that are used only for networking are
appropriate for all patches.

For example, these styles are not encouraged:

/*
block comment without introductory *
*/
and
/*
* block comment with line terminating */

Remove the networking specific test and add comments.

There are some infrequent false positives where code is lazily
commented out using /* and */ rather than using #if 0/#endif blocks
like:
/* case foo:
case bar: */
case baz:

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
7d3a9f673 checkpatch: report the right line # when using --emacs and --file ... Browse Code »

commit 34d8815f9512 ("checkpatch: add --showfile to allow input via pipe
to show filenames") broke the --emacs with --file option.

Fix it.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
100425dee checkpatch: add some <foo>_destroy functions to NEEDLESS_IF tests ... Browse Code »

Sergey Senozhatsky has modified several destroy functions that can
now be called with NULL values.

- kmem_cache_destroy()
- mempool_destroy()
- dma_pool_destroy()

Update checkpatch to warn when those functions are preceded by an if.

Update checkpatch to --fix all the calls too only when the code style
form is using leading tabs.

from:
if (foo)
(foo);
to:
(foo);

Signed-off-by: Joe Perches
Tested-by: Sergey Senozhatsky
Cc: David Rientjes
Cc: Julia Lawall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
3e838b6c4 checkpatch: Allow longer declaration macros ... Browse Code »

Some really long declaration macros exist.

For instance;
DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
and
DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(name, description)

Increase the limit from 2 words to 6 after DECLARE/DEFINE uses.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
9f5af480f checkpatch: improve SUSPECT_CODE_INDENT test ... Browse Code »

Many lines exist like

if (foo)
bar;

where the tabbed indentation of the branch is not one more than the "if"
line above it.

checkpatch should emit a warning on those lines.

Miscellenea:

o Remove comments from branch blocks
o Skip blank lines in block

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
9d3e3c705 checkpatch: add warning on BUG/BUG_ON use ... Browse Code »

Using BUG/BUG_ON crashes the kernel and is just unfriendly.

Enable code that emits a warning on BUG/BUG_ON use.

Make the code emit the message at WARNING level when scanning a patch and
at CHECK level when scanning files so that script users don't feel an
obligation to fix code that might be above their pay grade.

Signed-off-by: Joe Perches
Reported-by: Geert Uytterhoeven
Tested-by: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800