Eric Lee / smarc-fsl-linux-kernel

05 Sep, 2015

40 commits

484748f0b slab: infrastructure for bulk object allocation and freeing ... Browse Code »

Add the basic infrastructure for alloc/free operations on pointer arrays.
It includes a generic function in the common slab code that is used in
this infrastructure patch to create the unoptimized functionality for slab
bulk operations.

Allocators can then provide optimized allocation functions for situations
in which large numbers of objects are needed. These optimization may
avoid taking locks repeatedly and bypass metadata creation if all objects
in slab pages can be used to provide the objects required.

Allocators can extend the skeletons provided and add their own code to the
bulk alloc and free functions. They can keep the generic allocation and
freeing and just fall back to those if optimizations would not work (like
for example when debugging is on).

Signed-off-by: Christoph Lameter
Signed-off-by: Jesper Dangaard Brouer
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2015-09-05 07:54:41 +0800
2ae44005b slub: fix spelling succedd to succeed ... Browse Code »

With this patchset the SLUB allocator now has both bulk alloc and free
implemented.

This patchset mostly optimizes the "fastpath" where objects are available
on the per CPU fastpath page. This mostly amortize the less-heavy
none-locked cmpxchg_double used on fastpath.

The "fallback" bulking (e.g __kmem_cache_free_bulk) provides a good basis
for comparison. Measurements[1] of the fallback functions
__kmem_cache_{free,alloc}_bulk have been copied from slab_common.c and
forced "noinline" to force a function call like slab_common.c.

Measurements on CPU CPU i7-4790K @ 4.00GHz
Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns

Measurements last-patch with disabled debugging:

Bulk- fallback - this-patch
1 - 57 cycles(tsc) 14.448 ns - 44 cycles(tsc) 11.236 ns improved 22.8%
2 - 51 cycles(tsc) 12.768 ns - 28 cycles(tsc) 7.019 ns improved 45.1%
3 - 48 cycles(tsc) 12.232 ns - 22 cycles(tsc) 5.526 ns improved 54.2%
4 - 48 cycles(tsc) 12.025 ns - 19 cycles(tsc) 4.786 ns improved 60.4%
8 - 46 cycles(tsc) 11.558 ns - 18 cycles(tsc) 4.572 ns improved 60.9%
16 - 45 cycles(tsc) 11.458 ns - 18 cycles(tsc) 4.658 ns improved 60.0%
30 - 45 cycles(tsc) 11.499 ns - 18 cycles(tsc) 4.568 ns improved 60.0%
32 - 79 cycles(tsc) 19.917 ns - 65 cycles(tsc) 16.454 ns improved 17.7%
34 - 78 cycles(tsc) 19.655 ns - 63 cycles(tsc) 15.932 ns improved 19.2%
48 - 68 cycles(tsc) 17.049 ns - 50 cycles(tsc) 12.506 ns improved 26.5%
64 - 80 cycles(tsc) 20.009 ns - 63 cycles(tsc) 15.929 ns improved 21.3%
128 - 94 cycles(tsc) 23.749 ns - 86 cycles(tsc) 21.583 ns improved 8.5%
158 - 97 cycles(tsc) 24.299 ns - 90 cycles(tsc) 22.552 ns improved 7.2%
250 - 102 cycles(tsc) 25.681 ns - 98 cycles(tsc) 24.589 ns improved 3.9%

Benchmarking shows impressive improvements in the "fastpath" with a small
number of objects in the working set. Once the working set increases,
resulting in activating the "slowpath" (that contains the heavier locked
cmpxchg_double) the improvement decreases.

I'm currently working on also optimizing the "slowpath" (as network stack
use-case hits this), but this patchset should provide a good foundation
for further improvements. Rest of my patch queue in this area needs some
more work, but preliminary results are good. I'm attending Netfilter
Workshop[2] next week, and I'll hopefully return working on further
improvements in this area.

This patch (of 6):

s/succedd/succeed/

Signed-off-by: Jesper Dangaard Brouer
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2015-09-05 07:54:41 +0800
ec6a90661 watchdog: rename watchdog_suspend() and watchdog_resume() ... Browse Code »

Rename watchdog_suspend() to lockup_detector_suspend() and
watchdog_resume() to lockup_detector_resume() to avoid confusion with the
watchdog subsystem and to be consistent with the existing name
lockup_detector_init().

Also provide comment blocks to explain the watchdog_running and
watchdog_suspended variables and their relationship.

Signed-off-by: Ulrich Obergfell
Reviewed-by: Aaron Tomlin
Cc: Guenter Roeck
Cc: Don Zickus
Cc: Ulrich Obergfell
Cc: Jiri Olsa
Cc: Michal Hocko
Cc: Stephane Eranian
Cc: Chris Metcalf
Cc: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ulrich Obergfell
2015-09-05 07:54:41 +0800
999bbe49e watchdog: use suspend/resume interface in fixup_ht_bug() ... Browse Code »

Remove watchdog_nmi_disable_all() and watchdog_nmi_enable_all() since
these functions are no longer needed. If a subsystem has a need to
deactivate the watchdog temporarily, it should utilize the
watchdog_suspend() and watchdog_resume() functions.

[akpm@linux-foundation.org: fix build with CONFIG_LOCKUP_DETECTOR=m]
Signed-off-by: Ulrich Obergfell
Reviewed-by: Aaron Tomlin
Cc: Guenter Roeck
Cc: Don Zickus
Cc: Ulrich Obergfell
Cc: Jiri Olsa
Cc: Michal Hocko
Cc: Stephane Eranian
Cc: Chris Metcalf
Cc: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ulrich Obergfell
2015-09-05 07:54:41 +0800
d4bdd0b21 watchdog: use park/unpark functions in update_watchdog_all_cpus() ... Browse Code »

Remove update_watchdog() and restart_watchdog_hrtimer() since these
functions are no longer needed. Changes of parameters such as the sample
period are honored at the time when the watchdog threads are being
unparked.

Signed-off-by: Ulrich Obergfell
Reviewed-by: Aaron Tomlin
Cc: Guenter Roeck
Cc: Don Zickus
Cc: Ulrich Obergfell
Cc: Jiri Olsa
Cc: Michal Hocko
Cc: Stephane Eranian
Cc: Chris Metcalf
Cc: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ulrich Obergfell
2015-09-05 07:54:41 +0800
8c073d27d watchdog: introduce watchdog_suspend() and watchdog_resume() ... Browse Code »

This interface can be utilized to deactivate the hard and soft lockup
detector temporarily. Callers are expected to minimize the duration of
deactivation. Multiple deactivations are allowed to occur in parallel but
should be rare in practice.

[akpm@linux-foundation.org: remove unneeded static initialization]
Signed-off-by: Ulrich Obergfell
Reviewed-by: Aaron Tomlin
Cc: Guenter Roeck
Cc: Don Zickus
Cc: Ulrich Obergfell
Cc: Jiri Olsa
Cc: Michal Hocko
Cc: Stephane Eranian
Cc: Chris Metcalf
Cc: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ulrich Obergfell
2015-09-05 07:54:41 +0800
81a4beef9 watchdog: introduce watchdog_park_threads() and watchdog_unpark_threads() ... Browse Code »

Originally watchdog_nmi_enable(cpu) and watchdog_nmi_disable(cpu) were
only called in watchdog thread context. However, the following commits
utilize these functions outside of watchdog thread context too.

commit 9809b18fcf6b8d8ec4d3643677345907e6b50eca
Author: Michal Hocko
Date: Tue Sep 24 15:27:30 2013 -0700

watchdog: update watchdog_thresh properly

commit b3738d29323344da3017a91010530cf3a58590fc
Author: Stephane Eranian
Date: Mon Nov 17 20:07:03 2014 +0100

watchdog: Add watchdog enable/disable all functions

Hence, it is now possible that these functions execute concurrently with
the same 'cpu' argument. This concurrency is problematic because per-cpu
'watchdog_ev' can be accessed/modified without adequate synchronization.

The patch series aims to address the above problem. However, instead of
introducing locks to protect per-cpu 'watchdog_ev' a different approach is
taken: Invoke these functions by parking and unparking the watchdog
threads (to ensure they are always called in watchdog thread context).

static struct smp_hotplug_thread watchdog_threads = {
...
.park = watchdog_disable, // calls watchdog_nmi_disable()
.unpark = watchdog_enable, // calls watchdog_nmi_enable()
};

Both previously mentioned commits call these functions in a similar way
and thus in principle contain some duplicate code. The patch series also
avoids this duplication by providing a commonly usable mechanism.

- Patch 1/4 introduces the watchdog_{park|unpark}_threads functions that
park/unpark all watchdog threads specified in 'watchdog_cpumask'. They
are intended to be called inside of kernel/watchdog.c only.

- Patch 2/4 introduces the watchdog_{suspend|resume} functions which can
be utilized by external callers to deactivate the hard and soft lockup
detector temporarily.

- Patch 3/4 utilizes watchdog_{park|unpark}_threads to replace some code
that was introduced by commit 9809b18fcf6b8d8ec4d3643677345907e6b50eca.

- Patch 4/4 utilizes watchdog_{suspend|resume} to replace some code that
was introduced by commit b3738d29323344da3017a91010530cf3a58590fc.

A few corner cases should be mentioned here for completeness.

- kthread_park() of watchdog/N could hang if cpu N is already locked up.
However, if watchdog is enabled the lockup will be detected anyway.

- kthread_unpark() of watchdog/N could hang if cpu N got locked up after
kthread_park(). The occurrence of this scenario should be _very_ rare
in practice, in particular because it is not expected that temporary
deactivation will happen frequently, and if it happens at all it is
expected that the duration of deactivation will be short.

This patch (of 4): introduce watchdog_park_threads() and watchdog_unpark_threads()

These functions are intended to be used only from inside kernel/watchdog.c
to park/unpark all watchdog threads that are specified in
watchdog_cpumask.

Signed-off-by: Ulrich Obergfell
Reviewed-by: Aaron Tomlin
Cc: Guenter Roeck
Cc: Don Zickus
Cc: Ulrich Obergfell
Cc: Jiri Olsa
Cc: Michal Hocko
Cc: Stephane Eranian
Cc: Chris Metcalf
Cc: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ulrich Obergfell
2015-09-05 07:54:41 +0800
aacfbe6a9 kernel/watchdog: move NMI function header declarations from watchdog.h to nmi.h ... Browse Code »

The kernel's NMI watchdog has nothing to do with the watchdog subsystem.
Its header declarations should be in linux/nmi.h, not linux/watchdog.h.

The code provided two sets of dummy functions if HARDLOCKUP_DETECTOR is
not configured, one in the include file and one in kernel/watchdog.c.
Remove the dummy functions from kernel/watchdog.c and use those from the
include file.

Signed-off-by: Guenter Roeck
Cc: Stephane Eranian
Cc: Peter Zijlstra (Intel)
Cc: Ingo Molnar
Cc: Don Zickus
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Guenter Roeck
2015-09-05 07:54:41 +0800
314b08ff5 watchdog: simplify housekeeping affinity with the appropriate mask ... Browse Code »

housekeeping_mask gathers all the CPUs that aren't part of the nohz_full
set. This is exactly what we want the watchdog to be affine to without
the need to use complicated cpumask operations.

Signed-off-by: Frederic Weisbecker
Reviewed-by: Chris Metcalf
Cc: Thomas Gleixner
Cc: Chris Metcalf
Cc: Don Zickus
Cc: Peter Zijlstra
Cc: Ulrich Obergfell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-05 07:54:41 +0800
230ec9390 smpboot: allow passing the cpumask on per-cpu thread registration ... Browse Code »

It makes the registration cheaper and simpler for the smpboot per-cpu
kthread users that don't need to always update the cpumask after threads
creation.

[sfr@canb.auug.org.au: fix for allow passing the cpumask on per-cpu thread registration]
Signed-off-by: Frederic Weisbecker
Reviewed-by: Chris Metcalf
Reviewed-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Don Zickus
Cc: Peter Zijlstra
Cc: Ulrich Obergfell
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-05 07:54:41 +0800
3dd08c0c9 smpboot: make cleanup to mirror setup ... Browse Code »

The per-cpu kthread cleanup() callback is the mirror of the setup()
callback. When the per-cpu kthread is started, it first calls setup()
to initialize the resources which are then released by cleanup() when
the kthread exits.

Now since the introduction of a per-cpu kthread cpumask, the kthreads
excluded by the cpumask on boot may happen to be parked immediately
after their creation without taking the setup() stage, waiting to be
asked to unpark to do so. Then when smpboot_unregister_percpu_thread()
is later called, the kthread is stopped without having ever called
setup().

But this triggers a bug as the kthread unconditionally calls cleanup()
on exit but this doesn't mirror any setup(). Thus the kernel crashes
because we try to free resources that haven't been initialized, as in
the watchdog case:

WATCHDOG disable 0
WATCHDOG disable 1
WATCHDOG disable 2
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: hrtimer_active+0x26/0x60
[...]
Call Trace:
hrtimer_try_to_cancel+0x1c/0x280
hrtimer_cancel+0x1d/0x30
watchdog_disable+0x56/0x70
watchdog_cleanup+0xe/0x10
smpboot_thread_fn+0x23c/0x2c0
kthread+0xf8/0x110
ret_from_fork+0x3f/0x70

This bug is currently masked with explicit kthread unparking before
kthread_stop() on smpboot_destroy_threads(). This forces a call to
setup() and then unpark().

We could fix this by unconditionally calling setup() on kthread entry.
But setup() isn't always cheap. In the case of watchdog it launches
hrtimer, perf events, etc... So we may as well like to skip it if there
are chances the kthread will never be used, as in a reduced cpumask value.

So let's simply do a state machine check before calling cleanup() that
makes sure setup() has been called before mirroring it.

And remove the nasty hack workaround.

Signed-off-by: Frederic Weisbecker
Reviewed-by: Chris Metcalf
Reviewed-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Don Zickus
Cc: Peter Zijlstra
Cc: Ulrich Obergfell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-05 07:54:41 +0800
5869b5064 smpboot: fix memory leak on error handling ... Browse Code »

The cpumask is allocated before threads get created. If the latter step
fails, we need to free the cpumask.

Signed-off-by: Frederic Weisbecker
Reviewed-by: Chris Metcalf
Reviewed-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Don Zickus
Cc: Peter Zijlstra
Cc: Ulrich Obergfell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-05 07:54:41 +0800
a068acf2e fs: create and use seq_show_option for escaping ... Browse Code »

Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g. new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else. This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
of "sudo" is something more sneaky:

$ BASE="ovl"
$ MNT="$BASE/mnt"
$ LOW="$BASE/lower"
$ UP="$BASE/upper"
$ WORK="$BASE/work/ 0 0
none /proc fuse.pwn user_id=1000"
$ mkdir -p "$LOW" "$UP" "$WORK"
$ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
$ cat /proc/mounts
none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
none /proc fuse.pwn user_id=1000 0 0
$ fusermount -u /proc
$ cat /proc/mounts
cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed. Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook
Acked-by: Serge Hallyn
Acked-by: Jan Kara
Acked-by: Paul Moore
Cc: J. R. Okajima
Signed-off-by: Kees Cook
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2015-09-05 07:54:41 +0800
46359295a ocfs2: clean up redundant NULL checks before kfree ... Browse Code »

NULL check before kfree is redundant and so clean them up.

Signed-off-by: Joseph Qi
Reviewed-by: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-09-05 07:54:41 +0800
7ecef14ab ocfs2: neaten do_error, ocfs2_error and ocfs2_abort ... Browse Code »

These uses sometimes do and sometimes don't have '\n' terminations. Make
the uses consistently use '\n' terminations and remove the newline from
the functions.

Miscellanea:

o Coalesce formats
o Realign arguments

Signed-off-by: Joe Perches
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-05 07:54:41 +0800
d0c97d52f ocfs2: do not set fs read-only if rec[0] is empty while committing truncate ... Browse Code »

While appending an extent to a file, it will call these functions:
ocfs2_insert_extent

-> call ocfs2_grow_tree() if there's no free rec
-> ocfs2_add_branch add a new branch to extent tree,
now rec[0] in the leaf of rightmost path is empty
-> ocfs2_do_insert_extent
-> ocfs2_rotate_tree_right
-> ocfs2_extend_rotate_transaction
-> jbd2_journal_restart if jbd2_journal_extend fail
-> ocfs2_insert_path
-> ocfs2_extend_trans
-> jbd2_journal_restart if jbd2_journal_extend fail
-> ocfs2_insert_at_leaf
-> ocfs2_et_update_clusters
Function jbd2_journal_restart() may be called and it may happened that
buffers dirtied in ocfs2_add_branch() are committed
while buffers dirtied in ocfs2_insert_at_leaf() and
ocfs2_et_update_clusters() are not.
So an empty rec[0] is left in rightmost path which will cause
read-only filesystem when call ocfs2_commit_truncate()
with the error message: "Inode %lu has an empty extent record".

This is not a serious problem, so remove the rightmost path when call
ocfs2_commit_truncate().

Signed-off-by: joyce.xue
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xue jiufei
2015-09-05 07:54:41 +0800
7f27ec978 ocfs2: call ocfs2_journal_access_di() before ocfs2_journal_dirty() in ocfs2_write_end_nolock() ... Browse Code »

1: After we call ocfs2_journal_access_di() in ocfs2_write_begin(),
jbd2_journal_restart() may also be called, in this function transaction
A's t_updates-- and obtains a new transaction B. If
jbd2_journal_commit_transaction() is happened to commit transaction A,
when t_updates==0, it will continue to complete commit and unfile
buffer.

So when jbd2_journal_dirty_metadata(), the handle is pointed a new
transaction B, and the buffer head's journal head is already freed,
jh->b_transaction == NULL, jh->b_next_transaction == NULL, it returns
EINVAL, So it triggers the BUG_ON(status).

thread 1 jbd2
ocfs2_write_begin jbd2_journal_commit_transaction
ocfs2_write_begin_nolock
ocfs2_start_trans
jbd2__journal_start(t_updates+1,
transaction A)
ocfs2_journal_access_di
ocfs2_write_cluster_by_desc
ocfs2_mark_extent_written
ocfs2_change_extent_flag
ocfs2_split_extent
ocfs2_extend_rotate_transaction
jbd2_journal_restart
(t_updates-1,transaction B) t_updates==0
__jbd2_journal_refile_buffer
(jh->b_transaction = NULL)
ocfs2_write_end
ocfs2_write_end_nolock
ocfs2_journal_dirty
jbd2_journal_dirty_metadata(bug)
ocfs2_commit_trans

2. In ext4, I found that: jbd2_journal_get_write_access() called by
ext4_write_end.

ext4_write_begin
ext4_journal_start
__ext4_journal_start_sb
ext4_journal_check_start
jbd2__journal_start

ext4_write_end
ext4_mark_inode_dirty
ext4_reserve_inode_write
ext4_journal_get_write_access
jbd2_journal_get_write_access
ext4_mark_iloc_dirty
ext4_do_update_inode
ext4_handle_dirty_metadata
jbd2_journal_dirty_metadata

3. So I think we should put ocfs2_journal_access_di before
ocfs2_journal_dirty in the ocfs2_write_end. and it works well after my
modification.

Signed-off-by: vicky
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Cc: Zhangguanghui
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yangwenfang
2015-09-05 07:54:41 +0800
40476b829 ocfs2: use 64bit variables to track heartbeat time ... Browse Code »

o2hb_elapsed_msecs computes the time taken for a disk heartbeat.
'struct timeval' variables are used to store start and end times. On
32-bit systems, the 'tv_sec' component of 'struct timeval' will overflow
in year 2038 and beyond.

This patch solves the overflow with the following:

1. Replace o2hb_elapsed_msecs using 'ktime_t' values to measure start
and end time, and built-in function 'ktime_ms_delta' to compute the
elapsed time. ktime_get_real() is used since the code prints out the
wallclock time.

2. Changes format string to print time as a single 64-bit nanoseconds
value ("%lld") instead of seconds and microseconds. This simplifies
the code since converting ktime_t to that format would need expensive
computation. However, the debug log string is less readable than the
previous format.

Signed-off-by: Tina Ruchandani
Suggested by: Arnd Bergmann
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tina Ruchandani
2015-09-05 07:54:41 +0800
ad6948212 ocfs2: fix race between crashed dio and rm ... Browse Code »

There is a race case between crashed dio and rm, which will lead to
OCFS2_VALID_FL not set read-only.

N1 N2
------------------------------------------------------------------------
dd with direct flag
rm file
crashed with an dio entry left
in orphan dir
clear OCFS2_VALID_FL in
ocfs2_remove_inode
recover N1 and read the corrupted inode,
and set filesystem read-only

So we skip the inode deletion this time and wait for dio entry recovered
first.

Signed-off-by: Joseph Qi
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-09-05 07:54:41 +0800
f57a22dde ocfs2: avoid access invalid address when read o2dlm debug messages ... Browse Code »

The following case will lead to a lockres is freed but is still in use.

cat /sys/kernel/debug/o2dlm/locking_state dlm_thread
lockres_seq_start
-> lock dlm->track_lock
-> get resA
resA->refs decrease to 0,
call dlm_lockres_release,
and wait for "cat" unlock.
Although resA->refs is already set to 0,
increase resA->refs, and then unlock
lock dlm->track_lock
-> list_del_init()
-> unlock
-> free resA

In such a race case, invalid address access may occurs. So we should
delete list res->tracking before resA->refs decrease to 0.

Signed-off-by: Yiwen Jiang
Reviewed-by: Joseph Qi
Cc: Joel Becker
Signed-off-by: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yiwen Jiang
2015-09-05 07:54:41 +0800
743b5f143 ocfs2: take inode lock in ocfs2_iop_set/get_acl() ... Browse Code »

This bug in mainline code is pointed out by Mark Fasheh. When
ocfs2_iop_set_acl() and ocfs2_iop_get_acl() are entered from VFS layer,
inode lock is not held. This seems to be regression from older kernels.
The patch is to fix that.

Orabug: 20189959
Signed-off-by: Tariq Saeed
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tariq Saeed
2015-09-05 07:54:41 +0800
3d46a44a0 ocfs2: fix BUG_ON() in ocfs2_ci_checkpointed() ... Browse Code »

PID: 614 TASK: ffff882a739da580 CPU: 3 COMMAND: "ocfs2dc"
#0 [ffff882ecc3759b0] machine_kexec at ffffffff8103b35d
#1 [ffff882ecc375a20] crash_kexec at ffffffff810b95b5
#2 [ffff882ecc375af0] oops_end at ffffffff815091d8
#3 [ffff882ecc375b20] die at ffffffff8101868b
#4 [ffff882ecc375b50] do_trap at ffffffff81508bb0
#5 [ffff882ecc375ba0] do_invalid_op at ffffffff810165e5
#6 [ffff882ecc375c40] invalid_op at ffffffff815116fb
[exception RIP: ocfs2_ci_checkpointed+208]
RIP: ffffffffa0a7e940 RSP: ffff882ecc375cf0 RFLAGS: 00010002
RAX: 0000000000000001 RBX: 000000000000654b RCX: ffff8812dc83f1f8
RDX: 00000000000017d9 RSI: ffff8812dc83f1f8 RDI: ffffffffa0b2c318
RBP: ffff882ecc375d20 R8: ffff882ef6ecfa60 R9: ffff88301f272200
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffff
R13: ffff8812dc83f4f0 R14: 0000000000000000 R15: ffff8812dc83f1f8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff882ecc375d28] ocfs2_check_meta_downconvert at ffffffffa0a7edbd [ocfs2]
#8 [ffff882ecc375d38] ocfs2_unblock_lock at ffffffffa0a84af8 [ocfs2]
#9 [ffff882ecc375dc8] ocfs2_process_blocked_lock at ffffffffa0a85285 [ocfs2]
#10 [ffff882ecc375e18] ocfs2_downconvert_thread_do_work at ffffffffa0a85445 [ocfs2]
#11 [ffff882ecc375e68] ocfs2_downconvert_thread at ffffffffa0a854de [ocfs2]
#12 [ffff882ecc375ee8] kthread at ffffffff81090da7
#13 [ffff882ecc375f48] kernel_thread_helper at ffffffff81511884
assert is tripped because the tran is not checkpointed and the lock level is PR.

Some time ago, chmod command had been executed. As result, the following call
chain left the inode cluster lock in PR state, latter on causing the assert.
system_call_fastpath
-> my_chmod
-> sys_chmod
-> sys_fchmodat
-> notify_change
-> ocfs2_setattr
-> posix_acl_chmod
-> ocfs2_iop_set_acl
-> ocfs2_set_acl
-> ocfs2_acl_set_mode
Here is how.
1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
1120 {
1247 ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do.
..
1258 if (!status && attr->ia_valid & ATTR_MODE) {
1259 status = posix_acl_chmod(inode, inode->i_mode);

519 posix_acl_chmod(struct inode *inode, umode_t mode)
520 {
..
539 ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS);

287 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, ...
288 {
289 return ocfs2_set_acl(NULL, inode, NULL, type, acl, NULL, NULL);

224 int ocfs2_set_acl(handle_t *handle,
225 struct inode *inode, ...
231 {
..
252 ret = ocfs2_acl_set_mode(inode, di_bh,
253 handle, mode);

168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head ...
170 {
183 if (handle == NULL) {
>>> BUG: inode lock not held in ex at this point <<<
184 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb),
185 OCFS2_INODE_UPDATE_CREDITS);

ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we reach
ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in EX
mode (it should be). How this could have happended?

We are the lock master, were holding lock EX and have released it in
ocfs2_setattr.#1247. Note that there are no holders of this lock at
this point. Another node needs the lock in PR, and we downconvert from
EX to PR. So the inode lock is PR when do the trans in
ocfs2_acl_set_mode.#184. The trans stays in core (not flushed to disc).
Now another node want the lock in EX, downconvert thread gets kicked
(the one that tripped assert abovt), finds an unflushed trans but the
lock is not EX (it is PR). If the lock was at EX, it would have flushed
the trans ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before
downconverting (to NULL) for the request.

ocfs2_setattr must not drop inode lock ex in this code path. If it
does, takes it again before the trans, say in ocfs2_set_acl, another
cluster node can get in between, execute another setattr, overwriting
the one in progress on this node, resulting in a mode acl size combo
that is a mix of the two.

Orabug: 20189959
Signed-off-by: Tariq Saeed
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tariq Saeed
2015-09-05 07:54:41 +0800
72f6fe1fe ocfs2: optimize error handling in dlm_request_join ... Browse Code »

Currently error handling in dlm_request_join is a little obscure, so
optimize it to promote readability.

If packet.code is invalid, reset it to JOIN_DISALLOW to keep it
meaningful. It only influences the log printing.

Signed-off-by: Norton.Zhu
Cc: Srinivas Eeda
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Norton.Zhu
2015-09-05 07:54:41 +0800
928dda1f9 ocfs2: fix a tiny case that inode can not removed ... Browse Code »

When running dirop_fileop_racer we found a case that inode
can not removed.

Two nodes, say Node A and Node B, mount the same ocfs2 volume. Create
two dirs /race/1/ and /race/2/ in the filesystem.

Node A Node B
rm -r /race/2/
mv /race/1/ /race/2/
call ocfs2_unlink(), get
the EX mode of /race/2/
wait for B unlock /race/2/
decrease i_nlink of /race/2/ to 0,
and add inode of /race/2/ into
orphan dir, unlock /race/2/
got EX mode of /race/2/. because
/race/1/ is dir, so inc i_nlink
of /race/2/ and update into disk,
unlock /race/2/
because i_nlink of /race/2/
is not zero, this inode will
always remain in orphan dir

This patch fixes this case by test whether i_nlink of new dir is zero.

Signed-off-by: Yiwen Jiang
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Cc: Xue jiufei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yiwen Jiang
2015-09-05 07:54:41 +0800
6ab855a99 ocfs2: add ip_alloc_sem in direct IO to protect allocation changes ... Browse Code »

In ocfs2, ip_alloc_sem is used to protect allocation changes on the
node. In direct IO, we add ip_alloc_sem to protect date consistent
between direct-io and ocfs2_truncate_file race (buffer io use
ip_alloc_sem already). Although inode->i_mutex lock is used to avoid
concurrency of above situation, i think ip_alloc_sem is still needed
because protect allocation changes is significant.

Other filesystem like ext4 also uses rw_semaphore to protect data
consistent between get_block-vs-truncate race by other means, So
ip_alloc_sem in ocfs2 direct io is needed.

Signed-off-by: Weiwei Wang
Signed-off-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

WeiWei Wang
2015-09-05 07:54:41 +0800
34237681e ocfs2: clear the rest of the buffers on error ... Browse Code »

In case a validation fails, clear the rest of the buffers and return the
error to the calling function.

This also facilitates bubbling up the error originating from ocfs2_error
to calling functions.

Signed-off-by: Goldwyn Rodrigues
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2015-09-05 07:54:41 +0800
17a5b9ab3 ocfs2: acknowledge return value of ocfs2_error() ... Browse Code »

Caveat: This may return -EROFS for a read case, which seems wrong. This
is happening even without this patch series though. Should we convert
EROFS to EIO?

Signed-off-by: Goldwyn Rodrigues
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2015-09-05 07:54:41 +0800
7d0fb9148 ocfs2: add errors=continue ... Browse Code »

OCFS2 is often used in high-availaibility systems. However, ocfs2
converts the filesystem to read-only at the drop of the hat. This may
not be necessary, since turning the filesystem read-only would affect
other running processes as well, decreasing availability.

This attempt is to add errors=continue, which would return the EIO to
the calling process and terminate furhter processing so that the
filesystem is not corrupted further. However, the filesystem is not
converted to read-only.

As a future plan, I intend to create a small utility or extend
fsck.ocfs2 to fix small errors such as in the inode. The input to the
utility such as the inode can come from the kernel logs so we don't have
to schedule a downtime for fixing small-enough errors.

The patch changes the ocfs2_error to return an error. The error
returned depends on the mount option set. If none is set, the default
is to turn the filesystem read-only.

Perhaps errors=continue is not the best option name. Historically it is
used for making an attempt to progress in the current process itself.
Should we call it errors=eio? or errors=killproc? Suggestions/Comments
welcome.

Sources are available at:
https://github.com/goldwynr/linux/tree/error-cont

Signed-off-by: Goldwyn Rodrigues
Signed-off-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2015-09-05 07:54:41 +0800
513e2dae9 ocfs2: flush inode data to disk and free inode when i_count becomes zero ... Browse Code »

Disk inode deletion may be heavily delayed when one node unlink a file
after the same dentry is freed on another node(say N1) because of memory
shrink but inode is left in memory. This inode can only be freed while
N1 doing the orphan scan work.

However, N1 may skip orphan scan for several times because other nodes
may do the work earlier. In our tests, it may take 1 hour on 4 nodes
cluster and it hurts the user experience. So we think the inode should
be freed after the data flushed to disk when i_count becomes zero to
avoid such circumstances.

Signed-off-by: Joyce.xue
Cc: Joel Becker
Reviewed-by: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xue jiufei
2015-09-05 07:54:41 +0800
0f5e7b41f ocfs2: trusted xattr missing CAP_SYS_ADMIN check ... Browse Code »

The trusted extended attributes are only visible to the process which
hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2
xattr_handler trusted list. The check is important because this will be
used for implementing mechanisms in the userspace for which other
ordinary processes should not have access to.

Signed-off-by: Sanidhya Kashyap
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Cc: Taesoo kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sanidhya Kashyap
2015-09-05 07:54:41 +0800
807a79071 ocfs2: set filesytem read-only when ocfs2_delete_entry failed. ... Browse Code »

In ocfs2_rename, it will lead to an inode with two entried(old and new) if
ocfs2_delete_entry(old) failed. Thus, filesystem will be inconsistent.

The case is described below:

ocfs2_rename
-> ocfs2_start_trans
-> ocfs2_add_entry(new)
-> ocfs2_delete_entry(old)
-> __ocfs2_journal_access *failed* because of -ENOMEM
-> ocfs2_commit_trans

So filesystem should be set to read-only at the moment.

Signed-off-by: Yiwen Jiang
Cc: Joseph Qi
Cc: Joel Becker
Reviewed-by: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

jiangyiwen
2015-09-05 07:54:41 +0800
f83c7b5e9 ocfs2/dlm: use list_for_each_entry instead of list_for_each ... Browse Code »

Use list_for_each_entry instead of list_for_each to simplify code.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-09-05 07:54:41 +0800
0e3d9eafb ocfs2: remove unneeded code in dlm_register_domain_handlers ... Browse Code »

The last goto statement is unneeded, so remove it.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-09-05 07:54:41 +0800
cdd09f49c ocfs2: fix BUG when o2hb_register_callback fails ... Browse Code »

In dlm_register_domain_handlers, if o2hb_register_callback fails, it
will call dlm_unregister_domain_handlers to unregister. This will
trigger the BUG_ON in o2hb_unregister_callback because hc_magic is 0.
So we should call o2hb_setup_callback to initialize hc first.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-09-05 07:54:41 +0800
914a9b742 ocfs2: remove unneeded code in ocfs2_dlm_init ... Browse Code »

status is already initialized and it will only be 0 or negatives in the
code flow. So remove the unneeded assignment after the lable 'local'.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-09-05 07:54:41 +0800
3cb2ec43f ocfs2: adjust code to match locking/unlocking order ... Browse Code »

Unlocking order in ocfs2_unlink and ocfs2_rename mismatches the
corresponding locking order, although it won't cause issues, adjust the
code so that it looks more reasonable.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-09-05 07:54:41 +0800
bf59e6623 ocfs2: clean up unused local variables in ocfs2_file_write_iter ... Browse Code »

Since commit 86b9c6f3f891 ("ocfs2: remove filesize checks for sync I/O
journal commit") removes filesize checks for sync I/O journal commit,
variables old_size and old_clusters are not actually used any more. So
clean them up.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-09-05 07:54:41 +0800
372a447c4 ocfs2: do not log twice error messages ... Browse Code »

'o2hb_map_slot_data' and 'o2hb_populate_slot_data' are called from only
one place, in 'o2hb_region_dev_write'. Return value is checked and
'mlog_errno' is called to log a message if it is not 0.

So there is no need to call 'mlog_errno' directly within these functions.
This would result on logging the message twice.

Signed-off-by: Christophe JAILLET
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christophe JAILLET
2015-09-05 07:54:41 +0800
acf8fdbe6 ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access ... Browse Code »

When storage network is unstable, it may trigger the BUG in
__ocfs2_journal_access because of buffer not uptodate. We can retry the
write in this case or return error instead of BUG.

Signed-off-by: Joseph Qi
Reported-by: Zhangguanghui
Tested-by: Zhangguanghui
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-09-05 07:54:41 +0800
faaebf18f ocfs2: fix several issues of append dio ... Browse Code »

1) Take rw EX lock in case of append dio.
2) Explicitly treat the error code -EIOCBQUEUED as normal.
3) Set di_bh to NULL after brelse if it may be used again later.

Signed-off-by: Joseph Qi
Cc: Yiwen Jiang
Cc: Weiwei Wang
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2015-09-05 07:54:41 +0800