Eric Lee / smarc-fsl-linux-kernel

25 Mar, 2020

2 commits

f9f635c04 locks: reinstate locks_delete_block optimization ... Browse Code »

[ Upstream commit dcf23ac3e846ca0cf626c155a0e3fcbbcf4fae8a ]

There is measurable performance impact in some synthetic tests due to
commit 6d390e4b5d48 (locks: fix a potential use-after-free problem when
wakeup a waiter). Fix the race condition instead by clearing the
fl_blocker pointer after the wake_up, using explicit acquire/release
semantics.

This does mean that we can no longer use the clearing of fl_blocker as
the wait condition, so switch the waiters over to checking whether the
fl_blocked_member list_head is empty.

Reviewed-by: yangerkun
Reviewed-by: NeilBrown
Fixes: 6d390e4b5d48 (locks: fix a potential use-after-free problem when wakeup a waiter)
Signed-off-by: Jeff Layton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Linus Torvalds
2020-03-25 15:25:41 +0800
384e15fc4 locks: fix a potential use-after-free problem when wakeup a waiter ... Browse Code »

[ Upstream commit 6d390e4b5d48ec03bb87e63cf0a2bff5f4e116da ]

'16306a61d3b7 ("fs/locks: always delete_block after waiting.")' add the
logic to check waiter->fl_blocker without blocked_lock_lock. And it will
trigger a UAF when we try to wakeup some waiter：

Thread 1 has create a write flock a on file, and now thread 2 try to
unlock and delete flock a, thread 3 try to add flock b on the same file.

Thread2 Thread3
flock syscall(create flock b)
...flock_lock_inode_wait
flock_lock_inode(will insert
our fl_blocked_member list
to flock a's fl_blocked_requests)
sleep
flock syscall(unlock)
...flock_lock_inode_wait
locks_delete_lock_ctx
...__locks_wake_up_blocks
__locks_delete_blocks(
b->fl_blocker = NULL)
...
break by a signal
locks_delete_block
b->fl_blocker == NULL &&
list_empty(&b->fl_blocked_requests)
success, return directly
locks_free_lock b
wake_up(&b->fl_waiter)
trigger UAF

Fix it by remove this logic, and this patch may also fix CVE-2019-19769.

Cc: stable@vger.kernel.org
Fixes: 16306a61d3b7 ("fs/locks: always delete_block after waiting.")
Signed-off-by: yangerkun
Signed-off-by: Jeff Layton
Signed-off-by: Sasha Levin

yangerkun
2020-03-25 15:25:41 +0800

09 Jan, 2020

1 commit

72893303a locks: print unsigned ino in /proc/locks ... Browse Code »

commit 98ca480a8f22fdbd768e3dad07024c8d4856576c upstream.

An ino is unsigned, so display it as such in /proc/locks.

Cc: stable@vger.kernel.org
Signed-off-by: Amir Goldstein
Signed-off-by: Jeff Layton
Signed-off-by: Greg Kroah-Hartman

Amir Goldstein
2020-01-09 17:19:57 +0800

28 Sep, 2019

1 commit

298fb76a5 Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull nfsd updates from Bruce Fields:
"Highlights:

- Add a new knfsd file cache, so that we don't have to open and close
on each (NFSv2/v3) READ or WRITE. This can speed up read and write
in some cases. It also replaces our readahead cache.

- Prevent silent data loss on write errors, by treating write errors
like server reboots for the purposes of write caching, thus forcing
clients to resend their writes.

- Tweak the code that allocates sessions to be more forgiving, so
that NFSv4.1 mounts are less likely to hang when a server already
has a lot of clients.

- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be
limited only by the backend filesystem and the maximum RPC size.

- Allow the server to enforce use of the correct kerberos credentials
when a client reclaims state after a reboot.

And some miscellaneous smaller bugfixes and cleanup"

* tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits)
sunrpc: clean up indentation issue
nfsd: fix nfs read eof detection
nfsd: Make nfsd_reset_boot_verifier_locked static
nfsd: degraded slot-count more gracefully as allocation nears exhaustion.
nfsd: handle drc over-allocation gracefully.
nfsd: add support for upcall version 2
nfsd: add a "GetVersion" upcall for nfsdcld
nfsd: Reset the boot verifier on all write I/O errors
nfsd: Don't garbage collect files that might contain write errors
nfsd: Support the server resetting the boot verifier
nfsd: nfsd_file cache entries should be per net namespace
nfsd: eliminate an unnecessary acl size limit
Deprecate nfsd fault injection
nfsd: remove duplicated include from filecache.c
nfsd: Fix the documentation for svcxdr_tmpalloc()
nfsd: Fix up some unused variable warnings
nfsd: close cached files prior to a REMOVE or RENAME that would replace target
nfsd: rip out the raparms cache
nfsd: have nfsd_test_lock use the nfsd_file cache
nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache
...

Linus Torvalds
2019-09-28 08:00:27 +0800

20 Aug, 2019

1 commit

cfddf9f4c locks: fix a memory leak bug in __break_lease() ... Browse Code »

In __break_lease(), the file lock 'new_fl' is allocated in lease_alloc().
However, it is not deallocated in the following execution if
smp_load_acquire() fails, leading to a memory leak bug. To fix this issue,
free 'new_fl' before returning the error.

Signed-off-by: Wenwen Wang
Signed-off-by: Jeff Layton

Wenwen Wang
2019-08-20 17:48:52 +0800

19 Aug, 2019

2 commits

eb82dd393 nfsd: convert fi_deleg_file and ls_file fields to nfsd_file ... Browse Code »

Have them keep an nfsd_file reference instead of a struct file.

Signed-off-by: Jeff Layton
Signed-off-by: Trond Myklebust
Signed-off-by: Trond Myklebust
Signed-off-by: J. Bruce Fields

Jeff Layton
2019-08-19 23:09:09 +0800
18f6622eb locks: create a new notifier chain for lease attempts ... Browse Code »

With the new file caching infrastructure in nfsd, we can end up holding
files open for an indefinite period of time, even when they are still
idle. This may prevent the kernel from handing out leases on the file,
which is something we don't want to block.

Fix this by running a SRCU notifier call chain whenever on any
lease attempt. nfsd can then purge the cache for that inode before
returning.

Since SRCU is only conditionally compiled in, we must only define the
new chain if it's enabled, and users of the chain must ensure that
SRCU is enabled.

Signed-off-by: Jeff Layton
Signed-off-by: Trond Myklebust
Signed-off-by: Trond Myklebust
Signed-off-by: J. Bruce Fields

Jeff Layton
2019-08-19 23:00:39 +0800

25 Jul, 2019

1 commit

43e4cb942 locks: Fix procfs output for file leases ... Browse Code »

Since commit 778fc546f749c588aa2f ("locks: fix tracking of inprogress
lease breaks"), leases break don't change @fl_type but modifies
@fl_flags. However, procfs's part haven't been updated.

Previously, for a breaking lease the target type was printed (see
target_leasetype()), as returns fcntl(F_GETLEASE). But now it's always
"READ", as F_UNLCK no longer means "breaking". Unlike the previous
one, this behaviour don't provide a complete description of the lease.

There are /proc/pid/fdinfo/ outputs for a lease (the same for READ and
WRITE) breaked by O_WRONLY.
-- before:
lock: 1: LEASE BREAKING READ 2558 08:03:815793 0 EOF
-- after:
lock: 1: LEASE BREAKING UNLCK 2558 08:03:815793 0 EOF

Signed-off-by: Pavel Begunkov
Signed-off-by: Jeff Layton

Pavel Begunkov
2019-07-25 19:49:44 +0800

11 Jul, 2019

1 commit

d2b6b4c83 Merge tag 'nfsd-5.3' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull nfsd updates from Bruce Fields:
"Highlights:

- Add a new /proc/fs/nfsd/clients/ directory which exposes some
long-requested information about NFSv4 clients (like open files)
and allows forced revocation of client state.

- Replace the global duplicate reply cache by a cache per network
namespace; previously, a request in one network namespace could
incorrectly match an entry from another, though we haven't seen
this in production. This is the last remaining container bug that
I'm aware of; at this point you should be able to run separate
nfsd's in each network namespace, each with their own set of
exports, and everything should work.

- Cleanup and modify lock code to show the pid of lockd as the owner
of NLM locks. This is the correct version of the bugfix originally
attempted in b8eee0e90f97 ("lockd: Show pid of lockd for remote
locks")"

* tag 'nfsd-5.3' of git://linux-nfs.org/~bfields/linux: (34 commits)
nfsd: Make __get_nfsdfs_client() static
nfsd: Make two functions static
nfsd: Fix misuse of strlcpy
sunrpc/cache: remove the exporting of cache_seq_next
nfsd: decode implementation id
nfsd: create xdr_netobj_dup helper
nfsd: allow forced expiration of NFSv4 clients
nfsd: create get_nfsdfs_clp helper
nfsd4: show layout stateids
nfsd: show lock and deleg stateids
nfsd4: add file to display list of client's opens
nfsd: add more information to client info file
nfsd: escape high characters in binary data
nfsd: copy client's address including port number to cl_addr
nfsd4: add a client info file
nfsd: make client/ directory names small ints
nfsd: add nfsd/clients directory
nfsd4: use reference count to free client
nfsd: rename cl_refcount
nfsd: persist nfsd filesystem across mounts
...

Linus Torvalds
2019-07-11 12:22:43 +0800

04 Jul, 2019

1 commit

f85d93385 locks: Cleanup lm_compare_owner and lm_owner_key ... Browse Code »

After the update to use nlm_lockowners for the NLM server, there are no
more users of lm_compare_owner and lm_owner_key.

Signed-off-by: Benjamin Coddington
Signed-off-by: J. Bruce Fields

Benjamin Coddington
2019-07-04 05:52:09 +0800

19 Jun, 2019

2 commits

387e3746d locks: eliminate false positive conflicts for write lease ... Browse Code »

check_conflicting_open() is checking for existing fd's open for read or
for write before allowing to take a write lease. The check that was
implemented using i_count and d_count is an approximation that has
several false positives. For example, overlayfs since v4.19, takes an
extra reference on the dentry; An open with O_PATH takes a reference on
the dentry although the file cannot be read nor written.

Change the implementation to use i_readcount and i_writecount to
eliminate the false positive conflicts and allow a write lease to be
taken on an overlayfs file.

The change of behavior with existing fd's open with O_PATH is symmetric
w.r.t. current behavior of lease breakers - an open with O_PATH currently
does not break a write lease.

This increases the size of struct inode by 4 bytes on 32bit archs when
CONFIG_FILE_LOCKING is defined and CONFIG_IMA was not already
defined.

Signed-off-by: Amir Goldstein
Signed-off-by: Jeff Layton

Amir Goldstein
2019-06-19 20:49:38 +0800
d51f527f4 locks: Add trace_leases_conflict ... Browse Code »

Signed-off-by: Ira Weiny
Signed-off-by: Jeff Layton

Ira Weiny
2019-06-19 20:49:37 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

16 May, 2019

1 commit

700a800a9 Merge tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull nfsd updates from Bruce Fields:
"This consists mostly of nfsd container work:

Scott Mayhew revived an old api that communicates with a userspace
daemon to manage some on-disk state that's used to track clients
across server reboots. We've been using a usermode_helper upcall for
that, but it's tough to run those with the right namespaces, so a
daemon is much friendlier to container use cases.

Trond fixed nfsd's handling of user credentials in user namespaces. He
also contributed patches that allow containers to support different
sets of NFS protocol versions.

The only remaining container bug I'm aware of is that the NFS reply
cache is shared between all containers. If anyone's aware of other
gaps in our container support, let me know.

The rest of this is miscellaneous bugfixes"

* tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux: (23 commits)
nfsd: update callback done processing
locks: move checks from locks_free_lock() to locks_release_private()
nfsd: fh_drop_write in nfsd_unlink
nfsd: allow fh_want_write to be called twice
nfsd: knfsd must use the container user namespace
SUNRPC: rsi_parse() should use the current user namespace
SUNRPC: Fix the server AUTH_UNIX userspace mappings
lockd: Pass the user cred from knfsd when starting the lockd server
SUNRPC: Temporary sockets should inherit the cred from their parent
SUNRPC: Cache the process user cred in the RPC server listener
nfsd: Allow containers to set supported nfs versions
nfsd: Add custom rpcbind callbacks for knfsd
SUNRPC: Allow further customisation of RPC program registration
SUNRPC: Clean up generic dispatcher code
SUNRPC: Add a callback to initialise server requests
SUNRPC/nfs: Fix return value for nfs4_callback_compound()
nfsd: handle legacy client tracking records sent by nfsdcld
nfsd: re-order client tracking method selection
nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld
nfsd: un-deprecate nfsdcld
...

Linus Torvalds
2019-05-16 09:21:43 +0800

08 May, 2019

1 commit

b4b52b881 Merge tag 'Wimplicit-fallthrough-5.2-rc1' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/gustavoars/linux

Pull Wimplicit-fallthrough updates from Gustavo A. R. Silva:
"Mark switch cases where we are expecting to fall through.

This is part of the ongoing efforts to enable -Wimplicit-fallthrough.

Most of them have been baking in linux-next for a whole development
cycle. And with Stephen Rothwell's help, we've had linux-next
nag-emails going out for newly introduced code that triggers
-Wimplicit-fallthrough to avoid gaining more of these cases while we
work to remove the ones that are already present.

We are getting close to completing this work. Currently, there are
only 32 of 2311 of these cases left to be addressed in linux-next. I'm
auditing every case; I take a look into the code and analyze it in
order to determine if I'm dealing with an actual bug or a false
positive, as explained here:

https://lore.kernel.org/lkml/c2fad584-1705-a5f2-d63c-824e9b96cf50@embeddedor.com/

While working on this, I've found and fixed the several missing
break/return bugs, some of them introduced more than 5 years ago.

Once this work is finished, we'll be able to universally enable
"-Wimplicit-fallthrough" to avoid any of these kinds of bugs from
entering the kernel again"

* tag 'Wimplicit-fallthrough-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits)
memstick: mark expected switch fall-throughs
drm/nouveau/nvkm: mark expected switch fall-throughs
NFC: st21nfca: Fix fall-through warnings
NFC: pn533: mark expected switch fall-throughs
block: Mark expected switch fall-throughs
ASN.1: mark expected switch fall-through
lib/cmdline.c: mark expected switch fall-throughs
lib: zstd: Mark expected switch fall-throughs
scsi: sym53c8xx_2: sym_nvram: Mark expected switch fall-through
scsi: sym53c8xx_2: sym_hipd: mark expected switch fall-throughs
scsi: ppa: mark expected switch fall-through
scsi: osst: mark expected switch fall-throughs
scsi: lpfc: lpfc_scsi: Mark expected switch fall-throughs
scsi: lpfc: lpfc_nvme: Mark expected switch fall-through
scsi: lpfc: lpfc_nportdisc: Mark expected switch fall-through
scsi: lpfc: lpfc_hbadisc: Mark expected switch fall-throughs
scsi: lpfc: lpfc_els: Mark expected switch fall-throughs
scsi: lpfc: lpfc_ct: Mark expected switch fall-throughs
scsi: imm: mark expected switch fall-throughs
scsi: csiostor: csio_wr: mark expected switch fall-through
...

Linus Torvalds
2019-05-08 03:48:10 +0800

24 Apr, 2019

1 commit

5926459e7 locks: move checks from locks_free_lock() to locks_release_private() ... Browse Code »

Code that allocates locks using locks_alloc_lock() will free it
using locks_free_lock(), and will benefit from the BUG_ON()
consistency checks therein.

However some code (nfsd and lockd) allocate a lock embedded in
some other data structure, and so free the lock themselves after
calling locks_release_private(). This path does not benefit from
the consistency checks.

To help catch future errors, move the BUG_ON() checks to
locks_release_private() - which locks_free_lock() already calls.
This ensures that all users for locks will find out if the lock
isn't detached properly before being free.

Signed-off-by: NeilBrown
Reviewed-by: Jeff Layton
Signed-off-by: J. Bruce Fields

NeilBrown
2019-04-24 21:51:48 +0800

09 Apr, 2019

1 commit

0a4c92657 fs: mark expected switch fall-throughs ... Browse Code »

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

This patch fixes the following warnings:

fs/affs/affs.h:124:38: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/configfs/dir.c:1692:11: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/configfs/dir.c:1694:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ceph/file.c:249:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/hash.c:233:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/hash.c:246:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext2/inode.c:1237:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext2/inode.c:1244:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1182:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1188:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1432:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1440:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/f2fs/node.c:618:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/f2fs/node.c:620:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/btrfs/ref-verify.c:522:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/gfs2/bmap.c:711:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/gfs2/bmap.c:722:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/jffs2/fs.c:339:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/nfsd/nfs4proc.c:429:12: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ufs/util.h:62:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ufs/util.h:43:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/fcntl.c:770:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/seq_file.c:319:10: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/libfs.c:148:11: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/libfs.c:150:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/signalfd.c:178:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/locks.c:1473:16: warning: this statement may fall through [-Wimplicit-fallthrough=]

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enabling
-Wimplicit-fallthrough.

Reviewed-by: Kees Cook
Signed-off-by: Gustavo A. R. Silva

Gustavo A. R. Silva
2019-04-09 07:21:02 +0800

25 Mar, 2019

1 commit

945ab8f6d locks: wake any locks blocked on request before deadlock check ... Browse Code »

Andreas reported that he was seeing the tdbtorture test fail in some
cases with -EDEADLCK when it wasn't before. Some debugging showed that
deadlock detection was sometimes discovering the caller's lock request
itself in a dependency chain.

While we remove the request from the blocked_lock_hash prior to
reattempting to acquire it, any locks that are blocked on that request
will still be present in the hash and will still have their fl_blocker
pointer set to the current request.

This causes posix_locks_deadlock to find a deadlock dependency chain
when it shouldn't, as a lock request cannot block itself.

We are going to end up waking all of those blocked locks anyway when we
go to reinsert the request back into the blocked_lock_hash, so just do
it prior to checking for deadlocks. This ensures that any lock blocked
on the current request will no longer be part of any blocked request
chain.

URL: https://bugzilla.kernel.org/show_bug.cgi?id=202975
Fixes: 5946c4319ebb ("fs/locks: allow a lock request to block other requests.")
Cc: stable@vger.kernel.org
Reported-by: Andreas Schneider
Signed-off-by: Neil Brown
Signed-off-by: Jeff Layton

Jeff Layton
2019-03-25 20:36:24 +0800

28 Feb, 2019

1 commit

02e525b2a locking/percpu-rwsem: Remove preempt_disable variants ... Browse Code »

Effective revert commit:

87709e28dc7c ("fs/locks: Use percpu_down_read_preempt_disable()")

This is causing major pain for PREEMPT_RT.

Sebastian did a lot of lockperf runs on 2 and 4 node machines with all
preemption modes (PREEMPT=n should be an obvious NOP for this patch
and thus serves as a good control) and no results showed significance
over 2-sigma (the PREEMPT=n results were almost empty at 1-sigma).

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Signed-off-by: Ingo Molnar

Peter Zijlstra
2019-02-28 14:55:37 +0800

03 Jan, 2019

1 commit

bf77ae4c9 locks: fix error in locks_move_blocks() ... Browse Code »

After moving all requests from
fl->fl_blocked_requests
to
new->fl_blocked_requests

it is nonsensical to do anything to all the remaining elements, there
aren't any. This should do something to all the requests that have been
moved. For simplicity, it does it to all requests in the target list.

Setting "f->fl_blocker = new" to all members of new->fl_blocked_requests
is "obviously correct" as it preserves the invariant of the linkage
among requests.

Reported-by: syzbot+239d99847eb49ecb3899@syzkaller.appspotmail.com
Fixes: 5946c4319ebb ("fs/locks: allow a lock request to block other requests.")
Signed-off-by: NeilBrown
Signed-off-by: Jeff Layton

NeilBrown
2019-01-03 09:14:50 +0800

17 Dec, 2018

1 commit

052b8cfa4 locks: Use inode_is_open_for_write ... Browse Code »

Use the aptly named function rather than open coding it. No functional
changes.

Signed-off-by: Nikolay Borisov
Signed-off-by: Jeff Layton

Nikolay Borisov
2018-12-17 20:19:46 +0800

07 Dec, 2018

5 commits

7bbd1fc0e fs/locks: remove unnecessary white space. ... Browse Code »

- spaces before tabs,
- spaces at the end of lines,
- multiple blank lines,
- blank lines before EXPORT_SYMBOL,
can all go.

Signed-off-by: NeilBrown
Reviewed-by: J. Bruce Fields
Signed-off-by: Jeff Layton

NeilBrown
2018-12-07 19:51:00 +0800
cb03f94ff fs/locks: merge posix_unblock_lock() and locks_delete_block() ... Browse Code »

posix_unblock_lock() is not specific to posix locks, and behaves
nearly identically to locks_delete_block() - the former returning a
status while the later doesn't.

So discard posix_unblock_lock() and use locks_delete_block() instead,
after giving that function an appropriate return value.

Signed-off-by: NeilBrown
Reviewed-by: J. Bruce Fields
Signed-off-by: Jeff Layton

NeilBrown
2018-12-07 19:50:56 +0800
fd7732e03 fs/locks: create a tree of dependent requests. ... Browse Code »

When we find an existing lock which conflicts with a request,
and the request wants to wait, we currently add the request
to a list. When the lock is removed, the whole list is woken.
This can cause the thundering-herd problem.
To reduce the problem, we make use of the (new) fact that
a pending request can itself have a list of blocked requests.
When we find a conflict, we look through the existing blocked requests.
If any one of them blocks the new request, the new request is attached
below that request, otherwise it is added to the list of blocked
requests, which are now known to be mutually non-conflicting.

This way, when the lock is released, only a set of non-conflicting
locks will be woken, the rest can stay asleep.
If the lock request cannot be granted and the request needs to be
requeued, all the other requests it blocks will then be woken

To make this more concrete:

If you have a many-core machine, and have many threads all wanting to
briefly lock a give file (udev is known to do this), you can get quite
poor performance.

When one thread releases a lock, it wakes up all other threads that
are waiting (classic thundering-herd) - one will get the lock and the
others go to sleep.
When you have few cores, this is not very noticeable: by the time the
4th or 5th thread gets enough CPU time to try to claim the lock, the
earlier threads have claimed it, done what was needed, and released.
So with few cores, many of the threads don't end up contending.
With 50+ cores, lost of threads can get the CPU at the same time,
and the contention can easily be measured.

This patchset creates a tree of pending lock requests in which siblings
don't conflict and each lock request does conflict with its parent.
When a lock is released, only requests which don't conflict with each
other a woken.

Testing shows that lock-acquisitions-per-second is now fairly stable
even as the number of contending process goes to 1000. Without this
patch, locks-per-second drops off steeply after a few 10s of
processes.

There is a small cost to this extra complexity.
At 20 processes running a particular test on 72 cores, the lock
acquisitions per second drops from 1.8 million to 1.4 million with
this patch. For 100 processes, this patch still provides 1.4 million
while without this patch there are about 700,000.

Reported-and-tested-by: Martin Wilck
Signed-off-by: NeilBrown
Reviewed-by: J. Bruce Fields
Signed-off-by: Jeff Layton

NeilBrown
2018-12-07 19:49:24 +0800
c0e159089 fs/locks: change all *_conflict() functions to return bool. ... Browse Code »

posix_locks_conflict() and flock_locks_conflict() both return int.
leases_conflict() returns bool.

This inconsistency will cause problems for the next patch if not
fixed.

So change posix_locks_conflict() and flock_locks_conflict() to return
bool.
Also change the locks_conflict() helper.

And convert some
return (foo);
to
return foo;

Signed-off-by: NeilBrown
Reviewed-by: J. Bruce Fields
Signed-off-by: Jeff Layton

NeilBrown
2018-12-07 19:49:24 +0800
16306a61d fs/locks: always delete_block after waiting. ... Browse Code »

Now that requests can block other requests, we
need to be careful to always clean up those blocked
requests.
Any time that we wait for a request, we might have
other requests attached, and when we stop waiting,
we must clean them up.
If the lock was granted, the requests might have been
moved to the new lock, though when merged with a
pre-exiting lock, this might not happen.
In all cases we don't want blocked locks to remain
attached, so we remove them to be safe.

Signed-off-by: NeilBrown
Reviewed-by: J. Bruce Fields
Tested-by: syzbot+a4a3d526b4157113ec6a@syzkaller.appspotmail.com
Tested-by: kernel test robot
Signed-off-by: Jeff Layton

NeilBrown
2018-12-07 19:49:17 +0800

01 Dec, 2018

4 commits

5946c4319 fs/locks: allow a lock request to block other requests. ... Browse Code »

Currently, a lock can block pending requests, but all pending
requests are equal. If lots of pending requests are
mutually exclusive, this means they will all be woken up
and all but one will fail. This can hurt performance.

So we will allow pending requests to block other requests.
Only the first request will be woken, and it will wake the others.

This patch doesn't implement this fully, but prepares the way.

- It acknowledges that a request might be blocking other requests,
and when the request is converted to a lock, those blocked
requests are moved across.
- When a request is requeued or discarded, all blocked requests are
woken.
- When deadlock-detection looks for the lock which blocks a
given request, we follow the chain of ->fl_blocker all
the way to the top.

Tested-by: kernel test robot
Signed-off-by: NeilBrown
Reviewed-by: J. Bruce Fields
Signed-off-by: Jeff Layton

NeilBrown
2018-12-01 00:26:12 +0800
d6367d624 fs/locks: use properly initialized file_lock when unlocking. ... Browse Code »

Both locks_remove_posix() and locks_remove_flock() use a
struct file_lock without calling locks_init_lock() on it.
This means the various list_heads are not initialized, which
will become a problem with a later patch.

So change them both to initialize properly. For flock locks,
this involves using flock_make_lock(), and changing it to
allow a file_lock to be passed in, so memory allocation isn't
always needed.

Signed-off-by: NeilBrown
Reviewed-by: J. Bruce Fields
Signed-off-by: Jeff Layton

NeilBrown
2018-12-01 00:26:12 +0800
ad6bbd8b1 fs/locks: split out __locks_wake_up_blocks(). ... Browse Code »

This functionality will be useful in future patches, so
split it out from locks_wake_up_blocks().

Signed-off-by: NeilBrown
Reviewed-by: J. Bruce Fields
Signed-off-by: Jeff Layton

NeilBrown
2018-12-01 00:26:12 +0800
ada5c1da8 fs/locks: rename some lists and pointers. ... Browse Code »

struct file lock contains an 'fl_next' pointer which
is used to point to the lock that this request is blocked
waiting for. So rename it to fl_blocker.

The fl_blocked list_head in an active lock is the head of a list of
blocked requests. In a request it is a node in that list.
These are two distinct uses, so replace with two list_heads
with different names.
fl_blocked_requests is the head of a list of blocked requests
fl_blocked_member is a node in a member of that list.

The two different list_heads are never used at the same time, but that
will change in a future patch.

Note that a tracepoint is changed to report fl_blocker instead
of fl_next.

Signed-off-by: NeilBrown
Reviewed-by: J. Bruce Fields
Signed-off-by: Jeff Layton

NeilBrown
2018-12-01 00:26:12 +0800

22 Aug, 2018

2 commits

d9a185f8b Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs ... Browse Code »

Pull overlayfs updates from Miklos Szeredi:
"This contains two new features:

- Stack file operations: this allows removal of several hacks from
the VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.

- Metadata only copy-up: when file is on lower layer and only
metadata is modified (except size) then only copy up the metadata
and continue to use the data from the lower file"

* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
ovl: Enable metadata only feature
ovl: Do not do metacopy only for ioctl modifying file attr
ovl: Do not do metadata only copy-up for truncate operation
ovl: add helper to force data copy-up
ovl: Check redirect on index as well
ovl: Set redirect on upper inode when it is linked
ovl: Set redirect on metacopy files upon rename
ovl: Do not set dentry type ORIGIN for broken hardlinks
ovl: Add an inode flag OVL_CONST_INO
ovl: Treat metacopy dentries as type OVL_PATH_MERGE
ovl: Check redirects for metacopy files
ovl: Move some dir related ovl_lookup_single() code in else block
ovl: Do not expose metacopy only dentry from d_real()
ovl: Open file with data except for the case of fsync
ovl: Add helper ovl_inode_realdata()
ovl: Store lower data inode in ovl_inode
ovl: Fix ovl_getattr() to get number of blocks from lower
ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
ovl: Copy up meta inode data from lowest data inode
ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
...

Linus Torvalds
2018-08-22 09:19:09 +0800
0214f46b3 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/eb… ... Browse Code »

…iederm/user-namespace

Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.

This set of changes is split into several parts:

- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.

- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...

Linus Torvalds
2018-08-22 04:47:29 +0800

14 Aug, 2018

1 commit

575b94386 Merge tag 'locks-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux ... Browse Code »

Pull file locking updates from Jeff Layton:
"Just a couple of patches from Konstantin to fix /proc/locks when the
process that set the lock has exited, and a new tracepoint for the
flock() codepath. Also threw in mailmap entries for my addresses and a
comment cleanup"

* tag 'locks-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: remove misleading obsolete comment
mailmap: remap some of my email addresses to kernel.org address
locks: add tracepoint in flock codepath
fs/lock: show locks taken by processes from another pidns
fs/lock: skip lock owner pid translation in case we are in init_pid_ns

Linus Torvalds
2018-08-14 12:56:50 +0800

09 Aug, 2018

1 commit

da33a871b locks: remove misleading obsolete comment ... Browse Code »

The spinlock handling in this file has changed significantly since this
comment was written, and the file_lock_lock is no more. In addition,
this overall comment no longer applies. Deleting an entry now requires
both locks.

Signed-off-by: Jeff Layton

Jeff Layton
2018-08-09 00:59:06 +0800

07 Aug, 2018

1 commit

c883da313 locks: add tracepoint in flock codepath ... Browse Code »

Signed-off-by: Jeff Layton

Jeff Layton
2018-08-07 01:15:16 +0800

21 Jul, 2018

1 commit

019191342 signal: Use PIDTYPE_TGID to clearly store where file signals will be sent ... Browse Code »

When f_setown is called a pid and a pid type are stored. Replace the use
of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread
group. Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now
is only for a thread.

Update the users of __f_setown to use PIDTYPE_TGID instead of
PIDTYPE_PID.

For now the code continues to capture task_pid (when task_tgid would
really be appropriate), and iterate on PIDTYPE_PID (even when type ==
PIDTYPE_TGID) out of an abundance of caution to preserve existing
behavior.

Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID
for tgid lookup also be used to avoid taking the tasklist lock.

Suggested-by: Oleg Nesterov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2018-07-21 23:43:12 +0800

18 Jul, 2018

2 commits

de2a4a501 Partially revert "locks: fix file locking on overlayfs" ... Browse Code »

This partially reverts commit c568d68341be7030f5647def68851e469b21ca11.

Overlayfs files will now automatically get the correct locks, no need to
hack overlay support in VFS.

It is a partial revert, because it leaves the locks_inode() calls in place
and defines locks_inode() to file_inode(). We could revert those as well,
but it would be unnecessary code churn and it makes sense to document that
we are getting the inode for locking purposes.

Don't revert MS_NOREMOTELOCK yet since that has been part of the userspace
API for some time (though not in a useful way). Will try to remove
internal flags later when the dust around the new mount API settles.

Signed-off-by: Miklos Szeredi
Acked-by: Jeff Layton

Miklos Szeredi
2018-07-18 21:44:43 +0800
8cf9ee506 Revert "vfs: do get_write_access() on upper layer of overlayfs" ... Browse Code »

This reverts commit 4d0c5ba2ff79ef9f5188998b29fd28fcb05f3667.

We now get write access on both overlay and underlying layers so this patch
is no longer needed for correct operation.

Signed-off-by: Miklos Szeredi

Miklos Szeredi
2018-07-18 21:44:43 +0800

15 Jun, 2018

1 commit

7a932516f Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground ... Browse Code »

Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.

As Deepa writes:

'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.

The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.

Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'

Thomas Gleixner adds:

'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"

* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()

Linus Torvalds
2018-06-15 06:31:07 +0800

14 Jun, 2018

1 commit

1cf8e5de4 fs/lock: show locks taken by processes from another pidns ... Browse Code »

Currently if we face a lock taken by a process invisible in the current
pidns we skip the lock completely, but this

1) makes the output not that nice
(root@vz7)/: cat /proc/${PID_A2}/fdinfo/3
pos: 4
flags: 02100002
mnt_id: 257
lock: (root@vz7)/:

2) makes it more difficult to debug issues with leaked flocks
if you get error on lock, but don't see any locks in /proc/$id/fdinfo/$file

Let's show information about such locks again as previously, but
show zero in the owner pid field.

After the patch:
===============
(root@vz7)/:cat /proc/${PID_A2}/fdinfo/3
pos: 4
flags: 02100002
mnt_id: 295
lock: 1: FLOCK ADVISORY WRITE 0 b6:f8a61:529946 0 EOF

Fixes: 9d5b86ac13c5 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks")
Signed-off-by: Konstantin Khorenko
Acked-by: Andrey Vagin
Reviewed-by: Benjamin Coddington
Signed-off-by: Jeff Layton

Konstantin Khorenko
2018-06-14 19:37:50 +0800