Eric Lee / smarc-fsl-linux-kernel

27 Oct, 2016

1 commit

2a9becdd4 kernfs: Add noop_fsync to supported kernfs_file_fops ... Browse Code »

If you edit a kernfs backed file with vi(1), you see an ugly error
message when you write the file because vi tries to fsync(2) the
file after writing, which fails.

We have noop_fsync() for this, use it.

Signed-off-by: Tony Luck
Signed-off-by: Greg Kroah-Hartman

Tony Luck
9 years ago

15 Oct, 2016

1 commit

f34d3606f Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- tracepoints for basic cgroup management operations added

- kernfs and cgroup path formatting functions updated to behave in the
style of strlcpy()

- non-critical bug fixes

* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
cpuset: fix error handling regression in proc_cpuset_show()
cgroup: add tracepoints for basic operations
cgroup: make cgroup_path() and friends behave in the style of strlcpy()
kernfs: remove kernfs_path_len()
kernfs: make kernfs_path*() behave in the style of strlcpy()
kernfs: add dummy implementation of kernfs_path_from_node()

Linus Torvalds
9 years ago

11 Oct, 2016

3 commits

101105b17 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning

Linus Torvalds
9 years ago
3873691e5 Merge remote-tracking branch 'ovl/rename2' into for-linus Browse Code »

Al Viro
9 years ago
97d211670 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs xattr updates from Al Viro:
"xattr stuff from Andreas

This completes the switch to xattr_handler ->get()/->set() from
->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: Remove {get,set,remove}xattr inode operations
xattr: Stop calling {get,set,remove}xattr inode operations
vfs: Check for the IOP_XATTR flag in listxattr
xattr: Add __vfs_{get,set,remove}xattr helpers
libfs: Use IOP_XATTR flag for empty directory handling
vfs: Use IOP_XATTR flag for bad-inode handling
vfs: Add IOP_XATTR inode operations flag
vfs: Move xattr_resolve_name to the front of fs/xattr.c
ecryptfs: Switch to generic xattr handlers
sockfs: Get rid of getxattr iop
sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
kernfs: Switch to generic xattr handlers
hfs: Switch to generic xattr handlers
jffs2: Remove jffs2_{get,set,remove}xattr macros
xattr: Remove unnecessary NULL attribute name check

Linus Torvalds
9 years ago

08 Oct, 2016

2 commits

e55f1d1d1 Merge remote-tracking branch 'jk/vfs' into work.misc Browse Code »

Al Viro
9 years ago
fd50ecadd vfs: Remove {get,set,remove}xattr inode operations ... Browse Code »

These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher
Signed-off-by: Al Viro

Andreas Gruenbacher
9 years ago

07 Oct, 2016

1 commit

e72a1a8b3 kernfs: Switch to generic xattr handlers ... Browse Code »

Signed-off-by: Andreas Gruenbacher
Acked-by: Greg Kroah-Hartman
Acked-by: Tejun Heo
Signed-off-by: Al Viro

Andreas Gruenbacher
9 years ago

28 Sep, 2016

1 commit

c2050a454 fs: Replace current_fs_time() with current_time() ... Browse Code »

current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.

Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.

Signed-off-by: Deepa Dinamani
Signed-off-by: Al Viro

Deepa Dinamani
9 years ago

27 Sep, 2016

2 commits

2773bf00a fs: rename "rename2" i_op to "rename" ... Browse Code »

Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi

Miklos Szeredi
9 years ago
1cd66c93b fs: make remaining filesystems use .rename2 ... Browse Code »

This is trivial to do:

- add flags argument to foo_rename()
- check if flags is zero
- assign foo_rename() to .rename2 instead of .rename

This doesn't mean it's impossible to support RENAME_NOREPLACE for these
filesystems, but it is not trivial, like for local filesystems.
RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
for a file to be created on one host while it is overwritten by rename on
another host).

Filesystems converted:

9p, afs, ceph, coda, ecryptfs, kernfs, lustre, ncpfs, nfs, ocfs2, orangefs.

After this, we can get rid of the duplicate interfaces for rename.

Signed-off-by: Miklos Szeredi
Acked-by: Greg Kroah-Hartman
Acked-by: David Howells [AFS]
Acked-by: Mike Marshall
Cc: Eric Van Hensbergen
Cc: Ilya Dryomov
Cc: Jan Harkes
Cc: Tyler Hicks
Cc: Oleg Drokin
Cc: Trond Myklebust
Cc: Mark Fasheh

Miklos Szeredi
9 years ago

22 Sep, 2016

1 commit

31051c85b fs: Give dentry to inode_change_ok() instead of inode ... Browse Code »

inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara

Jan Kara
9 years ago

31 Aug, 2016

1 commit

df6a58c5c kernfs: don't depend on d_find_any_alias() when generating notifications ... Browse Code »

kernfs_notify_workfn() sends out file modified events for the
scheduled kernfs_nodes. Because the modifications aren't from
userland, it doesn't have the matching file struct at hand and can't
use fsnotify_modify(). Instead, it looked up the inode and then used
d_find_any_alias() to find the dentry and used fsnotify_parent() and
fsnotify() directly to generate notifications.

The assumption was that the relevant dentries would have been pinned
if there are listeners, which isn't true as inotify doesn't pin
dentries at all and watching the parent doesn't pin the child dentries
even for dnotify. This led to, for example, inotify watchers not
getting notifications if the system is under memory pressure and the
matching dentries got reclaimed. It can also be triggered through
/proc/sys/vm/drop_caches or a remount attempt which involves shrinking
dcache.

fsnotify_parent() only uses the dentry to access the parent inode,
which kernfs can do easily. Update kernfs_notify_workfn() so that it
uses fsnotify() directly for both the parent and target inodes without
going through d_find_any_alias(). While at it, supply the target file
name to fsnotify() from kernfs_node->name.

Signed-off-by: Tejun Heo
Reported-by: Evgeny Vereshchagin
Fixes: d911d9874801 ("kernfs: make kernfs_notify() trigger inotify events too")
Cc: John McCutchan
Cc: Robert Love
Cc: Eric Paris
Cc: stable@vger.kernel.org # v3.16+
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
9 years ago

10 Aug, 2016

2 commits

bb09c8634 kernfs: remove kernfs_path_len() ... Browse Code »

It doesn't have any in-kernel user and the same result can be obtained
from kernfs_path(@kn, NULL, 0). Remove it.

Signed-off-by: Tejun Heo
Acked-by: Greg Kroah-Hartman
Cc: Serge Hallyn

Tejun Heo
9 years ago
3abb1d90f kernfs: make kernfs_path*() behave in the style of strlcpy() ... Browse Code »

kernfs_path*() functions always return the length of the full path but
the path content is undefined if the length is larger than the
provided buffer. This makes its behavior different from strlcpy() and
requires error handling in all its users even when they don't care
about truncation. In addition, the implementation can actully be
simplified by making it behave properly in strlcpy() style.

* Update kernfs_path_from_node_locked() to always fill up the buffer
with path. If the buffer is not large enough, the output is
truncated and terminated.

* kernfs_path() no longer needs error handling. Make it a simple
inline wrapper around kernfs_path_from_node().

* sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
Updated accordingly.

* cgroup_path()'s use of kernfs_path() updated to retain the old
behavior.

Signed-off-by: Tejun Heo
Acked-by: Greg Kroah-Hartman
Acked-by: Serge Hallyn

Tejun Heo
9 years ago

30 Jul, 2016

1 commit

a867d7349 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull userns vfs updates from Eric Biederman:
"This tree contains some very long awaited work on generalizing the
user namespace support for mounting filesystems to include filesystems
with a backing store. The real world target is fuse but the goal is
to update the vfs to allow any filesystem to be supported. This
patchset is based on a lot of code review and testing to approach that
goal.

While looking at what is needed to support the fuse filesystem it
became clear that there were things like xattrs for security modules
that needed special treatment. That the resolution of those concerns
would not be fuse specific. That sorting out these general issues
made most sense at the generic level, where the right people could be
drawn into the conversation, and the issues could be solved for
everyone.

At a high level what this patchset does a couple of simple things:

- Add a user namespace owner (s_user_ns) to struct super_block.

- Teach the vfs to handle filesystem uids and gids not mapping into
to kuids and kgids and being reported as INVALID_UID and
INVALID_GID in vfs data structures.

By assigning a user namespace owner filesystems that are mounted with
only user namespace privilege can be detected. This allows security
modules and the like to know which mounts may not be trusted. This
also allows the set of uids and gids that are communicated to the
filesystem to be capped at the set of kuids and kgids that are in the
owning user namespace of the filesystem.

One of the crazier corner casees this handles is the case of inodes
whose i_uid or i_gid are not mapped into the vfs. Most of the code
simply doesn't care but it is easy to confuse the inode writeback path
so no operation that could cause an inode write-back is permitted for
such inodes (aka only reads are allowed).

This set of changes starts out by cleaning up the code paths involved
in user namespace permirted mounts. Then when things are clean enough
adds code that cleanly sets s_user_ns. Then additional restrictions
are added that are possible now that the filesystem superblock
contains owner information.

These changes should not affect anyone in practice, but there are some
parts of these restrictions that are changes in behavior.

- Andy's restriction on suid executables that does not honor the
suid bit when the path is from another mount namespace (think
/proc/[pid]/fd/) or when the filesystem was mounted by a less
privileged user.

- The replacement of the user namespace implicit setting of MNT_NODEV
with implicitly setting SB_I_NODEV on the filesystem superblock
instead.

Using SB_I_NODEV is a stronger form that happens to make this state
user invisible. The user visibility can be managed but it caused
problems when it was introduced from applications reasonably
expecting mount flags to be what they were set to.

There is a little bit of work remaining before it is safe to support
mounting filesystems with backing store in user namespaces, beyond
what is in this set of changes.

- Verifying the mounter has permission to read/write the block device
during mount.

- Teaching the integrity modules IMA and EVM to handle filesystems
mounted with only user namespace root and to reduce trust in their
security xattrs accordingly.

- Capturing the mounters credentials and using that for permission
checks in d_automount and the like. (Given that overlayfs already
does this, and we need the work in d_automount it make sense to
generalize this case).

Furthermore there are a few changes that are on the wishlist:

- Get all filesystems supporting posix acls using the generic posix
acls so that posix_acl_fix_xattr_from_user and
posix_acl_fix_xattr_to_user may be removed. [Maintainability]

- Reducing the permission checks in places such as remount to allow
the superblock owner to perform them.

- Allowing the superblock owner to chown files with unmapped uids and
gids to something that is mapped so the files may be treated
normally.

I am not considering even obvious relaxations of permission checks
until it is clear there are no more corner cases that need to be
locked down and handled generically.

Many thanks to Seth Forshee who kept this code alive, and putting up
with me rewriting substantial portions of what he did to handle more
corner cases, and for his diligent testing and reviewing of my
changes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
fs: Call d_automount with the filesystems creds
fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
evm: Translate user/group ids relative to s_user_ns when computing HMAC
dquot: For now explicitly don't support filesystems outside of init_user_ns
quota: Handle quota data stored in s_user_ns in quota_setxquota
quota: Ensure qids map to the filesystem
vfs: Don't create inodes with a uid or gid unknown to the vfs
vfs: Don't modify inodes with a uid or gid unknown to the vfs
cred: Reject inodes with invalid ids in set_create_file_as()
fs: Check for invalid i_uid in may_follow_link()
vfs: Verify acls are valid within superblock's s_user_ns.
userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
fs: Refuse uid/gid changes which don't map into s_user_ns
selinux: Add support for unprivileged mounts from user namespaces
Smack: Handle labels consistently in untrusted mounts
Smack: Add support for unprivileged mounts from user namespaces
fs: Treat foreign mounts as nosuid
fs: Limit file caps to the user namespace of the super block
userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
userns: Remove implicit MNT_NODEV fragility.
...

Linus Torvalds
9 years ago

24 Jun, 2016

3 commits

a2982cc92 vfs: Generalize filesystem nodev handling. ... Browse Code »

Introduce a function may_open_dev that tests MNT_NODEV and a new
superblock flab SB_I_NODEV. Use this new function in all of the
places where MNT_NODEV was previously tested.

Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
filesystems should never support device nodes, and a simple superblock
flags makes that very hard to get wrong. With SB_I_NODEV set if any
device nodes somehow manage to show up on on a filesystem those
device nodes will be unopenable.

Acked-by: Seth Forshee
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
9 years ago
29a517c23 kernfs: The cgroup filesystem also benefits from SB_I_NOEXEC ... Browse Code »

The cgroup filesystem is in the same boat as sysfs. No one ever
permits executables of any kind on the cgroup filesystem, and there is
no reasonable future case to support executables in the future.

Therefore move the setting of SB_I_NOEXEC which makes the code proof
against future mistakes of accidentally creating executables from
sysfs to kernfs itself. Making the code simpler and covering the
sysfs, cgroup, and cgroup2 filesystems.

Acked-by: Seth Forshee
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
9 years ago
6e4eab577 fs: Add user namespace member to struct super_block ... Browse Code »

Start marking filesystems with a user namespace owner, s_user_ns. In
this change this is only used for permission checks of who may mount a
filesystem. Ultimately s_user_ns will be used for translating ids and
checking capabilities for filesystems mounted from user namespaces.

The default policy for setting s_user_ns is implemented in sget(),
which arranges for s_user_ns to be set to current_user_ns() and to
ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that
user_ns.

The guts of sget are split out into another function sget_userns().
The function sget_userns calls alloc_super with the specified user
namespace or it verifies the existing superblock that was found
has the expected user namespace, and fails with EBUSY when it is not.
This failing prevents users with the wrong privileges mounting a
filesystem.

The reason for the split of sget_userns from sget is that in some
cases such as mount_ns and kernfs_mount_ns a different policy for
permission checking of mounts and setting s_user_ns is necessary, and
the existence of sget_userns() allows those policies to be
implemented.

The helper mount_ns is expected to be used for filesystems such as
proc and mqueuefs which present per namespace information. The
function mount_ns is modified to call sget_userns instead of sget to
ensure the user namespace owner of the namespace whose information is
presented by the filesystem is used on the superblock.

For sysfs and cgroup the appropriate permission checks are already in
place, and kernfs_mount_ns is modified to call sget_userns so that
the init_user_ns is the only user namespace used.

For the cgroup filesystem cgroup namespace mounts are bind mounts of a
subset of the full cgroup filesystem and as such s_user_ns must be the
same for all of them as there is only a single superblock.

Mounts of sysfs that vary based on the network namespace could in principle
change s_user_ns but it keeps the analysis and implementation of kernfs
simpler if that is not supported, and at present there appear to be no
benefits from supporting a different s_user_ns on any sysfs mount.

Getting the details of setting s_user_ns correct has been
a long process. Thanks to Pavel Tikhorirorv who spotted a leak
in sget_userns. Thanks to Seth Forshee who has kept the work alive.

Thanks-to: Seth Forshee
Thanks-to: Pavel Tikhomirov
Acked-by: Seth Forshee
Signed-off-by: Eric W. Biederman

Eric W. Biederman
9 years ago

11 Jun, 2016

1 commit

8387ff257 vfs: make the string hashes salt the hash ... Browse Code »

We always mixed in the parent pointer into the dentry name hash, but we
did it late at lookup time. It turns out that we can simplify that
lookup-time action by salting the hash with the parent pointer early
instead of late.

A few other users of our string hashes also wanted to mix in their own
pointers into the hash, and those are updated to use the same mechanism.

Hash users that don't have any particular initial salt can just use the
NULL pointer as a no-salt.

Cc: Vegard Nossum
Cc: George Spelvin
Cc: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
9 years ago

28 May, 2016

1 commit

3767e255b switch ->setxattr() to passing dentry and inode separately ... Browse Code »

smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
we'd hashed the new dentry and attached it to inode, we need ->setxattr()
instances getting the inode as an explicit argument rather than obtaining
it from dentry.

Similar change for ->getxattr() had been done in commit ce23e64. Unlike
->getxattr() (which is used by both selinux and smack instances of
->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
it got missed back then.

Reported-by: Seung-Woo Kim
Tested-by: Casey Schaufler
Signed-off-by: Al Viro

Al Viro
9 years ago

21 May, 2016

1 commit

3aa2fc166 Merge tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core updates from Greg KH:
"Here's the "big" driver core update for 4.7-rc1.

Mostly just debugfs changes, the long-known and messy races with
removing debugfs files should be fixed thanks to the great work of
Nicolai Stange. We also have some isa updates in here (the x86
maintainers told me to take it through this tree), a new warning when
we run out of dynamic char major numbers, and a few other assorted
changes, details in the shortlog.

All have been in linux-next for some time with no reported issues"

* tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits)
Revert "base: dd: don't remove driver_data in -EPROBE_DEFER case"
gpio: ws16c48: Utilize the ISA bus driver
gpio: 104-idio-16: Utilize the ISA bus driver
gpio: 104-idi-48: Utilize the ISA bus driver
gpio: 104-dio-48e: Utilize the ISA bus driver
watchdog: ebc-c384_wdt: Utilize the ISA bus driver
iio: stx104: Utilize the module_isa_driver and max_num_isa_dev macros
iio: stx104: Add X86 dependency to STX104 Kconfig option
Documentation: Add ISA bus driver documentation
isa: Implement the max_num_isa_dev macro
isa: Implement the module_isa_driver macro
pnp: pnpbios: Add explicit X86_32 dependency to PNPBIOS
isa: Decouple X86_32 dependency from the ISA Kconfig option
driver-core: use 'dev' argument in dev_dbg_ratelimited stub
base: dd: don't remove driver_data in -EPROBE_DEFER case
kernfs: Move faulting copy_user operations outside of the mutex
devcoredump: add scatterlist support
debugfs: unproxify files created through debugfs_create_u32_array()
debugfs: unproxify files created through debugfs_create_blob()
debugfs: unproxify files created through debugfs_create_bool()
...

Linus Torvalds
9 years ago

18 May, 2016

1 commit

7f427d3a6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull parallel filesystem directory handling update from Al Viro.

This is the main parallel directory work by Al that makes the vfs layer
able to do lookup and readdir in parallel within a single directory.
That's a big change, since this used to be all protected by the
directory inode mutex.

The inode mutex is replaced by an rwsem, and serialization of lookups of
a single name is done by a "in-progress" dentry marker.

The series begins with xattr cleanups, and then ends with switching
filesystems over to actually doing the readdir in parallel (switching to
the "iterate_shared()" that only takes the read lock).

A more detailed explanation of the process from Al Viro:
"The xattr work starts with some acl fixes, then switches ->getxattr to
passing inode and dentry separately. This is the point where the
things start to get tricky - that got merged into the very beginning
of the -rc3-based #work.lookups, to allow untangling the
security_d_instantiate() mess. The xattr work itself proceeds to
switch a lot of filesystems to generic_...xattr(); no complications
there.

After that initial xattr work, the series then does the following:

- untangle security_d_instantiate()

- convert a bunch of open-coded lookup_one_len_unlocked() to calls of
that thing; one such place (in overlayfs) actually yields a trivial
conflict with overlayfs fixes later in the cycle - overlayfs ended
up switching to a variant of lookup_one_len_unlocked() sans the
permission checks. I would've dropped that commit (it gets
overridden on merge from #ovl-fixes in #for-next; proper resolution
is to use the variant in mainline fs/overlayfs/super.c), but I
didn't want to rebase the damn thing - it was fairly late in the
cycle...

- some filesystems had managed to depend on lookup/lookup exclusion
for *fs-internal* data structures in a way that would break if we
relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.

- core of that series - parallel lookup machinery, replacing
->i_mutex with rwsem, making lookup_slow() take it only shared. At
that point lookups happen in parallel; lookups on the same name
wait for the in-progress one to be done with that dentry.

Surprisingly little code, at that - almost all of it is in
fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
making it use the new primitive and actually switching to locking
shared.

- parallel readdir stuff - first of all, we provide the exclusion on
per-struct file basis, same as we do for read() vs lseek() for
regular files. That takes care of most of the needed exclusion in
readdir/readdir; however, these guys are trickier than lookups, so
I went for switching them one-by-one. To do that, a new method
'->iterate_shared()' is added and filesystems are switched to it
as they are either confirmed to be OK with shared lock on directory
or fixed to be OK with that. I hope to kill the original method
come next cycle (almost all in-tree filesystems are switched
already), but it's still not quite finished.

- several filesystems get switched to parallel readdir. The
interesting part here is dealing with dcache preseeding by readdir;
that needs minor adjustment to be safe with directory locked only
shared.

Most of the filesystems doing that got switched to in those
commits. Important exception: NFS. Turns out that NFS folks, with
their, er, insistence on VFS getting the fuck out of the way of the
Smart Filesystem Code That Knows How And What To Lock(tm) have
grown the locking of their own. They had their own homegrown
rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
is the reader there). Of course, with VFS getting the fuck out of
the way, as requested, the actual smarts of the smart filesystem
code etc. had become exposed...

- do_last/lookup_open/atomic_open cleanups. As the result, open()
without O_CREAT locks the directory only shared. Including the
->atomic_open() case. Backmerge from #for-linus in the middle of
that - atomic_open() fix got brought in.

- then comes NFS switch to saner (VFS-based ;-) locking, killing the
homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
exclusion for sillyunlink/lookup is done by the parallel lookups
mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
now - rmdir being the writer.

Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
now.

- the rest of the series consists of switching a lot of filesystems
to parallel readdir; in a lot of cases ->llseek() gets simplified
as well. One backmerge in there (again, #for-linus - rockridge
fix)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
ext4: switch to ->iterate_shared()
hfs: switch to ->iterate_shared()
hfsplus: switch to ->iterate_shared()
hostfs: switch to ->iterate_shared()
hpfs: switch to ->iterate_shared()
hpfs: handle allocation failures in hpfs_add_pos()
gfs2: switch to ->iterate_shared()
f2fs: switch to ->iterate_shared()
afs: switch to ->iterate_shared()
befs: switch to ->iterate_shared()
befs: constify stuff a bit
isofs: switch to ->iterate_shared()
get_acorn_filename(): deobfuscate a bit
btrfs: switch to ->iterate_shared()
logfs: no need to lock directory in lseek
switch ecryptfs to ->iterate_shared
9p: switch to ->iterate_shared()
fat: switch to ->iterate_shared()
romfs, squashfs: switch to ->iterate_shared()
more trivial ->iterate_shared conversions
...

Linus Torvalds
9 years ago

12 May, 2016

1 commit

3cc9b23c8 kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call ... Browse Code »

Our caller expects 0 on success, not >0.

This fixes a bug in the patch

cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces

where /sys does not show up in mountinfo, breaking criu.

Thanks for catching this, Andrei.

Reported-by: Andrei Vagin
Signed-off-by: Serge Hallyn
Signed-off-by: Tejun Heo

Serge E. Hallyn
9 years ago

10 May, 2016

1 commit

4f41fc596 cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces ... Browse Code »

Patch summary:

When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.

Short explanation (courtesy of mkerrisk):

If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.

Long version:

When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.

cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF

unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':

grep freezer /proc/self/cgroup
9:freezer:/

If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:

mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.

Example (by mkerrisk):

First, a little script to save some typing and verbiage:

echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:

2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer

Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):

Look at the state of the /proc files:

/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer

The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.

However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:

# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer

The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.

With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:

/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer

cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)

Signed-off-by: Serge Hallyn
Tested-by: Michael Kerrisk
Acked-by: Michael Kerrisk
Signed-off-by: Tejun Heo

Serge E. Hallyn
9 years ago

09 May, 2016

1 commit

8cb0d2c1c kernfs: no point locking directory around that generic_file_llseek() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
9 years ago

03 May, 2016

3 commits

779b83913 kernfs: use lookup_one_len_unlocked() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
9 years ago
84695ffee Merge getxattr prototype change into work.lookups ... Browse Code »

The rest of work.xattr stuff isn't needed for this branch

Al Viro
9 years ago
e99ed4de1 kernfs_path_from_node_locked: don't overwrite nlen ... Browse Code »

We've calculated @len to be the bytes we need for '/..' entries from
@kn_from to the common ancestor, and calculated @nlen to be the extra
bytes we need to get from the common ancestor to @kn_to. We use them
as such at the end. But in the loop copying the actual entries, we
overwrite @nlen. Use a temporary variable for that instead.

Without this, the return length, when the buffer is large enough, is
wrong. (When the buffer is NULL or too small, the returned value is
correct. The buffer contents are also correct.)

Interestingly, no callers of this function are affected by this as of
yet. However the upcoming cgroup_show_path() will be.

Signed-off-by: Serge Hallyn

Serge Hallyn
9 years ago

01 May, 2016

1 commit

e4234a1fc kernfs: Move faulting copy_user operations outside of the mutex ... Browse Code »

A fault in a user provided buffer may lead anywhere, and lockdep warns
that we have a potential deadlock between the mm->mmap_sem and the
kernfs file mutex:

[ 82.811702] ======================================================
[ 82.811705] [ INFO: possible circular locking dependency detected ]
[ 82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted
[ 82.811711] -------------------------------------------------------
[ 82.811714] kms_setmode/5859 is trying to acquire lock:
[ 82.811717] (&dev->struct_mutex){+.+.+.}, at: [] drm_gem_mmap+0x1a1/0x270
[ 82.811731]
but task is already holding lock:
[ 82.811734] (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x44/0xa0
[ 82.811745]
which lock already depends on the new lock.

[ 82.811749]
the existing dependency chain (in reverse order) is:
[ 82.811752]
-> #3 (&mm->mmap_sem){++++++}:
[ 82.811761] [] lock_acquire+0xc3/0x1d0
[ 82.811766] [] __might_fault+0x75/0xa0
[ 82.811771] [] kernfs_fop_write+0x8a/0x180
[ 82.811787] [] __vfs_write+0x23/0xe0
[ 82.811792] [] vfs_write+0xa4/0x190
[ 82.811797] [] SyS_write+0x44/0xb0
[ 82.811801] [] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.811807]
-> #2 (s_active#6){++++.+}:
[ 82.811814] [] lock_acquire+0xc3/0x1d0
[ 82.811819] [] __kernfs_remove+0x210/0x2f0
[ 82.811823] [] kernfs_remove_by_name_ns+0x40/0xa0
[ 82.811828] [] sysfs_remove_file_ns+0x10/0x20
[ 82.811832] [] device_del+0x124/0x250
[ 82.811837] [] device_unregister+0x19/0x60
[ 82.811841] [] cpu_cache_sysfs_exit+0x51/0xb0
[ 82.811846] [] cacheinfo_cpu_callback+0x38/0x70
[ 82.811851] [] notifier_call_chain+0x39/0xa0
[ 82.811856] [] __raw_notifier_call_chain+0x9/0x10
[ 82.811860] [] cpu_notify+0x1e/0x40
[ 82.811865] [] cpu_notify_nofail+0x9/0x20
[ 82.811869] [] _cpu_down+0x233/0x340
[ 82.811874] [] disable_nonboot_cpus+0xc9/0x350
[ 82.811878] [] suspend_devices_and_enter+0x5a1/0xb50
[ 82.811883] [] pm_suspend+0x543/0x8d0
[ 82.811888] [] state_store+0x77/0xe0
[ 82.811892] [] kobj_attr_store+0xf/0x20
[ 82.811897] [] sysfs_kf_write+0x40/0x50
[ 82.811902] [] kernfs_fop_write+0x13c/0x180
[ 82.811906] [] __vfs_write+0x23/0xe0
[ 82.811910] [] vfs_write+0xa4/0x190
[ 82.811914] [] SyS_write+0x44/0xb0
[ 82.811918] [] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.811923]
-> #1 (cpu_hotplug.lock){+.+.+.}:
[ 82.811929] [] lock_acquire+0xc3/0x1d0
[ 82.811933] [] mutex_lock_nested+0x62/0x3b0
[ 82.811940] [] get_online_cpus+0x61/0x80
[ 82.811944] [] stop_machine+0x1b/0xe0
[ 82.811949] [] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
[ 82.812009] [] ggtt_bind_vma+0x46/0x70 [i915]
[ 82.812045] [] i915_vma_bind+0x140/0x290 [i915]
[ 82.812081] [] i915_gem_object_do_pin+0x899/0xb00 [i915]
[ 82.812117] [] i915_gem_object_pin+0x35/0x40 [i915]
[ 82.812154] [] intel_init_pipe_control+0xbe/0x210 [i915]
[ 82.812192] [] intel_logical_rings_init+0xe2/0xde0 [i915]
[ 82.812232] [] i915_gem_init+0xf3/0x130 [i915]
[ 82.812278] [] i915_driver_load+0xf2d/0x1770 [i915]
[ 82.812318] [] drm_dev_register+0xa4/0xb0
[ 82.812323] [] drm_get_pci_dev+0xce/0x1e0
[ 82.812328] [] i915_pci_probe+0x2f/0x50 [i915]
[ 82.812360] [] pci_device_probe+0x87/0xf0
[ 82.812366] [] driver_probe_device+0x229/0x450
[ 82.812371] [] __driver_attach+0x83/0x90
[ 82.812375] [] bus_for_each_dev+0x61/0xa0
[ 82.812380] [] driver_attach+0x19/0x20
[ 82.812384] [] bus_add_driver+0x1ef/0x290
[ 82.812388] [] driver_register+0x5b/0xe0
[ 82.812393] [] __pci_register_driver+0x5b/0x60
[ 82.812398] [] drm_pci_init+0xd6/0x100
[ 82.812402] [] 0xffffffffa027c094
[ 82.812406] [] do_one_initcall+0xae/0x1d0
[ 82.812412] [] do_init_module+0x5b/0x1cb
[ 82.812417] [] load_module+0x1c20/0x2480
[ 82.812422] [] SyS_finit_module+0x7e/0xa0
[ 82.812428] [] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.812433]
-> #0 (&dev->struct_mutex){+.+.+.}:
[ 82.812439] [] __lock_acquire+0x1fc9/0x20f0
[ 82.812443] [] lock_acquire+0xc3/0x1d0
[ 82.812456] [] drm_gem_mmap+0x1c7/0x270
[ 82.812460] [] mmap_region+0x334/0x580
[ 82.812466] [] do_mmap+0x364/0x410
[ 82.812470] [] vm_mmap_pgoff+0x6d/0xa0
[ 82.812474] [] SyS_mmap_pgoff+0x184/0x220
[ 82.812479] [] SyS_mmap+0x1d/0x20
[ 82.812484] [] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.812489]
other info that might help us debug this:

[ 82.812493] Chain exists of:
&dev->struct_mutex --> s_active#6 --> &mm->mmap_sem

[ 82.812502] Possible unsafe locking scenario:

[ 82.812506] CPU0 CPU1
[ 82.812508] ---- ----
[ 82.812510] lock(&mm->mmap_sem);
[ 82.812514] lock(s_active#6);
[ 82.812519] lock(&mm->mmap_sem);
[ 82.812522] lock(&dev->struct_mutex);
[ 82.812526]
*** DEADLOCK ***

[ 82.812531] 1 lock held by kms_setmode/5859:
[ 82.812533] #0: (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x44/0xa0
[ 82.812541]
stack backtrace:
[ 82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1
[ 82.812550] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
[ 82.812553] 0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270
[ 82.812560] ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90
[ 82.812566] ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350
[ 82.812573] Call Trace:
[ 82.812578] [] dump_stack+0x67/0x92
[ 82.812582] [] print_circular_bug+0x1fc/0x310
[ 82.812586] [] __lock_acquire+0x1fc9/0x20f0
[ 82.812590] [] lock_acquire+0xc3/0x1d0
[ 82.812594] [] ? drm_gem_mmap+0x1a1/0x270
[ 82.812599] [] drm_gem_mmap+0x1c7/0x270
[ 82.812603] [] ? drm_gem_mmap+0x1a1/0x270
[ 82.812608] [] mmap_region+0x334/0x580
[ 82.812612] [] do_mmap+0x364/0x410
[ 82.812616] [] vm_mmap_pgoff+0x6d/0xa0
[ 82.812629] [] SyS_mmap_pgoff+0x184/0x220
[ 82.812633] [] SyS_mmap+0x1d/0x20
[ 82.812637] [] entry_SYSCALL_64_fastpath+0x16/0x73

Highly unlikely though this scenario is, we can avoid the issue entirely
by moving the copy operation from out under the kernfs_get_active()
tracking by assigning the preallocated buffer its own mutex. The
temporary buffer allocation doesn't require mutex locking as it is
entirely local.

The locked section was extended by the addition of the preallocated buf
to speed up md user operations in

commit 2b75869bba676c248d8d25ae6d2bd9221dfffdb6
Author: NeilBrown
Date: Mon Oct 13 16:41:28 2014 +1100

sysfs/kernfs: allow attributes to request write buffer be pre-allocated.

Reported-by: Ville Syrjälä
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
Signed-off-by: Chris Wilson
Reviewed-by: Joonas Lahtinen
Cc: Ville Syrjälä
Cc: Joonas Lahtinen
Cc: NeilBrown
Acked-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Chris Wilson
9 years ago

19 Apr, 2016

1 commit

5614e7725 Merge 4.6-rc4 into driver-core-next ... Browse Code »

We want those fixes in here as well.

Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
9 years ago

11 Apr, 2016

1 commit

ce23e6401 ->getxattr(): pass dentry and inode as separate arguments ... Browse Code »

Signed-off-by: Al Viro

Al Viro
9 years ago

05 Apr, 2016

1 commit

09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
9 years ago

30 Mar, 2016

1 commit

3a3a5fece fs: kernfs: Replace CURRENT_TIME by current_fs_time() ... Browse Code »

This is in preparation for the series that transitions
filesystem timestamps to use 64 bit time and hence make
them y2038 safe.

CURRENT_TIME macro will be deleted before merging the
aforementioned series.

Use current_fs_time() instead of CURRENT_TIME for inode
timestamps.

struct kernfs_node is associated with a sysfs file/ directory.
Truncate the values to appropriate time granularity when
writing to inode timestamps of the files.

ktime_get_real_ts() is used to obtain times for
struct kernfs_iattrs. Since these times are later assigned to
inode times using timespec_truncate() for all filesystem based
operations, we can save the supers list traversal time here by
using ktime_get_real_ts() directly.

Signed-off-by: Deepa Dinamani
Signed-off-by: Greg Kroah-Hartman

Deepa Dinamani
9 years ago

22 Mar, 2016

1 commit

5518f66b5 Merge branch 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup namespace support from Tejun Heo:
"These are changes to implement namespace support for cgroup which has
been pending for quite some time now. It is very straight-forward and
only affects what part of cgroup hierarchies are visible.

After unsharing, mounting a cgroup fs will be scoped to the cgroups
the task belonged to at the time of unsharing and the cgroup paths
exposed to userland would be adjusted accordingly"

* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix and restructure error handling in copy_cgroup_ns()
cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
Add FS_USERNS_FLAG to cgroup fs
cgroup: Add documentation for cgroup namespaces
cgroup: mount cgroupns-root when inside non-init cgroupns
kernfs: define kernfs_node_dentry
cgroup: cgroup namespace setns support
cgroup: introduce cgroup namespaces
sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
kernfs: Add API to generate relative kernfs path

Linus Torvalds
9 years ago

17 Feb, 2016

2 commits

fb3c83156 kernfs: define kernfs_node_dentry ... Browse Code »

Add a new kernfs api is added to lookup the dentry for a particular
kernfs path.

Signed-off-by: Aditya Kali
Signed-off-by: Serge E. Hallyn
Acked-by: Greg Kroah-Hartman
Signed-off-by: Tejun Heo

Aditya Kali
9 years ago
9f6df573a kernfs: Add API to generate relative kernfs path ... Browse Code »

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Signed-off-by: Aditya Kali
Signed-off-by: Serge E. Hallyn
Acked-by: Greg Kroah-Hartman
Signed-off-by: Tejun Heo

Aditya Kali
9 years ago

08 Feb, 2016

1 commit

e56ed358a kernfs: make kernfs_walk_ns() use kernfs_pr_cont_buf[] ... Browse Code »

kernfs_walk_ns() uses a static path_buf[PATH_MAX] to separate out path
components. Keeping around the 4k buffer just for kernfs_walk_ns() is
wasteful. This patch makes it piggyback on kernfs_pr_cont_buf[]
instead. This requires kernfs_walk_ns() to hold kernfs_rename_lock.

Signed-off-by: Tejun Heo
Reported-by: Geert Uytterhoeven
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
9 years ago

23 Jan, 2016

1 commit

5955102c9 wrappers for ->i_mutex access ... Browse Code »

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro

Al Viro
9 years ago

15 Jan, 2016

1 commit

b2a209ffa Revert "kernfs: do not account ino_ida allocations to memcg" ... Browse Code »

Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
alloc_kmem_pages call) are accounted to memory cgroup automatically.
Callers have to explicitly opt out if they don't want/need accounting
for some reason. Such a design decision leads to several problems:

- kmalloc users are highly sensitive to failures, many of them
implicitly rely on the fact that kmalloc never fails, while memcg
makes failures quite plausible.

- A lot of objects are shared among different containers by design.
Accounting such objects to one of containers is just unfair.
Moreover, it might lead to pinning a dead memcg along with its kmem
caches, which aren't tiny, which might result in noticeable increase
in memory consumption for no apparent reason in the long run.

- There are tons of short-lived objects. Accounting them to memcg will
only result in slight noise and won't change the overall picture, but
we still have to pay accounting overhead.

For more info, see

- http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
- http://lkml.kernel.org/r/20151106090555.GK29259@esperanza

Therefore this patchset switches to the white list policy. Now kmalloc
users have to explicitly opt in by passing __GFP_ACCOUNT flag.

Currently, the list of accounted objects is quite limited and only
includes those allocations that (1) are known to be easily triggered
from userspace and (2) can fail gracefully (for the full list see patch
no. 6) and it still misses many object types. However, accounting only
those objects should be a satisfactory approximation of the behavior we
used to have for most sane workloads.

This patch (of 6):

Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations
to memcg").

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So it was decided to switch to the white-list policy. This patch reverts
bits introducing the black-list policy. The white-list policy will be
introduced later in the series.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
10 years ago