Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

17 Dec, 2014

1 commit

50062175f vm_area_operations: kill ->migrate() ... Browse Code »

the only instance this method has ever grown was one in kernfs -
one that call ->migrate() of another vm_ops if it exists.

Signed-off-by: Al Viro

Al Viro
2014-12-17 21:26:51 +0800

15 Dec, 2014

1 commit

e6b5be2be Merge tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core update from Greg KH:
"Here's the set of driver core patches for 3.19-rc1.

They are dominated by the removal of the .owner field in platform
drivers. They touch a lot of files, but they are "simple" changes,
just removing a line in a structure.

Other than that, a few minor driver core and debugfs changes. There
are some ath9k patches coming in through this tree that have been
acked by the wireless maintainers as they relied on the debugfs
changes.

Everything has been in linux-next for a while"

* tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits)
Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries"
fs: debugfs: add forward declaration for struct device type
firmware class: Deletion of an unnecessary check before the function call "vunmap"
firmware loader: fix hung task warning dump
devcoredump: provide a one-way disable function
device: Add dev__once variants
ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries
ath: use seq_file api for ath9k debugfs files
debugfs: add helper function to create device related seq_file
drivers/base: cacheinfo: remove noisy error boot message
Revert "core: platform: add warning if driver has no owner"
drivers: base: support cpu cache information interface to userspace via sysfs
drivers: base: add cpu_device_create to support per-cpu devices
topology: replace custom attribute macros with standard DEVICE_ATTR*
cpumask: factor out show_cpumap into separate helper function
driver core: Fix unbalanced device reference in drivers_probe
driver core: fix race with userland in device_add()
sysfs/kernfs: make read requests on pre-alloc files use the buffer.
sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
fs: sysfs: return EGBIG on write if offset is larger than file size
...

Linus Torvalds
2014-12-15 08:10:09 +0800

20 Nov, 2014

1 commit

41d28bca2 switch d_materialise_unique() users to d_splice_alias() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-11-20 02:01:20 +0800

08 Nov, 2014

2 commits

4ef67a8c9 sysfs/kernfs: make read requests on pre-alloc files use the buffer. ... Browse Code »
2

To match the previous patch which used the pre-alloc buffer for
writes, this patch causes reads to use the same buffer.
This is not strictly necessary as the current seq_read() will allocate
on first read, so user-space can trigger the required pre-alloc. But
consistency is valuable.

The read function is somewhat simpler than seq_read() and, for example,
does not support reading from an offset into the file: reads must be
at the start of the file.

As seq_read() does not use the prealloc buffer, ->seq_show is
incompatible with ->prealloc and caused an EINVAL return from open().
sysfs code which calls into kernfs always chooses the correct function.

As the buffer is shared with writes and other reads, the mutex is
extended to cover the copy_to_user.

Signed-off-by: NeilBrown
Reviewed-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2014-11-08 02:54:38 +0800
2b75869bb sysfs/kernfs: allow attributes to request write buffer be pre-allocated. ... Browse Code »
2

md/raid allows metadata management to be performed in user-space.
A various times, particularly on device failure, the metadata needs
to be updated before further writes can be permitted.
This means that the user-space program which updates metadata much
not block on writeout, and so must not allocate memory.

mlockall(MCL_CURRENT|MCL_FUTURE) and pre-allocation can avoid all
memory allocation issues for user-memory, but that does not help
kernel memory.
Several kernel objects can be pre-allocated. e.g. files opened before
any writes to the array are permitted.
However some kernel allocation happens in places that cannot be
pre-allocated.
In particular, writes to sysfs files (to tell md that it can now
allow writes to the array) allocate a buffer using GFP_KERNEL.

This patch allows attributes to be marked as "PREALLOC". In that case
the maximal buffer is allocated when the file is opened, and then used
on each write instead of allocating a new buffer.

As the same buffer is now shared for all writes on the same file
description, the mutex is extended to cover full use of the buffer
including the copy_from_user().

The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is
inspected by sysfs_add_file_mode_ns() to determine if the file should be
marked as requiring prealloc.

Despite the comment, we *do* use ->seq_show together with ->prealloc
in this patch. The next patch fixes that.

Signed-off-by: NeilBrown
Reviewed-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2014-11-08 02:53:25 +0800

09 Oct, 2014

1 commit

9b053f320 vfs: Remove unnecessary calls of check_submounts_and_drop ... Browse Code »

Now that check_submounts_and_drop can not fail and is called from
d_invalidate there is no longer a need to call check_submounts_and_drom
from filesystem d_revalidate methods so remove it.

Reviewed-by: Miklos Szeredi
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Al Viro

Eric W. Biederman
2014-10-09 14:38:56 +0800

22 Jul, 2014

1 commit

90125edbc Merge 3.16-rc6 into driver-core-next ... Browse Code »

We want the platform changes in here as well.

Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2014-07-22 01:07:25 +0800

11 Jul, 2014

1 commit

40f612373 Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"Mostly fixes for the fallouts from the recent cgroup core changes.

The decoupled nature of cgroup dynamic hierarchy management
(hierarchies are created dynamically on mount but may or may not be
reused once unmounted depending on remaining usages) led to more
ugliness being added to kernfs.

Hopefully, this is the last of it"

* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: break kernfs active protection in cpuset_write_resmask()
cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
kernfs: introduce kernfs_pin_sb()
cgroup: fix mount failure in a corner case
cpuset,mempolicy: fix sleeping function called from invalid context
cgroup: fix broken css_has_online_children()

Linus Torvalds
2014-07-11 02:38:23 +0800

10 Jul, 2014

1 commit

8278bd3ab kernfs: kernel-doc warning fix ... Browse Code »

s/static_name/name_is_static

Signed-off-by: Fabian Frederick
Acked-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Fabian Frederick
2014-07-10 07:37:29 +0800

03 Jul, 2014

1 commit

ecca47ce8 kernfs: kernfs_notify() must be useable from non-sleepable contexts ... Browse Code »

d911d9874801 ("kernfs: make kernfs_notify() trigger inotify events
too") added fsnotify triggering to kernfs_notify() which requires a
sleepable context. There are already existing users of
kernfs_notify() which invoke it from an atomic context and in general
it's silly to require a sleepable context for triggering a
notification.

The following is an invalid context bug triggerd by md invoking
sysfs_notify() from IO completion path.

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
2 locks held by swapper/1/0:
#0: (&(&vblk->vq_lock)->rlock){-.-...}, at: [] virtblk_done+0x42/0xe0 [virtio_blk]
#1: (&(&bitmap->counts.lock)->rlock){-.....}, at: [] bitmap_endwrite+0x68/0x240
irq event stamp: 33518
hardirqs last enabled at (33515): [] default_idle+0x1f/0x230
hardirqs last disabled at (33516): [] common_interrupt+0x6d/0x72
softirqs last enabled at (33518): [] _local_bh_enable+0x22/0x50
softirqs last disabled at (33517): [] irq_enter+0x60/0x80
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.0-0.rc2.git2.1.fc21.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000000 f90db13964f4ee05 ffff88007d403b80 ffffffff81807b4c
0000000000000000 ffff88007d403ba8 ffffffff810d4f14 0000000000000000
0000000000441800 ffff880078fa1780 ffff88007d403c38 ffffffff8180caf2
Call Trace:
[] dump_stack+0x4d/0x66
[] __might_sleep+0x184/0x240
[] mutex_lock_nested+0x42/0x440
[] kernfs_notify+0x90/0x150
[] bitmap_endwrite+0xcc/0x240
[] close_write+0x93/0xb0 [raid1]
[] r1_bio_write_done+0x29/0x50 [raid1]
[] raid1_end_write_request+0xe4/0x260 [raid1]
[] bio_endio+0x6b/0xa0
[] blk_update_request+0x94/0x420
[] blk_mq_end_io+0x1a/0x70
[] virtblk_request_done+0x32/0x80 [virtio_blk]
[] __blk_mq_complete_request+0x88/0x120
[] blk_mq_complete_request+0x2a/0x30
[] virtblk_done+0x66/0xe0 [virtio_blk]
[] vring_interrupt+0x3a/0xa0 [virtio_ring]
[] handle_irq_event_percpu+0x77/0x340
[] handle_irq_event+0x3d/0x60
[] handle_edge_irq+0x66/0x130
[] handle_irq+0x84/0x150
[] do_IRQ+0x4d/0xe0
[] common_interrupt+0x72/0x72
[] ? native_safe_halt+0x6/0x10
[] default_idle+0x24/0x230
[] arch_cpu_idle+0xf/0x20
[] cpu_startup_entry+0x37c/0x7b0
[] start_secondary+0x25b/0x300

This patch fixes it by punting the notification delivery through a
work item. This ends up adding an extra pointer to kernfs_elem_attr
enlarging kernfs_node by a pointer, which is not ideal but not a very
big deal either. If this turns out to be an actual issue, we can move
kernfs_elem_attr->size to kernfs_node->iattr later.

Signed-off-by: Tejun Heo
Reported-by: Josh Boyer
Cc: Jens Axboe
Reviewed-by: Michael S. Tsirkin
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-07-03 00:32:09 +0800

30 Jun, 2014

1 commit

4e26445fa kernfs: introduce kernfs_pin_sb() ... Browse Code »

kernfs_pin_sb() tries to get a refcnt of the superblock.

This will be used by cgroupfs.

v2:
- make kernfs_pin_sb() return the superblock.
- drop kernfs_drop_sb().

tj: Updated the comment a bit.

[ This is a prerequisite for a bugfix. ]
Cc: # 3.15
Acked-by: Greg Kroah-Hartman
Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-06-30 22:16:25 +0800

28 May, 2014

1 commit

26fc9cd20 kernfs: move the last knowledge of sysfs out from kernfs ... Browse Code »
13

There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.

Signed-off-by: Jianyu Zhan
Acked-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Jianyu Zhan
2014-05-28 05:33:17 +0800

23 May, 2014

1 commit

cbfef5336 Merge 3.15-rc6 into driver-core-next ... Browse Code »

We want the kernfs fixes in this branch as well for testing.

Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2014-05-23 09:13:53 +0800

13 May, 2014

1 commit

555724a83 kernfs, sysfs, cgroup: restrict extra perm check on open to sysfs ... Browse Code »

The kernfs open method - kernfs_fop_open() - inherited extra
permission checks from sysfs. While the vfs layer allows ignoring the
read/write permissions checks if the issuer has CAP_DAC_OVERRIDE,
sysfs explicitly denied open regardless of the cap if the file doesn't
have any of the UGO perms of the requested access or doesn't implement
the requested operation. It can be debated whether this was a good
idea or not but the behavior is too subtle and dangerous to change at
this point.

After cgroup got converted to kernfs, this extra perm check also got
applied to cgroup breaking libcgroup which opens write-only files with
O_RDWR as root. This patch gates the extra open permission check with
a new flag KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK and enables it for sysfs.
For sysfs, nothing changes. For cgroup, root now can perform any
operation regardless of the permissions as it was before kernfs
conversion. Note that kernfs still fails unimplemented operations
with -EINVAL.

While at it, add comments explaining KERNFS_ROOT flags.

Signed-off-by: Tejun Heo
Reported-by: Andrey Wagin
Tested-by: Andrey Wagin
Cc: Li Zefan
References: http://lkml.kernel.org/g/CANaxB-xUm3rJ-Cbp72q-rQJO5mZe1qK6qXsQM=vh0U8upJ44+A@mail.gmail.com
Fixes: 2bd59d48ebfb ("cgroup: convert to kernfs")
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-05-13 19:21:40 +0800

28 Apr, 2014

1 commit

d35cc56dd Merge 3.15-rc3 into staging-next Browse Code »

Greg Kroah-Hartman
2014-04-28 12:36:39 +0800

26 Apr, 2014

4 commits

b44b21402 kernfs: add back missing error check in kernfs_fop_mmap() ... Browse Code »
5

While updating how mmap enabled kernfs files are handled by lockdep,
9b2db6e18945 ("sysfs: bail early from kernfs_file_mmap() to avoid
spurious lockdep warning") inadvertently dropped error return check
from kernfs_file_mmap(). The intention was just dropping "if
(ops->mmap)" check as the control won't reach the point if the mmap
callback isn't implemented, but I mistakenly removed the error return
check together with it.

This led to Xorg crash on i810 which was reported and bisected to the
commit and then to the specific change by Tobias.

Signed-off-by: Tejun Heo
Reported-and-bisected-by: Tobias Powalowski
Tested-by: Tobias Powalowski
References: http://lkml.kernel.org/g/533D01BD.1010200@googlemail.com
Cc: stable # 3.14
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-04-26 03:25:13 +0800
c1befb885 kernfs: fix a subdir count leak ... Browse Code »

Currently kernfs_link_sibling() increates parent->dir.subdirs before
adding the node into parent's chidren rb tree.

Because it is possible that kernfs_link_sibling() couldn't find
a suitable slot and bail out, this leads to a mismatch between
elevated subdir count with actual children node numbers.

This patches fix this problem, by moving the subdir accouting
after the actual addtion happening.

Signed-off-by: Jianyu Zhan
Acked-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Jianyu Zhan
2014-04-26 03:25:13 +0800
d911d9874 kernfs: make kernfs_notify() trigger inotify events too ... Browse Code »
15

kernfs_notify() is used to indicate either new data is available or
the content of a file has changed. It currently only triggers poll
which may not be the most convenient to monitor especially when there
are a lot to monitor. Let's hook it up to fsnotify too so that the
events can be monitored via inotify too.

fsnotify_modify() requires file * but kernfs_notify() doesn't have any
specific file associated; however, we can walk all super_blocks
associated with a kernfs_root and as kernfs always associate one ino
with inode and one dentry with an inode, it's trivial to look up the
dentry associated with a given kernfs_node. As any active monitor
would pin dentry, just looking up existing dentry is enough. This
patch looks up the dentry associated with the specified kernfs_node
and generates events equivalent to fsnotify_modify().

Note that as fsnotify doesn't provide fsnotify_modify() equivalent
which can be called with dentry, kernfs_notify() directly calls
fsnotify_parent() and fsnotify(). It might be better to add a wrapper
in fsnotify.h instead.

Signed-off-by: Tejun Heo
Cc: John McCutchan
Cc: Robert Love
Cc: Eric Paris
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-04-26 02:43:31 +0800
7d568a838 kernfs: implement kernfs_root->supers list ... Browse Code »

Currently, there's no way to find out which super_blocks are
associated with a given kernfs_root. Let's implement it - the planned
inotify extension to kernfs_notify() needs it.

Make kernfs_super_info point back to the super_block and chain it at
kernfs_root->supers.

Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-04-26 02:43:31 +0800

17 Apr, 2014

1 commit

4afddd60a kernfs: protect lazy kernfs_iattrs allocation with mutex ... Browse Code »
5

kernfs_iattrs is allocated lazily when operations which require it
take place; unfortunately, the lazy allocation and returning weren't
properly synchronized and when there are multiple concurrent
operations, it might end up returning kernfs_iattrs which hasn't
finished initialization yet or different copies to different callers.

Fix it by synchronizing with a mutex. This can be smarter with memory
barriers but let's go there if it actually turns out to be necessary.

Signed-off-by: Tejun Heo
Link: http://lkml.kernel.org/g/533ABA32.9080602@oracle.com
Reported-by: Sasha Levin
Cc: stable@vger.kernel.org # 3.14
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-04-17 02:54:40 +0800

04 Apr, 2014

3 commits

76ca7d1cc Merge branch 'akpm' (incoming from Andrew) ... Browse Code »

Merge first patch-bomb from Andrew Morton:
- Various misc bits
- kmemleak fixes
- small befs, codafs, cifs, efs, freexxfs, hfsplus, minixfs, reiserfs things
- fanotify
- I appear to have become SuperH maintainer
- ocfs2 updates
- direct-io tweaks
- a bit of the MM queue
- printk updates
- MAINTAINERS maintenance
- some backlight things
- lib/ updates
- checkpatch updates
- the rtc queue
- nilfs2 updates
- Small Documentation/ updates

* emailed patches from Andrew Morton : (237 commits)
Documentation/SubmittingPatches: remove references to patch-scripts
Documentation/SubmittingPatches: update some dead URLs
Documentation/filesystems/ntfs.txt: remove changelog reference
Documentation/kmemleak.txt: updates
fs/reiserfs/super.c: add __init to init_inodecache
fs/reiserfs: move prototype declaration to header file
fs/hfsplus/attributes.c: add __init to hfsplus_create_attr_tree_cache()
fs/hfsplus/extents.c: fix concurrent acess of alloc_blocks
fs/hfsplus/extents.c: remove unused variable in hfsplus_get_block
nilfs2: update project's web site in nilfs2.txt
nilfs2: update MAINTAINERS file entries fix
nilfs2: verify metadata sizes read from disk
nilfs2: add FITRIM ioctl support for nilfs2
nilfs2: add nilfs_sufile_trim_fs to trim clean segs
nilfs2: implementation of NILFS_IOCTL_SET_SUINFO ioctl
nilfs2: add nilfs_sufile_set_suinfo to update segment usage
nilfs2: add struct nilfs_suinfo_update and flags
nilfs2: update MAINTAINERS file entries
fs/coda/inode.c: add __init to init_inodecache()
BEFS: logging cleanup
...

Linus Torvalds
2014-04-04 07:22:16 +0800
91b0abe36 mm + fs: store shadow entries in page cache ... Browse Code »

Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page. As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently. At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate. Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800
32d01dc7b Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:

- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.

After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.

- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.

- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.

- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.

There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"

* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...

Linus Torvalds
2014-04-04 04:05:42 +0800

09 Mar, 2014

2 commits

b7ce40cff kernfs: cache atomic_write_len in kernfs_open_file ... Browse Code »

While implementing atomic_write_len, 4d3773c4bb41 ("kernfs: implement
kernfs_ops->atomic_write_len") moved data copy from userland inside
kernfs_get_active() and kernfs_open_file->mutex so that
kernfs_ops->atomic_write_len can be accessed before copying buffer
from userland; unfortunately, this could lead to locking order
inversion involving mmap_sem if copy_from_user() takes a page fault.

======================================================
[ INFO: possible circular locking dependency detected ]
3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 Tainted: G W
-------------------------------------------------------
trinity-c236/10658 is trying to acquire lock:
(&of->mutex#2){+.+.+.}, at: [] kernfs_fop_mmap+0x54/0x120

but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x6e/0xe0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){++++++}:
[] validate_chain+0x6c5/0x7b0
[] __lock_acquire+0x4cd/0x5a0
[] lock_acquire+0x182/0x1d0
[] might_fault+0x7e/0xb0
[] kernfs_fop_write+0xd8/0x190
[] vfs_write+0xe3/0x1d0
[] SyS_write+0x5d/0xa0
[] tracesys+0xdd/0xe2

-> #0 (&of->mutex#2){+.+.+.}:
[] check_prev_add+0x13f/0x560
[] validate_chain+0x6c5/0x7b0
[] __lock_acquire+0x4cd/0x5a0
[] lock_acquire+0x182/0x1d0
[] mutex_lock_nested+0x6a/0x510
[] kernfs_fop_mmap+0x54/0x120
[] mmap_region+0x310/0x5c0
[] do_mmap_pgoff+0x385/0x430
[] vm_mmap_pgoff+0x8f/0xe0
[] SyS_mmap_pgoff+0x1b0/0x210
[] SyS_mmap+0x1d/0x20
[] tracesys+0xdd/0xe2

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
lock(&of->mutex#2);
lock(&mm->mmap_sem);
lock(&of->mutex#2);

*** DEADLOCK ***

1 lock held by trinity-c236/10658:
#0: (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x6e/0xe0

stack backtrace:
CPU: 2 PID: 10658 Comm: trinity-c236 Tainted: G W 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26
0000000000000000 ffff88011911fa48 ffffffff8438e945 0000000000000000
0000000000000000 ffff88011911fa98 ffffffff811a0109 ffff88011911fab8
ffff88011911fab8 ffff88011911fa98 ffff880119128cc0 ffff880119128cf8
Call Trace:
[] dump_stack+0x52/0x7f
[] print_circular_bug+0x129/0x160
[] check_prev_add+0x13f/0x560
[] ? deactivate_slab+0x511/0x550
[] validate_chain+0x6c5/0x7b0
[] __lock_acquire+0x4cd/0x5a0
[] ? mmap_region+0x24a/0x5c0
[] lock_acquire+0x182/0x1d0
[] ? kernfs_fop_mmap+0x54/0x120
[] mutex_lock_nested+0x6a/0x510
[] ? kernfs_fop_mmap+0x54/0x120
[] ? get_parent_ip+0x11/0x50
[] ? kernfs_fop_mmap+0x54/0x120
[] kernfs_fop_mmap+0x54/0x120
[] mmap_region+0x310/0x5c0
[] do_mmap_pgoff+0x385/0x430
[] ? vm_mmap_pgoff+0x6e/0xe0
[] vm_mmap_pgoff+0x8f/0xe0
[] ? __rcu_read_unlock+0x44/0xb0
[] ? dup_fd+0x3c0/0x3c0
[] SyS_mmap_pgoff+0x1b0/0x210
[] SyS_mmap+0x1d/0x20
[] tracesys+0xdd/0xe2

Fix it by caching atomic_write_len in kernfs_open_file during open so
that it can be determined without accessing kernfs_ops in
kernfs_fop_write(). This restores the structure of kernfs_fop_write()
before 4d3773c4bb41 with updated @len determination logic.

Signed-off-by: Tejun Heo
Reported-by: Sasha Levin
References: http://lkml.kernel.org/g/53113485.2090407@oracle.com
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-03-09 14:08:29 +0800
88391d49a kernfs: fix off by one error. ... Browse Code »
5

The hash values 0 and 1 are reserved for magic directory entries, but
the code only prevents names hashing to 0. This patch fixes the test
to also prevent hash value 1.

Signed-off-by: Richard Cochran
Cc:
Reviewed-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Richard Cochran
2014-03-09 14:08:29 +0800

03 Mar, 2014

1 commit

13df79774 Merge 3.14-rc5 into driver-core-next ... Browse Code »

We want the fixes in here.

Greg Kroah-Hartman
2014-03-03 12:09:08 +0800

25 Feb, 2014

1 commit

fed95bab8 sysfs: fix namespace refcnt leak ... Browse Code »

As mount() and kill_sb() is not a one-to-one match, we shoudn't get
ns refcnt unconditionally in sysfs_mount(), and instead we should
get the refcnt only when kernfs_mount() allocated a new superblock.

v2:
- Changed the name of the new argument, suggested by Tejun.
- Made the argument optional, suggested by Tejun.

v3:
- Make the new argument as second-to-last arg, suggested by Tejun.

Signed-off-by: Li Zefan
Acked-by: Tejun Heo
---
fs/kernfs/mount.c | 8 +++++++-
fs/sysfs/mount.c | 5 +++--
include/linux/kernfs.h | 9 +++++----
3 files changed, 15 insertions(+), 7 deletions(-)
Signed-off-by: Greg Kroah-Hartman

Li Zefan
2014-02-25 23:37:52 +0800

15 Feb, 2014

1 commit

f41c59345 kernfs: fix kernfs_node_from_dentry() ... Browse Code »

Currently kernfs_node_from_dentry() returns NULL for root dentry,
because root_dentry->d_op == NULL.

Due to this bug cgroupstats_build() returns -EINVAL for root cgroup.

# mount -t cgroup -o cpuacct /cgroup
# Documentation/accounting/getdelays -C /cgroup
fatal reply error, errno -22

With this fix:

# Documentation/accounting/getdelays -C /cgroup
sleeping 305, blocked 0, running 1, stopped 0, uninterruptible 1

Signed-off-by: Li Zefan
Acked-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Li Zefan
2014-02-15 06:31:37 +0800

12 Feb, 2014

1 commit

e61734c55 cgroup: remove cgroup->name ... Browse Code »
26

cgroup->name handling became quite complicated over time involving
dedicated struct cgroup_name for RCU protection. Now that cgroup is
on kernfs, we can drop all of it and simply use kernfs_name/path() and
friends. Replace cgroup->name and all related code with kernfs
name/path constructs.

* Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
of kernfs counterparts, which involves semantic changes.
pr_cont_cgroup_name() and pr_cont_cgroup_path() added.

* cgroup->name handling dropped from cgroup_rename().

* All users of cgroup_name/path() updated to the new semantics. Users
which were formatting the string just to printk them are converted
to use pr_cont_cgroup_name/path() instead, which simplifies things
quite a bit. As cgroup_name() no longer requires RCU read lock
around it, RCU lockings which were protecting only cgroup_name() are
removed.

v2: Comment above oom_info_lock updated as suggested by Michal.

v3: dummy_top doesn't have a kn associated and
pr_cont_cgroup_name/path() ended up calling the matching kernfs
functions with NULL kn leading to oops. Test for NULL kn and
print "/" if so. This issue was reported by Fengguang Wu.

v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").

Signed-off-by: Tejun Heo
Acked-by: Peter Zijlstra
Acked-by: Michal Hocko
Acked-by: Li Zefan
Cc: Fengguang Wu
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki

Tejun Heo
2014-02-12 22:29:50 +0800

11 Feb, 2014

1 commit

9561a8961 kernfs: fix hash calculation in kernfs_rename_ns() ... Browse Code »

3eef34ad7dc3 ("kernfs: implement kernfs_get_parent(),
kernfs_name/path() and friends") restructured kernfs_rename_ns() such
that new name assignment happens under kernfs_rename_lock;
unfortunately, it mistakenly passed NULL to kernfs_name_hash() to
calculate the new hash if the name hasn't changed, which can lead to
oops.

Fix it by using kn->name and kn->ns when calculating the new hash.

Signed-off-by: Tejun Heo
Reported-by: Dan Carpenter dan.carpenter@oracle.com
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-11 08:00:19 +0800

08 Feb, 2014

10 commits

ba341d55a kernfs: add CONFIG_KERNFS ... Browse Code »

As sysfs was kernfs's only user, kernfs has been piggybacking on
CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very
soon. Introduce a separate config option CONFIG_KERNFS which is to be
selected by kernfs users.

Signed-off-by: Tejun Heo
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-08 08:08:57 +0800
3eef34ad7 kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends ... Browse Code »

kernfs_node->parent and ->name are currently marked as "published"
indicating that kernfs users may access them directly; however, those
fields may get updated by kernfs_rename[_ns]() and unrestricted access
may lead to erroneous values or oops.

Protect ->parent and ->name updates with a irq-safe spinlock
kernfs_rename_lock and implement the following accessors for these
fields.

* kernfs_name() - format the node's name into the specified buffer
* kernfs_path() - format the node's path into the specified buffer
* pr_cont_kernfs_name() - pr_cont a node's name (doesn't need buffer)
* pr_cont_kernfs_path() - pr_cont a node's path (doesn't need buffer)
* kernfs_get_parent() - pin and return a node's parent

All can be called under any context. The recursive sysfs_pathname()
in fs/sysfs/dir.c is replaced with kernfs_path() and
sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
dereferencing parent directly.

v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
static inline making it cause a lot of build warnings. Add it.

Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-08 08:05:35 +0800
0c23b2259 kernfs: implement kernfs_node_from_dentry(), kernfs_root_from_sb() and kernfs_rename() ... Browse Code »

Implement helpers to determine node from dentry and root from
super_block. Also add a kernfs_rename_ns() wrapper which assumes NULL
namespace. These generally make sense and will be used by cgroup.

v2: Some dummy implementations for !CONFIG_SYSFS was missing. Fixed.
Reported by kbuild test robot.

Signed-off-by: Tejun Heo
Cc: kbuild test robot
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-08 08:00:41 +0800
4d3773c4b kernfs: implement kernfs_ops->atomic_write_len ... Browse Code »

A write to a kernfs_node is buffered through a kernel buffer. Writes
atomic_write_len. If not set (zero), the
behavior stays the same. If set, writes upto the size are executed
atomically and larger writes are rejected with -E2BIG.

A different implementation strategy would be allowing configuring
chunking size while making the original write size available to the
write method; however, such strategy, while being more complicated,
doesn't really buy anything. If the write implementation has to
handle chunking, the specific chunk size shouldn't matter all that
much.

Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-08 07:52:48 +0800
d35258ef7 kernfs: allow nodes to be created in the deactivated state ... Browse Code »

Currently, kernfs_nodes are made visible to userland on creation,
which makes it difficult for kernfs users to atomically succeed or
fail creation of multiple nodes. In addition, if something fails
after creating some nodes, the created nodes might already be in use
and their active refs need to be drained for removal, which has the
potential to introduce tricky reverse locking dependency on active_ref
depending on how the error path is synchronized.

This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
If set, all nodes under the root are created in the deactivated state
and stay invisible to userland until explicitly enabled by the new
kernfs_activate() API. Also, nodes which have never been activated
are guaranteed to bypass draining on removal thus allowing error paths
to not worry about lockding dependency on active_ref draining.

Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-08 07:52:48 +0800
b9c9dad0c kernfs: add missing kernfs_active() checks in directory operations ... Browse Code »

kernfs_iop_lookup(), kernfs_dir_pos() and kernfs_dir_next_pos() were
missing kernfs_active() tests before using the found kernfs_node. As
deactivated state is currently visible only while a node is being
removed, this doesn't pose an actual problem. e.g. lookup succeeding
on a deactivated node doesn't harm anything as the eventual file
operations are gonna fail and those failures are indistinguishible
from the cases in which the lookups had happened before the node was
deactivated.

However, we're gonna allow new nodes to be created deactivated and
then activated explicitly by the kernfs user when it sees fit. This
is to support atomically making multiple nodes visible to userland and
thus those nodes must not be visible to userland before activated.

Let's plug the lookup and readdir holes so that deactivated nodes are
invisible to userland.

Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-08 07:52:48 +0800
6a7fed4ee kernfs: implement kernfs_syscall_ops->remount_fs() and ->show_options() ... Browse Code »

Add two super_block related syscall callbacks ->remount_fs() and
->show_options() to kernfs_syscall_ops. These simply forward the
matching super_operations.

Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-08 07:52:48 +0800
90c07c895 kernfs: rename kernfs_dir_ops to kernfs_syscall_ops ... Browse Code »

We're gonna need non-dir syscall callbacks, which will make dir_ops a
misnomer. Let's rename kernfs_dir_ops to kernfs_syscall_ops.

This is pure rename.

Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-08 07:52:48 +0800
07c7530dd kernfs: invoke dir_ops while holding active ref of the target node ... Browse Code »

kernfs_dir_ops are currently being invoked without any active
reference, which makes it tricky for the invoked operations to
determine whether the objects associated those nodes are safe to
access and will remain that way for the duration of such operations.

kernfs already has active_ref mechanism to deal with this which makes
the removal of a given node the synchronization point for gating the
file operations. There's no reason for dir_ops to be any different.
Update the dir_ops handling so that active_ref is held while the
dir_ops are executing. This guarantees that while a dir_ops is
executing the target nodes stay alive.

As kernfs_dir_ops doesn't have any in-kernel user at this point, this
doesn't affect anybody.

Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-08 07:52:48 +0800
6b0afc2a2 kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers ... Browse Code »

Sometimes it's necessary to implement a node which wants to delete
nodes including itself. This isn't straightforward because of kernfs
active reference. While a file operation is in progress, an active
reference is held and kernfs_remove() waits for all such references to
drain before completing. For a self-deleting node, this is a deadlock
as kernfs_remove() ends up waiting for an active reference that itself
is sitting on top of.

This currently is worked around in the sysfs layer using
sysfs_schedule_callback() which makes such removals asynchronous.
While it works, it's rather cumbersome and inherently breaks
synchronicity of the operation - the file operation which triggered
the operation may complete before the removal is finished (or even
started) and the removal may fail asynchronously. If a removal
operation is immmediately followed by another operation which expects
the specific name to be available (e.g. removal followed by rename
onto the same name), there's no way to make the latter operation
reliable.

The thing is there's no inherent reason for this to be asynchrnous.
All that's necessary to do this synchronous is a dedicated operation
which drops its own active ref and deactivates self. This patch
implements kernfs_remove_self() and its wrappers in sysfs and driver
core. kernfs_remove_self() is to be called from one of the file
operations, drops the active ref the task is holding, removes the self
node, and restores active ref to the dead node so that the ref is
balanced afterwards. __kernfs_remove() is updated so that it takes an
early exit if the target node is already fully removed so that the
active ref restored by kernfs_remove_self() after removal doesn't
confuse the deactivation path.

This makes implementing self-deleting nodes very easy. The normal
removal path doesn't even need to be changed to use
kernfs_remove_self() for the self-deleting node. The method can
invoke kernfs_remove_self() on itself before proceeding the normal
removal path. kernfs_remove() invoked on the node by the normal
deletion path will simply be ignored.

This will replace sysfs_schedule_callback(). A subtle feature of
sysfs_schedule_callback() is that it collapses multiple invocations -
even if multiple removals are triggered, the removal callback is run
only once. An equivalent effect can be achieved by testing the return
value of kernfs_remove_self() - only the one which gets %true return
value should proceed with actual deletion. All other instances of
kernfs_remove_self() will wait till the enclosing kernfs operation
which invoked the winning instance of kernfs_remove_self() finishes
and then return %false. This trivially makes all users of
kernfs_remove_self() automatically show correct synchronous behavior
even when there are multiple concurrent operations - all "echo 1 >
delete" instances will finish only after the whole operation is
completed by one of the instances.

Note that manipulation of active ref is implemented in separate public
functions - kernfs_[un]break_active_protection().
kernfs_remove_self() is the only user at the moment but this will be
used to cater to more complex cases.

v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
and sysfs_remove_file_self() had incorrect return type. Fix it.
Reported by kbuild test bot.

v3: kernfs_[un]break_active_protection() separated out from
kernfs_remove_self() and exposed as public API.

Signed-off-by: Tejun Heo
Cc: Alan Stern
Cc: kbuild test robot
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-02-08 07:42:41 +0800