09 Oct, 2014

1 commit


22 Jul, 2014

1 commit


11 Jul, 2014

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "Mostly fixes for the fallouts from the recent cgroup core changes.

    The decoupled nature of cgroup dynamic hierarchy management
    (hierarchies are created dynamically on mount but may or may not be
    reused once unmounted depending on remaining usages) led to more
    ugliness being added to kernfs.

    Hopefully, this is the last of it"

    * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: break kernfs active protection in cpuset_write_resmask()
    cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
    kernfs: introduce kernfs_pin_sb()
    cgroup: fix mount failure in a corner case
    cpuset,mempolicy: fix sleeping function called from invalid context
    cgroup: fix broken css_has_online_children()

    Linus Torvalds
     

10 Jul, 2014

1 commit


03 Jul, 2014

1 commit

  • d911d9874801 ("kernfs: make kernfs_notify() trigger inotify events
    too") added fsnotify triggering to kernfs_notify() which requires a
    sleepable context. There are already existing users of
    kernfs_notify() which invoke it from an atomic context and in general
    it's silly to require a sleepable context for triggering a
    notification.

    The following is an invalid context bug triggerd by md invoking
    sysfs_notify() from IO completion path.

    BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
    in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
    2 locks held by swapper/1/0:
    #0: (&(&vblk->vq_lock)->rlock){-.-...}, at: [] virtblk_done+0x42/0xe0 [virtio_blk]
    #1: (&(&bitmap->counts.lock)->rlock){-.....}, at: [] bitmap_endwrite+0x68/0x240
    irq event stamp: 33518
    hardirqs last enabled at (33515): [] default_idle+0x1f/0x230
    hardirqs last disabled at (33516): [] common_interrupt+0x6d/0x72
    softirqs last enabled at (33518): [] _local_bh_enable+0x22/0x50
    softirqs last disabled at (33517): [] irq_enter+0x60/0x80
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.0-0.rc2.git2.1.fc21.x86_64 #1
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    0000000000000000 f90db13964f4ee05 ffff88007d403b80 ffffffff81807b4c
    0000000000000000 ffff88007d403ba8 ffffffff810d4f14 0000000000000000
    0000000000441800 ffff880078fa1780 ffff88007d403c38 ffffffff8180caf2
    Call Trace:
    [] dump_stack+0x4d/0x66
    [] __might_sleep+0x184/0x240
    [] mutex_lock_nested+0x42/0x440
    [] kernfs_notify+0x90/0x150
    [] bitmap_endwrite+0xcc/0x240
    [] close_write+0x93/0xb0 [raid1]
    [] r1_bio_write_done+0x29/0x50 [raid1]
    [] raid1_end_write_request+0xe4/0x260 [raid1]
    [] bio_endio+0x6b/0xa0
    [] blk_update_request+0x94/0x420
    [] blk_mq_end_io+0x1a/0x70
    [] virtblk_request_done+0x32/0x80 [virtio_blk]
    [] __blk_mq_complete_request+0x88/0x120
    [] blk_mq_complete_request+0x2a/0x30
    [] virtblk_done+0x66/0xe0 [virtio_blk]
    [] vring_interrupt+0x3a/0xa0 [virtio_ring]
    [] handle_irq_event_percpu+0x77/0x340
    [] handle_irq_event+0x3d/0x60
    [] handle_edge_irq+0x66/0x130
    [] handle_irq+0x84/0x150
    [] do_IRQ+0x4d/0xe0
    [] common_interrupt+0x72/0x72
    [] ? native_safe_halt+0x6/0x10
    [] default_idle+0x24/0x230
    [] arch_cpu_idle+0xf/0x20
    [] cpu_startup_entry+0x37c/0x7b0
    [] start_secondary+0x25b/0x300

    This patch fixes it by punting the notification delivery through a
    work item. This ends up adding an extra pointer to kernfs_elem_attr
    enlarging kernfs_node by a pointer, which is not ideal but not a very
    big deal either. If this turns out to be an actual issue, we can move
    kernfs_elem_attr->size to kernfs_node->iattr later.

    Signed-off-by: Tejun Heo
    Reported-by: Josh Boyer
    Cc: Jens Axboe
    Reviewed-by: Michael S. Tsirkin
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

30 Jun, 2014

1 commit

  • kernfs_pin_sb() tries to get a refcnt of the superblock.

    This will be used by cgroupfs.

    v2:
    - make kernfs_pin_sb() return the superblock.
    - drop kernfs_drop_sb().

    tj: Updated the comment a bit.

    [ This is a prerequisite for a bugfix. ]
    Cc: # 3.15
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

28 May, 2014

1 commit


23 May, 2014

1 commit


13 May, 2014

1 commit

  • The kernfs open method - kernfs_fop_open() - inherited extra
    permission checks from sysfs. While the vfs layer allows ignoring the
    read/write permissions checks if the issuer has CAP_DAC_OVERRIDE,
    sysfs explicitly denied open regardless of the cap if the file doesn't
    have any of the UGO perms of the requested access or doesn't implement
    the requested operation. It can be debated whether this was a good
    idea or not but the behavior is too subtle and dangerous to change at
    this point.

    After cgroup got converted to kernfs, this extra perm check also got
    applied to cgroup breaking libcgroup which opens write-only files with
    O_RDWR as root. This patch gates the extra open permission check with
    a new flag KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK and enables it for sysfs.
    For sysfs, nothing changes. For cgroup, root now can perform any
    operation regardless of the permissions as it was before kernfs
    conversion. Note that kernfs still fails unimplemented operations
    with -EINVAL.

    While at it, add comments explaining KERNFS_ROOT flags.

    Signed-off-by: Tejun Heo
    Reported-by: Andrey Wagin
    Tested-by: Andrey Wagin
    Cc: Li Zefan
    References: http://lkml.kernel.org/g/CANaxB-xUm3rJ-Cbp72q-rQJO5mZe1qK6qXsQM=vh0U8upJ44+A@mail.gmail.com
    Fixes: 2bd59d48ebfb ("cgroup: convert to kernfs")
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

28 Apr, 2014

1 commit


26 Apr, 2014

4 commits

  • While updating how mmap enabled kernfs files are handled by lockdep,
    9b2db6e18945 ("sysfs: bail early from kernfs_file_mmap() to avoid
    spurious lockdep warning") inadvertently dropped error return check
    from kernfs_file_mmap(). The intention was just dropping "if
    (ops->mmap)" check as the control won't reach the point if the mmap
    callback isn't implemented, but I mistakenly removed the error return
    check together with it.

    This led to Xorg crash on i810 which was reported and bisected to the
    commit and then to the specific change by Tobias.

    Signed-off-by: Tejun Heo
    Reported-and-bisected-by: Tobias Powalowski
    Tested-by: Tobias Powalowski
    References: http://lkml.kernel.org/g/533D01BD.1010200@googlemail.com
    Cc: stable # 3.14
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • Currently kernfs_link_sibling() increates parent->dir.subdirs before
    adding the node into parent's chidren rb tree.

    Because it is possible that kernfs_link_sibling() couldn't find
    a suitable slot and bail out, this leads to a mismatch between
    elevated subdir count with actual children node numbers.

    This patches fix this problem, by moving the subdir accouting
    after the actual addtion happening.

    Signed-off-by: Jianyu Zhan
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Jianyu Zhan
     
  • kernfs_notify() is used to indicate either new data is available or
    the content of a file has changed. It currently only triggers poll
    which may not be the most convenient to monitor especially when there
    are a lot to monitor. Let's hook it up to fsnotify too so that the
    events can be monitored via inotify too.

    fsnotify_modify() requires file * but kernfs_notify() doesn't have any
    specific file associated; however, we can walk all super_blocks
    associated with a kernfs_root and as kernfs always associate one ino
    with inode and one dentry with an inode, it's trivial to look up the
    dentry associated with a given kernfs_node. As any active monitor
    would pin dentry, just looking up existing dentry is enough. This
    patch looks up the dentry associated with the specified kernfs_node
    and generates events equivalent to fsnotify_modify().

    Note that as fsnotify doesn't provide fsnotify_modify() equivalent
    which can be called with dentry, kernfs_notify() directly calls
    fsnotify_parent() and fsnotify(). It might be better to add a wrapper
    in fsnotify.h instead.

    Signed-off-by: Tejun Heo
    Cc: John McCutchan
    Cc: Robert Love
    Cc: Eric Paris
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • Currently, there's no way to find out which super_blocks are
    associated with a given kernfs_root. Let's implement it - the planned
    inotify extension to kernfs_notify() needs it.

    Make kernfs_super_info point back to the super_block and chain it at
    kernfs_root->supers.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

17 Apr, 2014

1 commit

  • kernfs_iattrs is allocated lazily when operations which require it
    take place; unfortunately, the lazy allocation and returning weren't
    properly synchronized and when there are multiple concurrent
    operations, it might end up returning kernfs_iattrs which hasn't
    finished initialization yet or different copies to different callers.

    Fix it by synchronizing with a mutex. This can be smarter with memory
    barriers but let's go there if it actually turns out to be necessary.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/g/533ABA32.9080602@oracle.com
    Reported-by: Sasha Levin
    Cc: stable@vger.kernel.org # 3.14
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

04 Apr, 2014

3 commits

  • Merge first patch-bomb from Andrew Morton:
    - Various misc bits
    - kmemleak fixes
    - small befs, codafs, cifs, efs, freexxfs, hfsplus, minixfs, reiserfs things
    - fanotify
    - I appear to have become SuperH maintainer
    - ocfs2 updates
    - direct-io tweaks
    - a bit of the MM queue
    - printk updates
    - MAINTAINERS maintenance
    - some backlight things
    - lib/ updates
    - checkpatch updates
    - the rtc queue
    - nilfs2 updates
    - Small Documentation/ updates

    * emailed patches from Andrew Morton : (237 commits)
    Documentation/SubmittingPatches: remove references to patch-scripts
    Documentation/SubmittingPatches: update some dead URLs
    Documentation/filesystems/ntfs.txt: remove changelog reference
    Documentation/kmemleak.txt: updates
    fs/reiserfs/super.c: add __init to init_inodecache
    fs/reiserfs: move prototype declaration to header file
    fs/hfsplus/attributes.c: add __init to hfsplus_create_attr_tree_cache()
    fs/hfsplus/extents.c: fix concurrent acess of alloc_blocks
    fs/hfsplus/extents.c: remove unused variable in hfsplus_get_block
    nilfs2: update project's web site in nilfs2.txt
    nilfs2: update MAINTAINERS file entries fix
    nilfs2: verify metadata sizes read from disk
    nilfs2: add FITRIM ioctl support for nilfs2
    nilfs2: add nilfs_sufile_trim_fs to trim clean segs
    nilfs2: implementation of NILFS_IOCTL_SET_SUINFO ioctl
    nilfs2: add nilfs_sufile_set_suinfo to update segment usage
    nilfs2: add struct nilfs_suinfo_update and flags
    nilfs2: update MAINTAINERS file entries
    fs/coda/inode.c: add __init to init_inodecache()
    BEFS: logging cleanup
    ...

    Linus Torvalds
     
  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

09 Mar, 2014

2 commits

  • While implementing atomic_write_len, 4d3773c4bb41 ("kernfs: implement
    kernfs_ops->atomic_write_len") moved data copy from userland inside
    kernfs_get_active() and kernfs_open_file->mutex so that
    kernfs_ops->atomic_write_len can be accessed before copying buffer
    from userland; unfortunately, this could lead to locking order
    inversion involving mmap_sem if copy_from_user() takes a page fault.

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 Tainted: G W
    -------------------------------------------------------
    trinity-c236/10658 is trying to acquire lock:
    (&of->mutex#2){+.+.+.}, at: [] kernfs_fop_mmap+0x54/0x120

    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x6e/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&mm->mmap_sem){++++++}:
    [] validate_chain+0x6c5/0x7b0
    [] __lock_acquire+0x4cd/0x5a0
    [] lock_acquire+0x182/0x1d0
    [] might_fault+0x7e/0xb0
    [] kernfs_fop_write+0xd8/0x190
    [] vfs_write+0xe3/0x1d0
    [] SyS_write+0x5d/0xa0
    [] tracesys+0xdd/0xe2

    -> #0 (&of->mutex#2){+.+.+.}:
    [] check_prev_add+0x13f/0x560
    [] validate_chain+0x6c5/0x7b0
    [] __lock_acquire+0x4cd/0x5a0
    [] lock_acquire+0x182/0x1d0
    [] mutex_lock_nested+0x6a/0x510
    [] kernfs_fop_mmap+0x54/0x120
    [] mmap_region+0x310/0x5c0
    [] do_mmap_pgoff+0x385/0x430
    [] vm_mmap_pgoff+0x8f/0xe0
    [] SyS_mmap_pgoff+0x1b0/0x210
    [] SyS_mmap+0x1d/0x20
    [] tracesys+0xdd/0xe2

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&mm->mmap_sem);
    lock(&of->mutex#2);
    lock(&mm->mmap_sem);
    lock(&of->mutex#2);

    *** DEADLOCK ***

    1 lock held by trinity-c236/10658:
    #0: (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x6e/0xe0

    stack backtrace:
    CPU: 2 PID: 10658 Comm: trinity-c236 Tainted: G W 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26
    0000000000000000 ffff88011911fa48 ffffffff8438e945 0000000000000000
    0000000000000000 ffff88011911fa98 ffffffff811a0109 ffff88011911fab8
    ffff88011911fab8 ffff88011911fa98 ffff880119128cc0 ffff880119128cf8
    Call Trace:
    [] dump_stack+0x52/0x7f
    [] print_circular_bug+0x129/0x160
    [] check_prev_add+0x13f/0x560
    [] ? deactivate_slab+0x511/0x550
    [] validate_chain+0x6c5/0x7b0
    [] __lock_acquire+0x4cd/0x5a0
    [] ? mmap_region+0x24a/0x5c0
    [] lock_acquire+0x182/0x1d0
    [] ? kernfs_fop_mmap+0x54/0x120
    [] mutex_lock_nested+0x6a/0x510
    [] ? kernfs_fop_mmap+0x54/0x120
    [] ? get_parent_ip+0x11/0x50
    [] ? kernfs_fop_mmap+0x54/0x120
    [] kernfs_fop_mmap+0x54/0x120
    [] mmap_region+0x310/0x5c0
    [] do_mmap_pgoff+0x385/0x430
    [] ? vm_mmap_pgoff+0x6e/0xe0
    [] vm_mmap_pgoff+0x8f/0xe0
    [] ? __rcu_read_unlock+0x44/0xb0
    [] ? dup_fd+0x3c0/0x3c0
    [] SyS_mmap_pgoff+0x1b0/0x210
    [] SyS_mmap+0x1d/0x20
    [] tracesys+0xdd/0xe2

    Fix it by caching atomic_write_len in kernfs_open_file during open so
    that it can be determined without accessing kernfs_ops in
    kernfs_fop_write(). This restores the structure of kernfs_fop_write()
    before 4d3773c4bb41 with updated @len determination logic.

    Signed-off-by: Tejun Heo
    Reported-by: Sasha Levin
    References: http://lkml.kernel.org/g/53113485.2090407@oracle.com
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • The hash values 0 and 1 are reserved for magic directory entries, but
    the code only prevents names hashing to 0. This patch fixes the test
    to also prevent hash value 1.

    Signed-off-by: Richard Cochran
    Cc:
    Reviewed-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Richard Cochran
     

03 Mar, 2014

1 commit


25 Feb, 2014

1 commit

  • As mount() and kill_sb() is not a one-to-one match, we shoudn't get
    ns refcnt unconditionally in sysfs_mount(), and instead we should
    get the refcnt only when kernfs_mount() allocated a new superblock.

    v2:
    - Changed the name of the new argument, suggested by Tejun.
    - Made the argument optional, suggested by Tejun.

    v3:
    - Make the new argument as second-to-last arg, suggested by Tejun.

    Signed-off-by: Li Zefan
    Acked-by: Tejun Heo
    ---
    fs/kernfs/mount.c | 8 +++++++-
    fs/sysfs/mount.c | 5 +++--
    include/linux/kernfs.h | 9 +++++----
    3 files changed, 15 insertions(+), 7 deletions(-)
    Signed-off-by: Greg Kroah-Hartman

    Li Zefan
     

15 Feb, 2014

1 commit

  • Currently kernfs_node_from_dentry() returns NULL for root dentry,
    because root_dentry->d_op == NULL.

    Due to this bug cgroupstats_build() returns -EINVAL for root cgroup.

    # mount -t cgroup -o cpuacct /cgroup
    # Documentation/accounting/getdelays -C /cgroup
    fatal reply error, errno -22

    With this fix:

    # Documentation/accounting/getdelays -C /cgroup
    sleeping 305, blocked 0, running 1, stopped 0, uninterruptible 1

    Signed-off-by: Li Zefan
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Li Zefan
     

12 Feb, 2014

1 commit

  • cgroup->name handling became quite complicated over time involving
    dedicated struct cgroup_name for RCU protection. Now that cgroup is
    on kernfs, we can drop all of it and simply use kernfs_name/path() and
    friends. Replace cgroup->name and all related code with kernfs
    name/path constructs.

    * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
    of kernfs counterparts, which involves semantic changes.
    pr_cont_cgroup_name() and pr_cont_cgroup_path() added.

    * cgroup->name handling dropped from cgroup_rename().

    * All users of cgroup_name/path() updated to the new semantics. Users
    which were formatting the string just to printk them are converted
    to use pr_cont_cgroup_name/path() instead, which simplifies things
    quite a bit. As cgroup_name() no longer requires RCU read lock
    around it, RCU lockings which were protecting only cgroup_name() are
    removed.

    v2: Comment above oom_info_lock updated as suggested by Michal.

    v3: dummy_top doesn't have a kn associated and
    pr_cont_cgroup_name/path() ended up calling the matching kernfs
    functions with NULL kn leading to oops. Test for NULL kn and
    print "/" if so. This issue was reported by Fengguang Wu.

    v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Fengguang Wu
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

11 Feb, 2014

1 commit

  • 3eef34ad7dc3 ("kernfs: implement kernfs_get_parent(),
    kernfs_name/path() and friends") restructured kernfs_rename_ns() such
    that new name assignment happens under kernfs_rename_lock;
    unfortunately, it mistakenly passed NULL to kernfs_name_hash() to
    calculate the new hash if the name hasn't changed, which can lead to
    oops.

    Fix it by using kn->name and kn->ns when calculating the new hash.

    Signed-off-by: Tejun Heo
    Reported-by: Dan Carpenter dan.carpenter@oracle.com
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

08 Feb, 2014

15 commits

  • As sysfs was kernfs's only user, kernfs has been piggybacking on
    CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very
    soon. Introduce a separate config option CONFIG_KERNFS which is to be
    selected by kernfs users.

    Signed-off-by: Tejun Heo
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • kernfs_node->parent and ->name are currently marked as "published"
    indicating that kernfs users may access them directly; however, those
    fields may get updated by kernfs_rename[_ns]() and unrestricted access
    may lead to erroneous values or oops.

    Protect ->parent and ->name updates with a irq-safe spinlock
    kernfs_rename_lock and implement the following accessors for these
    fields.

    * kernfs_name() - format the node's name into the specified buffer
    * kernfs_path() - format the node's path into the specified buffer
    * pr_cont_kernfs_name() - pr_cont a node's name (doesn't need buffer)
    * pr_cont_kernfs_path() - pr_cont a node's path (doesn't need buffer)
    * kernfs_get_parent() - pin and return a node's parent

    All can be called under any context. The recursive sysfs_pathname()
    in fs/sysfs/dir.c is replaced with kernfs_path() and
    sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
    dereferencing parent directly.

    v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
    static inline making it cause a lot of build warnings. Add it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • Implement helpers to determine node from dentry and root from
    super_block. Also add a kernfs_rename_ns() wrapper which assumes NULL
    namespace. These generally make sense and will be used by cgroup.

    v2: Some dummy implementations for !CONFIG_SYSFS was missing. Fixed.
    Reported by kbuild test robot.

    Signed-off-by: Tejun Heo
    Cc: kbuild test robot
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • A write to a kernfs_node is buffered through a kernel buffer. Writes
    atomic_write_len. If not set (zero), the
    behavior stays the same. If set, writes upto the size are executed
    atomically and larger writes are rejected with -E2BIG.

    A different implementation strategy would be allowing configuring
    chunking size while making the original write size available to the
    write method; however, such strategy, while being more complicated,
    doesn't really buy anything. If the write implementation has to
    handle chunking, the specific chunk size shouldn't matter all that
    much.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • Currently, kernfs_nodes are made visible to userland on creation,
    which makes it difficult for kernfs users to atomically succeed or
    fail creation of multiple nodes. In addition, if something fails
    after creating some nodes, the created nodes might already be in use
    and their active refs need to be drained for removal, which has the
    potential to introduce tricky reverse locking dependency on active_ref
    depending on how the error path is synchronized.

    This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
    If set, all nodes under the root are created in the deactivated state
    and stay invisible to userland until explicitly enabled by the new
    kernfs_activate() API. Also, nodes which have never been activated
    are guaranteed to bypass draining on removal thus allowing error paths
    to not worry about lockding dependency on active_ref draining.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • kernfs_iop_lookup(), kernfs_dir_pos() and kernfs_dir_next_pos() were
    missing kernfs_active() tests before using the found kernfs_node. As
    deactivated state is currently visible only while a node is being
    removed, this doesn't pose an actual problem. e.g. lookup succeeding
    on a deactivated node doesn't harm anything as the eventual file
    operations are gonna fail and those failures are indistinguishible
    from the cases in which the lookups had happened before the node was
    deactivated.

    However, we're gonna allow new nodes to be created deactivated and
    then activated explicitly by the kernfs user when it sees fit. This
    is to support atomically making multiple nodes visible to userland and
    thus those nodes must not be visible to userland before activated.

    Let's plug the lookup and readdir holes so that deactivated nodes are
    invisible to userland.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • Add two super_block related syscall callbacks ->remount_fs() and
    ->show_options() to kernfs_syscall_ops. These simply forward the
    matching super_operations.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • We're gonna need non-dir syscall callbacks, which will make dir_ops a
    misnomer. Let's rename kernfs_dir_ops to kernfs_syscall_ops.

    This is pure rename.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • kernfs_dir_ops are currently being invoked without any active
    reference, which makes it tricky for the invoked operations to
    determine whether the objects associated those nodes are safe to
    access and will remain that way for the duration of such operations.

    kernfs already has active_ref mechanism to deal with this which makes
    the removal of a given node the synchronization point for gating the
    file operations. There's no reason for dir_ops to be any different.
    Update the dir_ops handling so that active_ref is held while the
    dir_ops are executing. This guarantees that while a dir_ops is
    executing the target nodes stay alive.

    As kernfs_dir_ops doesn't have any in-kernel user at this point, this
    doesn't affect anybody.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • Sometimes it's necessary to implement a node which wants to delete
    nodes including itself. This isn't straightforward because of kernfs
    active reference. While a file operation is in progress, an active
    reference is held and kernfs_remove() waits for all such references to
    drain before completing. For a self-deleting node, this is a deadlock
    as kernfs_remove() ends up waiting for an active reference that itself
    is sitting on top of.

    This currently is worked around in the sysfs layer using
    sysfs_schedule_callback() which makes such removals asynchronous.
    While it works, it's rather cumbersome and inherently breaks
    synchronicity of the operation - the file operation which triggered
    the operation may complete before the removal is finished (or even
    started) and the removal may fail asynchronously. If a removal
    operation is immmediately followed by another operation which expects
    the specific name to be available (e.g. removal followed by rename
    onto the same name), there's no way to make the latter operation
    reliable.

    The thing is there's no inherent reason for this to be asynchrnous.
    All that's necessary to do this synchronous is a dedicated operation
    which drops its own active ref and deactivates self. This patch
    implements kernfs_remove_self() and its wrappers in sysfs and driver
    core. kernfs_remove_self() is to be called from one of the file
    operations, drops the active ref the task is holding, removes the self
    node, and restores active ref to the dead node so that the ref is
    balanced afterwards. __kernfs_remove() is updated so that it takes an
    early exit if the target node is already fully removed so that the
    active ref restored by kernfs_remove_self() after removal doesn't
    confuse the deactivation path.

    This makes implementing self-deleting nodes very easy. The normal
    removal path doesn't even need to be changed to use
    kernfs_remove_self() for the self-deleting node. The method can
    invoke kernfs_remove_self() on itself before proceeding the normal
    removal path. kernfs_remove() invoked on the node by the normal
    deletion path will simply be ignored.

    This will replace sysfs_schedule_callback(). A subtle feature of
    sysfs_schedule_callback() is that it collapses multiple invocations -
    even if multiple removals are triggered, the removal callback is run
    only once. An equivalent effect can be achieved by testing the return
    value of kernfs_remove_self() - only the one which gets %true return
    value should proceed with actual deletion. All other instances of
    kernfs_remove_self() will wait till the enclosing kernfs operation
    which invoked the winning instance of kernfs_remove_self() finishes
    and then return %false. This trivially makes all users of
    kernfs_remove_self() automatically show correct synchronous behavior
    even when there are multiple concurrent operations - all "echo 1 >
    delete" instances will finish only after the whole operation is
    completed by one of the instances.

    Note that manipulation of active ref is implemented in separate public
    functions - kernfs_[un]break_active_protection().
    kernfs_remove_self() is the only user at the moment but this will be
    used to cater to more complex cases.

    v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
    and sysfs_remove_file_self() had incorrect return type. Fix it.
    Reported by kbuild test bot.

    v3: kernfs_[un]break_active_protection() separated out from
    kernfs_remove_self() and exposed as public API.

    Signed-off-by: Tejun Heo
    Cc: Alan Stern
    Cc: kbuild test robot
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • KERNFS_REMOVED is used to mark half-initialized and dying nodes so
    that they don't show up in lookups and deny adding new nodes under or
    renaming it; however, its role overlaps that of deactivation.

    It's necessary to deny addition of new children while removal is in
    progress; however, this role considerably intersects with deactivation
    - KERNFS_REMOVED prevents new children while deactivation prevents new
    file operations. There's no reason to have them separate making
    things more complex than necessary.

    This patch removes KERNFS_REMOVED.

    * Instead of KERNFS_REMOVED, each node now starts its life
    deactivated. This means that we now use both atomic_add() and
    atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler
    generates an overflow warnings when negating INT_MIN as the negation
    can't be represented as a positive number. Nothing is actually
    broken but let's bump BIAS by one to avoid the warnings for archs
    which negates the subtrahend..

    * A new helper kernfs_active() which tests whether kn->active >= 0 is
    added for convenience and lockdep annotation. All KERNFS_REMOVED
    tests are replaced with negated kernfs_active() tests.

    * __kernfs_remove() is updated to deactivate, but not drain, all nodes
    in the subtree instead of setting KERNFS_REMOVED. This removes
    deactivation from kernfs_deactivate(), which is now renamed to
    kernfs_drain().

    * Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
    checks on the active ref.

    * Some comment style updates in the affected area.

    v2: Reordered before removal path restructuring. kernfs_active()
    dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE()
    used in the lookup paths.

    v3: Reverted most of v2 except for creating a new node with
    KN_DEACTIVATED_BIAS.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • There currently are two mechanisms gating active ref lockdep
    annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
    The former disables lockdep annotations in kernfs_get/put_active()
    while the latter disables all of kernfs_deactivate().

    While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
    deactivation step for non-file nodes, the benefit is marginal and it
    needlessly diverges code paths. Let's drop KERNFS_ACTIVE_REF.

    While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
    flag so that it's more convenient and the related code can be compiled
    out when not enabled.

    v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
    KERNFS_LOCKDEP flag"). As the earlier patch already added
    KERNFS_LOCKDEP tests to kernfs_deactivate(), those additions are
    dropped from this patch and the existing ones are simply converted
    to kernfs_lockdep().

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
    added because there were operations which should be performed outside
    kernfs_mutex after adding and removing kernfs_nodes. The necessary
    operations were recorded in kernfs_addrm_cxt and performed by
    kernfs_addrm_finish(); however, after the recent changes which
    relocated deactivation and unmapping so that they're performed
    directly during removal, the only operation kernfs_addrm_finish()
    performs is kernfs_put(), which can be moved inside the removal path
    too.

    This patch moves the kernfs_put() of the base ref to __kernfs_remove()
    and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().

    * kernfs_add_one() is updated to grab and release kernfs_mutex itself.
    sysfs_addrm_start/finish() invocations around it are removed from
    all users.

    * __kernfs_remove() puts an unlinked node directly instead of chaining
    it to kernfs_addrm_cxt. Its callers are updated to grab and release
    kernfs_mutex instead of calling kernfs_addrm_start/finish() around
    it.

    v2: Rebased on top of "kernfs: associate a new kernfs_node with its
    parent on creation" which dropped @parent from kernfs_add_one().

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • kernfs_unmap_bin_file() is supposed to unmap all memory mappings of
    the target file before kernfs_remove() finishes; however, it currently
    is being called from kernfs_addrm_finish() and has the same race
    problem as the original implementation of deactivation when there are
    multiple removers - only the remover which snatches the node to its
    addrm_cxt->removed list is guaranteed to wait for its completion
    before returning.

    It can be easily fixed by moving kernfs_unmap_bin_file() invocation
    from kernfs_addrm_finish() to kernfs_deactivated(). The function may
    be called multiple times but that shouldn't do any harm.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • The recursive nature of kernfs_remove() means that, even if
    kernfs_remove() is not allowed to be called multiple times on the same
    node, there may be race conditions between removal of parent and its
    descendants. While we can claim that kernfs_remove() shouldn't be
    called on one of the descendants while the removal of an ancestor is
    in progress, such rule is unnecessarily restrictive and very difficult
    to enforce. It's better to simply allow invoking kernfs_remove() as
    the caller sees fit as long as the caller ensures that the node is
    accessible.

    The current behavior in such situations is broken. Whoever enters
    removal path first takes the node off the hierarchy and then
    deactivates. Following removers either return as soon as it notices
    that it's not the first one or can't even find the target node as it
    has already been removed from the hierarchy. In both cases, the
    following removers may finish prematurely while the nodes which should
    be removed and drained are still being processed by the first one.

    This patch restructures so that multiple removers, whether through
    recursion or direction invocation, always follow the following rules.

    * When there are multiple concurrent removers, only one puts the base
    ref.

    * Regardless of which one puts the base ref, all removers are blocked
    until the target node is fully deactivated and removed.

    To achieve the above, removal path now first marks all descendants
    including self REMOVED and then deactivates and unlinks leftmost
    descendant one-by-one. kernfs_deactivate() is called directly from
    __kernfs_removal() and drops and regrabs kernfs_mutex for each
    descendant to drain active refs. As this means that multiple removers
    can enter kernfs_deactivate() for the same node, the function is
    updated so that it can handle multiple deactivators of the same node -
    only one actually deactivates but all wait till drain completion.

    The restructured removal path guarantees that a removed node gets
    unlinked only after the node is deactivated and drained. Combined
    with proper multiple deactivator handling, this guarantees that any
    invocation of kernfs_remove() returns only after the node itself and
    all its descendants are deactivated, drained and removed.

    v2: Draining separated into a separate loop (used to be in the same
    loop as unlink) and done from __kernfs_deactivate(). This is to
    allow exposing deactivation as a separate interface later.

    Root node removal was broken in v1 patch. Fixed.

    v3: Revert most of v2 except for root node removal fix and
    simplification of KERNFS_REMOVED setting loop.

    v4: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
    KERNFS_LOCKDEP flag").

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo