05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this file is released under the gplv2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 68 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Armijn Hemel
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 May, 2019

1 commit

  • Pull selinux updates from Paul Moore:
    "We've got a few SELinux patches for the v5.2 merge window, the
    highlights are below:

    - Add LSM hooks, and the SELinux implementation, for proper labeling
    of kernfs. While we are only including the SELinux implementation
    here, the rest of the LSM folks have given the hooks a thumbs-up.

    - Update the SELinux mdp (Make Dummy Policy) script to actually work
    on a modern system.

    - Disallow userspace to change the LSM credentials via
    /proc/self/attr when the task's credentials are already overridden.

    The change was made in procfs because all the LSM folks agreed this
    was the Right Thing To Do and duplicating it across each LSM was
    going to be annoying"

    * tag 'selinux-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
    proc: prevent changes to overridden credentials
    selinux: Check address length before reading address family
    kernfs: fix xattr name handling in LSM helpers
    MAINTAINERS: update SELinux file patterns
    selinux: avoid uninitialized variable warning
    selinux: remove useless assignments
    LSM: lsm_hooks.h - fix missing colon in docstring
    selinux: Make selinux_kernfs_init_security static
    kernfs: initialize security of newly created nodes
    selinux: implement the kernfs_init_security hook
    LSM: add new hook for kernfs node initialization
    kernfs: use simple_xattrs for security attributes
    selinux: try security xattr after genfs for kernfs filesystems
    kernfs: do not alloc iattrs in kernfs_xattr_get
    kernfs: clean up struct kernfs_iattrs
    scripts/selinux: fix build
    selinux: use kernel linux/socket.h for genheaders and mdp
    scripts/selinux: modernize mdp

    Linus Torvalds
     

26 Apr, 2019

1 commit

  • smp_mb__before_atomic() can not be applied to atomic_set(). Remove the
    barrier and rely on RELEASE synchronization.

    Fixes: ba16b2846a8c6 ("kernfs: add an API to get kernfs node from inode number")
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrea Parri
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Andrea Parri
     

21 Mar, 2019

3 commits

  • Use the new security_kernfs_init_security() hook to allow LSMs to
    possibly assign a non-default security context to a newly created kernfs
    node based on the attributes of the new node and also its parent node.

    This fixes an issue with cgroupfs under SELinux, where newly created
    cgroup subdirectories/files would not inherit its parent's context if
    it had been set explicitly to a non-default value (other than the genfs
    context specified by the policy). This can be reproduced as follows (on
    Fedora/RHEL):

    # mkdir /sys/fs/cgroup/unified/test
    # # Need permissive to change the label under Fedora policy:
    # setenforce 0
    # chcon -t container_file_t /sys/fs/cgroup/unified/test
    # ls -lZ /sys/fs/cgroup/unified
    total 0
    -r--r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.controllers
    -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.max.depth
    -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.max.descendants
    -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.procs
    -r--r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.stat
    -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.subtree_control
    -rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.threads
    drwxr-xr-x. 2 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 init.scope
    drwxr-xr-x. 26 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:21 system.slice
    drwxr-xr-x. 3 root root system_u:object_r:container_file_t:s0 0 Jan 29 03:15 test
    drwxr-xr-x. 3 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 user.slice
    # mkdir /sys/fs/cgroup/unified/test/subdir

    Actual result:

    # ls -ldZ /sys/fs/cgroup/unified/test/subdir
    drwxr-xr-x. 2 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:15 /sys/fs/cgroup/unified/test/subdir

    Expected result:

    # ls -ldZ /sys/fs/cgroup/unified/test/subdir
    drwxr-xr-x. 2 root root unconfined_u:object_r:container_file_t:s0 0 Jan 29 03:15 /sys/fs/cgroup/unified/test/subdir

    Link: https://github.com/SELinuxProject/selinux-kernel/issues/39

    Signed-off-by: Ondrej Mosnacek
    Acked-by: Casey Schaufler
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     
  • Replace the special handling of security xattrs with simple_xattrs, as
    is already done for the trusted xattrs. This simplifies the code and
    allows LSMs to use more than just a single xattr to do their business.

    Signed-off-by: Ondrej Mosnacek
    Acked-by: Casey Schaufler
    [PM: manual merge fixes]
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     
  • Right now, kernfs_iattrs embeds the whole struct iattr, even though it
    doesn't really use half of its fields... This both leads to wasting
    space and makes the code look awkward. Let's just list the few fields
    we need directly in struct kernfs_iattrs.

    Signed-off-by: Ondrej Mosnacek
    Acked-by: Casey Schaufler
    [PM: merged a number of chunks manually due to fuzz]
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     

08 Feb, 2019

1 commit

  • Creating a new cache for kernfs_iattrs.
    Currently, memory is allocated with kzalloc() which
    always gives aligned memory. On ARM, this is 64 byte aligned.
    To avoid the wastage of memory in aligning the size requested,
    a new cache for kernfs_iattrs is created.

    Size of struct kernfs_iattrs is 80 Bytes.
    On ARM, it will come in kmalloc-128 slab.
    and it will come in kmalloc-192 slab if debug info is enabled.
    Extra bytes taken 48 bytes.

    Total number of objects created : 4096
    Total saving = 48*4096 = 192 KB

    After creating new slab(When debug info is enabled) :
    sh-3.2# cat /proc/slabinfo
    ...
    kernfs_iattrs_cache 4069 4096 128 32 1 : tunables 0 0 0 : slabdata 128 128 0
    ...

    All testing has been done on ARM target.

    Signed-off-by: Ayush Mittal
    Signed-off-by: Vaneet Narang
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Ayush Mittal
     

21 Jul, 2018

1 commit

  • This change allows creating kernfs files and directories with arbitrary
    uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
    kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
    The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
    and always create objects belonging to the global root.

    When creating symlinks ownership (uid/gid) is taken from the target kernfs
    object.

    Co-Developed-by: Tyler Hicks
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Tyler Hicks
    Signed-off-by: David S. Miller

    Dmitry Torokhov
     

06 Jun, 2018

1 commit

  • struct timespec is not y2038 safe. Transition vfs to use
    y2038 safe struct timespec64 instead.

    The change was made with the help of the following cocinelle
    script. This catches about 80% of the changes.
    All the header file and logic changes are included in the
    first 5 rules. The rest are trivial substitutions.
    I avoid changing any of the function signatures or any other
    filesystem specific data structures to keep the patch simple
    for review.

    The script can be a little shorter by combining different cases.
    But, this version was sufficient for my usecase.

    virtual patch

    @ depends on patch @
    identifier now;
    @@
    - struct timespec
    + struct timespec64
    current_time ( ... )
    {
    - struct timespec now = current_kernel_time();
    + struct timespec64 now = current_kernel_time64();
    ...
    - return timespec_trunc(
    + return timespec64_trunc(
    ... );
    }

    @ depends on patch @
    identifier xtime;
    @@
    struct \( iattr \| inode \| kstat \) {
    ...
    - struct timespec xtime;
    + struct timespec64 xtime;
    ...
    }

    @ depends on patch @
    identifier t;
    @@
    struct inode_operations {
    ...
    int (*update_time) (...,
    - struct timespec t,
    + struct timespec64 t,
    ...);
    ...
    }

    @ depends on patch @
    identifier t;
    identifier fn_update_time =~ "update_time$";
    @@
    fn_update_time (...,
    - struct timespec *t,
    + struct timespec64 *t,
    ...) { ... }

    @ depends on patch @
    identifier t;
    @@
    lease_get_mtime( ... ,
    - struct timespec *t
    + struct timespec64 *t
    ) { ... }

    @te depends on patch forall@
    identifier ts;
    local idexpression struct inode *inode_node;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier fn_update_time =~ "update_time$";
    identifier fn;
    expression e, E3;
    local idexpression struct inode *node1;
    local idexpression struct inode *node2;
    local idexpression struct iattr *attr1;
    local idexpression struct iattr *attr2;
    local idexpression struct iattr attr;
    identifier i_xtime1 =~ "^i_[acm]time$";
    identifier i_xtime2 =~ "^i_[acm]time$";
    identifier ia_xtime1 =~ "^ia_[acm]time$";
    identifier ia_xtime2 =~ "^ia_[acm]time$";
    @@
    (
    (
    - struct timespec ts;
    + struct timespec64 ts;
    |
    - struct timespec ts = current_time(inode_node);
    + struct timespec64 ts = current_time(inode_node);
    )

    i_xtime, &ts)
    + timespec64_equal(&inode_node->i_xtime, &ts)
    |
    - timespec_equal(&ts, &inode_node->i_xtime)
    + timespec64_equal(&ts, &inode_node->i_xtime)
    |
    - timespec_compare(&inode_node->i_xtime, &ts)
    + timespec64_compare(&inode_node->i_xtime, &ts)
    |
    - timespec_compare(&ts, &inode_node->i_xtime)
    + timespec64_compare(&ts, &inode_node->i_xtime)
    |
    ts = current_time(e)
    |
    fn_update_time(..., &ts,...)
    |
    inode_node->i_xtime = ts
    |
    node1->i_xtime = ts
    |
    ts = inode_node->i_xtime
    |
    ia_xtime ...+> = ts
    |
    ts = attr1->ia_xtime
    |
    ts.tv_sec
    |
    ts.tv_nsec
    |
    btrfs_set_stack_timespec_sec(..., ts.tv_sec)
    |
    btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
    |
    - ts = timespec64_to_timespec(
    + ts =
    ...
    -)
    |
    - ts = ktime_to_timespec(
    + ts = ktime_to_timespec64(
    ...)
    |
    - ts = E3
    + ts = timespec_to_timespec64(E3)
    |
    - ktime_get_real_ts(&ts)
    + ktime_get_real_ts64(&ts)
    |
    fn(...,
    - ts
    + timespec64_to_timespec(ts)
    ,...)
    )
    ...+>
    (

    )
    |
    - timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
    + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
    |
    - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
    + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
    |
    - timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
    + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
    |
    node1->i_xtime1 =
    - timespec_trunc(attr1->ia_xtime1,
    + timespec64_trunc(attr1->ia_xtime1,
    ...)
    |
    - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
    + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
    ...)
    |
    - ktime_get_real_ts(&attr1->ia_xtime1)
    + ktime_get_real_ts64(&attr1->ia_xtime1)
    |
    - ktime_get_real_ts(&attr.ia_xtime1)
    + ktime_get_real_ts64(&attr.ia_xtime1)
    )

    @ depends on patch @
    struct inode *node;
    struct iattr *attr;
    identifier fn;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    expression e;
    @@
    (
    - fn(node->i_xtime);
    + fn(timespec64_to_timespec(node->i_xtime));
    |
    fn(...,
    - node->i_xtime);
    + timespec64_to_timespec(node->i_xtime));
    |
    - e = fn(attr->ia_xtime);
    + e = fn(timespec64_to_timespec(attr->ia_xtime));
    )

    @ depends on patch forall @
    struct inode *node;
    struct iattr *attr;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier fn;
    @@
    {
    + struct timespec ts;
    i_xtime);
    fn (...,
    - &node->i_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    fn (...,
    - &attr->ia_xtime,
    + &ts,
    ...);
    )
    ...+>
    }

    @ depends on patch forall @
    struct inode *node;
    struct iattr *attr;
    struct kstat *stat;
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier i_xtime =~ "^i_[acm]time$";
    identifier xtime =~ "^[acm]time$";
    identifier fn, ret;
    @@
    {
    + struct timespec ts;
    i_xtime);
    ret = fn (...,
    - &node->i_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(node->i_xtime);
    ret = fn (...,
    - &node->i_xtime);
    + &ts);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    ret = fn (...,
    - &attr->ia_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    ret = fn (...,
    - &attr->ia_xtime);
    + &ts);
    |
    + ts = timespec64_to_timespec(stat->xtime);
    ret = fn (...,
    - &stat->xtime);
    + &ts);
    )
    ...+>
    }

    @ depends on patch @
    struct inode *node;
    struct inode *node2;
    identifier i_xtime1 =~ "^i_[acm]time$";
    identifier i_xtime2 =~ "^i_[acm]time$";
    identifier i_xtime3 =~ "^i_[acm]time$";
    struct iattr *attrp;
    struct iattr *attrp2;
    struct iattr attr ;
    identifier ia_xtime1 =~ "^ia_[acm]time$";
    identifier ia_xtime2 =~ "^ia_[acm]time$";
    struct kstat *stat;
    struct kstat stat1;
    struct timespec64 ts;
    identifier xtime =~ "^[acmb]time$";
    expression e;
    @@
    (
    ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
    |
    node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
    |
    node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
    |
    node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
    |
    stat->xtime = node2->i_xtime1;
    |
    stat1.xtime = node2->i_xtime1;
    |
    ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
    |
    ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
    |
    - e = node->i_xtime1;
    + e = timespec64_to_timespec( node->i_xtime1 );
    |
    - e = attrp->ia_xtime1;
    + e = timespec64_to_timespec( attrp->ia_xtime1 );
    |
    node->i_xtime1 = current_time(...);
    |
    node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
    - e;
    + timespec_to_timespec64(e);
    |
    node->i_xtime1 = node->i_xtime3 =
    - e;
    + timespec_to_timespec64(e);
    |
    - node->i_xtime1 = e;
    + node->i_xtime1 = timespec_to_timespec64(e);
    )

    Signed-off-by: Deepa Dinamani
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:

    Deepa Dinamani
     

29 Jul, 2017

5 commits

  • inode number and generation can identify a kernfs node. We are going to
    export the identification by exportfs operations, so put ino and
    generation into a separate structure. It's convenient when later patches
    use the identification.

    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • When working on adding exportfs operations in kernfs, I found it's hard
    to initialize dentry->d_fsdata in the exportfs operations. Looks there
    is no way to do it without race condition. Look at the kernfs code
    closely, there is no point to set dentry->d_fsdata. inode->i_private
    already points to kernfs_node, and we can get inode from a dentry. So
    this patch just delete the d_fsdata usage.

    Acked-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Add an API to get kernfs node from inode number. We will need this to
    implement exportfs operations.

    This API will be used in blktrace too later, so it should be as fast as
    possible. To make the API lock free, kernfs node is freed in RCU
    context. And we depend on kernfs_node count/ino number to filter out
    stale kernfs nodes.

    Acked-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Set i_generation for kernfs inode. This is required to implement
    exportfs operations. The generation is 32-bit, so it's possible the
    generation wraps up and we find stale files. To reduce the posssibility,
    we don't reuse inode numer immediately. When the inode number allocation
    wraps, we increase generation number. In this way generation/inode
    number consist of a 64-bit number which is unlikely duplicated. This
    does make the idr tree more sparse and waste some memory. Since idr
    manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6
    bits of the key. In a 100k inode kernfs, the worst case will have around
    300k radix tree node. Each node is 576bytes, so the tree will use about
    ~150M memory. Sounds not too bad, if this really is a problem, we should
    find better data structure.

    Acked-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • kernfs uses ida to manage inode number. The problem is we can't get
    kernfs_node from inode number with ida. Switching to use idr, next patch
    will add an API to get kernfs_node from inode number.

    Acked-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

28 Feb, 2017

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Several noteworthy changes.

    - Parav's rdma controller is finally merged. It is very straight
    forward and can limit the abosolute numbers of common rdma
    constructs used by different cgroups.

    - kernel/cgroup.c got too chubby and disorganized. Created
    kernel/cgroup/ subdirectory and moved all cgroup related files
    under kernel/ there and reorganized the core code. This hurts for
    backporting patches but was long overdue.

    - cgroup v2 process listing reimplemented so that it no longer
    depends on allocating a buffer large enough to cache the entire
    result to sort and uniq the output. v2 has always mangled the sort
    order to ensure that users don't depend on the sorted output, so
    this shouldn't surprise anybody. This makes the pid listing
    functions use the same iterators that are used internally, which
    have to have the same iterating capabilities anyway.

    - perf cgroup filtering now works automatically on cgroup v2. This
    patch was posted a long time ago but somehow fell through the
    cracks.

    - misc fixes asnd documentation updates"

    * 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (27 commits)
    kernfs: fix locking around kernfs_ops->release() callback
    cgroup: drop the matching uid requirement on migration for cgroup v2
    cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
    cgroup: misc cleanups
    cgroup: call subsys->*attach() only for subsystems which are actually affected by migration
    cgroup: track migration context in cgroup_mgctx
    cgroup: cosmetic update to cgroup_taskset_add()
    rdmacg: Fixed uninitialized current resource usage
    cgroup: Add missing cgroup-v2 PID controller documentation.
    rdmacg: Added documentation for rdmacg
    IB/core: added support to use rdma cgroup controller
    rdmacg: Added rdma cgroup controller
    cgroup: fix a comment typo
    cgroup: fix RCU related sparse warnings
    cgroup: move namespace code to kernel/cgroup/namespace.c
    cgroup: rename functions for consistency
    cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c
    cgroup: separate out cgroup1_kf_syscall_ops
    cgroup: refactor mount path and clearly distinguish v1 and v2 paths
    cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c
    ...

    Linus Torvalds
     

10 Feb, 2017

1 commit


28 Dec, 2016

1 commit

  • Add ->open/release() methods to kernfs_ops. ->open() is called when
    the file is opened and ->release() when the file is either released or
    severed. These callbacks can be used, for example, to manage
    persistent caching objects over multiple seq_file iterations.

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Acked-by: Acked-by: Zefan Li

    Tejun Heo
     

15 Oct, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:

    - tracepoints for basic cgroup management operations added

    - kernfs and cgroup path formatting functions updated to behave in the
    style of strlcpy()

    - non-critical bug fixes

    * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
    cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
    cpuset: fix error handling regression in proc_cpuset_show()
    cgroup: add tracepoints for basic operations
    cgroup: make cgroup_path() and friends behave in the style of strlcpy()
    kernfs: remove kernfs_path_len()
    kernfs: make kernfs_path*() behave in the style of strlcpy()
    kernfs: add dummy implementation of kernfs_path_from_node()

    Linus Torvalds
     

11 Oct, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     

08 Oct, 2016

1 commit


07 Oct, 2016

1 commit


27 Sep, 2016

2 commits

  • Generated patch:

    sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
    sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • This is trivial to do:

    - add flags argument to foo_rename()
    - check if flags is zero
    - assign foo_rename() to .rename2 instead of .rename

    This doesn't mean it's impossible to support RENAME_NOREPLACE for these
    filesystems, but it is not trivial, like for local filesystems.
    RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
    for a file to be created on one host while it is overwritten by rename on
    another host).

    Filesystems converted:

    9p, afs, ceph, coda, ecryptfs, kernfs, lustre, ncpfs, nfs, ocfs2, orangefs.

    After this, we can get rid of the duplicate interfaces for rename.

    Signed-off-by: Miklos Szeredi
    Acked-by: Greg Kroah-Hartman
    Acked-by: David Howells [AFS]
    Acked-by: Mike Marshall
    Cc: Eric Van Hensbergen
    Cc: Ilya Dryomov
    Cc: Jan Harkes
    Cc: Tyler Hicks
    Cc: Oleg Drokin
    Cc: Trond Myklebust
    Cc: Mark Fasheh

    Miklos Szeredi
     

10 Aug, 2016

2 commits

  • It doesn't have any in-kernel user and the same result can be obtained
    from kernfs_path(@kn, NULL, 0). Remove it.

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Cc: Serge Hallyn

    Tejun Heo
     
  • kernfs_path*() functions always return the length of the full path but
    the path content is undefined if the length is larger than the
    provided buffer. This makes its behavior different from strlcpy() and
    requires error handling in all its users even when they don't care
    about truncation. In addition, the implementation can actully be
    simplified by making it behave properly in strlcpy() style.

    * Update kernfs_path_from_node_locked() to always fill up the buffer
    with path. If the buffer is not large enough, the output is
    truncated and terminated.

    * kernfs_path() no longer needs error handling. Make it a simple
    inline wrapper around kernfs_path_from_node().

    * sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
    Updated accordingly.

    * cgroup_path()'s use of kernfs_path() updated to retain the old
    behavior.

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Acked-by: Serge Hallyn

    Tejun Heo
     

11 Jun, 2016

1 commit

  • We always mixed in the parent pointer into the dentry name hash, but we
    did it late at lookup time. It turns out that we can simplify that
    lookup-time action by salting the hash with the parent pointer early
    instead of late.

    A few other users of our string hashes also wanted to mix in their own
    pointers into the hash, and those are updated to use the same mechanism.

    Hash users that don't have any particular initial salt can just use the
    NULL pointer as a no-salt.

    Cc: Vegard Nossum
    Cc: George Spelvin
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 May, 2016

1 commit

  • Pull driver core updates from Greg KH:
    "Here's the "big" driver core update for 4.7-rc1.

    Mostly just debugfs changes, the long-known and messy races with
    removing debugfs files should be fixed thanks to the great work of
    Nicolai Stange. We also have some isa updates in here (the x86
    maintainers told me to take it through this tree), a new warning when
    we run out of dynamic char major numbers, and a few other assorted
    changes, details in the shortlog.

    All have been in linux-next for some time with no reported issues"

    * tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits)
    Revert "base: dd: don't remove driver_data in -EPROBE_DEFER case"
    gpio: ws16c48: Utilize the ISA bus driver
    gpio: 104-idio-16: Utilize the ISA bus driver
    gpio: 104-idi-48: Utilize the ISA bus driver
    gpio: 104-dio-48e: Utilize the ISA bus driver
    watchdog: ebc-c384_wdt: Utilize the ISA bus driver
    iio: stx104: Utilize the module_isa_driver and max_num_isa_dev macros
    iio: stx104: Add X86 dependency to STX104 Kconfig option
    Documentation: Add ISA bus driver documentation
    isa: Implement the max_num_isa_dev macro
    isa: Implement the module_isa_driver macro
    pnp: pnpbios: Add explicit X86_32 dependency to PNPBIOS
    isa: Decouple X86_32 dependency from the ISA Kconfig option
    driver-core: use 'dev' argument in dev_dbg_ratelimited stub
    base: dd: don't remove driver_data in -EPROBE_DEFER case
    kernfs: Move faulting copy_user operations outside of the mutex
    devcoredump: add scatterlist support
    debugfs: unproxify files created through debugfs_create_u32_array()
    debugfs: unproxify files created through debugfs_create_blob()
    debugfs: unproxify files created through debugfs_create_bool()
    ...

    Linus Torvalds
     

18 May, 2016

1 commit

  • Pull parallel filesystem directory handling update from Al Viro.

    This is the main parallel directory work by Al that makes the vfs layer
    able to do lookup and readdir in parallel within a single directory.
    That's a big change, since this used to be all protected by the
    directory inode mutex.

    The inode mutex is replaced by an rwsem, and serialization of lookups of
    a single name is done by a "in-progress" dentry marker.

    The series begins with xattr cleanups, and then ends with switching
    filesystems over to actually doing the readdir in parallel (switching to
    the "iterate_shared()" that only takes the read lock).

    A more detailed explanation of the process from Al Viro:
    "The xattr work starts with some acl fixes, then switches ->getxattr to
    passing inode and dentry separately. This is the point where the
    things start to get tricky - that got merged into the very beginning
    of the -rc3-based #work.lookups, to allow untangling the
    security_d_instantiate() mess. The xattr work itself proceeds to
    switch a lot of filesystems to generic_...xattr(); no complications
    there.

    After that initial xattr work, the series then does the following:

    - untangle security_d_instantiate()

    - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
    that thing; one such place (in overlayfs) actually yields a trivial
    conflict with overlayfs fixes later in the cycle - overlayfs ended
    up switching to a variant of lookup_one_len_unlocked() sans the
    permission checks. I would've dropped that commit (it gets
    overridden on merge from #ovl-fixes in #for-next; proper resolution
    is to use the variant in mainline fs/overlayfs/super.c), but I
    didn't want to rebase the damn thing - it was fairly late in the
    cycle...

    - some filesystems had managed to depend on lookup/lookup exclusion
    for *fs-internal* data structures in a way that would break if we
    relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.

    - core of that series - parallel lookup machinery, replacing
    ->i_mutex with rwsem, making lookup_slow() take it only shared. At
    that point lookups happen in parallel; lookups on the same name
    wait for the in-progress one to be done with that dentry.

    Surprisingly little code, at that - almost all of it is in
    fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
    making it use the new primitive and actually switching to locking
    shared.

    - parallel readdir stuff - first of all, we provide the exclusion on
    per-struct file basis, same as we do for read() vs lseek() for
    regular files. That takes care of most of the needed exclusion in
    readdir/readdir; however, these guys are trickier than lookups, so
    I went for switching them one-by-one. To do that, a new method
    '->iterate_shared()' is added and filesystems are switched to it
    as they are either confirmed to be OK with shared lock on directory
    or fixed to be OK with that. I hope to kill the original method
    come next cycle (almost all in-tree filesystems are switched
    already), but it's still not quite finished.

    - several filesystems get switched to parallel readdir. The
    interesting part here is dealing with dcache preseeding by readdir;
    that needs minor adjustment to be safe with directory locked only
    shared.

    Most of the filesystems doing that got switched to in those
    commits. Important exception: NFS. Turns out that NFS folks, with
    their, er, insistence on VFS getting the fuck out of the way of the
    Smart Filesystem Code That Knows How And What To Lock(tm) have
    grown the locking of their own. They had their own homegrown
    rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
    is the reader there). Of course, with VFS getting the fuck out of
    the way, as requested, the actual smarts of the smart filesystem
    code etc. had become exposed...

    - do_last/lookup_open/atomic_open cleanups. As the result, open()
    without O_CREAT locks the directory only shared. Including the
    ->atomic_open() case. Backmerge from #for-linus in the middle of
    that - atomic_open() fix got brought in.

    - then comes NFS switch to saner (VFS-based ;-) locking, killing the
    homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
    exclusion for sillyunlink/lookup is done by the parallel lookups
    mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
    now - rmdir being the writer.

    Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
    now.

    - the rest of the series consists of switching a lot of filesystems
    to parallel readdir; in a lot of cases ->llseek() gets simplified
    as well. One backmerge in there (again, #for-linus - rockridge
    fix)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
    ext4: switch to ->iterate_shared()
    hfs: switch to ->iterate_shared()
    hfsplus: switch to ->iterate_shared()
    hostfs: switch to ->iterate_shared()
    hpfs: switch to ->iterate_shared()
    hpfs: handle allocation failures in hpfs_add_pos()
    gfs2: switch to ->iterate_shared()
    f2fs: switch to ->iterate_shared()
    afs: switch to ->iterate_shared()
    befs: switch to ->iterate_shared()
    befs: constify stuff a bit
    isofs: switch to ->iterate_shared()
    get_acorn_filename(): deobfuscate a bit
    btrfs: switch to ->iterate_shared()
    logfs: no need to lock directory in lseek
    switch ecryptfs to ->iterate_shared
    9p: switch to ->iterate_shared()
    fat: switch to ->iterate_shared()
    romfs, squashfs: switch to ->iterate_shared()
    more trivial ->iterate_shared conversions
    ...

    Linus Torvalds
     

09 May, 2016

1 commit


03 May, 2016

1 commit

  • We've calculated @len to be the bytes we need for '/..' entries from
    @kn_from to the common ancestor, and calculated @nlen to be the extra
    bytes we need to get from the common ancestor to @kn_to. We use them
    as such at the end. But in the loop copying the actual entries, we
    overwrite @nlen. Use a temporary variable for that instead.

    Without this, the return length, when the buffer is large enough, is
    wrong. (When the buffer is NULL or too small, the returned value is
    correct. The buffer contents are also correct.)

    Interestingly, no callers of this function are affected by this as of
    yet. However the upcoming cgroup_show_path() will be.

    Signed-off-by: Serge Hallyn

    Serge Hallyn
     

30 Mar, 2016

1 commit

  • This is in preparation for the series that transitions
    filesystem timestamps to use 64 bit time and hence make
    them y2038 safe.

    CURRENT_TIME macro will be deleted before merging the
    aforementioned series.

    Use current_fs_time() instead of CURRENT_TIME for inode
    timestamps.

    struct kernfs_node is associated with a sysfs file/ directory.
    Truncate the values to appropriate time granularity when
    writing to inode timestamps of the files.

    ktime_get_real_ts() is used to obtain times for
    struct kernfs_iattrs. Since these times are later assigned to
    inode times using timespec_truncate() for all filesystem based
    operations, we can save the supers list traversal time here by
    using ktime_get_real_ts() directly.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Greg Kroah-Hartman

    Deepa Dinamani
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

17 Feb, 2016

1 commit


08 Feb, 2016

1 commit

  • kernfs_walk_ns() uses a static path_buf[PATH_MAX] to separate out path
    components. Keeping around the 4k buffer just for kernfs_walk_ns() is
    wasteful. This patch makes it piggyback on kernfs_pr_cont_buf[]
    instead. This requires kernfs_walk_ns() to hold kernfs_rename_lock.

    Signed-off-by: Tejun Heo
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

15 Jan, 2016

1 commit

  • Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
    alloc_kmem_pages call) are accounted to memory cgroup automatically.
    Callers have to explicitly opt out if they don't want/need accounting
    for some reason. Such a design decision leads to several problems:

    - kmalloc users are highly sensitive to failures, many of them
    implicitly rely on the fact that kmalloc never fails, while memcg
    makes failures quite plausible.

    - A lot of objects are shared among different containers by design.
    Accounting such objects to one of containers is just unfair.
    Moreover, it might lead to pinning a dead memcg along with its kmem
    caches, which aren't tiny, which might result in noticeable increase
    in memory consumption for no apparent reason in the long run.

    - There are tons of short-lived objects. Accounting them to memcg will
    only result in slight noise and won't change the overall picture, but
    we still have to pay accounting overhead.

    For more info, see

    - http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
    - http://lkml.kernel.org/r/20151106090555.GK29259@esperanza

    Therefore this patchset switches to the white list policy. Now kmalloc
    users have to explicitly opt in by passing __GFP_ACCOUNT flag.

    Currently, the list of accounted objects is quite limited and only
    includes those allocations that (1) are known to be easily triggered
    from userspace and (2) can fail gracefully (for the full list see patch
    no. 6) and it still misses many object types. However, accounting only
    those objects should be a satisfactory approximation of the behavior we
    used to have for most sane workloads.

    This patch (of 6):

    Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations
    to memcg").

    Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
    fragile and difficult to maintain, because there seem to be many more
    allocations that should not be accounted than those that should be.
    Besides, false accounting an allocation might result in much worse
    consequences than not accounting at all, namely increased memory
    consumption due to pinned dead kmem caches.

    So it was decided to switch to the white-list policy. This patch reverts
    bits introducing the black-list policy. The white-list policy will be
    introduced later in the series.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

21 Nov, 2015

1 commit

  • Implement kernfs_walk_and_get() which is similar to
    kernfs_find_and_get() but can walk a path instead of just a name.

    v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
    David.

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Cc: David Miller

    Tejun Heo
     

19 Aug, 2015

1 commit


04 Jul, 2015

1 commit

  • Pull user namespace updates from Eric Biederman:
    "Long ago and far away when user namespaces where young it was realized
    that allowing fresh mounts of proc and sysfs with only user namespace
    permissions could violate the basic rule that only root gets to decide
    if proc or sysfs should be mounted at all.

    Some hacks were put in place to reduce the worst of the damage could
    be done, and the common sense rule was adopted that fresh mounts of
    proc and sysfs should allow no more than bind mounts of proc and
    sysfs. Unfortunately that rule has not been fully enforced.

    There are two kinds of gaps in that enforcement. Only filesystems
    mounted on empty directories of proc and sysfs should be ignored but
    the test for empty directories was insufficient. So in my tree
    directories on proc, sysctl and sysfs that will always be empty are
    created specially. Every other technique is imperfect as an ordinary
    directory can have entries added even after a readdir returns and
    shows that the directory is empty. Special creation of directories
    for mount points makes the code in the kernel a smidge clearer about
    it's purpose. I asked container developers from the various container
    projects to help test this and no holes were found in the set of mount
    points on proc and sysfs that are created specially.

    This set of changes also starts enforcing the mount flags of fresh
    mounts of proc and sysfs are consistent with the existing mount of
    proc and sysfs. I expected this to be the boring part of the work but
    unfortunately unprivileged userspace winds up mounting fresh copies of
    proc and sysfs with noexec and nosuid clear when root set those flags
    on the previous mount of proc and sysfs. So for now only the atime,
    read-only and nodev attributes which userspace happens to keep
    consistent are enforced. Dealing with the noexec and nosuid
    attributes remains for another time.

    This set of changes also addresses an issue with how open file
    descriptors from /proc//ns/* are displayed. Recently readlink of
    /proc//fd has been triggering a WARN_ON that has not been
    meaningful since it was added (as all of the code in the kernel was
    converted) and is not now actively wrong.

    There is also a short list of issues that have not been fixed yet that
    I will mention briefly.

    It is possible to rename a directory from below to above a bind mount.
    At which point any directory pointers below the renamed directory can
    be walked up to the root directory of the filesystem. With user
    namespaces enabled a bind mount of the bind mount can be created
    allowing the user to pick a directory whose children they can rename
    to outside of the bind mount. This is challenging to fix and doubly
    so because all obvious solutions must touch code that is in the
    performance part of pathname resolution.

    As mentioned above there is also a question of how to ensure that
    developers by accident or with purpose do not introduce exectuable
    files on sysfs and proc and in doing so introduce security regressions
    in the current userspace that will not be immediately obvious and as
    such are likely to require breaking userspace in painful ways once
    they are recognized"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Remove incorrect debugging WARN in prepend_path
    mnt: Update fs_fully_visible to test for permanently empty directories
    sysfs: Create mountpoints with sysfs_create_mount_point
    sysfs: Add support for permanently empty directories to serve as mount points.
    kernfs: Add support for always empty directories.
    proc: Allow creating permanently empty directories that serve as mount points
    sysctl: Allow creating permanently empty directories that serve as mountpoints.
    fs: Add helper functions for permanently empty directories.
    vfs: Ignore unlocked mounts in fs_fully_visible
    mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
    mnt: Refactor the logic for mounting sysfs and proc in a user namespace

    Linus Torvalds
     

01 Jul, 2015

1 commit

  • Add a new function kernfs_create_empty_dir that can be used to create
    directory that can not be modified.

    Update the code to use make_empty_dir_inode when reporting a
    permanently empty directory to the vfs.

    Update the code to not allow adding to permanently empty directories.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman