30 May, 2018

1 commit

  • [ Upstream commit a0b0d1c345d0317efe594df268feb5ccc99f651e ]

    proc_sys_link_fill_cache() does not take currently unregistering sysctl
    tables into account, which might result into a page fault in
    sysctl_follow_link() - add a check to fix it.

    This bug has been present since v3.4.

    Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
    Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
    Signed-off-by: Danilo Krummrich
    Acked-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: "Luis R . Rodriguez"
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Danilo Krummrich
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Jul, 2017

1 commit

  • Merge yet more updates from Andrew Morton:

    - various misc things

    - kexec updates

    - sysctl core updates

    - scripts/gdb udpates

    - checkpoint-restart updates

    - ipc updates

    - kernel/watchdog updates

    - Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature"

    - "stackprotector: ascii armor the stack canary"

    - more MM bits

    - checkpatch updates

    * emailed patches from Andrew Morton : (96 commits)
    writeback: rework wb_[dec|inc]_stat family of functions
    ARM: samsung: usb-ohci: move inline before return type
    video: fbdev: omap: move inline before return type
    video: fbdev: intelfb: move inline before return type
    USB: serial: safe_serial: move __inline__ before return type
    drivers: tty: serial: move inline before return type
    drivers: s390: move static and inline before return type
    x86/efi: move asmlinkage before return type
    sh: move inline before return type
    MIPS: SMP: move asmlinkage before return type
    m68k: coldfire: move inline before return type
    ia64: sn: pci: move inline before type
    ia64: move inline before return type
    FRV: tlbflush: move asmlinkage before return type
    CRIS: gpio: move inline before return type
    ARM: HP Jornada 7XX: move inline before return type
    ARM: KVM: move asmlinkage before type
    checkpatch: improve the STORAGE_CLASS test
    mm, migration: do not trigger OOM killer when migrating memory
    drm/i915: use __GFP_RETRY_MAYFAIL
    ...

    Linus Torvalds
     

13 Jul, 2017

3 commits

  • To keep parity with regular int interfaces provide the an unsigned int
    proc_douintvec_minmax() which allows you to specify a range of allowed
    valid numbers.

    Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
    actual user for that.

    Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Acked-by: Kees Cook
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32
    fields") added proc_douintvec() to start help adding support for
    unsigned int, this however was only half the work needed. Two fixes
    have come in since then for the following issues:

    o Printing the values shows a negative value, this happens since
    do_proc_dointvec() and this uses proc_put_long()

    This was fixed by commit 5380e5644afbba9 ("sysctl: don't print negative
    flag for proc_douintvec").

    o We can easily wrap around the int values: UINT_MAX is 4294967295, if
    we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
    end up with 1.
    o We echo negative values in and they are accepted

    This was fixed by commit 425fffd886ba ("sysctl: report EINVAL if value
    is larger than UINT_MAX for proc_douintvec").

    It still also failed to be added to sysctl_check_table()... instead of
    adding it with the current implementation just provide a proper and
    simplified unsigned int support without any array unsigned int support
    with no negative support at all.

    Historically sysctl proc helpers have supported arrays, due to the
    complexity this adds though we've taken a step back to evaluate array
    users to determine if its worth upkeeping for unsigned int. An
    evaluation using Coccinelle has been done to perform a grammatical
    search to ask ourselves:

    o How many sysctl proc_dointvec() (int) users exist which likely
    should be moved over to proc_douintvec() (unsigned int) ?
    Answer: about 8
    - Of these how many are array users ?
    Answer: Probably only 1
    o How many sysctl array users exist ?
    Answer: about 12

    This last question gives us an idea just how popular arrays: they are not.
    Array support should probably just be kept for strings.

    The identified uint ports are:

    drivers/infiniband/core/ucma.c - max_backlog
    drivers/infiniband/core/iwcm.c - default_backlog
    net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
    net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
    net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
    net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
    net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
    net/phonet/sysctl.c proc_local_port_range()

    The only possible array users is proc_local_port_range() but it does not
    seem worth it to add array support just for this given the range support
    works just as well. Unsigned int support should be desirable more for
    when you *need* more than INT_MAX or using int min/max support then does
    not suffice for your ranges.

    If you forget and by mistake happen to register an unsigned int proc
    entry with an array, the driver will fail and you will get something as
    follows:

    sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
    CPU: 2 PID: 1342 Comm: modprobe Tainted: G W E
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    Call Trace:
    dump_stack+0x63/0x81
    __register_sysctl_table+0x350/0x650
    ? kmem_cache_alloc_trace+0x107/0x240
    __register_sysctl_paths+0x1b3/0x1e0
    ? 0xffffffffc005f000
    register_sysctl_table+0x1f/0x30
    test_sysctl_init+0x10/0x1000 [test_sysctl]
    do_one_initcall+0x52/0x1a0
    ? kmem_cache_alloc_trace+0x107/0x240
    do_init_module+0x5f/0x200
    load_module+0x1867/0x1bd0
    ? __symbol_put+0x60/0x60
    SYSC_finit_module+0xdf/0x110
    SyS_finit_module+0xe/0x10
    entry_SYSCALL_64_fastpath+0x1e/0xad
    RIP: 0033:0x7f042b22d119

    Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
    Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Suggested-by: Alexey Dobriyan
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Liping Zhang
    Cc: Alexey Dobriyan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Patch series "sysctl: few fixes", v5.

    I've been working on making kmod more deterministic, and as I did that I
    couldn't help but notice a few issues with sysctl. My end goal was just
    to fix unsigned int support, which back then was completely broken.
    Liping Zhang has sent up small atomic fixes, however it still missed yet
    one more fix and Alexey Dobriyan had also suggested to just drop array
    support given its complexity.

    I have inspected array support using Coccinelle and indeed its not that
    popular, so if in fact we can avoid it for new interfaces, I agree its
    best.

    I did develop a sysctl stress driver but will hold that off for another
    series.

    This patch (of 5):

    Commit 7c60c48f58a7 ("sysctl: Improve the sysctl sanity checks")
    improved sanity checks considerbly, however the enhancements on
    sysctl_check_table() meant adding a functional change so that only the
    last table entry's sanity error is propagated. It also changed the way
    errors were propagated so that each new check reset the err value, this
    means only last sanity check computed is used for an error. This has
    been in the kernel since v3.4 days.

    Fix this by carrying on errors from previous checks and iterations as we
    traverse the table and ensuring we keep any error from previous checks.
    We keep iterating on the table even if an error is found so we can
    complain for all errors found in one shot. This works as -EINVAL is
    always returned on error anyway, and the check for error is any non-zero
    value.

    Fixes: 7c60c48f58a7 ("sysctl: Improve the sysctl sanity checks")
    Link: http://lkml.kernel.org/r/20170519033554.18592-2-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     

12 Jul, 2017

1 commit

  • Andrei Vagin writes:
    FYI: This bug has been reproduced on 4.11.7
    > BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo} still in use (1) [unmount of proc proc]
    > ------------[ cut here ]------------
    > WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
    > CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
    > Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
    > Workqueue: events proc_cleanup_work
    > Call Trace:
    > dump_stack+0x63/0x86
    > __warn+0xcb/0xf0
    > warn_slowpath_null+0x1d/0x20
    > umount_check+0x6e/0x80
    > d_walk+0xc6/0x270
    > ? dentry_free+0x80/0x80
    > do_one_tree+0x26/0x40
    > shrink_dcache_for_umount+0x2d/0x90
    > generic_shutdown_super+0x1f/0xf0
    > kill_anon_super+0x12/0x20
    > proc_kill_sb+0x40/0x50
    > deactivate_locked_super+0x43/0x70
    > deactivate_super+0x5a/0x60
    > cleanup_mnt+0x3f/0x90
    > mntput_no_expire+0x13b/0x190
    > kern_unmount+0x3e/0x50
    > pid_ns_release_proc+0x15/0x20
    > proc_cleanup_work+0x15/0x20
    > process_one_work+0x197/0x450
    > worker_thread+0x4e/0x4a0
    > kthread+0x109/0x140
    > ? process_one_work+0x450/0x450
    > ? kthread_park+0x90/0x90
    > ret_from_fork+0x2c/0x40
    > ---[ end trace e1c109611e5d0b41 ]---
    > VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds. Have a nice day...
    > BUG: unable to handle kernel NULL pointer dereference at (null)
    > IP: _raw_spin_lock+0xc/0x30
    > PGD 0

    Fix this by taking a reference to the super block in proc_sys_prune_dcache.

    The superblock reference is the core of the fix however the sysctl_inodes
    list is converted to a hlist so that hlist_del_init_rcu may be used. This
    allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
    not causing problems for proc_sys_evict_inode when if it later choses to
    remove the inode from the sysctl_inodes list. Removing inodes from the
    sysctl_inodes list allows proc_sys_prune_dcache to have a progress
    guarantee, while still being able to drop all locks. The fact that
    head->unregistering is set in start_unregistering ensures that no more
    inodes will be added to the the sysctl_inodes list.

    Previously the code did a dance where it delayed calling iput until the
    next entry in the list was being considered to ensure the inode remained on
    the sysctl_inodes list until the next entry was walked to. The structure
    of the loop in this patch does not need that so is much easier to
    understand and maintain.

    Cc: stable@vger.kernel.org
    Reported-by: Andrei Vagin
    Tested-by: Andrei Vagin
    Fixes: ace0c791e6c3 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
    Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

06 May, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "This is a set of small fixes that were mostly stumbled over during
    more significant development. This proc fix and the fix to
    posix-timers are the most significant of the lot.

    There is a lot of good development going on but unfortunately it
    didn't quite make the merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: Fix unbalanced hard link numbers
    signal: Make kill_proc_info static
    rlimit: Properly call security_task_setrlimit
    signal: Remove unused definition of sig_user_definied
    ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
    ipc: Remove unused declaration of recompute_msgmni
    posix-timers: Correct sanity check in posix_cpu_nsleep
    sysctl: Remove dead register_sysctl_root

    Linus Torvalds
     

17 Apr, 2017

1 commit

  • The function no longer does anything. The is only a single caller of
    register_sysctl_root when semantically there should be two. Remove
    this function so that if someone decides this functionality is needed
    again it will be obvious all of the callers of setup_sysctl_set need
    to be audited and modified appropriately.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

08 Apr, 2017

1 commit

  • Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32
    fields") introduced the proc_douintvec helper function, but it forgot to
    add the related sanity check when doing register_sysctl_table. So add
    it now.

    Signed-off-by: Liping Zhang
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liping Zhang
     

04 Mar, 2017

1 commit

  • Pull vfs 'statx()' update from Al Viro.

    This adds the new extended stat() interface that internally subsumes our
    previous stat interfaces, and allows user mode to specify in more detail
    what kind of information it wants.

    It also allows for some explicit synchronization information to be
    passed to the filesystem, which can be relevant for network filesystems:
    is the cached value ok, or do you need open/close consistency, or what?

    From David Howells.

    Andreas Dilger points out that the first version of the extended statx
    interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

    * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    statx: Add a system call to make enhanced file info available

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Add a system call to make extended file information available, including
    file creation and some attribute flags where available through the
    underlying filesystem.

    The getattr inode operation is altered to take two additional arguments: a
    u32 request_mask and an unsigned int flags that indicate the
    synchronisation mode. This change is propagated to the vfs_getattr*()
    function.

    Functions like vfs_stat() are now inline wrappers around new functions
    vfs_statx() and vfs_statx_fd() to reduce stack usage.

    ========
    OVERVIEW
    ========

    The idea was initially proposed as a set of xattrs that could be retrieved
    with getxattr(), but the general preference proved to be for a new syscall
    with an extended stat structure.

    A number of requests were gathered for features to be included. The
    following have been included:

    (1) Make the fields a consistent size on all arches and make them large.

    (2) Spare space, request flags and information flags are provided for
    future expansion.

    (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
    __s64).

    (4) Creation time: The SMB protocol carries the creation time, which could
    be exported by Samba, which will in turn help CIFS make use of
    FS-Cache as that can be used for coherency data (stx_btime).

    This is also specified in NFSv4 as a recommended attribute and could
    be exported by NFSD [Steve French].

    (5) Lightweight stat: Ask for just those details of interest, and allow a
    netfs (such as NFS) to approximate anything not of interest, possibly
    without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
    Dilger] (AT_STATX_DONT_SYNC).

    (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
    its cached attributes are up to date [Trond Myklebust]
    (AT_STATX_FORCE_SYNC).

    And the following have been left out for future extension:

    (7) Data version number: Could be used by userspace NFS servers [Aneesh
    Kumar].

    Can also be used to modify fill_post_wcc() in NFSD which retrieves
    i_version directly, but has just called vfs_getattr(). It could get
    it from the kstat struct if it used vfs_xgetattr() instead.

    (There's disagreement on the exact semantics of a single field, since
    not all filesystems do this the same way).

    (8) BSD stat compatibility: Including more fields from the BSD stat such
    as creation time (st_btime) and inode generation number (st_gen)
    [Jeremy Allison, Bernd Schubert].

    (9) Inode generation number: Useful for FUSE and userspace NFS servers
    [Bernd Schubert].

    (This was asked for but later deemed unnecessary with the
    open-by-handle capability available and caused disagreement as to
    whether it's a security hole or not).

    (10) Extra coherency data may be useful in making backups [Andreas Dilger].

    (No particular data were offered, but things like last backup
    timestamp, the data version number and the DOS archive bit would come
    into this category).

    (11) Allow the filesystem to indicate what it can/cannot provide: A
    filesystem can now say it doesn't support a standard stat feature if
    that isn't available, so if, for instance, inode numbers or UIDs don't
    exist or are fabricated locally...

    (This requires a separate system call - I have an fsinfo() call idea
    for this).

    (12) Store a 16-byte volume ID in the superblock that can be returned in
    struct xstat [Steve French].

    (Deferred to fsinfo).

    (13) Include granularity fields in the time data to indicate the
    granularity of each of the times (NFSv4 time_delta) [Steve French].

    (Deferred to fsinfo).

    (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
    Note that the Linux IOC flags are a mess and filesystems such as Ext4
    define flags that aren't in linux/fs.h, so translation in the kernel
    may be a necessity (or, possibly, we provide the filesystem type too).

    (Some attributes are made available in stx_attributes, but the general
    feeling was that the IOC flags were to ext[234]-specific and shouldn't
    be exposed through statx this way).

    (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
    Michael Kerrisk].

    (Deferred, probably to fsinfo. Finding out if there's an ACL or
    seclabal might require extra filesystem operations).

    (16) Femtosecond-resolution timestamps [Dave Chinner].

    (A __reserved field has been left in the statx_timestamp struct for
    this - if there proves to be a need).

    (17) A set multiple attributes syscall to go with this.

    ===============
    NEW SYSTEM CALL
    ===============

    The new system call is:

    int ret = statx(int dfd,
    const char *filename,
    unsigned int flags,
    unsigned int mask,
    struct statx *buffer);

    The dfd, filename and flags parameters indicate the file to query, in a
    similar way to fstatat(). There is no equivalent of lstat() as that can be
    emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
    also no equivalent of fstat() as that can be emulated by passing a NULL
    filename to statx() with the fd of interest in dfd.

    Whether or not statx() synchronises the attributes with the backing store
    can be controlled by OR'ing a value into the flags argument (this typically
    only affects network filesystems):

    (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
    respect.

    (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
    its attributes with the server - which might require data writeback to
    occur to get the timestamps correct.

    (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
    network filesystem. The resulting values should be considered
    approximate.

    mask is a bitmask indicating the fields in struct statx that are of
    interest to the caller. The user should set this to STATX_BASIC_STATS to
    get the basic set returned by stat(). It should be noted that asking for
    more information may entail extra I/O operations.

    buffer points to the destination for the data. This must be 256 bytes in
    size.

    ======================
    MAIN ATTRIBUTES RECORD
    ======================

    The following structures are defined in which to return the main attribute
    set:

    struct statx_timestamp {
    __s64 tv_sec;
    __s32 tv_nsec;
    __s32 __reserved;
    };

    struct statx {
    __u32 stx_mask;
    __u32 stx_blksize;
    __u64 stx_attributes;
    __u32 stx_nlink;
    __u32 stx_uid;
    __u32 stx_gid;
    __u16 stx_mode;
    __u16 __spare0[1];
    __u64 stx_ino;
    __u64 stx_size;
    __u64 stx_blocks;
    __u64 __spare1[1];
    struct statx_timestamp stx_atime;
    struct statx_timestamp stx_btime;
    struct statx_timestamp stx_ctime;
    struct statx_timestamp stx_mtime;
    __u32 stx_rdev_major;
    __u32 stx_rdev_minor;
    __u32 stx_dev_major;
    __u32 stx_dev_minor;
    __u64 __spare2[14];
    };

    The defined bits in request_mask and stx_mask are:

    STATX_TYPE Want/got stx_mode & S_IFMT
    STATX_MODE Want/got stx_mode & ~S_IFMT
    STATX_NLINK Want/got stx_nlink
    STATX_UID Want/got stx_uid
    STATX_GID Want/got stx_gid
    STATX_ATIME Want/got stx_atime{,_ns}
    STATX_MTIME Want/got stx_mtime{,_ns}
    STATX_CTIME Want/got stx_ctime{,_ns}
    STATX_INO Want/got stx_ino
    STATX_SIZE Want/got stx_size
    STATX_BLOCKS Want/got stx_blocks
    STATX_BASIC_STATS [The stuff in the normal stat struct]
    STATX_BTIME Want/got stx_btime{,_ns}
    STATX_ALL [All currently available stuff]

    stx_btime is the file creation time, stx_mask is a bitmask indicating the
    data provided and __spares*[] are where as-yet undefined fields can be
    placed.

    Time fields are structures with separate seconds and nanoseconds fields
    plus a reserved field in case we want to add even finer resolution. Note
    that times will be negative if before 1970; in such a case, the nanosecond
    fields will also be negative if not zero.

    The bits defined in the stx_attributes field convey information about a
    file, how it is accessed, where it is and what it does. The following
    attributes map to FS_*_FL flags and are the same numerical value:

    STATX_ATTR_COMPRESSED File is compressed by the fs
    STATX_ATTR_IMMUTABLE File is marked immutable
    STATX_ATTR_APPEND File is append-only
    STATX_ATTR_NODUMP File is not to be dumped
    STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

    Within the kernel, the supported flags are listed by:

    KSTAT_ATTR_FS_IOC_FLAGS

    [Are any other IOC flags of sufficient general interest to be exposed
    through this interface?]

    New flags include:

    STATX_ATTR_AUTOMOUNT Object is an automount trigger

    These are for the use of GUI tools that might want to mark files specially,
    depending on what they are.

    Fields in struct statx come in a number of classes:

    (0) stx_dev_*, stx_blksize.

    These are local system information and are always available.

    (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
    stx_size, stx_blocks.

    These will be returned whether the caller asks for them or not. The
    corresponding bits in stx_mask will be set to indicate whether they
    actually have valid values.

    If the caller didn't ask for them, then they may be approximated. For
    example, NFS won't waste any time updating them from the server,
    unless as a byproduct of updating something requested.

    If the values don't actually exist for the underlying object (such as
    UID or GID on a DOS file), then the bit won't be set in the stx_mask,
    even if the caller asked for the value. In such a case, the returned
    value will be a fabrication.

    Note that there are instances where the type might not be valid, for
    instance Windows reparse points.

    (2) stx_rdev_*.

    This will be set only if stx_mode indicates we're looking at a
    blockdev or a chardev, otherwise will be 0.

    (3) stx_btime.

    Similar to (1), except this will be set to 0 if it doesn't exist.

    =======
    TESTING
    =======

    The following test program can be used to test the statx system call:

    samples/statx/test-statx.c

    Just compile and run, passing it paths to the files you want to examine.
    The file is built automatically if CONFIG_SAMPLES is enabled.

    Here's some example output. Firstly, an NFS directory that crosses to
    another FSID. Note that the AUTOMOUNT attribute is set because transiting
    this directory will cause d_automount to be invoked by the VFS.

    [root@andromeda ~]# /tmp/test-statx -A /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:26 Inode: 1703937 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000
    Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

    Secondly, the result of automounting on that directory.

    [root@andromeda ~]# /tmp/test-statx /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:27 Inode: 2 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

02 Mar, 2017

1 commit


22 Feb, 2017

1 commit

  • Konstantin Khlebnikov writes:
    > This patch has locking problem. I've got lockdep splat under LTP.
    >
    > [ 6633.115456] ======================================================
    > [ 6633.115502] [ INFO: possible circular locking dependency detected ]
    > [ 6633.115553] 4.9.10-debug+ #9 Tainted: G L
    > [ 6633.115584] -------------------------------------------------------
    > [ 6633.115627] ksm02/284980 is trying to acquire lock:
    > [ 6633.115659] (&sb->s_type->i_lock_key#4){+.+...}, at: [] igrab+0x1e/0x80
    > [ 6633.115834] but task is already holding lock:
    > [ 6633.115882] (sysctl_lock){+.+...}, at: [] unregister_sysctl_table+0x6b/0x110
    > [ 6633.116026] which lock already depends on the new lock.
    > [ 6633.116026]
    > [ 6633.116080]
    > [ 6633.116080] the existing dependency chain (in reverse order) is:
    > [ 6633.116117]
    > -> #2 (sysctl_lock){+.+...}:
    > -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}:
    > -> #0 (&sb->s_type->i_lock_key#4){+.+...}:
    >
    > d_lock nests inside i_lock
    > sysctl_lock nests inside d_lock in d_compare
    >
    > This patch adds i_lock nesting inside sysctl_lock.

    Al Viro replied:
    > Once ->unregistering is set, you can drop sysctl_lock just fine. So I'd
    > try something like this - use rcu_read_lock() in proc_sys_prune_dcache(),
    > drop sysctl_lock() before it and regain after. Make sure that no inodes
    > are added to the list ones ->unregistering has been set and use RCU list
    > primitives for modifying the inode list, with sysctl_lock still used to
    > serialize its modifications.
    >
    > Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing
    > igrab() is safe there. Since we don't drop inode reference until after we'd
    > passed beyond it in the list, list_for_each_entry_rcu() should be fine.

    I agree with Al Viro's analsysis of the situtation.

    Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering")
    Reported-by: Konstantin Khlebnikov
    Tested-by: Konstantin Khlebnikov
    Suggested-by: Al Viro
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

13 Feb, 2017

1 commit

  • Currently unregistering sysctl table does not prune its dentries.
    Stale dentries could slowdown sysctl operations significantly.

    For example, command:

    # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done
    creates a millions of stale denties around sysctls of loopback interface:

    # sysctl fs.dentry-state
    fs.dentry-state = 25812579 24724135 45 0 0 0

    All of them have matching names thus lookup have to scan though whole
    hash chain and call d_compare (proc_sys_compare) which checks them
    under system-wide spinlock (sysctl_lock).

    # time sysctl -a > /dev/null
    real 1m12.806s
    user 0m0.016s
    sys 1m12.400s

    Currently only memory reclaimer could remove this garbage.
    But without significant memory pressure this never happens.

    This patch collects sysctl inodes into list on sysctl table header and
    prunes all their dentries once that table unregisters.

    Konstantin Khlebnikov writes:
    > On 10.02.2017 10:47, Al Viro wrote:
    >> how about >> the matching stats *after* that patch?
    >
    > dcache size doesn't grow endlessly, so stats are fine
    >
    > # sysctl fs.dentry-state
    > fs.dentry-state = 92712 58376 45 0 0 0
    >
    > # time sysctl -a &>/dev/null
    >
    > real 0m0.013s
    > user 0m0.004s
    > sys 0m0.008s

    Signed-off-by: Konstantin Khlebnikov
    Suggested-by: Al Viro
    Signed-off-by: Eric W. Biederman

    Konstantin Khlebnikov
     

10 Jan, 2017

1 commit

  • Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
    added by grab_header when return from !dir_emit_dots path.
    It can cause any path called unregister_sysctl_table will
    wait forever.

    The calltrace of CVE-2016-9191:

    [ 5535.960522] Call Trace:
    [ 5535.963265] [] schedule+0x3f/0xa0
    [ 5535.968817] [] schedule_timeout+0x3db/0x6f0
    [ 5535.975346] [] ? wait_for_completion+0x45/0x130
    [ 5535.982256] [] wait_for_completion+0xc3/0x130
    [ 5535.988972] [] ? wake_up_q+0x80/0x80
    [ 5535.994804] [] drop_sysctl_table+0xc4/0xe0
    [ 5536.001227] [] drop_sysctl_table+0x77/0xe0
    [ 5536.007648] [] unregister_sysctl_table+0x4d/0xa0
    [ 5536.014654] [] unregister_sysctl_table+0x7f/0xa0
    [ 5536.021657] [] unregister_sched_domain_sysctl+0x15/0x40
    [ 5536.029344] [] partition_sched_domains+0x44/0x450
    [ 5536.036447] [] ? __mutex_unlock_slowpath+0x111/0x1f0
    [ 5536.043844] [] rebuild_sched_domains_locked+0x64/0xb0
    [ 5536.051336] [] update_flag+0x11d/0x210
    [ 5536.057373] [] ? mutex_lock_nested+0x2df/0x450
    [ 5536.064186] [] ? cpuset_css_offline+0x1b/0x60
    [ 5536.070899] [] ? trace_hardirqs_on+0xd/0x10
    [ 5536.077420] [] ? mutex_lock_nested+0x2df/0x450
    [ 5536.084234] [] ? css_killed_work_fn+0x25/0x220
    [ 5536.091049] [] cpuset_css_offline+0x35/0x60
    [ 5536.097571] [] css_killed_work_fn+0x5c/0x220
    [ 5536.104207] [] process_one_work+0x1df/0x710
    [ 5536.110736] [] ? process_one_work+0x160/0x710
    [ 5536.117461] [] worker_thread+0x12b/0x4a0
    [ 5536.123697] [] ? process_one_work+0x710/0x710
    [ 5536.130426] [] kthread+0xfe/0x120
    [ 5536.135991] [] ret_from_fork+0x1f/0x40
    [ 5536.142041] [] ? kthread_create_on_node+0x230/0x230

    One cgroup maintainer mentioned that "cgroup is trying to offline
    a cpuset css, which takes place under cgroup_mutex. The offlining
    ends up trying to drain active usages of a sysctl table which apprently
    is not happening."
    The real reason is that proc_sys_readdir doesn't drop reference added
    by grab_header when return from !dir_emit_dots path. So this cpuset
    offline path will wait here forever.

    See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13

    Fixes: f0c3b5093add ("[readdir] convert procfs")
    Cc: stable@vger.kernel.org
    Reported-by: CAI Qian
    Tested-by: Yang Shukui
    Signed-off-by: Zhou Chengming
    Acked-by: Al Viro
    Signed-off-by: Eric W. Biederman

    Zhou Chengming
     

11 Oct, 2016

2 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

07 Oct, 2016

1 commit

  • Pull namespace updates from Eric Biederman:
    "This set of changes is a number of smaller things that have been
    overlooked in other development cycles focused on more fundamental
    change. The devpts changes are small things that were a distraction
    until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
    trivial regression fix to autofs for the unprivileged mount changes
    that went in last cycle. A pair of ioctls has been added by Andrey
    Vagin making it is possible to discover the relationships between
    namespaces when referring to them through file descriptors.

    The big user visible change is starting to add simple resource limits
    to catch programs that misbehave. With namespaces in general and user
    namespaces in particular allowing users to use more kinds of
    resources, it has become important to have something to limit errant
    programs. Because the purpose of these limits is to catch errant
    programs the code needs to be inexpensive to use as it always on, and
    the default limits need to be high enough that well behaved programs
    on well behaved systems don't encounter them.

    To this end, after some review I have implemented per user per user
    namespace limits, and use them to limit the number of namespaces. The
    limits being per user mean that one user can not exhause the limits of
    another user. The limits being per user namespace allow contexts where
    the limit is 0 and security conscious folks can remove from their
    threat anlysis the code used to manage namespaces (as they have
    historically done as it root only). At the same time the limits being
    per user namespace allow other parts of the system to use namespaces.

    Namespaces are increasingly being used in application sand boxing
    scenarios so an all or nothing disable for the entire system for the
    security conscious folks makes increasing use of these sandboxes
    impossible.

    There is also added a limit on the maximum number of mounts present in
    a single mount namespace. It is nontrivial to guess what a reasonable
    system wide limit on the number of mount structure in the kernel would
    be, especially as it various based on how a system is using
    containers. A limit on the number of mounts in a mount namespace
    however is much easier to understand and set. In most cases in
    practice only about 1000 mounts are used. Given that some autofs
    scenarious have the potential to be 30,000 to 50,000 mounts I have set
    the default limit for the number of mounts at 100,000 which is well
    above every known set of users but low enough that the mount hash
    tables don't degrade unreaonsably.

    These limits are a start. I expect this estabilishes a pattern that
    other limits for resources that namespaces use will follow. There has
    been interest in making inotify event limits per user per user
    namespace as well as interest expressed in making details about what
    is going on in the kernel more visible"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
    autofs: Fix automounts by using current_real_cred()->uid
    mnt: Add a per mount namespace limit on the number of mounts
    netns: move {inc,dec}_net_namespaces into #ifdef
    nsfs: Simplify __ns_get_path
    tools/testing: add a test to check nsfs ioctl-s
    nsfs: add ioctl to get a parent namespace
    nsfs: add ioctl to get an owning user namespace for ns file descriptor
    kernel: add a helper to get an owning user namespace for a namespace
    devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
    devpts: Remove sync_filesystems
    devpts: Make devpts_kill_sb safe if fsi is NULL
    devpts: Simplify devpts_mount by using mount_nodev
    devpts: Move the creation of /dev/pts/ptmx into fill_super
    devpts: Move parse_mount_options into fill_super
    userns: When the per user per user namespace limit is reached return ENOSPC
    userns; Document per user per user namespace limits.
    mntns: Add a limit on the number of mount namespaces.
    netns: Add a limit on the number of net namespaces
    cgroupns: Add a limit on the number of cgroup namespaces
    ipcns: Add a limit on the number of ipc namespaces
    ...

    Linus Torvalds
     

28 Sep, 2016

1 commit

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani
     

22 Sep, 2016

1 commit

  • inode_change_ok() will be resposible for clearing capabilities and IMA
    extended attributes and as such will need dentry. Give it as an argument
    to inode_change_ok() instead of an inode. Also rename inode_change_ok()
    to setattr_prepare() to better relect that it does also some
    modifications in addition to checks.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

18 Aug, 2016

1 commit


15 Aug, 2016

1 commit

  • If net namespace is attached to a user namespace let's make container's
    root owner of sysctls affecting said network namespace instead of global
    root.

    This also allows us to clean up net_ctl_permissions() because we do not
    need to fudge permissions anymore for the container's owner since it now
    owns the objects in question.

    Acked-by: "Eric W. Biederman"
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: David S. Miller

    Dmitry Torokhov
     

08 Aug, 2016

1 commit


07 Aug, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    In the "trivial API change" department - ->d_compare() losing 'parent'
    argument"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    cachefiles: Fix race between inactivating and culling a cache object
    9p: use clone_fid()
    9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"
    vfs: make dentry_needs_remove_privs() internal
    vfs: remove file_needs_remove_privs()
    vfs: fix deadlock in file_remove_privs() on overlayfs
    get rid of 'parent' argument of ->d_compare()
    cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare()
    affs ->d_compare(): don't bother with ->d_inode
    fold _d_rehash() and __d_rehash() together
    fold dentry_rcuwalk_invalidate() into its only remaining caller

    Linus Torvalds
     

06 Aug, 2016

1 commit

  • Pull qstr constification updates from Al Viro:
    "Fairly self-contained bunch - surprising lot of places passes struct
    qstr * as an argument when const struct qstr * would suffice; it
    complicates analysis for no good reason.

    I'd prefer to feed that separately from the assorted fixes (those are
    in #for-linus and with somewhat trickier topology)"

    * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    qstr: constify instances in adfs
    qstr: constify instances in lustre
    qstr: constify instances in f2fs
    qstr: constify instances in ext2
    qstr: constify instances in vfat
    qstr: constify instances in procfs
    qstr: constify instances in fuse
    qstr constify instances in fs/dcache.c
    qstr: constify instances in nfs
    qstr: constify instances in ocfs2
    qstr: constify instances in autofs4
    qstr: constify instances in hfs
    qstr: constify instances in hfsplus
    qstr: constify instances in logfs
    qstr: constify dentry_init_security

    Linus Torvalds
     

01 Aug, 2016

1 commit


31 Jul, 2016

1 commit


11 Jun, 2016

1 commit

  • We always mixed in the parent pointer into the dentry name hash, but we
    did it late at lookup time. It turns out that we can simplify that
    lookup-time action by salting the hash with the parent pointer early
    instead of late.

    A few other users of our string hashes also wanted to mix in their own
    pointers into the hash, and those are updated to use the same mechanism.

    Hash users that don't have any particular initial salt can just use the
    NULL pointer as a no-salt.

    Cc: Vegard Nossum
    Cc: George Spelvin
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 May, 2016

2 commits


29 Sep, 2015

1 commit

  • IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there
    is no need to do that again from its callers. Drop it.

    Signed-off-by: Viresh Kumar
    Reviewed-by: Jeff Layton
    Reviewed-by: David Howells
    Reviewed-by: Steve French
    Signed-off-by: Jiri Kosina

    Viresh Kumar
     

01 Jul, 2015

1 commit


16 Apr, 2015

1 commit


09 Aug, 2014

1 commit


29 Jun, 2013

2 commits

  • Instances either don't look at it at all (the majority of cases) or
    only want it to find the superblock (which can be had as dentry->d_sb).
    A few cases that want more are actually safe with dentry->d_inode -
    the only precaution needed is the check that it hadn't been replaced with
    NULL by rmdir() or by overwriting rename(), which case should be simply
    treated as cache miss.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     

28 Feb, 2013

1 commit

  • - use pr_foo() throughout

    - remove a couple of duplicated KERN_WARNINGs, via WARN(KERN_WARNING "...")

    - nuke a few warnings which I've never seen happen, ever.

    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

23 Feb, 2013

1 commit


21 Dec, 2012

1 commit