09 Sep, 2017

1 commit

  • ... such that we can avoid the tree walks to get the node with the
    smallest key. Semantically the same, as the previously used rb_first(),
    but O(1). The main overhead is the extra footprint for the cached rb_node
    pointer, which should not matter for procfs.

    Link: http://lkml.kernel.org/r/20170719014603.19029-14-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

07 Sep, 2017

1 commit

  • /proc/pid/smaps_rollup is a new proc file that improves the performance
    of user programs that determine aggregate memory statistics (e.g., total
    PSS) of a process.

    Android regularly "samples" the memory usage of various processes in
    order to balance its memory pool sizes. This sampling process involves
    opening /proc/pid/smaps and summing certain fields. For very large
    processes, sampling memory use this way can take several hundred
    milliseconds, due mostly to the overhead of the seq_printf calls in
    task_mmu.c.

    smaps_rollup improves the situation. It contains most of the fields of
    /proc/pid/smaps, but instead of a set of fields for each VMA,
    smaps_rollup instead contains one synthetic smaps-format entry
    representing the whole process. In the single smaps_rollup synthetic
    entry, each field is the summation of the corresponding field in all of
    the real-smaps VMAs. Using a common format for smaps_rollup and smaps
    allows userspace parsers to repurpose parsers meant for use with
    non-rollup smaps for smaps_rollup, and it allows userspace to switch
    between smaps_rollup and smaps at runtime (say, based on the
    availability of smaps_rollup in a given kernel) with minimal fuss.

    By using smaps_rollup instead of smaps, a caller can avoid the
    significant overhead of formatting, reading, and parsing each of a large
    process's potentially very numerous memory mappings. For sampling
    system_server's PSS in Android, we measured a 12x speedup, representing
    a savings of several hundred milliseconds.

    One alternative to a new per-process proc file would have been including
    PSS information in /proc/pid/status. We considered this option but
    thought that PSS would be too expensive (by a few orders of magnitude)
    to collect relative to what's already emitted as part of
    /proc/pid/status, and slowing every user of /proc/pid/status for the
    sake of readers that happen to want PSS feels wrong.

    The code itself works by reusing the existing VMA-walking framework we
    use for regular smaps generation and keeping the mem_size_stats
    structure around between VMA walks instead of using a fresh one for each
    VMA. In this way, summation happens automatically. We let seq_file
    walk over the VMAs just as it does for regular smaps and just emit
    nothing to the seq_file until we hit the last VMA.

    Benchmarks:

    using smaps:
    iterations:1000 pid:1163 pss:220023808
    0m29.46s real 0m08.28s user 0m20.98s system

    using smaps_rollup:
    iterations:1000 pid:1163 pss:220702720
    0m04.39s real 0m00.03s user 0m04.31s system

    We're using the PSS samples we collect asynchronously for
    system-management tasks like fine-tuning oom_adj_score, memory use
    tracking for debugging, application-level memory-use attribution, and
    deciding whether we want to kill large processes during system idle
    maintenance windows. Android has been using PSS for these purposes for
    a long time; as the average process VMA count has increased and and
    devices become more efficiency-conscious, PSS-collection inefficiency
    has started to matter more. IMHO, it'd be a lot safer to optimize the
    existing PSS-collection model, which has been fine-tuned over the years,
    instead of changing the memory tracking approach entirely to work around
    smaps-generation inefficiency.

    Tim said:

    : There are two main reasons why Android gathers PSS information:
    :
    : 1. Android devices can show the user the amount of memory used per
    : application via the settings app. This is a less important use case.
    :
    : 2. We log PSS to help identify leaks in applications. We have found
    : an enormous number of bugs (in the Android platform, in Google's own
    : apps, and in third-party applications) using this data.
    :
    : To do this, system_server (the main process in Android userspace) will
    : sample the PSS of a process three seconds after it changes state (for
    : example, app is launched and becomes the foreground application) and about
    : every ten minutes after that. The net result is that PSS collection is
    : regularly running on at least one process in the system (usually a few
    : times a minute while the screen is on, less when screen is off due to
    : suspend). PSS of a process is an incredibly useful stat to track, and we
    : aren't going to get rid of it. We've looked at some very hacky approaches
    : using RSS ("take the RSS of the target process, subtract the RSS of the
    : zygote process that is the parent of all Android apps") to reduce the
    : accounting time, but it regularly overestimated the memory used by 20+
    : percent. Accordingly, I don't think that there's a good alternative to
    : using PSS.
    :
    : We started looking into PSS collection performance after we noticed random
    : frequency spikes while a phone's screen was off; occasionally, one of the
    : CPU clusters would ramp to a high frequency because there was 200-300ms of
    : constant CPU work from a single thread in the main Android userspace
    : process. The work causing the spike (which is reasonable governor
    : behavior given the amount of CPU time needed) was always PSS collection.
    : As a result, Android is burning more power than we should be on PSS
    : collection.
    :
    : The other issue (and why I'm less sure about improving smaps as a
    : long-term solution) is that the number of VMAs per process has increased
    : significantly from release to release. After trying to figure out why we
    : were seeing these 200-300ms PSS collection times on Android O but had not
    : noticed it in previous versions, we found that the number of VMAs in the
    : main system process increased by 50% from Android N to Android O (from
    : ~1800 to ~2700) and varying increases in every userspace process. Android
    : M to N also had an increase in the number of VMAs, although not as much.
    : I'm not sure why this is increasing so much over time, but thinking about
    : ASLR and ways to make ASLR better, I expect that this will continue to
    : increase going forward. I would not be surprised if we hit 5000 VMAs on
    : the main Android process (system_server) by 2020.
    :
    : If we assume that the number of VMAs is going to increase over time, then
    : doing anything we can do to reduce the overhead of each VMA during PSS
    : collection seems like the right way to go, and that means outputting an
    : aggregate statistic (to avoid whatever overhead there is per line in
    : writing smaps and in reading each line from userspace).

    Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com
    Signed-off-by: Daniel Colascione
    Cc: Tim Murray
    Cc: Joel Fernandes
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Sonny Rao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Colascione
     

19 Jul, 2017

1 commit

  • Pull structure randomization updates from Kees Cook:
    "Now that IPC and other changes have landed, enable manual markings for
    randstruct plugin, including the task_struct.

    This is the rest of what was staged in -next for the gcc-plugins, and
    comes in three patches, largest first:

    - mark "easy" structs with __randomize_layout

    - mark task_struct with an optional anonymous struct to isolate the
    __randomize_layout section

    - mark structs to opt _out_ of automated marking (which will come
    later)

    And, FWIW, this continues to pass allmodconfig (normal and patched to
    enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
    s390 for me"

    * tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    randstruct: opt-out externally exposed function pointer structs
    task_struct: Allow randomized layout
    randstruct: Mark various structs for randomization

    Linus Torvalds
     

12 Jul, 2017

1 commit

  • Andrei Vagin writes:
    FYI: This bug has been reproduced on 4.11.7
    > BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo} still in use (1) [unmount of proc proc]
    > ------------[ cut here ]------------
    > WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
    > CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
    > Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
    > Workqueue: events proc_cleanup_work
    > Call Trace:
    > dump_stack+0x63/0x86
    > __warn+0xcb/0xf0
    > warn_slowpath_null+0x1d/0x20
    > umount_check+0x6e/0x80
    > d_walk+0xc6/0x270
    > ? dentry_free+0x80/0x80
    > do_one_tree+0x26/0x40
    > shrink_dcache_for_umount+0x2d/0x90
    > generic_shutdown_super+0x1f/0xf0
    > kill_anon_super+0x12/0x20
    > proc_kill_sb+0x40/0x50
    > deactivate_locked_super+0x43/0x70
    > deactivate_super+0x5a/0x60
    > cleanup_mnt+0x3f/0x90
    > mntput_no_expire+0x13b/0x190
    > kern_unmount+0x3e/0x50
    > pid_ns_release_proc+0x15/0x20
    > proc_cleanup_work+0x15/0x20
    > process_one_work+0x197/0x450
    > worker_thread+0x4e/0x4a0
    > kthread+0x109/0x140
    > ? process_one_work+0x450/0x450
    > ? kthread_park+0x90/0x90
    > ret_from_fork+0x2c/0x40
    > ---[ end trace e1c109611e5d0b41 ]---
    > VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds. Have a nice day...
    > BUG: unable to handle kernel NULL pointer dereference at (null)
    > IP: _raw_spin_lock+0xc/0x30
    > PGD 0

    Fix this by taking a reference to the super block in proc_sys_prune_dcache.

    The superblock reference is the core of the fix however the sysctl_inodes
    list is converted to a hlist so that hlist_del_init_rcu may be used. This
    allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
    not causing problems for proc_sys_evict_inode when if it later choses to
    remove the inode from the sysctl_inodes list. Removing inodes from the
    sysctl_inodes list allows proc_sys_prune_dcache to have a progress
    guarantee, while still being able to drop all locks. The fact that
    head->unregistering is set in start_unregistering ensures that no more
    inodes will be added to the the sysctl_inodes list.

    Previously the code did a dance where it delayed calling iput until the
    next entry in the list was being considered to ensure the inode remained on
    the sysctl_inodes list until the next entry was walked to. The structure
    of the loop in this patch does not need that so is much easier to
    understand and maintain.

    Cc: stable@vger.kernel.org
    Reported-by: Andrei Vagin
    Tested-by: Andrei Vagin
    Fixes: ace0c791e6c3 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
    Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

01 Jul, 2017

1 commit

  • This marks many critical kernel structures for randomization. These are
    structures that have been targeted in the past in security exploits, or
    contain functions pointers, pointers to function pointer tables, lists,
    workqueues, ref-counters, credentials, permissions, or are otherwise
    sensitive. This initial list was extracted from Brad Spengler/PaX Team's
    code in the last public patch of grsecurity/PaX based on my understanding
    of the code. Changes or omissions from the original code are mine and
    don't reflect the original grsecurity/PaX code.

    Left out of this list is task_struct, which requires special handling
    and will be covered in a subsequent patch.

    Signed-off-by: Kees Cook

    Kees Cook
     

04 Mar, 2017

1 commit

  • Pull vfs 'statx()' update from Al Viro.

    This adds the new extended stat() interface that internally subsumes our
    previous stat interfaces, and allows user mode to specify in more detail
    what kind of information it wants.

    It also allows for some explicit synchronization information to be
    passed to the filesystem, which can be relevant for network filesystems:
    is the cached value ok, or do you need open/close consistency, or what?

    From David Howells.

    Andreas Dilger points out that the first version of the extended statx
    interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

    * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    statx: Add a system call to make enhanced file info available

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Add a system call to make extended file information available, including
    file creation and some attribute flags where available through the
    underlying filesystem.

    The getattr inode operation is altered to take two additional arguments: a
    u32 request_mask and an unsigned int flags that indicate the
    synchronisation mode. This change is propagated to the vfs_getattr*()
    function.

    Functions like vfs_stat() are now inline wrappers around new functions
    vfs_statx() and vfs_statx_fd() to reduce stack usage.

    ========
    OVERVIEW
    ========

    The idea was initially proposed as a set of xattrs that could be retrieved
    with getxattr(), but the general preference proved to be for a new syscall
    with an extended stat structure.

    A number of requests were gathered for features to be included. The
    following have been included:

    (1) Make the fields a consistent size on all arches and make them large.

    (2) Spare space, request flags and information flags are provided for
    future expansion.

    (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
    __s64).

    (4) Creation time: The SMB protocol carries the creation time, which could
    be exported by Samba, which will in turn help CIFS make use of
    FS-Cache as that can be used for coherency data (stx_btime).

    This is also specified in NFSv4 as a recommended attribute and could
    be exported by NFSD [Steve French].

    (5) Lightweight stat: Ask for just those details of interest, and allow a
    netfs (such as NFS) to approximate anything not of interest, possibly
    without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
    Dilger] (AT_STATX_DONT_SYNC).

    (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
    its cached attributes are up to date [Trond Myklebust]
    (AT_STATX_FORCE_SYNC).

    And the following have been left out for future extension:

    (7) Data version number: Could be used by userspace NFS servers [Aneesh
    Kumar].

    Can also be used to modify fill_post_wcc() in NFSD which retrieves
    i_version directly, but has just called vfs_getattr(). It could get
    it from the kstat struct if it used vfs_xgetattr() instead.

    (There's disagreement on the exact semantics of a single field, since
    not all filesystems do this the same way).

    (8) BSD stat compatibility: Including more fields from the BSD stat such
    as creation time (st_btime) and inode generation number (st_gen)
    [Jeremy Allison, Bernd Schubert].

    (9) Inode generation number: Useful for FUSE and userspace NFS servers
    [Bernd Schubert].

    (This was asked for but later deemed unnecessary with the
    open-by-handle capability available and caused disagreement as to
    whether it's a security hole or not).

    (10) Extra coherency data may be useful in making backups [Andreas Dilger].

    (No particular data were offered, but things like last backup
    timestamp, the data version number and the DOS archive bit would come
    into this category).

    (11) Allow the filesystem to indicate what it can/cannot provide: A
    filesystem can now say it doesn't support a standard stat feature if
    that isn't available, so if, for instance, inode numbers or UIDs don't
    exist or are fabricated locally...

    (This requires a separate system call - I have an fsinfo() call idea
    for this).

    (12) Store a 16-byte volume ID in the superblock that can be returned in
    struct xstat [Steve French].

    (Deferred to fsinfo).

    (13) Include granularity fields in the time data to indicate the
    granularity of each of the times (NFSv4 time_delta) [Steve French].

    (Deferred to fsinfo).

    (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
    Note that the Linux IOC flags are a mess and filesystems such as Ext4
    define flags that aren't in linux/fs.h, so translation in the kernel
    may be a necessity (or, possibly, we provide the filesystem type too).

    (Some attributes are made available in stx_attributes, but the general
    feeling was that the IOC flags were to ext[234]-specific and shouldn't
    be exposed through statx this way).

    (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
    Michael Kerrisk].

    (Deferred, probably to fsinfo. Finding out if there's an ACL or
    seclabal might require extra filesystem operations).

    (16) Femtosecond-resolution timestamps [Dave Chinner].

    (A __reserved field has been left in the statx_timestamp struct for
    this - if there proves to be a need).

    (17) A set multiple attributes syscall to go with this.

    ===============
    NEW SYSTEM CALL
    ===============

    The new system call is:

    int ret = statx(int dfd,
    const char *filename,
    unsigned int flags,
    unsigned int mask,
    struct statx *buffer);

    The dfd, filename and flags parameters indicate the file to query, in a
    similar way to fstatat(). There is no equivalent of lstat() as that can be
    emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
    also no equivalent of fstat() as that can be emulated by passing a NULL
    filename to statx() with the fd of interest in dfd.

    Whether or not statx() synchronises the attributes with the backing store
    can be controlled by OR'ing a value into the flags argument (this typically
    only affects network filesystems):

    (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
    respect.

    (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
    its attributes with the server - which might require data writeback to
    occur to get the timestamps correct.

    (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
    network filesystem. The resulting values should be considered
    approximate.

    mask is a bitmask indicating the fields in struct statx that are of
    interest to the caller. The user should set this to STATX_BASIC_STATS to
    get the basic set returned by stat(). It should be noted that asking for
    more information may entail extra I/O operations.

    buffer points to the destination for the data. This must be 256 bytes in
    size.

    ======================
    MAIN ATTRIBUTES RECORD
    ======================

    The following structures are defined in which to return the main attribute
    set:

    struct statx_timestamp {
    __s64 tv_sec;
    __s32 tv_nsec;
    __s32 __reserved;
    };

    struct statx {
    __u32 stx_mask;
    __u32 stx_blksize;
    __u64 stx_attributes;
    __u32 stx_nlink;
    __u32 stx_uid;
    __u32 stx_gid;
    __u16 stx_mode;
    __u16 __spare0[1];
    __u64 stx_ino;
    __u64 stx_size;
    __u64 stx_blocks;
    __u64 __spare1[1];
    struct statx_timestamp stx_atime;
    struct statx_timestamp stx_btime;
    struct statx_timestamp stx_ctime;
    struct statx_timestamp stx_mtime;
    __u32 stx_rdev_major;
    __u32 stx_rdev_minor;
    __u32 stx_dev_major;
    __u32 stx_dev_minor;
    __u64 __spare2[14];
    };

    The defined bits in request_mask and stx_mask are:

    STATX_TYPE Want/got stx_mode & S_IFMT
    STATX_MODE Want/got stx_mode & ~S_IFMT
    STATX_NLINK Want/got stx_nlink
    STATX_UID Want/got stx_uid
    STATX_GID Want/got stx_gid
    STATX_ATIME Want/got stx_atime{,_ns}
    STATX_MTIME Want/got stx_mtime{,_ns}
    STATX_CTIME Want/got stx_ctime{,_ns}
    STATX_INO Want/got stx_ino
    STATX_SIZE Want/got stx_size
    STATX_BLOCKS Want/got stx_blocks
    STATX_BASIC_STATS [The stuff in the normal stat struct]
    STATX_BTIME Want/got stx_btime{,_ns}
    STATX_ALL [All currently available stuff]

    stx_btime is the file creation time, stx_mask is a bitmask indicating the
    data provided and __spares*[] are where as-yet undefined fields can be
    placed.

    Time fields are structures with separate seconds and nanoseconds fields
    plus a reserved field in case we want to add even finer resolution. Note
    that times will be negative if before 1970; in such a case, the nanosecond
    fields will also be negative if not zero.

    The bits defined in the stx_attributes field convey information about a
    file, how it is accessed, where it is and what it does. The following
    attributes map to FS_*_FL flags and are the same numerical value:

    STATX_ATTR_COMPRESSED File is compressed by the fs
    STATX_ATTR_IMMUTABLE File is marked immutable
    STATX_ATTR_APPEND File is append-only
    STATX_ATTR_NODUMP File is not to be dumped
    STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

    Within the kernel, the supported flags are listed by:

    KSTAT_ATTR_FS_IOC_FLAGS

    [Are any other IOC flags of sufficient general interest to be exposed
    through this interface?]

    New flags include:

    STATX_ATTR_AUTOMOUNT Object is an automount trigger

    These are for the use of GUI tools that might want to mark files specially,
    depending on what they are.

    Fields in struct statx come in a number of classes:

    (0) stx_dev_*, stx_blksize.

    These are local system information and are always available.

    (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
    stx_size, stx_blocks.

    These will be returned whether the caller asks for them or not. The
    corresponding bits in stx_mask will be set to indicate whether they
    actually have valid values.

    If the caller didn't ask for them, then they may be approximated. For
    example, NFS won't waste any time updating them from the server,
    unless as a byproduct of updating something requested.

    If the values don't actually exist for the underlying object (such as
    UID or GID on a DOS file), then the bit won't be set in the stx_mask,
    even if the caller asked for the value. In such a case, the returned
    value will be a fabrication.

    Note that there are instances where the type might not be valid, for
    instance Windows reparse points.

    (2) stx_rdev_*.

    This will be set only if stx_mode indicates we're looking at a
    blockdev or a chardev, otherwise will be 0.

    (3) stx_btime.

    Similar to (1), except this will be set to 0 if it doesn't exist.

    =======
    TESTING
    =======

    The following test program can be used to test the statx system call:

    samples/statx/test-statx.c

    Just compile and run, passing it paths to the files you want to examine.
    The file is built automatically if CONFIG_SAMPLES is enabled.

    Here's some example output. Firstly, an NFS directory that crosses to
    another FSID. Note that the AUTOMOUNT attribute is set because transiting
    this directory will cause d_automount to be invoked by the VFS.

    [root@andromeda ~]# /tmp/test-statx -A /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:26 Inode: 1703937 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000
    Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

    Secondly, the result of automounting on that directory.

    [root@andromeda ~]# /tmp/test-statx /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:27 Inode: 2 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

02 Mar, 2017

2 commits


13 Feb, 2017

1 commit

  • Currently unregistering sysctl table does not prune its dentries.
    Stale dentries could slowdown sysctl operations significantly.

    For example, command:

    # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done
    creates a millions of stale denties around sysctls of loopback interface:

    # sysctl fs.dentry-state
    fs.dentry-state = 25812579 24724135 45 0 0 0

    All of them have matching names thus lookup have to scan though whole
    hash chain and call d_compare (proc_sys_compare) which checks them
    under system-wide spinlock (sysctl_lock).

    # time sysctl -a > /dev/null
    real 1m12.806s
    user 0m0.016s
    sys 1m12.400s

    Currently only memory reclaimer could remove this garbage.
    But without significant memory pressure this never happens.

    This patch collects sysctl inodes into list on sysctl table header and
    prunes all their dentries once that table unregisters.

    Konstantin Khlebnikov writes:
    > On 10.02.2017 10:47, Al Viro wrote:
    >> how about >> the matching stats *after* that patch?
    >
    > dcache size doesn't grow endlessly, so stats are fine
    >
    > # sysctl fs.dentry-state
    > fs.dentry-state = 92712 58376 45 0 0 0
    >
    > # time sysctl -a &>/dev/null
    >
    > real 0m0.013s
    > user 0m0.004s
    > sys 0m0.008s

    Signed-off-by: Konstantin Khlebnikov
    Suggested-by: Al Viro
    Signed-off-by: Eric W. Biederman

    Konstantin Khlebnikov
     

24 Jan, 2017

1 commit


15 Dec, 2016

1 commit

  • Pull security subsystem updates from James Morris:
    "Generally pretty quiet for this release. Highlights:

    Yama:
    - allow ptrace access for original parent after re-parenting

    TPM:
    - add documentation
    - many bugfixes & cleanups
    - define a generic open() method for ascii & bios measurements

    Integrity:
    - Harden against malformed xattrs

    SELinux:
    - bugfixes & cleanups

    Smack:
    - Remove unnecessary smack_known_invalid label
    - Do not apply star label in smack_setprocattr hook
    - parse mnt opts after privileges check (fixes unpriv DoS vuln)"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (56 commits)
    Yama: allow access for the current ptrace parent
    tpm: adjust return value of tpm_read_log
    tpm: vtpm_proxy: conditionally call tpm_chip_unregister
    tpm: Fix handling of missing event log
    tpm: Check the bios_dir entry for NULL before accessing it
    tpm: return -ENODEV if np is not set
    tpm: cleanup of printk error messages
    tpm: replace of_find_node_by_name() with dev of_node property
    tpm: redefine read_log() to handle ACPI/OF at runtime
    tpm: fix the missing .owner in tpm_bios_measurements_ops
    tpm: have event log use the tpm_chip
    tpm: drop tpm1_chip_register(/unregister)
    tpm: replace dynamically allocated bios_dir with a static array
    tpm: replace symbolic permission with octal for securityfs files
    char: tpm: fix kerneldoc tpm2_unseal_trusted name typo
    tpm_tis: Allow tpm_tis to be bound using DT
    tpm, tpm_vtpm_proxy: add kdoc comments for VTPM_PROXY_IOC_NEW_DEV
    tpm: Only call pm_runtime_get_sync if device has a parent
    tpm: define a generic open() method for ascii & bios measurements
    Documentation: tpm: add the Physical TPM device tree binding documentation
    ...

    Linus Torvalds
     

14 Dec, 2016

1 commit

  • Pull xen updates from Juergen Gross:
    "Xen features and fixes for 4.10

    These are some fixes, a move of some arm related headers to share them
    between arm and arm64 and a series introducing a helper to make code
    more readable.

    The most notable change is David stepping down as maintainer of the
    Xen hypervisor interface. This results in me sending you the pull
    requests for Xen related code from now on"

    * tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (29 commits)
    xen/balloon: Only mark a page as managed when it is released
    xenbus: fix deadlock on writes to /proc/xen/xenbus
    xen/scsifront: don't request a slot on the ring until request is ready
    xen/x86: Increase xen_e820_map to E820_X_MAX possible entries
    x86: Make E820_X_MAX unconditionally larger than E820MAX
    xen/pci: Bubble up error and fix description.
    xen: xenbus: set error code on failure
    xen: set error code on failures
    arm/xen: Use alloc_percpu rather than __alloc_percpu
    arm/arm64: xen: Move shared architecture headers to include/xen/arm
    xen/events: use xen_vcpu_id mapping for EVTCHNOP_status
    xen/gntdev: Use VM_MIXEDMAP instead of VM_IO to avoid NUMA balancing
    xen-scsifront: Add a missing call to kfree
    MAINTAINERS: update XEN HYPERVISOR INTERFACE
    xenfs: Use proc_create_mount_point() to create /proc/xen
    xen-platform: use builtin_pci_driver
    xen-netback: fix error handling output
    xen: make use of xenbus_read_unsigned() in xenbus
    xen: make use of xenbus_read_unsigned() in xen-pciback
    xen: make use of xenbus_read_unsigned() in xen-fbfront
    ...

    Linus Torvalds
     

13 Dec, 2016

2 commits


17 Nov, 2016

1 commit

  • Mounting proc in user namespace containers fails if the xenbus
    filesystem is mounted on /proc/xen because this directory fails
    the "permanently empty" test. proc_create_mount_point() exists
    specifically to create such mountpoints in proc but is currently
    proc-internal. Export this interface to modules, then use it in
    xenbus when creating /proc/xen.

    Signed-off-by: Seth Forshee
    Signed-off-by: David Vrabel
    Signed-off-by: Juergen Gross

    Seth Forshee
     

15 Nov, 2016

1 commit

  • Pass the file mode of the proc inode to be created to
    proc_pid_make_inode. In proc_pid_make_inode, initialize inode->i_mode
    before calling security_task_to_inode. This allows selinux to set
    isec->sclass right away without introducing "half-initialized" inode
    security structs.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Paul Moore

    Andreas Gruenbacher
     

28 Sep, 2016

1 commit


24 Jun, 2016

1 commit

  • Move the call of get_pid_ns, the call of proc_parse_options, and
    the setting of s_iflags into proc_fill_super so that mount_ns
    can be used.

    Convert proc_mount to call mount_ns and remove the now unnecessary
    code.

    Acked-by: Seth Forshee
    Reviewed-by: Djalal Harouni
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

01 Jul, 2015

1 commit

  • Add a new function proc_create_mount_point that when used to creates a
    directory that can not be added to.

    Add a new function is_empty_pde to test if a function is a mount
    point.

    Update the code to use make_empty_dir_inode when reporting
    a permanently empty directory to the vfs.

    Update the code to not allow adding to permanently empty directories.

    Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Feb, 2015

1 commit


17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

11 Dec, 2014

2 commits

  • procfs inodes need only the ns_ops part; nsfs inodes don't need it at all

    Signed-off-by: Al Viro

    Al Viro
     
  • When a lot of netdevices are created, one of the bottleneck is the
    creation of proc entries. This serie aims to accelerate this part.

    The current implementation for the directories in /proc is using a single
    linked list. This is slow when handling directories with large numbers of
    entries (eg netdevice-related entries when lots of tunnels are opened).

    This patch replaces this linked list by a red-black tree.

    Here are some numbers:

    dummy30000.batch contains 30 000 times 'link add type dummy'.

    Before the patch:
    $ time ip -b dummy30000.batch
    real 2m31.950s
    user 0m0.440s
    sys 2m21.440s
    $ time rmmod dummy
    real 1m35.764s
    user 0m0.000s
    sys 1m24.088s

    After the patch:
    $ time ip -b dummy30000.batch
    real 2m0.874s
    user 0m0.448s
    sys 1m49.720s
    $ time rmmod dummy
    real 1m13.988s
    user 0m0.000s
    sys 1m1.008s

    The idea of improving this part was suggested by Thierry Herbelot.

    [akpm@linux-foundation.org: initialise proc_root.subdir at compile time]
    Signed-off-by: Nicolas Dichtel
    Acked-by: David S. Miller
    Cc: Thierry Herbelot .
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Dichtel
     

05 Dec, 2014

1 commit

  • a) make get_proc_ns() return a pointer to struct ns_common
    b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that
    is_mnt_ns_file() could get away with fewer dereferences.

    That way struct proc_ns becomes invisible outside of fs/proc/*.c

    Signed-off-by: Al Viro

    Al Viro
     

10 Oct, 2014

3 commits

  • m_start() can use get_proc_task() instead, and "struct inode *"
    provides more potentially useful info, see the next changes.

    Signed-off-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Greg Ungerer
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • A simple test-case from Kirill Shutemov

    cat /proc/self/maps >/dev/null
    chmod +x /proc/self/net/packet
    exec /proc/self/net/packet

    makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
    the opposite order.

    It's a false positive and probably we should not allow "chmod +x" on proc
    files. Still I think that we should avoid mm_access() and cred_guard_mutex
    in sys_read() paths, security checking should happen at open time. Besides,
    this doesn't even look right if the task changes its ->mm between m_stop()
    and m_start().

    Add the new "mm_struct *mm" member into struct proc_maps_private and change
    proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to
    use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof)
    otherwise.

    The only complication is that proc_maps_open() users should additionally do
    mmdrop() in fop->release(), add the new proc_map_release() helper for that.

    Note: this is the user-visible change, if the task execs after open("maps")
    the new ->mm won't be visible via this file. I hope this is fine, and this
    matches /proc/pid/mem bahaviour.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Reported-by: "Kirill A. Shutemov"
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Extract the mm_access() code from __mem_open() into the new helper,
    proc_mem_open(), the next patch will add another caller.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

10 Aug, 2014

1 commit

  • Pull namespace updates from Eric Biederman:
    "This is a bunch of small changes built against 3.16-rc6. The most
    significant change for users is the first patch which makes setns
    drmatically faster by removing unneded rcu handling.

    The next chunk of changes are so that "mount -o remount,.." will not
    allow the user namespace root to drop flags on a mount set by the
    system wide root. Aks this forces read-only mounts to stay read-only,
    no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
    mounts to stay no exec and it prevents unprivileged users from messing
    with a mounts atime settings. I have included my test case as the
    last patch in this series so people performing backports can verify
    this change works correctly.

    The next change fixes a bug in NFS that was discovered while auditing
    nsproxy users for the first optimization. Today you can oops the
    kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
    with pid namespaces. I rebased and fixed the build of the
    !CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given
    that no one to my knowledge bases anything on my tree fixing the typo
    in place seems more responsible that requiring a typo-fix to be
    backported as well.

    The last change is a small semantic cleanup introducing
    /proc/thread-self and pointing /proc/mounts and /proc/net at it. This
    prevents several kinds of problemantic corner cases. It is a
    user-visible change so it has a minute chance of causing regressions
    so the change to /proc/mounts and /proc/net are individual one line
    commits that can be trivially reverted. Unfortunately I lost and
    could not find the email of the original reporter so he is not
    credited. From at least one perspective this change to /proc/net is a
    refgression fix to allow pthread /proc/net uses that were broken by
    the introduction of the network namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
    proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
    proc: Implement /proc/thread-self to point at the directory of the current thread
    proc: Have net show up under /proc//task/
    NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
    mnt: Add tests for unprivileged remount cases that have found to be faulty
    mnt: Change the default remount atime from relatime to the existing value
    mnt: Correct permission checks in do_remount
    mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
    mnt: Only change user settable mount flags in remount
    namespaces: Use task_lock and not rcu to protect nsproxy

    Linus Torvalds
     

09 Aug, 2014

3 commits

  • If you're applying this patch, all /proc/$PID/* files were converted
    to seq_file interface and this code became unused.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Signed-off-by: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • * remove proc_create(NULL, ...) check, let it oops

    * warn about proc_create("", ...) and proc_create("very very long name", ...)
    proc code keeps length as u8, no 256+ name length possible

    * warn about proc_create("123", ...)
    /proc/$PID and /proc/misc namespaces are separate things,
    but dumb module might create funky a-la $PID entry.

    * remove post mortem strchr('/') check
    Triggering it implies either strchr() is buggy or memory corruption.
    It should be VFS check anyway.

    In reality, none of these checks will ever trigger,
    it is preparation for the next patch.

    Based on patch from Al Viro.

    Signed-off-by: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

05 Aug, 2014

1 commit

  • /proc/thread-self is derived from /proc/self. /proc/thread-self
    points to the directory in proc containing information about the
    current thread.

    This funtionality has been missing for a long time, and is tricky to
    implement in userspace as gettid() is not exported by glibc. More
    importantly this allows fixing defects in /proc/mounts and /proc/net
    where in a threaded application today they wind up being empty files
    when only the initial pthread has exited, causing problems for other
    threads.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 Mar, 2014

1 commit

  • The same data is now available in sysfs, so we can remove the code
    that exports it in /proc and replace it with a symlink to the sysfs
    version.

    Tested on versatile qemu model and mpc5200 eval board. More testing
    would be appreciated.

    v5: Fixed up conflicts with mainline changes

    Signed-off-by: Grant Likely
    Cc: Rob Herring
    Cc: Benjamin Herrenschmidt
    Cc: David S. Miller
    Cc: Nathan Fontenot
    Cc: Pantelis Antoniou

    Grant Likely
     

29 Jun, 2013

2 commits


02 May, 2013

4 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Move non-public declarations and definitions from linux/proc_fs.h to
    fs/proc/internal.h.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Make the PROC_I() and PDE() macros internal to procfs. This means making
    PDE_DATA() out of line. This could be made more optimal by storing
    PDE()->data into inode->i_private.

    Also provide a __PDE_DATA() that is inline and internal to procfs.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Move some bits from linux/proc_fs.h to linux/of.h, signal.h and tty.h.

    Also move proc_tty_init() and proc_device_tree_init() to fs/proc/internal.h as
    they're internal to procfs.

    Signed-off-by: David Howells
    Acked-by: Greg Kroah-Hartman
    Acked-by: Grant Likely
    cc: devicetree-discuss@lists.ozlabs.org
    cc: linux-arch@vger.kernel.org
    cc: Greg Kroah-Hartman
    cc: Jri Slaby
    Signed-off-by: Al Viro

    David Howells