25 Dec, 2016

1 commit


18 Dec, 2016

1 commit

  • …/linux/kernel/git/mszeredi/vfs

    Pull partial readlink cleanups from Miklos Szeredi.

    This is the uncontroversial part of the readlink cleanup patch-set that
    simplifies the default readlink handling.

    Miklos and Al are still discussing the rest of the series.

    * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    vfs: make generic_readlink() static
    vfs: remove ".readlink = generic_readlink" assignments
    vfs: default to generic_readlink()
    vfs: replace calling i_op->readlink with vfs_readlink()
    proc/self: use generic_readlink
    ecryptfs: use vfs_get_link()
    bad_inode: add missing i_op initializers

    Linus Torvalds
     

15 Dec, 2016

2 commits

  • Pull audit updates from Paul Moore:
    "After the small number of patches for v4.9, we've got a much bigger
    pile for v4.10.

    The bulk of these patches involve a rework of the audit backlog queue
    to enable us to move the netlink multicasting out of the task/thread
    that generates the audit record and into the kernel thread that emits
    the record (just like we do for the audit unicast to auditd).

    While we were playing with the backlog queue(s) we fixed a number of
    other little problems with the code, and from all the testing so far
    things look to be in much better shape now. Doing this also allowed us
    to re-enable disabling IRQs for some netns operations ("netns: avoid
    disabling irq for netns id").

    The remaining patches fix some small problems that are well documented
    in the commit descriptions, as well as adding session ID filtering
    support"

    * 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
    audit: use proper refcount locking on audit_sock
    netns: avoid disabling irq for netns id
    audit: don't ever sleep on a command record/message
    audit: handle a clean auditd shutdown with grace
    audit: wake up kauditd_thread after auditd registers
    audit: rework audit_log_start()
    audit: rework the audit queue handling
    audit: rename the queues and kauditd related functions
    audit: queue netlink multicast sends just like we do for unicast sends
    audit: fixup audit_init()
    audit: move kaudit thread start from auditd registration to kaudit init (#2)
    audit: add support for session ID user filter
    audit: fix formatting of AUDIT_CONFIG_CHANGE events
    audit: skip sessionid sentinel value when auto-incrementing
    audit: tame initialization warning len_abuf in audit_log_execve_info
    audit: less stack usage for /proc/*/loginuid

    Linus Torvalds
     
  • Pull security subsystem updates from James Morris:
    "Generally pretty quiet for this release. Highlights:

    Yama:
    - allow ptrace access for original parent after re-parenting

    TPM:
    - add documentation
    - many bugfixes & cleanups
    - define a generic open() method for ascii & bios measurements

    Integrity:
    - Harden against malformed xattrs

    SELinux:
    - bugfixes & cleanups

    Smack:
    - Remove unnecessary smack_known_invalid label
    - Do not apply star label in smack_setprocattr hook
    - parse mnt opts after privileges check (fixes unpriv DoS vuln)"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (56 commits)
    Yama: allow access for the current ptrace parent
    tpm: adjust return value of tpm_read_log
    tpm: vtpm_proxy: conditionally call tpm_chip_unregister
    tpm: Fix handling of missing event log
    tpm: Check the bios_dir entry for NULL before accessing it
    tpm: return -ENODEV if np is not set
    tpm: cleanup of printk error messages
    tpm: replace of_find_node_by_name() with dev of_node property
    tpm: redefine read_log() to handle ACPI/OF at runtime
    tpm: fix the missing .owner in tpm_bios_measurements_ops
    tpm: have event log use the tpm_chip
    tpm: drop tpm1_chip_register(/unregister)
    tpm: replace dynamically allocated bios_dir with a static array
    tpm: replace symbolic permission with octal for securityfs files
    char: tpm: fix kerneldoc tpm2_unseal_trusted name typo
    tpm_tis: Allow tpm_tis to be bound using DT
    tpm, tpm_vtpm_proxy: add kdoc comments for VTPM_PROXY_IOC_NEW_DEV
    tpm: Only call pm_runtime_get_sync if device has a parent
    tpm: define a generic open() method for ascii & bios measurements
    Documentation: tpm: add the Physical TPM device tree binding documentation
    ...

    Linus Torvalds
     

14 Dec, 2016

1 commit

  • Pull xen updates from Juergen Gross:
    "Xen features and fixes for 4.10

    These are some fixes, a move of some arm related headers to share them
    between arm and arm64 and a series introducing a helper to make code
    more readable.

    The most notable change is David stepping down as maintainer of the
    Xen hypervisor interface. This results in me sending you the pull
    requests for Xen related code from now on"

    * tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (29 commits)
    xen/balloon: Only mark a page as managed when it is released
    xenbus: fix deadlock on writes to /proc/xen/xenbus
    xen/scsifront: don't request a slot on the ring until request is ready
    xen/x86: Increase xen_e820_map to E820_X_MAX possible entries
    x86: Make E820_X_MAX unconditionally larger than E820MAX
    xen/pci: Bubble up error and fix description.
    xen: xenbus: set error code on failure
    xen: set error code on failures
    arm/xen: Use alloc_percpu rather than __alloc_percpu
    arm/arm64: xen: Move shared architecture headers to include/xen/arm
    xen/events: use xen_vcpu_id mapping for EVTCHNOP_status
    xen/gntdev: Use VM_MIXEDMAP instead of VM_IO to avoid NUMA balancing
    xen-scsifront: Add a missing call to kfree
    MAINTAINERS: update XEN HYPERVISOR INTERFACE
    xenfs: Use proc_create_mount_point() to create /proc/xen
    xen-platform: use builtin_pci_driver
    xen-netback: fix error handling output
    xen: make use of xenbus_read_unsigned() in xenbus
    xen: make use of xenbus_read_unsigned() in xen-pciback
    xen: make use of xenbus_read_unsigned() in xen-fbfront
    ...

    Linus Torvalds
     

13 Dec, 2016

11 commits

  • Runtime nlink calculation works but meh. I don't know how to do it at
    compile time, but I know how to do it at init time.

    Shift "2+" part into init time as a bonus.

    Link: http://lkml.kernel.org/r/20161122195549.GB29812@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Comparison for "
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • format_decode and vsnprintf occasionally show up in perf top, so I went
    looking for places that might not need the full printf power. With the
    help of kprobes, I gathered some statistics on which format strings we
    mostly pass to vsnprintf. On a trivial desktop workload, I hit "%x" 25%
    of the time, so something apparently reads /proc/pid/status (which does
    5*16 printf("%x") calls) a lot.

    With this patch, reading /proc/pid/status is 30% faster according to
    this microbenchmark:

    char buf[4096];
    int i, fd;
    for (i = 0; i < 10000; ++i) {
    fd = open("/proc/self/status", O_RDONLY);
    read(fd, buf, sizeof(buf));
    close(fd);
    }

    Link: http://lkml.kernel.org/r/1474410485-1305-1-git-send-email-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Acked-by: Andrei Vagin
    Acked-by: Kees Cook
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Some comments were obsoleted since commit 05c0ae21c034 ("try a saner
    locking for pde_opener...").

    Some new comments added.

    Some confusing comments replaced with equally confusing ones.

    Link: http://lkml.kernel.org/r/20161029160231.GD1246@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • kzalloc is too much, half of the fields will be reinitialized anyway.

    If proc file doesn't have ->release hook (some still do not), clearing
    is unnecessary because it will be freed immediately.

    Link: http://lkml.kernel.org/r/20161029155747.GC1246@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • struct pde_opener::closing is boolean.

    Link: http://lkml.kernel.org/r/20161029155439.GB1246@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • list_del_init() is too much, structure will be freed in three lines
    anyway.

    Link: http://lkml.kernel.org/r/20161029155313.GA1246@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Linux doesn't support 4GB+ filenames in /proc, so unsigned long is too
    much.

    MOV r64, r/m64 is larger than MOV r32, r/m32.

    Link: http://lkml.kernel.org/r/20161029161123.GG1246@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • "unsigned int" is better on x86_64 because it most of the time it
    autoexpands to 64-bit value while "int" requires MOVSX instruction.

    Link: http://lkml.kernel.org/r/20161029160810.GF1246@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Similar to being able to examine if a process has been correctly
    confined with seccomp, the state of no_new_privs is equally interesting,
    so this adds it to /proc/$pid/status.

    Link: http://lkml.kernel.org/r/20161103214041.GA58566@beast
    Signed-off-by: Kees Cook
    Reviewed-by: Jann Horn
    Cc: Jonathan Corbet
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Rodrigo Freire
    Cc: John Stultz
    Cc: Ross Zwisler
    Cc: Robert Ho
    Cc: Jerome Marchand
    Cc: Andy Lutomirski
    Cc: Johannes Weiner
    Cc: Alexey Dobriyan
    Cc: "Richard W.M. Jones"
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The other pagetable walks in task_mmu.c have a cond_resched() after
    walking their ptes: add a cond_resched() in gather_pte_stats() too, for
    reading /proc//numa_maps. Only pagemap_pmd_range() has a
    cond_resched() in its (unusually expensive) pmd_trans_huge case: more
    should probably be added, but leave them unchanged for now.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052157400.13021@eggly.anvils
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Dec, 2016

2 commits


24 Nov, 2016

1 commit


17 Nov, 2016

1 commit

  • Mounting proc in user namespace containers fails if the xenbus
    filesystem is mounted on /proc/xen because this directory fails
    the "permanently empty" test. proc_create_mount_point() exists
    specifically to create such mountpoints in proc but is currently
    proc-internal. Export this interface to modules, then use it in
    xenbus when creating /proc/xen.

    Signed-off-by: Seth Forshee
    Signed-off-by: David Vrabel
    Signed-off-by: Juergen Gross

    Seth Forshee
     

15 Nov, 2016

1 commit

  • Pass the file mode of the proc inode to be created to
    proc_pid_make_inode. In proc_pid_make_inode, initialize inode->i_mode
    before calling security_task_to_inode. This allows selinux to set
    isec->sclass right away without introducing "half-initialized" inode
    security structs.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Paul Moore

    Andreas Gruenbacher
     

04 Nov, 2016

1 commit


28 Oct, 2016

1 commit

  • Reading auxv of any kernel thread results in NULL pointer dereferencing
    in auxv_read() where mm can be NULL. Fix that by checking for NULL mm
    and bailing out early. This is also the original behavior changed by
    recent commit c5317167854e ("proc: switch auxv to use of __mem_open()").

    # cat /proc/2/auxv
    Unable to handle kernel NULL pointer dereference at virtual address 000000a8
    Internal error: Oops: 17 [#1] PREEMPT SMP ARM
    CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1
    Hardware name: BCM2709
    task: ea3b0b00 task.stack: e99b2000
    PC is at auxv_read+0x24/0x4c
    LR is at do_readv_writev+0x2fc/0x37c
    Process cat (pid: 113, stack limit = 0xe99b2210)
    Call chain:
    auxv_read
    do_readv_writev
    vfs_readv
    default_file_splice_read
    splice_direct_to_actor
    do_splice_direct
    do_sendfile
    SyS_sendfile64
    ret_fast_syscall

    Fixes: c5317167854e ("proc: switch auxv to use of __mem_open()")
    Link: http://lkml.kernel.org/r/1476966200-14457-1-git-send-email-chianglungyu@gmail.com
    Signed-off-by: Leon Yu
    Acked-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Cc: Al Viro
    Cc: Kees Cook
    Cc: John Stultz
    Cc: Mateusz Guzik
    Cc: Janis Danisevskis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Leon Yu
     

25 Oct, 2016

1 commit

  • Now that Lorenzo cleaned things up and made the FOLL_FORCE users
    explicit, it becomes obvious how some of them don't really need
    FOLL_FORCE at all.

    So remove FOLL_FORCE from the proc code that reads the command line and
    arguments from user space.

    The mem_rw() function actually does want FOLL_FORCE, because gdd (and
    possibly many other debuggers) use it as a much more convenient version
    of PTRACE_PEEKDATA, but we should consider making the FOLL_FORCE part
    conditional on actually being a ptracer. This does not actually do
    that, just moves adds a comment to that effect and moves the gup_flags
    settings next to each other.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Oct, 2016

1 commit

  • Pull vmap stack fixes from Ingo Molnar:
    "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
    accesses that used to be just somewhat questionable are now totally
    buggy.

    These changes try to do it without breaking the ABI: the fields are
    left there, they are just reporting zero, or reporting narrower
    information (the maps file change)"

    * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
    fs/proc: Stop trying to report thread stacks
    fs/proc: Stop reporting eip and esp in /proc/PID/stat
    mm/numa: Remove duplicated include from mprotect.c

    Linus Torvalds
     

20 Oct, 2016

2 commits

  • This reverts more of:

    b76437579d13 ("procfs: mark thread stack correctly in proc//maps")

    ... which was partially reverted by:

    65376df58217 ("proc: revert /proc//maps [stack:TID] annotation")

    Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

    In current kernels, /proc/PID/maps (or /proc/TID/maps even for
    threads) shows "[stack]" for VMAs in the mm's stack address range.

    In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
    target thread's stack's VMA. This is racy, probably returns garbage
    and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
    KSTK_ESP is not safe to use on tasks that aren't known to be running
    ordinary process-context kernel code.

    This patch removes the difference and just shows "[stack]" for VMAs
    in the mm's stack range. This is IMO much more sensible -- the
    actual "stack" address really is treated specially by the VM code,
    and the current thread stack isn't even well-defined for programs
    that frequently switch stacks on their own.

    Reported-by: Jann Horn
    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux API
    Cc: Peter Zijlstra
    Cc: Tycho Andersen
    Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • Reporting these fields on a non-current task is dangerous. If the
    task is in any state other than normal kernel code, they may contain
    garbage or even kernel addresses on some architectures. (x86_64
    used to do this. I bet lots of architectures still do.) With
    CONFIG_THREAD_INFO_IN_TASK=y, it can OOPS, too.

    As far as I know, there are no use programs that make any material
    use of these fields, so just get rid of them.

    Reported-by: Jann Horn
    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux API
    Cc: Peter Zijlstra
    Cc: Tetsuo Handa
    Cc: Tycho Andersen
    Link: http://lkml.kernel.org/r/a5fed4c3f4e33ed25d4bb03567e329bc5a712bcc.1475257877.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

19 Oct, 2016

1 commit

  • This removes the 'write' argument from access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

11 Oct, 2016

3 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Al Viro
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

08 Oct, 2016

9 commits

  • Al Viro
     
  • Current supplementary groups code can massively overallocate memory and
    is implemented in a way so that access to individual gid is done via 2D
    array.

    If number of gids is
    Cc: Vasily Kulikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
    more detailed info please refer to:

    https://bugzilla.redhat.com/show_bug.cgi?id=1365721

    Actually, this bug is not only for NVDIMM/DAX but also for any other
    file systems. This simple test case abstracted from nvml can easily
    reproduce this bug in common environment:

    -------------------------- testcase.c -----------------------------

    int
    is_pmem_proc(const void *addr, size_t len)
    {
    const char *caddr = addr;

    FILE *fp;
    if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
    printf("!/proc/self/smaps");
    return 0;
    }

    int retval = 0; /* assume false until proven otherwise */
    char line[PROCMAXLEN]; /* for fgets() */
    char *lo = NULL; /* beginning of current range in smaps file */
    char *hi = NULL; /* end of current range in smaps file */
    int needmm = 0; /* looking for mm flag for current range */
    while (fgets(line, PROCMAXLEN, fp) != NULL) {
    static const char vmflags[] = "VmFlags:";
    static const char mm[] = " wr";

    /* check for range line */
    if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
    if (needmm) {
    /* last range matched, but no mm flag found */
    printf("never found mm flag.\n");
    break;
    } else if (caddr < lo) {
    /* never found the range for caddr */
    printf("#######no match for addr %p.\n", caddr);
    break;
    } else if (caddr < hi) {
    /* start address is in this range */
    size_t rangelen = (size_t)(hi - caddr);

    /* remember that matching has started */
    needmm = 1;

    /* calculate remaining range to search for */
    if (len > rangelen) {
    len -= rangelen;
    caddr += rangelen;
    printf("matched %zu bytes in range "
    "%p-%p, %zu left over.\n",
    rangelen, lo, hi, len);
    } else {
    len = 0;
    printf("matched all bytes in range "
    "%p-%p.\n", lo, hi);
    }
    }
    } else if (needmm && strncmp(line, vmflags,
    sizeof(vmflags) - 1) == 0) {
    if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
    printf("mm flag found.\n");
    if (len == 0) {
    /* entire range matched */
    retval = 1;
    break;
    }
    needmm = 0; /* saw what was needed */
    } else {
    /* mm flag not set for some or all of range */
    printf("range has no mm flag.\n");
    break;
    }
    }
    }

    fclose(fp);

    printf("returning %d.\n", retval);
    return retval;
    }

    void *Addr;
    size_t Size;

    /*
    * worker -- the work each thread performs
    */
    static void *
    worker(void *arg)
    {
    int *ret = (int *)arg;
    *ret = is_pmem_proc(Addr, Size);
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    if (argc < 2 || argc > 3) {
    printf("usage: %s file [env].\n", argv[0]);
    return -1;
    }

    int fd = open(argv[1], O_RDWR);

    struct stat stbuf;
    fstat(fd, &stbuf);

    Size = stbuf.st_size;
    Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

    close(fd);

    pthread_t threads[NTHREAD];
    int ret[NTHREAD];

    /* kick off NTHREAD threads */
    for (int i = 0; i < NTHREAD; i++)
    pthread_create(&threads[i], NULL, worker, &ret[i]);

    /* wait for all the threads to complete */
    for (int i = 0; i < NTHREAD; i++)
    pthread_join(threads[i], NULL);

    /* verify that all the threads return the same value */
    for (int i = 1; i < NTHREAD; i++) {
    if (ret[0] != ret[i]) {
    printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
    ret[0], ret[i]);
    }
    }

    printf("%d", ret[0]);
    return 0;
    }

    It failed as some threads can not find the memory region in
    "/proc/self/smaps" which is allocated in the main process

    It is caused by proc fs which uses 'file->version' to indicate the VMA that
    is the last one has already been handled by read() system call. When the
    next read() issues, it uses the 'version' to find the VMA, then the next
    VMA is what we want to handle, the related code is as follows:

    if (last_addr) {
    vma = find_vma(mm, last_addr);
    if (vma && (vma = m_next_vma(priv, vma)))
    return vma;
    }

    However, VMA will be lost if the last VMA is gone, e.g:

    The process VMA list is A->B->C->D

    CPU 0 CPU 1
    read() system call
    handle VMA B
    version = B
    return to userspace

    unmap VMA B

    issue read() again to continue to get
    the region info
    find_vma(version) will get VMA C
    m_next_vma(C) will get VMA D
    handle D
    !!! VMA C is lost !!!

    In order to fix this bug, we make 'file->version' indicate the end address
    of the current VMA. m_start will then look up a vma which with vma_start
    < last_vm_end and moves on to the next vma if we found the same or an
    overlapping vma. This will guarantee that we will not miss an exclusive
    vma but we can still miss one if the previous vma was shrunk. This is
    acceptable because guaranteeing "never miss a vma" is simply not feasible.
    User has to cope with some inconsistencies if the file is not read in one
    go.

    [mhocko@suse.com: changelog fixes]
    Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.com
    Acked-by: Dave Hansen
    Signed-off-by: Xiao Guangrong
    Signed-off-by: Robert Hu
    Acked-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Dan Williams
    Cc: Gleb Natapov
    Cc: Marcelo Tosatti
    Cc: Stefan Hajnoczi
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Ho
     
  • In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS)
    to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p
    == current, but the CAP_SYS_NICE doesn't.

    Thus while the previous commit was intended to loosen the needed
    privileges to modify a processes timerslack, it needlessly restricted a
    task modifying its own timerslack via the proc//timerslack_ns
    (which is permitted also via the PR_SET_TIMERSLACK method).

    This patch corrects this by checking if p == current before checking the
    CAP_SYS_NICE value.

    This patch applies on top of my two previous patches currently in -mm

    Link: http://lkml.kernel.org/r/1471906870-28624-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Acked-by: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • As requested, this patch checks the existing LSM hooks
    task_getscheduler/task_setscheduler when reading or modifying the task's
    timerslack value.

    Previous versions added new get/settimerslack LSM hooks, but since they
    checked the same PROCESS__SET/GETSCHED values as existing hooks, it was
    suggested we just use the existing ones.

    Link: http://lkml.kernel.org/r/1469132667-17377-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Cc: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: James Morris
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • When an interface to allow a task to change another tasks timerslack was
    first proposed, it was suggested that something greater then
    CAP_SYS_NICE would be needed, as a task could be delayed further then
    what normally could be done with nice adjustments.

    So CAP_SYS_PTRACE was adopted instead for what became the
    /proc//timerslack_ns interface. However, for Android (where this
    feature originates), giving the system_server CAP_SYS_PTRACE would allow
    it to observe and modify all tasks memory. This is considered too high
    a privilege level for only needing to change the timerslack.

    After some discussion, it was realized that a CAP_SYS_NICE process can
    set a task as SCHED_FIFO, so they could fork some spinning processes and
    set them all SCHED_FIFO 99, in effect delaying all other tasks for an
    infinite amount of time.

    So as a CAP_SYS_NICE task can already cause trouble for other tasks,
    using it as a required capability for accessing and modifying
    /proc//timerslack_ns seems sufficient.

    Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
    removes CAP_SYS_PTRACE, simplifying some of the code flow as well.

    This is technically an ABI change, but as the feature just landed in
    4.6, I suspect no one is yet using it.

    Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Reviewed-by: Nick Kralevich
    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Cc: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • Use a specific routine to emit most lines so that the code is easier to
    read and maintain.

    akpm:
    text data bss dec hex filename
    2976 8 0 2984 ba8 fs/proc/meminfo.o before
    2669 8 0 2677 a75 fs/proc/meminfo.o after

    Link: http://lkml.kernel.org/r/8fce7fdef2ba081a4ef531594e97da8a9feebb58.1470810406.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Andi Kleen
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Allow some seq_puts removals by taking a string instead of a single
    char.

    [akpm@linux-foundation.org: update vmstat_show(), per Joe]
    Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • top(1) opens the following files for every PID:

    /proc/*/stat
    /proc/*/statm
    /proc/*/status

    This patch switches /proc/*/status away from seq_printf().
    The result is 13.5% speedup.

    Benchmark is open("/proc/self/status")+read+close 1.000.000 million times.

    BEFORE
    $ perf stat -r 10 taskset -c 3 ./proc-self-status

    Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

    10748.474301 task-clock (msec) # 0.954 CPUs utilized ( +- 0.91% )
    12 context-switches # 0.001 K/sec ( +- 1.09% )
    1 cpu-migrations # 0.000 K/sec
    104 page-faults # 0.010 K/sec ( +- 0.45% )
    37,424,127,876 cycles # 3.482 GHz ( +- 0.04% )
    8,453,010,029 stalled-cycles-frontend # 22.59% frontend cycles idle ( +- 0.12% )
    3,747,609,427 stalled-cycles-backend # 10.01% backend cycles idle ( +- 0.68% )
    65,632,764,147 instructions # 1.75 insn per cycle
    # 0.13 stalled cycles per insn ( +- 0.00% )
    13,981,324,775 branches # 1300.773 M/sec ( +- 0.00% )
    138,967,110 branch-misses # 0.99% of all branches ( +- 0.18% )

    11.263885428 seconds time elapsed ( +- 0.04% )
    ^^^^^^^^^^^^

    AFTER
    $ perf stat -r 10 taskset -c 3 ./proc-self-status

    Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

    9010.521776 task-clock (msec) # 0.925 CPUs utilized ( +- 1.54% )
    11 context-switches # 0.001 K/sec ( +- 1.54% )
    1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
    103 page-faults # 0.011 K/sec ( +- 0.60% )
    32,352,310,603 cycles # 3.591 GHz ( +- 0.07% )
    7,849,199,578 stalled-cycles-frontend # 24.26% frontend cycles idle ( +- 0.27% )
    3,269,738,842 stalled-cycles-backend # 10.11% backend cycles idle ( +- 0.73% )
    56,012,163,567 instructions # 1.73 insn per cycle
    # 0.14 stalled cycles per insn ( +- 0.00% )
    11,735,778,795 branches # 1302.453 M/sec ( +- 0.00% )
    98,084,459 branch-misses # 0.84% of all branches ( +- 0.28% )

    9.741247736 seconds time elapsed ( +- 0.07% )
    ^^^^^^^^^^^

    Link: http://lkml.kernel.org/r/20160806125608.GB1187@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan