20 Jan, 2017

1 commit

  • commit 93362fa47fe98b62e4a34ab408c4a418432e7939 upstream.

    Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
    added by grab_header when return from !dir_emit_dots path.
    It can cause any path called unregister_sysctl_table will
    wait forever.

    The calltrace of CVE-2016-9191:

    [ 5535.960522] Call Trace:
    [ 5535.963265] [] schedule+0x3f/0xa0
    [ 5535.968817] [] schedule_timeout+0x3db/0x6f0
    [ 5535.975346] [] ? wait_for_completion+0x45/0x130
    [ 5535.982256] [] wait_for_completion+0xc3/0x130
    [ 5535.988972] [] ? wake_up_q+0x80/0x80
    [ 5535.994804] [] drop_sysctl_table+0xc4/0xe0
    [ 5536.001227] [] drop_sysctl_table+0x77/0xe0
    [ 5536.007648] [] unregister_sysctl_table+0x4d/0xa0
    [ 5536.014654] [] unregister_sysctl_table+0x7f/0xa0
    [ 5536.021657] [] unregister_sched_domain_sysctl+0x15/0x40
    [ 5536.029344] [] partition_sched_domains+0x44/0x450
    [ 5536.036447] [] ? __mutex_unlock_slowpath+0x111/0x1f0
    [ 5536.043844] [] rebuild_sched_domains_locked+0x64/0xb0
    [ 5536.051336] [] update_flag+0x11d/0x210
    [ 5536.057373] [] ? mutex_lock_nested+0x2df/0x450
    [ 5536.064186] [] ? cpuset_css_offline+0x1b/0x60
    [ 5536.070899] [] ? trace_hardirqs_on+0xd/0x10
    [ 5536.077420] [] ? mutex_lock_nested+0x2df/0x450
    [ 5536.084234] [] ? css_killed_work_fn+0x25/0x220
    [ 5536.091049] [] cpuset_css_offline+0x35/0x60
    [ 5536.097571] [] css_killed_work_fn+0x5c/0x220
    [ 5536.104207] [] process_one_work+0x1df/0x710
    [ 5536.110736] [] ? process_one_work+0x160/0x710
    [ 5536.117461] [] worker_thread+0x12b/0x4a0
    [ 5536.123697] [] ? process_one_work+0x710/0x710
    [ 5536.130426] [] kthread+0xfe/0x120
    [ 5536.135991] [] ret_from_fork+0x1f/0x40
    [ 5536.142041] [] ? kthread_create_on_node+0x230/0x230

    One cgroup maintainer mentioned that "cgroup is trying to offline
    a cpuset css, which takes place under cgroup_mutex. The offlining
    ends up trying to drain active usages of a sysctl table which apprently
    is not happening."
    The real reason is that proc_sys_readdir doesn't drop reference added
    by grab_header when return from !dir_emit_dots path. So this cpuset
    offline path will wait here forever.

    See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13

    Fixes: f0c3b5093add ("[readdir] convert procfs")
    Reported-by: CAI Qian
    Tested-by: Yang Shukui
    Signed-off-by: Zhou Chengming
    Acked-by: Al Viro
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Greg Kroah-Hartman

    Zhou Chengming
     

28 Oct, 2016

1 commit

  • Reading auxv of any kernel thread results in NULL pointer dereferencing
    in auxv_read() where mm can be NULL. Fix that by checking for NULL mm
    and bailing out early. This is also the original behavior changed by
    recent commit c5317167854e ("proc: switch auxv to use of __mem_open()").

    # cat /proc/2/auxv
    Unable to handle kernel NULL pointer dereference at virtual address 000000a8
    Internal error: Oops: 17 [#1] PREEMPT SMP ARM
    CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1
    Hardware name: BCM2709
    task: ea3b0b00 task.stack: e99b2000
    PC is at auxv_read+0x24/0x4c
    LR is at do_readv_writev+0x2fc/0x37c
    Process cat (pid: 113, stack limit = 0xe99b2210)
    Call chain:
    auxv_read
    do_readv_writev
    vfs_readv
    default_file_splice_read
    splice_direct_to_actor
    do_splice_direct
    do_sendfile
    SyS_sendfile64
    ret_fast_syscall

    Fixes: c5317167854e ("proc: switch auxv to use of __mem_open()")
    Link: http://lkml.kernel.org/r/1476966200-14457-1-git-send-email-chianglungyu@gmail.com
    Signed-off-by: Leon Yu
    Acked-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Cc: Al Viro
    Cc: Kees Cook
    Cc: John Stultz
    Cc: Mateusz Guzik
    Cc: Janis Danisevskis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Leon Yu
     

25 Oct, 2016

1 commit

  • Now that Lorenzo cleaned things up and made the FOLL_FORCE users
    explicit, it becomes obvious how some of them don't really need
    FOLL_FORCE at all.

    So remove FOLL_FORCE from the proc code that reads the command line and
    arguments from user space.

    The mem_rw() function actually does want FOLL_FORCE, because gdd (and
    possibly many other debuggers) use it as a much more convenient version
    of PTRACE_PEEKDATA, but we should consider making the FOLL_FORCE part
    conditional on actually being a ptracer. This does not actually do
    that, just moves adds a comment to that effect and moves the gup_flags
    settings next to each other.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Oct, 2016

1 commit

  • Pull vmap stack fixes from Ingo Molnar:
    "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
    accesses that used to be just somewhat questionable are now totally
    buggy.

    These changes try to do it without breaking the ABI: the fields are
    left there, they are just reporting zero, or reporting narrower
    information (the maps file change)"

    * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
    fs/proc: Stop trying to report thread stacks
    fs/proc: Stop reporting eip and esp in /proc/PID/stat
    mm/numa: Remove duplicated include from mprotect.c

    Linus Torvalds
     

20 Oct, 2016

2 commits

  • This reverts more of:

    b76437579d13 ("procfs: mark thread stack correctly in proc//maps")

    ... which was partially reverted by:

    65376df58217 ("proc: revert /proc//maps [stack:TID] annotation")

    Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

    In current kernels, /proc/PID/maps (or /proc/TID/maps even for
    threads) shows "[stack]" for VMAs in the mm's stack address range.

    In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
    target thread's stack's VMA. This is racy, probably returns garbage
    and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
    KSTK_ESP is not safe to use on tasks that aren't known to be running
    ordinary process-context kernel code.

    This patch removes the difference and just shows "[stack]" for VMAs
    in the mm's stack range. This is IMO much more sensible -- the
    actual "stack" address really is treated specially by the VM code,
    and the current thread stack isn't even well-defined for programs
    that frequently switch stacks on their own.

    Reported-by: Jann Horn
    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux API
    Cc: Peter Zijlstra
    Cc: Tycho Andersen
    Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • Reporting these fields on a non-current task is dangerous. If the
    task is in any state other than normal kernel code, they may contain
    garbage or even kernel addresses on some architectures. (x86_64
    used to do this. I bet lots of architectures still do.) With
    CONFIG_THREAD_INFO_IN_TASK=y, it can OOPS, too.

    As far as I know, there are no use programs that make any material
    use of these fields, so just get rid of them.

    Reported-by: Jann Horn
    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux API
    Cc: Peter Zijlstra
    Cc: Tetsuo Handa
    Cc: Tycho Andersen
    Link: http://lkml.kernel.org/r/a5fed4c3f4e33ed25d4bb03567e329bc5a712bcc.1475257877.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

19 Oct, 2016

1 commit

  • This removes the 'write' argument from access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

11 Oct, 2016

3 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Al Viro
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

08 Oct, 2016

10 commits

  • Al Viro
     
  • Current supplementary groups code can massively overallocate memory and
    is implemented in a way so that access to individual gid is done via 2D
    array.

    If number of gids is
    Cc: Vasily Kulikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
    more detailed info please refer to:

    https://bugzilla.redhat.com/show_bug.cgi?id=1365721

    Actually, this bug is not only for NVDIMM/DAX but also for any other
    file systems. This simple test case abstracted from nvml can easily
    reproduce this bug in common environment:

    -------------------------- testcase.c -----------------------------

    int
    is_pmem_proc(const void *addr, size_t len)
    {
    const char *caddr = addr;

    FILE *fp;
    if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
    printf("!/proc/self/smaps");
    return 0;
    }

    int retval = 0; /* assume false until proven otherwise */
    char line[PROCMAXLEN]; /* for fgets() */
    char *lo = NULL; /* beginning of current range in smaps file */
    char *hi = NULL; /* end of current range in smaps file */
    int needmm = 0; /* looking for mm flag for current range */
    while (fgets(line, PROCMAXLEN, fp) != NULL) {
    static const char vmflags[] = "VmFlags:";
    static const char mm[] = " wr";

    /* check for range line */
    if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
    if (needmm) {
    /* last range matched, but no mm flag found */
    printf("never found mm flag.\n");
    break;
    } else if (caddr < lo) {
    /* never found the range for caddr */
    printf("#######no match for addr %p.\n", caddr);
    break;
    } else if (caddr < hi) {
    /* start address is in this range */
    size_t rangelen = (size_t)(hi - caddr);

    /* remember that matching has started */
    needmm = 1;

    /* calculate remaining range to search for */
    if (len > rangelen) {
    len -= rangelen;
    caddr += rangelen;
    printf("matched %zu bytes in range "
    "%p-%p, %zu left over.\n",
    rangelen, lo, hi, len);
    } else {
    len = 0;
    printf("matched all bytes in range "
    "%p-%p.\n", lo, hi);
    }
    }
    } else if (needmm && strncmp(line, vmflags,
    sizeof(vmflags) - 1) == 0) {
    if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
    printf("mm flag found.\n");
    if (len == 0) {
    /* entire range matched */
    retval = 1;
    break;
    }
    needmm = 0; /* saw what was needed */
    } else {
    /* mm flag not set for some or all of range */
    printf("range has no mm flag.\n");
    break;
    }
    }
    }

    fclose(fp);

    printf("returning %d.\n", retval);
    return retval;
    }

    void *Addr;
    size_t Size;

    /*
    * worker -- the work each thread performs
    */
    static void *
    worker(void *arg)
    {
    int *ret = (int *)arg;
    *ret = is_pmem_proc(Addr, Size);
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    if (argc < 2 || argc > 3) {
    printf("usage: %s file [env].\n", argv[0]);
    return -1;
    }

    int fd = open(argv[1], O_RDWR);

    struct stat stbuf;
    fstat(fd, &stbuf);

    Size = stbuf.st_size;
    Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

    close(fd);

    pthread_t threads[NTHREAD];
    int ret[NTHREAD];

    /* kick off NTHREAD threads */
    for (int i = 0; i < NTHREAD; i++)
    pthread_create(&threads[i], NULL, worker, &ret[i]);

    /* wait for all the threads to complete */
    for (int i = 0; i < NTHREAD; i++)
    pthread_join(threads[i], NULL);

    /* verify that all the threads return the same value */
    for (int i = 1; i < NTHREAD; i++) {
    if (ret[0] != ret[i]) {
    printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
    ret[0], ret[i]);
    }
    }

    printf("%d", ret[0]);
    return 0;
    }

    It failed as some threads can not find the memory region in
    "/proc/self/smaps" which is allocated in the main process

    It is caused by proc fs which uses 'file->version' to indicate the VMA that
    is the last one has already been handled by read() system call. When the
    next read() issues, it uses the 'version' to find the VMA, then the next
    VMA is what we want to handle, the related code is as follows:

    if (last_addr) {
    vma = find_vma(mm, last_addr);
    if (vma && (vma = m_next_vma(priv, vma)))
    return vma;
    }

    However, VMA will be lost if the last VMA is gone, e.g:

    The process VMA list is A->B->C->D

    CPU 0 CPU 1
    read() system call
    handle VMA B
    version = B
    return to userspace

    unmap VMA B

    issue read() again to continue to get
    the region info
    find_vma(version) will get VMA C
    m_next_vma(C) will get VMA D
    handle D
    !!! VMA C is lost !!!

    In order to fix this bug, we make 'file->version' indicate the end address
    of the current VMA. m_start will then look up a vma which with vma_start
    < last_vm_end and moves on to the next vma if we found the same or an
    overlapping vma. This will guarantee that we will not miss an exclusive
    vma but we can still miss one if the previous vma was shrunk. This is
    acceptable because guaranteeing "never miss a vma" is simply not feasible.
    User has to cope with some inconsistencies if the file is not read in one
    go.

    [mhocko@suse.com: changelog fixes]
    Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.com
    Acked-by: Dave Hansen
    Signed-off-by: Xiao Guangrong
    Signed-off-by: Robert Hu
    Acked-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Dan Williams
    Cc: Gleb Natapov
    Cc: Marcelo Tosatti
    Cc: Stefan Hajnoczi
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Ho
     
  • In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS)
    to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p
    == current, but the CAP_SYS_NICE doesn't.

    Thus while the previous commit was intended to loosen the needed
    privileges to modify a processes timerslack, it needlessly restricted a
    task modifying its own timerslack via the proc//timerslack_ns
    (which is permitted also via the PR_SET_TIMERSLACK method).

    This patch corrects this by checking if p == current before checking the
    CAP_SYS_NICE value.

    This patch applies on top of my two previous patches currently in -mm

    Link: http://lkml.kernel.org/r/1471906870-28624-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Acked-by: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • As requested, this patch checks the existing LSM hooks
    task_getscheduler/task_setscheduler when reading or modifying the task's
    timerslack value.

    Previous versions added new get/settimerslack LSM hooks, but since they
    checked the same PROCESS__SET/GETSCHED values as existing hooks, it was
    suggested we just use the existing ones.

    Link: http://lkml.kernel.org/r/1469132667-17377-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Cc: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: James Morris
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • When an interface to allow a task to change another tasks timerslack was
    first proposed, it was suggested that something greater then
    CAP_SYS_NICE would be needed, as a task could be delayed further then
    what normally could be done with nice adjustments.

    So CAP_SYS_PTRACE was adopted instead for what became the
    /proc//timerslack_ns interface. However, for Android (where this
    feature originates), giving the system_server CAP_SYS_PTRACE would allow
    it to observe and modify all tasks memory. This is considered too high
    a privilege level for only needing to change the timerslack.

    After some discussion, it was realized that a CAP_SYS_NICE process can
    set a task as SCHED_FIFO, so they could fork some spinning processes and
    set them all SCHED_FIFO 99, in effect delaying all other tasks for an
    infinite amount of time.

    So as a CAP_SYS_NICE task can already cause trouble for other tasks,
    using it as a required capability for accessing and modifying
    /proc//timerslack_ns seems sufficient.

    Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
    removes CAP_SYS_PTRACE, simplifying some of the code flow as well.

    This is technically an ABI change, but as the feature just landed in
    4.6, I suspect no one is yet using it.

    Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Reviewed-by: Nick Kralevich
    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Cc: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • Use a specific routine to emit most lines so that the code is easier to
    read and maintain.

    akpm:
    text data bss dec hex filename
    2976 8 0 2984 ba8 fs/proc/meminfo.o before
    2669 8 0 2677 a75 fs/proc/meminfo.o after

    Link: http://lkml.kernel.org/r/8fce7fdef2ba081a4ef531594e97da8a9feebb58.1470810406.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Andi Kleen
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Allow some seq_puts removals by taking a string instead of a single
    char.

    [akpm@linux-foundation.org: update vmstat_show(), per Joe]
    Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • top(1) opens the following files for every PID:

    /proc/*/stat
    /proc/*/statm
    /proc/*/status

    This patch switches /proc/*/status away from seq_printf().
    The result is 13.5% speedup.

    Benchmark is open("/proc/self/status")+read+close 1.000.000 million times.

    BEFORE
    $ perf stat -r 10 taskset -c 3 ./proc-self-status

    Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

    10748.474301 task-clock (msec) # 0.954 CPUs utilized ( +- 0.91% )
    12 context-switches # 0.001 K/sec ( +- 1.09% )
    1 cpu-migrations # 0.000 K/sec
    104 page-faults # 0.010 K/sec ( +- 0.45% )
    37,424,127,876 cycles # 3.482 GHz ( +- 0.04% )
    8,453,010,029 stalled-cycles-frontend # 22.59% frontend cycles idle ( +- 0.12% )
    3,747,609,427 stalled-cycles-backend # 10.01% backend cycles idle ( +- 0.68% )
    65,632,764,147 instructions # 1.75 insn per cycle
    # 0.13 stalled cycles per insn ( +- 0.00% )
    13,981,324,775 branches # 1300.773 M/sec ( +- 0.00% )
    138,967,110 branch-misses # 0.99% of all branches ( +- 0.18% )

    11.263885428 seconds time elapsed ( +- 0.04% )
    ^^^^^^^^^^^^

    AFTER
    $ perf stat -r 10 taskset -c 3 ./proc-self-status

    Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

    9010.521776 task-clock (msec) # 0.925 CPUs utilized ( +- 1.54% )
    11 context-switches # 0.001 K/sec ( +- 1.54% )
    1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
    103 page-faults # 0.011 K/sec ( +- 0.60% )
    32,352,310,603 cycles # 3.591 GHz ( +- 0.07% )
    7,849,199,578 stalled-cycles-frontend # 24.26% frontend cycles idle ( +- 0.27% )
    3,269,738,842 stalled-cycles-backend # 10.11% backend cycles idle ( +- 0.73% )
    56,012,163,567 instructions # 1.73 insn per cycle
    # 0.14 stalled cycles per insn ( +- 0.00% )
    11,735,778,795 branches # 1302.453 M/sec ( +- 0.00% )
    98,084,459 branch-misses # 0.84% of all branches ( +- 0.28% )

    9.741247736 seconds time elapsed ( +- 0.07% )
    ^^^^^^^^^^^

    Link: http://lkml.kernel.org/r/20160806125608.GB1187@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Trying to walk all of virtual memory requires architecture specific
    knowledge. On x86_64, addresses must be sign extended from bit 48,
    whereas on arm64 the top VA_BITS of address space have their own set of
    page tables.

    clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
    provides a test_walk() callback that only expects to be walking over
    VMAs. Currently walk_pmd_range() will skip memory regions that don't
    have a VMA, reporting them as a hole.

    As this call only expects to walk user address space, make it walk 0 to
    'highest_vm_end'.

    Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.com
    Signed-off-by: James Morse
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     

07 Oct, 2016

1 commit

  • Pull namespace updates from Eric Biederman:
    "This set of changes is a number of smaller things that have been
    overlooked in other development cycles focused on more fundamental
    change. The devpts changes are small things that were a distraction
    until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
    trivial regression fix to autofs for the unprivileged mount changes
    that went in last cycle. A pair of ioctls has been added by Andrey
    Vagin making it is possible to discover the relationships between
    namespaces when referring to them through file descriptors.

    The big user visible change is starting to add simple resource limits
    to catch programs that misbehave. With namespaces in general and user
    namespaces in particular allowing users to use more kinds of
    resources, it has become important to have something to limit errant
    programs. Because the purpose of these limits is to catch errant
    programs the code needs to be inexpensive to use as it always on, and
    the default limits need to be high enough that well behaved programs
    on well behaved systems don't encounter them.

    To this end, after some review I have implemented per user per user
    namespace limits, and use them to limit the number of namespaces. The
    limits being per user mean that one user can not exhause the limits of
    another user. The limits being per user namespace allow contexts where
    the limit is 0 and security conscious folks can remove from their
    threat anlysis the code used to manage namespaces (as they have
    historically done as it root only). At the same time the limits being
    per user namespace allow other parts of the system to use namespaces.

    Namespaces are increasingly being used in application sand boxing
    scenarios so an all or nothing disable for the entire system for the
    security conscious folks makes increasing use of these sandboxes
    impossible.

    There is also added a limit on the maximum number of mounts present in
    a single mount namespace. It is nontrivial to guess what a reasonable
    system wide limit on the number of mount structure in the kernel would
    be, especially as it various based on how a system is using
    containers. A limit on the number of mounts in a mount namespace
    however is much easier to understand and set. In most cases in
    practice only about 1000 mounts are used. Given that some autofs
    scenarious have the potential to be 30,000 to 50,000 mounts I have set
    the default limit for the number of mounts at 100,000 which is well
    above every known set of users but low enough that the mount hash
    tables don't degrade unreaonsably.

    These limits are a start. I expect this estabilishes a pattern that
    other limits for resources that namespaces use will follow. There has
    been interest in making inotify event limits per user per user
    namespace as well as interest expressed in making details about what
    is going on in the kernel more visible"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
    autofs: Fix automounts by using current_real_cred()->uid
    mnt: Add a per mount namespace limit on the number of mounts
    netns: move {inc,dec}_net_namespaces into #ifdef
    nsfs: Simplify __ns_get_path
    tools/testing: add a test to check nsfs ioctl-s
    nsfs: add ioctl to get a parent namespace
    nsfs: add ioctl to get an owning user namespace for ns file descriptor
    kernel: add a helper to get an owning user namespace for a namespace
    devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
    devpts: Remove sync_filesystems
    devpts: Make devpts_kill_sb safe if fsi is NULL
    devpts: Simplify devpts_mount by using mount_nodev
    devpts: Move the creation of /dev/pts/ptmx into fill_super
    devpts: Move parse_mount_options into fill_super
    userns: When the per user per user namespace limit is reached return ENOSPC
    userns; Document per user per user namespace limits.
    mntns: Add a limit on the number of mount namespaces.
    netns: Add a limit on the number of net namespaces
    cgroupns: Add a limit on the number of cgroup namespaces
    ipcns: Add a limit on the number of ipc namespaces
    ...

    Linus Torvalds
     

06 Oct, 2016

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Pull networking updates from David Miller:

    1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
    co. at Google. https://lwn.net/Articles/701165/

    2) Do TCP Small Queues for retransmits, from Eric Dumazet.

    3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
    Starovoitov.

    4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.

    5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.

    6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.

    7) Support ndo_poll_controller in mlx5, from Calvin Owens.

    8) Move VRF processing to an output hook and allow l3mdev to be
    loopback, from David Ahern.

    9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.

    10) Congestion control in RXRPC, from David Howells.

    11) Support geneve RX offload in ixgbe, from Emil Tantilov.

    12) When hitting pressure for new incoming TCP data SKBs, perform a
    partial rathern than a full purge of the OFO queue (which could be
    huge). From Eric Dumazet.

    13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.

    14) Support RX network flow classification to igb, from Gangfeng Huang.

    15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.

    16) New skbmod packet action, from Jamal Hadi Salim.

    17) Remove some inefficiencies in snmp proc output, from Jia He.

    18) Add FIB notifications to properly propagate route changes to
    hardware which is doing forwarding offloading. From Jiri Pirko.

    19) New dsa driver for qca8xxx chips, from John Crispin.

    20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
    Żenczykowski.

    21) Add L3 mode to ipvlan, from Mahesh Bandewar.

    22) Support 802.1ad in mlx4, from Moshe Shemesh.

    23) Support hardware LRO in mediatek driver, from Nelson Chang.

    24) Add TC offloading to mlx5, from Or Gerlitz.

    25) Convert various drivers to ethtool ksettings interfaces, from
    Philippe Reynes.

    26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.

    27) NAPI support for ath10k, from Rajkumar Manoharan.

    28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.

    29) UDP replicast support in TIPC, from Richard Alpe.

    30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.

    31) Support BQL in thunderx driver, from Sunil Goutham.

    32) TSO support in alx driver, from Tobias Regnery.

    33) Add stream parser engine and use it in kcm.

    34) Support async DHCP replies in ipconfig module, from Uwe
    Kleine-König.

    35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
    mlxsw: switchx2: Fix misuse of hard_header_len
    mlxsw: spectrum: Fix misuse of hard_header_len
    net/faraday: Stop NCSI device on shutdown
    net/ncsi: Introduce ncsi_stop_dev()
    net/ncsi: Rework the channel monitoring
    net/ncsi: Allow to extend NCSI request properties
    net/ncsi: Rework request index allocation
    net/ncsi: Don't probe on the reserved channel ID (0x1f)
    net/ncsi: Introduce NCSI_RESERVED_CHANNEL
    net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
    net: Add netdev all_adj_list refcnt propagation to fix panic
    net: phy: Add Edge-rate driver for Microsemi PHYs.
    vmxnet3: Wake queue from reset work
    i40e: avoid NULL pointer dereference and recursive errors on early PCI error
    qed: Add RoCE ll2 & GSI support
    qed: Add support for memory registeration verbs
    qed: Add support for QP verbs
    qed: PD,PKEY and CQ verb support
    qed: Add support for RoCE hw init
    qede: Add qedr framework
    ...

    Linus Torvalds
     

30 Sep, 2016

1 commit


28 Sep, 2016

3 commits

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani
     
  • proc uses new_inode_pseudo() to allocate a new inode.
    This in turn calls the proc_inode_alloc() callback.
    But, at this point, inode is still not initialized
    with the super_block pointer which only happens just
    before alloc_inode() returns after the call to
    inode_init_always().

    Also, the inode times are initialized again after the
    call to new_inode_pseudo() in proc_inode_alloc().
    The assignemet in proc_alloc_inode() is redundant and
    also doesn't work after the current_time() api is
    changed to take struct inode* instead of
    struct *super_block.

    This bug was reported after current_time() was used to
    assign times in proc_alloc_inode().

    Signed-off-by: Deepa Dinamani
    Reported-by: Fengguang Wu [0-day test robot]
    Reviewed-by: Arnd Bergmann
    Signed-off-by: Al Viro

    Deepa Dinamani
     
  • Make struct proc_inode::fd unsigned.

    This allows better code generation on x86_64 (less sign extensions).

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Al Viro

    Alexey Dobriyan
     

23 Sep, 2016

1 commit


22 Sep, 2016

1 commit

  • inode_change_ok() will be resposible for clearing capabilities and IMA
    extended attributes and as such will need dentry. Give it as an argument
    to inode_change_ok() instead of an inode. Also rename inode_change_ok()
    to setattr_prepare() to better relect that it does also some
    modifications in addition to checks.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

21 Sep, 2016

2 commits

  • We hit hardened usercopy feature check for kernel text access by reading
    kcore file:

    usercopy: kernel memory exposure attempt detected from ffffffff8179a01f () (4065 bytes)
    kernel BUG at mm/usercopy.c:75!

    Bypassing this check for kcore by adding bounce buffer for ktext data.

    Reported-by: Steve Best
    Fixes: f5509cc18daa ("mm: Hardened usercopy")
    Suggested-by: Kees Cook
    Signed-off-by: Jiri Olsa
    Acked-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Jiri Olsa
     
  • Next patch adds bounce buffer for ktext area, so it's
    convenient to have single bounce buffer for both
    vmalloc/module and ktext cases.

    Suggested-by: Linus Torvalds
    Signed-off-by: Jiri Olsa
    Acked-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Jiri Olsa
     

15 Sep, 2016

1 commit


13 Sep, 2016

1 commit


11 Sep, 2016

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "nvdimm fixes for v4.8, two of them are tagged for -stable:

    - Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
    DAX pmd mappings end up with an uncached pgprot, and unusable
    performance for the device-dax interface. The device-dax interface
    appeared in 4.7 so this is tagged for -stable.

    - Fix a couple VM_BUG_ON() checks in the show_smaps() path to
    understand DAX pmd entries. This fix is tagged for -stable.

    - Fix a mis-merge of the nfit machine-check handler to flip the
    polarity of an if() to match the final version of the patch that
    Vishal sent for 4.8-rc1. Without this the nfit machine check
    handler never detects / inserts new 'badblocks' entries which
    applications use to identify lost portions of files.

    - For test purposes, fix the nvdimm_clear_poison() path to operate on
    legacy / simulated nvdimm memory ranges. Without this fix a test
    can set badblocks, but never clear them on these ranges.

    - Fix the range checking done by dax_dev_pmd_fault(). This is not
    tagged for -stable since this problem is mitigated by specifying
    aligned resources at device-dax setup time.

    These patches have appeared in a next release over the past week. The
    recent rebase you can see in the timestamps was to drop an invalid fix
    as identified by the updated device-dax unit tests [1]. The -mm
    touches have an ack from Andrew"

    [1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
    https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: allow legacy (e820) pmem region to clear bad blocks
    nfit, mce: Fix SPA matching logic in MCE handler
    mm: fix cache mode of dax pmd mappings
    mm: fix show_smap() for zone_device-pmd ranges
    dax: fix mapping size check

    Linus Torvalds
     

10 Sep, 2016

1 commit

  • Attempting to dump /proc//smaps for a process with pmd dax mappings
    currently results in the following VM_BUG_ONs:

    kernel BUG at mm/huge_memory.c:1105!
    task: ffff88045f16b140 task.stack: ffff88045be14000
    RIP: 0010:[] [] follow_trans_huge_pmd+0x2cb/0x340
    [..]
    Call Trace:
    [] smaps_pte_range+0xa0/0x4b0
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    kernel BUG at fs/proc/task_mmu.c:585!
    RIP: 0010:[] [] smaps_pte_range+0x499/0x4b0
    Call Trace:
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    These locations are sanity checking page flags that must be set for an
    anonymous transparent huge page, but are not set for the zone_device
    pages associated with dax mappings.

    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Acked-by: Andrew Morton
    Signed-off-by: Dan Williams

    Dan Williams
     

02 Sep, 2016

1 commit


01 Sep, 2016

1 commit

  • For more convenient access if one has a pointer to the task.

    As a minor nit take advantage of the fact that only task lock + rcu are
    needed to safely grab ->exe_file. This saves mm refcount dance.

    Use the helper in proc_exe_link.

    Signed-off-by: Mateusz Guzik
    Acked-by: Konstantin Khlebnikov
    Acked-by: Richard Guy Briggs
    Cc: # 4.3.x
    Signed-off-by: Paul Moore

    Mateusz Guzik
     

19 Aug, 2016

1 commit

  • When printing call return addresses found on a stack, /proc//stack
    can sometimes give a confusing result. If the call instruction was the
    last instruction in the function (which can happen when calling a
    noreturn function), '%pS' will incorrectly display the name of the
    function which happens to be next in the object code, rather than the
    name of the actual calling function.

    Use '%pB' instead, which was created for this exact purpose.

    Signed-off-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Byungchul Park
    Cc: Denys Vlasenko
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Nilay Vaish
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/47ad2821e5ebdbed1fbf83fb85424ae4fbdf8b6e.1471535549.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     

18 Aug, 2016

1 commit


15 Aug, 2016

1 commit

  • If net namespace is attached to a user namespace let's make container's
    root owner of sysctls affecting said network namespace instead of global
    root.

    This also allows us to clean up net_ctl_permissions() because we do not
    need to fudge permissions anymore for the container's owner since it now
    owns the objects in question.

    Acked-by: "Eric W. Biederman"
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: David S. Miller

    Dmitry Torokhov