12 Oct, 2016

1 commit

  • In the dlm_migrate_request_handler(), when `ret' is -EEXIST, the mle
    should be freed, otherwise the memory will be leaked.

    Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D3522A@H3CMLB12-EX.srv.huawei-3com.com
    Signed-off-by: Guozhonghua
    Reviewed-by: Mark Fasheh
    Cc: Eric Ren
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guozhonghua
     

11 Oct, 2016

10 commits

  • Pull networking fixes from David Miller:

    1) Netfilter list handling fix, from Linus.

    2) RXRPC/AFS bug fixes from David Howells (oops on call to serviceless
    endpoints, build warnings, missing notifications, etc.) From David
    Howells.

    3) Kernel log message missing newlines, from Colin Ian King.

    4) Don't enter direct reclaim in netlink dumps, the idea is to use a
    high order allocation first and fallback quickly to a 0-order
    allocation if such a high-order one cannot be done cheaply and
    without reclaim. From Eric Dumazet.

    5) Fix firmware download errors in btusb bluetooth driver, from Ethan
    Hsieh.

    6) Missing Kconfig deps for QCOM_EMAC, from Geert Uytterhoeven.

    7) Fix MDIO_XGENE dup Kconfig entry. From Laura Abbott.

    8) Constrain ipv6 rtr_solicits sysctl values properly, from Maciej
    Żenczykowski.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
    netfilter: Fix slab corruption.
    be2net: Enable VF link state setting for BE3
    be2net: Fix TX stats for TSO packets
    be2net: Update Copyright string in be_hw.h
    be2net: NCSI FW section should be properly updated with ethtool for BE3
    be2net: Provide an alternate way to read pf_num for BEx chips
    wan/fsl_ucc_hdlc: Fix size used in dma_free_coherent()
    net: macb: NULL out phydev after removing mdio bus
    xen-netback: make sure that hashes are not send to unaware frontends
    Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion
    MAINTAINERS: add myself as a maintainer of xen-netback
    ipv6 addrconf: disallow rtr_solicits < -1
    Bluetooth: btusb: Fix atheros firmware download error
    drivers: net: phy: Correct duplicate MDIO_XGENE entry
    ethernet: qualcomm: QCOM_EMAC should depend on HAS_DMA and HAS_IOMEM
    net: ethernet: mediatek: remove hwlro property in the device tree
    net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi
    net: ethernet: mediatek: get the chip id by ETHDMASYS registers
    net: bgmac: Fix errant feature flag check
    netlink: do not enter direct reclaim from netlink_dump()
    ...

    Linus Torvalds
     
  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Al Viro
     
  • Pull vfs xattr updates from Al Viro:
    "xattr stuff from Andreas

    This completes the switch to xattr_handler ->get()/->set() from
    ->getxattr/->setxattr/->removexattr"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: Remove {get,set,remove}xattr inode operations
    xattr: Stop calling {get,set,remove}xattr inode operations
    vfs: Check for the IOP_XATTR flag in listxattr
    xattr: Add __vfs_{get,set,remove}xattr helpers
    libfs: Use IOP_XATTR flag for empty directory handling
    vfs: Use IOP_XATTR flag for bad-inode handling
    vfs: Add IOP_XATTR inode operations flag
    vfs: Move xattr_resolve_name to the front of fs/xattr.c
    ecryptfs: Switch to generic xattr handlers
    sockfs: Get rid of getxattr iop
    sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
    kernfs: Switch to generic xattr handlers
    hfs: Switch to generic xattr handlers
    jffs2: Remove jffs2_{get,set,remove}xattr macros
    xattr: Remove unnecessary NULL attribute name check

    Linus Torvalds
     
  • Pull dlm fix from David Teigland:
    "This includes a bug fix for a bad memory access during workqueue
    cleanup, which can happen while shutting down the dlm networking
    layer"

    * tag 'dlm-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
    dlm: free workqueues after the connections

    Linus Torvalds
     
  • Pull Ceph updates from Ilya Dryomov:
    "The big ticket item here is support for rbd exclusive-lock feature,
    with maintenance operations offloaded to userspace (Douglas Fuller,
    Mike Christie and myself). Another block device bullet is a series
    fixing up layering error paths (myself).

    On the filesystem side, we've got patches that improve our handling of
    buffered vs dio write races (Neil Brown) and a few assorted fixes from
    Zheng. Also included a couple of random cleanups and a minor CRUSH
    update"

    * tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits)
    crush: remove redundant local variable
    crush: don't normalize input of crush_ln iteratively
    libceph: ceph_build_auth() doesn't need ceph_auth_build_hello()
    libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello()
    ceph: fix description for rsize and rasize mount options
    rbd: use kmalloc_array() in rbd_header_from_disk()
    ceph: use list_move instead of list_del/list_add
    ceph: handle CEPH_SESSION_REJECT message
    ceph: avoid accessing / when mounting a subpath
    ceph: fix mandatory flock check
    ceph: remove warning when ceph_releasepage() is called on dirty page
    ceph: ignore error from invalidate_inode_pages2_range() in direct write
    ceph: fix error handling of start_read()
    rbd: add rbd_obj_request_error() helper
    rbd: img_data requests don't own their page array
    rbd: don't call rbd_osd_req_format_read() for !img_data requests
    rbd: rework rbd_img_obj_exists_submit() error paths
    rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback()
    rbd: move bumping img_request refcount into rbd_obj_request_submit()
    rbd: mark the original request as done if stat request fails
    ...

    Linus Torvalds
     
  • Pull splice fixups from Al Viro:
    "A couple of fixups for interaction of pipe-backed iov_iter with
    O_DIRECT reads + constification of a couple of primitives in uio.h
    missed by previous rounds.

    Kudos to davej - his fuzzing has caught those bugs"

    * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [btrfs] fix check_direct_IO() for non-iovec iterators
    constify iov_iter_count() and iter_is_iovec()
    fix ITER_PIPE interaction with direct_IO

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     
  • looking for duplicate ->iov_base makes sense only for
    iovec-backed iterators; for kvec-backed ones it's pointless,
    for bvec-backed ones it's pointless and broken on 32bit (we
    walk through an array of struct bio_vec accessing them as if
    they were struct iovec; works by accident on 64bit, but on
    32bit it'll blow up) and for pipe-backed ones it's pointless
    and ends up oopsing.

    Signed-off-by: Al Viro

    Al Viro
     
  • by making sure we call iov_iter_advance() on original
    iov_iter even if direct_IO (done on its copy) has returned 0.
    It's a no-op for old iov_iter flavours and does the right thing
    (== truncation of the stuff we'd allocated, but not filled) in
    ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with
    by cleanup in generic_file_read_iter().

    Signed-off-by: Al Viro

    Al Viro
     

10 Oct, 2016

1 commit

  • After backporting commit ee44b4bc054a ("dlm: use sctp 1-to-1 API")
    series to a kernel with an older workqueue which didn't use RCU yet, it
    was noticed that we are freeing the workqueues in dlm_lowcomms_stop()
    too early as free_conn() will try to access that memory for canceling
    the queued works if any.

    This issue was introduced by commit 0d737a8cfd83 as before it such
    attempt to cancel the queued works wasn't performed, so the issue was
    not present.

    This patch fixes it by simply inverting the free order.

    Cc: stable@vger.kernel.org
    Fixes: 0d737a8cfd83 ("dlm: fix race while closing connections")
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David Teigland

    Marcelo Ricardo Leitner
     

08 Oct, 2016

28 commits

  • Al Viro
     
  • Al Viro
     
  • Al Viro
     
  • Al Viro
     
  • Merge updates from Andrew Morton:

    - fsnotify updates

    - ocfs2 updates

    - all of MM

    * emailed patches from Andrew Morton : (127 commits)
    console: don't prefer first registered if DT specifies stdout-path
    cred: simpler, 1D supplementary groups
    CREDITS: update Pavel's information, add GPG key, remove snail mail address
    mailmap: add Johan Hovold
    .gitattributes: set git diff driver for C source code files
    uprobes: remove function declarations from arch/{mips,s390}
    spelling.txt: "modeled" is spelt correctly
    nmi_backtrace: generate one-line reports for idle cpus
    arch/tile: adopt the new nmi_backtrace framework
    nmi_backtrace: do a local dump_stack() instead of a self-NMI
    nmi_backtrace: add more trigger_*_cpu_backtrace() methods
    min/max: remove sparse warnings when they're nested
    Documentation/filesystems/proc.txt: add more description for maps/smaps
    mm, proc: fix region lost in /proc/self/smaps
    proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
    proc: add LSM hook checks to /proc//timerslack_ns
    proc: relax /proc//timerslack_ns capability requirements
    meminfo: break apart a very long seq_printf with #ifdefs
    seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
    proc: faster /proc/*/status
    ...

    Linus Torvalds
     
  • These inode operations are no longer used; remove them.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • Current supplementary groups code can massively overallocate memory and
    is implemented in a way so that access to individual gid is done via 2D
    array.

    If number of gids is
    Cc: Vasily Kulikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
    more detailed info please refer to:

    https://bugzilla.redhat.com/show_bug.cgi?id=1365721

    Actually, this bug is not only for NVDIMM/DAX but also for any other
    file systems. This simple test case abstracted from nvml can easily
    reproduce this bug in common environment:

    -------------------------- testcase.c -----------------------------

    int
    is_pmem_proc(const void *addr, size_t len)
    {
    const char *caddr = addr;

    FILE *fp;
    if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
    printf("!/proc/self/smaps");
    return 0;
    }

    int retval = 0; /* assume false until proven otherwise */
    char line[PROCMAXLEN]; /* for fgets() */
    char *lo = NULL; /* beginning of current range in smaps file */
    char *hi = NULL; /* end of current range in smaps file */
    int needmm = 0; /* looking for mm flag for current range */
    while (fgets(line, PROCMAXLEN, fp) != NULL) {
    static const char vmflags[] = "VmFlags:";
    static const char mm[] = " wr";

    /* check for range line */
    if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
    if (needmm) {
    /* last range matched, but no mm flag found */
    printf("never found mm flag.\n");
    break;
    } else if (caddr < lo) {
    /* never found the range for caddr */
    printf("#######no match for addr %p.\n", caddr);
    break;
    } else if (caddr < hi) {
    /* start address is in this range */
    size_t rangelen = (size_t)(hi - caddr);

    /* remember that matching has started */
    needmm = 1;

    /* calculate remaining range to search for */
    if (len > rangelen) {
    len -= rangelen;
    caddr += rangelen;
    printf("matched %zu bytes in range "
    "%p-%p, %zu left over.\n",
    rangelen, lo, hi, len);
    } else {
    len = 0;
    printf("matched all bytes in range "
    "%p-%p.\n", lo, hi);
    }
    }
    } else if (needmm && strncmp(line, vmflags,
    sizeof(vmflags) - 1) == 0) {
    if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
    printf("mm flag found.\n");
    if (len == 0) {
    /* entire range matched */
    retval = 1;
    break;
    }
    needmm = 0; /* saw what was needed */
    } else {
    /* mm flag not set for some or all of range */
    printf("range has no mm flag.\n");
    break;
    }
    }
    }

    fclose(fp);

    printf("returning %d.\n", retval);
    return retval;
    }

    void *Addr;
    size_t Size;

    /*
    * worker -- the work each thread performs
    */
    static void *
    worker(void *arg)
    {
    int *ret = (int *)arg;
    *ret = is_pmem_proc(Addr, Size);
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    if (argc < 2 || argc > 3) {
    printf("usage: %s file [env].\n", argv[0]);
    return -1;
    }

    int fd = open(argv[1], O_RDWR);

    struct stat stbuf;
    fstat(fd, &stbuf);

    Size = stbuf.st_size;
    Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

    close(fd);

    pthread_t threads[NTHREAD];
    int ret[NTHREAD];

    /* kick off NTHREAD threads */
    for (int i = 0; i < NTHREAD; i++)
    pthread_create(&threads[i], NULL, worker, &ret[i]);

    /* wait for all the threads to complete */
    for (int i = 0; i < NTHREAD; i++)
    pthread_join(threads[i], NULL);

    /* verify that all the threads return the same value */
    for (int i = 1; i < NTHREAD; i++) {
    if (ret[0] != ret[i]) {
    printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
    ret[0], ret[i]);
    }
    }

    printf("%d", ret[0]);
    return 0;
    }

    It failed as some threads can not find the memory region in
    "/proc/self/smaps" which is allocated in the main process

    It is caused by proc fs which uses 'file->version' to indicate the VMA that
    is the last one has already been handled by read() system call. When the
    next read() issues, it uses the 'version' to find the VMA, then the next
    VMA is what we want to handle, the related code is as follows:

    if (last_addr) {
    vma = find_vma(mm, last_addr);
    if (vma && (vma = m_next_vma(priv, vma)))
    return vma;
    }

    However, VMA will be lost if the last VMA is gone, e.g:

    The process VMA list is A->B->C->D

    CPU 0 CPU 1
    read() system call
    handle VMA B
    version = B
    return to userspace

    unmap VMA B

    issue read() again to continue to get
    the region info
    find_vma(version) will get VMA C
    m_next_vma(C) will get VMA D
    handle D
    !!! VMA C is lost !!!

    In order to fix this bug, we make 'file->version' indicate the end address
    of the current VMA. m_start will then look up a vma which with vma_start
    < last_vm_end and moves on to the next vma if we found the same or an
    overlapping vma. This will guarantee that we will not miss an exclusive
    vma but we can still miss one if the previous vma was shrunk. This is
    acceptable because guaranteeing "never miss a vma" is simply not feasible.
    User has to cope with some inconsistencies if the file is not read in one
    go.

    [mhocko@suse.com: changelog fixes]
    Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.com
    Acked-by: Dave Hansen
    Signed-off-by: Xiao Guangrong
    Signed-off-by: Robert Hu
    Acked-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Dan Williams
    Cc: Gleb Natapov
    Cc: Marcelo Tosatti
    Cc: Stefan Hajnoczi
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Ho
     
  • In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS)
    to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p
    == current, but the CAP_SYS_NICE doesn't.

    Thus while the previous commit was intended to loosen the needed
    privileges to modify a processes timerslack, it needlessly restricted a
    task modifying its own timerslack via the proc//timerslack_ns
    (which is permitted also via the PR_SET_TIMERSLACK method).

    This patch corrects this by checking if p == current before checking the
    CAP_SYS_NICE value.

    This patch applies on top of my two previous patches currently in -mm

    Link: http://lkml.kernel.org/r/1471906870-28624-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Acked-by: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • As requested, this patch checks the existing LSM hooks
    task_getscheduler/task_setscheduler when reading or modifying the task's
    timerslack value.

    Previous versions added new get/settimerslack LSM hooks, but since they
    checked the same PROCESS__SET/GETSCHED values as existing hooks, it was
    suggested we just use the existing ones.

    Link: http://lkml.kernel.org/r/1469132667-17377-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Cc: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: James Morris
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • When an interface to allow a task to change another tasks timerslack was
    first proposed, it was suggested that something greater then
    CAP_SYS_NICE would be needed, as a task could be delayed further then
    what normally could be done with nice adjustments.

    So CAP_SYS_PTRACE was adopted instead for what became the
    /proc//timerslack_ns interface. However, for Android (where this
    feature originates), giving the system_server CAP_SYS_PTRACE would allow
    it to observe and modify all tasks memory. This is considered too high
    a privilege level for only needing to change the timerslack.

    After some discussion, it was realized that a CAP_SYS_NICE process can
    set a task as SCHED_FIFO, so they could fork some spinning processes and
    set them all SCHED_FIFO 99, in effect delaying all other tasks for an
    infinite amount of time.

    So as a CAP_SYS_NICE task can already cause trouble for other tasks,
    using it as a required capability for accessing and modifying
    /proc//timerslack_ns seems sufficient.

    Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
    removes CAP_SYS_PTRACE, simplifying some of the code flow as well.

    This is technically an ABI change, but as the feature just landed in
    4.6, I suspect no one is yet using it.

    Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Reviewed-by: Nick Kralevich
    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Cc: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • Use a specific routine to emit most lines so that the code is easier to
    read and maintain.

    akpm:
    text data bss dec hex filename
    2976 8 0 2984 ba8 fs/proc/meminfo.o before
    2669 8 0 2677 a75 fs/proc/meminfo.o after

    Link: http://lkml.kernel.org/r/8fce7fdef2ba081a4ef531594e97da8a9feebb58.1470810406.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Andi Kleen
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Allow some seq_puts removals by taking a string instead of a single
    char.

    [akpm@linux-foundation.org: update vmstat_show(), per Joe]
    Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • top(1) opens the following files for every PID:

    /proc/*/stat
    /proc/*/statm
    /proc/*/status

    This patch switches /proc/*/status away from seq_printf().
    The result is 13.5% speedup.

    Benchmark is open("/proc/self/status")+read+close 1.000.000 million times.

    BEFORE
    $ perf stat -r 10 taskset -c 3 ./proc-self-status

    Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

    10748.474301 task-clock (msec) # 0.954 CPUs utilized ( +- 0.91% )
    12 context-switches # 0.001 K/sec ( +- 1.09% )
    1 cpu-migrations # 0.000 K/sec
    104 page-faults # 0.010 K/sec ( +- 0.45% )
    37,424,127,876 cycles # 3.482 GHz ( +- 0.04% )
    8,453,010,029 stalled-cycles-frontend # 22.59% frontend cycles idle ( +- 0.12% )
    3,747,609,427 stalled-cycles-backend # 10.01% backend cycles idle ( +- 0.68% )
    65,632,764,147 instructions # 1.75 insn per cycle
    # 0.13 stalled cycles per insn ( +- 0.00% )
    13,981,324,775 branches # 1300.773 M/sec ( +- 0.00% )
    138,967,110 branch-misses # 0.99% of all branches ( +- 0.18% )

    11.263885428 seconds time elapsed ( +- 0.04% )
    ^^^^^^^^^^^^

    AFTER
    $ perf stat -r 10 taskset -c 3 ./proc-self-status

    Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

    9010.521776 task-clock (msec) # 0.925 CPUs utilized ( +- 1.54% )
    11 context-switches # 0.001 K/sec ( +- 1.54% )
    1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
    103 page-faults # 0.011 K/sec ( +- 0.60% )
    32,352,310,603 cycles # 3.591 GHz ( +- 0.07% )
    7,849,199,578 stalled-cycles-frontend # 24.26% frontend cycles idle ( +- 0.27% )
    3,269,738,842 stalled-cycles-backend # 10.11% backend cycles idle ( +- 0.73% )
    56,012,163,567 instructions # 1.73 insn per cycle
    # 0.14 stalled cycles per insn ( +- 0.00% )
    11,735,778,795 branches # 1302.453 M/sec ( +- 0.00% )
    98,084,459 branch-misses # 0.84% of all branches ( +- 0.28% )

    9.741247736 seconds time elapsed ( +- 0.07% )
    ^^^^^^^^^^^

    Link: http://lkml.kernel.org/r/20160806125608.GB1187@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • When the huge page is added to the page cahce (huge_add_to_page_cache),
    the page private flag will be cleared. since this code
    (remove_inode_hugepages) will only be called for pages in the page
    cahce, PagePrivate(page) will always be false.

    The patch remove the code without any functional change.

    Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
    page. No functional change with this patch.

    Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Hanjun Guo
    Cc: Will Deacon
    Cc: Dave Hansen
    Cc: Sudeep Holla
    Cc: Catalin Marinas
    Cc: Mark Rutland
    Cc: Rob Herring
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • After using the offset of the swap entry as the key of the swap cache,
    the page_index() becomes exactly same as page_file_index(). So the
    page_file_index() is removed and the callers are changed to use
    page_index() instead.

    Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Trond Myklebust
    Cc: Anna Schumaker
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Dan Williams
    Cc: Joonsoo Kim
    Cc: Ross Zwisler
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The global zero page is used to satisfy an anonymous read fault. If
    THP(Transparent HugePage) is enabled then the global huge zero page is
    used. The global huge zero page uses an atomic counter for reference
    counting and is allocated/freed dynamically according to its counter
    value.

    CPU time spent on that counter will greatly increase if there are a lot
    of processes doing anonymous read faults. This patch proposes a way to
    reduce the access to the global counter so that the CPU load can be
    reduced accordingly.

    To do this, a new flag of the mm_struct is introduced:
    MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
    the global counter in two cases:

    1 The first time it uses the global huge zero page;
    2 The time when mm_user of its mm_struct reaches zero.

    Note that right now, the huge zero page is eligible to be freed as soon
    as its last use goes away. With this patch, the page will not be
    eligible to be freed until the exit of the last process from which it
    was ever used.

    And with the use of mm_user, the kthread is not eligible to use huge
    zero page either. Since no kthread is using huge zero page today, there
    is no difference after applying this patch. But if that is not desired,
    I can change it to when mm_count reaches zero.

    Case used for test on Haswell EP:

    usemem -n 72 --readonly -j 0x200000 100G

    Which spawns 72 processes and each will mmap 100G anonymous space and
    then do read only access to that space sequentially with a step of 2MB.

    CPU cycles from perf report for base commit:
    54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
    CPU cycles from perf report for this commit:
    0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

    Performance(throughput) of the workload for base commit: 1784430792
    Performance(throughput) of the workload for this commit: 4726928591
    164% increase.

    Runtime of the workload for base commit: 707592 us
    Runtime of the workload for this commit: 303970 us
    50% drop.

    Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
    Signed-off-by: Aaron Lu
    Cc: Sergey Senozhatsky
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • Trying to walk all of virtual memory requires architecture specific
    knowledge. On x86_64, addresses must be sign extended from bit 48,
    whereas on arm64 the top VA_BITS of address space have their own set of
    page tables.

    clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
    provides a test_walk() callback that only expects to be walking over
    VMAs. Currently walk_pmd_range() will skip memory regions that don't
    have a VMA, reporting them as a hole.

    As this call only expects to walk user address space, make it walk 0 to
    'highest_vm_end'.

    Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.com
    Signed-off-by: James Morse
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     
  • To support DAX pmd mappings with unmodified applications, filesystems
    need to align an mmap address by the pmd size.

    Call thp_get_unmapped_area() from f_op->get_unmapped_area.

    Note, there is no change in behavior for a non-DAX file.

    Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Cc: Andreas Dilger
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • The extern struct variable ocfs2_inode_cache is not defined. It meant to
    use ocfs2_inode_cachep defined in super.c, I think. Fortunately it is
    not used anywhere now, so no impact actually. Clean it up to fix this
    mistake.

    Link: http://lkml.kernel.org/r/57E1E49D.8050503@huawei.com
    Signed-off-by: Joseph Qi
    Reviewed-by: Eric Ren
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • The workqueue "dlm_worker" queues a single work item &dlm->dispatched_work
    and thus it doesn't require execution ordering. Hence, alloc_workqueue
    has been used to replace the deprecated create_singlethread_workqueue
    instance.

    The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
    memory pressure.

    Since there are fixed number of work items, explicit concurrency
    limit is unnecessary here.

    Link: http://lkml.kernel.org/r/2b5ad8d6688effe1a9ddb2bc2082d26fbbe00302.1472590094.git.bhaktipriya96@gmail.com
    Signed-off-by: Bhaktipriya Shridhar
    Acked-by: Tejun Heo
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bhaktipriya Shridhar
     
  • The workqueue "ocfs2_wq" queues multiple work items viz
    &osb->la_enable_wq, &journal->j_recovery_work, &os->os_orphan_scan_work,
    &osb->osb_truncate_log_wq which require strict execution ordering. Hence,
    an ordered dedicated workqueue has been used.

    WQ_MEM_RECLAIM has been set to ensure forward progress under memory
    pressure because the workqueue is being used on a memory reclaim path.

    Link: http://lkml.kernel.org/r/66279de510a7f4cfc6e386d99b7e04b3f65fb11b.1472590094.git.bhaktipriya96@gmail.com
    Signed-off-by: Bhaktipriya Shridhar
    Acked-by: Tejun Heo
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bhaktipriya Shridhar
     
  • The workqueue "o2net_wq" queues multiple work items viz
    &old_sc->sc_shutdown_work, &sc->sc_rx_work, &sc->sc_connect_work which
    require strict execution ordering. Hence, an ordered dedicated
    workqueue has been used.

    WQ_MEM_RECLAIM has been set to ensure forward progress under memory
    pressure.

    Link: http://lkml.kernel.org/r/ddc12e5766c79ba26f8a00d98049107f8a1d4866.1472590094.git.bhaktipriya96@gmail.com
    Signed-off-by: Bhaktipriya Shridhar
    Acked-by: Tejun Heo
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bhaktipriya Shridhar
     
  • The workqueue "user_dlm_worker" queues a single work item
    &lockres->l_work per user_lock_res instance and so it doesn't require
    execution ordering. Hence, alloc_workqueue has been used to replace the
    deprecated create_singlethread_workqueue instance.

    The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
    memory pressure.

    Since there are fixed number of work items, explicit concurrency
    limit is unnecessary here.

    Link: http://lkml.kernel.org/r/9748136d3a3b18138ad1d6ba708367aa1fe9f98c.1472590094.git.bhaktipriya96@gmail.com
    Signed-off-by: Bhaktipriya Shridhar
    Acked-by: Tejun Heo
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bhaktipriya Shridhar
     
  • Use assert_spin_locked() macro instead of hand-made BUG_ON statements.

    Link: http://lkml.kernel.org/r/1474537439-18919-1-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Suggested-by: Heiner Kallweit
    Reviewed-by: Jeff Layton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • When freeing permission events by fsnotify_destroy_event(), the warning
    WARN_ON(!list_empty(&event->list)); may falsely hit.

    This is because although fanotify_get_response() saw event->response
    set, there is nothing to make sure the current CPU also sees the removal
    of the event from the list. Add proper locking around the WARN_ON() to
    avoid the false warning.

    Link: http://lkml.kernel.org/r/1473797711-14111-7-git-send-email-jack@suse.cz
    Reported-by: Miklos Szeredi
    Signed-off-by: Jan Kara
    Reviewed-by: Lino Sanfilippo
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Fanotify code has its own lock (access_lock) to protect a list of events
    waiting for a response from userspace.

    However this is somewhat awkward as the same list_head in the event is
    protected by notification_lock if it is part of the notification queue
    and by access_lock if it is part of the fanotify private queue which
    makes it difficult for any reliable checks in the generic code. So make
    fanotify use the same lock - notification_lock - for protecting its
    private event list.

    Link: http://lkml.kernel.org/r/1473797711-14111-6-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Lino Sanfilippo
    Cc: Miklos Szeredi
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara