08 Oct, 2016

40 commits

  • Git can be told to apply language-specific rules when generating diffs.
    Enable this for C source code files (*.c and *.h) so that function names
    are printed right. Specifically, doing so prevents "git diff" from
    mistakenly considering unindented goto labels as function names.

    Link: http://lkml.kernel.org/r/20160907143403.1449324f@endymion
    Signed-off-by: Jean Delvare
    Cc: Peter Zijlstra
    Cc: Joe Perches
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     
  • The declarations of arch-specific functions have been moved to a common
    header in commit 3820b4d2789f ('uprobes: Move function declarations out
    of arch'), but MIPS and S390 has added them to their own trees later.
    Remove the unnecessary duplicates.

    Link: http://lkml.kernel.org/r/1472804384-17830-1-git-send-email-marcin.nowakowski@imgtec.com
    Signed-off-by: Marcin Nowakowski
    Acked-by: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcin Nowakowski
     
  • No need to correct the correct.

    Link: http://lkml.kernel.org/r/1472490791.3425.38.camel@perches.com
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • When doing an nmi backtrace of many cores, most of which are idle, the
    output is a little overwhelming and very uninformative. Suppress
    messages for cpus that are idling when they are interrupted and just
    emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".

    We do this by grouping all the cpuidle code together into a new
    .cpuidle.text section, and then checking the address of the interrupted
    PC to see if it lies within that section.

    This commit suitably tags x86 and tile idle routines, and only adds in
    the minimal framework for other architectures.

    Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Chris Metcalf
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Thompson [arm]
    Tested-by: Petr Mladek
    Cc: Aaron Tomlin
    Cc: Peter Zijlstra (Intel)
    Cc: "Rafael J. Wysocki"
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • Previously tile was rolling its own method of capturing backtrace data
    in the NMI handlers, but it was relying on running printk() from the NMI
    handler, which is not always safe. So adopt the nmi_backtrace model
    (with the new cpumask extension) instead.

    So we can call the nmi_backtrace code directly from the nmi handler,
    move the nmi_enter()/exit() into the top-level tile NMI handler.

    The semantics of the routine change slightly since it is now synchronous
    with the remote cores completing the backtraces. Previously it was
    asynchronous, but with protection to avoid starting a new remote
    backtrace if the old one was still in progress.

    Link: http://lkml.kernel.org/r/1472487169-14923-4-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Chris Metcalf
    Cc: Daniel Thompson [arm]
    Cc: Petr Mladek
    Cc: Aaron Tomlin
    Cc: Peter Zijlstra (Intel)
    Cc: "Rafael J. Wysocki"
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • Currently on arm there is code that checks whether it should call
    dump_stack() explicitly, to avoid trying to raise an NMI when the
    current context is not preemptible by the backtrace IPI. Similarly, the
    forthcoming arch/tile support uses an IPI mechanism that does not
    support generating an NMI to self.

    Accordingly, move the code that guards this case into the generic
    mechanism, and invoke it unconditionally whenever we want a backtrace of
    the current cpu. It seems plausible that in all cases, dump_stack()
    will generate better information than generating a stack from the NMI
    handler. The register state will be missing, but that state is likely
    not particularly helpful in any case.

    Or, if we think it is helpful, we should be capturing and emitting the
    current register state in all cases when regs == NULL is passed to
    nmi_cpu_backtrace().

    Link: http://lkml.kernel.org/r/1472487169-14923-3-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Chris Metcalf
    Tested-by: Daniel Thompson [arm]
    Reviewed-by: Petr Mladek
    Acked-by: Aaron Tomlin
    Cc: "Rafael J. Wysocki"
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • Patch series "improvements to the nmi_backtrace code" v9.

    This patch series modifies the trigger_xxx_backtrace() NMI-based remote
    backtracing code to make it more flexible, and makes a few small
    improvements along the way.

    The motivation comes from the task isolation code, where there are
    scenarios where we want to be able to diagnose a case where some cpu is
    about to interrupt a task-isolated cpu. It can be helpful to see both
    where the interrupting cpu is, and also an approximation of where the
    cpu that is being interrupted is. The nmi_backtrace framework allows us
    to discover the stack of the interrupted cpu.

    I've tested that the change works as desired on tile, and build-tested
    x86, arm, mips, and sparc64. For x86 I confirmed that the generic
    cpuidle stuff as well as the architecture-specific routines are in the
    new cpuidle section. For arm, mips, and sparc I just build-tested it
    and made sure the generic cpuidle routines were in the new cpuidle
    section, but I didn't attempt to figure out which the platform-specific
    idle routines might be. That might be more usefully done by someone
    with platform experience in follow-up patches.

    This patch (of 4):

    Currently you can only request a backtrace of either all cpus, or all
    cpus but yourself. It can also be helpful to request a remote backtrace
    of a single cpu, and since we want that, the logical extension is to
    support a cpumask as the underlying primitive.

    This change modifies the existing lib/nmi_backtrace.c code to take a
    cpumask as its basic primitive, and modifies the linux/nmi.h code to use
    the new "cpumask" method instead.

    The existing clients of nmi_backtrace (arm and x86) are converted to
    using the new cpumask approach in this change.

    The other users of the backtracing API (sparc64 and mips) are converted
    to use the cpumask approach rather than the all/allbutself approach.
    The mips code ignored the "include_self" boolean but with this change it
    will now also dump a local backtrace if requested.

    Link: http://lkml.kernel.org/r/1472487169-14923-2-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Chris Metcalf
    Tested-by: Daniel Thompson [arm]
    Reviewed-by: Aaron Tomlin
    Reviewed-by: Petr Mladek
    Cc: "Rafael J. Wysocki"
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Ralf Baechle
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • Currently, when min/max are nested within themselves, sparse will warn:

    warning: symbol '_min1' shadows an earlier one
    originally declared here
    warning: symbol '_min1' shadows an earlier one
    originally declared here
    warning: symbol '_min2' shadows an earlier one
    originally declared here

    This also immediately happens when min3() or max3() are used.

    Since sparse implements __COUNTER__, we can use __UNIQUE_ID() to
    generate unique variable names, avoiding this.

    Link: http://lkml.kernel.org/r/1471519773-29882-1-git-send-email-johannes@sipsolutions.net
    Signed-off-by: Johannes Berg
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Berg
     
  • Add some more description on the limitations for smaps/maps readings, as
    well as some guaruntees we can make.

    Link: http://lkml.kernel.org/r/1475296958-27652-2-git-send-email-robert.hu@intel.com
    Signed-off-by: Robert Ho
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Xiao Guangrong
    Cc: Robert Hu
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Dan Williams
    Cc: Gleb Natapov
    Cc: Marcelo Tosatti
    Cc: Stefan Hajnoczi
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Ho
     
  • Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
    more detailed info please refer to:

    https://bugzilla.redhat.com/show_bug.cgi?id=1365721

    Actually, this bug is not only for NVDIMM/DAX but also for any other
    file systems. This simple test case abstracted from nvml can easily
    reproduce this bug in common environment:

    -------------------------- testcase.c -----------------------------

    int
    is_pmem_proc(const void *addr, size_t len)
    {
    const char *caddr = addr;

    FILE *fp;
    if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
    printf("!/proc/self/smaps");
    return 0;
    }

    int retval = 0; /* assume false until proven otherwise */
    char line[PROCMAXLEN]; /* for fgets() */
    char *lo = NULL; /* beginning of current range in smaps file */
    char *hi = NULL; /* end of current range in smaps file */
    int needmm = 0; /* looking for mm flag for current range */
    while (fgets(line, PROCMAXLEN, fp) != NULL) {
    static const char vmflags[] = "VmFlags:";
    static const char mm[] = " wr";

    /* check for range line */
    if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
    if (needmm) {
    /* last range matched, but no mm flag found */
    printf("never found mm flag.\n");
    break;
    } else if (caddr < lo) {
    /* never found the range for caddr */
    printf("#######no match for addr %p.\n", caddr);
    break;
    } else if (caddr < hi) {
    /* start address is in this range */
    size_t rangelen = (size_t)(hi - caddr);

    /* remember that matching has started */
    needmm = 1;

    /* calculate remaining range to search for */
    if (len > rangelen) {
    len -= rangelen;
    caddr += rangelen;
    printf("matched %zu bytes in range "
    "%p-%p, %zu left over.\n",
    rangelen, lo, hi, len);
    } else {
    len = 0;
    printf("matched all bytes in range "
    "%p-%p.\n", lo, hi);
    }
    }
    } else if (needmm && strncmp(line, vmflags,
    sizeof(vmflags) - 1) == 0) {
    if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
    printf("mm flag found.\n");
    if (len == 0) {
    /* entire range matched */
    retval = 1;
    break;
    }
    needmm = 0; /* saw what was needed */
    } else {
    /* mm flag not set for some or all of range */
    printf("range has no mm flag.\n");
    break;
    }
    }
    }

    fclose(fp);

    printf("returning %d.\n", retval);
    return retval;
    }

    void *Addr;
    size_t Size;

    /*
    * worker -- the work each thread performs
    */
    static void *
    worker(void *arg)
    {
    int *ret = (int *)arg;
    *ret = is_pmem_proc(Addr, Size);
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    if (argc < 2 || argc > 3) {
    printf("usage: %s file [env].\n", argv[0]);
    return -1;
    }

    int fd = open(argv[1], O_RDWR);

    struct stat stbuf;
    fstat(fd, &stbuf);

    Size = stbuf.st_size;
    Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

    close(fd);

    pthread_t threads[NTHREAD];
    int ret[NTHREAD];

    /* kick off NTHREAD threads */
    for (int i = 0; i < NTHREAD; i++)
    pthread_create(&threads[i], NULL, worker, &ret[i]);

    /* wait for all the threads to complete */
    for (int i = 0; i < NTHREAD; i++)
    pthread_join(threads[i], NULL);

    /* verify that all the threads return the same value */
    for (int i = 1; i < NTHREAD; i++) {
    if (ret[0] != ret[i]) {
    printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
    ret[0], ret[i]);
    }
    }

    printf("%d", ret[0]);
    return 0;
    }

    It failed as some threads can not find the memory region in
    "/proc/self/smaps" which is allocated in the main process

    It is caused by proc fs which uses 'file->version' to indicate the VMA that
    is the last one has already been handled by read() system call. When the
    next read() issues, it uses the 'version' to find the VMA, then the next
    VMA is what we want to handle, the related code is as follows:

    if (last_addr) {
    vma = find_vma(mm, last_addr);
    if (vma && (vma = m_next_vma(priv, vma)))
    return vma;
    }

    However, VMA will be lost if the last VMA is gone, e.g:

    The process VMA list is A->B->C->D

    CPU 0 CPU 1
    read() system call
    handle VMA B
    version = B
    return to userspace

    unmap VMA B

    issue read() again to continue to get
    the region info
    find_vma(version) will get VMA C
    m_next_vma(C) will get VMA D
    handle D
    !!! VMA C is lost !!!

    In order to fix this bug, we make 'file->version' indicate the end address
    of the current VMA. m_start will then look up a vma which with vma_start
    < last_vm_end and moves on to the next vma if we found the same or an
    overlapping vma. This will guarantee that we will not miss an exclusive
    vma but we can still miss one if the previous vma was shrunk. This is
    acceptable because guaranteeing "never miss a vma" is simply not feasible.
    User has to cope with some inconsistencies if the file is not read in one
    go.

    [mhocko@suse.com: changelog fixes]
    Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.com
    Acked-by: Dave Hansen
    Signed-off-by: Xiao Guangrong
    Signed-off-by: Robert Hu
    Acked-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Dan Williams
    Cc: Gleb Natapov
    Cc: Marcelo Tosatti
    Cc: Stefan Hajnoczi
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Ho
     
  • In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS)
    to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p
    == current, but the CAP_SYS_NICE doesn't.

    Thus while the previous commit was intended to loosen the needed
    privileges to modify a processes timerslack, it needlessly restricted a
    task modifying its own timerslack via the proc//timerslack_ns
    (which is permitted also via the PR_SET_TIMERSLACK method).

    This patch corrects this by checking if p == current before checking the
    CAP_SYS_NICE value.

    This patch applies on top of my two previous patches currently in -mm

    Link: http://lkml.kernel.org/r/1471906870-28624-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Acked-by: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • As requested, this patch checks the existing LSM hooks
    task_getscheduler/task_setscheduler when reading or modifying the task's
    timerslack value.

    Previous versions added new get/settimerslack LSM hooks, but since they
    checked the same PROCESS__SET/GETSCHED values as existing hooks, it was
    suggested we just use the existing ones.

    Link: http://lkml.kernel.org/r/1469132667-17377-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Cc: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: James Morris
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • When an interface to allow a task to change another tasks timerslack was
    first proposed, it was suggested that something greater then
    CAP_SYS_NICE would be needed, as a task could be delayed further then
    what normally could be done with nice adjustments.

    So CAP_SYS_PTRACE was adopted instead for what became the
    /proc//timerslack_ns interface. However, for Android (where this
    feature originates), giving the system_server CAP_SYS_PTRACE would allow
    it to observe and modify all tasks memory. This is considered too high
    a privilege level for only needing to change the timerslack.

    After some discussion, it was realized that a CAP_SYS_NICE process can
    set a task as SCHED_FIFO, so they could fork some spinning processes and
    set them all SCHED_FIFO 99, in effect delaying all other tasks for an
    infinite amount of time.

    So as a CAP_SYS_NICE task can already cause trouble for other tasks,
    using it as a required capability for accessing and modifying
    /proc//timerslack_ns seems sufficient.

    Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
    removes CAP_SYS_PTRACE, simplifying some of the code flow as well.

    This is technically an ABI change, but as the feature just landed in
    4.6, I suspect no one is yet using it.

    Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: John Stultz
    Reviewed-by: Nick Kralevich
    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Cc: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • Use a specific routine to emit most lines so that the code is easier to
    read and maintain.

    akpm:
    text data bss dec hex filename
    2976 8 0 2984 ba8 fs/proc/meminfo.o before
    2669 8 0 2677 a75 fs/proc/meminfo.o after

    Link: http://lkml.kernel.org/r/8fce7fdef2ba081a4ef531594e97da8a9feebb58.1470810406.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Andi Kleen
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Allow some seq_puts removals by taking a string instead of a single
    char.

    [akpm@linux-foundation.org: update vmstat_show(), per Joe]
    Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • top(1) opens the following files for every PID:

    /proc/*/stat
    /proc/*/statm
    /proc/*/status

    This patch switches /proc/*/status away from seq_printf().
    The result is 13.5% speedup.

    Benchmark is open("/proc/self/status")+read+close 1.000.000 million times.

    BEFORE
    $ perf stat -r 10 taskset -c 3 ./proc-self-status

    Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

    10748.474301 task-clock (msec) # 0.954 CPUs utilized ( +- 0.91% )
    12 context-switches # 0.001 K/sec ( +- 1.09% )
    1 cpu-migrations # 0.000 K/sec
    104 page-faults # 0.010 K/sec ( +- 0.45% )
    37,424,127,876 cycles # 3.482 GHz ( +- 0.04% )
    8,453,010,029 stalled-cycles-frontend # 22.59% frontend cycles idle ( +- 0.12% )
    3,747,609,427 stalled-cycles-backend # 10.01% backend cycles idle ( +- 0.68% )
    65,632,764,147 instructions # 1.75 insn per cycle
    # 0.13 stalled cycles per insn ( +- 0.00% )
    13,981,324,775 branches # 1300.773 M/sec ( +- 0.00% )
    138,967,110 branch-misses # 0.99% of all branches ( +- 0.18% )

    11.263885428 seconds time elapsed ( +- 0.04% )
    ^^^^^^^^^^^^

    AFTER
    $ perf stat -r 10 taskset -c 3 ./proc-self-status

    Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

    9010.521776 task-clock (msec) # 0.925 CPUs utilized ( +- 1.54% )
    11 context-switches # 0.001 K/sec ( +- 1.54% )
    1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
    103 page-faults # 0.011 K/sec ( +- 0.60% )
    32,352,310,603 cycles # 3.591 GHz ( +- 0.07% )
    7,849,199,578 stalled-cycles-frontend # 24.26% frontend cycles idle ( +- 0.27% )
    3,269,738,842 stalled-cycles-backend # 10.11% backend cycles idle ( +- 0.73% )
    56,012,163,567 instructions # 1.73 insn per cycle
    # 0.14 stalled cycles per insn ( +- 0.00% )
    11,735,778,795 branches # 1302.453 M/sec ( +- 0.00% )
    98,084,459 branch-misses # 0.84% of all branches ( +- 0.28% )

    9.741247736 seconds time elapsed ( +- 0.07% )
    ^^^^^^^^^^^

    Link: http://lkml.kernel.org/r/20160806125608.GB1187@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Every current KDE system has process named ksysguardd polling files
    below once in several seconds:

    $ strace -e trace=open -p $(pidof ksysguardd)
    Process 1812 attached
    open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8
    open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8
    open("/proc/net/dev", O_RDONLY) = 8
    open("/proc/net/wireless", O_RDONLY) = -1 ENOENT (No such file or directory)
    open("/proc/stat", O_RDONLY) = 8
    open("/proc/vmstat", O_RDONLY) = 8

    Hell knows what it is doing but speed up reading /proc/vmstat by 33%!

    Benchmark is open+read+close 1.000.000 times.

    BEFORE
    $ perf stat -r 10 taskset -c 3 ./proc-vmstat

    Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

    13146.768464 task-clock (msec) # 0.960 CPUs utilized ( +- 0.60% )
    15 context-switches # 0.001 K/sec ( +- 1.41% )
    1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
    104 page-faults # 0.008 K/sec ( +- 0.57% )
    45,489,799,349 cycles # 3.460 GHz ( +- 0.03% )
    9,970,175,743 stalled-cycles-frontend # 21.92% frontend cycles idle ( +- 0.10% )
    2,800,298,015 stalled-cycles-backend # 6.16% backend cycles idle ( +- 0.32% )
    79,241,190,850 instructions # 1.74 insn per cycle
    # 0.13 stalled cycles per insn ( +- 0.00% )
    17,616,096,146 branches # 1339.956 M/sec ( +- 0.00% )
    176,106,232 branch-misses # 1.00% of all branches ( +- 0.18% )

    13.691078109 seconds time elapsed ( +- 0.03% )
    ^^^^^^^^^^^^

    AFTER
    $ perf stat -r 10 taskset -c 3 ./proc-vmstat

    Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

    8688.353749 task-clock (msec) # 0.950 CPUs utilized ( +- 1.25% )
    10 context-switches # 0.001 K/sec ( +- 2.13% )
    1 cpu-migrations # 0.000 K/sec
    104 page-faults # 0.012 K/sec ( +- 0.56% )
    30,384,010,730 cycles # 3.497 GHz ( +- 0.07% )
    12,296,259,407 stalled-cycles-frontend # 40.47% frontend cycles idle ( +- 0.13% )
    3,370,668,651 stalled-cycles-backend # 11.09% backend cycles idle ( +- 0.69% )
    28,969,052,879 instructions # 0.95 insn per cycle
    # 0.42 stalled cycles per insn ( +- 0.01% )
    6,308,245,891 branches # 726.058 M/sec ( +- 0.00% )
    214,685,502 branch-misses # 3.40% of all branches ( +- 0.26% )

    9.146081052 seconds time elapsed ( +- 0.07% )
    ^^^^^^^^^^^

    vsnprintf() is slow because:

    1. format_decode() is busy looking for format specifier: 2 branches
    per character (not in this case, but in others)

    2. approximately million branches while parsing format mini language
    and everywhere

    3. just look at what string() does /proc/vmstat is good case because
    most of its content are strings

    Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • This came to light when implementing native 64-bit atomics for ARCv2.

    The atomic64 self-test code uses CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
    to check whether atomic64_dec_if_positive() is available. It seems it
    was needed when not every arch defined it. However as of current code
    the Kconfig option seems needless

    - for CONFIG_GENERIC_ATOMIC64 it is auto-enabled in lib/Kconfig and a
    generic definition of API is present lib/atomic64.c
    - arches with native 64-bit atomics select it in arch/*/Kconfig and
    define the API in their headers

    So I see no point in keeping the Kconfig option

    Compile tested for:
    - blackfin (CONFIG_GENERIC_ATOMIC64)
    - x86 (!CONFIG_GENERIC_ATOMIC64)
    - ia64

    Link: http://lkml.kernel.org/r/1473703083-8625-3-git-send-email-vgupta@synopsys.com
    Signed-off-by: Vineet Gupta
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Ralf Baechle
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Vineet Gupta
    Cc: Zhaoxiu Zeng
    Cc: Linus Walleij
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Herbert Xu
    Cc: Ming Lin
    Cc: Arnd Bergmann
    Cc: Geert Uytterhoeven
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Andi Kleen
    Cc: Boqun Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     
  • This is based on s390 version and needed to get rid of
    CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE

    Link: http://lkml.kernel.org/r/1473703083-8625-2-git-send-email-vgupta@synopsys.com
    Signed-off-by: Vineet Gupta
    Reported-by: kbuild test robot
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     
  • The macro PAGE_ALIGNED() is prone to cause error because it doesn't
    follow convention to parenthesize parameter @addr within macro body, for
    example unsigned long *ptr = kmalloc(...); PAGE_ALIGNED(ptr + 16); for
    the left parameter of macro IS_ALIGNED(), (unsigned long)(ptr + 16) is
    desired but the actual one is (unsigned long)ptr + 16.

    It is fixed by simply canonicalizing macro PAGE_ALIGNED() definition.

    Link: http://lkml.kernel.org/r/57EA6AE7.7090807@zoho.com
    Signed-off-by: zijun_hu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zijun_hu
     
  • When the huge page is added to the page cahce (huge_add_to_page_cache),
    the page private flag will be cleared. since this code
    (remove_inode_hugepages) will only be called for pages in the page
    cahce, PagePrivate(page) will always be false.

    The patch remove the code without any functional change.

    Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • Currently we do warn only about allocation failures but small
    allocations are basically nofail and they might loop in the page
    allocator for a long time. Especially when the reclaim cannot make any
    progress - e.g. GFP_NOFS cannot invoke the oom killer and rely on a
    different context to make a forward progress in case there is a lot
    memory used by filesystems.

    Give us at least a clue when something like this happens and warn about
    allocations which take more than 10s. Print the basic allocation
    context information along with the cumulative time spent in the
    allocation as well as the allocation stack. Repeat the warning after
    every 10 seconds so that we know that the problem is permanent rather
    than ephemeral.

    Link: http://lkml.kernel.org/r/20160929084407.7004-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Tetsuo Handa
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • warn_alloc_failed is currently used from the page and vmalloc
    allocators. This is a good reuse of the code except that vmalloc would
    appreciate a slightly different warning message. This is already
    handled by the fmt parameter except that

    "%s: page allocation failure: order:%u, mode:%#x(%pGg)"

    is printed anyway. This might be quite misleading because it might be a
    vmalloc failure which leads to the warning while the page allocator is
    not the culprit here. Fix this by always using the fmt string and only
    print the context that makes sense for the particular context (e.g.
    order makes only very little sense for the vmalloc context).

    Rename the function to not miss any user and also because a later patch
    will reuse it also for !failure cases.

    Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Tetsuo Handa
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We triggered a deadloop in truncate_inode_pages_range() on 32 bits
    architecture with the test case bellow:

    ...
    fd = open();
    write(fd, buf, 4096);
    preadv64(fd, &iovec, 1, 0xffffffff000);
    ftruncate(fd, 0);
    ...

    Then ftruncate() will not return forever.

    The filesystem used in this case is ubifs, but it can be triggered on
    many other filesystems.

    When preadv64() is called with offset=0xffffffff000, a page with
    index=0xffffffff will be added to the radix tree of ->mapping. Then
    this page can be found in ->mapping with pagevec_lookup(). After that,
    truncate_inode_pages_range(), which is called in ftruncate(), will fall
    into an infinite loop:

    - find a page with index=0xffffffff, since index>=end, this page won't
    be truncated

    - index++, and index become 0

    - the page with index=0xffffffff will be found again

    The data type of index is unsigned long, so index won't overflow to 0 on
    64 bits architecture in this case, and the dead loop won't happen.

    Since truncate_inode_pages_range() is executed with holding lock of
    inode->i_rwsem, any operation related with this lock will be blocked,
    and a hung task will happen, e.g.:

    INFO: task truncate_test:3364 blocked for more than 120 seconds.
    ...
    call_rwsem_down_write_failed+0x17/0x30
    generic_file_write_iter+0x32/0x1c0
    ubifs_write_iter+0xcc/0x170
    __vfs_write+0xc4/0x120
    vfs_write+0xb2/0x1b0
    SyS_write+0x46/0xa0

    The page with index=0xffffffff added to ->mapping is useless. Fix this
    by checking the read position before allocating pages.

    Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
    Signed-off-by: Wei Fang
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Fang
     
  • Arm64 supports gigantic pages after commit 084bd29810a5 ("ARM64: mm:
    HugeTLB support.") however, it can only be allocated at boottime and
    can't be freed.

    This patch selects ARCH_HAS_GIGANTIC_PAGE to make gigantic pages can be
    allocated and freed at runtime for arch arm64.

    Link: http://lkml.kernel.org/r/1475227569-63446-3-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Acked-by: Michal Hocko
    Acked-by: Catalin Marinas
    Acked-by: Hillf Danton
    Cc: Hanjun Guo
    Cc: Will Deacon
    Cc: Dave Hansen
    Cc: Sudeep Holla
    Cc: Mark Rutland
    Cc: Rob Herring
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
    page. No functional change with this patch.

    Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Hanjun Guo
    Cc: Will Deacon
    Cc: Dave Hansen
    Cc: Sudeep Holla
    Cc: Catalin Marinas
    Cc: Mark Rutland
    Cc: Rob Herring
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • We have received a hard to explain oom report from a customer. The oom
    triggered regardless there is a lot of free memory:

    PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
    PoolThread cpuset=/ mems_allowed=0-7
    Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1
    Call Trace:
    dump_trace+0x75/0x300
    dump_stack+0x69/0x6f
    dump_header+0x8e/0x110
    oom_kill_process+0xa6/0x350
    out_of_memory+0x2b7/0x310
    __alloc_pages_slowpath+0x7dd/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_anonymous_page+0x13e/0x300
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30
    [...]
    active_anon:1135959151 inactive_anon:1051962 isolated_anon:0
    active_file:13093 inactive_file:222506 isolated_file:0
    unevictable:262144 dirty:2 writeback:0 unstable:0
    free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308
    mapped:261139 shmem:166297 pagetables:2228282 bounce:0
    [...]
    Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
    lowmem_reserve[]: 0 2892 775542 775542
    Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 772650 772650
    Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 0 0
    Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 0 0
    Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0

    So all nodes but Node 0 have a lot of free memory which should suggest
    that there is an available memory especially when mems_allowed=0-7. One
    could speculate that a massive process has managed to terminate and free
    up a lot of memory while racing with the above allocation request.
    Although this is highly unlikely it cannot be ruled out.

    A further debugging, however shown that the faulting process had
    mempolicy (not cpuset) to bind to Node 0. We cannot see that
    information from the report though. mems_allowed turned out to be more
    confusing than really helpful.

    Fix this by always priting the nodemask. It is either mempolicy mask
    (and non-null) or the one defined by the cpusets. The new output for
    the above oom report would be

    PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0

    This patch doesn't touch show_mem and the node filtering based on the
    cpuset node mask because mempolicy is always a subset of cpusets and
    seeing the full cpuset oom context might be helpful for tunning more
    specific mempolicies inside cpusets (e.g. when they turn out to be too
    restrictive). To prevent from ugly ifdefs the mask is printed even for
    !NUMA configurations but this should be OK (a single node will be
    printed).

    Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Sellami Abdelkader
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Sellami Abdelkader
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Let's add comment on why we skip page_mapcount() for sl[aou]b pages.

    Link: http://lkml.kernel.org/r/20160922105532.GB24593@node
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The old code was always doing:

    vma->vm_end = next->vm_end
    vma_rb_erase(next) // in __vma_unlink
    vma->vm_next = next->vm_next // in __vma_unlink
    next = vma->vm_next
    vma_gap_update(next)

    The new code still does the above for remove_next == 1 and 2, but for
    remove_next == 3 it has been changed and it does:

    next->vm_start = vma->vm_start
    vma_rb_erase(vma) // in __vma_unlink
    vma_gap_update(next)

    In the latter case, while unlinking "vma", validate_mm_rb() is told to
    ignore "vma" that is being removed, but next->vm_start was reduced
    instead. So for the new case, to avoid the false positive from
    validate_mm_rb, it should be "next" that is ignored when "vma" is
    being unlinked.

    "vma" and "next" in the above comment, considered pre-swap().

    Link: http://lkml.kernel.org/r/1474492522-2261-4-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Tested-by: Shaun Tancheff
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The cases are three not two.

    Link: http://lkml.kernel.org/r/1474492522-2261-3-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • If next would be NULL we couldn't reach such code path.

    Link: http://lkml.kernel.org/r/1474309513-20313-2-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The rmap_walk can access vm_page_prot (and potentially vm_flags in the
    pte/pmd manipulations). So it's not safe to wait the caller to update
    the vm_page_prot/vm_flags after vma_merge returned potentially removing
    the "next" vma and extending the "current" vma over the
    next->vm_start,vm_end range, but still with the "current" vma
    vm_page_prot, after releasing the rmap locks.

    The vm_page_prot/vm_flags must be transferred from the "next" vma to the
    current vma while vma_merge still holds the rmap locks.

    The side effect of this race condition is pte corruption during migrate
    as remove_migration_ptes when run on a address of the "next" vma that
    got removed, used the vm_page_prot of the current vma.

    migrate mprotect
    ------------ -------------
    migrating in "next" vma
    vma_merge() # removes "next" vma and
    # extends "current" vma
    # current vma is not with
    # vm_page_prot updated
    remove_migration_ptes
    read vm_page_prot of current "vma"
    establish pte with wrong permissions
    vm_set_page_prot(vma) # too late!
    change_protection in the old vma range
    only, next range is not updated

    This caused segmentation faults and potentially memory corruption in
    heavy mprotect loads with some light page migration caused by compaction
    in the background.

    Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
    which confirms the case 8 is only buggy one where the race can trigger,
    in all other vma_merge cases the above cannot happen.

    This fix removes the oddness factor from case 8 and it converts it from:

    AAAA
    PPPPNNNNXXXX -> PPPPNNNNNNNN

    to:

    AAAA
    PPPPNNNNXXXX -> PPPPXXXXXXXX

    XXXX has the right vma properties for the whole merged vma returned by
    vma_adjust, so it solves the problem fully. It has the added benefits
    that the callers could stop updating vma properties when vma_merge
    succeeds however the callers are not updated by this patch (there are
    bits like VM_SOFTDIRTY that still need special care for the whole range,
    as the vma merging ignores them, but as long as they're not processed by
    rmap walks and instead they're accessed with the mmap_sem at least for
    reading, they are fine not to be updated within vma_adjust before
    releasing the rmap_locks).

    Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Aditya Mandaleeka
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • mm->highest_vm_end doesn't need any update.

    After finally removing the oddness from vma_merge case 8 that was
    causing:

    1) constant risk of trouble whenever anybody would check vma fields
    from rmap_walks, like it happened when page migration was
    introduced and it read the vma->vm_page_prot from a rmap_walk

    2) the callers of vma_merge to re-initialize any value different from
    the current vma, instead of vma_merge() more reliably returning a
    vma that already matches all fields passed as parameter

    .. it is also worth to take the opportunity of cleaning up superfluous
    code in vma_adjust(), that if not removed adds up to the hard
    readability of the function.

    Link: http://lkml.kernel.org/r/1474492522-2261-5-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
    concurrently and this prevents the risk of reading intermediate values.

    Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can
    in rare cases cause a hung task warning.

    At present, if alloc_stable_node() allocation fails, two break_cows may
    want to allocate a couple of pages, and the issue will come up when free
    memory is under pressure.

    We fix it by adding __GFP_HIGH to GFP, to grant access to memory
    reserves, increasing the likelihood of allocation success.

    [akpm@linux-foundation.org: tweak comment]
    Link: http://lkml.kernel.org/r/1474354484-58233-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • Fix typo in comment.

    Link: http://lkml.kernel.org/r/1474788764-5774-1-git-send-email-ysxie@foxmail.com
    Signed-off-by: Yisheng Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • For every pfn aligned to minimum_order, dissolve_free_huge_pages() will
    call dissolve_free_huge_page() which takes the hugetlb spinlock, even if
    the page is not huge at all or a hugepage that is in-use.

    Improve this by doing the PageHuge() and page_count() checks already in
    dissolve_free_huge_pages() before calling dissolve_free_huge_page(). In
    dissolve_free_huge_page(), when holding the spinlock, those checks need
    to be revalidated.

    Link: http://lkml.kernel.org/r/20160926172811.94033-4-gerald.schaefer@de.ibm.com
    Signed-off-by: Gerald Schaefer
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Vlastimil Babka
    Cc: Mike Kravetz
    Cc: "Aneesh Kumar K . V"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rui Teng
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • In dissolve_free_huge_pages(), free hugepages will be dissolved without
    making sure that there are enough of them left to satisfy hugepage
    reservations.

    Fix this by adding a return value to dissolve_free_huge_pages() and
    checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may
    lead to the situation where dissolve_free_huge_page() returns an error
    and all free hugepages that were dissolved before that error are lost,
    while the memory block still cannot be set offline.

    Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
    Signed-off-by: Gerald Schaefer
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Vlastimil Babka
    Cc: Mike Kravetz
    Cc: "Aneesh Kumar K . V"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rui Teng
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Patch series "mm/hugetlb: memory offline issues with hugepages", v4.

    This addresses several issues with hugepages and memory offline. While
    the first patch fixes a panic, and is therefore rather important, the
    last patch is just a performance optimization.

    The second patch fixes a theoretical issue with reserved hugepages,
    while still leaving some ugly usability issue, see description.

    This patch (of 3):

    dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
    list corruption and addressing exception when trying to set a memory
    block offline that is part (but not the first part) of a "gigantic"
    hugetlb page with a size > memory block size.

    When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
    will trigger directly. In the other case we will run into an addressing
    exception later, because dissolve_free_huge_page() will not work on the
    head page of the compound hugetlb page which will result in a NULL
    hstate from page_hstate().

    To fix this, first remove the VM_BUG_ON() because it is wrong, and then
    use the compound head page in dissolve_free_huge_page(). This means
    that an unused pre-allocated gigantic page that has any part of itself
    inside the memory block that is going offline will be dissolved
    completely. Losing an unused gigantic hugepage is preferable to failing
    the memory offline, for example in the situation where a (possibly
    faulty) memory DIMM needs to go offline.

    Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com
    Signed-off-by: Gerald Schaefer
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Vlastimil Babka
    Cc: Mike Kravetz
    Cc: "Aneesh Kumar K . V"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rui Teng
    Cc: Dave Hansen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Commit b4def3509d18 ("mm, nobootmem: clean-up of free_low_memory_core_early()")
    removed the unnecessary nodeid argument, after that, this comment
    becomes more confused. We should move it to the right place.

    Fixes: b4def3509d18c1db9 ("mm, nobootmem: clean-up of free_low_memory_core_early()")
    Link: http://lkml.kernel.org/r/1473996082-14603-1-git-send-email-wanlong.gao@gmail.com
    Signed-off-by: Wanlong Gao
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanlong Gao