23 Nov, 2013

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "The patches from Joonsoo Kim switch mm/slab.c to use 'struct page' for
    slab internals similar to mm/slub.c. This reduces memory usage and
    improves performance:

    https://lkml.org/lkml/2013/10/16/155

    Rest of the changes are bug fixes from various people"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (21 commits)
    mm, slub: fix the typo in mm/slub.c
    mm, slub: fix the typo in include/linux/slub_def.h
    slub: Handle NULL parameter in kmem_cache_flags
    slab: replace non-existing 'struct freelist *' with 'void *'
    slab: fix to calm down kmemleak warning
    slub: proper kmemleak tracking if CONFIG_SLUB_DEBUG disabled
    slab: rename slab_bufctl to slab_freelist
    slab: remove useless statement for checking pfmemalloc
    slab: use struct page for slab management
    slab: replace free and inuse in struct slab with newly introduced active
    slab: remove SLAB_LIMIT
    slab: remove kmem_bufctl_t
    slab: change the management method of free objects of the slab
    slab: use __GFP_COMP flag for allocating slab pages
    slab: use well-defined macro, virt_to_slab()
    slab: overloading the RCU head over the LRU for RCU free
    slab: remove cachep in struct slab_rcu
    slab: remove nodeid in struct slab
    slab: remove colouroff in struct slab
    slab: change return type of kmem_getpages() to struct page
    ...

    Linus Torvalds
     

22 Nov, 2013

3 commits

  • Fengguang Wu reports that compiling mm/mempolicy.c results in a warning:

    mm/mempolicy.c: In function 'mpol_to_str':
    mm/mempolicy.c:2878:2: error: format not a string literal and no format arguments

    Kees says this is because he is using -Wformat-security.

    Silence the warning.

    Signed-off-by: David Rientjes
    Reported-by: Fengguang Wu
    Suggested-by: Kees Cook
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
    caused by THP") can cause dereference of a dangling pointer if
    split_huge_page runs during PageHuge() if there are updates to the
    tail_page->private field.

    Also it is repeating compound_head twice for hugetlbfs and it is running
    compound_head+compound_trans_head for THP when a single one is needed in
    both cases.

    The new code within the PageSlab() check doesn't need to verify that the
    THP page size is never bigger than the smallest hugetlbfs page size, to
    avoid memory corruption.

    A longstanding theoretical race condition was found while fixing the
    above (see the change right after the skip_unlock label, that is
    relevant for the compound_lock path too).

    By re-establishing the _mapcount tail refcounting for all compound
    pages, this also fixes the below problem:

    echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    BUG: Bad page state in process bash pfn:59a01
    page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
    page flags: 0x1c00000000008000(tail)
    Modules linked in:
    CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x55/0x76
    bad_page+0xd5/0x130
    free_pages_prepare+0x213/0x280
    __free_pages+0x36/0x80
    update_and_free_page+0xc1/0xd0
    free_pool_huge_page+0xc2/0xe0
    set_max_huge_pages.part.58+0x14c/0x220
    nr_hugepages_store_common.isra.60+0xd0/0xf0
    nr_hugepages_store+0x13/0x20
    kobj_attr_store+0xf/0x20
    sysfs_write_file+0x189/0x1e0
    vfs_write+0xc5/0x1f0
    SyS_write+0x55/0xb0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Khalid Aziz
    Signed-off-by: Andrea Arcangeli
    Tested-by: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Right now, the migration code in migrate_page_copy() uses copy_huge_page()
    for hugetlbfs and thp pages:

    if (PageHuge(page) || PageTransHuge(page))
    copy_huge_page(newpage, page);

    So, yay for code reuse. But:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);

    and a non-hugetlbfs page has no page_hstate(). This works 99% of the
    time because page_hstate() determines the hstate from the page order
    alone. Since the page order of a THP page matches the default hugetlbfs
    page order, it works.

    But, if you change the default huge page size on the boot command-line
    (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
    so page_hstate() returns null and copy_huge_page() oopses pretty fast
    since copy_huge_page() dereferences the hstate:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);
    if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
    ...

    Mel noticed that the migration code is really the only user of these
    functions. This moves all the copy code over to migrate.c and makes
    copy_huge_page() work for THP by checking for it explicitly.

    I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
    THP migration for the NUMA working set scanning fault case")

    [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Reviewed-by: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Tested-by: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

21 Nov, 2013

1 commit

  • This reverts commit ea1e7ed33708c7a760419ff9ded0a6cb90586a50.

    Al points out that while the commit *does* actually create a separate
    slab for the page->ptl allocation, that slab is never actually used, and
    the code continues to use kmalloc/kfree.

    Damien Wyart points out that the original patch did have the conversion
    to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.

    Revert the half-arsed attempt that didn't do anything. If we really do
    want the special slab (remember: this is all relevant just for debug
    builds, so it's not necessarily all that critical) we might as well redo
    the patch fully.

    Reported-by: Al Viro
    Acked-by: Andrew Morton
    Cc: Kirill A Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Nov, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "Usual earth-shaking, news-breaking, rocket science pile from
    trivial.git"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
    doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
    doc: add missing files to timers/00-INDEX
    timekeeping: Fix some trivial typos in comments
    mm: Fix some trivial typos in comments
    irq: Fix some trivial typos in comments
    NUMA: fix typos in Kconfig help text
    mm: update 00-INDEX
    doc: Documentation/DMA-attributes.txt fix typo
    DRM: comment: `halve' -> `half'
    Docs: Kconfig: `devlopers' -> `developers'
    doc: typo on word accounting in kprobes.c in mutliple architectures
    treewide: fix "usefull" typo
    treewide: fix "distingush" typo
    mm/Kconfig: Grammar s/an/a/
    kexec: Typo s/the/then/
    Documentation/kvm: Update cpuid documentation for steal time and pv eoi
    treewide: Fix common typo in "identify"
    __page_to_pfn: Fix typo in comment
    Correct some typos for word frequency
    clk: fixed-factor: Fix a trivial typo
    ...

    Linus Torvalds
     

15 Nov, 2013

13 commits

  • This patch enhances the type safety for the kfifo API. It is now safe
    to put const data into a non const FIFO and the API will now generate a
    compiler warning when reading from the fifo where the destination
    address is pointing to a const variable.

    As a side effect the kfifo_put() does now expect the value of an element
    instead a pointer to the element. This was suggested Russell King. It
    make the handling of the kfifo_put easier since there is no need to
    create a helper variable for getting the address of a pointer or to pass
    integers of different sizes.

    IMHO the API break is okay, since there are currently only six users of
    kfifo_put().

    The code is also cleaner by kicking out the "if (0)" expressions.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stefani Seibold
    Cc: Russell King
    Cc: Hauke Mehrtens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     
  • If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
    is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
    so we loose 24 on each. An average system can easily allocate few tens
    thousands of page->ptl and overhead is significant.

    Let's create a separate slab for page->ptl allocation to solve this.

    Signed-off-by: Kirill A. Shutemov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Use kernel/bounds.c to convert build-time spinlock_t size check into a
    preprocessor symbol and apply that to properly separate the page::ptl
    situation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Kirill A. Shutemov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • If split page table lock is in use, we embed the lock into struct page
    of table's page. We have to disable split lock, if spinlock_t is too
    big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
    enabled.

    This patch add support for dynamic allocation of split page table lock
    if we can't embed it to struct page.

    page->ptl is unsigned long now and we use it as spinlock_t if
    sizeof(spinlock_t) ptl.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The basic idea is the same as with PTE level: the lock is embedded into
    struct page of table's page.

    We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
    take mm->page_table_lock anymore. Let's reuse page->lru of table's page
    for that.

    pgtable_pmd_page_ctor() returns true, if initialization is successful
    and false otherwise. Current implementation never fails, but assumption
    that constructor can fail will help to port it to -rt where spinlock_t
    is rather huge and cannot be embedded into struct page -- dynamic
    allocation is required.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Reviewed-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Only trivial cases left. Let's convert them altogether.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Hugetlb supports multiple page sizes. We use split lock only for PMD
    level, but not for PUD.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Currently mm->pmd_huge_pte protected by page table lock. It will not
    work with split lock. We have to have per-pmd pmd_huge_pte for proper
    access serialization.

    For now, let's just introduce wrapper to access mm->pmd_huge_pte.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split page table lock we can't know which lock we need to take
    before we find the relevant pmd.

    Let's move lock taking inside the function.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split ptlock it's important to know which lock
    pmd_trans_huge_lock() took. This patch adds one more parameter to the
    function to return the lock.

    In most places migration to new api is trivial. Exception is
    move_huge_pmd(): we need to take two locks if pmd tables are different.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split page table lock for PMD level we can't hold mm->page_table_lock
    while updating nr_ptes.

    Let's convert it to atomic_long_t to avoid races.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Alex Thorlton noticed that some massively threaded workloads work poorly,
    if THP enabled. This patchset fixes this by introducing split page table
    lock for PMD tables. hugetlbfs is not covered yet.

    This patchset is based on work by Naoya Horiguchi.

    : akpm result summary:
    :
    : THP off, v3.12-rc2: 18.059261877 seconds time elapsed
    : THP off, patched: 16.768027318 seconds time elapsed
    :
    : THP on, v3.12-rc2: 42.162306788 seconds time elapsed
    : THP on, patched: 8.397885779 seconds time elapsed
    :
    : HUGETLB, v3.12-rc2: 47.574936948 seconds time elapsed
    : HUGETLB, patched: 19.447481153 seconds time elapsed

    THP off, v3.12-rc2:
    -------------------

    Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

    1037072.835207 task-clock # 57.426 CPUs utilized ( +- 3.59% )
    95,093 context-switches # 0.092 K/sec ( +- 3.93% )
    140 cpu-migrations # 0.000 K/sec ( +- 5.28% )
    10,000,550 page-faults # 0.010 M/sec ( +- 0.00% )
    2,455,210,400,261 cycles # 2.367 GHz ( +- 3.62% ) [83.33%]
    2,429,281,882,056 stalled-cycles-frontend # 98.94% frontend cycles idle ( +- 3.67% ) [83.33%]
    1,975,960,019,659 stalled-cycles-backend # 80.48% backend cycles idle ( +- 3.88% ) [66.68%]
    46,503,296,013 instructions # 0.02 insns per cycle
    # 52.24 stalled cycles per insn ( +- 3.21% ) [83.34%]
    9,278,997,542 branches # 8.947 M/sec ( +- 4.00% ) [83.34%]
    89,881,640 branch-misses # 0.97% of all branches ( +- 1.17% ) [83.33%]

    18.059261877 seconds time elapsed ( +- 2.65% )

    THP on, v3.12-rc2:
    ------------------

    Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

    3114745.395974 task-clock # 73.875 CPUs utilized ( +- 1.84% )
    267,356 context-switches # 0.086 K/sec ( +- 1.84% )
    99 cpu-migrations # 0.000 K/sec ( +- 1.40% )
    58,313 page-faults # 0.019 K/sec ( +- 0.28% )
    7,416,635,817,510 cycles # 2.381 GHz ( +- 1.83% ) [83.33%]
    7,342,619,196,993 stalled-cycles-frontend # 99.00% frontend cycles idle ( +- 1.88% ) [83.33%]
    6,267,671,641,967 stalled-cycles-backend # 84.51% backend cycles idle ( +- 2.03% ) [66.67%]
    117,819,935,165 instructions # 0.02 insns per cycle
    # 62.32 stalled cycles per insn ( +- 4.39% ) [83.34%]
    28,899,314,777 branches # 9.278 M/sec ( +- 4.48% ) [83.34%]
    71,787,032 branch-misses # 0.25% of all branches ( +- 1.03% ) [83.33%]

    42.162306788 seconds time elapsed ( +- 1.73% )

    HUGETLB, v3.12-rc2:
    -------------------

    Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):

    2588052.787264 task-clock # 54.400 CPUs utilized ( +- 3.69% )
    246,831 context-switches # 0.095 K/sec ( +- 4.15% )
    138 cpu-migrations # 0.000 K/sec ( +- 5.30% )
    21,027 page-faults # 0.008 K/sec ( +- 0.01% )
    6,166,666,307,263 cycles # 2.383 GHz ( +- 3.68% ) [83.33%]
    6,086,008,929,407 stalled-cycles-frontend # 98.69% frontend cycles idle ( +- 3.77% ) [83.33%]
    5,087,874,435,481 stalled-cycles-backend # 82.51% backend cycles idle ( +- 4.41% ) [66.67%]
    133,782,831,249 instructions # 0.02 insns per cycle
    # 45.49 stalled cycles per insn ( +- 4.30% ) [83.34%]
    34,026,870,541 branches # 13.148 M/sec ( +- 4.24% ) [83.34%]
    68,670,942 branch-misses # 0.20% of all branches ( +- 3.26% ) [83.33%]

    47.574936948 seconds time elapsed ( +- 2.09% )

    THP off, patched:
    -----------------

    Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

    943301.957892 task-clock # 56.256 CPUs utilized ( +- 3.01% )
    86,218 context-switches # 0.091 K/sec ( +- 3.17% )
    121 cpu-migrations # 0.000 K/sec ( +- 6.64% )
    10,000,551 page-faults # 0.011 M/sec ( +- 0.00% )
    2,230,462,457,654 cycles # 2.365 GHz ( +- 3.04% ) [83.32%]
    2,204,616,385,805 stalled-cycles-frontend # 98.84% frontend cycles idle ( +- 3.09% ) [83.32%]
    1,778,640,046,926 stalled-cycles-backend # 79.74% backend cycles idle ( +- 3.47% ) [66.69%]
    45,995,472,617 instructions # 0.02 insns per cycle
    # 47.93 stalled cycles per insn ( +- 2.51% ) [83.34%]
    9,179,700,174 branches # 9.731 M/sec ( +- 3.04% ) [83.35%]
    89,166,529 branch-misses # 0.97% of all branches ( +- 1.45% ) [83.33%]

    16.768027318 seconds time elapsed ( +- 2.47% )

    THP on, patched:
    ----------------

    Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

    458793.837905 task-clock # 54.632 CPUs utilized ( +- 0.79% )
    41,831 context-switches # 0.091 K/sec ( +- 0.97% )
    98 cpu-migrations # 0.000 K/sec ( +- 1.66% )
    57,829 page-faults # 0.126 K/sec ( +- 0.62% )
    1,077,543,336,716 cycles # 2.349 GHz ( +- 0.81% ) [83.33%]
    1,067,403,802,964 stalled-cycles-frontend # 99.06% frontend cycles idle ( +- 0.87% ) [83.33%]
    864,764,616,143 stalled-cycles-backend # 80.25% backend cycles idle ( +- 0.73% ) [66.68%]
    16,129,177,440 instructions # 0.01 insns per cycle
    # 66.18 stalled cycles per insn ( +- 7.94% ) [83.35%]
    3,618,938,569 branches # 7.888 M/sec ( +- 8.46% ) [83.36%]
    33,242,032 branch-misses # 0.92% of all branches ( +- 2.02% ) [83.32%]

    8.397885779 seconds time elapsed ( +- 0.18% )

    HUGETLB, patched:
    -----------------

    Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):

    395353.076837 task-clock # 20.329 CPUs utilized ( +- 8.16% )
    55,730 context-switches # 0.141 K/sec ( +- 5.31% )
    138 cpu-migrations # 0.000 K/sec ( +- 4.24% )
    21,027 page-faults # 0.053 K/sec ( +- 0.00% )
    930,219,717,244 cycles # 2.353 GHz ( +- 8.21% ) [83.32%]
    914,295,694,103 stalled-cycles-frontend # 98.29% frontend cycles idle ( +- 8.35% ) [83.33%]
    704,137,950,187 stalled-cycles-backend # 75.70% backend cycles idle ( +- 9.16% ) [66.69%]
    30,541,538,385 instructions # 0.03 insns per cycle
    # 29.94 stalled cycles per insn ( +- 3.98% ) [83.35%]
    8,415,376,631 branches # 21.286 M/sec ( +- 3.61% ) [83.36%]
    32,645,478 branch-misses # 0.39% of all branches ( +- 3.41% ) [83.32%]

    19.447481153 seconds time elapsed ( +- 2.00% )

    This patch (of 11):

    CONFIG_GENERIC_LOCKBREAK increases sizeof(spinlock_t) to 8 bytes. It
    leads to increase sizeof(struct page) by 4 bytes on 32-bit system if split
    page table lock is in use, since page->ptl shares space in union with
    longs and pointers.

    Let's disable split page table lock on 32-bit systems with
    GENERIC_LOCKBREAK enabled.

    Signed-off-by: Kirill A. Shutemov
    Cc: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • There's only one caller of do_generic_file_read() and the only actor is
    file_read_actor(). No reason to have a callback parameter.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Dave Hansen
    Reviewed-by: Wanpeng Li
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

14 Nov, 2013

2 commits

  • Pull core locking changes from Ingo Molnar:
    "The biggest changes:

    - add lockdep support for seqcount/seqlocks structures, this
    unearthed both bugs and required extra annotation.

    - move the various kernel locking primitives to the new
    kernel/locking/ directory"

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    block: Use u64_stats_init() to initialize seqcounts
    locking/lockdep: Mark __lockdep_count_forward_deps() as static
    lockdep/proc: Fix lock-time avg computation
    locking/doc: Update references to kernel/mutex.c
    ipv6: Fix possible ipv6 seqlock deadlock
    cpuset: Fix potential deadlock w/ set_mems_allowed
    seqcount: Add lockdep functionality to seqcount/seqlock structures
    net: Explicitly initialize u64_stats_sync structures for lockdep
    locking: Move the percpu-rwsem code to kernel/locking/
    locking: Move the lglocks code to kernel/locking/
    locking: Move the rwsem code to kernel/locking/
    locking: Move the rtmutex code to kernel/locking/
    locking: Move the semaphore core to kernel/locking/
    locking: Move the spinlock code to kernel/locking/
    locking: Move the lockdep code to kernel/locking/
    locking: Move the mutex code to kernel/locking/
    hung_task debugging: Add tracepoint to report the hang
    x86/locking/kconfig: Update paravirt spinlock Kconfig description
    lockstat: Report avg wait and hold times
    lockdep, x86/alternatives: Drop ancient lockdep fixup message
    ...

    Linus Torvalds
     
  • Pull block IO core updates from Jens Axboe:
    "This is the pull request for the core changes in the block layer for
    3.13. It contains:

    - The new blk-mq request interface.

    This is a new and more scalable queueing model that marries the
    best part of the request based interface we currently have (which
    is fully featured, but scales poorly) and the bio based "interface"
    which the new drivers for high IOPS devices end up using because
    it's much faster than the request based one.

    The bio interface has no block layer support, since it taps into
    the stack much earlier. This means that drivers end up having to
    implement a lot of functionality on their own, like tagging,
    timeout handling, requeue, etc. The blk-mq interface provides all
    these. Some drivers even provide a switch to select bio or rq and
    has code to handle both, since things like merging only works in
    the rq model and hence is faster for some workloads. This is a
    huge mess. Conversion of these drivers nets us a substantial code
    reduction. Initial results on converting SCSI to this model even
    shows an 8x improvement on single queue devices. So while the
    model was intended to work on the newer multiqueue devices, it has
    substantial improvements for "classic" hardware as well. This code
    has gone through extensive testing and development, it's now ready
    to go. A pull request is coming to convert virtio-blk to this
    model will be will be coming as well, with more drivers scheduled
    for 3.14 conversion.

    - Two blktrace fixes from Jan and Chen Gang.

    - A plug merge fix from Alireza Haghdoost.

    - Conversion of __get_cpu_var() from Christoph Lameter.

    - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

    - A fix for a race between request completion and the timeout
    handling from Jeff Moyer. This is what caused the merge conflict
    with blk-mq/core, in case you are looking at that.

    - A dm stacking fix from Mike Snitzer.

    - A code consolidation fix and duplicated code removal from Kent
    Overstreet.

    - A handful of block bug fixes from Mikulas Patocka, fixing a loop
    crash and memory corruption on blk cg.

    - Elevator switch bug fix from Tomoki Sekiyama.

    A heads-up that I had to rebase this branch. Initially the immutable
    bio_vecs had been queued up for inclusion, but a week later, it became
    clear that it wasn't fully cooked yet. So the decision was made to
    pull this out and postpone it until 3.14. It was a straight forward
    rebase, just pruning out the immutable series and the later fixes of
    problems with it. The rest of the patches applied directly and no
    further changes were made"

    * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
    block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    block: Do not call sector_div() with a 64-bit divisor
    kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
    block: Consolidate duplicated bio_trim() implementations
    block: Use rw_copy_check_uvector()
    block: Enable sysfs nomerge control for I/O requests in the plug list
    block: properly stack underlying max_segment_size to DM device
    elevator: acquire q->sysfs_lock in elevator_change()
    elevator: Fix a race in elevator switching and md device initialization
    block: Replace __get_cpu_var uses
    bdi: test bdi_init failure
    block: fix a probe argument to blk_register_region
    loop: fix crash if blk_alloc_queue fails
    blk-core: Fix memory corruption if blkcg_init_queue fails
    block: fix race between request completion and timeout handling
    blktrace: Send BLK_TN_PROCESS events to all running traces
    blk-mq: don't disallow request merges for req->special being set
    blk-mq: mq plug list breakage
    blk-mq: fix for flush deadlock
    ...

    Linus Torvalds
     

13 Nov, 2013

19 commits

  • Pull networking updates from David Miller:

    1) The addition of nftables. No longer will we need protocol aware
    firewall filtering modules, it can all live in userspace.

    At the core of nftables is a, for lack of a better term, virtual
    machine that executes byte codes to inspect packet or metadata
    (arriving interface index, etc.) and make verdict decisions.

    Besides support for loading packet contents and comparing them, the
    interpreter supports lookups in various datastructures as
    fundamental operations. For example sets are supports, and
    therefore one could create a set of whitelist IP address entries
    which have ACCEPT verdicts attached to them, and use the appropriate
    byte codes to do such lookups.

    Since the interpreted code is composed in userspace, userspace can
    do things like optimize things before giving it to the kernel.

    Another major improvement is the capability of atomically updating
    portions of the ruleset. In the existing netfilter implementation,
    one has to update the entire rule set in order to make a change and
    this is very expensive.

    Userspace tools exist to create nftables rules using existing
    netfilter rule sets, but both kernel implementations will need to
    co-exist for quite some time as we transition from the old to the
    new stuff.

    Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
    worked so hard on this.

    2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
    to our pseudo-random number generator, mostly used for things like
    UDP port randomization and netfitler, amongst other things.

    In particular the taus88 generater is updated to taus113, and test
    cases are added.

    3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
    and Yang Yingliang.

    4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
    Sujir.

    5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
    Neal Cardwell, and Yuchung Cheng.

    6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
    control message data, much like other socket option attributes.
    From Francesco Fusco.

    7) Allow applications to specify a cap on the rate computed
    automatically by the kernel for pacing flows, via a new
    SO_MAX_PACING_RATE socket option. From Eric Dumazet.

    8) Make the initial autotuned send buffer sizing in TCP more closely
    reflect actual needs, from Eric Dumazet.

    9) Currently early socket demux only happens for TCP sockets, but we
    can do it for connected UDP sockets too. Implementation from Shawn
    Bohrer.

    10) Refactor inet socket demux with the goal of improving hash demux
    performance for listening sockets. With the main goals being able
    to use RCU lookups on even request sockets, and eliminating the
    listening lock contention. From Eric Dumazet.

    11) The bonding layer has many demuxes in it's fast path, and an RCU
    conversion was started back in 3.11, several changes here extend the
    RCU usage to even more locations. From Ding Tianhong and Wang
    Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
    Falico.

    12) Allow stackability of segmentation offloads to, in particular, allow
    segmentation offloading over tunnels. From Eric Dumazet.

    13) Significantly improve the handling of secret keys we input into the
    various hash functions in the inet hashtables, TCP fast open, as
    well as syncookies. From Hannes Frederic Sowa. The key fundamental
    operation is "net_get_random_once()" which uses static keys.

    Hannes even extended this to ipv4/ipv6 fragmentation handling and
    our generic flow dissector.

    14) The generic driver layer takes care now to set the driver data to
    NULL on device removal, so it's no longer necessary for drivers to
    explicitly set it to NULL any more. Many drivers have been cleaned
    up in this way, from Jingoo Han.

    15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.

    16) Improve CRC32 interfaces and generic SKB checksum iterators so that
    SCTP's checksumming can more cleanly be handled. Also from Daniel
    Borkmann.

    17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
    using the interface MTU value. This helps avoid PMTU attacks,
    particularly on DNS servers. From Hannes Frederic Sowa.

    18) Use generic XPS for transmit queue steering rather than internal
    (re-)implementation in virtio-net. From Jason Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    random32: add test cases for taus113 implementation
    random32: upgrade taus88 generator to taus113 from errata paper
    random32: move rnd_state to linux/random.h
    random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
    random32: add periodic reseeding
    random32: fix off-by-one in seeding requirement
    PHY: Add RTL8201CP phy_driver to realtek
    xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
    macmace: add missing platform_set_drvdata() in mace_probe()
    ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
    ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
    vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
    ixgbe: add warning when max_vfs is out of range.
    igb: Update link modes display in ethtool
    netfilter: push reasm skb through instead of original frag skbs
    ip6_output: fragment outgoing reassembled skb properly
    MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
    net_sched: tbf: support of 64bit rates
    ixgbe: deleting dfwd stations out of order can cause null ptr deref
    ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
    ...

    Linus Torvalds
     
  • Merge first patch-bomb from Andrew Morton:
    "Quite a lot of other stuff is banked up awaiting further
    next->mainline merging, but this batch contains:

    - Lots of random misc patches
    - OCFS2
    - Most of MM
    - backlight updates
    - lib/ updates
    - printk updates
    - checkpatch updates
    - epoll tweaking
    - rtc updates
    - hfs
    - hfsplus
    - documentation
    - procfs
    - update gcov to gcc-4.7 format
    - IPC"

    * emailed patches from Andrew Morton : (269 commits)
    ipc, msg: fix message length check for negative values
    ipc/util.c: remove unnecessary work pending test
    devpts: plug the memory leak in kill_sb
    ./Makefile: export initial ramdisk compression config option
    init/Kconfig: add option to disable kernel compression
    drivers: w1: make w1_slave::flags long to avoid memory corruption
    drivers/w1/masters/ds1wm.cuse dev_get_platdata()
    drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
    drivers/memstick/core/mspro_block.c: fix attributes array allocation
    drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
    kernel/panic.c: reduce 1 byte usage for print tainted buffer
    gcov: reuse kbasename helper
    kernel/gcov/fs.c: use pr_warn()
    kernel/module.c: use pr_foo()
    gcov: compile specific gcov implementation based on gcc version
    gcov: add support for gcc 4.7 gcov format
    gcov: move gcov structs definitions to a gcc version specific file
    kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
    kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
    kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
    ...

    Linus Torvalds
     
  • Pull vfs updates from Al Viro:
    "All kinds of stuff this time around; some more notable parts:

    - RCU'd vfsmounts handling
    - new primitives for coredump handling
    - files_lock is gone
    - Bruce's delegations handling series
    - exportfs fixes

    plus misc stuff all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
    ecryptfs: ->f_op is never NULL
    locks: break delegations on any attribute modification
    locks: break delegations on link
    locks: break delegations on rename
    locks: helper functions for delegation breaking
    locks: break delegations on unlink
    namei: minor vfs_unlink cleanup
    locks: implement delegations
    locks: introduce new FL_DELEG lock flag
    vfs: take i_mutex on renamed file
    vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
    vfs: don't use PARENT/CHILD lock classes for non-directories
    vfs: pull ext4's double-i_mutex-locking into common code
    exportfs: fix quadratic behavior in filehandle lookup
    exportfs: better variable name
    exportfs: move most of reconnect_path to helper function
    exportfs: eliminate unused "noprogress" counter
    exportfs: stop retrying once we race with rename/remove
    exportfs: clear DISCONNECTED on all parents sooner
    exportfs: more detailed comment for path_reconnect
    ...

    Linus Torvalds
     
  • Pull cgroup changes from Tejun Heo:
    "Not too much activity this time around. css_id is finally killed and
    a minor update to device_cgroup"

    * 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    device_cgroup: remove can_attach
    cgroup: kill css_id
    memcg: stop using css id
    memcg: fail to create cgroup if the cgroup id is too big
    memcg: convert to use cgroup id
    memcg: convert to use cgroup_is_descendant()

    Linus Torvalds
     
  • Pull percpu changes from Tejun Heo:
    "Two smallish changes for percpu. Two patches to remove unused
    this_cpu_xor() and one to fix a bug in percpu init failure path so
    that it can reach the proper BUG() instead of oopsing earlier"

    * 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    x86: remove this_cpu_xor() implementation
    percpu: remove this_cpu_xor() implementation
    percpu: fix bootmem error handling in pcpu_page_first_chunk()

    Linus Torvalds
     
  • Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
    PTE update") was added to account for the number of PTE updates when
    marking pages prot_numa. task_numa_work was using the old return value
    to track how much address space had been updated. Altering the return
    value causes the scanner to do more work than it is configured or
    documented to in a single unit of work.

    This patch reverts that commit and accounts for the number of THP
    updates separately in vmstat. It is up to the administrator to
    interpret the pair of values correctly. This is a straight-forward
    operation and likely to only be of interest when actively debugging NUMA
    balancing problems.

    The impact of this patch is that the NUMA PTE scanner will scan slower
    when THP is enabled and workloads may converge slower as a result. On
    the flip size system CPU usage should be lower than recent tests
    reported. This is an illustrative example of a short single JVM specjbb
    test

    specjbb
    3.12.0 3.12.0
    vanilla acctupdates
    TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
    TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
    TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
    TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
    TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
    TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
    TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
    TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

    3.12.0 3.12.0
    vanillaacctupdates
    User 5169.64 5184.14
    System 100.45 80.02
    Elapsed 252.75 251.85

    Performance is similar but note the reduction in system CPU time. While
    this showed a performance gain, it will not be universal but at least
    it'll be behaving as documented. The vmstats are obviously different but
    here is an obvious interpretation of them from mmtests.

    3.12.0 3.12.0
    vanillaacctupdates
    NUMA page range updates 1408326 11043064
    NUMA huge PMD updates 0 21040
    NUMA PTE updates 1408326 291624

    "NUMA page range updates" == nr_pte_updates and is the value returned to
    the NUMA pte scanner. NUMA huge PMD updates were the number of THP
    updates which in combination can be used to calculate how many ptes were
    updated from userspace.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Thorlton
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The same calculation is currently done in three differents places.
    Factor that code so future changes has to be made at only one place.

    [akpm@linux-foundation.org: uninline vm_commit_limit()]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Signed-off-by: Zhi Yong Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhi Yong Wu
     
  • The refcount routine was not fit the kernel get/put semantic exactly,
    There were too many judgement statements on refcount and it could be
    minus.

    This patch does the following:

    - move refcount judgement to zswap_entry_put() to hide resource free function.

    - add a new function zswap_entry_find_get(), so that callers can use
    easily in the following pattern:

    zswap_entry_find_get
    .../* do something */
    zswap_entry_put

    - to eliminate compile error, move some functions declaration

    This patch is based on Minchan Kim 's idea and suggestion.

    Signed-off-by: Weijie Yang
    Cc: Seth Jennings
    Acked-by: Minchan Kim
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • Consider the following scenario:

    thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
    thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
    finished, entry x and its zbud is not freed as its refcount != 0
    now, the swap_map[x] = 0
    thread 0: now call zswap_get_swap_cache_page
    swapcache_prepare return -ENOENT because entry x is not used any more
    zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
    zswap_writeback_entry do nothing except put refcount

    Now, the memory of zswap_entry x and its zpage leak.

    Modify:
    - check the refcount in fail path, free memory if it is not referenced.

    - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
    can be not only caused by nomem but also by invalidate.

    Signed-off-by: Weijie Yang
    Reviewed-by: Bob Liu
    Reviewed-by: Minchan Kim
    Acked-by: Seth Jennings
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • Signed-off-by: Qiang Huang
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • We can't see the relationship with memcg from the parameters,
    so the name with memcg_idx would be more reasonable.

    Signed-off-by: Qiang Huang
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • Signed-off-by: Qiang Huang
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • This patch fixes the problem that get_unmapped_area() can return illegal
    address and result in failing mmap(2) etc.

    In case that the address higher than PAGE_SIZE is set to
    /proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
    returned by get_unmapped_area(), even if you do not pass any virtual
    address hint (i.e. the second argument).

    This is because the current get_unmapped_area() code does not take into
    account mmap_min_addr.

    This leads to two actual problems as follows:

    1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
    although any illegal parameter is not passed.

    2. The bottom-up search path after the top-down search might not work in
    arch_get_unmapped_area_topdown().

    Note: The first and third chunk of my patch, which changes "len" check,
    are for more precise check using mmap_min_addr, and not for solving the
    above problem.

    [How to reproduce]

    --- test.c -------------------------------------------------
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    void *ret = NULL, *last_map;
    size_t pagesize = sysconf(_SC_PAGESIZE);

    do {
    last_map = ret;
    ret = mmap(0, pagesize, PROT_NONE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    // printf("ret=%p\n", ret);
    } while (ret != MAP_FAILED);

    if (errno != ENOMEM) {
    printf("ERR: unexpected errno: %d (last map=%p)\n",
    errno, last_map);
    }

    return 0;
    }
    ---------------------------------------------------------------

    $ gcc -m32 -o test test.c
    $ sudo sysctl -w vm.mmap_min_addr=65536
    vm.mmap_min_addr = 65536
    $ ./test (run as non-priviledge user)
    ERR: unexpected errno: 1 (last map=0x10000)

    Signed-off-by: Akira Takeuchi
    Signed-off-by: Kiyoshi Owada
    Reviewed-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akira Takeuchi
     
  • When __rmqueue_fallback() doesn't find a free block with the required size
    it splits a larger page and puts the rest of the page onto the free list.

    But it has one serious mistake. When putting back, __rmqueue_fallback()
    always use start_migratetype if type is not CMA. However,
    __rmqueue_fallback() is only called when all of the start_migratetype
    queue is empty. That said, __rmqueue_fallback always puts back memory to
    the wrong queue except try_to_steal_freepages() changed pageblock type
    (i.e. requested size is smaller than half of page block). The end result
    is that the antifragmentation framework increases fragmenation instead of
    decreasing it.

    Mel's original anti fragmentation does the right thing. But commit
    47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.

    This patch restores sane and old behavior. It also removes an incorrect
    comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
    restructure free-page stealing code and fix a bug").

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • In general, every tracepoint should be zero overhead if it is disabled.
    However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
    "new_type == start_migratetype" even if tracepoint is disabled.

    However, the code can be moved into tracepoint's TP_fast_assign() and
    TP_fast_assign exist exactly such purpose. This patch does it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
    MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It rewrites
    the argument to MIGRATE_UNMOVABLE and we lost these attribute.

    The problem was introduced by commit 49255c619fbd ("page allocator: move
    check for disabled anti-fragmentation out of fastpath"). So a 4 year
    old issue may mean that nobody uses page_group_by_mobility_disabled.

    But anyway, this patch fixes the problem.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The kernel's readahead algorithm sometimes interprets random read
    accesses as sequential and triggers unnecessary data prefecthing from
    storage device (impacting random read average latency).

    In order to identify sequential cache read misses, the readahead
    algorithm intends to check whether offset - previous offset == 1
    (trivial sequential reads) or offset - previous offset == 0 (sequential
    reads not aligned on page boundary):

    if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) current offset (which happens on random pattern), the if
    condition is true and access is wrongly interpeted as sequential. An
    unnecessary data prefetching is triggered, impacting the average random
    read latency.

    Storing the previous offset value in a "pgoff_t" variable (unsigned
    long) fixes the sequential read detection logic.

    Signed-off-by: Damien Ramonda
    Reviewed-by: Fengguang Wu
    Acked-by: Pierre Tardy
    Acked-by: David Cohen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Damien Ramonda
     
  • Signed-off-by: Daeseok Youn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daeseok Youn