14 Oct, 2020

2 commits

  • Rename head_pincount() --> head_compound_pincount(). These names are more
    accurate (or less misleading) than the original ones.

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Cc: Qian Cai
    Cc: Matthew Wilcox
    Cc: Vlastimil Babka
    Cc: Kirill A. Shutemov
    Cc: Mike Rapoport
    Cc: William Kucharski
    Link: https://lkml.kernel.org/r/20200807183358.105097-1-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • __dump_page() checks i_dentry is fetchable and i_ino is earlier in the
    struct than i_ino, so it ought to work fine, but it's possible that struct
    randomisation has reordered i_ino after i_dentry and the pointer is just
    wild enough that i_dentry is fetchable and i_ino isn't.

    Also print the inode number if the dentry is invalid.

    Reported-by: Vlastimil Babka
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Reviewed-by: Mike Rapoport
    Link: https://lkml.kernel.org/r/20200819185710.28180-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

08 Aug, 2020

7 commits

  • If a compound page is being split while dump_page() is being run on that
    page, we can end up calling compound_mapcount() on a page that is no
    longer compound. This leads to a crash (already seen at least once in the
    field), due to the VM_BUG_ON_PAGE() assertion inside compound_mapcount().

    (The above is from Matthew Wilcox's analysis of Qian Cai's bug report.)

    A similar problem is possible, via compound_pincount() instead of
    compound_mapcount().

    In order to avoid this kind of crash, make dump_page() slightly more
    robust, by providing a pair of simpler routines that don't contain
    assertions: head_mapcount() and head_pincount().

    For debug tools, we don't want to go *too* far in this direction, but this
    is a simple small fix, and the crash has already been seen, so it's a good
    trade-off.

    Reported-by: Qian Cai
    Suggested-by: Matthew Wilcox
    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Kirill A. Shutemov
    Cc: Mike Rapoport
    Cc: William Kucharski
    Link: http://lkml.kernel.org/r/20200804214807.169256-1-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • The actual address of the struct page isn't particularly helpful, while
    the hashed address helps match with other messages elsewhere. Add the PFN
    that the page refers to in order to help diagnose problems where the page
    is improperly aligned for the purpose.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Acked-by: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: William Kucharski
    Link: http://lkml.kernel.org/r/20200709202117.7216-7-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • The inode number helps correlate this page with debug messages elsewhere
    in the kernel.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Acked-by: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: William Kucharski
    Link: http://lkml.kernel.org/r/20200709202117.7216-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This is simpler to use than copy_from_kernel_nofault(). Also make some of
    the related error messages less verbose.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Mike Rapoport
    Acked-by: Vlastimil Babka
    Cc: John Hubbard
    Cc: "Kirill A. Shutemov"
    Cc: William Kucharski
    Link: http://lkml.kernel.org/r/20200709202117.7216-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Tail page flags contain very little useful information. Print the head
    page's flags instead. While the flags will contain "head" for tail pages,
    this should not be too confusing as the previous line starts with the word
    "head:" and so the flags should be interpreted as belonging to the head
    page.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Acked-by: Mike Rapoport
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: William Kucharski
    Link: http://lkml.kernel.org/r/20200709202117.7216-4-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Simplify both the implementation and the output by splitting all the
    compound page information onto a second line.

    Reported-by: John Hubbard
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Tested-by: John Hubbard
    Reviewed-by: John Hubbard
    Acked-by: Mike Rapoport
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: William Kucharski
    Link: http://lkml.kernel.org/r/20200709202117.7216-3-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Patch series "Improvements for dump_page()", v2.

    Here's a sample dump of a pagecache tail page with all of the patches
    applied:

    page:000000006d1c49ca refcount:6 mapcount:0 mapping:00000000136b8d90 index:0x109 pfn:0x6c645
    head:000000008bd38076 order:2 compound_mapcount:0 compound_pincount:0
    aops:xfs_address_space_operations ino:800042 dentry name:"fd"
    flags: 0x4000000000012014(uptodate|lru|private|head)
    raw: 4000000000000000 ffffd46ac1b19101 ffffffff00000202 dead000000000004
    raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
    head: 4000000000012014 ffffd46ac1b1bbc8 ffffd46ac1b1bc08 ffff91976f659560
    head: 0000000000000108 ffff919773220680 00000006ffffffff 0000000000000000
    page dumped because: testing

    This patch (of 6):

    If we can't call page_mapping() to get the page mapping, handle the
    anon/ksm/movable bits correctly.

    [akpm@linux-foundation.org: augmented code comment from John]
    Link: http://lkml.kernel.org/r/15cff11a-6762-8a6a-3f0e-dd227280cd6f@nvidia.com

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Acked-by: Mike Rapoport
    Acked-by: Vlastimil Babka
    Cc: William Kucharski
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200709202117.7216-1-willy@infradead.org
    Link: http://lkml.kernel.org/r/20200709202117.7216-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

18 Jun, 2020

1 commit


10 Jun, 2020

1 commit

  • Except for historical confusion in the kprobes/uprobes and bpf tracers,
    which has been fixed now, there is no good reason to ever allow user
    memory accesses from probe_kernel_read. Switch probe_kernel_read to only
    read from kernel memory.

    [akpm@linux-foundation.org: update it for "mm, dump_page(): do not crash with invalid mapping pointer"]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Masami Hiramatsu
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200521152301.2587579-17-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

03 Jun, 2020

1 commit

  • We have seen a following problem on a RPi4 with 1G RAM:

    BUG: Bad page state in process systemd-hwdb pfn:35601
    page:ffff7e0000d58040 refcount:15 mapcount:131221 mapping:efd8fe765bc80080 index:0x1 compound_mapcount: -32767
    Unable to handle kernel paging request at virtual address efd8fe765bc80080
    Mem abort info:
    ESR = 0x96000004
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000004
    CM = 0, WnR = 0
    [efd8fe765bc80080] address between user and kernel address ranges
    Internal error: Oops: 96000004 [#1] SMP
    Modules linked in: btrfs libcrc32c xor xor_neon zlib_deflate raid6_pq mmc_block xhci_pci xhci_hcd usbcore sdhci_iproc sdhci_pltfm sdhci mmc_core clk_raspberrypi gpio_raspberrypi_exp pcie_brcmstb bcm2835_dma gpio_regulator phy_generic fixed sg scsi_mod efivarfs
    Supported: No, Unreleased kernel
    CPU: 3 PID: 408 Comm: systemd-hwdb Not tainted 5.3.18-8-default #1 SLE15-SP2 (unreleased)
    Hardware name: raspberrypi rpi/rpi, BIOS 2020.01 02/21/2020
    pstate: 40000085 (nZcv daIf -PAN -UAO)
    pc : __dump_page+0x268/0x368
    lr : __dump_page+0xc4/0x368
    sp : ffff000012563860
    x29: ffff000012563860 x28: ffff80003ddc4300
    x27: 0000000000000010 x26: 000000000000003f
    x25: ffff7e0000d58040 x24: 000000000000000f
    x23: efd8fe765bc80080 x22: 0000000000020095
    x21: efd8fe765bc80080 x20: ffff000010ede8b0
    x19: ffff7e0000d58040 x18: ffffffffffffffff
    x17: 0000000000000001 x16: 0000000000000007
    x15: ffff000011689708 x14: 3030386362353637
    x13: 6566386466653a67 x12: 6e697070616d2031
    x11: 32323133313a746e x10: 756f6370616d2035
    x9 : ffff00001168a840 x8 : ffff00001077a670
    x7 : 000000000000013d x6 : ffff0000118a43b5
    x5 : 0000000000000001 x4 : ffff80003dd9e2c8
    x3 : ffff80003dd9e2c8 x2 : 911c8d7c2f483500
    x1 : dead000000000100 x0 : efd8fe765bc80080
    Call trace:
    __dump_page+0x268/0x368
    bad_page+0xd4/0x168
    check_new_page_bad+0x80/0xb8
    rmqueue_bulk.constprop.26+0x4d8/0x788
    get_page_from_freelist+0x4d4/0x1228
    __alloc_pages_nodemask+0x134/0xe48
    alloc_pages_vma+0x198/0x1c0
    do_anonymous_page+0x1a4/0x4d8
    __handle_mm_fault+0x4e8/0x560
    handle_mm_fault+0x104/0x1e0
    do_page_fault+0x1e8/0x4c0
    do_translation_fault+0xb0/0xc0
    do_mem_abort+0x50/0xb0
    el0_da+0x24/0x28
    Code: f9401025 8b8018a0 9a851005 17ffffca (f94002a0)

    Besides the underlying issue with page->mapping containing a bogus value
    for some reason, we can see that __dump_page() crashed by trying to read
    the pointer at mapping->host, turning a recoverable warning into full
    Oops.

    It can be expected that when page is reported as bad state for some
    reason, the pointers there should not be trusted blindly.

    So this patch treats all data in __dump_page() that depends on
    page->mapping as lava, using probe_kernel_read_strict(). Ideally this
    would include the dentry->d_parent recursively, but that would mean
    changing printk handler for %pd. Chances of reaching the dentry
    printing part with an initially bogus mapping pointer should be rather
    low, though.

    Also prefix printing mapping->a_ops with a description of what is being
    printed. In case the value is bogus, %ps will print raw value instead
    of the symbol name and then it's not obvious at all that it's printing
    a_ops.

    Reported-by: Petr Tesarik
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: John Hubbard
    Link: http://lkml.kernel.org/r/20200331165454.12263-1-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

03 Apr, 2020

2 commits

  • As part of pin_user_pages() and related API calls, pages are "dma-pinned".
    For the case of compound pages of order > 1, the per-page accounting of
    dma pins is accomplished via the 3rd struct page in the compound page. In
    order to support debugging of any pin_user_pages()- related problems,
    enhance dump_page() so as to report the pin count in that case.

    Documentation/core-api/pin_user_pages.rst is also updated accordingly.

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ira Weiny
    Cc: Jérôme Glisse
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jason Gunthorpe
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200211001536.1027652-13-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • There was no protection against a corrupted struct page having an
    implausible compound_head(). Sanity check that a compound page has a head
    within reach of the maximum allocatable page (this will need to be
    adjusted if one of the plans to allocate 1GB pages comes to fruition). In
    addition,

    - Print the mapping pointer using %p insted of %px. The actual value of
    the pointer can be read out of the raw page dump and using %p gives a
    chance to correlate it with an earlier printk of the mapping pointer
    - Print the mapping pointer from the head page, not the tail page
    (the tail ->mapping pointer may be in use for other purposes, eg part
    of a list_head)
    - Print the order of the page for compound pages
    - Dump the raw head page as well as the raw page
    - Print the refcount from the head page, not the tail page

    Suggested-by: Kirill A. Shutemov
    Co-developed-by: John Hubbard
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Cc: Ira Weiny
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jason Gunthorpe
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200211001536.1027652-12-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

01 Feb, 2020

2 commits

  • It is not that hard to trigger lockdep splats by calling printk from
    under zone->lock. Most of them are false positives caused by lock
    chains introduced early in the boot process and they do not cause any
    real problems (although most of the early boot lock dependencies could
    happen after boot as well). There are some console drivers which do
    allocate from the printk context as well and those should be fixed. In
    any case, false positives are not that trivial to workaround and it is
    far from optimal to lose lockdep functionality for something that is a
    non-issue.

    So change has_unmovable_pages() so that it no longer calls dump_page()
    itself - instead it returns a "struct page *" of the unmovable page back
    to the caller so that in the case of a has_unmovable_pages() failure,
    the caller can call dump_page() after releasing zone->lock. Also, make
    dump_page() is able to report a CMA page as well, so the reason string
    from has_unmovable_pages() can be removed.

    Even though has_unmovable_pages doesn't hold any reference to the
    returned page this should be reasonably safe for the purpose of
    reporting the page (dump_page) because it cannot be hotremoved in the
    context of memory unplug. The state of the page might change but that
    is the case even with the existing code as zone->lock only plays role
    for free pages.

    While at it, remove a similar but unnecessary debug-only printk() as
    well. A sample of one of those lockdep splats is,

    WARNING: possible circular locking dependency detected
    ------------------------------------------------------
    test.sh/8653 is trying to acquire lock:
    ffffffff865a4460 (console_owner){-.-.}, at:
    console_unlock+0x207/0x750

    but task is already holding lock:
    ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
    __offline_isolated_pages+0x179/0x3e0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&(&zone->lock)->rlock){-.-.}:
    __lock_acquire+0x5b3/0xb40
    lock_acquire+0x126/0x280
    _raw_spin_lock+0x2f/0x40
    rmqueue_bulk.constprop.21+0xb6/0x1160
    get_page_from_freelist+0x898/0x22c0
    __alloc_pages_nodemask+0x2f3/0x1cd0
    alloc_pages_current+0x9c/0x110
    allocate_slab+0x4c6/0x19c0
    new_slab+0x46/0x70
    ___slab_alloc+0x58b/0x960
    __slab_alloc+0x43/0x70
    __kmalloc+0x3ad/0x4b0
    __tty_buffer_request_room+0x100/0x250
    tty_insert_flip_string_fixed_flag+0x67/0x110
    pty_write+0xa2/0xf0
    n_tty_write+0x36b/0x7b0
    tty_write+0x284/0x4c0
    __vfs_write+0x50/0xa0
    vfs_write+0x105/0x290
    redirected_tty_write+0x6a/0xc0
    do_iter_write+0x248/0x2a0
    vfs_writev+0x106/0x1e0
    do_writev+0xd4/0x180
    __x64_sys_writev+0x45/0x50
    do_syscall_64+0xcc/0x76c
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #2 (&(&port->lock)->rlock){-.-.}:
    __lock_acquire+0x5b3/0xb40
    lock_acquire+0x126/0x280
    _raw_spin_lock_irqsave+0x3a/0x50
    tty_port_tty_get+0x20/0x60
    tty_port_default_wakeup+0xf/0x30
    tty_port_tty_wakeup+0x39/0x40
    uart_write_wakeup+0x2a/0x40
    serial8250_tx_chars+0x22e/0x440
    serial8250_handle_irq.part.8+0x14a/0x170
    serial8250_default_handle_irq+0x5c/0x90
    serial8250_interrupt+0xa6/0x130
    __handle_irq_event_percpu+0x78/0x4f0
    handle_irq_event_percpu+0x70/0x100
    handle_irq_event+0x5a/0x8b
    handle_edge_irq+0x117/0x370
    do_IRQ+0x9e/0x1e0
    ret_from_intr+0x0/0x2a
    cpuidle_enter_state+0x156/0x8e0
    cpuidle_enter+0x41/0x70
    call_cpuidle+0x5e/0x90
    do_idle+0x333/0x370
    cpu_startup_entry+0x1d/0x1f
    start_secondary+0x290/0x330
    secondary_startup_64+0xb6/0xc0

    -> #1 (&port_lock_key){-.-.}:
    __lock_acquire+0x5b3/0xb40
    lock_acquire+0x126/0x280
    _raw_spin_lock_irqsave+0x3a/0x50
    serial8250_console_write+0x3e4/0x450
    univ8250_console_write+0x4b/0x60
    console_unlock+0x501/0x750
    vprintk_emit+0x10d/0x340
    vprintk_default+0x1f/0x30
    vprintk_func+0x44/0xd4
    printk+0x9f/0xc5

    -> #0 (console_owner){-.-.}:
    check_prev_add+0x107/0xea0
    validate_chain+0x8fc/0x1200
    __lock_acquire+0x5b3/0xb40
    lock_acquire+0x126/0x280
    console_unlock+0x269/0x750
    vprintk_emit+0x10d/0x340
    vprintk_default+0x1f/0x30
    vprintk_func+0x44/0xd4
    printk+0x9f/0xc5
    __offline_isolated_pages.cold.52+0x2f/0x30a
    offline_isolated_pages_cb+0x17/0x30
    walk_system_ram_range+0xda/0x160
    __offline_pages+0x79c/0xa10
    offline_pages+0x11/0x20
    memory_subsys_offline+0x7e/0xc0
    device_offline+0xd5/0x110
    state_store+0xc6/0xe0
    dev_attr_store+0x3f/0x60
    sysfs_kf_write+0x89/0xb0
    kernfs_fop_write+0x188/0x240
    __vfs_write+0x50/0xa0
    vfs_write+0x105/0x290
    ksys_write+0xc6/0x160
    __x64_sys_write+0x43/0x50
    do_syscall_64+0xcc/0x76c
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Chain exists of:
    console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&(&zone->lock)->rlock);
    lock(&(&port->lock)->rlock);
    lock(&(&zone->lock)->rlock);
    lock(console_owner);

    *** DEADLOCK ***

    9 locks held by test.sh/8653:
    #0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
    vfs_write+0x25f/0x290
    #1: ffff888277618880 (&of->mutex){+.+.}, at:
    kernfs_fop_write+0x128/0x240
    #2: ffff8898131fc218 (kn->count#115){.+.+}, at:
    kernfs_fop_write+0x138/0x240
    #3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
    lock_device_hotplug_sysfs+0x16/0x50
    #4: ffff8884374f4990 (&dev->mutex){....}, at:
    device_offline+0x70/0x110
    #5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
    __offline_pages+0xbf/0xa10
    #6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
    percpu_down_write+0x87/0x2f0
    #7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
    __offline_isolated_pages+0x179/0x3e0
    #8: ffffffff865a4920 (console_lock){+.+.}, at:
    vprintk_emit+0x100/0x340

    stack backtrace:
    Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
    BIOS U34 05/21/2019
    Call Trace:
    dump_stack+0x86/0xca
    print_circular_bug.cold.31+0x243/0x26e
    check_noncircular+0x29e/0x2e0
    check_prev_add+0x107/0xea0
    validate_chain+0x8fc/0x1200
    __lock_acquire+0x5b3/0xb40
    lock_acquire+0x126/0x280
    console_unlock+0x269/0x750
    vprintk_emit+0x10d/0x340
    vprintk_default+0x1f/0x30
    vprintk_func+0x44/0xd4
    printk+0x9f/0xc5
    __offline_isolated_pages.cold.52+0x2f/0x30a
    offline_isolated_pages_cb+0x17/0x30
    walk_system_ram_range+0xda/0x160
    __offline_pages+0x79c/0xa10
    offline_pages+0x11/0x20
    memory_subsys_offline+0x7e/0xc0
    device_offline+0xd5/0x110
    state_store+0xc6/0xe0
    dev_attr_store+0x3f/0x60
    sysfs_kf_write+0x89/0xb0
    kernfs_fop_write+0x188/0x240
    __vfs_write+0x50/0xa0
    vfs_write+0x105/0x290
    ksys_write+0xc6/0x160
    __x64_sys_write+0x43/0x50
    do_syscall_64+0xcc/0x76c
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Link: http://lkml.kernel.org/r/20200117181200.20299-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Reviewed-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Cc: Petr Mladek
    Cc: Steven Rostedt (VMware)
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Commit 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
    inadvertently removed printing of page flags for pages that are neither
    anon nor ksm nor have a mapping. Fix that.

    Using pr_cont() again would be a solution, but the commit explicitly
    removed its use. Avoiding the danger of mixing up split lines from
    multiple CPUs might be beneficial for near-panic dumps like this, so fix
    this without reintroducing pr_cont().

    Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz
    Fixes: 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
    Signed-off-by: Vlastimil Babka
    Reported-by: Anshuman Khandual
    Reported-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Qian Cai
    Cc: Oscar Salvador
    Cc: Mel Gorman
    Cc: Mike Rapoport
    Cc: Dan Williams
    Cc: Pavel Tatashin
    Cc: Ralph Campbell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

14 Jan, 2020

1 commit


16 Nov, 2019

2 commits

  • PageAnon() and PageKsm() use the low two bits of the page->mapping
    pointer to indicate the page type. PageAnon() only checks the LSB while
    PageKsm() checks the least significant 2 bits are equal to 3.

    Therefore, PageAnon() is true for KSM pages. __dump_page() incorrectly
    will never print "ksm" because it checks PageAnon() first. Fix this by
    checking PageKsm() first.

    Link: http://lkml.kernel.org/r/20191113000651.20677-1-rcampbell@nvidia.com
    Fixes: 1c6fb1d89e73 ("mm: print more information about mapping in __dump_page")
    Signed-off-by: Ralph Campbell
    Acked-by: Michal Hocko
    Cc: Jerome Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • When dumping struct page information, __dump_page() prints the page type
    with a trailing blank followed by the page flags on a separate line:

    anon
    flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)

    It looks like the intent was to use pr_cont() for printing "flags:" but
    pr_cont() usage is discouraged so fix this by extending the format to
    include the flags into a single line:

    anon flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)

    If the page is file backed, the name might be long so use two lines:

    shmem_aops name:"dev/zero"
    flags: 0x10000000008000c(uptodate|dirty|swapbacked)

    Eliminate pr_conf() usage as well for appending compound_mapcount.

    Link: http://lkml.kernel.org/r/20191112012608.16926-1-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Andrew Morton
    Cc: Jerome Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

15 May, 2019

1 commit

  • Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to
    _refcount") left out a couple of references to the old field name. Fix
    that.

    Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
    Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount")
    Signed-off-by: Baruch Siach
    Reviewed-by: Andrew Morton
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baruch Siach
     

30 Mar, 2019

2 commits

  • While debugging something, I added a dump_page() into do_swap_page(),
    and I got the splat from below. The issue happens when dereferencing
    mapping->host in __dump_page():

    ...
    else if (mapping) {
    pr_warn("%ps ", mapping->a_ops);
    if (mapping->host->i_dentry.first) {
    struct dentry *dentry;
    dentry = container_of(mapping->host->i_dentry.first, struct dentry, d_u.d_alias);
    pr_warn("name:\"%pd\" ", dentry);
    }
    }
    ...

    Swap address space does not contain an inode information, and so
    mapping->host equals NULL.

    Although the dump_page() call was added artificially into
    do_swap_page(), I am not sure if we can hit this from any other path, so
    it looks worth fixing it. We can easily do that by checking
    mapping->host first.

    Link: http://lkml.kernel.org/r/20190318072931.29094-1-osalvador@suse.de
    Fixes: 1c6fb1d89e73c ("mm: print more information about mapping in __dump_page")
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • atomic64_read() on ppc64le returns "long int", so fix the same way as
    commit d549f545e690 ("drm/virtio: use %llu format string form
    atomic64_t") by adding a cast to u64, which makes it work on all arches.

    In file included from ./include/linux/printk.h:7,
    from ./include/linux/kernel.h:15,
    from mm/debug.c:9:
    mm/debug.c: In function 'dump_mm':
    ./include/linux/kern_levels.h:5:18: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 19 has type 'long int' [-Wformat=]
    #define KERN_SOH "A" /* ASCII Start Of Header */
    ^~~~~~
    ./include/linux/kern_levels.h:8:20: note: in expansion of macro
    'KERN_SOH'
    #define KERN_EMERG KERN_SOH "0" /* system is unusable */
    ^~~~~~~~
    ./include/linux/printk.h:297:9: note: in expansion of macro 'KERN_EMERG'
    printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
    ^~~~~~~~~~
    mm/debug.c:133:2: note: in expansion of macro 'pr_emerg'
    pr_emerg("mm %px mmap %px seqnum %llu task_size %lu"
    ^~~~~~~~
    mm/debug.c:140:17: note: format string is defined here
    "pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx"
    ~~~^
    %lx

    Link: http://lkml.kernel.org/r/20190310183051.87303-1-cai@lca.pw
    Fixes: 70f8a3ca68d3 ("mm: make mm->pinned_vm an atomic64 counter")
    Signed-off-by: Qian Cai
    Acked-by: Davidlohr Bueso
    Cc: Jason Gunthorpe
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

10 Mar, 2019

1 commit

  • Pull rdma updates from Jason Gunthorpe:
    "This has been a slightly more active cycle than normal with ongoing
    core changes and quite a lot of collected driver updates.

    - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe

    - A new data transfer mode for HFI1 giving higher performance

    - Significant functional and bug fix update to the mlx5
    On-Demand-Paging MR feature

    - A chip hang reset recovery system for hns

    - Change mm->pinned_vm to an atomic64

    - Update bnxt_re to support a new 57500 chip

    - A sane netlink 'rdma link add' method for creating rxe devices and
    fixing the various unregistration race conditions in rxe's
    unregister flow

    - Allow lookup up objects by an ID over netlink

    - Various reworking of the core to driver interface:
    - drivers should not assume umem SGLs are in PAGE_SIZE chunks
    - ucontext is accessed via udata not other means
    - start to make the core code responsible for object memory
    allocation
    - drivers should convert struct device to struct ib_device via a
    helper
    - drivers have more tools to avoid use after unregister problems"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits)
    net/mlx5: ODP support for XRC transport is not enabled by default in FW
    IB/hfi1: Close race condition on user context disable and close
    RDMA/umem: Revert broken 'off by one' fix
    RDMA/umem: minor bug fix in error handling path
    RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
    cxgb4: kfree mhp after the debug print
    IB/rdmavt: Fix concurrency panics in QP post_send and modify to error
    IB/rdmavt: Fix loopback send with invalidate ordering
    IB/iser: Fix dma_nents type definition
    IB/mlx5: Set correct write permissions for implicit ODP MR
    bnxt_re: Clean cq for kernel consumers only
    RDMA/uverbs: Don't do double free of allocated PD
    RDMA: Handle ucontext allocations by IB/core
    RDMA/core: Fix a WARN() message
    bnxt_re: fix the regression due to changes in alloc_pbl
    IB/mlx4: Increase the timeout for CM cache
    IB/core: Abort page fault handler silently during owning process exit
    IB/mlx5: Validate correct PD before prefetch MR
    IB/mlx5: Protect against prefetch of invalid MR
    RDMA/uverbs: Store PR pointer before it is overwritten
    ...

    Linus Torvalds
     

22 Feb, 2019

1 commit

  • Evaluating page_mapping() on a poisoned page ends up dereferencing junk
    and making PF_POISONED_CHECK() considerably crashier than intended:

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
    Mem abort info:
    ESR = 0x96000005
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000005
    CM = 0, WnR = 0
    user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000c2f6ac38
    [0000000000000006] pgd=0000000000000000, pud=0000000000000000
    Internal error: Oops: 96000005 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 2 PID: 491 Comm: bash Not tainted 5.0.0-rc1+ #1
    Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
    pstate: 00000005 (nzcv daif -PAN -UAO)
    pc : page_mapping+0x18/0x118
    lr : __dump_page+0x1c/0x398
    Process bash (pid: 491, stack limit = 0x000000004ebd4ecd)
    Call trace:
    page_mapping+0x18/0x118
    __dump_page+0x1c/0x398
    dump_page+0xc/0x18
    remove_store+0xbc/0x120
    dev_attr_store+0x18/0x28
    sysfs_kf_write+0x40/0x50
    kernfs_fop_write+0x130/0x1d8
    __vfs_write+0x30/0x180
    vfs_write+0xb4/0x1a0
    ksys_write+0x60/0xd0
    __arm64_sys_write+0x18/0x20
    el0_svc_common+0x94/0xf8
    el0_svc_handler+0x68/0x70
    el0_svc+0x8/0xc
    Code: f9400401 d1000422 f240003f 9a801040 (f9400402)
    ---[ end trace cdb5eb5bf435cecb ]---

    Fix that by not inspecting the mapping until we've determined that it's
    likely to be valid. Now the above condition still ends up stopping the
    kernel, but in the correct manner:

    page:ffffffbf20000000 is uninitialized and poisoned
    raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
    raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    ------------[ cut here ]------------
    kernel BUG at ./include/linux/mm.h:1006!
    Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 1 PID: 483 Comm: bash Not tainted 5.0.0-rc1+ #3
    Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
    pstate: 40000005 (nZcv daif -PAN -UAO)
    pc : remove_store+0xbc/0x120
    lr : remove_store+0xbc/0x120
    ...

    Link: http://lkml.kernel.org/r/03b53ee9d7e76cda4b9b5e1e31eea080db033396.1550071778.git.robin.murphy@arm.com
    Fixes: 1c6fb1d89e73 ("mm: print more information about mapping in __dump_page")
    Signed-off-by: Robin Murphy
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Murphy
     

08 Feb, 2019

1 commit

  • Taking a sleeping lock to _only_ increment a variable is quite the
    overkill, and pretty much all users do this. Furthermore, some drivers
    (ie: infiniband and scif) that need pinned semantics can go to quite
    some trouble to actually delay via workqueue (un)accounting for pinned
    pages when not possible to acquire it.

    By making the counter atomic we no longer need to hold the mmap_sem and
    can simply some code around it for pinned_vm users. The counter is 64-bit
    such that we need not worry about overflows such as rdma user input
    controlled from userspace.

    Reviewed-by: Ira Weiny
    Reviewed-by: Christoph Lameter
    Reviewed-by: Daniel Jordan
    Reviewed-by: Jan Kara
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jason Gunthorpe

    Davidlohr Bueso
     

29 Dec, 2018

3 commits

  • Those strings are immutable as well.

    Link: http://lkml.kernel.org/r/20181124090508.GB10877@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • __dump_page messages use KERN_EMERG resp. KERN_ALERT loglevel (this is
    the case since 2004). Most callers of this function are really detecting
    a critical page state and BUG right after. On the other hand the function
    is called also from contexts which just want to inform about the page
    state and those would rather not disrupt logs that much (e.g. some
    systems route these messages to the normal console).

    Reduce the loglevel to KERN_WARNING to make dump_page easier to reuse for
    other contexts while those messages will still make it to the kernel log
    in most setups. Even if the loglevel setup filters warnings away those
    paths that are really critical already print the more targeted error or
    panic and that should make it to the kernel log.

    [mhocko@kernel.org: fix __dump_page()]
    Link: http://lkml.kernel.org/r/20181212142540.GA7378@dhcp22.suse.cz
    [akpm@linux-foundation.org: s/KERN_WARN/KERN_WARNING/, per Michal]
    Link: http://lkml.kernel.org/r/20181107101830.17405-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Anshuman Khandual
    Cc: Baoquan He
    Cc: Oscar Salvador
    Cc: Oscar Salvador
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • I have been promissing to improve memory offlining failures debugging for
    quite some time. As things stand now we get only very limited information
    in the kernel log when the offlining fails. It is usually only

    [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed

    with no further details. We do not know what exactly fails and for what
    reason. Whenever I was forced to debug such a failure I've always had to
    do a debugging patch to tell me more. We can enable some tracepoints but
    it would be much better to get a better picture without using them.

    This patch series does 2 things. The first one is to make dump_page more
    usable by printing more information about the mapping patch 1. Then it
    reduces the log level from emerg to warning so that this function is
    usable from less critical context patch 2. Then I have added more
    detailed information about the offlining failure patch 4 and finally add
    dump_page to isolation and offlining migration paths. Patch 3 is a
    trivial cleanup.

    This patch (of 6):

    __dump_page prints the mapping pointer but that is quite unhelpful for
    many reports because the pointer itself only helps to distinguish anon/ksm
    mappings from other ones (because of lowest bits set). Sometimes it would
    be much more helpful to know what kind of mapping that is actually and if
    we know this is a file mapping then also try to resolve the dentry name.

    [dan.carpenter@oracle.com: fix a width vs precision bug in printk]
    Link: http://lkml.kernel.org/r/20181123072135.gqvblm2vdujbvfjs@kili.mountain
    [mhocko@kernel.org: use %dp to print dentry]
    Link: http://lkml.kernel.org/r/20181125080834.GB12455@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20181107101830.17405-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Anshuman Khandual
    Reviewed-by: William Kucharski
    Cc: Oscar Salvador
    Cc: Baoquan He
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

27 Oct, 2018

1 commit

  • Patch series "Address issues slowing persistent memory initialization", v5.

    The main thing this patch set achieves is that it allows us to initialize
    each node worth of persistent memory independently. As a result we reduce
    page init time by about 2 minutes because instead of taking 30 to 40
    seconds per node and going through each node one at a time, we process all
    4 nodes in parallel in the case of a 12TB persistent memory setup spread
    evenly over 4 nodes.

    This patch (of 3):

    On systems with a large amount of memory it can take a significant amount
    of time to initialize all of the page structs with the PAGE_POISON_PATTERN
    value. I have seen it take over 2 minutes to initialize a system with
    over 12TB of RAM.

    In order to work around the issue I had to disable CONFIG_DEBUG_VM and
    then the boot time returned to something much more reasonable as the
    arch_add_memory call completed in milliseconds versus seconds. However in
    doing that I had to disable all of the other VM debugging on the system.

    In order to work around a kernel that might have CONFIG_DEBUG_VM enabled
    on a system that has a large amount of memory I have added a new kernel
    parameter named "vm_debug" that can be set to "-" in order to disable it.

    Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomain
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Alexander Duyck
    Cc: Dave Hansen
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     

14 Sep, 2018

1 commit

  • Jann Horn points out that the vmacache_flush_all() function is not only
    potentially expensive, it's buggy too. It also happens to be entirely
    unnecessary, because the sequence number overflow case can be avoided by
    simply making the sequence number be 64-bit. That doesn't even grow the
    data structures in question, because the other adjacent fields are
    already 64-bit.

    So simplify the whole thing by just making the sequence number overflow
    case go away entirely, which gets rid of all the complications and makes
    the code faster too. Win-win.

    [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
    also just goes away entirely with this ]

    Reported-by: Jann Horn
    Suggested-by: Will Deacon
    Acked-by: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Jul, 2018

1 commit

  • If struct page is poisoned, and uninitialized access is detected via
    PF_POISONED_CHECK(page) dump_page() is called to output the page. But,
    the dump_page() itself accesses struct page to determine how to print
    it, and therefore gets into a recursive loop.

    For example:

    dump_page()
    __dump_page()
    PageSlab(page)
    PF_POISONED_CHECK(page)
    VM_BUG_ON_PGFLAGS(PagePoisoned(page), page)
    dump_page() recursion loop.

    Link: http://lkml.kernel.org/r/20180702180536.2552-1-pasha.tatashin@oracle.com
    Fixes: f165b378bbdf ("mm: uninitialized struct page poisoning sanity checking")
    Signed-off-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

05 Jan, 2018

1 commit

  • With the recent addition of hashed kernel pointers, places which need to
    produce useful debug output have to specify %px, not %p. This patch
    fixes all the VM debug to use %px. This is appropriate because it's
    debug output that the user should never be able to trigger, and kernel
    developers need to see the actual pointers.

    Link: http://lkml.kernel.org/r/20171219133236.GE13680@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Michal Hocko
    Cc: "Tobin C. Harding"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

16 Nov, 2017

3 commits

  • Currently, we account page tables separately for each page table level,
    but that's redundant -- we only make use of total memory allocated to
    page tables for oom_badness calculation. We also provide the
    information to userspace, but it has dubious value there too.

    This patch switches page table accounting to single counter.

    mm->pgtables_bytes is now used to account all page table levels. We use
    bytes, because page table size for different levels of page table tree
    may be different.

    The change has user-visible effect: we don't have VmPMD and VmPUD
    reported in /proc/[pid]/status. Not sure if anybody uses them. (As
    alternative, we can always report 0 kB for them.)

    OOM-killer report is also slightly changed: we now report pgtables_bytes
    instead of nr_ptes, nr_pmd, nr_puds.

    Apart from reducing number of counters per-mm, the benefit is that we
    now calculate oom_badness() more correctly for machines which have
    different size of page tables depending on level or where page tables
    are less than a page in size.

    The only downside can be debuggability because we do not know which page
    table level could leak. But I do not remember many bugs that would be
    caught by separate counters so I wouldn't lose sleep over this.

    [akpm@linux-foundation.org: fix mm/huge_memory.c]
    Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    [kirill.shutemov@linux.intel.com: fix build]
    Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
    and nr_pud.

    The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page
    table accounting doesn't make sense if you don't have page tables.

    It's preparation for consolidation of page-table counters in mm_struct.

    Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • On a machine with 5-level paging support a process can allocate
    significant amount of memory and stay unnoticed by oom-killer and memory
    cgroup. The trick is to allocate a lot of PUD page tables. We don't
    account PUD page tables, only PMD and PTE.

    We already addressed the same issue for PMD page tables, see commit
    dc6c9a35b66b ("mm: account pmd page tables to the process").
    Introduction of 5-level paging brings the same issue for PUD page
    tables.

    The patch expands accounting to PUD level.

    [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
    Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
    [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
    Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
    Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Heiko Carstens
    Acked-by: Rik van Riel
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

11 Aug, 2017

2 commits

  • Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
    COMPACTION] but upcoming patches to solve subtle TLB flush batching
    problem will use it regardless of compaction/NUMA so this patch doesn't
    remove the dependency.

    [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
    Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.com
    Signed-off-by: Minchan Kim
    Signed-off-by: Nadav Amit
    Acked-by: Mel Gorman
    Cc: "David S. Miller"
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Nadav Amit
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Sergey Senozhatsky
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "fixes of TLB batching races", v6.

    It turns out that Linux TLB batching mechanism suffers from various
    races. Races that are caused due to batching during reclamation were
    recently handled by Mel and this patch-set deals with others. The more
    fundamental issue is that concurrent updates of the page-tables allow
    for TLB flushes to be batched on one core, while another core changes
    the page-tables. This other core may assume a PTE change does not
    require a flush based on the updated PTE value, while it is unaware that
    TLB flushes are still pending.

    This behavior affects KSM (which may result in memory corruption) and
    MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
    proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
    Memory corruption in KSM is harder to produce in practice, but was
    observed by hacking the kernel and adding a delay before flushing and
    replacing the KSM page.

    Finally, there is also one memory barrier missing, which may affect
    architectures with weak memory model.

    This patch (of 7):

    Setting and clearing mm->tlb_flush_pending can be performed by multiple
    threads, since mmap_sem may only be acquired for read in
    task_numa_work(). If this happens, tlb_flush_pending might be cleared
    while one of the threads still changes PTEs and batches TLB flushes.

    This can lead to the same race between migration and
    change_protection_range() that led to the introduction of
    tlb_flush_pending. The result of this race was data corruption, which
    means that this patch also addresses a theoretically possible data
    corruption.

    An actual data corruption was not observed, yet the race was was
    confirmed by adding assertion to check tlb_flush_pending is not set by
    two threads, adding artificial latency in change_protection_range() and
    using sysctl to reduce kernel.numa_balancing_scan_delay_ms.

    Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
    Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
    change_protection_range")
    Signed-off-by: Nadav Amit
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: "David S. Miller"
    Cc: Andrea Arcangeli
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Russell King
    Cc: Sergey Senozhatsky
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

13 Dec, 2016

1 commit

  • __dump_page() is used when a page metadata inconsistency is detected,
    either by standard runtime checks, or extra checks in CONFIG_DEBUG_VM
    builds. It prints some of the relevant metadata, but not the whole
    struct page, which is based on unions and interpretation is dependent on
    the context.

    This means that sometimes e.g. a VM_BUG_ON_PAGE() checks certain field,
    which is however not printed by __dump_page() and the resulting bug
    report may then lack clues that could help in determining the root
    cause. This patch solves the problem by simply printing the whole
    struct page word by word, so no part is missing, but the interpretation
    of the data is left to developers. This is similar to e.g. x86_64 raw
    stack dumps.

    Example output:

    page:ffffea00000475c0 count:1 mapcount:0 mapping: (null) index:0x0
    flags: 0x100000000000400(reserved)
    raw: 0100000000000400 0000000000000000 0000000000000000 00000001ffffffff
    raw: ffffea00000475e0 ffffea00000475e0 0000000000000000 0000000000000000
    page dumped because: VM_BUG_ON_PAGE(1)

    [aryabinin@virtuozzo.com: suggested print_hex_dump()]
    Link: http://lkml.kernel.org/r/2ff83214-70fe-741e-bf05-fe4a4073ec3e@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Acked-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

08 Oct, 2016

1 commit