26 Feb, 2013

1 commit

  • Pull module update from Rusty Russell:
    "The sweeping change is to make add_taint() explicitly indicate whether
    to disable lockdep, but it's a mechanical change."

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    MODSIGN: Add option to not sign modules during modules_install
    MODSIGN: Add -s option to sign-file
    MODSIGN: Specify the hash algorithm on sign-file command line
    MODSIGN: Simplify Makefile with a Kconfig helper
    module: clean up load_module a little more.
    modpost: Ignore ARC specific non-alloc sections
    module: constify within_module_*
    taint: add explicit flag to show whether lock dep is still OK.
    module: printk message when module signature fail taints kernel.

    Linus Torvalds
     

24 Feb, 2013

8 commits

  • I dislike the way in which "swapcache" gets used in do_swap_page():
    there is always a page from swapcache there (even if maybe uncached by
    the time we lock it), but tests are made according to "swapcache".
    Rework that with "page != swapcache", as has been done in unuse_pte().

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In "ksm: remove old stable nodes more thoroughly" I said that I'd never
    seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing,
    but it soon appeared once I tried fuller tests on the whole series.

    It turned out to be due to the KSM page migration itself: unmerge_and_
    remove_all_rmap_items() failed to locate and replace all the KSM pages,
    because of that hiatus in page migration when old pte has been replaced
    by migration entry, but not yet by new pte. follow_page() finds no page
    at that instant, but a KSM page reappears shortly after, without a
    fault.

    Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
    for KSM's break_cow(). I'd have preferred to avoid another flag, and do
    it every time, in case someone else makes the same easy mistake; but did
    not find another transgressor (the common get_user_pages() is of course
    safe), and cannot be sure that every follow_page() caller is prepared to
    sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
    already sleep there, since anon_vma locking was changed to mutex, but
    maybe that's somehow excluded.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This change adds a follow_page_mask function which is equivalent to
    follow_page, but with an extra page_mask argument.

    follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
    a THP page, and to 0 in other cases.

    __get_user_pages() makes use of this in order to accelerate populating
    THP ranges - that is, when both the pages and vmas arrays are NULL, we
    don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
    we also avoid taking mm->page_table_lock that many times).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Use long type for page counts in mm_populate() so as to avoid integer
    overflow when running the following test code:

    int main(void) {
    void *p = mmap(NULL, 0x100000000000, PROT_READ,
    MAP_PRIVATE | MAP_ANON, -1, 0);
    printf("p: %p\n", p);
    mlockall(MCL_CURRENT);
    printf("done\n");
    return 0;
    }

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Switching merge_across_nodes after running KSM is liable to oops on stale
    nodes still left over from the previous stable tree. It's not something
    that people will often want to do, but it would be lame to demand a reboot
    when they're trying to determine which merge_across_nodes setting is best.

    How can this happen? We only permit switching merge_across_nodes when
    pages_shared is 0, and usually set run 2 to force that beforehand, which
    ought to unmerge everything: yet oopses still occur when you then run 1.

    Three causes:

    1. The old stable tree (built according to the inverse
    merge_across_nodes) has not been fully torn down. A stable node
    lingers until get_ksm_page() notices that the page it references no
    longer references it: but the page is not necessarily freed as soon as
    expected, particularly when swapcache.

    Fix this with a pass through the old stable tree, applying
    get_ksm_page() to each of the remaining nodes (most found stale and
    removed immediately), with forced removal of any left over. Unless the
    page is still mapped: I've not seen that case, it shouldn't occur, but
    better to WARN_ON_ONCE and EBUSY than BUG.

    2. __ksm_enter() has a nice little optimization, to insert the new mm
    just behind ksmd's cursor, so there's a full pass for it to stabilize
    (or be removed) before ksmd addresses it. Nice when ksmd is running,
    but not so nice when we're trying to unmerge all mms: we were missing
    those mms forked and inserted behind the unmerge cursor. Easily fixed
    by inserting at the end when KSM_RUN_UNMERGE.

    3. It is possible for a KSM page to be faulted back from swapcache
    into an mm, just after unmerge_and_remove_all_rmap_items() scanned past
    it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is
    private to ksm.c, so dissolve the distinction between
    ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in
    the one call into ksm.c.

    A long outstanding, unrelated bugfix sneaks in with that third fix:
    ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O
    error when read in from swap) to a page which it then marks Uptodate. Fix
    this case by not copying, letting do_swap_page() discover the error.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit
    NUMA configuration with NUMA Balancing will still need an extra page
    field. As Peter notes "Completely dropping 32bit support for
    CONFIG_NUMA_BALANCING would simplify things, but it would also remove
    the warning if we grow enough 64bit only page-flags to push the last-cpu
    out."

    [mgorman@suse.de: minor modifications]
    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
    the vma type - we know we're working with a stack. So, we can call
    directly into __mlock_vma_pages_range(), and remove the last
    make_pages_present() call site.

    Note that we don't use mm_populate() here, so we can't release the
    mmap_sem while allocating new stack pages. This is deemed acceptable,
    because the stack vmas grow by a bounded number of pages at a time, and
    these are anon pages so we don't have to read from disk to populate
    them.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When ex-KSM pages are faulted from swap cache, the fault handler is not
    capable of re-establishing anon_vma-spanning KSM pages. In this case, a
    copy of the page is created instead, just like during a COW break.

    These freshly made copies are known to be exclusive to the faulting VMA
    and there is no reason to go look for this page in parent and sibling
    processes during rmap operations.

    Use page_add_new_anon_rmap() for these copies. This also puts them on
    the proper LRU lists and marks them SwapBacked, so we can get rid of
    doing this ad-hoc in the KSM copy code.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Hugh Dickins
    Cc: Simon Jeons
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Satoru Moriya
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 Jan, 2013

1 commit


10 Jan, 2013

1 commit

  • The check for a pmd being in the process of being split was dropped by
    mistake by commit d10e63f29488 ("mm: numa: Create basic numa page
    hinting infrastructure"). Put it back.

    Reported-by: Dave Jones
    Debugged-by: Hillf Danton
    Acked-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: Kirill Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 Jan, 2013

1 commit

  • Since commit e303297e6c3a ("mm: extended batches for generic
    mmu_gather") we are batching pages to be freed until either
    tlb_next_batch cannot allocate a new batch or we are done.

    This works just fine most of the time but we can get in troubles with
    non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY)
    on large machines where too aggressive batching might lead to soft
    lockups during process exit path (exit_mmap) because there are no
    scheduling points down the free_pages_and_swap_cache path and so the
    freeing can take long enough to trigger the soft lockup.

    The lockup is harmless except when the system is setup to panic on
    softlockup which is not that unusual.

    The simplest way to work around this issue is to limit the maximum
    number of batches in a single mmu_gather. 10k of collected pages should
    be safe to prevent from soft lockups (we would have 2ms for one) even if
    they are all freed without an explicit scheduling point.

    This patch doesn't add any new explicit scheduling points because it
    relies on zap_pmd_range during page tables zapping which calls
    cond_resched per PMD.

    The following lockup has been reported for 3.0 kernel with a huge
    process (in order of hundreds gigs but I do know any more details).

    BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053]
    Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod
    Supported: Yes
    CPU 56
    Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7
    RIP: 0010: _raw_spin_unlock_irqrestore+0x8/0x10
    RSP: 0018:ffff883ec1037af0 EFLAGS: 00000206
    RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80
    RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206
    RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400
    R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e
    R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e
    FS: 00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100)
    Call Trace:
    release_pages+0xc5/0x260
    free_pages_and_swap_cache+0x9d/0xc0
    tlb_flush_mmu+0x5c/0x80
    tlb_finish_mmu+0xe/0x50
    exit_mmap+0xbd/0x120
    mmput+0x49/0x120
    exit_mm+0x122/0x160
    do_exit+0x17a/0x430
    do_group_exit+0x3d/0xb0
    get_signal_to_deliver+0x247/0x480
    do_signal+0x71/0x1b0
    do_notify_resume+0x98/0xb0
    int_signal+0x12/0x17
    DWARF2 unwinder stuck at int_signal+0x12/0x17

    Signed-off-by: Michal Hocko
    Cc: [3.0+]
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Dec, 2012

2 commits


17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

4 commits

  • page_mkwrite is initalized with zero and only set once, from that point
    exists no way to get to the oom or oom_free_new labels.

    [akpm@linux-foundation.org: cleanup]
    Signed-off-by: Dominik Dingel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dominik Dingel
     
  • We have two different implementation of is_zero_pfn() and my_zero_pfn()
    helpers: for architectures with and without zero page coloring.

    Let's consolidate them in .

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Pass vma instead of mm and add address parameter.

    In most cases we already have vma on the stack. We provides
    split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

    This change is preparation to huge zero pmd splitting implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • On write access to huge zero page we alloc a new huge page and clear it.

    If ENOMEM, graceful fallback: we create a new pmd table and set pte around
    fault address to newly allocated normal (4k) page. All other ptes in the
    pmd set to normal zero page.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Dec, 2012

1 commit

  • On x86 memory accesses to pages without the ACCESSED flag set result in
    the ACCESSED flag being set automatically. With the ARM architecture a
    page access fault is raised instead (and it will continue to be raised
    until the ACCESSED flag is set for the appropriate PTE/PMD).

    For normal memory pages, handle_pte_fault will call pte_mkyoung
    (effectively setting the ACCESSED flag). For transparent huge pages,
    pmd_mkyoung will only be called for a write fault.

    This patch ensures that faults on transparent hugepages which do not
    result in a CoW update the access flags for the faulting pmd.

    Signed-off-by: Will Deacon
    Cc: Chris Metcalf
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Ni zhan Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     

11 Dec, 2012

8 commits

  • The PTE scanning rate and fault rates are two of the biggest sources of
    system CPU overhead with automatic NUMA placement. Ideally a proper policy
    would detect if a workload was properly placed, schedule and adjust the
    PTE scanning rate accordingly. We do not track the necessary information
    to do that but we at least know if we migrated or not.

    This patch scans slower if a page was not migrated as the result of a
    NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
    now higher than the previous default. Once every minute it will reset
    the scanner in case of phase changes.

    This is hilariously crude and the numbers are arbitrary. Workloads will
    converge quite slowly in comparison to what a proper policy should be able
    to do. On the plus side, we will chew up less CPU for workloads that have
    no need for automatic balancing.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • To say that the PMD handling code was incorrectly transferred from autonuma
    is an understatement. The intention was to handle a PMDs worth of pages
    in the same fault and effectively batch the taking of the PTL and page
    migration. The copied version instead has the impact of clearing a number
    of pte_numa PTE entries and whether any page migration takes place depends
    on racing. This just happens to work in some cases.

    This patch handles pte_numa faults in batch when a pmd_numa fault is
    handled. The pages are migrated if they are currently misplaced.
    Essentially this is making an assumption that NUMA locality is
    on a PMD boundary but that could be addressed by only setting
    pmd_numa if all the pages within that PMD are on the same node
    if necessary.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • It is tricky to quantify the basic cost of automatic NUMA placement in a
    meaningful manner. This patch adds some vmstats that can be used as part
    of a basic costing model.

    u = basic unit = sizeof(void *)
    Ca = cost of struct page access = sizeof(struct page) / u
    Cpte = Cost PTE access = Ca
    Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
    where Cpte is incurred twice for a read and a write and Wlock
    is a constant representing the cost of taking or releasing a
    lock
    Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
    Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
    Ci = Cost of page isolation = Ca + Wi
    where Wi is a constant that should reflect the approximate cost
    of the locking operation
    Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
    where Wnuma is the approximate NUMA factor. 1 is local. 1.2
    would imply that remote accesses are 20% more expensive

    Balancing cost = Cpte * numa_pte_updates +
    Cnumahint * numa_hint_faults +
    Ci * numa_pages_migrated +
    Cpagecopy * numa_pages_migrated

    Note that numa_pages_migrated is used as a measure of how many pages
    were isolated even though it would miss pages that failed to migrate. A
    vmstat counter could have been added for it but the isolation cost is
    pretty marginal in comparison to the overall cost so it seemed overkill.

    The ideal way to measure automatic placement benefit would be to count
    the number of remote accesses versus local accesses and do something like

    benefit = (remote_accesses_before - remove_access_after) * Wnuma

    but the information is not readily available. As a workload converges, the
    expection would be that the number of remote numa hints would reduce to 0.

    convergence = numa_hint_faults_local / numa_hint_faults
    where this is measured for the last N number of
    numa hints recorded. When the workload is fully
    converged the value is 1.

    This can measure if the placement policy is converging and how fast it is
    doing it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Mel Gorman
     
  • NOTE: This patch is based on "sched, numa, mm: Add fault driven
    placement and migration policy" but as it throws away all the policy
    to just leave a basic foundation I had to drop the signed-offs-by.

    This patch creates a bare-bones method for setting PTEs pte_numa in the
    context of the scheduler that when faulted later will be faulted onto the
    node the CPU is running on. In itself this does nothing useful but any
    placement policy will fundamentally depend on receiving hints on placement
    from fault context and doing something intelligent about it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Peter Zijlstra
     
  • Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
    sufficiently different that the signed-off-bys were dropped

    Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
    pieces into an effective migrate on fault scheme.

    Note that (on x86) we rely on PROT_NONE pages being !present and avoid
    the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
    page-migration performance.

    Based-on-work-by: Peter Zijlstra
    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Note: This patch started as "mm/mpol: Create special PROT_NONE
    infrastructure" and preserves the basic idea but steals *very*
    heavily from "autonuma: numa hinting page faults entry points" for
    the actual fault handlers without the migration parts. The end
    result is barely recognisable as either patch so all Signed-off
    and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
    this version, I will re-add the signed-offs-by to reflect the history.

    In order to facilitate a lazy -- fault driven -- migration of pages, create
    a special transient PAGE_NUMA variant, we can then use the 'spurious'
    protection faults to drive our migrations from.

    The meaning of PAGE_NUMA depends on the architecture but on x86 it is
    effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
    NUMA faults for the reason that the page fault code checks the permission on
    the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
    before it ever calls handle_mm_fault.

    [dhillf@gmail.com: Fix typo]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     
  • Introduce FOLL_NUMA to tell follow_page to check
    pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
    so because it always invokes handle_mm_fault and retries the
    follow_page later.

    KVM secondary MMU page faults will trigger the NUMA hinting page
    faults through gup_fast -> get_user_pages -> follow_page ->
    handle_mm_fault.

    Other follow_page callers like KSM should not use FOLL_NUMA, or they
    would fail to get the pages if they use follow_page instead of
    get_user_pages.

    [ This patch was picked up from the AutoNUMA tree. ]

    Originally-by: Andrea Arcangeli
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    [ ported to this tree. ]
    Signed-off-by: Ingo Molnar
    Reviewed-by: Rik van Riel

    Andrea Arcangeli
     
  • With transparent hugepage support, handle_mm_fault() has to be careful
    that a normal PMD has been established before handling a PTE fault. To
    achieve this, it used __pte_alloc() directly instead of pte_alloc_map
    as pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map()
    is called once it is known the PMD is safe.

    pte_alloc_map() is smart enough to check if a PTE is already present
    before calling __pte_alloc but this check was lost. As a consequence,
    PTEs may be allocated unnecessarily and the page table lock taken.
    Thi useless PTE does get cleaned up but it's a performance hit which
    is visible in page_test from aim9.

    This patch simply re-adds the check normally done by pte_alloc_map to
    check if the PTE needs to be allocated before taking the page table
    lock. The effect is noticable in page_test from aim9.

    AIM9
    2.6.38-vanilla 2.6.38-checkptenone
    creat-clo 446.10 ( 0.00%) 424.47 (-5.10%)
    page_test 38.10 ( 0.00%) 42.04 ( 9.37%)
    brk_test 52.45 ( 0.00%) 51.57 (-1.71%)
    exec_test 382.00 ( 0.00%) 456.90 (16.39%)
    fork_test 60.11 ( 0.00%) 67.79 (11.34%)
    MMTests Statistics: duration
    Total Elapsed Time (seconds) 611.90 612.22

    (While this affects 2.6.38, it is a performance rather than a
    functional bug and normally outside the rules -stable. While the big
    performance differences are to a microbench, the difference in fork
    and exec performance may be significant enough that -stable wants to
    consider the patch)

    Reported-by: Raz Ben Yehuda
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrea Arcangeli
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    [ Picked this up from the AutoNUMA tree to help
    it upstream and to allow apples-to-apples
    performance comparisons. ]
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

17 Nov, 2012

1 commit

  • do_wp_page() sets mmun_called if mmun_start and mmun_end were
    initialized and, if so, may call mmu_notifier_invalidate_range_end()
    with these values. This doesn't prevent gcc from emitting a build
    warning though:

    mm/memory.c: In function `do_wp_page':
    mm/memory.c:2530: warning: `mmun_start' may be used uninitialized in this function
    mm/memory.c:2531: warning: `mmun_end' may be used uninitialized in this function

    It's much easier to initialize the variables to impossible values and do
    a simple comparison to determine if they were initialized to remove the
    bool entirely.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Oct, 2012

10 commits

  • When a transparent hugepage is mapped and it is included in an mlock()
    range, follow_page() incorrectly avoids setting the page's mlock bit and
    moving it to the unevictable lru.

    This is evident if you try to mlock(), munlock(), and then mlock() a
    range again. Currently:

    #define MAP_SIZE (4 << 30) /* 4GB */

    void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    mlock(ptr, MAP_SIZE);

    $ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
    Inactive(anon): 6304 kB
    Unevictable: 4213924 kB

    munlock(ptr, MAP_SIZE);

    Inactive(anon): 4186252 kB
    Unevictable: 19652 kB

    mlock(ptr, MAP_SIZE);

    Inactive(anon): 4198556 kB
    Unevictable: 21684 kB

    Notice that less than 2MB was added to the unevictable list; this is
    because these pages in the range are not transparent hugepages since the
    4GB range was allocated with mmap() and has no specific alignment. If
    posix_memalign() were used instead, unevictable would not have grown at
    all on the second mlock().

    The fix is to call mlock_vma_page() so that the mlock bit is set and the
    page is added to the unevictable list. With this patch:

    mlock(ptr, MAP_SIZE);

    Inactive(anon): 4056 kB
    Unevictable: 4213940 kB

    munlock(ptr, MAP_SIZE);

    Inactive(anon): 4198268 kB
    Unevictable: 19636 kB

    mlock(ptr, MAP_SIZE);

    Inactive(anon): 4008 kB
    Unevictable: 4213940 kB

    Signed-off-by: David Rientjes
    Acked-by: Hugh Dickins
    Reviewed-by: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Signed-off-by: Robert P. J. Day
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • In order to allow sleeping during invalidate_page mmu notifier calls, we
    need to avoid calling when holding the PT lock. In addition to its direct
    calls, invalidate_page can also be called as a substitute for a change_pte
    call, in case the notifier client hasn't implemented change_pte.

    This patch drops the invalidate_page call from change_pte, and instead
    wraps all calls to change_pte with invalidate_range_start and
    invalidate_range_end calls.

    Note that change_pte still cannot sleep after this patch, and that clients
    implementing change_pte should not take action on it in case the number of
    outstanding invalidate_range_start calls is larger than one, otherwise
    they might miss a later invalidation.

    Signed-off-by: Haggai Eran
    Cc: Andrea Arcangeli
    Cc: Sagi Grimberg
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haggai Eran
     
  • In order to allow sleeping during mmu notifier calls, we need to avoid
    invoking them under the page table spinlock. This patch solves the
    problem by calling invalidate_page notification after releasing the lock
    (but before freeing the page itself), or by wrapping the page invalidation
    with calls to invalidate_range_begin and invalidate_range_end.

    To prevent accidental changes to the invalidate_range_end arguments after
    the call to invalidate_range_begin, the patch introduces a convention of
    saving the arguments in consistently named locals:

    unsigned long mmun_start; /* For mmu_notifiers */
    unsigned long mmun_end; /* For mmu_notifiers */

    ...

    mmun_start = ...
    mmun_end = ...
    mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

    ...

    mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

    The patch changes code to use this convention for all calls to
    mmu_notifier_invalidate_range_start/end, except those where the calls are
    close enough so that anyone who glances at the code can see the values
    aren't changing.

    This patchset is a preliminary step towards on-demand paging design to be
    added to the RDMA stack.

    Why do we want on-demand paging for Infiniband?

    Applications register memory with an RDMA adapter using system calls,
    and subsequently post IO operations that refer to the corresponding
    virtual addresses directly to HW. Until now, this was achieved by
    pinning the memory during the registration calls. The goal of on demand
    paging is to avoid pinning the pages of registered memory regions (MRs).
    This will allow users the same flexibility they get when swapping any
    other part of their processes address spaces. Instead of requiring the
    entire MR to fit in physical memory, we can allow the MR to be larger,
    and only fit the current working set in physical memory.

    Why should anyone care? What problems are users currently experiencing?

    This can make programming with RDMA much simpler. Today, developers
    that are working with more data than their RAM can hold need either to
    deregister and reregister memory regions throughout their process's
    life, or keep a single memory region and copy the data to it. On demand
    paging will allow these developers to register a single MR at the
    beginning of their process's life, and let the operating system manage
    which pages needs to be fetched at a given time. In the future, we
    might be able to provide a single memory access key for each process
    that would provide the entire process's address as one large memory
    region, and the developers wouldn't need to register memory regions at
    all.

    Is there any prospect that any other subsystems will utilise these
    infrastructural changes? If so, which and how, etc?

    As for other subsystems, I understand that XPMEM wanted to sleep in
    MMU notifiers, as Christoph Lameter wrote at
    http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
    perhaps Andrea knows about other use cases.

    Scheduling in mmu notifications is required since we need to sync the
    hardware with the secondary page tables change. A TLB flush of an IO
    device is inherently slower than a CPU TLB flush, so our design works by
    sending the invalidation request to the device, and waiting for an
    interrupt before exiting the mmu notifier handler.

    Avi said:

    kvm may be a buyer. kvm::mmu_lock, which serializes guest page
    faults, also protects long operations such as destroying large ranges.
    It would be good to convert it into a spinlock, but as it is used inside
    mmu notifiers, this cannot be done.

    (there are alternatives, such as keeping the spinlock and using a
    generation counter to do the teardown in O(1), which is what the "may"
    is doing up there).

    [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Haggai Eran
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sagi Grimberg
     
  • We had thought that pages could no longer get freed while still marked as
    mlocked; but Johannes Weiner posted this program to demonstrate that
    truncating an mlocked private file mapping containing COWed pages is still
    mishandled:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    char *map;
    int fd;

    system("grep mlockfreed /proc/vmstat");
    fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
    unlink("chigurh");
    ftruncate(fd, 4096);
    map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
    map[0] = 11;
    mlock(map, sizeof(fd));
    ftruncate(fd, 0);
    close(fd);
    munlock(map, sizeof(fd));
    munmap(map, 4096);
    system("grep mlockfreed /proc/vmstat");
    return 0;
    }

    The anon COWed pages are not caught by truncation's clear_page_mlock() of
    the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
    look out for them there in page_remove_rmap(). Indeed, why should
    truncation or invalidation be doing the clear_page_mlock() when removing
    from pagecache? mlock is a property of mapping in userspace, not a
    property of pagecache: an mlocked unmapped page is nonsensical.

    Reported-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Merge VM_INSERTPAGE into VM_MIXEDMAP. VM_MIXEDMAP VMA can mix pure-pfn
    ptes, special ptes and normal ptes.

    Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
    VM_PFNMAP. If driver populates whole VMA at mmap() it probably not
    expects page-faults.

    This patch removes special check from vma_wants_writenotify() which
    disables pages write tracking for VMA populated via vm_instert_page().
    BDI below mapped file should not use dirty-accounting, moreover
    do_wp_page() can handle this.

    vm_insert_page() still marks vma after first usage. Usually it is called
    from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
    change vma->vm_flags. Caller must set VM_MIXEDMAP at mmap time if it
    wants to call this function from other places, for example from page-fault
    handler.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.

    We can toss mapping address from remap_pfn_range() into
    track_pfn_vma_new(), and collect all PAT-related logic together in
    arch/x86/.

    This patch also restores orignal frustration-free is_cow_mapping() check
    in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73
    ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")

    is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
    because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.

    [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Suresh Siddha
    Cc: Venkatesh Pallipadi
    Cc: H. Peter Anvin
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: Hugh Dickins
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
    attribute and uses it. Expectation is that the driver reserves the
    memory attributes for the pfn before calling vm_insert_pfn().

    remap_pfn_range() (when called for the whole vma) will setup a new
    attribute (based on the prot argument) for the specified pfn range.
    This addresses the legacy usage which typically calls remap_pfn_range()
    with a desired memory attribute. For ranges smaller than the vma size
    (which is typically not the case), remap_pfn_range() will use the
    existing memory attribute for the pfn range.

    Expose two different API's for these different behaviors.
    track_pfn_insert() for tracking the pfn attribute set by vm_insert_pfn()
    and track_pfn_remap() for the remap_pfn_range().

    This cleanup also prepares the ground for the track/untrack pfn vma
    routines to take over the ownership of setting PAT specific vm_flag in
    the 'vma'.

    [khlebnikov@openvz.org: Clear checks in track_pfn_remap()]
    [akpm@linux-foundation.org: tweak a few comments]
    Signed-off-by: Suresh Siddha
    Signed-off-by: Konstantin Khlebnikov
    Cc: Venkatesh Pallipadi
    Cc: H. Peter Anvin
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: Hugh Dickins
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Konstantin Khlebnikov
    Cc: Matt Helsley
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suresh Siddha
     

02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds