08 Apr, 2014

3 commits

  • The NUMA scanning code can end up iterating over many gigabytes of
    unpopulated memory, especially in the case of a freshly started KVM
    guest with lots of memory.

    This results in the mmu notifier code being called even when there are
    no mapped pages in a virtual address range. The amount of time wasted
    can be enough to trigger soft lockup warnings with very large KVM
    guests.

    This patch moves the mmu notifier call to the pmd level, which
    represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier
    code is only called from the address in the PMD where present mappings
    are first encountered.

    The hugetlbfs code is left alone for now; hugetlb mappings are not
    relocatable, and as such are left alone by the NUMA code, and should
    never trigger this problem to begin with.

    Signed-off-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Reported-by: Xing Gang
    Tested-by: Chegu Vinod
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Sasha reported the following bug using trinity

    kernel BUG at mm/mprotect.c:149!
    invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G W 3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105
    task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000
    RIP: change_protection_range+0x3b3/0x500
    Call Trace:
    change_protection+0x25/0x30
    change_prot_numa+0x1b/0x30
    task_numa_work+0x279/0x360
    task_work_run+0xae/0xf0
    do_notify_resume+0x8e/0xe0
    retint_signal+0x4d/0x92

    The VM_BUG_ON was added in -mm by the patch "mm,numa: reorganize
    change_pmd_range". The race existed without the patch but was just
    harder to hit.

    The problem is that a transhuge check is made without holding the PTL.
    It's possible at the time of the check that a parallel fault clears the
    pmd and inserts a new one which then triggers the VM_BUG_ON check. This
    patch removes the VM_BUG_ON but fixes the race by rechecking transhuge
    under the PTL when marking page tables for NUMA hinting and bailing if a
    race occurred. It is not a problem for calls to mprotect() as they hold
    mmap_sem for write.

    Signed-off-by: Mel Gorman
    Reported-by: Sasha Levin
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Reorganize the order of ifs in change_pmd_range a little, in preparation
    for the next patch.

    [akpm@linux-foundation.org: fix indenting, per David]
    Signed-off-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Reported-by: Xing Gang
    Tested-by: Chegu Vinod
    Acked-by: David Rientjes
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

17 Feb, 2014

2 commits

  • Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
    a hash table MMU for various reasons (the flush is handled as part of
    the PTE modification when necessary).

    ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.

    Additionally ppc64 require the tlb flushing to be batched within ptl locks.

    The reason to do that is to ensure that the hash page table is in sync with
    linux page table.

    We track the hpte index in linux pte and if we clear them without flushing
    hash and drop the ptl lock, we can have another cpu update the pte and can
    end up with duplicate entry in the hash table, which is fatal.

    We also want to keep set_pte_at simpler by not requiring them to do hash
    flush for performance reason. We do that by assuming that set_pte_at() is
    never *ever* called on a PTE that is already valid.

    This was the case until the NUMA code went in which broke that assumption.

    Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
    way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
    using set_pte_at() and a powerpc specific one using the appropriate
    mechanism needed to keep the hash table in sync.

    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     
  • So move it within the if loop

    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     

22 Jan, 2014

1 commit

  • KSM pages can be shared between tasks that are not necessarily related
    to each other from a NUMA perspective. This patch causes those pages to
    be ignored by automatic NUMA balancing so they do not migrate and do not
    cause unrelated tasks to be grouped together.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

19 Dec, 2013

3 commits

  • There are a few subtle races, between change_protection_range (used by
    mprotect and change_prot_numa) on one side, and NUMA page migration and
    compaction on the other side.

    The basic race is that there is a time window between when the PTE gets
    made non-present (PROT_NONE or NUMA), and the TLB is flushed.

    During that time, a CPU may continue writing to the page.

    This is fine most of the time, however compaction or the NUMA migration
    code may come in, and migrate the page away.

    When that happens, the CPU may continue writing, through the cached
    translation, to what is no longer the current memory location of the
    process.

    This only affects x86, which has a somewhat optimistic pte_accessible.
    All other architectures appear to be safe, and will either always flush,
    or flush whenever there is a valid mapping, even with no permissions
    (SPARC).

    The basic race looks like this:

    CPU A CPU B CPU C

    load TLB entry
    make entry PTE/PMD_NUMA
    fault on entry
    read/write old page
    start migrating page
    change PTE/PMD to new page
    read/write old page [*]
    flush TLB
    reload TLB from new entry
    read/write new page
    lose data

    [*] the old page may belong to a new user at this point!

    The obvious fix is to flush remote TLB entries, by making sure that
    pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
    still be accessible if there is a TLB flush pending for the mm.

    This should fix both NUMA migration and compaction.

    [mgorman@suse.de: fix build]
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • On a protection change it is no longer clear if the page should be still
    accessible. This patch clears the NUMA hinting fault bits on a
    protection change.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The TLB must be flushed if the PTE is updated but change_pte_range is
    clearing the PTE while marking PTEs pte_numa without necessarily
    flushing the TLB if it reinserts the same entry. Without the flush,
    it's conceivable that two processors have different TLBs for the same
    virtual address and at the very least it would generate spurious faults.

    This patch only unmaps the pages in change_pte_range for a full
    protection change.

    [riel@redhat.com: write pte_numa pte back to the page tables]
    Signed-off-by: Mel Gorman
    Signed-off-by: Rik van Riel
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc: Chegu Vinod
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Nov, 2013

1 commit

  • Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
    PTE update") was added to account for the number of PTE updates when
    marking pages prot_numa. task_numa_work was using the old return value
    to track how much address space had been updated. Altering the return
    value causes the scanner to do more work than it is configured or
    documented to in a single unit of work.

    This patch reverts that commit and accounts for the number of THP
    updates separately in vmstat. It is up to the administrator to
    interpret the pair of values correctly. This is a straight-forward
    operation and likely to only be of interest when actively debugging NUMA
    balancing problems.

    The impact of this patch is that the NUMA PTE scanner will scan slower
    when THP is enabled and workloads may converge slower as a result. On
    the flip size system CPU usage should be lower than recent tests
    reported. This is an illustrative example of a short single JVM specjbb
    test

    specjbb
    3.12.0 3.12.0
    vanilla acctupdates
    TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
    TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
    TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
    TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
    TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
    TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
    TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
    TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

    3.12.0 3.12.0
    vanillaacctupdates
    User 5169.64 5184.14
    System 100.45 80.02
    Elapsed 252.75 251.85

    Performance is similar but note the reduction in system CPU time. While
    this showed a performance gain, it will not be universal but at least
    it'll be behaving as documented. The vmstats are obviously different but
    here is an obvious interpretation of them from mmtests.

    3.12.0 3.12.0
    vanillaacctupdates
    NUMA page range updates 1408326 11043064
    NUMA huge PMD updates 0 21040
    NUMA PTE updates 1408326 291624

    "NUMA page range updates" == nr_pte_updates and is the value returned to
    the NUMA pte scanner. NUMA huge PMD updates were the number of THP
    updates which in combination can be used to calculate how many ptes were
    updated from userspace.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Thorlton
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Nov, 2013

1 commit

  • Resolve cherry-picking conflicts:

    Conflicts:
    mm/huge_memory.c
    mm/memory.c
    mm/mprotect.c

    See this upstream merge commit for more details:

    52469b4fcd4f Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

29 Oct, 2013

1 commit

  • A THP PMD update is accounted for as 512 pages updated in vmstat. This is
    large difference when estimating the cost of automatic NUMA balancing and
    can be misleading when comparing results that had collapsed versus split
    THP. This patch addresses the accounting issue.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

17 Oct, 2013

1 commit

  • If page migration is turned on in config and the page is migrating, we
    may lose the soft dirty bit. If fork and mprotect are called on
    migrating pages (once migration is complete) pages do not obtain the
    soft dirty bit in the correspond pte entries. Fix it adding an
    appropriate test on swap entries.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

09 Oct, 2013

8 commits

  • With the THP migration races closed it is still possible to occasionally
    see corruption. The problem is related to handling PMD pages in batch.
    When a page fault is handled it can be assumed that the page being
    faulted will also be flushed from the TLB. The same flushing does not
    happen when handling PMD pages in batch. Fixing is straight forward but
    there are a number of reasons not to

    1. Multiple TLB flushes may have to be sent depending on what pages get
    migrated
    2. The handling of PMDs in batch means that faults get accounted to
    the task that is handling the fault. While care is taken to only
    mark PMDs where the last CPU and PID match it can still have problems
    due to PID truncation when matching PIDs.
    3. Batching on the PMD level may reduce faults but setting pmd_numa
    requires taking a heavy lock that can contend with THP migration
    and handling the fault requires the release/acquisition of the PTL
    for every page migrated. It's still pretty heavy.

    PMD batch handling is not something that people ever have been happy
    with. This patch removes it and later patches will deal with the
    additional fault overhead using more installigent migrate rate adaption.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-48-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Change the per page last fault tracking to use cpu,pid instead of
    nid,pid. This will allow us to try and lookup the alternate task more
    easily. Note that even though it is the cpu that is store in the page
    flags that the mpol_misplaced decision is still based on the node.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
    [ Fixed build failure on 32-bit systems. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Base page PMD faulting is meant to batch handle NUMA hinting faults from
    PTEs. However, even is no PTE faults would ever be handled within a
    range the kernel still traps PMD hinting faults. This patch avoids the
    overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-37-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Ideally it would be possible to distinguish between NUMA hinting faults that
    are private to a task and those that are shared. If treated identically
    there is a risk that shared pages bounce between nodes depending on
    the order they are referenced by tasks. Ultimately what is desirable is
    that task private pages remain local to the task while shared pages are
    interleaved between sharing tasks running on different nodes to give good
    average performance. This is further complicated by THP as even
    applications that partition their data may not be partitioning on a huge
    page boundary.

    To start with, this patch assumes that multi-threaded or multi-process
    applications partition their data and that in general the private accesses
    are more important for cpu->memory locality in the general case. Also,
    no new infrastructure is required to treat private pages properly but
    interleaving for shared pages requires additional infrastructure.

    To detect private accesses the pid of the last accessing task is required
    but the storage requirements are a high. This patch borrows heavily from
    Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
    to encode some bits from the last accessing task in the page flags as
    well as the node information. Collisions will occur but it is better than
    just depending on the node information. Node information is then used to
    determine if a page needs to migrate. The PID information is used to detect
    private/shared accesses. The preferred NUMA node is selected based on where
    the maximum number of approximately private faults were measured. Shared
    faults are not taken into consideration for a few reasons.

    First, if there are many tasks sharing the page then they'll all move
    towards the same node. The node will be compute overloaded and then
    scheduled away later only to bounce back again. Alternatively the shared
    tasks would just bounce around nodes because the fault information is
    effectively noise. Either way accounting for shared faults the same as
    private faults can result in lower performance overall.

    The second reason is based on a hypothetical workload that has a small
    number of very important, heavily accessed private pages but a large shared
    array. The shared array would dominate the number of faults and be selected
    as a preferred node even though it's the wrong decision.

    The third reason is that multiple threads in a process will race each
    other to fault the shared page making the fault information unreliable.

    Signed-off-by: Mel Gorman
    [ Fix complication error when !NUMA_BALANCING. ]
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Currently automatic NUMA balancing is unable to distinguish between false
    shared versus private pages except by ignoring pages with an elevated
    page_mapcount entirely. This avoids shared pages bouncing between the
    nodes whose task is using them but that is ignored quite a lot of data.

    This patch kicks away the training wheels in preparation for adding support
    for identifying shared/private pages is now in place. The ordering is so
    that the impact of the shared/private detection can be easily measured. Note
    that the patch does not migrate shared, file-backed within vmas marked
    VM_EXEC as these are generally shared library pages. Migrating such pages
    is not beneficial as there is an expectation they are read-shared between
    caches and iTLB and iCache pressure is generally low.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-28-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • NUMA PTE scanning is expensive both in terms of the scanning itself and
    the TLB flush if there are any updates. The TLB flush is avoided if no
    PTEs are updated but there is a bug where transhuge PMDs are considered
    to be updated even if they were already pmd_numa. This patch addresses
    the problem and TLB flushes should be reduced.

    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Reviewed-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-12-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • NUMA PTE scanning is expensive both in terms of the scanning itself and
    the TLB flush if there are any updates. Currently non-present PTEs are
    accounted for as an update and incurring a TLB flush where it is only
    necessary for anonymous migration entries. This patch addresses the
    problem and should reduce TLB flushes.

    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Reviewed-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-11-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • A THP PMD update is accounted for as 512 pages updated in vmstat. This is
    large difference when estimating the cost of automatic NUMA balancing and
    can be misleading when comparing results that had collapsed versus split
    THP. This patch addresses the accounting issue.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

19 Dec, 2012

1 commit


17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • Pass vma instead of mm and add address parameter.

    In most cases we already have vma on the stack. We provides
    split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

    This change is preparation to huge zero pmd splitting implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Dec, 2012

4 commits

  • To say that the PMD handling code was incorrectly transferred from autonuma
    is an understatement. The intention was to handle a PMDs worth of pages
    in the same fault and effectively batch the taking of the PTL and page
    migration. The copied version instead has the impact of clearing a number
    of pte_numa PTE entries and whether any page migration takes place depends
    on racing. This just happens to work in some cases.

    This patch handles pte_numa faults in batch when a pmd_numa fault is
    handled. The pages are migrated if they are currently misplaced.
    Essentially this is making an assumption that NUMA locality is
    on a PMD boundary but that could be addressed by only setting
    pmd_numa if all the pages within that PMD are on the same node
    if necessary.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • This patch converts change_prot_numa() to use change_protection(). As
    pte_numa and friends check the PTE bits directly it is necessary for
    change_protection() to use pmd_mknuma(). Hence the required
    modifications to change_protection() are a little clumsy but the
    end result is that most of the numa page table helpers are just one or
    two instructions.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Reuse the NUMA code's 'modified page protections' count that
    change_protection() computes and skip the TLB flush if there's
    no changes to a range that sys_mprotect() modifies.

    Given that mprotect() already optimizes the same-flags case
    I expected this optimization to dominantly trigger on
    CONFIG_NUMA_BALANCING=y kernels - but even with that feature
    disabled it triggers rather often.

    There's two reasons for that:

    1)

    While sys_mprotect() already optimizes the same-flag case:

    if (newflags == oldflags) {
    *pprev = vma;
    return 0;
    }

    and this test works in many cases, but it is too sharp in some
    others, where it differentiates between protection values that the
    underlying PTE format makes no distinction about, such as
    PROT_EXEC == PROT_READ on x86.

    2)

    Even where the pte format over vma flag changes necessiates a
    modification of the pagetables, there might be no pagetables
    yet to modify: they might not be instantiated yet.

    During a regular desktop bootup this optimization hits a couple
    of hundred times. During a Java test I measured thousands of
    hits.

    So this optimization improves sys_mprotect() in general, not just
    CONFIG_NUMA_BALANCING=y kernels.

    [ We could further increase the efficiency of this optimization if
    change_pte_range() and change_huge_pmd() was a bit smarter about
    recognizing exact-same-value protection masks - when the hardware
    can do that safely. This would probably further speed up mprotect(). ]

    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • This will be used for three kinds of purposes:

    - to optimize mprotect()

    - to speed up working set scanning for working set areas that
    have not been touched

    - to more accurately scan per real working set

    No change in functionality from this patch.

    Suggested-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Mar, 2012

1 commit

  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

2 commits

  • Since commit 2a11c8ea20bf ("kconfig: Introduce IS_ENABLED(),
    IS_BUILTIN() and IS_MODULE()") there is a generic grep-friendly method
    for checking config options in C expressions.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Pull security subsystem updates for 3.4 from James Morris:
    "The main addition here is the new Yama security module from Kees Cook,
    which was discussed at the Linux Security Summit last year. Its
    purpose is to collect miscellaneous DAC security enhancements in one
    place. This also marks a departure in policy for LSM modules, which
    were previously limited to being standalone access control systems.
    Chromium OS is using Yama, and I believe there are plans for Ubuntu,
    at least.

    This patchset also includes maintenance updates for AppArmor, TOMOYO
    and others."

    Fix trivial conflict in due to the jumo_label->static_key
    rename.

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
    AppArmor: Fix location of const qualifier on generated string tables
    TOMOYO: Return error if fails to delete a domain
    AppArmor: add const qualifiers to string arrays
    AppArmor: Add ability to load extended policy
    TOMOYO: Return appropriate value to poll().
    AppArmor: Move path failure information into aa_get_name and rename
    AppArmor: Update dfa matching routines.
    AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
    AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
    AppArmor: Add const qualifiers to generated string tables
    AppArmor: Fix oops in policy unpack auditing
    AppArmor: Fix error returned when a path lookup is disconnected
    KEYS: testing wrong bit for KEY_FLAG_REVOKED
    TOMOYO: Fix mount flags checking order.
    security: fix ima kconfig warning
    AppArmor: Fix the error case for chroot relative path name lookup
    AppArmor: fix mapping of META_READ to audit and quiet flags
    AppArmor: Fix underflow in xindex calculation
    AppArmor: Fix dropping of allowed operations that are force audited
    AppArmor: Add mising end of structure test to caps unpacking
    ...

    Linus Torvalds
     

07 Mar, 2012

1 commit

  • Several users of "find_vma_prev()" were not in fact interested in the
    previous vma if there was no primary vma to be found either. And in
    those cases, we're much better off just using the regular "find_vma()",
    and then "prev" can be looked up by just checking vma->vm_prev.

    The find_vma_prev() semantics are fairly subtle (see Mikulas' recent
    commit 83cd904d271b: "mm: fix find_vma_prev"), and the whole "return
    prev by reference" means that it generates worse code too.

    Thus this "let's avoid using this inconvenient and clearly too subtle
    interface when we don't really have to" patch.

    Cc: Mikulas Patocka
    Cc: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Feb, 2012

1 commit


14 Jan, 2011

3 commits

  • Natively handle huge pmds when changing page tables on behalf of
    mprotect().

    I left out update_mmu_cache() because we do not need it on x86 anyway but
    more importantly the interface works on ptes, not pmds.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Flushing the tlb for huge pmds requires the vma's anon_vma, so pass along
    the vma instead of the mm, we can always get the latter when we need it.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • split_huge_page_pmd compat code. Each one of those would need to be
    expanded to hundred of lines of complex code without a fully reliable
    split_huge_page_pmd design.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

10 Nov, 2010

1 commit

  • As pointed out by Linus, commit dab5855 ("perf_counter: Add mmap event hooks to
    mprotect()") is fundamentally wrong as mprotect_fixup() can free 'vma' due to
    merging. Fix the problem by moving perf_event_mmap() hook to
    mprotect_fixup().

    Note: there's another successful return path from mprotect_fixup() if old
    flags equal to new flags. We don't, however, need to call
    perf_event_mmap() there because 'perf' already knows the VMA is
    executable.

    Reported-by: Dave Jones
    Analyzed-by: Linus Torvalds
    Cc: Ingo Molnar
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

09 Jun, 2009

1 commit

  • Some JIT compilers allocate memory for generated code with
    posix_memalign() + mprotect() so we need to hook into mprotect()
    to make sure 'perf' is aware that we're executing code in
    anonymous memory.

    [ penberg@cs.helsinki.fi: move the hook to sys_mprotect() ]
    Cc: Arnaldo Carvalho de Melo
    Signed-off-by: Pekka Enberg
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra