18 Dec, 2012

1 commit

  • This build error is currently hidden by the fact that the x86
    implementation of 'update_mmu_cache_pmd()' is a macro that doesn't use
    its last argument, but commit b32967ff101a ("mm: numa: Add THP migration
    for the NUMA working set scanning fault case") introduced a call with
    the wrong third argument.

    In the akpm tree, it causes this build error:

    mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put':
    mm/migrate.c:1666:2: error: incompatible type for argument 3 of 'update_mmu_cache_pmd'
    arch/x86/include/asm/pgtable.h:792:20: note: expected 'struct pmd_t *' but argument is of type 'pmd_t'

    Fix it.

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Christoph Lameter
    Signed-off-by: Wen Congyang
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

12 Dec, 2012

4 commits

  • The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
    hacks around putback_lru_pages() in order to allow ballooned pages to be
    re-inserted on balloon page list as if a ballooned page was like a LRU page.

    As ballooned pages are not legitimate LRU pages, this patch introduces
    putback_movable_pages() to properly cope with cases where the isolated
    pageset contains ballooned pages and LRU pages, thus fixing the mentioned
    inelegant hack around putback_lru_pages().

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Memory fragmentation introduced by ballooning might reduce significantly
    the number of 2MB contiguous memory blocks that can be used within a guest,
    thus imposing performance penalties associated with the reduced number of
    transparent huge pages that could be used by the guest workload.

    This patch introduces the helper functions as well as the necessary changes
    to teach compaction and migration bits how to cope with pages which are
    part of a guest memory balloon, in order to make them movable by memory
    compaction procedures.

    Signed-off-by: Rafael Aquini
    Acked-by: Mel Gorman
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Memory fragmentation introduced by ballooning might reduce significantly
    the number of 2MB contiguous memory blocks that can be used within a
    guest, thus imposing performance penalties associated with the reduced
    number of transparent huge pages that could be used by the guest workload.

    This patch-set follows the main idea discussed at 2012 LSFMMS session:
    "Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
    to introduce the required changes to the virtio_balloon driver, as well as
    the changes to the core compaction & migration bits, in order to make
    those subsystems aware of ballooned pages and allow memory balloon pages
    become movable within a guest, thus avoiding the aforementioned
    fragmentation issue

    Following are numbers that prove this patch benefits on allowing
    compaction to be more effective at memory ballooned guests.

    Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
    running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
    chunks, at every minute (inflating/deflating), while test was running:

    ===BEGIN stress-highalloc

    STRESS-HIGHALLOC
    highalloc-3.7 highalloc-3.7
    rc4-clean rc4-patch
    Pass 1 55.00 ( 0.00%) 62.00 ( 7.00%)
    Pass 2 54.00 ( 0.00%) 62.00 ( 8.00%)
    while Rested 75.00 ( 0.00%) 80.00 ( 5.00%)

    MMTests Statistics: duration
    3.7 3.7
    rc4-clean rc4-patch
    User 1207.59 1207.46
    System 1300.55 1299.61
    Elapsed 2273.72 2157.06

    MMTests Statistics: vmstat
    3.7 3.7
    rc4-clean rc4-patch
    Page Ins 3581516 2374368
    Page Outs 11148692 10410332
    Swap Ins 80 47
    Swap Outs 3641 476
    Direct pages scanned 37978 33826
    Kswapd pages scanned 1828245 1342869
    Kswapd pages reclaimed 1710236 1304099
    Direct pages reclaimed 32207 31005
    Kswapd efficiency 93% 97%
    Kswapd velocity 804.077 622.546
    Direct efficiency 84% 91%
    Direct velocity 16.703 15.682
    Percentage direct scans 2% 2%
    Page writes by reclaim 79252 9704
    Page writes file 75611 9228
    Page writes anon 3641 476
    Page reclaim immediate 16764 11014
    Page rescued immediate 0 0
    Slabs scanned 2171904 2152448
    Direct inode steals 385 2261
    Kswapd inode steals 659137 609670
    Kswapd skipped wait 1 69
    THP fault alloc 546 631
    THP collapse alloc 361 339
    THP splits 259 263
    THP fault fallback 98 50
    THP collapse fail 20 17
    Compaction stalls 747 499
    Compaction success 244 145
    Compaction failures 503 354
    Compaction pages moved 370888 474837
    Compaction move failure 77378 65259

    ===END stress-highalloc

    This patch:

    Introduce MIGRATEPAGE_SUCCESS as the default return code for
    address_space_operations.migratepage() method and documents the expected
    return code for the same method in failure cases.

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Several place need to find the pmd by(mm_struct, address), so introduce a
    function to simplify it.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Ni zhan Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     

11 Dec, 2012

13 commits

  • rmap_walk_anon() and try_to_unmap_anon() appears to be too
    careful about locking the anon vma: while it needs protection
    against anon vma list modifications, it does not need exclusive
    access to the list itself.

    Transforming this exclusive lock to a read-locked rwsem removes
    a global lock from the hot path of page-migration intense
    threaded workloads which can cause pathological performance like
    this:

    96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
    |
    --- perf_trace_sched_switch
    __schedule
    schedule
    schedule_preempt_disabled
    __mutex_lock_common.isra.6
    __mutex_lock_slowpath
    mutex_lock
    |
    |--50.61%-- rmap_walk
    | move_to_new_page
    | migrate_pages
    | migrate_misplaced_page
    | __do_numa_page.isra.69
    | handle_pte_fault
    | handle_mm_fault
    | __do_page_fault
    | do_page_fault
    | page_fault
    | __memset_sse2
    | |
    | --100.00%-- worker_thread
    | |
    | --100.00%-- start_thread
    |
    --49.39%-- page_lock_anon_vma
    try_to_unmap_anon
    try_to_unmap
    migrate_pages
    migrate_misplaced_page
    __do_numa_page.isra.69
    handle_pte_fault
    handle_mm_fault
    __do_page_fault
    do_page_fault
    page_fault
    __memset_sse2
    |
    --100.00%-- worker_thread
    start_thread

    With this change applied the profile is now nicely flat
    and there's no anon-vma related scheduling/blocking.

    Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
    to make it clearer that it's an exclusive write-lock in
    that case - suggested by Rik van Riel.

    Suggested-by: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Ingo Molnar
     
  • If there is excessive migration due to NUMA balancing it gets rate
    limited. It does this by counting the number of pages it has migrated
    recently but counts a transhuge page as 1 page. Account for it properly.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Subject says it all. Allocation failures and a failure to isolate should
    be accounted as a migration failure. This is partially another
    difference between base page and transhuge page migration. A base page
    migration makes multiple attempts for these conditions before it would
    be accounted for as a failure.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Commit "Add THP migration for the NUMA working set scanning fault case"
    breaks the build because HPAGE_PMD_SHIFT and HPAGE_PMD_MASK defined to
    explode without CONFIG_TRANSPARENT_HUGEPAGE:

    mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put':
    mm/migrate.c:1549: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
    mm/migrate.c:1564: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
    mm/migrate.c:1566: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
    mm/migrate.c:1573: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
    mm/migrate.c:1606: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
    mm/migrate.c:1648: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed

    CONFIG_NUMA_BALANCING allows compilation without enabling transparent
    hugepages, so define the dummy function for such a configuration and only
    define migrate_misplaced_transhuge_page_put() when transparent hugepages
    are enabled.

    Signed-off-by: David Rientjes
    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Note: This is very heavily based on a patch from Peter Zijlstra with
    fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch
    put a lot of migration logic into mm/huge_memory.c where it does
    not belong. This version puts tries to share some of the migration
    logic with migrate_misplaced_page. However, it should be noted
    that now migrate.c is doing more with the pagetable manipulation
    than is preferred. The end result is barely recognisable so as
    before, the signed-offs had to be removed but will be re-added if
    the original authors are ok with it.

    Add THP migration for the NUMA working set scanning fault case.

    It uses the page lock to serialize. No migration pte dance is
    necessary because the pte is already unmapped when we decide
    to migrate.

    [dhillf@gmail.com: Fix memory leak on isolation failure]
    [dhillf@gmail.com: Fix transfer of last_nid information]
    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Pass last_nid from misplaced page to newly allocated migration target page.

    Signed-off-by: Hillf Danton
    Signed-off-by: Mel Gorman

    Hillf Danton
     
  • If there are a large number of NUMA hinting faults and all of them
    are resulting in migrations it may indicate that memory is just
    bouncing uselessly around. NUMA balancing cost is likely exceeding
    any benefit from locality. Rate limit the PTE updates if the node
    is migration rate-limited. As noted in the comments, this distorts
    the NUMA faulting statistics.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • NOTE: This is very heavily based on similar logic in autonuma. It should
    be signed off by Andrea but because there was no standalone
    patch and it's sufficiently different from what he did that
    the signed-off is omitted. Will be added back if requested.

    If a large number of pages are misplaced then the memory bus can be
    saturated just migrating pages between nodes. This patch rate-limits
    the amount of memory that can be migrating between nodes.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • It is tricky to quantify the basic cost of automatic NUMA placement in a
    meaningful manner. This patch adds some vmstats that can be used as part
    of a basic costing model.

    u = basic unit = sizeof(void *)
    Ca = cost of struct page access = sizeof(struct page) / u
    Cpte = Cost PTE access = Ca
    Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
    where Cpte is incurred twice for a read and a write and Wlock
    is a constant representing the cost of taking or releasing a
    lock
    Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
    Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
    Ci = Cost of page isolation = Ca + Wi
    where Wi is a constant that should reflect the approximate cost
    of the locking operation
    Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
    where Wnuma is the approximate NUMA factor. 1 is local. 1.2
    would imply that remote accesses are 20% more expensive

    Balancing cost = Cpte * numa_pte_updates +
    Cnumahint * numa_hint_faults +
    Ci * numa_pages_migrated +
    Cpagecopy * numa_pages_migrated

    Note that numa_pages_migrated is used as a measure of how many pages
    were isolated even though it would miss pages that failed to migrate. A
    vmstat counter could have been added for it but the isolation cost is
    pretty marginal in comparison to the overall cost so it seemed overkill.

    The ideal way to measure automatic placement benefit would be to count
    the number of remote accesses versus local accesses and do something like

    benefit = (remote_accesses_before - remove_access_after) * Wnuma

    but the information is not readily available. As a workload converges, the
    expection would be that the number of remote numa hints would reduce to 0.

    convergence = numa_hint_faults_local / numa_hint_faults
    where this is measured for the last N number of
    numa hints recorded. When the workload is fully
    converged the value is 1.

    This can measure if the placement policy is converging and how fast it is
    doing it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Mel Gorman
     
  • If we have to avoid migrating to a node that is nearly full, put page
    and return zero.

    Signed-off-by: Hillf Danton
    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Note: This was originally based on Peter's patch "mm/migrate: Introduce
    migrate_misplaced_page()" but borrows extremely heavily from Andrea's
    "autonuma: memory follows CPU algorithm and task/mm_autonuma stats
    collection". The end result is barely recognisable so signed-offs
    had to be dropped. If original authors are ok with it, I'll
    re-add the signed-off-bys.

    Add migrate_misplaced_page() which deals with migrating pages from
    faults.

    Based-on-work-by: Lee Schermerhorn
    Based-on-work-by: Peter Zijlstra
    Based-on-work-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Peter Zijlstra
     
  • The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
    about migration activity but not the type or the reason. This patch adds
    a tracepoint to identify the type of page migration and why the page is
    being migrated.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     
  • The compact_pages_moved and compact_pagemigrate_failed events are
    convenient for determining if compaction is active and to what
    degree migration is succeeding but it's at the wrong level. Other
    users of migration may also want to know if migration is working
    properly and this will be particularly true for any automated
    NUMA migration. This patch moves the counters down to migration
    with the new events called pgmigrate_success and pgmigrate_fail.
    The compact_blocks_moved counter is removed because while it was
    useful for debugging initially, it's worthless now as no meaningful
    conclusions can be drawn from its value.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     

01 Aug, 2012

3 commits

  • Compaction (and page migration in general) can currently be hindered
    through pages being owned by memory cgroups that are at their limits and
    unreclaimable.

    The reason is that the replacement page is being charged against the limit
    while the page being replaced is also still charged. But this seems
    unnecessary, given that only one of the two pages will still be in use
    after migration finishes.

    This patch changes the memcg migration sequence so that the replacement
    page is not charged. Whatever page is still in use after successful or
    failed migration gets to keep the charge of the page that was going to be
    replaced.

    The replacement page will still show up temporarily in the rss/cache
    statistics, this can be fixed in a later patch as it's less urgent.

    Reported-by: David Rientjes
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With HugeTLB pages, hugetlb cgroup is uncharged in compound page
    destructor. Since we are holding a hugepage reference, we can be sure
    that old page won't get uncharged till the last put_page().

    Signed-off-by: Aneesh Kumar K.V
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Since we migrate only one hugepage, don't use linked list for passing the
    page around. Directly pass the page that need to be migrated as argument.
    This also removes the usage of page->lru in the migrate path.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

04 Jun, 2012

1 commit

  • New tmpfs use of !PageUptodate pages for fallocate() is triggering the
    WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
    is called from migrate_page_copy() for compaction.

    It is anomalous that migration should use __set_page_dirty_nobuffers()
    on an address_space that does not participate in dirty and writeback
    accounting; and this has also been observed to insert surprising dirty
    tags into a tmpfs radix_tree, despite tmpfs not using tags at all.

    We should probably give migrate_page_copy() a better way to preserve the
    tag and migrate accounting info, when mapping_cap_account_dirty(). But
    that needs some more work: so in the interim, avoid the warning by using
    a simple SetPageDirty on PageSwapBacked pages.

    Reported-and-tested-by: Dave Jones
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 May, 2012

1 commit

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

16 May, 2012

1 commit


26 Apr, 2012

1 commit

  • Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
    added an odd construct where 'mm' is checked for being NULL, and if it is,
    it would get dereferenced anyways by mput()ing it.

    Signed-off-by: Sasha Levin
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Mar, 2012

1 commit

  • Migration functions perform the rcu_read_unlock too early. As a result
    the task pointed to may change from under us. This can result in an oops,
    as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.

    The following patch extend the period of the rcu_read_lock until after the
    permissions checks are done. We also take a refcount so that the task
    reference is stable when calling security check functions and performing
    cpuset node validation (which takes a mutex).

    The refcount is dropped before actual page migration occurs so there is no
    change to the refcounts held during page migration.

    Also move the determination of the mm of the task struct to immediately
    before the do_migrate*() calls so that it is clear that we switch from
    handling the task during permission checks to the mm for the actual
    migration. Since the determination is only done once and we then no
    longer use the task_struct we can be sure that we operate on a specific
    address space that will not change from under us.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Christoph Lameter
    Cc: "Eric W. Biederman"
    Reported-by: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Mar, 2012

1 commit

  • When moving tasks from old memcg (with move_charge_at_immigrate on new
    memcg), followed by removal of old memcg, hit General Protection Fault in
    mem_cgroup_lru_del_list() (called from release_pages called from
    free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
    exit_mmap from mmput from exit_mm from do_exit).

    Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
    been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
    still trying to update its stats, and take page off lru before freeing.

    A task, or a charge, or a page on lru: each secures a memcg against
    removal. In this case, the last task has been moved out of the old memcg,
    and it is exiting: anonymous pages are uncharged one by one from the
    memcg, as they are zapped from its pagetables, so the charge gets down to
    0; but the pages themselves are queued in an mmu_gather for freeing.

    Most of those pages will be on lru (and force_empty is careful to
    lru_add_drain_all, to add pages from pagevec to lru first), but not
    necessarily all: perhaps some have been isolated for page reclaim, perhaps
    some isolated for other reasons. So, force_empty may find no task, no
    charge and no page on lru, and let the removal proceed.

    There would still be no problem if these pages were immediately freed; but
    typically (and the put_page_testzero protocol demands it) they have to be
    added back to lru before they are found freeable, then removed from lru
    and freed. We don't see the issue when adding, because the
    mem_cgroup_iter() loops keep their own reference to the memcg being
    scanned; but when it comes to mem_cgroup_lru_del_list().

    I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
    PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
    of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
    38c5d72f3ebe ("memcg: simplify LRU handling by new rule") mercifully
    removed those convolutions, but left this General Protection Fault.

    But it's surprisingly easy to restore the old behaviour: just check
    PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
    to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
    change? just going back to how it worked before; testing, and an audit of
    uses of pc->mem_cgroup, show no problem.

    And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
    sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
    longer has any purpose, and we can safely revert 4e5f01c2b9b9 ("memcg:
    clear pc->mem_cgroup if necessary").

    Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
    is not strictly necessary: the lru_lock there, with RCU before memcg
    structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
    without that; but it seems cleaner to rely on one dependency less.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Feb, 2012

1 commit

  • Postpone resetting page->mapping until the final remove_migration_ptes().
    Otherwise the expression PageAnon(migration_entry_to_page(entry)) does not
    work.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

13 Jan, 2012

3 commits

  • This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage. Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.

    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.

    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Asynchronous compaction is used when allocating transparent hugepages to
    avoid blocking for long periods of time. Due to reports of stalling,
    there was a debate on disabling synchronous compaction but this severely
    impacted allocation success rates. Part of the reason was that many dirty
    pages are skipped in asynchronous compaction by the following check;

    if (PageDirty(page) && !sync &&
    mapping->a_ops->migratepage != migrate_page)
    rc = -EBUSY;

    This skips over all mapping aops using buffer_migrate_page() even though
    it is possible to migrate some of these pages without blocking. This
    patch updates the ->migratepage callback with a "sync" parameter. It is
    the responsibility of the callback to fail gracefully if migration would
    block.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
    and reducing atomic ops/complexity in memcg LRU handling.

    In some cases, pages are added to lru before charge to memcg and pages
    are not classfied to memory cgroup at lru addtion. Now, the lru where
    the page should be added is determined a bit in page_cgroup->flags and
    pc->mem_cgroup. I'd like to remove the check of flag.

    To handle the case pc->mem_cgroup may contain stale pointers if pages
    are added to LRU before classification. This patch resets
    pc->mem_cgroup to root_mem_cgroup before lru additions.

    [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
    [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
    [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
    [hughd@google.com: stop oops in mem_cgroup_reset_owner()]
    [hughd@google.com: fix page migration to reset_owner]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

11 Jan, 2012

3 commits


09 Dec, 2011

1 commit


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit

  • unmap_and_move() is one a big messy function. Clean it up.

    Signed-off-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

31 Oct, 2011

1 commit