19 Apr, 2013

1 commit

  • As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
    feature bit was really just an early hack to make with/without
    mutex-spinning testable. So it is no longer necessary.

    This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
    move the mutex spinning code from kernel/sched/core.c back to
    kernel/mutex.c which is where they should belong.

    Signed-off-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Chandramouleeswaran Aswin
    Cc: Davidlohr Bueso
    Cc: Norton Scott J
    Cc: Rik van Riel
    Cc: Paul E. McKenney
    Cc: David Howells
    Cc: Dave Jones
    Cc: Clark Williams
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

11 Dec, 2012

3 commits

  • Due to the fact that migrations are driven by the CPU a task is running
    on there is no point tracking NUMA faults until one task runs on a new
    node. This patch tracks the first node used by an address space. Until
    it changes, PTE scanning is disabled and no NUMA hinting faults are
    trapped. This should help workloads that are short-lived, do not care
    about NUMA placement or have bound themselves to a single node.

    This takes advantage of the logic in "mm: sched: numa: Implement slow
    start for working set sampling" to delay when the checks are made. This
    will take advantage of processes that set their CPU and node bindings
    early in their lifetime. It will also potentially allow any initial load
    balancing to take place.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • This patch adds Kconfig options and kernel parameters to allow the
    enabling and disabling of automatic NUMA balancing. The existance
    of such a switch was and is very important when debugging problems
    related to transparent hugepages and we should have the same for
    automatic NUMA placement.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • NOTE: This patch is based on "sched, numa, mm: Add fault driven
    placement and migration policy" but as it throws away all the policy
    to just leave a basic foundation I had to drop the signed-offs-by.

    This patch creates a bare-bones method for setting PTEs pte_numa in the
    context of the scheduler that when faulted later will be faulted onto the
    node the CPU is running on. In itself this does nothing useful but any
    placement policy will fundamentally depend on receiving hints on placement
    from fault context and doing something intelligent about it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Peter Zijlstra
     

16 Oct, 2012

1 commit


13 Sep, 2012

1 commit

  • Heteregeneous ARM platform uses arch_scale_freq_power function
    to reflect the relative capacity of each core

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1341826026-6504-6-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

04 Sep, 2012

1 commit

  • Commit beac4c7e4a1c ("sched: Remove AFFINE_WAKEUPS feature") removed
    use of the flag but left the definition. Get rid of it.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/1345090865-20851-1-git-send-email-namhyung@kernel.org
    Signed-off-by: Ingo Molnar

    Namhyung Kim
     

26 Apr, 2012

1 commit

  • Commits 367456c756a6 ("sched: Ditch per cgroup task lists for
    load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
    left some more wreckage.

    By setting loop_max unconditionally to ->nr_running load-balancing
    could take a lot of time on very long runqueues (hackbench!). So keep
    the sysctl as max limit of the amount of tasks we'll iterate.

    Furthermore, the min load filter for migration completely fails with
    cgroups since inequality in per-cpu state can easily lead to such
    small loads :/

    Furthermore the change to add new tasks to the tail of the queue
    instead of the head seems to have some effect.. not quite sure I
    understand why.

    Combined these fixes solve the huge hackbench regression reported by
    Tim when hackbench is ran in a cgroup.

    Reported-by: Tim Chen
    Acked-by: Tim Chen
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
    [ got rid of the CONFIG_PREEMPT tuning and made small readability edits ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Dec, 2011

1 commit

  • Now that we initialize jump_labels before sched_init() we can use them
    for the debug features without having to worry about a window where
    they have the wrong setting.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-vpreo4hal9e0kzqmg5y0io2k@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

17 Nov, 2011

1 commit