24 May, 2012

2 commits

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     
  • …rnel.org/pub/scm/linux/kernel/git/tip/tip

    Pull perf fixes from Ingo Molnar:

    - Leftover AMD PMU driver fix fix from the end of the v3.4
    stabilization cycle.

    - Late tools/perf/ changes that missed the first round:
    * endianness fixes
    * event parsing improvements
    * libtraceevent fixes factored out from trace-cmd
    * perl scripting engine fixes related to libtraceevent,
    * testcase improvements
    * perf inject / pipe mode fixes
    * plus a kernel side fix

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86: Update event scheduling constraints for AMD family 15h models

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "sched, perf: Use a single callback into the scheduler"
    perf evlist: Show event attribute details
    perf tools: Bump default sample freq to 4 kHz
    perf buildid-list: Work better with pipe mode
    perf tools: Fix piped mode read code
    perf inject: Fix broken perf inject -b
    perf tools: rename HEADER_TRACE_INFO to HEADER_TRACING_DATA
    perf tools: Add union u64_swap type for swapping u64 data
    perf tools: Carry perf_event_attr bitfield throught different endians
    perf record: Fix documentation for branch stack sampling
    perf target: Add cpu flag to sample_type if target has cpu
    perf tools: Always try to build libtraceevent
    perf tools: Rename libparsevent to libtraceevent in Makefile
    perf script: Rename struct event to struct event_format in perl engine
    perf script: Explicitly handle known default print arg type
    perf tools: Add hardcoded name term for pmu events
    perf tools: Separate 'mem:' event scanner bits
    perf tools: Use allocated list for each parsed event
    perf tools: Add support for displaying event parser debug info
    perf test: Move parse event automated tests to separated object

    Linus Torvalds
     

23 May, 2012

4 commits

  • This reverts commit cb04ff9ac424 ("sched, perf: Use a single
    callback into the scheduler").

    Before this change was introduced, the process switch worked
    like this (wrt. to perf event schedule):

    schedule (prev, next)
    - schedule out all perf events for prev
    - switch to next
    - schedule in all perf events for current (next)

    After the commit, the process switch looks like:

    schedule (prev, next)
    - schedule out all perf events for prev
    - schedule in all perf events for (next)
    - switch to next

    The problem is, that after we schedule perf events in, the pmu
    is enabled and we can receive events even before we make the
    switch to next - so "current" still being prev process (event
    SAMPLE data are filled based on the value of the "current"
    process).

    Thats exactly what we see for test__PERF_RECORD test. We receive
    SAMPLES with PID of the process that our tracee is scheduled
    from.

    Discussed with Peter Zijlstra:

    > Bah!, yeah I guess reverting is the right thing for now. Sad
    > though.
    >
    > So by having the two hooks we have a black-spot between them
    > where we receive no events at all, this black-spot covers the
    > hand-over of current and we thus don't receive the 'wrong'
    > events.
    >
    > I rather liked we could do away with both that black-spot and
    > clean up the code a little, but apparently people rely on it.

    Signed-off-by: Jiri Olsa
    Acked-by: Peter Zijlstra
    Cc: acme@redhat.com
    Cc: paulus@samba.org
    Cc: cjashfor@linux.vnet.ibm.com
    Cc: fweisbec@gmail.com
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • Pull scheduler changes from Ingo Molnar:
    "The biggest change is the cleanup/simplification of the load-balancer:
    instead of the current practice of architectures twiddling scheduler
    internal data structures and providing the scheduler domains in
    colorfully inconsistent ways, we now have generic scheduler code in
    kernel/sched/core.c:sched_init_numa() that looks at the architecture's
    node_distance() parameters and (while not fully trusting it) deducts a
    NUMA topology from it.

    This inevitably changes balancing behavior - hopefully for the better.

    There are various smaller optimizations, cleanups and fixlets as well"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
    sched: Remove stale power aware scheduling remnants and dysfunctional knobs
    sched/debug: Fix printing large integers on 32-bit platforms
    sched/fair: Improve the ->group_imb logic
    sched/nohz: Fix rq->cpu_load[] calculations
    sched/numa: Don't scale the imbalance
    sched/fair: Revert sched-domain iteration breakage
    sched/x86: Rewrite set_cpu_sibling_map()
    sched/numa: Fix the new NUMA topology bits
    sched/numa: Rewrite the CONFIG_NUMA sched domain support
    sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
    sched/fair: Add some serialization to the sched_domain load-balance walk
    sched/fair: Let minimally loaded cpu balance the group
    sched: Change rq->nr_running to unsigned int
    x86/numa: Check for nonsensical topologies on real hw as well
    x86/numa: Hard partition cpu topology masks on node boundaries
    x86/numa: Allow specifying node_distance() for numa=fake
    x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
    sched: Update documentation and comments
    sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt()

    Linus Torvalds
     
  • Pull perf changes from Ingo Molnar:
    "Lots of changes:

    - (much) improved assembly annotation support in perf report, with
    jump visualization, searching, navigation, visual output
    improvements and more.

    - kernel support for AMD IBS PMU hardware features. Notably 'perf
    record -e cycles:p' and 'perf top -e cycles:p' should work without
    skid now, like PEBS does on the Intel side, because it takes
    advantage of IBS transparently.

    - the libtracevents library: it is the first step towards unifying
    tracing tooling and perf, and it also gives a tracing library for
    external tools like powertop to rely on.

    - infrastructure: various improvements and refactoring of the UI
    modules and related code

    - infrastructure: cleanup and simplification of the profiling
    targets code (--uid, --pid, --tid, --cpu, --all-cpus, etc.)

    - tons of robustness fixes all around

    - various ftrace updates: speedups, cleanups, robustness
    improvements.

    - typing 'make' in tools/ will now give you a menu of projects to
    build and a short help text to explain what each does.

    - ... and lots of other changes I forgot to list.

    The perf record make bzImage + perf report regression you reported
    should be fixed."

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (166 commits)
    tracing: Remove kernel_lock annotations
    tracing: Fix initial buffer_size_kb state
    ring-buffer: Merge separate resize loops
    perf evsel: Create events initially disabled -- again
    perf tools: Split term type into value type and term type
    perf hists: Fix callchain ip printf format
    perf target: Add uses_mmap field
    ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER
    ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code()
    ftrace: Make ftrace_modify_all_code() global for archs to use
    ftrace: Return record ip addr for ftrace_location()
    ftrace: Consolidate ftrace_location() and ftrace_text_reserved()
    ftrace: Speed up search by skipping pages by address
    ftrace: Remove extra helper functions
    ftrace: Sort all function addresses, not just per page
    tracing: change CPU ring buffer state from tracing_cpumask
    tracing: Check return value of tracing_dentry_percpu()
    ring-buffer: Reset head page before running self test
    ring-buffer: Add integrity check at end of iter read
    ring-buffer: Make addition of pages in ring buffer atomic
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "cgroup file type addition / removal is updated so that file types are
    added and removed instead of individual files so that dynamic file
    type addition / removal can be implemented by cgroup and used by
    controllers. blkio controller changes which will come through block
    tree are dependent on this. Other changes include res_counter cleanup
    and disallowing kthread / PF_THREAD_BOUND threads to be attached to
    non-root cgroups.

    There's a reported bug with the file type addition / removal handling
    which can lead to oops on cgroup umount. The issue is being looked
    into. It shouldn't cause problems for most setups and isn't a
    security concern."

    Fix up trivial conflict in Documentation/feature-removal-schedule.txt

    * 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    res_counter: Account max_usage when calling res_counter_charge_nofail()
    res_counter: Merge res_counter_charge and res_counter_charge_nofail
    cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
    cgroup: remove cgroup_subsys->populate()
    cgroup: get rid of populate for memcg
    cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
    cgroup: make css->refcnt clearing on cgroup removal optional
    cgroup: use negative bias on css->refcnt to block css_tryget()
    cgroup: implement cgroup_rm_cftypes()
    cgroup: introduce struct cfent
    cgroup: relocate __d_cgrp() and __d_cft()
    cgroup: remove cgroup_add_file[s]()
    cgroup: convert memcg controller to the new cftype interface
    memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
    cgroup: convert all non-memcg controllers to the new cftype interface
    cgroup: relocate cftype and cgroup_subsys definitions in controllers
    cgroup: merge cft_release_agent cftype array into the base files array
    cgroup: implement cgroup_add_cftypes() and friends
    cgroup: build list of all cgroups under a given cgroupfs_root
    cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
    ...

    Linus Torvalds
     

22 May, 2012

2 commits

  • Pull smp hotplug cleanups from Thomas Gleixner:
    "This series is merily a cleanup of code copied around in arch/* and
    not changing any of the real cpu hotplug horrors yet. I wish I'd had
    something more substantial for 3.5, but I underestimated the lurking
    horror..."

    Fix up trivial conflicts in arch/{arm,sparc,x86}/Kconfig and
    arch/sparc/include/asm/thread_info_32.h

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
    um: Remove leftover declaration of alloc_task_struct_node()
    task_allocator: Use config switches instead of magic defines
    sparc: Use common threadinfo allocator
    score: Use common threadinfo allocator
    sh-use-common-threadinfo-allocator
    mn10300: Use common threadinfo allocator
    powerpc: Use common threadinfo allocator
    mips: Use common threadinfo allocator
    hexagon: Use common threadinfo allocator
    m32r: Use common threadinfo allocator
    frv: Use common threadinfo allocator
    cris: Use common threadinfo allocator
    x86: Use common threadinfo allocator
    c6x: Use common threadinfo allocator
    fork: Provide kmemcache based thread_info allocator
    tile: Use common threadinfo allocator
    fork: Provide weak arch_release_[task_struct|thread_info] functions
    fork: Move thread info gfp flags to header
    fork: Remove the weak insanity
    sh: Remove cpu_idle_wait()
    ...

    Linus Torvalds
     
  • Pull RCU changes from Ingo Molnar:
    "This is the v3.5 RCU tree from Paul E. McKenney:

    1) A set of improvements and fixes to the RCU_FAST_NO_HZ feature (with
    more on the way for 3.6). Posted to LKML:
    https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
    https://lkml.org/lkml/2012/4/16/611 (commit 4),
    https://lkml.org/lkml/2012/4/30/390 (commit 6), and
    https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
    the other commits for the convenience of the tester).

    2) Changes to make rcu_barrier() avoid disrupting execution of CPUs
    that have no RCU callbacks. Posted to LKML:
    https://lkml.org/lkml/2012/4/23/322.

    3) A couple of commits that improve the efficiency of the interaction
    between preemptible RCU and the scheduler, these two being all that
    survived an abortive attempt to allow preemptible RCU's
    __rcu_read_lock() to be inlined. The full set was posted to LKML at
    https://lkml.org/lkml/2012/4/14/143, and the first and third patches
    of that set remain.

    4) Lai Jiangshan's algorithmic implementation of SRCU, which includes
    call_srcu() and srcu_barrier(). A major feature of this new
    implementation is that synchronize_srcu() no longer disturbs the
    execution of other CPUs. This work is based on earlier
    implementations by Peter Zijlstra and Paul E. McKenney. Posted to
    LKML: https://lkml.org/lkml/2012/2/22/82.

    5) A number of miscellaneous bug fixes and improvements which were
    posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
    subsequent updates posted to LKML."

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
    rcu: Make rcu_barrier() less disruptive
    rcu: Explicitly initialize RCU_FAST_NO_HZ per-CPU variables
    rcu: Make RCU_FAST_NO_HZ handle timer migration
    rcu: Update RCU maintainership
    rcu: Make exit_rcu() more precise and consolidate
    rcu: Move PREEMPT_RCU preemption to switch_to() invocation
    rcu: Ensure that RCU_FAST_NO_HZ timers expire on correct CPU
    rcu: Add rcutorture test for call_srcu()
    rcu: Implement per-domain single-threaded call_srcu() state machine
    rcu: Use single value to handle expedited SRCU grace periods
    rcu: Improve srcu_readers_active_idx()'s cache locality
    rcu: Remove unused srcu_barrier()
    rcu: Implement a variant of Peter's SRCU algorithm
    rcu: Improve SRCU's wait_idx() comments
    rcu: Flip ->completed only once per SRCU grace period
    rcu: Increment upper bit only for srcu_read_lock()
    rcu: Remove fast check path from __synchronize_srcu()
    rcu: Direct algorithmic SRCU implementation
    rcu: Introduce rcutorture testing for rcu_barrier()
    timer: Fix mod_timer_pinned() header comment
    ...

    Linus Torvalds
     

19 May, 2012

1 commit

  • Merge reason: We are going to queue up a dependent patch:

    "perf tools: Move parse event automated tests to separated object"

    That depends on:

    commit e7c72d8
    perf tools: Add 'G' and 'H' modifiers to event parsing

    Conflicts:
    tools/perf/builtin-stat.c

    Conflicted with the recent 'perf_target' patches when checking the
    result of perf_evsel open routines to see if a retry is needed to cope
    with older kernels where the exclude guest/host perf_event_attr bits
    were not used.

    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

18 May, 2012

1 commit


17 May, 2012

2 commits

  • It's been broken forever (i.e. it's not scheduling in a power
    aware fashion), as reported by Suresh and others sending
    patches, and nobody cares enough to fix it properly ...
    so remove it to make space free for something better.

    There's various problems with the code as it stands today, first
    and foremost the user interface which is bound to topology
    levels and has multiple values per level. This results in a
    state explosion which the administrator or distro needs to
    master and almost nobody does.

    Furthermore large configuration state spaces aren't good, it
    means the thing doesn't just work right because it's either
    under so many impossibe to meet constraints, or even if
    there's an achievable state workloads have to be aware of
    it precisely and can never meet it for dynamic workloads.

    So pushing this kind of decision to user-space was a bad idea
    even with a single knob - it's exponentially worse with knobs
    on every node of the topology.

    There is a proposal to replace the user interface with a single
    3 state knob:

    sched_balance_policy := { performance, power, auto }

    where 'auto' would be the preferred default which looks at things
    like Battery/AC mode and possible cpufreq state or whatever the hw
    exposes to show us power use expectations - but there's been no
    progress on it in the past many months.

    Aside from that, the actual implementation of the various knobs
    is known to be broken. There have been sporadic attempts at
    fixing things but these always stop short of reaching a mergable
    state.

    Therefore this wholesale removal with the hopes of spurring
    people who care to come forward once again and work on a
    coherent replacement.

    Signed-off-by: Peter Zijlstra
    Cc: Suresh Siddha
    Cc: Arjan van de Ven
    Cc: Vincent Guittot
    Cc: Vaidyanathan Srinivasan
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: bring together all the pending scheduler bits,
    for the sched/numa changes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

14 May, 2012

7 commits

  • Some numbers like nr_running and nr_uninterruptible are fundamentally
    unsigned since its impossible to have a negative amount of tasks, yet
    we still print them as signed to easily recognise the underflow
    condition.

    rq->nr_uninterruptible has 'special' accounting and can in fact very
    easily become negative on a per-cpu basis.

    It was noted that since the P() macro assumes things are long long and
    the promotion of unsigned 'int/long' to long long on 32bit doesn't
    sign extend we print silly large numbers instead of the easier to read
    signed numbers.

    Therefore extend the P() macro to not require the sign extention.

    Reported-by: Diwakar Tundlam
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-gk5tm8t2n4ix2vkpns42uqqp@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Group imbalance is meant to deal with situations where affinity masks
    and sched domains don't align well, such as 3 cpus from one group and
    6 from another. In this case the domain based balancer will want to
    put an equal amount of tasks on each side even though they don't have
    equal cpus.

    Currently group_imb is set whenever two cpus of a group have a weight
    difference of at least one avg task and the heaviest cpu has at least
    two tasks. A group with imbalance set will always be picked as busiest
    and a balance pass will be forced.

    The problem is that even if there are no affinity masks this stuff can
    trigger and cause weird balancing decisions, eg. the observed
    behaviour was that of 6 cpus, 5 had 2 and 1 had 3 tasks, due to the
    difference of 1 avg load (they all had the same weight) and nr_running
    being >1 the group_imbalance logic triggered and did the weird thing
    of pulling more load instead of trying to move the 1 excess task to
    the other domain of 6 cpus that had 5 cpu with 2 tasks and 1 cpu with
    1 task.

    Curb the group_imbalance stuff by making the nr_running condition
    weaker by also tracking the min_nr_running and using the difference in
    nr_running over the set instead of the absolute max nr_running.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-9s7dedozxo8kjsb9kqlrukkf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While investigating why the load-balancer did funny I found that the
    rq->cpu_load[] tables were completely screwy.. a bit more digging
    revealed that the updates that got through were missing ticks followed
    by a catchup of 2 ticks.

    The catchup assumes the cpu was idle during that time (since only nohz
    can cause missed ticks and the machine is idle etc..) this means that
    esp. the higher indices were significantly lower than they ought to
    be.

    The reason for this is that its not correct to compare against jiffies
    on every jiffy on any other cpu than the cpu that updates jiffies.

    This patch cludges around it by only doing the catch-up stuff from
    nohz_idle_balance() and doing the regular stuff unconditionally from
    the tick.

    Signed-off-by: Peter Zijlstra
    Cc: pjt@google.com
    Cc: Venkatesh Pallipadi
    Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • It's far too easy to get ridiculously large imbalance pct when you
    scale it like that. Use a fixed 125% for now.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-zsriaft1dv7hhboyrpvqjy6s@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Patches c22402a2f ("sched/fair: Let minimally loaded cpu balance the
    group") and 0ce90475 ("sched/fair: Add some serialization to the
    sched_domain load-balance walk") are horribly broken so revert them.

    The problem is that while it sounds good to have the minimally loaded
    cpu do the pulling of more load, the way we walk the domains there is
    absolutely no guarantee this cpu will actually get to the domain. In
    fact its very likely it wont. Therefore the higher up the tree we get,
    the less likely it is we'll balance at all.

    The first of mask always walks up, while sucky in that it accumulates
    load on the first cpu and needs extra passes to spread it out at least
    guarantees a cpu gets up that far and load-balancing happens at all.

    Since its now always the first and idle cpus should always be able to
    balance so they get a task as fast as possible we can also do away
    with the added serialization.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There's no need to convert a node number to a node number by
    pretending its a cpu number..

    Reported-by: Yinghai Lu
    Reported-and-Tested-by: Greg Pearson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-0sqhrht34phowgclj12dgk8h@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • …/linux-rcu into core/rcu

    Pull the v3.5 RCU tree from Paul E. McKenney:

    1) A set of improvements and fixes to the RCU_FAST_NO_HZ feature
    (with more on the way for 3.6). Posted to LKML:
    https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
    https://lkml.org/lkml/2012/4/16/611 (commit 4),
    https://lkml.org/lkml/2012/4/30/390 (commit 6), and
    https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
    the other commits for the convenience of the tester).

    2) Changes to make rcu_barrier() avoid disrupting execution of CPUs
    that have no RCU callbacks. Posted to LKML:
    https://lkml.org/lkml/2012/4/23/322.

    3) A couple of commits that improve the efficiency of the interaction
    between preemptible RCU and the scheduler, these two being all
    that survived an abortive attempt to allow preemptible RCU's
    __rcu_read_lock() to be inlined. The full set was posted to
    LKML at https://lkml.org/lkml/2012/4/14/143, and the first and
    third patches of that set remain.

    4) Lai Jiangshan's algorithmic implementation of SRCU, which includes
    call_srcu() and srcu_barrier(). A major feature of this new
    implementation is that synchronize_srcu() no longer disturbs
    the execution of other CPUs. This work is based on earlier
    implementations by Peter Zijlstra and Paul E. McKenney. Posted to
    LKML: https://lkml.org/lkml/2012/2/22/82.

    5) A number of miscellaneous bug fixes and improvements which were
    posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
    subsequent updates posted to LKML.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

09 May, 2012

7 commits

  • We can easily use a single callback for both sched-in and sched-out. This
    reduces the code footprint in the scheduler path as well as removes
    the PMU black spot otherwise present between the out and in callback.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-o56ajxp1edwqg6x9d31wb805@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The current code groups up to 16 nodes in a level and then puts an
    ALLNODES domain spanning the entire tree on top of that. This doesn't
    reflect the numa topology and esp for the smaller not-fully-connected
    machines out there today this might make a difference.

    Therefore, build a proper numa topology based on node_distance().

    Since there's no fixed numa layers anymore, the static SD_NODE_INIT
    and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
    construct something similar and scales some values either on the
    number of cpus in the domain and/or the node_distance() ratio.

    Signed-off-by: Peter Zijlstra
    Cc: Anton Blanchard
    Cc: Benjamin Herrenschmidt
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: "David S. Miller"
    Cc: Fenghua Yu
    Cc: "H. Peter Anvin"
    Cc: Ivan Kokshaysky
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mips@linux-mips.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-sh@vger.kernel.org
    Cc: Matt Turner
    Cc: Paul Mackerras
    Cc: Paul Mundt
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: sparclinux@vger.kernel.org
    Cc: Tony Luck
    Cc: x86@kernel.org
    Cc: Dimitri Sivanich
    Cc: Greg Pearson
    Cc: KAMEZAWA Hiroyuki
    Cc: bob.picco@oracle.com
    Cc: chris.mason@oracle.com
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • More function argument passing reduction.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-v66ivjfqdiqdso01lqgqx6qf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since the sched_domain walk is completely unserialized (!SD_SERIALIZE)
    it is possible that multiple cpus in the group get elected to do the
    next level. Avoid this by adding some serialization.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-vqh9ai6s0ewmeakjz80w4qz6@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently we let the leftmost (or first idle) cpu ascend the
    sched_domain tree and perform load-balancing. The result is that the
    busiest cpu in the group might be performing this function and pull
    more load to itself. The next load balance pass will then try to
    equalize this again.

    Change this to pick the least loaded cpu to perform higher domain
    balancing.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-v8zlrmgmkne3bkcy9dej1fvm@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since there's a PID space limit of 30bits (see
    futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a
    lower bound of 2 pages per task) would still take 8T of memory it
    seems reasonable to say that unsigned int is sufficient for
    rq->nr_running.

    When we do get anywhere near that amount of tasks I suspect other
    things would go funny, load-balancer load computations would really
    need to be hoisted to 128bit etc.

    So save a few bytes and convert rq->nr_running and friends to
    unsigned int.

    Suggested-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • If we have one cpu that failed to boot and boot cpu gave up on
    waiting for it and then another cpu is being booted, kernel
    might crash with following OOPS:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] __bitmap_weight+0x30/0x80
    Call Trace:
    [] build_sched_domains+0x7b6/0xa50

    The crash happens in init_sched_groups_power() that expects
    sched_groups to be circular linked list. However it is not
    always true, since sched_groups preallocated in __sdt_alloc are
    initialized in build_sched_groups and it may exit early

    if (cpu != cpumask_first(sched_domain_span(sd)))
    return 0;

    without initializing sd->groups->next field.

    Fix bug by initializing next field right after sched_group was
    allocated.

    Also-Reported-by: Jiang Liu
    Signed-off-by: Igor Mammedov
    Cc: a.p.zijlstra@chello.nl
    Cc: pjt@google.com
    Cc: seto.hidetoshi@jp.fujitsu.com
    Link: http://lkml.kernel.org/r/1336559908-32533-1-git-send-email-imammedo@redhat.com
    Signed-off-by: Ingo Molnar

    Igor Mammedov
     

08 May, 2012

1 commit


07 May, 2012

2 commits


05 May, 2012

1 commit

  • All archs define init_task in the same way (except ia64, but there is
    no particular reason why ia64 cannot use the common version). Create a
    generic instance so all archs can be converted over.

    The config switch is temporary and will be removed when all archs are
    converted over.

    Signed-off-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Chen Liqin
    Cc: Chris Metcalf
    Cc: Chris Zankel
    Cc: David Howells
    Cc: David S. Miller
    Cc: Geert Uytterhoeven
    Cc: Guan Xuetao
    Cc: Haavard Skinnemoen
    Cc: Hirokazu Takata
    Cc: James E.J. Bottomley
    Cc: Jesper Nilsson
    Cc: Jonas Bonn
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Matt Turner
    Cc: Michal Simek
    Cc: Mike Frysinger
    Cc: Paul Mundt
    Cc: Ralf Baechle
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Russell King
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20120503085034.092585287@linutronix.de

    Thomas Gleixner
     

03 May, 2012

2 commits


26 Apr, 2012

3 commits

  • Under extreme memory used up situations, percpu allocation
    might fail. We hit it when system goes to suspend-to-ram,
    causing a kworker panic:

    EIP: [] build_sched_domains+0x23a/0xad0
    Kernel panic - not syncing: Fatal exception
    Pid: 3026, comm: kworker/u:3
    3.0.8-137473-gf42fbef #1

    Call Trace:
    [] panic+0x66/0x16c
    [...]
    [] partition_sched_domains+0x287/0x4b0
    [] cpuset_update_active_cpus+0x1fe/0x210
    [] cpuset_cpu_inactive+0x1d/0x30
    [...]

    With this fix applied build_sched_domains() will return -ENOMEM and
    the suspend attempt fails.

    Signed-off-by: he, bo
    Reviewed-by: Zhang, Yanmin
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc:
    Link: http://lkml.kernel.org/r/1335355161.5892.17.camel@hebo
    [ So, we fail to deallocate a CPU because we cannot allocate RAM :-/
    I don't like that kind of sad behavior but nevertheless it should
    not crash under high memory load. ]
    Signed-off-by: Ingo Molnar

    he, bo
     
  • Commits 367456c756a6 ("sched: Ditch per cgroup task lists for
    load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
    left some more wreckage.

    By setting loop_max unconditionally to ->nr_running load-balancing
    could take a lot of time on very long runqueues (hackbench!). So keep
    the sysctl as max limit of the amount of tasks we'll iterate.

    Furthermore, the min load filter for migration completely fails with
    cgroups since inequality in per-cpu state can easily lead to such
    small loads :/

    Furthermore the change to add new tasks to the tail of the queue
    instead of the head seems to have some effect.. not quite sure I
    understand why.

    Combined these fixes solve the huge hackbench regression reported by
    Tim when hackbench is ran in a cgroup.

    Reported-by: Tim Chen
    Acked-by: Tim Chen
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
    [ got rid of the CONFIG_PREEMPT tuning and made small readability edits ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • All SMP architectures have magic to fork the idle task and to store it
    for reusage when cpu hotplug is enabled. Provide a generic
    infrastructure for it.

    Create/reinit the idle thread for the cpu which is brought up in the
    generic code and hand the thread pointer to the architecture code via
    __cpu_up().

    Note, that fork_idle() is called via a workqueue, because this
    guarantees that the idle thread does not get a reference to a user
    space VM. This can happen when the boot process did not bring up all
    possible cpus and a later cpu_up() is initiated via the sysfs
    interface. In that case fork_idle() would be called in the context of
    the user space task and take a reference on the user space VM.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Rusty Russell
    Cc: Paul E. McKenney
    Cc: Srivatsa S. Bhat
    Cc: Matt Turner
    Cc: Russell King
    Cc: Mike Frysinger
    Cc: Jesper Nilsson
    Cc: Richard Kuo
    Cc: Tony Luck
    Cc: Hirokazu Takata
    Cc: Ralf Baechle
    Cc: David Howells
    Cc: James E.J. Bottomley
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: David S. Miller
    Cc: Chris Metcalf
    Cc: Richard Weinberger
    Cc: x86@kernel.org
    Acked-by: Venkatesh Pallipadi
    Link: http://lkml.kernel.org/r/20120420124557.102478630@linutronix.de

    Thomas Gleixner
     

14 Apr, 2012

1 commit


13 Apr, 2012

1 commit

  • Migration status depends on a difference of weight from 0 and 1.
    If weight > 1 ( 1) then task becomes
    pushable (or not pushable). We are not insterested in its exact
    values, is it 3 or 4, for example.
    Now if we are changing affinity from a set of 3 cpus to a set of 4, the-
    task will be dequeued and enqueued sequentially without important
    difference in comparison with initial state. The only difference is in
    internal representation of plist queue of pushable tasks and the fact
    that the task may won't be the first in a sequence of the same priority
    tasks. But it seems to me it gives nothing.

    Link: http://lkml.kernel.org/r/273741334120764@web83.yandex.ru

    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Tkhai Kirill
    Signed-off-by: Steven Rostedt

    Kirill Tkhai
     

08 Apr, 2012

1 commit


02 Apr, 2012

1 commit

  • Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
    net_cls and device controllers to use the new cftype based interface.
    Termination entry is added to cftype arrays and populate callbacks are
    replaced with cgroup_subsys->base_cftypes initializations.

    This is functionally identical transformation. There shouldn't be any
    visible behavior change.

    memcg is rather special and will be converted separately.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "David S. Miller"
    Cc: Vivek Goyal

    Tejun Heo
     

01 Apr, 2012

1 commit