17 Jun, 2010

2 commits


11 Jun, 2010

1 commit


09 Jun, 2010

1 commit

  • Frederic reported that frequency driven swevents didn't work properly
    and even caused a division-by-zero error.

    It turns out there are two bugs, the division-by-zero comes from a
    failure to deal with that in perf_calculate_period().

    The other was more interesting and turned out to be a wrong comparison
    in perf_adjust_period(). The comparison was between an s64 and u64 and
    got implicitly converted to an unsigned comparison. The problem is
    that period_left is typically < 0, so it ended up being always true.

    Cure this by making the local period variables s64.

    Reported-by: Frederic Weisbecker
    Tested-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Jun, 2010

13 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    module: fix bne2 "gave up waiting for init of module libcrc32c"
    module: verify_export_symbols under the lock
    module: move find_module check to end
    module: make locking more fine-grained.
    module: Make module sysfs functions private.
    module: move sysfs exposure to end of load_module
    module: fix kdb's illicit use of struct module_use.
    module: Make the 'usage' lists be two-way

    Linus Torvalds
     
  • Problem: it's hard to avoid an init routine stumbling over a
    request_module these days. And it's not clear it's always a bad idea:
    for example, a module like kvm with dynamic dependencies on kvm-intel
    or kvm-amd would be neater if it could simply request_module the right
    one.

    In this particular case, it's libcrc32c:

    libcrc32c_mod_init
    crypto_alloc_shash
    crypto_alloc_tfm
    crypto_find_alg
    crypto_alg_mod_lookup
    crypto_larval_lookup
    request_module

    If another module is waiting inside resolve_symbol() for libcrc32c to
    finish initializing (ie. bne2 depends on libcrc32c) then it does so
    holding the module lock, and our request_module() can't make progress
    until that is released.

    Waiting inside resolve_symbol() without the lock isn't all that hard:
    we just need to pass the -EBUSY up the call chain so we can sleep
    where we don't hold the lock. Error reporting is a bit trickier: we
    need to copy the name of the unfinished module before releasing the
    lock.

    Other notes:
    1) This also fixes a theoretical issue where a weak dependency would allow
    symbol version mismatches to be ignored.
    2) We rename use_module to ref_module to make life easier for the only
    external user (the out-of-tree ksplice patches).

    Signed-off-by: Rusty Russell
    Cc: Linus Torvalds
    Cc: Tim Abbot
    Tested-by: Brandon Philips

    Rusty Russell
     
  • It disabled preempt so it was "safe", but nothing stops another module
    slipping in before this module is added to the global list now we don't
    hold the lock the whole time.

    So we check this just after we check for duplicate modules, and just
    before we put the module in the global list.

    (find_symbol finds symbols in coming and going modules, too).

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • I think Rusty may have made the lock a bit _too_ finegrained there, and
    didn't add it to some places that needed it. It looks, for example, like
    PATCH 1/2 actually drops the lock in places where it's needed
    ("find_module()" is documented to need it, but now load_module() didn't
    hold it at all when it did the find_module()).

    Rather than adding a new "module_loading" list, I think we should be able
    to just use the existing "modules" list, and just fix up the locking a
    bit.

    In fact, maybe we could just move the "look up existing module" a bit
    later - optimistically assuming that the module doesn't exist, and then
    just undoing the work if it turns out that we were wrong, just before
    adding ourselves to the list.

    Signed-off-by: Rusty Russell

    Linus Torvalds
     
  • Kay Sievers reports that we still have some
    contention over module loading which is slowing boot.

    Linus also disliked a previous "drop lock and regrab" patch to fix the
    bne2 "gave up waiting for init of module libcrc32c" message.

    This is more ambitious: we only grab the lock where we need it.

    Signed-off-by: Rusty Russell
    Cc: Brandon Philips
    Cc: Kay Sievers
    Cc: Linus Torvalds

    Rusty Russell
     
  • These were placed in the header in ef665c1a06 to get the various
    SYSFS/MODULE config combintations to compile.

    That may have been necessary then, but it's not now. These functions
    are all local to module.c.

    Signed-off-by: Rusty Russell
    Cc: Randy Dunlap

    Rusty Russell
     
  • This means a little extra work, but is more logical: we don't put
    anything in sysfs until we're about to put the module into the
    global list an parse its parameters.

    This also gives us a logical place to put duplicate module detection
    in the next patch.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Linus changed the structure, and luckily this didn't compile any more.

    Reported-by: Stephen Rothwell
    Signed-off-by: Rusty Russell
    Cc: Jason Wessel
    Cc: Martin Hicks

    Rusty Russell
     
  • When adding a module that depends on another one, we used to create a
    one-way list of "modules_which_use_me", so that module unloading could
    see who needs a module.

    It's actually quite simple to make that list go both ways: so that we
    not only can see "who uses me", but also see a list of modules that are
    "used by me".

    In fact, we always wanted that list in "module_unload_free()": when we
    unload a module, we want to also release all the other modules that are
    used by that module. But because we didn't have that list, we used to
    first iterate over all modules, and then iterate over each "used by me"
    list of that module.

    By making the list two-way, we simplify module_unload_free(), and it
    allows for some trivial fixes later too.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Rusty Russell (cleaned & rebased)

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
    block: make blk_init_free_list and elevator_init idempotent
    block: avoid unconditionally freeing previously allocated request_queue
    pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface
    pipe: change the privilege required for growing a pipe beyond system max
    pipe: adjust minimum pipe size to 1 page
    block: disable preemption before using sched_clock()
    cciss: call BUG() earlier
    Preparing 8.3.8rc2
    drbd: Reduce verbosity
    drbd: use drbd specific ratelimit instead of global printk_ratelimit
    drbd: fix hang on local read errors while disconnected
    drbd: Removed the now empty w_io_error() function
    drbd: removed duplicated #includes
    drbd: improve usage of MSG_MORE
    drbd: need to set socket bufsize early to take effect
    drbd: improve network latency, TCP_QUICKACK
    drbd: Revert "drbd: Create new current UUID as late as possible"
    brd: support discard
    Revert "writeback: fix WB_SYNC_NONE writeback from umount"
    Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync"
    ...

    Linus Torvalds
     
  • The commit 80b5184cc537718122e036afe7e62d202b70d077 ("kernel/: convert cpu
    notifier to return encapsulate errno value") changed the return value of
    cpu notifier callbacks.

    Those callbacks don't return NOTIFY_BAD on failures anymore. But there
    are a few callbacks which are called directly at init time and checking
    the return value.

    I forgot to change BUG_ON checking by the direct callers in the commit.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Child groups should have a greater depth than their parents. Prior to
    this change, the parent would incorrectly report zero memory usage for
    child cgroups when use_hierarchy is enabled.

    test script:
    mount -t cgroup none /cgroups -o memory
    cd /cgroups
    mkdir cg1

    echo 1 > cg1/memory.use_hierarchy
    mkdir cg1/cg11

    echo $$ > cg1/cg11/tasks
    dd if=/dev/zero of=/tmp/foo bs=1M count=1

    echo
    echo CHILD
    grep cache cg1/cg11/memory.stat

    echo
    echo PARENT
    grep cache cg1/memory.stat

    echo $$ > tasks
    rmdir cg1/cg11 cg1
    cd /
    umount /cgroups

    Using fae9c79, a recent patch that changed alloc_css_id() depth computation,
    the parent incorrectly reports zero usage:
    root@ubuntu:~# ./test
    1+0 records in
    1+0 records out
    1048576 bytes (1.0 MB) copied, 0.0151844 s, 69.1 MB/s

    CHILD
    cache 1048576
    total_cache 1048576

    PARENT
    cache 0
    total_cache 0

    With this patch, the parent correctly includes child usage:
    root@ubuntu:~# ./test
    1+0 records in
    1+0 records out
    1048576 bytes (1.0 MB) copied, 0.0136827 s, 76.6 MB/s

    CHILD
    cache 1052672
    total_cache 1052672

    PARENT
    cache 0
    total_cache 1052672

    Signed-off-by: Greg Thelen
    Acked-by: Paul Menage
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Li Zefan
    Cc: [2.6.34.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • task_struct->pesonality is "unsigned int", but sys_personality() paths use
    "unsigned long pesonality". This means that every assignment or
    comparison is not right. In particular, if this argument does not fit
    into "unsigned int" __set_personality() changes the caller's personality
    and then sys_personality() returns -EINVAL.

    Turn this argument into "unsigned int" and avoid overflows. Obviously,
    this is the user-visible change, we just ignore the upper bits. But this
    can't break the sane application.

    There is another thing which can confuse the poorly written applications.
    User-space thinks that this syscall returns int, not long. This means
    that the returned value can be negative and look like the error code. But
    note that libc won't be confused and thus errno won't be set, and with
    this patch the user-space can never get -1 unless sys_personality() really
    fails. And, most importantly, the negative RET != -1 is only possible if
    that app previously called personality(RET).

    Pointed-out-by: Wenming Zhang
    Suggested-by: Linus Torvalds
    Signed-off-by: Oleg Nesterov
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

04 Jun, 2010

2 commits

  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched, trace: Fix sched_switch() prev_state argument
    sched: Fix wake_affine() vs RT tasks
    sched: Make sure timers have migrated before killing the migration_thread

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf: Fix crash in swevents
    perf buildid-list: Fix --with-hits event processing
    perf scripts python: Give field dict to unhandled callback
    perf hist: fix objdump output parsing
    perf-record: Check correct pid when forking
    perf: Do the comm inheritance per thread in event__process_task
    perf: Use event__process_task from perf sched
    perf: Process comm events by tid
    blktrace: Fix new kernel-doc warnings
    perf_events: Fix unincremented buffer base on partial copy
    perf_events: Fix event scheduling issues introduced by transactional API
    perf_events, trace: Fix perf_trace_destroy(), mutex went missing
    perf_events, trace: Fix probe unregister race
    perf_events: Fix races in group composition
    perf_events: Fix races and clean up perf_event and perf_mmap_data interaction

    Linus Torvalds
     

03 Jun, 2010

2 commits

  • Frederic reported that because swevents handling doesn't disable IRQs
    anymore, we can get a recursion of perf_adjust_period(), once from
    overflow handling and once from the tick.

    If both call ->disable, we get a double hlist_del_rcu() and trigger
    a LIST_POISON2 dereference.

    Since we don't actually need to stop/start a swevent to re-programm
    the hardware (lack of hardware to program), simply nop out these
    callbacks for the swevent pmu.

    Reported-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This changes the interface to be based on bytes instead. The API
    matches that of F_SETPIPE_SZ in that it rounds up the passed in
    size so that the resulting page array is a power-of-2 in size.

    The proc file is renamed to /proc/sys/fs/pipe-max-size to
    reflect this change.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Jun, 2010

1 commit

  • In commit e9fb7631ebcd ("cpu-hotplug: introduce cpu_notify(),
    __cpu_notify(), cpu_notify_nofail()") the new helper functions access
    cpu_chain. As a result, it shouldn't be marked __cpuinitdata (via
    section mismatch warning).

    Alternatively, the helper functions should be forced inline, or marked
    __ref or __cpuinit. In the meantime, this patch silences the warning
    the trivial way.

    Signed-off-by: Daniel J Blueman
    Signed-off-by: Linus Torvalds

    Daniel J Blueman
     

01 Jun, 2010

3 commits

  • * 'for-35' of git://repo.or.cz/linux-kbuild: (81 commits)
    kbuild: Revert part of e8d400a to resolve a conflict
    kbuild: Fix checking of scm-identifier variable
    gconfig: add support to show hidden options that have prompts
    menuconfig: add support to show hidden options which have prompts
    gconfig: remove show_debug option
    gconfig: remove dbg_print_ptype() and dbg_print_stype()
    kconfig: fix zconfdump()
    kconfig: some small fixes
    add random binaries to .gitignore
    kbuild: Include gen_initramfs_list.sh and the file list in the .d file
    kconfig: recalc symbol value before showing search results
    .gitignore: ignore *.lzo files
    headerdep: perlcritic warning
    scripts/Makefile.lib: Align the output of LZO
    kbuild: Generate modules.builtin in make modules_install
    Revert "kbuild: specify absolute paths for cscope"
    kbuild: Do not unnecessarily regenerate modules.builtin
    headers_install: use local file handles
    headers_check: fix perl warnings
    export_report: fix perl warnings
    ...

    Linus Torvalds
     
  • Mike reports that since e9e9250b (sched: Scale down cpu_power due to RT
    tasks), wake_affine() goes funny on RT tasks due to them still having a
    !0 weight and wake_affine() still subtracts that from the rq weight.

    Since nobody should be using se->weight for RT tasks, set the value to
    zero. Also, since we now use ->cpu_power to normalize rq weights to
    account for RT cpu usage, add that factor into the imbalance computation.

    Reported-by: Mike Galbraith
    Tested-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Rafael sees a sometimes crash at precpu_modfree from kernel/module.c; it
    only occurred with another (since-reverted) patch, but that patch simply
    changed timing to uncover this bug, it was otherwise unrelated.

    The comment about the mod being freed is self-explanatory, but neither
    Tejun nor I read it. This bug was introduced in 259354deaa, after it
    had previously been fixed in 6e2b75740b. How embarrassing.

    Reported-by: "Rafael J. Wysocki"
    Signed-off-by: Rusty Russell
    Embarrassingly-Acked-by: Tejun Heo
    Cc: Masami Hiramatsu
    Tested-by: "Rafael J. Wysocki"
    Signed-off-by: Linus Torvalds

    Rusty Russell
     

31 May, 2010

12 commits

  • Fix blktrace.c kernel-doc warnings:
    Warning(kernel/trace/blktrace.c:858): No description found for parameter 'ignore'
    Warning(kernel/trace/blktrace.c:890): No description found for parameter 'ignore'

    Signed-off-by: Randy Dunlap
    Cc: Jens Axboe
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Randy Dunlap
     
  • If a sample size crosses to the next page boundary, the copy
    will be made in more than one step. However we forget to advance
    the source offset for the next copy, leading to unexpected double
    copies that completely mess up the traces.

    This fixes various kinds of bad traces that have irrelevant
    data inside, as an example:

    geany-4979 [001] 5758.077775: sched_switch: prev_comm=! prev_pid=121
    prev_prio=0 prev_state=S|D|Z|X|x ==> next_comm= next_pid=7497072
    next_prio=0

    Signed-off-by: Frederic Weisbecker
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The transactional API patch between the generic and model-specific
    code introduced several important bugs with event scheduling, at
    least on X86. If you had pinned events, e.g., watchdog, and were
    over-committing the PMU, you would get bogus counts. The bug was
    showing up on Intel CPU because events would move around more
    often that on AMD. But the problem also existed on AMD, though
    harder to expose.

    The issues were:

    - group_sched_in() was missing a cancel_txn() in the error path

    - cpuc->n_added was not properly maintained, leading to missing
    actions in hw_perf_enable(), i.e., n_running being 0. You cannot
    update n_added until you know the transaction has succeeded. In
    case of failed transaction n_added was not adjusted back.

    - in case of failed transactions, event_sched_out() was called
    and eventually invoked x86_disable_event() to touch the HW reg.
    But with transactions, on X86, event_sched_in() does not touch
    HW registers, it simply collects events into a list. Thus, you
    could end up calling x86_disable_event() on a counter which
    did not correspond to the current event when idx != -1.

    The patch modifies the generic and X86 code to avoid all those problems.

    First, we keep track of the number of events added last. In case the
    transaction fails, we substract them from n_added. This approach is
    necessary (as opposed to delaying updates to n_added) because not all
    event updates use the transaction API, e.g., single events.

    Second, we encapsulate the event_sched_in() and event_sched_out() in
    group_sched_in() inside the transaction. That makes the operations
    symmetrical and you can also detect that you are inside a transaction
    and skip the HW reg access by checking cpuc->group_flag.

    With this patch, you can now overcommit the PMU even with pinned
    system-wide events present and still get valid counts.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • Steve spotted I forgot to do the destroy under event_mutex.

    Reported-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • tracepoint_probe_unregister() does not synchronize against the probe
    callbacks, so do that explicitly. This properly serializes the callbacks
    and the free of the data used therein.

    Also, use this_cpu_ptr() where possible.

    Acked-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Group siblings don't pin each-other or the parent, so when we destroy
    events we must make sure to clean up all cross referencing pointers.

    In particular, for destruction of a group leader we must be able to
    find all its siblings and remove their reference to it.

    This means that detaching an event from its context must not detach it
    from the group, otherwise we can end up failing to clear all pointers.

    Solve this by clearly separating the attachment to a context and
    attachment to a group, and keep the group composed until we destroy
    the events.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to move toward separate buffer objects, rework the whole
    perf_mmap_data construct to be a more self-sufficient entity, one
    with its own lifetime rules.

    This greatly sanitizes the whole output redirection code, which
    was riddled with bugs and races.

    Signed-off-by: Peter Zijlstra
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Problem: In a stress test where some heavy tests were running along with
    regular CPU offlining and onlining, a hang was observed. The system seems
    to be hung at a point where migration_call() tries to kill the
    migration_thread of the dying CPU, which just got moved to the current
    CPU. This migration thread does not get a chance to run (and die) since
    rt_throttled is set to 1 on current, and it doesn't get cleared as the
    hrtimer which is supposed to reset the rt bandwidth
    (sched_rt_period_timer) is tied to the CPU which we just marked dead!

    Solution: This patch pushes the killing of migration thread to
    "CPU_POST_DEAD" event. By then all the timers (including
    sched_rt_period_timer) should have got migrated (along with other
    callbacks).

    Signed-off-by: Amit Arora
    Signed-off-by: Gautham R Shenoy
    Acked-by: Tejun Heo
    Signed-off-by: Peter Zijlstra
    Cc: Thomas Gleixner
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Amit K. Arora
     
  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    mutex: Fix optimistic spinning vs. BKL

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf tui: Fix last use_browser problem related to .perfconfig
    perf symbols: Add the build id cache to the vmlinux path
    perf tui: Reset use_browser if stdout is not a tty
    ring-buffer: Move zeroing out excess in page to ring buffer code
    ring-buffer: Reset "real_end" when page is filled

    Linus Torvalds
     
  • If there's only one CPU online when disable_nonboot_cpus() is called,
    the error variable will not be initialized and that may lead to
    erroneous behavior. Fix this issue by initializing error in
    disable_nonboot_cpus() as appropriate.

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • This reverts commit 0ac0c0d0f837c499afd02a802f9cf52d3027fa3b, which
    caused cross-architecture build problems for all the wrong reasons.
    IA64 already added its own version of __node_random(), but the fact is,
    there is nothing architectural about the function, and the original
    commit was just badly done. Revert it, since no fix is forthcoming.

    Requested-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 May, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: clean up on forwarded aborted mds request
    ceph: fix leak of osd authorizer
    ceph: close out mds, osd connections before stopping auth
    ceph: make lease code DN specific
    fs/ceph: Use ERR_CAST
    ceph: renew auth tickets before they expire
    ceph: do not resend mon requests on auth ticket renewal
    ceph: removed duplicated #includes
    ceph: avoid possible null dereference
    ceph: make mds requests killable, not interruptible
    sched: add wait_for_completion_killable_timeout

    Linus Torvalds
     
  • Add missing _killable_timeout variant for wait_for_completion that will
    return when a timeout expires or the task is killed.

    CC: Ingo Molnar
    CC: Andreas Herrmann
    CC: Thomas Gleixner
    CC: Mike Galbraith
    Acked-by: Peter Zijlstra
    Signed-off-by: Sage Weil

    Sage Weil
     

29 May, 2010

1 commit