03 Mar, 2013

1 commit

  • Pull ext4 bug fixes from Ted Ts'o:
    "Various bug fixes for ext4. The most important is a fix for the new
    extent cache's slab shrinker which can cause significant, user-visible
    pauses when the system is under memory pressure."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: enable quotas before orphan cleanup
    ext4: don't allow quota mount options when quota feature enabled
    ext4: fix a warning from sparse check for ext4_dir_llseek
    ext4: convert number of blocks to clusters properly
    ext4: fix possible memory leak in ext4_remount()
    jbd2: fix ERR_PTR dereference in jbd2__journal_start
    ext4: use percpu counter for extent cache count
    ext4: optimize ext4_es_shrink()

    Linus Torvalds
     

01 Mar, 2013

2 commits

  • When the system is under memory pressure, ext4_es_srhink() will get
    called very often. So optimize returning the number of items in the
    file system's extent status cache by keeping a per-filesystem count,
    instead of calculating it each time by scanning all of the inodes in
    the extent status cache.

    Also rename the slab used for the extent status cache to be
    "ext4_extent_status" so it's obviousl the slab in question is created
    by ext4.

    Signed-off-by: "Theodore Ts'o"
    Cc: Zheng Liu

    Theodore Ts'o
     
  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

27 Feb, 2013

1 commit

  • Pull ext4 updates from Theodore Ts'o:
    "The one new feature added in this patch series is the ability to use
    the "punch hole" functionality for inodes that are not using extent
    maps.

    In the bug fix category, we fixed some races in the AIO and fstrim
    code, and some potential NULL pointer dereferences and memory leaks in
    error handling code paths.

    In the optimization category, we fixed a performance regression in the
    jbd2 layer introduced by commit d9b01934d56a ("jbd: fix fsync() tid
    wraparound bug", introduced in v3.0) which shows up in the AIM7
    benchmark. We also further optimized jbd2 by minimize the amount of
    time that transaction handles are held active.

    This patch series also features some additional enhancement of the
    extent status tree, which is now used to cache extent information in a
    more efficient/compact form than what we use on-disk."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (65 commits)
    ext4: fix free clusters calculation in bigalloc filesystem
    ext4: no need to remove extent if len is 0 in ext4_es_remove_extent()
    ext4: fix xattr block allocation/release with bigalloc
    ext4: reclaim extents from extent status tree
    ext4: adjust some functions for reclaiming extents from extent status tree
    ext4: remove single extent cache
    ext4: lookup block mapping in extent status tree
    ext4: track all extent status in extent status tree
    ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
    ext4: rename and improbe ext4_es_find_extent()
    ext4: add physical block and status member into extent status tree
    ext4: refine extent status tree
    ext4: use ERR_PTR() abstraction for ext4_append()
    ext4: refactor code to read directory blocks into ext4_read_dirblock()
    ext4: add debugging context for warning in ext4_da_update_reserve_space()
    ext4: use KERN_WARNING for warning messages
    jbd2: use module parameters instead of debugfs for jbd_debug
    ext4: use module parameters instead of debugfs for mballoc_debug
    ext4: start handle at the last possible moment when creating inodes
    ext4: fix the number of credits needed for acl ops with inline data
    ...

    Linus Torvalds
     

25 Feb, 2013

1 commit

  • Pull KVM updates from Marcelo Tosatti:
    "KVM updates for the 3.9 merge window, including x86 real mode
    emulation fixes, stronger memory slot interface restrictions, mmu_lock
    spinlock hold time reduction, improved handling of large page faults
    on shadow, initial APICv HW acceleration support, s390 channel IO
    based virtio, amongst others"

    * tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
    Revert "KVM: MMU: lazily drop large spte"
    x86: pvclock kvm: align allocation size to page size
    KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr
    x86 emulator: fix parity calculation for AAD instruction
    KVM: PPC: BookE: Handle alignment interrupts
    booke: Added DBCR4 SPR number
    KVM: PPC: booke: Allow multiple exception types
    KVM: PPC: booke: use vcpu reference from thread_struct
    KVM: Remove user_alloc from struct kvm_memory_slot
    KVM: VMX: disable apicv by default
    KVM: s390: Fix handling of iscs.
    KVM: MMU: cleanup __direct_map
    KVM: MMU: remove pt_access in mmu_set_spte
    KVM: MMU: cleanup mapping-level
    KVM: MMU: lazily drop large spte
    KVM: VMX: cleanup vmx_set_cr0().
    KVM: VMX: add missing exit names to VMX_EXIT_REASONS array
    KVM: VMX: disable SMEP feature when guest is in non-paging mode
    KVM: Remove duplicate text in api.txt
    Revert "KVM: MMU: split kvm_mmu_free_page"
    ...

    Linus Torvalds
     

21 Feb, 2013

1 commit

  • Pull ACPI and power management updates from Rafael Wysocki:

    - Rework of the ACPI namespace scanning code from Rafael J. Wysocki
    with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
    Toshi Kani, and Yinghai Lu.

    - ACPI power resources handling and ACPI device PM update from Rafael
    J Wysocki.

    - ACPICA update to version 20130117 from Bob Moore and Lv Zheng with
    contributions from Aaron Lu, Chao Guan, Jesper Juhl, and Tim Gardner.

    - Support for Intel Lynxpoint LPSS from Mika Westerberg.

    - cpuidle update from Len Brown including Intel Haswell support, C1
    state for intel_idle, removal of global pm_idle.

    - cpuidle fixes and cleanups from Daniel Lezcano.

    - cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri with
    contributions from Stratos Karafotis and Rickard Andersson.

    - Intel P-states driver for Sandy Bridge processors from Dirk
    Brandewie.

    - cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.

    - cpufreq fixes related to ordering issues between acpi-cpufreq and
    powernow-k8 from Borislav Petkov and Matthew Garrett.

    - cpufreq support for Calxeda Highbank processors from Mark Langsdorf
    and Rob Herring.

    - cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
    from Shawn Guo.

    - cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
    and Inderpal Singh.

    - Support for "lightweight suspend" from Zhang Rui.

    - Removal of the deprecated power trace API from Paul Gortmaker.

    - Assorted updates from Andreas Fleig, Colin Ian King, Davidlohr Bueso,
    Joseph Salisbury, Kees Cook, Li Fei, Nishanth Menon, ShuoX Liu,
    Srinivas Pandruvada, Tejun Heo, Thomas Renninger, and Yasuaki
    Ishimatsu.

    * tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (267 commits)
    PM idle: remove global declaration of pm_idle
    unicore32 idle: delete stray pm_idle comment
    openrisc idle: delete pm_idle
    mn10300 idle: delete pm_idle
    microblaze idle: delete pm_idle
    m32r idle: delete pm_idle, and other dead idle code
    ia64 idle: delete pm_idle
    cris idle: delete idle and pm_idle
    ARM64 idle: delete pm_idle
    ARM idle: delete pm_idle
    blackfin idle: delete pm_idle
    sparc idle: rename pm_idle to sparc_idle
    sh idle: rename global pm_idle to static sh_idle
    x86 idle: rename global pm_idle to static x86_idle
    APM idle: register apm_cpu_idle via cpuidle
    cpufreq / intel_pstate: Add kernel command line option disable intel_pstate.
    cpufreq / intel_pstate: Change to disallow module build
    tools/power turbostat: display SMI count by default
    intel_idle: export both C1 and C1E
    ACPI / hotplug: Fix concurrency issues and memory leaks
    ...

    Linus Torvalds
     

20 Feb, 2013

2 commits

  • Pull workqueue changes from Tejun Heo:
    "A lot of reorganization is going on mostly to prepare for worker pools
    with custom attributes so that workqueue can replace custom pool
    implementations in places including writeback and btrfs and make CPU
    assignment in crypto more flexible.

    workqueue evolved from purely per-cpu design and implementation, so
    there are a lot of assumptions regarding being bound to CPUs and even
    unbound workqueues are implemented as an extension of the model -
    workqueues running on the special unbound CPU. Bulk of changes this
    round are about promoting worker_pools as the top level abstraction
    replacing global_cwq (global cpu workqueue). At this point, I'm
    fairly confident about getting custom worker pools working pretty soon
    and ready for the next merge window.

    Lai's patches are replacing the convoluted mb() dancing workqueue has
    been doing with much simpler mechanism which only depends on
    assignment atomicity of long. For details, please read the commit
    message of 0b3dae68ac ("workqueue: simplify is-work-item-queued-here
    test"). While the change ends up adding one pointer to struct
    delayed_work, the inflation in percentage is less than five percent
    and it decouples delayed_work logic a lot more cleaner from usual work
    handling, removes the unusual memory barrier dancing, and allows for
    further simplification, so I think the trade-off is acceptable.

    There will be two more workqueue related pull requests and there are
    some shared commits among them. I'll write further pull requests
    assuming this pull request is pulled first."

    * 'for-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (37 commits)
    workqueue: un-GPL function delayed_work_timer_fn()
    workqueue: rename cpu_workqueue to pool_workqueue
    workqueue: reimplement is_chained_work() using current_wq_worker()
    workqueue: fix is_chained_work() regression
    workqueue: pick cwq instead of pool in __queue_work()
    workqueue: make get_work_pool_id() cheaper
    workqueue: move nr_running into worker_pool
    workqueue: cosmetic update in try_to_grab_pending()
    workqueue: simplify is-work-item-queued-here test
    workqueue: make work->data point to pool after try_to_grab_pending()
    workqueue: add delayed_work->wq to simplify reentrancy handling
    workqueue: make work_busy() test WORK_STRUCT_PENDING first
    workqueue: replace WORK_CPU_NONE/LAST with WORK_CPU_END
    workqueue: post global_cwq removal cleanups
    workqueue: rename nr_running variables
    workqueue: remove global_cwq
    workqueue: remove worker_pool->gcwq
    workqueue: replace for_each_worker_pool() with for_each_std_worker_pool()
    workqueue: make freezing/thawing per-pool
    workqueue: make hotplug processing per-pool
    ...

    Linus Torvalds
     
  • Pull perf changes from Ingo Molnar:
    "There are lots of improvements, the biggest changes are:

    Main kernel side changes:

    - Improve uprobes performance by adding 'pre-filtering' support, by
    Oleg Nesterov.

    - Make some POWER7 events available in sysfs, equivalent to what was
    done on x86, from Sukadev Bhattiprolu.

    - tracing updates by Steve Rostedt - mostly misc fixes and smaller
    improvements.

    - Use perf/event tracing to report PCI Express advanced errors, by
    Tony Luck.

    - Enable northbridge performance counters on AMD family 15h, by Jacob
    Shin.

    - This tracing commit:

    tracing: Remove the extra 4 bytes of padding in events

    changes the ABI. All involved parties (PowerTop in particular)
    seem to agree that it's safe to do now with the introduction of
    libtraceevent, but the devil is in the details ...

    Main tooling side changes:

    - Add 'event group view', from Namyung Kim:

    To use it, 'perf record' should group events when recording. And
    then perf report parses the saved group relation from file header
    and prints them together if --group option is provided. You can
    use the 'perf evlist' command to see event group information:

    $ perf record -e '{ref-cycles,cycles}' noploop 1
    [ perf record: Woken up 2 times to write data ]
    [ perf record: Captured and wrote 0.385 MB perf.data (~16807 samples) ]

    $ perf evlist --group
    {ref-cycles,cycles}

    With this example, default perf report will show you each event
    separately.

    You can use --group option to enable event group view:

    $ perf report --group
    ...
    # group: {ref-cycles,cycles}
    # ========
    # Samples: 7K of event 'anon group { ref-cycles, cycles }'
    # Event count (approx.): 6876107743
    #
    # Overhead Command Shared Object Symbol
    # ................ ....... ................. ..........................
    99.84% 99.76% noploop noploop [.] main
    0.07% 0.00% noploop ld-2.15.so [.] strcmp
    0.03% 0.00% noploop [kernel.kallsyms] [k] timerqueue_del
    0.03% 0.03% noploop [kernel.kallsyms] [k] sched_clock_cpu
    0.02% 0.00% noploop [kernel.kallsyms] [k] account_user_time
    0.01% 0.00% noploop [kernel.kallsyms] [k] __alloc_pages_nodemask
    0.00% 0.00% noploop [kernel.kallsyms] [k] native_write_msr_safe
    0.00% 0.11% noploop [kernel.kallsyms] [k] _raw_spin_lock
    0.00% 0.06% noploop [kernel.kallsyms] [k] find_get_page
    0.00% 0.02% noploop [kernel.kallsyms] [k] rcu_check_callbacks
    0.00% 0.02% noploop [kernel.kallsyms] [k] __current_kernel_time

    As you can see the Overhead column now contains both of ref-cycles
    and cycles and header line shows group information also - 'anon
    group { ref-cycles, cycles }'. The output is sorted by period of
    group leader first.

    - Initial GTK+ annotate browser, from Namhyung Kim.

    - Add option for runtime switching perf data file in perf report,
    just press 's' and a menu with the valid files found in the current
    directory will be presented, from Feng Tang.

    - Add support to display whole group data for raw columns, from Jiri
    Olsa.

    - Add per processor socket count aggregation in perf stat, from
    Stephane Eranian.

    - Add interval printing in 'perf stat', from Stephane Eranian.

    - 'perf test' improvements

    - Add support for wildcards in tracepoint system name, from Jiri
    Olsa.

    - Add anonymous huge page recognition, from Joshua Zhu.

    - perf build-id cache now can show DSOs present in a perf.data file
    that are not in the cache, to integrate with build-id servers being
    put in place by organizations such as Fedora.

    - perf top now shares more of the evsel config/creation routines with
    'record', paving the way for further integration like 'top'
    snapshots, etc.

    - perf top now supports DWARF callchains.

    - Fix mmap limitations on 32-bit, fix from David Miller.

    - 'perf bench numa mem' NUMA performance measurement suite

    - ... and lots of fixes, performance improvements, cleanups and other
    improvements I failed to list - see the shortlog and git log for
    details."

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (270 commits)
    perf/x86/amd: Enable northbridge performance counters on AMD family 15h
    perf/hwbp: Fix cleanup in case of kzalloc failure
    perf tools: Fix build with bison 2.3 and older.
    perf tools: Limit unwind support to x86 archs
    perf annotate: Make it to be able to skip unannotatable symbols
    perf gtk/annotate: Fail early if it can't annotate
    perf gtk/annotate: Show source lines with gray color
    perf gtk/annotate: Support multiple event annotation
    perf ui/gtk: Implement basic GTK2 annotation browser
    perf annotate: Fix warning message on a missing vmlinux
    perf buildid-cache: Add --update option
    uprobes/perf: Avoid uprobe_apply() whenever possible
    uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE
    uprobes/perf: Teach trace_uprobe/perf code to pre-filter
    uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's
    uprobes: Introduce uprobe_apply()
    perf: Introduce hw_perf_event->tp_target and ->tp_list
    uprobes/perf: Always increment trace_uprobe->nhit
    uprobes/tracing: Kill uprobe_trace_consumer, embed uprobe_consumer into trace_uprobe
    uprobes/tracing: Introduce is_trace_uprobe_enabled()
    ...

    Linus Torvalds
     

18 Feb, 2013

5 commits

  • Although extent status is loaded on-demand, we also need to reclaim
    extent from the tree when we are under a heavy memory pressure because
    in some cases fragmented extent tree causes status tree costs too much
    memory.

    Here we maintain a lru list in super_block. When the extent status of
    an inode is accessed and changed, this inode will be move to the tail
    of the list. The inode will be dropped from this list when it is
    cleared. In the inode, a counter is added to count the number of
    cached objects in extent status tree. Here only written/unwritten/hole
    extent is counted because delayed extent doesn't be reclaimed due to
    fiemap, bigalloc and seek_data/hole need it. The counter will be
    increased as a new extent is allocated, and it will be decreased as a
    extent is freed.

    In this commit we use normal shrinker framework to reclaim memory from
    the status tree. ext4_es_reclaim_extents_count() traverses the lru list
    to count the number of reclaimable extents. ext4_es_shrink() tries to
    reclaim written/unwritten/hole extents from extent status tree. The
    inode that has been shrunk is moved to the tail of lru list.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Jan kara

    Zheng Liu
     
  • After tracking all extent status, we already have a extent cache in
    memory. Every time we want to lookup a block mapping, we can first
    try to lookup it in extent status tree to avoid a potential disk I/O.

    A new function called ext4_es_lookup_extent is defined to finish this
    work. When we try to lookup a block mapping, we always call
    ext4_map_blocks and/or ext4_da_map_blocks. So in these functions we
    first try to lookup a block mapping in extent status tree.

    A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
    in order not to put a hole into extent status tree because this hole
    will be converted to delayed extent in the tree immediately.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Jan kara

    Zheng Liu
     
  • This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
    and improve this function. First, we split input and output parameter.
    Second, this function never return the first block of the next delayed
    extent after 'es'.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Jan kara

    Zheng Liu
     
  • This commit adds two members in extent_status structure to let it record
    physical block and extent status. Here es_pblk is used to record both
    of them because physical block only has 48 bits. So extent status could
    be stashed into it so that we can save some memory. Now written,
    unwritten, delayed and hole are defined as status.

    Due to new member is added into extent status tree, all interfaces need
    to be adjusted.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Zheng Liu
     
  • This commit refines the extent status tree code.

    1) A prefix 'es_' is added to to the extent status tree structure
    members.

    2) Refactored es_remove_extent() so that __es_remove_extent() can be
    used by es_insert_extent() to remove the old extent entry(-ies) before
    inserting a new one.

    3) Rename extent_status_end() to ext4_es_end()

    4) ext4_es_can_be_merged() is define to check whether two extents can
    be merged or not.

    5) Update and clarified comments.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Zheng Liu
     

14 Feb, 2013

1 commit

  • workqueue has moved away from global_cwqs to worker_pools and with the
    scheduled custom worker pools, wforkqueues will be associated with
    pools which don't have anything to do with CPUs. The workqueue code
    went through significant amount of changes recently and mass renaming
    isn't likely to hurt much additionally. Let's replace 'cpu' with
    'pool' so that it reflects the current design.

    * s/struct cpu_workqueue_struct/struct pool_workqueue/
    * s/cpu_wq/pool_wq/
    * s/cwq/pwq/

    This patch is purely cosmetic.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

09 Feb, 2013

1 commit


07 Feb, 2013

1 commit

  • Track the delay between when we first request that the commit begin
    and when it actually begins, so we can see how much of a gap exists.
    In theory, this should just be the remaining scheduling quantuum of
    the thread which requested the commit (assuming it was not a
    synchronous operation which triggered the commit request) plus
    scheduling overhead; however, it's possible that real time processes
    might get in the way of letting the kjournald thread from executing.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

29 Jan, 2013

1 commit


26 Jan, 2013

1 commit

  • The text in Documentation said it would be removed in 2.6.41;
    the text in the Kconfig said removal in the 3.1 release. Either
    way you look at it, we are well past both, so push it off a cliff.

    Note that the POWER_CSTATE and the POWER_PSTATE are part of the
    legacy tracing API. Remove all tracepoints which use these flags.
    As can be seen from context, most already have a trace entry via
    trace_cpu_idle anyways.

    Also, the cpufreq/cpufreq.c PSTATE one is actually unpaired, as
    compared to the CSTATE ones which all have a clear start/stop.
    As part of this, the trace_power_frequency also becomes orphaned,
    so it too is deleted.

    Signed-off-by: Paul Gortmaker
    Acked-by: Steven Rostedt
    Signed-off-by: Rafael J. Wysocki

    Paul Gortmaker
     

25 Jan, 2013

1 commit

  • Move gcwq->cpu to pool->cpu. This introduces a couple places where
    gcwq->pools[0].cpu is used. These will soon go away as gcwq is
    further reduced.

    This is part of an effort to remove global_cwq and make worker_pool
    the top level abstraction, which in turn will help implementing worker
    pools with user-specified attributes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan

    Tejun Heo
     

17 Jan, 2013

1 commit


14 Jan, 2013

4 commits

  • Add tracepoints for page dirtying, writeback_single_inode start, inode
    dirtying and writeback. For the latter two inode events, a pair of
    events are defined to denote start and end of the operations (the
    starting one has _start suffix and the one w/o suffix happens after
    the operation is complete). These inode ops are FS specific and can
    be non-trivial and having enclosing tracepoints is useful for external
    tracers.

    This is part of tracepoint additions to improve visiblity into
    dirtying / writeback operations for io tracer and userland.

    v2: writeback_dirty_inode[_start] TPs may be called for files on
    pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL
    before dereferencing.

    v3: buffer dirtying moved to a block TP.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The former is triggered from touch_buffer() and the latter
    mark_buffer_dirty().

    This is part of tracepoint additions to improve visiblity into
    dirtying / writeback operations for io tracer and userland.

    v2: Transformed writeback_dirty_buffer to block_dirty_buffer and made
    it share TP definition with block_touch_buffer.

    Signed-off-by: Tejun Heo
    Cc: Fengguang Wu
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio_{front|back}_merge tracepoints report a bio merging into an
    existing request but didn't specify which request the bio is being
    merged into. Add @req to it. This makes it impossible to share the
    event template with block_bio_queue - split it out.

    @req isn't used or exported to userland at this point and there is no
    userland visible behavior change. Later changes will make use of the
    extra parameter.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio completion didn't kick block_bio_complete TP. Only dm was
    explicitly triggering the TP on IO completion. This makes
    block_bio_complete TP useless for tracers which want to know about
    bios, and all other bio based drivers skip generating blktrace
    completion events.

    This patch makes all bio completions via bio_endio() generate
    block_bio_complete TP.

    * Explicit trace_block_bio_complete() invocation removed from dm and
    the trace point is unexported.

    * @rq dropped from trace_block_bio_complete(). bios may fly around
    w/o queue associated. Verifying and accessing the assocaited queue
    belongs to TP probes.

    * blktrace now gets both request and bio completions. Make it ignore
    bio completions if request completion path is happening.

    This makes all bio based drivers generate blktrace completion events
    properly and makes the block_bio_complete TP actually useful.

    v2: With this change, block_bio_complete TP could be invoked on sg
    commands which have bio's with %NULL bi_bdev. Update TP
    assignment code to check whether bio->bi_bdev is %NULL before
    dereferencing.

    Signed-off-by: Tejun Heo
    Original-patch-by: Namhyung Kim
    Cc: Tejun Heo
    Cc: Steven Rostedt
    Cc: Alasdair Kergon
    Cc: dm-devel@redhat.com
    Cc: Neil Brown
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Jan, 2013

1 commit


09 Jan, 2013

3 commits

  • This commit adds event tracing for callback acceleration to allow better
    tracking of callbacks through the system.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • When the type of global variable blimit changed from int to long, the
    type of the blimit argument of trace_rcu_batch_start() needed to have
    changed. This commit fixes this issue.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Currently, rcutorture traces every read-side access. This can be
    problematic because even a two-minute rcutorture run on a two-CPU system
    can generate 28,853,363 reads. Normally, only a failing read is of
    interest, so this commit traces adjusts rcutorture's tracing to only
    trace failing reads. The resulting event tracing records the time
    and the ->completed value captured at the beginning of the RCU read-side
    critical section, allowing correlation with other event-tracing messages.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    [ paulmck: Add fix to build problem located by Randy Dunlap based on
    diagnosis by Steven Rostedt. ]

    Paul E. McKenney
     

08 Jan, 2013

1 commit

  • Add a new capability, KVM_CAP_S390_CSS_SUPPORT, which will pass
    intercepts for channel I/O instructions to userspace. Only I/O
    instructions interacting with I/O interrupts need to be handled
    in-kernel:

    - TEST PENDING INTERRUPTION (tpi) dequeues and stores pending
    interrupts entirely in-kernel.
    - TEST SUBCHANNEL (tsch) dequeues pending interrupts in-kernel
    and exits via KVM_EXIT_S390_TSCH to userspace for subchannel-
    related processing.

    Reviewed-by: Marcelo Tosatti
    Reviewed-by: Alexander Graf
    Signed-off-by: Cornelia Huck
    Signed-off-by: Marcelo Tosatti

    Cornelia Huck
     

04 Jan, 2013

1 commit

  • This header file will define a new trace event that will be triggered when
    a AER event occurs. The following data will be provided to the trace
    event.

    char * dev_name - The name of the slot where the device resides
    ([domain:]bus:device.function).

    u32 status - Either the correctable or uncorrectable register
    indicating what error or errors have been see.

    u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED

    The trace event will also provide a trace string that may look like:

    "0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned
    TLP"

    Signed-off-by: Lance Ortiz
    Acked-by: Mauro Carvalho Chehab
    Acked-by: Boris Petkov
    Signed-off-by: Tony Luck

    Lance Ortiz
     

03 Jan, 2013

1 commit

  • Pull ext4 bug fixes from Ted Ts'o:
    "Various bug fixes for ext4. Perhaps the most serious bug fixed is one
    which could cause file system corruptions when performing file punch
    operations."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: avoid hang when mounting non-journal filesystems with orphan list
    ext4: lock i_mutex when truncating orphan inodes
    ext4: do not try to write superblock on ro remount w/o journal
    ext4: include journal blocks in df overhead calcs
    ext4: remove unaligned AIO warning printk
    ext4: fix an incorrect comment about i_mutex
    ext4: fix deadlock in journal_unmap_buffer()
    ext4: split off ext4_journalled_invalidatepage()
    jbd2: fix assertion failure in jbd2_journal_flush()
    ext4: check dioread_nolock on remount
    ext4: fix extent tree corruption caused by hole punch

    Linus Torvalds
     

26 Dec, 2012

1 commit


19 Dec, 2012

2 commits

  • This flag is used to indicate to the callees that this allocation is a
    kernel allocation in process context, and should be accounted to current's
    memcg.

    Signed-off-by: Glauber Costa
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: Kamezawa Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: JoonSoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Pull btrfs update from Chris Mason:
    "A big set of fixes and features.

    In terms of line count, most of the code comes from Stefan, who added
    the ability to replace a single drive in place. This is different
    from how btrfs normally replaces drives, and is much much much faster.

    Josef is plowing through our synchronous write performance. This pull
    request does not include the DIO_OWN_WAITING patch that was discussed
    on the list, but it has a number of other improvements to cut down our
    latencies and CPU time during fsync/O_DIRECT writes.

    Miao Xie has a big series of fixes and is spreading out ordered
    operations over more CPUs. This improves performance and reduces
    contention.

    I've put in fixes for error handling around hash collisions. These
    are going back to individual stable kernels as I test against them.

    Otherwise we have a lot of fixes and cleanups, thanks everyone!
    raid5/6 is being rebased against the device replacement code. I'll
    have it posted this Friday along with a nice series of benchmarks."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
    Btrfs: fix a bug of per-file nocow
    Btrfs: fix hash overflow handling
    Btrfs: don't take inode delalloc mutex if we're a free space inode
    Btrfs: fix autodefrag and umount lockup
    Btrfs: fix permissions of empty files not affected by umask
    Btrfs: put raid properties into global table
    Btrfs: fix BUG() in scrub when first superblock reading gives EIO
    Btrfs: do not call file_update_time in aio_write
    Btrfs: only unlock and relock if we have to
    Btrfs: use tokens where we can in the tree log
    Btrfs: optimize leaf_space_used
    Btrfs: don't memset new tokens
    Btrfs: only clear dirty on the buffer if it is marked as dirty
    Btrfs: move checks in set_page_dirty under DEBUG
    Btrfs: log changed inodes based on the extent map tree
    Btrfs: add path->really_keep_locks
    Btrfs: do not mark ems as prealloc if we are writing to them
    Btrfs: keep track of the extents original block length
    Btrfs: inline csums if we're fsyncing
    Btrfs: don't bother copying if we're only logging the inode
    ...

    Linus Torvalds
     

17 Dec, 2012

3 commits

  • Value 0 is not a tree id, so besides an upper limit, a lower limit is
    necessary as well while parsing root types of tracepoint.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • Pull ext4 update from Ted Ts'o:
    "There are two major features for this merge window. The first is
    inline data, which allows small files or directories to be stored in
    the in-inode extended attribute area. (This requires that the file
    system use inodes which are at least 256 bytes or larger; 128 byte
    inodes do not have any room for in-inode xattrs.)

    The second new feature is SEEK_HOLE/SEEK_DATA support. This is
    enabled by the extent status tree patches, and this infrastructure
    will be used to further optimize ext4 in the future.

    Beyond that, we have the usual collection of code cleanups and bug
    fixes."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (63 commits)
    ext4: zero out inline data using memset() instead of empty_zero_page
    ext4: ensure Inode flags consistency are checked at build time
    ext4: Remove CONFIG_EXT4_FS_XATTR
    ext4: remove unused variable from ext4_ext_in_cache()
    ext4: remove redundant initialization in ext4_fill_super()
    ext4: remove redundant code in ext4_alloc_inode()
    ext4: use sync_inode_metadata() when syncing inode metadata
    ext4: enable ext4 inline support
    ext4: let fallocate handle inline data correctly
    ext4: let ext4_truncate handle inline data correctly
    ext4: evict inline data out if we need to strore xattr in inode
    ext4: let fiemap work with inline data
    ext4: let ext4_rename handle inline dir
    ext4: let empty_dir handle inline dir
    ext4: let ext4_delete_entry() handle inline data
    ext4: make ext4_delete_entry generic
    ext4: let ext4_find_entry handle inline data
    ext4: create a new function search_dir
    ext4: let ext4_readdir handle inline data
    ext4: let add_dir_entry handle inline data properly
    ...

    Linus Torvalds
     
  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

12 Dec, 2012

3 commits

  • Pull perf updates from Ingo Molnar:
    "Lots of activity:

    211 files changed, 8328 insertions(+), 4116 deletions(-)

    most of it on the tooling side.

    Main changes:

    * ftrace enhancements and fixes from Steve Rostedt.

    * uprobes fixes, cleanups and preparation for the ARM port from Oleg
    Nesterov.

    * UAPI fixes, from David Howels - prepares the arch/x86 UAPI
    transition

    * Separate perf tests into multiple objects, one per test, from Jiri
    Olsa.

    * Make hardware event translations available in sysfs, from Jiri
    Olsa.

    * Fixes to /proc/pid/maps parsing, preparatory to supporting data
    maps, from Namhyung Kim

    * Implement ui_progress for GTK, from Namhyung Kim

    * Add framework for automated perf_event_attr tests, where tools with
    different command line options will be run from a 'perf test', via
    python glue, and the perf syscall will be intercepted to verify
    that the perf_event_attr fields set by the tool are those expected,
    from Jiri Olsa

    * Add a 'link' method for hists, so that we can have the leader with
    buckets for all the entries in all the hists. This new method is
    now used in the default 'diff' output, making the sum of the
    'baseline' column be 100%, eliminating blind spots.

    * libtraceevent fixes for compiler warnings trying to make perf it
    build on some distros, like fedora 14, 32-bit, some of the warnings
    really pointed to real bugs.

    * Add a browser for 'perf script' and make it available from the
    report and annotate browsers. It does filtering to find the
    scripts that handle events found in the perf.data file used. From
    Feng Tang

    * perf inject changes to allow showing where a task sleeps, from
    Andrew Vagin.

    * Makefile improvements from Namhyung Kim.

    * Add --pre and --post command hooks in 'stat', from Peter Zijlstra.

    * Don't stop synthesizing threads when one vanishes, this is for the
    existing threads when we start a tool like trace.

    * Use sched:sched_stat_runtime to provide a thread summary, this
    produces the same output as the 'trace summary' subcommand of
    tglx's original "trace" tool.

    * Support interrupted syscalls in 'trace'

    * Add an event duration column and filter in 'trace'.

    * There are references to the man pages in some tools, so try to
    build Documentation when installing, warning the user if that is
    not possible, from Borislav Petkov.

    * Give user better message if precise is not supported, from David
    Ahern.

    * Try to find cross-built objdump path by using the session
    environment information in the perf.data file header, from Irina
    Tirdea, original patch and idea by Namhyung Kim.

    * Diplays more output on features check for make V=1, so that one can
    figure out what is happening by looking at gcc output, etc. From
    Jiri Olsa.

    * Add on_exit implementation for systems without one, e.g. Android,
    from Bernhard Rosenkraenzer.

    * Only process events for vcpus of interest, helps handling large
    number of events, from David Ahern.

    * Cross compilation fixes for Android, from Irina Tirdea.

    * Add documentation on compiling for Android, from Irina Tirdea.

    * perf diff improvements from Jiri Olsa.

    * Target (task/user/cpu/syswide) handling improvements, from Namhyung
    Kim.

    * Add support in 'trace' for tracing workload given by command line,
    from Namhyung Kim.

    * ... and much more."

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (194 commits)
    uprobes: Use percpu_rw_semaphore to fix register/unregister vs dup_mmap() race
    perf evsel: Introduce is_group_member method
    perf powerpc: Use uapi/unistd.h to fix build error
    tools: Pass the target in descend
    tools: Honour the O= flag when tool build called from a higher Makefile
    tools: Define a Makefile function to do subdir processing
    perf ui: Always compile browser setup code
    perf ui: Add ui_progress__finish()
    perf ui gtk: Implement ui_progress functions
    perf ui: Introduce generic ui_progress helper
    perf ui tui: Move progress.c under ui/tui directory
    perf tools: Add basic event modifier sanity check
    perf tools: Omit group members from perf_evlist__disable/enable
    perf tools: Ensure single disable call per event in record comand
    perf tools: Fix 'disabled' attribute config for record command
    perf tools: Fix attributes for '{}' defined event groups
    perf tools: Use sscanf for parsing /proc/pid/maps
    perf tools: Add gtk. config option for launching GTK browser
    perf tools: Fix compile error on NO_NEWT=1 build
    perf hists: Initialize all of he->stat with zeroes
    ...

    Linus Torvalds
     
  • Pull RCU update from Ingo Molnar:
    "The major features of this tree are:

    1. A first version of no-callbacks CPUs. This version prohibits
    offlining CPU 0, but only when enabled via CONFIG_RCU_NOCB_CPU=y.
    Relaxing this constraint is in progress, but not yet ready
    for prime time. These commits were posted to LKML at
    https://lkml.org/lkml/2012/10/30/724.

    2. Changes to SRCU that allows statically initialized srcu_struct
    structures. These commits were posted to LKML at
    https://lkml.org/lkml/2012/10/30/296.

    3. Restructuring of RCU's debugfs output. These commits were posted
    to LKML at https://lkml.org/lkml/2012/10/30/341.

    4. Additional CPU-hotplug/RCU improvements, posted to LKML at
    https://lkml.org/lkml/2012/10/30/327.
    Note that the commit eliminating __stop_machine() was judged to
    be too-high of risk, so is deferred to 3.9.

    5. Changes to RCU's idle interface, most notably a new module
    parameter that redirects normal grace-period operations to
    their expedited equivalents. These were posted to LKML at
    https://lkml.org/lkml/2012/10/30/739.

    6. Additional diagnostics for RCU's CPU stall warning facility,
    posted to LKML at https://lkml.org/lkml/2012/10/30/315.
    The most notable change reduces the
    default RCU CPU stall-warning time from 60 seconds to 21 seconds,
    so that it once again happens sooner than the softlockup timeout.

    7. Documentation updates, which were posted to LKML at
    https://lkml.org/lkml/2012/10/30/280.
    A couple of late-breaking changes were posted at
    https://lkml.org/lkml/2012/11/16/634 and
    https://lkml.org/lkml/2012/11/16/547.

    8. Miscellaneous fixes, which were posted to LKML at
    https://lkml.org/lkml/2012/10/30/309.

    9. Finally, a fix for an lockdep-RCU splat was posted to LKML
    at https://lkml.org/lkml/2012/11/7/486."

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
    context_tracking: New context tracking susbsystem
    sched: Mark RCU reader in sched_show_task()
    rcu: Separate accounting of callbacks from callback-free CPUs
    rcu: Add callback-free CPUs
    rcu: Add documentation for the new rcuexp debugfs trace file
    rcu: Update documentation for TREE_RCU debugfs tracing
    rcu: Reduce default RCU CPU stall warning timeout
    rcu: Fix TINY_RCU rcu_is_cpu_rrupt_from_idle check
    rcu: Clarify memory-ordering properties of grace-period primitives
    rcu: Add new rcutorture module parameters to start/end test messages
    rcu: Remove list_for_each_continue_rcu()
    rcu: Fix batch-limit size problem
    rcu: Add tracing for synchronize_sched_expedited()
    rcu: Remove old debugfs interfaces and also RCU flavor name
    rcu: split 'rcuhier' to each flavor
    rcu: split 'rcugp' to each flavor
    rcu: split 'rcuboost' to each flavor
    rcu: split 'rcubarrier' to each flavor
    rcu: Fix tracing formatting
    rcu: Remove the interface "rcudata.csv"
    ...

    Linus Torvalds
     
  • The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
    so this range can be represented by the signed short type with no
    functional change. The extra space this frees up in struct signal_struct
    will be used for per-thread oom kill flags in the next patch.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Cc: Anton Vorontsov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes