04 Sep, 2013

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on the cgroup front. Most changes aren't visible
    to userland at all at this point and are laying foundation for the
    planned unified hierarchy.

    - The biggest change is decoupling the lifetime management of css
    (cgroup_subsys_state) from that of cgroup's. Because controllers
    (cpu, memory, block and so on) will need to be dynamically enabled
    and disabled, css which is the association point between a cgroup
    and a controller may come and go dynamically across the lifetime of
    a cgroup. Till now, css's were created when the associated cgroup
    was created and stayed till the cgroup got destroyed.

    Assumptions around this tight coupling permeated through cgroup
    core and controllers. These assumptions are gradually removed,
    which consists bulk of patches, and css destruction path is
    completely decoupled from cgroup destruction path. Note that
    decoupling of creation path is relatively easy on top of these
    changes and the patchset is pending for the next window.

    - cgroup has its own event mechanism cgroup.event_control, which is
    only used by memcg. It is overly complex trying to achieve high
    flexibility whose benefits seem dubious at best. Going forward,
    new events will simply generate file modified event and the
    existing mechanism is being made specific to memcg. This pull
    request contains prepatory patches for such change.

    - Various fixes and cleanups"

    Fixed up conflict in kernel/cgroup.c as per Tejun.

    * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
    cgroup: fix cgroup_css() invocation in css_from_id()
    cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
    cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
    cgroup: implement CFTYPE_NO_PREFIX
    cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
    cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
    cgroup: fix cgroup_write_event_control()
    cgroup: fix subsystem file accesses on the root cgroup
    cgroup: change cgroup_from_id() to css_from_id()
    cgroup: use css_get() in cgroup_create() to check CSS_ROOT
    cpuset: remove an unncessary forward declaration
    cgroup: RCU protect each cgroup_subsys_state release
    cgroup: move subsys file removal to kill_css()
    cgroup: factor out kill_css()
    cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
    cgroup: replace cgroup->css_kill_cnt with ->nr_css
    cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
    cgroup: move cgroup->subsys[] assignment to online_css()
    cgroup: reorganize css init / exit paths
    cgroup: add __rcu modifier to cgroup->subsys[]
    ...

    Linus Torvalds
     

24 Aug, 2013

2 commits


09 Aug, 2013

9 commits

  • Previously, all css descendant iterators didn't include the origin
    (root of subtree) css in the iteration. The reasons were maintaining
    consistency with css_for_each_child() and that at the time of
    introduction more use cases needed skipping the origin anyway;
    however, given that css_is_descendant() considers self to be a
    descendant, omitting the origin css has become more confusing and
    looking at the accumulated use cases rather clearly indicates that
    including origin would result in simpler code overall.

    While this is a change which can easily lead to subtle bugs, cgroup
    API including the iterators has recently gone through major
    restructuring and no out-of-tree changes will be applicable without
    adjustments making this a relatively acceptable opportunity for this
    type of change.

    The conversions are mostly straight-forward. If the iteration block
    had explicit origin handling before or after, it's moved inside the
    iteration. If not, if (pos == origin) continue; is added. Some
    conversions add extra reference get/put around origin handling by
    consolidating origin handling and the rest. While the extra ref
    operations aren't strictly necessary, this shouldn't cause any
    noticeable difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Michal Hocko
    Cc: Jens Axboe
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cgroup_taskset which is used by the subsystem attach methods is the
    last cgroup subsystem API which isn't using css as the handle. Update
    cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
    cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

    The conversions are pretty mechanical. One exception is
    cpuset::cgroup_cs(), which lost its last user and got removed.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Daniel Wagner
    Cc: Ingo Molnar
    Cc: Matt Helsley
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using css
    (cgroup_subsys_state) as the primary handle instead of cgroup in
    subsystem API. For hierarchy iterators, this is beneficial because

    * In most cases, css is the only thing subsystems care about anyway.

    * On the planned unified hierarchy, iterations for different
    subsystems will need to skip over different subtrees of the
    hierarchy depending on which subsystems are enabled on each cgroup.
    Passing around css makes it unnecessary to explicitly specify the
    subsystem in question as css is intersection between cgroup and
    subsystem

    * For the planned unified hierarchy, css's would need to be created
    and destroyed dynamically independent from cgroup hierarchy. Having
    cgroup core manage css iteration makes enforcing deref rules a lot
    easier.

    Most subsystem conversions are straight-forward. Noteworthy changes
    are

    * blkio: cgroup_to_blkcg() is no longer used. Removed.

    * freezer: cgroup_freezer() is no longer used. Removed.

    * devices: cgroup_to_devcgroup() is no longer used. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is transitioning to using css (cgroup_subsys_state) instead of
    cgroup as the primary subsystem handle. The cgroupfs file interface
    will be converted to use css's which requires finding out the
    subsystem from cftype so that the matching css can be determined from
    the cgroup.

    This patch adds cftype->ss which points to the subsystem the file
    belongs to. The field is initialized while a cftype is being
    registered. This makes it unnecessary to explicitly specify the
    subsystem for other cftype handling functions. @ss argument dropped
    from various cftype handling functions.

    This patch shouldn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • Currently, controllers have to explicitly follow the cgroup hierarchy
    to find the parent of a given css. cgroup is moving towards using
    cgroup_subsys_state as the main controller interface construct, so
    let's provide a way to climb the hierarchy using just csses.

    This patch implements css_parent() which, given a css, returns its
    parent. The function is guarnateed to valid non-NULL parent css as
    long as the target css is not at the top of the hierarchy.

    freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
    are converted to use css_parent() instead of accessing cgroup->parent
    directly.

    * __parent_ca() is dropped from cpuacct and its usage is replaced with
    parent_ca(). The only difference between the two was NULL test on
    cgroup->parent which is now embedded in css_parent() making the
    distinction moot. Note that eventually a css->parent field will be
    added to css and the NULL check in css_parent() will go away.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • css (cgroup_subsys_state) is usually embedded in a subsys specific
    data structure. Subsystems either use container_of() directly to cast
    from css to such data structure or has an accessor function wrapping
    such cast. As cgroup as whole is moving towards using css as the main
    interface handle, add and update such accessors to ease dealing with
    css's.

    All accessors explicitly handle NULL input and return NULL in those
    cases. While this looks like an extra branch in the code, as all
    controllers specific data structures have css as the first field, the
    casting doesn't involve any offsetting and the compiler can trivially
    optimize out the branch.

    * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
    accessor. Added.

    * memory, hugetlb and devices already had one but didn't explicitly
    handle NULL input. Updated.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the drivers/block uses of the __cpuinit macros
    from all C files.

    [1] https://lkml.org/lkml/2013/5/20/589

    Cc: Jens Axboe
    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

12 Jul, 2013

1 commit

  • Pull core block IO updates from Jens Axboe:
    "Here are the core IO block bits for 3.11. It contains:

    - A tweak to the reserved tag logic from Jan, for weirdo devices with
    just 3 free tags. But for those it improves things substantially
    for random writes.

    - Periodic writeback fix from Jan. Marked for stable as well.

    - Fix for a race condition in IO scheduler switching from Jianpeng.

    - The hierarchical blk-cgroup support from Tejun. This is the grunt
    of the series.

    - blk-throttle fix from Vivek.

    Just a note that I'm in the middle of a relocation, whole family is
    flying out tomorrow. Hence I will be awal the remainder of this week,
    but back at work again on Monday the 15th. CC'ing Tejun, since any
    potential "surprises" will most likely be from the blk-cgroup work.
    But it's been brewing for a while and sitting in my tree and
    linux-next for a long time, so should be solid."

    * 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits)
    elevator: Fix a race in elevator switching
    block: Reserve only one queue tag for sync IO if only 3 tags are available
    writeback: Fix periodic writeback after fs mount
    blk-throttle: implement proper hierarchy support
    blk-throttle: implement throtl_grp->has_rules[]
    blk-throttle: Account for child group's start time in parent while bio climbs up
    blk-throttle: add throtl_qnode for dispatch fairness
    blk-throttle: make throtl_pending_timer_fn() ready for hierarchy
    blk-throttle: make tg_dispatch_one_bio() ready for hierarchy
    blk-throttle: make blk_throtl_bio() ready for hierarchy
    blk-throttle: make blk_throtl_drain() ready for hierarchy
    blk-throttle: dispatch from throtl_pending_timer_fn()
    blk-throttle: implement dispatch looping
    blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work
    blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it
    blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
    blk-throttle: add throtl_service_queue->parent_sq
    blk-throttle: generalize update_disptime optimization in blk_throtl_bio()
    blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]
    blk-throttle: move bio_lists[] and friends to throtl_service_queue
    ...

    Linus Torvalds
     

10 Jul, 2013

3 commits

  • Graft AIX partitions enumeration into partitions/msdos.c

    There is already a AIX disks detection logic in msdos.c. When an AIX disk
    has been found, and if configured to, call the aix partitions recognizer.
    This avoids removal of AIX disks protection from msdos.c, avoids code
    duplication, and ensures that AIX partitions enumeration is called before
    plain msdos partitions enumeration.

    Signed-off-by: Philippe De Muyter
    Cc: Karel Zak
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Philippe De Muyter
     
  • Add partitions/aix.h and partitions/aix.c.

    AIX LVM permits to make "logical volumes" which are made of multiple
    slices of multiple disks. The new code allows only access to the
    "logical volumes" which are made of one slice on the probed disk, a
    slice being a contiguous disk area. The code also detects "logical
    volumes" made of multiple slices on the probed disk, but can not
    describe them to the partition layer, because the partition layer
    generic code does not support that. When such non-contiguous "logical
    volumes" are detected, a diagnostic message is printed.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Philippe De Muyter
    Cc: Karel Zak
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Philippe De Muyter
     
  • Signed-off-by: Philippe De Muyter
    Cc: Karel Zak
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Philippe De Muyter
     

04 Jul, 2013

4 commits

  • Merge first patch-bomb from Andrew Morton:
    - various misc bits
    - I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
    distracted. There has been quite a bit of activity.
    - About half the MM queue
    - Some backlight bits
    - Various lib/ updates
    - checkpatch updates
    - zillions more little rtc patches
    - ptrace
    - signals
    - exec
    - procfs
    - rapidio
    - nbd
    - aoe
    - pps
    - memstick
    - tools/testing/selftests updates

    * emailed patches from Andrew Morton : (445 commits)
    tools/testing/selftests: don't assume the x bit is set on scripts
    selftests: add .gitignore for kcmp
    selftests: fix clean target in kcmp Makefile
    selftests: add .gitignore for vm
    selftests: add hugetlbfstest
    self-test: fix make clean
    selftests: exit 1 on failure
    kernel/resource.c: remove the unneeded assignment in function __find_resource
    aio: fix wrong comment in aio_complete()
    drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
    drivers/memstick/host/r592.c: convert to module_pci_driver
    drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
    pps-gpio: add device-tree binding and support
    drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
    drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
    drivers/parport/share.c: use kzalloc
    Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
    aoe: update internal version number to v83
    aoe: update copyright date
    aoe: perform I/O completions in parallel
    ...

    Linus Torvalds
     
  • Disk names may contain arbitrary strings, so they must not be
    interpreted as format strings. It seems that only md allows arbitrary
    strings to be used for disk names, but this could allow for a local
    memory corruption from uid 0 into ring 0.

    CVE-2013-2851

    Signed-off-by: Kees Cook
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • There is a hole in struct hd_geometry, so we have to zero the struct on
    stack before copying it to user-space.

    Signed-off-by: Cong Wang
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cong Wang
     
  • Pull s390 updates from Martin Schwidefsky:
    "This is the bulk of the s390 patches for the 3.11 merge window.

    Notable enhancements are: the block timeout patches for dasd from
    Hannes, and more work on the PCI support front. In addition some
    cleanup and the usual bug fixing."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
    s390/dasd: Fail all requests when DASD_FLAG_ABORTIO is set
    s390/dasd: Add 'timeout' attribute
    block: check for timeout function in blk_rq_timed_out()
    block/dasd: detailed I/O errors
    s390/dasd: Reduce amount of messages for specific errors
    s390/dasd: Implement block timeout handling
    s390/dasd: process all requests in the device tasklet
    s390/dasd: make number of retries configurable
    s390/dasd: Clarify comment
    s390/hwsampler: Updated misleading member names in hws_data_entry
    s390/appldata_net_sum: do not use static data
    s390/appldata_mem: do not use static data
    s390/vmwatchdog: do not use static data
    s390/airq: simplify adapter interrupt code
    s390/pci: remove per device debug attribute
    s390/dma: remove gratuitous brackets
    s390/facility: decompose test_facility()
    s390/sclp: remove duplicated include from sclp_ctl.c
    s390/irq: store interrupt information in pt_regs
    s390/drivers: Cocci spatch "ptr_ret.spatch"
    ...

    Linus Torvalds
     

03 Jul, 2013

2 commits

  • There's a race between elevator switching and normal io operation.
    Because the allocation of struct elevator_queue and struct elevator_data
    don't in a atomic operation.So there are have chance to use NULL
    ->elevator_data.
    For example:
    Thread A: Thread B
    blk_queu_bio elevator_switch
    spin_lock_irq(q->queue_block) elevator_alloc
    elv_merge elevator_init_fn

    Because call elevator_alloc, it can't hold queue_lock and the
    ->elevator_data is NULL.So at the same time, threadA call elv_merge and
    nedd some info of elevator_data.So the crash happened.

    Move the elevator_alloc into func elevator_init_fn, it make the
    operations in a atomic operation.

    Using the follow method can easy reproduce this bug
    1:dd if=/dev/sdb of=/dev/null
    2:while true;do echo noop > scheduler;echo deadline > scheduler;done

    The test method also use this method.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     
  • Pull workqueue changes from Tejun Heo:
    "Surprisingly, Lai and I didn't break too many things implementing
    custom pools and stuff last time around and there aren't any follow-up
    changes necessary at this point.

    The only change in this pull request is Viresh's patches to make some
    per-cpu workqueues to behave as unbound workqueues dependent on a boot
    param whose default can be configured via a config option. This leads
    to higher processing overhead / lower bandwidth as more work items are
    bounced across CPUs; however, it can lead to noticeable powersave in
    certain configurations - ~10% w/ idlish constant workload on a
    big.LITTLE configuration according to Viresh.

    This is because per-cpu workqueues interfere with how the scheduler
    perceives whether or not each CPU is idle by forcing pinned tasks on
    them, which makes the scheduler's power-aware scheduling decisions
    less effective.

    Its effectiveness is likely less pronounced on homogenous
    configurations and this type of optimization can probably be made
    automatic; however, the changes are pretty minimal and the affected
    workqueues are clearly marked, so it's an easy gain for some
    configurations for the time being with pretty unintrusive changes."

    * 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    fbcon: queue work on power efficient wq
    block: queue work on power efficient wq
    PHYLIB: queue work on system_power_efficient_wq
    workqueue: Add system wide power_efficient workqueues
    workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues

    Linus Torvalds
     

01 Jul, 2013

2 commits

  • rq_timed_out_fn might have been unset while the request
    was in flight, so we need to check for it in blk_rq_timed_out().

    Acked-by: Jens Axboe
    Signed-off-by: Hannes Reinecke
    Signed-off-by: Stefan Weinhuber
    Signed-off-by: Martin Schwidefsky

    Hannes Reinecke
     
  • The DASD driver is using FASTFAIL as an equivalent to the
    transport errors in SCSI. And the 'steal lock' function maps
    roughly to a reservation error. So we should be returning the
    appropriate error codes when completing a request.

    Acked-by: Jens Axboe
    Signed-off-by: Hannes Reinecke
    Signed-off-by: Stefan Weinhuber
    Signed-off-by: Martin Schwidefsky

    Hannes Reinecke
     

29 Jun, 2013

1 commit

  • In case a device has three tags available we still reserve two of them
    for sync IO. That leaves only a single tag for async IO such as
    writeback from flusher thread which results in poor performance.

    Allow async IO to consume two tags in case queue has three tag availabe
    to get a decent async write performance.

    This patch improves streaming write performance on a machine with such disk
    from ~21 MB/s to ~52 MB/s. Also postmark throughput in presence of
    streaming writer improves from 8 to 12 transactions per second so sync
    IO doesn't seem to be harmed in presence of heavy async writer.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

17 May, 2013

1 commit

  • In blk_post_runtime_resume, an autosuspend request will be initiated for
    the device. Since we are holding the queue lock, we can't sleep and thus
    we should use the async version to initiate an autosuspend, i.e.
    pm_request_suspend instead of pm_runtime_suspend, which might sleep.

    Signed-off-by: Aaron Lu
    Signed-off-by: Jens Axboe

    Aaron Lu
     

15 May, 2013

13 commits

  • With the recent updates, blk-throttle is finally ready for proper
    hierarchy support. Dispatching now honors service_queue->parent_sq
    and propagates correctly. The only thing missing is setting
    ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
    hierarchy.

    This patch updates throtl_pd_init() such that service_queues form the
    same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
    As this concludes proper hierarchy support for blkcg, the shameful
    .broken_hierarchy tag is removed from blkio_subsys.

    v2: Updated blkio-controller.txt as suggested by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Cc: Li Zefan

    Tejun Heo
     
  • blk_throtl_bio() has a quick exit path for throtl_grps without limits
    configured. It looks at the bps and iops limits and if both are not
    configured, the bio is issued immediately. While this is correct in
    the current flat hierarchy as each throtl_grp behaves completely
    independently, it would become wrong in proper hierarchy mode. A
    group without any limits could still be limited by one of its
    ancestors and bio's queued for such group should not bypass
    blk-throtl.

    As having a quick bypass mechanism is beneficial, this patch
    reimplements the mechanism such that it's correct even with proper
    hierarchy. throtl_grp->has_rules[] is added. These booleans are
    updated for the whole subtree whenever a config is updated so that
    has_rules[] of the whole subtree stays synchronized. They're also
    updated when a new throtl_grp comes online so that it can't escape the
    limits of its ancestors.

    As no throtl_grp has another throtl_grp as parent now, this patch
    doesn't yet make any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • With the planned proper hierarchy support, a bio will climb up the
    tree before actually being dispatched. This makes sure bio is also
    subjected to parent's throttling limits, if any.

    It might happen that parent is idle and when bio is transferred to
    parent, a new slice starts fresh. But that is incorrect as parents
    wait time should have started when bio was queued in child group and
    causes IOs to be throttled more than configured as they climb the
    hierarchy.

    Given the fact that we have not written hierarchical algorithm in a
    way where child's and parents time slices are synchronized, we
    transfer the child's start time to parent if parent was idling. If
    parent was busy doing dispatch of other bios all this while, this is
    not an issue.

    Child's slice start time is passed to parent. Parent looks at its
    last expired slice start time. If child's start time is after parents
    old start time, that means parent had been idle and after parent
    went idle, child had an IO queued. So use child's start time as
    parent start time.

    If parent's start time is after child's start time, that means,
    when IO got queued in child group, parent was not idle. But later
    it dispatched some IO, its slice got trimmed and then it went idle.
    After a while child's request got shifted in parent group. In this
    case use parent's old start time as new start time as that's the
    duration of slice we did not use.

    This logic is far from perfect as if there are multiple childs
    then first child transferring the bio decides the start time while
    a bio might have queued up even earlier in other child, which is
    yet to be transferred up to parent. In that case we will lose
    time and bandwidth in parent. This patch is just an approximation
    to make situation somewhat better.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Tejun Heo

    Vivek Goyal
     
  • With flat hierarchy, there's only single level of dispatching
    happening and fairness beyond that point is the responsibility of the
    rest of the block layer and driver, which usually works out okay;
    however, with the planned hierarchy support,
    service_queue->bio_lists[] can be filled up by bios from a single
    source. While the limits would still be honored, it'd be very easy to
    starve IOs from siblings or children.

    To avoid such starvation, this patch implements throtl_qnode and
    converts service_queue->bio_lists[] to lists of per-source qnodes
    which in turn contains the bio's. For example, when a bio is
    dispatched from a child group, the bio doesn't get queued on
    ->bio_lists[] directly but it first gets queued on the group's qnode
    which in turn gets queued on service_queue->queued[]. When
    dispatching for the upper level, the ->queued[] list is consumed in
    round-robing order so that the dispatch windows is consumed fairly by
    all IO sources.

    There are two ways a bio can come to a throtl_grp - directly queued to
    the group or dispatched from a child. For the former
    throtl_grp->qnode_on_self[rw] is used. For the latter, the child's
    ->qnode_on_parent[rw].

    Note that this means that the child which is contributing a bio to its
    parent should stay pinned until all its bios are dispatched to its
    grand-parent. This patch moves blkg refcnting from bio add/remove
    spots to qnode activation/deactivation so that the blkg containing an
    active qnode is always pinned. As child pins the parent, this is
    sufficient for keeping the relevant sub-tree pinned while bios are in
    flight.

    The starvation issue was spotted by Vivek Goyal.

    v2: The original patch used the same throtl_grp->qnode_on_self/parent
    for reads and writes causing RWs to be queued incorrectly if there
    already are outstanding IOs in the other direction. They should
    be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs
    can use different qnodes. Spotted by Vivek Goyal.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_pending_timer_fn() currently assumes that the parent_sq is the
    top level one and the bio's dispatched are ready to be issued;
    however, this assumption will be wrong with proper hierarchy support.
    This patch makes the following changes to make
    throtl_pending_timer_fn() ready for hiearchy.

    * If the parent_sq isn't the top-level one, update the parent
    throtl_grp's dispatch time and schedule the next dispatch as
    necessary. If the parent's dispatch time is now, repeat the
    function for the parent throtl_grp.

    * If the parent_sq is the top-level one, kick issue work_item as
    before.

    * The debug message printed by throtl_log() now prints out the
    service_queue's nr_queued[] instead of the total nr_queued as the
    latter becomes uninteresting and misleading with hierarchical
    dispatch.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • tg_dispatch_one_bio() currently assumes that the parent_sq is the top
    level one and the bio being dispatched is ready to be issued; however,
    this assumption will be wrong with proper hierarchy support. This
    patch makes the following changes to make tg_dispatch_on_bio() ready
    for hiearchy.

    * throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead
    of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to
    transfer a bio from a child tg to its parent.

    * tg_dispatch_one_bio() is updated to distinguish whether its parent
    is another throtl_grp or the throtl_data. If former, the bio is
    transferred to the parent throtl_grp using throtl_add_bio_tg(). If
    latter, the bio is ready to be issued and put on the top-level
    service_queue's bio_lists[] and throtl_data->nr_queued is
    decremented.

    As all throtl_grps currently have the top level service_queue as their
    ->parent_sq, this patch in itself doesn't make any behavior
    difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, blk_throtl_bio() issues the passed in bio directly if it's
    within limits of its associated tg (throtl_grp). This behavior
    becomes incorrect with hierarchy support as the bio should be
    accounted to and throttled by the ancestor throtl_grps too.

    This patch makes the direct issue path of blk_throtl_bio() to loop
    until it reaches the top-level service_queue or gets throttled. If
    the former, the bio can be issued directly; otherwise, it gets queued
    at the first layer it was above limits.

    As tg->parent_sq is always the top-level service queue currently, this
    patch in itself doesn't make any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • The current blk_throtl_drain() assumes that all active throtl_grps are
    queued on throtl_data->service_queue, which won't be true once
    hierarchy support is implemented.

    This patch makes blk_throtl_drain() perform post-order walk of the
    blkg hierarchy draining each associated throtl_grp, which guarantees
    that all bios will eventually be pushed to the top-level service_queue
    in throtl_data.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, blk_throtl_dispatch_work_fn() is responsible for both
    dispatching bio's from throtl_grp's according to their limits and then
    issuing the dispatched bios.

    This patch moves the dispatch part to throtl_pending_timer_fn() so
    that the work item is kicked iff there are bio's to issue. This is to
    avoid work item execution at each step when hierarchy support is
    enabled. bio's will be dispatched towards the top-level service_queue
    from the timers at each layer and the work item will only be used to
    issue the bio's which reached the top-level service_queue.

    While fetching bio's to issue from bio_lists[],
    blk_throtl_dispatch_work_fn() fetches all READs before WRITEs. While
    the original code also dispatched READs first, if multiple throtl_grps
    are dispatched on the same run, WRITEs from throtl_grp which is
    dispatched first would precede READs from throtl_grps which are
    dispatched later. While this is a behavior change, given that the
    previous code already prioritized READs and block layer generally
    prioritizes and segregates READs from WRITEs, this isn't likely to
    make any noticeable differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_select_dispatch() only dispatches throtl_quantum bios on each
    invocation. blk_throtl_dispatch_work_fn() in turn depends on
    throtl_schedule_next_dispatch() scheduling the next dispatch window
    immediately so that undue delays aren't incurred. This effectively
    chains multiple dispatch work item executions back-to-back when there
    are more than throtl_quantum bios to dispatch on a given tick.

    There is no reason to finish the current work item just to repeat it
    immediately. This patch makes throtl_schedule_next_dispatch() return
    %false without doing anything if the current dispatch window is still
    open and updates blk_throtl_dispatch_work_fn() repeat dispatching
    after cpu_relax() on %false return.

    This change will help implementing hierarchy support as dispatching
    will be done from pending_timer and immediate reschedule of timer
    function isn't supported and doesn't make much sense.

    While this patch changes how dispatch behaves when there are more than
    throtl_quantum bios to dispatch on a single tick, the behavior change
    is immaterial.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, throtl_data->dispatch_work is a delayed_work item which
    handles both delayed dispatch and issuing bios. The two tasks will be
    separated to support proper hierarchy. To prepare for that, this
    patch separates out the timer into throtl_service_queue->pending_timer
    from throtl_data->dispatch_work and make the latter a work_struct.

    * As the timer is now per-service_queue, it's initialized and
    del_sync'd as its corresponding service_queue is created and
    destroyed. The timer, when triggered, simply schedules
    throtl_data->dispathc_work for execution.

    * throtl_schedule_delayed_work() is renamed to
    throtl_schedule_pending_timer() and takes @sq and @expires now.

    * Simiarly, throtl_schedule_next_dispatch() now takes @sq, which
    should be the parent_sq of the service_queue which just got a new
    bio or updated. As the parent_sq is always the top-level
    service_queue now, this doesn't change anything at this point.

    This patch doesn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • With proper hierarchy support, a bio can be dispatched multiple times
    until it reaches the top-level service_queue and we don't want to
    update dispatch stats at each step. They are local stats and will be
    kept local. If recursive stats are necessary, they should be
    implemented separately and definitely not by updating counters
    recursively on each dispatch.

    This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate
    stats update with it so that dispatch stats are updated only on the
    first time the bio is charged to a throtl_grp, which will always be
    the throtl_grp the bio was originally queued to.

    This means that REQ_THROTTLED would be set even for bios which don't
    get throttled. As we don't want bios to leave blk-throtl with the
    flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio()
    and clear if the bio is being issued directly.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Now that both throtl_data and throtl_grp embed throtl_service_queue,
    we can unify throtl_log() and throtl_log_tg().

    * sq_to_tg() is added. This returns the throtl_grp a service_queue is
    embedded in. If the service_queue is the top-level one embedded in
    throtl_data, NULL is returned.

    * sq_to_td() is added. A service_queue is always associated with a
    throtl_data. This function finds the associated td and returns it.

    * throtl_log() is updated to take throtl_service_queue instead of
    throtl_data. If the service_queue is one embedded in throtl_grp, it
    prints the same header as throtl_log_tg() did. If it's one embedded
    in throtl_data, it behaves the same as before. This renders
    throtl_log_tg() unnecessary. Removed.

    This change is necessary for hierarchy support as we're gonna be using
    the same code paths to dispatch bios to intermediate service_queues
    embedded in throtl_grps and the top-level service_queue embedded in
    throtl_data.

    This patch doesn't make any behavior changes.

    v2: throtl_log() didn't print a space after blkg path. Updated so
    that it prints a space after throtl_grp path. Spotted by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo