Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

19 Oct, 2014

1 commit

d3dc366bb Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.

- blk-mq timeout updates and fixes from Christoph.

- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.

- blk integrity updates and fixes from Martin and Gu Zheng.

- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.

- Improve the error printing, from Rob Elliott.

- Backing device improvements and cleanups from Tejun.

- Fixup of a misplaced rq_complete() tracepoint from Hannes.

- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.

- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.

- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.

- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"

* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...

Linus Torvalds
2014-10-19 02:53:51 +0800

10 Oct, 2014

2 commits

c798360cd Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...

- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.

This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.

Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe0c
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.

- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.

- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.

There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...

Linus Torvalds
2014-10-10 19:26:02 +0800
570546517 mm: clean up zone flags ... Browse Code »

Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
reader through layers indirection just to track down a simple bit.

Remove all zone flag wrappers and just use bitops against zone->flags
directly. It's just as readable and the lines are barely any longer.

Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
remove the zone_flags_t typedef.

Signed-off-by: Johannes Weiner
Acked-by: David Rientjes
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-10-10 10:25:57 +0800

09 Sep, 2014

4 commits

018a17bdc bdi: reimplement bdev_inode_switch_bdi() ... Browse Code »
13

A block_device may be attached to different gendisks and thus
different bdis over time. bdev_inode_switch_bdi() is used to switch
the associated bdi. The function assumes that the inode could be
dirty and transfers it between bdis if so. This is a bit nasty in
that it reaches into bdi internals.

This patch reimplements the function so that it writes out the inode
if dirty. This is a lot simpler and can be implemented without
exposing bdi internals.

Signed-off-by: Tejun Heo
Cc: Alexander Viro
Signed-off-by: Jens Axboe

Tejun Heo
2014-09-09 00:00:43 +0800
1a1e4530e bdi: explain the dirty list transferring in bdi_destroy() ... Browse Code »

bdi_destroy() has code to transfer the remaining dirty inodes to the
default_backing_dev_info; however, given the shutdown sequence, it
isn't clear how such condition would happen. Also, it isn't a full
solution as the transferred inodes stlil point to the bdi which is
being destroyed. Operations on those inodes can end up accessing
already released fields such as the percpu stat fields.

Digging through the history, it seems that the code was added as a
quick workaround for a bug report without fully root-causing the
issue. We probably want to remove the code in time but for now let's
add a comment noting that it is a quick workaround.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2014-09-09 00:00:41 +0800
c0ea1c22b bdi: make backing_dev_info->wb.dwork canceling stricter ... Browse Code »

Canceling of bdi->wb.dwork is currently a bit mushy.
bdi_wb_shutdown() performs cancel_delayed_work_sync() at the end after
shutting down and flushing the delayed_work and bdi_destroy() tries
yet again after bdi_unregister().

bdi->wb.dwork is queued only after checking BDI_registered while
holding bdi->wb_lock and bdi_wb_shutdown() clears the flag while
holding the same lock and then flushes the delayed_work. There's no
way the delayed_work can be queued again after that.

Replace the two unnecessary cancel_delayed_work_sync() invocations
with WARNs on pending. This simplifies and clarifies the code a bit
and will help future changes in further isolating bdi_writeback
handling.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2014-09-09 00:00:39 +0800
b68757341 bdi: remove bdi->wb_lock locking around bdi->dev clearing in bdi_unregister() ... Browse Code »

The only places where NULL test on bdi->dev is used are
bdi_[un]register(). The functions can't be called in parallel anyway
and there's no point in protecting bdi->dev clearing with a lock.
Remove bdi->wb_lock grabbing around bdi->dev clearing and move it
after device_unregister() call so that bdi->dev doesn't have to be
cached in a local variable.

This patch shouldn't introduce any behavior difference.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2014-09-09 00:00:38 +0800

08 Sep, 2014

2 commits

20ae00792 proportions: add @gfp to init functions ... Browse Code »

Percpu allocator now supports allocation mask. Add @gfp to
[flex_]proportions init functions so that !GFP_KERNEL allocation masks
can be used with them too.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Peter Zijlstra

Tejun Heo
2014-09-08 08:51:30 +0800
908c7f194 percpu_counter: add @gfp to percpu_counter_init() ... Browse Code »

Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo
Acked-by: Jan Kara
Acked-by: "David S. Miller"
Cc: x86@kernel.org
Cc: Jens Axboe
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andrew Morton

Tejun Heo
2014-09-08 08:51:29 +0800

18 Apr, 2014

1 commit

4e857c58e arch: Mass conversion of smp_mb__*() ... Browse Code »

Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra
Acked-by: Paul E. McKenney
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-04-18 20:20:48 +0800

04 Apr, 2014

2 commits

5acda9d12 bdi: avoid oops on device removal ... Browse Code »
5

After commit 839a8e8660b6 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.

This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977

Reviewed-by: Tejun Heo
Signed-off-by: Jan Kara
Cc: Derek Basehore
Cc: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2014-04-04 07:20:49 +0800
6ca738d60 backing_dev: fix hung task on sync ... Browse Code »
5

bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes. The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb(). This can happen since mod_delayed_work()
can now steal work from a work_queue. This fixes the problem by using
queue_delayed_work() instead. This is a regression caused by commit
839a8e8660b6 ("writeback: replace custom worker pool implementation with
unbound workqueue").

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task. Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future. This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty. The same problem can happen if the sync work is run on the
rescue worker.

[jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
Signed-off-by: Derek Basehore
Reviewed-by: Jan Kara
Cc: Alexander Viro
Reviewed-by: Tejun Heo
Cc: Greg Kroah-Hartman
Cc: "Darrick J. Wong"
Cc: Derek Basehore
Cc: Kees Cook
Cc: Benson Leung
Cc: Sonny Rao
Cc: Luigi Semenzato
Cc: Jens Axboe
Cc: Dave Chinner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Derek Basehore
2014-04-04 07:20:49 +0800

12 Sep, 2013

1 commit

4c3bffc27 mm/backing-dev.c: check user buffer length before copying data to the related user buffer ... Browse Code »

'*lenp' may be less than "sizeof(kbuf)" so we must check this before the
next copy_to_user().

pdflush_proc_obsolete() is called by sysctl which 'procname' is
"nr_pdflush_threads", if the user passes buffer length less than
"sizeof(kbuf)", it will cause issue.

Signed-off-by: Chen Gang
Reviewed-by: Jan Kara
Cc: Tejun Heo
Cc: Jeff Moyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2013-09-12 06:58:03 +0800

20 Aug, 2013

1 commit

d9e1241e4 backing-dev: convert class code to use dev_groups ... Browse Code »

The dev_attrs field of struct class is going away soon, dev_groups
should be used instead. This converts the backing device class code to
use the correct field.

Cc: Andrew Morton
Cc: Jan Kara
Cc: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2013-08-20 12:22:34 +0800

17 Jul, 2013

1 commit

b9b325974 sysfs.h: add __ATTR_RW() macro ... Browse Code »

A number of parts of the kernel created their own version of this, might
as well have the sysfs core provide it instead.

Reviewed-by: Guenter Roeck
Tested-by: Guenter Roeck
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2013-07-17 01:57:36 +0800

04 Jul, 2013

1 commit

02aa2a376 drivers: avoid format string in dev_set_name ... Browse Code »

Calling dev_set_name with a single paramter causes it to be handled as a
format string. Many callers are passing potentially dynamic string
content, so use "%s" in those cases to avoid any potential accidents,
including wrappers like device_create*() and bdi_register().

Signed-off-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2013-07-04 07:07:41 +0800

02 Apr, 2013

3 commits

b5c872ddb writeback: expose the bdi_wq workqueue ... Browse Code »

There are cases where userland wants to tweak the priority and
affinity of writeback flushers. Expose bdi_wq to userland by setting
WQ_SYSFS. It appears under /sys/bus/workqueue/devices/writeback/ and
allows adjusting maximum concurrency level, cpumask and nice level.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Fengguang Wu
Cc: Jeff Moyer
Cc: Kay Sievers
Cc: Greg Kroah-Hartman

Tejun Heo
2013-04-02 10:08:06 +0800
839a8e866 writeback: replace custom worker pool implementation with unbound workqueue ... Browse Code »
15

Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.

there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.

The conversion isn't too complicated but the followings are worth
mentioning.

* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.

* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.

The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.

* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.

* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.

* BDI_pending is no longer used. Removed.

* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.

Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Fengguang Wu
Cc: Jeff Moyer

Tejun Heo
2013-04-02 10:08:06 +0800
181387da2 writeback: remove unused bdi_pending_list ... Browse Code »

There's no user left. Remove it.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Fengguang Wu

Tejun Heo
2013-04-02 10:08:06 +0800

22 Feb, 2013

1 commit

7d311cdab bdi: allow block devices to say that they require stable page writes ... Browse Code »

This patchset ("stable page writes, part 2") makes some key
modifications to the original 'stable page writes' patchset. First, it
provides creators (devices and filesystems) of a backing_dev_info a flag
that declares whether or not it is necessary to ensure that page
contents cannot change during writeout. It is no longer assumed that
this is true of all devices (which was never true anyway). Second, the
flag is used to relaxed the wait_on_page_writeback calls so that wait
only occurs if the device needs it. Third, it fixes up the remaining
disk-backed filesystems to use this improved conditional-wait logic to
provide stable page writes on those filesystems.

It is hoped that (for people not using checksumming devices, anyway)
this patchset will give back unnecessary performance decreases since the
original stable page write patchset went into 3.0. Sorry about not
fixing it sooner.

Complaints were registered by several people about the long write
latencies introduced by the original stable page write patchset.
Generally speaking, the kernel ought to allocate as little extra memory
as possible to facilitate writeout, but for people who simply cannot
wait, a second page stability strategy is (re)introduced: snapshotting
page contents. The waiting behavior is still the default strategy; to
enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
set. This flag is used to bandaid^Henable stable page writeback on
ext3[1], and is not used anywhere else.

Given that there are already a few storage devices and network FSes that
have rolled their own page stability wait/page snapshot code, it would
be nice to move towards consolidating all of these. It seems possible
that iscsi and raid5 may wish to use the new stable page write support
to enable zero-copy writeout.

Thank you to Jan Kara for helping fix a couple more filesystems.

Per Andrew Morton's request, here are the result of using dbench to measure
latencies on ext2:

3.8.0-rc3:
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 109347 0.028 59.817
ReadX 347180 0.004 3.391
Flush 15514 29.828 287.283

Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

3.8.0-rc3 + patches:
WriteX 105556 0.029 4.273
ReadX 335004 0.005 4.112
Flush 14982 30.540 298.634

Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

As you can see, for ext2 the maximum write latency decreases from ~60ms
on a laptop hard disk to ~4ms. I'm not sure why the flush latencies
increase, though I suspect that being able to dirty pages faster gives
the flusher more work to do.

On ext4, the average write latency decreases as well as all the maximum
latencies:

3.8.0-rc3:
WriteX 85624 0.152 33.078
ReadX 272090 0.010 61.210
Flush 12129 36.219 168.260

Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms

3.8.0-rc3 + patches:
WriteX 86082 0.141 30.928
ReadX 273358 0.010 36.124
Flush 12214 34.800 165.689

Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms

XFS seems to exhibit similar latency improvements as ext2:

3.8.0-rc3:
WriteX 125739 0.028 104.343
ReadX 399070 0.005 4.115
Flush 17851 25.004 131.390

Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms

3.8.0-rc3 + patches:
WriteX 123529 0.028 6.299
ReadX 392434 0.005 4.287
Flush 17549 25.120 188.687

Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms

...and btrfs, just to round things out, also shows some latency
decreases:

3.8.0-rc3:
WriteX 67122 0.083 82.355
ReadX 212719 0.005 2.828
Flush 9547 47.561 147.418

Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms

3.8.0-rc3 + patches:
WriteX 64898 0.101 71.631
ReadX 206673 0.005 7.123
Flush 9190 47.963 219.034

Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms

Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own
thing, so they'd wait too.

After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own wait code, or they don't block at all. The blocking
behavior is back to what it was before 3.0 if you don't have a disk
requiring stable page writes.

This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same
results as -rc3.

[1] The alternative fixes to ext3 include fixing the locking order and
page bit handling like we did for ext4 (but then why not just use
ext4?), or setting PG_writeback so early that ext3 becomes extremely
slow. I tried that, but the number of write()s I could initiate dropped
by nearly an order of magnitude. That was a bit much even for the
author of the stable page series! :)

This patch:

Creates a per-backing-device flag that tracks whether or not pages must
be held immutable during writeout. Eventually it will be used to waive
wait_for_page_writeback() if nothing requires stable pages.

Signed-off-by: Darrick J. Wong
Reviewed-by: Jan Kara
Cc: Adrian Hunter
Cc: Andy Lutomirski
Cc: Artem Bityutskiy
Cc: Joel Becker
Cc: Mark Fasheh
Cc: Steven Whitehouse
Cc: Jens Axboe
Cc: Eric Van Hensbergen
Cc: Ron Minnich
Cc: Latchesar Ionkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2013-02-22 09:22:19 +0800

18 Dec, 2012

1 commit

9360b5366 Revert "bdi: add a user-tunable cpu_list for the bdi flusher threads" ... Browse Code »

This reverts commit 8fa72d234da9b6b473bbb1f74d533663e4996e6b.

People disagree about how this should be done, so let's revert this for
now so that nobody starts using the new tuning interface. Tejun is
thinking about a more generic interface for thread pool affinity.

Requested-by: Tejun Heo
Acked-by: Jeff Moyer
Acked-by: Jens Axboe
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-12-18 03:29:09 +0800

06 Dec, 2012

1 commit

8fa72d234 bdi: add a user-tunable cpu_list for the bdi flusher threads ... Browse Code »

In realtime environments, it may be desirable to keep the per-bdi
flusher threads from running on certain cpus. This patch adds a
cpu_list file to /sys/class/bdi/* to enable this. The default is to tie
the flusher threads to the same numa node as the backing device (though
I could be convinced to make it a mask of all cpus to avoid a change in
behaviour).

Thanks to Jeremy Eder for the original idea.

Signed-off-by: Jeff Moyer
Signed-off-by: Jens Axboe

Jeff Moyer
2012-12-06 03:17:21 +0800

25 Aug, 2012

1 commit

7034ed132 backing-dev: use kstrto* in preference to simple_strtoul ... Browse Code »

Fix checkpatch warnings:

WARNING: consider using kstrto* in preference to simple_strtoul

for the below sys entry parsers:
/sys/block//bdi/read_ahead_kb
/sys/block//bdi/max_ratio
/sys/block//bdi/min_ratio

Signed-off-by: Namjae Jeon
Signed-off-by: Vivek Trivedi

Namjae Jeon
2012-08-25 16:58:14 +0800

04 Aug, 2012

1 commit

f0cd2dbb6 vfs: kill write_super and sync_supers ... Browse Code »

Finally we can kill the 'sync_supers' kernel thread along with the
'->write_super()' superblock operation because all the users are gone.
Now every file-system is supposed to self-manage own superblock and
its dirty state.

The nice thing about killing this thread is that it improves power management.
Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
every 5 seconds no matter what - even if there were no dirty superblocks and
even if there were no file-systems using this service (e.g., btrfs and
journalled ext4 do not need it). So it was wasting power most of the time. And
because the thread was in the core of the kernel, all systems had to have it.
So I am quite happy to make it go away.

Interestingly, this thread is a left-over from the pdflush kernel thread which
was a self-forking kernel thread responsible for all the write-back in old
Linux kernels. It was turned into per-block device BDI threads, and
'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.

Signed-off-by: Artem Bityutskiy
Signed-off-by: Al Viro

Artem Bityutskiy
2012-08-04 05:24:44 +0800

01 Aug, 2012

1 commit

3965c9ae4 mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads ... Browse Code »

Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more. But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.

Signed-off-by: Wanpeng Li
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2012-08-01 09:42:40 +0800

09 Jun, 2012

1 commit

eb608e3a3 block: Convert BDI proportion calculations to flexible proportions ... Browse Code »

Convert calculations of proportion of writeback each bdi does to new flexible
proportion code. That allows us to use aging period of fixed wallclock time
which gives better proportion estimates given the hugely varying throughput of
different devices.

Acked-by: Peter Zijlstra
Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-06-09 07:37:56 +0800

01 Feb, 2012

1 commit

2673b4cf5 backing-dev: fix wakeup timer races with bdi_unregister() ... Browse Code »

While 7a401a972df8e18 ("backing-dev: ensure wakeup_timer is deleted")
addressed the problem of the bdi being freed with a queued wakeup
timer, there are other races that could happen if the wakeup timer
expires after/during bdi_unregister(), before bdi_destroy() is called.

wakeup_timer_fn() could attempt to wakeup a task which has already has
been freed, or could access a NULL bdi->dev via the wake_forker_thread
tracepoint.

Cc:
Cc: Jens Axboe
Reported-by: Chanho Min
Reviewed-by: Namjae Jeon
Signed-off-by: Rabin Vincent
Signed-off-by: Wu Fengguang

Rabin Vincent
2012-02-01 16:52:49 +0800

22 Nov, 2011

1 commit

8a32c441c freezer: implement and use kthread_freezable_should_stop() ... Browse Code »

Writeback and thinkpad_acpi have been using thaw_process() to prevent
deadlock between the freezer and kthread_stop(); unfortunately, this
is inherently racy - nothing prevents freezing from happening between
thaw_process() and kthread_stop().

This patch implements kthread_freezable_should_stop() which enters
refrigerator if necessary but is guaranteed to return if
kthread_stop() is invoked. Both thaw_process() users are converted to
use the new function.

Note that this deadlock condition exists for many of freezable
kthreads. They need to be converted to use the new should_stop or
freezable workqueue.

Tested with synthetic test case.

Signed-off-by: Tejun Heo
Acked-by: Henrique de Moraes Holschuh
Cc: Jens Axboe
Cc: Oleg Nesterov

Tejun Heo
2011-11-22 04:32:23 +0800

11 Nov, 2011

1 commit

7a401a972 backing-dev: ensure wakeup_timer is deleted ... Browse Code »

bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
from all super_blocks and then del_timer_sync() the writeback timer.

However, this can race with __mark_inode_dirty(), leading to
bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
we're unregistering, after we've called del_timer_sync().

This can end up with the bdi being freed with an active timer inside it,
as in the case of the following dump after the removal of an SD card.

Fix this by redoing the del_timer_sync() in bdi_destory().

------------[ cut here ]------------
WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
Modules linked in:
Backtrace:
[] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
[] (dump_stack+0x0/0x1c) from [] (warn_slowpath_common+0x54/0x6c)
[] (warn_slowpath_common+0x0/0x6c) from [] (warn_slowpath_fmt+0x38/0x40)
r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
r3:00000009
[] (warn_slowpath_fmt+0x0/0x40) from [] (debug_print_object+0x9c/0xc8)
r3:c02b1b29 r2:c02bc662
[] (debug_print_object+0x0/0xc8) from [] (debug_check_no_obj_freed+0xac/0x1dc)
r6:c7964000 r5:00000001 r4:c7964000
[] (debug_check_no_obj_freed+0x0/0x1dc) from [] (kmem_cache_free+0x88/0x1f8)
[] (kmem_cache_free+0x0/0x1f8) from [] (blk_release_queue+0x70/0x78)
[] (blk_release_queue+0x0/0x78) from [] (kobject_release+0x70/0x84)
r5:c79641f0 r4:c796420c
[] (kobject_release+0x0/0x84) from [] (kref_put+0x68/0x80)
r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
[] (kref_put+0x0/0x80) from [] (kobject_put+0x48/0x5c)
r5:c79643b4 r4:c79641f0
[] (kobject_put+0x0/0x5c) from [] (blk_cleanup_queue+0x68/0x74)
r4:c7964000
[] (blk_cleanup_queue+0x0/0x74) from [] (mmc_blk_put+0x78/0xe8)
r5:00000000 r4:c794c400
[] (mmc_blk_put+0x0/0xe8) from [] (mmc_blk_release+0x24/0x38)
r5:c794c400 r4:c0322824
[] (mmc_blk_release+0x0/0x38) from [] (__blkdev_put+0xe8/0x170)
r5:c78d5e00 r4:c74083c0
[] (__blkdev_put+0x0/0x170) from [] (blkdev_put+0x11c/0x12c)
r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
r3:00000000
[] (blkdev_put+0x0/0x12c) from [] (kill_block_super+0x60/0x6c)
r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
[] (kill_block_super+0x0/0x6c) from [] (deactivate_locked_super+0x44/0x70)
r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
[] (deactivate_locked_super+0x0/0x70) from [] (deactivate_super+0x6c/0x70)
r5:c794dc00 r4:c794dc00
[] (deactivate_super+0x0/0x70) from [] (mntput_no_expire+0x188/0x194)
r5:c794dc00 r4:c7942300
[] (mntput_no_expire+0x0/0x194) from [] (sys_umount+0x2e4/0x310)
r6:c7942300 r5:00000000 r4:00000000 r3:00000000
[] (sys_umount+0x0/0x310) from [] (ret_fast_syscall+0x0/0x30)
---[ end trace e5c83c92ada51c76 ]---

Cc: stable@kernel.org
Signed-off-by: Rabin Vincent
Signed-off-by: Linus Walleij
Signed-off-by: Jens Axboe

Rabin Vincent
2011-11-11 20:29:04 +0800

07 Nov, 2011

1 commit

208bca086 Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Add a 'reason' to wb_writeback_work
writeback: send work item to queue_io, move_expired_inodes
writeback: trace event balance_dirty_pages
writeback: trace event bdi_dirty_ratelimit
writeback: fix ppc compile warnings on do_div(long long, unsigned long)
writeback: per-bdi background threshold
writeback: dirty position control - bdi reserve area
writeback: control dirty pause time
writeback: limit max dirty pause time
writeback: IO-less balance_dirty_pages()
writeback: per task dirty rate limit
writeback: stabilize bdi->dirty_ratelimit
writeback: dirty rate control
writeback: add bg_threshold parameter to __bdi_update_bandwidth()
writeback: dirty position control
writeback: account per-bdi accumulated dirtied pages

Linus Torvalds
2011-11-07 11:02:23 +0800

01 Nov, 2011

1 commit

20c8c6289 mm-add-comment-explaining-task-state-setting-in-bdi_forker_thread-fix ... Browse Code »

fiddle wording

Cc: Jan Kara
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-11-01 08:30:49 +0800

31 Oct, 2011

1 commit

0e175a183 writeback: Add a 'reason' to wb_writeback_work ... Browse Code »

This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity. A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.

The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.

And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.

Acked-by: Jan Kara
Signed-off-by: Curt Wohlgemuth
Signed-off-by: Wu Fengguang

Curt Wohlgemuth
2011-10-31 00:33:36 +0800

03 Oct, 2011

3 commits

7381131cb writeback: stabilize bdi->dirty_ratelimit ... Browse Code »

There are some imperfections in balanced_dirty_ratelimit.

1) large fluctuations

The dirty_rate used for computing balanced_dirty_ratelimit is merely
averaged in the past 200ms (very small comparing to the 3s estimation
period for write_bw), which makes rather dispersed distribution of
balanced_dirty_ratelimit.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular
balanced_dirty_ratelimit points can be filtered out by remembering some
prev_balanced_rate and prev_prev_balanced_rate. However the more
reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.

2) due to truncates and fs redirties, the (write_bw dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_dirty_ratelimit. The truncates, due to its possibly
bumpy nature, can hardly be compensated smoothly. So let's face it. When
some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
high, dirty pages will go higher than the setpoint. task_ratelimit will
in turn become lower than dirty_ratelimit. So if we consider both
balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
only when they are on the same side of dirty_ratelimit, the systematical
errors in balanced_dirty_ratelimit won't be able to bring
dirty_ratelimit far away.

The balanced_dirty_ratelimit estimation may also be inaccurate near
@limit or @freerun, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
there is no point to bring up dirty_ratelimit in a hurry only to hurt
both the above two goals.

So, we make use of task_ratelimit to limit the update of dirty_ratelimit
in two ways:

1) avoid changing dirty rate when it's against the position control target
(the adjusted rate will slow down the progress of dirty pages going
back to setpoint).

2) limit the step size. task_ratelimit is changing values step by step,
leaving a consistent trace comparing to the randomly jumping
balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
errors in stable state and typically larger errors when there are big
errors in rate. So it's a pretty good limiting factor for the step
size of dirty_ratelimit.

Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
task_ratelimit is merely used as a limiting factor.

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-10-03 21:08:57 +0800
be3ffa276 writeback: dirty rate control ... Browse Code »

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

balance_dirty_pages(pages_dirtied)
{
task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
pause = pages_dirtied / task_ratelimit;
sleep(pause);
}

On every 200ms, update bdi->dirty_ratelimit
===========================================

bdi_update_dirty_ratelimit()
{
task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
bdi->dirty_ratelimit = balanced_dirty_ratelimit
}

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

dirty_rate == write_bw (1)

The fairness requirement gives us:

task_ratelimit = balanced_dirty_ratelimit
== write_bw / N (2)

where N is the number of dd tasks. We don't know N beforehand, but
still can estimate balanced_dirty_ratelimit within 200ms.

Start by throttling each dd task at rate

task_ratelimit = task_ratelimit_0 (3)
(any non-zero initial value is OK)

After 200ms, we measured

dirty_rate = # of pages dirtied by all dd's / 200ms
write_bw = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

dirty_rate == N * task_rate
== N * task_ratelimit_0 (4)
Or
task_ratelimit_0 == dirty_rate / N (5)

Now we conclude that the balanced task ratelimit can be estimated by

write_bw
balanced_dirty_ratelimit = task_ratelimit_0 * ---------- (6)
dirty_rate

Because with (4) and (5) we can get the desired equality (1):

write_bw
balanced_dirty_ratelimit == (dirty_rate / N) * ----------
dirty_rate
== write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by extending (2) to

task_ratelimit = balanced_dirty_ratelimit * pos_ratio (7)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
pages are created less fast than they are cleaned, thus DROP to the
setpoints (and the reverse).

Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
remains CONSTANT for the past 200ms, we get

task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio (8)

Putting (8) into (6), we get the formula used in
bdi_update_dirty_ratelimit():

write_bw
balanced_dirty_ratelimit *= pos_ratio * ---------- (9)
dirty_rate

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-10-03 21:08:56 +0800
c8e28ce04 writeback: account per-bdi accumulated dirtied pages ... Browse Code »

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara
CC: Michael Rubin
CC: Peter Zijlstra
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-10-03 21:08:56 +0800

03 Sep, 2011

2 commits

09f40f98b mm: Add comment explaining task state setting in bdi_forker_thread() ... Browse Code »

CC: Wu Fengguang
CC: Andrew Morton
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2011-09-03 07:17:02 +0800
5a042aa4b mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread() ... Browse Code »

bdi_forker_thread() clears BDI_pending bit at the end of the main loop.
However clearing of this bit must not be done in some cases which is
handled by calling 'continue' from switch statement. That's kind of
unusual construct and without a good reason so change the function into
more intuitive code flow.

CC: Wu Fengguang
CC: Andrew Morton
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2011-09-03 07:17:02 +0800

27 Jul, 2011

1 commit

f01ef569c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
mm: properly reflect task dirty limits in dirty_exceeded logic
writeback: don't busy retry writeback on new/freeing inodes
writeback: scale IO chunk size up to half device bandwidth
writeback: trace global_dirty_state
writeback: introduce max-pause and pass-good dirty limits
writeback: introduce smoothed global dirty limit
writeback: consolidate variable names in balance_dirty_pages()
writeback: show bdi write bandwidth in debugfs
writeback: bdi write bandwidth estimation
writeback: account per-bdi accumulated written pages
writeback: make writeback_control.nr_to_write straight
writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
writeback: trace event writeback_queue_io
writeback: trace event writeback_single_inode
writeback: remove .nonblocking and .encountered_congestion
writeback: remove writeback_control.more_io
writeback: skip balance_dirty_pages() for in-memory fs
writeback: add bdi_dirty_limit() kernel-doc
writeback: avoid extra sync work at enqueue time
writeback: elevate queue_io() into wb_writeback()
...

Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

Linus Torvalds
2011-07-27 01:39:54 +0800

26 Jul, 2011

2 commits

45b583b10 Merge 'akpm' patch series ... Browse Code »

* Merge akpm patch series: (122 commits)
drivers/connector/cn_proc.c: remove unused local
Documentation/SubmitChecklist: add RCU debug config options
reiserfs: use hweight_long()
reiserfs: use proper little-endian bitops
pnpacpi: register disabled resources
drivers/rtc/rtc-tegra.c: properly initialize spinlock
drivers/rtc/rtc-twl.c: check return value of twl_rtc_write_u8() in twl_rtc_set_time()
drivers/rtc: add support for Qualcomm PMIC8xxx RTC
drivers/rtc/rtc-s3c.c: support clock gating
drivers/rtc/rtc-mpc5121.c: add support for RTC on MPC5200
init: skip calibration delay if previously done
misc/eeprom: add eeprom access driver for digsy_mtc board
misc/eeprom: add driver for microwire 93xx46 EEPROMs
checkpatch.pl: update $logFunctions
checkpatch: make utf-8 test --strict
checkpatch.pl: add ability to ignore various messages
checkpatch: add a "prefer __aligned" check
checkpatch: validate signature styles and To: and Cc: lines
checkpatch: add __rcu as a sparse modifier
checkpatch: suggest using min_t or max_t
...

Did this as a merge because of (trivial) conflicts in
- Documentation/feature-removal-schedule.txt
- arch/xtensa/include/asm/uaccess.h
that were just easier to fix up in the merge than in the patch series.

Linus Torvalds
2011-07-26 12:00:19 +0800
ccb6108f5 mm/backing-dev.c: reset bdi min_ratio in bdi_unregister() ... Browse Code »

Vito said:

: The system has many usb disks coming and going day to day, with their
: respective bdi's having min_ratio set to 1 when inserted. It works for
: some time until eventually min_ratio can no longer be set, even when the
: active set of bdi's seen in /sys/class/bdi/*/min_ratio doesn't add up to
: anywhere near 100.
:
: This then leads to an unrelated starvation problem caused by write-heavy
: fuse mounts being used atop the usb disks, a problem the min_ratio setting
: at the underlying devices bdi effectively prevents.

Fix this leakage by resetting the bdi min_ratio when unregistering the
BDI.

Signed-off-by: Peter Zijlstra
Reported-by: Vito Caputo
Cc: Wu Fengguang
Cc: Miklos Szeredi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-07-26 11:57:07 +0800