Eric Lee / smarc-fsl-linux-kernel

27 Jul, 2011

40 commits

82f9d486e memcg: add memory.vmscan_stat ... Browse Code »

The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim
in...") says it adds scanning stats to memory.stat file. But it doesn't
because we considered we needed to make a concensus for such new APIs.

This patch is a trial to add memory.scan_stat. This shows
- the number of scanned pages(total, anon, file)
- the number of rotated pages(total, anon, file)
- the number of freed pages(total, anon, file)
- the number of elaplsed time (including sleep/pause time)

for both of direct/soft reclaim.

The biggest difference with oringinal Ying's one is that this file
can be reset by some write, as

# echo 0 ...../memory.scan_stat

Example of output is here. This is a result after make -j 6 kernel
under 300M limit.

[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
scanned_pages_by_limit 9471864
scanned_anon_pages_by_limit 6640629
scanned_file_pages_by_limit 2831235
rotated_pages_by_limit 4243974
rotated_anon_pages_by_limit 3971968
rotated_file_pages_by_limit 272006
freed_pages_by_limit 2318492
freed_anon_pages_by_limit 962052
freed_file_pages_by_limit 1356440
elapsed_ns_by_limit 351386416101
scanned_pages_by_system 0
scanned_anon_pages_by_system 0
scanned_file_pages_by_system 0
rotated_pages_by_system 0
rotated_anon_pages_by_system 0
rotated_file_pages_by_system 0
freed_pages_by_system 0
freed_anon_pages_by_system 0
freed_file_pages_by_system 0
elapsed_ns_by_system 0
scanned_pages_by_limit_under_hierarchy 9471864
scanned_anon_pages_by_limit_under_hierarchy 6640629
scanned_file_pages_by_limit_under_hierarchy 2831235
rotated_pages_by_limit_under_hierarchy 4243974
rotated_anon_pages_by_limit_under_hierarchy 3971968
rotated_file_pages_by_limit_under_hierarchy 272006
freed_pages_by_limit_under_hierarchy 2318492
freed_anon_pages_by_limit_under_hierarchy 962052
freed_file_pages_by_limit_under_hierarchy 1356440
elapsed_ns_by_limit_under_hierarchy 351386416101
scanned_pages_by_system_under_hierarchy 0
scanned_anon_pages_by_system_under_hierarchy 0
scanned_file_pages_by_system_under_hierarchy 0
rotated_pages_by_system_under_hierarchy 0
rotated_anon_pages_by_system_under_hierarchy 0
rotated_file_pages_by_system_under_hierarchy 0
freed_pages_by_system_under_hierarchy 0
freed_anon_pages_by_system_under_hierarchy 0
freed_file_pages_by_system_under_hierarchy 0
elapsed_ns_by_system_under_hierarchy 0

total_xxxx is for hierarchy management.

This will be useful for further memcg developments and need to be
developped before we do some complicated rework on LRU/softlimit
management.

This patch adds a new struct memcg_scanrecord into scan_control struct.
sc->nr_scanned at el is not designed for exporting information. For
example, nr_scanned is reset frequentrly and incremented +2 at scanning
mapped pages.

To avoid complexity, I added a new param in scan_control which is for
exporting scanning score.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Michal Hocko
Cc: Ying Han
Cc: Andrew Bresticker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-07-27 07:49:42 +0800
108b6a784 memcg: fix behavior of mem_cgroup_resize_limit() ... Browse Code »
1

Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to
memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
when mem_limit == memsw_limit. The flag is checked at the beginning of
reclaim, and "noswap" is set if the flag is true, because using swap is
meaningless in this case.

This works well in most cases, but when we try to shrink mem_limit,
which is the same as memsw_limit now, we might fail to shrink mem_limit
because swap doesn't used.

This patch fixes this behavior by:
- check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
- If it is set, don't set "noswap" flag even if memsw_is_minimum is true.

Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ying Han
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2011-07-27 07:49:42 +0800
4508378b9 memcg: fix vmscan count in small memcgs ... Browse Code »
1

Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets")
fixes the memcg/kswapd behavior against small targets and prevent vmscan
priority too high.

But the implementation is too naive and adds another problem to small
memcg. It always force scan to 32 pages of file/anon and doesn't handle
swappiness and other rotate_info. It makes vmscan to scan anon LRU
regardless of swappiness and make reclaim bad. This patch fixes it by
adjusting scanning count with regard to swappiness at el.

At a test "cat 1G file under 300M limit." (swappiness=20)
before patch
scanned_pages_by_limit 360919
scanned_anon_pages_by_limit 180469
scanned_file_pages_by_limit 180450
rotated_pages_by_limit 31
rotated_anon_pages_by_limit 25
rotated_file_pages_by_limit 6
freed_pages_by_limit 180458
freed_anon_pages_by_limit 19
freed_file_pages_by_limit 180439
elapsed_ns_by_limit 429758872
after patch
scanned_pages_by_limit 180674
scanned_anon_pages_by_limit 24
scanned_file_pages_by_limit 180650
rotated_pages_by_limit 35
rotated_anon_pages_by_limit 24
rotated_file_pages_by_limit 11
freed_pages_by_limit 180634
freed_anon_pages_by_limit 0
freed_file_pages_by_limit 180634
elapsed_ns_by_limit 367119089
scanned_pages_by_system 0

the numbers of scanning anon are decreased(as expected), and elapsed time
reduced. By this patch, small memcgs will work better.
(*) Because the amount of file-cache is much bigger than anon,
recalaim_stat's rotate-scan counter make scanning files more.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Michal Hocko
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-07-27 07:49:42 +0800
1af8efe96 memcg: change memcg_oom_mutex to spinlock ... Browse Code »

memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
for oom_control. None of the critical sections which it protects sleep
(eventfd_signal works from atomic context and the rest are simple linked
list resp. oom_lock atomic operations).

Mutex is also too heavyweight for those code paths because it triggers a
lot of scheduling. It also makes makes convoying effects more visible
when we have a big number of oom killing because we take the lock
mutliple times during mem_cgroup_handle_oom so we have multiple places
where many processes can sleep.

Signed-off-by: Michal Hocko
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-07-27 07:49:42 +0800
79dfdaccd memcg: make oom_lock 0 and 1 based rather than counter ... Browse Code »

Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
counter which is incremented by mem_cgroup_oom_lock when we are about to
handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep
if oom_lock > 1 to prevent from multiple oom kills at the same time.
The counter is then decremented by mem_cgroup_oom_unlock called from the
same function.

This works correctly but it can lead to serious starvations when we have
many processes triggering OOM and many CPUs available for them (I have
tested with 16 CPUs).

Consider a process (call it A) which gets the oom_lock (the first one
that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
processes that are blocked on the mutex. While A releases the mutex and
calls mem_cgroup_out_of_memory others will wake up (one after another)
and increase the counter and fall into sleep (memcg_oom_waitq).

Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
decreases oom_lock and wakes other tasks (if releasing memory by
somebody else - e.g. killed process - hasn't done it yet).

A testcase would look like:
Assume malloc XXX is a program allocating XXX Megabytes of memory
which touches all allocated pages in a tight loop
# swapoff SWAP_DEVICE
# cgcreate -g memory:A
# cgset -r memory.oom_control=0 A
# cgset -r memory.limit_in_bytes= 200M
# for i in `seq 100`
# do
# cgexec -g memory:A malloc 10 &
# done

The main problem here is that all processes still race for the mutex and
there is no guarantee that we will get counter back to 0 for those that
got back to mem_cgroup_handle_oom. In the end the whole convoy
in/decreases the counter but we do not get to 1 that would enable
killing so nothing useful can be done. The time is basically unbounded
because it highly depends on scheduling and ordering on mutex (I have
seen this taking hours...).

This patch replaces the counter by a simple {un}lock semantic. As
mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
make sure that nobody else races with us which is guaranteed by the
memcg_oom_mutex.

We have to be careful while locking subtrees because we can encounter a
subtree which is already locked: hierarchy:

A
/ \
B \
/\ \
C D E

B - C - D tree might be already locked. While we want to enable locking
E subtree because OOM situations cannot influence each other we
definitely do not want to allow locking A.

Therefore we have to refuse lock if any subtree is already locked and
clear up the lock for all nodes that have been set up to the failure
point.

On the other hand we have to make sure that the rest of the world will
recognize that a group is under OOM even though it doesn't have a lock.
Therefore we have to introduce under_oom variable which is incremented
and decremented for the whole subtree when we enter resp. leave
mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be
updated under memcg_oom_mutex because its users only check a single
group and they use atomic operations for that.

This can be checked easily by the following test case:

# cgcreate -g memory:A
# cgset -r memory.use_hierarchy=1 A
# cgset -r memory.oom_control=1 A
# cgset -r memory.limit_in_bytes= 100M
# cgset -r memory.memsw.limit_in_bytes= 100M
# cgcreate -g memory:A/B
# cgset -r memory.oom_control=1 A/B
# cgset -r memory.limit_in_bytes=20M
# cgset -r memory.memsw.limit_in_bytes=20M
# cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B
# cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A

While B gets oom_lock A will not get it. Both of them go into sleep and
wait for an external action. We can make the limit higher for A to
enforce waking it up

# cgset -r memory.memsw.limit_in_bytes=300M A
# cgset -r memory.limit_in_bytes=300M A

malloc in A has to wake up even though it doesn't have oom_lock.

Finally, the unlock path is very easy because we always unlock only the
subtree we have locked previously while we always decrement under_oom.

Signed-off-by: Michal Hocko
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-07-27 07:49:42 +0800
bb2a0de92 memcg: consolidate memory cgroup lru stat functions ... Browse Code »

In mm/memcontrol.c, there are many lru stat functions as..

mem_cgroup_zone_nr_lru_pages
mem_cgroup_node_nr_file_lru_pages
mem_cgroup_nr_file_lru_pages
mem_cgroup_node_nr_anon_lru_pages
mem_cgroup_nr_anon_lru_pages
mem_cgroup_node_nr_unevictable_lru_pages
mem_cgroup_nr_unevictable_lru_pages
mem_cgroup_node_nr_lru_pages
mem_cgroup_nr_lru_pages
mem_cgroup_get_local_zonestat

Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
This seems bad. This patch consolidates all functions into

mem_cgroup_zone_nr_lru_pages()
mem_cgroup_node_nr_lru_pages()
mem_cgroup_nr_lru_pages()

For these functions, "which LRU?" information is passed by a mask.

example:
mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))

And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.

example:
mem_cgroup_nr_lru_pages(mem, ALL_LRU)

BTW, considering layout of NUMA memory placement of counters, this patch seems
to be better.

Now, when we gather all LRU information, we scan in following orer
for_each_lru -> for_each_node -> for_each_zone.

This means we'll touch cache lines in different node in turn.

After patch, we'll scan
for_each_node -> for_each_zone -> for_each_lru(mask)

Then, we'll gather information in the same cacheline at once.

[akpm@linux-foundation.org: fix warnigns, build error]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Michal Hocko
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-07-27 07:49:42 +0800
1f4c025b5 memcg: export memory cgroup's swappiness with mem_cgroup_swappiness() ... Browse Code »

Each memory cgroup has a 'swappiness' value which can be accessed by
get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages()
and swappiness is passed by argument. It's propagated by scan_control.

get_swappiness() is a static function but some planned updates will need
to get swappiness from files other than memcontrol.c This patch exports
get_swappiness() as mem_cgroup_swappiness(). With this, we can remove the
argument of swapiness from try_to_free... and drop swappiness from
scan_control. only memcg uses it.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Michal Hocko
Cc: Ying Han
Cc: Shaohua Li
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-07-27 07:49:42 +0800
b830ac1d9 rtc: fix hrtimer deadlock ... Browse Code »
1

Ben reported a lockup related to rtc. The lockup happens due to:

CPU0 CPU1

rtc_irq_set_state() __run_hrtimer()
spin_lock_irqsave(&rtc->irq_task_lock) rtc_handle_legacy_irq();
spin_lock(&rtc->irq_task_lock);
hrtimer_cancel()
while (callback_running);

So the running callback never finishes as it's blocked on
rtc->irq_task_lock.

Use hrtimer_try_to_cancel() instead and drop rtc->irq_task_lock while
waiting for the callback. Fix this for both rtc_irq_set_state() and
rtc_irq_set_freq().

Signed-off-by: Thomas Gleixner
Reported-by: Ben Greear
Cc: John Stultz
Cc: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Gleixner
2011-07-27 07:49:42 +0800
431e2bcc3 rtc: limit frequency ... Browse Code »
1

Due to the hrtimer self rearming mode a user can DoS the machine simply
because it's starved by hrtimer events.

The RTC hrtimer is self rearming. We really need to limit the frequency
to something sensible.

Signed-off-by: Thomas Gleixner
Cc: John Stultz
Cc: Ingo Molnar
Cc: Ben Greear
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Gleixner
2011-07-27 07:49:42 +0800
2c4f57d12 rtc: handle errors correctly in rtc_irq_set_state() ... Browse Code »
1

The code checks the correctness of the parameters, but unconditionally
arms/disarms the hrtimer.

The result is that a random task might arm/disarm rtc timer and surprise
the real owner by either generating events or by stopping them.

Signed-off-by: Thomas Gleixner
Cc: John Stultz
Cc: Ingo Molnar
Cc: Ben Greear
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Gleixner
2011-07-27 07:49:41 +0800
b45d59fb9 mn10300, exec: remove redundant set_fs(USER_DS) ... Browse Code »

The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.

Signed-off-by: Mathias Krause
Cc: Koichi Yasutake
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathias Krause
2011-07-27 07:49:41 +0800
fc92805a8 drivers/base/power/opp.c: fix dev_opp initial value ... Browse Code »

Dev_opp initial value shoule be ERR_PTR(), IS_ERR() is used to check
error.

Signed-off-by: Jonghwan Choi
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jonghwan Choi
2011-07-27 07:49:41 +0800
adc400f69 frv, exec: remove redundant set_fs(USER_DS) ... Browse Code »

The address limit is already set in flush_old_exec() so those calls to
set_fs(USER_DS) are redundant.

Also removed the dead code in flush_thread().

Signed-off-by: Mathias Krause
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathias Krause
2011-07-27 07:49:41 +0800
6fd4ce886 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus ... Browse Code »

* 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: (31 commits)
MIPS: Close races in TLB modify handlers.
MIPS: Add uasm UASM_i_SRL_SAFE macro.
MIPS: RB532: Use hex_to_bin()
MIPS: Enable cpu_has_clo_clz for MIPS Technologies' platforms
MIPS: PowerTV: Provide cpu-feature-overrides.h
MIPS: Remove pointless return statement from empty void functions.
MIPS: Limit fixrange_init() to the FIXMAP region
MIPS: Install handlers for software IRQs
MIPS: Move FIXADDR_TOP into spaces.h
MIPS: Add SYNC after cacheflush
MIPS: pfn_valid() is broken on low memory HIGHMEM systems
MIPS: HIGHMEM DMA on noncoherent MIPS32 processors
MIPS: topdown mmap support
MIPS: Remove redundant addr_limit assignment on exec.
MIPS: AR7: Replace __attribute__((__packed__)) with __packed
MIPS: AR7: Remove 'space before tabs' in platform.c
MIPS: Lantiq: Add missing clk_enable and clk_disable functions.
MIPS: AR7: Fix trailing semicolon bug in clock.c
MAINTAINERS: Update MIPS entry.
MIPS: BCM63xx: Remove duplicate PERF_IRQSTAT_REG definition
...

Linus Torvalds
2011-07-27 05:17:28 +0800
ba5b56cb3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
ceph: document unlocked d_parent accesses
ceph: explicitly reference rename old_dentry parent dir in request
ceph: document locking for ceph_set_dentry_offset
ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
ceph: protect d_parent access in ceph_d_revalidate
ceph: protect access to d_parent
ceph: handle racing calls to ceph_init_dentry
ceph: set dir complete frag after adding capability
rbd: set blk_queue request sizes to object size
ceph: set up readahead size when rsize is not passed
rbd: cancel watch request when releasing the device
ceph: ignore lease mask
ceph: fix ceph_lookup_open intent usage
ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
ceph: fix bad parent_inode calc in ceph_lookup_open
ceph: avoid carrying Fw cap during write into page cache
libceph: don't time out osd requests that haven't been received
ceph: report f_bfree based on kb_avail rather than diffing.
ceph: only queue capsnap if caps are dirty
ceph: fix snap writeback when racing with writes
...

Linus Torvalds
2011-07-27 04:38:50 +0800
243dd2809 gma500: udelay(20000) it too long again ... Browse Code »

so replace it with mdelay(20).

Fixes build error:

ERROR: "__bad_udelay" [drivers/staging/gma500/psb_gfx.ko] undefined!

Signed-off-by: Stephen Rothwell
Signed-off-by: Linus Torvalds

Stephen Rothwell
2011-07-27 02:55:14 +0800
9c646cfc3 USB / Renesas: Fix build issue related to struct scatterlist ... Browse Code »

Fix build issue caused by undefined struct scatterlist in
drivers/usb/renesas_usbhs/fifo.c.

Signed-off-by: Rafael J. Wysocki
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2011-07-27 02:52:55 +0800
6c0cbef66 MMC / TMIO: Fix build issue related to struct scatterlist ... Browse Code »

Fix build issue caused by undefined struct scatterlist in
drivers/mmc/host/tmio_mmc.c.

Signed-off-by: Rafael J. Wysocki
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2011-07-27 02:52:55 +0800
2ac232f37 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 ... Browse Code »

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
jbd: change the field "b_cow_tid" of struct journal_head from type unsigned to tid_t
ext3.txt: update the links in the section "useful links" to the latest ones
ext3: Fix data corruption in inodes with journalled data
ext2: check xattr name_len before acquiring xattr_sem in ext2_xattr_get
ext3: Fix compilation with -DDX_DEBUG
quota: Remove unused declaration
jbd: Use WRITE_SYNC in journal checkpoint.
jbd: Fix oops in journal_remove_journal_head()
ext3: Return -EINVAL when start is beyond the end of fs in ext3_trim_fs()
ext3/ioctl.c: silence sparse warnings about different address spaces
ext3/ext4 Documentation: remove bh/nobh since it has been deprecated
ext3: Improve truncate error handling
ext3: use proper little-endian bitops
ext2: include fs.h into ext2_fs.h
ext3: Fix oops in ext3_try_to_allocate_with_rsv()
jbd: fix a bug of leaking jh->b_jcount
jbd: remove dependency on __GFP_NOFAIL
ext3: Convert ext3 to new truncate calling convention
jbd: Add fixed tracepoints
ext3: Add fixed tracepoints

Resolve conflicts in fs/ext3/fsync.c due to fsync locking push-down and
new fixed tracepoints.

Linus Torvalds
2011-07-27 02:34:40 +0800
d79698da3 ceph: document unlocked d_parent accesses ... Browse Code »

For the most part we don't care about racing with rename when directing
MDS requests; either the old or new parent is fine. Document that, and
do some minor cleanup.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:31:26 +0800
41b02e1f9 ceph: explicitly reference rename old_dentry parent dir in request ... Browse Code »

We carry a pin on the parent directory for the rename source and dest
dentries. For the source it's r_locked_dir; we need to explicitly
reference the old_dentry parent as well, since the dentry's d_parent may
change between when the request was created and pinned and when it is
freed.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:31:14 +0800
4f1772645 ceph: document locking for ceph_set_dentry_offset ... Browse Code »

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:31:08 +0800
e5f86dc37 ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug ... Browse Code »

Have caller pass in a safely-obtained reference to the parent directory
for calculating a dentry's hash valud.

While we're here, simpify the flow through ceph_encode_fh() so that there
is a single exit point and cleanup.

Also fix a bug with the dentry hash calculation: calculate the hash for the
dentry we were given, not its parent.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:30:55 +0800
bf1c6aca9 ceph: protect d_parent access in ceph_d_revalidate ... Browse Code »

Protect d_parent with d_lock. Carry a reference. Simplify the flow so
that there is a single exit point and cleanup.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:30:43 +0800
5f21c96dd ceph: protect access to d_parent ... Browse Code »

d_parent is protected by d_lock: use it when looking up a dentry's parent
directory inode. Also take a reference and drop it in the caller to avoid
a use-after-free.

Reported-by: Al Viro
Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:30:29 +0800
48d0cbd12 ceph: handle racing calls to ceph_init_dentry ... Browse Code »

The ->lookup() and prepopulate_readdir() callers are working with unhashed
dentries, so we don't have to worry. The export.c callers, though, need
to initialize something they got back from d_obtain_alias() and are
potentially racing with other callers. Make sure we don't return unless
the dentry is properly initialized (by us or someone else).

Reported-by: Al Viro
Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:30:15 +0800
dfabbed6f ceph: set dir complete frag after adding capability ... Browse Code »

Curretly ceph_add_cap clears the complete bit if we are newly issued the
FILE_SHARED cap, which is normally the case for a newly issue cap on a new
directory. That means we clear the just-set bit. Move the check that sets
the flag to after the cap is added/updated.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:30:02 +0800
029bcbd8b rbd: set blk_queue request sizes to object size ... Browse Code »

This improves performance since more requests can be merged.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Josh Durgin

Josh Durgin
2011-07-27 02:29:35 +0800
e98522274 ceph: set up readahead size when rsize is not passed ... Browse Code »

This should improve the default read performance, as without it
readahead is practically disabled.

Signed-off-by: Yehuda Sadeh

Yehuda Sadeh
2011-07-27 02:29:14 +0800
79e3057c4 rbd: cancel watch request when releasing the device ... Browse Code »

We were missing this cleanup, so when a device was released
the osd didn't clean up its watchers list, so following notifications
could be slow as osd needed to timeout on the client.

Signed-off-by: Yehuda Sadeh

Yehuda Sadeh
2011-07-27 02:29:04 +0800
2f90b852e ceph: ignore lease mask ... Browse Code »

The lease mask is no longer used (and it changed a while back). Instead,
use a non-zero duration to indicate that there is a lease being issued.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:28:25 +0800
468640e32 ceph: fix ceph_lookup_open intent usage ... Browse Code »

We weren't properly calling lookup_instantiate_filp when setting up the
lookup intent, which could lead to file leakage on errors. So:

- use separate helper for the hidden snapdir translation, immediately
following the mds request
- use ceph_finish_lookup for the final dentry/return value dance in the
exit path
- lookup_instantiate_filp on success

Reported-by: Al Viro
Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:28:11 +0800
9bae113a0 ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC ... Browse Code »

We only need to put these on the directory unsafe list if they have
side effects that fsync(2) should flush out.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:27:59 +0800
acda76578 ceph: fix bad parent_inode calc in ceph_lookup_open ... Browse Code »

We were always getting NULL here because the intent file f_dentry is always
NULL at this point, which means we were always passing NULL to
ceph_mdsc_do_request. In reality, this was fine, since this isn't
currently ever a write operation that needs to get strung on the dir's
unsafe list.

Use the dir explicitly, and only pass it if this open has side-effects that
a dir fsync should flush.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:27:48 +0800
d8de9ab63 ceph: avoid carrying Fw cap during write into page cache ... Browse Code »

The generic_file_aio_write call may block on balance_dirty_pages while we
flush data to the OSDs. If we hold a reference to the FILE_WR cap during
that interval revocation by the MDS (e.g., to do a stat(2)) may be very
slow.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:27:34 +0800
4cf9d5446 libceph: don't time out osd requests that haven't been received ... Browse Code »

Keep track of when an outgoing message is ACKed (i.e., the server fully
received it and, presumably, queued it for processing). Time out OSD
requests only if it's been too long since they've been received.

This prevents timeouts and connection thrashing when the OSDs are simply
busy and are throttling the requests they read off the network.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:27:24 +0800
8f04d4227 ceph: report f_bfree based on kb_avail rather than diffing. ... Browse Code »

Reviewed-by: Yehuda Sadeh
Signed-off-by: Greg Farnum

Greg Farnum
2011-07-27 02:27:06 +0800
e77dc3e9c ceph: only queue capsnap if caps are dirty ... Browse Code »

We used to go into this branch if i_wrbuffer_ref_head was non-zero. This
was an ancient check from before we were careful about dealing with all
kinds of caps (and not just dirty pages). It is cleaner to only queue a
capsnap if there is an actual dirty cap. If we are racing with...
something...we will end up here with ci->i_wrbuffer_refs but no dirty
caps.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:26:41 +0800
af0ed569d ceph: fix snap writeback when racing with writes ... Browse Code »

There are two problems that come up when we try to queue a capsnap while a
write is in progress:

- The FILE_WR cap is held, but not yet dirty, so we may queue a capsnap
with dirty == 0. That will crash later in __ceph_flush_snaps(). Or
on the FILE_WR cap if a write is in progress.
- We may not have i_head_snapc set, which causes problems pretty quickly.
Look to the snaprealm in this case.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:26:31 +0800
9cfa1098d ceph: use flag bit for at_end readdir flag ... Browse Code »

This saves us a word of memory per file.

Reviewed-by: Yehuda Sadeh
Signed-off-by: Sage Weil

Sage Weil
2011-07-27 02:26:18 +0800