Eric Lee / smarc-fsl-linux-kernel

21 Jul, 2010

2 commits

b8ab9f820 x86,nobootmem: make alloc_bootmem_node fall back to other node when 32bit numa is used ... Browse Code »

Borislav Petkov reported his 32bit numa system has problem:

[ 0.000000] Reserving total of 4c00 pages for numa KVA remap
[ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
[ 0.000000] max_pfn = 238000
[ 0.000000] 8202MB HIGHMEM available.
[ 0.000000] 885MB LOWMEM available.
[ 0.000000] mapped low ram: 0 - 375fe000
[ 0.000000] low ram: 0 - 375fe000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
[ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
[ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
[ 0.000000] BUG: unable to handle kernel paging request at 40000000
[ 0.000000] IP: [] __alloc_memory_core_early+0x147/0x1d6
[ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
...
[ 0.000000] Call Trace:
[ 0.000000] [] ? __alloc_bootmem_node+0x216/0x22f
[ 0.000000] [] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
[ 0.000000] [] ? sparse_init+0x1dc/0x499
[ 0.000000] [] ? paging_init+0x168/0x1df
[ 0.000000] [] ? native_pagetable_setup_start+0xef/0x1bb

looks like it allocates too much high address for bootmem.

Try to cut limit with get_max_mapped()

Reported-by: Borislav Petkov
Tested-by: Conny Seidel
Signed-off-by: Yinghai Lu
Cc: [2.6.34.x]
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Johannes Weiner
Cc: Lee Schermerhorn
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2010-07-21 07:25:40 +0800
a6aa62a09 mm/vmscan.c: fix mapping use after free ... Browse Code »

We need lock_page_nosync() here because we have no reference to the
mapping when taking the page lock.

Signed-off-by: Nick Piggin
Reviewed-by: Johannes Weiner
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2010-07-21 07:25:40 +0800

20 Jul, 2010

1 commit

46ac0cc92 Merge branch 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm ... Browse Code »

* 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm:
kmemleak: Add support for NO_BOOTMEM configurations
kmemleak: Annotate false positive in init_section_page_cgroup()

Linus Torvalds
2010-07-20 04:18:34 +0800

19 Jul, 2010

3 commits

7f8275d0d mm: add context argument to shrinker callback ... Browse Code »

The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig

Dave Chinner
2010-07-19 12:56:17 +0800
9078370c0 kmemleak: Add support for NO_BOOTMEM configurations ... Browse Code »

With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
friends use the early_res functions for memory management when
NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
corresponding code paths for bootmem allocations.

Signed-off-by: Catalin Marinas
Acked-by: Pekka Enberg
Acked-by: Yinghai Lu
Cc: H. Peter Anvin
Cc: stable@kernel.org

Catalin Marinas
2010-07-19 18:54:15 +0800
7952f9881 kmemleak: Annotate false positive in init_section_page_cgroup() ... Browse Code »

The pointer to the page_cgroup table allocated in
init_section_page_cgroup() is stored in section->page_cgroup as (base -
pfn). Since this value does not point to the beginning or inside the
allocated memory block, kmemleak reports a false positive.

This was reported in bugzilla.kernel.org as #16297.

Signed-off-by: Catalin Marinas
Reported-by: Adrien Dessemond
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Pekka Enberg
Cc: Andrew Morton

Catalin Marinas
2010-07-19 18:54:14 +0800

14 Jul, 2010

1 commit

95f72d1ed lmb: rename to memblock ... Browse Code »

via following scripts

FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

sed -i \
-e 's/lmb/memblock/g' \
-e 's/LMB/MEMBLOCK/g' \
$FILES

for N in $(find . -name lmb.[ch]); do
M=$(echo $N | sed 's/lmb/memblock/g')
mv $N $M
done

and remove some wrong change like lmbench and dlmb etc.

also move memblock.c from lib/ to mm/

Suggested-by: Ingo Molnar
Acked-by: "H. Peter Anvin"
Acked-by: Benjamin Herrenschmidt
Acked-by: Linus Torvalds
Signed-off-by: Yinghai Lu
Signed-off-by: Benjamin Herrenschmidt

Yinghai Lu
2010-07-14 15:14:00 +0800

08 Jul, 2010

1 commit

c77e9e682 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
writeback: simplify the write back thread queue
writeback: split writeback_inodes_wb
writeback: remove writeback_inodes_wbc
fs-writeback: fix kernel-doc warnings
splice: check f_mode for seekable file
splice: direct_splice_actor() should not use pos in sd

Linus Torvalds
2010-07-08 23:06:40 +0800

06 Jul, 2010

2 commits

83ba7b071 writeback: simplify the write back thread queue ... Browse Code »

First remove items from work_list as soon as we start working on them. This
means we don't have to track any pending or visited state and can get
rid of all the RCU magic freeing the work items - we can simply free
them once the operation has finished. Second use a real completion for
tracking synchronous requests - if the caller sets the completion pointer
we complete it, otherwise use it as a boolean indicator that we can free
the work item directly. Third unify struct wb_writeback_args and struct
bdi_work into a single data structure, wb_writeback_work. Previous we
set all parameters into a struct wb_writeback_args, copied it into
struct bdi_work, copied it again on the stack to use it there. Instead
of just allocate one structure dynamically or on the stack and use it
all the way through the stack.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2010-07-06 14:59:53 +0800
9c3a8ee8a writeback: remove writeback_inodes_wbc ... Browse Code »

This was just an odd wrapper around writeback_inodes_wb. Removing this
also allows to get rid of the bdi member of struct writeback_control
which was rather out of place there.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2010-07-06 14:54:03 +0800

30 Jun, 2010

3 commits

5c0c16549 mempolicy: fix dangling reference to tmpfs superblock mpol ... Browse Code »

My patch to "Factor out duplicate put/frees in mpol_shared_policy_init()
to a common return path"; and Dan Carpenter's fix thereto both left a
dangling reference to the incoming tmpfs superblock mempolicy structure.
A similar leak was introduced earlier when the nodemask was moved offstack
to the scratch area despite the note in the comment block regarding the
incoming ref.

Move the remaining 'put of the incoming "mpol" to the common exit path to
drop the reference.

Signed-off-by: Lee Schermerhorn
Acked-by: Dan Carpenter
Cc: KOSAKI Motohiro
Cc: David Rientjes
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-06-30 06:29:31 +0800
4d845ebf4 memcg: fix wake up in oom wait queue ... Browse Code »

OOM-waitqueue should be waken up when oom_disable is canceled. This is a
fix for 3c11ecf448eff8f1 ("memcg: oom kill disable and oom status").

How to test:
Create a cgroup A...
1. set memory.limit and memory.memsw.limit to be small value
2. echo 1 > /cgroup/A/memory.oom_control, this disables oom-kill.
3. run a program which must cause OOM.

A program executed in 3 will sleep by oom_waiqueue in memcg. Then, how to
wake it up is problem.

1. echo 0 > /cgroup/A/memory.oom_control (enable OOM-killer)
2. echo big mem > /cgroup/A/memory.memsw.limit_in_bytes(allow more swap)

etc..

Without the patch, a task in slept can not be waken up.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-06-30 06:29:30 +0800
984bc9601 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Don't count_vm_events for discard bio in submit_bio.
cfq: fix recursive call in cfq_blkiocg_update_completion_stats()
cfq-iosched: Fixed boot warning with BLK_CGROUP=y and CFQ_GROUP_IOSCHED=n
cfq: Don't allow queue merges for queues that have no process references
block: fix DISCARD_BARRIER requests
cciss: set SCSI max cmd len to 16, as default is wrong
cpqarray: fix two more wrong section type
cpqarray: fix wrong __init type on pci probe function
drbd: Fixed a race between disk-attach and unexpected state changes
writeback: fix pin_sb_for_writeback
writeback: add missing requeue_io in writeback_inodes_wb
writeback: simplify and split bdi_start_writeback
writeback: simplify wakeup_flusher_threads
writeback: fix writeback_inodes_wb from writeback_inodes_sb
writeback: enforce s_umount locking in writeback_inodes_sb
writeback: queue work on stack in writeback_inodes_sb
writeback: fix writeback completion notifications

Linus Torvalds
2010-06-30 01:42:52 +0800

18 Jun, 2010

1 commit

9983b6f0c percpu: fix first chunk match in per_cpu_ptr_to_phys() ... Browse Code »

per_cpu_ptr_to_phys() determines whether the passed in @addr belongs
to the first_chunk or not by just matching the address against the
address range of the base unit (unit0, used by cpu0). When an adress
from another cpu was passed in, it will always determine that the
address doesn't belong to the first chunk even when it does. This
makes the function return a bogus physical address which may lead to
crash.

This problem was discovered by Cliff Wickman while investigating a
crash during kdump on a SGI UV system.

Signed-off-by: Tejun Heo
Reported-by: Cliff Wickman
Tested-by: Cliff Wickman
Cc: stable@kernel.org

Tejun Heo
2010-06-18 21:07:23 +0800

17 Jun, 2010

1 commit

a92d3ff9e percpu: fix trivial bugs in pcpu_build_alloc_info() ... Browse Code »

Fix the following two trivial bugs in pcpu_build_alloc_info()

* we should memset group_cnt to 0 by size of group_cnt, not size of
group_map (both are of the same size, so the bug isn't dangerous)

* we can delete useless variable group_cnt_max.

Signed-off-by: Pavel V. Panteleev
Signed-off-by: Tejun Heo

Pavel V. Panteleev
2010-06-17 16:07:25 +0800

11 Jun, 2010

1 commit

c5444198c writeback: simplify and split bdi_start_writeback ... Browse Code »

bdi_start_writeback now never gets a superblock passed, so we can just remove
that case. And to further untangle the code and flatten the call stack
split it into two trivial helpers for it's two callers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2010-06-11 18:58:08 +0800

09 Jun, 2010

2 commits

d87815cb2 writeback: limit write_cache_pages integrity scanning to current EOF ... Browse Code »

sync can currently take a really long time if a concurrent writer is
extending a file. The problem is that the dirty pages on the address
space grow in the same direction as write_cache_pages scans, so if
the writer keeps ahead of writeback, the writeback will not
terminate until the writer stops adding dirty pages.

For a data integrity sync, we only need to write the pages dirty at
the time we start the writeback, so we can stop scanning once we get
to the page that was at the end of the file at the time the scan
started.

This will prevent operations like copying a large file preventing
sync from completing as it will not write back pages that were
dirtied after the sync was started. This does not impact the
existing integrity guarantees, as any dirty page (old or new)
within the EOF range at the start of the scan will still be
captured.

This patch will not prevent sync from blocking on large writes into
holes. That requires more complex intervention while this patch only
addresses the common append-case of this sync holdoff.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Linus Torvalds

Dave Chinner
2010-06-09 09:12:44 +0800
0b5649278 writeback: pay attention to wbc->nr_to_write in write_cache_pages ... Browse Code »

If a filesystem writes more than one page in ->writepage, write_cache_pages
fails to notice this and continues to attempt writeback when wbc->nr_to_write
has gone negative - this trace was captured from XFS:

wbc_writeback_start: towrt=1024
wbc_writepage: towrt=1024
wbc_writepage: towrt=0
wbc_writepage: towrt=-1
wbc_writepage: towrt=-5
wbc_writepage: towrt=-21
wbc_writepage: towrt=-85

This has adverse effects on filesystem writeback behaviour. write_cache_pages()
needs to terminate after a certain number of pages are written, not after a
certain number of calls to ->writepage are made. This is a regression
introduced by 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4 ("vfs: Add
no_nrwrite_index_update writeback control flag"), but cannot be reverted
directly due to subsequent bug fixes that have gone in on top of it.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Linus Torvalds

Dave Chinner
2010-06-09 09:12:44 +0800

05 Jun, 2010

4 commits

7f0d384ca Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Minix: Clean up left over label
fix truncate inode time modification breakage
fix setattr error handling in sysfs, configfs
fcntl: return -EFAULT if copy_to_user fails
wrong type for 'magic' argument in simple_fill_super()
fix the deadlock in qib_fs
mqueue doesn't need make_bad_inode()

Linus Torvalds
2010-06-05 12:12:39 +0800
d2dd328b7 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
block: make blk_init_free_list and elevator_init idempotent
block: avoid unconditionally freeing previously allocated request_queue
pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface
pipe: change the privilege required for growing a pipe beyond system max
pipe: adjust minimum pipe size to 1 page
block: disable preemption before using sched_clock()
cciss: call BUG() earlier
Preparing 8.3.8rc2
drbd: Reduce verbosity
drbd: use drbd specific ratelimit instead of global printk_ratelimit
drbd: fix hang on local read errors while disconnected
drbd: Removed the now empty w_io_error() function
drbd: removed duplicated #includes
drbd: improve usage of MSG_MORE
drbd: need to set socket bufsize early to take effect
drbd: improve network latency, TCP_QUICKACK
drbd: Revert "drbd: Create new current UUID as late as possible"
brd: support discard
Revert "writeback: fix WB_SYNC_NONE writeback from umount"
Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync"
...

Linus Torvalds
2010-06-05 06:37:44 +0800
bb21c7ce1 vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim failure ... Browse Code »

Greg Thelen reported recent Johannes's stack diet patch makes kernel hang.
His test is following.

mount -t cgroup none /cgroups -o memory
mkdir /cgroups/cg1
echo $$ > /cgroups/cg1/tasks
dd bs=1024 count=1024 if=/dev/null of=/data/foo
echo $$ > /cgroups/tasks
echo 1 > /cgroups/cg1/memory.force_empty

Actually, This OOM hard to try logic have been corrupted since following
two years old patch.

commit a41f24ea9fd6169b147c53c2392e2887cc1d9247
Author: Nishanth Aravamudan
Date: Tue Apr 29 00:58:25 2008 -0700

page allocator: smarter retry of costly-order allocations

Original intention was "return success if the system have shrinkable zones
though priority==0 reclaim was failure". But the above patch changed to
"return nr_reclaimed if .....". Oh, That forgot nr_reclaimed may be 0 if
priority==0 reclaim failure.

And Johannes's patch 0aeb2339e54e ("vmscan: remove all_unreclaimable scan
control") made it more corrupt. Originally, priority==0 reclaim failure
on memcg return 0, but this patch changed to return 1. It totally
confused memcg.

This patch fixes it completely.

Reported-by: Greg Thelen
Signed-off-by: KOSAKI Motohiro
Acked-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Tested-by: Greg Thelen
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-06-05 06:21:45 +0800
af5a30d8c fix truncate inode time modification breakage ... Browse Code »

mtime and ctime should be changed only if the file size has actually
changed. Patches changing ext2 and tmpfs from vmtruncate to new truncate
sequence has caused regressions where they always update timestamps.

There is some strange cases in POSIX where truncate(2) must not update
times unless the size has acutally changed, see 6e656be89.

This area is all still rather buggy in different ways in a lot of
filesystems and needs a cleanup and audit (ideally the vfs will provide
a simple attribute or call to direct all filesystems exactly which
attributes to change). But coming up with the best solution will take a
while and is not appropriate for rc anyway.

So fix recent regression for now.

Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

Nick Piggin
2010-06-05 05:16:30 +0800

01 Jun, 2010

2 commits

b4ca76157 Merge branch 'master' into for-linus ... Browse Code »

Conflicts:
fs/pipe.c

Signed-off-by: Jens Axboe

Jens Axboe
2010-06-01 18:42:12 +0800
0e3c9a228 Revert "writeback: fix WB_SYNC_NONE writeback from umount" ... Browse Code »

This reverts commit e913fc825dc685a444cb4c1d0f9d32f372f59861.

We are investigating a hang associated with the WB_SYNC_NONE changes,
so revert them for now.

Conflicts:

fs/fs-writeback.c
mm/page-writeback.c

Signed-off-by: Jens Axboe

Jens Axboe
2010-06-01 17:08:43 +0800

31 May, 2010

2 commits

3b03117c5 Merge branch 'slub/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'slub/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
SLUB: Allow full duplication of kmalloc array for 390
slub: move kmem_cache_node into it's own cacheline

Linus Torvalds
2010-05-31 03:46:17 +0800
003386fff Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
mm: export generic_pipe_buf_*() to modules
fuse: support splice() reading from fuse device
fuse: allow splice to move pages
mm: export remove_from_page_cache() to modules
mm: export lru_cache_add_*() to modules
fuse: support splice() writing to fuse device
fuse: get page reference for readpages
fuse: use get_user_pages_fast()
fuse: remove unneeded variable

Linus Torvalds
2010-05-31 00:16:14 +0800

28 May, 2010

14 commits

3889e6e76 tmpfs: convert to use the new truncate convention ... Browse Code »

Cc: Christoph Hellwig
Acked-by: Hugh Dickins
Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

npiggin@suse.de
2010-05-28 10:15:51 +0800
7bb46a673 fs: introduce new truncate sequence ... Browse Code »

Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.

simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.

simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).

To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.

Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).

Cc: Christoph Hellwig
Acked-by: Jan Kara
Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

npiggin@suse.de
2010-05-28 10:15:33 +0800
1b061d924 rename the generic fsync implementations ... Browse Code »

We don't name our generic fsync implementations very well currently.
The no-op implementation for in-memory filesystems currently is called
simple_sync_file which doesn't make too much sense to start with,
the the generic one for simple filesystems is called simple_fsync
which can lead to some confusion.

This patch renames the generic file fsync method to generic_file_fsync
to match the other generic_file_* routines it is supposed to be used
with, and the no-op implementation to noop_fsync to make it obvious
what to expect. In addition add some documentation for both methods.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-05-28 10:06:06 +0800
105a048a4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
Btrfs: add more error checking to btrfs_dirty_inode
Btrfs: allow unaligned DIO
Btrfs: drop verbose enospc printk
Btrfs: Fix block generation verification race
Btrfs: fix preallocation and nodatacow checks in O_DIRECT
Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
Btrfs: rework O_DIRECT enospc handling
Btrfs: use async helpers for DIO write checksumming
Btrfs: don't walk around with task->state != TASK_RUNNING
Btrfs: do aio_write instead of write
Btrfs: add basic DIO read/write support
direct-io: do not merge logically non-contiguous requests
direct-io: add a hook for the fs to provide its own submit_bio function
fs: allow short direct-io reads to be completed via buffered IO
Btrfs: Metadata ENOSPC handling for balance
Btrfs: Pre-allocate space for data relocation
Btrfs: Metadata ENOSPC handling for tree log
Btrfs: Metadata reservation for orphan inodes
Btrfs: Introduce global metadata reservation
...

Linus Torvalds
2010-05-28 01:43:44 +0800
7d6e6d09d numa: slab: use numa_mem_id() for slab local memory node ... Browse Code »

Example usage of generic "numa_mem_id()":

The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes
well. Specifically, the "fast path"--____cache_alloc()--will never
succeed as slab doesn't cache offnode object on the per cpu queues, and
for memoryless nodes, all memory will be "off node" relative to
numa_node_id(). This adds significant overhead to all kmem cache
allocations, incurring a significant regression relative to earlier
kernels [from before slab.c was reorganized].

This patch uses the generic topology function "numa_mem_id()" to return
the "effective local memory node" for the calling context. This is the
first node in the local node's generic fallback zonelist-- the same node
that "local" mempolicy-based allocations would use. This lets slab cache
these "local" allocations and avoid fallback/refill on every allocation.

N.B.: Slab will need to handle node and memory hotplug events that could
change the value returned by numa_mem_id() for any given node if recent
changes to address memory hotplug don't already address this. E.g., flush
all per cpu slab queues before rebuilding the zonelists while the
"machine" is held in the stopped state.

Performance impact on "hackbench 400 process 200"

2.6.34-rc3-mmotm-100405-1609 no-patch this-patch
ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff
ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup

The slowdown of the patched kernel from ~12 sec to ~28 seconds when
configured with memoryless nodes is the result of all cpus allocating from
a single node's mm pagepool. The cache lines of the single node are
distributed/interleaved over the memory of the real physical nodes, but
the zone lock, list heads, ... of the single node with memory still each
live in a single cache line that is accessed from all processors.

x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845

Signed-off-by: Lee Schermerhorn
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Nick Piggin
Cc: David Rientjes
Cc: Eric Whitney
Cc: KAMEZAWA Hiroyuki
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-28 00:12:57 +0800
7aac78988 numa: introduce numa_mem_id()- effective local memory node id ... Browse Code »

Introduce numa_mem_id(), based on generic percpu variable infrastructure
to track "nearest node with memory" for archs that support memoryless
nodes.

Define API in when CONFIG_HAVE_MEMORYLESS_NODES
defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
if/when they support them.

Archs can override definitions of:

numa_mem_id() - returns node number of "local memory" node
set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue

Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
This will initialize the boot cpu at boot time, and all cpus on change of
numa_zonelist_order, or when node or memory hot-plug requires zonelist
rebuild. Archs that support memoryless nodes will need to initialize
'numa_mem' for secondary cpus as they're brought on-line.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Lee Schermerhorn
Signed-off-by: Christoph Lameter
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Nick Piggin
Cc: David Rientjes
Cc: Eric Whitney
Cc: KAMEZAWA Hiroyuki
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-28 00:12:57 +0800
728120192 numa: add generic percpu var numa_node_id() implementation ... Browse Code »

Rework the generic version of the numa_node_id() function to use the new
generic percpu variable infrastructure.

Guard the new implementation with a new config option:

CONFIG_USE_PERCPU_NUMA_NODE_ID.

Archs which support this new implemention will default this option to 'y'
when NUMA is configured. This config option could be removed if/when all
archs switch over to the generic percpu implementation of numa_node_id().
Arch support involves:

1) converting any existing per cpu variable implementations to use
this implementation. x86_64 is an instance of such an arch.
2) archs that don't use a per cpu variable for numa_node_id() will
need to initialize the new per cpu variable "numa_node" as cpus
are brought on-line. ia64 is an example.
3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
when NUMA is configured. This is required because I have
retained the old implementation by default to allow archs to
be modified incrementally, as desired.

Subsequent patches will convert x86_64 and ia64 to use this implemenation.

Signed-off-by: Lee Schermerhorn
Cc: Tejun Heo
Cc: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: Nick Piggin
Cc: David Rientjes
Cc: Eric Whitney
Cc: KAMEZAWA Hiroyuki
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-28 00:12:57 +0800
eac406801 slab: convert cpu notifier to return encapsulate errno value ... Browse Code »

By the previous modification, the cpu notifier can return encapsulate
errno value. This converts the cpu notifiers for slab.

Signed-off-by: Akinobu Mita
Cc: Christoph Lameter
Acked-by: Pekka Enberg
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2010-05-28 00:12:48 +0800
6adef3ebe cpusets: new round-robin rotor for SLAB allocations ... Browse Code »

We have observed several workloads running on multi-node systems where
memory is assigned unevenly across the nodes in the system. There are
numerous reasons for this but one is the round-robin rotor in
cpuset_mem_spread_node().

For example, a simple test that writes a multi-page file will allocate
pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
allocates on odd nodes & skips even nodes).

An example is shown below. The program "lfile" writes a file consisting
of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
MPOL_F_NODE) to determine the nodes where the file pages were allocated.
The output is shown below:

# ./lfile
allocated on nodes: 2 4 6 0 1 2 6 0 2

There is a single rotor that is used for allocating both file pages & slab
pages. Writing the file allocates both a data page & a slab page
(buffer_head). This advances the RR rotor 2 nodes for each page
allocated.

A quick confirmation seems to confirm this is the cause of the uneven
allocation:

# echo 0 >/dev/cpuset/memory_spread_slab
# ./lfile
allocated on nodes: 6 7 8 9 0 1 2 3 4 5

This patch introduces a second rotor that is used for slab allocations.

Signed-off-by: Jack Steiner
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Paul Menage
Cc: Jack Steiner
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jack Steiner
2010-05-28 00:12:44 +0800
2c488db27 memcg: clean up memory thresholds ... Browse Code »

Introduce struct mem_cgroup_thresholds. It helps to reduce number of
checks of thresholds type (memory or mem+swap).

[akpm@linux-foundation.org: repair comment]
Signed-off-by: Kirill A. Shutemov
Cc: Phil Carmody
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: KAMEZAWA Hiroyuki
Acked-by: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-05-28 00:12:44 +0800
907860ed3 cgroups: make cftype.unregister_event() void-returning ... Browse Code »

Since we are unable to handle an error returned by
cftype.unregister_event() properly, let's make the callback
void-returning.

mem_cgroup_unregister_event() has been rewritten to be a "never fail"
function. On mem_cgroup_usage_register_event() we save old buffer for
thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
avoid allocation.

Signed-off-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Cc: Phil Carmody
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-05-28 00:12:44 +0800
ac39cf8cb memcg: fix mis-accounting of file mapped racy with migration ... Browse Code »

FILE_MAPPED per memcg of migrated file cache is not properly updated,
because our hook in page_add_file_rmap() can't know to which memcg
FILE_MAPPED should be counted.

Basically, this patch is for fixing the bug but includes some big changes
to fix up other messes.

Now, at migrating mapped file, events happen in following sequence.

1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst a new page before migration. But at this point,
no changes to new page's page_cgroup, no commit for the charge.
(IOW, PCG_USED bit is not set.)
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. Here, the new page is unlocked.
7. memcg commits the charge for newpage, Mark the new page's page_cgroup
as PCG_USED.

Because "commit" happens after page-remap, we can count FILE_MAPPED
at "5", because we should avoid to trust page_cgroup->mem_cgroup.
if PCG_USED bit is unset.
(Note: memcg's LRU removal code does that but LRU-isolation logic is used
for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
not on LRU or page_cgroup->mem_cgroup is NULL.)

We can lose file_mapped accounting information at 5 because FILE_MAPPED
is updated only when mapcount changes 0->1. So we should catch it.

BTW, historically, above implemntation comes from migration-failure
of anonymous page. Because we charge both of old page and new page
with mapcount=0, we can't catch
- the page is really freed before remap.
- migration fails but it's freed before remap
or .....corner cases.

New migration sequence with memcg is:

1. allocate a new page.
2. mark PageCgroupMigration to the old page.
3. charge against a new page onto the old page's memcg. (here, new page's pc
is marked as PageCgroupUsed.)
4. page migration replaces radix-tree, page table, etc...
5. At remapping, new page's page_cgroup is now makrked as "USED"
We can catch 0->1 event and FILE_MAPPED will be properly updated.

And we can catch SWAPOUT event after unlock this and freeing this
page by unmap() can be caught.

7. Clear PageCgroupMigration of the old page.

So, FILE_MAPPED will be correctly updated.

Then, for what MIGRATION flag is ?
Without it, at migration failure, we may have to charge old page again
because it may be fully unmapped. "charge" means that we have to dive into
memory reclaim or something complated. So, it's better to avoid
charge it again. Before this patch, __commit_charge() was working for
both of the old/new page and fixed up all. But this technique has some
racy condtion around FILE_MAPPED and SWAPOUT etc...
Now, the kernel use MIGRATION flag and don't uncharge old page until
the end of migration.

I hope this change will make memcg's page migration much simpler. This
page migration has caused several troubles. Worth to add a flag for
simplification.

Reviewed-by: Daisuke Nishimura
Tested-by: Daisuke Nishimura
Reported-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Christoph Lameter
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

akpm@linux-foundation.org
2010-05-28 00:12:44 +0800
315c1998e mm: memcontrol - uninitialised return value ... Browse Code »

Only an out of memory error will cause ret to be set.

Signed-off-by: Phil Carmody
Acked-by: Kirill A. Shutemov
Cc: Balbir Singh
Cc: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Phil Carmody
2010-05-28 00:12:44 +0800
5407a5625 mm: remove unnecessary use of atomic ... Browse Code »

The bottom 4 hunks are atomically changing memory to which there are no
aliases as it's freshly allocated, so there's no need to use atomic
operations.

The other hunks are just atomic_read and atomic_set, and do not involve
any read-modify-write. The use of atomic_{read,set} doesn't prevent a
read/write or write/write race, so if a race were possible (I'm not saying
one is), then it would still be there even with atomic_set.

See:
http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/

Signed-off-by: Phil Carmody
Acked-by: Kirill A. Shutemov
Cc: Balbir Singh
Cc: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Phil Carmody
2010-05-28 00:12:43 +0800