Eric Lee / smarc-fsl-linux-kernel

23 Mar, 2011

40 commits

d79fdd6d9 ptrace: Clean transitions between TASK_STOPPED and TRACED ... Browse Code »

Currently, if the task is STOPPED on ptrace attach, it's left alone
and the state is silently changed to TRACED on the next ptrace call.
The behavior breaks the assumption that arch_ptrace_stop() is called
before any task is poked by ptrace and is ugly in that a task
manipulates the state of another task directly.

With GROUP_STOP_PENDING, the transitions between TASK_STOPPED and
TRACED can be made clean. The tracer can use the flag to tell the
tracee to retry stop on attach and detach. On retry, the tracee will
enter the desired state in the correct way. The lower 16bits of
task->group_stop is used to remember the signal number which caused
the last group stop. This is used while retrying for ptrace attach as
the original group_exit_code could have been consumed with wait(2) by
then.

As the real parent may wait(2) and consume the group_exit_code
anytime, the group_exit_code needs to be saved separately so that it
can be used when switching from regular sleep to ptrace_stop(). This
is recorded in the lower 16bits of task->group_stop.

If a task is already stopped and there's no intervening SIGCONT, a
ptrace request immediately following a successful PTRACE_ATTACH should
always succeed even if the tracer doesn't wait(2) for attach
completion; however, with this change, the tracee might still be
TASK_RUNNING trying to enter TASK_TRACED which would cause the
following request to fail with -ESRCH.

This intermediate state is hidden from the ptracer by setting
GROUP_STOP_TRAPPING on attach and making ptrace_check_attach() wait
for it to clear on its signal->wait_chldexit. Completing the
transition or getting killed clears TRAPPING and wakes up the tracer.

Note that the STOPPED -> RUNNING -> TRACED transition is still visible
to other threads which are in the same group as the ptracer and the
reverse transition is visible to all. Please read the comments for
details.

Oleg:

* Spotted a race condition where a task may retry group stop without
proper bookkeeping. Fixed by redoing bookkeeping on retry.

* Spotted that the transition is visible to userland in several
different ways. Most are fixed with GROUP_STOP_TRAPPING. Unhandled
corner case is documented.

* Pointed out not setting GROUP_STOP_SIGMASK on an already stopped
task would result in more consistent behavior.

* Pointed out that calling ptrace_stop() from do_signal_stop() in
TASK_STOPPED can race with group stop start logic and then confuse
the TRAPPING wait in ptrace_check_attach(). ptrace_stop() is now
called with TASK_RUNNING.

* Suggested using signal->wait_chldexit instead of bit wait.

* Spotted a race condition between TRACED transition and clearing of
TRAPPING.

Signed-off-by: Tejun Heo
Acked-by: Oleg Nesterov
Cc: Roland McGrath
Cc: Jan Kratochvil

Tejun Heo
2011-03-23 17:37:00 +0800
39efa3ef3 signal: Use GROUP_STOP_PENDING to stop once for a single group stop ... Browse Code »

Currently task->signal->group_stop_count is used to decide whether to
stop for group stop. However, if there is a task in the group which
is taking a long time to stop, other tasks which are continued by
ptrace would repeatedly stop for the same group stop until the group
stop is complete.

Conversely, if a ptraced task is in TASK_TRACED state, the debugger
won't get notified of group stops which is inconsistent compared to
the ptraced task in any other state.

This patch introduces GROUP_STOP_PENDING which tracks whether a task
is yet to stop for the group stop in progress. The flag is set when a
group stop starts and cleared when the task stops the first time for
the group stop, and consulted whenever whether the task should
participate in a group stop needs to be determined. Note that now
tasks in TASK_TRACED also participate in group stop.

This results in the following behavior changes.

* For a single group stop, a ptracer would see at most one stop
reported.

* A ptracee in TASK_TRACED now also participates in group stop and the
tracer would get the notification. However, as a ptraced task could
be in TASK_STOPPED state or any ptrace trap could consume group
stop, the notification may still be missing. These will be
addressed with further patches.

* A ptracee may start a group stop while one is still in progress if
the tracer let it continue with stop signal delivery. Group stop
code handles this correctly.

Oleg:

* Spotted that a task might skip signal check even when its
GROUP_STOP_PENDING is set. Fixed by updating
recalc_sigpending_tsk() to check GROUP_STOP_PENDING instead of
group_stop_count.

* Pointed out that task->group_stop should be cleared whenever
task->signal->group_stop_count is cleared. Fixed accordingly.

* Pointed out the behavior inconsistency between TASK_TRACED and
RUNNING and the last behavior change.

Signed-off-by: Tejun Heo
Acked-by: Oleg Nesterov
Cc: Roland McGrath

Tejun Heo
2011-03-23 17:37:00 +0800
e5c1902e9 signal: Fix premature completion of group stop when interfered by ptrace ... Browse Code »

task->signal->group_stop_count is used to track the progress of group
stop. It's initialized to the number of tasks which need to stop for
group stop to finish and each stopping or trapping task decrements.
However, each task doesn't keep track of whether it decremented the
counter or not and if woken up before the group stop is complete and
stops again, it can decrement the counter multiple times.

Please consider the following example code.

static void *worker(void *arg)
{
while (1) ;
return NULL;
}

int main(void)
{
pthread_t thread;
pid_t pid;
int i;

pid = fork();
if (!pid) {
for (i = 0; i < 5; i++)
pthread_create(&thread, NULL, worker, NULL);
while (1) ;
return 0;
}

ptrace(PTRACE_ATTACH, pid, NULL, NULL);
while (1) {
waitid(P_PID, pid, NULL, WSTOPPED);
ptrace(PTRACE_SINGLESTEP, pid, NULL, (void *)(long)SIGSTOP);
}
return 0;
}

The child creates five threads and the parent continuously traps the
first thread and whenever the child gets a signal, SIGSTOP is
delivered. If an external process sends SIGSTOP to the child, all
other threads in the process should reliably stop. However, due to
the above bug, the first thread will often end up consuming
group_stop_count multiple times and SIGSTOP often ends up stopping
none or part of the other four threads.

This patch adds a new field task->group_stop which is protected by
siglock and uses GROUP_STOP_CONSUME flag to track which task is still
to consume group_stop_count to fix this bug.

task_clear_group_stop_pending() and task_participate_group_stop() are
added to help manipulating group stop states. As ptrace_stop() now
also uses task_participate_group_stop(), it will set
SIGNAL_STOP_STOPPED if it completes a group stop.

There still are many issues regarding the interaction between group
stop and ptrace. Patches to address them will follow.

- Oleg spotted duplicate GROUP_STOP_CONSUME. Dropped.

Signed-off-by: Tejun Heo
Acked-by: Oleg Nesterov
Cc: Roland McGrath

Tejun Heo
2011-03-23 17:37:00 +0800
edf2ed153 ptrace: Kill tracehook_notify_jctl() ... Browse Code »

tracehook_notify_jctl() aids in determining whether and what to report
to the parent when a task is stopped or continued. The function also
adds an extra requirement that siglock may be released across it,
which is currently unused and quite difficult to satisfy in
well-defined manner.

As job control and the notifications are about to receive major
overhaul, remove the tracehook and open code it. If ever necessary,
let's factor it out after the overhaul.

* Oleg spotted incorrect CLD_CONTINUED/STOPPED selection when ptraced.
Fixed.

Signed-off-by: Tejun Heo
Cc: Oleg Nesterov
Cc: Roland McGrath

Tejun Heo
2011-03-23 17:37:00 +0800
6447f55da Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx ... Browse Code »

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (66 commits)
avr32: at32ap700x: fix typo in DMA master configuration
dmaengine/dmatest: Pass timeout via module params
dma: let IMX_DMA depend on IMX_HAVE_DMA_V1 instead of an explicit list of SoCs
fsldma: make halt behave nicely on all supported controllers
fsldma: reduce locking during descriptor cleanup
fsldma: support async_tx dependencies and automatic unmapping
fsldma: fix controller lockups
fsldma: minor codingstyle and consistency fixes
fsldma: improve link descriptor debugging
fsldma: use channel name in printk output
fsldma: move related helper functions near each other
dmatest: fix automatic buffer unmap type
drivers, pch_dma: Fix warning when CONFIG_PM=n.
dmaengine/dw_dmac fix: use readl & writel instead of __raw_readl & __raw_writel
avr32: at32ap700x: Specify DMA Flow Controller, Src and Dst msize
dw_dmac: Setting Default Burst length for transfers as 16.
dw_dmac: Allow src/dst msize & flow controller to be configured at runtime
dw_dmac: Changing type of src_master and dest_master to u8.
dw_dmac: Pass Channel Priority from platform_data
dw_dmac: Pass Channel Allocation Order from platform_data
...

Linus Torvalds
2011-03-23 08:53:13 +0800
565d76cb7 zlib: slim down zlib_deflate() workspace when possible ... Browse Code »

Instead of always creating a huge (268K) deflate_workspace with the
maximum compression parameters (windowBits=15, memLevel=8), allow the
caller to obtain a smaller workspace by specifying smaller parameter
values.

For example, when capturing oops and panic reports to a medium with
limited capacity, such as NVRAM, compression may be the only way to
capture the whole report. In this case, a small workspace (24K works
fine) is a win, whether you allocate the workspace when you need it (i.e.,
during an oops or panic) or at boot time.

I've verified that this patch works with all accepted values of windowBits
(positive and negative), memLevel, and compression level.

Signed-off-by: Jim Keniston
Cc: Herbert Xu
Cc: David Miller
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jim Keniston
2011-03-23 08:44:17 +0800
d03e1617f crc32: add missed brackets in macro ... Browse Code »

Add brackets around typecasted argument in crc32() macro.

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2011-03-23 08:44:15 +0800
e359dc24d sigma-firmware: loader for Analog Devices' SigmaStudio ... Browse Code »

Analog Devices' SigmaStudio can produce firmware blobs for devices with
these DSPs embedded (like some audio codecs). Allow these device drivers
to easily parse and load them.

Signed-off-by: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2011-03-23 08:44:15 +0800
33ee3b2e2 kstrto*: converting strings to integers done (hopefully) right ... Browse Code »

1. simple_strto*() do not contain overflow checks and crufty,
libc way to indicate failure.
2. strict_strto*() also do not have overflow checks but the name and
comments pretend they do.
3. Both families have only "long long" and "long" variants,
but users want strtou8()
4. Both "simple" and "strict" prefixes are wrong:
Simple doesn't exactly say what's so simple, strict should not exist
because conversion should be strict by default.

The solution is to use "k" prefix and add convertors for more types.
Enter
kstrtoull()
kstrtoll()
kstrtoul()
kstrtol()
kstrtouint()
kstrtoint()

kstrtou64()
kstrtos64()
kstrtou32()
kstrtos32()
kstrtou16()
kstrtos16()
kstrtou8()
kstrtos8()

Include runtime testsuite (somewhat incomplete) as well.

strict_strto*() become deprecated, stubbed to kstrto*() and
eventually will be removed altogether.

Use kstrto*() in code today!

Note: on some archs _kstrtoul() and _kstrtol() are left in tree, even if
they'll be unused at runtime. This is temporarily solution,
because I don't want to hardcode list of archs where these
functions aren't needed. Current solution with sizeof() and
__alignof__ at least always works.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2011-03-23 08:44:14 +0800
34db18a05 smp: move smp setup functions to kernel/smp.c ... Browse Code »

Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter
setup functions from init/main.c to kenrel/smp.c, saves some #ifdef
CONFIG_SMP.

Signed-off-by: WANG Cong
Cc: Rakib Mullick
Cc: David Howells
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Tejun Heo
Cc: Arnd Bergmann
Cc: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Amerigo Wang
2011-03-23 08:44:11 +0800
fa9ee9c4b include/linux/err.h: add a function to cast error-pointers to a return value ... Browse Code »

PTR_RET() can be used if you have an error-pointer and are only interested
in the eventual error value, but not the pointer. Yields the usual 0 for
no error, -ESOMETHING otherwise.

Signed-off-by: Uwe Kleine-König
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uwe Kleine-König
2011-03-23 08:44:11 +0800
f3ccfcdaf fs.h: remove 8 bytes of padding from block_device on 64bit builds ... Browse Code »

Re-ordering struct block_inode to remove 8 bytes of padding on 64 bit
builds, which also shrinks bdev_inode by 8 bytes (776 -> 768) allowing it
to fit into one fewer cache lines.

Signed-off-by: Richard Kennedy
Cc: Al Viro
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard Kennedy
2011-03-23 08:44:10 +0800
c837fb37a include/linux/compiler-gcc*.h: unify macro definitions ... Browse Code »

Unify identical gcc3.x and gcc4.x macros.

Signed-off-by: Borislav Petkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2011-03-23 08:44:10 +0800
3e50594e8 add the common dma_addr_t typedef to include/linux/types.h ... Browse Code »

All architectures can use the common dma_addr_t typedef now. We can
remove the arch specific dma_addr_t.

Signed-off-by: FUJITA Tomonori
Acked-by: Arnd Bergmann
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Ivan Kokshaysky
Cc: Richard Henderson
Cc: Matt Turner
Cc: "Luck, Tony"
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Chris Metcalf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

FUJITA Tomonori
2011-03-23 08:44:09 +0800
78afd5612 mm: add __GFP_OTHER_NODE flag ... Browse Code »

Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
zone_statistics() that an allocation is on behalf of another thread. This
way the local and remote counters can be still correct, even when
background daemons like khugepaged are changing memory mappings.

This only affects the accounting, but I think it's worth doing that right
to avoid confusing users.

I first tried to just pass down the right node, but this required a lot of
changes to pass down this parameter and at least one addition of a 10th
argument to a 9 argument function. Using the flag is a lot less
intrusive.

Open: should be also used for migration?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen
Cc: Andrea Arcangeli
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-03-23 08:44:05 +0800
8afdcece4 mm: vmscan: kswapd should not free an excessive number of pages when balancing small zones ... Browse Code »

When reclaiming for order-0 pages, kswapd requires that all zones be
balanced. Each cycle through balance_pgdat() does background ageing on
all zones if necessary and applies equal pressure on the inactive zone
unless a lot of pages are free already.

A "lot of free pages" is defined as a "balance gap" above the high
watermark which is currently 7*high_watermark. Historically this was
reasonable as min_free_kbytes was small. However, on systems using huge
pages, it is recommended that min_free_kbytes is higher and it is tuned
with hugeadm --set-recommended-min_free_kbytes. With the introduction of
transparent huge page support, this recommended value is also applied. On
X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
expect around 68M of memory to be free. The Normal zone is approximately
35000 pages so under even normal memory pressure such as copying a large
file, it gets exhausted quickly. As it is getting exhausted, kswapd
applies pressure equally to all zones, including the DMA32 zone. DMA32 is
approximately 700,000 pages with a high watermark of around 23,000 pages.
In this situation, kswapd will reclaim around (23000*8 where 8 is the high
watermark + balance gap of 7 * high watermark) pages or 718M of pages
before the zone is ignored. What the user sees is that free memory far
higher than it should be.

To avoid an excessive number of pages being reclaimed from the larger
zones, explicitely defines the "balance gap" to be either 1% of the zone
or the low watermark for the zone, whichever is smaller. While kswapd
will check all zones to apply pressure, it'll ignore zones that meets the
(high_wmark + balance_gap) watermark.

To test this, 80G were copied from a partition and the amount of memory
being used was recorded. A comparison of a patch and unpatched kernel can
be seen at
http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
and shows that kswapd is not reclaiming as much memory with the patch
applied.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Shaohua Li
Cc: "Chen, Tim C"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-03-23 08:44:04 +0800
033193275 pagewalk: only split huge pages when necessary ... Browse Code »

Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
unconditionally split any transparent huge pages it runs in to. In
practice, that means that anyone doing a

cat /proc/$pid/smaps

will unconditionally break down every huge page in the process and depend
on khugepaged to re-collapse it later. This is fairly suboptimal.

This patch changes that behavior. It teaches each ->pmd_entry handler
(there are five) that they must break down the THPs themselves. Also, the
_generic_ code will never break down a THP unless a ->pte_entry handler is
actually set.

This means that the ->pmd_entry handlers can now choose to deal with THPs
without breaking them down.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Acked-by: David Rientjes
Reviewed-by: Eric B Munson
Tested-by: Eric B Munson
Cc: Michael J Wolf
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Matt Mackall
Cc: Jeremy Fitzhardinge
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2011-03-23 08:44:04 +0800
3f58a8294 memcg: move memcg reclaimable page into tail of inactive list ... Browse Code »

The rotate_reclaimable_page function moves just written out pages, which
the VM wanted to reclaim, to the end of the inactive list. That way the
VM will find those pages first next time it needs to free memory.

This patch applies the rule in memcg. It can help to prevent unnecessary
working page eviction of memcg.

Signed-off-by: Minchan Kim
Acked-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Rik van Riel
Cc: KOSAKI Motohiro
Acked-by: Johannes Weiner
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:03 +0800
315601809 mm: deactivate invalidated pages ... Browse Code »

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
workloads(ex, nightly rsync). That's because the workload makes just
use-once pages and touches pages twice. It promotes the page into active
list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE. But other OSes
don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED. But it has a
problem. If kernel meets page is writing during invalidate_mapping_pages,
it can't work. It makes for application programmer to use it since they
always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable. At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer. It means we don't
use the page any more. So let's move the writing page into inactive
list's head if we can't truncate it right now.

Why I move page to head of lru on this patch, Dirty/Writeback page would
be flushed sooner or later. It can prevent writeout of pageout which is
less effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added his
Signed-off-by.

Signed-off-by: Minchan Kim
Reported-by: Ben Gamari
Signed-off-by: Peter Zijlstra
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Cc: Wu Fengguang
Acked-by: Johannes Weiner
Cc: Nick Piggin
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:03 +0800
481b4bb5e mm: mm_struct: remove 16 bytes of alignment padding on 64 bit builds ... Browse Code »

Reorder mm_struct to remove 16 bytes of alignment padding on 64 bit
builds. On my config this shrinks mm_struct by enough to fit in one
fewer cache lines and allows more objects per slab in mm_struct
kmem_cache under SLUB.

slabinfo before patch :-
Sizes (bytes) Slabs
--------------------------------
Object : 848 Total : 9
SlabObj: 896 Full : 2
SlabSiz: 16384 Partial: 5
Loss : 48 CpuSlab: 2
Align : 64 Objects: 18

slabinfo after :-
Sizes (bytes) Slabs
--------------------------------
Object : 832 Total : 7
SlabObj: 832 Full : 2
SlabSiz: 16384 Partial: 3
Loss : 0 CpuSlab: 2
Align : 64 Objects: 19

Signed-off-by: Richard Kennedy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard Kennedy
2011-03-23 08:44:03 +0800
cb240452b mm: remove unused TestSetPageLocked() interface ... Browse Code »

TestSetPageLocked() isn't being used anywhere. Also, using it would
likely be an error, since the proper interface trylock_page() provides
stronger ordering guarantees.

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2011-03-23 08:44:03 +0800
01d8b20de mm: simplify anon_vma refcounts ... Browse Code »

This patch changes the anon_vma refcount to be 0 when the object is free.
It does this by adding 1 ref to being in use in the anon_vma structure
(iow. the anon_vma->head list is not empty).

This allows a simpler release scheme without having to check both the
refcount and the list as well as avoids taking a ref for each entry on the
list.

Signed-off-by: Peter Zijlstra
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-03-23 08:44:03 +0800
83813267c mm: move anon_vma ref out from under CONFIG_foo ... Browse Code »

We need the anon_vma refcount unconditionally to simplify the anon_vma
lifetime rules.

Signed-off-by: Peter Zijlstra
Acked-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-03-23 08:44:03 +0800
9e60109f1 mm: rename drop_anon_vma() to put_anon_vma() ... Browse Code »

The normal code pattern used in the kernel is: get/put.

Signed-off-by: Peter Zijlstra
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-03-23 08:44:03 +0800
e64a782fe mm: change __remove_from_page_cache() ... Browse Code »

Now we renamed remove_from_page_cache with delete_from_page_cache. As
consistency of __remove_from_swap_cache and remove_from_swap_cache, we
change internal page cache handling function name, too.

Signed-off-by: Minchan Kim
Cc: Christoph Hellwig
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:02 +0800
702cfbf93 mm: goodbye remove_from_page_cache() ... Browse Code »

Now delete_from_page_cache() replaces remove_from_page_cache(). So we
remove remove_from_page_cache so fs or something out of mainline will
notice it when compile time and can fix it.

Signed-off-by: Minchan Kim
Cc: Christoph Hellwig
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:02 +0800
97cecb5a2 mm: introduce delete_from_page_cache() ... Browse Code »

Presently we increase the page refcount in add_to_page_cache() but don't
decrease it in remove_from_page_cache(). Such asymmetry adds confusion,
requiring that callers notice it and a comment explaining why they release
a page reference. It's not a good API.

A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
gave up because reiser4's drop_page() had to unlock the page between
removing it from page cache and doing the page_cache_release(). But now
the situation is changed. I think at least things in current mainline
don't have any obstacles. The problem is for out-of-mainline filesystems
- if they have done such things as reiser4, this patch could be a problem
but they will discover this at compile time since we remove
remove_from_page_cache().

This patch:

This function works as just wrapper remove_from_page_cache(). The
difference is that it decreases page references in itself. So caller have
to make sure it has a page reference before calling.

This patch is ready for removing remove_from_page_cache().

Signed-off-by: Minchan Kim
Cc: Christoph Hellwig
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Cc: Edward Shishkin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:02 +0800
ef6a3c631 mm: add replace_page_cache_page() function ... Browse Code »
1

This function basically does:

remove_from_page_cache(old);
page_cache_release(old);
add_to_page_cache_locked(new);

Except it does this atomically, so there's no possibility for the "add" to
fail because of a race.

If memory cgroups are enabled, then the memory cgroup charge is also moved
from the old page to the new.

This function is currently used by fuse to move pages into the page cache
on read, instead of copying the page contents.

[minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
Signed-off-by: Miklos Szeredi
Acked-by: Rik van Riel
Acked-by: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2011-03-23 08:44:02 +0800
318b275fb mm: allow GUP to fail instead of waiting on a page ... Browse Code »

GUP user may want to try to acquire a reference to a page if it is already
in memory, but not if IO, to bring it in, is needed. For example KVM may
tell vcpu to schedule another guest process if current one is trying to
access swapped out page. Meanwhile, the page will be swapped in and the
guest process, that depends on it, will be able to run again.

This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in
conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
instead.

[akpm@linux-foundation.org: improve FOLL_NOWAIT comment]
Signed-off-by: Gleb Natapov
Cc: Linus Torvalds
Cc: Hugh Dickins
Acked-by: Rik van Riel
Cc: Michel Lespinasse
Cc: Avi Kivity
Cc: Marcelo Tosatti
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gleb Natapov
2011-03-23 08:44:02 +0800
ddd588b5d oom: suppress nodes that are not allowed from meminfo on oom kill ... Browse Code »

The oom killer is extremely verbose for machines with a large number of
cpus and/or nodes. This verbosity can often be harmful if it causes other
important messages to be scrolled from the kernel log and incurs a
signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT >
8.

This patch causes only memory information to be displayed for nodes that
are allowed by current's cpuset when dumping the VM state. Information
for all other nodes is irrelevant to the oom condition; we don't care if
there's an abundance of memory elsewhere if we can't access it.

This only affects the behavior of dumping memory information when an oom
is triggered. Other dumps, such as for sysrq+m, still display the
unfiltered form when using the existing show_mem() interface.

Additionally, the per-cpu pageset statistics are extremely verbose in oom
killer output, so it is now suppressed. This removes

nodes_weight(current->mems_allowed) * (1 + nr_cpus)

lines from the oom killer output.

Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed
nodes.

Signed-off-by: David Rientjes
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-03-23 08:44:01 +0800
207205a2b kthread: NUMA aware kthread_create_on_node() ... Browse Code »

All kthreads being created from a single helper task, they all use memory
from a single node for their kernel stack and task struct.

This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
to parameters already used by kthread_create().

This parameter serves in allocating memory for the new kthread on its
memory node if possible.

Signed-off-by: Eric Dumazet
Acked-by: David S. Miller
Reviewed-by: Andi Kleen
Acked-by: Rusty Russell
Cc: Tejun Heo
Cc: Tony Luck
Cc: Fenghua Yu
Cc: David Howells
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2011-03-23 08:44:01 +0800
d527caf22 mm: compaction: prevent kswapd compacting memory to reduce CPU usage ... Browse Code »

This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
order > 0") due to reports stating that kswapd CPU usage was higher and
IRQs were being disabled more frequently. This was reported at
http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.

Without this patch applied, CPU usage by kswapd hovers around the 20% mark
according to the tester (Arthur Marsh:
http://www.spinics.net/linux/fedora/alsa-user/msg09899.html). With this
patch applied, it's around 2%.

The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
triggered by high-order allocations hitting the low watermark for their
order and waking kswapd on kernels with CONFIG_COMPACTION set. The most
common trigger for this is network cards configured for jumbo frames but
it's also possible it'll be triggered by fork-heavy workloads (order-1)
and some wireless cards which depend on order-1 allocations.

The symptoms for the user will be high CPU usage by kswapd in low-memory
situations which could be confused with another writeback problem. While
a patch like 5a03b051 may be reintroduced in the future, this patch plays
it safe for now and reverts it.

[mel@csn.ul.ie: Beefed up the changelog]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Mel Gorman
Reported-by: Arthur Marsh
Tested-by: Arthur Marsh
Cc: [2.6.38.1]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-03-23 08:44:00 +0800
ef0a5e80f pwm_backlight: add check_fb() hook ... Browse Code »

In systems with multiple framebuffer devices, one of the devices might be
blanked while another is unblanked. In order for the backlight blanking
logic to know whether to turn off the backlight for a particular
framebuffer's blanking notification, it needs to be able to check if a
given framebuffer device corresponds to the backlight.

This plumbs the check_fb hook from core backlight through the
pwm_backlight helper to allow platform code to plug in a check_fb hook.

Signed-off-by: Robert Morell
Cc: Richard Purdie
Cc: Arun Murthy
Cc: Linus Walleij
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robert Morell
2011-03-23 08:44:00 +0800
bb7ca747f backlight: add backlight type ... Browse Code »

There may be multiple ways of controlling the backlight on a given
machine. Allow drivers to expose the type of interface they are
providing, making it possible for userspace to make appropriate policy
decisions.

Signed-off-by: Matthew Garrett
Cc: Richard Purdie
Cc: Chris Wilson
Cc: David Airlie
Cc: Alex Deucher
Cc: Ben Skeggs
Cc: Zhang Rui
Cc: Len Brown
Cc: Jesse Barnes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Garrett
2011-03-23 08:43:59 +0800
9517f925f leds: make *struct gpio_led_platform_data.leds const ... Browse Code »
46

And fix a typo.

Signed-off-by: Uwe Kleine-König
Cc: Lars-Peter Clausen
Cc: Richard Purdie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uwe Kleine-König
2011-03-23 08:43:59 +0800
b1e6b7068 leds: add driver for LM3530 ALS ... Browse Code »

Simple backlight driver for National Semiconductor LM3530. Presently only
manual mode is supported, PWM and ALS support to be added.

Signed-off-by: Shreshtha Kumar Sahu
Cc: Linus Walleij
Cc: Richard Purdie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shreshtha Kumar Sahu
2011-03-23 08:43:59 +0800
c7a1fcd8e include/asm-generic/unistd.h: fix syncfs syscall number ... Browse Code »

syncfs() is duplicating name_to_handle_at() due to a merging mistake.

Cc: Sage Weil
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-03-23 08:43:58 +0800
01ba82514 Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Add statistics for this_cmpxchg_double failures
slub: Add missing irq restore for the OOM path

Linus Torvalds
2011-03-23 07:26:57 +0800
0adfc56ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: use watch/notify for changes in rbd header
libceph: add lingering request and watch/notify event framework
rbd: update email address in Documentation
ceph: rename dentry_release -> d_release, fix comment
ceph: add request to the tail of unsafe write list
ceph: remove request from unsafe list if it is canceled/timed out
ceph: move readahead default to fs/ceph from libceph
ceph: add ino32 mount option
ceph: update common header files
ceph: remove debugfs debug cruft
libceph: fix osd request queuing on osdmap updates
ceph: preserve I_COMPLETE across rename
libceph: Fix base64-decoding when input ends in newline.

Linus Torvalds
2011-03-23 07:25:25 +0800
f23eb2b2b tty: stop using "delayed_work" in the tty layer ... Browse Code »

Using delayed-work for tty flip buffers ends up causing us to wait for
the next tick to complete some actions. That's usually not all that
noticeable, but for certain latency-critical workloads it ends up being
totally unacceptable.

As an extreme case of this, passing a token back-and-forth over a pty
will take two ticks per iteration, so even just a thousand iterations
will take 8 seconds assuming a common 250Hz configuration.

Avoiding the whole delayed work issue brings that ping-pong test-case
down to 0.009s on my machine.

In more practical terms, this latency has been a performance problem for
things like dive computer simulators (simulating the serial interface
using the ptys) and for other environments (Alan mentions a CP/M emulator).

Reported-by: Jef Driesen
Acked-by: Greg KH
Acked-by: Alan Cox
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-03-23 07:17:32 +0800