23 Mar, 2011

40 commits

  • Currently, if the task is STOPPED on ptrace attach, it's left alone
    and the state is silently changed to TRACED on the next ptrace call.
    The behavior breaks the assumption that arch_ptrace_stop() is called
    before any task is poked by ptrace and is ugly in that a task
    manipulates the state of another task directly.

    With GROUP_STOP_PENDING, the transitions between TASK_STOPPED and
    TRACED can be made clean. The tracer can use the flag to tell the
    tracee to retry stop on attach and detach. On retry, the tracee will
    enter the desired state in the correct way. The lower 16bits of
    task->group_stop is used to remember the signal number which caused
    the last group stop. This is used while retrying for ptrace attach as
    the original group_exit_code could have been consumed with wait(2) by
    then.

    As the real parent may wait(2) and consume the group_exit_code
    anytime, the group_exit_code needs to be saved separately so that it
    can be used when switching from regular sleep to ptrace_stop(). This
    is recorded in the lower 16bits of task->group_stop.

    If a task is already stopped and there's no intervening SIGCONT, a
    ptrace request immediately following a successful PTRACE_ATTACH should
    always succeed even if the tracer doesn't wait(2) for attach
    completion; however, with this change, the tracee might still be
    TASK_RUNNING trying to enter TASK_TRACED which would cause the
    following request to fail with -ESRCH.

    This intermediate state is hidden from the ptracer by setting
    GROUP_STOP_TRAPPING on attach and making ptrace_check_attach() wait
    for it to clear on its signal->wait_chldexit. Completing the
    transition or getting killed clears TRAPPING and wakes up the tracer.

    Note that the STOPPED -> RUNNING -> TRACED transition is still visible
    to other threads which are in the same group as the ptracer and the
    reverse transition is visible to all. Please read the comments for
    details.

    Oleg:

    * Spotted a race condition where a task may retry group stop without
    proper bookkeeping. Fixed by redoing bookkeeping on retry.

    * Spotted that the transition is visible to userland in several
    different ways. Most are fixed with GROUP_STOP_TRAPPING. Unhandled
    corner case is documented.

    * Pointed out not setting GROUP_STOP_SIGMASK on an already stopped
    task would result in more consistent behavior.

    * Pointed out that calling ptrace_stop() from do_signal_stop() in
    TASK_STOPPED can race with group stop start logic and then confuse
    the TRAPPING wait in ptrace_check_attach(). ptrace_stop() is now
    called with TASK_RUNNING.

    * Suggested using signal->wait_chldexit instead of bit wait.

    * Spotted a race condition between TRACED transition and clearing of
    TRAPPING.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Jan Kratochvil

    Tejun Heo
     
  • Currently task->signal->group_stop_count is used to decide whether to
    stop for group stop. However, if there is a task in the group which
    is taking a long time to stop, other tasks which are continued by
    ptrace would repeatedly stop for the same group stop until the group
    stop is complete.

    Conversely, if a ptraced task is in TASK_TRACED state, the debugger
    won't get notified of group stops which is inconsistent compared to
    the ptraced task in any other state.

    This patch introduces GROUP_STOP_PENDING which tracks whether a task
    is yet to stop for the group stop in progress. The flag is set when a
    group stop starts and cleared when the task stops the first time for
    the group stop, and consulted whenever whether the task should
    participate in a group stop needs to be determined. Note that now
    tasks in TASK_TRACED also participate in group stop.

    This results in the following behavior changes.

    * For a single group stop, a ptracer would see at most one stop
    reported.

    * A ptracee in TASK_TRACED now also participates in group stop and the
    tracer would get the notification. However, as a ptraced task could
    be in TASK_STOPPED state or any ptrace trap could consume group
    stop, the notification may still be missing. These will be
    addressed with further patches.

    * A ptracee may start a group stop while one is still in progress if
    the tracer let it continue with stop signal delivery. Group stop
    code handles this correctly.

    Oleg:

    * Spotted that a task might skip signal check even when its
    GROUP_STOP_PENDING is set. Fixed by updating
    recalc_sigpending_tsk() to check GROUP_STOP_PENDING instead of
    group_stop_count.

    * Pointed out that task->group_stop should be cleared whenever
    task->signal->group_stop_count is cleared. Fixed accordingly.

    * Pointed out the behavior inconsistency between TASK_TRACED and
    RUNNING and the last behavior change.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov
    Cc: Roland McGrath

    Tejun Heo
     
  • task->signal->group_stop_count is used to track the progress of group
    stop. It's initialized to the number of tasks which need to stop for
    group stop to finish and each stopping or trapping task decrements.
    However, each task doesn't keep track of whether it decremented the
    counter or not and if woken up before the group stop is complete and
    stops again, it can decrement the counter multiple times.

    Please consider the following example code.

    static void *worker(void *arg)
    {
    while (1) ;
    return NULL;
    }

    int main(void)
    {
    pthread_t thread;
    pid_t pid;
    int i;

    pid = fork();
    if (!pid) {
    for (i = 0; i < 5; i++)
    pthread_create(&thread, NULL, worker, NULL);
    while (1) ;
    return 0;
    }

    ptrace(PTRACE_ATTACH, pid, NULL, NULL);
    while (1) {
    waitid(P_PID, pid, NULL, WSTOPPED);
    ptrace(PTRACE_SINGLESTEP, pid, NULL, (void *)(long)SIGSTOP);
    }
    return 0;
    }

    The child creates five threads and the parent continuously traps the
    first thread and whenever the child gets a signal, SIGSTOP is
    delivered. If an external process sends SIGSTOP to the child, all
    other threads in the process should reliably stop. However, due to
    the above bug, the first thread will often end up consuming
    group_stop_count multiple times and SIGSTOP often ends up stopping
    none or part of the other four threads.

    This patch adds a new field task->group_stop which is protected by
    siglock and uses GROUP_STOP_CONSUME flag to track which task is still
    to consume group_stop_count to fix this bug.

    task_clear_group_stop_pending() and task_participate_group_stop() are
    added to help manipulating group stop states. As ptrace_stop() now
    also uses task_participate_group_stop(), it will set
    SIGNAL_STOP_STOPPED if it completes a group stop.

    There still are many issues regarding the interaction between group
    stop and ptrace. Patches to address them will follow.

    - Oleg spotted duplicate GROUP_STOP_CONSUME. Dropped.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov
    Cc: Roland McGrath

    Tejun Heo
     
  • tracehook_notify_jctl() aids in determining whether and what to report
    to the parent when a task is stopped or continued. The function also
    adds an extra requirement that siglock may be released across it,
    which is currently unused and quite difficult to satisfy in
    well-defined manner.

    As job control and the notifications are about to receive major
    overhaul, remove the tracehook and open code it. If ever necessary,
    let's factor it out after the overhaul.

    * Oleg spotted incorrect CLD_CONTINUED/STOPPED selection when ptraced.
    Fixed.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Roland McGrath

    Tejun Heo
     
  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (66 commits)
    avr32: at32ap700x: fix typo in DMA master configuration
    dmaengine/dmatest: Pass timeout via module params
    dma: let IMX_DMA depend on IMX_HAVE_DMA_V1 instead of an explicit list of SoCs
    fsldma: make halt behave nicely on all supported controllers
    fsldma: reduce locking during descriptor cleanup
    fsldma: support async_tx dependencies and automatic unmapping
    fsldma: fix controller lockups
    fsldma: minor codingstyle and consistency fixes
    fsldma: improve link descriptor debugging
    fsldma: use channel name in printk output
    fsldma: move related helper functions near each other
    dmatest: fix automatic buffer unmap type
    drivers, pch_dma: Fix warning when CONFIG_PM=n.
    dmaengine/dw_dmac fix: use readl & writel instead of __raw_readl & __raw_writel
    avr32: at32ap700x: Specify DMA Flow Controller, Src and Dst msize
    dw_dmac: Setting Default Burst length for transfers as 16.
    dw_dmac: Allow src/dst msize & flow controller to be configured at runtime
    dw_dmac: Changing type of src_master and dest_master to u8.
    dw_dmac: Pass Channel Priority from platform_data
    dw_dmac: Pass Channel Allocation Order from platform_data
    ...

    Linus Torvalds
     
  • Instead of always creating a huge (268K) deflate_workspace with the
    maximum compression parameters (windowBits=15, memLevel=8), allow the
    caller to obtain a smaller workspace by specifying smaller parameter
    values.

    For example, when capturing oops and panic reports to a medium with
    limited capacity, such as NVRAM, compression may be the only way to
    capture the whole report. In this case, a small workspace (24K works
    fine) is a win, whether you allocate the workspace when you need it (i.e.,
    during an oops or panic) or at boot time.

    I've verified that this patch works with all accepted values of windowBits
    (positive and negative), memLevel, and compression level.

    Signed-off-by: Jim Keniston
    Cc: Herbert Xu
    Cc: David Miller
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jim Keniston
     
  • Add brackets around typecasted argument in crc32() macro.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Analog Devices' SigmaStudio can produce firmware blobs for devices with
    these DSPs embedded (like some audio codecs). Allow these device drivers
    to easily parse and load them.

    Signed-off-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • 1. simple_strto*() do not contain overflow checks and crufty,
    libc way to indicate failure.
    2. strict_strto*() also do not have overflow checks but the name and
    comments pretend they do.
    3. Both families have only "long long" and "long" variants,
    but users want strtou8()
    4. Both "simple" and "strict" prefixes are wrong:
    Simple doesn't exactly say what's so simple, strict should not exist
    because conversion should be strict by default.

    The solution is to use "k" prefix and add convertors for more types.
    Enter
    kstrtoull()
    kstrtoll()
    kstrtoul()
    kstrtol()
    kstrtouint()
    kstrtoint()

    kstrtou64()
    kstrtos64()
    kstrtou32()
    kstrtos32()
    kstrtou16()
    kstrtos16()
    kstrtou8()
    kstrtos8()

    Include runtime testsuite (somewhat incomplete) as well.

    strict_strto*() become deprecated, stubbed to kstrto*() and
    eventually will be removed altogether.

    Use kstrto*() in code today!

    Note: on some archs _kstrtoul() and _kstrtol() are left in tree, even if
    they'll be unused at runtime. This is temporarily solution,
    because I don't want to hardcode list of archs where these
    functions aren't needed. Current solution with sizeof() and
    __alignof__ at least always works.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter
    setup functions from init/main.c to kenrel/smp.c, saves some #ifdef
    CONFIG_SMP.

    Signed-off-by: WANG Cong
    Cc: Rakib Mullick
    Cc: David Howells
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Arnd Bergmann
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang
     
  • PTR_RET() can be used if you have an error-pointer and are only interested
    in the eventual error value, but not the pointer. Yields the usual 0 for
    no error, -ESOMETHING otherwise.

    Signed-off-by: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • Re-ordering struct block_inode to remove 8 bytes of padding on 64 bit
    builds, which also shrinks bdev_inode by 8 bytes (776 -> 768) allowing it
    to fit into one fewer cache lines.

    Signed-off-by: Richard Kennedy
    Cc: Al Viro
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     
  • Unify identical gcc3.x and gcc4.x macros.

    Signed-off-by: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • All architectures can use the common dma_addr_t typedef now. We can
    remove the arch specific dma_addr_t.

    Signed-off-by: FUJITA Tomonori
    Acked-by: Arnd Bergmann
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Ivan Kokshaysky
    Cc: Richard Henderson
    Cc: Matt Turner
    Cc: "Luck, Tony"
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    FUJITA Tomonori
     
  • Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
    zone_statistics() that an allocation is on behalf of another thread. This
    way the local and remote counters can be still correct, even when
    background daemons like khugepaged are changing memory mappings.

    This only affects the accounting, but I think it's worth doing that right
    to avoid confusing users.

    I first tried to just pass down the right node, but this required a lot of
    changes to pass down this parameter and at least one addition of a 10th
    argument to a 9 argument function. Using the flag is a lot less
    intrusive.

    Open: should be also used for migration?

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andi Kleen
    Cc: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • When reclaiming for order-0 pages, kswapd requires that all zones be
    balanced. Each cycle through balance_pgdat() does background ageing on
    all zones if necessary and applies equal pressure on the inactive zone
    unless a lot of pages are free already.

    A "lot of free pages" is defined as a "balance gap" above the high
    watermark which is currently 7*high_watermark. Historically this was
    reasonable as min_free_kbytes was small. However, on systems using huge
    pages, it is recommended that min_free_kbytes is higher and it is tuned
    with hugeadm --set-recommended-min_free_kbytes. With the introduction of
    transparent huge page support, this recommended value is also applied. On
    X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
    expect around 68M of memory to be free. The Normal zone is approximately
    35000 pages so under even normal memory pressure such as copying a large
    file, it gets exhausted quickly. As it is getting exhausted, kswapd
    applies pressure equally to all zones, including the DMA32 zone. DMA32 is
    approximately 700,000 pages with a high watermark of around 23,000 pages.
    In this situation, kswapd will reclaim around (23000*8 where 8 is the high
    watermark + balance gap of 7 * high watermark) pages or 718M of pages
    before the zone is ignored. What the user sees is that free memory far
    higher than it should be.

    To avoid an excessive number of pages being reclaimed from the larger
    zones, explicitely defines the "balance gap" to be either 1% of the zone
    or the low watermark for the zone, whichever is smaller. While kswapd
    will check all zones to apply pressure, it'll ignore zones that meets the
    (high_wmark + balance_gap) watermark.

    To test this, 80G were copied from a partition and the amount of memory
    being used was recorded. A comparison of a patch and unpatched kernel can
    be seen at
    http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
    and shows that kswapd is not reclaiming as much memory with the patch
    applied.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Shaohua Li
    Cc: "Chen, Tim C"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
    unconditionally split any transparent huge pages it runs in to. In
    practice, that means that anyone doing a

    cat /proc/$pid/smaps

    will unconditionally break down every huge page in the process and depend
    on khugepaged to re-collapse it later. This is fairly suboptimal.

    This patch changes that behavior. It teaches each ->pmd_entry handler
    (there are five) that they must break down the THPs themselves. Also, the
    _generic_ code will never break down a THP unless a ->pte_entry handler is
    actually set.

    This means that the ->pmd_entry handlers can now choose to deal with THPs
    without breaking them down.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Eric B Munson
    Tested-by: Eric B Munson
    Cc: Michael J Wolf
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • The rotate_reclaimable_page function moves just written out pages, which
    the VM wanted to reclaim, to the end of the inactive list. That way the
    VM will find those pages first next time it needs to free memory.

    This patch applies the rule in memcg. It can help to prevent unnecessary
    working page eviction of memcg.

    Signed-off-by: Minchan Kim
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Recently, there are reported problem about thrashing.
    (http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
    workloads(ex, nightly rsync). That's because the workload makes just
    use-once pages and touches pages twice. It promotes the page into active
    list so that it results in working set page eviction.

    Some app developer want to support POSIX_FADV_NOREUSE. But other OSes
    don't support it, either.
    (http://marc.info/?l=linux-mm&m=128928979512086&w=2)

    By other approach, app developers use POSIX_FADV_DONTNEED. But it has a
    problem. If kernel meets page is writing during invalidate_mapping_pages,
    it can't work. It makes for application programmer to use it since they
    always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
    make sure the pages could be discardable. At last, they can't use
    deferred write of kernel so that they could see performance loss.
    (http://insights.oetiker.ch/linux/fadvise.html)

    In fact, invalidation is very big hint to reclaimer. It means we don't
    use the page any more. So let's move the writing page into inactive
    list's head if we can't truncate it right now.

    Why I move page to head of lru on this patch, Dirty/Writeback page would
    be flushed sooner or later. It can prevent writeout of pageout which is
    less effective than flusher's writeout.

    Originally, I reused lru_demote of Peter with some change so added his
    Signed-off-by.

    Signed-off-by: Minchan Kim
    Reported-by: Ben Gamari
    Signed-off-by: Peter Zijlstra
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Acked-by: Johannes Weiner
    Cc: Nick Piggin
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Reorder mm_struct to remove 16 bytes of alignment padding on 64 bit
    builds. On my config this shrinks mm_struct by enough to fit in one
    fewer cache lines and allows more objects per slab in mm_struct
    kmem_cache under SLUB.

    slabinfo before patch :-
    Sizes (bytes) Slabs
    --------------------------------
    Object : 848 Total : 9
    SlabObj: 896 Full : 2
    SlabSiz: 16384 Partial: 5
    Loss : 48 CpuSlab: 2
    Align : 64 Objects: 18

    slabinfo after :-
    Sizes (bytes) Slabs
    --------------------------------
    Object : 832 Total : 7
    SlabObj: 832 Full : 2
    SlabSiz: 16384 Partial: 3
    Loss : 0 CpuSlab: 2
    Align : 64 Objects: 19

    Signed-off-by: Richard Kennedy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     
  • TestSetPageLocked() isn't being used anywhere. Also, using it would
    likely be an error, since the proper interface trylock_page() provides
    stronger ordering guarantees.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This patch changes the anon_vma refcount to be 0 when the object is free.
    It does this by adding 1 ref to being in use in the anon_vma structure
    (iow. the anon_vma->head list is not empty).

    This allows a simpler release scheme without having to check both the
    refcount and the list as well as avoids taking a ref for each entry on the
    list.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • We need the anon_vma refcount unconditionally to simplify the anon_vma
    lifetime rules.

    Signed-off-by: Peter Zijlstra
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • The normal code pattern used in the kernel is: get/put.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Now we renamed remove_from_page_cache with delete_from_page_cache. As
    consistency of __remove_from_swap_cache and remove_from_swap_cache, we
    change internal page cache handling function name, too.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now delete_from_page_cache() replaces remove_from_page_cache(). So we
    remove remove_from_page_cache so fs or something out of mainline will
    notice it when compile time and can fix it.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Presently we increase the page refcount in add_to_page_cache() but don't
    decrease it in remove_from_page_cache(). Such asymmetry adds confusion,
    requiring that callers notice it and a comment explaining why they release
    a page reference. It's not a good API.

    A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
    gave up because reiser4's drop_page() had to unlock the page between
    removing it from page cache and doing the page_cache_release(). But now
    the situation is changed. I think at least things in current mainline
    don't have any obstacles. The problem is for out-of-mainline filesystems
    - if they have done such things as reiser4, this patch could be a problem
    but they will discover this at compile time since we remove
    remove_from_page_cache().

    This patch:

    This function works as just wrapper remove_from_page_cache(). The
    difference is that it decreases page references in itself. So caller have
    to make sure it has a page reference before calling.

    This patch is ready for removing remove_from_page_cache().

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Cc: Edward Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This function basically does:

    remove_from_page_cache(old);
    page_cache_release(old);
    add_to_page_cache_locked(new);

    Except it does this atomically, so there's no possibility for the "add" to
    fail because of a race.

    If memory cgroups are enabled, then the memory cgroup charge is also moved
    from the old page to the new.

    This function is currently used by fuse to move pages into the page cache
    on read, instead of copying the page contents.

    [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
    Signed-off-by: Miklos Szeredi
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • GUP user may want to try to acquire a reference to a page if it is already
    in memory, but not if IO, to bring it in, is needed. For example KVM may
    tell vcpu to schedule another guest process if current one is trying to
    access swapped out page. Meanwhile, the page will be swapped in and the
    guest process, that depends on it, will be able to run again.

    This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
    FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in
    conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
    it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
    instead.

    [akpm@linux-foundation.org: improve FOLL_NOWAIT comment]
    Signed-off-by: Gleb Natapov
    Cc: Linus Torvalds
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gleb Natapov
     
  • The oom killer is extremely verbose for machines with a large number of
    cpus and/or nodes. This verbosity can often be harmful if it causes other
    important messages to be scrolled from the kernel log and incurs a
    signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT >
    8.

    This patch causes only memory information to be displayed for nodes that
    are allowed by current's cpuset when dumping the VM state. Information
    for all other nodes is irrelevant to the oom condition; we don't care if
    there's an abundance of memory elsewhere if we can't access it.

    This only affects the behavior of dumping memory information when an oom
    is triggered. Other dumps, such as for sysrq+m, still display the
    unfiltered form when using the existing show_mem() interface.

    Additionally, the per-cpu pageset statistics are extremely verbose in oom
    killer output, so it is now suppressed. This removes

    nodes_weight(current->mems_allowed) * (1 + nr_cpus)

    lines from the oom killer output.

    Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed
    nodes.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • All kthreads being created from a single helper task, they all use memory
    from a single node for their kernel stack and task struct.

    This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
    to parameters already used by kthread_create().

    This parameter serves in allocating memory for the new kthread on its
    memory node if possible.

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
    order > 0") due to reports stating that kswapd CPU usage was higher and
    IRQs were being disabled more frequently. This was reported at
    http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.

    Without this patch applied, CPU usage by kswapd hovers around the 20% mark
    according to the tester (Arthur Marsh:
    http://www.spinics.net/linux/fedora/alsa-user/msg09899.html). With this
    patch applied, it's around 2%.

    The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
    triggered by high-order allocations hitting the low watermark for their
    order and waking kswapd on kernels with CONFIG_COMPACTION set. The most
    common trigger for this is network cards configured for jumbo frames but
    it's also possible it'll be triggered by fork-heavy workloads (order-1)
    and some wireless cards which depend on order-1 allocations.

    The symptoms for the user will be high CPU usage by kswapd in low-memory
    situations which could be confused with another writeback problem. While
    a patch like 5a03b051 may be reintroduced in the future, this patch plays
    it safe for now and reverts it.

    [mel@csn.ul.ie: Beefed up the changelog]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reported-by: Arthur Marsh
    Tested-by: Arthur Marsh
    Cc: [2.6.38.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • In systems with multiple framebuffer devices, one of the devices might be
    blanked while another is unblanked. In order for the backlight blanking
    logic to know whether to turn off the backlight for a particular
    framebuffer's blanking notification, it needs to be able to check if a
    given framebuffer device corresponds to the backlight.

    This plumbs the check_fb hook from core backlight through the
    pwm_backlight helper to allow platform code to plug in a check_fb hook.

    Signed-off-by: Robert Morell
    Cc: Richard Purdie
    Cc: Arun Murthy
    Cc: Linus Walleij
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Morell
     
  • There may be multiple ways of controlling the backlight on a given
    machine. Allow drivers to expose the type of interface they are
    providing, making it possible for userspace to make appropriate policy
    decisions.

    Signed-off-by: Matthew Garrett
    Cc: Richard Purdie
    Cc: Chris Wilson
    Cc: David Airlie
    Cc: Alex Deucher
    Cc: Ben Skeggs
    Cc: Zhang Rui
    Cc: Len Brown
    Cc: Jesse Barnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     
  • And fix a typo.

    Signed-off-by: Uwe Kleine-König
    Cc: Lars-Peter Clausen
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • Simple backlight driver for National Semiconductor LM3530. Presently only
    manual mode is supported, PWM and ALS support to be added.

    Signed-off-by: Shreshtha Kumar Sahu
    Cc: Linus Walleij
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shreshtha Kumar Sahu
     
  • syncfs() is duplicating name_to_handle_at() due to a merging mistake.

    Cc: Sage Weil
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    slub: Add statistics for this_cmpxchg_double failures
    slub: Add missing irq restore for the OOM path

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    rbd: use watch/notify for changes in rbd header
    libceph: add lingering request and watch/notify event framework
    rbd: update email address in Documentation
    ceph: rename dentry_release -> d_release, fix comment
    ceph: add request to the tail of unsafe write list
    ceph: remove request from unsafe list if it is canceled/timed out
    ceph: move readahead default to fs/ceph from libceph
    ceph: add ino32 mount option
    ceph: update common header files
    ceph: remove debugfs debug cruft
    libceph: fix osd request queuing on osdmap updates
    ceph: preserve I_COMPLETE across rename
    libceph: Fix base64-decoding when input ends in newline.

    Linus Torvalds
     
  • Using delayed-work for tty flip buffers ends up causing us to wait for
    the next tick to complete some actions. That's usually not all that
    noticeable, but for certain latency-critical workloads it ends up being
    totally unacceptable.

    As an extreme case of this, passing a token back-and-forth over a pty
    will take two ticks per iteration, so even just a thousand iterations
    will take 8 seconds assuming a common 250Hz configuration.

    Avoiding the whole delayed work issue brings that ping-pong test-case
    down to 0.009s on my machine.

    In more practical terms, this latency has been a performance problem for
    things like dive computer simulators (simulating the serial interface
    using the ptys) and for other environments (Alan mentions a CP/M emulator).

    Reported-by: Jef Driesen
    Acked-by: Greg KH
    Acked-by: Alan Cox
    Signed-off-by: Linus Torvalds

    Linus Torvalds