01 Aug, 2012

5 commits

  • The conditional mem_cgroup_cancel_charge_swapin() is a leftover from when
    the function would continue to reestablish the page even after
    mem_cgroup_try_charge_swapin() failed. After 85d9fc8 "memcg: fix refcnt
    handling at swapoff", the condition is always true when this code is
    reached.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit b3a27d ("swap: Add swap slot free callback to
    block_device_operations") dereferences p->bdev->bd_disk but this is a NULL
    dereference if using swap-over-NFS. This patch checks SWP_BLKDEV on the
    swap_info_struct before dereferencing.

    With reference to this callback, Christoph Hellwig stated "Please just
    remove the callback entirely. It has no user outside the staging tree and
    was added clearly against the rules for that staging tree". This would
    also be my preference but there was not an obvious way of keeping zram in
    staging/ happy.

    Signed-off-by: Xiaotian Feng
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The version of swap_activate introduced is sufficient for swap-over-NFS
    but would not provide enough information to implement a generic handler.
    This patch shuffles things slightly to ensure the same information is
    available for aops->swap_activate() as is available to the core.

    No functionality change.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently swapfiles are managed entirely by the core VM by using ->bmap to
    allocate space and write to the blocks directly. This effectively ensures
    that the underlying blocks are allocated and avoids the need for the swap
    subsystem to locate what physical blocks store offsets within a file.

    If the swap subsystem is to use the filesystem information to locate the
    blocks, it is critical that information such as block groups, block
    bitmaps and the block descriptor table that map the swap file were
    resident in memory. This patch adds address_space_operations that the VM
    can call when activating or deactivating swap backed by a file.

    int swap_activate(struct file *);
    int swap_deactivate(struct file *);

    The ->swap_activate() method is used to communicate to the file that the
    VM relies on it, and the address_space should take adequate measures such
    as reserving space in the underlying device, reserving memory for mempools
    and pinning information such as the block descriptor table in memory. The
    ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate()
    returned success.

    After a successful swapfile ->swap_activate, the swapfile is marked
    SWP_FILE and swapper_space.a_ops will proxy to
    sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache
    pages and ->readpage to read.

    It is perfectly possible that direct_IO be used to read the swap pages but
    it is an unnecessary complication. Similarly, it is possible that
    ->writepage be used instead of direct_io to write the pages but filesystem
    developers have stated that calling writepage from the VM is undesirable
    for a variety of reasons and using direct_IO opens up the possibility of
    writing back batches of swap pages in the future.

    [a.p.zijlstra@chello.nl: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In order to teach filesystems to handle swap cache pages, three new page
    functions are introduced:

    pgoff_t page_file_index(struct page *);
    loff_t page_file_offset(struct page *);
    struct address_space *page_file_mapping(struct page *);

    page_file_index() - gives the offset of this page in the file in
    PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this
    function also gives the correct index for PG_swapcache pages.

    page_file_offset() - uses page_file_index(), so that it will give the
    expected result, even for PG_swapcache pages.

    page_file_mapping() - gives the mapping backing the actual page; that is
    for swap cache pages it will give swap_file->f_mapping.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Jun, 2012

1 commit

  • Minchan Kim reports that when a system has many swap areas, and tmpfs
    swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
    back the page cannot locate it, and the read fails with -ENOMEM.

    Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
    swp_entry_to_pte()) technique for determining maximum usable swap
    offset, without stopping to realize that that actually depends upon the
    pte swap encoding shifting swap offset to the higher bits and truncating
    it there. Whereas our radix_tree swap encoding leaves offset in the
    lower bits: it's swap "type" (that is, index of swap area) that was
    truncated.

    Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
    broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().

    This does not reduce the usable size of a swap area any further, it
    leaves it as claimed when making the original commit: no change from 3.0
    on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB
    per swapfile on i386 with PAE. It's not a change I would have risked
    five years ago, but with x86_64 supported for ten years, I believe it's
    appropriate now.

    Hmm, and what if some architecture implements its swap pte with offset
    encoded below type? That would equally break the maximum usable swap
    offset check. Happily, they all follow the same tradition of encoding
    offset above type, but I'll prepare a check on that for next.

    Reported-and-Reviewed-and-Tested-by: Minchan Kim
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org [3.1, 3.2, 3.3, 3.4]
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Jun, 2012

1 commit

  • Pull frontswap feature from Konrad Rzeszutek Wilk:
    "Frontswap provides a "transcendent memory" interface for swap pages.
    In some environments, dramatic performance savings may be obtained
    because swapped pages are saved in RAM (or a RAM-like device) instead
    of a swap disk. This tag provides the basic infrastructure along with
    some changes to the existing backends."

    Fix up trivial conflict in mm/Makefile due to removal of swap token code
    changing a line next to the new frontswap entry.

    This pull request came in before the merge window even opened, it got
    delayed to after the merge window by me just wanting to make sure it had
    actual users. Apparently IBM is using this on their embedded side, and
    Jan Beulich says that it's already made available for SLES and OpenSUSE
    users.

    Also acked by Rik van Riel, and Konrad points to other people liking it
    too. So in it goes.

    By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
    via Konrad Rzeszutek Wilk
    * tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    frontswap: s/put_page/store/g s/get_page/load
    MAINTAINER: Add myself for the frontswap API
    mm: frontswap: config and doc files
    mm: frontswap: core frontswap functionality
    mm: frontswap: core swap subsystem hooks and headers
    mm: frontswap: add frontswap header file

    Linus Torvalds
     

30 May, 2012

2 commits

  • This patch changes memcg's behavior at task_move().

    At task_move(), the kernel scans a task's page table and move the changes
    for mapped pages from source cgroup to target cgroup. There has been a
    bug at handling shared anonymous pages for a long time.

    Before patch:
    - The spec says 'shared anonymous pages are not moved.'
    - The implementation was 'shared anonymoys pages may be moved'.
    If page_mapcount
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Naoya Horiguchi
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the
    backing RAM has to be below 4GB. Not a problem while the boards
    supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
    their GMA3600 is managed by the GMA500 driver.

    shmem/tmpfs has never pretended to support hardware restrictions on the
    backing memory, but it might have appeared to do so before v3.1, and
    even now it works fine until a page is swapped out then back in. When
    read_cache_page_gfp() supplied a freshly allocated page for copy, that
    compensated for whatever choice might have been made by earlier swapin
    readahead; but swapoff was likely to destroy the illusion.

    We'd like to continue to support GMA500, so now add a new
    shmem_should_replace_page() check on the zone when about to move a page
    from swapcache to filecache (in swapin and swapoff cases), with
    shmem_replace_page() to allocate and substitute a suitable page (given
    gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).

    This does involve a minor extension to mem_cgroup_replace_page_cache()
    (the page may or may not have already been charged); and I've removed a
    comment and call to mem_cgroup_uncharge_cache_page(), which in fact is
    always a no-op while PageSwapCache.

    Also removed optimization of an unlikely path in shmem_getpage_gfp(),
    now that we need to check PageSwapCache more carefully (a racing caller
    might already have made the copy). And at one point shmem_unuse_inode()
    needs to use the hitherto private page_swapcount(), to guard against
    racing with inode eviction.

    It would make sense to extend shmem_should_replace_page(), to cover
    cpuset and NUMA mempolicy restrictions too, but set that aside for now:
    needs a cleanup of shmem mempolicy handling, and more testing, and ought
    to handle swap faults in do_swap_page() as well as shmem.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Alan Cox
    Cc: Stephane Marchesin
    Cc: Andi Kleen
    Cc: Dave Airlie
    Cc: Daniel Vetter
    Cc: Rob Clark
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

15 May, 2012

1 commit

  • This patch, 2of4, contains the changes to the core swap subsystem.
    This includes:

    (1) makes available core swap data structures (swap_lock, swap_list and
    swap_info) that are needed by frontswap.c but we don't need to expose them
    to the dozens of files that include swap.h so we create a new swapfile.h
    just to extern-ify these and modify their declarations to non-static

    (2) adds frontswap-related elements to swap_info_struct. Frontswap_map
    points to vzalloc'ed one-bit-per-swap-page metadata that indicates
    whether the swap page is in frontswap or in the device and frontswap_pages
    counts how many pages are in frontswap.

    (3) adds hooks in the swap subsystem and extends try_to_unuse so that
    frontswap_shrink can do a "partial swapoff".

    Note that a failed frontswap_map allocation is safe... failure is noted
    by lack of "FS" in the subsequent printk.

    ---

    [v14: rebase to 3.4-rc2]
    [v10: no change]
    [v9: akpm@linux-foundation.org: mark some statics __read_mostly]
    [v9: akpm@linux-foundation.org: add clarifying comments]
    [v9: akpm@linux-foundation.org: no need to loop repeating try_to_unuse]
    [v9: error27@gmail.com: remove superfluous check for NULL]
    [v8: rebase to 3.0-rc4]
    [v8: kamezawa.hiroyu@jp.fujitsu.com: change counter to atomic_t to avoid races]
    [v8: kamezawa.hiroyu@jp.fujitsu.com: comment to clarify informational counters]
    [v7: rebase to 3.0-rc3]
    [v7: JBeulich@novell.com: add new swap struct elements only if config'd]
    [v6: rebase to 3.0-rc1]
    [v6: lliubbo@gmail.com: fix null pointer deref if vzalloc fails]
    [v6: konrad.wilk@oracl.com: various checks and code clarifications/comments]
    [v5: no change from v4]
    [v4: rebase to 2.6.39]
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Kamezawa Hiroyuki
    Acked-by: Jan Beulich
    Acked-by: Seth Jennings
    Cc: Jeremy Fitzhardinge
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Nitin Gupta
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Rik Riel
    Cc: Andrew Morton
    [v11: Rebased, fixed mm/swapfile.c context change]
    Signed-off-by: Konrad Rzeszutek Wilk

    Dan Magenheimer
     

29 Mar, 2012

1 commit

  • Most system calls taking flags first check that the flags passed in are
    valid, and that helps userspace to detect when new flags are supported.

    But swapon never did so: start checking now, to help if we ever want to
    support more swap_flags in future.

    It's difficult to get stray bits set in an int, and swapon is not widely
    used, so this is most unlikely to break any userspace; but we can just
    revert if it turns out to do so.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Mar, 2012

1 commit

  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

4 commits

  • When swapon() was not passed the SWAP_FLAG_DISCARD option, sys_swapon()
    will still perform a discard operation. This can cause problems if
    discard is slow or buggy.

    Reverse the order of the check so that a discard operation is performed
    only if the sys_swapon() caller is attempting to enable discard.

    Signed-off-by: Shaohua Li
    Reported-by: Holger Kiehl
    Tested-by: Holger Kiehl
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Ever since abandoning the virtual scan of processes, for scalability
    reasons, swap space has been a little more fragmented than before. This
    can lead to the situation where a large memory user is killed, swap space
    ends up full of "holes" and swapin readahead is totally ineffective.

    On my home system, after killing a leaky firefox it took over an hour to
    page just under 2GB of memory back in, slowing the virtual machines down
    to a crawl.

    This patch makes swapin readahead simply skip over holes, instead of
    stopping at them. This allows the system to swap things back in at rates
    of several MB/second, instead of a few hundred kB/second.

    The checks done in valid_swaphandles are already done in
    read_swap_cache_async as well, allowing us to remove a fair amount of
    code.

    [akpm@linux-foundation.org: fix it for page_cluster >= 32]
    Signed-off-by: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: Adrian Drzewiecki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: [2.6.38+]
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Pull security subsystem updates for 3.4 from James Morris:
    "The main addition here is the new Yama security module from Kees Cook,
    which was discussed at the Linux Security Summit last year. Its
    purpose is to collect miscellaneous DAC security enhancements in one
    place. This also marks a departure in policy for LSM modules, which
    were previously limited to being standalone access control systems.
    Chromium OS is using Yama, and I believe there are plans for Ubuntu,
    at least.

    This patchset also includes maintenance updates for AppArmor, TOMOYO
    and others."

    Fix trivial conflict in due to the jumo_label->static_key
    rename.

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
    AppArmor: Fix location of const qualifier on generated string tables
    TOMOYO: Return error if fails to delete a domain
    AppArmor: add const qualifiers to string arrays
    AppArmor: Add ability to load extended policy
    TOMOYO: Return appropriate value to poll().
    AppArmor: Move path failure information into aa_get_name and rename
    AppArmor: Update dfa matching routines.
    AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
    AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
    AppArmor: Add const qualifiers to generated string tables
    AppArmor: Fix oops in policy unpack auditing
    AppArmor: Fix error returned when a path lookup is disconnected
    KEYS: testing wrong bit for KEY_FLAG_REVOKED
    TOMOYO: Fix mount flags checking order.
    security: fix ima kconfig warning
    AppArmor: Fix the error case for chroot relative path name lookup
    AppArmor: fix mapping of META_READ to audit and quiet flags
    AppArmor: Fix underflow in xindex calculation
    AppArmor: Fix dropping of allowed operations that are force audited
    AppArmor: Add mising end of structure test to caps unpacking
    ...

    Linus Torvalds
     

20 Mar, 2012

1 commit


14 Feb, 2012

1 commit


13 Jan, 2012

1 commit


11 Jan, 2012

1 commit

  • Colin Cross reported;

    Under the following conditions, __alloc_pages_slowpath can loop forever:
    gfp_mask & __GFP_WAIT is true
    gfp_mask & __GFP_FS is false
    reclaim and compaction make no progress
    order
    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit

  • test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace
    PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
    current's oom_score_adj for ksm and swapoff without requiring an
    additional per-process flag.

    Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
    then reinstate the previous value is racy since it's possible that
    userspace can set the value to something else itself before the old value
    is reinstated. That results in userspace setting current's oom_score_adj
    to a different value and then the kernel immediately setting it back to
    its previous value without notification.

    To fix this, a new compare_swap_oom_score_adj() function is introduced
    with the same semantics as the compare and swap CAS instruction, or
    CMPXCHG on x86. It is used to reinstate the previous value of
    oom_score_adj if and only if the present value is the same as the old
    value.

    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

31 Oct, 2011

1 commit


04 Aug, 2011

1 commit

  • If swap entries are to be stored along with struct page pointers in a
    radix tree, they need to be distinguished as exceptional entries.

    Most of the handling of swap entries in radix tree will be contained in
    shmem.c, but a few functions in filemap.c's common code need to check
    for their appearance: find_get_page(), find_lock_page(),
    find_get_pages() and find_get_pages_contig().

    So as not to slow their fast paths, tuck those checks inside the
    existing checks for unlikely radix_tree_deref_slot(); except for
    find_lock_page(), where it is an added test. And make it a BUG in
    find_get_pages_tag(), which is not applied to tmpfs files.

    A part of the reason for eliminating shmem_readpage() earlier, was to
    minimize the places where common code would need to allow for swap
    entries.

    The swp_entry_t known to swapfile.c must be massaged into a slightly
    different form when stored in the radix tree, just as it gets massaged
    into a pte_t when stored in page tables.

    In an i386 kernel this limits its information (type and page offset) to
    30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
    swapfile size of 128GB. Which is less than the 512GB we previously
    allowed with X86_PAE (where the swap entry can occupy the entire upper
    32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
    there's not a new limitation on 64-bit (where swap filesize is already
    limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
    probably still enough swap for a 64GB 32-bit machine.

    Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
    enforce filesize limit in read_swap_header(), just as for ptes.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Jul, 2011

1 commit

  • Moving the event counter into the dynamically allocated 'struc seq_file'
    allows poll() support without the need to allocate its own tracking
    structure.

    All current users are switched over to use the new counter.

    Requested-by: Andrew Morton akpm@linux-foundation.org
    Acked-by: NeilBrown
    Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
    Signed-off-by: Kay Sievers
    Signed-off-by: Al Viro

    Kay Sievers
     

28 Jun, 2011

1 commit

  • Before adding any more global entry points into shmem.c, gather such
    prototypes into shmem_fs.h. Remove mm's own declarations from swap.h,
    but for now leave the ones in mm.h: because shmem_file_setup() and
    shmem_zero_setup() are called from various places, and we should not
    force other subsystems to update immediately.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 May, 2011

1 commit

  • There's a kernel-wide shortage of per-process flags, so it's always
    helpful to trim one when possible without incurring a significant penalty.
    It's even more important when you're planning on adding a per- process
    flag yourself, which I plan to do shortly for transparent hugepages.

    PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
    tendency to allocate large amounts of memory and should be preferred for
    killing over other tasks. We'd rather immediately kill the task making
    the errant syscall rather than penalizing an innocent task.

    This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
    setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.

    The process's old oom_score_adj is stored and then set to
    OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old
    value is then reinstated when the process should no longer be considered a
    high priority for oom killing.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Izik Eidus
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

24 Mar, 2011

1 commit

  • Remove initialization of vaiable in caller of memory cgroup function.
    Actually, it's return value of memcg function but it's initialized in
    caller.

    Some memory cgroup uses following style to bring the result of start
    function to the end function for avoiding races.

    mem_cgroup_start_A(&(*ptr))
    /* Something very complicated can happen here. */
    mem_cgroup_end_A(*ptr)

    In some calls, *ptr should be initialized to NULL be caller. But it's
    ugly. This patch fixes that *ptr is initialized by _start function.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

23 Mar, 2011

11 commits

  • A conflict between 52c50567d8ab ("mm: swap: unlock swapfile inode mutex
    before closing file on bad swapfiles") and 83ef99befc32 ("sys_swapon:
    remove did_down variable") caused a double unlock of the inode mutex
    (once in bad_swap: before the filp_close, once at the end just before
    returning).

    The patch which added the extra unlock cleared did_down to avoid
    unlocking twice, but the other patch removed the did_down variable.

    To fix, set inode to NULL after the first unlock, since it will be used
    after that point only for the final unlock.

    While checking this patch, I found a path which could unlock without
    locking, in case the same inode was added as a swapfile twice. To fix,
    move the setting of the inode variable further down, to just before
    claim_swapfile, which will lock the inode before doing anything else.

    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Eric B Munson
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrew Morton
    Signed-off-by: Cesar Eduardo Barros
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • scan_swap_map() is a large function (224 lines), with several loops and a
    complex control flow involving several gotos.

    Given all that, it is a bit silly that it is marked as inline. The
    compiler agrees with me: on a x86-64 compile, it did not inline the
    function.

    Remove the "inline" and let the compiler decide instead.

    Signed-off-by: Cesar Eduardo Barros
    Reviewed-by: Pekka Enberg
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The block in sys_swapon which does the final adjustments to the
    swap_info_struct and to swap_list is the same as the block which
    re-inserts it again at sys_swapoff on failure of try_to_unuse(). Move
    this code to a separate function, and use it both in sys_swapon and
    sys_swapoff.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The block in sys_swapon which does the final adjustments to the
    swap_info_struct and to swap_list is the same as the block which
    re-inserts it again at sys_swapoff on failure of try_to_unuse(), except
    for the order of the operations within the lock. Since the order should
    not matter, arbitrarily change sys_swapoff to match sys_swapon, in
    preparation to making both share the same code.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The block in sys_swapon which does the final adjustments to the
    swap_info_struct and to swap_list is the same as the block which
    re-inserts it again at sys_swapoff on failure of try_to_unuse(). To be
    able to make both share the same code, move the printk() call in the
    middle of it to just after it.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • It still exists within setup_swap_map_and_extents(), but after it
    nr_good_pages == p->pages.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Since there is no cleanup to do, there is no reason to jump to a label.
    Return directly instead.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Move the code which parses the bad block list and the extents to a
    separate function. Only code movement, no functional changes.

    This change uses the fact that, after the success path, nr_good_pages ==
    p->pages.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • The call to swap_cgroup_swapon is in the middle of loading the swap map
    and extents. As it only does memory allocation and does not depend on
    the swapfile layout (map/extents), it can be called earlier (or later).

    Move it to just after the allocation of swap_map, since it is
    conceptually similar (allocates a map).

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: Pekka Enberg
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Since there is no cleanup to do, there is no reason to jump to a label.
    Return directly instead.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros
     
  • Move the code which parses and checks the swapfile header (except for
    the bad block list) to a separate function. Only code movement, no
    functional changes.

    Signed-off-by: Cesar Eduardo Barros
    Tested-by: Eric B Munson
    Acked-by: Eric B Munson
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cesar Eduardo Barros