12 Jul, 2012

9 commits

  • memblock_free_reserved_regions() calls memblock_free(), but
    memblock_free() would double reserved.regions too, so we could free the
    old range for reserved.regions.

    Also tj said there is another bug which could be related to this.

    | I don't think we're saving any noticeable
    | amount by doing this "free - give it to page allocator - reserve
    | again" dancing. We should just allocate regions aligned to page
    | boundaries and free them later when memblock is no longer in use.

    in that case, when DEBUG_PAGEALLOC, will get panic:

    memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39
    BUG: unable to handle kernel paging request at ffff88102febd948
    IP: [] __next_free_mem_range+0x9b/0x155
    PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    CPU 0
    Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447
    RIP: 0010:[] [] __next_free_mem_range+0x9b/0x155

    See the discussion at https://lkml.org/lkml/2012/6/13/469

    So try to allocate with PAGE_SIZE alignment and free it later.

    Reported-by: Sasha Levin
    Acked-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Yinghai Lu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint
    in alloc_bootmem_section"), usemap allocations may easily be placed
    outside the optimal section that holds the node descriptor, even if
    there is space available in that section. This results in unnecessary
    hotplug dependencies that need to have the node unplugged before the
    section holding the usemap.

    The reason is that the bootmem allocator doesn't guarantee a linear
    search starting from the passed allocation goal but may start out at a
    much higher address absent an upper limit.

    Fix this by trying the allocation with the limit at the section end,
    then retry without if that fails. This keeps the fix from f5bf18fa22f8
    of not panicking if the allocation does not fit in the section, but
    still makes sure to try to stay within the section at first.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Johannes Weiner
    Cc: [3.3.x, 3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Commit 238305bb4d41 ("mm: remove sparsemem allocation details from the
    bootmem allocator") introduced a bug in the allocation goal calculation
    that put section usemaps not in the same section as the node
    descriptors, creating unnecessary hotplug dependencies between them:

    node 0 must be removed before remove section 16399
    node 1 must be removed before remove section 16399
    node 2 must be removed before remove section 16399
    node 3 must be removed before remove section 16399
    node 4 must be removed before remove section 16399
    node 5 must be removed before remove section 16399
    node 6 must be removed before remove section 16399

    The reason is that it applies PAGE_SECTION_MASK to the physical address
    of the node descriptor when finding a suitable place to put the usemap,
    when this mask is actually intended to be used with PFNs. Because the
    PFN mask is wider, the target address will point beyond the wanted
    section holding the node descriptor and the node must be offlined before
    the section holding the usemap can go.

    Fix this by extending the mask to address width before use.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • shmem_add_to_page_cache() has three callsites, but only one of them wants
    the radix_tree_preload() (an exceptional entry guarantees that the radix
    tree node is present in the other cases), and only that site can achieve
    mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a no-op in the
    other cases). We did it this way originally to reflect
    add_to_page_cache_locked(); but it's confusing now, so move the radix_tree
    preloading and mem_cgroup uncharging to that one caller.

    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When adding the page_private checks before calling shmem_replace_page(), I
    did realize that there is a further race, but thought it too unlikely to
    need a hurried fix.

    But independently I've been chasing why a mem cgroup's memory.stat
    sometimes shows negative rss after all tasks have gone: I expected it to
    be a stats gathering bug, but actually it's shmem swapping's fault.

    It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
    the page may have been removed from swapcache before getting the lock; or
    it may have been freed and reused and be back in swapcache; and it can
    even be using the same swap location as before (page_private same).

    The swapoff case is already secure against this (swap cannot be reused
    until the whole area has been swapped off, and a new swapped on); and
    shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check for
    the expected radix_tree entry - but a little too late.

    By that time, we might have already decided to shmem_replace_page(): I
    don't know of a problem from that, but I'd feel more at ease not to do so
    spuriously. And we have already done mem_cgroup_cache_charge(), on
    perhaps the wrong mem cgroup: and this charge is not then undone on the
    error path, because PageSwapCache ends up preventing that.

    It's this last case which causes the occasional negative rss in
    memory.stat: the page is charged here as cache, but (sometimes) found to
    be anon when eventually it's uncharged - and in between, it's an
    undeserved charge on the wrong memcg.

    Fix this by adding an earlier check on the radix_tree entry: it's
    inelegant to descend the tree twice, but swapping is not the fast path,
    and a better solution would need a pair (try+commit) of memcg calls, and a
    rework of shmem_replace_page() to keep out of the swapcache.

    We can use the added shmem_confirm_swap() function to replace the
    find_get_page+page_cache_release we were already doing on the error path.
    And add a comment on that -EEXIST: it seems a peculiar errno to be using,
    but originates from its use in radix_tree_insert().

    [It can be surprising to see positive rss left in a memcg's memory.stat
    after all tasks have gone, since it is supposed to count anonymous but not
    shmem. Aside from sharing anon pages via fork with a task in some other
    memcg, it often happens after swapping: because a swap page can't be freed
    while under writeback, nor while locked. So it's not an error, and these
    residual pages are easily freed once pressure demands.]

    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE"). I believe
    it's correct, and it's been nice to have from rc1 to rc6; but as the
    original commit said:

    I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
    would be of any use to them on tmpfs. This code adds 92 lines and 752
    bytes on x86_64 - is that bloat or worthwhile?

    Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs to
    the dumb generic support for v3.5. We can always reinstate it later if
    useful, and anyone needing it in a hurry can just get it out of git.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Josef Bacik
    Cc: Andi Kleen
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Marco Stornelli
    Cc: Jeff liu
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We should goto error to release memory resource if hotadd_new_pgdat()
    failed.

    Signed-off-by: Wen Congyang
    Cc: Yasuaki ISIMATU
    Acked-by: David Rientjes
    Cc: Len Brown
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • If page migration cannot charge the temporary page to the memcg,
    migrate_pages() will return -ENOMEM. This isn't considered in memory
    compaction however, and the loop continues to iterate over all
    pageblocks trying to isolate and migrate pages. If a small number of
    very large memcgs happen to be oom, however, these attempts will mostly
    be futile leading to an enormous amout of cpu consumption due to the
    page migration failures.

    This patch will short circuit and fail memory compaction if
    migrate_pages() returns -ENOMEM. COMPACT_PARTIAL is returned in case
    some migrations were successful so that the page allocator will retry.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • kswapd_stop() is called to destroy the kswapd work thread when all memory
    of a NUMA node has been offlined. But kswapd_stop() only terminates the
    work thread without resetting NODE_DATA(nid)->kswapd to NULL. The stale
    pointer will prevent kswapd_run() from creating a new work thread when
    adding memory to the memory-less NUMA node again. Eventually the stale
    pointer may cause invalid memory access.

    An example stack dump as below. It's reproduced with 2.6.32, but latest
    kernel has the same issue.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] exit_creds+0x12/0x78
    PGD 0
    Oops: 0000 [#1] SMP
    last sysfs file: /sys/devices/system/memory/memory391/state
    CPU 11
    Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
    Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
    RIP: 0010:exit_creds+0x12/0x78
    RSP: 0018:ffff8806044f1d78 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
    RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
    RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
    R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
    R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
    FS: 00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
    Stack:
    ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
    ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
    0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
    Call Trace:
    __put_task_struct+0x5d/0x97
    kthread_stop+0x50/0x58
    offline_pages+0x324/0x3da
    memory_block_change_state+0x179/0x1db
    store_mem_state+0x9e/0xbb
    sysfs_write_file+0xd0/0x107
    vfs_write+0xad/0x169
    sys_write+0x45/0x6e
    system_call_fastpath+0x16/0x1b
    Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
    RIP exit_creds+0x12/0x78
    RSP
    CR2: 0000000000000000

    [akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     

07 Jul, 2012

1 commit

  • Otherwise the code races with munmap (causing a use-after-free
    of the vma) or with close (causing a use-after-free of the struct
    file).

    The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix
    mmap_sem i_mutex deadlock")

    Cc: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Cc: Nick Piggin
    Cc: stable@vger.kernel.org
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

04 Jul, 2012

1 commit

  • Pull block bits from Jens Axboe:
    "As vacation is coming up, thought I'd better get rid of my pending
    changes in my for-linus branch for this iteration. It contains:

    - Two patches for mtip32xx. Killing a non-compliant sysfs interface
    and moving it to debugfs, where it belongs.

    - A few patches from Asias. Two legit bug fixes, and one killing an
    interface that is no longer in use.

    - A patch from Jan, making the annoying partition ioctl warning a bit
    less annoying, by restricting it to !CAP_SYS_RAWIO only.

    - Three bug fixes for drbd from Lars Ellenberg.

    - A fix for an old regression for umem, it hasn't really worked since
    the plugging scheme was changed in 3.0.

    - A few fixes from Tejun.

    - A splice fix from Eric Dumazet, fixing an issue with pipe
    resizing."

    * 'for-linus' of git://git.kernel.dk/linux-block:
    scsi: Silence unnecessary warnings about ioctl to partition
    block: Drop dead function blk_abort_queue()
    block: Mitigate lock unbalance caused by lock switching
    block: Avoid missed wakeup in request waitqueue
    umem: fix up unplugging
    splice: fix racy pipe->buffers uses
    drbd: fix null pointer dereference with on-congestion policy when diskless
    drbd: fix list corruption by failing but already aborted reads
    drbd: fix access of unallocated pages and kernel panic
    xen/blkfront: Add WARN to deal with misbehaving backends.
    blkcg: drop local variable @q from blkg_destroy()
    mtip32xx: Create debugfs entries for troubleshooting
    mtip32xx: Remove 'registers' and 'flags' from sysfs
    blkcg: fix blkg_alloc() failure path
    block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
    block: fix return value on cfq_init() failure
    mtip32xx: Remove version.h header file inclusion
    xen/blkback: Copy id field when doing BLKIF_DISCARD.

    Linus Torvalds
     

21 Jun, 2012

7 commits

  • If the range passed to mbind() is not allocated on nodes set in the
    nodemask, it migrates the pages to respect the constraint.

    The final formal of migrate_pages() is a mode of type enum migrate_mode,
    not a boolean. do_mbind() is currently passing "true" which is the
    equivalent of MIGRATE_SYNC_LIGHT. This should instead be MIGRATE_SYNC
    for synchronous page migration.

    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • __alloc_memory_core_early() asks memblock for a range of memory then try
    to reserve it. If the reserved region array lacks space for the new
    range, memblock_double_array() is called to allocate more space for the
    array. If memblock is used to allocate memory for the new array it can
    end up using a range that overlaps with the range originally allocated in
    __alloc_memory_core_early(), leading to possible data corruption.

    With this patch memblock_double_array() now calls memblock_find_in_range()
    with a narrowed candidate range (in cases where the reserved.regions array
    is being doubled) so any memory allocated will not overlap with the
    original range that was being reserved. The range is narrowed by passing
    in the starting address and size of the previously allocated range. Then
    the range above the ending address is searched and if a candidate is not
    found, the range below the starting address is searched.

    Signed-off-by: Greg Pearson
    Signed-off-by: Yinghai Lu
    Acked-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Pearson
     
  • Fix kernel-doc warnings in mm/memory.c:

    Warning(mm/memory.c:1377): No description found for parameter 'start'
    Warning(mm/memory.c:1377): Excess function parameter 'address' description in 'zap_page_range'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Fix kernel-doc warnings such as

    Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
    Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'

    Signed-off-by: Wanpeng Li
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Andrea asked for addr, end, vma->vm_start, and vma->vm_end to be emitted
    when !rwsem_is_locked(&tlb->mm->mmap_sem). Otherwise, debugging the
    underlying issue is more difficult.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If use_hierarchy is set, reclaim testing soon oopses in css_is_ancestor()
    called from __mem_cgroup_same_or_subtree() called from page_referenced():
    when processes are exiting, it's easy for mm_match_cgroup() to pass along
    a NULL memcg coming from a NULL mm->owner.

    Check for that in __mem_cgroup_same_or_subtree(). Return true or false?
    False because we cannot know if it was in the hierarchy, but also false
    because it's better not to count a reference from an exiting process.

    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Acked-by: Konstantin Khlebnikov
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The divide in p->signal->oom_score_adj * totalpages / 1000 within
    oom_badness() was causing an overflow of the signed long data type.

    This adds both the root bias and p->signal->oom_score_adj before doing the
    normalization which fixes the issue and also cleans up the calculation.

    Tested-by: Dave Jones
    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

16 Jun, 2012

2 commits

  • Minchan Kim reports that when a system has many swap areas, and tmpfs
    swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
    back the page cannot locate it, and the read fails with -ENOMEM.

    Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
    swp_entry_to_pte()) technique for determining maximum usable swap
    offset, without stopping to realize that that actually depends upon the
    pte swap encoding shifting swap offset to the higher bits and truncating
    it there. Whereas our radix_tree swap encoding leaves offset in the
    lower bits: it's swap "type" (that is, index of swap area) that was
    truncated.

    Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
    broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().

    This does not reduce the usable size of a swap area any further, it
    leaves it as claimed when making the original commit: no change from 3.0
    on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB
    per swapfile on i386 with PAE. It's not a change I would have risked
    five years ago, but with x86_64 supported for ten years, I believe it's
    appropriate now.

    Hmm, and what if some architecture implements its swap pte with offset
    encoded below type? That would equally break the maximum usable swap
    offset check. Happily, they all follow the same tradition of encoding
    offset above type, but I'll prepare a check on that for next.

    Reported-and-Reviewed-and-Tested-by: Minchan Kim
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org [3.1, 3.2, 3.3, 3.4]
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Pull core updates (RCU and locking) from Ingo Molnar:
    "Most of the diffstat comes from the RCU slow boot regression fixes,
    but there's also a debuggability improvements/fixes."

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    memblock: Document memblock_is_region_{memory,reserved}()
    rcu: Precompute RCU_FAST_NO_HZ timer offsets
    rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure
    rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks
    rcu: RCU_FAST_NO_HZ detection of callback adoption
    spinlock: Indicate that a lockup is only suspected
    kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop()
    panic: Make panic_on_oops configurable

    Linus Torvalds
     

14 Jun, 2012

1 commit

  • Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
    by splice_shrink_spd() called from vmsplice_to_pipe()

    commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes)
    added capability to adjust pipe->buffers.

    Problem is some paths don't hold pipe mutex and assume pipe->buffers
    doesn't change for their duration.

    Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
    use it in place of pipe->buffers where appropriate.

    splice_shrink_spd() loses its struct pipe_inode_info argument.

    Reported-by: Dave Jones
    Signed-off-by: Eric Dumazet
    Cc: Jens Axboe
    Cc: Alexander Viro
    Cc: Tom Herbert
    Cc: stable # 2.6.35
    Tested-by: Dave Jones
    Signed-off-by: Jens Axboe

    Eric Dumazet
     

09 Jun, 2012

1 commit

  • If the privileges given to root threads (3% of allowable memory) or a
    negative value of /proc/pid/oom_score_adj happen to exceed the amount of
    rss of a thread, its badness score overflows as a result of commit
    a7f638f999ff ("mm, oom: normalize oom scores to oom_score_adj scale only
    for userspace").

    Fix this by making the type signed and return 1, meaning the thread is
    still eligible for kill, if the value is negative.

    Reported-by: Dave Jones
    Acked-by: Oleg Nesterov
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

08 Jun, 2012

2 commits

  • At first glance one would think that memblock_is_region_memory()
    and memblock_is_region_reserved() would be implemented in the
    same way. Unfortunately they aren't and the former returns
    whether the region specified is a subset of a memory bank while
    the latter returns whether the region specified intersects with
    reserved memory.

    Document the two functions so that users aren't tempted to
    make the implementation the same between them and to clarify the
    purpose of the functions.

    Signed-off-by: Stephen Boyd
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/1337845521-32755-1-git-send-email-sboyd@codeaurora.org
    Signed-off-by: Ingo Molnar

    Stephen Boyd
     
  • Commit bde05d1ccd51 ("shmem: replace page if mapping excludes its zone")
    is not at all likely to break for anyone, but it was an earlier version
    from before review feedback was incorporated. Fix that up now.

    * shmem_replace_page must flush_dcache_page after copy_highpage [akpm]
    * Expand comment on why shmem_unuse_inode needs page_swapcount [akpm]
    * Remove excess of VM_BUG_ONs from shmem_replace_page [wangcong]
    * Check page_private matches swap before calling shmem_replace_page [hughd]
    * shmem_replace_page allow for unexpected race in radix_tree lookup [hughd]

    Signed-off-by: Hugh Dickins
    Cc: Cong Wang
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: Alan Cox
    Cc: Stephane Marchesin
    Cc: Andi Kleen
    Cc: Dave Airlie
    Cc: Daniel Vetter
    Cc: Rob Clark
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Jun, 2012

3 commits

  • Pull signal and vfs compile breakage fixes from Al Viro.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    fixups for signal breakage

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    nommu: fix compilation of nommu.c

    Linus Torvalds
     
  • Compiling 3.5-rc1 for nommu targets gives:

    CC mm/nommu.o
    mm/nommu.c: In function ‘sys_mmap_pgoff’:
    mm/nommu.c:1489:2: error: ‘ret’ undeclared (first use in this function)
    mm/nommu.c:1489:2: note: each undeclared identifier is reported only once for each function it appears in

    It is trivially fixed by replacing 'ret' with the local variable that is
    already defined for the return value 'retval'.

    Signed-off-by: Greg Ungerer
    Signed-off-by: Al Viro

    Greg Ungerer
     
  • Pull frontswap feature from Konrad Rzeszutek Wilk:
    "Frontswap provides a "transcendent memory" interface for swap pages.
    In some environments, dramatic performance savings may be obtained
    because swapped pages are saved in RAM (or a RAM-like device) instead
    of a swap disk. This tag provides the basic infrastructure along with
    some changes to the existing backends."

    Fix up trivial conflict in mm/Makefile due to removal of swap token code
    changing a line next to the new frontswap entry.

    This pull request came in before the merge window even opened, it got
    delayed to after the merge window by me just wanting to make sure it had
    actual users. Apparently IBM is using this on their embedded side, and
    Jan Beulich says that it's already made available for SLES and OpenSUSE
    users.

    Also acked by Rik van Riel, and Konrad points to other people liking it
    too. So in it goes.

    By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
    via Konrad Rzeszutek Wilk
    * tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    frontswap: s/put_page/store/g s/get_page/load
    MAINTAINER: Add myself for the frontswap API
    mm: frontswap: config and doc files
    mm: frontswap: core frontswap functionality
    mm: frontswap: core swap subsystem hooks and headers
    mm: frontswap: add frontswap header file

    Linus Torvalds
     

04 Jun, 2012

2 commits

  • This reverts commit 5ceb9ce6fe9462a298bb2cd5c9f1ca6cb80a0199.

    That commit seems to be the cause of the mm compation list corruption
    issues that Dave Jones reported. The locking (or rather, absense
    there-of) is dubious, as is the use of the 'page' variable once it has
    been found to be outside the pageblock range.

    So revert it for now, we can re-visit this for 3.6. If we even need to:
    as Minchan Kim says, "The patch wasn't a bug fix and even test workload
    was very theoretical".

    Reported-and-tested-by: Dave Jones
    Acked-by: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Acked-by: Minchan Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • New tmpfs use of !PageUptodate pages for fallocate() is triggering the
    WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
    is called from migrate_page_copy() for compaction.

    It is anomalous that migration should use __set_page_dirty_nobuffers()
    on an address_space that does not participate in dirty and writeback
    accounting; and this has also been observed to insert surprising dirty
    tags into a tmpfs radix_tree, despite tmpfs not using tags at all.

    We should probably give migrate_page_copy() a better way to preserve the
    tag and migrate accounting info, when mapping_cap_account_dirty(). But
    that needs some more work: so in the interim, avoid the warning by using
    a simple SetPageDirty on PageSwapBacked pages.

    Reported-and-tested-by: Dave Jones
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

02 Jun, 2012

3 commits

  • Pull slab updates from Pekka Enberg:
    "Mainly a bunch of SLUB fixes from Joonsoo Kim"

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    slub: use __SetPageSlab function to set PG_slab flag
    slub: fix a memory leak in get_partial_node()
    slub: remove unused argument of init_kmem_cache_node()
    slub: fix a possible memory leak
    Documentations: Fix slabinfo.c directory in vm/slub.txt
    slub: fix incorrect return type of get_any_partial()

    Linus Torvalds
     
  • Pull vfs changes from Al Viro.
    "A lot of misc stuff. The obvious groups:
    * Miklos' atomic_open series; kills the damn abuse of
    ->d_revalidate() by NFS, which was the major stumbling block for
    all work in that area.
    * ripping security_file_mmap() and dealing with deadlocks in the
    area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
    general.
    * ->encode_fh() switched to saner API; insane fake dentry in
    mm/cleancache.c gone.
    * assorted annotations in fs (endianness, __user)
    * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
    * ->update_time() work from Josef.
    * other bits and pieces all over the place.

    Normally it would've been in two or three pull requests, but
    signal.git stuff had eaten a lot of time during this cycle ;-/"

    Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
    'truncate_range' inode method was removed by the VM changes, the VFS
    update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
    to sparse fix added twice, with other changes nearby).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
    nfs: don't open in ->d_revalidate
    vfs: retry last component if opening stale dentry
    vfs: nameidata_to_filp(): don't throw away file on error
    vfs: nameidata_to_filp(): inline __dentry_open()
    vfs: do_dentry_open(): don't put filp
    vfs: split __dentry_open()
    vfs: do_last() common post lookup
    vfs: do_last(): add audit_inode before open
    vfs: do_last(): only return EISDIR for O_CREAT
    vfs: do_last(): check LOOKUP_DIRECTORY
    vfs: do_last(): make ENOENT exit RCU safe
    vfs: make follow_link check RCU safe
    vfs: do_last(): use inode variable
    vfs: do_last(): inline walk_component()
    vfs: do_last(): make exit RCU safe
    vfs: split do_lookup()
    Btrfs: move over to use ->update_time
    fs: introduce inode operation ->update_time
    reiserfs: get rid of resierfs_sync_super
    reiserfs: mark the superblock as dirty a bit later
    ...

    Linus Torvalds
     
  • Btrfs has to make sure we have space to allocate new blocks in order to modify
    the inode, so updating time can fail. We've gotten around this by having our
    own file_update_time but this is kind of a pain, and Christoph has indicated he
    would like to make xfs do something different with atime updates. So introduce
    ->update_time, where we will deal with i_version an a/m/c time updates and
    indicate which changes need to be made. The normal version just does what it
    has always done, updates the time and marks the inode dirty, and then
    filesystems can choose to do something different.

    I've gone through all of the users of file_update_time and made them check for
    errors with the exception of the fault code since it's complicated and I wasn't
    quite sure what to do there, also Jan is going to be pushing the file time
    updates into page_mkwrite for those who have it so that should satisfy btrfs and
    make it not a big deal to check the file_update_time() return code in the
    generic fault path. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

01 Jun, 2012

8 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • take it to mm/util.c, convert vm_mmap() to use of that one and
    take it to mm/util.c as well, convert both sys_mmap_pgoff() to
    use of vm_mmap_pgoff()

    Signed-off-by: Al Viro

    Al Viro
     
  • just pull into vm_mmap()

    Signed-off-by: Al Viro

    Al Viro
     
  • after all, 0 bytes and 0 pages is the same thing...

    Signed-off-by: Al Viro

    Al Viro
     
  • it really should be done by get_unmapped_area(); that cuts down on
    the amount of callers considerably and it's the right place for
    that stuff anyway.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Merge misc patches from Andrew Morton:

    - the "misc" tree - stuff from all over the map

    - checkpatch updates

    - fatfs

    - kmod changes

    - procfs

    - cpumask

    - UML

    - kexec

    - mqueue

    - rapidio

    - pidns

    - some checkpoint-restore feature work. Reluctantly. Most of it
    delayed a release. I'm still rather worried that we don't have a
    clear roadmap to completion for this work.

    * emailed from Andrew Morton : (78 patches)
    kconfig: update compression algorithm info
    c/r: prctl: add ability to set new mm_struct::exe_file
    c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
    c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
    syscalls, x86: add __NR_kcmp syscall
    fs, proc: introduce /proc//task//children entry
    sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
    aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
    eventfd: change int to __u64 in eventfd_signal()
    fs/nls: add Apple NLS
    pidns: make killed children autoreap
    pidns: use task_active_pid_ns in do_notify_parent
    rapidio/tsi721: add DMA engine support
    rapidio: add DMA engine support for RIO data transfers
    ipc/mqueue: add rbtree node caching support
    tools/selftests: add mq_perf_tests
    ipc/mqueue: strengthen checks on mqueue creation
    ipc/mqueue: correct mq_attr_ok test
    ipc/mqueue: improve performance of send/recv
    selftests: add mq_open_tests
    ...

    Linus Torvalds
     
  • A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
    changes made to support CMA in an earlier patch.

    Rather than having an additional check_access parameter to these
    functions, the first paramater type is overloaded to allow the caller to
    specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
    are valid, but do not check the memory that they point to. This is used
    by process_vm_readv/writev where we need to validate that a iovec passed
    to the syscall is valid but do not want to check the memory that it points
    to at this point because it refers to an address space in another process.

    Signed-off-by: Chris Yeoh
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh