27 Jul, 2012

1 commit

  • Pull x86/mm changes from Peter Anvin:
    "The big change here is the patchset by Alex Shi to use INVLPG to flush
    only the affected pages when we only need to flush a small page range.

    It also removes the special INVALIDATE_TLB_VECTOR interrupts (32
    vectors!) and replace it with an ordinary IPI function call."

    Fix up trivial conflicts in arch/x86/include/asm/apic.h (added code next
    to changed line)

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/tlb: Fix build warning and crash when building for !SMP
    x86/tlb: do flush_tlb_kernel_range by 'invlpg'
    x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
    x86/tlb: enable tlb flush range support for x86
    mm/mmu_gather: enable tlb flush range in generic mmu_gather
    x86/tlb: add tlb_flushall_shift knob into debugfs
    x86/tlb: add tlb_flushall_shift for specific CPU
    x86/tlb: fall back to flush all when meet a THP large page
    x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range
    x86/tlb_info: get last level TLB entry number of CPU
    x86: Add read_mostly declaration/definition to variables from smp.h
    x86: Define early read-mostly per-cpu macros

    Linus Torvalds
     

25 Jul, 2012

2 commits

  • Pull trivial tree from Jiri Kosina:
    "Trivial updates all over the place as usual."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
    Fix typo in include/linux/clk.h .
    pci: hotplug: Fix typo in pci
    iommu: Fix typo in iommu
    video: Fix typo in drivers/video
    Documentation: Add newline at end-of-file to files lacking one
    arm,unicore32: Remove obsolete "select MISC_DEVICES"
    module.c: spelling s/postition/position/g
    cpufreq: Fix typo in cpufreq driver
    trivial: typo in comment in mksysmap
    mach-omap2: Fix typo in debug message and comment
    scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
    Change email address for Steve Glendinning
    Btrfs: fix typo in convert_extent_bit
    via: Remove bogus if check
    netprio_cgroup.c: fix comment typo
    backlight: fix memory leak on obscure error path
    Documentation: asus-laptop.txt references an obsolete Kconfig item
    Documentation: ManagementStyle: fixed typo
    mm/vmscan: cleanup comment error in balance_pgdat
    mm: cleanup on the comments of zone_reclaim_stat
    ...

    Linus Torvalds
     
  • Pull frontswap updates from Konrad Rzeszutek Wilk:
    "Cleanups in code and documentation. Little bit of refactoring for
    cleaner look."

    * tag 'stable/for-linus-3.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    mm/frontswap: cleanup doc and comment error
    mm: frontswap: remove unneeded headers
    mm: frontswap: split out function to clear a page out
    mm: frontswap: remove unnecessary check during initialization
    mm: frontswap: make all branches of if statement in put page consistent
    mm: frontswap: split frontswap_shrink further to simplify locking
    mm: frontswap: split out __frontswap_unuse_pages
    mm: frontswap: split out __frontswap_curr_pages
    mm: frontswap: trivial coding convention issues
    mm: frontswap: remove casting from function calls through ops structure

    Linus Torvalds
     

24 Jul, 2012

2 commits

  • Pull arch/tile updates from Chris Metcalf:
    "These changes provide support for PCIe root complex and USB host mode
    for tilegx's on-chip I/Os.

    In addition, this pull provides the required underpinning for the
    on-chip networking support that was pulled into 3.5. The changes have
    all been through LKML (with several rounds for PCIe RC) and on
    linux-next."

    * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
    tile: updates to pci root complex from community feedback
    bounce: allow use of bounce pool via config option
    usb: add host support for the tilegx architecture
    arch/tile: provide kernel support for the tilegx USB shim
    tile pci: enable IOMMU to support DMA for legacy devices
    arch/tile: enable ZONE_DMA for tilegx
    tilegx pci: support I/O to arbitrarily-cached pages
    tile: remove unused header
    arch/tile: tilegx PCI root complex support
    arch/tile: provide kernel support for the tilegx TRIO shim
    arch/tile: break out the "csum a long" function to
    arch/tile: provide kernel support for the tilegx mPIPE shim
    arch/tile: common DMA code for the GXIO IORPC subsystem
    arch/tile: support MMIO-based readb/writeb etc.
    arch/tile: introduce GXIO IORPC framework for tilegx

    Linus Torvalds
     
  • Pull the big VFS changes from Al Viro:
    "This one is *big* and changes quite a few things around VFS. What's in there:

    - the first of two really major architecture changes - death to open
    intents.

    The former is finally there; it was very long in making, but with
    Miklos getting through really hard and messy final push in
    fs/namei.c, we finally have it. Unlike his variant, this one
    doesn't introduce struct opendata; what we have instead is
    ->atomic_open() taking preallocated struct file * and passing
    everything via its fields.

    Instead of returning struct file *, it returns -E... on error, 0
    on success and 1 in "deal with it yourself" case (e.g. symlink
    found on server, etc.).

    See comments before fs/namei.c:atomic_open(). That made a lot of
    goodies finally possible and quite a few are in that pile:
    ->lookup(), ->d_revalidate() and ->create() do not get struct
    nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
    flags instead, ->create() gets "do we want it exclusive" flag.

    With the introduction of new helper (kern_path_locked()) we are rid
    of all struct nameidata instances outside of fs/namei.c; it's still
    visible in namei.h, but not for long. Come the next cycle,
    declaration will move either to fs/internal.h or to fs/namei.c
    itself. [me, miklos, hch]

    - The second major change: behaviour of final fput(). Now we have
    __fput() done without any locks held by caller *and* not from deep
    in call stack.

    That obviously lifts a lot of constraints on the locking in there.
    Moreover, it's legal now to call fput() from atomic contexts (which
    has immediately simplified life for aio.c). We also don't need
    anti-recursion logics in __scm_destroy() anymore.

    There is a price, though - the damn thing has become partially
    asynchronous. For fput() from normal process we are guaranteed
    that pending __fput() will be done before the caller returns to
    userland, exits or gets stopped for ptrace.

    For kernel threads and atomic contexts it's done via
    schedule_work(), so theoretically we might need a way to make sure
    it's finished; so far only one such place had been found, but there
    might be more.

    There's flush_delayed_fput() (do all pending __fput()) and there's
    __fput_sync() (fput() analog doing __fput() immediately). I hope
    we won't need them often; see warnings in fs/file_table.c for
    details. [me, based on task_work series from Oleg merged last
    cycle]

    - sync series from Jan

    - large part of "death to sync_supers()" work from Artem; the only
    bits missing here are exofs and ext4 ones. As far as I understand,
    those are going via the exofs and ext4 trees resp.; once they are
    in, we can put ->write_super() to the rest, along with the thread
    calling it.

    - preparatory bits from unionmount series (from dhowells).

    - assorted cleanups and fixes all over the place, as usual.

    This is not the last pile for this cycle; there's at least jlayton's
    ESTALE work and fsfreeze series (the latter - in dire need of fixes,
    so I'm not sure it'll make the cut this cycle). I'll probably throw
    symlink/hardlink restrictions stuff from Kees into the next pile, too.
    Plus there's a lot of misc patches I hadn't thrown into that one -
    it's large enough as it is..."

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
    ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
    btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
    switch dentry_open() to struct path, make it grab references itself
    spufs: shift dget/mntget towards dentry_open()
    zoran: don't bother with struct file * in zoran_map
    ecryptfs: don't reinvent the wheels, please - use struct completion
    don't expose I_NEW inodes via dentry->d_inode
    tidy up namei.c a bit
    unobfuscate follow_up() a bit
    ext3: pass custom EOF to generic_file_llseek_size()
    ext4: use core vfs llseek code for dir seeks
    vfs: allow custom EOF in generic_file_llseek code
    vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
    vfs: Remove unnecessary flushing of block devices
    vfs: Make sys_sync writeout also block device inodes
    vfs: Create function for iterating over block devices
    vfs: Reorder operations during sys_sync
    quota: Move quota syncing to ->sync_fs method
    quota: Split dquot_quota_sync() to writeback and cache flushing part
    vfs: Move noop_backing_dev_info check from sync into writeback
    ...

    Linus Torvalds
     

23 Jul, 2012

3 commits


20 Jul, 2012

1 commit


19 Jul, 2012

1 commit


18 Jul, 2012

3 commits

  • Merge Andrew's remaining patches for 3.5:
    "Nine fixes"

    * Merge emailed patches from Andrew Morton : (9 commits)
    mm: fix lost kswapd wakeup in kswapd_stop()
    m32r: make memset() global for CONFIG_KERNEL_BZIP2=y
    m32r: add memcpy() for CONFIG_KERNEL_GZIP=y
    m32r: consistently use "suffix-$(...)"
    m32r: fix 'fix breakage from "m32r: use generic ptrace_resume code"' fallout
    m32r: fix pull clearing RESTORE_SIGMASK into block_sigmask() fallout
    m32r: remove duplicate definition of PTRACE_O_TRACESYSGOOD
    mn10300: fix "pull clearing RESTORE_SIGMASK into block_sigmask()" fallout
    bootmem: make ___alloc_bootmem_node_nopanic() really nopanic

    Linus Torvalds
     
  • Offlining memory may block forever, waiting for kswapd() to wake up
    because kswapd() does not check the event kthread->should_stop before
    sleeping.

    The proper pattern, from Documentation/memory-barriers.txt, is:

    --- waker ---
    event_indicated = 1;
    wake_up_process(event_daemon);

    --- sleeper ---
    for (;;) {
    set_current_state(TASK_UNINTERRUPTIBLE);
    if (event_indicated)
    break;
    schedule();
    }

    set_current_state() may be wrapped by:
    prepare_to_wait();

    In the kswapd() case, event_indicated is kthread->should_stop.

    === offlining memory (waker) ===
    kswapd_stop()
    kthread_stop()
    kthread->should_stop = 1
    wake_up_process()
    wait_for_completion()

    === kswapd_try_to_sleep (sleeper) ===
    kswapd_try_to_sleep()
    prepare_to_wait()
    .
    .
    schedule()
    .
    .
    finish_wait()

    The schedule() needs to be protected by a test of kthread->should_stop,
    which is wrapped by kthread_should_stop().

    Reproducer:
    Do heavy file I/O in background.
    Do a memory offline/online in a tight loop

    Signed-off-by: Aaditya Kumar
    Acked-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaditya Kumar
     
  • In reaction to commit 99ab7b19440a ("mm: sparse: fix usemap allocation
    above node descriptor section") Johannes said:
    | while backporting the below patch, I realised that your fix busted
    | f5bf18fa22f8 again. The problem was not a panicking version on
    | allocation failure but when the usemap size was too large such that
    | goal + size > limit triggers the BUG_ON in the bootmem allocator. So
    | we need a version that passes limit ONLY if the usemap is smaller than
    | the section.

    after checking the code, the name of ___alloc_bootmem_node_nopanic()
    does not reflect the fact.

    Make bootmem really not panic.

    Hope will kill bootmem sooner.

    Signed-off-by: Yinghai Lu
    Cc: Johannes Weiner
    Cc: [3.3.x, 3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

17 Jul, 2012

1 commit

  • Pull CMA and DMA-mapping fixes from Marek Szyprowski:
    "Another set of minor fixups for recently merged Contiguous Memory
    Allocator and ARM DMA-mapping changes. Those patches fix mysterious
    crashes on systems with CMA and Himem enabled as well as some corner
    cases caused by typical off-by-one bug."

    * 'fixes-for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
    ARM: dma-mapping: modify condition check while freeing pages
    mm: cma: fix condition check when setting global cma area
    mm: cma: don't replace lowmem pages with highmem

    Linus Torvalds
     

14 Jul, 2012

1 commit


12 Jul, 2012

11 commits

  • memblock_free_reserved_regions() calls memblock_free(), but
    memblock_free() would double reserved.regions too, so we could free the
    old range for reserved.regions.

    Also tj said there is another bug which could be related to this.

    | I don't think we're saving any noticeable
    | amount by doing this "free - give it to page allocator - reserve
    | again" dancing. We should just allocate regions aligned to page
    | boundaries and free them later when memblock is no longer in use.

    in that case, when DEBUG_PAGEALLOC, will get panic:

    memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39
    BUG: unable to handle kernel paging request at ffff88102febd948
    IP: [] __next_free_mem_range+0x9b/0x155
    PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    CPU 0
    Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447
    RIP: 0010:[] [] __next_free_mem_range+0x9b/0x155

    See the discussion at https://lkml.org/lkml/2012/6/13/469

    So try to allocate with PAGE_SIZE alignment and free it later.

    Reported-by: Sasha Levin
    Acked-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Yinghai Lu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint
    in alloc_bootmem_section"), usemap allocations may easily be placed
    outside the optimal section that holds the node descriptor, even if
    there is space available in that section. This results in unnecessary
    hotplug dependencies that need to have the node unplugged before the
    section holding the usemap.

    The reason is that the bootmem allocator doesn't guarantee a linear
    search starting from the passed allocation goal but may start out at a
    much higher address absent an upper limit.

    Fix this by trying the allocation with the limit at the section end,
    then retry without if that fails. This keeps the fix from f5bf18fa22f8
    of not panicking if the allocation does not fit in the section, but
    still makes sure to try to stay within the section at first.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Johannes Weiner
    Cc: [3.3.x, 3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Commit 238305bb4d41 ("mm: remove sparsemem allocation details from the
    bootmem allocator") introduced a bug in the allocation goal calculation
    that put section usemaps not in the same section as the node
    descriptors, creating unnecessary hotplug dependencies between them:

    node 0 must be removed before remove section 16399
    node 1 must be removed before remove section 16399
    node 2 must be removed before remove section 16399
    node 3 must be removed before remove section 16399
    node 4 must be removed before remove section 16399
    node 5 must be removed before remove section 16399
    node 6 must be removed before remove section 16399

    The reason is that it applies PAGE_SECTION_MASK to the physical address
    of the node descriptor when finding a suitable place to put the usemap,
    when this mask is actually intended to be used with PFNs. Because the
    PFN mask is wider, the target address will point beyond the wanted
    section holding the node descriptor and the node must be offlined before
    the section holding the usemap can go.

    Fix this by extending the mask to address width before use.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • shmem_add_to_page_cache() has three callsites, but only one of them wants
    the radix_tree_preload() (an exceptional entry guarantees that the radix
    tree node is present in the other cases), and only that site can achieve
    mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a no-op in the
    other cases). We did it this way originally to reflect
    add_to_page_cache_locked(); but it's confusing now, so move the radix_tree
    preloading and mem_cgroup uncharging to that one caller.

    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When adding the page_private checks before calling shmem_replace_page(), I
    did realize that there is a further race, but thought it too unlikely to
    need a hurried fix.

    But independently I've been chasing why a mem cgroup's memory.stat
    sometimes shows negative rss after all tasks have gone: I expected it to
    be a stats gathering bug, but actually it's shmem swapping's fault.

    It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
    the page may have been removed from swapcache before getting the lock; or
    it may have been freed and reused and be back in swapcache; and it can
    even be using the same swap location as before (page_private same).

    The swapoff case is already secure against this (swap cannot be reused
    until the whole area has been swapped off, and a new swapped on); and
    shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check for
    the expected radix_tree entry - but a little too late.

    By that time, we might have already decided to shmem_replace_page(): I
    don't know of a problem from that, but I'd feel more at ease not to do so
    spuriously. And we have already done mem_cgroup_cache_charge(), on
    perhaps the wrong mem cgroup: and this charge is not then undone on the
    error path, because PageSwapCache ends up preventing that.

    It's this last case which causes the occasional negative rss in
    memory.stat: the page is charged here as cache, but (sometimes) found to
    be anon when eventually it's uncharged - and in between, it's an
    undeserved charge on the wrong memcg.

    Fix this by adding an earlier check on the radix_tree entry: it's
    inelegant to descend the tree twice, but swapping is not the fast path,
    and a better solution would need a pair (try+commit) of memcg calls, and a
    rework of shmem_replace_page() to keep out of the swapcache.

    We can use the added shmem_confirm_swap() function to replace the
    find_get_page+page_cache_release we were already doing on the error path.
    And add a comment on that -EEXIST: it seems a peculiar errno to be using,
    but originates from its use in radix_tree_insert().

    [It can be surprising to see positive rss left in a memcg's memory.stat
    after all tasks have gone, since it is supposed to count anonymous but not
    shmem. Aside from sharing anon pages via fork with a task in some other
    memcg, it often happens after swapping: because a swap page can't be freed
    while under writeback, nor while locked. So it's not an error, and these
    residual pages are easily freed once pressure demands.]

    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE"). I believe
    it's correct, and it's been nice to have from rc1 to rc6; but as the
    original commit said:

    I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
    would be of any use to them on tmpfs. This code adds 92 lines and 752
    bytes on x86_64 - is that bloat or worthwhile?

    Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs to
    the dumb generic support for v3.5. We can always reinstate it later if
    useful, and anyone needing it in a hurry can just get it out of git.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Josef Bacik
    Cc: Andi Kleen
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Marco Stornelli
    Cc: Jeff liu
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We should goto error to release memory resource if hotadd_new_pgdat()
    failed.

    Signed-off-by: Wen Congyang
    Cc: Yasuaki ISIMATU
    Acked-by: David Rientjes
    Cc: Len Brown
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • If page migration cannot charge the temporary page to the memcg,
    migrate_pages() will return -ENOMEM. This isn't considered in memory
    compaction however, and the loop continues to iterate over all
    pageblocks trying to isolate and migrate pages. If a small number of
    very large memcgs happen to be oom, however, these attempts will mostly
    be futile leading to an enormous amout of cpu consumption due to the
    page migration failures.

    This patch will short circuit and fail memory compaction if
    migrate_pages() returns -ENOMEM. COMPACT_PARTIAL is returned in case
    some migrations were successful so that the page allocator will retry.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • kswapd_stop() is called to destroy the kswapd work thread when all memory
    of a NUMA node has been offlined. But kswapd_stop() only terminates the
    work thread without resetting NODE_DATA(nid)->kswapd to NULL. The stale
    pointer will prevent kswapd_run() from creating a new work thread when
    adding memory to the memory-less NUMA node again. Eventually the stale
    pointer may cause invalid memory access.

    An example stack dump as below. It's reproduced with 2.6.32, but latest
    kernel has the same issue.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] exit_creds+0x12/0x78
    PGD 0
    Oops: 0000 [#1] SMP
    last sysfs file: /sys/devices/system/memory/memory391/state
    CPU 11
    Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
    Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
    RIP: 0010:exit_creds+0x12/0x78
    RSP: 0018:ffff8806044f1d78 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
    RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
    RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
    R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
    R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
    FS: 00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
    Stack:
    ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
    ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
    0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
    Call Trace:
    __put_task_struct+0x5d/0x97
    kthread_stop+0x50/0x58
    offline_pages+0x324/0x3da
    memory_block_change_state+0x179/0x1db
    store_mem_state+0x9e/0xbb
    sysfs_write_file+0xd0/0x107
    vfs_write+0xad/0x169
    sys_write+0x45/0x6e
    system_call_fastpath+0x16/0x1b
    Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
    RIP exit_creds+0x12/0x78
    RSP
    CR2: 0000000000000000

    [akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Merge memory fault handling fix from Tony Luck.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • In commit dad1743e5993f1 ("x86/mce: Only restart instruction after machine
    check recovery if it is safe") we fixed mce_notify_process() to force a
    signal to the current process if it was not restartable (RIPV bit not
    set in MCG_STATUS). But doing it here means that the process doesn't
    get told the virtual address of the fault via siginfo_t->si_addr. This
    would prevent application level recovery from the fault.

    Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
    that we will provide the right information with the signal.

    Signed-off-by: Tony Luck
    Acked-by: Borislav Petkov
    Cc: stable@kernel.org # 3.4+

    Tony Luck
     

07 Jul, 2012

1 commit

  • Otherwise the code races with munmap (causing a use-after-free
    of the vma) or with close (causing a use-after-free of the struct
    file).

    The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix
    mmap_sem i_mutex deadlock")

    Cc: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Cc: Nick Piggin
    Cc: stable@vger.kernel.org
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

06 Jul, 2012

1 commit

  • The filesystem layer expects pages in the block device's mapping to not
    be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
    currently replace lowmem pages with highmem pages, leading to crashes in
    filesystem code such as the one below:

    Unable to handle kernel NULL pointer dereference at virtual address 00000400
    pgd = c0c98000
    [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
    Internal error: Oops: 817 [#1] PREEMPT SMP ARM
    CPU: 0 Not tainted (3.5.0-rc5+ #80)
    PC is at __memzero+0x24/0x80
    ...
    Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
    Backtrace:
    [] (ext4_getblk+0x0/0x180) from [] (ext4_bread+0x1c/0x98)
    [] (ext4_bread+0x0/0x98) from [] (ext4_mkdir+0x160/0x3bc)
    r4:c15337f0
    [] (ext4_mkdir+0x0/0x3bc) from [] (vfs_mkdir+0x8c/0x98)
    [] (vfs_mkdir+0x0/0x98) from [] (sys_mkdirat+0x74/0xac)
    r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
    [] (sys_mkdirat+0x0/0xac) from [] (sys_mkdir+0x20/0x24)
    r6:beccdcf0 r5:00074000 r4:beccdbbc
    [] (sys_mkdir+0x0/0x24) from [] (ret_fast_syscall+0x0/0x30)

    Fix this by replacing only highmem pages with highmem.

    Reported-by: Laura Abbott
    Signed-off-by: Rabin Vincent
    Acked-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski

    Rabin Vincent
     

04 Jul, 2012

1 commit

  • Pull block bits from Jens Axboe:
    "As vacation is coming up, thought I'd better get rid of my pending
    changes in my for-linus branch for this iteration. It contains:

    - Two patches for mtip32xx. Killing a non-compliant sysfs interface
    and moving it to debugfs, where it belongs.

    - A few patches from Asias. Two legit bug fixes, and one killing an
    interface that is no longer in use.

    - A patch from Jan, making the annoying partition ioctl warning a bit
    less annoying, by restricting it to !CAP_SYS_RAWIO only.

    - Three bug fixes for drbd from Lars Ellenberg.

    - A fix for an old regression for umem, it hasn't really worked since
    the plugging scheme was changed in 3.0.

    - A few fixes from Tejun.

    - A splice fix from Eric Dumazet, fixing an issue with pipe
    resizing."

    * 'for-linus' of git://git.kernel.dk/linux-block:
    scsi: Silence unnecessary warnings about ioctl to partition
    block: Drop dead function blk_abort_queue()
    block: Mitigate lock unbalance caused by lock switching
    block: Avoid missed wakeup in request waitqueue
    umem: fix up unplugging
    splice: fix racy pipe->buffers uses
    drbd: fix null pointer dereference with on-congestion policy when diskless
    drbd: fix list corruption by failing but already aborted reads
    drbd: fix access of unallocated pages and kernel panic
    xen/blkfront: Add WARN to deal with misbehaving backends.
    blkcg: drop local variable @q from blkg_destroy()
    mtip32xx: Create debugfs entries for troubleshooting
    mtip32xx: Remove 'registers' and 'flags' from sysfs
    blkcg: fix blkg_alloc() failure path
    block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
    block: fix return value on cfq_init() failure
    mtip32xx: Remove version.h header file inclusion
    xen/blkback: Copy id field when doing BLKIF_DISCARD.

    Linus Torvalds
     

29 Jun, 2012

1 commit


28 Jun, 2012

3 commits

  • Signed-off-by: Wanpeng Li
    Signed-off-by: Jiri Kosina

    Wanpeng Li
     
  • Since there are five lists in LRU cache, the array nr in get_scan_count
    should be:

    nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
    nr[2] = file inactive pages to scan; nr[3] = file active pages to scan

    Signed-off-by: Wanpeng Li
    Reviewed-by: Rik van Riel
    Acked-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Jiri Kosina

    Wanpeng Li
     
  • This patch enabled the tlb flush range support in generic mmu layer.

    Most of arch has self tlb flush range support, like ARM/IA64 etc.
    X86 arch has no this support in hardware yet. But another instruction
    'invlpg' can implement this function in some degree. So, enable this
    feather in generic layer for x86 now. and maybe useful for other archs
    in further.

    Generic mmu_gather struct is protected by micro
    HAVE_GENERIC_MMU_GATHER. Other archs that has flush range supported
    own self mmu_gather struct. So, now this change is safe for them.

    In future we may unify this struct and related functions on multiple
    archs.

    Thanks for Peter Zijlstra time and time reminder for multiple
    architecture code safe!

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-7-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     

21 Jun, 2012

7 commits

  • If the range passed to mbind() is not allocated on nodes set in the
    nodemask, it migrates the pages to respect the constraint.

    The final formal of migrate_pages() is a mode of type enum migrate_mode,
    not a boolean. do_mbind() is currently passing "true" which is the
    equivalent of MIGRATE_SYNC_LIGHT. This should instead be MIGRATE_SYNC
    for synchronous page migration.

    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • __alloc_memory_core_early() asks memblock for a range of memory then try
    to reserve it. If the reserved region array lacks space for the new
    range, memblock_double_array() is called to allocate more space for the
    array. If memblock is used to allocate memory for the new array it can
    end up using a range that overlaps with the range originally allocated in
    __alloc_memory_core_early(), leading to possible data corruption.

    With this patch memblock_double_array() now calls memblock_find_in_range()
    with a narrowed candidate range (in cases where the reserved.regions array
    is being doubled) so any memory allocated will not overlap with the
    original range that was being reserved. The range is narrowed by passing
    in the starting address and size of the previously allocated range. Then
    the range above the ending address is searched and if a candidate is not
    found, the range below the starting address is searched.

    Signed-off-by: Greg Pearson
    Signed-off-by: Yinghai Lu
    Acked-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Pearson
     
  • Fix kernel-doc warnings in mm/memory.c:

    Warning(mm/memory.c:1377): No description found for parameter 'start'
    Warning(mm/memory.c:1377): Excess function parameter 'address' description in 'zap_page_range'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Fix kernel-doc warnings such as

    Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
    Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'

    Signed-off-by: Wanpeng Li
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Andrea asked for addr, end, vma->vm_start, and vma->vm_end to be emitted
    when !rwsem_is_locked(&tlb->mm->mmap_sem). Otherwise, debugging the
    underlying issue is more difficult.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If use_hierarchy is set, reclaim testing soon oopses in css_is_ancestor()
    called from __mem_cgroup_same_or_subtree() called from page_referenced():
    when processes are exiting, it's easy for mm_match_cgroup() to pass along
    a NULL memcg coming from a NULL mm->owner.

    Check for that in __mem_cgroup_same_or_subtree(). Return true or false?
    False because we cannot know if it was in the hierarchy, but also false
    because it's better not to count a reference from an exiting process.

    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Acked-by: Konstantin Khlebnikov
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The divide in p->signal->oom_score_adj * totalpages / 1000 within
    oom_badness() was causing an overflow of the signed long data type.

    This adds both the root bias and p->signal->oom_score_adj before doing the
    normalization which fixes the issue and also cleans up the calculation.

    Tested-by: Dave Jones
    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes