30 Jan, 2010

1 commit

  • After memory pressure has forced it to dip into the reserves, 2.6.32's
    5f8dcc21211a3d4e3a7a5ca366b469fb88117f61 "page-allocator: split per-cpu
    list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
    pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.

    Fix that in the most straightforward way (which, considering the overheads
    of alternative approaches, is Mel's preference): the right migratetype is
    already in page_private(page), but free_pcppages_bulk() wasn't using it.

    How did this bug show up? As a 20% slowdown in my tmpfs loop kbuild
    swapping tests, on PowerMac G5 with SLUB allocator. Bisecting to that
    commit was easy, but explaining the magnitude of the slowdown not easy.

    The same effect appears, but much less markedly, with SLAB, and even
    less markedly on other machines (the PowerMac divides into fewer zones
    than x86, I think that may be a factor). We guess that lumpy reclaim
    of short-lived high-order pages is implicated in some way, and probably
    this bug has been tickling a poor decision somewhere in page reclaim.

    But instrumentation hasn't told me much, I've run out of time and
    imagination to determine exactly what's going on, and shouldn't hold up
    the fix any longer: it's valid, and might even fix other misbehaviours.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Jan, 2010

1 commit

  • It's a simplified 'read_cache_page()' which takes a page allocation
    flag, so that different paths can control how aggressive the memory
    allocations are that populate a address space.

    In particular, the intel GPU object mapping code wants to be able to do
    a certain amount of own internal memory management by automatically
    shrinking the address space when memory starts getting tight. This
    allows it to dynamically use different memory allocation policies on a
    per-allocation basis, rather than depend on the (static) address space
    gfp policy.

    The actual new function is a one-liner, but re-organizing the helper
    functions to the point where you can do this with a single line of code
    is what most of the patch is all about.

    Tested-by: Chris Wilson
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jan, 2010

1 commit

  • In free_unmap_area_noflush(), va->flags is marked as VM_LAZY_FREE first, and
    then vmap_lazy_nr is increased atomically.

    But, in __purge_vmap_area_lazy(), while traversing of vmap_are_list, nr
    is counted by checking VM_LAZY_FREE is set to va->flags. After counting
    the variable nr, kernel reads vmap_lazy_nr atomically and checks a
    BUG_ON condition whether nr is greater than vmap_lazy_nr to prevent
    vmap_lazy_nr from being negative.

    The problem is that, if interrupted right after marking VM_LAZY_FREE,
    increment of vmap_lazy_nr can be delayed. Consequently, BUG_ON
    condition can be met because nr is counted more than vmap_lazy_nr.

    It is highly probable when vmalloc/vfree are called frequently. This
    scenario have been verified by adding delay between marking VM_LAZY_FREE
    and increasing vmap_lazy_nr in free_unmap_area_noflush().

    Even the vmap_lazy_nr is for checking high watermark, it never be the
    strict watermark. Although the BUG_ON condition is to prevent
    vmap_lazy_nr from being negative, vmap_lazy_nr is signed variable. So,
    it could go down to negative value temporarily.

    Consequently, removing the BUG_ON condition is proper.

    A possible BUG_ON message is like the below.

    kernel BUG at mm/vmalloc.c:517!
    invalid opcode: 0000 [#1] SMP
    EIP: 0060:[] EFLAGS: 00010297 CPU: 3
    EIP is at __purge_vmap_area_lazy+0x144/0x150
    EAX: ee8a8818 EBX: c08e77d4 ECX: e7c7ae40 EDX: c08e77ec
    ESI: 000081fe EDI: e7c7ae60 EBP: e7c7ae64 ESP: e7c7ae3c
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Call Trace:
    [] free_unmap_vmap_area_noflush+0x69/0x70
    [] remove_vm_area+0x22/0x70
    [] __vunmap+0x45/0xe0
    [] vmalloc+0x2c/0x30
    Code: 8d 59 e0 eb 04 66 90 89 cb 89 d0 e8 87 fe ff ff 8b 43 20 89 da 8d 48 e0 8d 43 20 3b 04 24 75 e7 fe 05 a8 a5 a3 c0 e9 78 ff ff ff 0b eb fe 90 8d b4 26 00 00 00 00 56 89 c6 b8 ac a5 a3 c0 31
    EIP: [] __purge_vmap_area_lazy+0x144/0x150 SS:ESP 0068:e7c7ae3c

    [ See also http://marc.info/?l=linux-kernel&m=126335856228090&w=2 ]

    Signed-off-by: Yongseok Koh
    Reviewed-by: Minchan Kim
    Cc: Nick Piggin
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yongseok Koh
     

17 Jan, 2010

8 commits

  • commit f2260e6b (page allocator: update NR_FREE_PAGES only as necessary)
    made one minor regression. if __rmqueue() was failed, NR_FREE_PAGES stat
    go wrong. this patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Reported-by: Huang Shijie
    Reviewed-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
    over the end of a truncation. The problem is that
    ramfs_nommu_check_mappings() checks that the reduced file size against the
    VMA tree, but not the vm_region tree.

    The following sequence of events can cause the problem:

    fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
    ftruncate(fd, 32 * 1024);
    a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    munmap(a, 32 * 1024);
    ftruncate(fd, 16 * 1024);
    c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

    Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b'
    sees that the vm_region from 'a' is covering the region it wants and so
    shares it, pinning it in memory.

    Mapping 'a' then goes away and the file is truncated to the end of VMA
    'b'. However, the region allocated by 'a' is still in effect, and has
    _not_ been reduced.

    Mapping 'c' is then created, and because there's a vm_region covering the
    desired region, get_unmapped_area() is _not_ called to repeat the check,
    and the mapping is granted, even though the pages from the latter half of
    the mapping have been discarded.

    However:

    d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

    Mapping 'd' should work, and should end up sharing the region allocated by
    'a'.

    To deal with this, we shrink the vm_region struct during the truncation,
    lest do_mmap_pgoff() take it as licence to share the full region
    automatically without calling the get_unmapped_area() file op again.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • get_unmapped_area() is unnecessary for NOMMU as no-one calls it.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • In split_vma(), there's no need to check if the VMA being split has a
    region that's in use by more than one VMA because:

    (1) The preceding test prohibits splitting of non-anonymous VMAs and regions
    (eg: file or chardev backed VMAs).

    (2) Anonymous regions can't be mapped multiple times because there's no handle
    by which to refer to the already existing region.

    (3) If a VMA has previously been split, then the region backing it has also
    been split into two regions, each of usage 1.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • The vm_usage count field in struct vm_region does not need to be atomic as
    it's only even modified whilst nommu_region_sem is write locked.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Current mem_cgroup_force_empty() only ensures mem->res.usage == 0 on
    success. But this doesn't guarantee memcg's LRU is really empty, because
    there are some cases in which !PageCgrupUsed pages exist on memcg's LRU.

    For example:
    - Pages can be uncharged by its owner process while they are on LRU.
    - race between mem_cgroup_add_lru_list() and __mem_cgroup_uncharge_common().

    So there can be a case in which the usage is zero but some of the LRUs are not empty.

    OTOH, mem_cgroup_del_lru_list(), which can be called asynchronously with
    rmdir, accesses the mem_cgroup, so this access can cause a problem if it
    races with rmdir because the mem_cgroup might have been freed by rmdir.

    Actually, I saw a bug which seems to be caused by this race.

    [1530745.949906] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230
    [1530745.950651] IP: [] mem_cgroup_del_lru_list+0x30/0x80
    [1530745.950651] PGD 3863de067 PUD 3862c7067 PMD 0
    [1530745.950651] Oops: 0002 [#1] SMP
    [1530745.950651] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index1/shared_cpu_map
    [1530745.950651] CPU 3
    [1530745.950651] Modules linked in: configs ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp nfsd nfs_acl auth_rpcgss exportfs autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc battery ac lp kvm_intel kvm sg ide_cd_mod cdrom serio_raw tpm_tis tpm tpm_bios acpi_memhotplug button parport_pc parport rtc_cmos rtc_core rtc_lib e1000 i2c_i801 i2c_core pcspkr dm_region_hash dm_log dm_mod ata_piix libata shpchp megaraid_mbox sd_mod scsi_mod megaraid_mm ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
    [1530745.950651] Pid: 19653, comm: shmem_test_02 Tainted: G M 2.6.32-mm1-00701-g2b04386 #3 Express5800/140Rd-4 [N8100-1065]
    [1530745.950651] RIP: 0010:[] [] mem_cgroup_del_lru_list+0x30/0x80
    [1530745.950651] RSP: 0018:ffff8803863ddcb8 EFLAGS: 00010002
    [1530745.950651] RAX: 00000000000001e0 RBX: ffff8803abc02238 RCX: 00000000000001e0
    [1530745.950651] RDX: 0000000000000000 RSI: ffff88038611a000 RDI: ffff8803abc02238
    [1530745.950651] RBP: ffff8803863ddcc8 R08: 0000000000000002 R09: ffff8803a04c8643
    [1530745.950651] R10: 0000000000000000 R11: ffffffff810c7333 R12: 0000000000000000
    [1530745.950651] R13: ffff880000017f00 R14: 0000000000000092 R15: ffff8800179d0310
    [1530745.950651] FS: 0000000000000000(0000) GS:ffff880017800000(0000) knlGS:0000000000000000
    [1530745.950651] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [1530745.950651] CR2: 0000000000000230 CR3: 0000000379d87000 CR4: 00000000000006e0
    [1530745.950651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [1530745.950651] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [1530745.950651] Process shmem_test_02 (pid: 19653, threadinfo ffff8803863dc000, task ffff88038612a8a0)
    [1530745.950651] Stack:
    [1530745.950651] ffffea00040c2fe8 0000000000000000 ffff8803863ddd98 ffffffff810c739a
    [1530745.950651] 00000000863ddd18 000000000000000c 0000000000000000 0000000000000000
    [1530745.950651] 0000000000000002 0000000000000000 ffff8803863ddd68 0000000000000046
    [1530745.950651] Call Trace:
    [1530745.950651] [] release_pages+0x142/0x1e7
    [1530745.950651] [] ? pagevec_move_tail+0x6e/0x112
    [1530745.950651] [] pagevec_move_tail+0xfd/0x112
    [1530745.950651] [] lru_add_drain+0x76/0x94
    [1530745.950651] [] exit_mmap+0x6e/0x145
    [1530745.950651] [] mmput+0x5e/0xcf
    [1530745.950651] [] exit_mm+0x11c/0x129
    [1530745.950651] [] ? audit_free+0x196/0x1c9
    [1530745.950651] [] do_exit+0x1f5/0x6b7
    [1530745.950651] [] ? up_read+0x2b/0x2f
    [1530745.950651] [] ? lockdep_sys_exit_thunk+0x35/0x67
    [1530745.950651] [] do_group_exit+0x83/0xb0
    [1530745.950651] [] sys_exit_group+0x17/0x1b
    [1530745.950651] [] system_call_fastpath+0x16/0x1b
    [1530745.950651] Code: 54 53 0f 1f 44 00 00 83 3d cc 29 7c 00 00 41 89 f4 75 63 eb 4e 48 83 7b 08 00 75 04 0f 0b eb fe 48 89 df e8 18 f3 ff ff 44 89 e2 ff 4c d0 50 48 8b 05 2b 2d 7c 00 48 39 43 08 74 39 48 8b 4b
    [1530745.950651] RIP [] mem_cgroup_del_lru_list+0x30/0x80
    [1530745.950651] RSP
    [1530745.950651] CR2: 0000000000000230
    [1530745.950651] ---[ end trace c3419c1bb8acc34f ]---
    [1530745.950651] Fixing recursive fault but reboot is needed!

    The problem here is pages on LRU may contain pointer to stale memcg. To
    make res->usage to be 0, all pages on memcg must be uncharged or moved to
    another(parent) memcg. Moved page_cgroup have already removed from
    original LRU, but uncharged page_cgroup contains pointer to memcg withou
    PCG_USED bit. (This asynchronous LRU work is for improving performance.)
    If PCG_USED bit is not set, page_cgroup will never be added to memcg's
    LRU. So, about pages not on LRU, they never access stale pointer. Then,
    what we have to take care of is page_cgroup _on_ LRU list. This patch
    fixes this problem by making mem_cgroup_force_empty() visit all LRUs
    before exiting its loop and guarantee there are no pages on its LRU.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Commit f50de2d3 (vmscan: have kswapd sleep for a short interval and double
    check it should be asleep) can cause kswapd to enter an infinite loop if
    running on a single-CPU system. If all zones are unreclaimble,
    sleeping_prematurely return 1 and kswapd will call balance_pgdat() again.
    but it's totally meaningless, balance_pgdat() doesn't anything against
    unreclaimable zone!

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Reported-by: Will Newton
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Tested-by: Will Newton
    Reviewed-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The current check for 'backward merging' within add_active_range() does
    not seem correct. start_pfn must be compared against
    early_node_map[i].start_pfn (and NOT against .end_pfn) to find out whether
    the new region is backward-mergeable with the existing range.

    Signed-off-by: Kazuhisa Ichikawa
    Acked-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kazuhisa Ichikawa
     

14 Jan, 2010

1 commit

  • If __block_prepare_write() was failed in block_write_begin(), the
    allocated blocks can be outside of ->i_size.

    But new truncate_pagecache() in vmtuncate() does nothing if new < old.
    It means the above usage is not working anymore.

    So, this patch fixes it by removing "new < old" check. It would need
    more cleanup/change. But, now -rc and truncate working is in progress,
    so, this tried to fix it minimum change.

    Acked-by: Nick Piggin
    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     

12 Jan, 2010

2 commits

  • sz is in bytes, MAX_ORDER_NR_PAGES is in pages.

    Signed-off-by: Andrea Arcangeli
    Acked-by: David Gibson
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • __pcpu_ptr_to_addr() can be overridden by the architecture and might not
    behave well if passed a NULL pointer. So avoid calling it until we have
    verified that its arg is not NULL.

    Cc: Rusty Russell
    Cc: Kamalesh Babulal
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

08 Jan, 2010

1 commit


07 Jan, 2010

2 commits

  • The MMU code uses the copy_*_user_page() variants in access_process_vm()
    rather than copy_*_user() as the former includes an icache flush. This
    is important when doing things like setting software breakpoints with
    gdb. So switch the NOMMU code over to do the same.

    This patch makes the reasonable assumption that copy_from_user_page()
    won't fail - which is probably fine, as we've checked the VMA from which
    we're copying is usable, and the copy is not allowed to cross VMAs. The
    one case where it might go wrong is if the VMA is a device rather than
    RAM, and that device returns an error which - in which case rubbish will
    be returned rather than EIO.

    Signed-off-by: Jie Zhang
    Signed-off-by: Mike Frysinger
    Signed-off-by: David Howells
    Acked-by: David McCullough
    Acked-by: Paul Mundt
    Acked-by: Greg Ungerer
    Signed-off-by: Linus Torvalds

    Jie Zhang
     
  • When working with FDPIC, there are many shared mappings of read-only
    code regions between applications (the C library, applet packages like
    busybox, etc.), but the current do_mmap_pgoff() function will issue an
    icache flush whenever a VMA is added to an MM instead of only doing it
    when the map is initially created.

    The flush can instead be done when a region is first mmapped PROT_EXEC.
    Note that we may not rely on the first mapping of a region being
    executable - it's possible for it to be PROT_READ only, so we have to
    remember whether we've flushed the region or not, and then flush the
    entire region when a bit of it is made executable.

    However, this also affects the brk area. That will no longer be
    executable. We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
    for NOMMU mode kernels, when it increases the brk allocation, making
    sys_brk() flush the extra from the icache should suffice. The brk area
    probably isn't used by NOMMU programs since the brk area can only use up
    the leavings from the stack allocation, where the stack allocation is
    larger than requested.

    Signed-off-by: David Howells
    Signed-off-by: Mike Frysinger
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     

31 Dec, 2009

2 commits


29 Dec, 2009

1 commit

  • Commit ce79ddc8e2376a9a93c7d42daf89bfcbb9187e62 ("SLAB: Fix lockdep annotations
    for CPU hotplug") broke init_node_lock_keys() off-slab logic which causes
    lockdep false positives.

    Fix that up by reverting the logic back to original while keeping CPU hotplug
    fixes intact.

    Reported-and-tested-by: Heiko Carstens
    Reported-and-tested-by: Andi Kleen
    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

25 Dec, 2009

2 commits


24 Dec, 2009

1 commit


23 Dec, 2009

1 commit

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (36 commits)
    powerpc/gc/wii: Remove get_irq_desc()
    powerpc/gc/wii: hlwd-pic: convert irq_desc.lock to raw_spinlock
    powerpc/gamecube/wii: Fix off-by-one error in ugecon/usbgecko_udbg
    powerpc/mpic: Fix problem that affinity is not updated
    powerpc/mm: Fix stupid bug in subpge protection handling
    powerpc/iseries: use DECLARE_COMPLETION_ONSTACK for non-constant completion
    powerpc: Fix MSI support on U4 bridge PCIe slot
    powerpc: Handle VSX alignment faults correctly in little-endian mode
    powerpc/mm: Fix typo of cpumask_clear_cpu()
    powerpc/mm: Fix hash_utils_64.c compile errors with DEBUG enabled.
    powerpc: Convert BUG() to use unreachable()
    powerpc/pseries: Make declarations of cpu_hotplug_driver_lock() ANSI compatible.
    powerpc/pseries: Don't panic when H_PROD fails during cpu-online.
    powerpc/mm: Fix a WARN_ON() with CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_VM
    powerpc/defconfigs: Set HZ=100 on pseries and ppc64 defconfigs
    powerpc/defconfigs: Disable token ring in powerpc defconfigs
    powerpc/defconfigs: Reduce 64bit vmlinux by making acenic and cramfs modules
    powerpc/pseries: Select XICS and PCI_MSI PSERIES
    powerpc/85xx: Wrong variable returned on error
    powerpc/iseries: Convert to proc_fops
    ...

    Linus Torvalds
     

22 Dec, 2009

1 commit

  • The injector filter requires stable_page_flags() which is supplied
    by procfs. So make it dependent on that.

    Also add ifdefs around the filter code in memory-failure.c so that
    when the filter is disabled due to missing dependencies the whole
    code still builds.

    Reported-by: Ingo Molnar
    Signed-off-by: Andi Kleen

    Andi Kleen
     

20 Dec, 2009

1 commit

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, irq: Allow 0xff for /proc/irq/[n]/smp_affinity on an 8-cpu system
    Makefile: Unexport LC_ALL instead of clearing it
    x86: Fix objdump version check in arch/x86/tools/chkobjdump.awk
    x86: Reenable TSC sync check at boot, even with NONSTOP_TSC
    x86: Don't use POSIX character classes in gen-insn-attr-x86.awk
    Makefile: set LC_CTYPE, LC_COLLATE, LC_NUMERIC to C
    x86: Increase MAX_EARLY_RES; insufficient on 32-bit NUMA
    x86: Fix checking of SRAT when node 0 ram is not from 0
    x86, cpuid: Add "volatile" to asm in native_cpuid()
    x86, msr: msrs_alloc/free for CONFIG_SMP=n
    x86, amd: Get multi-node CPU info from NodeId MSR instead of PCI config space
    x86: Add IA32_TSC_AUX MSR and use it
    x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers
    initramfs: add missing decompressor error check
    bzip2: Add missing checks for malloc returning NULL
    bzip2/lzma/gzip: pre-boot malloc doesn't return NULL on failure

    Linus Torvalds
     

18 Dec, 2009

5 commits

  • Memory balloon drivers can allocate a large amount of memory which is not
    movable but could be freed to accomodate memory hotplug remove.

    Prior to calling the memory hotplug notifier chain the memory in the
    pageblock is isolated. Currently, if the migrate type is not
    MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
    for that page range to fail.

    Rather than failing pageblock isolation if the migrateteype is not
    MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
    and not on the LRU, are owned by a registered balloon driver (or other
    entity) using a notifier chain. If all of the non-movable pages are owned
    by a balloon, they can be freed later through the memory notifier chain
    and the range can still be isolated in set_migratetype_isolate().

    Signed-off-by: Robert Jennings
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: Brian King
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Gerald Schaefer
    Cc: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Benjamin Herrenschmidt

    Robert Jennings
     
  • …/rusty/linux-2.6-for-linus

    * 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    cpumask: rename tsk_cpumask to tsk_cpus_allowed
    cpumask: don't recommend set_cpus_allowed hack in Documentation/cpu-hotplug.txt
    cpumask: avoid dereferencing struct cpumask
    cpumask: convert drivers/idle/i7300_idle.c to cpumask_var_t
    cpumask: use modern cpumask style in drivers/scsi/fcoe/fcoe.c
    cpumask: avoid deprecated function in mm/slab.c
    cpumask: use cpu_online in kernel/perf_event.c

    Linus Torvalds
     
  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
    Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support
    NOMMU: Optimise away the {dac_,}mmap_min_addr tests
    security/min_addr.c: make init_mmap_min_addr() static
    keys: PTR_ERR return of wrong pointer in keyctl_get_security()

    Linus Torvalds
     
  • * 'kmemleak' of git://linux-arm.org/linux-2.6:
    kmemleak: fix kconfig for crc32 build error
    kmemleak: Reduce the false positives by checking for modified objects
    kmemleak: Show the age of an unreferenced object
    kmemleak: Release the object lock before calling put_object()
    kmemleak: Scan the _ftrace_events section in modules
    kmemleak: Simplify the kmemleak_scan_area() function prototype
    kmemleak: Do not use off-slab management with SLAB_NOLEAKTRACE

    Linus Torvalds
     
  • I added blk_run_backing_dev on page_cache_async_readahead so readahead I/O
    is unpluged to improve throughput on especially RAID environment.

    The normal case is, if page N become uptodate at time T(N), then T(N)
    Acked-by: Wu Fengguang
    Cc: Jens Axboe
    Cc: KOSAKI Motohiro
    Tested-by: Ronald
    Cc: Bart Van Assche
    Cc: Vladislav Bolkhovitin
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

17 Dec, 2009

9 commits

  • These days we use cpumask_empty() which takes a pointer.

    Signed-off-by: Rusty Russell
    Acked-by: Christoph Lameter

    Rusty Russell
     
  • Replacing
    error = 0;
    if (error)
    op
    with nothing is not quite an equivalent transformation ;-)

    Signed-off-by: Al Viro

    Al Viro
     
  • Found one system that boot from socket1 instead of socket0, SRAT get rejected...

    [ 0.000000] SRAT: Node 1 PXM 0 0-a0000
    [ 0.000000] SRAT: Node 1 PXM 0 100000-80000000
    [ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
    [ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
    [ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
    [ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
    [ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
    [ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
    [ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
    [ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
    ...
    [ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040
    [ 0.000000] NUMA: Using 20 for the hash shift.
    [ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
    [ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
    [ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
    [ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
    [ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
    [ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
    [ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
    [ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
    [ 0.000000] SRAT: SRAT not used.

    the early_node_map is not sorted because node0 with non zero start come first.

    so try to sort it right away after all regions are registered.

    also fixs refression by 8716273c (x86: Export srat physical topology)

    -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
    -v3: update comments.

    Reported-and-tested-by: Jens Axboe
    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be
    skipped by the compiler. We do this as the minimum mmap address doesn't make
    any sense in NOMMU mode.

    mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode.

    Signed-off-by: David Howells
    Acked-by: Eric Paris
    Signed-off-by: James Morris

    David Howells
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
    HWPOISON: Remove stray phrase in a comment
    HWPOISON: Try to allocate migration page on the same node
    HWPOISON: Don't do early filtering if filter is disabled
    HWPOISON: Add a madvise() injector for soft page offlining
    HWPOISON: Add soft page offline support
    HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
    HWPOISON: Use new shake_page in memory_failure
    HWPOISON: Use correct name for MADV_HWPOISON in documentation
    HWPOISON: mention HWPoison in Kconfig entry
    HWPOISON: Use get_user_page_fast in hwpoison madvise
    HWPOISON: add an interface to switch off/on all the page filters
    HWPOISON: add memory cgroup filter
    memcg: add accessor to mem_cgroup.css
    memcg: rename and export try_get_mem_cgroup_from_page()
    HWPOISON: add page flags filter
    mm: export stable page flags
    HWPOISON: limit hwpoison injector to known page types
    HWPOISON: add fs/device filters
    HWPOISON: return 0 to indicate success reliably
    HWPOISON: make semantics of IGNORED/DELAYED clear
    ...

    Linus Torvalds
     
  • * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (38 commits)
    direct I/O fallback sync simplification
    ocfs: stop using do_sync_mapping_range
    cleanup blockdev_direct_IO locking
    make generic_acl slightly more generic
    sanitize xattr handler prototypes
    libfs: move EXPORT_SYMBOL for d_alloc_name
    vfs: force reval of target when following LAST_BIND symlinks (try #7)
    ima: limit imbalance msg
    Untangling ima mess, part 3: kill dead code in ima
    Untangling ima mess, part 2: deal with counters
    Untangling ima mess, part 1: alloc_file()
    O_TRUNC open shouldn't fail after file truncation
    ima: call ima_inode_free ima_inode_free
    IMA: clean up the IMA counts updating code
    ima: only insert at inode creation time
    ima: valid return code from ima_inode_alloc
    fs: move get_empty_filp() deffinition to internal.h
    Sanitize exec_permission_lite()
    Kill cached_lookup() and real_lookup()
    Kill path_lookup_open()
    ...

    Trivial conflicts in fs/direct-io.c

    Linus Torvalds
     
  • In the case of direct I/O falling back to buffered I/O we sync data
    twice currently: once at the end of generic_file_buffered_write using
    filemap_write_and_wait_range and once a little later in
    __generic_file_aio_write using do_sync_mapping_range with all flags set.

    The wait before write of the do_sync_mapping_range call does not make
    any sense, so just keep the filemap_write_and_wait_range call and move
    it to the right spot.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that we cache the ACL pointers in the generic inode all the generic_acl
    cruft can go away and generic_acl.c can directly implement xattr handlers
    dealing with the full Posix ACL semantics for in-memory filesystems.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Add a flags argument to struct xattr_handler and pass it to all xattr
    handler methods. This allows using the same methods for multiple
    handlers, e.g. for the ACL methods which perform exactly the same action
    for the access and default ACLs, just using a different underlying
    attribute. With a little more groundwork it'll also allow sharing the
    methods for the regular user/trusted/secure handlers in extN, ocfs2 and
    jffs2 like it's already done for xfs in this patch.

    Also change the inode argument to the handlers to a dentry to allow
    using the handlers mechnism for filesystems that require it later,
    e.g. cifs.

    [with GFS2 bits updated by Steven Whitehouse ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Joel Becker
    Signed-off-by: Al Viro

    Christoph Hellwig