12 Jan, 2012

1 commit

  • * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/numa: Add constraints check for nid parameters
    mm, x86: Remove debug_pagealloc_enabled
    x86/mm: Initialize high mem before free_all_bootmem()
    arch/x86/kernel/e820.c: quiet sparse noise about plain integer as NULL pointer
    arch/x86/kernel/e820.c: Eliminate bubble sort from sanitize_e820_map()
    x86: Fix mmap random address range
    x86, mm: Unify zone_sizes_init()
    x86, mm: Prepare zone_sizes_init() for unification
    x86, mm: Use max_low_pfn for ZONE_NORMAL on 64-bit
    x86, mm: Wrap ZONE_DMA32 with CONFIG_ZONE_DMA32
    x86, mm: Use max_pfn instead of highend_pfn
    x86, mm: Move zone init from paging_init() on 64-bit
    x86, mm: Use MAX_DMA_PFN for ZONE_DMA on 32-bit

    Linus Torvalds
     

11 Jan, 2012

2 commits

  • This one behaves similarly to the /proc//fd/ one - it contains
    symlinks one for each mapping with file, the name of a symlink is
    "vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
    results in a file that point exactly to the same inode as them vma's one.

    For example the ls -l of some arbitrary /proc//map_files/

    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

    This *helps* checkpointing process in three ways:

    1. When dumping a task mappings we do know exact file that is mapped
    by particular region. We do this by opening
    /proc/$pid/map_files/$address symlink the way we do with file
    descriptors.

    2. This also helps in determining which anonymous shared mappings are
    shared with each other by comparing the inodes of them.

    3. When restoring a set of processes in case two of them has a mapping
    shared, we map the memory by the 1st one and then open its
    /proc/$pid/map_files/$address file and map it by the 2nd task.

    Using /proc/$pid/maps for this is quite inconvenient since it brings
    repeatable re-reading and reparsing for this text file which slows down
    restore procedure significantly. Also as being pointed in (3) it is a way
    easier to use top level shared mapping in children as
    /proc/$pid/map_files/$address when needed.

    [akpm@linux-foundation.org: coding-style fixes]
    [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Vasiliy Kulikov
    Reviewed-by: "Kirill A. Shutemov"
    Cc: Tejun Heo
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
    on access (read,write) to an unallocated page, which permits us to catch
    code which corrupts memory. However the kernel is trying to maximise
    memory usage, hence there are usually few free pages in the system and
    buggy code usually corrupts some crucial data.

    This patch changes the buddy allocator to keep more free/protected pages
    and to interlace free/protected and allocated pages to increase the
    probability of catching corruption.

    When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
    debug_guardpage_minorder defines the minimum order used by the page
    allocator to grant a request. The requested size will be returned with
    the remaining pages used as guard pages.

    The default value of debug_guardpage_minorder is zero: no change from
    current behaviour.

    [akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
    Signed-off-by: Stanislaw Gruszka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: "Rafael J. Wysocki"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislaw Gruszka
     

20 Dec, 2011

1 commit


09 Dec, 2011

2 commits

  • Use atomic-long operations instead of looping around cmpxchg().

    [akpm@linux-foundation.org: massage atomic.h inclusions]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
    there's no user of early_node_map[] left. Kill early_node_map[] and
    replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also,
    relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
    as page_alloc.c would no longer host an alternative implementation.

    This change is ultimately one to one mapping and shouldn't cause any
    observable difference; however, after the recent changes, there are
    some functions which now would fit memblock.c better than page_alloc.c
    and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
    doesn't make much sense on some of them. Further cleanups for
    functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.

    -v2: Fix compile bug introduced by mis-spelling
    CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
    mmzone.h. Reported by Stephen Rothwell.

    Signed-off-by: Tejun Heo
    Cc: Stephen Rothwell
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu
    Cc: Tony Luck
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Chen Liqin
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: "H. Peter Anvin"

    Tejun Heo
     

06 Dec, 2011

1 commit

  • When (no)bootmem finish operation, it pass pages to buddy
    allocator. Since debug_pagealloc_enabled is not set, we will do
    not protect pages, what is not what we want with
    CONFIG_DEBUG_PAGEALLOC=y.

    To fix remove debug_pagealloc_enabled. That variable was
    introduced by commit 12d6f21e "x86: do not PSE on
    CONFIG_DEBUG_PAGEALLOC=y" to get more CPA (change page
    attribude) code testing. But currently we have CONFIG_CPA_DEBUG,
    which test CPA.

    Signed-off-by: Stanislaw Gruszka
    Acked-by: Mel Gorman
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/1322582711-14571-1-git-send-email-sgruszka@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     

29 Nov, 2011

1 commit

  • Conflicts & resolutions:

    * arch/x86/xen/setup.c

    dc91c728fd "xen: allow extra memory to be in multiple regions"
    24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..."

    conflicted on xen_add_extra_mem() updates. The resolution is
    trivial as the latter just want to replace
    memblock_x86_reserve_range() with memblock_reserve().

    * drivers/pci/intel-iommu.c

    166e9278a3f "x86/ia64: intel-iommu: move to drivers/iommu/"
    5dfe8660a3d "bootmem: Replace work_with_active_regions() with..."

    conflicted as the former moved the file under drivers/iommu/.
    Resolved by applying the chnages from the latter on the moved
    file.

    * mm/Kconfig

    6661672053a "memblock: add NO_BOOTMEM config symbol"
    c378ddd53f9 "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option"

    conflicted trivially. Both added config options. Just
    letting both add their own options resolves the conflict.

    * mm/memblock.c

    d1f0ece6cdc "mm/memblock.c: small function definition fixes"
    ed7b56a799c "memblock: Remove memblock_memory_can_coalesce()"

    confliected. The former updates function removed by the
    latter. Resolution is trivial.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

03 Nov, 2011

2 commits

  • This avoids duplicating the function in every arch gup_fast.

    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Michel while working on the working set estimation code, noticed that
    calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
    wasn't safe, if the pfn ended up being a tail page of a transparent
    hugepage under splitting by __split_huge_page_refcount().

    He then found the problem could also theoretically materialize with
    page_cache_get_speculative() during the speculative radix tree lookups
    that uses get_page_unless_zero() in SMP if the radix tree page is freed
    and reallocated and get_user_pages is called on it before
    page_cache_get_speculative has a chance to call get_page_unless_zero().

    So the best way to fix the problem is to keep page_tail->_count zero at
    all times. This will guarantee that get_page_unless_zero() can never
    succeed on any tail page. page_tail->_mapcount is guaranteed zero and
    is unused for all tail pages of a compound page, so we can simply
    account the tail page references there and transfer them to
    tail_page->_count in __split_huge_page_refcount() (in addition to the
    head_page->_mapcount).

    While debugging this s/_count/_mapcount/ change I also noticed get_page is
    called by direct-io.c on pages returned by get_user_pages. That wasn't
    entirely safe because the two atomic_inc in get_page weren't atomic. As
    opposed to other get_user_page users like secondary-MMU page fault to
    establish the shadow pagetables would never call any superflous get_page
    after get_user_page returns. It's safer to make get_page universally safe
    for tail pages and to use get_page_foll() within follow_page (inside
    get_user_pages()). get_page_foll() is safe to do the refcounting for tail
    pages without taking any locks because it is run within PT lock protected
    critical sections (PT lock for pte and page_table_lock for
    pmd_trans_huge).

    The standard get_page() as invoked by direct-io instead will now take
    the compound_lock but still only for tail pages. The direct-io paths
    are usually I/O bound and the compound_lock is per THP so very
    finegrined, so there's no risk of scalability issues with it. A simple
    direct-io benchmarks with all lockdep prove locking and spinlock
    debugging infrastructure enabled shows identical performance and no
    overhead. So it's worth it. Ideally direct-io should stop calling
    get_page() on pages returned by get_user_pages(). The spinlock in
    get_page() is already optimized away for no-THP builds but doing
    get_page() on tail pages returned by GUP is generally a rare operation
    and usually only run in I/O paths.

    This new refcounting on page_tail->_mapcount in addition to avoiding new
    RCU critical sections will also allow the working set estimation code to
    work without any further complexity associated to the tail page
    refcounting with THP.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Michel Lespinasse
    Reviewed-by: Michel Lespinasse
    Reviewed-by: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

01 Nov, 2011

1 commit

  • Add __attribute__((format (printf...) to the function to validate format
    and arguments. Use vsprintf extension %pV to avoid any possible message
    interleaving. Coalesce format string. Convert printks/pr_warning to
    pr_warn.

    [akpm@linux-foundation.org: use the __printf() macro]
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

18 Aug, 2011

2 commits


09 Aug, 2011

1 commit

  • In commit 2efaca927f5c ("mm/futex: fix futex writes on archs with SW
    tracking of dirty & young") we forgot about MMU=n. This patch fixes
    that.

    Signed-off-by: Peter Zijlstra
    Acked-by: Benjamin Herrenschmidt
    Acked-by: David Howells
    Link: http://lkml.kernel.org/r/1311761831.24752.413.camel@twins
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

03 Aug, 2011

2 commits

  • Some trivial conflicts due to other various merges
    adding to the end of common lists sooner than this one.

    arch/ia64/Kconfig
    arch/powerpc/Kconfig
    arch/x86/Kconfig
    lib/Kconfig
    lib/Makefile

    Signed-off-by: Len Brown

    Len Brown
     
  • memory_failure() is the entry point for HWPoison memory error
    recovery. It must be called in process context. But commonly
    hardware memory errors are notified via MCE or NMI, so some delayed
    execution mechanism must be used. In MCE handler, a work queue + ring
    buffer mechanism is used.

    In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
    (Generic Hardware Error Source) can be used to report memory errors
    too. To add support to APEI GHES memory recovery, a mechanism similar
    to that of MCE is implemented. memory_failure_queue() is the new
    entry point that can be called in IRQ context. The next step is to
    make MCE handler uses this interface too.

    Signed-off-by: Huang Ying
    Cc: Andi Kleen
    Cc: Wu Fengguang
    Cc: Andrew Morton
    Signed-off-by: Len Brown

    Huang Ying
     

26 Jul, 2011

4 commits

  • I haven't reproduced it myself but the fail scenario is that on such
    machines (notably ARM and some embedded powerpc), if you manage to hit
    that futex path on a writable page whose dirty bit has gone from the PTE,
    you'll livelock inside the kernel from what I can tell.

    It will go in a loop of trying the atomic access, failing, trying gup to
    "fix it up", getting succcess from gup, go back to the atomic access,
    failing again because dirty wasn't fixed etc...

    So I think you essentially hang in the kernel.

    The scenario is probably rare'ish because affected architecture are
    embedded and tend to not swap much (if at all) so we probably rarely hit
    the case where dirty is missing or young is missing, but I think Shan has
    a piece of SW that can reliably reproduce it using a shared writable
    mapping & fork or something like that.

    On archs who use SW tracking of dirty & young, a page without dirty is
    effectively mapped read-only and a page without young unaccessible in the
    PTE.

    Additionally, some architectures might lazily flush the TLB when relaxing
    write protection (by doing only a local flush), and expect a fault to
    invalidate the stale entry if it's still present on another processor.

    The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
    "fix it up" by causing get_user_pages() which would then be equivalent to
    taking the fault.

    However that isn't the case. get_user_pages() will not call
    handle_mm_fault() in the case where the PTE seems to have the right
    permissions, regardless of the dirty and young state. It will eventually
    update those bits ... in the struct page, but not in the PTE.

    Additionally, it will not handle the lazy TLB flushing that can be
    required by some architectures in the fault case.

    Basically, gup is the wrong interface for the job. The patch provides a
    more appropriate one which boils down to just calling handle_mm_fault()
    since what we are trying to do is simulate a real page fault.

    The futex code currently attempts to write to user memory within a
    pagefault disabled section, and if that fails, tries to fix it up using
    get_user_pages().

    This doesn't work on archs where the dirty and young bits are maintained
    by software, since they will gate access permission in the TLB, and will
    not be updated by gup().

    In addition, there's an expectation on some archs that a spurious write
    fault triggers a local TLB flush, and that is missing from the picture as
    well.

    I decided that adding those "features" to gup() would be too much for this
    already too complex function, and instead added a new simpler
    fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
    which the futex code can call.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
    Signed-off-by: Benjamin Herrenschmidt
    Reported-by: Shan Hai
    Tested-by: Shan Hai
    Cc: David Laight
    Acked-by: Peter Zijlstra
    Cc: Darren Hart
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Correct comment on truncate_inode_pages*() in linux/mm.h; and remove
    declaration of page_unuse(), it didn't exist even in 2.2.26 or 2.4.0!

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Originally, walk_hugetlb_range() didn't require a caller take any lock.
    But commit d33b9f45bd ("mm: hugetlb: fix hugepage memory leak in
    walk_page_range") changed its rule. Because it added find_vma() call in
    walk_hugetlb_range().

    Any locking-rule change commit should write a doc too.

    [akpm@linux-foundation.org: clarify comment]
    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • These uses are read-only and in a subsequent patch I have a const struct
    page in my hand...

    [akpm@linux-foundation.org: fix warnings in lowmem_page_address()]
    Signed-off-by: Ian Campbell
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Campbell
     

23 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
    vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
    isofs: Remove global fs lock
    jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
    fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
    mm/truncate.c: fix build for CONFIG_BLOCK not enabled
    fs:update the NOTE of the file_operations structure
    Remove dead code in dget_parent()
    AFS: Fix silly characters in a comment
    switch d_add_ci() to d_splice_alias() in "found negative" case as well
    simplify gfs2_lookup()
    jfs_lookup(): don't bother with . or ..
    get rid of useless dget_parent() in btrfs rename() and link()
    get rid of useless dget_parent() in fs/btrfs/ioctl.c
    fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
    drivers: fix up various ->llseek() implementations
    fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
    Ext4: handle SEEK_HOLE/SEEK_DATA generically
    Btrfs: implement our own ->llseek
    fs: add SEEK_HOLE and SEEK_DATA flags
    reiserfs: make reiserfs default to barrier=flush
    ...

    Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
    shrinker callout for the inode cache, that clashed with the xfs code to
    start the periodic workers later.

    Linus Torvalds
     

21 Jul, 2011

1 commit

  • With context based shrinkers, we can implement a per-superblock
    shrinker that shrinks the caches attached to the superblock. We
    currently have global shrinkers for the inode and dentry caches that
    split up into per-superblock operations via a coarse proportioning
    method that does not batch very well. The global shrinkers also
    have a dependency - dentries pin inodes - so we have to be very
    careful about how we register the global shrinkers so that the
    implicit call order is always correct.

    With a per-sb shrinker callout, we can encode this dependency
    directly into the per-sb shrinker, hence avoiding the need for
    strictly ordering shrinker registrations. We also have no need for
    any proportioning code for the shrinker subsystem already provides
    this functionality across all shrinkers. Allowing the shrinker to
    operate on a single superblock at a time means that we do less
    superblock list traversals and locking and reclaim should batch more
    effectively. This should result in less CPU overhead for reclaim and
    potentially faster reclaim of items from each filesystem.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

20 Jul, 2011

1 commit

  • For shrinkers that have their own cond_resched* calls, having
    shrink_slab break the work down into small batches is not
    paticularly efficient. Add a custom batchsize field to the struct
    shrinker so that shrinkers can use a larger batch size if they
    desire.

    A value of zero (uninitialised) means "use the default", so
    behaviour is unchanged by this patch.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

15 Jul, 2011

3 commits

  • From 83103b92f3234ec830852bbc5c45911bd6cbdb20 Mon Sep 17 00:00:00 2001
    From: Tejun Heo
    Date: Thu, 14 Jul 2011 11:22:16 +0200

    Add optional region->nid which can be enabled by arch using
    CONFIG_HAVE_MEMBLOCK_NODE_MAP. When enabled, memblock also carries
    NUMA node information and replaces early_node_map[].

    Newly added memblocks have MAX_NUMNODES as nid. Arch can then call
    memblock_set_node() to set node information. memblock takes care of
    merging and node affine allocations w.r.t. node information.

    When MEMBLOCK_NODE_MAP is enabled, early_node_map[], related data
    structures and functions to manipulate and iterate it are disabled.
    memblock version of __next_mem_pfn_range() is provided such that
    for_each_mem_pfn_range() behaves the same and its users don't have to
    be updated.

    -v2: Yinghai spotted section mismatch caused by missing
    __init_memblock in memblock_set_node(). Fixed.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/20110714094342.GF3455@htj.dyndns.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     
  • With the previous changes, generic NUMA aware memblock API has feature
    parity with memblock_x86_find_in_range_node(). There currently are
    two users - x86 setup_node_data() and __alloc_memory_core_early() in
    nobootmem.c.

    This patch converts the former to use memblock_alloc_nid() and the
    latter memblock_find_range_in_node(), and kills
    memblock_x86_find_in_range_node() and related functions including
    find_memory_early_core_early() in page_alloc.c.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310460395-30913-9-git-send-email-tj@kernel.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     
  • Callback based iteration is cumbersome and much less useful than
    for_each_*() iterator. This patch implements for_each_mem_pfn_range()
    which replaces work_with_active_regions(). All the current users of
    work_with_active_regions() are converted.

    This simplifies walking over early_node_map and will allow converting
    internal logics in page_alloc to use iterator instead of walking
    early_node_map directly, which in turn will enable moving node
    information to memblock.

    powerpc change is only compile tested.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/20110714074610.GD3455@htj.dyndns.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     

13 Jul, 2011

1 commit

  • SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use
    sections array to map pfn to nid which is limited in granularity. If
    NUMA nodes are laid out such that the mapping cannot be accurate, boot
    will fail triggering BUG_ON() in mminit_verify_page_links().

    On 32bit, it's 512MiB w/ PAE and SPARSEMEM. This seems to have been
    granular enough until commit 2706a0bf7b (x86, NUMA: Enable
    CONFIG_AMD_NUMA on 32bit too). Apparently, there is a machine which
    aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT. This
    led to the following BUG_ON().

    On node 0 totalpages: 2096615
    DMA zone: 32 pages used for memmap
    DMA zone: 0 pages reserved
    DMA zone: 3927 pages, LIFO batch:0
    Normal zone: 1740 pages used for memmap
    Normal zone: 220978 pages, LIFO batch:31
    HighMem zone: 16405 pages used for memmap
    HighMem zone: 1853533 pages, LIFO batch:31
    BUG: Int 6: CR2 (null)
    EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc
    EBX f2400000 EDX 00000006 ECX (null) EAX 00000001
    err (null) EIP c16209aa CS 00000060 flg 00010002
    Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
    (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null)
    f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null)
    Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17
    Call Trace:
    [] ? early_fault+0x2e/0x2e
    [] ? mminit_verify_page_links+0x12/0x42
    [] ? memmap_init_zone+0xaf/0x10c
    [] ? free_area_init_node+0x2b9/0x2e3
    [] ? free_area_init_nodes+0x3f2/0x451
    [] ? paging_init+0x112/0x118
    [] ? setup_arch+0x791/0x82f
    [] ? start_kernel+0x6a/0x257

    This patch implements node_map_pfn_alignment() which determines
    maximum internode alignment and update numa_register_memblks() to
    reject NUMA configuration if alignment exceeds the pfn -> nid mapping
    granularity of the memory model as determined by PAGES_PER_SECTION.

    This makes the problematic machine boot w/ flatmem by rejecting the
    NUMA config and provides protection against crazy NUMA configurations.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org
    LKML-Reference:
    Reported-and-Tested-by: Hans Rosenfeld
    Cc: Conny Seidel
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     

27 May, 2011

2 commits

  • Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
    This was because exe_file was needed only for /proc//exe. Since we
    will need the exe_file functionality also for core dumps (so core name can
    contain full binary path), built this functionality always into the
    kernel.

    To achieve that move that out of proc FS to the kernel/ where in fact it
    should belong. By doing that we can make dup_mm_exe_file static. Also we
    can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

25 May, 2011

11 commits

  • Remove noMMU declaration of shmem_get_unmapped_area() from mm.h: it fell
    out of use in 2.6.21 and ceased to exist in 2.6.29.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The problem with having two different types of counters is that developers
    adding new code need to keep in mind whether it's safe to use both the
    atomic and non-atomic implementations. For example, when adding new
    callers of the *_mm_counter() functions a developer needs to ensure that
    those paths are always executed with page_table_lock held, in case we're
    using the non-atomic implementation of mm counters.

    Hugh Dickins introduced the atomic mm counters in commit f412ac08c986
    ("[PATCH] mm: fix rss and mmlist locking"). When asked why he left the
    non-atomic counters around he said,

    | The only reason was to avoid adding costly atomic operations into a
    | configuration that had no need for them there: the page_table_lock
    | sufficed.
    |
    | Certainly it would be simpler just to delete the non-atomic variant.
    |
    | And I think it's fair to say that any configuration on which we're
    | measuring performance to that degree (rather than "does it boot fast?"
    | type measurements), would already be going the split ptlocks route.

    Removing the non-atomic counters eases the maintenance burden because
    developers no longer have to mindful of the two implementations when using
    *_mm_counter().

    Note that all architectures provide a means of atomically updating
    atomic_long_t variables, even if they have to revert to the generic
    spinlock implementation because they don't support 64-bit atomic
    instructions (see lib/atomic64.c).

    Signed-off-by: Matt Fleming
    Acked-by: Dave Hansen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Fleming
     
  • Do not define PFN_SECTION_SHIFT if !CONFIG_SPARSEMEM.

    Signed-off-by: Daniel Kiper
    Acked-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Kiper
     
  • set_page_section() is meaningful only in CONFIG_SPARSEMEM and
    !CONFIG_SPARSEMEM_VMEMMAP context. Move it to proper place and amend
    accordingly functions which are using it.

    Signed-off-by: Daniel Kiper
    Acked-by: Dave Hansen
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Kiper
     
  • Change each shrinker's API by consolidating the existing parameters into
    shrink_control struct. This will simplify any further features added w/o
    touching each file of shrinker.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: fix warning]
    [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
    [akpm@linux-foundation.org: fix xfs warning]
    [akpm@linux-foundation.org: update gfs2]
    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Acked-by: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Steven Whitehouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • Consolidate the existing parameters to shrink_slab() into a new
    shrink_control struct. This is needed later to pass the same struct to
    shrinkers.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Acked-by: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • This originally started as a simple patch to give vmalloc() some more
    verbose output on failure on top of the plain page allocator messages.
    Johannes suggested that it might be nicer to lead with the vmalloc() info
    _before_ the page allocator messages.

    But, I do think there's a lot of value in what __alloc_pages_slowpath()
    does with its filtering and so forth.

    This patch creates a new function which other allocators can call instead
    of relying on the internal page allocator warnings. It also gives this
    function private rate-limiting which separates it from other
    printk_ratelimit() users.

    Signed-off-by: Dave Hansen
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Hugh says:
    "The only significant loser, I think, would be page reclaim (when
    concurrent with truncation): could spin for a long time waiting for
    the i_mmap_mutex it expects would soon be dropped? "

    Counter points:
    - cpu contention makes the spin stop (need_resched())
    - zap pages should be freeing pages at a higher rate than reclaim
    ever can

    I think the simplification of the truncate code is definitely worth it.

    Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
    unmap_mapping_range() on the same inode") and takes out the code that
    caused its problem.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Rework the existing mmu_gather infrastructure.

    The direct purpose of these patches was to allow preemptible mmu_gather,
    but even without that I think these patches provide an improvement to the
    status quo.

    The first 9 patches rework the mmu_gather infrastructure. For review
    purpose I've split them into generic and per-arch patches with the last of
    those a generic cleanup.

    The next patch provides generic RCU page-table freeing, and the followup
    is a patch converting s390 to use this. I've also got 4 patches from
    DaveM lined up (not included in this series) that uses this to implement
    gup_fast() for sparc64.

    Then there is one patch that extends the generic mmu_gather batching.

    After that follow the mm preemptibility patches, these make part of the mm
    a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
    mutexes which together with the mmu_gather rework makes mmu_gather
    preemptible as well.

    Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

    This also allows for preemptible mmu_notifiers, something that XPMEM I
    think wants.

    Furthermore, it removes the new and universially detested unmap_mutex.

    This patch:

    Remove the first obstacle towards a fully preemptible mmu_gather.

    The current scheme assumes mmu_gather is always done with preemption
    disabled and uses per-cpu storage for the page batches. Change this to
    try and allocate a page for batching and in case of failure, use a small
    on-stack array to make some progress.

    Preemptible mmu_gather is desired in general and usable once i_mmap_lock
    becomes a mutex. Doing it before the mutex conversion saves us from
    having to rework the code by moving the mmu_gather bits inside the
    pte_lock.

    Also avoid flushing the tlb batches from under the pte lock, this is
    useful even without the i_mmap_lock conversion as it significantly reduces
    pte lock hold times.

    [akpm@linux-foundation.org: fix comment tpyo]
    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Currently we have expand_upwards exported while expand_downwards is
    accessible only via expand_stack or expand_stack_downwards.

    check_stack_guard_page is a nice example of the asymmetry. It uses
    expand_stack for VM_GROWSDOWN while expand_upwards is called for
    VM_GROWSUP case.

    Let's clean this up by exporting both functions and make those names
    consistent. Let's use expand_{upwards,downwards} because expanding
    doesn't always involve stack manipulation (an example is
    ia64_do_page_fault which uses expand_upwards for registers backing store
    expansion). expand_downwards has to be defined for both
    CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
    version in the early process initialization phase for growsup
    configuration.

    Signed-off-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: James Bottomley
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently, memory hotplug calls setup_per_zone_wmarks() and
    calculate_zone_inactive_ratio(), but doesn't call
    setup_per_zone_lowmem_reserve().

    It means the number of reserved pages aren't updated even if memory hot
    plug occur. This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro