15 Aug, 2011

1 commit

  • Commit db64fe02258f ("mm: rewrite vmap layer") introduced code that does
    address calculations under the assumption that VMAP_BLOCK_SIZE is a
    power of two. However, this might not be true if CONFIG_NR_CPUS is not
    set to a power of two.

    Wrong vmap_block index/offset values could lead to memory corruption.
    However, this has never been observed in practice (or never been
    diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that
    checks for inconsistent vmap_block indices.

    To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two.

    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572
    Reported-by: Pavel Kysilka
    Reported-by: Matias A. Fonzo
    Signed-off-by: Clemens Ladisch
    Signed-off-by: Stefan Richter
    Cc: Nick Piggin
    Cc: Jeremy Fitzhardinge
    Cc: Krzysztof Helt
    Cc: Andrew Morton
    Cc: 2.6.28+
    Signed-off-by: Linus Torvalds

    Clemens Ladisch
     

10 Aug, 2011

2 commits

  • This reverts commit 8521fc50d433507a7cdc96bec280f9e5888a54cc.

    The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
    bit operations is sufficient but that is not true. Johannes Weiner has
    reported a crash during parallel memory cgroup removal:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] css_is_ancestor+0x20/0x70
    Oops: 0000 [#1] PREEMPT SMP
    Pid: 19677, comm: rmdir Tainted: G W 3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
    RIP: 0010:[] css_is_ancestor+0x20/0x70
    RSP: 0018:ffff880077b09c88 EFLAGS: 00010202
    Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
    Call Trace:
    [] mem_cgroup_same_or_subtree+0x33/0x40
    [] drain_all_stock+0x11f/0x170
    [] mem_cgroup_force_empty+0x231/0x6d0
    [] mem_cgroup_pre_destroy+0x14/0x20
    [] cgroup_rmdir+0xb9/0x500
    [] vfs_rmdir+0x86/0xe0
    [] do_rmdir+0xfb/0x110
    [] sys_rmdir+0x16/0x20
    [] system_call_fastpath+0x16/0x1b

    We are crashing because we try to dereference cached memcg when we are
    checking whether we should wait for draining on the cache. The cache is
    already cleaned up, though.

    There is also a theoretical chance that the cached memcg gets freed
    between we test for the FLUSHING_CACHED_CHARGE and dereference it in
    mem_cgroup_same_or_subtree:

    CPU0 CPU1 CPU2
    mem=stock->cached
    stock->cached=NULL
    clear_bit
    test_and_set_bit
    test_bit() ...
    mem_cgroup_destroy
    use after free

    The percpu_charge_mutex protected from this race because sync draining
    is exclusive.

    It is safer to revert now and come up with a more parallel
    implementation later.

    Signed-off-by: Michal Hocko
    Reported-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • deactivate_slab() has the comparison if more than the minimum number of
    partial pages are in the partial list wrong. An effect of this may be that
    empty pages are not freed from deactivate_slab(). The result could be an
    OOM due to growth of the partial slabs per node. Frees mostly occur from
    __slab_free which is okay so this would only affect use cases where a lot
    of switching around of per cpu slabs occur.

    Switching per cpu slabs occurs with high frequency if debugging options are
    enabled.

    Reported-and-tested-by: Xiaotian Feng
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

09 Aug, 2011

2 commits

  • The check_bytes() function is used by slub debugging. It returns a pointer
    to the first unmatching byte for a character in the given memory area.

    If the character for matching byte is greater than 0x80, check_bytes()
    doesn't work. Becuase 64-bit pattern is generated as below.

    value64 = value | value << 8 | value << 16 | value << 24;
    value64 = value64 | value64 << 32;

    The integer promotions are performed and sign-extended as the type of value
    is u8. The upper 32 bits of value64 is 0xffffffff in the first line, and
    the second line has no effect.

    This fixes the 64-bit pattern generation.

    Signed-off-by: Akinobu Mita
    Cc: Christoph Lameter
    Cc: Matt Mackall
    Reviewed-by: Marcin Slusarz
    Acked-by: Eric Dumazet
    Signed-off-by: Pekka Enberg

    Akinobu Mita
     
  • When a slab is freed by __slab_free() and the slab can only contain a
    single object ever then it was full (and therefore not on the partial
    lists but on the full list in the debug case) before we reached
    slab_empty.

    This caused the following full list corruption when SLUB debugging was enabled:

    [ 5913.233035] ------------[ cut here ]------------
    [ 5913.233097] WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
    [ 5913.233101] Hardware name: Adamo 13
    [ 5913.233105] list_del corruption. prev->next should be ffffea000434fd20, but was ffffea0004199520
    [ 5913.233108] Modules linked in: nfs fscache fuse ebtable_nat ebtables ppdev parport_pc lp parport ipt_MASQUERADE iptable_nat nf_nat nfsd lockd nfs_acl auth_rpcgss xt_CHECKSUM sunrpc iptable_mangle bridge stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables rfcomm bnep arc4 iwlagn snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel btusb mac80211 snd_hda_codec bluetooth snd_hwdep snd_seq snd_seq_device snd_pcm usb_debug dell_wmi sparse_keymap cdc_ether usbnet cdc_acm uvcvideo cdc_wdm mii cfg80211 snd_timer dell_laptop videodev dcdbas snd microcode v4l2_compat_ioctl32 soundcore joydev tg3 pcspkr snd_page_alloc iTCO_wdt i2c_i801 rfkill iTCO_vendor_support wmi virtio_net kvm_intel kvm ipv6 xts gf128mul dm_crypt i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
    [ 5913.233213] Pid: 0, comm: swapper Not tainted 3.0.0+ #127
    [ 5913.233213] Call Trace:
    [ 5913.233213] [] warn_slowpath_common+0x83/0x9b
    [ 5913.233213] [] warn_slowpath_fmt+0x46/0x48
    [ 5913.233213] [] __list_del_entry+0x8d/0x98
    [ 5913.233213] [] list_del+0xe/0x2d
    [ 5913.233213] [] __slab_free+0x1db/0x235
    [ 5913.233213] [] ? bvec_free_bs+0x35/0x37
    [ 5913.233213] [] ? bvec_free_bs+0x35/0x37
    [ 5913.233213] [] ? bvec_free_bs+0x35/0x37
    [ 5913.233213] [] kmem_cache_free+0x88/0x102
    [ 5913.233213] [] bvec_free_bs+0x35/0x37
    [ 5913.233213] [] bio_free+0x34/0x64
    [ 5913.233213] [] dm_bio_destructor+0x12/0x14
    [ 5913.233213] [] bio_put+0x2b/0x2d
    [ 5913.233213] [] clone_endio+0x9e/0xb4
    [ 5913.233213] [] bio_endio+0x2d/0x2f
    [ 5913.233213] [] crypt_dec_pending+0x5c/0x8b [dm_crypt]
    [ 5913.233213] [] crypt_endio+0x78/0x81 [dm_crypt]

    [ Full discussion here: https://lkml.org/lkml/2011/8/4/375 ]

    Make sure that we remove such a slab also from the full lists.

    Reported-and-tested-by: Dave Jones
    Reported-and-tested-by: Xiaotian Feng
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

05 Aug, 2011

1 commit


04 Aug, 2011

18 commits

  • Fernando found we hit the regular OFF_SLAB 'recursion' before we
    annotate the locks, cure this.

    The relevant portion of the stack-trace:

    > [ 0.000000] [] rt_spin_lock+0x50/0x56
    > [ 0.000000] [] __cache_free+0x43/0xc3
    > [ 0.000000] [] kmem_cache_free+0x6c/0xdc
    > [ 0.000000] [] slab_destroy+0x4f/0x53
    > [ 0.000000] [] free_block+0x94/0xc1
    > [ 0.000000] [] do_tune_cpucache+0x10b/0x2bb
    > [ 0.000000] [] enable_cpucache+0x7b/0xa7
    > [ 0.000000] [] kmem_cache_init_late+0x1f/0x61
    > [ 0.000000] [] start_kernel+0x24c/0x363
    > [ 0.000000] [] i386_start_kernel+0xa9/0xaf

    Reported-by: Fernando Lopez-Lezcano
    Acked-by: Pekka Enberg
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Lockdep thinks there's lock recursion through:

    kmem_cache_free()
    cache_flusharray()
    spin_lock(&l3->list_lock) list_lock) --'

    Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no
    actual possibility of recursing. Luckily debug objects marks it slab
    with SLAB_DEBUG_OBJECTS so we can identify the thing.

    Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special
    lockdep key so that lockdep sees its a different cachep.

    Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU |
    SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble.

    Reported-and-tested-by: Sebastian Siewior
    [ fixes to the initial patch ]
    Reported-by: Thomas Gleixner
    Acked-by: Pekka Enberg
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • * 'apei-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
    ACPI, APEI, EINJ Param support is disabled by default
    APEI GHES: 32-bit buildfix
    ACPI: APEI build fix
    ACPI, APEI, GHES: Add hardware memory error recovery support
    HWPoison: add memory_failure_queue()
    ACPI, APEI, GHES, Error records content based throttle
    ACPI, APEI, GHES, printk support for recoverable error via NMI
    lib, Make gen_pool memory allocator lockless
    lib, Add lock-less NULL terminated single list
    Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG
    ACPI, APEI, Add WHEA _OSC support
    ACPI, APEI, Add APEI bit support in generic _OSC call
    ACPI, APEI, GHES, Support disable GHES at boot time
    ACPI, APEI, GHES, Prevent GHES to be built as module
    ACPI, APEI, Use apei_exec_run_optional in APEI EINJ and ERST
    ACPI, APEI, Add apei_exec_run_optional
    ACPI, APEI, GHES, Do not ratelimit fatal error printk before panic
    ACPI, APEI, ERST, Fix erst-dbg long record reading issue
    ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled

    Linus Torvalds
     
  • Make the radix_tree exceptional cases, mostly in filemap.c, clearer.

    It's hard to devise a suitable snappy name that illuminates the use by
    shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
    generality. And akpm points out that /* radix_tree_deref_retry(page) */
    comments look like calls that have been commented out for unknown
    reason.

    Skirt the naming difficulty by rearranging these blocks to handle the
    transient radix_tree_deref_retry(page) case first; then just explain the
    remaining shmem/tmpfs swap case in a comment.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We have already acknowledged that swapoff of a tmpfs file is slower than
    it was before conversion to the generic radix_tree: a little slower
    there will be acceptable, if the hotter paths are faster.

    But it was a shock to find swapoff of a 500MB file 20 times slower on my
    laptop, taking 10 minutes; and at that rate it significantly slows down
    my testing.

    Now, most of that turned out to be overhead from PROVE_LOCKING and
    PROVE_RCU: without those it was only 4 times slower than before; and
    more realistic tests on other machines don't fare as badly.

    I've tried a number of things to improve it, including tagging the swap
    entries, then doing lookup by tag: I'd expected that to halve the time,
    but in practice it's erratic, and often counter-productive.

    The only change I've so far found to make a consistent improvement, is
    to short-circuit the way we go back and forth, gang lookup packing
    entries into the array supplied, then shmem scanning that array for the
    target entry. Scanning in place doubles the speed, so it's now only
    twice as slow as before (or three times slower when the PROVEs are on).

    So, add radix_tree_locate_item() as an expedient, once-off,
    single-caller hack to do the lookup directly in place. #ifdef it on
    CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
    applicability as save space in other configurations. And, sadly,
    #include sched.h for cond_resched().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove PageSwapBacked (!page_is_file_cache) cases from
    add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
    go through shmem_add_to_page_cache().

    Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
    and add a comment on swap entries to invalidate_mapping_pages().

    And mincore_page() uses find_get_page() on what might be shmem or a
    tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
    find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • But we've not yet removed the old swp_entry_t i_direct[16] from
    shmem_inode_info. That's because it was still being shared with the
    inline symlink. Remove it now (saving 64 or 128 bytes from shmem inode
    size), and use kmemdup() for short symlinks, say, those up to 128 bytes.

    I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
    rather than shmem_evict_inode(), where we usually do such freeing? I
    guess it doesn't matter, and I'm not into NUMA mpol testing right now.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
    shmem_radix_tree_replace() to substitute swap entry for page pointer
    atomically in the radix tree.

    As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
    copying such code from delete_from_swap_cache, but again judged easier
    to sell than making its other callers go through the extras.

    Remove the toy implementation's shmem_put_swap() and shmem_get_swap(),
    now unreferenced, and the hack to disable swap: it's now good to go.

    The way things have worked out, info->lock no longer helps to guard the
    shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
    That global mutex exclusion between shmem_writepage() and shmem_unuse()
    is not pretty, and we ought to find another way; but it's been forced on
    us by recent race discoveries, not a consequence of this patchset.

    And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a
    swap entry was found already present? That's no longer possible, the
    (unknown) one inserting this page into filecache would hit the swap
    entry occupying that slot.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
    had to move swappage to filecache with GFP_NOWAIT.

    Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
    moving its call out from shmem_add_to_page_cache() to two of thats three
    callers. But leave it doing mem_cgroup_uncharge_cache_page() on error:
    although asymmetrical, it's easier for all 3 callers to handle.

    These two changes would also be appropriate if anyone were to start
    using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

    Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
    radix_tree_exceptional_entry() to get what it needs for itself.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Convert shmem_getpage_gfp(), the engine-room of shmem, to expect page or
    swap entry returned from radix tree by find_lock_page().

    Whereas the repetitive old method proceeded mainly under info->lock,
    dropping and repeating whenever one of the conditions needed was not
    met, now we can proceed without it, leaving shmem_add_to_page_cache() to
    check for a race.

    This way there is no need to preallocate a page, no need for an early
    radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().

    Move the error unwinding down to the bottom instead of repeating it
    throughout. ENOSPC handling is a little different from before: there is
    no longer any race between find_lock_page() and finding swap, but we can
    arrive at ENOSPC before calling shmem_recalc_inode(), which might
    occasionally discover freed space.

    Be stricter to check i_size before returning. info->lock is used for
    little but alloced, swapped, i_blocks updates. Move i_blocks updates
    out from under the max_blocks check, so even an unlimited size=0 mount
    can show accurate du.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
    tree, searching for matching swap.

    This is somewhat slower than the old method: because of repeated radix
    tree descents, because of copying entries up, but probably most because
    the old method noted and skipped once a vector page was cleared of swap.
    Perhaps we can devise a use of radix tree tagging to achieve that later.

    shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
    for the lockless lookup by checking that the expected entry is in place,
    under lock. It is not very satisfactory to be copying this much from
    add_to_page_cache_locked(), but I think easier to sell than insisting
    that every caller of add_to_page_cache*() go through the extras.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Disable the toy swapping implementation in shmem_writepage() - it's hard
    to support two schemes at once - and convert shmem_truncate_range() to a
    lockless gang lookup of swap entries along with pages, freeing both.

    Since the second loop tightens its noose until all entries of either
    kind have been squeezed out (and we shall make sure that there's not an
    instant when neither is visible), there is no longer a need for yet
    another pass below.

    shmem_radix_tree_replace() compensates for the lockless lookup by
    checking that the expected entry is in place, under lock, before
    replacing it. Here it just deletes, but will be used in later patches
    to substitute swap entry for page or page for swap entry.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Bring truncate.c's code for truncate_inode_pages_range() inline into
    shmem_truncate_range(), replacing its first call (there's a followup
    call below, but leave that one, it will disappear next).

    Don't play with it yet, apart from leaving out the cleancache flush, and
    (importantly) the nrpages == 0 skip, and moving shmem_setattr()'s
    partial page preparation into its partial page handling.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • While it's at its least, make a number of boring nitpicky cleanups to
    shmem.c, mostly for consistency of variable naming. Things like "swap"
    instead of "entry", "pgoff_t index" instead of "unsigned long idx".

    And since everything else here is prefixed "shmem_", better change
    init_tmpfs() to shmem_init().

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The maximum size of a shmem/tmpfs file has been limited by the maximum
    size of its triple-indirect swap vector. With 4kB page size, maximum
    filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
    that on a 64-bit kernel. (With 8kB page size, maximum filesize was just
    over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
    MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)

    It's a shame that tmpfs should be more restrictive than ramfs, and this
    limitation has now been noticed. Add another level to the swap vector?
    No, it became obscure and hard to maintain, once I complicated it to
    make use of highmem pages nine years ago: better choose another way.

    Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
    tmpfs would never have invented its own peculiar radix tree: we would
    have fitted swap entries into the common radix tree instead, in much the
    same way as we fit swap entries into page tables.

    And why should each file have a separate radix tree for its pages and
    for its swap entries? The swap entries are required precisely where and
    when the pages are not. We want to put them together in a single radix
    tree: which can then avoid much of the locking which was needed to
    prevent them from being exchanged underneath us.

    This also avoids the waste of memory devoted to swap vectors, first in
    the shmem_inode itself, then at least two more pages once a file grew
    beyond 16 data pages (pages accounted by df and du, but not by memcg).
    Allocated upfront, to avoid allocation when under swapping pressure, but
    pure waste when CONFIG_SWAP is not set - I have never spattered around
    the ifdefs to prevent that, preferring this move to sharing the common
    radix tree instead.

    There are three downsides to sharing the radix tree. One, that it binds
    tmpfs more tightly to the rest of mm, either requiring knowledge of swap
    entries in radix tree there, or duplication of its code here in shmem.c.
    I believe that the simplications and memory savings (and probable higher
    performance, not yet measured) justify that.

    Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
    nodes that cannot be freed under memory pressure - whereas before it was
    the less precious highmem swap vector pages that could not be freed.
    I'm hoping that 64-bit has now been accessible for long enough, that the
    highmem argument has grown much less persuasive.

    Three, that swapoff is slower than it used to be on tmpfs files, since
    it's using a simple generic mechanism not tailored to it: I find this
    noticeable, and shall want to improve, but maybe nobody else will
    notice.

    So... now remove most of the old swap vector code from shmem.c. But,
    for the moment, keep the simple i_direct vector of 16 pages, with simple
    accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
    to help mark where swap needs to be handled in subsequent patches.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If swap entries are to be stored along with struct page pointers in a
    radix tree, they need to be distinguished as exceptional entries.

    Most of the handling of swap entries in radix tree will be contained in
    shmem.c, but a few functions in filemap.c's common code need to check
    for their appearance: find_get_page(), find_lock_page(),
    find_get_pages() and find_get_pages_contig().

    So as not to slow their fast paths, tuck those checks inside the
    existing checks for unlikely radix_tree_deref_slot(); except for
    find_lock_page(), where it is an added test. And make it a BUG in
    find_get_pages_tag(), which is not applied to tmpfs files.

    A part of the reason for eliminating shmem_readpage() earlier, was to
    minimize the places where common code would need to allow for swap
    entries.

    The swp_entry_t known to swapfile.c must be massaged into a slightly
    different form when stored in the radix tree, just as it gets massaged
    into a pte_t when stored in page tables.

    In an i386 kernel this limits its information (type and page offset) to
    30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
    swapfile size of 128GB. Which is less than the 512GB we previously
    allowed with X86_PAE (where the swap entry can occupy the entire upper
    32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
    there's not a new limitation on 64-bit (where swap filesize is already
    limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
    probably still enough swap for a 64GB 32-bit machine.

    Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
    enforce filesize limit in read_swap_header(), just as for ptes.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
    peculiar swap vector, instead keeping a file's swap entries in the same
    radix tree as its struct page pointers: thus saving memory, and
    simplifying its code and locking.

    This patch:

    The radix_tree is used by several subsystems for different purposes. A
    major use is to store the struct page pointers of a file's pagecache for
    memory management. But what if mm wanted to store something other than
    page pointers there too?

    The low bit of a radix_tree entry is already used to denote an indirect
    pointer, for internal use, and the unlikely radix_tree_deref_retry()
    case.

    Define the next bit as denoting an exceptional entry, and supply inline
    functions radix_tree_exception() to return non-0 in either unlikely
    case, and radix_tree_exceptional_entry() to return non-0 in the second
    case.

    If a subsystem already uses radix_tree with that bit set, no problem: it
    does not affect internal workings at all, but is defined for the
    convenience of those storing well-aligned pointers in the radix_tree.

    The radix_tree_gang_lookups have an implicit assumption that the caller
    can deduce the offset of each entry returned e.g. by the page->index of
    a struct page. But that may not be feasible for some kinds of item to
    be stored there.

    radix_tree_gang_lookup_slot() allow for an optional indices argument,
    output array in which to return those offsets. The same could be added
    to other radix_tree_gang_lookups, but for now keep it to the only one
    for which we need it.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • init_fault_attr_dentries() is used to export fault_attr via debugfs.
    But it can only export it in debugfs root directory.

    Per Forlin is working on mmc_fail_request which adds support to inject
    data errors after a completed host transfer in MMC subsystem.

    The fault_attr for mmc_fail_request should be defined per mmc host and
    export it in debugfs directory per mmc host like
    /sys/kernel/debug/mmc0/mmc_fail_request.

    init_fault_attr_dentries() doesn't help for mmc_fail_request. So this
    introduces fault_create_debugfs_attr() which is able to create a
    directory in the arbitrary directory and replace
    init_fault_attr_dentries().

    [akpm@linux-foundation.org: extraneous semicolon, per Randy]
    Signed-off-by: Akinobu Mita
    Tested-by: Per Forlin
    Cc: Jens Axboe
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Randy Dunlap
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

03 Aug, 2011

2 commits

  • Some trivial conflicts due to other various merges
    adding to the end of common lists sooner than this one.

    arch/ia64/Kconfig
    arch/powerpc/Kconfig
    arch/x86/Kconfig
    lib/Kconfig
    lib/Makefile

    Signed-off-by: Len Brown

    Len Brown
     
  • memory_failure() is the entry point for HWPoison memory error
    recovery. It must be called in process context. But commonly
    hardware memory errors are notified via MCE or NMI, so some delayed
    execution mechanism must be used. In MCE handler, a work queue + ring
    buffer mechanism is used.

    In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
    (Generic Hardware Error Source) can be used to report memory errors
    too. To add support to APEI GHES memory recovery, a mechanism similar
    to that of MCE is implemented. memory_failure_queue() is the new
    entry point that can be called in IRQ context. The next step is to
    make MCE handler uses this interface too.

    Signed-off-by: Huang Ying
    Cc: Andi Kleen
    Cc: Wu Fengguang
    Cc: Andrew Morton
    Signed-off-by: Len Brown

    Huang Ying
     

02 Aug, 2011

1 commit

  • exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which
    frees the memory.

    However select_bad_process() checks ->mm != NULL before TIF_MEMDIE,
    so it continues to kill other tasks even if we have the oom-killed
    task freeing its memory.

    Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip
    the tasks which have already passed exit_notify() to ensure a zombie
    with TIF_MEMDIE set can't block oom-killer. Alternatively we could
    probably clear TIF_MEMDIE after exit_mmap().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

01 Aug, 2011

1 commit


31 Jul, 2011

2 commits

  • Use the nice enumerated constant.

    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Pekka Enberg

    Andrew Morton
     
  • * 'slub/lockless' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: (21 commits)
    slub: When allocating a new slab also prep the first object
    slub: disable interrupts in cmpxchg_double_slab when falling back to pagelock
    Avoid duplicate _count variables in page_struct
    Revert "SLUB: Fix build breakage in linux/mm_types.h"
    SLUB: Fix build breakage in linux/mm_types.h
    slub: slabinfo update for cmpxchg handling
    slub: Not necessary to check for empty slab on load_freelist
    slub: fast release on full slab
    slub: Add statistics for the case that the current slab does not match the node
    slub: Get rid of the another_slab label
    slub: Avoid disabling interrupts in free slowpath
    slub: Disable interrupts in free_debug processing
    slub: Invert locking and avoid slab lock
    slub: Rework allocator fastpaths
    slub: Pass kmem_cache struct to lock and freeze slab
    slub: explicit list_lock taking
    slub: Add cmpxchg_double_slab()
    mm: Rearrange struct page
    slub: Move page->frozen handling near where the page->freelist handling occurs
    slub: Do not use frozen page flag but a bit in the page counters
    ...

    Linus Torvalds
     

28 Jul, 2011

1 commit


27 Jul, 2011

9 commits

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     
  • Now cleanup_fault_attr_dentries() recursively removes a directory, So we
    can simplify the error handling in the initialization code and no need
    to hold dentry structs for each debugfs file.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Now cleanup_fault_attr_dentries() recursively removes a directory, So we
    can simplify the error handling in the initialization code and no need
    to hold dentry structs for each debugfs file.

    Signed-off-by: Akinobu Mita
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Use debugfs_remove_recursive() to simplify initialization and
    deinitialization of fault injection debugfs files.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • [ This patch has already been accepted as commit 0ac0c0d0f837 but later
    reverted (commit 35926ff5fba8) because it itroduced arch specific
    __node_random which was defined only for x86 code so it broke other
    archs. This is a followup without any arch specific code. Other than
    that there are no functional changes.]

    Some workloads that create a large number of small files tend to assign
    too many pages to node 0 (multi-node systems). Part of the reason is
    that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
    at node 0 for newly created tasks.

    This patch changes the rotor to be initialized to a random node number
    of the cpuset.

    [akpm@linux-foundation.org: fix layout]
    [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
    [mhocko@suse.cz: Make it arch independent]
    [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
    Signed-off-by: Jack Steiner
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Michal Hocko
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Jack Steiner
    Cc: KOSAKI Motohiro
    Cc: Lee Schermerhorn
    Cc: Michal Hocko
    Cc: Paul Menage
    Cc: Pekka Enberg
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • percpu_charge_mutex protects from multiple simultaneous per-cpu charge
    caches draining because we might end up having too many work items. At
    least this was the case until commit 26fe61684449 ("memcg: fix percpu
    cached charge draining frequency") when we introduced a more targeted
    draining for async mode.

    Now that also sync draining is targeted we can safely remove mutex
    because we will not send more work than the current number of CPUs.
    FLUSHING_CACHED_CHARGE protects from sending the same work multiple
    times and stock->nr_pages == 0 protects from pointless sending a work if
    there is obviously nothing to be done. This is of course racy but we
    can live with it as the race window is really small (we would have to
    see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
    non-zero).

    The only remaining place where we can race is synchronous mode when we
    rely on FLUSHING_CACHED_CHARGE test which might have been set by other
    drainer on the same group but we should wait in that case as well.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We are checking whether a given two groups are same or at least in the
    same subtree of a hierarchy at several places. Let's make a helper for
    it to make code easier to read.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently we have two ways how to drain per-CPU caches for charges.
    drain_all_stock_sync will synchronously drain all caches while
    drain_all_stock_async will asynchronously drain only those that refer to
    a given memory cgroup or its subtree in hierarchy. Targeted async
    draining has been introduced by 26fe6168 (memcg: fix percpu cached
    charge draining frequency) to reduce the cpu workers number.

    sync draining is currently triggered only from mem_cgroup_force_empty
    which is triggered only by userspace (mem_cgroup_force_empty_write) or
    when a cgroup is removed (mem_cgroup_pre_destroy). Although these are
    not usually frequent operations it still makes some sense to do targeted
    draining as well, especially if the box has many CPUs.

    This patch unifies both methods to use the single code (drain_all_stock)
    which relies on the original async implementation and just adds
    flush_work to wait on all caches that are still under work for the sync
    mode. We are using FLUSHING_CACHED_CHARGE bit check to prevent from
    waiting on a work that we haven't triggered. Please note that both sync
    and async functions are currently protected by percpu_charge_mutex so we
    cannot race with other drainers.

    Signed-off-by: Michal Hocko
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • drain_all_stock_async tries to optimize a work to be done on the work
    queue by excluding any work for the current CPU because it assumes that
    the context we are called from already tried to charge from that cache
    and it's failed so it must be empty already.

    While the assumption is correct we can optimize it even more by checking
    the current number of pages in the cache. This will also reduce a work
    on other CPUs with an empty stock.

    For the current CPU we can simply call drain_local_stock rather than
    deferring it to the work queue.

    [kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization]
    Signed-off-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Linus Torvalds

    Michal Hocko