08 Oct, 2010

26 commits

  • This fixes a problem introduced with the hugetlb hwpoison handling

    The user space SIGBUS signalling wants to know the size of the hugepage
    that caused a HWPOISON fault.

    Unfortunately the architecture page fault handlers do not have easy
    access to the struct page.

    Pass the information out in the fault error code instead.

    I added a separate VM_FAULT_HWPOISON_LARGE bit for this case and encode
    the hpage index in some free upper bits of the fault code. The small
    page hwpoison keeps stays with the VM_FAULT_HWPOISON name to minimize
    changes.

    Also add code to hugetlb.h to convert that index into a page shift.

    Will be used in a further patch.

    Cc: Naoya Horiguchi
    Cc: fengguang.wu@intel.com
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • migrate_huge_page_move_mapping() is declared as "extern int ..."
    in include/linux/migrate.h for !CONFIG_MIGRATION,
    which causes the build error like below:

    mm/mprotect.o: In function `migrate_huge_page_move_mapping':
    mprotect.c:(.text+0x0): multiple definition of `migrate_huge_page_move_mapping'
    mm/shmem.o:shmem.c:(.text+0x0): first defined here
    mm/rmap.o: In function `migrate_huge_page_move_mapping':
    rmap.c:(.text+0x0): multiple definition of `migrate_huge_page_move_mapping'
    mm/shmem.o:shmem.c:(.text+0x0): first defined here

    Reported-by: Stephen Rothwell
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • Fixes warning reported by Stephen Rothwell

    mm/hugetlb.c:2950: warning: 'is_hugepage_on_freelist' defined but not used

    for the !CONFIG_MEMORY_FAILURE case.

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Linus asked for a cleanup of __page_set_anon_rmap to make
    it look more like the cleaner huge pages version.

    Factor out the duplicated PageAnon check into a single check
    at the beginning of the function.

    Remove obsolete comments and rewrite them into standard English.

    No functional changes.

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Currently unpoisoning hugepages doesn't work correctly because
    clearing PG_HWPoison is done outside if (TestClearPageHWPoison).
    This patch fixes it.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch extends soft offlining framework to support hugepage.
    When memory corrected errors occur repeatedly on a hugepage,
    we can choose to stop using it by migrating data onto another hugepage
    and disabling the original (maybe half-broken) one.

    ChangeLog since v4:
    - branch soft_offline_page() for hugepage

    ChangeLog since v3:
    - remove comment about "ToDo: hugepage soft-offline"

    ChangeLog since v2:
    - move refcount handling into isolate_lru_page()

    ChangeLog since v1:
    - add double check in isolating hwpoisoned hugepage
    - define free/non-free checker for hugepage
    - postpone calling put_page() for hugepage in soft_offline_page()

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • Currently error recovery for free hugepage works only for MF_COUNT_INCREASED.
    This patch enables !MF_COUNT_INCREASED case.

    Free hugepages can be handled directly by alloc_huge_page() and
    dequeue_hwpoisoned_huge_page(), and both of them are protected
    by hugetlb_lock, so there is no race between them.

    Note that this patch defines the refcount of HWPoisoned hugepage
    dequeued from freelist is 1, deviated from present 0, thereby we
    can avoid race between unpoison and memory failure on free hugepage.
    This is reasonable because unlikely to free buddy pages, free hugepage
    is governed by hugetlbfs even after error handling finishes.
    And it also makes unpoison code added in the later patch cleaner.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • Currently alloc_huge_page() raises page refcount outside hugetlb_lock.
    but it causes race when dequeue_hwpoison_huge_page() runs concurrently
    with alloc_huge_page().
    To avoid it, this patch moves set_page_refcounted() in hugetlb_lock.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Wu Fengguang
    Acked-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This check is necessary to avoid race between dequeue and allocation,
    which can cause a free hugepage to be dequeued twice and get kernel unstable.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Wu Fengguang
    Acked-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch extends page migration code to support hugepage migration.
    One of the potential users of this feature is soft offlining which
    is triggered by memory corrected errors (added by the next patch.)

    Todo:
    - there are other users of page migration such as memory policy,
    memory hotplug and memocy compaction.
    They are not ready for hugepage support for now.

    ChangeLog since v4:
    - define migrate_huge_pages()
    - remove changes on isolation/putback_lru_page()

    ChangeLog since v2:
    - refactor isolate/putback_lru_page() to handle hugepage
    - add comment about race on unmap_and_move_huge_page()

    ChangeLog since v1:
    - divide migration code path for hugepage
    - define routine checking migration swap entry for hugetlb
    - replace "goto" with "if/else" in remove_migration_pte()

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch modifies hugepage copy functions to have only destination
    and source hugepages as arguments for later use.
    The old ones are renamed from copy_{gigantic,huge}_page() to
    copy_user_{gigantic,huge}_page().
    This naming convention is consistent with that between copy_highpage()
    and copy_user_highpage().

    ChangeLog since v4:
    - add blank line between local declaration and code
    - remove unnecessary might_sleep()

    ChangeLog since v2:
    - change copy_huge_page() from macro to inline dummy function
    to avoid compile warning when !CONFIG_HUGETLB_PAGE.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • We can't use existing hugepage allocation functions to allocate hugepage
    for page migration, because page migration can happen asynchronously with
    the running processes and page migration users should call the allocation
    function with physical addresses (not virtual addresses) as arguments.

    ChangeLog since v3:
    - unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node()

    ChangeLog since v2:
    - remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)

    ChangeLog since v1:
    - add comment on top of alloc_huge_page_no_vma()

    Signed-off-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Signed-off-by: Jun'ichi Nomura
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • Since the PageHWPoison() check is for avoiding hwpoisoned page remained
    in pagecache mapping to the process, it should be done in "found in pagecache"
    branch, not in the common path.
    Otherwise, metadata corruption occurs if memory failure happens between
    alloc_huge_page() and lock_page() because page fault fails with metadata
    changes remained (such as refcount, mapcount, etc.)

    This patch moves the check to "found in pagecache" branch and fix the problem.

    ChangeLog since v2:
    - remove retry check in "new allocation" path.
    - make description more detailed
    - change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check"

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Mel Gorman
    Reviewed-by: Wu Fengguang
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • * 'hwpoison-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    HWPOISON: Stop shrinking at right page count
    HWPOISON: Report correct address granuality for AO huge page errors
    HWPOISON: Copy si_addr_lsb to user
    page-types.c: fix name of unpoison interface

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    elevator: fix oops on early call to elevator_change()

    Linus Torvalds
     
  • * 'for-linus' of git://neil.brown.name/md:
    md: check return code of read_sb_page
    md/raid1: minor bio initialisation improvements.
    md/raid1: avoid overflow in raid1 resync when bitmap is in use.

    Linus Torvalds
     
  • * 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
    drm: don't drop handle reference on unload
    drm/ttm: Fix two race conditions + fix busy codepaths

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
    Input: wacom - fix runtime PM related deadlock
    Input: joydev - fix JSIOCSAXMAP ioctl
    Input: uinput - setup MT usage during device creation

    Linus Torvalds
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: properly account for reclaimed inodes

    Linus Torvalds
     
  • * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (37 commits)
    V4L/DVB: v4l: radio: si470x: fix unneeded free_irq() call
    V4L/DVB: v4l: videobuf: prevent passing a NULL to dma_free_coherent()
    V4L/DVB: ir-core: Fix null dereferences in the protocols sysfs interface
    V4L/DVB: v4l: s5p-fimc: Fix 3-planar formats handling and pixel offset error on S5PV210 SoCs
    V4L/DVB: v4l: s5p-fimc: Fix return value on probe() failure
    V4L/DVB: uvcvideo: Restrict frame rates for Chicony CNF7129 webcam
    V4L/DVB: uvcvideo: Fix support for Medion Akoya All-in-one PC integrated webcam
    V4L/DVB: ivtvfb: prevent reading uninitialized stack memory
    V4L/DVB: cx25840: Fix typo in volume control initialization: 65335 vs. 65535
    V4L/DVB: v4l: mem2mem_testdev: add missing release for video_device
    V4L/DVB: v4l: mem2mem_testdev: fix errorenous comparison
    V4L/DVB: mt9v022.c: Fixed compilation warning
    V4L/DVB: mt9m111: added current colorspace at g_fmt
    V4L/DVB: mt9m111: cropcap and s_crop check if type is VIDEO_CAPTURE
    V4L/DVB: mx2_camera: fix a race causing NULL dereference
    V4L/DVB: tm6000: bugfix data handling
    V4L/DVB: gspca - sn9c20x: Bad transfer size of Bayer images
    V4L/DVB: videobuf-dma-sg: set correct size in last sg element
    V4L/DVB: cx231xx: Avoid an OOPS when card is unknown (card=0)
    V4L/DVB: dvb: fix smscore_getbuffer() logic
    ...

    Linus Torvalds
     
  • * 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
    of/i2c: Fix module load order issue caused by of_i2c.c
    i2c: Fix checks which cause legacy suspend to never get called
    i2c-pca: Fix waitforcompletion() return value
    i2c: Fix for suspend/resume issue
    i2c: Remove obsolete cleanup for clientdata

    Linus Torvalds
     
  • When proc_doulongvec_minmax() is used with an array of longs, and no
    min/max check requested (.extra1 or .extra2 being NULL), we dereference a
    NULL pointer for the second element of the array.

    Noticed while doing some changes in network stack for the "16TB problem"

    Fix is to not change min & max pointers in __do_proc_doulongvec_minmax(),
    so that all elements of the vector share an unique min/max limit, like
    proc_dointvec_minmax().

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Eric Dumazet
    Cc: "Eric W. Biederman"
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Add Samsung S5P series FIMC(Camera Interface) maintainers.

    Signed-off-by: Kyungmin Park
    Cc: Kyungmin Park
    Cc: Sylwester Nawrocki
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kyungmin Park
     
  • Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • We need to check parent's thresholds if parent has use_hierarchy == 1 to
    be sure that parent's threshold events will be triggered even if parent
    itself is not active (no MEM_CGROUP_EVENTS).

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • During boot of a 16TB system, the following is printed:
    Dentry cache hash table entries: -2147483648 (order: 22, 17179869184 bytes)

    Signed-off-by: Robin Holt
    Reviewed-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

07 Oct, 2010

14 commits

  • When we call the slab shrinker to free a page we need to stop at
    page count one because the caller always holds a single reference, not zero.

    This avoids useless looping over slab shrinkers and freeing too much
    memory.

    Reviewed-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • The SIGBUS user space signalling is supposed to report the
    address granuality of a corruption. Pass this information correctly
    for huge pages by querying the hpage order.

    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • The original hwpoison code added a new siginfo field si_addr_lsb to
    pass the granuality of the fault address to user space. Unfortunately
    this field was never copied to user space. Fix this here.

    I added explicit checks for the MCEERR codes to avoid having
    to patch all potential callers to initialize the field.

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • The page-types utility still uses an out of date name for the
    unpoison interface: debugfs:hwpoison/renew-pfn
    This patch renames and fixes it.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • 2.6.36 introduces an API for drivers to switch the IO scheduler
    instead of manually calling the elevator exit and init functions.
    This API was added since q->elevator must be cleared in between
    those two calls. And since we already have this functionality
    directly from use by the sysfs interface to switch schedulers
    online, it was prudent to reuse it internally too.

    But this API needs the queue to be in a fully initialized state
    before it is called, or it will attempt to unregister elevator
    kobjects before they have been added. This results in an oops
    like this:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
    IP: [] sysfs_create_dir+0x2e/0xc0
    PGD 47ddfc067 PUD 47c6a1067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
    CPU 2
    Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb

    Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
    RIP: 0010:[] [] sysfs_create_dir+0x2e/0xc0
    RSP: 0018:ffff88027da25d08 EFLAGS: 00010246
    RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
    RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
    RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
    R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
    R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
    FS: 00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
    Stack:
    ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
    ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
    ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
    Call Trace:
    [] kobject_add_internal+0xe7/0x1f0
    [] kobject_add_varg+0x38/0x60
    [] kobject_add+0x69/0x90
    [] ? sysfs_remove_dir+0x20/0xa0
    [] ? sub_preempt_count+0x9d/0xe0
    [] ? _raw_spin_unlock+0x30/0x50
    [] ? sysfs_remove_dir+0x20/0xa0
    [] ? sysfs_remove_dir+0x34/0xa0
    [] elv_register_queue+0x34/0xa0
    [] elevator_change+0xfd/0x250
    [] ? t_init+0x0/0x361 [t]
    [] ? t_init+0x0/0x361 [t]
    [] t_init+0xa8/0x361 [t]
    [] do_one_initcall+0x3e/0x170
    [] sys_init_module+0xbd/0x220
    [] system_call_fastpath+0x16/0x1b
    Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
    RIP [] sysfs_create_dir+0x2e/0xc0
    RSP
    CR2: 0000000000000051
    ---[ end trace a6541d3bf07945df ]---

    Fix this by adding a registered bit to the elevator queue, which is
    set when the sysfs kobjects have been registered.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • since the handle references are all tied to a file_priv, and when it disappears
    all the handle refs go with it.

    The fbcon ones we'd only notice on unload, but the nouveau notifier one
    would would happen on reboot.

    nouveau: Reported-by: Marc Dionne
    nouveau: Tested-by: Marc Dionne
    i915 unload: Reported-by: Keith Packard
    Acked-by: Ben Skeggs
    Signed-off-by: Dave Airlie

    Dave Airlie
     
  • When marking an inode reclaimable, a per-AG counter is increased, the
    inode is tagged reclaimable in its per-AG tree, and, when this is the
    first reclaimable inode in the AG, the AG entry in the per-mount tree
    is also tagged.

    When an inode is finally reclaimed, however, it is only deleted from
    the per-AG tree. Neither the counter is decreased, nor is the parent
    tree's AG entry untagged properly.

    Since the tags in the per-mount tree are not cleared, the inode
    shrinker iterates over all AGs that have had reclaimable inodes at one
    point in time.

    The counters on the other hand signal an increasing amount of slab
    objects to reclaim. Since "70e60ce xfs: convert inode shrinker to
    per-filesystem context" this is not a real issue anymore because the
    shrinker bails out after one iteration.

    But the problem was observable on a machine running v2.6.34, where the
    reclaimable work increased and each process going into direct reclaim
    eventually got stuck on the xfs inode shrinking path, trying to scan
    several million objects.

    Fix this by properly unwinding the reclaimable-state tracking of an
    inode when it is reclaimed.

    Signed-off-by: Johannes Weiner
    Cc: stable@kernel.org
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Johannes Weiner
     
  • Function read_sb_page may return ERR_PTR(...). Check for it.

    Signed-off-by: Vasiliy Kulikov
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: NeilBrown

    Vasiliy Kulikov
     
  • When performing a resync we pre-allocate some bios and repeatedly use
    them. This requires us to re-initialise them each time.
    One field (bi_comp_cpu) and some flags weren't being initiaised
    reliably.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • bitmap_start_sync returns - via a pass-by-reference variable - the
    number of sectors before we need to check with the bitmap again.
    Since commit ef4256733506f245 this number can be substantially larger,
    2^27 is a common value.

    Unfortunately it is an 'int' and so when raid1.c:sync_request shifts
    it 9 places to the left it becomes 0. This results in a zero-length
    read which the scsi layer justifiably complains about.

    This patch just removes the shift so the common case becomes safe with
    a trivially-correct patch.

    In the next merge window we will convert this 'int' to a 'sector_t'

    Reported-by: "George Spelvin"
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Linus Torvalds
     
  • * 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus:
    MIPS: Octeon: Place cnmips_cu2_setup in __init memory.
    MIPS: Don't place cu2 notifiers in __cpuinitdata
    MIPS: Calculate VMLINUZ_LOAD_ADDRESS based on the length of vmlinux.bin
    MIPS: Alchemy: Resolve prom section mismatches
    MIPS: Fix syscall 64 bit number comments.
    MIPS: Hookup fanotify_init, fanotify_mark, and prlimit64 syscalls.
    MIPS: TX49xx: Rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN
    MIPS: N32: Fix getdents64 syscall for n32
    MIPS: Remove pr_ uses of KERN_
    MIPS: PNX8550: Sort out machine halt, restart and powerdown functions.
    MIPS: GIC: Remove dependencies from Malta files.
    MIPS: Kconfig: Fix and clarify kconfig help text for VSMP and SMTC.
    MIPS: DMA: Fix computation of DMA flags from device's coherent_dma_mask.
    MIPS: Audit: Fix hang in entry.S.
    MIPS: Document why RELOC_HIDE is there.
    MIPS: Octeon: Determine if helper needs to be built
    MIPS: Use generic atomic64 for 32-bit kernels
    MIPS: RM7000: Symbol should be static
    MIPS: kspd: Adjust confusing if indentation
    MIPS: Fix a typo.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    writeback: always use sb->s_bdi for writeback purposes

    Linus Torvalds
     
  • * 'v2.6.36-rc6-urgent-fixes' of git://xenbits.xen.org/people/sstabellini/linux-pvhvm:
    xen: do not initialize PV timers on HVM if !xen_have_vector_callback
    xen: do not set xenstored_ready before xenbus_probe on hvm

    Linus Torvalds