10 Apr, 2010

3 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (34 commits)
    cfq-iosched: Fix the incorrect timeslice accounting with forced_dispatch
    loop: Update mtime when writing using aops
    block: expose the statistics in blkio.time and blkio.sectors for the root cgroup
    backing-dev: Handle class_create() failure
    Block: Fix block/elevator.c elevator_get() off-by-one error
    drbd: lc_element_by_index() never returns NULL
    cciss: unlock on error path
    cfq-iosched: Do not merge queues of BE and IDLE classes
    cfq-iosched: Add additional blktrace log messages in CFQ for easier debugging
    i2o: Remove the dangerous kobj_to_i2o_device macro
    block: remove 16 bytes of padding from struct request on 64bits
    cfq-iosched: fix a kbuild regression
    block: make CONFIG_BLK_CGROUP visible
    Remove GENHD_FL_DRIVERFS
    block: Export max number of segments and max segment size in sysfs
    block: Finalize conversion of block limits functions
    block: Fix overrun in lcm() and move it to lib
    vfs: improve writeback_inodes_wb()
    paride: fix off-by-one test
    drbd: fix al-to-on-disk-bitmap for 4k logical_block_size
    ...

    Linus Torvalds
     
  • As suggested by Linus, fix up kmem_ptr_validate() to handle non-kernel pointers
    more graciously. The patch changes kmem_ptr_validate() to use the newly
    introduced kern_ptr_validate() helper to check that a pointer is a valid kernel
    pointer before we attempt to convert it into a 'struct page'.

    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Matt Mackall
    Cc: Nick Piggin
    Signed-off-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • As suggested by Linus, introduce a kern_ptr_validate() helper that does some
    sanity checks to make sure a pointer is a valid kernel pointer. This is a
    preparational step for fixing SLUB kmem_ptr_validate().

    Cc: Andrew Morton
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Ingo Molnar
    Cc: Matt Mackall
    Cc: Nick Piggin
    Signed-off-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

08 Apr, 2010

1 commit

  • …git/x86/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip:
    x86: Fix double enable_IR_x2apic() call on SMP kernel on !SMP boards
    x86: Increase CONFIG_NODES_SHIFT max to 10
    ibft, x86: Change reserve_ibft_region() to find_ibft_region()
    x86, hpet: Fix bug in RTC emulation
    x86, hpet: Erratum workaround for read after write of HPET comparator
    bootmem, x86: Fix 32bit numa system without RAM on node 0
    nobootmem, x86: Fix 32bit numa system without RAM on node 0
    x86: Handle overlapping mptables
    x86: Make e820_remove_range to handle all covered case
    x86-32, resume: do a global tlb flush in S4 resume

    Linus Torvalds
     

07 Apr, 2010

5 commits

  • Presently, memcg's FILE_MAPPED accounting has following race with
    move_account (happens at rmdir()).

    increment page->mapcount (rmap.c)
    mem_cgroup_update_file_mapped() move_account()
    lock_page_cgroup()
    check page_mapped() if
    page_mapped(page)>1 {
    FILE_MAPPED -1 from old memcg
    FILE_MAPPED +1 to old memcg
    }
    .....
    overwrite pc->mem_cgroup
    unlock_page_cgroup()
    lock_page_cgroup()
    FILE_MAPPED + 1 to pc->mem_cgroup
    unlock_page_cgroup()

    Then,
    old memcg (-1 file mapped)
    new memcg (+2 file mapped)

    This happens because move_account see page_mapped() which is not guarded
    by lock_page_cgroup(). This patch adds FILE_MAPPED flag to page_cgroup
    and move account information based on it. Now, all checks are synchronous
    with lock_page_cgroup().

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Reviewed-by: Daisuke Nishimura
    Cc: Andrea Righi
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When we look into pagemap using page-types with option -p, the value of
    pfn for hugepages looks wrong (see below.) This is because pte was
    evaluated only once for one vma although it should be updated for each
    hugepage. This patch fixes it.

    $ page-types -p 3277 -Nl -b huge
    voffset offset len flags
    7f21e8a00 11e400 1 ___U___________H_G________________
    7f21e8a01 11e401 1ff ________________TG________________
    ^^^
    7f21e8c00 11e400 1 ___U___________H_G________________
    7f21e8c01 11e401 1ff ________________TG________________
    ^^^

    One hugepage contains 1 head page and 511 tail pages in x86_64 and each
    two lines represent each hugepage. Voffset and offset mean virtual
    address and physical address in the page unit, respectively. The
    different hugepages should not have the same offset value.

    With this patch applied:

    $ page-types -p 3386 -Nl -b huge
    voffset offset len flags
    7fec7a600 112c00 1 ___UD__________H_G________________
    7fec7a601 112c01 1ff ________________TG________________
    ^^^
    7fec7a800 113200 1 ___UD__________H_G________________
    7fec7a801 113201 1ff ________________TG________________
    ^^^
    OK

    More info:

    - This patch modifies walk_page_range()'s hugepage walker. But the
    change only affects pagemap_read(), which is the only caller of hugepage
    callback.

    - Without this patch, hugetlb_entry() callback is called per vma, that
    doesn't match the natural expectation from its name.

    - With this patch, hugetlb_entry() is called per hugepte entry and the
    callback can become much simpler.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Shaohua Li reported his tmpfs streaming I/O test can lead to make oom.
    The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there
    are 6 copies of kernel source and the test does kbuild for each copy. His
    investigation shows the test has a lot of rotated anon pages and quite few
    file pages, so get_scan_ratio calculates percent[0] (i.e. scanning
    percent for anon) to be zero. Actually the percent[0] shoule be a big
    value, but our calculation round it to zero.

    Although before commit 84b18490 ("vmscan: get_scan_ratio() cleanup") , we
    have the same problem too. But the old logic can rescue percent[0]==0
    case only when priority==0. It had hided the real issue. I didn't think
    merely streaming io can makes percent[0]==0 && priority==0 situation. but
    I was wrong.

    So, definitely we have to fix such tmpfs streaming io issue. but anyway I
    revert the regression commit at first.

    This reverts commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Shaohua Li
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • btrfs relocate_file_extent_cluster() calls us with NULL filp:

    [ 4005.426805] BUG: unable to handle kernel NULL pointer dereference at 00000021
    [ 4005.426818] IP: [] page_cache_sync_readahead+0x18/0x3e

    Signed-off-by: Wu Fengguang
    Cc: Yan Zheng
    Reported-by: Kirill A. Shutemov
    Tested-by: Kirill A. Shutemov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • - We weren't zeroing p->rss_stat[] at fork()

    - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel
    threads and was oopsing.

    - Make __sync_task_rss_stat() static, too.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648

    [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)]
    Reported-by: Troels Liebe Bentsen
    Signed-off-by: KAMEZAWA Hiroyuki
    "Michael S. Tsirkin"
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

06 Apr, 2010

3 commits

  • * 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc:
    eeepc-wmi: include slab.h
    staging/otus: include slab.h from usbdrv.h
    percpu: don't implicitly include slab.h from percpu.h
    kmemcheck: Fix build errors due to missing slab.h
    include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
    iwlwifi: don't include iwl-dev.h from iwl-devtrace.h
    x86: don't include slab.h from arch/x86/include/asm/pgtable_32.h

    Fix up trivial conflicts in include/linux/percpu.h due to
    is_kernel_percpu_address() having been introduced since the slab.h
    cleanup with the percpu_up.c splitup.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    module: add stub for is_module_percpu_address
    percpu, module: implement and use is_kernel/module_percpu_address()
    module: encapsulate percpu handling better and record percpu_size

    Linus Torvalds
     
  • Fix a memory leak in anon_vma_fork(), where we fail to tear down the
    anon_vmas attached to the new VMA in case setting up the new anon_vma
    fails.

    This bug also has the potential to leave behind anon_vma_chain structs
    with pointers to invalid memory.

    Reported-by: Minchan Kim
    Signed-off-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

02 Apr, 2010

3 commits

  • I hit this when we had a bug in IDR for a few days. Basically sysfs would
    fail to create new inodes since it uses an IDR and therefore class_create would
    fail.

    While we are unlikely to see this fail we may as well handle it instead of
    oopsing.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Jens Axboe

    Anton Blanchard
     
  • When 32bit numa is used, free_all_bootmem() will still only go over with
    node id 0.

    If node 0 doesn't have RAM installed, the lowest populated node
    becomes low RAM.

    This one fixes BOOTMEM path by iterating over the bdata_list.

    -v3: add more comments, and fix bootmem path too.
    -v4: seperate from one big patch

    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • On one system without RAM on node0, got following boot dump with a 32
    bit NUMA kernel:

    early_node_map[4] active PFN ranges
    1: 0x00000010 -> 0x00000099
    1: 0x00000100 -> 0x0007da00
    1: 0x0007e800 -> 0x0007ffa0
    1: 0x0007ffae -> 0x0007ffb0
    ...
    Subtract (29 early reservations)
    #000 [0000001000 - 0000002000]
    #001 [0000089000 - 000008f000]
    #002 [0000091000 - 0000093500]
    ...
    #027 [007cbfef40 - 007e800000]
    #028 [007e9ca000 - 007ff95000]
    (0 free memory ranges)
    Initializing HighMem for node 0 (00000000:00000000)
    Initializing HighMem for node 1 (00000000:00000000)
    Memory: 0k/2096832k available (6662k kernel code, 2096300k reserved, 4829k data, 484k init, 0k highmem)
    ...
    Checking if this processor honours the WP bit even in supervisor mode...Ok.
    swapper: page allocation failure. order:0, mode:0x0
    Pid: 0, comm: swapper Not tainted 2.6.34-rc3-tip-03818-g4b1ea6c-dirty #35
    Call Trace:
    [] ? printk+0xf/0x11
    [] __alloc_pages_nodemask+0x417/0x487
    [] new_slab+0xe2/0x1fe
    [] kmem_cache_open+0x185/0x358
    [] T.954+0x1c/0x60
    [] kmem_cache_init+0x24/0x113
    [] start_kernel+0x166/0x2e4
    [] ? unknown_bootoption+0x0/0x18e
    [] i386_start_kernel+0xce/0xd5
    Mem-Info:
    Node 1 DMA per-cpu:
    CPU 0: hi: 0, btch: 1 usd: 0
    Node 1 Normal per-cpu:
    CPU 0: hi: 0, btch: 1 usd: 0
    active_anon:0 inactive_anon:0 isolated_anon:0
    active_file:0 inactive_file:0 isolated_file:0
    unevictable:0 dirty:0 writeback:0 unstable:0
    free:0 slab_reclaimable:0 slab_unreclaimable:0
    mapped:0 shmem:0 pagetables:0 bounce:0

    When 32bit NUMA is used, free_all_bootmem() will still only go over with
    node id 0.

    If node 0 doesn't have RAM installed, We need to go with node1
    because early_node_map still use 1 for all ranges, and ram from node1
    become low ram.

    Use MAX_NUMNODES like 64-bit NUMA does.

    Note: BOOTMEM path has the same problem.
    this bug exist before We have NO_BOOTMEM support.

    -v3: add more comments, and fix bootmem path too.
    -v4: seperate bootmem path fix

    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

30 Mar, 2010

3 commits

  • percpu.h has always been including slab.h to get k[mz]alloc/free() for
    UP inline implementation. percpu.h being used by very low level
    headers including module.h and sched.h, this meant that a lot files
    unintentionally got slab.h inclusion.

    Lee Schermerhorn was trying to make topology.h use percpu.h and got
    bitten by this implicit inclusion. The right thing to do is break
    this ultimately unnecessary dependency. The previous patch added
    explicit inclusion of either gfp.h or slab.h to the source files using
    them. This patch updates percpu.h such that slab.h is no longer
    included from percpu.h.

    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter
    Cc: Ingo Molnar
    Cc: Lee Schermerhorn

    Tejun Heo
     
  • mm/kmemcheck.c:69: error: dereferencing pointer to incomplete type
    mm/kmemcheck.c:69: error: 'SLAB_NOTRACK' undeclared (first use in this function)
    mm/kmemcheck.c:82: error: dereferencing pointer to incomplete type
    mm/kmemcheck.c:94: error: dereferencing pointer to incomplete type
    mm/kmemcheck.c:94: error: dereferencing pointer to incomplete type
    mm/kmemcheck.c:94: error: 'SLAB_DESTROY_BY_RCU' undeclared (first use in this function)

    Signed-off-by: Randy Dunlap
    Signed-off-by: Tejun Heo

    Randy Dunlap
     
  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

29 Mar, 2010

1 commit

  • lockdep has custom code to check whether a pointer belongs to static
    percpu area which is somewhat broken. Implement proper
    is_kernel/module_percpu_address() and replace the custom code.

    On UP, percpu variables are regular static variables and can't be
    distinguished from them. Always return %false on UP.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    Cc: Ingo Molnar

    Tejun Heo
     

27 Mar, 2010

1 commit


26 Mar, 2010

2 commits

  • Fix __get_user_pages() to make it pin the last page on a buffer that doesn't
    begin at the start of a page, but is a multiple of PAGE_SIZE in size.

    The problem is that __get_user_pages() advances the pointer too much when it
    iterates to the next page if the page it's currently looking at isn't used from
    the first byte. This can cause the end of a short VMA to be reached
    prematurely, resulting in the last page being lost.

    Signed-off-by: Steven J. Magnani
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Revert the following patch:

    commit c08c6e1f54c85fc299cf9f88cf330d6dd28a9a1d
    Author: Steven J. Magnani
    Date: Fri Mar 5 13:42:24 2010 -0800

    nommu: get_user_pages(): pin last page on non-page-aligned start

    As it assumes that the mappings begin at the start of pages - something that
    isn't necessarily true on NOMMU systems. On NOMMU systems, it is possible for
    a mapping to only occupy part of the page, and not necessarily touch either end
    of it; in fact it's also possible for multiple non-overlapping mappings to
    coexist on one page (consider direct mappings of ROMFS files, for example).

    Signed-off-by: David Howells
    Acked-by: Steven J. Magnani
    Signed-off-by: Linus Torvalds

    David Howells
     

25 Mar, 2010

11 commits

  • Discovered while testing other mempolicy changes:

    get_mempolicy() does not handle static/relative mode flags correctly.
    Return the value that the user specified so that it can be restored
    via set_mempolicy() if desired.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • In 2.6.34-rc1, removing vhost_net module causes an oops in sync_mm_rss
    (called from do_exit) when workqueue is destroyed. This does not happen
    on net-next, or with vhost on top of to 2.6.33.

    The issue seems to be introduced by
    34e55232e59f7b19050267a05ff1226e5cd122a5 ("mm: avoid false sharing of
    mm_counter) which added sync_mm_rss() that is passed task->mm, and
    dereferences it without checking. If task is a kernel thread, mm might be
    NULL. I think this might also happen e.g. with aio.

    This patch fixes the oops by calling sync_mm_rss when task->mm is set to
    NULL. I also added BUG_ON to detect any other cases where counters get
    incremented while mm is NULL.

    The oops I observed looks like this:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
    IP: [] sync_mm_rss+0x33/0x6f
    PGD 0
    Oops: 0002 [#1] SMP
    last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
    CPU 2
    Modules linked in: vhost_net(-) tun bridge stp sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel kvm i5000_edac edac_core rtc_cmos bnx2 button i2c_i801 i2c_core rtc_core e1000e sg joydev ide_cd_mod serio_raw pcspkr rtc_lib cdrom virtio_net virtio_blk virtio_pci virtio_ring virtio af_packet e1000 shpchp aacraid uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]

    Pid: 2046, comm: vhost Not tainted 2.6.34-rc1-vhost #25 System Planar/IBM System x3550 -[7978B3G]-
    RIP: 0010:[] [] sync_mm_rss+0x33/0x6f
    RSP: 0018:ffff8802379b7e60 EFLAGS: 00010202
    RAX: 0000000000000008 RBX: ffff88023f2390c0 RCX: 0000000000000000
    RDX: ffff88023f2396b0 RSI: 0000000000000000 RDI: ffff88023f2390c0
    RBP: ffff8802379b7e60 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff88023aecfbc0 R11: 0000000000013240 R12: 0000000000000000
    R13: ffffffff81051a6c R14: ffffe8ffffc0f540 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00000000000002a8 CR3: 000000023af23000 CR4: 00000000000406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process vhost (pid: 2046, threadinfo ffff8802379b6000, task ffff88023f2390c0)
    Stack:
    ffff8802379b7ee0 ffffffff81040687 ffffe8ffffc0f558 ffffffffa00a3e2d
    0000000000000000 ffff88023f2390c0 ffffffff81055817 ffff8802379b7e98
    ffff8802379b7e98 0000000100000286 ffff8802379b7ee0 ffff88023ad47d78
    Call Trace:
    [] do_exit+0x147/0x6c4
    [] ? handle_rx_net+0x0/0x17 [vhost_net]
    [] ? autoremove_wake_function+0x0/0x39
    [] ? worker_thread+0x0/0x229
    [] kthreadd+0x0/0xf2
    [] kernel_thread_helper+0x4/0x10
    [] ? kthread+0x0/0x87
    [] ? kernel_thread_helper+0x0/0x10
    Code: 00 8b 87 6c 02 00 00 85 c0 74 14 48 98 f0 48 01 86 a0 02 00 00 c7 87 6c 02 00 00 00 00 00 00 8b 87 70 02 00 00 85 c0 74 14 48 98 48 01 86 a8 02 00 00 c7 87 70 02 00 00 00 00 00 00 8b 87 74
    RIP [] sync_mm_rss+0x33/0x6f
    RSP
    CR2: 00000000000002a8
    ---[ end trace 41603ba922beddd2 ]---
    Fixing recursive fault but reboot is needed!

    (note: handle_rx_net is a work item using workqueue in question).
    sync_mm_rss+0x33/0x6f gave me a hint. I also tried reverting
    34e55232e59f7b19050267a05ff1226e5cd122a5 and the oops goes away.

    The module in question calls use_mm and later unuse_mm from a kernel
    thread. It is when this kernel thread is destroyed that the crash
    happens.

    Signed-off-by: Michael S. Tsirkin
    Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • mpol_parse_str() made lots 'err' variable related bug. Because it is ugly
    and reviewing unfriendly.

    This patch simplifies it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Ravikiran Thirumalai
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Acked-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit 71fe804b6d5 (mempolicy: use struct mempolicy pointer in
    shmem_sb_info) added mpol=local mount option. but its feature is broken
    since it was born. because such code always return 1 (i.e. mount
    failure).

    This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Ravikiran Thirumalai
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Acked-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, following mount operation cause mount error.

    % mount -t tmpfs -ompol=bind:0 none /tmp

    Because commit 71fe804b6d5 (mempolicy: use struct mempolicy pointer in
    shmem_sb_info) corrupted MPOL_BIND parse code.

    This patch restore the needed one.

    Signed-off-by: KOSAKI Motohiro
    Cc: Ravikiran Thirumalai
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Acked-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Fix an 'oops' when a tmpfs mount point is mounted with the mpol=default
    mempolicy.

    Upon remounting a tmpfs mount point with 'mpol=default' option, the mount
    code crashed with a null pointer dereference. The initial problem report
    was on 2.6.27, but the problem exists in mainline 2.6.34-rc as well. On
    examining the code, we see that mpol_new returns NULL if default mempolicy
    was requested. This 'NULL' mempolicy is accessed to store the node mask
    resulting in oops.

    The following patch fixes it.

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Acked-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • ksm.c's write_protect_page implements a lockless means of verifying a page
    does not have any users of the page which are not accounted for via other
    kernel tracking means. It does this by removing the writable pte with TLB
    flushes, checking the page_count against the total known users, and then
    using set_pte_at_notify to make it a read-only entry.

    An unneeded mmu_notifier callout is made in the case where the known users
    does not match the page_count. In that event, we are inserting the
    identical pte and there is no need for the set_pte_at_notify, but rather
    the simpler set_pte_at suffices.

    Signed-off-by: Robin Holt
    Acked-by: Izik Eidus
    Acked-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     
  • Fix an incorrect comment in the do_mmap_shared_file(). If a mapping is
    requested MAP_SHARED, then a private copy cannot be made and still provide
    correct semantics.

    Signed-off-by: David Howells
    Reported-by: Dave Hudson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • There was a potential null deref introduced in c62b1a3b31b5 ("memcg: use
    generic percpu instead of private implementation").

    Signed-off-by: Dan Carpenter
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • In commit 02491447 ("memcg: move charges of anonymous swap"), I tried to
    disable move charge feature in no mmu case by enclosing all the related
    functions with "#ifdef CONFIG_MMU", but the commit places these ifdefs in
    wrong place. (it seems that it's mangled while handling some fixes...)

    This patch fixes it up.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Commit 08677214e318297 ("x86: Make 64 bit use early_res instead
    of bootmem before slab") introduced early_res replacement for
    bootmem, but left code in __free_pages_memory() which dumps all
    the ranges that are beeing freed, without any additional
    information, causing some noise in dmesg during bootup.

    Just remove printing of the ranges, that doesn't provide
    anything useful anyway.

    While at it, remove other commented-out KERN_DEBUG messages in
    the NO_BOOTMEM code as well.

    Signed-off-by: Jiri Kosina
    Found-OK-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Yinghai Lu
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jiri Kosina
     

18 Mar, 2010

1 commit

  • swap_cgroup uses 2bytes data and uses cmpxchg in a new operation. 2byte
    cmpxchg/xchg is not available on some archs. This patch replaces
    cmpxchg/xchg with operations under lock.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Sachin Sant wrote:
    Acked-by: Balbir Singh
    Acked-by: Daisuke Nishimura
    Cc: Li Zefan
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

14 Mar, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    locking: Make sparse work with inline spinlocks and rwlocks
    x86/mce: Fix RCU lockdep splats
    rcu: Increase RCU CPU stall timeouts if PROVE_RCU
    ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
    rcu: Suppress RCU lockdep warnings during early boot
    rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
    rcu: Suppress __mpol_dup() false positive from RCU lockdep
    rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
    rcu: Add control variables to lockdep_rcu_dereference() diagnostics
    rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
    rcu: Use wrapper function instead of exporting tasklist_lock
    sched, rcu: Fix rcu_dereference() for RCU-lockdep
    rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
    rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
    x86/gart: Unexport gart_iommu_aperture

    Fix trivial conflicts in kernel/trace/ftrace.c

    Linus Torvalds
     

13 Mar, 2010

5 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
    doc: fix typo in comment explaining rb_tree usage
    Remove fs/ntfs/ChangeLog
    doc: fix console doc typo
    doc: cpuset: Update the cpuset flag file
    Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
    Remove drivers/parport/ChangeLog
    Remove drivers/char/ChangeLog
    doc: typo - Table 1-2 should refer to "status", not "statm"
    tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
    No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
    devres/irq: Fix devm_irq_match comment
    Remove reference to kthread_create_on_cpu
    tree-wide: Assorted spelling fixes
    tree-wide: fix 'lenght' typo in comments and code
    drm/kms: fix spelling in error message
    doc: capitalization and other minor fixes in pnp doc
    devres: typo fix s/dev/devm/
    Remove redundant trailing semicolons from macros
    fix typo "definetly" -> "definitely" in comment
    tree-wide: s/widht/width/g typo in comments
    ...

    Fix trivial conflict in Documentation/laptops/00-INDEX

    Linus Torvalds
     
  • In current page-fault code,

    handle_mm_fault()
    -> ...
    -> mem_cgroup_charge()
    -> map page or handle error.
    -> check return code.

    If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is
    called. But if it's caused by memcg, OOM should have been already
    invoked.

    Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That
    patch records last_oom_jiffies for memcg's sub-hierarchy and prevents
    page_fault_out_of_memory from being invoked in near future.

    But Nishimura-san reported that check by jiffies is not enough when the
    system is terribly heavy.

    This patch changes memcg's oom logic as.
    * If memcg causes OOM-kill, continue to retry.
    * remove jiffies check which is used now.
    * add memcg-oom-lock which works like perzone oom lock.
    * If current is killed(as a process), bypass charge.

    Something more sophisticated can be added but this pactch does
    fundamental things.
    TODO:
    - add oom notifier
    - add permemcg disable-oom-kill flag and freezer at oom.
    - more chances for wake up oom waiter (when changing memory limit etc..)

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Events should be removed after rmdir of cgroup directory, but before
    destroying subsystem state objects. Let's take reference to cgroup
    directory dentry to do that.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Presently, if panic_on_oom=2, the whole system panics even if the oom
    happend in some special situation (as cpuset, mempolicy....). Then,
    panic_on_oom=2 means painc_on_oom_always.

    Now, memcg doesn't check panic_on_oom flag. This patch adds a check.

    BTW, how it's useful ?

    kdump+panic_on_oom=2 is the last tool to investigate what happens in
    oom-ed system. When a task is killed, the sysytem recovers and there will
    be few hint to know what happnes. In mission critical system, oom should
    never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by
    knowing precise information via snapshot.

    TODO:
    - For memcg, it's for isolate system's memory usage, oom-notiifer and
    freeze_at_oom (or rest_at_oom) should be implemented. Then, management
    daemon can do similar jobs (as kdump) or taking snapshot per cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Nick Piggin
    Reviewed-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Memcg has 2 eventcountes which counts "the same" event. Just usages are
    different from each other. This patch tries to reduce event counter.

    Now logic uses "only increment, no reset" counter and masks for each
    checks. Softlimit chesk was done per 1000 evetns. So, the similar check
    can be done by !(new_counter & 0x3ff). Threshold check was done per 100
    events. So, the similar check can be done by (!new_counter & 0x7f)

    ALL event checks are done right after EVENT percpu counter is updated.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki