03 Nov, 2011

1 commit

  • Michel while working on the working set estimation code, noticed that
    calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
    wasn't safe, if the pfn ended up being a tail page of a transparent
    hugepage under splitting by __split_huge_page_refcount().

    He then found the problem could also theoretically materialize with
    page_cache_get_speculative() during the speculative radix tree lookups
    that uses get_page_unless_zero() in SMP if the radix tree page is freed
    and reallocated and get_user_pages is called on it before
    page_cache_get_speculative has a chance to call get_page_unless_zero().

    So the best way to fix the problem is to keep page_tail->_count zero at
    all times. This will guarantee that get_page_unless_zero() can never
    succeed on any tail page. page_tail->_mapcount is guaranteed zero and
    is unused for all tail pages of a compound page, so we can simply
    account the tail page references there and transfer them to
    tail_page->_count in __split_huge_page_refcount() (in addition to the
    head_page->_mapcount).

    While debugging this s/_count/_mapcount/ change I also noticed get_page is
    called by direct-io.c on pages returned by get_user_pages. That wasn't
    entirely safe because the two atomic_inc in get_page weren't atomic. As
    opposed to other get_user_page users like secondary-MMU page fault to
    establish the shadow pagetables would never call any superflous get_page
    after get_user_page returns. It's safer to make get_page universally safe
    for tail pages and to use get_page_foll() within follow_page (inside
    get_user_pages()). get_page_foll() is safe to do the refcounting for tail
    pages without taking any locks because it is run within PT lock protected
    critical sections (PT lock for pte and page_table_lock for
    pmd_trans_huge).

    The standard get_page() as invoked by direct-io instead will now take
    the compound_lock but still only for tail pages. The direct-io paths
    are usually I/O bound and the compound_lock is per THP so very
    finegrined, so there's no risk of scalability issues with it. A simple
    direct-io benchmarks with all lockdep prove locking and spinlock
    debugging infrastructure enabled shows identical performance and no
    overhead. So it's worth it. Ideally direct-io should stop calling
    get_page() on pages returned by get_user_pages(). The spinlock in
    get_page() is already optimized away for no-THP builds but doing
    get_page() on tail pages returned by GUP is generally a rare operation
    and usually only run in I/O paths.

    This new refcounting on page_tail->_mapcount in addition to avoiding new
    RCU critical sections will also allow the working set estimation code to
    work without any further complexity associated to the tail page
    refcounting with THP.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Michel Lespinasse
    Reviewed-by: Michel Lespinasse
    Reviewed-by: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

01 Nov, 2011

3 commits

  • Add comments to explain the page statistics field in the mm_struct.

    [akpm@linux-foundation.org: add missing ;]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Some kernel components pin user space memory (infiniband and perf) (by
    increasing the page count) and account that memory as "mlocked".

    The difference between mlocking and pinning is:

    A. mlocked pages are marked with PG_mlocked and are exempt from
    swapping. Page migration may move them around though.
    They are kept on a special LRU list.

    B. Pinned pages cannot be moved because something needs to
    directly access physical memory. They may not be on any
    LRU list.

    I recently saw an mlockalled process where mm->locked_vm became
    bigger than the virtual size of the process (!) because some
    memory was accounted for twice:

    Once when the page was mlocked and once when the Infiniband
    layer increased the refcount because it needt to pin the RDMA
    memory.

    This patch introduces a separate counter for pinned pages and
    accounts them seperately.

    Signed-off-by: Christoph Lameter
    Cc: Mike Marciniszyn
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This removes mm->oom_disable_count entirely since it's unnecessary and
    currently buggy. The counter was intended to be per-process but it's
    currently decremented in the exit path for each thread that exits, causing
    it to underflow.

    The count was originally intended to prevent oom killing threads that
    share memory with threads that cannot be killed since it doesn't lead to
    future memory freeing. The counter could be fixed to represent all
    threads sharing the same mm, but it's better to remove the count since:

    - it is possible that the OOM_DISABLE thread sharing memory with the
    victim is waiting on that thread to exit and will actually cause
    future memory freeing, and

    - there is no guarantee that a thread is disabled from oom killing just
    because another thread sharing its mm is oom disabled.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

26 Oct, 2011

1 commit


20 Oct, 2011

1 commit

  • A few network drivers currently use skb_frag_struct for this purpose but I have
    patches which add additional fields and semantics there which these other uses
    do not want.

    A structure for reference sub-page regions seems like a generally useful thing
    so do so instead of adding a network subsystem specific structure.

    Signed-off-by: Ian Campbell
    Acked-by: Jens Axboe
    Acked-by: David Rientjes
    Signed-off-by: David S. Miller

    Ian Campbell
     

20 Aug, 2011

1 commit

  • Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to
    partial pages. The partial page list is used in slab_free() to avoid
    per node lock taking.

    In __slab_alloc() we can then take multiple partial pages off the per
    node partial list in one go reducing node lock pressure.

    We can also use the per cpu partial list in slab_alloc() to avoid scanning
    partial lists for pages with free objects.

    The main effect of a per cpu partial list is that the per node list_lock
    is taken for batches of partial pages instead of individual ones.

    Potential future enhancements:

    1. The pickup from the partial list could be perhaps be done without disabling
    interrupts with some work. The free path already puts the page into the
    per cpu partial list without disabling interrupts.

    2. __slab_free() may have some code paths that could use optimization.

    Performance:

    Before After
    ./hackbench 100 process 200000
    Time: 1953.047 1564.614
    ./hackbench 100 process 20000
    Time: 207.176 156.940
    ./hackbench 100 process 20000
    Time: 204.468 156.940
    ./hackbench 100 process 20000
    Time: 204.879 158.772
    ./hackbench 10 process 20000
    Time: 20.153 15.853
    ./hackbench 10 process 20000
    Time: 20.153 15.986
    ./hackbench 10 process 20000
    Time: 19.363 16.111
    ./hackbench 1 process 20000
    Time: 2.518 2.307
    ./hackbench 1 process 20000
    Time: 2.258 2.339
    ./hackbench 1 process 20000
    Time: 2.864 2.163

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

18 Jul, 2011

2 commits


08 Jul, 2011

1 commit

  • On Wed, 6 Jul 2011, Jonathan Cameron wrote:

    > Getting:
    >
    > CHK include/linux/version.h
    > CHK include/generated/utsrelease.h
    > make[1]: `include/generated/mach-types.h' is up to date.
    > CC arch/arm/kernel/asm-offsets.s
    > In file included from include/linux/sched.h:64:0,
    > from arch/arm/kernel/asm-offsets.c:13:
    > include/linux/mm_types.h:74:15: error: duplicate member '_count'
    > make[1]: *** [arch/arm/kernel/asm-offsets.s] Error 1
    > make: *** [prepare0] Error 2
    >
    > Issue looks to have been introduced by
    >
    > mm: Rearrange struct page
    >
    > fc9bb8c768abe7ae10861c3510e01a95f98d5933
    >
    > Guessing it's a known issue, but just thought I'd flag it up in case
    > it's something very specific about my build.
    >
    > gcc-2.6 armv7a
    >
    > Reverting that patch works, but given I don't know the history, I'm
    > not proposing doing that in general!

    Well _count exists in two unionized structs but always has the same offset
    within the larger struct. Maybe ARM creates different offsets there for
    some reason?

    The following is a patch to restructure the union / structs combo in such
    a way that only a single definition of _count

    Reported-by: Jonathan Cameron
    Tested-by: Piotr Hosowicz
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

02 Jul, 2011

2 commits

  • We need to be able to use cmpxchg_double on the freelist and object count
    field in struct page. Rearrange the fields in struct page according to
    doubleword entities so that the freelist pointer comes before the counters.
    Do the rearranging with a future in mind where we use more doubleword
    atomics to avoid locking of updates to flags/mapping or lru pointers.

    Create another union to allow access to counters in struct page as a
    single unsigned long value.

    The doublewords must be properly aligned for cmpxchg_double to work.
    Sadly this increases the size of page struct by one word on some architectures.
    But as a resultpage structs are now cacheline aligned on x86_64.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Do not use a page flag for the frozen bit. It needs to be part
    of the state that is handled with cmpxchg_double(). So use a bit
    in the counter struct in the page struct for that purpose.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

30 May, 2011

1 commit

  • Thomas Gleixner reports that we now have a boot crash triggered by
    CONFIG_CPUMASK_OFFSTACK=y:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] find_next_bit+0x55/0xb0
    Call Trace:
    [] cpumask_any_but+0x2a/0x70
    [] flush_tlb_mm+0x2b/0x80
    [] pud_populate+0x35/0x50
    [] pgd_alloc+0x9a/0xf0
    [] mm_init+0xec/0x120
    [] mm_alloc+0x53/0xd0

    which was introduced by commit de03c72cfce5 ("mm: convert
    mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
    mm_init() vs mm_init_cpumask

    Thomas wrote a patch to just fix the ordering of initialization, but I
    hate the new double allocation in the fork path, so I ended up instead
    doing some more radical surgery to clean it all up.

    Reported-by: Thomas Gleixner
    Reported-by: Ingo Molnar
    Cc: KOSAKI Motohiro
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 May, 2011

2 commits

  • Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
    This was because exe_file was needed only for /proc//exe. Since we
    will need the exe_file functionality also for core dumps (so core name can
    contain full binary path), built this functionality always into the
    kernel.

    To achieve that move that out of proc FS to the kernel/ where in fact it
    should belong. By doing that we can make dup_mm_exe_file static. Also we
    can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

25 May, 2011

3 commits

  • The problem with having two different types of counters is that developers
    adding new code need to keep in mind whether it's safe to use both the
    atomic and non-atomic implementations. For example, when adding new
    callers of the *_mm_counter() functions a developer needs to ensure that
    those paths are always executed with page_table_lock held, in case we're
    using the non-atomic implementation of mm counters.

    Hugh Dickins introduced the atomic mm counters in commit f412ac08c986
    ("[PATCH] mm: fix rss and mmlist locking"). When asked why he left the
    non-atomic counters around he said,

    | The only reason was to avoid adding costly atomic operations into a
    | configuration that had no need for them there: the page_table_lock
    | sufficed.
    |
    | Certainly it would be simpler just to delete the non-atomic variant.
    |
    | And I think it's fair to say that any configuration on which we're
    | measuring performance to that degree (rather than "does it boot fast?"
    | type measurements), would already be going the split ptlocks route.

    Removing the non-atomic counters eases the maintenance burden because
    developers no longer have to mindful of the two implementations when using
    *_mm_counter().

    Note that all architectures provide a means of atomically updating
    atomic_long_t variables, even if they have to revert to the generic
    spinlock implementation because they don't support 64-bit atomic
    instructions (see lib/atomic64.c).

    Signed-off-by: Matt Fleming
    Acked-by: Dave Hansen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Fleming
     
  • cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
    It might lead to reduce cache hit ratio.

    This patch has two change.
    1) Move the place of cpumask into last of mm_struct. Because usually cpumask
    is accessed only front bits when the system has cpu-hotplug capability
    2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
    footprint if cpumask_size() will use nr_cpumask_bits properly in future.

    In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
    It may help to detect out of tree cpu_vm_mask users.

    This patch has no functional change.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Howells
    Cc: Koichi Yasutake
    Cc: Hugh Dickins
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Hugh says:
    "The only significant loser, I think, would be page reclaim (when
    concurrent with truncation): could spin for a long time waiting for
    the i_mmap_mutex it expects would soon be dropped? "

    Counter points:
    - cpu contention makes the spin stop (need_resched())
    - zap pages should be freeing pages at a higher rate than reclaim
    ever can

    I think the simplification of the truncate code is definitely worth it.

    Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
    unmap_mapping_range() on the same inode") and takes out the code that
    caused its problem.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

23 Mar, 2011

1 commit

  • Reorder mm_struct to remove 16 bytes of alignment padding on 64 bit
    builds. On my config this shrinks mm_struct by enough to fit in one
    fewer cache lines and allows more objects per slab in mm_struct
    kmem_cache under SLUB.

    slabinfo before patch :-
    Sizes (bytes) Slabs
    --------------------------------
    Object : 848 Total : 9
    SlabObj: 896 Full : 2
    SlabSiz: 16384 Partial: 5
    Loss : 48 CpuSlab: 2
    Align : 64 Objects: 18

    slabinfo after :-
    Sizes (bytes) Slabs
    --------------------------------
    Object : 832 Total : 7
    SlabObj: 832 Full : 2
    SlabSiz: 16384 Partial: 3
    Loss : 0 CpuSlab: 2
    Align : 64 Objects: 19

    Signed-off-by: Richard Kennedy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     

14 Jan, 2011

1 commit

  • This increase the size of the mm struct a bit but it is needed to
    preallocate one pte for each hugepage so that split_huge_page will not
    require a fail path. Guarantee of success is a fundamental property of
    split_huge_page to avoid decrasing swapping reliability and to avoid
    adding -ENOMEM fail paths that would otherwise force the hugepage-unaware
    VM code to learn rolling back in the middle of its pte mangling operations
    (if something we need it to learn handling pmd_trans_huge natively rather
    being capable of rollback). When split_huge_page runs a pte is needed to
    succeed the split, to map the newly splitted regular pages with a regular
    pte. This way all existing VM code remains backwards compatible by just
    adding a split_huge_page* one liner. The memory waste of those
    preallocated ptes is negligible and so it is worth it.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Oct, 2010

1 commit

  • It's pointless to kill a task if another thread sharing its mm cannot be
    killed to allow future memory freeing. A subsequent patch will prevent
    kills in such cases, but first it's necessary to have a way to flag a task
    that shares memory with an OOM_DISABLE task that doesn't incur an
    additional tasklist scan, which would make select_bad_process() an O(n^2)
    function.

    This patch adds an atomic counter to struct mm_struct that follows how
    many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
    They cannot be killed by the kernel, so their memory cannot be freed in
    oom conditions.

    This only requires task_lock() on the task that we're operating on, it
    does not require mm->mmap_sem since task_lock() pins the mm and the
    operation is atomic.

    [rientjes@google.com: changelog and sys_unshare() code]
    [rientjes@google.com: protect oom_disable_count with task_lock in fork]
    [rientjes@google.com: use old_mm for oom_disable_count in exec]
    Signed-off-by: Ying Han
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

23 Aug, 2010

1 commit


21 Aug, 2010

1 commit

  • It's a really simple list, and several of the users want to go backwards
    in it to find the previous vma. So rather than have to look up the
    previous entry with 'find_vma_prev()' or something similar, just make it
    doubly linked instead.

    Tested-by: Ian Campbell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

20 Aug, 2010

1 commit

  • This adds annotations for RCU operations in core kernel components

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Al Viro
    Cc: Jens Axboe
    Cc: Andrew Morton
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     

13 Mar, 2010

1 commit

  • Commit 34e55232e59f7b19050267a05ff1226e5cd122a5 ("mm: avoid false sharing
    of mm_counter") added sync_mm_rss() for syncing loosely accounted rss
    counters. It's for CONFIG_MMU but sync_mm_rss is called even in NOMMU
    enviroment (kerne/exit.c, fs/exec.c). Above commit doesn't handle it
    well.

    This patch changes
    SPLIT_RSS_COUNTING depends on SPLIT_PTLOCKS && CONFIG_MMU

    And for avoid unnecessary function calls, sync_mm_rss changed to be inlined
    noop function in header file.

    Reported-by: David Howells
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Mike Frysinger
    Signed-off-by: Michal Simek
    Signed-off-by: David Howells
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Mar, 2010

4 commits

  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

17 Jan, 2010

2 commits


07 Jan, 2010

1 commit

  • When working with FDPIC, there are many shared mappings of read-only
    code regions between applications (the C library, applet packages like
    busybox, etc.), but the current do_mmap_pgoff() function will issue an
    icache flush whenever a VMA is added to an MM instead of only doing it
    when the map is initially created.

    The flush can instead be done when a region is first mmapped PROT_EXEC.
    Note that we may not rely on the first mapping of a region being
    executable - it's possible for it to be PROT_READ only, so we have to
    remember whether we've flushed the region or not, and then flush the
    entire region when a bit of it is made executable.

    However, this also affects the brk area. That will no longer be
    executable. We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
    for NOMMU mode kernels, when it increases the brk allocation, making
    sys_brk() flush the extra from the icache should suffice. The brk area
    probably isn't used by NOMMU programs since the brk area can only use up
    the leavings from the stack allocation, where the stack allocation is
    larger than requested.

    Signed-off-by: David Howells
    Signed-off-by: Mike Frysinger
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     

28 Sep, 2009

1 commit


24 Sep, 2009

2 commits

  • Because the binfmt is not different between threads in the same process,
    it can be moved from task_struct to mm_struct. And binfmt moudle is
    handled per mm_struct instead of task_struct.

    Signed-off-by: Hiroshi Shimamoto
    Acked-by: Oleg Nesterov
    Cc: Rusty Russell
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiroshi Shimamoto
     
  • ->ioctx_lock and ->ioctx_list are used only under CONFIG_AIO.

    Signed-off-by: Alexey Dobriyan
    Cc: Zach Brown
    Cc: Benjamin LaHaise
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

19 Aug, 2009

1 commit

  • The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
    the mm_struct. It was a very good first step for sanitize OOM.

    However Paul Menage reported the commit makes regression to his job
    scheduler. Current OOM logic can kill OOM_DISABLED process.

    Why? His program has the code of similar to the following.

    ...
    set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
    ...
    if (vfork() == 0) {
    set_oom_adj(0); /* Invoked child can be killed */
    execve("foo-bar-cmd");
    }
    ....

    vfork() parent and child are shared the same mm_struct. then above
    set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
    change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
    lost OOM immune and it was killed.

    Actually, fork-setting-exec idiom is very frequently used in userland program.
    We must not break this assumption.

    Then, this patch revert commit 2ff05b2b and related commit.

    Reverted commit list
    ---------------------
    - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
    - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
    - commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
    - commit 933b787b57 (mm: copy over oom_adj value at fork time)

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

17 Jun, 2009

2 commits

  • * akpm: (182 commits)
    fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
    fbdev: *bfin*: fix __dev{init,exit} markings
    fbdev: *bfin*: drop unnecessary calls to memset
    fbdev: bfin-t350mcqb-fb: drop unused local variables
    fbdev: blackfin has __raw I/O accessors, so use them in fb.h
    fbdev: s1d13xxxfb: add accelerated bitblt functions
    tcx: use standard fields for framebuffer physical address and length
    fbdev: add support for handoff from firmware to hw framebuffers
    intelfb: fix a bug when changing video timing
    fbdev: use framebuffer_release() for freeing fb_info structures
    radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
    s3c-fb: CPUFREQ frequency scaling support
    s3c-fb: fix resource releasing on error during probing
    carminefb: fix possible access beyond end of carmine_modedb[]
    acornfb: remove fb_mmap function
    mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
    mb862xxfb: restrict compliation of platform driver to PPC
    Samsung SoC Framebuffer driver: add Alpha Channel support
    atmel-lcdc: fix pixclock upper bound detection
    offb: use framebuffer_alloc() to allocate fb_info struct
    ...

    Manually fix up conflicts due to kmemcheck in mm/slab.c

    Linus Torvalds
     
  • The per-task oom_adj value is a characteristic of its mm more than the
    task itself since it's not possible to oom kill any thread that shares the
    mm. If a task were to be killed while attached to an mm that could not be
    freed because another thread were set to OOM_DISABLE, it would have
    needlessly been terminated since there is no potential for future memory
    freeing.

    This patch moves oomkilladj (now more appropriately named oom_adj) from
    struct task_struct to struct mm_struct. This requires task_lock() on a
    task to check its oom_adj value to protect against exec, but it's already
    necessary to take the lock when dereferencing the mm to find the total VM
    size for the badness heuristic.

    This fixes a livelock if the oom killer chooses a task and another thread
    sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
    because oom_kill_task() repeatedly returns 1 and refuses to kill the
    chosen task while select_bad_process() will repeatedly choose the same
    task during the next retry.

    Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
    oom_kill_task() to check for threads sharing the same memory will be
    removed in the next patch in this series where it will no longer be
    necessary.

    Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
    these threads are immune from oom killing already. They simply report an
    oom_adj value of OOM_DISABLE.

    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Jun, 2009

1 commit

  • General description: kmemcheck is a patch to the linux kernel that
    detects use of uninitialized memory. It does this by trapping every
    read and write to memory that was allocated dynamically (e.g. using
    kmalloc()). If a memory address is read that has not previously been
    written to, a message is printed to the kernel log.

    Thanks to Andi Kleen for the set_memory_4k() solution.

    Andrew Morton suggested documenting the shadow member of struct page.

    Signed-off-by: Vegard Nossum
    Signed-off-by: Pekka Enberg

    [export kmemcheck_mark_initialized]
    [build fix for setup_max_cpus]
    Signed-off-by: Ingo Molnar

    [rebased for mainline inclusion]
    Signed-off-by: Vegard Nossum

    Vegard Nossum
     

03 Apr, 2009

1 commit

  • This fixes a build failure with generic debug pagealloc:

    mm/debug-pagealloc.c: In function 'set_page_poison':
    mm/debug-pagealloc.c:8: error: 'struct page' has no member named 'debug_flags'
    mm/debug-pagealloc.c: In function 'clear_page_poison':
    mm/debug-pagealloc.c:13: error: 'struct page' has no member named 'debug_flags'
    mm/debug-pagealloc.c: In function 'page_poison':
    mm/debug-pagealloc.c:18: error: 'struct page' has no member named 'debug_flags'
    mm/debug-pagealloc.c: At top level:
    mm/debug-pagealloc.c:120: error: redefinition of 'kernel_map_pages'
    include/linux/mm.h:1278: error: previous definition of 'kernel_map_pages' was here
    mm/debug-pagealloc.c: In function 'kernel_map_pages':
    mm/debug-pagealloc.c:122: error: 'debug_pagealloc_enabled' undeclared (first use in this function)

    by fixing

    - debug_flags should be in struct page
    - define DEBUG_PAGEALLOC config option for all architectures

    Signed-off-by: Akinobu Mita
    Reported-by: Alexander Beregalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita