01 Nov, 2011

11 commits

  • This removes mm->oom_disable_count entirely since it's unnecessary and
    currently buggy. The counter was intended to be per-process but it's
    currently decremented in the exit path for each thread that exits, causing
    it to underflow.

    The count was originally intended to prevent oom killing threads that
    share memory with threads that cannot be killed since it doesn't lead to
    future memory freeing. The counter could be fixed to represent all
    threads sharing the same mm, but it's better to remove the count since:

    - it is possible that the OOM_DISABLE thread sharing memory with the
    victim is waiting on that thread to exit and will actually cause
    future memory freeing, and

    - there is no guarantee that a thread is disabled from oom killing just
    because another thread sharing its mm is oom disabled.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • After selecting a task to kill, the oom killer iterates all processes and
    kills all other threads that share the same mm_struct in different thread
    groups. It would not otherwise be helpful to kill a thread if its memory
    would not be subsequently freed.

    A kernel thread, however, may assume a user thread's mm by using
    use_mm(). This is only temporary and should not result in sending a
    SIGKILL to that kthread.

    This patch ensures that only user threads and not kthreads are sent a
    SIGKILL if they share the same mm_struct as the oom killed task.

    Signed-off-by: David Rientjes
    Reviewed-by: Michal Hocko
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If a thread has been oom killed and is frozen, thaw it before returning to
    the page allocator. Otherwise, it can stay frozen indefinitely and no
    memory will be freed.

    Signed-off-by: David Rientjes
    Reported-by: Konstantin Khlebnikov
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: "Rafael J. Wysocki"
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Looks like someone got distracted after adding the comment characters.

    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • per-task block plug can reduce block queue lock contention and increase
    request merge. Currently page reclaim doesn't support it. I originally
    thought page reclaim doesn't need it, because kswapd thread count is
    limited and file cache write is done at flusher mostly.

    When I test a workload with heavy swap in a 4-node machine, each CPU is
    doing direct page reclaim and swap. This causes block queue lock
    contention. In my test, without below patch, the CPU utilization is about
    2% ~ 7%. With the patch, the CPU utilization is about 1% ~ 3%. Disk
    throughput isn't changed. This should improve normal kswapd write and
    file cache write too (increase request merge for example), but might not
    be so obvious as I explain above.

    Signed-off-by: Shaohua Li
    Cc: Jens Axboe
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • unmap_and_move() is one a big messy function. Clean it up.

    Signed-off-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless,
    we have isolated mapped page and re-add it into LRU's head. It's
    unnecessary CPU overhead and makes LRU churning.

    Of course, when we isolate the page, the page might be mapped but when we
    try to migrate the page, the page would be not mapped. So it could be
    migrated. But race is rare and although it happens, it's no big deal.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In async mode, compaction doesn't migrate dirty or writeback pages. So,
    it's meaningless to pick the page and re-add it to lru list.

    Of course, when we isolate the page in compaction, the page might be dirty
    or writeback but when we try to migrate the page, the page would be not
    dirty, writeback. So it could be migrated. But it's very unlikely as
    isolate and migration cycle is much faster than writeout.

    So, this patch helps cpu overhead and prevent unnecessary LRU churning.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
    macro isn't recommended as it's type-unsafe and making debugging harder as
    symbol cannot be passed throught to the debugger.

    Quote from Johannes
    " Hmm, it would probably be cleaner to fully convert the isolation mode
    into independent flags. INACTIVE, ACTIVE, BOTH is currently a
    tri-state among flags, which is a bit ugly."

    This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

    Signed-off-by: Minchan Kim
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • acct_isolated of compaction uses page_lru_base_type which returns only
    base type of LRU list so it never returns LRU_ACTIVE_ANON or
    LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only
    acct_isolated so it doesn't have fields in conpact_control.

    This patch removes fields from compact_control and makes clear function of
    acct_issolated which counts the number of anon|file pages isolated.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The basic idea behind cross memory attach is to allow MPI programs doing
    intra-node communication to do a single copy of the message rather than a
    double copy of the message via shared memory.

    The following patch attempts to achieve this by allowing a destination
    process, given an address and size from a source process, to copy memory
    directly from the source process into its own address space via a system
    call. There is also a symmetrical ability to copy from the current
    process's address space into a destination process's address space.

    - Use of /proc/pid/mem has been considered, but there are issues with
    using it:
    - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
    written to would need to be contiguous.
    - Currently mem_read allows only processes who are currently
    ptrace'ing the target and are still able to ptrace the target to read
    from the target. This check could possibly be moved to the open call,
    but its not clear exactly what race this restriction is stopping
    (reason appears to have been lost)
    - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
    domain socket is a bit ugly from a userspace point of view,
    especially when you may have hundreds if not (eventually) thousands
    of processes that all need to do this with each other
    - Doesn't allow for some future use of the interface we would like to
    consider adding in the future (see below)
    - Interestingly reading from /proc/pid/mem currently actually
    involves two copies! (But this could be fixed pretty easily)

    As mentioned previously use of vmsplice instead was considered, but has
    problems. Since you need the reader and writer working co-operatively if
    the pipe is not drained then you block. Which requires some wrapping to
    do non blocking on the send side or polling on the receive. In all to all
    communication it requires ordering otherwise you can deadlock. And in the
    example of many MPI tasks writing to one MPI task vmsplice serialises the
    copying.

    There are some cases of MPI collectives where even a single copy interface
    does not get us the performance gain we could. For example in an
    MPI_Reduce rather than copy the data from the source we would like to
    instead use it directly in a mathops (say the reduce is doing a sum) as
    this would save us doing a copy. We don't need to keep a copy of the data
    from the source. I haven't implemented this, but I think this interface
    could in the future do all this through the use of the flags - eg could
    specify the math operation and type and the kernel rather than just
    copying the data would apply the specified operation between the source
    and destination and store it in the destination.

    Although we don't have a "second user" of the interface (though I've had
    some nibbles from people who may be interested in using it for intra
    process messaging which is not MPI). This interface is something which
    hardware vendors are already doing for their custom drivers to implement
    fast local communication. And so in addition to this being useful for
    OpenMPI it would mean the driver maintainers don't have to fix things up
    when the mm changes.

    There was some discussion about how much faster a true zero copy would
    go. Here's a link back to the email with some testing I did on that:

    http://marc.info/?l=linux-mm&m=130105930902915&w=2

    There is a basic man page for the proposed interface here:

    http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

    This has been implemented for x86 and powerpc, other architecture should
    mainly (I think) just need to add syscall numbers for the process_vm_readv
    and process_vm_writev. There are 32 bit compatibility versions for
    64-bit kernels.

    For arch maintainers there are some simple tests to be able to quickly
    verify that the syscalls are working correctly here:

    http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

    Signed-off-by: Chris Yeoh
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: James Morris
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

29 Oct, 2011

1 commit

  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits)
    leases: fix write-open/read-lease race
    nfs: drop unnecessary locking in llseek
    ext4: replace cut'n'pasted llseek code with generic_file_llseek_size
    vfs: add generic_file_llseek_size
    vfs: do (nearly) lockless generic_file_llseek
    direct-io: merge direct_io_walker into __blockdev_direct_IO
    direct-io: inline the complete submission path
    direct-io: separate map_bh from dio
    direct-io: use a slab cache for struct dio
    direct-io: rearrange fields in dio/dio_submit to avoid holes
    direct-io: fix a wrong comment
    direct-io: separate fields only used in the submission path from struct dio
    vfs: fix spinning prevention in prune_icache_sb
    vfs: add a comment to inode_permission()
    vfs: pass all mask flags check_acl and posix_acl_permission
    vfs: add hex format for MAY_* flag values
    vfs: indicate that the permission functions take all the MAY_* flags
    compat: sync compat_stats with statfs.
    vfs: add "device" tag to /proc/self/mountstats
    cleanup: vfs: small comment fix for block_invalidatepage
    ...

    Fix up trivial conflict in fs/gfs2/file.c (llseek changes)

    Linus Torvalds
     

28 Oct, 2011

1 commit

  • Currently, when you call iov_iter_advance, then the pointer to the iovec
    array can be incremented, but it does not decrement the nr_segs value in
    the iov_iter struct. The result is a iov_iter struct with a nr_segs
    value that goes beyond the end of the array.

    While I'm not aware of anything that's specifically broken by this, it
    seems odd and a bit dangerous not to decrement that value. If someone
    were to trust the nr_segs value to be correct, then they could end up
    walking off the end of the array.

    Changing this might also provide some micro-optimization when dealing
    with the last iovec in an array. Many of the other routines that deal
    with iov_iter have optimized codepaths when nr_segs == 1.

    Cc: Nick Piggin
    Signed-off-by: Jeff Layton
    Signed-off-by: Christoph Hellwig

    Jeff Layton
     

26 Oct, 2011

1 commit


25 Oct, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
    MAINTAINERS: linux-m32r is moderated for non-subscribers
    linux@lists.openrisc.net is moderated for non-subscribers
    Drop default from "DM365 codec select" choice
    parisc: Kconfig: cleanup Kernel page size default
    Kconfig: remove redundant CONFIG_ prefix on two symbols
    cris: remove arch/cris/arch-v32/lib/nand_init.S
    microblaze: add missing CONFIG_ prefixes
    h8300: drop puzzling Kconfig dependencies
    MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
    tty: drop superfluous dependency in Kconfig
    ARM: mxc: fix Kconfig typo 'i.MX51'
    Fix file references in Kconfig files
    aic7xxx: fix Kconfig references to READMEs
    Fix file references in drivers/ide/
    thinkpad_acpi: Fix printk typo 'bluestooth'
    bcmring: drop commented out line in Kconfig
    btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
    doc: raw1394: Trivial typo fix
    CIFS: Don't free volume_info->UNC until we are entirely done with it.
    treewide: Correct spelling of successfully in comments
    ...

    Linus Torvalds
     
  • * 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
    TOMOYO: Fix incomplete read after seek.
    Smack: allow to access /smack/access as normal user
    TOMOYO: Fix unused kernel config option.
    Smack: fix: invalid length set for the result of /smack/access
    Smack: compilation fix
    Smack: fix for /smack/access output, use string instead of byte
    Smack: domain transition protections (v3)
    Smack: Provide information for UDS getsockopt(SO_PEERCRED)
    Smack: Clean up comments
    Smack: Repair processing of fcntl
    Smack: Rule list lookup performance
    Smack: check permissions from user space (v2)
    TOMOYO: Fix quota and garbage collector.
    TOMOYO: Remove redundant tasklist_lock.
    TOMOYO: Fix domain transition failure warning.
    TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
    TOMOYO: Simplify garbage collector.
    TOMOYO: Fix make namespacecheck warnings.
    target: check hex2bin result
    encrypted-keys: check hex2bin result
    ...

    Linus Torvalds
     

20 Oct, 2011

1 commit

  • I don't usually pay much attention to the stale "? " addresses in
    stack backtraces, but this lucky report from Pawel Sikora hints that
    mremap's move_ptes() has inadequate locking against page migration.

    3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
    kernel BUG at include/linux/swapops.h:105!
    RIP: 0010:[] []
    migration_entry_wait+0x156/0x160
    [] handle_pte_fault+0xae1/0xaf0
    [] ? __pte_alloc+0x42/0x120
    [] ? do_huge_pmd_anonymous_page+0xab/0x310
    [] handle_mm_fault+0x181/0x310
    [] ? vma_adjust+0x537/0x570
    [] do_page_fault+0x11d/0x4e0
    [] ? do_mremap+0x2d5/0x570
    [] page_fault+0x1f/0x30

    mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
    and pagetable locks, were good enough before page migration (with its
    requirement that every migration entry be found) came in, and enough
    while migration always held mmap_sem; but not enough nowadays, when
    there's memory hotremove and compaction.

    The danger is that move_ptes() lets a migration entry dodge around
    behind remove_migration_pte()'s back, so it's in the old location when
    looking at the new, then in the new location when looking at the old.

    Either mremap's move_ptes() must additionally take anon_vma lock(), or
    migration's remove_migration_pte() must stop peeking for is_swap_entry()
    before it takes pagetable lock.

    Consensus chooses the latter: we prefer to add overhead to migration
    than to mremapping, which gets used by JVMs and by exec stack setup.

    Reported-and-tested-by: Paweł Sikora
    Signed-off-by: Hugh Dickins
    Acked-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Sep, 2011

3 commits

  • Discarding slab should be done when node partial > min_partial. Otherwise,
    node partial slab may eat up all memory.

    Signed-off-by: Alex Shi
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Alex Shi
     
  • Correct comment errors, that mistake cpu partial objects number as pages
    number, may make reader misunderstand.

    Signed-off-by: Alex Shi
    Reviewed-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Alex Shi
     
  • Historically /proc/slabinfo and files under /sys/kernel/slab/* have
    world read permissions and are accessible to the world. slabinfo
    contains rather private information related both to the kernel and
    userspace tasks. Depending on the situation, it might reveal either
    private information per se or information useful to make another
    targeted attack. Some examples of what can be learned by
    reading/watching for /proc/slabinfo entries:

    1) dentry (and different *inode*) number might reveal other processes fs
    activity. The number of dentry "active objects" doesn't strictly show
    file count opened/touched by a process, however, there is a good
    correlation between them. The patch "proc: force dcache drop on
    unauthorized access" relies on the privacy of dentry count.

    2) different inode entries might reveal the same information as (1), but
    these are more fine granted counters. If a filesystem is mounted in a
    private mount point (or even a private namespace) and fs type differs from
    other mounted fs types, fs activity in this mount point/namespace is
    revealed. If there is a single ecryptfs mount point, the whole fs
    activity of a single user is revealed. Number of files in ecryptfs
    mount point is a private information per se.

    3) fuse_* reveals number of files / fs activity of a user in a user
    private mount point. It is approx. the same severity as ecryptfs
    infoleak in (2).

    4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
    which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
    the precise number of sysfs files is known to the world.

    5) buffer_head might reveal some kernel activity. With other
    information leaks an attacker might identify what specific kernel
    routines generate buffer_head activity.

    6) *kmalloc* infoleaks are very situational. Attacker should watch for
    the specific kmalloc size entry and filter the noise related to the unrelated
    kernel activity. If an attacker has relatively silent victim system, he
    might get rather precise counters.

    Additional information sources might significantly increase the slabinfo
    infoleak benefits. E.g. if an attacker knows that the processes
    activity on the system is very low (only core daemons like syslog and
    cron), he may run setxid binaries / trigger local daemon activity /
    trigger network services activity / await sporadic cron jobs activity
    / etc. and get rather precise counters for fs and network activity of
    these privileged tasks, which is unknown otherwise.

    Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
    exploitation of kernel heap overflows (and possibly, other bugs). The
    related discussion:

    http://thread.gmane.org/gmane.linux.kernel/1108378

    To keep compatibility with old permission model where non-root
    monitoring daemon could watch for kernel memleaks though slabinfo one
    should do:

    groupadd slabinfo
    usermod -a -G slabinfo $MONITOR_USER

    And add the following commands to init scripts (to mountall.conf in
    Ubuntu's upstart case):

    chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
    chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*

    Signed-off-by: Vasiliy Kulikov
    Reviewed-by: Kees Cook
    Reviewed-by: Dave Hansen
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    CC: Valdis.Kletnieks@vt.edu
    CC: Linus Torvalds
    CC: Alan Cox
    Signed-off-by: Pekka Enberg

    Vasiliy Kulikov
     

22 Sep, 2011

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-block:
    floppy: use del_timer_sync() in init cleanup
    blk-cgroup: be able to remove the record of unplugged device
    block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
    mm: Add comment explaining task state setting in bdi_forker_thread()
    mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
    block: simplify force plug flush code a little bit
    block: change force plug flush call order
    block: Fix queue_flag update when rq_affinity goes from 2 to 1
    block: separate priority boosting from REQ_META
    block: remove READ_META and WRITE_META
    xen-blkback: fixed indentation and comments
    xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.

    Linus Torvalds
     

19 Sep, 2011

2 commits


15 Sep, 2011

9 commits

  • Fast-forward merge with Linus to be able to merge patches
    based on more recent version of the tree.

    Jiri Kosina
     
  • Signed-off-by: Joe Perches
    Acked-by: Paul Menage
    Signed-off-by: Jiri Kosina

    Joe Perches
     
  • The found entries by find_get_pages() could be all swap entries. In
    this case we skip the entries, but make sure the skipped entries are
    accounted, so we don't keep looping.

    Using nr_found > nr_skip to simplify code as suggested by Eric.

    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Shaohua Li
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Xen backend drivers (e.g., blkback and netback) would sometimes fail to
    map grant pages into the vmalloc address space allocated with
    alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could
    not find the page (in the L2 table) containing the PTEs it needed to
    update.

    (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000

    netback and blkback were making the hypercall from a kernel thread where
    task->active_mm != &init_mm and alloc_vm_area() was only updating the page
    tables for init_mm. The usual method of deferring the update to the page
    tables of other processes (i.e., after taking a fault) doesn't work as a
    fault cannot occur during the hypercall.

    This would work on some systems depending on what else was using vmalloc.

    Fix this by reverting ef691947d8a3 ("vmalloc: remove vmalloc_sync_all()
    from alloc_vm_area()") and add a comment to explain why it's needed.

    Signed-off-by: David Vrabel
    Cc: Jeremy Fitzhardinge
    Cc: Konrad Rzeszutek Wilk
    Cc: Ian Campbell
    Cc: Keir Fraser
    Cc: [3.0.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Vrabel
     
  • Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
    memory.vmscan_stat").

    The implementation of per-memcg reclaim statistics violates how memcg
    hierarchies usually behave: hierarchically.

    The reclaim statistics are accounted to child memcgs and the parent
    hitting the limit, but not to hierarchy levels in between. Usually,
    hierarchical statistics are perfectly recursive, with each level
    representing the sum of itself and all its children.

    Since this exports statistics to userspace, this may lead to confusion
    and problems with changing things after the release, so revert it now,
    we can try again later.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Without swap, anonymous pages are not scanned. As such, they should not
    count when considering force-scanning a small target if there is no swap.

    Otherwise, targets are not force-scanned even when their effective scan
    number is zero and the other conditions--kswapd/memcg--apply.

    This fixes 246e87a93934 ("memcg: fix get_scan_count() for small
    targets").

    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Ying Han
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The vmstat_text array is only defined for CONFIG_SYSFS or CONFIG_PROC_FS,
    yet it is referenced for per-node vmstat with CONFIG_NUMA:

    drivers/built-in.o: In function `node_read_vmstat':
    node.c:(.text+0x1106df): undefined reference to `vmstat_text'

    Introduced in commit fa25c503dfa2 ("mm: per-node vmstat: show proper
    vmstats").

    Define the array for CONFIG_NUMA as well.

    [akpm@linux-foundation.org: remove unneeded ifdefs]
    Signed-off-by: David Rientjes
    Reported-by: Cong Wang
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When compiling mm/mempolicy.c with struct user copy checks the following
    warning is shown:

    In file included from arch/x86/include/asm/uaccess.h:572,
    from include/linux/uaccess.h:5,
    from include/linux/highmem.h:7,
    from include/linux/pagemap.h:10,
    from include/linux/mempolicy.h:70,
    from mm/mempolicy.c:68:
    In function `copy_from_user',
    inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415:
    arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
    LD mm/built-in.o

    Fix this by passing correct buffer size value.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") didn't really
    fix the mbind vma merge problem due to wrong pgoff value passing to
    vma_merge(), which made vma_merge() always return NULL.

    Before the patch applied, we are getting a result like:

    addr = 0x7fa58f00c000
    [snip]
    7fa58f00c000-7fa58f00d000 rw-p 00000000 00:00 0
    7fa58f00d000-7fa58f00e000 rw-p 00000000 00:00 0
    7fa58f00e000-7fa58f00f000 rw-p 00000000 00:00 0

    here 7fa58f00c000->7fa58f00f000 we get 3 VMAs which are expected to be
    merged described as described in commit 9d8cebd.

    Re-testing the patched kernel with the reproducer provided in commit
    9d8cebd, we get the correct result:

    addr = 0x7ffa5aaa2000
    [snip]
    7ffa5aaa2000-7ffa5aaa6000 rw-p 00000000 00:00 0
    7fffd556f000-7fffd5584000 rw-p 00000000 00:00 0 [stack]

    Signed-off-by: Caspar Zhang
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Caspar Zhang
     

14 Sep, 2011

1 commit


03 Sep, 2011

2 commits


27 Aug, 2011

2 commits

  • Adding slab to partial list head/tail is sensitive to performance.
    So explicitly uses DEACTIVATE_TO_TAIL/DEACTIVATE_TO_HEAD to document
    it to avoid we get it wrong.

    Acked-by: Christoph Lameter
    Signed-off-by: Shaohua Li
    Signed-off-by: Shaohua Li
    Signed-off-by: Pekka Enberg

    Shaohua Li
     
  • The slab has just one free object, adding it to partial list head doesn't make
    sense. And it can cause lock contentation. For example,
    1. CPU takes the slab from partial list
    2. fetch an object
    3. switch to another slab
    4. free an object, then the slab is added to partial list again
    In this way n->list_lock will be heavily contended.
    In fact, Alex had a hackbench regression. 3.1-rc1 performance drops about 70%
    against 3.0. This patch fixes it.

    Acked-by: Christoph Lameter
    Reported-by: Alex Shi
    Signed-off-by: Shaohua Li
    Signed-off-by: Shaohua Li
    Signed-off-by: Pekka Enberg

    Shaohua Li
     

26 Aug, 2011

3 commits

  • Commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
    counter") tried to oom lock the hierarchy and roll back upon
    encountering an already locked memcg.

    The code is confused when it comes to detecting a locked memcg, though,
    so it would fail and rollback after locking one memcg and encountering
    an unlocked second one.

    The result is that oom-locking hierarchies fails unconditionally and
    that every oom killer invocation simply goes to sleep on the oom
    waitqueue forever. The tasks practically hang forever without anyone
    intervening, possibly holding locks that trip up unrelated tasks, too.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
    task. It's possible ZONE_CONGESTED isn't cleared in some cases:

    1. the zone is already balanced just entering balance_pgdat() for
    order-0 because concurrent tasks free memory. In this case, later
    check will skip the zone as it's balanced so the flag isn't cleared.

    2. high order balance fallbacks to order-0. quote from Mel: At the
    end of balance_pgdat(), kswapd uses the following logic;

    If reclaiming at high order {
    for each zone {
    if all_unreclaimable
    skip
    if watermark is not met
    order = 0
    loop again

    /* watermark is met */
    clear congested
    }
    }

    i.e. it clears ZONE_CONGESTED if it the zone is balanced. if not,
    it restarts balancing at order-0. However, if the higher zones are
    balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
    that only happens after a zone is shrunk. This can mean that
    wait_iff_congested() stalls unnecessarily.

    This patch makes kswapd clear ZONE_CONGESTED during its initial
    highmem->dma scan for zones that are already balanced.

    Signed-off-by: Shaohua Li
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • I get the below warning:

    BUG: using smp_processor_id() in preemptible [00000000] code: bash/746
    caller is native_sched_clock+0x37/0x6e
    Pid: 746, comm: bash Tainted: G W 3.0.0+ #254
    Call Trace:
    [] debug_smp_processor_id+0xc2/0xdc
    [] native_sched_clock+0x37/0x6e
    [] try_to_free_mem_cgroup_pages+0x7d/0x270
    [] mem_cgroup_force_empty+0x24b/0x27a
    [] ? sys_close+0x38/0x138
    [] ? sys_close+0x38/0x138
    [] mem_cgroup_force_empty_write+0x17/0x19
    [] cgroup_file_write+0xa8/0xba
    [] vfs_write+0xb3/0x138
    [] sys_write+0x4a/0x71
    [] ? sys_close+0xf0/0x138
    [] system_call_fastpath+0x16/0x1b

    sched_clock() can't be used with preempt enabled. And we don't need
    fast approach to get clock here, so let's use ktime API.

    Signed-off-by: Shaohua Li
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li