02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Al Viro
     

01 May, 2013

9 commits

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     
  • cleancache_ops is used to decide whether backend is registered.
    So now cleancache_enabled is always true if defined CONFIG_CLEANCACHE.

    Signed-off-by: Bob Liu
    Cc: Wanpeng Li
    Cc: Andor Daam
    Cc: Dan Magenheimer
    Cc: Florian Schmaus
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Cc: Stefan Hengelein
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Instead of using a backend_registered to determine whether a backend is
    enabled. This allows us to remove the backend_register check and just
    do 'if (cleancache_ops)'

    [v1: Rebase on top of b97c4b430b0a (ramster->zcache move]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Bob Liu
    Cc: Wanpeng Li
    Cc: Andor Daam
    Cc: Dan Magenheimer
    Cc: Florian Schmaus
    Cc: Minchan Kim
    Cc: Stefan Hengelein
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konrad Rzeszutek Wilk
     
  • With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
    be built/loaded as modules rather than built-in and enabled by a boot
    parameter, this patch provides "lazy initialization", allowing backends
    to register to cleancache even after filesystems were mounted. Calls to
    init_fs and init_shared_fs are remembered as fake poolids but no real
    tmem_pools created. On backend registration the fake poolids are mapped
    to real poolids and respective tmem_pools.

    Signed-off-by: Stefan Hengelein
    Signed-off-by: Florian Schmaus
    Signed-off-by: Andor Daam
    Signed-off-by: Dan Magenheimer
    [v1: Minor fixes: used #define for some values and bools]
    [v2: Removed CLEANCACHE_HAS_LAZY_INIT]
    [v3: Added more comments, added a lock for [shared_|]fs_poolid_map]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Bob Liu
    Cc: Wanpeng Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Magenheimer
     
  • Frontswap initialization routine depends on swap_lock, which want to be
    atomic about frontswap's first appearance. IOW, frontswap is not present
    and will fail all calls OR frontswap is fully functional but if new
    swap_info_struct isn't registered by enable_swap_info, swap subsystem
    doesn't start I/O so there is no race between init procedure and page I/O
    working on frontswap.

    So let's remove unnecessary swap_lock dependency.

    Cc: Dan Magenheimer
    Signed-off-by: Minchan Kim
    [v1: Rebased on my branch, reworked to work with backends loading late]
    [v2: Added a check for !map]
    [v3: Made the invalidate path follow the init path]
    [v4: Address comments by Wanpeng Li ]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Bob Liu
    Cc: Wanpeng Li
    Cc: Andor Daam
    Cc: Florian Schmaus
    Cc: Stefan Hengelein
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • After allowing tmem backends to build/run as modules, frontswap_enabled
    always true if defined CONFIG_FRONTSWAP. But frontswap_test() depends on
    whether backend is registered, mv it into frontswap.c using fronstswap_ops
    to make the decision.

    frontswap_set/clear are not used outside frontswap, so don't export them.

    Signed-off-by: Bob Liu
    Cc: Wanpeng Li
    Cc: Andor Daam
    Cc: Dan Magenheimer
    Cc: Florian Schmaus
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Cc: Stefan Hengelein
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • This simplifies the code in the frontswap - we can get rid of the
    'backend_registered' test and instead check against frontswap_ops.

    [v1: Rebase on top of 703ba7fe5e0 (ramster->zcache move]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Bob Liu
    Cc: Wanpeng Li
    Cc: Andor Daam
    Cc: Dan Magenheimer
    Cc: Florian Schmaus
    Cc: Minchan Kim
    Cc: Stefan Hengelein
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konrad Rzeszutek Wilk
     
  • With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
    be built/loaded as modules rather than built-in and enabled by a boot
    parameter, this patch provides "lazy initialization", allowing backends
    to register to frontswap even after swapon was run. Before a backend
    registers all calls to init are recorded and the creation of tmem_pools
    delayed until a backend registers or until a frontswap store is
    attempted.

    Signed-off-by: Stefan Hengelein
    Signed-off-by: Florian Schmaus
    Signed-off-by: Andor Daam
    Signed-off-by: Dan Magenheimer
    [v1: Fixes per Seth Jennings suggestions]
    [v2: Removed FRONTSWAP_HAS_.. ]
    [v3: Fix up per Bob Liu recommendations]
    [v4: Fix up per Andrew's comments]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Bob Liu
    Cc: Wanpeng Li
    Cc: Dan Magenheimer
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Magenheimer
     
  • Pull trivial tree updates from Jiri Kosina:
    "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small
    code cleanups"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits)
    mm: Convert print_symbol to %pSR
    gfs2: Convert print_symbol to %pSR
    m32r: Convert print_symbol to %pSR
    iostats.txt: add easy-to-find description for field 6
    x86 cmpxchg.h: fix wrong comment
    treewide: Fix typo in printk and comments
    doc: devicetree: Fix various typos
    docbook: fix 8250 naming in device-drivers
    pata_pdc2027x: Fix compiler warning
    treewide: Fix typo in printks
    mei: Fix comments in drivers/misc/mei
    treewide: Fix typos in kernel messages
    pm44xx: Fix comment for "CONFIG_CPU_IDLE"
    doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP"
    mmzone: correct "pags" to "pages" in comment.
    kernel-parameters: remove outdated 'noresidual' parameter
    Remove spurious _H suffixes from ifdef comments
    sound: Remove stray pluses from Kconfig file
    radio-shark: Fix printk "CONFIG_LED_CLASS"
    doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE
    ...

    Linus Torvalds
     

30 Apr, 2013

29 commits

  • Merge second batch of fixes from Andrew Morton:

    - various misc bits

    - some printk updates

    - a new "SRAM" driver.

    - MAINTAINERS updates

    - the backlight driver queue

    - checkpatch updates

    - a few init/ changes

    - a huge number of drivers/rtc changes

    - fatfs updates

    - some lib/idr.c work

    - some renaming of the random driver interfaces

    * emailed patches from Andrew Morton : (285 commits)
    net: rename random32 to prandom
    net/core: remove duplicate statements by do-while loop
    net/core: rename random32() to prandom_u32()
    net/netfilter: rename random32() to prandom_u32()
    net/sched: rename random32() to prandom_u32()
    net/sunrpc: rename random32() to prandom_u32()
    scsi: rename random32() to prandom_u32()
    lguest: rename random32() to prandom_u32()
    uwb: rename random32() to prandom_u32()
    video/uvesafb: rename random32() to prandom_u32()
    mmc: rename random32() to prandom_u32()
    drbd: rename random32() to prandom_u32()
    kernel/: rename random32() to prandom_u32()
    mm/: rename random32() to prandom_u32()
    lib/: rename random32() to prandom_u32()
    x86: rename random32() to prandom_u32()
    x86: pageattr-test: remove srandom32 call
    uuid: use prandom_bytes()
    raid6test: use prandom_bytes()
    sctp: convert sctp_assoc_set_id() to use idr_alloc_cyclic()
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:

    - Fixes and a lot of cleanups. Locking cleanup is finally complete.
    cgroup_mutex is no longer exposed to individual controlelrs which
    used to cause nasty deadlock issues. Li fixed and cleaned up quite a
    bit including long standing ones like racy cgroup_path().

    - device cgroup now supports proper hierarchy thanks to Aristeu.

    - perf_event cgroup now supports proper hierarchy.

    - A new mount option "__DEVEL__sane_behavior" is added. As indicated
    by the name, this option is to be used for development only at this
    point and generates a warning message when used. Unfortunately,
    cgroup interface currently has too many brekages and inconsistencies
    to implement a consistent and unified hierarchy on top. The new flag
    is used to collect the behavior changes which are necessary to
    implement consistent unified hierarchy. It's likely that this flag
    won't be used verbatim when it becomes ready but will be enabled
    implicitly along with unified hierarchy.

    The option currently disables some of broken behaviors in cgroup core
    and also .use_hierarchy switch in memcg (will be routed through -mm),
    which can be used to make very unusual hierarchy where nesting is
    partially honored. It will also be used to implement hierarchy
    support for blk-throttle which would be impossible otherwise without
    introducing a full separate set of control knobs.

    This is essentially versioning of interface which isn't very nice but
    at this point I can't see any other options which would allow keeping
    the interface the same while moving towards hierarchy behavior which
    is at least somewhat sane. The planned unified hierarchy is likely
    to require some level of adaptation from userland anyway, so I think
    it'd be best to take the chance and update the interface such that
    it's supportable in the long term.

    Maintaining the existing interface does complicate cgroup core but
    shouldn't put too much strain on individual controllers and I think
    it'd be manageable for the foreseeable future. Maybe we'll be able
    to drop it in a decade.

    Fix up conflicts (including a semantic one adding a new #include to ppc
    that was uncovered by header the file changes) as per Tejun.

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
    cpuset: fix compile warning when CONFIG_SMP=n
    cpuset: fix cpu hotplug vs rebuild_sched_domains() race
    cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
    cgroup: restore the call to eventfd->poll()
    cgroup: fix use-after-free when umounting cgroupfs
    cgroup: fix broken file xattrs
    devcg: remove parent_cgroup.
    memcg: force use_hierarchy if sane_behavior
    cgroup: remove cgrp->top_cgroup
    cgroup: introduce sane_behavior mount option
    move cgroupfs_root to include/linux/cgroup.h
    cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
    cgroup: make cgroup_path() not print double slashes
    Revert "cgroup: remove bind() method from cgroup_subsys."
    perf: make perf_event cgroup hierarchical
    cgroup: implement cgroup_is_descendant()
    cgroup: make sure parent won't be destroyed before its children
    cgroup: remove bind() method from cgroup_subsys.
    devcg: remove broken_hierarchy tag
    cgroup: remove cgroup_lock_is_held()
    ...

    Linus Torvalds
     
  • Use preferable function name which implies using a pseudo-random
    number generator.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • The memcg is not referenced, so it can be destroyed at anytime right
    after we exit rcu read section, so it's not safe to access it.

    To fix this, we call css_tryget() to get a reference while we're still
    in rcu read section.

    This also removes a bogus comment above __memcg_create_cache_enqueue().

    Signed-off-by: Li Zefan
    Acked-by: Glauber Costa
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • There are times when HIGHMEM is enabled, but we don't prefer
    CONFIG_BOUNCE to be enabled. CONFIG_BOUNCE can reduce the block device
    throughput, and this is not ideal for machines where we don't gain much
    by enabling it. So provide an option to deselect CONFIG_BOUNCE. The
    observation was made while measuring eMMC throughput using iozone on an
    ARM device with 1GB RAM.

    Signed-off-by: Vinayak Menon
    Cc: David Rientjes
    Cc: Jens Axboe
    Cc: Randy Dunlap
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon
     
  • Currently, we do memset() before reserving the area. This may not cause
    any problem, but it is somewhat weird. So change execution order.

    Signed-off-by: Joonsoo Kim
    Cc: Yinghai Lu
    Acked-by: Johannes Weiner
    Cc: Jiang Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Remove unused argument and make function static, because there is no user
    outside of nobootmem.c

    Signed-off-by: Joonsoo Kim
    Cc: Yinghai Lu
    Acked-by: Johannes Weiner
    Cc: Jiang Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • PFN_PHYS() is a phys_addr_t, which can be u32 or u64.
    Fix the build warning when phys_addr_t is u32.

    mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 2 has type 'unsigned int' [-Wformat]: => 1685:3
    mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]: => 1685:3

    Signed-off-by: Randy Dunlap
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • As pointed out by Andrew Morton, the swap-over-NFS writeback is not
    setting PageWriteback before it is queued for direct IO. While swap
    pages do not participate in BDI or process dirty accounting and the IO
    is synchronous, the writeback bit is still required and not setting it
    in this case was an oversight. swapoff depends on the page writeback to
    synchronoise all pending writes on a swap page before it is reused.
    Swapcache freeing and reuse depend on checking the PageWriteback under
    lock to ensure the page is safe to reuse.

    Direct IO handlers and the direct IO handler for NFS do not deal with
    PageWriteback as they are synchronous writes. In the case of NFS, it
    schedules pages (or a page in the case of swap) for IO and then waits
    synchronously for IO to complete in nfs_direct_write(). It is
    recognised that this is a slowdown from normal swap handling which is
    asynchronous and uses a completion handler. Shoving PageWriteback
    handling down into direct IO handlers looks like a bad fit to handle the
    swap case although it may have to be dealt with some day if swap is
    converted to use direct IO in general and bmap is finally done away
    with. At that point it will be necessary to refit asynchronous direct
    IO with completion handlers onto the swap subsystem.

    As swapcache currently depends on PageWriteback to protect against
    races, this patch sets PageWriteback under the page lock before queueing
    it for direct IO. It is cleared when the direct IO handler returns. IO
    errors are treated similarly to the direct-to-bio case except PageError
    is not set as in the case of swap-over-NFS, it is likely to be a
    transient error.

    It was asked what prevents such a page being reclaimed in parallel.
    With this patch applied, such a page will now be skipped (most of the
    time) or blocked until the writeback completes. Reclaim checks
    PageWriteback under the page lock before calling try_to_free_swap and
    the page lock should prevent the page being requeued for IO before it is
    freed.

    This and Jerome's related patch should considered for -stable as far
    back as 3.6 when swap-over-NFS was introduced.

    [akpm@linux-foundation.org: use pr_err_ratelimited()]
    [akpm@linux-foundation.org: remove hopefully-unneeded cast in printk]
    Signed-off-by: Mel Gorman
    Cc: Jerome Marchand
    Cc: Hugh Dickins
    Cc: [3.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Since commit 62c230bc1790 ("mm: add support for a filesystem to activate
    swap files and use direct_IO for writing swap pages"), swap_writepage()
    calls direct_IO on swap files. However, in that case the page isn't
    redirtied if I/O fails, and is therefore handled afterwards as if it has
    been successfully written to the swap file, leading to memory corruption
    when the page is eventually swapped back in.

    This patch sets the page dirty when direct_IO() fails. It fixes a
    memory corruption that happened while using swap-over-NFS.

    Signed-off-by: Jerome Marchand
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: [3.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • A memcg may livelock when oom if the process that grabs the hierarchy's
    oom lock is never the first process with PF_EXITING set in the memcg's
    task iteration.

    The oom killer, both global and memcg, will defer if it finds an
    eligible process that is in the process of exiting and it is not being
    ptraced. The idea is to allow it to exit without using memory reserves
    before needlessly killing another process.

    This normally works fine except in the memcg case with a large number of
    threads attached to the oom memcg. In this case, the memcg oom killer
    only gets called for the process that grabs the hierarchy's oom lock;
    all others end up blocked on the memcg's oom waitqueue. Thus, if the
    process that grabs the hierarchy's oom lock is never the first
    PF_EXITING process in the memcg's task iteration, the oom killer is
    constantly deferred without anything making progress.

    The fix is to give PF_EXITING processes access to memory reserves so
    that we've marked them as oom killed without any iteration. This allows
    __mem_cgroup_try_charge() to succeed so that the process may exit. This
    makes the memcg oom killer exemption for TIF_MEMDIE tasks, now
    immediately granted for processes with pending SIGKILLs and those in the
    exit path, to be equivalent to what is done for the global oom killer.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Current implementation of huge zero page uses pfn value 0 to indicate
    that the page hasn't allocated yet. It assumes that buddy page
    allocator can't return page with pfn == 0.

    Let's rework the code to store 'struct page *' of huge zero page, not
    its pfn. This way we can avoid the weak assumption.

    [akpm@linux-foundation.org: fix sparse warning]
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Minchan Kim
    Acked-by: Minchan Kim
    Reviewed-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This might cause a use-after-free bug.

    Signed-off-by: Li Zefan
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • There are two convenient ways to report errors to userspace

    1) retun error to original syscall for example write(2)
    2) mark mapping with error flag and return it on later fsync(2)

    Second one is broken if (mapping->nrpages == 0) This is real-life
    situation because after error pages are likey to be truncated or
    invalidated.

    We have to return an error regardless to number of pages in the mapping.

    #Original testcase: git@github.com:dmonakhov/xfstests.git
    MOUNT_OPTIONS="-b1024"
    ./check shared/305

    Signed-off-by: Dmitry Monakhov
    Reviewed-by: Jan Kara
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Monakhov
     
  • There is no comment for parameter nid of memblock_insert_region().
    This patch adds comment for it.

    Signed-off-by: Tang Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Signed-off-by: Cody P Schafer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • In page reclaim, huge page is split. split_huge_page() adds tail pages
    to LRU list. Since we are reclaiming a huge page, it's better we
    reclaim all subpages of the huge page instead of just the head page.
    This patch adds split tail pages to shrink page list so the tail pages
    can be reclaimed soon.

    Before this patch, run a swap workload:
    thp_fault_alloc 3492
    thp_fault_fallback 608
    thp_collapse_alloc 6
    thp_collapse_alloc_failed 0
    thp_split 916

    With this patch:
    thp_fault_alloc 4085
    thp_fault_fallback 16
    thp_collapse_alloc 90
    thp_collapse_alloc_failed 0
    thp_split 1272

    fallback allocation is reduced a lot.

    [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
    Signed-off-by: Shaohua Li
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Wanpeng Li
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • To prevent flooding the swap device with writebacks, frontswap backends
    need to count and limit the number of outstanding writebacks. The
    incrementing of the counter can be done before the call to
    __swap_writepage(). However, the caller must receive a notification
    when the writeback completes in order to decrement the counter.

    To achieve this functionality, this patch modifies __swap_writepage() to
    take the bio completion callback function as an argument.

    end_swap_bio_write(), the normal bio completion function, is also made
    non-static so that code doing the accounting can call it after the
    accounting is done.

    There should be no behavioural change to existing code.

    Signed-off-by: Seth Jennings
    Signed-off-by: Bob Liu
    Acked-by: Minchan Kim
    Reviewed-by: Dan Magenheimer
    Cc: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     
  • swap_writepage() is currently where frontswap hooks into the swap write
    path to capture pages with the frontswap_store() function. However, if
    a frontswap backend wants to "resume" the writeback of a page to the
    swap device, it can't call swap_writepage() as the page will simply
    reenter the backend.

    This patch separates swap_writepage() into a top and bottom half, the
    bottom half named __swap_writepage() to allow a frontswap backend, like
    zswap, to resume writeback beyond the frontswap_store() hook.

    __add_to_swap_cache() is also made non-static so that the page for which
    writeback is to be resumed can be added to the swap cache.

    Signed-off-by: Seth Jennings
    Signed-off-by: Bob Liu
    Acked-by: Minchan Kim
    Reviewed-by: Dan Magenheimer
    Cc: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     
  • Fix a corner case for MAP_FIXED when requested mapping length is larger
    than rlimit for virtual memory. In such case any overlapping mappings
    are unmapped before we check for the limit and return ENOMEM.

    The check is moved before the loop that unmaps overlapping parts of
    existing mappings. When we are about to hit the limit (currently mapped
    pages + len > limit) we scan for overlapping pages and check again
    accounting for them.

    This fixes situation when userspace program expects that the previous
    mappings are preserved after the mmap() syscall has returned with error.
    (POSIX clearly states that successfull mapping shall replace any
    previous mappings.)

    This corner case was found and can be tested with LTP testcase:

    testcases/open_posix_testsuite/conformance/interfaces/mmap/24-2.c

    In this case the mmap, which is clearly over current limit, unmaps
    dynamic libraries and the testcase segfaults right after returning into
    userspace.

    I've also looked at the second instance of the unmapping loop in the
    do_brk(). The do_brk() is called from brk() syscall and from vm_brk().
    The brk() syscall checks for overlapping mappings and bails out when
    there are any (so it can't be triggered from the brk syscall). The
    vm_brk() is called only from binmft handlers so it shouldn't be
    triggered unless binmft handler created overlapping mappings.

    Signed-off-by: Cyril Hrubis
    Reviewed-by: Mel Gorman
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyril Hrubis
     
  • With this patch userland applications that want to maintain the
    interactivity/memory allocation cost can use the pressure level
    notifications. The levels are defined like this:

    The "low" level means that the system is reclaiming memory for new
    allocations. Monitoring this reclaiming activity might be useful for
    maintaining cache level. Upon notification, the program (typically
    "Activity Manager") might analyze vmstat and act in advance (i.e.
    prematurely shutdown unimportant services).

    The "medium" level means that the system is experiencing medium memory
    pressure, the system might be making swap, paging out active file
    caches, etc. Upon this event applications may decide to further analyze
    vmstat/zoneinfo/memcg or internal memory usage statistics and free any
    resources that can be easily reconstructed or re-read from a disk.

    The "critical" level means that the system is actively thrashing, it is
    about to out of memory (OOM) or even the in-kernel OOM killer is on its
    way to trigger. Applications should do whatever they can to help the
    system. It might be too late to consult with vmstat or any other
    statistics, so it's advisable to take an immediate action.

    The events are propagated upward until the event is handled, i.e. the
    events are not pass-through. Here is what this means: for example you
    have three cgroups: A->B->C. Now you set up an event listener on
    cgroups A, B and C, and suppose group C experiences some pressure. In
    this situation, only group C will receive the notification, i.e. groups
    A and B will not receive it. This is done to avoid excessive
    "broadcasting" of messages, which disturbs the system and which is
    especially bad if we are low on memory or thrashing. So, organize the
    cgroups wisely, or propagate the events manually (or, ask us to
    implement the pass-through events, explaining why would you need them.)

    Performance wise, the memory pressure notifications feature itself is
    lightweight and does not require much of bookkeeping, in contrast to the
    rest of memcg features. Unfortunately, as of current memcg
    implementation, pages accounting is an inseparable part and cannot be
    turned off. The good news is that there are some efforts[1] to improve
    the situation; plus, implementing the same, fully API-compatible[2]
    interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable
    option, so it will not require any changes on the userland side.

    [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
    [2] http://lkml.org/lkml/2013/2/21/454

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
    Signed-off-by: Anton Vorontsov
    Acked-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Mel Gorman
    Cc: Glauber Costa
    Cc: Michal Hocko
    Cc: Luiz Capitulino
    Cc: Greg Thelen
    Cc: Leonid Moiseichuk
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Vorontsov
     
  • In madvise(), there doesn't seem to be any reason for taking the
    ¤t->mm->mmap_sem before start and len_in have been validated.
    Incidentally, this removes the need for the out: label.

    [akpm@linux-foundation.org: s/out_plug/out/, per David]
    Signed-off-by: Rasmus Villemoes
    Acked-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • __remove_pages() is only necessary for CONFIG_MEMORY_HOTREMOVE. PowerPC
    pseries will return -EOPNOTSUPP if unsupported.

    Adding an #ifdef causes several other functions it depends on to also
    become unnecessary, which saves in .text when disabled (it's disabled in
    most defconfigs besides powerpc, including x86). remove_memory_block()
    becomes static since it is not referenced outside of
    drivers/base/memory.c.

    Build tested on x86 and powerpc with CONFIG_MEMORY_HOTREMOVE both enabled
    and disabled.

    Signed-off-by: David Rientjes
    Acked-by: Toshi Kani
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Greg Kroah-Hartman
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Change __remove_pages() to call release_mem_region_adjustable(). This
    allows a requested memory range to be released from the iomem_resource
    table even if it does not match exactly to an resource entry but still
    fits into. The resource entries initialized at bootup usually cover the
    whole contiguous memory ranges and may not necessarily match with the
    size of memory hot-delete requests.

    If release_mem_region_adjustable() failed, __remove_pages() emits a
    warning message and continues to proceed as it was the case with
    release_mem_region(). release_mem_region(), which is defined to
    __release_region(), emits a warning message and returns no error since a
    void function.

    Signed-off-by: Toshi Kani
    Reviewed-by : Yasuaki Ishimatsu
    Acked-by: David Rientjes
    Cc: Ram Pai
    Cc: T Makphaibulchoke
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Jiang Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • The comment over migrate_pages() looks quite weird, and makes it hard to
    grasp what it is trying to say. Rewrite it more comprehensibly.

    Signed-off-by: Srivatsa S. Bhat
    Acked-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     
  • Currently the memory barrier in __do_huge_pmd_anonymous_page doesn't
    work. Because lru_cache_add_lru uses pagevec so it could miss spinlock
    easily so above rule was broken so user might see inconsistent data.

    I was not first person who pointed out the problem. Mel and Peter
    pointed out a few months ago and Peter pointed out further that even
    spin_lock/unlock can't make sure of it:

    http://marc.info/?t=134333512700004

    In particular:

    *A = a;
    LOCK
    UNLOCK
    *B = b;

    may occur as:

    LOCK, STORE *B, STORE *A, UNLOCK

    At last, Hugh pointed out that even we don't need memory barrier in
    there because __SetPageUpdate already have done it from Nick's commit
    0ed361dec369 ("mm: fix PageUptodate data race") explicitly.

    So this patch fixes comment on THP and adds same comment for
    do_anonymous_page, too because everybody except Hugh was missing that.
    It means we need a comment about that.

    Signed-off-by: Minchan Kim
    Acked-by: Andrea Arcangeli
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • CONFIG_HOTPLUG is going away as an option, cleanup CONFIG_HOTPLUG
    ifdefs in mm files.

    Signed-off-by: Yijing Wang
    Acked-by: Greg Kroah-Hartman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yijing Wang
     
  • Just a trivial issue I stumbled on while doing something else...

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Alter the admin and user reserves of the previous patches in this series
    when memory is added or removed.

    If memory is added and the reserves have been eliminated or increased
    above the default max, then we'll trust the admin.

    If memory is removed and there isn't enough free memory, then we need to
    reset the reserves.

    Otherwise keep the reserve set by the admin.

    The reserve reset code is the same as the reserve initialization code.

    I tested hot addition and removal by triggering it via sysfs. The
    reserves shrunk when they were set high and memory was removed. They
    were reset higher when memory was added again.

    [akpm@linux-foundation.org: use register_hotmemory_notifier()]
    [akpm@linux-foundation.org: init_user_reserve() and init_admin_reserve can no longer be __meminit]
    [fengguang.wu@intel.com: make init_reserve_notifier() static]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker