23 Sep, 2009

3 commits

  • Move various magic-number definitions into magic.h.

    Signed-off-by: Nick Black
    Acked-by: Pekka Enberg
    Cc: Al Viro
    Cc: "David S. Miller"
    Cc: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Black
     
  • When syslog is not possible, at the same time there's no serial/net
    console available, it will be hard to read the printk messages. For
    example oops/panic/warning messages in shutdown phase.

    Add a printk delay feature, we can make each printk message delay some
    milliseconds.

    Setting the delay by proc/sysctl interface: /proc/sys/kernel/printk_delay

    The value range from 0 - 10000, default value is 0

    [akpm@linux-foundation.org: fix a few things]
    Signed-off-by: Dave Young
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • of the form

    include/net/inet_sock.h:208: warning: ISO C90 forbids mixed declarations and code

    Cc: Johannes Berg
    Acked-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

22 Sep, 2009

37 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck:
    kmemcheck: add missing braces to do-while in kmemcheck_annotate_bitfield
    kmemcheck: update documentation
    kmemcheck: depend on HAVE_ARCH_KMEMCHECK
    kmemcheck: remove useless check
    kmemcheck: remove duplicated #include

    Linus Torvalds
     
  • * 'for-2.6.32' of git://linux-nfs.org/~bfields/linux: (68 commits)
    nfsd4: nfsv4 clients should cross mountpoints
    nfsd: revise 4.1 status documentation
    sunrpc/cache: avoid variable over-loading in cache_defer_req
    sunrpc/cache: use list_del_init for the list_head entries in cache_deferred_req
    nfsd: return success for non-NFS4 nfs4_state_start
    nfsd41: Refactor create_client()
    nfsd41: modify nfsd4.1 backchannel to use new xprt class
    nfsd41: Backchannel: Implement cb_recall over NFSv4.1
    nfsd41: Backchannel: cb_sequence callback
    nfsd41: Backchannel: Setup sequence information
    nfsd41: Backchannel: Server backchannel RPC wait queue
    nfsd41: Backchannel: Add sequence arguments to callback RPC arguments
    nfsd41: Backchannel: callback infrastructure
    nfsd4: use common rpc_cred for all callbacks
    nfsd4: allow nfs4 state startup to fail
    SUNRPC: Defer the auth_gss upcall when the RPC call is asynchronous
    nfsd4: fix null dereference creating nfsv4 callback client
    nfsd4: fix whitespace in NFSPROC4_CLNT_CB_NULL definition
    nfsd41: sunrpc: add new xprt class for nfsv4.1 backchannel
    sunrpc/cache: simplify cache_fresh_locked and cache_fresh_unlocked.
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    trivial: fix typo in aic7xxx comment
    trivial: fix comment typo in drivers/ata/pata_hpt37x.c
    trivial: typo in kernel-parameters.txt
    trivial: fix typo in tracing documentation
    trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
    trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
    trivial: remove unnecessary semicolons
    trivial: Fix duplicated word "options" in comment
    trivial: kbuild: remove extraneous blank line after declaration of usage()
    trivial: improve help text for mm debug config options
    trivial: doc: hpfall: accept disk device to unload as argument
    trivial: doc: hpfall: reduce risk that hpfall can do harm
    trivial: SubmittingPatches: Fix reference to renumbered step
    trivial: fix typos "man[ae]g?ment" -> "management"
    trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
    trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
    trivial: fix missing printk space in amd_k7_smp_check
    trivial: fix typo s/ketymap/keymap/ in comment
    trivial: fix typo "to to" in multiple files
    trivial: fix typos in comments s/DGBU/DBGU/
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
    HID: Remove duplicate Kconfig entry
    HID: consolidate connect and disconnect into core code
    HID: fix non-atomic allocation in hid_input_report

    Linus Torvalds
     
  • The shutdown method is used by the winbond cir driver to setup the
    hardware for wake-from-S5.

    Signed-off-by: Bjorn Helgaas
    Signed-off-by: David Härdeman
    Cc: Dmitry Torokhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Härdeman
     
  • This offers a way for platforms to define flags and thresholds for the
    free-fall/wakeup functions of the lis302d chips.

    More registers needed to be seperated as they are specific to the

    Signed-off-by: Daniel Mack
    Acked-by: Pavel Machek
    Cc: Eric Piel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Mack
     
  • Bit 0x80 in CTRL_REG3 is an ACTIVE_LOW rather than an ACTIVE_HIGH
    function, I got that wrong during my last change.

    Signed-off-by: Daniel Mack
    Acked-by: Pavel Machek
    Cc: Eric Piel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Mack
     
  • FLEX_ARRAY_INIT(element_size, total_nr_elements) cannot determine if
    either parameter is valid, so flex arrays which are statically allocated
    with this interface can easily become corrupted or reference beyond its
    allocated memory.

    This removes FLEX_ARRAY_INIT() as a struct flex_array initializer since no
    initializer may perform the required checking. Instead, the array is now
    defined with a new interface:

    DEFINE_FLEX_ARRAY(name, element_size, total_nr_elements)

    This may be prefixed with `static' for file scope.

    This interface includes compile-time checking of the parameters to ensure
    they are valid. Since the validity of both element_size and
    total_nr_elements depend on FLEX_ARRAY_BASE_SIZE and FLEX_ARRAY_PART_SIZE,
    the kernel build will fail if either of these predefined values changes
    such that the array parameters are no longer valid.

    Since BUILD_BUG_ON() requires compile time constants, several of the
    static inline functions that were once local to lib/flex_array.c had to be
    moved to include/linux/flex_array.h.

    Signed-off-by: David Rientjes
    Acked-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add a new function to the flex_array API:

    int flex_array_shrink(struct flex_array *fa)

    This function will free all unused second-level pages. Since elements are
    now poisoned if they are not allocated with __GFP_ZERO, it's possible to
    identify parts that consist solely of unused elements.

    flex_array_shrink() returns the number of pages freed.

    Signed-off-by: David Rientjes
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Newly initialized flex_array's and/or flex_array_part's are now poisoned
    with a new poison value, FLEX_ARRAY_FREE. It's value is similar to
    POISON_FREE used in the various slab allocators, but is different to
    distinguish between flex array's poisoned kmem and slab allocator poisoned
    kmem.

    This will allow us to identify flex_array_part's that only contain free
    elements (and free them with an addition to the flex_array API). This
    could also be extended in the future to identify `get' uses on elements
    that have not been `put'.

    If __GFP_ZERO is passed for a part's gfp mask, the poisoning is avoided.
    These elements are considered to be in-use since they have been
    initialized.

    Signed-off-by: David Rientjes
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add a new function to the flex_array API:

    int flex_array_clear(struct flex_array *fa,
    unsigned int element_nr)

    This function will zero the element at element_nr in the flex_array.

    Although this is equivalent to using flex_array_put() and passing a
    pointer to zero'd memory, flex_array_clear() does not require such a
    pointer to memory that would most likely need to be allocated on the
    caller's stack which could be significantly large depending on
    element_size.

    Signed-off-by: David Rientjes
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Fix the menu idle governor which balances power savings, energy efficiency
    and performance impact.

    The reason for a reworked governor is that there have been serious
    performance issues reported with the existing code on Nehalem server
    systems.

    To show this I'm sure Andrew wants to see benchmark results:
    (benchmark is "fio", "no cstates" is using "idle=poll")

    no cstates current linux new algorithm
    1 disk 107 Mb/s 85 Mb/s 105 Mb/s
    2 disks 215 Mb/s 123 Mb/s 209 Mb/s
    12 disks 590 Mb/s 320 Mb/s 585 Mb/s

    In various power benchmark measurements, no degredation was found by our
    measurement&diagnostics team. Obviously a small percentage more power was
    used in the "fio" benchmark, due to the much higher performance.

    While it would be a novel idea to describe the new algorithm in this
    commit message, I cheaped out and described it in comments in the code
    instead.

    [changes since first post: spelling fixes from akpm, review feedback,
    folded menu-tng into menu.c]

    Signed-off-by: Arjan van de Ven
    Cc: Venkatesh Pallipadi
    Cc: Len Brown
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Yanmin Zhang
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • Anyone who wants to do copy to/from user from a kernel thread, needs
    use_mm (like what fs/aio has). Move that into mm/, to make reusing and
    exporting easier down the line, and make aio use it. Next intended user,
    besides aio, will be vhost-net.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • Add a flag for mmap that will be used to request a huge page region that
    will look like anonymous memory to userspace. This is accomplished by
    using a file on the internal vfsmount. MAP_HUGETLB is a modifier of
    MAP_ANONYMOUS and so must be specified with it. The region will behave
    the same as a MAP_ANONYMOUS region using small pages.

    [akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
    Signed-off-by: Eric B Munson
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: David Gibson
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • Add a flag for mmap that will be used to request a huge page region that
    will look like anonymous memory to user space. This is accomplished by
    using a file on the internal vfsmount. MAP_HUGETLB is a modifier of
    MAP_ANONYMOUS and so must be specified with it. The region will behave
    the same as a MAP_ANONYMOUS region using small pages.

    The patch also adds the MAP_STACK flag, which was previously defined only
    on some architectures but not on others. Since MAP_STACK is meant to be a
    hint only, architectures can define it without assigning a specific
    meaning to it.

    Signed-off-by: Arnd Bergmann
    Cc: Eric B Munson
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • This patchset adds a flag to mmap that allows the user to request that an
    anonymous mapping be backed with huge pages. This mapping will borrow
    functionality from the huge page shm code to create a file on the kernel
    internal mount and use it to approximate an anonymous mapping. The
    MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work without
    both flags being preset.

    A new flag is necessary because there is no other way to hook into huge
    pages without creating a file on a hugetlbfs mount which wouldn't be
    MAP_ANONYMOUS.

    To userspace, this mapping will behave just like an anonymous mapping
    because the file is not accessible outside of the kernel.

    This patchset is meant to simplify the programming model. Presently there
    is a large chunk of boiler platecode, contained in libhugetlbfs, required
    to create private, hugepage backed mappings. This patch set would allow
    use of hugepages without linking to libhugetlbfs or having hugetblfs
    mounted.

    Unification of the VM code would provide these same benefits, but it has
    been resisted each time that it has been suggested for several reasons: it
    would break PAGE_SIZE assumptions across the kernel, it makes page-table
    abstractions really expensive, and it does not provide any benefit on
    architectures that do not support huge pages, incurring fast path
    penalties without providing any benefit on these architectures.

    This patch:

    There are two means of creating mappings backed by huge pages:

    1. mmap() a file created on hugetlbfs
    2. Use shm which creates a file on an internal mount which essentially
    maps it MAP_SHARED

    The internal mount is only used for shared mappings but there is very
    little that stops it being used for private mappings. This patch extends
    hugetlbfs_file_setup() to deal with the creation of files that will be
    mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is
    used in a subsequent patch to implement the MAP_HUGETLB mmap() flag.

    Signed-off-by: Eric Munson
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: David Gibson
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
    CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
    more sense of it by removing CONFIG_TMPFS altogether, always enabling its
    code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
    CONFIG_TMPFS off that we'd better leave that as is.

    But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
    make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
    shmem_acl.o being pointlessly built into the kernel when SHMEM is off.

    And a selfish change, to prevent the world from being rebuilt when I
    switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
    header files is mm.h shmem_lock() - give that a shmem.c stub instead.

    Signed-off-by: Hugh Dickins
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • __get_user_pages() has been taking its own GUP flags, then processing
    them into FOLL flags for follow_page(). Though oddly named, the FOLL
    flags are more widely used, so pass them to __get_user_pages() now.
    Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.

    (The patch to __get_user_pages() looks peculiar, with both gup_flags
    and foll_flags: the gup_flags remain constant; but as before there's
    an exceptional case, out of scope of the patch, in which foll_flags
    per page have FOLL_WRITE masked off.)

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • follow_hugetlb_page() shouldn't be guessing about the coredump case
    either: pass the foll_flags down to it, instead of just the write bit.

    Remove that obscure huge_zeropage_ok() test. The decision is easy,
    though unlike the non-huge case - here vm_ops->fault is always set.
    But we know that a fault would serve up zeroes, unless there's
    already a hugetlbfs pagecache page to back the range.

    (Alternatively, since hugetlb pages aren't swapped out under pressure,
    you could save more dump space by arguing that a page not yet faulted
    into this process cannot be relevant to the dump; but that would be
    more surprising.)

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The "FOLL_ANON optimization" and its use_zero_page() test have caused
    confusion and bugs: why does it test VM_SHARED? for the very good but
    unsatisfying reason that VMware crashed without. As we look to maybe
    reinstating anonymous use of the ZERO_PAGE, we need to sort this out.

    Easily done: it's silly for __get_user_pages() and follow_page() to
    be guessing whether it's safe to assume that they're being used for
    a coredump (which can take a shortcut snapshot where other uses must
    handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.

    get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In preparation for the next patch, add a simple get_dump_page(addr)
    interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
    get_user_pages() directly. They're not interested in errors: they
    just want to use holes as much as possible, to save space and make
    sure that the data is aligned where the headers said it would be.

    Oh, and don't use that horrid DUMP_SEEK(off) macro!

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The following two patches remove searching in the page allocator fast-path
    by maintaining multiple free-lists in the per-cpu structure. At the time
    the search was introduced, increasing the per-cpu structures would waste a
    lot of memory as per-cpu structures were statically allocated at
    compile-time. This is no longer the case.

    The patches are as follows. They are based on mmotm-2009-08-27.

    Patch 1 adds multiple lists to struct per_cpu_pages, one per
    migratetype that can be stored on the PCP lists.

    Patch 2 notes that the pcpu drain path check empty lists multiple times. The
    patch reduces the number of checks by maintaining a count of free
    lists encountered. Lists containing pages will then free multiple
    pages in batch

    The patches were tested with kernbench, netperf udp/tcp, hackbench and
    sysbench. The netperf tests were not bound to any CPU in particular and
    were run such that the results should be 99% confidence that the reported
    results are within 1% of the estimated mean. sysbench was run with a
    postgres background and read-only tests. Similar to netperf, it was run
    multiple times so that it's 99% confidence results are within 1%. The
    patches were tested on x86, x86-64 and ppc64 as

    x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 1.34% to 2.28% gain
    netperf-tcp - 0.45% to 1.22% gain
    hackbench - Small variances, very close to noise
    sysbench - Very small gains

    x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 1.83% to 10.42% gains
    netperf-tcp - No conclusive until buffer >= PAGE_SIZE
    4096 +15.83%
    8192 + 0.34% (not significant)
    16384 + 1%
    hackbench - Small gains, very close to noise
    sysbench - 0.79% to 1.6% gain

    ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 2-3% gain for almost all buffer sizes tested
    netperf-tcp - losses on small buffers, gains on larger buffers
    possibly indicates some bad caching effect.
    hackbench - No significant difference
    sysbench - 2-4% gain

    This patch:

    Currently the per-cpu page allocator searches the PCP list for pages of
    the correct migrate-type to reduce the possibility of pages being
    inappropriate placed from a fragmentation perspective. This search is
    potentially expensive in a fast-path and undesirable. Splitting the
    per-cpu list into multiple lists increases the size of a per-cpu structure
    and this was potentially a major problem at the time the search was
    introduced. These problem has been mitigated as now only the necessary
    number of structures is allocated for the running system.

    This patch replaces a list search in the per-cpu allocator with one list
    per migrate type. The potential snag with this approach is when bulk
    freeing pages. We round-robin free pages based on migrate type which has
    little bearing on the cache hotness of the page and potentially checks
    empty lists repeatedly in the event the majority of PCP pages are of one
    type.

    Signed-off-by: Mel Gorman
    Acked-by: Nick Piggin
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently, OOM logic callflow is here.

    __out_of_memory()
    select_bad_process() for each task
    badness() calculate badness of one task
    oom_kill_process() search child
    oom_kill_task() kill target task and mm shared tasks with it

    example, process-A have two thread, thread-A and thread-B and it have very
    fat memory and each thread have following oom_adj and oom_score.

    thread-A: oom_adj = OOM_DISABLE, oom_score = 0
    thread-B: oom_adj = 0, oom_score = very-high

    Then, select_bad_process() select thread-B, but oom_kill_task() refuse
    kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory()
    call select_bad_process() again. but select_bad_process() select the same
    task. It mean kernel fall in livelock.

    The fact is, select_bad_process() must select killable task. otherwise
    OOM logic go into livelock.

    And root cause is, oom_adj shouldn't be per-thread value. it should be
    per-process value because OOM-killer kill a process, not thread. Thus
    This patch moves oomkilladj (now more appropriately named oom_adj) from
    struct task_struct to struct signal_struct. it naturally prevent
    select_bad_process() choose wrong task.

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
    which case shrink_list() _still_ calls isolate_pages() with the much
    larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
    rate by up to 32 times.

    For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
    So when shrink_zone() expects to scan 4 pages in the active/inactive list,
    the active list will be scanned 4 pages, while the inactive list will be
    (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
    the balance between the two lists.

    It can further impact the scan of anon active list, due to the anon
    active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():

    inactive anon list over scanned => inactive_anon_is_low() == TRUE
    => shrink_active_list()
    => active anon list over scanned

    So the end result may be

    - anon inactive => over scanned
    - anon active => over scanned (maybe not as much)
    - file inactive => over scanned
    - file active => under scanned (relatively)

    The accesses to nr_saved_scan are not lock protected and so not 100%
    accurate, however we can tolerate small errors and the resulted small
    imbalanced scan rates between zones.

    Cc: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Signed-off-by: Alexey Dobriyan
    Acked-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • This is being done by allowing boot time allocations to specify that they
    may want a sub-page sized amount of memory.

    Overall this seems more consistent with the other hash table allocations,
    and allows making two supposedly mm-only variables really mm-only
    (nr_{kernel,all}_pages).

    Signed-off-by: Jan Beulich
    Cc: Ingo Molnar
    Cc: "Eric W. Biederman"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • Sizing of memory allocations shouldn't depend on the number of physical
    pages found in a system, as that generally includes (perhaps a huge amount
    of) non-RAM pages. The amount of what actually is usable as storage
    should instead be used as a basis here.

    Some of the calculations (i.e. those not intending to use high memory)
    should likely even use (totalram_pages - totalhigh_pages).

    Signed-off-by: Jan Beulich
    Acked-by: Rusty Russell
    Acked-by: Ingo Molnar
    Cc: Dave Airlie
    Cc: Kyle McMartin
    Cc: Jeremy Fitzhardinge
    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: "David S. Miller"
    Cc: Patrick McHardy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • Make page_has_private() return a true boolean value and remove the double
    negations from the two callsites using it for arithmetic.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_is_file_cache() has been used for both boolean checks and LRU
    arithmetic, which was always a bit weird.

    Now that page_lru_base_type() exists for LRU arithmetic, make
    page_is_file_cache() a real predicate function and adjust the
    boolean-using callsites to drop those pesky double negations.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of abusing page_is_file_cache() for LRU list index arithmetic, add
    another helper with a more appropriate name and convert the non-boolean
    users of page_is_file_cache() accordingly.

    This new helper gives the LRU base type a page is supposed to live on,
    inactive anon or inactive file.

    [hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The kzalloc mempool zeros items when they are initially allocated, but
    does not rezero used items that are returned to the pool. Consequently
    mempool_alloc()s may return non-zeroed memory.

    Since there are/were only two in-tree users for
    mempool_create_kzalloc_pool(), and 'fixing' this in a way that will
    re-zero used (but not new) items before first use is non-trivial, just
    remove it.

    Signed-off-by: Sage Weil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sage Weil
     
  • The page allocation trace event reports that a page was successfully
    allocated but it does not specify where it came from. When analysing
    performance, it can be important to distinguish between pages coming from
    the per-cpu allocator and pages coming from the buddy lists as the latter
    requires the zone lock to the taken and more data structures to be
    examined.

    This patch adds a trace event for __rmqueue reporting when a page is being
    allocated from the buddy lists. It distinguishes between being called to
    refill the per-cpu lists or whether it is a high-order allocation.
    Similarly, this patch adds an event to catch when the PCP lists are being
    drained a little and pages are going back to the buddy lists.

    This is trickier to draw conclusions from but high activity on those
    events could explain why there were a large number of cache misses on a
    page-allocator-intensive workload. The coalescing and splitting of
    buddies involves a lot of writing of page metadata and cache line bounces
    not to mention the acquisition of an interrupt-safe lock necessary to
    enter this path.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Fragmentation avoidance depends on being able to use free pages from lists
    of the appropriate migrate type. In the event this is not possible,
    __rmqueue_fallback() selects a different list and in some circumstances
    change the migratetype of the pageblock. Simplistically, the more times
    this event occurs, the more likely that fragmentation will be a problem
    later for hugepage allocation at least but there are other considerations
    such as the order of page being split to satisfy the allocation.

    This patch adds a trace event for __rmqueue_fallback() that reports what
    page is being used for the fallback, the orders of relevant pages, the
    desired migratetype and the migratetype of the lists being used, whether
    the pageblock changed type and whether this event is important with
    respect to fragmentation avoidance or not. This information can be used
    to help analyse fragmentation avoidance and help decide whether
    min_free_kbytes should be increased or not.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch adds trace events for the allocation and freeing of pages,
    including the freeing of pagevecs. Using the events, it will be known
    what struct page and pfns are being allocated and freed and what the call
    site was in many cases.

    The page alloc tracepoints be used as an indicator as to whether the
    workload was heavily dependant on the page allocator or not. You can make
    a guess based on vmstat but you can't get a per-process breakdown.
    Depending on the call path, the call_site for page allocation may be
    __get_free_pages() instead of a useful callsite. Instead of passing down
    a return address similar to slab debugging, the user should enable the
    stacktrace and seg-addr options to get a proper stack trace.

    The pagevec free tracepoint has a different usecase. It can be used to
    get a idea of how many pages are being dumped off the LRU and whether it
    is kswapd doing the work or a process doing direct reclaim.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The function free_cold_page() has no callers so delete it.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Just as the swapoff system call allocates many pages of RAM to various
    processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
    (unmerge) is liable to allocate many pages of RAM to various processes,
    perhaps triggering OOM; and each is normally run from a modest admin
    process (swapoff or shell), easily repeated until it succeeds.

    So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
    try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
    with that, to ask the OOM killer to kill them first, to prevent them from
    spawning more and more OOM kills.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A few cleanups, given the munlock fix: the comment on ksm_test_exit() no
    longer applies, and it can be made private to ksm.c; there's no more
    reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins