20 Oct, 2008

40 commits

  • Add temperature sensor support for Macbook Pro 3.

    Signed-off-by: Henrik Rydberg
    Cc: Nicolas Boichat
    Cc: Riki Oktarianto
    Cc: Mark M. Hoffman
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henrik Rydberg
     
  • Adds temperature sensor support for the Macbook Pro 4.

    Signed-off-by: Henrik Rydberg
    Cc: Nicolas Boichat
    Cc: Riki Oktarianto
    Cc: Mark M. Hoffman
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henrik Rydberg
     
  • dmi_system_id.driver_data is already void*.

    Cc: Henrik Rydberg
    Cc: Nicolas Boichat
    Cc: Riki Oktarianto
    Cc: Mark M. Hoffman
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This patch adds accelerometer, backlight and temperature sensor support
    for the Macbook Air.

    Signed-off-by: Henrik Rydberg
    Cc: Nicolas Boichat
    Cc: Riki Oktarianto
    Cc: Mark M. Hoffman
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henrik Rydberg
     
  • On some recent Macbooks, the package length for the light sensors ALV0 and
    ALV1 has changed from 6 to 10. This patch allows for a variable package
    length encompassing both variants.

    Signed-off-by: Henrik Rydberg
    Cc: Nicolas Boichat
    Cc: Riki Oktarianto
    Cc: Mark M. Hoffman
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henrik Rydberg
     
  • The time to wait for a status change while reading or writing to the SMC
    ports is a balance between read reliability and system performance. The
    current setting yields rougly three errors in a thousand when
    simultaneously reading three different temperature values on a Macbook
    Air. This patch increases the setting to a value yielding roughly one
    error in ten thousand, with no noticable system performance degradation.

    Signed-off-by: Henrik Rydberg
    Cc: Nicolas Boichat
    Cc: Riki Oktarianto
    Cc: Mark M. Hoffman
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henrik Rydberg
     
  • On many Macbooks since mid 2007, the Pro, C2D and Air models, applesmc
    fails to read some or all SMC ports. This problem has various effects,
    such as flooded logfiles, malfunctioning temperature sensors,
    accelerometers failing to initialize, and difficulties getting backlight
    functionality to work properly.

    The root of the problem seems to be the command protocol. The current
    code sends out a command byte, then repeatedly polls for an ack before
    continuing to send or recieve data. From experiments leading to this
    patch, it seems the command protocol never quite worked or changed so that
    one now sends a command byte, waits a little bit, polls for an ack, and if
    it fails, repeats the whole thing by sending the command byte again.

    This patch implements a send_command function according to the new
    interpretation of the protocol, and should work also for earlier models.

    Signed-off-by: Henrik Rydberg
    Cc: Nicolas Boichat
    Cc: Riki Oktarianto
    Cc: Mark M. Hoffman
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henrik Rydberg
     
  • At one single place in the code, the specified number of bytes to read and
    the actual number of bytes read differ by one. This one-liner patch fixes
    that inconsistency.

    Signed-off-by: Henrik Rydberg
    Cc: Nicolas Boichat
    Cc: Riki Oktarianto
    Cc: Mark M. Hoffman
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henrik Rydberg
     
  • Adds therm-min/max/crit-alarm callbacks, sensor-device-attribute
    declarations, and refs to those new decls in the macro used to initialize
    the therm_group (of sysfs files)

    The thermistors use voltage channels to measure; so they don't have a
    fault-alarm, but unlike the other voltages, they do have an overtemp,
    which we call crit (by convention).

    [akpm@linux-foundation.org: cleanup]
    Signed-off-by: Jim Cromie
    Cc: Jean Delvare
    Cc: "Mark M. Hoffman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jim Cromie
     
  • temp and vin status register values may be set by chip specifications, set
    again by bios, or by this previously loaded driver. Debug output nicely
    displays modprobe init=\d actions.

    Signed-off-by: Jim Cromie
    Cc: Jean Delvare
    Cc: "Mark M. Hoffman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jim Cromie
     
  • Driver handles 3 logical devices in fixed length array. Give this a
    define-d constant.

    Signed-off-by: Jim Cromie
    Cc: Jean Delvare
    Cc: "Mark M. Hoffman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jim Cromie
     
  • Adds temp-min/max/crit/fault-alarm callbacks, sensor-device-attribute
    declarations, and refs to those new decls in the macro used to initialize
    the temp_group (of sysfs files)

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Jim Cromie
    Cc: Jean Delvare
    Cc: "Mark M. Hoffman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jim Cromie
     
  • Adds vin-min/max-alarm callbacks, sensor-device-attribute declarations,
    and refs to those new decls in the macro used to initialize the vin_group
    (of sysfs files)

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Jim Cromie
    Cc: Jean Delvare
    Cc: "Mark M. Hoffman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jim Cromie
     
  • Bring hwmon/pc87360 into agreement with
    Documentation/hwmon/sysfs-interface.

    Patchset adds separate limit alarms for voltages and temps, it also adds
    temp[123]_fault files. On my Soekris, temps 1,2 are unused/unconnected,
    so temp[123]_fault = 1,1,0 respectively. This agrees with
    /usr/bin/sensors, which has always shown them as OPEN. Temps 4,5,6 are
    thermistor based, and dont have a fault bit in their status register.

    This patch:

    2 different kinds of constants added:
    - CHAN_ALM_* constants for (later) vin, temp alarm callbacks.
    - CHAN_* conversion constants, used in _init_device, partly for RW1C bits

    Signed-off-by: Jim Cromie
    Cc: Jean Delvare
    Cc: "Mark M. Hoffman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jim Cromie
     
  • Fix for a typo and and replacing incorrect word in the comment.

    Signed-off-by: Ameya Palande
    Cc: "Ashok Raj"
    Cc: "Shaohua Li"
    Cc: "Anil S Keshavamurthy"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ameya Palande
     
  • These comments are useless, remove them.

    Signed-off-by: WANG Cong
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Cong
     
  • On my HP 2510, pressing the (i) button generates an unknown keycode:
    0x213b. So here is a patch adding support for it. However, as it seems
    there is already support for a similar button connected to 0x231b as
    keycode, I wonder if it could be a typo in the driver?

    Signed-off-by: Eric Piel
    Cc: Matthew Garrett
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Piel
     
  • I fell into the trap recently that it only dumps hrtimers instead of
    all timers. Fix the documentation.

    Signed-off-by: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Fix

    arch/um/sys-i386/signal.c: In function 'copy_sc_from_user':
    arch/um/sys-i386/signal.c:182: warning: dereferencing 'void *' pointer
    arch/um/sys-i386/signal.c:182: error: request for member '_fxsr_env' in something not a structure or union

    Signed-off-by: WANG Cong
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Cong
     
  • Removed duplicated include file in
    arch/m68k/bvme6000/rtc.c.

    Signed-off-by: Huang Weiyi
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Weiyi
     
  • Describe why we need the freezer subsystem and how to use it in a
    documentation file. Since the cgroups.txt file is focused on the
    subsystem-agnostic portions of cgroups make a directory and move the old
    cgroups.txt file at the same time.

    Signed-off-by: Matt Helsley
    Cc: Paul Menage
    Cc: containers@lists.linux-foundation.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • check_if_frozen() sounds like it should return something when in fact it's
    just updating the freezer state.

    Signed-off-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • Rename cgroup freezer states to be less generic to avoid any name
    collisions while also better describing what each state is.

    Signed-off-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • Don't let frozen tasks or cgroups change. This means frozen tasks can't
    leave their current cgroup for another cgroup. It also means that tasks
    cannot be added to or removed from a cgroup in the FROZEN state. We
    enforce these rules by checking for frozen tasks and cgroups in the
    can_attach() function.

    Signed-off-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • When a system is resumed after a suspend, it will also unfreeze frozen
    cgroups.

    This patchs modifies the resume sequence to skip the tasks which are part
    of a frozen control group.

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Matt Helsley
    Acked-by: Serge E. Hallyn
    Tested-by: Matt Helsley
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • This patch implements a new freezer subsystem in the control groups
    framework. It provides a way to stop and resume execution of all tasks in
    a cgroup by writing in the cgroup filesystem.

    The freezer subsystem in the container filesystem defines a file named
    freezer.state. Writing "FROZEN" to the state file will freeze all tasks
    in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
    the cgroup. Reading will return the current state.

    * Examples of usage :

    # mkdir /containers/freezer
    # mount -t cgroup -ofreezer freezer /containers
    # mkdir /containers/0
    # echo $some_pid > /containers/0/tasks

    to get status of the freezer subsystem :

    # cat /containers/0/freezer.state
    RUNNING

    to freeze all tasks in the container :

    # echo FROZEN > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    FREEZING
    # cat /containers/0/freezer.state
    FROZEN

    to unfreeze all tasks in the container :

    # echo RUNNING > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    RUNNING

    This is the basic mechanism which should do the right thing for user space
    task in a simple scenario.

    It's important to note that freezing can be incomplete. In that case we
    return EBUSY. This means that some tasks in the cgroup are busy doing
    something that prevents us from completely freezing the cgroup at this
    time. After EBUSY, the cgroup will remain partially frozen -- reflected
    by freezer.state reporting "FREEZING" when read. The state will remain
    "FREEZING" until one of these things happens:

    1) Userspace cancels the freezing operation by writing "RUNNING" to
    the freezer.state file
    2) Userspace retries the freezing operation by writing "FROZEN" to
    the freezer.state file (writing "FREEZING" is not legal
    and returns EIO)
    3) The tasks that blocked the cgroup from entering the "FROZEN"
    state disappear from the cgroup's set of tasks.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: export thaw_process]
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Matt Helsley
    Acked-by: Serge E. Hallyn
    Tested-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • Now that the TIF_FREEZE flag is available in all architectures, extract
    the refrigerator() and freeze_task() from kernel/power/process.c and make
    it available to all.

    The refrigerator() can now be used in a control group subsystem
    implementing a control group freezer.

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Matt Helsley
    Acked-by: Serge E. Hallyn
    Tested-by: Matt Helsley
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • This patch series introduces a cgroup subsystem that utilizes the swsusp
    freezer to freeze a group of tasks. It's immediately useful for batch job
    management scripts. It should also be useful in the future for
    implementing container checkpoint/restart.

    The freezer subsystem in the container filesystem defines a cgroup file
    named freezer.state. Reading freezer.state will return the current state
    of the cgroup. Writing "FROZEN" to the state file will freeze all tasks
    in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
    the cgroup.

    * Examples of usage :

    # mkdir /containers/freezer
    # mount -t cgroup -ofreezer freezer /containers
    # mkdir /containers/0
    # echo $some_pid > /containers/0/tasks

    to get status of the freezer subsystem :

    # cat /containers/0/freezer.state
    RUNNING

    to freeze all tasks in the container :

    # echo FROZEN > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    FREEZING
    # cat /containers/0/freezer.state
    FROZEN

    to unfreeze all tasks in the container :

    # echo RUNNING > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    RUNNING

    This patch:

    The first step in making the refrigerator() available to all
    architectures, even for those without power management.

    The purpose of such a change is to be able to use the refrigerator() in a
    new control group subsystem which will implement a control group freezer.

    [akpm@linux-foundation.org: fix sparc]
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Matt Helsley
    Acked-by: Pavel Machek
    Acked-by: Serge E. Hallyn
    Acked-by: Rafael J. Wysocki
    Acked-by: Nigel Cunningham
    Tested-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • To prepare the chunking, move the sys_move_pages() code that is used when
    nodes!=NULL into do_pages_move(). And rename do_move_pages() into
    do_move_page_to_node_array().

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • do_pages_stat() does not need any page_to_node entry for real. Just pass
    the pointers to the user-space page address array and to the user-space
    status array, and have do_pages_stat() traverse the former and fill the
    latter directly.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • A patchset reworking sys_move_pages(). It removes the possibly large
    vmalloc by using multiple chunks when migrating large buffers. It also
    dramatically increases the throughput for large buffers since the lookup
    in new_page_node() is now limited to a single chunk, causing the quadratic
    complexity to have a much slower impact. There is no need to use any
    radix-tree-like structure to improve this lookup.

    sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
    migrating between nodes #2 and #3:

    length move_pages (us) move_pages+patch (us)
    4kB 126 98
    40kB 198 168
    400kB 963 937
    4MB 12503 11930
    40MB 246867 11848

    Patches #1 and #4 are the important ones:
    1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
    2) don't vmalloc a huge page_to_node array for do_pages_stat()
    3) extract do_pages_move() out of sys_move_pages()
    4) rework do_pages_move() to work on page_sized chunks
    5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default

    This patch:

    There is no point in returning -ENOENT from sys_move_pages() if all pages
    were already on the right node, while we return 0 if only 1 page was not.
    Most application don't know where their pages are allocated, so it's not
    an error to try to migrate them anyway.

    Just return 0 and let the status array in user-space be checked if the
    application needs details.

    It will make the upcoming chunked-move_pages() support much easier.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • During hotplug memory remove, memory regions should be released on a
    PAGES_PER_SECTION size chunks. This mirrors the code in add_memory where
    resources are requested on a PAGES_PER_SECTION size.

    Attempting to release the entire memory region fails because there is not
    a single resource for the total number of pages being removed. Instead
    the resources for the pages are split in PAGES_PER_SECTION size chunks as
    requested during memory add.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Badari Pulavarty
    Acked-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Fontenot
     
  • The current documentation of dirty_ratio and dirty_background_ratio is a
    bit misleading.

    In the documentation we say that they are "a percentage of total system
    memory", but the current page writeback policy, intead, is to apply the
    percentages to the dirtyable memory, that means free pages + reclaimable
    pages.

    Better to be more explicit to clarify this concept.

    Signed-off-by: Andrea Righi
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     
  • This attribute just has a write operation.

    [akpm@linux-foundation.org: use S_IWUSR as suggested by Randy]
    Signed-off-by: Shaohua Li
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock.
    There seems to be no need for the lru_lock anymore, but there is a need for
    zone->lock instead, because that function may call move_freepages() via
    setup_zone_migrate_reserve().

    Signed-off-by: Gerald Schaefer
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Presently hugepage doesn't use zero page at all because zero page is only
    used for coredumping and hugepage can't core dump.

    However we have now implemented hugepage coredumping. Therefore we should
    implement the zero page of hugepage.

    Implementation note:

    o Why do we only check VM_SHARED for zero page?
    normal page checked as ..

    static inline int use_zero_page(struct vm_area_struct *vma)
    {
    if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
    return 0;

    return !vma->vm_ops || !vma->vm_ops->fault;
    }

    First, hugepages are never mlock()ed. We aren't concerned with VM_LOCKED.

    Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
    doesn't have any file backing. Thus ops->fault checking is meaningless.

    o Why don't we use zero page if !pte.

    !pte indicate {pud, pmd} doesn't exist or some error happened. So we
    shouldn't return zero page if any error occurred.

    Signed-off-by: KOSAKI Motohiro
    Cc: Adam Litke
    Cc: Hugh Dickins
    Cc: Kawai Hidehiro
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Presently hugepage's vma has a VM_RESERVED flag in order not to be
    swapped. But a VM_RESERVED vma isn't core dumped because this flag is
    often used for some kernel vmas (e.g. vmalloc, sound related).

    Thus hugepages are never dumped and it can't be debugged easily. Many
    developers want hugepages to be included into core-dump.

    However, We can't read generic VM_RESERVED area because this area is often
    IO mapping area. then these area reading may change device state. it is
    definitly undesiable side-effect.

    So adding a hugepage specific bit to the coredump filter is better. It
    will be able to hugepage core dumping and doesn't cause any side-effect to
    any i/o devices.

    In additional, libhugetlb use hugetlb private mapping pages as anonymous
    page. Then, hugepage private mapping pages should be core dumped by
    default.

    Then, /proc/[pid]/core_dump_filter has two new bits.

    - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
    - bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no)

    I tested by following method.

    % ulimit -c unlimited
    % ./crash_hugepage 50
    % ./crash_hugepage 50 -p
    % ls -lh
    % gdb ./crash_hugepage core
    %
    % echo 0x43 > /proc/self/coredump_filter
    % ./crash_hugepage 50
    % ./crash_hugepage 50 -p
    % ls -lh
    % gdb ./crash_hugepage core

    #include
    #include
    #include
    #include
    #include

    #include "hugetlbfs.h"

    int main(int argc, char** argv){
    char* p;
    int ch;
    int mmap_flags = MAP_SHARED;
    int fd;
    int nr_pages;

    while((ch = getopt(argc, argv, "p")) != -1) {
    switch (ch) {
    case 'p':
    mmap_flags &= ~MAP_SHARED;
    mmap_flags |= MAP_PRIVATE;
    break;
    default:
    /* nothing*/
    break;
    }
    }
    argc -= optind;
    argv += optind;

    if (argc == 0){
    printf("need # of pages\n");
    exit(1);
    }

    nr_pages = atoi(argv[0]);
    if (nr_pages < 2) {
    printf("nr_pages must >2\n");
    exit(1);
    }

    fd = hugetlbfs_unlinked_fd();
    p = mmap(NULL, nr_pages * gethugepagesize(),
    PROT_READ|PROT_WRITE, mmap_flags, fd, 0);

    sleep(2);

    *(p + gethugepagesize()) = 1; /* COW */
    sleep(2);

    /* crash! */
    *(int*)0 = 1;

    return 0;
    }

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Kawai Hidehiro
    Cc: Hugh Dickins
    Cc: William Irwin
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Improve debuggability of memory setup problems.

    Signed-off-by: Yinghai Lu
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
    mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
    mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
    mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?

    Signed-off-by: Harvey Harrison
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
    provide a fast, scalable percpu frontend for small vmaps (requires a
    slightly different API, though).

    The biggest problem with vmap is actually vunmap. Presently this requires
    a global kernel TLB flush, which on most architectures is a broadcast IPI
    to all CPUs to flush the cache. This is all done under a global lock. As
    the number of CPUs increases, so will the number of vunmaps a scaled
    workload will want to perform, and so will the cost of a global TLB flush.
    This gives terrible quadratic scalability characteristics.

    Another problem is that the entire vmap subsystem works under a single
    lock. It is a rwlock, but it is actually taken for write in all the fast
    paths, and the read locking would likely never be run concurrently anyway,
    so it's just pointless.

    This is a rewrite of vmap subsystem to solve those problems. The existing
    vmalloc API is implemented on top of the rewritten subsystem.

    The TLB flushing problem is solved by using lazy TLB unmapping. vmap
    addresses do not have to be flushed immediately when they are vunmapped,
    because the kernel will not reuse them again (would be a use-after-free)
    until they are reallocated. So the addresses aren't allocated again until
    a subsequent TLB flush. A single TLB flush then can flush multiple
    vunmaps from each CPU.

    XEN and PAT and such do not like deferred TLB flushing because they can't
    always handle multiple aliasing virtual addresses to a physical address.
    They now call vm_unmap_aliases() in order to flush any deferred mappings.
    That call is very expensive (well, actually not a lot more expensive than
    a single vunmap under the old scheme), however it should be OK if not
    called too often.

    The virtual memory extent information is stored in an rbtree rather than a
    linked list to improve the algorithmic scalability.

    There is a per-CPU allocator for small vmaps, which amortizes or avoids
    global locking.

    To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
    must be used in place of vmap and vunmap. Vmalloc does not use these
    interfaces at the moment, so it will not be quite so scalable (although it
    will use lazy TLB flushing).

    As a quick test of performance, I ran a test that loops in the kernel,
    linearly mapping then touching then unmapping 4 pages. Different numbers
    of tests were run in parallel on an 4 core, 2 socket opteron. Results are
    in nanoseconds per map+touch+unmap.

    threads vanilla vmap rewrite
    1 14700 2900
    2 33600 3000
    4 49500 2800
    8 70631 2900

    So with a 8 cores, the rewritten version is already 25x faster.

    In a slightly more realistic test (although with an older and less
    scalable version of the patch), I ripped the not-very-good vunmap batching
    code out of XFS, and implemented the large buffer mapping with vm_map_ram
    and vm_unmap_ram... along with a couple of other tricks, I was able to
    speed up a large directory workload by 20x on a 64 CPU system. I believe
    vmap/vunmap is actually sped up a lot more than 20x on such a system, but
    I'm running into other locks now. vmap is pretty well blown off the
    profiles.

    Before:
    1352059 total 0.1401
    798784 _write_lock 8320.6667
    Cc: Jeremy Fitzhardinge
    Cc: Krzysztof Helt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin