18 Mar, 2016

1 commit


31 Oct, 2011

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

24 Sep, 2009

1 commit


24 Jun, 2009

1 commit

  • Currently, the following three different ways to define percpu arrays
    are in use.

    1. DEFINE_PER_CPU(elem_type[array_len], array_name);
    2. DEFINE_PER_CPU(elem_type, array_name[array_len]);
    3. DEFINE_PER_CPU(elem_type, array_name)[array_len];

    Unify to #1 which correctly separates the roles of the two parameters
    and thus allows more flexibility in the way percpu variables are
    defined.

    [ Impact: cleanup ]

    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter
    Cc: Ingo Molnar
    Cc: Tony Luck
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Cc: Jeremy Fitzhardinge
    Cc: linux-mm@kvack.org
    Cc: Christoph Lameter
    Cc: David S. Miller

    Tejun Heo
     

13 Mar, 2009

1 commit


03 Sep, 2008

1 commit

  • Quicklists store pages for each CPU as caches. (Each CPU can cache
    node_free_pages/16 pages)

    It is used for page table cache. exit() will increase the cache size,
    while fork() consumes it.

    So for example if an apache-style application runs (one parent and many
    child model), one CPU process will fork() while another CPU will process
    the middleware work and exit().

    At that time, the CPU on which the parent runs doesn't have page table
    cache at all. Others (on which children runs) have maximum caches.

    QList_max = (#ofCPUs - 1) x Free / 16
    => QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)

    So, How much quicklist memory is used in the maximum case?

    This is proposional to # of CPUs because the limit of per cpu quicklist
    cache doesn't see the number of cpus.

    Above calculation mean

    Number of CPUs per node 2 4 8 16
    ============================== ====================
    QList_max / (Free + QList_max) 5.8% 16% 30% 48%

    Wow! Quicklist can spend about 50% memory at worst case.

    My demonstration program is here
    --------------------------------------------------------------------------------
    #define _GNU_SOURCE

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define BUFFSIZE 512

    int max_cpu(void) /* get max number of logical cpus from /proc/cpuinfo */
    {
    FILE *fd;
    char *ret, buffer[BUFFSIZE];
    int cpu = 1;

    fd = fopen("/proc/cpuinfo", "r");
    if (fd == NULL) {
    perror("fopen(/proc/cpuinfo)");
    exit(EXIT_FAILURE);
    }
    while (1) {
    ret = fgets(buffer, BUFFSIZE, fd);
    if (ret == NULL)
    break;
    if (!strncmp(buffer, "processor", 9))
    cpu = atoi(strchr(buffer, ':') + 2);
    }
    fclose(fd);
    return cpu;
    }

    void cpu_bind(int cpu) /* bind current process to one cpu */
    {
    cpu_set_t mask;
    int ret;

    CPU_ZERO(&mask);
    CPU_SET(cpu, &mask);
    ret = sched_setaffinity(0, sizeof(mask), &mask);
    if (ret == -1) {
    perror("sched_setaffinity()");
    exit(EXIT_FAILURE);
    }
    sched_yield(); /* not necessary */
    }

    #define MMAP_SIZE (10 * 1024 * 1024) /* 10 MB */
    #define FORK_INTERVAL 1 /* 1 second */

    main(int argc, char *argv[])
    {
    int cpu_max, nextcpu;
    long pagesize;
    pid_t pid;

    /* set max number of logical cpu */
    if (argc > 1)
    cpu_max = atoi(argv[1]) - 1;
    else
    cpu_max = max_cpu();

    /* get the page size */
    pagesize = sysconf(_SC_PAGESIZE);
    if (pagesize == -1) {
    perror("sysconf(_SC_PAGESIZE)");
    exit(EXIT_FAILURE);
    }

    /* prepare parent process */
    cpu_bind(0);
    nextcpu = cpu_max;

    loop:

    /* select destination cpu for child process by round-robin rule */
    if (++nextcpu > cpu_max)
    nextcpu = 1;

    pid = fork();

    if (pid == 0) { /* child action */

    char *p;
    int i;

    /* consume page tables */
    p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    i = MMAP_SIZE / pagesize;
    while (i-- > 0) {
    *p = 1;
    p += pagesize;
    }

    /* move to other cpu */
    cpu_bind(nextcpu);
    /*
    printf("a child moved to cpu%d after mmap().\n", nextcpu);
    fflush(stdout);
    */

    /* back page tables to pgtable_quicklist */
    exit(0);

    } else if (pid > 0) { /* parent action */

    sleep(FORK_INTERVAL);
    waitpid(pid, NULL, WNOHANG);

    }

    goto loop;
    }
    ----------------------------------------

    When above program which does task migration runs, my 8GB box spends
    800MB of memory for quicklist. This is not memory leak but doesn't seem
    good.

    % cat /proc/meminfo

    MemTotal: 7701568 kB
    MemFree: 4724672 kB
    (snip)
    Quicklists: 844800 kB

    because

    - My machine spec is
    number of numa node: 2
    number of cpus: 8 (4CPU x2 node)
    total mem: 8GB (4GB x2 node)
    free mem: about 5GB

    - Then, 4.7GB x 16% ~= 880MB.
    So, Quicklist can use 800MB.

    So, if following spec machine run that program

    CPUs: 64 (8cpu x 8node)
    Mem: 1TB (128GB x8node)

    Then, quicklist can waste 300GB (= 1TB x 30%). It is too large.

    So, I don't like cache policies which is proportional to # of cpus.

    My patch changes the number of caches
    from:
    per-cpu-cache-amount = memory_on_node / 16
    to
    per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node.

    Signed-off-by: KOSAKI Motohiro
    Cc: Keiichiro Tokunaga
    Acked-by: Christoph Lameter
    Tested-by: David Miller
    Acked-by: Mike Travis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

15 Jan, 2008

1 commit

  • Quicklists calculates the size of the quicklists based on the number of
    free pages. This must be the number of free pages that can be allocated
    with GFP_KERNEL. node_page_state() includes the pages in ZONE_HIGHMEM and
    ZONE_MOVABLE which may lead the quicklists to become too large causing OOM.

    Signed-off-by: Christoph Lameter
    Tested-by: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

08 May, 2007

1 commit

  • On x86_64 this cuts allocation overhead for page table pages down to a
    fraction (kernel compile / editing load. TSC based measurement of times spend
    in each function):

    no quicklist

    pte_alloc 1569048 4.3s(401ns/2.7us/179.7us)
    pmd_alloc 780988 2.1s(337ns/2.7us/86.1us)
    pud_alloc 780072 2.2s(424ns/2.8us/300.6us)
    pgd_alloc 260022 1s(920ns/4us/263.1us)

    quicklist:

    pte_alloc 452436 573.4ms(8ns/1.3us/121.1us)
    pmd_alloc 196204 174.5ms(7ns/889ns/46.1us)
    pud_alloc 195688 172.4ms(7ns/881ns/151.3us)
    pgd_alloc 65228 9.8ms(8ns/150ns/6.1us)

    pgd allocations are the most complex and there we see the most dramatic
    improvement (may be we can cut down the amount of pgds cached somewhat?). But
    even the pte allocations still see a doubling of performance.

    1. Proven code from the IA64 arch.

    The method used here has been fine tuned for years and
    is NUMA aware. It is based on the knowledge that accesses
    to page table pages are sparse in nature. Taking a page
    off the freelists instead of allocating a zeroed pages
    allows a reduction of number of cachelines touched
    in addition to getting rid of the slab overhead. So
    performance improves. This is particularly useful if pgds
    contain standard mappings. We can save on the teardown
    and setup of such a page if we have some on the quicklists.
    This includes avoiding lists operations that are otherwise
    necessary on alloc and free to track pgds.

    2. Light weight alternative to use slab to manage page size pages

    Slab overhead is significant and even page allocator use
    is pretty heavy weight. The use of a per cpu quicklist
    means that we touch only two cachelines for an allocation.
    There is no need to access the page_struct (unless arch code
    needs to fiddle around with it). So the fast past just
    means bringing in one cacheline at the beginning of the
    page. That same cacheline may then be used to store the
    page table entry. Or a second cacheline may be used
    if the page table entry is not in the first cacheline of
    the page. The current code will zero the page which means
    touching 32 cachelines (assuming 128 byte). We get down
    from 32 to 2 cachelines in the fast path.

    3. x86_64 gets lightweight page table page management.

    This will allow x86_64 arch code to faster repopulate pgds
    and other page table entries. The list operations for pgds
    are reduced in the same way as for i386 to the point where
    a pgd is allocated from the page allocator and when it is
    freed back to the page allocator. A pgd can pass through
    the quicklists without having to be reinitialized.

    64 Consolidation of code from multiple arches

    So far arches have their own implementation of quicklist
    management. This patch moves that feature into the core allowing
    an easier maintenance and consistent management of quicklists.

    Page table pages have the characteristics that they are typically zero or in a
    known state when they are freed. This is usually the exactly same state as
    needed after allocation. So it makes sense to build a list of freed page
    table pages and then consume the pages already in use first. Those pages have
    already been initialized correctly (thus no need to zero them) and are likely
    already cached in such a way that the MMU can use them most effectively. Page
    table pages are used in a sparse way so zeroing them on allocation is not too
    useful.

    Such an implementation already exits for ia64. Howver, that implementation
    did not support constructors and destructors as needed by i386 / x86_64. It
    also only supported a single quicklist. The implementation here has
    constructor and destructor support as well as the ability for an arch to
    specify how many quicklists are needed.

    Quicklists are defined by an arch defining CONFIG_QUICKLIST. If more than one
    quicklist is necessary then we can define NR_QUICK for additional lists. F.e.
    i386 needs two and thus has

    config NR_QUICK
    int
    default 2

    If an arch has requested quicklist support then pages can be allocated
    from the quicklist (or from the page allocator if the quicklist is
    empty) via:

    quicklist_alloc(, , )

    Page table pages can be freed using:

    quicklist_free(, , )

    Pages must have a definite state after allocation and before
    they are freed. If no constructor is specified then pages
    will be zeroed on allocation and must be zeroed before they are
    freed.

    If a constructor is used then the constructor will establish
    a definite page state. F.e. the i386 and x86_64 pgd constructors
    establish certain mappings.

    Constructors and destructors can also be used to track the pages.
    i386 and x86_64 use a list of pgds in order to be able to dynamically
    update standard mappings.

    Signed-off-by: Christoph Lameter
    Cc: "David S. Miller"
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter