16 Apr, 2015

26 commits

  • Commit 38c5ce936a08 ("mm/gup: Replace ACCESS_ONCE with READ_ONCE")
    converted ACCESS_ONCE usage in gup_pmd_range() to READ_ONCE, since
    ACCESS_ONCE doesn't work reliably on non-scalar types.

    This patch also fixes the other ACCESS_ONCE usages in gup_pte_range()
    and __get_user_pages_fast() in mm/gup.c

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     
  • As suggested by Kirill the "goto"s in vma_to_resize aren't necessary, just
    change them to explicit return.

    Signed-off-by: Derek Che
    Suggested-by: "Kirill A. Shutemov"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Derek
     
  • Recently I straced bash behavior in this dd zero pipe to read test, in
    part of testing under vm.overcommit_memory=2 (OVERCOMMIT_NEVER mode):

    # dd if=/dev/zero | read x

    The bash sub shell is calling mremap to reallocate more and more memory
    untill it finally failed -ENOMEM (I expect), or to be killed by system OOM
    killer (which should not happen under OVERCOMMIT_NEVER mode); But the
    mremap system call actually failed of -EFAULT, which is a surprise to me,
    I think it's supposed to be -ENOMEM? then I wrote this piece of C code
    testing confirmed it: https://gist.github.com/crquan/326bde37e1ddda8effe5

    $ ./remap
    allocated one page @0x7f686bf71000, (PAGE_SIZE: 4096)
    grabbed 7680512000 bytes of memory (1875125 pages) @ 00007f6690993000.
    mremap failed Bad address (14).

    The -EFAULT comes from the branch of security_vm_enough_memory_mm failure,
    underlyingly it calls __vm_enough_memory which returns only 0 for success
    or -ENOMEM; So why vma_to_resize needs to return -EFAULT in this case?
    this sounds like a mistake to me.

    Some more digging into git history:

    1) Before commit 119f657c7 ("RLIMIT_AS checking fix") in May 1 2005
    (pre 2.6.12 days) it was returning -ENOMEM for this failure;

    2) but commit 119f657c7 ("untangling do_mremap(), part 1") changed it
    accidentally, to what ever is preserved in local ret, which happened to
    be -EFAULT, in a previous assignment;

    3) then in commit 54f5de709 code refactoring, it's explicitly returning
    -EFAULT, should be wrong.

    Signed-off-by: Derek Che
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Derek
     
  • In original implementation of vm_map_ram made by Nick Piggin there were
    two bitmaps: alloc_map and dirty_map. None of them were used as supposed
    to be: finding a suitable free hole for next allocation in block.
    vm_map_ram allocates space sequentially in block and on free call marks
    pages as dirty, so freed space can't be reused anymore.

    Actually it would be very interesting to know the real meaning of those
    bitmaps, maybe implementation was incomplete, etc.

    But long time ago Zhang Yanfei removed alloc_map by these two commits:

    mm/vmalloc.c: remove dead code in vb_alloc
    3fcd76e8028e0be37b02a2002b4f56755daeda06
    mm/vmalloc.c: remove alloc_map from vmap_block
    b8e748b6c32999f221ea4786557b8e7e6c4e4e7a

    In this patch I replaced dirty_map with two range variables: dirty min and
    max. These variables store minimum and maximum position of dirty space in
    a block, since we need only to know the dirty range, not exact position of
    dirty pages.

    Why it was made? Several reasons: at first glance it seems that
    vm_map_ram allocator concerns about fragmentation thus it uses bitmaps for
    finding free hole, but it is not true. To avoid complexity seems it is
    better to use something simple, like min or max range values. Secondly,
    code also becomes simpler, without iteration over bitmap, just comparing
    values in min and max macros. Thirdly, bitmap occupies up to 1024 bits
    (4MB is a max size of a block). Here I replaced the whole bitmap with two
    longs.

    Finally vm_unmap_aliases should be slightly faster and the whole
    vmap_block structure occupies less memory.

    Signed-off-by: Roman Pen
    Cc: Zhang Yanfei
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Cc: David Rientjes
    Cc: WANG Chao
    Cc: Fabian Frederick
    Cc: Christoph Lameter
    Cc: Gioh Kim
    Cc: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Pen
     
  • Previous implementation allocates new vmap block and repeats search of a
    free block from the very beginning, iterating over the CPU free list.

    Why it can be better??

    1. Allocation can happen on one CPU, but search can be done on another CPU.
    In worst case we preallocate amount of vmap blocks which is equal to
    CPU number on the system.

    2. In previous patch I added newly allocated block to the tail of free list
    to avoid soon exhaustion of virtual space and give a chance to occupy
    blocks which were allocated long time ago. Thus to find newly allocated
    block all the search sequence should be repeated, seems it is not efficient.

    In this patch newly allocated block is occupied right away, address of
    virtual space is returned to the caller, so there is no any need to repeat
    the search sequence, allocation job is done.

    Signed-off-by: Roman Pen
    Cc: Andrew Morton
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Cc: David Rientjes
    Cc: WANG Chao
    Cc: Fabian Frederick
    Cc: Christoph Lameter
    Cc: Gioh Kim
    Cc: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Pen
     
  • Recently I came across high fragmentation of vm_map_ram allocator:
    vmap_block has free space, but still new blocks continue to appear.
    Further investigation showed that certain mapping/unmapping sequences
    can exhaust vmalloc space. On small 32bit systems that's not a big
    problem, cause purging will be called soon on a first allocation failure
    (alloc_vmap_area), but on 64bit machines, e.g. x86_64 has 45 bits of
    vmalloc space, that can be a disaster.

    1) I came up with a simple allocation sequence, which exhausts virtual
    space very quickly:

    while (iters) {

    /* Map/unmap big chunk */
    vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 16);

    /* Map/unmap small chunks.
    *
    * -1 for hole, which should be left at the end of each block
    * to keep it partially used, with some free space available */
    for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
    vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 8);
    }
    }

    The idea behind is simple:

    1. We have to map a big chunk, e.g. 16 pages.

    2. Then we have to occupy the remaining space with smaller chunks, i.e.
    8 pages. At the end small hole should remain to keep block in free list,
    but do not let big chunk to occupy remaining space.

    3. Goto 1 - allocation request of 16 pages can't be completed (only 8 slots
    are left free in the block in the #2 step), new block will be allocated,
    all further requests will lay into newly allocated block.

    To have some measurement numbers for all further tests I setup ftrace and
    enabled 4 basic calls in a function profile:

    echo vm_map_ram > /sys/kernel/debug/tracing/set_ftrace_filter;
    echo alloc_vmap_area >> /sys/kernel/debug/tracing/set_ftrace_filter;
    echo vm_unmap_ram >> /sys/kernel/debug/tracing/set_ftrace_filter;
    echo free_vmap_block >> /sys/kernel/debug/tracing/set_ftrace_filter;

    So for this scenario I got these results:

    BEFORE (all new blocks are put to the head of a free list)
    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 126000 30683.30 us 0.243 us 30819.36 us
    vm_unmap_ram 126000 22003.24 us 0.174 us 340.886 us
    alloc_vmap_area 1000 4132.065 us 4.132 us 0.903 us

    AFTER (all new blocks are put to the tail of a free list)
    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 126000 28713.13 us 0.227 us 24944.70 us
    vm_unmap_ram 126000 20403.96 us 0.161 us 1429.872 us
    alloc_vmap_area 993 3916.795 us 3.944 us 29.370 us
    free_vmap_block 992 654.157 us 0.659 us 1.273 us

    SUMMARY:

    The most interesting numbers in those tables are numbers of block
    allocations and deallocations: alloc_vmap_area and free_vmap_block
    calls, which show that before the change blocks were not freed, and
    virtual space and physical memory (vmap_block structure allocations,
    etc) were consumed.

    Average time which were spent in vm_map_ram/vm_unmap_ram became slightly
    better. That can be explained with a reasonable amount of blocks in a
    free list, which we need to iterate to find a suitable free block.

    2) Another scenario is a random allocation:

    while (iters) {

    /* Randomly take number from a range [1..32/64] */
    nr = rand(1, VMAP_MAX_ALLOC);
    vaddr = vm_map_ram(pages, nr, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, nr);
    }

    I chose mersenne twister PRNG to generate persistent random state to
    guarantee that both runs have the same random sequence. For each
    vm_map_ram call random number from [1..32/64] was taken to represent
    amount of pages which I do map.

    I did 10'000 vm_map_ram calls and got these two tables:

    BEFORE (all new blocks are put to the head of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 10000 10170.01 us 1.017 us 993.609 us
    vm_unmap_ram 10000 5321.823 us 0.532 us 59.789 us
    alloc_vmap_area 420 2150.239 us 5.119 us 3.307 us
    free_vmap_block 37 159.587 us 4.313 us 134.344 us

    AFTER (all new blocks are put to the tail of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 10000 7745.637 us 0.774 us 395.229 us
    vm_unmap_ram 10000 5460.573 us 0.546 us 67.187 us
    alloc_vmap_area 414 2201.650 us 5.317 us 5.591 us
    free_vmap_block 412 574.421 us 1.394 us 15.138 us

    SUMMARY:

    'BEFORE' table shows, that 420 blocks were allocated and only 37 were
    freed. Remained 383 blocks are still in a free list, consuming virtual
    space and physical memory.

    'AFTER' table shows, that 414 blocks were allocated and 412 were really
    freed. 2 blocks remained in a free list.

    So fragmentation was dramatically reduced. Why? Because when we put
    newly allocated block to the head, all further requests will occupy new
    block, regardless remained space in other blocks. In this scenario all
    requests come randomly. Eventually remained free space will be less
    than requested size, free list will be iterated and it is possible that
    nothing will be found there - finally new block will be created. So
    exhaustion in random scenario happens for the maximum possible
    allocation size: 32 pages for 32-bit system and 64 pages for 64-bit
    system.

    Also average cost of vm_map_ram was reduced from 1.017 us to 0.774 us.
    Again this can be explained by iteration through smaller list of free
    blocks.

    3) Next simple scenario is a sequential allocation, when the allocation
    order is increased for each block. This scenario forces allocator to
    reach maximum amount of partially free blocks in a free list:

    while (iters) {

    /* Populate free list with blocks with remaining space */
    for (order = 0; order << order);

    /* Leave a hole */
    nr -= 1;

    for (i = 0; i < nr; i++) {
    vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, (1 << order));
    }

    /* Completely occupy blocks from a free list */
    for (order = 0; order << order), -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, (1 << order));
    }
    }

    Results which I got:

    BEFORE (all new blocks are put to the head of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 2032000 399545.2 us 0.196 us 467123.7 us
    vm_unmap_ram 2032000 363225.7 us 0.178 us 111405.9 us
    alloc_vmap_area 7001 30627.76 us 4.374 us 495.755 us
    free_vmap_block 6993 7011.685 us 1.002 us 159.090 us

    AFTER (all new blocks are put to the tail of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 2032000 394259.7 us 0.194 us 589395.9 us
    vm_unmap_ram 2032000 292500.7 us 0.143 us 94181.08 us
    alloc_vmap_area 7000 31103.11 us 4.443 us 703.225 us
    free_vmap_block 7000 6750.844 us 0.964 us 119.112 us

    SUMMARY:

    No surprises here, almost all numbers are the same.

    Fixing this fragmentation problem I also did some improvements in a
    allocation logic of a new vmap block: occupy block immediately and get
    rid of extra search in a free list.

    Also I replaced dirty bitmap with min/max dirty range values to make the
    logic simpler and slightly faster, since two longs comparison costs
    less, than loop thru bitmap.

    This patchset raises several questions:

    Q: Think the problem you comments is already known so that I wrote comments
    about it as "it could consume lots of address space through fragmentation".
    Could you tell me about your situation and reason why it should be avoided?
    Gioh Kim

    A: Indeed, there was a commit 364376383 which adds explicit comment about
    fragmentation. But fragmentation which is described in this comment caused
    by mixing of long-lived and short-lived objects, when a whole block is pinned
    in memory because some page slots are still in use. But here I am talking
    about blocks which are free, nobody uses them, and allocator keeps them alive
    forever, continuously allocating new blocks.

    Q: I think that if you put newly allocated block to the tail of a free
    list, below example would results in enormous performance degradation.

    new block: 1MB (256 pages)

    while (iters--) {
    vm_map_ram(3 or something else not dividable for 256) * 85
    vm_unmap_ram(3) * 85
    }

    On every iteration, it needs newly allocated block and it is put to the
    tail of a free list so finding it consumes large amount of time.
    Joonsoo Kim

    A: Second patch in current patchset gets rid of extra search in a free list,
    so new block will be immediately occupied..

    Also, the scenario above is impossible, cause vm_map_ram allocates virtual
    range in orders, i.e. 2^n. I.e. passing 3 to vm_map_ram you will allocate
    4 slots in a block and 256 slots (capacity of a block) of course dividable
    on 4, so block will be completely occupied.

    But there is a worst case which we can achieve: each free block has a hole
    equal to order size.

    The maximum size of allocation is 64 pages for 64-bit system
    (if you try to map more, original alloc_vmap_area will be called).

    So the maximum order is 6. That means that worst case, before allocator
    makes a decision to allocate a new block, is to iterate 7 blocks:

    HEAD
    1st block - has 1 page slot free (order 0)
    2nd block - has 2 page slots free (order 1)
    3rd block - has 4 page slots free (order 2)
    4th block - has 8 page slots free (order 3)
    5th block - has 16 page slots free (order 4)
    6th block - has 32 page slots free (order 5)
    7th block - has 64 page slots free (order 6)
    TAIL

    So the worst scenario on 64-bit system is that each CPU queue can have 7
    blocks in a free list.

    This can happen only and only if you allocate blocks increasing the order.
    (as I did in the function written in the comment of the first patch)
    This is weird and rare case, but still it is possible. Afterwards you will
    get 7 blocks in a list.

    All further requests should be placed in a newly allocated block or some
    free slots should be found in a free list.
    Seems it does not look dramatically awful.

    This patch (of 3):

    If suitable block can't be found, new block is allocated and put into a
    head of a free list, so on next iteration this new block will be found
    first.

    That's bad, because old blocks in a free list will not get a chance to be
    fully used, thus fragmentation will grow.

    Let's consider this simple example:

    #1 We have one block in a free list which is partially used, and where only
    one page is free:

    HEAD |xxxxxxxxx-| TAIL
    ^
    free space for 1 page, order 0

    #2 New allocation request of order 1 (2 pages) comes, new block is allocated
    since we do not have free space to complete this request. New block is put
    into a head of a free list:

    HEAD |----------|xxxxxxxxx-| TAIL

    #3 Two pages were occupied in a new found block:

    HEAD |xx--------|xxxxxxxxx-| TAIL
    ^
    two pages mapped here

    #4 New allocation request of order 0 (1 page) comes. Block, which was created
    on #2 step, is located at the beginning of a free list, so it will be found
    first:

    HEAD |xxX-------|xxxxxxxxx-| TAIL
    ^ ^
    page mapped here, but better to use this hole

    It is obvious, that it is better to complete request of #4 step using the
    old block, where free space is left, because in other case fragmentation
    will be highly increased.

    But fragmentation is not only the case. The worst thing is that I can
    easily create scenario, when the whole vmalloc space is exhausted by
    blocks, which are not used, but already dirty and have several free pages.

    Let's consider this function which execution should be pinned to one CPU:

    static void exhaust_virtual_space(struct page *pages[16], int iters)
    {
    /* Firstly we have to map a big chunk, e.g. 16 pages.
    * Then we have to occupy the remaining space with smaller
    * chunks, i.e. 8 pages. At the end small hole should remain.
    * So at the end of our allocation sequence block looks like
    * this:
    * XX big chunk
    * |XXxxxxxxx-| x small chunk
    * - hole, which is enough for a small chunk,
    * but is not enough for a big chunk
    */
    while (iters--) {
    int i;
    void *vaddr;

    /* Map/unmap big chunk */
    vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 16);

    /* Map/unmap small chunks.
    *
    * -1 for hole, which should be left at the end of each block
    * to keep it partially used, with some free space available */
    for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
    vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 8);
    }
    }
    }

    On every iteration new block (1MB of vm area in my case) will be
    allocated and then will be occupied, without attempt to resolve small
    allocation request using previously allocated blocks in a free list.

    In case of random allocation (size should be randomly taken from the
    range [1..64] in 64-bit case or [1..32] in 32-bit case) situation is the
    same: new blocks continue to appear if maximum possible allocation size
    (32 or 64) passed to the allocator, because all remaining blocks in a
    free list do not have enough free space to complete this allocation
    request.

    In summary if new blocks are put into the head of a free list eventually
    virtual space will be exhausted.

    In current patch I simply put newly allocated block to the tail of a
    free list, thus reduce fragmentation, giving a chance to resolve
    allocation request using older blocks with possible holes left.

    Signed-off-by: Roman Pen
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Cc: David Rientjes
    Cc: WANG Chao
    Cc: Fabian Frederick
    Cc: Christoph Lameter
    Cc: Gioh Kim
    Cc: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Pen
     
  • Add min_size mount option to the hugetlbfs documentation. Also, add the
    missing pagesize option and mention that size can be specified as bytes or
    a percentage of huge page pool.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Make 'min_size=' be an option when mounting a hugetlbfs. This
    option takes the same value as the 'size' option. min_size can be
    specified without specifying size. If both are specified, min_size must
    be less that or equal to size else the mount will fail. If min_size is
    specified, then at mount time an attempt is made to reserve min_size
    pages. If the reservation fails, the mount fails. At umount time, the
    reserved pages are released.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • The same routines that perform subpool maximum size accounting
    hugepage_subpool_get/put_pages() are modified to also perform minimum size
    accounting. When a delta value is passed to these routines, calculate how
    global reservations must be adjusted to maintain the subpool minimum size.
    The routines now return this global reserve count adjustment. This
    global reserve count adjustment is then passed to the global accounting
    routine hugetlb_acct_memory().

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • hugetlbfs allocates huge pages from the global pool as needed. Even if
    the global pool contains a sufficient number pages for the filesystem size
    at mount time, those global pages could be grabbed for some other use. As
    a result, filesystem huge page allocations may fail due to lack of pages.

    Applications such as a database want to use huge pages for performance
    reasons. hugetlbfs filesystem semantics with ownership and modes work
    well to manage access to a pool of huge pages. However, the application
    would like some reasonable assurance that allocations will not fail due to
    a lack of huge pages. At application startup time, the application would
    like to configure itself to use a specific number of huge pages. Before
    starting, the application can check to make sure that enough huge pages
    exist in the system global pools. However, there are no guarantees that
    those pages will be available when needed by the application. What the
    application wants is exclusive use of a subset of huge pages.

    Add a new hugetlbfs mount option 'min_size=' to indicate that the
    specified number of pages will be available for use by the filesystem. At
    mount time, this number of huge pages will be reserved for exclusive use
    of the filesystem. If there is not a sufficient number of free pages, the
    mount will fail. As pages are allocated to and freeed from the
    filesystem, the number of reserved pages is adjusted so that the specified
    minimum is maintained.

    This patch (of 4):

    Add a field to the subpool structure to indicate the minimimum number of
    huge pages to always be used by this subpool. This minimum count includes
    allocated pages as well as reserved pages. If the minimum number of pages
    for the subpool have not been allocated, pages are reserved up to this
    minimum. An additional field (rsv_hpages) is used to track the number of
    pages reserved to meet this minimum size. The hstate pointer in the
    subpool is convenient to have when reserving and unreserving the pages.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • When the compaction is activated via /proc/sys/vm/compact_memory it would
    better scan the whole zone. And some platforms, for instance ARM, have
    the start_pfn of a zone at zero. Therefore the first try to compact via
    /proc doesn't work. It needs to reset the compaction scanner position
    first.

    Signed-off-by: Gioh Kim
    Acked-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gioh Kim
     
  • memcg currently uses hardcoded GFP_TRANSHUGE gfp flags for all THP
    charges. THP allocations, however, might be using different flags
    depending on /sys/kernel/mm/transparent_hugepage/{,khugepaged/}defrag and
    the current allocation context.

    The primary difference is that defrag configured to "madvise" value will
    clear __GFP_WAIT flag from the core gfp mask to make the allocation
    lighter for all mappings which are not backed by VM_HUGEPAGE vmas. If
    memcg charge path ignores this fact we will get light allocation but the a
    potential memcg reclaim would kill the whole point of the configuration.

    Fix the mismatch by providing the same gfp mask used for the allocation to
    the charge functions. This is quite easy for all paths except for
    hugepaged kernel thread with !CONFIG_NUMA which is doing a pre-allocation
    long before the allocated page is used in collapse_huge_page via
    khugepaged_alloc_page. To prevent from cluttering the whole code path
    from khugepaged_do_scan we simply return the current flags as per
    khugepaged_defrag() value which might have changed since the
    preallocation. If somebody changed the value of the knob we would charge
    differently but this shouldn't happen often and it is definitely not
    critical because it would only lead to a reduced success rate of one-off
    THP promotion.

    [akpm@linux-foundation.org: fix weird code layout while we're there]
    [rientjes@google.com: clean up around alloc_hugepage_gfpmask()]
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Acked-by: David Rientjes
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • "deactivate_page" was created for file invalidation so it has too
    specific logic for file-backed pages. So, let's change the name of the
    function and date to a file-specific one and yield the generic name.

    Signed-off-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Wang, Yalin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • …d the unevictable LRU

    The memory compaction code uses the migration code to do most of the
    work in compaction. However, the compaction code interacts with the
    unevictable LRU differently than migration code and this difference
    should be noted in the documentation.

    [akpm@linux-foundation.org: identify /proc/sys/vm/compact_unevictable directly]
    Signed-off-by: Eric B Munson <emunson@akamai.com>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Eric B Munson
     
  • Currently, pages which are marked as unevictable are protected from
    compaction, but not from other types of migration. The POSIX real time
    extension explicitly states that mlock() will prevent a major page
    fault, but the spirit of this is that mlock() should give a process the
    ability to control sources of latency, including minor page faults.
    However, the mlock manpage only explicitly says that a locked page will
    not be written to swap and this can cause some confusion. The
    compaction code today does not give a developer who wants to avoid swap
    but wants to have large contiguous areas available any method to achieve
    this state. This patch introduces a sysctl for controlling compaction
    behavior with respect to the unevictable lru. Users who demand no page
    faults after a page is present can set compact_unevictable_allowed to 0
    and users who need the large contiguous areas can enable compaction on
    locked memory by leaving the default value of 1.

    To illustrate this problem I wrote a quick test program that mmaps a
    large number of 1MB files filled with random data. These maps are
    created locked and read only. Then every other mmap is unmapped and I
    attempt to allocate huge pages to the static huge page pool. When the
    compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
    after fragmenting memory. When the value is set to 1, allocations
    succeed.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • With the page flag sanitization patchset, an invalid usage of
    ClearPageReclaim() is detected in set_page_dirty(). This can be called
    from __unmap_hugepage_range(), so let's check PageReclaim() before trying
    to clear it to avoid the misuse.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • With the page flag sanitization patchset, an invalid usage of
    ClearPageSwapCache() is detected in migration_page_copy().
    migrate_page_copy() is shared by both normal and hugepage (both thp and
    hugetlb) code path, so let's check PageSwapCache() and clear it if it's
    set to avoid misuse of the invalid clear operation.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • THP uses tail page refcounting to be able to split huge pages at any time.
    Tail page refcounting is not needed for other users of compound pages and
    it's harmful because of overhead.

    We try to exclude non-THP pages from tail page refcounting using
    __compound_tail_refcounted() check. It excludes most common non-THP
    compound pages: SL*B and hugetlb, but it doesn't catch rest of __GFP_COMP
    users -- drivers.

    And it's not only about overhead.

    Drivers might want to use compound pages to get refcounting semantics
    suitable for mapping high-order pages to userspace. But tail page
    refcounting breaks it.

    Tail page refcounting uses ->_mapcount in tail pages to store GUP pins on
    them. It means GUP pins would affect page_mapcount() for tail pages.
    It's not a problem for THP, because it never maps tail pages. But unlike
    THP, drivers map parts of compound pages with PTEs and it makes
    page_mapcount() be called for tail pages.

    In particular, GUP pins would shift PSS up and affect /proc/kpagecount for
    such pages. But, I'm not aware about anything which can lead to crash or
    other serious misbehaviour.

    Since currently all THP pages are anonymous and all drivers pages are not,
    we can fix the __compound_tail_refcounted() check by requiring PageAnon()
    to enable tail page refcounting.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Currently we take a naive approach to page flags on compound pages - we
    set the flag on the page without consideration if the flag makes sense
    for tail page or for compound page in general. This patchset try to
    sort this out by defining per-flag policy on what need to be done if
    page-flag helper operate on compound page.

    The last patch in the patchset also sanitizes usege of page->mapping for
    tail pages. We don't define the meaning of page->mapping for tail
    pages. Currently it's always NULL, which can be inconsistent with head
    page and potentially lead to problems.

    For now I caught one case of illegal usage of page flags or ->mapping:
    sound subsystem allocates pages with __GFP_COMP and maps them with PTEs.
    It leads to setting dirty bit on tail pages and access to tail_page's
    ->mapping. I don't see any bad behaviour caused by this, but worth
    fixing anyway.

    This patchset makes more sense if you take my THP refcounting into
    account: we will see more compound pages mapped with PTEs and we need to
    define behaviour of flags on compound pages to avoid bugs.

    This patch (of 16):

    We have page-flags helper function declarations/definitions spread over
    several header files. Let's consolidate them in .

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This cleanup patch moves all strings passed to action_result() into a
    singl= e array action_page_type so that a reader can easily find which
    kind of actio= n results are possible. And this patch also fixes the
    odd lines to be printed out, like "unknown page state page" or "free
    buddy, 2nd try page".

    [akpm@linux-foundation.org: rename messages, per David]
    [akpm@linux-foundation.org: s/DIRTY_UNEVICTABLE_LRU/CLEAN_UNEVICTABLE_LRU', per Andi]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Cc: Tony Luck
    Cc: "Xie XiuQi"
    Cc: Steven Rostedt
    Cc: Chen Gong
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Low and high watermarks, as they defined in the TODO to the mem_cgroup
    struct, have already been implemented by Johannes, so remove the stale
    comment.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mem_cgroup_lookup() is a wrapper around mem_cgroup_from_id(), which
    checks that id != 0 before issuing the function call. Today, there is
    no point in this additional check apart from optimization, because there
    is no css with id 0 to css_from_id.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • All callers of zone_movable_is_highmem are under #ifdef CONFIG_HIGHMEM,
    so the else branch return 0 is not needed.

    Signed-off-by: Zhang Zhen
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Zhen
     
  • Alter 'taks' -> 'task'

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • vfs_readdir() was replaced by iterate_dir() in commit 5c0ba4e0762e
    ("[readdir] introduce iterate_dir() and dir_context").

    Signed-off-by: Zhang Zhen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Zhen
     
  • Pull networking updates from David Miller:

    1) Add BQL support to via-rhine, from Tino Reichardt.

    2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
    can support hw switch offloading. From Floria Fainelli.

    3) Allow 'ip address' commands to initiate multicast group join/leave,
    from Madhu Challa.

    4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

    5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
    Borkmann.

    6) Remove the ugly compat support in ARP for ugly layers like ax25,
    rose, etc. And use this to clean up the neigh layer, then use it to
    implement MPLS support. All from Eric Biederman.

    7) Support L3 forwarding offloading in switches, from Scott Feldman.

    8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
    up route lookups even further. From Alexander Duyck.

    9) Many improvements and bug fixes to the rhashtable implementation,
    from Herbert Xu and Thomas Graf. In particular, in the case where
    an rhashtable user bulk adds a large number of items into an empty
    table, we expand the table much more sanely.

    10) Don't make the tcp_metrics hash table per-namespace, from Eric
    Biederman.

    11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

    12) Split out new connection request sockets so that they can be
    established in the main hash table. Much less false sharing since
    hash lookups go direct to the request sockets instead of having to
    go first to the listener then to the request socks hashed
    underneath. From Eric Dumazet.

    13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

    14) Support stable privacy address generation for RFC7217 in IPV6. From
    Hannes Frederic Sowa.

    15) Hash network namespace into IP frag IDs, also from Hannes Frederic
    Sowa.

    16) Convert PTP get/set methods to use 64-bit time, from Richard
    Cochran.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
    fm10k: Bump driver version to 0.15.2
    fm10k: corrected VF multicast update
    fm10k: mbx_update_max_size does not drop all oversized messages
    fm10k: reset head instead of calling update_max_size
    fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
    fm10k: update xcast mode before synchronizing multicast addresses
    fm10k: start service timer on probe
    fm10k: fix function header comment
    fm10k: comment next_vf_mbx flow
    fm10k: don't handle mailbox events in iov_event path and always process mailbox
    fm10k: use separate workqueue for fm10k driver
    fm10k: Set PF queues to unlimited bandwidth during virtualization
    fm10k: expose tx_timeout_count as an ethtool stat
    fm10k: only increment tx_timeout_count in Tx hang path
    fm10k: remove extraneous "Reset interface" message
    fm10k: separate PF only stats so that VF does not display them
    fm10k: use hw->mac.max_queues for stats
    fm10k: only show actual queues, not the maximum in hardware
    fm10k: allow creation of VLAN on default vid
    fm10k: fix unused warnings
    ...

    Linus Torvalds
     

15 Apr, 2015

14 commits

  • Pull ARM updates from Russell King:
    "Included in this update are both some long term fixes and some new
    features.

    Fixes:

    - An integer overflow in the calculation of ELF_ET_DYN_BASE.

    - Avoiding OOMs for high-order IOMMU allocations

    - SMP requires the data cache to be enabled for synchronisation
    primitives to work, so prevent the CPU_DCACHE_DISABLE option being
    visible on SMP builds.

    - A bug going back 10+ years in the noMMU ARM94* CPU support code,
    where it corrupts registers. Found by folk getting Linux running
    on their cameras.

    - Versatile Express needs an errata workaround enabled for CPU
    hot-unplug to work.

    Features:

    - Clean up module linker by handling out of range relocations
    separately from relocation cases we don't handle.

    - Fix a long term bug in the pci_mmap_page_range() code, which we
    hope won't impact userspace (we hope there's no users of the
    existing broken interface.)

    - Don't map DMA coherent allocations when we don't have a MMU.

    - Drop experimental status for SMP_ON_UP.

    - Warn when DT doesn't specify ePAPR mandatory cache properties.

    - Add documentation concerning how we find the start of physical
    memory for AUTO_ZRELADDR kernels, detailing why we have chosen the
    mask and the implications of changing it.

    - Updates from Ard Biesheuvel to address some issues with large
    kernels (such as allyesconfig) failing to link.

    - Allow hibernation to work on modern (ARMv7) CPUs - this appears to
    have never worked in the past on these CPUs.

    - Enable IRQ_SHOW_LEVEL, which changes the /proc/interrupts output
    format (hopefully without userspace breaking... let's hope that if
    it causes someone a problem, they tell us.)

    - Fix tegra-ahb DT offsets.

    - Rework ARM errata 643719 code (and ARMv7 flush_cache_louis()/
    flush_dcache_all()) code to be more efficient, and enable this
    errata workaround by default for ARMv7+SMP CPUs. This complements
    the Versatile Express fix above.

    - Rework ARMv7 context code for errata 430973, so that only Cortex A8
    CPUs are impacted by the branch target buffer flush when this
    errata is enabled. Also update the help text to indicate that all
    r1p* A8 CPUs are impacted.

    - Switch ARM to the generic show_mem() implementation, it conveys all
    the information which we were already reporting.

    - Prevent slow timer sources being used for udelay() - timers running
    at less than 1MHz are not useful for this, and can cause udelay()
    to return immediately, without any wait. Using such a slow timer
    is silly.

    - VDSO support for 32-bit ARM, mainly for gettimeofday() using the
    ARM architected timer.

    - Perf support for Scorpion performance monitoring units"

    vdso semantic conflict fixed up as per linux-next.

    * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (52 commits)
    ARM: update errata 430973 documentation to cover Cortex A8 r1p*
    ARM: ensure delay timer has sufficient accuracy for delays
    ARM: switch to use the generic show_mem() implementation
    ARM: proc-v7: avoid errata 430973 workaround for non-Cortex A8 CPUs
    ARM: enable ARM errata 643719 workaround by default
    ARM: cache-v7: optimise test for Cortex A9 r0pX devices
    ARM: cache-v7: optimise branches in v7_flush_cache_louis
    ARM: cache-v7: consolidate initialisation of cache level index
    ARM: cache-v7: shift CLIDR to extract appropriate field before masking
    ARM: cache-v7: use movw/movt instructions
    ARM: allow 16-bit instructions in ALT_UP()
    ARM: proc-arm94*.S: fix setup function
    ARM: vexpress: fix CPU hotplug with CT9x4 tile.
    ARM: 8276/1: Make CPU_DCACHE_DISABLE depend on !SMP
    ARM: 8335/1: Documentation: DT bindings: Tegra AHB: document the legacy base address
    ARM: 8334/1: amba: tegra-ahb: detect and correct bogus base address
    ARM: 8333/1: amba: tegra-ahb: fix register offsets in the macros
    ARM: 8339/1: Enable CONFIG_GENERIC_IRQ_SHOW_LEVEL
    ARM: 8338/1: kexec: Relax SMP validation to improve DT compatibility
    ARM: 8337/1: mm: Do not invoke OOM for higher order IOMMU DMA allocations
    ...

    Linus Torvalds
     
  • Pull s390 updates from Martin Schwidefsky:
    "The major change in this merge is the removal of the support for
    31-bit kernels. Naturally 31-bit user space will continue to work via
    the compat layer.

    And then some cleanup, some improvements and bug fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (23 commits)
    s390/smp: wait until secondaries are active & online
    s390/hibernate: fix save and restore of kernel text section
    s390/cacheinfo: add missing facility check
    s390/syscalls: simplify syscall_get_arch()
    s390/irq: enforce correct irqclass_sub_desc array size
    s390: remove "64" suffix from mem64.S and swsusp_asm64.S
    s390/ipl: cleanup macro usage
    s390/ipl: cleanup shutdown_action attributes
    s390/ipl: cleanup bin attr usage
    s390/uprobes: fix address space annotation
    s390: add missing arch_release_task_struct() declaration
    s390: make couple of functions and variables static
    s390/maccess: improve s390_kernel_write()
    s390/maccess: remove potentially broken probe_kernel_write()
    s390/watchdog: support for KVM hypervisors and delete pr_info messages
    s390/watchdog: enable KEEPALIVE for /dev/watchdog
    s390/dasd: remove setting of scheduler from driver
    s390/traps: panic() instead of die() on translation exception
    s390: remove test_facility(2) (== z/Architecture mode active) checks
    s390/cmpxchg: simplify cmpxchg_double
    ...

    Linus Torvalds
     
  • Pull power management and ACPI updates from Rafael Wysocki:
    "These are mostly fixes and cleanups all over, although there are a few
    items that sort of fall into the new feature category.

    First off, we have new callbacks for PM domains that should help us to
    handle some issues related to device initialization in a better way.

    There also is some consolidation in the unified device properties API
    area allowing us to use that inferface for accessing data coming from
    platform initialization code in addition to firmware-provided data.

    We have some new device/CPU IDs in a few drivers, support for new
    chips and a new cpufreq driver too.

    Specifics:

    - Generic PM domains support update including new PM domain callbacks
    to handle device initialization better (Russell King, Rafael J
    Wysocki, Kevin Hilman)

    - Unified device properties API update including a new mechanism for
    accessing data provided by platform initialization code (Rafael J
    Wysocki, Adrian Hunter)

    - ARM cpuidle update including ARM32/ARM64 handling consolidation
    (Daniel Lezcano)

    - intel_idle update including support for the Silvermont Core in the
    Baytrail SOC and for the Airmont Core in the Cherrytrail and
    Braswell SOCs (Len Brown, Mathias Krause)

    - New cpufreq driver for Hisilicon ACPU (Leo Yan)

    - intel_pstate update including support for the Knights Landing chip
    (Dasaratharaman Chandramouli, Kristen Carlson Accardi)

    - QorIQ cpufreq driver update (Tang Yuantian, Arnd Bergmann)

    - powernv cpufreq driver update (Shilpasri G Bhat)

    - devfreq update including Tegra support changes (Tomeu Vizoso,
    MyungJoo Ham, Chanwoo Choi)

    - powercap RAPL (Running-Average Power Limit) driver update including
    support for Intel Broadwell server chips (Jacob Pan, Mathias Krause)

    - ACPI device enumeration update related to the handling of the
    special PRP0001 device ID allowing DT-style 'compatible' property
    to be used for ACPI device identification (Rafael J Wysocki)

    - ACPI EC driver update including limited _DEP support (Lan Tianyu,
    Lv Zheng)

    - ACPI backlight driver update including a new mechanism to allow
    native backlight handling to be forced on non-Windows 8 systems and
    a new quirk for Lenovo Ideapad Z570 (Aaron Lu, Hans de Goede)

    - New Windows Vista compatibility quirk for Sony VGN-SR19XN (Chen Yu)

    - Assorted ACPI fixes and cleanups (Aaron Lu, Martin Kepplinger,
    Masanari Iida, Mika Westerberg, Nan Li, Rafael J Wysocki)

    - Fixes related to suspend-to-idle for the iTCO watchdog driver and
    the ACPI core system suspend/resume code (Rafael J Wysocki, Chen Yu)

    - PM tracing support for the suspend phase of system suspend/resume
    transitions (Zhonghui Fu)

    - Configurable delay for the system suspend/resume testing facility
    (Brian Norris)

    - PNP subsystem cleanups (Peter Huewe, Rafael J Wysocki)"

    * tag 'pm+acpi-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits)
    ACPI / scan: Fix NULL pointer dereference in acpi_companion_match()
    ACPI / scan: Rework modalias creation when "compatible" is present
    intel_idle: mark cpu id array as __initconst
    powercap / RAPL: mark rapl_ids array as __initconst
    powercap / RAPL: add ID for Broadwell server
    intel_pstate: Knights Landing support
    intel_pstate: remove MSR test
    cpufreq: fix qoriq uniprocessor build
    ACPI / scan: Take the PRP0001 position in the list of IDs into account
    ACPI / scan: Simplify acpi_match_device()
    ACPI / scan: Generalize of_compatible matching
    device property: Introduce firmware node type for platform data
    device property: Make it possible to use secondary firmware nodes
    PM / watchdog: iTCO: stop watchdog during system suspend
    cpufreq: hisilicon: add acpu driver
    ACPI / EC: Call acpi_walk_dep_device_list() after installing EC opregion handler
    cpufreq: powernv: Report cpu frequency throttling
    intel_idle: Add support for the Airmont Core in the Cherrytrail and Braswell SOCs
    intel_idle: Update support for Silvermont Core in Baytrail SOC
    PM / devfreq: tegra: Register governor on module init
    ...

    Linus Torvalds
     
  • Jeff Kirsher says:

    ====================
    Intel Wired LAN Driver Updates 2015-04-14

    This series contains updates to fm10k only.

    Fixed transmit statistics which was actually using values from the
    receive ring, instead of the transmit ring. Fixed up spelling mistakes
    in code comments and resolved unused argument warnings. Added support
    for netconsole. Fixed up statistic reporting so that we are only
    reporting from actual queues as well as display PF only stats for
    just the PF and not the VF. Also fixed an issue that when returning
    virtualization queues from the VF back to the PF, we were retaining
    the VF rate limiter.

    Fixed up the driver to use a separate workqueue, which helps reduce
    and stabilize latency between scheduling the work in our interrupt and
    actually performing the work.

    Fixed a bug where the VF tried to set a multicast address before
    requesting the required xcast mode.

    Fix VF multicast update since VFs were being improperly added to the
    switch's mutlicast group. The error stems from the fact that incorrect
    arguments were passed to the update_mc_addr().

    Thanks to Alex Duyck for the extensive review.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull input subsystem updates from Dmitry Torokhov:
    "You will get the following new drivers:

    - Qualcomm PM8941 power key drver
    - ChipOne icn8318 touchscreen controller driver
    - Broadcom iProc touchscreen and keypad drivers
    - Semtech SX8654 I2C touchscreen controller driver

    ALPS driver now supports newer SS4 devices; Elantech got a fix that
    should make it work on some ASUS laptops; and a slew of other
    enhancements and random fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (51 commits)
    Input: alps - non interleaved V2 dualpoint has separate stick button bits
    Input: alps - fix touchpad buttons getting stuck when used with trackpoint
    Input: atkbd - document "no new force-release quirks" policy
    Input: ALPS - make alps_get_pkt_id_ss4_v2() and others static
    Input: ALPS - V7 devices can report 5-finger taps
    Input: ALPS - add support for SS4 touchpad devices
    Input: ALPS - refactor alps_set_abs_params_mt()
    Input: elantech - fix absolute mode setting on some ASUS laptops
    Input: atmel_mxt_ts - split out touchpad initialisation logic
    Input: atmel_mxt_ts - implement support for T100 touch object
    Input: cros_ec_keyb - fix clearing keyboard state on wakeup
    Input: gscps2 - drop pci_ids dependency
    Input: synaptics - allocate 3 slots to keep stability in image sensors
    Input: Revert "Revert "synaptics - use dmax in input_mt_assign_slots""
    Input: MT - make slot assignment work for overcovered solutions
    mfd: tc3589x: enforce device-tree only mode
    Input: tc3589x - localize platform data
    Input: tsc2007 - Convert msecs to jiffies only once
    Input: edt-ft5x06 - remove EV_SYN event report
    Input: edt-ft5x06 - allow to setting the maximum axes value through the DT
    ...

    Linus Torvalds
     
  • Pull i2c updates from Wolfram Sang:
    "Most notable:

    - introducing the i2c_quirk infrastructure. Now, flaws of I2C
    controllers can be described and the core will check if the flaws
    collide with the messages to be sent

    - wait_for_completion return type cleanup series

    - new drivers for Digicolor, Netlogic XLP, Ingenic JZ4780

    - updates to the I2C slave framework which include API changes. Its
    only user was updated, too. Documentation was finally added

    - changed dynamic bus numbering for the DT case. This could change
    bus numbers for users. However, it fixes a collision where dynamic
    and static busses request the same id.

    - driver bugfixes, cleanups"

    * 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (52 commits)
    i2c: xlp9xx: Driver for Netlogic XLP9XX/5XX I2C controller
    of: Add vendor prefix 'netlogic'
    i2c: davinci: use ICPFUNC to toggle I2C as gpio for bus recovery
    i2c: davinci: use bus recovery infrastructure
    i2c: change input parameter to i2c_adapter for prepare/unprepare_recovery
    i2c: i2c-mux-gpio: remove error messages for probe deferrals
    i2c: jz4780: Add i2c bus controller driver for Ingenic JZ4780
    i2c: dln2: set the device tree node of the adapter
    i2c: davinci: fixup wait_for_completion_timeout handling
    i2c: mpc: Fix ISR return value
    i2c: slave-eeprom: add more info when to increase the pointer
    i2c: slave: add documentation for i2c-slave-eeprom
    Documentation: i2c: describe the new slave mode
    i2c: slave: rework the slave API
    i2c: add support for the Digicolor I2C controller
    i2c: busses with dynamic ids should start after fixed ids for DT
    of: base: add function to get highest id of an alias stem
    i2c: designware: Suppress error message if platform_get_irq() < 0
    i2c: mpc: assign the correct prescaler from SVR
    i2c: img-scb: fixup of wait_for_completion_timeout return handling
    ...

    Linus Torvalds
     
  • Pull VFIO updates from Alex Williamson:

    - VFIO platform bus driver support (Baptiste Reynal, Antonios Motakis,
    testing and review by Eric Auger)

    - Split VFIO irqfd support to separate module (Alex Williamson)

    - vfio-pci VGA arbiter client (Alex Williamson)

    - New vfio-pci.ids= module option (Alex Williamson)

    - vfio-pci D3 power state support for idle devices (Alex Williamson)

    * tag 'vfio-v4.1-rc1' of git://github.com/awilliam/linux-vfio: (30 commits)
    vfio-pci: Fix use after free
    vfio-pci: Move idle devices to D3hot power state
    vfio-pci: Remove warning if try-reset fails
    vfio-pci: Allow PCI IDs to be specified as module options
    vfio-pci: Add VGA arbiter client
    vfio-pci: Add module option to disable VGA region access
    vgaarb: Stub vga_set_legacy_decoding()
    vfio: Split virqfd into a separate module for vfio bus drivers
    vfio: virqfd_lock can be static
    vfio: put off the allocation of "minor" in vfio_create_group
    vfio/platform: implement IRQ masking/unmasking via an eventfd
    vfio: initialize the virqfd workqueue in VFIO generic code
    vfio: move eventfd support code for VFIO_PCI to a separate file
    vfio: pass an opaque pointer on virqfd initialization
    vfio: add local lock for virqfd instead of depending on VFIO PCI
    vfio: virqfd: rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit
    vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export
    vfio/platform: support for level sensitive interrupts
    vfio/platform: trigger an interrupt via eventfd
    vfio/platform: initial interrupts support code
    ...

    Linus Torvalds
     
  • Pull pincontrol updates from Linus Walleij:
    "This is the bulk of pin control changes for the v4.1 development
    cycle. Nothing really exciting this time: we basically added a few
    new drivers and subdrivers and stabilized them in linux-next. Some
    cleanups too. With sunrisepoint Intel has a real fine fully featured
    pin control driver for contemporary hardware, and the AMD driver is
    also for large deployments. Most of the others are ARM devices.

    New drivers:
    - Intel Sunrisepoint
    - AMD KERNCZ GPIO
    - Broadcom Cygnus IOMUX

    New subdrivers:
    - Marvell MVEBU Armada 39x SoCs
    - Samsung Exynos 5433
    - nVidia Tegra 210
    - Mediatek MT8135
    - Mediatek MT8173
    - AMLogic Meson8b
    - Qualcomm PM8916

    On top of this cleanups and development history for the above drivers
    as issues were fixed after merging"

    * tag 'pinctrl-v4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (71 commits)
    pinctrl: sirf: move sgpio lock into state container
    pinctrl: Add support for PM8916 GPIO's and MPP's
    pinctrl: bcm2835: Fix support for threaded level triggered IRQs
    sh-pfc: r8a7790: add EtherAVB pin groups
    pinctrl: Document "function" + "pins" pinmux binding
    pinctrl: intel: Add Intel Sunrisepoint pin controller and GPIO support
    pinctrl: fsl: imx: Check for 0 config register
    pinctrl: Add support for Meson8b
    documentation: Extend pinctrl docs for Meson8b
    pinctrl: Cleanup Meson8 driver
    Fix inconsistent spinlock of AMD GPIO driver which can be recognized by static analysis tool smatch. Declare constant Variables with Sparse's suggestion.
    pinctrl: at91: convert __raw to endian agnostic IO
    pinctrl: constify of_device_id array
    pinctrl: pinconf-generic: add dt node names to error messages
    pinctrl: pinconf-generic: scan also referenced phandle node
    pinctrl: mvebu: add suspend/resume support to Armada XP pinctrl driver
    pinctrl: st: Display pin's function when printing pinctrl debug information
    pinctrl: st: Show correct pin direction also in GPIO mode
    pinctrl: st: Supply a GPIO get_direction() call-back
    pinctrl: st: Move st_get_pio_control() further up the source file
    ...

    Linus Torvalds
     
  • Pull backlight updates from Lee Jones:
    "Changes to existing drivers:

    - Use of_get_child_by_name() instead of refcount; 88pm860x_bl

    - Terminate array with NULL element; da9052_bl"

    * tag 'backlight-for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
    backlight: da9052_bl: Terminate da9052_wled_ids array with empty element
    backlight: 88pm860x_bl: Use of_get_child_by_name() instead of refcount hack

    Linus Torvalds
     
  • Pull MFD updates from Lee Jones:
    "Changes to existing drivers:

    - Rename child driver [axp288_battery => axp288_fuel_gauge]; axp20x
    - Rename child driver [max77693-flash => max77693-led]; max77693
    - Error handling fixes; intel_soc_pmic
    - GPIO tweaking; intel_soc_pmic
    - Remove non-DT code; vexpress-sysreg, tc3589x
    - Remove unused/legacy code; ti_am335x_tscadc, rts5249, rtsx_gops, rtsx_pcr,
    rtc-s5m, sec-core, max77693, menelaus,
    wm5102-tables
    - Trivial fixups; rtsx_pci, da9150-core, sec-core, max7769, max77693,
    mc13xxx-core, dln2, hi6421-pmic-core, rk808, twl4030-power,
    lpc_ich, menelaus, twl6040
    - Update register/address values; rts5227, rts5249
    - DT and/or binding document fixups; arizona, da9150, mt6397, axp20x,
    qcom-rpm, qcom-spmi-pmic
    - Couple of trivial core Kconfig fixups
    - Remove use of seq_printf return value; ab8500-debugfs
    - Remove __exit markups; menelaus, tps65010
    - Fix platform-device name collisions; mfd-core

    New drivers/supported devices:

    - Add support for wm8280/wm8281 into arizona
    - Add support for COMe-cBL6 into kempld-core
    - Add support for rts524a and rts525a into rts5249
    - Add support for ipq8064 into qcom_rpm
    - Add support for extcon into axp20x
    - New MediaTek MT6397 PMIC driver
    - New Maxim MAX77843 PMIC dirver
    - New Intel Quark X1000 I2C-GPIO driver
    - New Skyworks SKY81452 driver"

    * tag 'mfd-for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (76 commits)
    mfd: sec: Fix RTC alarm interrupt number on S2MPS11
    mfd: wm5102: Remove registers for output 3R from readable list
    mfd: tps65010: Remove incorrect __exit markups
    mfd: devicetree: bindings: Add Qualcomm RPM regulator subnodes
    mfd: axp20x: Add support for extcon cell
    mfd: lpc_ich: Sort IDs
    mfd: twl6040: Remove wrong and unneeded "platform:twl6040" modalias
    mfd: qcom-spmi-pmic: Add specific compatible strings for Qualcomm's SPMI PMIC's
    mfd: axp20x: Fix duplicate const for model names
    mfd: menelaus: Use macro for magic number
    mfd: menelaus: Drop support for SW controller VCORE
    mfd: menelaus: Delete omap_has_menelaus
    mfd: arizona: Correct type of gpio_defaults
    mfd: lpc_ich: Sort IDs
    mfd: Fix a typo in Kconfig
    mfd: qcom_rpm: Add support for IPQ8064
    mfd: devicetree: qcom_rpm: Document IPQ8064 resources
    mfd: core: Fix platform-device name collisions
    mfd: intel_quark_i2c_gpio: Don't crash if !DMI
    dt-bindings: Add vendor-prefix for X-Powers
    ...

    Linus Torvalds
     
  • Merge first patchbomb from Andrew Morton:

    - arch/sh updates

    - ocfs2 updates

    - kernel/watchdog feature

    - about half of mm/

    * emailed patches from Andrew Morton : (122 commits)
    Documentation: update arch list in the 'memtest' entry
    Kconfig: memtest: update number of test patterns up to 17
    arm: add support for memtest
    arm64: add support for memtest
    memtest: use phys_addr_t for physical addresses
    mm: move memtest under mm
    mm, hugetlb: abort __get_user_pages if current has been oom killed
    mm, mempool: do not allow atomic resizing
    memcg: print cgroup information when system panics due to panic_on_oom
    mm: numa: remove migrate_ratelimited
    mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
    mm: split ET_DYN ASLR from mmap ASLR
    s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
    mm: expose arch_mmap_rnd when available
    s390: standardize mmap_rnd() usage
    powerpc: standardize mmap_rnd() usage
    mips: extract logic for mmap_rnd()
    arm64: standardize mmap_rnd() usage
    x86: standardize mmap_rnd() usage
    arm: factor out mmap ASLR into mmap_rnd
    ...

    Linus Torvalds
     
  • Since arm64/arm support memtest command line option update the "memtest"
    entry.

    Signed-off-by: Vladimir Murzin
    Cc: "H. Peter Anvin"
    Cc: Catalin Marinas
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Murzin
     
  • Additional test patterns for memtest were introduced since commit
    63823126c221 ("x86: memtest: add additional (regular) test patterns"),
    but looks like Kconfig was not updated that time.

    Update Kconfig entry with the actual number of maximum test patterns.

    Signed-off-by: Vladimir Murzin
    Cc: "H. Peter Anvin"
    Cc: Catalin Marinas
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Murzin
     
  • Add support for memtest command line option.

    Signed-off-by: Vladimir Murzin
    Acked-by: Will Deacon
    Cc: "H. Peter Anvin"
    Cc: Catalin Marinas
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Russell King
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Murzin