23 Oct, 2016

1 commit

  • Pull vmap stack fixes from Ingo Molnar:
    "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
    accesses that used to be just somewhat questionable are now totally
    buggy.

    These changes try to do it without breaking the ABI: the fields are
    left there, they are just reporting zero, or reporting narrower
    information (the maps file change)"

    * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
    fs/proc: Stop trying to report thread stacks
    fs/proc: Stop reporting eip and esp in /proc/PID/stat
    mm/numa: Remove duplicated include from mprotect.c

    Linus Torvalds
     

20 Oct, 2016

1 commit

  • Asking for a non-current task's stack can't be done without races
    unless the task is frozen in kernel mode. As far as I know,
    vm_is_stack_for_task() never had a safe non-current use case.

    The __unused annotation is because some KSTK_ESP implementations
    ignore their parameter, which IMO is further justification for this
    patch.

    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux API
    Cc: Peter Zijlstra
    Cc: Tycho Andersen
    Link: http://lkml.kernel.org/r/4c3f68f426e6c061ca98b4fc7ef85ffbb0a25b0c.1475257877.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

19 Oct, 2016

2 commits

  • This removes the 'write' argument from access_process_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Acked-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_unlocked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

29 Jul, 2016

1 commit

  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • Naive approach: on mapping/unmapping the page as compound we update
    ->_mapcount on each 4k page. That's not efficient, but it's not obvious
    how we can optimize this. We can look into optimization later.

    PG_double_map optimization doesn't work for file pages since lifecycle
    of file pages is different comparing to anon pages: file page can be
    mapped again at any time.

    Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

24 May, 2016

2 commits

  • All the callers of vm_mmap seem to check for the failure already and
    bail out in one way or another on the error which means that we can
    change it to use killable version of vm_mmap_pgoff and return -EINTR if
    the current task gets killed while waiting for mmap_sem. This also
    means that vm_mmap_pgoff can be killable by default and drop the
    additional parameter.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Please note that load_elf_binary is ignoring vm_mmap error for
    current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
    problem because the address is not used anywhere and we never return to
    the userspace if we got killed.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc: Al Viro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 May, 2016

1 commit

  • It's huge. Uninlining it saves 206 bytes per callsite. Shaves 4924
    bytes from the x86_64 allmodconfig vmlinux.

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Steve Capper
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

1 commit

  • Currently we have two copies of the same code which implements memory
    overcommitment logic. Let's move it into mm/util.c and hence avoid
    duplication. No functional changes here.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

16 Feb, 2016

1 commit

  • The concept here was a suggestion from Ingo. The implementation
    horrors are all mine.

    This allows get_user_pages(), get_user_pages_unlocked(), and
    get_user_pages_locked() to be called with or without the
    leading tsk/mm arguments. We will give a compile-time warning
    about the old style being __deprecated and we will also
    WARN_ON() if the non-remote version is used for a remote-style
    access.

    Doing this, folks will get nice warnings and will not break the
    build. This should be nice for -next and will hopefully let
    developers fix up their own code instead of maintainers needing
    to do it at merge time.

    The way we do this is hideous. It uses the __VA_ARGS__ macro
    functionality to call different functions based on the number
    of arguments passed to the macro.

    There's an additional hack to ensure that our EXPORT_SYMBOL()
    of the deprecated symbols doesn't trigger a warning.

    We should be able to remove this mess as soon as -rc1 hits in
    the release after this is merged.

    Signed-off-by: Dave Hansen
    Cc: Al Viro
    Cc: Alexander Kuleshov
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Dominik Dingel
    Cc: Geliang Tang
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Leon Romanovsky
    Cc: Linus Torvalds
    Cc: Masahiro Yamada
    Cc: Mateusz Guzik
    Cc: Maxime Coquelin
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Oleg Nesterov
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Cc: Xie XiuQi
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

04 Feb, 2016

1 commit

  • Commit b76437579d13 ("procfs: mark thread stack correctly in
    proc//maps") added [stack:TID] annotation to /proc//maps.

    Finding the task of a stack VMA requires walking the entire thread list,
    turning this into quadratic behavior: a thousand threads means a
    thousand stacks, so the rendering of /proc//maps needs to look at a
    million combinations.

    The cost is not in proportion to the usefulness as described in the
    patch.

    Drop the [stack:TID] annotation to make /proc//maps (and
    /proc//numa_maps) usable again for higher thread counts.

    The [stack] annotation inside /proc//task//maps is retained, as
    identifying the stack VMA there is an O(1) operation.

    Siddesh said:
    "The end users needed a way to identify thread stacks programmatically and
    there wasn't a way to do that. I'm afraid I no longer remember (or have
    access to the resources that would aid my memory since I changed
    employers) the details of their requirement. However, I did do this on my
    own time because I thought it was an interesting project for me and nobody
    really gave any feedback then as to its utility, so as far as I am
    concerned you could roll back the main thread maps information since the
    information is available in the thread-specific files"

    Signed-off-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Siddhesh Poyarekar
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 Jan, 2016

1 commit

  • Only functions doing more than one read are modified. Consumeres
    happened to deal with possibly changing data, but it does not seem like
    a good thing to rely on.

    Signed-off-by: Mateusz Guzik
    Acked-by: Cyrill Gorcunov
    Cc: Alexey Dobriyan
    Cc: Jarod Wilson
    Cc: Jan Stancek
    Cc: Al Viro
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mateusz Guzik
     

16 Jan, 2016

2 commits

  • Both page_referenced() and page_idle_clear_pte_refs_one() assume that
    THP can only be mapped with PMD, so there's no reason to look on PTEs
    for PageTransHuge() pages. That's no true anymore: THP can be mapped
    with PTEs too.

    The patch removes PageTransHuge() test from the functions and opencode
    page table check.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kirill A. Shutemov
    Cc: Vladimir Davydov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Sasha Levin
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We don't define meaning of page->mapping for tail pages. Currently it's
    always NULL, which can be inconsistent with head page and potentially
    lead to problems.

    Let's poison the pointer to catch all illigal uses.

    page_rmapping(), page_mapping() and page_anon_vma() are changed to look
    on head page.

    The only illegal use I've caught so far is __GPF_COMP pages from sound
    subsystem, mapped with PTEs. do_shared_fault() is changed to use
    page_rmapping() instead of direct access to fault_page->mapping.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Jérôme Glisse
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

04 Jan, 2016

1 commit


06 Nov, 2015

1 commit


16 Apr, 2015

1 commit

  • Most-used page->mapping helper -- page_mapping() -- has already uninlined.
    Let's uninline also page_rmapping() and page_anon_vma(). It saves us
    depending on configuration around 400 bytes in text:

    text data bss dec hex filename
    660318 99254 410000 1169572 11d8a4 mm/built-in.o-before
    659854 99254 410000 1169108 11d6d4 mm/built-in.o

    I also tried to make code a bit more clean.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kirill A. Shutemov
    Cc: Christoph Lameter
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

14 Feb, 2015

1 commit

  • kstrdup() is often used to duplicate strings where neither source neither
    destination will be ever modified. In such case we can just reuse the
    source instead of duplicating it. The problem is that we must be sure
    that the source is non-modifiable and its life-time is long enough.

    I suspect the good candidates for such strings are strings located in
    kernel .rodata section, they cannot be modifed because the section is
    read-only and their life-time is equal to kernel life-time.

    This small patchset proposes alternative version of kstrdup -
    kstrdup_const, which returns source string if it is located in .rodata
    otherwise it fallbacks to kstrdup. To verify if the source is in
    .rodata function checks if the address is between sentinels
    __start_rodata, __end_rodata. I guess it should work with all
    architectures.

    The main patch is accompanied by four patches constifying kstrdup for
    cases where situtation described above happens frequently.

    I have tested the patchset on mobile platform (exynos4210-trats) and it
    saves 3272 string allocations. Since minimal allocation is 32 or 64
    bytes depending on Kconfig options the patchset saves respectively about
    100KB or 200KB of memory.

    Stats from tested platform show that the main offender is sysfs:

    By caller:
    2260 __kernfs_new_node
    631 clk_register+0xc8/0x1b8
    318 clk_register+0x34/0x1b8
    51 kmem_cache_create
    12 alloc_vfsmnt

    By string (with count >= 5):
    883 power
    876 subsystem
    135 parameters
    132 device
    61 iommu_group
    ...

    This patch (of 5):

    Add an alternative version of kstrdup which returns pointer to constant
    char array. The function checks if input string is in persistent and
    read-only memory section, if yes it returns the input string, otherwise it
    fallbacks to kstrdup.

    kstrdup_const is accompanied by kfree_const performing conditional memory
    deallocation of the string.

    Signed-off-by: Andrzej Hajda
    Cc: Marek Szyprowski
    Cc: Kyungmin Park
    Cc: Mike Turquette
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Cc: Greg KH
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrzej Hajda
     

12 Feb, 2015

1 commit


10 Oct, 2014

1 commit

  • - Rename vm_is_stack() to task_of_stack() and change it to return
    "struct task_struct *" rather than the global (and thus wrong in
    general) pid_t.

    - Add the new pid_of_stack() helper which calls task_of_stack() and
    uses the right namespace to report the correct pid_t.

    Unfortunately we need to define this helper twice, in task_mmu.c
    and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
    and move at least pid_of_stack/task_of_stack there to avoid the
    code duplication.

    - Change show_map_vma() and show_numa_map() to use the new helper.

    Signed-off-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Greg Ungerer
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Aug, 2014

1 commit

  • Aleksei hit the soft lockup during reading /proc/PID/smaps. David
    investigated the problem and suggested the right fix.

    while_each_thread() is racy and should die, this patch updates
    vm_is_stack().

    Signed-off-by: Oleg Nesterov
    Reported-by: Aleksei Besogonov
    Tested-by: Aleksei Besogonov
    Suggested-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

07 Aug, 2014

1 commit

  • Functions krealloc(), __krealloc(), kzfree() belongs to slab API, so
    should be placed in slab_common.c

    Also move slab allocator's tracepoints defenitions to slab_common.c No
    functional changes here.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

07 May, 2014

1 commit


13 Apr, 2014

1 commit

  • Pull audit updates from Eric Paris.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
    audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
    audit: do not cast audit_rule_data pointers pointlesly
    AUDIT: Allow login in non-init namespaces
    audit: define audit_is_compat in kernel internal header
    kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
    sched: declare pid_alive as inline
    audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
    syscall_get_arch: remove useless function arguments
    audit: remove stray newline from audit_log_execve_info() audit_panic() call
    audit: remove stray newlines from audit_log_lost messages
    audit: include subject in login records
    audit: remove superfluous new- prefix in AUDIT_LOGIN messages
    audit: allow user processes to log from another PID namespace
    audit: anchor all pid references in the initial pid namespace
    audit: convert PPIDs to the inital PID namespace.
    pid: get pid_t ppid of task in init_pid_ns
    audit: rename the misleading audit_get_context() to audit_take_context()
    audit: Add generic compat syscall support
    audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
    ...

    Linus Torvalds
     

08 Apr, 2014

1 commit

  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes with
    the right macro in the memory management (/mm) subsystem.

    [akpm@linux-foundation.org: while-we're-there consistency tweaks]
    Signed-off-by: Gideon Israel Dsouza
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     

08 Mar, 2014

1 commit


22 Jan, 2014

1 commit

  • Some applications that run on HPC clusters are designed around the
    availability of RAM and the overcommit ratio is fine tuned to get the
    maximum usage of memory without swapping. With growing memory, the
    1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
    for these workload (on a 2TB machine it represents no less than 20GB).

    This patch adds the new overcommit_kbytes sysctl variable that allow a
    much finer grain.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

15 Jan, 2014

1 commit

  • Commit 8456a648cf44 ("slab: use struct page for slab management") causes
    a crash in the LVM2 testsuite on PA-RISC (the crashing test is
    fsadm.sh). The testsuite doesn't crash on 3.12, crashes on 3.13-rc1 and
    later.

    Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
    CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
    task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000

    YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
    PSW: 00001000000001101111100100001110 Not tainted
    r00-03 000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
    r04-07 00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
    r08-11 0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
    r12-15 0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
    r16-19 000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
    r20-23 0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
    r24-27 00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
    r28-31 202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
    sr00-03 000000000532c000 0000000000000000 0000000000000000 000000000532c000
    sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000

    IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
    IIR: 539c0030 ISR: 00000000202d6000 IOR: 000006202224647d
    CPU: 3 CR30: 000000413edd8000 CR31: 0000000000000000
    ORIG_R28: 00000000405a95e0
    IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
    IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
    RP(r2): flush_dcache_page+0x128/0x388
    Backtrace:
    flush_dcache_page+0x128/0x388
    lo_splice_actor+0x90/0x148 [loop]
    splice_from_pipe_feed+0xc0/0x1d0
    __splice_from_pipe+0xac/0xc0
    lo_direct_splice_actor+0x1c/0x70 [loop]
    splice_direct_to_actor+0xec/0x228
    lo_receive+0xe4/0x298 [loop]
    loop_thread+0x478/0x640 [loop]
    kthread+0x134/0x168
    end_fault_vector+0x20/0x28
    xfs_setsize_buftarg+0x0/0x90 [xfs]

    Kernel panic - not syncing: Bad Address (null pointer deref?)

    Commit 8456a648cf44 changes the page structure so that the slab
    subsystem reuses the page->mapping field.

    The crash happens in the following way:
    * XFS allocates some memory from slab and issues a bio to read data
    into it.
    * the bio is sent to the loopback device.
    * lo_receive creates an actor and calls splice_direct_to_actor.
    * lo_splice_actor copies data to the target page.
    * lo_splice_actor calls flush_dcache_page because the page may be
    mapped by userspace. In that case we need to flush the kernel cache.
    * flush_dcache_page asks for the list of userspace mappings, however
    that page->mapping field is reused by the slab subsystem for a
    different purpose. This causes the crash.

    Note that other architectures without coherent caches (sparc, arm, mips)
    also call page_mapping from flush_dcache_page, so they may crash in the
    same way.

    This patch fixes this bug by testing if the page is a slab page in
    page_mapping and returning NULL if it is.

    The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
    earlier kernels in the same scenario on architectures without cache
    coherence when CONFIG_DEBUG_VM is enabled - so it should be backported
    to stable kernels.

    In the old kernels, the function page_mapping is placed in
    include/linux/mm.h, so you should modify the patch accordingly when
    backporting it.

    Signed-off-by: Mikulas Patocka
    Cc: John David Anglin ]
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Acked-by: Pekka Enberg
    Reviewed-by: Joonsoo Kim
    Cc: Helge Deller
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

13 Nov, 2013

1 commit

  • The same calculation is currently done in three differents places.
    Factor that code so future changes has to be made at only one place.

    [akpm@linux-foundation.org: uninline vm_commit_limit()]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

12 Sep, 2013

1 commit

  • PageSwapCache() is always false when !CONFIG_SWAP, so compiler
    properly discard related code. Therefore, we don't need #ifdef explicitly.

    Signed-off-by: Joonsoo Kim
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

11 Jul, 2013

1 commit

  • Since all architectures have been converted to use vm_unmapped_area(),
    there is no remaining use for the free_area_cache.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: "James E.J. Bottomley"
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

24 Feb, 2013

4 commits

  • When I use several fast SSD to do swap, swapper_space.tree_lock is
    heavily contended. This makes each swap partition have one
    address_space to reduce the lock contention. There is an array of
    address_space for swap. The swap entry type is the index to the array.

    In my test with 3 SSD, this increases the swapout throughput 20%.

    [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • According to akpm, this saves 1/2k text and makes things simple for the
    next patch.

    Numbers from Minchan:

    add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424)
    function old new delta
    page_mapping - 48 +48
    do_task_stat 2292 2308 +16
    page_remove_rmap 240 248 +8
    load_elf_binary 4500 4508 +8
    update_queue 532 536 +4
    scsi_probe_and_add_lun 2892 2896 +4
    lookup_fast 644 648 +4
    vcs_read 1040 1036 -4
    __ip_route_output_key 1904 1900 -4
    ip_route_input_noref 2508 2500 -8
    shmem_file_aio_read 784 772 -12
    __isolate_lru_page 272 256 -16
    shmem_replace_page 708 688 -20
    mark_buffer_dirty 228 208 -20
    __set_page_dirty_buffers 240 220 -20
    __remove_mapping 276 256 -20
    update_mmu_cache 500 476 -24
    set_page_dirty_balance 92 68 -24
    set_page_dirty 172 148 -24
    page_evictable 88 64 -24
    page_cache_pipe_buf_steal 248 224 -24
    clear_page_dirty_for_io 340 316 -24
    test_set_page_writeback 400 372 -28
    test_clear_page_writeback 516 488 -28
    invalidate_inode_page 156 128 -28
    page_mkclean 432 400 -32
    flush_dcache_page 360 328 -32
    __set_page_dirty_nobuffers 324 280 -44
    shrink_page_list 2412 2356 -56

    Signed-off-by: Shaohua Li
    Suggested-by: Andrew Morton
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
    multiple, however there was no equivalent code in mm_populate(), which
    caused issues.

    This could be fixed by introduced the same rounding in mm_populate(),
    however I think it's preferable to make do_mmap_pgoff() return populate
    as a size rather than as a boolean, so we don't have to duplicate the
    size rounding logic in mm_populate().

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
    with MCL_FUTURE in effect), we want to populate the pages within the
    newly created vmas. This may take a while as we may have to read pages
    from disk, so ideally we want to do this outside of the write-locked
    mmap_sem region.

    This change introduces mm_populate(), which is used to defer populating
    such mappings until after the mmap_sem write lock has been released.
    This is implemented as a generalization of the former do_mlock_pages(),
    which accomplished the same task but was using during mlock() /
    mlockall().

    Signed-off-by: Michel Lespinasse
    Reported-by: Andy Lutomirski
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

29 Oct, 2012

1 commit


16 Oct, 2012

1 commit