07 Jun, 2014

1 commit

  • The remap_file_pages() system call is used to create a nonlinear
    mapping, that is, a mapping in which the pages of the file are mapped
    into a nonsequential order in memory. The advantage of using
    remap_file_pages() over using repeated calls to mmap(2) is that the
    former approach does not require the kernel to create additional VMA
    (Virtual Memory Area) data structures.

    Supporting of nonlinear mapping requires significant amount of
    non-trivial code in kernel virtual memory subsystem including hot paths.
    Also to get nonlinear mapping work kernel need a way to distinguish
    normal page table entries from entries with file offset (pte_file).
    Kernel reserves flag in PTE for this purpose. PTE flags are scarce
    resource especially on some CPU architectures. It would be nice to free
    up the flag for other usage.

    Fortunately, there are not many users of remap_file_pages() in the wild.
    It's only known that one enterprise RDBMS implementation uses the
    syscall on 32-bit systems to map files bigger than can linearly fit into
    32-bit virtual address space. This use-case is not critical anymore
    since 64-bit systems are widely available.

    The plan is to deprecate the syscall and replace it with an emulation.
    The emulation will create new VMAs instead of nonlinear mappings. It's
    going to work slower for rare users of remap_file_pages() but ABI is
    preserved.

    One side effect of emulation (apart from performance) is that user can
    hit vm.max_map_count limit more easily due to additional VMAs. See
    comment for DEFAULT_MAX_MAP_COUNT for more details on the limit.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Kirill A. Shutemov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Jones
    Cc: Armin Rigo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Jun, 2014

1 commit

  • Currently memory error handler handles action optional errors in the
    deferred manner by default. And if a recovery aware application wants
    to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
    However, such signal can be sent only to the main thread, so it's
    problematic if the application wants to have a dedicated thread to
    handler such signals.

    So this patch adds dedicated thread support to memory error handler. We
    have PF_MCE_EARLY flags for each thread separately, so with this patch
    AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
    thread. If you want to implement a dedicated thread, you call prctl()
    to set PF_MCE_EARLY on the thread.

    Memory error handler collects processes to be killed, so this patch lets
    it check PF_MCE_EARLY flag on each thread in the collecting routines.

    No behavioral change for all non-early kill cases.

    Tony said:

    : The old behavior was crazy - someone with a multithreaded process might
    : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
    : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
    : that thread wasn't the main thread for the process.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Tony Luck
    Cc: Kamil Iskra
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

04 Jun, 2014

1 commit

  • Pull trivial tree changes from Jiri Kosina:
    "Usual pile of patches from trivial tree that make the world go round"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
    staging: go7007: remove reference to CONFIG_KMOD
    aic7xxx: Remove obsolete preprocessor define
    of: dma: doc fixes
    doc: fix incorrect formula to calculate CommitLimit value
    doc: Note need of bc in the kernel build from 3.10 onwards
    mm: Fix printk typo in dmapool.c
    modpost: Fix comment typo "Modules.symvers"
    Kconfig.debug: Grammar s/addition/additional/
    wimax: Spelling s/than/that/, wording s/destinatary/recipient/
    aic7xxx: Spelling s/termnation/termination/
    arm64: mm: Remove superfluous "the" in comment
    of: Spelling s/anonymouns/anonymous/
    dma: imx-sdma: Spelling s/determnine/determine/
    ath10k: Improve grammar in comments
    ath6kl: Spelling s/determnine/determine/
    of: Improve grammar for of_alias_get_id() documentation
    drm/exynos: Spelling s/contro/control/
    radio-bcm2048.c: fix wrong overflow check
    doc: printk-formats: do not mention casts for u64/s64
    doc: spelling error changes
    ...

    Linus Torvalds
     

05 May, 2014

1 commit


19 Apr, 2014

1 commit

  • In document numa_memory_policy.txt, the following examples for flag
    MPOL_F_RELATIVE_NODES are incorrect.

    For example, consider a task that is attached to a cpuset with
    mems 2-5 that sets an Interleave policy over the same set with
    MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
    interleave now occurs over nodes 3,5-6. If the cpuset's mems
    then change to 0,2-3,5, then the interleave occurs over nodes
    0,3,5.

    According to the comment of the patch adding flag MPOL_F_RELATIVE_NODES,
    the nodemasks the user specifies should be considered relative to the
    current task's mems_allowed.

    (https://lkml.org/lkml/2008/2/29/428)

    And according to numa_memory_policy.txt, if the user's nodemask includes
    nodes that are outside the range of the new set of allowed nodes, then
    the remap wraps around to the beginning of the nodemask and, if not
    already set, sets the node in the mempolicy nodemask.

    So in the example, if the user specifies 2-5, for a task whose
    mems_allowed is 3-7, the nodemasks should be remapped the third, fourth,
    fifth, sixth node in mems_allowed. like the following:

    mems_allowed: 3 4 5 6 7

    relative index: 0 1 2 3 4
    5

    So the nodemasks should be remapped to 3,5-7, but not 3,5-6.

    And for a task whose mems_allowed is 0,2-3,5, the nodemasks should be
    remapped to 0,2-3,5, but not 0,3,5.

    mems_allowed: 0 2 3 5

    relative index: 0 1 2 3
    4 5

    Signed-off-by: Tang Chen
    Cc: Randy Dunlap
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     

21 Mar, 2014

1 commit


11 Feb, 2014

1 commit

  • Some of the 00-INDEX files are somewhat outdated and some folders does
    not contain 00-INDEX at all. Only outdated (with the notably exception
    of spi) indexes are touched here, the 169 folders without 00-INDEX has
    not been touched.

    New 00-INDEX
    - spi/* was added in a series of commits dating back to 2006

    Added files (missing in (*/)00-INDEX)
    - dmatest.txt was added by commit 851b7e16a07d ("dmatest: run test via
    debugfs")
    - this_cpu_ops.txt was added by commit a1b2a555d637 ("percpu: add
    documentation on this_cpu operations")
    - ww-mutex-design.txt was added by commit 040a0a371005 ("mutex: Add
    support for wound/wait style locks")
    - bcache.txt was added by commit cafe56359144 ("bcache: A block layer
    cache")
    - kernel-per-CPU-kthreads.txt was added by commit 49717cb40410
    ("kthread: Document ways of reducing OS jitter due to per-CPU
    kthreads")
    - phy.txt was added by commit ff764963479a ("drivers: phy: add generic
    PHY framework")
    - block/null_blk was added by commit 12f8f4fc0314 ("null_blk:
    documentation")
    - module-signing.txt was added by commit 3cafea307642 ("Add
    Documentation/module-signing.txt file")
    - assoc_array.txt was added by commit 3cb989501c26 ("Add a generic
    associative array implementation.")
    - arm/IXP4xx was part of the initial repo
    - arm/cluster-pm-race-avoidance.txt was added by commit 7fe31d28e839
    ("ARM: mcpm: introduce helpers for platform coherency exit/setup")
    - arm/firmware.txt was added by commit 7366b92a77fc ("ARM: Add
    interface for registering and calling firmware-specific operations")
    - arm/kernel_mode_neon.txt was added by commit 2afd0a05241d ("ARM:
    7825/1: document the use of NEON in kernel mode")
    - arm/tcm.txt was added by commit bc581770cfdd ("ARM: 5580/2: ARM TCM
    (Tightly-Coupled Memory) support v3")
    - arm/vlocks.txt was added by commit 9762f12d3e05 ("ARM: mcpm: Add
    baremetal voting mutexes")
    - blackfin/gptimers-example.c, Makefile was added by commit
    4b60779d5ea7 ("Blackfin: add an example showing how to use the
    gptimers API")
    - devicetree/usage-model.txt was added by commit 31134efc681a ("dt:
    Linux DT usage model documentation")
    - fb/api.txt was added by commit fb21c2f42879 ("fbdev: Add FOURCC-based
    format configuration API")
    - fb/sm501.txt was added by commit e6a049807105 ("video, sm501: add
    edid and commandline support")
    - fb/udlfb.txt was added by commit 96f8d864afd6 ("fbdev: move udlfb out
    of staging.")
    - filesystems/Makefile was added by commit 1e0051ae48a2
    ("Documentation/fs/: split txt and source files")
    - filesystems/nfs/nfsd-admin-interfaces.txt was added by commit
    8a4c6e19cfed ("nfsd: document kernel interfaces for nfsd
    configuration")
    - ide/warm-plug-howto.txt was added by commit f74c91413ec6 ("ide: add
    warm-plug support for IDE devices (take 2)")
    - laptops/Makefile was added by commit d49129accc21
    ("Documentation/laptop/: split txt and source files")
    - leds/leds-blinkm.txt was added by commit b54cf35a7f65 ("LEDS: add
    BlinkM RGB LED driver, documentation and update MAINTAINERS")
    - leds/ledtrig-oneshot.txt was added by commit 5e417281cde2 ("leds: add
    oneshot trigger")
    - leds/ledtrig-transient.txt was added by commit 44e1e9f8e705 ("leds:
    add new transient trigger for one shot timer activation")
    - m68k/README.buddha was part of the initial repo
    - networking/LICENSE.(qla3xxx|qlcnic|qlge) was added by commits
    40839129f779, c4e84bde1d59, 5a4faa873782
    - networking/Makefile was added by commit 3794f3e812ef ("docsrc: build
    Documentation/ sources")
    - networking/i40evf.txt was added by commit 105bf2fe6b32 ("i40evf: add
    driver to kernel build system")
    - networking/ipsec.txt was added by commit b3c6efbc36e2 ("xfrm: Add
    file to document IPsec corner case")
    - networking/mac80211-auth-assoc-deauth.txt was added by commit
    3cd7920a2be8 ("mac80211: add auth/assoc/deauth flow diagram")
    - networking/netlink_mmap.txt was added by commit 5683264c3981
    ("netlink: add documentation for memory mapped I/O")
    - networking/nf_conntrack-sysctl.txt was added by commit c9f9e0e1597f
    ("netfilter: doc: add nf_conntrack sysctl api documentation") lan)
    - networking/team.txt was added by commit 3d249d4ca7d0 ("net: introduce
    ethernet teaming device")
    - networking/vxlan.txt was added by commit d342894c5d2f ("vxlan:
    virtual extensible lan")
    - power/runtime_pm.txt was added by commit 5e928f77a09a ("PM: Introduce
    core framework for run-time PM of I/O devices (rev. 17)")
    - power/charger-manager.txt was added by commit 3bb3dbbd56ea
    ("power_supply: Add initial Charger-Manager driver")
    - RCU/lockdep-splat.txt was added by commit d7bd2d68aa2e ("rcu:
    Document interpretation of RCU-lockdep splats")
    - s390/kvm.txt was added by 5ecee4b (KVM: s390: API documentation)
    - s390/qeth.txt was added by commit b4d72c08b358 ("qeth: bridgeport
    support - basic control")
    - scheduler/sched-bwc.txt was added by commit 88ebc08ea9f7 ("sched: Add
    documentation for bandwidth control")
    - scsi/advansys.txt was added by commit 4bd6d7f35661 ("[SCSI] advansys:
    Move documentation to Documentation/scsi")
    - scsi/bfa.txt was added by commit 1ec90174bdb4 ("[SCSI] bfa: add
    readme file")
    - scsi/bnx2fc.txt was added by commit 12b8fc10eaf4 ("[SCSI] bnx2fc: Add
    driver documentation")
    - scsi/cxgb3i.txt was added by commit c3673464ebc0 ("[SCSI] cxgb3i: Add
    cxgb3i iSCSI driver.")
    - scsi/hpsa.txt was added by commit 992ebcf14f3c ("[SCSI] hpsa: Add
    hpsa.txt to Documentation/scsi")
    - scsi/link_power_management_policy.txt was added by commit
    ca77329fb713 ("[libata] Link power management infrastructure")
    - scsi/osd.txt was added by commit 78e0c621deca ("[SCSI] osd:
    Documentation for OSD library")
    - scsi/scsi-parameter.txt was created/moved by commit 163475fb111c
    ("Documentation: move SCSI parameters to their own text file")
    - serial/driver was part of the initial repo
    - serial/n_gsm.txt was added by commit 323e84122ec6 ("n_gsm: add a
    documentation")
    - timers/Makefile was added by commit 3794f3e812ef ("docsrc: build
    Documentation/ sources")
    - virt/kvm/s390.txt was added by commit d9101fca3d57 ("KVM: s390:
    diagnose call documentation")
    - vm/split_page_table_lock was added by commit 49076ec2ccaf ("mm:
    dynamically allocate page->ptl if it cannot be embedded to struct
    page")
    - w1/slaves/w1_ds28e04 was added by commit fbf7f7b4e2ae ("w1: Add
    1-wire slave device driver for DS28E04-100")
    - w1/masters/omap-hdq was added by commit e0a29382c6f5 ("hdq:
    documentation for OMAP HDQ")
    - x86/early-microcode.txt was added by commit 0d91ea86a895 ("x86, doc:
    Documentation for early microcode loading")
    - x86/earlyprintk.txt was added by commit a1aade478862 ("x86/doc:
    mini-howto for using earlyprintk=dbgp")
    - x86/entry_64.txt was added by commit 8b4777a4b50c ("x86-64: Document
    some of entry_64.S")
    - x86/pat.txt was added by commit d27554d874c7 ("x86: PAT
    documentation")

    Moved files
    - arm/kernel_user_helpers.txt was moved out of arch/arm/kernel by
    commit 37b8304642c7 ("ARM: kuser: move interface documentation out of
    the source code")
    - efi-stub.txt was moved out of x86/ and down into Documentation/ in
    commit 4172fe2f8a47 ("EFI stub documentation updates")
    - laptops/hpfall.c was moved out of hwmon/ and into laptops/ in commit
    efcfed9bad88 ("Move hp_accel to drivers/platform/x86")
    - commit 5616c23ad9cd ("x86: doc: move x86-generic documentation from
    Doc/x86/i386"):
    * x86/usb-legacy-support.txt
    * x86/boot.txt
    * x86/zero_page.txt
    - power/video_extension.txt was moved to acpi in commit 70e66e4df191
    ("ACPI / video: move video_extension.txt to Documentation/acpi")

    Removed files (left in 00-INDEX)
    - memory.txt was removed by commit 00ea8990aadf ("memory.txt: remove
    stray information")
    - gpio.txt was moved to gpio/ in commit fd8e198cfcaa ("Documentation:
    gpiolib: document new interface")
    - networking/DLINK.txt was removed by commit 168e06ae26dd
    ("drivers/net: delete old parallel port de600/de620 drivers")
    - serial/hayes-esp.txt was removed by commit f53a2ade0bb9 ("tty: esp:
    remove broken driver")
    - s390/TAPE was removed by commit 9e280f669308 ("[S390] remove tape
    block docu")
    - vm/locking was removed by commit 57ea8171d2bc ("mm: documentation:
    remove hopelessly out-of-date locking doc")
    - laptops/acer-wmi.txt was remvoed by commit 020036678e81 ("acer-wmi:
    Delete out-of-date documentation")

    Typos/misc issues
    - rpc-server-gss.txt was added as knfsd-rpcgss.txt in commit
    030d794bf498 ("SUNRPC: Use gssproxy upcall for server RPCGSS
    authentication.")
    - commit b88cf73d9278 ("net: add missing entries to
    Documentation/networking/00-INDEX")
    * generic-hdlc.txt was added as generic_hdlc.txt
    * spider_net.txt was added as spider-net.txt
    - w1/master/mxc-w1 was added as mxc_w1 by commit a5fd9139f74c ("w1: add
    1-wire master driver for i.MX27 / i.MX31")
    - s390/zfcpdump.txt was added as zfcpdump by commit 6920c12a407e
    ("[S390] Add Documentation/s390/00-INDEX.")

    Signed-off-by: Henrik Austad
    Reviewed-by: Paul E. McKenney [rcu bits]
    Acked-by: Rob Landley
    Cc: Jiri Kosina
    Cc: Thomas Gleixner
    Cc: Rob Herring
    Cc: David S. Miller
    Cc: Mark Brown
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Gleb Natapov
    Cc: Linus Torvalds
    Cc: Len Brown
    Cc: James Bottomley
    Cc: Jean-Christophe Plagniol-Villard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henrik Austad
     

24 Jan, 2014

1 commit

  • Documentation/vm/locking is a blast from the past. In the entire git
    history, it has had precisely Three modifications. Two of those look to
    be pure renames, and the third was from 2005.

    The doc contains such gems as:

    > The page_table_lock is grabbed while holding the
    > kernel_lock spinning monitor.

    > Page stealers hold kernel_lock to protect against a bunch of
    > races.

    Or this which talks about mmap_sem:

    > 4. The exception to this rule is expand_stack, which just
    > takes the read lock and the page_table_lock, this is ok
    > because it doesn't really modify fields anybody relies on.

    expand_stack() doesn't take any locks any more directly, and the
    mmap_sem acquisition was long ago moved up in to the page fault code
    itself.

    It could be argued that we need to rewrite this, but it is dangerous to
    leave it as-is. It will confuse more people than it helps.

    Signed-off-by: Dave Hansen
    Cc: Hugh Dickins
    Acked-by: Vlastimil Babka
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

22 Jan, 2014

1 commit

  • Some applications that run on HPC clusters are designed around the
    availability of RAM and the overcommit ratio is fine tuned to get the
    maximum usage of memory without swapping. With growing memory, the
    1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
    for these workload (on a 2TB machine it represents no less than 20GB).

    This patch adds the new overcommit_kbytes sysctl variable that allow a
    much finer grain.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

22 Nov, 2013

1 commit

  • There are two code paths how page with pmd page table can be freed:
    pmd_free() and pmd_free_tlb().

    I've missed the second one and didn't add page table destructor call
    there. It leads to leak of page->ptl for pmd page tables, if
    dynamically allocated page->ptl is in use.

    The patch adds the missed destructor and modifies documentation
    accordingly.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Andrey Vagin
    Tested-by: Andrey Vagin
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Nov, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "Usual earth-shaking, news-breaking, rocket science pile from
    trivial.git"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
    doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
    doc: add missing files to timers/00-INDEX
    timekeeping: Fix some trivial typos in comments
    mm: Fix some trivial typos in comments
    irq: Fix some trivial typos in comments
    NUMA: fix typos in Kconfig help text
    mm: update 00-INDEX
    doc: Documentation/DMA-attributes.txt fix typo
    DRM: comment: `halve' -> `half'
    Docs: Kconfig: `devlopers' -> `developers'
    doc: typo on word accounting in kprobes.c in mutliple architectures
    treewide: fix "usefull" typo
    treewide: fix "distingush" typo
    mm/Kconfig: Grammar s/an/a/
    kexec: Typo s/the/then/
    Documentation/kvm: Update cpuid documentation for steal time and pv eoi
    treewide: Fix common typo in "identify"
    __page_to_pfn: Fix typo in comment
    Correct some typos for word frequency
    clk: fixed-factor: Fix a trivial typo
    ...

    Linus Torvalds
     

15 Nov, 2013

1 commit

  • If split page table lock is in use, we embed the lock into struct page
    of table's page. We have to disable split lock, if spinlock_t is too
    big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
    enabled.

    This patch add support for dynamic allocation of split page table lock
    if we can't embed it to struct page.

    page->ptl is unsigned long now and we use it as spinlock_t if
    sizeof(spinlock_t) ptl.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Nov, 2013

1 commit


14 Oct, 2013

1 commit

  • The following files moved files out of Documentation/vm/
    c6dd897f ("mm: move page-types.c from Documentation to tools/vm")
    f0f57b2b ("move hugepage test examples to tools/testing/selftests/vm)

    Remove these files from vm/00-INDEX.

    The following commits added new files do Documentation/vm/
    4fe4746a ("mm/fs: cleancache documentation") added vm/cleancache.txt
    d65bfacb ("mm: highmem documentation") added vm/highmem.txt
    1c9bf22c ("thp: transparent hugepage support documentation") added
    vm/transhuge.txt
    0f8975ec ("mm: soft-dirty bits for user memory changes tracking")
    61b0d760 ("zswap: add documentation")
    27c6aec2 ("mm: frontswap: config and doc files")

    Add the missing documentation-files with a short description to 00-INDEX

    Signed-off-by: Henrik Austad
    Signed-off-by: Jiri Kosina

    Henrik Austad
     

12 Sep, 2013

2 commits

  • Pavel reported that in case if vma area get unmapped and then mapped (or
    expanded) in-place, the soft dirty tracker won't be able to recognize this
    situation since it works on pte level and ptes are get zapped on unmap,
    loosing soft dirty bit of course.

    So to resolve this situation we need to track actions on vma level, there
    VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded)
    we set this bit, and keep it here until application calls for clearing
    soft dirty bit.

    Thus when user space application track memory changes now it can detect if
    vma area is renewed.

    Reported-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Explicitly mention/recommend using the libhugetlbfs test cases when
    changing related kernel code. Developers that are unaware of the project
    can easily miss this and introduce potential regressions that may or may
    not be caught by community review.

    Also do some cleanups that make the document visually easier to view at a
    first glance.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

11 Jul, 2013

1 commit

  • Add the documentation file for the zswap functionality

    Signed-off-by: Seth Jennings
    Acked-by: Rik van Riel
    Cc: Greg Kroah-Hartman
    Cc: Nitin Gupta
    Cc: Minchan Kim
    Cc: Konrad Rzeszutek Wilk
    Cc: Dan Magenheimer
    Cc: Robert Jennings
    Cc: Jenifer Hopper
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Larry Woodman
    Cc: Benjamin Herrenschmidt
    Cc: Dave Hansen
    Cc: Joe Perches
    Cc: Joonsoo Kim
    Cc: Cody P Schafer
    Cc: Hugh Dickens
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     

10 Jul, 2013

1 commit

  • Transparent huge zero page is used during the page fault instead of in
    khugepaged.

    # ls /sys/kernel/mm/transparent_hugepage/
    defrag enabled khugepaged use_zero_page
    # ls /sys/kernel/mm/transparent_hugepage/khugepaged/
    alloc_sleep_millisecs defrag full_scans max_ptes_none pages_collapsed pages_to_scan scan_sleep_millisecs

    This patch corrects the documentation just like the codes done.

    Signed-off-by: Wanpeng Li
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

05 Jul, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "The usual stuff from trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    treewide: relase -> release
    Documentation/cgroups/memory.txt: fix stat file documentation
    sysctl/net.txt: delete reference to obsolete 2.4.x kernel
    spinlock_api_smp.h: fix preprocessor comments
    treewide: Fix typo in printk
    doc: device tree: clarify stuff in usage-model.txt.
    open firmware: "/aliasas" -> "/aliases"
    md: bcache: Fixed a typo with the word 'arithmetic'
    irq/generic-chip: fix a few kernel-doc entries
    frv: Convert use of typedef ctl_table to struct ctl_table
    sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
    doc: clk: Fix incorrect wording
    Documentation/arm/IXP4xx fix a typo
    Documentation/networking/ieee802154 fix a typo
    Documentation/DocBook/media/v4l fix a typo
    Documentation/video4linux/si476x.txt fix a typo
    Documentation/virtual/kvm/api.txt fix a typo
    Documentation/early-userspace/README fix a typo
    Documentation/video4linux/soc-camera.txt fix a typo
    lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
    ...

    Linus Torvalds
     

04 Jul, 2013

2 commits

  • In order to reuse bits from pagemap entries gracefully, we leave the
    entries as is but on pagemap open emit a warning in dmesg, that bits
    55-60 are about to change in a couple of releases. Next, if a user
    issues soft-dirty clear command via the clear_refs file (it was disabled
    before v3.9) we assume that he's aware of the new pagemap format, note
    that fact and report the bits in pagemap in the new manner.

    The "migration strategy" looks like this then:

    1. existing users are not affected -- they don't touch soft-dirty feature, thus
    see old bits in pagemap, but are warned and have time to fix themselves
    2. those who use soft-dirty know about new pagemap format
    3. some time soon we get rid of any signs of page-shift in pagemap as well as
    this trick with clear-soft-dirty affecting pagemap format.

    Signed-off-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Glauber Costa
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The soft-dirty is a bit on a PTE which helps to track which pages a task
    writes to. In order to do this tracking one should

    1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
    2. Wait some time.
    3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

    To do this tracking, the writable bit is cleared from PTEs when the
    soft-dirty bit is. Thus, after this, when the task tries to modify a
    page at some virtual address the #PF occurs and the kernel sets the
    soft-dirty bit on the respective PTE.

    Note, that although all the task's address space is marked as r/o after
    the soft-dirty bits clear, the #PF-s that occur after that are processed
    fast. This is so, since the pages are still mapped to physical memory,
    and thus all the kernel does is finds this fact out and puts back
    writable, dirty and soft-dirty bits on the PTE.

    Another thing to note, is that when mremap moves PTEs they are marked
    with soft-dirty as well, since from the user perspective mremap modifies
    the virtual memory at mremap's new address.

    Signed-off-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Glauber Costa
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

28 May, 2013

1 commit


30 Apr, 2013

1 commit

  • Add user_reserve_kbytes knob.

    Limit the growth of the memory reserved for other user processes to
    min(3% current process size, user_reserve_pages). Only about 8MB is
    necessary to enable recovery in the default mode, and only a few hundred
    MB are required even when overcommit is disabled.

    user_reserve_pages defaults to min(3% free pages, 128MB)

    I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
    then adding the RSS of each.

    This only affects OVERCOMMIT_NEVER mode.

    Background

    1. user reserve

    __vm_enough_memory reserves a hardcoded 3% of the current process size for
    other applications when overcommit is disabled. This was done so that a
    user could recover if they launched a memory hogging process. Without the
    reserve, a user would easily run into a message such as:

    bash: fork: Cannot allocate memory

    2. admin reserve

    Additionally, a hardcoded 3% of free memory is reserved for root in both
    overcommit 'guess' and 'never' modes. This was intended to prevent a
    scenario where root-cant-log-in and perform recovery operations.

    Note that this reserve shrinks, and doesn't guarantee a useful reserve.

    Motivation

    The two hardcoded memory reserves should be updated to account for current
    memory sizes.

    Also, the admin reserve would be more useful if it didn't shrink too much.

    When the current code was originally written, 1GB was considered
    "enterprise". Now the 3% reserve can grow to multiple GB on large memory
    systems, and it only needs to be a few hundred MB at most to enable a user
    or admin to recover a system with an unwanted memory hogging process.

    I've found that reducing these reserves is especially beneficial for a
    specific type of application load:

    * single application system
    * one or few processes (e.g. one per core)
    * allocating all available memory
    * not initializing every page immediately
    * long running

    I've run scientific clusters with this sort of load. A long running job
    sometimes failed many hours (weeks of CPU time) into a calculation. They
    weren't initializing all of their memory immediately, and they weren't
    using calloc, so I put systems into overcommit 'never' mode. These
    clusters run diskless and have no swap.

    However, with the current reserves, a user wishing to allocate as much
    memory as possible to one process may be prevented from using, for
    example, almost 2GB out of 32GB.

    The effect is less, but still significant when a user starts a job with
    one process per core. I have repeatedly seen a set of processes
    requesting the same amount of memory fail because one of them could not
    allocate the amount of memory a user would expect to be able to allocate.
    For example, Message Passing Interfce (MPI) processes, one per core. And
    it is similar for other parallel programming frameworks.

    Changing this reserve code will make the overcommit never mode more useful
    by allowing applications to allocate nearly all of the available memory.

    Also, the new admin_reserve_kbytes will be safer than the current behavior
    since the hardcoded 3% of available memory reserve can shrink to something
    useless in the case where applications have grabbed all available memory.

    Risks

    * "bash: fork: Cannot allocate memory"

    The downside of the first patch-- which creates a tunable user reserve
    that is only used in overcommit 'never' mode--is that an admin can set
    it so low that a user may not be able to kill their process, even if
    they already have a shell prompt.

    Of course, a user can get in the same predicament with the current 3%
    reserve--they just have to launch processes until 3% becomes negligible.

    * root-cant-log-in problem

    The second patch, adding the tunable rootuser_reserve_pages, allows
    the admin to shoot themselves in the foot by setting it too small. They
    can easily get the system into a state where root-can't-log-in.

    However, the new admin_reserve_kbytes will be safer than the current
    behavior since the hardcoded 3% of available memory reserve can shrink
    to something useless in the case where applications have grabbed all
    available memory.

    Alternatives

    * Memory cgroups provide a more flexible way to limit application memory.

    Not everyone wants to set up cgroups or deal with their overhead.

    * We could create a fourth overcommit mode which provides smaller reserves.

    The size of useful reserves may be drastically different depending
    on the whether the system is embedded or enterprise.

    * Force users to initialize all of their memory or use calloc.

    Some users don't want/expect the system to overcommit when they malloc.
    Overcommit 'never' mode is for this scenario, and it should work well.

    The new user and admin reserve tunables are simple to use, with low
    overhead compared to cgroups. The patches preserve current behavior where
    3% of memory is less than 128MB, except that the admin reserve doesn't
    shrink to an unusable size under pressure. The code allows admins to tune
    for embedded and enterprise usage.

    FAQ

    * How is the root-cant-login problem addressed?
    What happens if admin_reserve_pages is set to 0?

    Root is free to shoot themselves in the foot by setting
    admin_reserve_kbytes too low.

    On x86_64, the minimum useful reserve is:
    8MB for overcommit 'guess'
    128MB for overcommit 'never'

    admin_reserve_pages defaults to min(3% free memory, 8MB)

    So, anyone switching to 'never' mode needs to adjust
    admin_reserve_pages.

    * How do you calculate a minimum useful reserve?

    A user or the admin needs enough memory to login and perform
    recovery operations, which includes, at a minimum:

    sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

    For overcommit 'guess', we can sum resident set sizes (RSS)
    because we only need enough memory to handle what the recovery
    programs will typically use. On x86_64 this is about 8MB.

    For overcommit 'never', we can take the max of their virtual sizes (VSZ)
    and add the sum of their RSS. We use VSZ instead of RSS because mode
    forces us to ensure we can fulfill all of the requested memory allocations--
    even if the programs only use a fraction of what they ask for.
    On x86_64 this is about 128MB.

    When swap is enabled, reserves are useful even when they are as
    small as 10MB, regardless of overcommit mode.

    When both swap and overcommit are disabled, then the admin should
    tune the reserves higher to be absolutley safe. Over 230MB each
    was safest in my testing.

    * What happens if user_reserve_pages is set to 0?

    Note, this only affects overcomitt 'never' mode.

    Then a user will be able to allocate all available memory minus
    admin_reserve_kbytes.

    However, they will easily see a message such as:

    "bash: fork: Cannot allocate memory"

    And they won't be able to recover/kill their application.
    The admin should be able to recover the system if
    admin_reserve_kbytes is set appropriately.

    * What's the difference between overcommit 'guess' and 'never'?

    "Guess" allows an allocation if there are enough free + reclaimable
    pages. It has a hardcoded 3% of free pages reserved for root.

    "Never" allows an allocation if there is enough swap + a configurable
    percentage (default is 50) of physical RAM. It has a hardcoded 3% of
    free pages reserved for root, like "Guess" mode. It also has a
    hardcoded 3% of the current process size reserved for additional
    applications.

    * Why is overcommit 'guess' not suitable even when an app eventually
    writes to every page? It takes free pages, file pages, available
    swap pages, reclaimable slab pages into consideration. In other words,
    these are all pages available, then why isn't overcommit suitable?

    Because it only looks at the present state of the system. It
    does not take into account the memory that other applications have
    malloced, but haven't initialized yet. It overcommits the system.

    Test Summary

    There was little change in behavior in the default overcommit 'guess'
    mode with swap enabled before and after the patch. This was expected.

    Systems run most predictably (i.e. no oom kills) in overcommit 'never'
    mode with swap enabled. This also allowed the most memory to be allocated
    to a user application.

    Overcommit 'guess' mode without swap is a bad idea. It is easy to
    crash the system. None of the other tested combinations crashed.
    This matches my experience on the Roadrunner supercomputer.

    Without the tunable user reserve, a system in overcommit 'never' mode
    and without swap does not allow the admin to recover, although the
    admin can.

    With the new tunable reserves, a system in overcommit 'never' mode
    and without swap can be configured to:

    1. maximize user-allocatable memory, running close to the edge of
    recoverability

    2. maximize recoverability, sacrificing allocatable memory to
    ensure that a user cannot take down a system

    Test Description

    Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

    System is booted into multiuser console mode, with unnecessary services
    turned off. Caches were dropped before each test.

    Hogs are user memtester processes that attempt to allocate all free memory
    as reported by /proc/meminfo

    In overcommit 'never' mode, memory_ratio=100

    Test Results

    3.9.0-rc1-mm1

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5432/5432 no yes yes
    guess yes 4 5444/5444 1 yes yes
    guess no 1 5302/5449 no yes yes
    guess no 4 - crash no no

    never yes 1 5460/5460 1 yes yes
    never yes 4 5460/5460 1 yes yes
    never no 1 5218/5432 no no yes
    never no 4 5203/5448 no no yes

    3.9.0-rc1-mm1-tunablereserves

    User and Admin Recovery show their respective reserves, if applicable.

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5419/5419 no - yes 8MB yes
    guess yes 4 5436/5436 1 - yes 8MB yes
    guess no 1 5440/5440 * - yes 8MB yes
    guess no 4 - crash - no 8MB no

    * process would successfully mlock, then the oom killer would pick it

    never yes 1 5446/5446 no 10MB yes 20MB yes
    never yes 4 5456/5456 no 10MB yes 20MB yes
    never no 1 5387/5429 no 128MB no 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely

    never no 1 5359/5448 no 10MB no 10MB barely

    never no 1 5323/5428 no 0MB no 10MB barely
    never no 1 5332/5428 no 0MB no 50MB yes
    never no 1 5293/5429 no 0MB no 90MB yes

    never no 1 5001/5427 no 230MB yes 338MB yes
    never no 4* 4998/5424 no 230MB yes 338MB yes

    * more memtesters were launched, able to allocate approximately another 100MB

    Future Work

    - Test larger memory systems.

    - Test an embedded image.

    - Test other architectures.

    - Time malloc microbenchmarks.

    - Would it be useful to be able to set overcommit policy for
    each memory cgroup?

    - Some lines are slightly above 80 chars.
    Perhaps define a macro to convert between pages and kb?
    Other places in the kernel do this.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_user_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     

24 Feb, 2013

2 commits

  • Added slightly more detail to the Documentation of merge_across_nodes, a
    few comments in areas indicated by review, and renamed get_ksm_page()'s
    argument from "locked" to "lock_it". No functional change.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
    Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
    we had with that, fully enabling KSM page migration on the way.

    (A different kind of KSM/NUMA issue which I've certainly not begun to
    address here: when KSM pages are unmerged, there's usually no sense in
    preferring to allocate the new pages local to the caller's node.)

    This patch:

    Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
    which control merging pages across different numa nodes. When it is set
    to zero only pages from the same node are merged, otherwise pages from
    all nodes can be merged together (default behavior).

    Typical use-case could be a lot of KVM guests on NUMA machine and cpus
    from more distant nodes would have significant increase of access
    latency to the merged ksm page. Sysfs knob was choosen for higher
    variability when some users still prefers higher amount of saved
    physical memory regardless of access latency.

    Every numa node has its own stable & unstable trees because of faster
    searching and inserting. Changing of merge_across_nodes value is
    possible only when there are not any ksm shared pages in system.

    I've tested this patch on numa machines with 2, 4 and 8 nodes and
    measured speed of memory access inside of KVM guests with memory pinned
    to one of nodes with this benchmark:

    http://pholasek.fedorapeople.org/alloc_pg.c

    Population standard deviations of access times in percentage of average
    were following:

    merge_across_nodes=1
    2 nodes 1.4%
    4 nodes 1.6%
    8 nodes 1.7%

    merge_across_nodes=0
    2 nodes 1%
    4 nodes 0.32%
    8 nodes 0.018%

    RFC: https://lkml.org/lkml/2011/11/30/91
    v1: https://lkml.org/lkml/2012/1/23/46
    v2: https://lkml.org/lkml/2012/6/29/105
    v3: https://lkml.org/lkml/2012/9/14/550
    v4: https://lkml.org/lkml/2012/9/23/137
    v5: https://lkml.org/lkml/2012/12/10/540
    v6: https://lkml.org/lkml/2012/12/23/154
    v7: https://lkml.org/lkml/2012/12/27/225

    Hugh notes that this patch brings two problems, whose solution needs
    further support in mm/ksm.c, which follows in subsequent patches:

    1) switching merge_across_nodes after running KSM is liable to oops
    on stale nodes still left over from the previous stable tree;

    2) memory hotremove may migrate KSM pages, but there is no provision
    here for !merge_across_nodes to migrate nodes to the proper tree.

    Signed-off-by: Petr Holasek
    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Holasek
     

14 Dec, 2012

1 commit

  • Merge misc VM changes from Andrew Morton:
    "The rest of most-of-MM. The other MM bits await a slab merge.

    This patch includes the addition of a huge zero_page. Not a
    performance boost but it an save large amounts of physical memory in
    some situations.

    Also a bunch of Fujitsu engineers are working on memory hotplug.
    Which, as it turns out, was badly broken. About half of their patches
    are included here; the remainder are 3.8 material."

    However, this merge disables CONFIG_MOVABLE_NODE, which was totally
    broken. We don't add new features with "default y", nor do we add
    Kconfig questions that are incomprehensible to most people without any
    help text. Does the feature even make sense without compaction or
    memory hotplug?

    * akpm: (54 commits)
    mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
    mm/memory.c: remove unused code from do_wp_page()
    asm-generic, mm: pgtable: consolidate zero page helpers
    mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
    hwpoison, hugetlbfs: fix RSS-counter warning
    hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
    mm: protect against concurrent vma expansion
    memcg: do not check for mm in __mem_cgroup_count_vm_event
    tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
    mm: provide more accurate estimation of pages occupied by memmap
    fs/buffer.c: remove redundant initialization in alloc_page_buffers()
    fs/buffer.c: do not inline exported function
    writeback: fix a typo in comment
    mm: introduce new field "managed_pages" to struct zone
    mm, oom: remove statically defined arch functions of same name
    mm, oom: remove redundant sleep in pagefault oom handler
    mm, oom: cleanup pagefault oom handler
    memory_hotplug: allow online/offline memory to result movable node
    numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
    mm, memcg: avoid unnecessary function call when memcg is disabled
    ...

    Linus Torvalds
     

13 Dec, 2012

3 commits

  • By default kernel tries to use huge zero page on read page fault. It's
    possible to disable huge zero page by writing 0 or enable it back by
    writing 1:

    echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
    echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • hzp_alloc is incremented every time a huge zero page is successfully
    allocated. It includes allocations which where dropped due
    race with other allocation. Note, it doesn't count every map
    of the huge zero page, only its allocation.

    hzp_alloc_failed is incremented if kernel fails to allocate huge zero
    page and falls back to using small pages.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Pass vma instead of mm and add address parameter.

    In most cases we already have vma on the stack. We provides
    split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

    This change is preparation to huge zero pmd splitting implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

19 Nov, 2012

1 commit


09 Oct, 2012

2 commits

  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

22 Aug, 2012

1 commit

  • Commit f0f57b2b1488 ("mm: move hugepage test examples to
    tools/testing/selftests/vm") moved map_hugetlb.c, hugepage-shm.c and
    hugepage-mmap.c tests into tools/testing/selftests/vm/ directory, but it
    didn't update hugetlbpage.txt

    Signed-off-by: Zhouping Liu
    Acked-by: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhouping Liu
     

23 Jul, 2012

1 commit


05 Jun, 2012

1 commit

  • Pull frontswap feature from Konrad Rzeszutek Wilk:
    "Frontswap provides a "transcendent memory" interface for swap pages.
    In some environments, dramatic performance savings may be obtained
    because swapped pages are saved in RAM (or a RAM-like device) instead
    of a swap disk. This tag provides the basic infrastructure along with
    some changes to the existing backends."

    Fix up trivial conflict in mm/Makefile due to removal of swap token code
    changing a line next to the new frontswap entry.

    This pull request came in before the merge window even opened, it got
    delayed to after the merge window by me just wanting to make sure it had
    actual users. Apparently IBM is using this on their embedded side, and
    Jan Beulich says that it's already made available for SLES and OpenSUSE
    users.

    Also acked by Rik van Riel, and Konrad points to other people liking it
    too. So in it goes.

    By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
    via Konrad Rzeszutek Wilk
    * tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    frontswap: s/put_page/store/g s/get_page/load
    MAINTAINER: Add myself for the frontswap API
    mm: frontswap: config and doc files
    mm: frontswap: core frontswap functionality
    mm: frontswap: core swap subsystem hooks and headers
    mm: frontswap: add frontswap header file

    Linus Torvalds
     

02 Jun, 2012

1 commit

  • Pull slab updates from Pekka Enberg:
    "Mainly a bunch of SLUB fixes from Joonsoo Kim"

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    slub: use __SetPageSlab function to set PG_slab flag
    slub: fix a memory leak in get_partial_node()
    slub: remove unused argument of init_kmem_cache_node()
    slub: fix a possible memory leak
    Documentations: Fix slabinfo.c directory in vm/slub.txt
    slub: fix incorrect return type of get_any_partial()

    Linus Torvalds
     

01 Jun, 2012

1 commit

  • This is an implementation of Andrew's proposal to extend the pagemap file
    bits to report what is missing about tasks' working set.

    The problem with the working set detection is multilateral. In the criu
    (checkpoint/restore) project we dump the tasks' memory into image files
    and to do it properly we need to detect which pages inside mappings are
    really in use. The mincore syscall I though could help with this did not.
    First, it doesn't report swapped pages, thus we cannot find out which
    parts of anonymous mappings to dump. Next, it does report pages from page
    cache as present even if they are not mapped, and it doesn't make that has
    not been cow-ed.

    Note, that issue with swap pages is critical -- we must dump swap pages to
    image file. But the issues with file pages are optimization -- we can
    take all file pages to image, this would be correct, but if we know that a
    page is not mapped or not cow-ed, we can remove them from dump file. The
    dump would still be self-consistent, though significantly smaller in size
    (up to 10 times smaller on real apps).

    Andrew noticed, that the proc pagemap file solved 2 of 3 above issues --
    it reports whether a page is present or swapped and it doesn't report not
    mapped page cache pages. But, it doesn't distinguish cow-ed file pages
    from not cow-ed.

    I would like to make the last unused bit in this file to report whether the
    page mapped into respective pte is PageAnon or not.

    [comment stolen from Pavel Emelyanov's v1 patch]

    Signed-off-by: Konstantin Khlebnikov
    Cc: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

30 May, 2012

1 commit


15 May, 2012

2 commits

  • Sounds so much more natural.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • This patch 4of4 adds configuration and documentation files including a FAQ.

    [v14: updated docs/FAQ to use zcache and RAMster as examples]
    [v10: no change]
    [v9: akpm@linux-foundation.org: sysfs->debugfs; no longer need Doc/ABI file]
    [v8: rebase to 3.0-rc4]
    [v7: rebase to 3.0-rc3]
    [v6: rebase to 3.0-rc1]
    [v5: change config default to n]
    [v4: rebase to 2.6.39]
    Signed-off-by: Dan Magenheimer
    Acked-by: Jan Beulich
    Acked-by: Seth Jennings
    Cc: Jeremy Fitzhardinge
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Nitin Gupta
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Rik Riel
    Cc: Andrew Morton
    Signed-off-by: Konrad Rzeszutek Wilk

    Dan Magenheimer