16 Jan, 2016

11 commits

  • Pull md updates from Neil Brown:
    "Mostly clustered-raid1 and raid5 journal updates. one Y2038 fix and
    other minor stuff.

    One patch removes me from the MAINTAINERS file and adds a record of my
    md maintainership to Credits"

    Many thanks to Neil, who has been around for a _looong_ time.

    * tag 'md/4.5' of git://neil.brown.name/md: (26 commits)
    md/raid: only permit hot-add of compatible integrity profiles
    Remove myself as MD Maintainer, and add to Credits.
    raid5-cache: handle journal hotadd in quiesce
    MD: add journal with array suspended
    md: set MD_HAS_JOURNAL in correct places
    md: Remove 'ready' field from mddev.
    md: remove unnecesary md_new_event_inintr
    raid5: allow r5l_io_unit allocations to fail
    raid5-cache: use a mempool for the metadata block
    raid5-cache: use a bio_set
    raid5-cache: add journal hot add/remove support
    drivers: md: use ktime_get_real_seconds()
    md: avoid warning for 32-bit sector_t
    raid5-cache: free meta_page earlier
    raid5-cache: simplify r5l_move_io_unit_list
    md: update comment for md_allow_write
    md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY
    md-cluster: Protect communication with mutexes
    md-cluster: Defer MD reloading to mddev->thread
    md-cluster: update the documentation
    ...

    Linus Torvalds
     
  • Pull regulator updates from Mark Brown:
    "Aside from a fix for a spurious warning (which caused more problems
    than it fixed in the fixing really) this is all driver updates,
    including new drivers for Dialog PV88060/90 and TI LM363x and TPS65086
    devices. The qcom_smd driver has had PM8916 and PMA8084 support
    added"

    * tag 'regulator-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (36 commits)
    regulator: core: remove some dead code
    regulator: core: use dev_to_rdev
    regulator: lp872x: Get rid of duplicate reference to DVS GPIO
    regulator: lp872x: Add missing of_match in regulators descriptions
    regulator: axp20x: Fix GPIO LDO enable value for AXP22x
    regulator: lp8788: constify regulator_ops structures
    regulator: wm8*: constify regulator_ops structures
    regulator: da9*: constify regulator_ops structures
    regulator: mt6311: Use REGCACHE_RBTREE
    regulator: tps65917/palmas: Add bypass ops for LDOs with bypass capability
    regulator: qcom-smd: Add support for PMA8084
    regulator: qcom-smd: Add PM8916 support
    soc: qcom: documentation: Update SMD/RPM Docs
    regulator: pv88090: logical vs bitwise AND typo
    regulator: pv88090: Fix irq leak
    regulator: pv88090: new regulator driver
    regulator: wm831x-ldo: Use platform_register/unregister_drivers()
    regulator: wm831x-dcdc: Use platform_register/unregister_drivers()
    regulator: lp8788-ldo: Use platform_register/unregister_drivers()
    regulator: core: Fix nested locking of supplies
    ...

    Linus Torvalds
     
  • Pull mailbox fixlet from Jussi Brar.

    * 'mailbox-for-next' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
    mailbox: constify mbox_chan_ops structure

    Linus Torvalds
     
  • Pull UDF fixes and quota cleanups from Jan Kara:
    "Several UDF fixes and some minor quota cleanups"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    udf: Check output buffer length when converting name to CS0
    udf: Prevent buffer overrun with multi-byte characters
    quota: constify qtree_fmt_operations structures
    udf: avoid uninitialized variable use
    udf: Fix lost indirect extent block
    udf: Factor out code for creating indirect extent
    udf: limit the maximum number of indirect extents in a row
    udf: limit the maximum number of TD redirections
    fs: make quota/dquot.c explicitly non-modular
    fs: make quota/netlink.c explicitly non-modular

    Linus Torvalds
     
  • Merge first patch-bomb from Andrew Morton:

    - A few hotfixes which missed 4.4 becasue I was asleep. cc'ed to
    -stable

    - A few misc fixes

    - OCFS2 updates

    - Part of MM. Including pretty large changes to page-flags handling
    and to thp management which have been buffered up for 2-3 cycles now.

    I have a lot of MM material this time.

    [ It turns out the THP part wasn't quite ready, so that got dropped from
    this series - Linus ]

    * emailed patches from Andrew Morton : (117 commits)
    zsmalloc: reorganize struct size_class to pack 4 bytes hole
    mm/zbud.c: use list_last_entry() instead of list_tail_entry()
    zram/zcomp: do not zero out zcomp private pages
    zram: pass gfp from zcomp frontend to backend
    zram: try vmalloc() after kmalloc()
    zram/zcomp: use GFP_NOIO to allocate streams
    mm: add tracepoint for scanning pages
    drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
    mm/page_isolation: use macro to judge the alignment
    mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
    mm: rework virtual memory accounting
    include/linux/memblock.h: fix ordering of 'flags' argument in comments
    mm: move lru_to_page to mm_inline.h
    Documentation/filesystems: describe the shared memory usage/accounting
    memory-hotplug: don't BUG() in register_memory_resource()
    hugetlb: make mm and fs code explicitly non-modular
    mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
    mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
    mm: make sure isolate_lru_page() is never called for tail page
    vmstat: make vmstat_updater deferrable again and shut down on idle
    ...

    Linus Torvalds
     
  • Reoder the pages_per_zspage field in struct size_class which can
    eliminate the 4 bytes hole between it and stats field.

    Signed-off-by: Weijie Yang
    Reviewed-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • list_last_entry*( has been defined in list.h, so replace
    list_tail_entry() with it.

    Signed-off-by: Geliang Tang
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • Do not __GFP_ZERO allocated zcomp ->private pages. We keep allocated
    streams around and use them for read/write requests, so we supply a
    zeroed out ->private to compression algorithm as a scratch buffer only
    once -- the first time we use that stream. For the rest of IO requests
    served by this stream ->private usually contains some temporarily data
    from the previous requests.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Each zcomp backend uses own gfp flag but it's pointless because the
    context they could be called is driven by upper layer(ie, zcomp
    frontend). As well, zcomp frondend could call them in different
    context. One context(ie, zram init part) is it should be better to make
    sure successful allocation other context(ie, further stream allocation
    part for accelarating I/O speed) is just optional so let's pass gfp down
    from driver (ie, zcomp frontend) like normal MM convention.

    [sergey.senozhatsky@gmail.com: add missing __vmalloc zero and highmem gfps]
    Signed-off-by: Minchan Kim
    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When we're using LZ4 multi compression streams for zram swap, we found
    out page allocation failure message in system running test. That was
    not only once, but a few(2 - 5 times per test). Also, some failure
    cases were continually occurring to try allocation order 3.

    In order to make parallel compression private data, we should call
    kzalloc() with order 2/3 in runtime(lzo/lz4). But if there is no order
    2/3 size memory to allocate in that time, page allocation fails. This
    patch makes to use vmalloc() as fallback of kmalloc(), this prevents
    page alloc failure warning.

    After using this, we never found warning message in running test, also
    It could reduce process startup latency about 60-120ms in each case.

    For reference a call trace :

    Binder_1: page allocation failure: order:3, mode:0x10c0d0
    CPU: 0 PID: 424 Comm: Binder_1 Tainted: GW 3.10.49-perf-g991d02b-dirty #20
    Call trace:
    dump_backtrace+0x0/0x270
    show_stack+0x10/0x1c
    dump_stack+0x1c/0x28
    warn_alloc_failed+0xfc/0x11c
    __alloc_pages_nodemask+0x724/0x7f0
    __get_free_pages+0x14/0x5c
    kmalloc_order_trace+0x38/0xd8
    zcomp_lz4_create+0x2c/0x38
    zcomp_strm_alloc+0x34/0x78
    zcomp_strm_multi_find+0x124/0x1ec
    zcomp_strm_find+0xc/0x18
    zram_bvec_rw+0x2fc/0x780
    zram_make_request+0x25c/0x2d4
    generic_make_request+0x80/0xbc
    submit_bio+0xa4/0x15c
    __swap_writepage+0x218/0x230
    swap_writepage+0x3c/0x4c
    shrink_page_list+0x51c/0x8d0
    shrink_inactive_list+0x3f8/0x60c
    shrink_lruvec+0x33c/0x4cc
    shrink_zone+0x3c/0x100
    try_to_free_pages+0x2b8/0x54c
    __alloc_pages_nodemask+0x514/0x7f0
    __get_free_pages+0x14/0x5c
    proc_info_read+0x50/0xe4
    vfs_read+0xa0/0x12c
    SyS_read+0x44/0x74
    DMA: 3397*4kB (MC) 26*8kB (RC) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
    0*512kB 0*1024kB 0*2048kB 0*4096kB = 13796kB

    [minchan@kernel.org: change vmalloc gfp and adding comment about gfp]
    [sergey.senozhatsky@gmail.com: tweak comments and styles]
    Signed-off-by: Kyeongdon Kim
    Signed-off-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Sergey Senozhatsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kyeongdon Kim
     
  • We can end up allocating a new compression stream with GFP_KERNEL from
    within the IO path, which may result is nested (recursive) IO
    operations. That can introduce problems if the IO path in question is a
    reclaimer, holding some locks that will deadlock nested IOs.

    Allocate streams and working memory using GFP_NOIO flag, forbidding
    recursive IO and FS operations.

    An example:

    inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
    git/20158 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (jbd2_handle){+.+.?.}, at: start_this_handle+0x4ca/0x555
    {IN-RECLAIM_FS-W} state was registered at:
    __lock_acquire+0x8da/0x117b
    lock_acquire+0x10c/0x1a7
    start_this_handle+0x52d/0x555
    jbd2__journal_start+0xb4/0x237
    __ext4_journal_start_sb+0x108/0x17e
    ext4_dirty_inode+0x32/0x61
    __mark_inode_dirty+0x16b/0x60c
    iput+0x11e/0x274
    __dentry_kill+0x148/0x1b8
    shrink_dentry_list+0x274/0x44a
    prune_dcache_sb+0x4a/0x55
    super_cache_scan+0xfc/0x176
    shrink_slab.part.14.constprop.25+0x2a2/0x4d3
    shrink_zone+0x74/0x140
    kswapd+0x6b7/0x930
    kthread+0x107/0x10f
    ret_from_fork+0x3f/0x70
    irq event stamp: 138297
    hardirqs last enabled at (138297): debug_check_no_locks_freed+0x113/0x12f
    hardirqs last disabled at (138296): debug_check_no_locks_freed+0x33/0x12f
    softirqs last enabled at (137818): __do_softirq+0x2d3/0x3e9
    softirqs last disabled at (137813): irq_exit+0x41/0x95

    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(jbd2_handle);

    lock(jbd2_handle);

    *** DEADLOCK ***
    5 locks held by git/20158:
    #0: (sb_writers#7){.+.+.+}, at: [] mnt_want_write+0x24/0x4b
    #1: (&type->i_mutex_dir_key#2/1){+.+.+.}, at: [] lock_rename+0xd9/0xe3
    #2: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [] lock_two_nondirectories+0x3f/0x6b
    #3: (&sb->s_type->i_mutex_key#11/4){+.+.+.}, at: [] lock_two_nondirectories+0x66/0x6b
    #4: (jbd2_handle){+.+.?.}, at: [] start_this_handle+0x4ca/0x555

    stack backtrace:
    CPU: 2 PID: 20158 Comm: git Not tainted 4.1.0-rc7-next-20150615-dbg-00016-g8bdf555-dirty #211
    Call Trace:
    dump_stack+0x4c/0x6e
    mark_lock+0x384/0x56d
    mark_held_locks+0x5f/0x76
    lockdep_trace_alloc+0xb2/0xb5
    kmem_cache_alloc_trace+0x32/0x1e2
    zcomp_strm_alloc+0x25/0x73 [zram]
    zcomp_strm_multi_find+0xe7/0x173 [zram]
    zcomp_strm_find+0xc/0xe [zram]
    zram_bvec_rw+0x2ca/0x7e0 [zram]
    zram_make_request+0x1fa/0x301 [zram]
    generic_make_request+0x9c/0xdb
    submit_bio+0xf7/0x120
    ext4_io_submit+0x2e/0x43
    ext4_bio_write_page+0x1b7/0x300
    mpage_submit_page+0x60/0x77
    mpage_map_and_submit_buffers+0x10f/0x21d
    ext4_writepages+0xc8c/0xe1b
    do_writepages+0x23/0x2c
    __filemap_fdatawrite_range+0x84/0x8b
    filemap_flush+0x1c/0x1e
    ext4_alloc_da_blocks+0xb8/0x117
    ext4_rename+0x132/0x6dc
    ? mark_held_locks+0x5f/0x76
    ext4_rename2+0x29/0x2b
    vfs_rename+0x540/0x636
    SyS_renameat2+0x359/0x44d
    SyS_rename+0x1e/0x20
    entry_SYSCALL_64_fastpath+0x12/0x6f

    [minchan@kernel.org: add stable mark]
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Kyeongdon Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     

15 Jan, 2016

29 commits

  • Pull trivial tree updates from Jiri Kosina.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    floppy: make local variable non-static
    exynos: fixes an incorrect header guard
    dt-bindings: fixes some incorrect header guards
    cpufreq-dt: correct dead link in documentation
    cpufreq: ARM big LITTLE: correct dead link in documentation
    treewide: Fix typos in printk
    Documentation: filesystem: Fix typo in fs/eventfd.c
    fs/super.c: use && instead of & for warn_on condition
    Documentation: fix sysfs-ptp
    lib: scatterlist: fix Kconfig description

    Linus Torvalds
     
  • Pull livepatching updates from Jiri Kosina:

    - RO/NX attribute fixes for patch module relocations from Josh
    Poimboeuf. As part of this effort, module.c has been cleaned up as
    well and livepatching is piggy-backing on this cleanup. Rusty is OK
    with this whole lot going through livepatching tree.

    - symbol disambiguation support from Chris J Arges. That series is
    also

    Reviewed-by: Miroslav Benes

    but this came in only after I've alredy pushed out. Didn't want to
    rebase because of that, hence I am mentioning it here.

    - symbol lookup fix from Miroslav Benes

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
    livepatch: Cleanup module page permission changes
    module: keep percpu symbols in module's symtab
    module: clean up RO/NX handling.
    module: use a structure to encapsulate layout.
    gcov: use within_module() helper.
    module: Use the same logic for setting and unsetting RO/NX
    livepatch: function,sympos scheme in livepatch sysfs directory
    livepatch: add sympos as disambiguator field to klp_reloc
    livepatch: add old_sympos as disambiguator field to klp_func

    Linus Torvalds
     
  • Pull HID updates from Jiri Kosina:

    - appoint Benjamin Tissoires as co-maintainer / designated reviewer

    - sysfs report_descriptor visibility fix for unclaimed devices, from
    Andy Lutomirski

    - suspend/resume fixes for Sony driver from Frank Praznik

    - IRQ deadlock fix from Ioan-Adrian Ratiu

    - hid-i2c fixes affecting (at least) Yoga 900 from Mika Westerberg and
    Srinivas Pandruvada

    - a lot of new device support (especially, but not limited to, Wacom)
    and assorted small misc fixes

    - almost complete G920 support; the only bit that is missing is
    switching the device to HID mode automatically; Simon Wood and Michal
    Maly are working on it.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (46 commits)
    Revert "INPUT: xpad: switch Logitech G920 Wheel into HID mode"
    HID: sensor-hub: Add quirk for Lenovo Yoga 900 with ITE Chips
    HID: Add new PID for Microchip Pick16F1454
    HID: wacom: Use correct report to query pen ID from INTUOSHT2 devices
    HID: i2c-hid: Prevent sending reports from racing with device reset
    HID: use kobj_to_dev()
    HID: wiimote: use dev_to_wii()
    HID: add a new helper to_hid_driver()
    HID: use to_hid_device()
    HID: move to_hid_device() to hid.h
    HID: usbhid: use to_usb_device
    HID: corsair: Convert to use module_hid_driver
    HID: input: ignore the battery in OKLICK Laser BTmouse
    HID: wacom: Fix pad button range for CINTIQ_COMPANION_2
    HID: wacom: Fix touchring value reporting
    HID: wacom: Report 'strip2' values in ABS_RY
    HID: wacom: Limit touchstrip data to 13 bits
    HID: wacom: bitwise vs logical ORs
    HID: wacom: Apply lowres quirk to BAMBOO_TOUCH devices
    HID: enable hid device to suspend/resume asynchronously
    ...

    Linus Torvalds
     
  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable fixes:
    - Fix a regression in the SunRPC socket polling code
    - Fix the attribute cache revalidation code
    - Fix race in __update_open_stateid()
    - Fix an lo->plh_block_lgets imbalance in layoutreturn
    - Fix an Oopsable typo in ff_mirror_match_fh()

    Features:
    - pNFS layout recall performance improvements.
    - pNFS/flexfiles: Support server-supplied layoutstats sampling period

    Bugfixes + cleanups:
    - NFSv4: Don't perform cached access checks before we've OPENed the
    file
    - Fix starvation issues with background flushes
    - Reclaim writes should be flushed as unstable writes if there are
    already entries in the commit lists
    - Various bugfixes from Chuck to fix NFS/RDMA send queue ordering
    problems
    - Ensure that we propagate fatal layoutget errors back to the
    application
    - Fixes for sundry flexfiles layoutstats bugs
    - Fix files/flexfiles to not cache invalidated layouts in the DS
    commit buckets"

    * tag 'nfs-for-4.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
    NFS: Fix a compile warning about unused variable in nfs_generic_pg_pgios()
    NFSv4: Fix a compile warning about no prototype for nfs4_ioctl()
    NFS: Use wait_on_atomic_t() for unlock after readahead
    SUNRPC: Fixup socket wait for memory
    NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments
    NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures
    NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid()
    NFSv4.1/pNFS: Fix a race in initiate_file_draining()
    NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout
    NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode
    NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids
    NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn()
    NFS: Relax requirements in nfs_flush_incompatible
    NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid
    NFS: Allow multiple commit requests in flight per file
    NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs
    SUNRPC: Fix a missing break in rpc_anyaddr()
    pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
    NFS: Fix attribute cache revalidation
    NFS: Ensure we revalidate attributes before using execute_ok()
    ...

    Linus Torvalds
     
  • Pull vfs fix from Al Viro:
    "Don't put symlink bodies in pagecache into highmem"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    Make sure that highmem pages are not added to symlink page cache

    Linus Torvalds
     
  • This patch series makes swapin readahead up to a certain number to gain
    more thp performance and adds tracepoint for khugepaged_scan_pmd,
    collapse_huge_page, __collapse_huge_page_isolate.

    This patch series was written to deal with programs that access most,
    but not all, of their memory after they get swapped out. Currently
    these programs do not get their memory collapsed into THPs after the
    system swapped their memory out, while they would get THPs before
    swapping happened.

    This patch series was tested with a test program, it allocates 400MB of
    memory, writes to it, and then sleeps. I force the system to swap out
    all. Afterwards, the test program touches the area by writing and
    leaves a piece of it without writing. This shows how much swap in
    readahead made by the patch.

    Test results:

    After swapped out
    -------------------------------------------------------------------
    | Anonymous | AnonHugePages | Swap | Fraction |
    -------------------------------------------------------------------
    With patch | 90076 kB | 88064 kB | 309928 kB | %99 |
    -------------------------------------------------------------------
    Without patch | 194068 kB | 192512 kB | 205936 kB | %99 |
    -------------------------------------------------------------------

    After swapped in
    -------------------------------------------------------------------
    | Anonymous | AnonHugePages | Swap | Fraction |
    -------------------------------------------------------------------
    With patch | 201408 kB | 198656 kB | 198596 kB | %98 |
    -------------------------------------------------------------------
    Without patch | 292624 kB | 192512 kB | 107380 kB | %65 |
    -------------------------------------------------------------------

    This patch (of 3):

    Using static tracepoints, data of functions is recorded. It is good to
    automatize debugging without doing a lot of changes in the source code.

    This patch adds tracepoint for khugepaged_scan_pmd, collapse_huge_page
    and __collapse_huge_page_isolate.

    [dan.carpenter@oracle.com: add a missing tab]
    Signed-off-by: Ebru Akagunduz
    Acked-by: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Xie XiuQi
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • Fix a bug where a kernel warning is triggered when performing a memory
    hotplug on ppc64. This warning may also occur on any architecture that
    uses the memory_probe_store interface.

    WARNING: at drivers/base/memory.c:200
    CPU: 9 PID: 13042 Comm: systemd-udevd Not tainted 4.4.0-rc4-00113-g0bd0f1e-dirty #7
    NIP [c00000000055e034] pages_correctly_reserved+0x134/0x1b0
    LR [c00000000055e7f8] memory_subsys_online+0x68/0x140
    Call Trace:
    memory_subsys_online+0x68/0x140
    device_online+0xb4/0x120
    store_mem_state+0xb0/0x180
    dev_attr_store+0x34/0x60
    sysfs_kf_write+0x64/0xa0
    kernfs_fop_write+0x17c/0x1e0
    __vfs_write+0x40/0x160
    vfs_write+0xb8/0x200
    SyS_write+0x60/0x110
    system_call+0x38/0xd0

    The warning is triggered because there is a udev rule that automatically
    tries to online memory after it has been added. The udev rule varies
    from distro to distro, but will generally look something like:

    SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

    On any architecture that uses memory_probe_store to reserve memory, the
    udev rule will be triggered after the first section of the block is
    reserved and will subsequently attempt to online the entire block,
    interrupting the memory reservation process and causing the warning.
    This patch modifies memory_probe_store to add a block of memory with a
    single call to add_memory as opposed to looping through and adding each
    section individually. A single call to add_memory is protected by the
    mem_hotplug mutex which will prevent the udev rule from onlining memory
    until the reservation of the entire block is complete.

    Signed-off-by: John Allen
    Acked-by: Dave Hansen
    Cc: Nathan Fontenot
    Cc: Michael Ellerman
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Allen
     
  • Signed-off-by: Wang Xiaoqiang
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Running sparse on drivers/staging/lustre results in dozens of warnings:
    include/linux/gfp.h:281:41: warning: odd constant _Bool cast (400000
    becomes 1)

    Use "!!" to explicitly convert to bool and get rid of the warning.

    Signed-off-by: Joshua Clayton
    Cc: Mel Gorman
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joshua Clayton
     
  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • for_each_free_mem_range() and for_each_free_mem_range_reverse() both
    accept a 'flags' argument, the comment surrounding the macro placed the
    'flags' documentation at the very end, while 'flags' is in fact the 3rd
    argument to the macro, so let's preserve natural ordering here.

    Fixes: fc6daaf931518 ("mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute")
    Signed-off-by: Florian Fainelli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Florian Fainelli
     
  • Move lru_to_page() from internal.h to mm_inline.h.

    Signed-off-by: Geliang Tang
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • The Shared Memory accounting support is present in Kernel since commit
    4b02108ac1b3 ("mm: oom analysis: add shmem vmstat") and in userland
    free(1) since 2014. This patch updates the Documentation to reflect
    this change.

    Signed-off-by: Rodrigo Freire
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rodrigo Freire
     
  • Out of memory condition is not a bug and while we can't add new memory
    in such case crashing the system seems wrong. Propagating the return
    value from register_memory_resource() requires interface change.

    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Igor Mammedov
    Acked-by: David Rientjes
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: Sheng Yong
    Cc: Zhu Guihua
    Cc: Dan Williams
    Cc: David Vrabel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • The Kconfig currently controlling compilation of this code is:

    config HUGETLBFS
    bool "HugeTLB file system support"

    ...meaning that it currently is not being built as a module by anyone.

    Lets remove the modular code that is essentially orphaned, so that when
    reading the driver there is no doubt it is builtin-only.

    Since module_init translates to device_initcall in the non-modular case,
    the init ordering gets moved to earlier levels when we use the more
    appropriate initcalls here.

    Originally I had the fs part and the mm part as separate commits, just
    by happenstance of the nature of how I detected these non-modular use
    cases. But that can possibly introduce regressions if the patch merge
    ordering puts the fs part 1st -- as the 0-day testing reported a splat
    at mount time.

    Investigating with "initcall_debug" showed that the delta was
    init_hugetlbfs_fs being called _before_ hugetlb_init instead of after. So
    both the fs change and the mm change are here together.

    In addition, it worked before due to luck of link order, since they were
    both in the same initcall category. So we now have the fs part using
    fs_initcall, and the mm part using subsys_initcall, which puts it one
    bucket earlier. It now passes the basic sanity test that failed in
    earlier 0-day testing.

    We delete the MODULE_LICENSE tag and capture that information at the top
    of the file alongside author comments, etc.

    We don't replace module.h with init.h since the file already has that.
    Also note that MODULE_ALIAS is a no-op for non-modular code.

    Signed-off-by: Paul Gortmaker
    Reported-by: kernel test robot
    Cc: Nadia Yvette Chambers
    Cc: Alexander Viro
    Cc: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Hillf Danton
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • Use list_for_each_entry_safe() instead of list_for_each_safe() to
    simplify the code.

    Signed-off-by: Geliang Tang
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • clear_soft_dirty_pmd() is called by clear_refs_write(CLEAR_REFS_SOFT_DIRTY),
    VM_SOFTDIRTY was already cleared before walk_page_range().

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The VM_BUG_ON_PAGE() would catch such cases if any still exists.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Currently the vmstat updater is not deferrable as a result of commit
    ba4877b9ca51 ("vmstat: do not use deferrable delayed work for
    vmstat_update"). This in turn can cause multiple interruptions of the
    applications because the vmstat updater may run at

    Make vmstate_update deferrable again and provide a function that folds
    the differentials when the processor is going to idle mode thus
    addressing the issue of the above commit in a clean way.

    Note that the shepherd thread will continue scanning the differentials
    from another processor and will reenable the vmstat workers if it
    detects any changes.

    Fixes: ba4877b9ca51 ("vmstat: do not use deferrable delayed work for vmstat_update")
    Signed-off-by: Christoph Lameter
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • A CONFIG_MEMCG=y kernel booted with "cgroup_disable=memory" crashes on a
    NULL memcg (but non-NULL root_mem_cgroup) when vmpressure kicks in.
    Here's the patch I use to avoid that, but you might prefer a test on
    mem_cgroup_disabled() somewhere.

    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: David S. Miller
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • According to the direct use of struct static_key is
    deprecated. Update the socket and slab accounting code accordingly.

    Signed-off-by: Johannes Weiner
    Acked-by: David S. Miller
    Reported-by: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Let the networking stack know when a memcg is under reclaim pressure so
    that it can clamp its transmit windows accordingly.

    Whenever the reclaim efficiency of a cgroup's LRU lists drops low enough
    for a MEDIUM or HIGH vmpressure event to occur, assert a pressure state
    in the socket and tcp memory code that tells it to curb consumption
    growth from sockets associated with said control group.

    Traditionally, vmpressure reports for the entire subtree of a memcg
    under pressure, which drops useful information on the individual groups
    reclaimed. However, it's too late to change the userinterface, so add a
    second reporting mode that reports on the level of reclaim instead of at
    the level of pressure, and use that report for sockets.

    vmpressure events are naturally edge triggered, so for hysteresis assert
    socket pressure for a second to allow for subsequent vmpressure events
    to occur before letting the socket code return to normal.

    This will likely need finetuning for a wider variety of workloads, but
    for now stick to the vmpressure presets and keep hysteresis simple.

    Signed-off-by: Johannes Weiner
    Acked-by: David S. Miller
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Socket memory can be a significant share of overall memory consumed by
    common workloads. In order to provide reasonable resource isolation in
    the unified hierarchy, this type of memory needs to be included in the
    tracking/accounting of a cgroup under active memory resource control.

    Overhead is only incurred when a non-root control group is created AND
    the memory controller is instructed to track and account the memory
    footprint of that group. cgroup.memory=nosocket can be specified on the
    boot commandline to override any runtime configuration and forcibly
    exclude socket memory from active memory resource control.

    Signed-off-by: Johannes Weiner
    Acked-by: David S. Miller
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The unified hierarchy memory controller will account socket memory.
    Move the infrastructure functions accordingly.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The unified hierarchy memory controller doesn't expose the memory+swap
    counter to userspace, but its accounting is hardcoded in all charge
    paths right now, including the per-cpu charge cache ("the stock").

    To avoid adding yet more pointless memory+swap accounting with the
    socket memory support in unified hierarchy, disable the counter
    altogether when in unified hierarchy mode.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The unified hierarchy memory controller is going to use this jump label
    as well to control the networking callbacks. Move it to the memory
    controller code and give it a more generic name.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There won't be any separate counters for socket memory consumed by
    protocols other than TCP in the future. Remove the indirection and link
    sockets directly to their owning memory cgroup.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There won't be a tcp control soft limit, so integrating the memcg code
    into the global skmem limiting scheme complicates things unnecessarily.
    Replace this with simple and clear charge and uncharge calls--hidden
    behind a jump label--to account skb memory.

    Note that this is not purely aesthetic: as a result of shoehorning the
    per-memcg code into the same memory accounting functions that handle the
    global level, the old code would compare the per-memcg consumption
    against the smaller of the per-memcg limit and the global limit. This
    allowed the total consumption of multiple sockets to exceed the global
    limit, as long as the individual sockets stayed within bounds. After
    this change, the code will always compare the per-memcg consumption to
    the per-memcg limit, and the global consumption to the global limit, and
    thus close this loophole.

    Without a soft limit, the per-memcg memory pressure state in sockets is
    generally questionable. However, we did it until now, so we continue to
    enter it when the hard limit is hit, and packets are dropped, to let
    other sockets in the cgroup know that they shouldn't grow their transmit
    windows, either. However, keep it simple in the new callback model and
    leave memory pressure lazily when the next packet is accepted (as
    opposed to doing it synchroneously when packets are processed). When
    packets are dropped, network performance will already be in the toilet,
    so that should be a reasonable trade-off.

    As described above, consumption is now checked on the per-memcg level
    and the global level separately. Likewise, memory pressure states are
    maintained on both the per-memcg level and the global level, and a
    socket is considered under pressure when either level asserts as much.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • tcp_memcontrol replicates the global sysctl_mem limit array per cgroup,
    but it only ever sets these entries to the value of the memory_allocated
    page_counter limit. Use the latter directly.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner