27 Oct, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "overlayfs merge + leak fix for d_splice_alias() failure exits"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    overlayfs: embed middle into overlay_readdir_data
    overlayfs: embed root into overlay_readdir_data
    overlayfs: make ovl_cache_entry->name an array instead of pointer
    overlayfs: don't hold ->i_mutex over opening the real directory
    fix inode leaks on d_splice_alias() failure exits
    fs: limit filesystem stacking depth
    overlay: overlay filesystem documentation
    overlayfs: implement show_options
    overlayfs: add statfs support
    overlay filesystem
    shmem: support RENAME_WHITEOUT
    ext4: support RENAME_WHITEOUT
    vfs: add RENAME_WHITEOUT
    vfs: add whiteout support
    vfs: export check_sticky()
    vfs: introduce clone_private_mount()
    vfs: export __inode_permission() to modules
    vfs: export do_splice_direct() to modules
    vfs: add i_op->dentry_open()

    Linus Torvalds
     

25 Oct, 2014

1 commit

  • Pull ACPI and power management updates from Rafael Wysocki:
    "This is material that didn't make it to my 3.18-rc1 pull request for
    various reasons, mostly related to timing and travel (LinuxCon EU /
    LPC) plus a couple of fixes for recent bugs.

    The only really new thing here is the PM QoS class for memory
    bandwidth, but it is simple enough and users of it will be added in
    the next cycle. One major change in behavior is that platform devices
    enumerated by ACPI will use 32-bit DMA mask by default. Also included
    is an ACPICA update to a new upstream release, but that's mostly
    cleanups, changes in tools and similar. The rest is fixes and
    cleanups mostly.

    Specifics:

    - Fix for a recent PCI power management change that overlooked the
    fact that some IRQ chips might not be able to configure PCIe PME
    for system wakeup from Lucas Stach.

    - Fix for a bug introduced in 3.17 where acpi_device_wakeup() is
    called with a wrong ordering of arguments from Zhang Rui.

    - A bunch of intel_pstate driver fixes (all -stable candidates) from
    Dirk Brandewie, Gabriele Mazzotta and Pali Rohár.

    - Fixes for a rather long-standing problem with the OOM killer and
    the freezer that frozen processes killed by the OOM do not actually
    release any memory until they are thawed, so OOM-killing them is
    rather pointless, with a couple of cleanups on top (Michal Hocko,
    Cong Wang, Rafael J Wysocki).

    - ACPICA update to upstream release 20140926, inlcuding mostly
    cleanups reducing differences between the upstream ACPICA and the
    kernel code, tools changes (acpidump, acpiexec) and support for the
    _DDN object (Bob Moore, Lv Zheng).

    - New PM QoS class for memory bandwidth from Tomeu Vizoso.

    - Default 32-bit DMA mask for platform devices enumerated by ACPI
    (this change is mostly needed for some drivers development in
    progress targeted at 3.19) from Heikki Krogerus.

    - ACPI EC driver cleanups, mostly related to debugging, from Lv
    Zheng.

    - cpufreq-dt driver updates from Thomas Petazzoni.

    - powernv cpuidle driver update from Preeti U Murthy"

    * tag 'pm+acpi-3.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (34 commits)
    intel_pstate: Correct BYT VID values.
    intel_pstate: Fix BYT frequency reporting
    intel_pstate: Don't lose sysfs settings during cpu offline
    cpufreq: intel_pstate: Reflect current no_turbo state correctly
    cpufreq: expose scaling_cur_freq sysfs file for set_policy() drivers
    cpufreq: intel_pstate: Fix setting max_perf_pct in performance policy
    PCI / PM: handle failure to enable wakeup on PCIe PME
    ACPI: invoke acpi_device_wakeup() with correct parameters
    PM / freezer: Clean up code after recent fixes
    PM: convert do_each_thread to for_each_process_thread
    OOM, PM: OOM killed task shouldn't escape PM suspend
    freezer: remove obsolete comments in __thaw_task()
    freezer: Do not freeze tasks killed by OOM killer
    ACPI / platform: provide default DMA mask
    cpuidle: powernv: Populate cpuidle state details by querying the device-tree
    cpufreq: cpufreq-dt: adjust message related to regulators
    cpufreq: cpufreq-dt: extend with platform_data
    cpufreq: allow driver-specific data
    ACPI / EC: Cleanup coding style.
    ACPI / EC: Refine event/query debugging messages.
    ...

    Linus Torvalds
     

24 Oct, 2014

1 commit

  • Allocate a dentry, initialize it with a whiteout and hash it in the place
    of the old dentry. Later the old dentry will be moved away and the
    whiteout will remain.

    i_mutex protects agains concurrent readdir.

    Signed-off-by: Miklos Szeredi
    Cc: Hugh Dickins

    Miklos Szeredi
     

22 Oct, 2014

1 commit

  • PM freezer relies on having all tasks frozen by the time devices are
    getting frozen so that no task will touch them while they are getting
    frozen. But OOM killer is allowed to kill an already frozen task in
    order to handle OOM situtation. In order to protect from late wake ups
    OOM killer is disabled after all tasks are frozen. This, however, still
    keeps a window open when a killed task didn't manage to die by the time
    freeze_processes finishes.

    Reduce the race window by checking all tasks after OOM killer has been
    disabled. This is still not race free completely unfortunately because
    oom_killer_disable cannot stop an already ongoing OOM killer so a task
    might still wake up from the fridge and get killed without
    freeze_processes noticing. Full synchronization of OOM and freezer is,
    however, too heavy weight for this highly unlikely case.

    Introduce and check oom_kills counter which gets incremented early when
    the allocator enters __alloc_pages_may_oom path and only check all the
    tasks if the counter changes during the freezing attempt. The counter
    is updated so early to reduce the race window since allocator checked
    oom_killer_disabled which is set by PM-freezing code. A false positive
    will push the PM-freezer into a slow path but that is not a big deal.

    Changes since v1
    - push the re-check loop out of freeze_processes into
    check_frozen_processes and invert the condition to make the code more
    readable as per Rafael

    Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
    Cc: 3.2+ # 3.2+
    Signed-off-by: Michal Hocko
    Signed-off-by: Rafael J. Wysocki

    Michal Hocko
     

21 Oct, 2014

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "A large number of cleanups and bug fixes, with some (minor) journal
    optimizations"

    [ This got sent to me before -rc1, but was stuck in my spam folder. - Linus ]

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits)
    ext4: check s_chksum_driver when looking for bg csum presence
    ext4: move error report out of atomic context in ext4_init_block_bitmap()
    ext4: Replace open coded mdata csum feature to helper function
    ext4: delete useless comments about ext4_move_extents
    ext4: fix reservation overflow in ext4_da_write_begin
    ext4: add ext4_iget_normal() which is to be used for dir tree lookups
    ext4: don't orphan or truncate the boot loader inode
    ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
    ext4: optimize block allocation on grow indepth
    ext4: get rid of code duplication
    ext4: fix over-defensive complaint after journal abort
    ext4: fix return value of ext4_do_update_inode
    ext4: fix mmap data corruption when blocksize < pagesize
    vfs: fix data corruption when blocksize < pagesize for mmaped data
    ext4: fold ext4_nojournal_sops into ext4_sops
    ext4: support freezing ext2 (nojournal) file systems
    ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
    ext4: don't check quota format when there are no quota files
    jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list
    jbd2: avoid pointless scanning of checkpoint lists
    ...

    Linus Torvalds
     

19 Oct, 2014

1 commit

  • Pull core block layer changes from Jens Axboe:
    "This is the core block IO pull request for 3.18. Apart from the new
    and improved flush machinery for blk-mq, this is all mostly bug fixes
    and cleanups.

    - blk-mq timeout updates and fixes from Christoph.

    - Removal of REQ_END, also from Christoph. We pass it through the
    ->queue_rq() hook for blk-mq instead, freeing up one of the request
    bits. The space was overly tight on 32-bit, so Martin also killed
    REQ_KERNEL since it's no longer used.

    - blk integrity updates and fixes from Martin and Gu Zheng.

    - Update to the flush machinery for blk-mq from Ming Lei. Now we
    have a per hardware context flush request, which both cleans up the
    code should scale better for flush intensive workloads on blk-mq.

    - Improve the error printing, from Rob Elliott.

    - Backing device improvements and cleanups from Tejun.

    - Fixup of a misplaced rq_complete() tracepoint from Hannes.

    - Make blk_get_request() return error pointers, fixing up issues
    where we NULL deref when a device goes bad or missing. From Joe
    Lawrence.

    - Prep work for drastically reducing the memory consumption of dm
    devices from Junichi Nomura. This allows creating clone bio sets
    without preallocating a lot of memory.

    - Fix a blk-mq hang on certain combinations of queue depths and
    hardware queues from me.

    - Limit memory consumption for blk-mq devices for crash dump
    scenarios and drivers that use crazy high depths (certain SCSI
    shared tag setups). We now just use a single queue and limited
    depth for that"

    * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
    block: Remove REQ_KERNEL
    blk-mq: allocate cpumask on the home node
    bio-integrity: remove the needless fail handle of bip_slab creating
    block: include func name in __get_request prints
    block: make blk_update_request print prefix match ratelimited prefix
    blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
    block: fix alignment_offset math that assumes io_min is a power-of-2
    blk-mq: Make bt_clear_tag() easier to read
    blk-mq: fix potential hang if rolling wakeup depth is too high
    block: add bioset_create_nobvec()
    block: use bio_clone_fast() in blk_rq_prep_clone()
    block: misplaced rq_complete tracepoint
    sd: Honor block layer integrity handling flags
    block: Replace strnicmp with strncasecmp
    block: Add T10 Protection Information functions
    block: Don't merge requests if integrity flags differ
    block: Integrity checksum flag
    block: Relocate bio integrity flags
    block: Add a disk flag to block integrity profile
    block: Add prefix to block integrity profile flags
    ...

    Linus Torvalds
     

14 Oct, 2014

6 commits

  • Merge second patch-bomb from Andrew Morton:
    - a few hotfixes
    - drivers/dma updates
    - MAINTAINERS updates
    - Quite a lot of lib/ updates
    - checkpatch updates
    - binfmt updates
    - autofs4
    - drivers/rtc/
    - various small tweaks to less used filesystems
    - ipc/ updates
    - kernel/watchdog.c changes

    * emailed patches from Andrew Morton : (135 commits)
    mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
    kernel/param: consolidate __{start,stop}___param[] in
    ia64: remove duplicate declarations of __per_cpu_start[] and __per_cpu_end[]
    frv: remove unused declarations of __start___ex_table and __stop___ex_table
    kvm: ensure hard lockup detection is disabled by default
    kernel/watchdog.c: control hard lockup detection default
    staging: rtl8192u: use %*pEn to escape buffer
    staging: rtl8192e: use %*pEn to escape buffer
    staging: wlan-ng: use %*pEhp to print SN
    lib80211: remove unused print_ssid()
    wireless: hostap: proc: print properly escaped SSID
    wireless: ipw2x00: print SSID via %*pE
    wireless: libertas: print esaped string via %*pE
    lib/vsprintf: add %*pE[achnops] format specifier
    lib / string_helpers: introduce string_escape_mem()
    lib / string_helpers: refactoring the test suite
    lib / string_helpers: move documentation to c-file
    include/linux: remove strict_strto* definitions
    arch/x86/mm/numa.c: fix boot failure when all nodes are hotpluggable
    fs: check bh blocknr earlier when searching lru
    ...

    Linus Torvalds
     
  • Pull x86 mm updates from Ingo Molnar:
    "This tree includes the following changes:

    - fix memory hotplug
    - fix hibernation bootup memory layout assumptions
    - fix hyperv numa guest kernel messages
    - remove dead code
    - update documentation"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mm: Update memory map description to list hypervisor-reserved area
    x86/mm, hibernate: Do not assume the first e820 area to be RAM
    x86/mm/numa: Drop dead code and rename setup_node_data() to setup_alloc_data()
    x86/mm/hotplug: Modify PGD entry when removing memory
    x86/mm/hotplug: Pass sync_global_pgds() a correct argument in remove_pagetable()
    x86: Remove set_pmd_pfn

    Linus Torvalds
     
  • For VMAs that don't want write notifications, PTEs created for read faults
    have their write bit set. If the read fault happens after VM_SOFTDIRTY is
    cleared, then the PTE's softdirty bit will remain clear after subsequent
    writes.

    Here's a simple code snippet to demonstrate the bug:

    char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
    assert(*m == '\0'); /* new PTE allows write access */
    assert(!soft_dirty(x));
    *m = 'x'; /* should dirty the page */
    assert(soft_dirty(x)); /* fails */

    With this patch, write notifications are enabled when VM_SOFTDIRTY is
    cleared. Furthermore, to avoid unnecessary faults, write notifications
    are disabled when VM_SOFTDIRTY is set.

    As a side effect of enabling and disabling write notifications with
    care, this patch fixes a bug in mprotect where vm_page_prot bits set by
    drivers were zapped on mprotect. An analogous bug was fixed in mmap by
    commit c9d0bf241451 ("mm: uncached vma support with writenotify").

    Signed-off-by: Peter Feiner
    Reported-by: Peter Feiner
    Suggested-by: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Bjorn Helgaas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     
  • Add a function to create CMA region from previously reserved memory and
    add support for handling 'shared-dma-pool' reserved-memory device tree
    nodes.

    Based on previous code provided by Josh Cartwright

    Signed-off-by: Marek Szyprowski
    Cc: Arnd Bergmann
    Cc: Michal Nazarewicz
    Cc: Grant Likely
    Cc: Laura Abbott
    Cc: Josh Cartwright
    Cc: Joonsoo Kim
    Cc: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marek Szyprowski
     
  • The current cma bitmap aligned mask computation is incorrect. It could
    cause an unexpected alignment when using cma_alloc() if the wanted align
    order is larger than cma->order_per_bit.

    Take kvm for example (PAGE_SHIFT = 12), kvm_cma->order_per_bit is set to
    6. When kvm_alloc_rma() tries to alloc kvm_rma_pages, it will use 15 as
    the expected align value. After using the current implementation however,
    we get 0 as cma bitmap aligned mask other than 511.

    This patch fixes the cma bitmap aligned mask calculation.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Weijie Yang
    Acked-by: Michal Nazarewicz
    Cc: Joonsoo Kim
    Cc: "Aneesh Kumar K.V"
    Cc: [3.17]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • Commit bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache")
    changed the allocation method for cpu cache array from slab allocator to
    percpu allocator. Alignment should be provided for aligned memory in
    percpu allocator case, but, that commit mistakenly set this alignment to
    0. So, percpu allocator returns unaligned memory address. It doesn't
    cause any problem on x86 which permits unaligned access, but, it causes
    the problem on sparc64 which needs strong guarantee of alignment.

    Following bug report is reported from David Miller.

    I'm getting tons of the following on sparc64:

    [603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
    [603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
    ...
    [603970.554394] log_unaligned: 333 callbacks suppressed
    ...

    This patch provides a proper alignment parameter when allocating cpu
    cache to fix this unaligned memory access problem on sparc64.

    Reported-by: David Miller
    Tested-by: David Miller
    Tested-by: Meelis Roos
    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

13 Oct, 2014

2 commits

  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - changes related to No-CBs CPUs and NO_HZ_FULL

    - RCU-tasks implementation

    - torture-test updates

    - miscellaneous fixes

    - locktorture updates

    - RCU documentation updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
    workqueue: Use cond_resched_rcu_qs macro
    workqueue: Add quiescent state between work items
    locktorture: Cleanup header usage
    locktorture: Cannot hold read and write lock
    locktorture: Fix __acquire annotation for spinlock irq
    locktorture: Support rwlocks
    rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
    locktorture: Document boot/module parameters
    rcutorture: Rename rcutorture_runnable parameter
    locktorture: Add test scenario for rwsem_lock
    locktorture: Add test scenario for mutex_lock
    locktorture: Make torture scripting account for new _runnable name
    locktorture: Introduce torture context
    locktorture: Support rwsems
    locktorture: Add infrastructure for torturing read locks
    torture: Address race in module cleanup
    locktorture: Make statistics generic
    locktorture: Teach about lock debugging
    locktorture: Support mutexes
    locktorture: Add documentation
    ...

    Linus Torvalds
     
  • Pull vfs updates from Al Viro:
    "The big thing in this pile is Eric's unmount-on-rmdir series; we
    finally have everything we need for that. The final piece of prereqs
    is delayed mntput() - now filesystem shutdown always happens on
    shallow stack.

    Other than that, we have several new primitives for iov_iter (Matt
    Wilcox, culled from his XIP-related series) pushing the conversion to
    ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
    cleanups and fixes (including the external name refcounting, which
    gives consistent behaviour of d_move() wrt procfs symlinks for long
    and short names alike) and assorted cleanups and fixes all over the
    place.

    This is just the first pile; there's a lot of stuff from various
    people that ought to go in this window. Starting with
    unionmount/overlayfs mess... ;-/"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
    fs/file_table.c: Update alloc_file() comment
    vfs: Deduplicate code shared by xattr system calls operating on paths
    reiserfs: remove pointless forward declaration of struct nameidata
    don't need that forward declaration of struct nameidata in dcache.h anymore
    take dname_external() into fs/dcache.c
    let path_init() failures treated the same way as subsequent link_path_walk()
    fix misuses of f_count() in ppp and netlink
    ncpfs: use list_for_each_entry() for d_subdirs walk
    vfs: move getname() from callers to do_mount()
    gfs2_atomic_open(): skip lookups on hashed dentry
    [infiniband] remove pointless assignments
    gadgetfs: saner API for gadgetfs_create_file()
    f_fs: saner API for ffs_sb_create_file()
    jfs: don't hash direct inode
    [s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
    ecryptfs: ->f_op is never NULL
    android: ->f_op is never NULL
    nouveau: __iomem misannotations
    missing annotation in fs/file.c
    fs: namespace: suppress 'may be used uninitialized' warnings
    ...

    Linus Torvalds
     

12 Oct, 2014

1 commit


11 Oct, 2014

1 commit

  • Commit d3ac21cacc24790eb45d735769f35753f5b56ceb ("mm: Support compiling
    out madvise and fadvise") incorrectly made fadvise conditional on
    CONFIG_MMU. (The merged branch unintentionally incorporated v1 of the
    patch rather than the fixed v2.) Apply the delta from v1 to v2, to
    allow fadvise without CONFIG_MMU.

    Reported-by: Johannes Weiner
    Signed-off-by: Josh Triplett

    Josh Triplett
     

10 Oct, 2014

24 commits

  • Pull percpu updates from Tejun Heo:
    "A lot of activities on percpu front. Notable changes are...

    - percpu allocator now can take @gfp. If @gfp doesn't contain
    GFP_KERNEL, it tries to allocate from what's already available to
    the allocator and a work item tries to keep the reserve around
    certain level so that these atomic allocations usually succeed.

    This will replace the ad-hoc percpu memory pool used by
    blk-throttle and also be used by the planned blkcg support for
    writeback IOs.

    Please note that I noticed a bug in how @gfp is interpreted while
    preparing this pull request and applied the fix 6ae833c7fe0c
    ("percpu: fix how @gfp is interpreted by the percpu allocator")
    just now.

    - percpu_ref now uses longs for percpu and global counters instead of
    ints. It leads to more sparse packing of the percpu counters on
    64bit machines but the overhead should be negligible and this
    allows using percpu_ref for refcnting pages and in-memory objects
    directly.

    - The switching between percpu and single counter modes of a
    percpu_ref is made independent of putting the base ref and a
    percpu_ref can now optionally be initialized in single or killed
    mode. This allows avoiding percpu shutdown latency for cases where
    the refcounted objects may be synchronously created and destroyed
    in rapid succession with only a fraction of them reaching fully
    operational status (SCSI probing does this when combined with
    blk-mq support). It's also planned to be used to implement forced
    single mode to detect underflow more timely for debugging.

    There's a separate branch percpu/for-3.18-consistent-ops which cleans
    up the duplicate percpu accessors. That branch causes a number of
    conflicts with s390 and other trees. I'll send a separate pull
    request w/ resolutions once other branches are merged"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
    percpu: fix how @gfp is interpreted by the percpu allocator
    blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
    percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
    percpu_ref: add PERCPU_REF_INIT_* flags
    percpu_ref: decouple switching to percpu mode and reinit
    percpu_ref: decouple switching to atomic mode and killing
    percpu_ref: add PCPU_REF_DEAD
    percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
    percpu_ref: replace pcpu_ prefix with percpu_
    percpu_ref: minor code and comment updates
    percpu_ref: relocate percpu_ref_reinit()
    Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
    Revert "percpu: free percpu allocation info for uniprocessor system"
    percpu-refcount: make percpu_ref based on longs instead of ints
    percpu-refcount: improve WARN messages
    percpu: fix locking regression in the failure path of pcpu_alloc()
    percpu-refcount: add @gfp to percpu_ref_init()
    proportions: add @gfp to init functions
    percpu_counter: add @gfp to percpu_counter_init()
    percpu_counter: make percpu_counters_lock irq-safe
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "Nothing too interesting. Just a handful of cleanup patches"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    Revert "cgroup: remove redundant variable in cgroup_mount()"
    cgroup: remove redundant variable in cgroup_mount()
    cgroup: fix missing unlock in cgroup_release_agent()
    cgroup: remove CGRP_RELEASABLE flag
    perf/cgroup: Remove perf_put_cgroup()
    cgroup: remove redundant check in cgroup_ino()
    cpuset: simplify proc_cpuset_show()
    cgroup: simplify proc_cgroup_show()
    cgroup: use a per-cgroup work for release agent
    cgroup: remove bogus comments
    cgroup: remove redundant code in cgroup_rmdir()
    cgroup: remove some useless forward declarations
    cgroup: fix a typo in comment.

    Linus Torvalds
     
  • For now, there are NCHUNKS of 64 freelists in zbud_pool, the last
    unbuddied[63] freelist linked with all zbud pages which have free chunks
    of 63. Calculating according to context of num_free_chunks(), our max
    chunk number of unbuddied zbud page is 62, so none of zbud pages will be
    added/removed in last freelist, but still we will try to find an unbuddied
    zbud page in the last unused freelist, it is unneeded.

    This patch redefines NCHUNKS to 63 as free chunk number in one zbud page,
    hence we can decrease size of zpool and avoid accessing the last unused
    freelist whenever failing to allocate zbud from freelist in zbud_alloc.

    Signed-off-by: Chao Yu
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chao Yu
     
  • Change zsmalloc init_zspage() logic to iterate through each object on each
    of its pages, checking the offset to verify the object is on the current
    page before linking it into the zspage.

    The current zsmalloc init_zspage free object linking code has logic that
    relies on there only being one page per zspage when PAGE_SIZE is a
    multiple of class->size. It calculates the number of objects for the
    current page, and iterates through all of them plus one, to account for
    the assumed partial object at the end of the page. While this currently
    works, the logic can be simplified to just link the object at each
    successive offset until the offset is larger than PAGE_SIZE, which does
    not rely on PAGE_SIZE being a multiple of class->size.

    Signed-off-by: Dan Streetman
    Acked-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Nitin Gupta
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • The letter 'f' in "n
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • zs_get_total_size_bytes returns a amount of memory zsmalloc consumed with
    *byte unit* but zsmalloc operates *page unit* rather than byte unit so
    let's change the API so benefit we could get is that reduce unnecessary
    overhead (ie, change page unit with byte unit) in zsmalloc.

    Since return type is pages, "zs_get_total_pages" is better than
    "zs_get_total_size_bytes".

    Signed-off-by: Minchan Kim
    Reviewed-by: Dan Streetman
    Cc: Sergey Senozhatsky
    Cc: Jerome Marchand
    Cc:
    Cc:
    Cc: Luigi Semenzato
    Cc: Nitin Gupta
    Cc: Seth Jennings
    Cc: David Horner
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Currently, zram has no feature to limit memory so theoretically zram can
    deplete system memory. Users have asked for a limit several times as even
    without exhaustion zram makes it hard to control memory usage of the
    platform. This patchset adds the feature.

    Patch 1 makes zs_get_total_size_bytes faster because it would be used
    frequently in later patches for the new feature.

    Patch 2 changes zs_get_total_size_bytes's return unit from bytes to page
    so that zsmalloc doesn't need unnecessary operation(ie, << PAGE_SHIFT).

    Patch 3 adds new feature. I added the feature into zram layer, not
    zsmalloc because limiation is zram's requirement, not zsmalloc so any
    other user using zsmalloc(ie, zpool) shouldn't affected by unnecessary
    branch of zsmalloc. In future, if every users of zsmalloc want the
    feature, then, we could move the feature from client side to zsmalloc
    easily but vice versa would be painful.

    Patch 4 adds news facility to report maximum memory usage of zram so that
    this avoids user polling frequently via /sys/block/zram0/ mem_used_total
    and ensures transient max are not missed.

    This patch (of 4):

    pages_allocated has counted in size_class structure and when user of
    zsmalloc want to see total_size_bytes, it should gather all of count from
    each size_class to report the sum.

    It's not bad if user don't see the value often but if user start to see
    the value frequently, it would be not a good deal for performance pov.

    This patch moves the count from size_class to zs_pool so it could reduce
    memory footprint (from [255 * 8byte] to [sizeof(atomic_long_t)]).

    Signed-off-by: Minchan Kim
    Reviewed-by: Dan Streetman
    Cc: Sergey Senozhatsky
    Cc: Jerome Marchand
    Cc:
    Cc:
    Cc: Luigi Semenzato
    Cc: Nitin Gupta
    Cc: Seth Jennings
    Reviewed-by: David Horner
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • vmstat workers are used for folding counter differentials into the zone,
    per node and global counters at certain time intervals. They currently
    run at defined intervals on all processors which will cause some holdoff
    for processors that need minimal intrusion by the OS.

    The current vmstat_update mechanism depends on a deferrable timer firing
    every other second by default which registers a work queue item that runs
    on the local CPU, with the result that we have 1 interrupt and one
    additional schedulable task on each CPU every 2 seconds If a workload
    indeed causes VM activity or multiple tasks are running on a CPU, then
    there are probably bigger issues to deal with.

    However, some workloads dedicate a CPU for a single CPU bound task. This
    is done in high performance computing, in high frequency financial
    applications, in networking (Intel DPDK, EZchip NPS) and with the advent
    of systems with more and more CPUs over time, this may become more and
    more common to do since when one has enough CPUs one cares less about
    efficiently sharing a CPU with other tasks and more about efficiently
    monopolizing a CPU per task.

    The difference of having this timer firing and workqueue kernel thread
    scheduled per second can be enormous. An artificial test measuring the
    worst case time to do a simple "i++" in an endless loop on a bare metal
    system and under Linux on an isolated CPU with dynticks and with and
    without this patch, have Linux match the bare metal performance (~700
    cycles) with this patch and loose by couple of orders of magnitude (~200k
    cycles) without it[*]. The loss occurs for something that just calculates
    statistics. For networking applications, for example, this could be the
    difference between dropping packets or sustaining line rate.

    Statistics are important and useful, but it would be great if there would
    be a way to not cause statistics gathering produce a huge performance
    difference. This patche does just that.

    This patch creates a vmstat shepherd worker that monitors the per cpu
    differentials on all processors. If there are differentials on a
    processor then a vmstat worker local to the processors with the
    differentials is created. That worker will then start folding the diffs
    in regular intervals. Should the worker find that there is no work to be
    done then it will make the shepherd worker monitor the differentials
    again.

    With this patch it is possible then to have periods longer than
    2 seconds without any OS event on a "cpu" (hardware thread).

    The patch shows a very minor increased in system performance.

    hackbench -s 512 -l 2000 -g 15 -f 25 -P

    Results before the patch:

    Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
    Each sender will pass 2000 messages of 512 bytes
    Time: 4.992
    Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
    Each sender will pass 2000 messages of 512 bytes
    Time: 4.971
    Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
    Each sender will pass 2000 messages of 512 bytes
    Time: 5.063

    Hackbench after the patch:

    Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
    Each sender will pass 2000 messages of 512 bytes
    Time: 4.973
    Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
    Each sender will pass 2000 messages of 512 bytes
    Time: 4.990
    Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
    Each sender will pass 2000 messages of 512 bytes
    Time: 4.993

    [fengguang.wu@intel.com: cpu_stat_off can be static]
    Signed-off-by: Christoph Lameter
    Reviewed-by: Gilad Ben-Yossef
    Cc: Frederic Weisbecker
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: John Stultz
    Cc: Mike Frysinger
    Cc: Minchan Kim
    Cc: Hakan Akkan
    Cc: Max Krasnyansky
    Cc: "Paul E. McKenney"
    Cc: Hugh Dickins
    Cc: Viresh Kumar
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • PROT_NUMA VMAs are skipped to avoid problems distinguishing between
    present, prot_none and special entries. MPOL_MF_LAZY is not visible from
    userspace since commit a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP and
    MPOL_MF_LAZY from userspace for now") but it should still skip VMAs the
    same way task_numa_work does.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Always mark pages with PageBalloon even if balloon compaction is disabled
    and expose this mark in /proc/kpageflags as KPF_BALLOON.

    Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
    "balloon_deflate" and "balloon_migrate". They accumulate balloon
    activity. Current size of balloon is (balloon_inflate - balloon_deflate)
    pages.

    All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
    It should be selected by ballooning driver which wants use this feature.
    Currently virtio-balloon is the only user.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Now ballooned pages are detected using PageBalloon(). Fake mapping is no
    longer required. This patch links ballooned pages to balloon device using
    field page->private instead of page->mapping. Also this patch embeds
    balloon_dev_info directly into struct virtio_balloon.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Sasha Levin reported KASAN splash inside isolate_migratepages_range().
    Problem is in the function __is_movable_balloon_page() which tests
    AS_BALLOON_MAP in page->mapping->flags. This function has no protection
    against anonymous pages. As result it tried to check address space flags
    inside struct anon_vma.

    Further investigation shows more problems in current implementation:

    * Special branch in __unmap_and_move() never works:
    balloon_page_movable() checks page flags and page_count. In
    __unmap_and_move() page is locked, reference counter is elevated, thus
    balloon_page_movable() always fails. As a result execution goes to the
    normal migration path. virtballoon_migratepage() returns
    MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
    move_to_new_page() thinks this is an error code and assigns
    newpage->mapping to NULL. Newly migrated page lose connectivity with
    balloon an all ability for further migration.

    * lru_lock erroneously required in isolate_migratepages_range() for
    isolation ballooned page. This function releases lru_lock periodically,
    this makes migration mostly impossible for some pages.

    * balloon_page_dequeue have a tight race with balloon_page_isolate:
    balloon_page_isolate could be executed in parallel with dequeue between
    picking page from list and locking page_lock. Race is rare because they
    use trylock_page() for locking.

    This patch fixes all of them.

    Instead of fake mapping with special flag this patch uses special state of
    page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256. Buddy allocator uses
    PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose. Storing mark
    directly in struct page makes everything safer and easier.

    PagePrivate is used to mark pages present in page list (i.e. not
    isolated, like PageLRU for normal pages). It replaces special rules for
    reference counter and makes balloon migration similar to migration of
    normal pages. This flag is protected by page_lock together with link to
    the balloon device.

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Sasha Levin
    Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This series implements general forms of get_user_pages_fast and
    __get_user_pages_fast in core code and activates them for arm and arm64.

    These are required for Transparent HugePages to function correctly, as a
    futex on a THP tail will otherwise result in an infinite loop (due to the
    core implementation of __get_user_pages_fast always returning 0).

    Unfortunately, a futex on THP tail can be quite common for certain
    workloads; thus THP is unreliable without a __get_user_pages_fast
    implementation.

    This series may also be beneficial for direct-IO heavy workloads and
    certain KVM workloads.

    This patch (of 6):

    get_user_pages_fast() attempts to pin user pages by walking the page
    tables directly and avoids taking locks. Thus the walker needs to be
    protected from page table pages being freed from under it, and needs to
    block any THP splits.

    One way to achieve this is to have the walker disable interrupts, and rely
    on IPIs from the TLB flushing code blocking before the page table pages
    are freed.

    On some platforms we have hardware broadcast of TLB invalidations, thus
    the TLB flushing code doesn't necessarily need to broadcast IPIs; and
    spuriously broadcasting IPIs can hurt system performance if done too
    often.

    This problem has been solved on PowerPC and Sparc by batching up page
    table pages belonging to more than one mm_user, then scheduling an
    rcu_sched callback to free the pages. This RCU page table free logic has
    been promoted to core code and is activated when one enables
    HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement their
    own get_user_pages_fast routines.

    The RCU page table free logic coupled with an IPI broadcast on THP split
    (which is a rare event), allows one to protect a page table walker by
    merely disabling the interrupts during the walk.

    This patch provides a general RCU implementation of get_user_pages_fast
    that can be used by architectures that perform hardware broadcast of TLB
    invalidations.

    It is based heavily on the PowerPC implementation by Nick Piggin.

    [akpm@linux-foundation.org: various comment fixes]
    Signed-off-by: Steve Capper
    Tested-by: Dann Frazier
    Reviewed-by: Catalin Marinas
    Acked-by: Hugh Dickins
    Cc: Russell King
    Cc: Mark Rutland
    Cc: Mel Gorman
    Cc: Will Deacon
    Cc: Christoffer Dall
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steve Capper
     
  • Remove 3 brace coding style for any arm of this statement

    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • WARNING: Prefer: pr_err(... to printk(KERN_ERR ...

    [akpm@linux-foundation.org: remove KERN_ERR]
    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • Replace asm. headers with linux/headers:

    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • "WARNING: Use #include instead of "

    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • memcg_can_account_kmem() returns true iff

    !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
    memcg_kmem_is_active(memcg);

    To begin with the !mem_cgroup_is_root(memcg) check is useless, because one
    can't enable kmem accounting for the root cgroup (mem_cgroup_write()
    returns EINVAL on an attempt to set the limit on the root cgroup).

    Furthermore, the !mem_cgroup_disabled() check also seems to be redundant.
    The point is memcg_can_account_kmem() is called from three places:
    mem_cgroup_salbinfo_read(), __memcg_kmem_get_cache(), and
    __memcg_kmem_newpage_charge(). The latter two functions are only invoked
    if memcg_kmem_enabled() returns true, which implies that the memory cgroup
    subsystem is enabled. And mem_cgroup_slabinfo_read() shows the output of
    memory.kmem.slabinfo, which won't exist if the memory cgroup is completely
    disabled.

    So let's substitute all the calls to memcg_can_account_kmem() with plain
    memcg_kmem_is_active(), and kill the former.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • In a memcg with even just moderate cache pressure, success rates for
    transparent huge page allocations drop to zero, wasting a lot of effort
    that the allocator puts into assembling these pages.

    The reason for this is that the memcg reclaim code was never designed for
    higher-order charges. It reclaims in small batches until there is room
    for at least one page. Huge page charges only succeed when these batches
    add up over a series of huge faults, which is unlikely under any
    significant load involving order-0 allocations in the group.

    Remove that loop on the memcg side in favor of passing the actual reclaim
    goal to direct reclaim, which is already set up and optimized to meet
    higher-order goals efficiently.

    This brings memcg's THP policy in line with the system policy: if the
    allocator painstakingly assembles a hugepage, memcg will at least make an
    honest effort to charge it. As a result, transparent hugepage allocation
    rates amid cache activity are drastically improved:

    vanilla patched
    pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
    pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
    pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
    thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
    thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)

    [ Note: this may in turn increase memory consumption from internal
    fragmentation, which is an inherent risk of transparent hugepages.
    Some setups may have to adjust the memcg limits accordingly to
    accomodate this - or, if the machine is already packed to capacity,
    disable the transparent huge page feature. ]

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When attempting to charge pages, we first charge the memory counter and
    then the memory+swap counter. If one of the counters is at its limit, we
    enter reclaim, but if it's the memory+swap counter, reclaim shouldn't swap
    because that wouldn't change the situation. However, if the counters have
    the same limits, we never get to the memory+swap limit. To know whether
    reclaim should swap or not, there is a state flag that indicates whether
    the limits are equal and whether hitting the memory limit implies hitting
    the memory+swap limit.

    Just try the memory+swap counter first.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
    This is not a big deal for the normal release path but it completely kills
    memcg uncharge batching which reduces res_counter spin_lock contention.
    Dave has noticed this with his page fault scalability test case on a large
    machine when the lock was basically dominating on all CPUs:

    80.18% 80.18% [kernel] [k] _raw_spin_lock
    |
    --- _raw_spin_lock
    |
    |--66.59%-- res_counter_uncharge_until
    | res_counter_uncharge
    | uncharge_batch
    | uncharge_list
    | mem_cgroup_uncharge_list
    | release_pages
    | free_pages_and_swap_cache
    | tlb_flush_mmu_free
    | |
    | |--90.12%-- unmap_single_vma
    | | unmap_vmas
    | | unmap_region
    | | do_munmap
    | | vm_munmap
    | | sys_munmap
    | | system_call_fastpath
    | | __GI___munmap
    | |
    | --9.88%-- tlb_flush_mmu
    | tlb_finish_mmu
    | unmap_region
    | do_munmap
    | vm_munmap
    | sys_munmap
    | system_call_fastpath
    | __GI___munmap

    In his case the load was running in the root memcg and that part has been
    handled by reverting 05b843012335 ("mm: memcontrol: use root_mem_cgroup
    res_counter") because this is a clear regression, but the problem remains
    inside dedicated memcgs.

    There is no reason to limit release_pages to PAGEVEC_SIZE batches other
    than lru_lock held times. This logic, however, can be moved inside the
    function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
    hold any lock for the whole pages_to_free list so it is safe to call them
    in a single run.

    The release_pages() code was previously breaking the lru_lock each
    PAGEVEC_SIZE pages (ie, 14 pages). However this code has no usage of
    pagevecs so switch to breaking the lock at least every SWAP_CLUSTER_MAX
    (32) pages. This means that the lock acquisition frequency is
    approximately halved and the max hold times are approximately doubled.

    The now unneeded batching is removed from free_pages_and_swap_cache().

    Also update the grossly out-of-date release_pages documentation.

    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Reported-by: Dave Hansen
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • cat /sys/.../pools followed by removal the device leads to:

    |======================================================
    |[ INFO: possible circular locking dependency detected ]
    |3.17.0-rc4+ #1498 Not tainted
    |-------------------------------------------------------
    |rmmod/2505 is trying to acquire lock:
    | (s_active#28){++++.+}, at: [] kernfs_remove_by_name_ns+0x3c/0x88
    |
    |but task is already holding lock:
    | (pools_lock){+.+.+.}, at: [] dma_pool_destroy+0x18/0x17c
    |
    |which lock already depends on the new lock.
    |the existing dependency chain (in reverse order) is:
    |
    |-> #1 (pools_lock){+.+.+.}:
    | [] show_pools+0x30/0xf8
    | [] dev_attr_show+0x1c/0x48
    | [] sysfs_kf_seq_show+0x88/0x10c
    | [] kernfs_seq_show+0x24/0x28
    | [] seq_read+0x1b8/0x480
    | [] vfs_read+0x8c/0x148
    | [] SyS_read+0x40/0x8c
    | [] ret_fast_syscall+0x0/0x48
    |
    |-> #0 (s_active#28){++++.+}:
    | [] __kernfs_remove+0x258/0x2ec
    | [] kernfs_remove_by_name_ns+0x3c/0x88
    | [] dma_pool_destroy+0x148/0x17c
    | [] hcd_buffer_destroy+0x20/0x34
    | [] usb_remove_hcd+0x110/0x1a4

    The problem is the lock order of pools_lock and kernfs_mutex in
    dma_pool_destroy() vs show_pools() call path.

    This patch breaks out the creation of the sysfs file outside of the
    pools_lock mutex. The newly added pools_reg_lock ensures that there is no
    race of create vs destroy code path in terms whether or not the sysfs file
    has to be deleted (and was it deleted before we try to create a new one)
    and what to do if device_create_file() failed.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • `While growing per memcg caches arrays, we jump between memcontrol.c and
    slab_common.c in a weird way:

    memcg_alloc_cache_id - memcontrol.c
    memcg_update_all_caches - slab_common.c
    memcg_update_cache_size - memcontrol.c

    There's absolutely no reason why memcg_update_cache_size can't live on the
    slab's side though. So let's move it there and settle it comfortably amid
    per-memcg cache allocation functions.

    Besides, this patch cleans this function up a bit, removing all the
    useless comments from it, and renames it to memcg_update_cache_params to
    conform to memcg_alloc/free_cache_params, which we already have in
    slab_common.c.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Glauber Costa
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov