07 Oct, 2012

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "New and noteworthy:

    * More SLAB allocator unification patches from Christoph Lameter and
    others. This paves the way for slab memcg patches that hopefully
    will land in v3.8.

    * SLAB tracing improvements from Ezequiel Garcia.

    * Kernel tainting upon SLAB corruption from Dave Jones.

    * Miscellanous SLAB allocator bug fixes and improvements from various
    people."

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (43 commits)
    slab: Fix build failure in __kmem_cache_create()
    slub: init_kmem_cache_cpus() and put_cpu_partial() can be static
    mm/slab: Fix kmem_cache_alloc_node_trace() declaration
    Revert "mm/slab: Fix kmem_cache_alloc_node_trace() declaration"
    mm, slob: fix build breakage in __kmalloc_node_track_caller
    mm/slab: Fix kmem_cache_alloc_node_trace() declaration
    mm/slab: Fix typo _RET_IP -> _RET_IP_
    mm, slub: Rename slab_alloc() -> slab_alloc_node() to match SLAB
    mm, slab: Rename __cache_alloc() -> slab_alloc()
    mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototype
    mm, slab: Replace 'caller' type, void* -> unsigned long
    mm, slob: Add support for kmalloc_track_caller()
    mm, slab: Remove silly function slab_buffer_size()
    mm, slob: Use NUMA_NO_NODE instead of -1
    mm, sl[au]b: Taint kernel when we detect a corrupted slab
    slab: Only define slab_error for DEBUG
    slab: fix the DEADLOCK issue on l3 alien lock
    slub: Zero initial memory segment for kmem_cache and kmem_cache_node
    Revert "mm/sl[aou]b: Move sysfs_slab_add to common"
    mm/sl[aou]b: Move kmem_cache refcounting to common code
    ...

    Linus Torvalds
     

06 Oct, 2012

1 commit


03 Oct, 2012

10 commits

  • Pekka Enberg
     
  • Fix up a trivial conflict with NUMA_NO_NODE cleanups.

    Conflicts:
    mm/slob.c

    Signed-off-by: Pekka Enberg

    Pekka Enberg
     
  • Pekka Enberg
     
  • Fix build failure with CONFIG_DEBUG_SLAB=y && CONFIG_DEBUG_PAGEALLOC=y caused
    by commit 8a13a4cc "mm/sl[aou]b: Shrink __kmem_cache_create() parameter lists".

    mm/slab.c: In function '__kmem_cache_create':
    mm/slab.c:2474: error: 'align' undeclared (first use in this function)
    mm/slab.c:2474: error: (Each undeclared identifier is reported only once
    mm/slab.c:2474: error: for each function it appears in.)
    make[1]: *** [mm/slab.o] Error 1
    make: *** [mm] Error 2

    Acked-by: Christoph Lameter
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Pekka Enberg

    Tetsuo Handa
     
  • Acked-by: Glauber Costa
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Fengguang Wu
    Signed-off-by: Pekka Enberg

    Fengguang Wu
     
  • Pull frontswap update from Konrad Rzeszutek Wilk:
    "Features:
    - Support exlusive get if backend is capable.
    Bug-fixes:
    - Fix compile warnings
    - Add comments/cleanup doc
    - Fix wrong if condition"

    * tag 'stable/for-linus-3.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    frontswap: support exclusive gets if tmem backend is capable
    mm: frontswap: fix a wrong if condition in frontswap_shrink
    mm/frontswap: fix uninit'ed variable warning
    mm/frontswap: cleanup doc and comment error
    mm: frontswap: remove unneeded headers

    Linus Torvalds
     
  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:

    - xattr support added. The implementation is shared with tmpfs. The
    usage is restricted and intended to be used to manage per-cgroup
    metadata by system software. tmpfs changes are routed through this
    branch with Hugh's permission.

    - cgroup subsystem ID handling simplified.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
    cgroup: Assign subsystem IDs during compile time
    cgroup: Do not depend on a given order when populating the subsys array
    cgroup: Wrap subsystem selection macro
    cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
    cgroup: net_prio: Do not define task_netpioidx() when not selected
    cgroup: net_cls: Do not define task_cls_classid() when not selected
    cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
    cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
    xattr: mark variable as uninitialized to make both gcc and smatch happy
    fs: add missing documentation to simple_xattr functions
    cgroup: add documentation on extended attributes usage
    cgroup: rename subsys_bits to subsys_mask
    cgroup: add xattr support
    cgroup: revise how we re-populate root directory
    xattr: extract simple_xattr code from tmpfs

    Linus Torvalds
     
  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

02 Oct, 2012

2 commits

  • Pull RCU changes from Ingo Molnar:

    0. 'idle RCU':

    Adds RCU APIs that allow non-idle tasks to enter RCU idle mode and
    provides x86 code to make use of them, allowing RCU to treat
    user-mode execution as an extended quiescent state when the new
    RCU_USER_QS kernel configuration parameter is specified. (Work is
    in progress to port this to a few other architectures, but is not
    part of this series.)

    1. A fix for a latent bug that has been in RCU ever since the addition
    of CPU stall warnings. This bug results in false-positive stall
    warnings, but thus far only on embedded systems with severely
    cut-down userspace configurations.

    2. Further reductions in latency spikes for huge systems, along with
    additional boot-time adaptation to the actual hardware.

    This is a large change, as it moves RCU grace-period initialization
    and cleanup, along with quiescent-state forcing, from softirq to a
    kthread. However, it appears to be in quite good shape (famous
    last words).

    3. Updates to documentation and rcutorture, the latter category
    including keeping statistics on CPU-hotplug latencies and fixing
    some initialization-time races.

    4. CPU-hotplug fixes and improvements.

    5. Idle-loop fixes that were omitted on an earlier submission.

    6. Miscellaneous fixes and improvements

    In certain RCU configurations new kernel threads will show up (rcu_bh,
    rcu_sched), showing RCU processing overhead.

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits)
    rcu: Apply micro-optimization and int/bool fixes to RCU's idle handling
    rcu: Userspace RCU extended QS selftest
    x86: Exit RCU extended QS on notify resume
    x86: Use the new schedule_user API on userspace preemption
    rcu: Exit RCU extended QS on user preemption
    rcu: Exit RCU extended QS on kernel preemption after irq/exception
    x86: Exception hooks for userspace RCU extended QS
    x86: Unspaghettize do_general_protection()
    x86: Syscall hooks for userspace RCU extended QS
    rcu: Switch task's syscall hooks on context switch
    rcu: Ignore userspace extended quiescent state by default
    rcu: Allow rcu_user_enter()/exit() to nest
    rcu: Settle config for userspace extended quiescent state
    rcu: Make RCU_FAST_NO_HZ handle adaptive ticks
    rcu: New rcu_user_enter_after_irq() and rcu_user_exit_after_irq() APIs
    rcu: New rcu_user_enter() and rcu_user_exit() APIs
    ia64: Add missing RCU idle APIs on idle loop
    xtensa: Add missing RCU idle APIs on idle loop
    score: Add missing RCU idle APIs on idle loop
    parisc: Add missing RCU idle APIs on idle loop
    ...

    Linus Torvalds
     
  • Pull the trivial tree from Jiri Kosina:
    "Tiny usual fixes all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    doc: fix old config name of kprobetrace
    fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
    btrfs: fix the commment for the action flags in delayed-ref.h
    btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
    vfs: fix kerneldoc for generic_fh_to_parent()
    treewide: fix comment/printk/variable typos
    ipr: fix small coding style issues
    doc: fix broken utf8 encoding
    nfs: comment fix
    platform/x86: fix asus_laptop.wled_type module parameter
    mfd: printk/comment fixes
    doc: getdelays.c: remember to close() socket on error in create_nl_socket()
    doc: aliasing-test: close fd on write error
    mmc: fix comment typos
    dma: fix comments
    spi: fix comment/printk typos in spi
    Coccinelle: fix typo in memdup_user.cocci
    tmiofb: missing NULL pointer checks
    tools: perf: Fix typo in tools/perf
    tools/testing: fix comment / output typos
    ...

    Linus Torvalds
     

29 Sep, 2012

1 commit


28 Sep, 2012

1 commit

  • Speculative cache pagecache lookups can elevate the refcount from
    under us, so avoid the false positive. If the refcount is < 2 we'll be
    notified by a VM_BUG_ON in put_page_testzero as there are two
    put_page(src_page) in a row before returning from this function.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Reviewed-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Sep, 2012

4 commits


26 Sep, 2012

4 commits

  • On Sat, 8 Sep 2012, Ezequiel Garcia wrote:

    > @@ -454,15 +455,35 @@ void *__kmalloc_node(size_t size, gfp_t gfp, int node)
    > gfp |= __GFP_COMP;
    > ret = slob_new_pages(gfp, order, node);
    >
    > - trace_kmalloc_node(_RET_IP_, ret,
    > + trace_kmalloc_node(caller, ret,
    > size, PAGE_SIZE << order, gfp, node);
    > }
    >
    > kmemleak_alloc(ret, size, 1, gfp);
    > return ret;
    > }
    > +
    > +void *__kmalloc_node(size_t size, gfp_t gfp, int node)
    > +{
    > + return __do_kmalloc_node(size, gfp, node, _RET_IP_);
    > +}
    > EXPORT_SYMBOL(__kmalloc_node);
    >
    > +#ifdef CONFIG_TRACING
    > +void *__kmalloc_track_caller(size_t size, gfp_t gfp, unsigned long caller)
    > +{
    > + return __do_kmalloc_node(size, gfp, NUMA_NO_NODE, caller);
    > +}
    > +
    > +#ifdef CONFIG_NUMA
    > +void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
    > + int node, unsigned long caller)
    > +{
    > + return __do_kmalloc_node(size, gfp, node, caller);
    > +}
    > +#endif

    This breaks Pekka's slab/next tree with this:

    mm/slob.c: In function '__kmalloc_node_track_caller':
    mm/slob.c:488: error: 'gfp' undeclared (first use in this function)
    mm/slob.c:488: error: (Each undeclared identifier is reported only once
    mm/slob.c:488: error: for each function it appears in.)

    mm, slob: fix build breakage in __kmalloc_node_track_caller

    "mm, slob: Add support for kmalloc_track_caller()" breaks the build
    because gfp is undeclared. Fix it.

    Acked-by: Ezequiel Garcia
    Signed-off-by: David Rientjes
    Signed-off-by: Pekka Enberg

    David Rientjes
     
  • The bug was introduced in commit 4052147c0afa ("mm, slab: Match SLAB
    and SLUB kmem_cache_alloc_xxx_trace() prototype").

    Reported-by: Fengguang Wu
    Signed-off-by: Ezequiel Garcia
    Signed-off-by: Pekka Enberg

    Ezequiel Garcia
     
  • The bug was introduced by commit 7c0cb9c64f83 ("mm, slab: Replace
    'caller' type, void* -> unsigned long").

    Reported-by: Fengguang Wu
    Signed-off-by: Ezequiel Garcia
    Signed-off-by: Pekka Enberg

    Ezequiel Garcia
     
  • Resolved conflict in kernel/sched/core.c using Peter Zijlstra's
    approach from https://lkml.org/lkml/2012/9/5/585.

    Paul E. McKenney
     

25 Sep, 2012

7 commits


23 Sep, 2012

1 commit


21 Sep, 2012

2 commits

  • Tmem, as originally specified, assumes that "get" operations
    performed on persistent pools never flush the page of data out
    of tmem on a successful get, waiting instead for a flush
    operation. This is intended to mimic the model of a swap
    disk, where a disk read is non-destructive. Unlike a
    disk, however, freeing up the RAM can be valuable. Over
    the years that frontswap was in the review process, several
    reviewers (and notably Hugh Dickins in 2010) pointed out that
    this would result, at least temporarily, in two copies of the
    data in RAM: one (compressed for zcache) copy in tmem,
    and one copy in the swap cache. We wondered if this could
    be done differently, at least optionally.

    This patch allows tmem backends to instruct the frontswap
    code that this backend performs exclusive gets. Zcache2
    already contains hooks to support this feature. Other
    backends are completely unaffected unless/until they are
    updated to support this feature.

    While it is not clear that exclusive gets are a performance
    win on all workloads at all times, this small patch allows for
    experimentation by backends.

    P.S. Let's not quibble about the naming of "get" vs "read" vs
    "load" etc. The naming is currently horribly inconsistent between
    cleancache and frontswap and existing tmem backends, so will need
    to be straightened out as a separate patch. "Get" is used
    by the tmem architecture spec, existing backends, and
    all documentation and presentation material so I am
    using it in this patch.

    Signed-off-by: Dan Magenheimer
    Signed-off-by: Konrad Rzeszutek Wilk

    Dan Magenheimer
     
  • pages_to_unuse is set to 0 to unuse all frontswap pages
    But that doesn't happen since a wrong condition in frontswap_shrink
    cancel it.

    -v2: Add comment to explain return value of __frontswap_shrink,
    as suggested by Dan Carpenter, thanks

    Signed-off-by: Zhenzhong Duan
    Signed-off-by: Konrad Rzeszutek Wilk

    Zhenzhong Duan
     

19 Sep, 2012

2 commits

  • It doesn't seem worth adding a new taint flag for this, so just re-use
    the one from 'bad page'

    Acked-by: Christoph Lameter # SLUB
    Acked-by: David Rientjes
    Signed-off-by: Dave Jones
    Signed-off-by: Pekka Enberg

    Dave Jones
     
  • On Tue, 11 Sep 2012, Stephen Rothwell wrote:
    > After merging the final tree, today's linux-next build (sparc64 defconfig)
    > produced this warning:
    >
    > mm/slab.c:808:13: warning: '__slab_error' defined but not used [-Wunused-function]
    >
    > Introduced by commit 945cf2b6199b ("mm/sl[aou]b: Extract a common
    > function for kmem_cache_destroy"). All uses of slab_error() are now
    > guarded by DEBUG.

    There is no use case left for slab builds without DEBUG.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

18 Sep, 2012

4 commits

  • There may be a bug when registering section info. For example, on my
    Itanium platform, the pfn range of node0 includes the other nodes, so
    other nodes' section info will be double registered, and memmap's page
    count will equal to 3.

    node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
    node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
    node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
    node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000

    free_all_bootmem_node()
    register_page_bootmem_info_node()
    register_page_bootmem_info_section()

    When hot remove memory, we can't free the memmap's page because
    page_count() is 2 after put_page_bootmem().

    sparse_remove_one_section()
    free_section_usemap()
    free_map_bootmem()
    put_page_bootmem()

    [akpm@linux-foundation.org: add code comment]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Acked-by: Mel Gorman
    Cc: "Luck, Tony"
    Cc: Yasuaki Ishimatsu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    qiuxishi
     
  • The heuristic method for buddy has been introduced since commit
    43506fad21ca ("mm/page_alloc.c: simplify calculation of combined index
    of adjacent buddy lists"). But the page address of higher page's buddy
    was wrongly calculated, which will lead page_is_buddy to fail for ever.
    IOW, the heuristic method would be disabled with the wrong page address
    of higher page's buddy.

    Calculating the page address of higher page's buddy should be based
    higher_page with the offset between index of higher page and index of
    higher page's buddy.

    Signed-off-by: Haifeng Li
    Signed-off-by: Gavin Shan
    Reviewed-by: Michal Hocko
    Cc: KyongHo Cho
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: [2.6.38+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Haifeng
     
  • get_partial() is currently not checking pfmemalloc_match() meaning that
    it is possible for pfmemalloc pages to leak to non-pfmemalloc users.
    This is a problem in the following situation. Assume that there is a
    request from normal allocation and there are no objects in the per-cpu
    cache and no node-partial slab.

    In this case, slab_alloc enters the slow path and new_slab_objects() is
    called which may return a PFMEMALLOC page. As the current user is not
    allowed to access PFMEMALLOC page, deactivate_slab() is called
    ([5091b74a: mm: slub: optimise the SLUB fast path to avoid pfmemalloc
    checks]) and returns an object from PFMEMALLOC page.

    Next time, when we get another request from normal allocation,
    slab_alloc() enters the slow-path and calls new_slab_objects(). In
    new_slab_objects(), we call get_partial() and get a partial slab which
    was just deactivated but is a pfmemalloc page. We extract one object
    from it and re-deactivate.

    "deactivate -> re-get in get_partial -> re-deactivate" occures repeatedly.

    As a result, access to PFMEMALLOC page is not properly restricted and it
    can cause a performance degradation due to frequent deactivation.
    deactivation frequently.

    This patch changes get_partial_node() to take pfmemalloc_match() into
    account and prevents the "deactivate -> re-get in get_partial()
    scenario. Instead, new_slab() is called.

    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Chuck Lever
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In array cache, there is a object at index 0, check it.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Chuck Lever
    Cc: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim