01 Feb, 2017

4 commits

  • commit e47483bca2cc59a4593b37a270b16ee42b1d9f08 upstream.

    Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode
    triggers OOM killer in few seconds, despite lots of free memory. The
    test attempts to repeatedly fault in memory in one process in a cpuset,
    while changing allowed nodes of the cpuset between 0 and 1 in another
    process.

    The problem comes from insufficient protection against cpuset changes,
    which can cause get_page_from_freelist() to consider all zones as
    non-eligible due to nodemask and/or current->mems_allowed. This was
    masked in the past by sufficient retries, but since commit 682a3385e773
    ("mm, page_alloc: inline the fast path of the zonelist iterator") we fix
    the preferred_zoneref once, and don't iterate over the whole zonelist in
    further attempts, thus the only eligible zones might be placed in the
    zonelist before our starting point and we always miss them.

    A previous patch fixed this problem for current->mems_allowed. However,
    cpuset changes also update the task's mempolicy nodemask. The fix has
    two parts. We have to repeat the preferred_zoneref search when we
    detect cpuset update by way of seqcount, and we have to check the
    seqcount before considering OOM.

    [akpm@linux-foundation.org: fix typo in comment]
    Link: http://lkml.kernel.org/r/20170120103843.24587-5-vbabka@suse.cz
    Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
    Signed-off-by: Vlastimil Babka
    Reported-by: Ganapatrao Kulkarni
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 5ce9bfef1d27944c119a397a9d827bef795487ce upstream.

    This is a preparation for the following patch to make review simpler.
    While the primary motivation is a bug fix, this also simplifies the fast
    path, although the moved code is only enabled when cpusets are in use.

    Link: http://lkml.kernel.org/r/20170120103843.24587-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Cc: Ganapatrao Kulkarni
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 16096c25bf0ca5d87e4fa6ec6108ba53feead212 upstream.

    Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode
    triggers OOM killer in few seconds, despite lots of free memory. The
    test attempts to repeatedly fault in memory in one process in a cpuset,
    while changing allowed nodes of the cpuset between 0 and 1 in another
    process.

    One possible cause is that in the fast path we find the preferred
    zoneref according to current mems_allowed, so that it points to the
    middle of the zonelist, skipping e.g. zones of node 1 completely. If
    the mems_allowed is updated to contain only node 1, we never reach it in
    the zonelist, and trigger OOM before checking the cpuset_mems_cookie.

    This patch fixes the particular case by redoing the preferred zoneref
    search if we switch back to the original nodemask. The condition is
    also slightly changed so that when the last non-root cpuset is removed,
    we don't miss it.

    Note that this is not a full fix, and more patches will follow.

    Link: http://lkml.kernel.org/r/20170120103843.24587-3-vbabka@suse.cz
    Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
    Signed-off-by: Vlastimil Babka
    Reported-by: Ganapatrao Kulkarni
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit ea57485af8f4221312a5a95d63c382b45e7840dc upstream.

    Patch series "fix premature OOM regression in 4.7+ due to cpuset races".

    This is v2 of my attempt to fix the recent report based on LTP cpuset
    stress test [1]. The intention is to go to stable 4.9 LTSS with this,
    as triggering repeated OOMs is not nice. That's why the patches try to
    be not too intrusive.

    Unfortunately why investigating I found that modifying the testcase to
    use per-VMA policies instead of per-task policies will bring the OOM's
    back, but that seems to be much older and harder to fix problem. I have
    posted a RFC [2] but I believe that fixing the recent regressions has a
    higher priority.

    Longer-term we might try to think how to fix the cpuset mess in a better
    and less error prone way. I was for example very surprised to learn,
    that cpuset updates change not only task->mems_allowed, but also
    nodemask of mempolicies. Until now I expected the parameter to
    alloc_pages_nodemask() to be stable. I wonder why do we then treat
    cpusets specially in get_page_from_freelist() and distinguish HARDWALL
    etc, when there's unconditional intersection between mempolicy and
    cpuset. I would expect the nodemask adjustment for saving overhead in
    g_p_f(), but that clearly doesn't happen in the current form. So we
    have both crazy complexity and overhead, AFAICS.

    [1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com
    [2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@suse.cz

    This patch (of 4):

    Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first
    zone in a zonelist twice") we have a wrong check for NULL preferred_zone,
    which can theoretically happen due to concurrent cpuset modification. We
    check the zoneref pointer which is never NULL and we should check the zone
    pointer. Also document this in first_zones_zonelist() comment per Michal
    Hocko.

    Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
    Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Cc: Ganapatrao Kulkarni
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

06 Jan, 2017

1 commit

  • commit a6de734bc002fe2027ccc074fbbd87d72957b7a4 upstream.

    Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
    defer debugging checks of pages allocated from the PCP") will allow the
    per-cpu list counter to be out of sync with the per-cpu list contents if
    a struct page is corrupted.

    The consequence is an infinite loop if the per-cpu lists get fully
    drained by free_pcppages_bulk because all the lists are empty but the
    count is positive. The infinite loop occurs here

    do {
    batch_free++;
    if (++migratetype == MIGRATE_PCPTYPES)
    migratetype = 0;
    list = &pcp->lists[migratetype];
    } while (list_empty(list));

    What the user sees is a bad page warning followed by a soft lockup with
    interrupts disabled in free_pcppages_bulk().

    This patch keeps the accounting in sync.

    Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
    Link: http://lkml.kernel.org/r/20161202112951.23346-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Jesper Dangaard Brouer
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

12 Nov, 2016

1 commit

  • Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
    long") by error embedded "\n" in the format string, resulting in strange
    output.

    [ 722.876655] kworker/0:1: page alloction stalls for 160001ms, order:0
    [ 722.876656] , mode:0x2400000(GFP_NOIO)
    [ 722.876657] CPU: 0 PID: 6966 Comm: kworker/0:1 Not tainted 4.8.0+ #69

    Link: http://lkml.kernel.org/r/1476026219-7974-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

02 Nov, 2016

1 commit


01 Nov, 2016

1 commit

  • The stack frame size could grow too large when the plugin used long long
    on 32-bit architectures when the given function had too many basic blocks.

    The gcc warning was:

    drivers/pci/hotplug/ibmphp_ebda.c: In function 'ibmphp_access_ebda':
    drivers/pci/hotplug/ibmphp_ebda.c:409:1: warning: the frame size of 1108 bytes is larger than 1024 bytes [-Wframe-larger-than=]

    This switches latent_entropy from u64 to unsigned long.

    Thanks to PaX Team and Emese Revfy for the patch.

    Signed-off-by: Kees Cook

    Kees Cook
     

28 Oct, 2016

2 commits

  • Recent changes to printk require KERN_CONT uses to continue logging
    messages. So add KERN_CONT where necessary.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines")
    Link: http://lkml.kernel.org/r/c7df37c8665134654a17aaeb8b9f6ace1d6db58b.1476239034.git.joe@perches.com
    Reported-by: Mark Rutland
    Signed-off-by: Joe Perches
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Oct, 2016

1 commit

  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds
     

11 Oct, 2016

2 commits

  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     
  • This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot time as
    possible, hoping to capitalize on any possible variation in CPU operation
    (due to runtime data differences, hardware differences, SMP ordering,
    thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example for
    how to manipulate kernel code using the gcc plugin internals.

    The need for very-early boot entropy tends to be very architecture or
    system design specific, so this plugin is more suited for those sorts
    of special cases. The existing kernel RNG already attempts to extract
    entropy from reliable runtime variation, but this plugin takes the idea to
    a logical extreme by permuting a global variable based on any variation
    in code execution (e.g. a different value (and permutation function)
    is used to permute the global based on loop count, case statement,
    if/then/else branching, etc).

    To do this, the plugin starts by inserting a local variable in every
    marked function. The plugin then adds logic so that the value of this
    variable is modified by randomly chosen operations (add, xor and rol) and
    random values (gcc generates separate static values for each location at
    compile time and also injects the stack pointer at runtime). The resulting
    value depends on the control flow path (e.g., loops and branches taken).

    Before the function returns, the plugin mixes this local variable into
    the latent_entropy global variable. The value of this global variable
    is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
    though it does not credit any bytes of entropy to the pool; the contents
    of the global are just used to mix the pool.

    Additionally, the plugin can pre-initialize arrays with build-time
    random contents, so that two different kernel builds running on identical
    hardware will not have the same starting values.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message and code comments]
    Signed-off-by: Kees Cook

    Emese Revfy
     

08 Oct, 2016

15 commits

  • Currently we do warn only about allocation failures but small
    allocations are basically nofail and they might loop in the page
    allocator for a long time. Especially when the reclaim cannot make any
    progress - e.g. GFP_NOFS cannot invoke the oom killer and rely on a
    different context to make a forward progress in case there is a lot
    memory used by filesystems.

    Give us at least a clue when something like this happens and warn about
    allocations which take more than 10s. Print the basic allocation
    context information along with the cumulative time spent in the
    allocation as well as the allocation stack. Repeat the warning after
    every 10 seconds so that we know that the problem is permanent rather
    than ephemeral.

    Link: http://lkml.kernel.org/r/20160929084407.7004-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Tetsuo Handa
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • warn_alloc_failed is currently used from the page and vmalloc
    allocators. This is a good reuse of the code except that vmalloc would
    appreciate a slightly different warning message. This is already
    handled by the fmt parameter except that

    "%s: page allocation failure: order:%u, mode:%#x(%pGg)"

    is printed anyway. This might be quite misleading because it might be a
    vmalloc failure which leads to the warning while the page allocator is
    not the culprit here. Fix this by always using the fmt string and only
    print the context that makes sense for the particular context (e.g.
    order makes only very little sense for the vmalloc context).

    Rename the function to not miss any user and also because a later patch
    will reuse it also for !failure cases.

    Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Tetsuo Handa
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The should_reclaim_retry() makes decisions based on no_progress_loops,
    so it makes sense to also update the counter there. It will be also
    consistent with should_compact_retry() and compaction_retries. No
    functional change.

    [hillf.zj@alibaba-inc.com: fix missing pointer dereferences]
    Link: http://lkml.kernel.org/r/20160926162025.21555-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The new ultimate compaction priority disables some heuristics, which may
    result in excessive cost. This is fine for non-costly orders where we
    want to try hard before resulting for OOM, but might be disruptive for
    costly orders which do not trigger OOM and should generally have some
    fallback. Thus, we disable the full priority for costly orders.

    Suggested-by: Michal Hocko
    Link: http://lkml.kernel.org/r/20160906135258.18335-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • During reclaim/compaction loop, compaction priority can be increased by
    the should_compact_retry() function, but the current code is not
    optimal. Priority is only increased when compaction_failed() is true,
    which means that compaction has scanned the whole zone. This may not
    happen even after multiple attempts with a lower priority due to
    parallel activity, so we might needlessly struggle on the lower
    priorities and possibly run out of compaction retry attempts in the
    process.

    After this patch we are guaranteed at least one attempt at the highest
    compaction priority even if we exhaust all retries at the lower
    priorities.

    Link: http://lkml.kernel.org/r/20160906135258.18335-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "reintroduce compaction feedback for OOM decisions".

    After several people reported OOM's for order-2 allocations in 4.7 due
    to Michal Hocko's OOM rework, he reverted the part that considered
    compaction feedback [1] in the decisions to retry reclaim/compaction.
    This was to provide a fix quickly for 4.8 rc and 4.7 stable series,
    while mmotm had an almost complete solution that instead improved
    compaction reliability.

    This series completes the mmotm solution and reintroduces the compaction
    feedback into OOM decisions. The first two patches restore the state of
    mmotm before the temporary solution was merged, the last patch should be
    the missing piece for reliability. The third patch restricts the
    hardened compaction to non-costly orders, since costly orders don't
    result in OOMs in the first place.

    [1] http://marc.info/?i=20160822093249.GA14916%40dhcp22.suse.cz%3E

    This patch (of 4):

    Commit 6b4e3181d7bd ("mm, oom: prevent premature OOM killer invocation
    for high order request") was intended as a quick fix of OOM regressions
    for 4.8 and stable 4.7.x kernels. For a better long-term solution, we
    still want to consider compaction feedback, which should be possible
    after some more improvements in the following patches.

    This reverts commit 6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f.

    Link: http://lkml.kernel.org/r/20160906135258.18335-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently arch specific code can reserve memory blocks but
    alloc_large_system_hash() may not take it into consideration when sizing
    the hashes. This can lead to bigger hash than required and lead to no
    available memory for other purposes. This is specifically true for
    systems with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.

    One approach to solve this problem would be to walk through the memblock
    regions and calculate the available memory and base the size of hash
    system on the available memory.

    The other approach would be to depend on the architecture to provide the
    number of pages that are reserved. This change provides hooks to allow
    the architecture to provide the required info.

    Link: http://lkml.kernel.org/r/1472476010-4709-2-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Srikar Dronamraju
    Suggested-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Michael Ellerman
    Cc: Mahesh Salgaonkar
    Cc: Hari Bathini
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srikar Dronamraju
     
  • Use the existing enums instead of hardcoded index when looking at the
    zonelist. This makes it more readable. No functionality change by this
    patch.

    Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Until now, if some page_ext users want to use it's own field on
    page_ext, it should be defined in struct page_ext by hard-coding. It
    has a problem that wastes memory in following situation.

    struct page_ext {
    #ifdef CONFIG_A
    int a;
    #endif
    #ifdef CONFIG_B
    int b;
    #endif
    };

    Assume that kernel is built with both CONFIG_A and CONFIG_B. Even if we
    enable feature A and doesn't enable feature B at runtime, each entry of
    struct page_ext takes two int rather than one int. It's undesirable
    result so this patch tries to fix it.

    To solve above problem, this patch implements to support extra space
    allocation at runtime. When need() callback returns true, it's extra
    memory requirement is summed to entry size of page_ext. Also, offset
    for each user's extra memory space is returned. With this offset, user
    can use this extra space and there is no need to define needed field on
    page_ext by hard-coding.

    This patch only implements an infrastructure. Following patch will use
    it for page_owner which is only user having it's own fields on page_ext.

    Link: http://lkml.kernel.org/r/1471315879-32294-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • What debug_pagealloc does is just mapping/unmapping page table.
    Basically, it doesn't need additional memory space to memorize
    something. But, with guard page feature, it requires additional memory
    to distinguish if the page is for guard or not. Guard page is only used
    when debug_guardpage_minorder is non-zero so this patch removes
    additional memory allocation (page_ext) if debug_guardpage_minorder is
    zero.

    It saves memory if we just use debug_pagealloc and not guard page.

    Link: http://lkml.kernel.org/r/1471315879-32294-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Patch series "Reduce memory waste by page extension user".

    This patchset tries to reduce memory waste by page extension user.

    First case is architecture supported debug_pagealloc. It doesn't
    requires additional memory if guard page isn't used. 8 bytes per page
    will be saved in this case.

    Second case is related to page owner feature. Until now, if page_ext
    users want to use it's own fields on page_ext, fields should be defined
    in struct page_ext by hard-coding. It has a following problem.

    struct page_ext {
    #ifdef CONFIG_A
    int a;
    #endif
    #ifdef CONFIG_B
    int b;
    #endif
    };

    Assume that kernel is built with both CONFIG_A and CONFIG_B. Even if we
    enable feature A and doesn't enable feature B at runtime, each entry of
    struct page_ext takes two int rather than one int. It's undesirable
    waste so this patch tries to reduce it. By this patchset, we can save
    20 bytes per page dedicated for page owner feature in some
    configurations.

    This patch (of 6):

    We can make code clean by moving decision condition for set_page_guard()
    into set_page_guard() itself. It will help code readability. There is
    no functional change.

    Link: http://lkml.kernel.org/r/1471315879-32294-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • On x86_64 MAX_ORDER_NR_PAGES is usually 4M, and a pageblock is usually
    2M, so we only set one pageblock's migratetype in deferred_free_range()
    if pfn is aligned to MAX_ORDER_NR_PAGES. That means it causes
    uninitialized migratetype blocks, you can see from "cat
    /proc/pagetypeinfo", almost half blocks are Unmovable.

    Also we missed freeing the last block in deferred_init_memmap(), it
    causes memory leak.

    Fixes: ac5d2539b238 ("mm: meminit: reduce number of times pageblocks are set during struct page init")
    Link: http://lkml.kernel.org/r/57A3260F.4050709@huawei.com
    Signed-off-by: Xishi Qiu
    Cc: Taku Izumi
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: "Kirill A . Shutemov"
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror
    option") rewrote the calculation of node spanned pages. But when we
    have a movable node, the size of node spanned pages is double added.
    That's because we have an empty normal zone, the present pages is zero,
    but its spanned pages is not zero.

    e.g.
    Zone ranges:
    DMA [mem 0x0000000000001000-0x0000000000ffffff]
    DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
    Normal [mem 0x0000000100000000-0x0000007c7fffffff]
    Movable zone start for each node
    Node 1: 0x0000001080000000
    Node 2: 0x0000002080000000
    Node 3: 0x0000003080000000
    Node 4: 0x0000003c80000000
    Node 5: 0x0000004c80000000
    Node 6: 0x0000005c80000000
    Early memory node ranges
    node 0: [mem 0x0000000000001000-0x000000000009ffff]
    node 0: [mem 0x0000000000100000-0x000000007552afff]
    node 0: [mem 0x000000007bd46000-0x000000007bd46fff]
    node 0: [mem 0x000000007bdcd000-0x000000007bffffff]
    node 0: [mem 0x0000000100000000-0x000000107fffffff]
    node 1: [mem 0x0000001080000000-0x000000207fffffff]
    node 2: [mem 0x0000002080000000-0x000000307fffffff]
    node 3: [mem 0x0000003080000000-0x0000003c7fffffff]
    node 4: [mem 0x0000003c80000000-0x0000004c7fffffff]
    node 5: [mem 0x0000004c80000000-0x0000005c7fffffff]
    node 6: [mem 0x0000005c80000000-0x0000006c7fffffff]
    node 7: [mem 0x0000006c80000000-0x0000007c7fffffff]

    node1:
    Normal, start=0x1080000, present=0x0, spanned=0x1000000
    Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
    pgdat, start=0x1080000, present=0x1000000, spanned=0x2000000

    After this patch, the problem is fixed.

    node1:
    Normal, start=0x0, present=0x0, spanned=0x0
    Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
    pgdat, start=0x1080000, present=0x1000000, spanned=0x1000000

    Link: http://lkml.kernel.org/r/57A325E8.6070100@huawei.com
    Signed-off-by: Xishi Qiu
    Cc: Taku Izumi
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: "Kirill A . Shutemov"
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • The __compaction_suitable() function checks the low watermark plus a
    compact_gap() gap to decide if there's enough free memory to perform
    compaction. Then __isolate_free_page uses low watermark check to decide
    if particular free page can be isolated. In the latter case, using low
    watermark is needlessly pessimistic, as the free page isolations are
    only temporary. For __compaction_suitable() the higher watermark makes
    sense for high-order allocations where more freepages increase the
    chance of success, and we can typically fail with some order-0 fallback
    when the system is struggling to reach that watermark. But for
    low-order allocation, forming the page should not be that hard. So
    using low watermark here might just prevent compaction from even trying,
    and eventually lead to OOM killer even if we are above min watermarks.

    So after this patch, we use min watermark for non-costly orders in
    __compaction_suitable(), and for all orders in __isolate_free_page().

    [vbabka@suse.cz: clarify __isolate_free_page() comment]
    Link: http://lkml.kernel.org/r/7ae4baec-4eca-e70b-2a69-94bea4fb19fa@suse.cz
    Link: http://lkml.kernel.org/r/20160810091226.6709-11-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __compaction_suitable() function checks the low watermark plus a
    compact_gap() gap to decide if there's enough free memory to perform
    compaction. This check uses direct compactor's alloc_flags, but that's
    wrong, since these flags are not applicable for freepage isolation.

    For example, alloc_flags may indicate access to memory reserves, making
    compaction proceed, and then fail watermark check during the isolation.

    A similar problem exists for ALLOC_CMA, which may be part of
    alloc_flags, but not during freepage isolation. In this case however it
    makes sense to use ALLOC_CMA both in __compaction_suitable() and
    __isolate_free_page(), since there's actually nothing preventing the
    freepage scanner to isolate from CMA pageblocks, with the assumption
    that a page that could be migrated once by compaction can be migrated
    also later by CMA allocation. Thus we should count pages in CMA
    pageblocks when considering compaction suitability and when isolating
    freepages.

    To sum up, this patch should remove some false positives from
    __compaction_suitable(), and allow compaction to proceed when free pages
    required for compaction reside in the CMA pageblocks.

    Link: http://lkml.kernel.org/r/20160810091226.6709-10-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

02 Sep, 2016

2 commits

  • Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
    of memory when booting a secondary kernel. Srikar Dronamraju reported
    that multiple nodes may have no memory managed by the buddy allocator
    but still return true for populated_zone().

    Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
    nodes") was reported to cause kswapd to spin at 100% CPU usage when
    fadump was enabled. The old code happened to deal with the situation of
    a populated node with zero free pages by co-incidence but the current
    code tries to reclaim populated zones without realising that is
    impossible.

    We cannot just convert populated_zone() as many existing users really
    need to check for present_pages. This patch introduces a managed_zone()
    helper and uses it in the few cases where it is critical that the check
    is made for managed pages -- zonelist construction and page reclaim.

    Link: http://lkml.kernel.org/r/20160831195104.GB8119@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Srikar Dronamraju
    Tested-by: Srikar Dronamraju
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There have been several reports about pre-mature OOM killer invocation
    in 4.7 kernel when order-2 allocation request (for the kernel stack)
    invoked OOM killer even during basic workloads (light IO or even kernel
    compile on some filesystems). In all reported cases the memory is
    fragmented and there are no order-2+ pages available. There is usually
    a large amount of slab memory (usually dentries/inodes) and further
    debugging has shown that there are way too many unmovable blocks which
    are skipped during the compaction. Multiple reporters have confirmed
    that the current linux-next which includes [1] and [2] helped and OOMs
    are not reproducible anymore.

    A simpler fix for the late rc and stable is to simply ignore the
    compaction feedback and retry as long as there is a reclaim progress and
    we are not getting OOM for order-0 pages. We already do that for
    CONFING_COMPACTION=n so let's reuse the same code when compaction is
    enabled as well.

    [1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
    [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz

    Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
    Link: http://lkml.kernel.org/r/20160823074339.GB23577@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Tested-by: Olaf Hering
    Tested-by: Ralf-Peter Rohbeck
    Cc: Markus Trippelsdorf
    Cc: Arkadiusz Miskiewicz
    Cc: Ralf-Peter Rohbeck
    Cc: Jiri Slaby
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: [4.7.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

12 Aug, 2016

1 commit

  • meminfo_proc_show() and si_mem_available() are using the wrong helpers
    for calculating the size of the LRUs. The user-visible impact is that
    there appears to be an abnormally high number of unevictable pages.

    Link: http://lkml.kernel.org/r/20160805105805.GR2799@techsingularity.net
    Signed-off-by: Mel Gorman
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Aug, 2016

2 commits

  • Some of node threshold depends on number of managed pages in the node.
    When memory is going on/offline, it can be changed and we need to adjust
    them.

    Add recalculation to appropriate places and clean-up related functions
    for better maintenance.

    Link: http://lkml.kernel.org/r/1470724248-26780-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Before resetting min_unmapped_pages, we need to initialize
    min_unmapped_pages rather than min_slab_pages.

    Fixes: a5f5f91da6 (mm: convert zone_reclaim to node_reclaim)
    Link: http://lkml.kernel.org/r/1470724248-26780-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

10 Aug, 2016

1 commit

  • To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
    which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg
    in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
    with __GFP_ACCOUNT, including those that aren't actually charged to any
    cgroup, i.e. allocated from the root cgroup context. To avoid overhead
    in case cgroups are not used, we only do that if memcg_kmem_enabled() is
    true. The latter is set iff there are kmem-enabled memory cgroups
    (online or offline). The root cgroup is not considered kmem-enabled.

    As a result, if a page is allocated with __GFP_ACCOUNT for the root
    cgroup when there are kmem-enabled memory cgroups and is freed after all
    kmem-enabled memory cgroups were removed, e.g.

    # no memory cgroups has been created yet, create one
    mkdir /sys/fs/cgroup/memory/test
    # run something allocating pages with __GFP_ACCOUNT, e.g.
    # a program using pipe
    dmesg | tail
    # remove the memory cgroup
    rmdir /sys/fs/cgroup/memory/test

    we'll get bad page state bug complaining about page->_mapcount != -1:

    BUG: Bad page state in process swapper/0 pfn:1fd945c
    page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0
    flags: 0x1000000000000000()

    To avoid that, let's mark with PageKmemcg only those pages that are
    actually charged to and hence pin a non-root memory cgroup.

    Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

05 Aug, 2016

1 commit

  • Paul Mackerras and Reza Arbab reported that machines with memoryless
    nodes fail when vmstats are refreshed. Paul reported an oops as follows

    Unable to handle kernel paging request for data at address 0xff7a10000
    Faulting instruction address: 0xc000000000270cd0
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-kvm+ #118
    task: c000000ff0680010 task.stack: c000000ff0704000
    NIP: c000000000270cd0 LR: c000000000270ce8 CTR: 0000000000000000
    REGS: c000000ff0707900 TRAP: 0300 Not tainted (4.7.0-kvm+)
    MSR: 9000000102009033 CR: 846b6824 XER: 20000000
    CFAR: c000000000008768 DAR: 0000000ff7a10000 DSISR: 42000000 SOFTE: 1
    NIP refresh_zone_stat_thresholds+0x80/0x240
    LR refresh_zone_stat_thresholds+0x98/0x240
    Call Trace:
    refresh_zone_stat_thresholds+0xb8/0x240 (unreliable)

    Both supplied potential fixes but one potentially misses checks and
    another had redundant initialisations. This version initialises
    per_cpu_nodestats on a per-pgdat basis instead of on a per-zone basis.

    Link: http://lkml.kernel.org/r/20160804092404.GI2799@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Paul Mackerras
    Reported-by: Reza Arbab
    Tested-by: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

03 Aug, 2016

1 commit

  • There was only one use of __initdata_refok and __exit_refok

    __init_refok was used 46 times against 82 for __ref.

    Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
    section reference annotations tags: __ref, __refdata, __refconst")

    This patch removes the following compatibility definitions and replaces
    them treewide.

    /* compatibility defines */
    #define __init_refok __ref
    #define __initdata_refok __refdata
    #define __exit_refok __ref

    I can also provide separate patches if necessary.
    (One patch per tree and check in 1 month or 2 to remove old definitions)

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Ingo Molnar
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

29 Jul, 2016

4 commits

  • Async compaction detects contention either due to failing trylock on
    zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
    compaction: khugepaged should not give up due to need_resched()") the
    code got quite complicated to distinguish these two up to the
    __alloc_pages_slowpath() level, so different decisions could be taken
    for khugepaged allocations.

    After the recent changes, khugepaged allocations don't check for
    contended compaction anymore, so we again don't need to distinguish lock
    and sched contention, and simplify the current convoluted code a lot.

    However, I believe it's also possible to simplify even more and
    completely remove the check for contended compaction after the initial
    async compaction for costly orders, which was originally aimed at THP
    page fault allocations. There are several reasons why this can be done
    now:

    - with the new defaults, THP page faults no longer do reclaim/compaction at
    all, unless the system admin has overridden the default, or application has
    indicated via madvise that it can benefit from THP's. In both cases, it
    means that the potential extra latency is expected and worth the benefits.
    - even if reclaim/compaction proceeds after this patch where it previously
    wouldn't, the second compaction attempt is still async and will detect the
    contention and back off, if the contention persists
    - there are still heuristics like deferred compaction and pageblock skip bits
    in place that prevent excessive THP page fault latencies

    Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In the context of direct compaction, for some types of allocations we
    would like the compaction to either succeed or definitely fail while
    trying as hard as possible. Current async/sync_light migration mode is
    insufficient, as there are heuristics such as caching scanner positions,
    marking pageblocks as unsuitable or deferring compaction for a zone. At
    least the final compaction attempt should be able to override these
    heuristics.

    To communicate how hard compaction should try, we replace migration mode
    with a new enum compact_priority and change the relevant function
    signatures. In compact_zone_order() where struct compact_control is
    constructed, the priority is mapped to suitable control flags. This
    patch itself has no functional change, as the current priority levels
    are mapped back to the same migration modes as before. Expanding them
    will be done next.

    Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
    removed, as the only caller exists under CONFIG_COMPACTION.

    Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After the previous patch, we can distinguish costly allocations that
    should be really lightweight, such as THP page faults, with
    __GFP_NORETRY. This means we don't need to recognize khugepaged
    allocations via PF_KTHREAD anymore. We can also change THP page faults
    in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
    khugepaged, as the process has indicated that it benefits from THP's and
    is willing to pay some initial latency costs.

    We can also make the flags handling less cryptic by distinguishing
    GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
    GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
    __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

    The patch effectively changes the current GFP_TRANSHUGE users as
    follows:

    * get_huge_zero_page() - the zero page lifetime should be relatively
    long and it's shared by multiple users, so it's worth spending some
    effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
    This also restores direct reclaim to this allocation, which was
    unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
    by default to madvise and add a stall-free defrag option")

    * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
    is not an issue. So if khugepaged "defrag" is enabled (the default), do
    reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
    PF_KTHREAD check from page alloc.

    As a side-effect, khugepaged will now no longer check if the initial
    compaction was deferred or contended. This is OK, as khugepaged sleep
    times between collapsion attempts are long enough to prevent noticeable
    disruption, so we should allow it to spend some effort.

    * migrate_misplaced_transhuge_page() - already was masking out
    __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
    equivalent.

    * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
    are now allocating without __GFP_NORETRY. Other vma's keep using
    __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
    it's allowed only for madvised vma's). The rest is conversion to
    GFP_TRANSHUGE(_LIGHT).

    [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
    Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since THP allocations during page faults can be costly, extra decisions
    are employed for them to avoid excessive reclaim and compaction, if the
    initial compaction doesn't look promising. The detection has never been
    perfect as there is no gfp flag specific to THP allocations. At this
    moment it checks the whole combination of flags that makes up
    GFP_TRANSHUGE, and hopes that no other users of such combination exist,
    or would mind being treated the same way. Extra care is also taken to
    separate allocations from khugepaged, where latency doesn't matter that
    much.

    It is however possible to distinguish these allocations in a simpler and
    more reliable way. The key observation is that after the initial
    compaction followed by the first iteration of "standard"
    reclaim/compaction, both __GFP_NORETRY allocations and costly
    allocations without __GFP_REPEAT are declared as failures:

    /* Do not loop if specifically requested */
    if (gfp_mask & __GFP_NORETRY)
    goto nopage;

    /*
    * Do not retry costly high order allocations unless they are
    * __GFP_REPEAT
    */
    if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
    goto nopage;

    This means we can further distinguish allocations that are costly order
    *and* additionally include the __GFP_NORETRY flag. As it happens,
    GFP_TRANSHUGE allocations do already fall into this category. This will
    also allow other costly allocations with similar high-order benefit vs
    latency considerations to use this semantic. Furthermore, we can
    distinguish THP allocations that should try a bit harder (such as from
    khugepageed) by removing __GFP_NORETRY, as will be done in the next
    patch.

    Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka