26 Jun, 2015

1 commit

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

25 Jun, 2015

4 commits

  • memcg->under_oom tracks whether the memcg is under OOM conditions and is
    an atomic_t counter managed with mem_cgroup_[un]mark_under_oom(). While
    atomic_t appears to be simple synchronization-wise, when used as a
    synchronization construct like here, it's trickier and more error-prone
    due to weak memory ordering rules, especially around atomic_read(), and
    false sense of security.

    For example, both non-trivial read sites of memcg->under_oom are a bit
    problematic although not being actually broken.

    * mem_cgroup_oom_register_event()

    It isn't explicit what guarantees the memory ordering between event
    addition and memcg->under_oom check. This isn't broken only because
    memcg_oom_lock is used for both event list and memcg->oom_lock.

    * memcg_oom_recover()

    The lockless test doesn't have any explanation why this would be
    safe.

    mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point
    in avoiding locking memcg_oom_lock there. This patch converts
    memcg->under_oom from atomic_t to int, puts their modifications under
    memcg_oom_lock and documents why the lockless test in
    memcg_oom_recover() is safe.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Since commit 4942642080ea ("mm: memcg: handle non-error OOM situations
    more gracefully"), nobody uses mem_cgroup->oom_wakeups. Remove it.

    While at it, also fold memcg_wakeup_oom() into memcg_oom_recover() which
    is its only user. This cleanup was suggested by Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • The zonelist locking and the oom_sem are two overlapping locks that are
    used to serialize global OOM killing against different things.

    The historical zonelist locking serializes OOM kills from allocations with
    overlapping zonelists against each other to prevent killing more tasks
    than necessary in the same memory domain. Only when neither tasklists nor
    zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
    bound to separate nodes) are OOM kills allowed to execute in parallel.

    The younger oom_sem is a read-write lock to serialize OOM killing against
    the PM code trying to disable the OOM killer altogether.

    However, the OOM killer is a fairly cold error path, there is really no
    reason to optimize for highly performant and concurrent OOM kills. And
    the oom_sem is just flat-out redundant.

    Replace both locking schemes with a single global mutex serializing OOM
    kills regardless of context.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking
    are related in functionality, but the interface is not symmetrical at
    all: one is an internal OOM killer function used during the killing, the
    other is for an OOM victim to signal its own death on exit later on.
    This has locking implications, see follow-up changes.

    While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
    is easier on the eye.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

11 Jun, 2015

2 commits

  • On -rt, the VM_BUG_ON(!irqs_disabled()) triggers inside the memcg
    swapout path because the spin_lock_irq(&mapping->tree_lock) in the
    caller doesn't actually disable the hardware interrupts - which is fine,
    because on -rt the tophalves run in process context and so we are still
    safe from preemption while updating the statistics.

    Remove the VM_BUG_ON() but keep the comment of what we rely on.

    Signed-off-by: Johannes Weiner
    Reported-by: Clark Williams
    Cc: Fernando Lopez-Lezcano
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When trimming memcg consumption excess (see memory.high), we call
    try_to_free_mem_cgroup_pages without checking if we are allowed to sleep
    in the current context, which can result in a deadlock. Fix this.

    Fixes: 241994ed8649 ("mm: memcontrol: default hierarchy interface for memory")
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

02 Jun, 2015

8 commits

  • While cgroup writeback support now connects memcg and blkcg so that
    writeback IOs are properly attributed and controlled, the IO back
    pressure propagation mechanism implemented in balance_dirty_pages()
    and its subroutines wasn't aware of cgroup writeback.

    Processes belonging to a memcg may have access to only subset of total
    memory available in the system and not factoring this into dirty
    throttling rendered it completely ineffective for processes under
    memcg limits and memcg ended up building a separate ad-hoc degenerate
    mechanism directly into vmscan code to limit page dirtying.

    The previous patches updated balance_dirty_pages() and its subroutines
    so that they can deal with multiple wb_domain's (writeback domains)
    and defined per-memcg wb_domain. Processes belonging to a non-root
    memcg are bound to two wb_domains, global wb_domain and memcg
    wb_domain, and should be throttled according to IO pressures from both
    domains. This patch updates dirty throttling code so that it repeats
    similar calculations for the two domains - the differences between the
    two are few and minor - and applies the lower of the two sets of
    resulting constraints.

    wb_over_bg_thresh(), which controls when background writeback
    terminates, is also updated to consider both global and memcg
    wb_domains. It returns true if dirty is over bg_thresh for either
    domain.

    This makes the dirty throttling mechanism operational for memcg
    domains including writeback-bandwidth-proportional dirty page
    distribution inside them but the ad-hoc memcg throttling mechanism in
    vmscan is still in place. The next patch will rip it out.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The amount of available memory to a memcg wb_domain can change as
    memcg configuration changes. A domain's ->dirty_limit exists to
    smooth out sudden drops in dirty threshold; however, when a domain's
    size actually drops significantly, it hinders the dirty throttling
    from adjusting to the new configuration leading to unexpected
    behaviors including unnecessary OOM kills.

    This patch resolves the issue by adding wb_domain_size_changed() which
    resets ->dirty_limit[_tstmp] and making memcg call it on configuration
    changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Dirtyable memory is distributed to a wb (bdi_writeback) according to
    the relative bandwidth the wb is writing out in the whole system.
    This distribution is global - each wb is measured against all other
    wb's and gets the proportinately sized portion of the memory in the
    whole system.

    For cgroup writeback, the amount of dirtyable memory is scoped by
    memcg and thus each wb would need to be measured and controlled in its
    memcg. IOW, a wb will belong to two writeback domains - the global
    and memcg domains.

    The previous patches laid the groundwork to support the two wb_domains
    and this patch implements memcg wb_domain. memcg->cgwb_domain is
    initialized on css online and destroyed on css release,
    wb->memcg_completions is added, and __wb_writeout_inc() is updated to
    increment completions against both global and memcg wb_domains.

    The following patches will update balance_dirty_pages() and its
    subroutines to actually consider memcg wb_domain for throttling.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cpu_possible_mask represents the CPUs which are actually possible
    during that boot instance. For systems which don't support CPU
    hotplug, this will match cpu_online_mask exactly in most cases. Even
    for systems which support CPU hotplug, the number of possible CPU
    slots is highly unlikely to diverge greatly from the number of online
    CPUs. The only cases where the difference between possible and online
    caused problems were when the boot code failed to initialize the
    possible mask and left it fully set at NR_CPUS - 1.

    As such, most per-cpu constructs allocate for all possible CPUs and
    often iterate over the possibles, which also has the benefit of
    avoiding the blocking CPU hotplug synchronization.

    memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and
    mem_cgroup_read_events(), which iterates over online CPUs and handles
    CPU hotplug operations explicitly. This complexity doesn't actually
    buy anything. Switch to iterating over the possibles and drop the
    explicit CPU hotplug handling.

    Eventually, we want to convert memcg to use percpu_counter instead of
    its own custom implementation which also benefits from quick access
    w/o summing for cases where larger error margin is acceptable.

    This will allow mem_cgroup_read_stat() to be called from non-sleepable
    contexts which will be used by cgroup writeback.

    Signed-off-by: Tejun Heo
    Cc: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • For the planned cgroup writeback support, on each bdi
    (backing_dev_info), each memcg will be served by a separate wb
    (bdi_writeback). This patch updates bdi so that a bdi can host
    multiple wbs (bdi_writebacks).

    On the default hierarchy, blkcg implicitly enables memcg. This allows
    using memcg's page ownership for attributing writeback IOs, and every
    memcg - blkcg combination can be served by its own wb by assigning a
    dedicated wb to each memcg. This means that there may be multiple
    wb's of a bdi mapped to the same blkcg. As congested state is per
    blkcg - bdi combination, those wb's should share the same congested
    state. This is achieved by tracking congested state via
    bdi_writeback_congested structs which are keyed by blkcg.

    bdi->wb remains unchanged and will keep serving the root cgroup.
    cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
    looked up while dirtying an inode according to the memcg of the page
    being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
    by its memcg id. Once an inode is associated with its wb, it can be
    retrieved using inode_to_wb().

    Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
    pages will keep being associated with bdi->wb.

    v3: inode_attach_wb() in account_page_dirtied() moved inside
    mapping_cap_account_dirty() block where it's known to be !NULL.
    Also, an unnecessary NULL check before kfree() removed. Both
    detected by the kbuild bot.

    v2: Updated so that wb association is per inode and wb is per memcg
    rather than blkcg.

    Signed-off-by: Tejun Heo
    Cc: kbuild test robot
    Cc: Dan Carpenter
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Implement mem_cgroup_css_from_page() which returns the
    cgroup_subsys_state of the memcg associated with a given page on the
    default hierarchy. This will be used by cgroup writeback support.

    This function assumes that page->mem_cgroup association doesn't change
    until the page is released, which is true on the default hierarchy as
    long as replace_page_cache_page() is not used. As the only user of
    replace_page_cache_page() is FUSE which won't support cgroup writeback
    for the time being, this works for now, and replace_page_cache_page()
    will soon be updated so that the invariant actually holds.

    Note that the RCU protected page->mem_cgroup access is consistent with
    other usages across memcg but ultimately incorrect. These unlocked
    accesses are missing required barriers. page->mem_cgroup should be
    made an RCU pointer and updated and accessed using RCU operations.

    v4: Instead of triggering WARN, return the root css on the traditional
    hierarchies. This makes the function a lot easier to deal with
    especially as there's no light way to synchronize against
    hierarchy rebinding.

    v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/

    v2: Trigger WARN if the function is used on the traditional
    hierarchies and add comment about the assumed invariant.

    Signed-off-by: Tejun Heo
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add global mem_cgroup_root_css which points to the root memcg css.
    This will be used by cgroup writeback support. If memcg is disabled,
    it's defined as ERR_PTR(-EINVAL).

    Signed-off-by: Tejun Heo
    Cc: Johannes Weiner
    aCc: Michal Hocko
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When modifying PG_Dirty on cached file pages, update the new
    MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
    global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
    per memcg memory.stat cgroupfs file. The most recent past attempt at
    this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

    The new accounting supports future efforts to add per cgroup dirty
    page throttling and writeback. It also helps an administrator break
    down a container's memory usage and provides evidence to understand
    memcg oom kills (the new dirty count is included in memcg oom kill
    messages).

    The ability to move page accounting between memcg
    (memory.move_charge_at_immigrate) makes this accounting more
    complicated than the global counter. The existing
    mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
    accounting with stat updates.
    Typical update operation:
    memcg = mem_cgroup_begin_page_stat(page)
    if (TestSetPageDirty()) {
    [...]
    mem_cgroup_update_page_stat(memcg)
    }
    mem_cgroup_end_page_stat(memcg)

    Summary of mem_cgroup_end_page_stat() overhead:
    - Without CONFIG_MEMCG it's a no-op
    - With CONFIG_MEMCG and no inter memcg task movement, it's just
    rcu_read_lock()
    - With CONFIG_MEMCG and inter memcg task movement, it's
    rcu_read_lock() + spin_lock_irqsave()

    A memcg parameter is added to several routines because their callers
    now grab mem_cgroup_begin_page_stat() which returns the memcg later
    needed by for mem_cgroup_update_page_stat().

    Because mem_cgroup_begin_page_stat() may disable interrupts, some
    adjustments are needed:
    - move __mark_inode_dirty() from __set_page_dirty() to its caller.
    __mark_inode_dirty() locking does not want interrupts disabled.
    - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
    __delete_from_page_cache(), replace_page_cache_page(),
    invalidate_complete_page2(), and __remove_mapping().

    text data bss dec hex filename
    8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
    8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
    +192 text bytes
    8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
    8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
    +773 text bytes

    Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
    all metrics, they're all wall clock or cycle counts. The read and write
    fault benchmarks just measure fault time, they do not include I/O time.

    * CONFIG_MEMCG not set:
    baseline patched
    kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
    dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
    dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
    dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
    read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
    write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)

    * CONFIG_MEMCG=y root_memcg:
    baseline patched
    kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
    dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
    dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
    dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
    read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
    write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)

    * CONFIG_MEMCG=y non-root_memcg:
    baseline patched
    kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
    dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
    dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
    dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
    read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
    write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)

    As expected anon page faults are not affected by this patch.

    tj: Updated to apply on top of the recent cancel_dirty_page() changes.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Greg Thelen
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Greg Thelen
     

16 Apr, 2015

3 commits

  • We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
    tree since it doesn't work reliably on non-scalar types.

    This patch removes the rest of the usages of ACCESS_ONCE, and use the new
    READ_ONCE API for the read accesses. This makes things cleaner, instead
    of using separate/multiple sets of APIs.

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     
  • Low and high watermarks, as they defined in the TODO to the mem_cgroup
    struct, have already been implemented by Johannes, so remove the stale
    comment.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mem_cgroup_lookup() is a wrapper around mem_cgroup_from_id(), which
    checks that id != 0 before issuing the function call. Today, there is
    no point in this additional check apart from optimization, because there
    is no css with id 0 to css_from_id.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

15 Apr, 2015

3 commits

  • If kernel panics due to oom, caused by a cgroup reaching its limit, when
    'compulsory panic_on_oom' is enabled, then we will only see that the OOM
    happened because of "compulsory panic_on_oom is enabled" but this doesn't
    tell the difference between mempolicy and memcg. And dumping system wide
    information is plain wrong and more confusing. This patch provides the
    information of the cgroup whose limit triggerred panic

    Signed-off-by: Balasubramani Vivekanandan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balasubramani Vivekanandan
     
  • When !MMU, it will report warning. The related warning with allmodconfig
    under c6x:

    CC mm/memcontrol.o
    mm/memcontrol.c:2802:12: warning: 'mem_cgroup_move_account' defined but not used [-Wunused-function]
    static int mem_cgroup_move_account(struct page *page,
    ^

    Signed-off-by: Chen Gang
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Add myself to the list of copyright holders.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Mar, 2015

1 commit

  • If the memory cgroup controller is initially mounted in the scope of the
    default cgroup hierarchy and then remounted to a legacy hierarchy, it will
    still have hierarchy support enabled, which is incorrect. We should
    disable hierarchy support if bound to the legacy cgroup hierarchy.

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

01 Mar, 2015

2 commits

  • The memcg control knobs indicate the highest possible value using the
    symbolic name "infinity", which is long and awkward to type.

    Switch to the string "max", which is just as descriptive but shorter and
    sweeter.

    This changes a user interface, so do it before the release and before
    the development flag is dropped from the default hierarchy.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • A memcg is considered low limited even when the current usage is equal to
    the low limit. This leads to interesting side effects e.g.
    groups/hierarchies with no memory accounted are considered protected and
    so the reclaim will emit MEMCG_LOW event when encountering them.

    Another and much bigger issue was reported by Joonsoo Kim. He has hit a
    NULL ptr dereference with the legacy cgroup API which even doesn't have
    low limit exposed. The limit is 0 by default but the initial check fails
    for memcg with 0 consumption and parent_mem_cgroup() would return NULL if
    use_hierarchy is 0 and so page_counter_read would try to dereference NULL.

    I suppose that the current implementation is just an overlook because the
    documentation in Documentation/cgroups/unified-hierarchy.txt says:

    "The memory.low boundary on the other hand is a top-down allocated
    reserve. A cgroup enjoys reclaim protection when it and all its
    ancestors are below their low boundaries"

    Fix the usage and the low limit comparision in mem_cgroup_low accordingly.

    Fixes: 241994ed8649 (mm: memcontrol: default hierarchy interface for memory)
    Reported-by: Joonsoo Kim
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

13 Feb, 2015

8 commits

  • Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called
    from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of
    wrapper functions.

    Although this patch moves static keys decrement from __mem_cgroup_free to
    mem_cgroup_css_free, it does not introduce any functional changes, because
    the keys are incremented on setting the limit (tcp or kmem), which can
    only happen after successful mem_cgroup_css_online.

    Signed-off-by: Vladimir Davydov
    Cc: Glauber Costa
    Cc: KAMEZAWA Hiroyuki
    Cc: Eric W. Biederman
    Cc: David S. Miller
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Now, the only reason to keep kmemcg_id till css free is list_lru, which
    uses it to distribute elements between per-memcg lists. However, it can
    be easily sorted out - we only need to change kmemcg_id of an offline
    cgroup to its parent's id, making further list_lru_add()'s add elements to
    the parent's list, and then move all elements from the offline cgroup's
    list to the one of its parent. It will work, because a racing
    list_lru_del() does not need to know the list it is deleting the element
    from. It can decrement the wrong nr_items counter though, but the ongoing
    reparenting will fix it. After list_lru reparenting is done we are free
    to release kmemcg_id saving a valuable slot in a per-memcg array for new
    cgroups.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We need to look up a kmem_cache in ->memcg_params.memcg_caches arrays only
    on allocations, so there is no need to have the array entries set until
    css free - we can clear them on css offline. This will allow us to reuse
    array entries more efficiently and avoid costly array relocations.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, kmem_cache stores a pointer to struct memcg_cache_params
    instead of embedding it. The rationale is to save memory when kmem
    accounting is disabled. However, the memcg_cache_params has shrivelled
    drastically since it was first introduced:

    * Initially:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct kmem_cache *memcg_caches[0];
    struct {
    struct mem_cgroup *memcg;
    struct list_head list;
    struct kmem_cache *root_cache;
    bool dead;
    atomic_t nr_pages;
    struct work_struct destroy;
    };
    };
    };

    * Now:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct {
    struct rcu_head rcu_head;
    struct kmem_cache *memcg_caches[0];
    };
    struct {
    struct mem_cgroup *memcg;
    struct kmem_cache *root_cache;
    };
    };
    };

    So the memory saving does not seem to be a clear win anymore.

    OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
    it results in touching one more cache line on kmem alloc/free hot paths.
    Besides, it makes linking kmem caches in a list chained by a field of
    struct memcg_cache_params really painful due to a level of indirection,
    while I want to make them linked in the following patch. That said, let
    us embed it.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • There are several FS shrinkers, including super_block::s_shrink, that
    keep reclaimable objects in the list_lru structure. Hence to turn them
    to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

    This patch does the trick. It adds an array of lru lists to the
    list_lru_node structure (per-node part of the list_lru), one for each
    kmem-active memcg, and dispatches every item addition or removal to the
    list corresponding to the memcg which the item is accounted to. So now
    the list_lru structure is not just per node, but per node and per memcg.

    Not all list_lrus need this feature, so this patch also adds a new
    method, list_lru_init_memcg, which initializes a list_lru as memcg
    aware. Otherwise (i.e. if initialized with old list_lru_init), the
    list_lru won't have per memcg lists.

    Just like per memcg caches arrays, the arrays of per-memcg lists are
    indexed by memcg_cache_id, so we must grow them whenever
    memcg_nr_cache_ids is increased. So we introduce a callback,
    memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id
    space is full.

    The locking is implemented in a manner similar to lruvecs, i.e. we have
    one lock per node that protects all lists (both global and per cgroup) on
    the node.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We need a stable value of memcg_nr_cache_ids in kmem_cache_create()
    (memcg_alloc_cache_params() wants it for root caches), where we only
    hold the slab_mutex and no memcg-related locks. As a result, we have to
    update memcg_nr_cache_ids under the slab_mutex, which we can only take
    on the slab's side (see memcg_update_array_size). This looks awkward
    and will become even worse when per-memcg list_lru is introduced, which
    also wants stable access to memcg_nr_cache_ids.

    To get rid of this dependency between the memcg_nr_cache_ids and the
    slab_mutex, this patch introduces a special rwsem. The rwsem is held
    for writing during memcg_caches arrays relocation and memcg_nr_cache_ids
    updates. Therefore one can take it for reading to get a stable access
    to memcg_caches arrays and/or memcg_nr_cache_ids.

    Currently the semaphore is taken for reading only from
    kmem_cache_create, right before taking the slab_mutex, so right now
    there's no much point in using rwsem instead of mutex. However, once
    list_lru is made per-memcg it will allow list_lru initializations to
    proceed concurrently.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • memcg_limited_groups_array_size, which defines the size of memcg_caches
    arrays, sounds rather cumbersome. Also it doesn't point anyhow that
    it's related to kmem/caches stuff. So let's rename it to
    memcg_nr_cache_ids. It's concise and points us directly to
    memcg_cache_id.

    Also, rename kmem_limited_groups to memcg_cache_ida.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patch adds SHRINKER_MEMCG_AWARE flag. If a shrinker has this flag
    set, it will be called per memory cgroup. The memory cgroup to scan
    objects from is passed in shrink_control->memcg. If the memory cgroup
    is NULL, a memcg aware shrinker is supposed to scan objects from the
    global list. Unaware shrinkers are only called on global pressure with
    memcg=NULL.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

12 Feb, 2015

8 commits

  • pagewalk.c can handle vma in itself, so we don't have to pass vma via
    walk->private. And both of mem_cgroup_count_precharge() and
    mem_cgroup_move_charge() do for each vma loop themselves, but now it's
    done in pagewalk.c, so let's clean up them.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The swap controller code is scattered all over the file. Gather all
    the code that isn't directly needed by the memory controller at the
    end of the file in its own CONFIG_MEMCG_SWAP section.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The initialization code for the per-cpu charge stock and the soft
    limit tree is compact enough to inline it into mem_cgroup_init().

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • - No need to test the node for N_MEMORY. node_online() is enough for
    node fallback to work in slab, use NUMA_NO_NODE for everything else.

    - Remove the BUG_ON() for allocation failure. A NULL pointer crash is
    just as descriptive, and the absent return value check is obvious.

    - Move local variables to the inner-most blocks.

    - Point to the tree structure after its initialized, not before, it's
    just more logical that way.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Guenter Roeck
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM
    suspend") has left a race window when OOM killer manages to
    note_oom_kill after freeze_processes checks the counter. The race
    window is quite small and really unlikely and partial solution deemed
    sufficient at the time of submission.

    Tejun wasn't happy about this partial solution though and insisted on a
    full solution. That requires the full OOM and freezer's task freezing
    exclusion, though. This is done by this patch which introduces oom_sem
    RW lock and turns oom_killer_disable() into a full OOM barrier.

    oom_killer_disabled check is moved from the allocation path to the OOM
    level and we take oom_sem for reading for both the check and the whole
    OOM invocation.

    oom_killer_disable() takes oom_sem for writing so it waits for all
    currently running OOM killer invocations. Then it disable all the further
    OOMs by setting oom_killer_disabled and checks for any oom victims.
    Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
    last victim wakes up all waiters enqueued by oom_killer_disable().
    Therefore this function acts as the full OOM barrier.

    The page fault path is covered now as well although it was assumed to be
    safe before. As per Tejun, "We used to have freezing points deep in file
    system code which may be reacheable from page fault." so it would be
    better and more robust to not rely on freezing points here. Same applies
    to the memcg OOM killer.

    out_of_memory tells the caller whether the OOM was allowed to trigger and
    the callers are supposed to handle the situation. The page allocation
    path simply fails the allocation same as before. The page fault path will
    retry the fault (more on that later) and Sysrq OOM trigger will simply
    complain to the log.

    Normally there wouldn't be any unfrozen user tasks after
    try_to_freeze_tasks so the function will not block. But if there was an
    OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
    finish yet then we have to wait for it. This should complete in a finite
    time, though, because

    - the victim cannot loop in the page fault handler (it would die
    on the way out from the exception)
    - it cannot loop in the page allocator because all the further
    allocation would fail and __GFP_NOFAIL allocations are not
    acceptable at this stage
    - it shouldn't be blocked on any locks held by frozen tasks
    (try_to_freeze expects lockless context) and kernel threads and
    work queues are not frozen yet

    Signed-off-by: Michal Hocko
    Suggested-by: Tejun Heo
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Cong Wang
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This patchset addresses a race which was described in the changelog for
    5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):

    : PM freezer relies on having all tasks frozen by the time devices are
    : getting frozen so that no task will touch them while they are getting
    : frozen. But OOM killer is allowed to kill an already frozen task in order
    : to handle OOM situtation. In order to protect from late wake ups OOM
    : killer is disabled after all tasks are frozen. This, however, still keeps
    : a window open when a killed task didn't manage to die by the time
    : freeze_processes finishes.

    The original patch hasn't closed the race window completely because that
    would require a more complex solution as it can be seen by this patchset.

    The primary motivation was to close the race condition between OOM killer
    and PM freezer _completely_. As Tejun pointed out, even though the race
    condition is unlikely the harder it would be to debug weird bugs deep in
    the PM freezer when the debugging options are reduced considerably. I can
    only speculate what might happen when a task is still runnable
    unexpectedly.

    On a plus side and as a side effect the oom enable/disable has a better
    (full barrier) semantic without polluting hot paths.

    I have tested the series in KVM with 100M RAM:
    - many small tasks (20M anon mmap) which are triggering OOM continually
    - s2ram which resumes automatically is triggered in a loop
    echo processors > /sys/power/pm_test
    while true
    do
    echo mem > /sys/power/state
    sleep 1s
    done
    - simple module which allocates and frees 20M in 8K chunks. If it sees
    freezing(current) then it tries another round of allocation before calling
    try_to_freeze
    - debugging messages of PM stages and OOM killer enable/disable/fail added
    and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
    it wakes up waiters.
    - rebased on top of the current mmotm which means some necessary updates
    in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
    I think this should be OK because __thaw_task shouldn't interfere with any
    locking down wake_up_process. Oleg?

    As expected there are no OOM killed tasks after oom is disabled and
    allocations requested by the kernel thread are failing after all the tasks
    are frozen and OOM disabled. I wasn't able to catch a race where
    oom_killer_disable would really have to wait but I kinda expected the race
    is really unlikely.

    [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
    [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1
    [ 243.636072] (elapsed 2.837 seconds) done.
    [ 243.641985] Trying to disable OOM killer
    [ 243.643032] Waiting for concurent OOM victims
    [ 243.644342] OOM killer disabled
    [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
    [ 243.652983] Suspending console(s) (use no_console_suspend to debug)
    [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
    [...]
    [ 243.992600] PM: suspend of devices complete after 336.667 msecs
    [ 243.993264] PM: late suspend of devices complete after 0.660 msecs
    [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs
    [ 243.994717] ACPI: Preparing to enter system sleep state S3
    [ 243.994795] PM: Saving platform NVS memory
    [ 243.994796] Disabling non-boot CPUs ...

    The first 2 patches are simple cleanups for OOM. They should go in
    regardless the rest IMO.

    Patches 3 and 4 are trivial printk -> pr_info conversion and they should
    go in ditto.

    The main patch is the last one and I would appreciate acks from Tejun and
    Rafael. I think the OOM part should be OK (except for __thaw_task vs.
    task_lock where a look from Oleg would appreciated) but I am not so sure I
    haven't screwed anything in the freezer code. I have found several
    surprises there.

    This patch (of 5):

    This patch is just a preparatory and it doesn't introduce any functional
    change.

    Note:
    I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
    wait for the oom victim and to prevent from new killing. This is
    just a side effect of the flag. The primary meaning is to give the oom
    victim access to the memory reserves and that shouldn't be necessary
    here.

    Signed-off-by: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Cong Wang
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Turn the move type enum into flags and give the flags field a shorter
    name. Once that is done, move_anon() and move_file() are simple enough to
    just fold them into the callsites.

    [akpm@linux-foundation.org: tweak MOVE_MASK definition, per Michal]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Introduce the basic control files to account, partition, and limit
    memory using cgroups in default hierarchy mode.

    This interface versioning allows us to address fundamental design
    issues in the existing memory cgroup interface, further explained
    below. The old interface will be maintained indefinitely, but a
    clearer model and improved workload performance should encourage
    existing users to switch over to the new one eventually.

    The control files are thus:

    - memory.current shows the current consumption of the cgroup and its
    descendants, in bytes.

    - memory.low configures the lower end of the cgroup's expected
    memory consumption range. The kernel considers memory below that
    boundary to be a reserve - the minimum that the workload needs in
    order to make forward progress - and generally avoids reclaiming
    it, unless there is an imminent risk of entering an OOM situation.

    - memory.high configures the upper end of the cgroup's expected
    memory consumption range. A cgroup whose consumption grows beyond
    this threshold is forced into direct reclaim, to work off the
    excess and to throttle new allocations heavily, but is generally
    allowed to continue and the OOM killer is not invoked.

    - memory.max configures the hard maximum amount of memory that the
    cgroup is allowed to consume before the OOM killer is invoked.

    - memory.events shows event counters that indicate how often the
    cgroup was reclaimed while below memory.low, how often it was
    forced to reclaim excess beyond memory.high, how often it hit
    memory.max, and how often it entered OOM due to memory.max. This
    allows users to identify configuration problems when observing a
    degradation in workload performance. An overcommitted system will
    have an increased rate of low boundary breaches, whereas increased
    rates of high limit breaches, maximum hits, or even OOM situations
    will indicate internally overcommitted cgroups.

    For existing users of memory cgroups, the following deviations from
    the current interface are worth pointing out and explaining:

    - The original lower boundary, the soft limit, is defined as a limit
    that is per default unset. As a result, the set of cgroups that
    global reclaim prefers is opt-in, rather than opt-out. The costs
    for optimizing these mostly negative lookups are so high that the
    implementation, despite its enormous size, does not even provide
    the basic desirable behavior. First off, the soft limit has no
    hierarchical meaning. All configured groups are organized in a
    global rbtree and treated like equal peers, regardless where they
    are located in the hierarchy. This makes subtree delegation
    impossible. Second, the soft limit reclaim pass is so aggressive
    that it not just introduces high allocation latencies into the
    system, but also impacts system performance due to overreclaim, to
    the point where the feature becomes self-defeating.

    The memory.low boundary on the other hand is a top-down allocated
    reserve. A cgroup enjoys reclaim protection when it and all its
    ancestors are below their low boundaries, which makes delegation
    of subtrees possible. Secondly, new cgroups have no reserve per
    default and in the common case most cgroups are eligible for the
    preferred reclaim pass. This allows the new low boundary to be
    efficiently implemented with just a minor addition to the generic
    reclaim code, without the need for out-of-band data structures and
    reclaim passes. Because the generic reclaim code considers all
    cgroups except for the ones running low in the preferred first
    reclaim pass, overreclaim of individual groups is eliminated as
    well, resulting in much better overall workload performance.

    - The original high boundary, the hard limit, is defined as a strict
    limit that can not budge, even if the OOM killer has to be called.
    But this generally goes against the goal of making the most out of
    the available memory. The memory consumption of workloads varies
    during runtime, and that requires users to overcommit. But doing
    that with a strict upper limit requires either a fairly accurate
    prediction of the working set size or adding slack to the limit.
    Since working set size estimation is hard and error prone, and
    getting it wrong results in OOM kills, most users tend to err on
    the side of a looser limit and end up wasting precious resources.

    The memory.high boundary on the other hand can be set much more
    conservatively. When hit, it throttles allocations by forcing
    them into direct reclaim to work off the excess, but it never
    invokes the OOM killer. As a result, a high boundary that is
    chosen too aggressively will not terminate the processes, but
    instead it will lead to gradual performance degradation. The user
    can monitor this and make corrections until the minimal memory
    footprint that still gives acceptable performance is found.

    In extreme cases, with many concurrent allocations and a complete
    breakdown of reclaim progress within the group, the high boundary
    can be exceeded. But even then it's mostly better to satisfy the
    allocation from the slack available in other groups or the rest of
    the system than killing the group. Otherwise, memory.max is there
    to limit this type of spillover and ultimately contain buggy or
    even malicious applications.

    - The original control file names are unwieldy and inconsistent in
    many different ways. For example, the upper boundary hit count is
    exported in the memory.failcnt file, but an OOM event count has to
    be manually counted by listening to memory.oom_control events, and
    lower boundary / soft limit events have to be counted by first
    setting a threshold for that value and then counting those events.
    Also, usage and limit files encode their units in the filename.
    That makes the filenames very long, even though this is not
    information that a user needs to be reminded of every time they
    type out those names.

    To address these naming issues, as well as to signal clearly that
    the new interface carries a new configuration model, the naming
    conventions in it necessarily differ from the old interface.

    - The original limit files indicate the state of an unset limit with
    a very high number, and a configured limit can be unset by echoing
    -1 into those files. But that very high number is implementation
    and architecture dependent and not very descriptive. And while -1
    can be understood as an underflow into the highest possible value,
    -2 or -10M etc. do not work, so it's not inconsistent.

    memory.low, memory.high, and memory.max will use the string
    "infinity" to indicate and set the highest possible value.

    [akpm@linux-foundation.org: use seq_puts() for basic strings]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner