23 Aug, 2010

1 commit


04 Aug, 2010

1 commit

  • Below you will find an updated version from the original series bunching all patches into one big patch
    updating broken web addresses that are located in Documentation/*
    Some of the addresses date as far far back as 1995 etc... so searching became a bit difficult,
    the best way to deal with these is to use web.archive.org to locate these addresses that are outdated.
    Now there are also some addresses pointing to .spec files some are located, but some(after searching
    on the companies site)where still no where to be found. In this case I just changed the address
    to the company site this way the users can contact the company and they can locate them for the users.

    Signed-off-by: Justin P. Mattock
    Signed-off-by: Thomas Weber
    Signed-off-by: Mike Frysinger
    Cc: Paulo Marques
    Cc: Randy Dunlap
    Cc: Michael Neuling
    Signed-off-by: Jiri Kosina

    Justin P. Mattock
     

28 May, 2010

5 commits

  • Some information are old, and I think current document doesn't work as "a
    guide for users". We need summary of all of our controls, at least.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Randy Dunlap
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch adds support for moving charge of file pages, which include
    normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
    bit 1 of /memory.move_charge_at_immigrate.

    Unlike the case of anonymous pages, file pages(and swaps) in the range
    mmapped by the task will be moved even if the task hasn't done page fault,
    i.e. they might not be the task's "RSS", but other task's "RSS" that maps
    the same file. And mapcount of the page is ignored(the page can be moved
    even if page_mapcount(page) > 1). So, conditions that the page/swap
    should be met to be moved is that it must be in the range mmapped by the
    target task and it must be charged to the old cgroup.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This adds a feature to disable oom-killer for memcg, if disabled, of
    course, tasks under memcg will stop.

    But now, we have oom-notifier for memcg. And the world around memcg is
    not under out-of-memory. memcg's out-of-memory just shows memcg hits
    limit. Then, administrator or management daemon can recover the situation
    by

    - kill some process
    - enlarge limit, add more swap.
    - migrate some tasks
    - remove file cache on tmps (difficult ?)

    Unlike oom-killer, you can take enough information before killing tasks.
    (by gcore, or, ps etc.)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering containers or other resource management softwares in userland,
    event notification of OOM in memcg should be implemented. Now, memcg has
    "threshold" notifier which uses eventfd, we can make use of it for oom
    notification.

    This patch adds oom notification eventfd callback for memcg. The usage is
    very similar to threshold notifier, but control file is memory.oom_control
    and no arguments other than eventfd is required.

    % cgroup_event_notifier /cgroup/A/memory.oom_control dummy
    (About cgroup_event_notifier, see Documentation/cgroup/)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Signed-off-by: Trevor Woerner
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Trevor Woerner
     

22 May, 2010

1 commit


21 May, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
    vlynq: make whole Kconfig-menu dependant on architecture
    add descriptive comment for TIF_MEMDIE task flag declaration.
    EEPROM: max6875: Header file cleanup
    EEPROM: 93cx6: Header file cleanup
    EEPROM: Header file cleanup
    agp: use NULL instead of 0 when pointer is needed
    rtc-v3020: make bitfield unsigned
    PCI: make bitfield unsigned
    jbd2: use NULL instead of 0 when pointer is needed
    cciss: fix shadows sparse warning
    doc: inode uses a mutex instead of a semaphore.
    uml: i386: Avoid redefinition of NR_syscalls
    fix "seperate" typos in comments
    cocbalt_lcdfb: correct sections
    doc: Change urls for sparse
    Powerpc: wii: Fix typo in comment
    i2o: cleanup some exit paths
    Documentation/: it's -> its where appropriate
    UML: Fix compiler warning due to missing task_struct declaration
    UML: add kernel.h include to signal.c
    ...

    Linus Torvalds
     

29 Apr, 2010

1 commit


27 Apr, 2010

1 commit

  • This patch fixes few usability and configurability issues.

    o All the cgroup based controller options are configurable from
    "Genral Setup/Control Group Support/" menu. blkio is the only exception.
    Hence make this option visible in above menu and make it configurable from
    there to bring it inline with rest of the cgroup based controllers.

    o Get rid of CONFIG_DEBUG_CFQ_IOSCHED.

    This option currently does two things.

    - Enable printing of cgroup paths in blktrace
    - Enables CONFIG_DEBUG_BLK_CGROUP, which in turn displays additional stat
    files in cgroup.

    If we are using group scheduling, blktrace data is of not really much use
    if cgroup information is not present. To get this data, currently one has to
    also enable CONFIG_DEBUG_CFQ_IOSCHED, which in turn brings the overhead of
    all the additional debug stat files which is not desired.

    Hence, this patch moves printing of cgroup paths under
    CONFIG_CFQ_GROUP_IOSCHED.

    This allows us to get rid of CONFIG_DEBUG_CFQ_IOSCHED completely. Now all
    the debug stat files are controlled only by CONFIG_DEBUG_BLK_CGROUP which
    can be enabled through config menu.

    Signed-off-by: Vivek Goyal
    Acked-by: Divyesh Shah
    Reviewed-by: Gui Jianfeng
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

25 Apr, 2010

1 commit


23 Apr, 2010

1 commit


13 Apr, 2010

1 commit


09 Apr, 2010

4 commits

  • 1) group_wait_time - This is the amount of time the cgroup had to wait to get a
    timeslice for one of its queues from when it became busy, i.e., went from 0
    to 1 request queued. This is different from the io_wait_time which is the
    cumulative total of the amount of time spent by each IO in that cgroup waiting
    in the scheduler queue. This stat is a great way to find out any jobs in the
    fleet that are being starved or waiting for longer than what is expected (due
    to an IO controller bug or any other issue).
    2) empty_time - This is the amount of time a cgroup spends w/o any pending
    requests. This stat is useful when a job does not seem to be able to use its
    assigned disk share by helping check if that is happening due to an IO
    controller bug or because the job is not submitting enough IOs.
    3) idle_time - This is the amount of time spent by the IO scheduler idling
    for a given cgroup in anticipation of a better request than the exising ones
    from other queues/cgroups.

    All these stats are recorded using start and stop events. When reading these
    stats, we do not add the delta between the current time and the last start time
    if we're between the start and stop events. We avoid doing this to make sure
    that these numbers are always monotonically increasing when read. Since we're
    using sched_clock() which may use the tsc as its source, it may induce some
    inconsistency (due to tsc resync across cpus) if we included the current delta.

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     
  • These stats are useful for getting a feel for the queue depth of the cgroup,
    i.e., how filled up its queues are at a given instant and over the existence of
    the cgroup. This ability is useful when debugging problems in the wild as it
    helps understand the application's IO pattern w/o having to read through the
    userspace code (coz its tedious or just not available) or w/o the ability
    to run blktrace (since you may not have root access and/or not want to disturb
    performance).

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     
  • This includes both the number of bios merged into requests belonging to this
    cgroup as well as the number of requests merged together.
    In the past, we've observed different merging behavior across upstream kernels,
    some by design some actual bugs. This stat helps a lot in debugging such
    problems when applications report decreased throughput with a new kernel
    version.

    This needed adding an extra elevator function to capture bios being merged as I
    did not want to pollute elevator code with blkiocg knowledge and hence needed
    the accounting invocation to come from CFQ.

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     
  • that include some minor fixes and addresses all comments.

    Changelog: (most based on Vivek Goyal's comments)
    o renamed blkiocg_reset_write to blkiocg_reset_stats
    o more clarification in the documentation on io_service_time and io_wait_time
    o Initialize blkg->stats_lock
    o rename io_add_stat to blkio_add_stat and declare it static
    o use bool for direction and sync
    o derive direction and sync info from existing rq methods
    o use 12 for major:minor string length
    o define io_service_time better to cover the NCQ case
    o add a separate reset_stats interface
    o make the indexed stats a 2d array to simplify macro and function pointer code
    o blkio.time now exports in jiffies as before
    o Added stats description in patch description and
    Documentation/cgroup/blkio-controller.txt
    o Prefix all stats functions with blkio and make them static as applicable
    o replace IO_TYPE_MAX with IO_TYPE_TOTAL
    o Moved #define constant to top of blk-cgroup.c
    o Pass dev_t around instead of char *
    o Add note to documentation file about resetting stats
    o use BLK_CGROUP_MODULE in addition to BLK_CGROUP config option in #ifdef
    statements
    o Avoid struct request specific knowledge in blk-cgroup. blk-cgroup.h now has
    rq_direction() and rq_sync() functions which are used by CFQ and when using
    io-controller at a higher level, bio_* functions can be added.

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     

25 Mar, 2010

2 commits


19 Mar, 2010

1 commit


16 Mar, 2010

1 commit


13 Mar, 2010

14 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
    doc: fix typo in comment explaining rb_tree usage
    Remove fs/ntfs/ChangeLog
    doc: fix console doc typo
    doc: cpuset: Update the cpuset flag file
    Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
    Remove drivers/parport/ChangeLog
    Remove drivers/char/ChangeLog
    doc: typo - Table 1-2 should refer to "status", not "statm"
    tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
    No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
    devres/irq: Fix devm_irq_match comment
    Remove reference to kthread_create_on_cpu
    tree-wide: Assorted spelling fixes
    tree-wide: fix 'lenght' typo in comments and code
    drm/kms: fix spelling in error message
    doc: capitalization and other minor fixes in pnp doc
    devres: typo fix s/dev/devm/
    Remove redundant trailing semicolons from macros
    fix typo "definetly" -> "definitely" in comment
    tree-wide: s/widht/width/g typo in comments
    ...

    Fix trivial conflict in Documentation/laptops/00-INDEX

    Linus Torvalds
     
  • Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Decription of sanity check for memory thresholds.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • An example of cgroup notification API usage.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Presently, if panic_on_oom=2, the whole system panics even if the oom
    happend in some special situation (as cpuset, mempolicy....). Then,
    panic_on_oom=2 means painc_on_oom_always.

    Now, memcg doesn't check panic_on_oom flag. This patch adds a check.

    BTW, how it's useful ?

    kdump+panic_on_oom=2 is the last tool to investigate what happens in
    oom-ed system. When a task is killed, the sysytem recovers and there will
    be few hint to know what happnes. In mission critical system, oom should
    never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by
    knowing precise information via snapshot.

    TODO:
    - For memcg, it's for isolate system's memory usage, oom-notiifer and
    freeze_at_oom (or rest_at_oom) should be implemented. Then, management
    daemon can do similar jobs (as kdump) or taking snapshot per cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Nick Piggin
    Reviewed-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Update memcg_test.txt to describe how to test the move-charge feature.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • It allows to register multiple memory and memsw thresholds and gets
    notifications when it crosses.

    To register a threshold application need:
    - create an eventfd;
    - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
    - write string like " " to
    cgroup.event_control.

    Application will be notified through eventfd when memory usage crosses
    threshold in any direction.

    It's applicable for root and non-root cgroup.

    It uses stats to track memory usage, simmilar to soft limits. It checks
    if we need to send event to userspace on every 100 page in/out. I guess
    it's good compromise between performance and accuracy of thresholds.

    [akpm@linux-foundation.org: coding-style fixes]
    [nishimura@mxp.nes.nec.co.jp: fix documentation merge issue]
    Signed-off-by: Kirill A. Shutemov
    Cc: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Vladislav Buzov
    Cc: Daisuke Nishimura
    Cc: Alexander Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patchset introduces eventfd-based API for notifications in cgroups
    and implements memory notifications on top of it.

    It uses statistics in memory controler to track memory usage.

    Output of time(1) on building kernel on tmpfs:

    Root cgroup before changes:
    make -j2 506.37 user 60.93s system 193% cpu 4:52.77 total
    Non-root cgroup before changes:
    make -j2 507.14 user 62.66s system 193% cpu 4:54.74 total
    Root cgroup after changes (0 thresholds):
    make -j2 507.13 user 62.20s system 193% cpu 4:53.55 total
    Non-root cgroup after changes (0 thresholds):
    make -j2 507.70 user 64.20s system 193% cpu 4:55.70 total
    Root cgroup after changes (1 thresholds, never crossed):
    make -j2 506.97 user 62.20s system 193% cpu 4:53.90 total
    Non-root cgroup after changes (1 thresholds, never crossed):
    make -j2 507.55 user 64.08s system 193% cpu 4:55.63 total

    This patch:

    Introduce the write-only file "cgroup.event_control" in every cgroup.

    To register new notification handler you need:
    - create an eventfd;
    - open a control file to be monitored. Callbacks register_event() and
    unregister_event() must be defined for the control file;
    - write " " to cgroup.event_control.
    Interpretation of args is defined by control file implementation;

    eventfd will be woken up by control file implementation or when the
    cgroup is removed.

    To unregister notification handler just close eventfd.

    If you need notification functionality for a control file you have to
    implement callbacks register_event() and unregister_event() in the
    struct cftype.

    [kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: KAMEZAWA Hiroyuki
    Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Vladislav Buzov
    Cc: Daisuke Nishimura
    Cc: Alexander Shishkin
    Cc: Davide Libenzi
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch is another core part of this move-charge-at-task-migration
    feature. It enables moving charges of anonymous swaps.

    To move the charge of swap, we need to exchange swap_cgroup's record.

    In current implementation, swap_cgroup's record is protected by:

    - page lock: if the entry is on swap cache.
    - swap_lock: if the entry is not on swap cache.

    This works well in usual swap-in/out activity.

    But this behavior make the feature of moving swap charge check many
    conditions to exchange swap_cgroup's record safely.

    So I changed modification of swap_cgroup's recored(swap_cgroup_record())
    to use xchg, and define a new function to cmpxchg swap_cgroup's record.

    This patch also enables moving charge of non pte_present but not uncharged
    swap caches, which can be exist on swap-out path, by getting the target
    pages via find_get_page() as do_mincore() does.

    [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
    [akpm@linux-foundation.org: fix typos]
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • In current memcg, charges associated with a task aren't moved to the new
    cgroup at task migration. Some users feel this behavior to be strange.
    These patches are for this feature, that is, for charging to the new
    cgroup and, of course, uncharging from the old cgroup at task migration.

    This patch adds "memory.move_charge_at_immigrate" file, which is a flag
    file to determine whether charges should be moved to the new cgroup at
    task migration or not and what type of charges should be moved. This
    patch also adds read and write handlers of the file.

    This patch also adds no-op handlers for this feature. These handlers will
    be implemented in later patches. And you cannot write any values other
    than 0 to move_charge_at_immigrate yet.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Add a forgotten item into CONTENTS.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Provides support for unloading modular subsystems.

    This patch adds a new function cgroup_unload_subsys which is to be used
    for removing a loaded subsystem during module deletion. Reference
    counting of the subsystems' modules is moved from once (at load time) to
    once per attached hierarchy (in parse_cgroupfs_options and
    rebind_subsystems) (i.e., 0 or 1).

    Signed-off-by: Ben Blum
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Add interface between cgroups subsystem management and module loading

    This patch implements rudimentary module-loading support for cgroups -
    namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
    module initcall, and a struct module pointer in struct cgroup_subsys.

    Several functions that might be wanted by modules have had EXPORT_SYMBOL
    added to them, but it's unclear exactly which functions want it and which
    won't.

    Signed-off-by: Ben Blum
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Add cancel_attach() operation to struct cgroup_subsys. cancel_attach()
    can be used when can_attach() operation prepares something for the subsys,
    but we should rollback what can_attach() operation has prepared if attach
    task fails after we've succeeded in can_attach().

    Signed-off-by: Daisuke Nishimura
    Acked-by: Li Zefan
    Reviewed-by: Paul Menage
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

24 Feb, 2010

1 commit

  • This patch is for modifying with correct cuset flag file. We need to update
    current manual for cpuset. For example, before) cpus, cpu_exclusive, mems
    after ) cpuset.cpus, cpuset.cpu_exclusive, cpuset.mems

    Signed-off-by: Geunsik Lim
    Acked-by: Paul Menage
    Signed-off-by: Jiri Kosina

    GeunSik Lim
     

04 Dec, 2009

1 commit


08 Oct, 2009

1 commit

  • Update documentation of cgroups tasks and procs files

    Document the cgroup.procs file.

    Clarify the semantics of the cgroup.procs and tasks files. Although the
    current cgroup.procs interface returns a sorted and uniqified list of
    pids, potential future performance enhancements could result in those
    properties being removed - explicitly document this aspect of the API.

    There are no existing users of cgroup.procs, so compatibility isn't an
    issue. There are users of the "tasks" file, but none that would appear to
    break in the event of the sorted property being broken. The standard
    "libcpuset" explicitly sorts the results of reading from the tasks file,
    and "libcg" and other users don't appear to care about ordering.

    Signed-off-by: Paul Menage
    Reviewed-by: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

24 Sep, 2009

1 commit

  • Soft limits is a new feature for the memory resource controller, something
    similar has existed in the group scheduler in the form of shares. The CPU
    controllers interpretation of shares is very different though.

    Soft limits are the most useful feature to have for environments where the
    administrator wants to overcommit the system, such that only on memory
    contention do the limits become active. The current soft limits
    implementation provides a soft_limit_in_bytes interface for the memory
    controller and not for memory+swap controller. The implementation
    maintains an RB-Tree of groups that exceed their soft limit and starts
    reclaiming from the group that exceeds this limit by the maximum amount.

    This patch:

    Add documentation for soft limits

    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh