10 Jan, 2009

9 commits

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
    x86: fix section mismatch warnings in mcheck/mce_amd_64.c
    x86: offer frame pointers in all build modes
    x86: remove duplicated #include's
    x86: k8 numa register active regions later
    x86: update Alan Cox's email addresses
    x86: rename all fields of mpc_table mpc_X to X
    x86: rename all fields of mpc_oemtable oem_X to X
    x86: rename all fields of mpc_bus mpc_X to X
    x86: rename all fields of mpc_cpu mpc_X to X
    x86: rename all fields of mpc_intsrc mpc_X to X
    x86: rename all fields of mpc_lintsrc mpc_X to X
    x86: rename all fields of mpc_iopic mpc_X to X
    x86: irqinit_64.c init_ISA_irqs should be static
    Documentation/x86/boot.txt: payload length was changed to payload_length
    x86: setup_percpu.c fix style problems
    x86: irqinit_64.c fix style problems
    x86: irqinit_32.c fix style problems
    x86: i8259.c fix style problems
    x86: irq_32.c fix style problems
    x86: ioport.c fix style problems
    ...

    Linus Torvalds
     
  • Currently, ext3 in mainline Linux doesn't have the freeze feature which
    suspends write requests. So, we cannot take a backup which keeps the
    filesystem's consistency with the storage device's features (snapshot and
    replication) while it is mounted.

    In many case, a commercial filesystem (e.g. VxFS) has the freeze feature
    and it would be used to get the consistent backup.

    If Linux's standard filesystem ext3 has the freeze feature, we can do it
    without a commercial filesystem.

    So I have implemented the ioctls of the freeze feature.
    I think we can take the consistent backup with the following steps.
    1. Freeze the filesystem with the freeze ioctl.
    2. Separate the replication volume or create the snapshot
    with the storage device's feature.
    3. Unfreeze the filesystem with the unfreeze ioctl.
    4. Take the backup from the separated replication volume
    or the snapshot.

    This patch:

    VFS:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that they can return an error.
    Rename write_super_lockfs and unlockfs of the super block operation
    freeze_fs and unfreeze_fs to avoid a confusion.

    ext3, ext4, xfs, gfs2, jfs:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that write_super_lockfs returns an error if needed,
    and unlockfs always returns 0.

    reiserfs:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that they always return 0 (success) to keep a current behavior.

    Signed-off-by: Takashi Sato
    Signed-off-by: Masayuki Hamaguchi
    Cc:
    Cc:
    Cc: Christoph Hellwig
    Cc: Dave Kleikamp
    Cc: Dave Chinner
    Cc: Alasdair G Kergon
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Sato
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
    MAINTAINERS: squashfs entry
    Squashfs: documentation
    Squashfs: initrd support
    Squashfs: Kconfig entry
    Squashfs: Makefiles
    Squashfs: header files
    Squashfs: block operations
    Squashfs: cache operations
    Squashfs: uid/gid lookup operations
    Squashfs: fragment block operations
    Squashfs: export operations
    Squashfs: super block operations
    Squashfs: symlink operations
    Squashfs: regular file operations
    Squashfs: directory readdir operations
    Squashfs: directory lookup operations
    Squashfs: inode operations

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu:
    NOMMU: Support XIP on initramfs
    NOMMU: Teach kobjsize() about VMA regions.
    FLAT: Don't attempt to expand the userspace stack to fill the space allocated
    FDPIC: Don't attempt to expand the userspace stack to fill the space allocated
    NOMMU: Improve procfs output using per-MM VMAs
    NOMMU: Make mmap allocation page trimming behaviour configurable.
    NOMMU: Make VMAs per MM as for MMU-mode linux
    NOMMU: Delete askedalloc and realalloc variables
    NOMMU: Rename ARM's struct vm_region
    NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area()

    Linus Torvalds
     
  • * 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
    [S390] update documentation for hvc_iucv kernel parameter.
    [S390] hvc_iucv: Special handling of IUCV HVC devices
    [S390] hvc_iucv: Refactor console and device initialization
    [S390] hvc_iucv: Update function documentation
    [S390] hvc_iucv: Limit rate of outgoing IUCV messages
    [S390] hvc_iucv: Change IUCV term id and use one device as default
    [S390] Use unsigned long long for u64 on 64bit.
    [S390] qdio: fix broken pointer in case of CONFIG_DEBUG_FS is disabled
    [S390] vdso: compile fix
    [S390] remove code for oldselect system call
    [S390] types: add/fix types.h include in header files
    [S390] dasd: add device attribute to disable blocking on lost paths
    [S390] dasd: send change uevents for dasd block devices
    [S390] tape block: fix dependencies
    [S390] asm-s390/posix_types.h: drop __USE_ALL usage
    [S390] gettimeofday.S: removed duplicated #includes
    [S390] ptrace: no extern declarations for userspace

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (864 commits)
    Btrfs: explicitly mark the tree log root for writeback
    Btrfs: Drop the hardware crc32c asm code
    Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING
    Btrfs: kmap_atomic(KM_USER0) is safe for btrfs_readpage_end_io_hook
    Btrfs: Don't use kmap_atomic(..., KM_IRQ0) during checksum verifies
    Btrfs: tree logging checksum fixes
    Btrfs: don't change file extent's ram_bytes in btrfs_drop_extents
    Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation
    Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat code
    Btrfs: drop EXPORT symbols from extent_io.c
    Btrfs: Fix checkpatch.pl warnings
    Btrfs: Fix free block discard calls down to the block layer
    Btrfs: avoid orphan inode caused by log replay
    Btrfs: avoid potential super block corruption
    Btrfs: do not call kfree if kmalloc failed in btrfs_sysfs_add_super
    Btrfs: fix a memory leak in btrfs_get_sb
    Btrfs: Fix typo in clear_state_cb
    Btrfs: Fix memset length in btrfs_file_write
    Btrfs: update directory's size when creating subvol/snapshot
    Btrfs: add permission checks to the ioctls
    ...

    Linus Torvalds
     
  • * git://git.infradead.org/mtd-2.6: (67 commits)
    [MTD] [MAPS] Fix printk format warning in nettel.c
    [MTD] [NAND] add cmdline parsing (mtdparts=) support to cafe_nand
    [MTD] CFI: remove major/minor version check for command set 0x0002
    [MTD] [NAND] ndfc driver
    [MTD] [TESTS] Fix some size_t printk format warnings
    [MTD] LPDDR Makefile and KConfig
    [MTD] LPDDR extended physmap driver to support LPDDR flash
    [MTD] LPDDR added new pfow_base parameter
    [MTD] LPDDR Command set driver
    [MTD] LPDDR PFOW definition
    [MTD] LPDDR QINFO records definitions
    [MTD] LPDDR qinfo probing.
    [MTD] [NAND] pxa3xx: convert from ns to clock ticks more accurately
    [MTD] [NAND] pxa3xx: fix non-page-aligned reads
    [MTD] [NAND] fix nandsim sched.h references
    [MTD] [NAND] alauda: use USB API functions rather than constants
    [MTD] struct device - replace bus_id with dev_name(), dev_set_name()
    [MTD] fix m25p80 64-bit divisions
    [MTD] fix dataflash 64-bit divisions
    [MTD] [NAND] Set the fsl elbc ECCM according the settings in bootloader.
    ...

    Fixed up trivial debug conflicts in drivers/mtd/devices/{m25p80.c,mtd_dataflash.c}

    Linus Torvalds
     
  • * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (94 commits)
    ACPICA: hide private headers
    ACPICA: create acpica/ directory
    ACPI: fix build warning
    ACPI : Use RSDT instead of XSDT by adding boot option of "acpi=rsdt"
    ACPI: Avoid array address overflow when _CST MWAIT hint bits are set
    fujitsu-laptop: Simplify SBLL/SBL2 backlight handling
    fujitsu-laptop: Add BL power, LED control and radio state information
    ACPICA: delete utcache.c
    ACPICA: delete acdisasm.h
    ACPICA: Update version to 20081204.
    ACPICA: FADT: Update error msgs for consistency
    ACPICA: FADT: set acpi_gbl_use_default_register_widths to TRUE by default
    ACPICA: FADT parsing changes and fixes
    ACPICA: Add ACPI_MUTEX_TYPE configuration option
    ACPICA: Fixes for various ACPI data tables
    ACPICA: Restructure includes into public/private
    ACPI: remove private acpica headers from driver files
    ACPI: reboot.c: use new acpi_reset interface
    ACPICA: New: acpi_reset interface - write to reset register
    ACPICA: Move all public H/W interfaces to new hwxface
    ...

    Linus Torvalds
     
  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (22 commits)
    ioat: fix self test for multi-channel case
    dmaengine: bump initcall level to arch_initcall
    dmaengine: advertise all channels on a device to dma_filter_fn
    dmaengine: use idr for registering dma device numbers
    dmaengine: add a release for dma class devices and dependent infrastructure
    ioat: do not perform removal actions at shutdown
    iop-adma: enable module removal
    iop-adma: kill debug BUG_ON
    iop-adma: let devm do its job, don't duplicate free
    dmaengine: kill enum dma_state_client
    dmaengine: remove 'bigref' infrastructure
    dmaengine: kill struct dma_client and supporting infrastructure
    dmaengine: replace dma_async_client_register with dmaengine_get
    atmel-mci: convert to dma_request_channel and down-level dma_slave
    dmatest: convert to dma_request_channel
    dmaengine: introduce dma_request_channel and private channels
    net_dma: convert to dma_find_channel
    dmaengine: provide a common 'issue_pending_all' implementation
    dmaengine: centralize channel allocation, introduce dma_find_channel
    dmaengine: up-level reference counting to the module level
    ...

    Linus Torvalds
     

09 Jan, 2009

28 commits

  • Signed-off-by: Hendrik Brueckner
    Signed-off-by: Martin Schwidefsky

    Hendrik Brueckner
     
  • Len Brown
     
  • Len Brown
     
  • On some boxes there exist both RSDT and XSDT table. But unfortunately
    sometimes there exists the following error when XSDT table is used:
    a. 32/64X address mismatch
    b. The 32/64X FACS address mismatch

    In such case the boot option of "acpi=rsdt" is provided so that
    RSDT is tried instead of XSDT table when the system can't work well.

    http://bugzilla.kernel.org/show_bug.cgi?id=8246

    Signed-off-by: Zhao Yakui
    cc:Thomas Renninger
    Signed-off-by: Len Brown

    Zhao Yakui
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
    jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
    ext4: Remove "extents" mount option
    block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
    ext4: Make printk's consistently prefixed with "EXT4-fs: "
    ext4: Add sanity checks for the superblock before mounting the filesystem
    ext4: Add mount option to set kjournald's I/O priority
    jbd2: Submit writes to the journal using WRITE_SYNC
    jbd2: Add pid and journal device name to the "kjournald2 starting" message
    ext4: Add markers for better debuggability
    ext4: Remove code to create the journal inode
    ext4: provide function to release metadata pages under memory pressure
    ext3: provide function to release metadata pages under memory pressure
    add releasepage hooks to block devices which can be used by file systems
    ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
    ext4: Init the complete page while building buddy cache
    ext4: Don't allow new groups to be added during block allocation
    ext4: mark the blocks/inode bitmap beyond end of group as used
    ext4: Use new buffer_head flag to check uninit group bitmaps initialization
    ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
    ext4: code cleanup
    ...

    Linus Torvalds
     
  • * 'docs-next' of git://git.lwn.net/linux-2.6:
    Fix a typo in the development process document.
    Document handling of bad memory
    Document RCU and unloadable modules

    Linus Torvalds
     
  • Reported-by: Aníbal Monsalve Salazar
    Signed-off-by: Jonathan Corbet

    Jonathan Corbet
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6:
    regulator: fix kernel-doc warnings
    regulator: catch some registration errors
    regulator: Add basic DocBook manual
    regulator: Fix some kerneldoc rendering issues
    regulator: Add missing kerneldoc
    regulator: Clean up kerneldoc warnings
    regulator: Remove extraneous kerneldoc annotations
    regulator: init/link earlier
    regulator: move set_machine_constraints after regulator device initialization
    regulator: da903x: make da903x_is_enabled return 0 or 1
    regulator: da903x: add '\n' to error messages
    regulator: sysfs attribute reduction (v2)
    regulator: code shrink (v2)
    regulator: improved mode error checks
    regulator: enable/disable refcounting
    regulator: struct device - replace bus_id with dev_name(), dev_set_name()

    Linus Torvalds
     
  • Add a basic DocBook manual for the regulator API. This is much more
    skeletal than the existing text documentation, the main benefit is to
    provide a skeleton for automatic generation of a manual based on the
    kerneldoc for the API.

    Since large portions of the text are lifted from the existing text format
    documentation written by Liam Girdwood much of the credit belongs to
    him.

    Signed-off-by: Mark Brown
    Signed-off-by: Liam Girdwood

    Mark Brown
     
  • Clean up the sysfs interface to regulators by only exposing the
    attributes that can be properly displayed. For example: when a
    particular regulator method is needed to display the value, only
    create that attribute when that method exists.

    This cleaned-up interface is much more comprehensible. Most
    regulators only support a subset of the possible methods, so
    often more than half the attributes would be meaningless. Many
    "not defined" values are no longer necessary. (But handling
    of out-of-range values still looks a bit iffy.)

    Documentation is updated to reflect that few of the attributes
    are *always* present, and to briefly explain why a regulator may
    not have a given attribute.

    This adds object code, about a dozen bytes more than was removed
    by the preceding patch, but saves a bunch of per-regulator data
    associated with the now-removed attributes. So there's a net
    reduction in memory footprint.

    Signed-off-by: David Brownell
    Signed-off-by: Liam Girdwood

    David Brownell
     
  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (53 commits)
    serial: Add driver for the Cell Network Processor serial port NWP device
    powerpc: enable dynamic ftrace
    powerpc/cell: Fix the prototype of create_vma_map()
    powerpc/mm: Make clear_fixmap() actually work
    powerpc/kdump: Use ppc_save_regs() in crash_setup_regs()
    powerpc: Export cacheable_memzero as its now used in a driver
    powerpc: Fix missing semicolons in mmu_decl.h
    powerpc/pasemi: local_irq_save uses an unsigned long
    powerpc/cell: Fix some u64 vs. long types
    powerpc/cell: Use correct types in beat files
    powerpc: Use correct type in prom_init.c
    powerpc: Remove unnecessary casts
    mtd/ps3vram: Use _PAGE_NO_CACHE in memory ioremap
    mtd/ps3vram: Use msleep in waits
    mtd/ps3vram: Use proper kernel types
    mtd/ps3vram: Cleanup ps3vram driver messages
    mtd/ps3vram: Remove ps3vram debug routines
    mtd/ps3vram: Add modalias support to the ps3vram driver
    mtd/ps3vram: Add ps3vram driver for accessing video RAM as MTD
    powerpc: Fix iseries drivers build failure without CONFIG_VIOPATH
    ...

    Linus Torvalds
     
  • When I review ocfs2 code, find there are 2 typos to "successfull". After
    doing grep "successfull " in kernel tree, 22 typos found totally -- great
    minds always think alike :)

    This patch fixes all the similar typos. Thanks for Randy's ack and comments.

    Signed-off-by: Coly Li
    Acked-by: Randy Dunlap
    Acked-by: Roland Dreier
    Cc: Jeremy Kerr
    Cc: Jeff Garzik
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Theodore Ts'o
    Cc: Mark Fasheh
    Cc: Vlad Yasevich
    Cc: Sridhar Samudrala
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Coly Li
     
  • Send completion status of the commands to the userspace. Message and
    protocol are described in the documentation.

    Signed-off-by: Evgeniy Polyakov
    Cc: Paul Alfille
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • Command which allows to reset the bus.

    Signed-off-by: Evgeniy Polyakov
    Cc: Paul Alfille
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • Signed-off-by: Evgeniy Polyakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Polyakov
     
  • This patch adds support for the 1-wire master interface for i.MX27 and
    i.MX31.

    Signed-off-by: Luotao Fu
    Signed-off-by: Sascha Hauer
    Signed-off-by: Evgeniy Polyakov
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sascha Hauer
     
  • These patches introduce new locking/refcount support for cgroups to
    reduce the need for subsystems to call cgroup_lock(). This will
    ultimately allow the atomicity of cgroup_rmdir() (which was removed
    recently) to be restored.

    These three patches give:

    1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can
    use to prevent changes to its own cgroup tree

    2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the
    memory controller

    3/3 - introduce a css_tryget() function similar to the one recently
    proposed by Kamezawa, but avoiding spurious refcount failures in
    the event of a race between a css_tryget() and an unsuccessful
    cgroup_rmdir()

    Future patches will likely involve:

    - using hierarchy mutex in place of cgroup_lock() in more subsystems
    where appropriate

    - restoring the atomicity of cgroup_rmdir() with respect to cgroup_create()

    This patch:

    Add a hierarchy_mutex to the cgroup_subsys object that protects changes to
    the hierarchy observed by that subsystem. It is taken by the cgroup
    subsystem (in addition to cgroup_mutex) for the following operations:

    - linking a cgroup into that subsystem's cgroup tree
    - unlinking a cgroup from that subsystem's cgroup tree
    - moving the subsystem to/from a hierarchy (including across the
    bind() callback)

    Thus if the subsystem holds its own hierarchy_mutex, it can safely
    traverse its own hierarchy.

    Signed-off-by: Paul Menage
    Tested-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Fix swapin charge operation of memcg.

    Now, memcg has hooks to swap-out operation and checks SwapCache is really
    unused or not. That check depends on contents of struct page. I.e. If
    PageAnon(page) && page_mapped(page), the page is recoginized as
    still-in-use.

    Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
    of any rmap. Then, in followinig sequence

    (Page fault with WRITE)
    try_charge() (charge += PAGESIZE)
    commit_charge() (Check page_cgroup is used or not..)
    reuse_swap_page()
    -> delete_from_swapcache()
    -> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
    ......
    New charge is uncharged soon....
    To avoid this, move commit_charge() after page_mapcount() goes up to 1.
    By this,

    try_charge() (usage += PAGESIZE)
    reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set)
    commit_charge() (If page_cgroup is not marked as PCG_USED,
    add new charge.)
    Accounting will be correct.

    Changelog (v2) -> (v3)
    - fixed invalid charge to swp_entry==0.
    - updated documentation.
    Changelog (v1) -> (v2)
    - fixed comment.

    [nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Tested-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Daisuke Nishimura
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Documentation for implementation details and how to test.

    Just an example. feel free to modify, add, remove lines.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Currently, /proc/sys/vm/swappiness can change swappiness ratio for global
    reclaim. However, memcg reclaim doesn't have tuning parameter for itself.

    In general, the optimal swappiness depend on workload. (e.g. hpc
    workload need to low swappiness than the others.)

    Then, per cgroup swappiness improve administrator tunability.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Add the following four fields to memory.stat file:

    - inactive_ratio
    - recent_rotated_anon
    - recent_rotated_file
    - recent_scanned_anon
    - recent_scanned_file

    Acked-by: Rik van Riel
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Documentation updates for hierarchy support

    Signed-off-by: Balbir Singh
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: David Rientjes
    Cc: Pavel Emelianov
    Cc: Dhaval Giani
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • This patch implements per cgroup limit for usage of memory+swap. However
    there are SwapCache, double counting of swap-cache and swap-entry is
    avoided.

    Mem+Swap controller works as following.
    - memory usage is limited by memory.limit_in_bytes.
    - memory + swap usage is limited by memory.memsw_limit_in_bytes.

    This has following benefits.
    - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..

    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.

    Maybe someone wonder "why not swap but mem+swap ?"

    - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.

    Accounting target information is stored in swap_cgroup which is
    per swap entry record.

    Charge is done as following.
    map
    - charge page and memsw.

    unmap
    - uncharge page/memsw if not SwapCache.

    swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

    swap-in (do_swap_page)
    - charged as page and memsw.
    record in swap_cgroup is cleared.
    memsw accounting is decremented.

    swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.

    There are people work under never-swap environments and consider swap as
    something bad. For such people, this mem+swap controller extension is just an
    overhead. This overhead is avoided by config or boot option.
    (see Kconfig. detail is not in this patch.)

    TODO:
    - maybe more optimization can be don in swap-in path. (but not very safe.)
    But we just do simple accounting at this stage.

    [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
    [hugh@veritas.com: memswap controller core swapcache fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Config and control variable for mem+swap controller.

    This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
    (memory resource controller swap extension.)

    For accounting swap, it's obvious that we have to use additional memory to
    remember "who uses swap". This adds more overhead. So, it's better to
    offer "choice" to users. This patch adds 2 choices.

    This patch adds 2 parameters to enable swap extension or not.
    - CONFIG
    - boot option

    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • SwapCache support for memory resource controller (memcg)

    Before mem+swap controller, memcg itself should handle SwapCache in proper
    way. This is cut-out from it.

    In current memcg, SwapCache is just leaked and the user can create tons of
    SwapCache. This is a leak of account and should be handled.

    SwapCache accounting is done as following.

    charge (anon)
    - charged when it's mapped.
    (because of readahead, charge at add_to_swap_cache() is not sane)
    uncharge (anon)
    - uncharged when it's dropped from swapcache and fully unmapped.
    means it's not uncharged at unmap.
    Note: delete from swap cache at swap-in is done after rmap information
    is established.
    charge (shmem)
    - charged at swap-in. this prevents charge at add_to_page_cache().

    uncharge (shmem)
    - uncharged when it's dropped from swapcache and not on shmem's
    radix-tree.

    at migration, check against 'old page' is modified to handle shmem.

    Comparing to the old version discussed (and caused troubles), we have
    advantages of
    - PCG_USED bit.
    - simple migrating handling.

    So, situation is much easier than several months ago, maybe.

    [hugh@veritas.com: memcg: handle swap caches build fix]
    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • By memcg-move-all-accounts-to-parent-at-rmdir.patch, there is no leak of
    memory usage and force_empty is removed.

    This patch adds "force_empty" again, in reasonable manner.

    memory.force_empty file works when

    #echo 0 (or some) > memory.force_empty
    and have following function.

    1. only works when there are no task in this cgroup.
    2. free all page under this cgroup as much as possible.
    3. page which cannot be freed will be moved up to parent.
    4. Then, memcg will be empty after above echo returns.

    This is much better behavior than old "force_empty" which just forget
    all accounts. This patch also check signal_pending() and above "echo"
    can be stopped by "Ctrl-C".

    [akpm@linux-foundation.org: cleanup]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch provides a function to move account information of a page
    between mem_cgroups and rewrite force_empty to make use of this.

    This moving of page_cgroup is done under
    - lru_lock of source/destination mem_cgroup is held.
    - lock_page_cgroup() is held.

    Then, a routine which touches pc->mem_cgroup without lock_page_cgroup()
    should confirm pc->mem_cgroup is still valid or not. Typical code can be
    following.

    (while page is not under lock_page())
    mem = pc->mem_cgroup;
    mz = page_cgroup_zoneinfo(pc)
    spin_lock_irqsave(&mz->lru_lock);
    if (pc->mem_cgroup == mem)
    ...../* some list handling */
    spin_unlock_irqrestore(&mz->lru_lock);

    Of course, better way is
    lock_page_cgroup(pc);
    ....
    unlock_page_cgroup(pc);

    But you should confirm the nest of lock and avoid deadlock.

    If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
    you don't have to worry about what pc->mem_cgroup points to.
    moved pages are added to head of lru, not to tail.

    Expected users of this routine is:
    - force_empty (rmdir)
    - moving tasks between cgroup (for moving account information.)
    - hierarchy (maybe useful.)

    force_empty(rmdir) uses this move_account and move pages to its parent.
    This "move" will not cause OOM (I added "oom" parameter to try_charge().)

    If the parent is busy (not enough memory), force_empty calls try_to_free_page()
    and reduce usage.

    Purpose of this behavior is
    - Fix "forget all" behavior of force_empty and avoid leak of accounting.
    - By "moving first, free if necessary", keep pages on memory as much as
    possible.

    Adding a switch to change behavior of force_empty to
    - free first, move if necessary
    - free all, if there is mlocked/busy pages, return -EBUSY.
    is under consideration. (I'll add if someone requtests.)

    This patch also removes memory.force_empty file, a brutal debug-only interface.

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • - remove 'releasable' since it has been moved to the debug subsys.
    - update lock requirements of subsys callbacks.

    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

08 Jan, 2009

3 commits

  • NOMMU mmap allocates a piece of memory for an mmap that's rounded up in size to
    the nearest power-of-2 number of pages. Currently it then discards the excess
    pages back to the page allocator, making that memory available for use by other
    things. This can, however, cause greater amount of fragmentation.

    To counter this, a sysctl is added in order to fine-tune the trimming
    behaviour. The default behaviour remains to trim pages aggressively, while
    this can either be disabled completely or set to a higher page-granular
    watermark in order to have finer-grained control.

    vm region vm_top bits taken from an earlier patch by David Howells.

    Signed-off-by: Paul Mundt
    Signed-off-by: David Howells
    Tested-by: Mike Frysinger

    Paul Mundt
     
  • Make VMAs per mm_struct as for MMU-mode linux. This solves two problems:

    (1) In SYSV SHM where nattch for a segment does not reflect the number of
    shmat's (and forks) done.

    (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
    exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
    that a VMA might be shared and already have its vm_mm assigned to another
    process or a dead process.

    A new struct (vm_region) is introduced to track a mapped region and to remember
    the circumstances under which it may be shared and the vm_list_struct structure
    is discarded as it's no longer required.

    This patch makes the following additional changes:

    (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
    with no recourse to __GFP_COMP, so the pages are not composite. Instead,
    each page has a reference on it held by the region. Anything else that is
    interested in such a page will have to get a reference on it to retain it.
    When the pages are released due to unmapping, each page is passed to
    put_page() and will be freed when the page usage count reaches zero.

    (2) Excess pages are trimmed after an allocation as the allocation must be
    made as a power-of-2 quantity of pages.

    (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may
    end up with overlapping VMAs within the tree, the VMA struct address is
    appended to the sort key.

    (4) Non-anonymous VMAs are now added to the backing inode's prio list.

    (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
    the backing region. The VMA and region structs will be split if
    necessary.

    (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
    segment instead of all the attachments at that addresss. Multiple
    shmat()'s return the same address under NOMMU-mode instead of different
    virtual addresses as under MMU-mode.

    (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.

    (8) /proc/maps is now the global list of mapped regions, and may list bits
    that aren't actually mapped anywhere.

    (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
    of RAM currently allocated by mmap to hold mappable regions that can't be
    mapped directly. These are copies of the backing device or file if not
    anonymous.

    These changes make NOMMU mode more similar to MMU mode. The downside is that
    NOMMU mode requires some extra memory to track things over NOMMU without this
    patch (VMAs are no longer shared, and there are now region structs).

    Signed-off-by: David Howells
    Tested-by: Mike Frysinger
    Acked-by: Paul Mundt

    David Howells
     
  • Benjamin Herrenschmidt