15 Jan, 2012

5 commits

  • Introduce a wrapper around scsi_cmd_ioctl that takes a block device.

    The function will then be enhanced to detect partition block devices
    and, in that case, subject the ioctls to whitelisting.

    Cc: linux-scsi@vger.kernel.org
    Cc: Jens Axboe
    Cc: James Bottomley
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Linus Torvalds

    Paolo Bonzini
     
  • 2nd round of GPIO changes for v3.3 merge window

    * tag 'gpio-for-linus' of git://git.secretlab.ca/git/linux-2.6:
    GPIO: sa1100: implement proper gpiolib gpio_to_irq conversion
    gpio: pl061: remove combined interrupt
    gpio: pl061: convert to use generic irq chip
    GPIO: add bindings for managed devices
    ARM: realview: convert pl061 no irq to 0 instead of -1
    gpio: pl061: convert to use 0 for no irq
    gpio: pl061: use chained_irq_* functions in irq handler
    GPIO/pl061: Add suspend resume capability
    drivers/gpio/gpio-tegra.c: use devm_request_and_ioremap

    Linus Torvalds
     
  • * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (119 commits)
    MIPS: Delete unused function add_temporary_entry.
    MIPS: Set default pci cache line size.
    MIPS: Flush huge TLB
    MIPS: Octeon: Remove SYS_SUPPORTS_HIGHMEM.
    MIPS: Octeon: Add support for OCTEON II PCIe
    MIPS: Octeon: Update PCI Latency timer and enable more error reporting.
    MIPS: Alchemy: Update cpu-feature-overrides
    MIPS: Alchemy: db1200: Improve PB1200 detection.
    MIPS: Alchemy: merge Au1000 and Au1300-style IRQ controller code.
    MIPS: Alchemy: chain IRQ controllers to MIPS IRQ controller
    MIPS: Alchemy: irq: register pm at irq init time
    MIPS: Alchemy: Touchscreen support on DB1100
    MIPS: Alchemy: Hook up IrDA on DB1000/DB1100
    net/irda: convert au1k_ir to platform driver.
    MIPS: Alchemy: remove unused board headers
    MTD: nand: make au1550nd.c a platform_driver
    MIPS: Netlogic: Mark Netlogic chips as SMT capable
    MIPS: Netlogic: Add support for XLP 3XX cores
    MIPS: Netlogic: Merge some of XLR/XLP wakup code
    MIPS: Netlogic: Add default XLP config.
    ...

    Fix up trivial conflicts in arch/mips/kernel/{perf_event_mipsxx.c,
    traps.c} and drivers/tty/serial/Makefile

    Linus Torvalds
     
  • Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1

    * tag 'for-linus' of git://github.com/rustyrussell/linux:
    module_param: check that bool parameters really are bool.
    intelfbdrv.c: bailearly is an int module_param
    paride/pcd: fix bool verbose module parameter.
    module_param: make bool parameters really bool (drivers & misc)
    module_param: make bool parameters really bool (arch)
    module_param: make bool parameters really bool (core code)
    kernel/async: remove redundant declaration.
    printk: fix unnecessary module_param_name.
    lirc_parallel: fix module parameter description.
    module_param: avoid bool abuse, add bint for special cases.
    module_param: check type correctness for module_param_array
    modpost: use linker section to generate table.
    modpost: use a table rather than a giant if/else statement.
    modules: sysfs - export: taint, coresize, initsize
    kernel/params: replace DEBUGP with pr_debug
    module: replace DEBUGP with pr_debug
    module: struct module_ref should contains long fields
    module: Fix performance regression on modules with large symbol tables
    module: Add comments describing how the "strmap" logic works

    Fix up conflicts in scripts/mod/file2alias.c due to the new linker-
    generated table approach to adding __mod_*_device_table entries. The
    ARM sa11x0 mcp bus needed to be converted to that too.

    Linus Torvalds
     
  • * 'for-3.3' of git://linux-nfs.org/~bfields/linux: (31 commits)
    nfsd4: nfsd4_create_clid_dir return value is unused
    NFSD: Change name of extended attribute containing junction
    svcrpc: don't revert to SVC_POOL_DEFAULT on nfsd shutdown
    svcrpc: fix double-free on shutdown of nfsd after changing pool mode
    nfsd4: be forgiving in the absence of the recovery directory
    nfsd4: fix spurious 4.1 post-reboot failures
    NFSD: forget_delegations should use list_for_each_entry_safe
    NFSD: Only reinitilize the recall_lru list under the recall lock
    nfsd4: initialize special stateid's at compile time
    NFSd: use network-namespace-aware cache registering routines
    SUNRPC: create svc_xprt in proper network namespace
    svcrpc: update outdated BKL comment
    nfsd41: allow non-reclaim open-by-fh's in 4.1
    svcrpc: avoid memory-corruption on pool shutdown
    svcrpc: destroy server sockets all at once
    svcrpc: make svc_delete_xprt static
    nfsd: Fix oops when parsing a 0 length export
    nfsd4: Use kmemdup rather than duplicating its implementation
    nfsd4: add a separate (lockowner, inode) lookup
    nfsd4: fix CONFIG_NFSD_FAULT_INJECTION compile error
    ...

    Linus Torvalds
     

14 Jan, 2012

5 commits

  • * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
    dma-buf: Documentation update for Kconfig select
    nouveau: Support Optimus models for vga_switcheroo
    nouveau: properly check for _DSM function support
    dma-buf: drop option text so users don't select it.
    radeon: Call pci_clear_master() instead of open-coding it.
    gma500: Discard modes that don't fit in stolen memory
    drm: bump DRM_CONNECTOR_MAX_ENCODER from 2 to 3
    drm/radeon/kms: Fix module parameter description format
    drm/radeon/kms/ni: fix packet2 handling for VM IB parser
    ttm/dma: Remove the WARN() which is not useful.

    Linus Torvalds
     
  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (59 commits)
    rtc: max8925: Add function to work as wakeup source
    mfd: Add pm ops to max8925
    mfd: Convert aat2870 to dev_pm_ops
    mfd: Still check other interrupts if we get a wm831x touchscreen IRQ
    mfd: Introduce missing kfree in 88pm860x probe routine
    mfd: Add S5M series configuration
    mfd: Add s5m series irq driver
    mfd: Add S5M core driver
    mfd: Improve mc13xxx dt binding document
    mfd: Fix stmpe section mismatch
    mfd: Fix stmpe build warning
    mfd: Fix STMPE I2c build failure
    mfd: Constify aat2870-core i2c_device_id table
    gpio: Add support for stmpe variant 801
    mfd: Add support for stmpe variant 801
    mfd: Add support for stmpe variant 610
    mfd: Add support for STMPE SPI interface
    mfd: Separate out STMPE controller and interface specific code
    misc: Remove max8997-muic sysfs attributes
    mfd: Remove unused wm831x_irq_data_to_mask_reg()
    ...

    Fix up trivial conflict in drivers/leds/Kconfig due to addition of
    LEDS_MAX8997 and LEDS_TCA6507 next to each other.

    Linus Torvalds
     
  • MMC highlights for 3.3:

    Core:
    * Support for the HS200 high-speed eMMC mode.
    * Support SDIO 3.0 Ultra High Speed cards.
    * Kill pending block requests immediately if card is removed.
    * Enable the eMMC feature for locking boot partitions read-only
    until next power on, exposed via sysfs.

    Drivers:
    * Runtime PM support for Intel Medfield SDIO.
    * Suspend/resume support for sdhci-spear.
    * sh-mmcif now processes requests asynchronously.

    * tag 'mmc-merge-for-3.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (58 commits)
    mmc: fix a deadlock between system suspend and MMC block IO
    mmc: sdhci: restore the enabled dma when do reset all
    mmc: dw_mmc: miscaculated the fifo-depth with wrong bit operation
    mmc: host: Adds support for eMMC 4.5 HS200 mode
    mmc: core: HS200 mode support for eMMC 4.5
    mmc: dw_mmc: fixed wrong bit operation for SDMMC_GET_FCNT()
    mmc: core: Separate the timeout value for cache-ctrl
    mmc: sdhci-spear: Fix compilation error
    mmc: sdhci: Deal with failure case in sdhci_suspend_host
    mmc: dw_mmc: Clear the DDR mode for non-DDR
    mmc: sd: Fix SDR12 timing regression
    mmc: sdhci: Fix tuning timer incorrect setting when suspending host
    mmc: core: Add option to prevent eMMC sleep command
    mmc: omap_hsmmc: use threaded irq handler for card-detect.
    mmc: sdhci-pci: enable runtime PM for Medfield SDIO
    mmc: sdhci: Always pass clock request value zero to set_clock host op
    mmc: sdhci-pci: remove SDHCI_QUIRK2_OWN_CARD_DETECTION
    mmc: sdhci-pci: get gpio numbers from platform data
    mmc: sdhci-pci: add platform data
    mmc: sdhci: prevent card detection activity for non-removable cards
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
    GFS2: Fix nlink setting on inode creation
    GFS2: fail mount if journal recovery fails
    GFS2: let spectator mount do read only recovery
    GFS2: Fix a use-after-free that coverity spotted
    GFS2: dlm based recovery coordination

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: ensure prealloc_blob is in place when removing xattr
    rbd: initialize snap_rwsem in rbd_add()
    ceph: enable/disable dentry complete flags via mount option
    vfs: export symbol d_find_any_alias()
    ceph: always initialize the dentry in open_root_dentry()
    libceph: remove useless return value for osd_client __send_request()
    ceph: avoid iput() while holding spinlock in ceph_dir_fsync
    ceph: avoid useless dget/dput in encode_fh
    ceph: dereference pointer after checking for NULL
    crush: fix force for non-root TAKE
    ceph: remove unnecessary d_fsdata conditional checks
    ceph: Use kmemdup rather than duplicating its implementation

    Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
    always initialize the dentry in open_root_dentry)

    Linus Torvalds
     

13 Jan, 2012

30 commits

  • There exists at least one NVIDIA GPU (Quadro NVS 300) that has a DMS-59
    connector which is capable of supporting DisplayPort, TMDS and VGA on
    a single connector.

    We need to bump the allowed encoder limit to support all three configs.

    Signed-off-by: Ben Skeggs
    Signed-off-by: Dave Airlie

    Ben Skeggs
     
  • Andrew explains:

    - various misc stuff

    - Most of the rest of MM: memcg, threaded hugepages, others.

    - cpumask

    - kexec

    - kdump

    - some direct-io performance tweaking

    - radix-tree optimisations

    - new selftests code

    A note on this: often people will develop a new userspace-visible
    feature and will develop userspace code to exercise/test that
    feature. Then they merge the patch and the selftest code dies.
    Sometimes we paste it into the changelog. Sometimes the code gets
    thrown into Documentation/(!).

    This saddens me. So this patch creates a bare-bones framework which
    will henceforth allow me to ask people to include their test apps in
    the kernel tree so we can keep them alive. Then when people enhance
    or fix the feature, I can ask them to update the test app too.

    The infrastruture is terribly trivial at present - let's see how it
    evolves.

    - checkpoint/restart feature work.

    A note on this: this is a project by various mad Russians to perform
    c/r mainly from userspace, with various oddball helper code added
    into the kernel where the need is demonstrated.

    So rather than some large central lump of code, what we have is
    little bits and pieces popping up in various places which either
    expose something new or which permit something which is normally
    kernel-private to be modified.

    The overall project is an ongoing thing. I've judged that the size
    and scope of the thing means that we're more likely to be successful
    with it if we integrate the support into mainline piecemeal rather
    than allowing it all to develop out-of-tree.

    However I'm less confident than the developers that it will all
    eventually work! So what I'm asking them to do is to wrap each piece
    of new code inside CONFIG_CHECKPOINT_RESTORE. So if it all
    eventually comes to tears and the project as a whole fails, it should
    be a simple matter to go through and delete all trace of it.

    This lot pretty much wraps up the -rc1 merge for me.

    * akpm: (96 commits)
    unlzo: fix input buffer free
    ramoops: update parameters only after successful init
    ramoops: fix use of rounddown_pow_of_two()
    c/r: prctl: add PR_SET_MM codes to set up mm_struct entries
    c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
    c/r: introduce CHECKPOINT_RESTORE symbol
    selftests: new x86 breakpoints selftest
    selftests: new very basic kernel selftests directory
    radix_tree: take radix_tree_path off stack
    radix_tree: remove radix_tree_indirect_to_ptr()
    dio: optimize cache misses in the submission path
    vfs: cache request_queue in struct block_device
    fs/direct-io.c: calculate fs_count correctly in get_more_blocks()
    drivers/parport/parport_pc.c: fix warnings
    panic: don't print redundant backtraces on oops
    sysctl: add the kernel.ns_last_pid control
    kdump: add udev events for memory online/offline
    include/linux/crash_dump.h needs elf.h
    kdump: fix crash_kexec()/smp_send_stop() race in panic()
    kdump: crashk_res init check for /sys/kernel/kexec_crash_size
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
    pptp: Accept packet with seq zero
    RDS: Remove some unused iWARP code
    net: fsl: fec: handle 10Mbps speed in RMII mode
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c: add missing iounmap
    drivers/net/ethernet/tundra/tsi108_eth.c: add missing iounmap
    ksz884x: fix mtu for VLAN
    net_sched: sfq: add optional RED on top of SFQ
    dp83640: Fix NOHZ local_softirq_pending 08 warning
    gianfar: Fix invalid TX frames returned on error queue when time stamping
    gianfar: Fix missing sock reference when processing TX time stamps
    phylib: introduce mdiobus_alloc_size()
    net: decrement memcg jump label when limit, not usage, is changed
    net: reintroduce missing rcu_assign_pointer() calls
    inet_diag: Rename inet_diag_req_compat into inet_diag_req
    inet_diag: Rename inet_diag_req into inet_diag_req_v2
    bond_alb: don't disable softirq under bond_alb_xmit
    mac80211: fix rx->key NULL pointer dereference in promiscuous mode
    nl80211: fix old station flags compatibility
    mdio-octeon: use an unique MDIO bus name.
    mdio-gpio: use an unique MDIO bus name.
    ...

    Linus Torvalds
     
  • When we restore a task we need to set up text, data and data heap sizes
    from userspace to the values a task had at checkpoint time. This patch
    adds auxilary prctl codes for that.

    While most of them have a statistical nature (their values are involved
    into calculation of /proc//statm output) the start_brk and brk values
    are used to compute an allowed size of program data segment expansion.
    Which means an arbitrary changes of this values might be dangerous
    operation. So to restrict access the following requirements applied to
    prctl calls:

    - The process has to have CAP_SYS_ADMIN capability granted.
    - For all opcodes except start_brk/brk members an appropriate
    VMA area must exist and should fit certain VMA flags,
    such as:
    - code segment must be executable but not writable;
    - data segment must not be executable.

    start_brk/brk values must not intersect with data segment and must not
    exceed RLIMIT_DATA resource limit.

    Still the main guard is CAP_SYS_ADMIN capability check.

    Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
    otherwise these prctl calls will return -EINVAL.

    [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text]
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • It is not used anymore, remove it

    Signed-off-by: Xiao Guangrong
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • This makes it possible to get from the inode to the request_queue with one
    less cache miss. Used in followon optimization.

    The livetime of the pointer is the same as the gendisk.

    This assumes that the queue will always stay the same in the gendisk while
    it's visible to block_devices. I think that's safe correct?

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Building an ARM target we get the following warnings:

    CC arch/arm/kernel/setup.o
    In file included from arch/arm/kernel/setup.c:39:
    arch/arm/include/asm/elf.h:102:1: warning: "vmcore_elf64_check_arch" redefined
    In file included from arch/arm/kernel/setup.c:24:
    include/linux/crash_dump.h:30:1: warning: this is the location of the previous definition

    Quoting Russell King:

    "linux/crash_dump.h makes no attempt to include asm/elf.h, but it depends
    on stuff in asm/elf.h to determine how stuff inside this file is defined
    at parse time.

    So, if asm/elf.h is included after linux/crash_dump.h or not at all, you
    get a different result from the situation where asm/elf.h is included
    before."

    So add elf.h header to crash_dump.h to avoid this problem.

    The original discussion about this can be found at:
    http://www.spinics.net/lists/arm-kernel/msg154113.html

    Signed-off-by: Fabio Estevam
    Cc: Russell King
    Cc: [3.2.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabio Estevam
     
  • KMSG_DUMP_KEXEC is useless because we already save kernel messages inside
    /proc/vmcore, and it is unsafe to allow modules to do other stuffs in a
    crash dump scenario.

    [akpm@linux-foundation.org: fix powerpc build]
    Signed-off-by: WANG Cong
    Reported-by: Vivek Goyal
    Acked-by: Vivek Goyal
    Acked-by: Jarod Wilson
    Cc: "Eric W. Biederman"
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Cong
     
  • del_page_from_lru() repeats del_page_from_lru_list(), also working out
    which LRU the page was on, clearing the relevant bits. Decouple those
    functions: remove del_page_from_lru() and add page_off_lru().

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • What's so special about ____pagevec_lru_add() that it needs four leading
    underscores? Nothing, it just helped to distinguish from
    __pagevec_lru_add() in 2.6.28 development. Cut two leading underscores.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Replace pagevecs in putback_lru_pages() and move_active_pages_to_lru()
    by lists of pages_to_free: then apply Konstantin Khlebnikov's
    free_hot_cold_page_list() to them instead of pagevec_release().

    Which simplifies the flow (no need to drop and retake lock whenever
    pagevec fills up) and reduces stale addresses in stack backtraces
    (which often showed through the pagevecs); but more importantly,
    removes another 120 bytes from the deepest stacks in page reclaim.
    Although I've not recently seen an actual stack overflow here with
    a vanilla kernel, move_active_pages_to_lru() has often featured in
    deep backtraces.

    However, free_hot_cold_page_list() does not handle compound pages
    (nor need it: a Transparent HugePage would have been split by the
    time it reaches the call in shrink_page_list()), but it is possible
    for putback_lru_pages() or move_active_pages_to_lru() to be left
    holding the last reference on a THP, so must exclude the unlikely
    compound case before putting on pages_to_free.

    Remove pagevec_strip(), its work now done in move_active_pages_to_lru().
    The pagevec in scan_mapping_unevictable_pages() remains in mm/vmscan.c,
    but that is never on the reclaim path, and cannot be replaced by a list.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Konstantin Khlebnikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage. Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.

    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.

    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list. This
    had to be partially reverted because some dirty pages can be migrated by
    compaction without blocking.

    This patch updates "mm: compaction: make isolate_lru_page" by skipping
    over pages that migration has no possibility of migrating to minimise LRU
    disruption.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Asynchronous compaction is used when allocating transparent hugepages to
    avoid blocking for long periods of time. Due to reports of stalling,
    there was a debate on disabling synchronous compaction but this severely
    impacted allocation success rates. Part of the reason was that many dirty
    pages are skipped in asynchronous compaction by the following check;

    if (PageDirty(page) && !sync &&
    mapping->a_ops->migratepage != migrate_page)
    rc = -EBUSY;

    This skips over all mapping aops using buffer_migrate_page() even though
    it is possible to migrate some of these pages without blocking. This
    patch updates the ->migratepage callback with a "sync" parameter. It is
    the responsibility of the callback to fail gracefully if migration would
    block.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In trace_mm_vmscan_lru_isolate(), we don't output 'file' information to
    the trace event and it is a bit inconvenient for the user to get the
    real information(like pasted below). mm_vmscan_lru_isolate:
    isolate_mode=2 order=0 nr_requested=32 nr_scanned=32 nr_taken=32
    contig_taken=0 contig_dirty=0 contig_failed=0

    'active' can be obtained by analyzing mode(Thanks go to Minchan and
    Mel), So this patch adds 'file' to the trace event and it now looks
    like: mm_vmscan_lru_isolate: isolate_mode=2 order=0 nr_requested=32
    nr_scanned=32 nr_taken=32 contig_taken=0 contig_dirty=0 contig_failed=0
    file=0

    Signed-off-by: Tao Ma
    Acked-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tao Ma
     
  • We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
    flushed, but not a corresponding API for pmd entry. This isn't a
    problem so far because THP is only for x86 currently and tlb_flush()
    under x86 will flush entire TLB. But this is confusion and could be
    missed if thp is ported to other arch.

    Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
    __tlb_remove_page() as suggested by Andrea Arcangeli. The
    __tlb_remove_page() function is supposed to be called after
    tlb_remove_xxx_tlb_entry() and we can catch any misuse.

    Signed-off-by: Shaohua Li
    Reviewed-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Now, at LRU handling, memory cgroup needs to do complicated works to see
    valid pc->mem_cgroup, which may be overwritten.

    This patch is for relaxing the protocol. This patch guarantees
    - when pc->mem_cgroup is overwritten, page must not be on LRU.

    By this, LRU routine can believe pc->mem_cgroup and don't need to check
    bits on pc->flags. This new rule may adds small overheads to swapin. But
    in most case, lru handling gets faster.

    After this patch, PCG_ACCT_LRU bit is obsolete and removed.

    [akpm@linux-foundation.org: remove unneeded VM_BUG_ON(), restore hannes's christmas tree]
    [akpm@linux-foundation.org: clean up code comment]
    [hughd@google.com: fix NULL mem_cgroup_try_charge]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
    and reducing atomic ops/complexity in memcg LRU handling.

    In some cases, pages are added to lru before charge to memcg and pages
    are not classfied to memory cgroup at lru addtion. Now, the lru where
    the page should be added is determined a bit in page_cgroup->flags and
    pc->mem_cgroup. I'd like to remove the check of flag.

    To handle the case pc->mem_cgroup may contain stale pointers if pages
    are added to LRU before classification. This patch resets
    pc->mem_cgroup to root_mem_cgroup before lru additions.

    [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
    [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
    [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
    [hughd@google.com: stop oops in mem_cgroup_reset_owner()]
    [hughd@google.com: fix page migration to reset_owner]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There are multiple places which need to get the swap_cgroup address, so
    add a helper function:

    static struct swap_cgroup *swap_cgroup_getsc(swp_entry_t ent,
    struct swap_cgroup_ctrl **ctrl);

    to simplify the code.

    Signed-off-by: Bob Liu
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
    page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies
    page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.

    But thinking again,
    - compound_lock() is held at move_accout...then, it's not necessary
    to take move_lock_page_cgroup().
    - LRU is locked and all tail pages will go into the same LRU as
    head is now on.
    - page_cgroup is contiguous in huge page range.

    This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
    hugepage and reduce costs for spliting.

    [akpm@linux-foundation.org: fix typo, per Michal]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • To find the page corresponding to a certain page_cgroup, the pc->flags
    encoded the node or section ID with the base array to compare the pc
    pointer to.

    Now that the per-memory cgroup LRU lists link page descriptors directly,
    there is no longer any code that knows the struct page_cgroup of a PFN
    but not the struct page.

    [hughd@google.com: remove unused node/section info from pc->flags fix]
    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that all code that operated on global per-zone LRU lists is
    converted to operate on per-memory cgroup LRU lists instead, there is no
    reason to keep the double-LRU scheme around any longer.

    The pc->lru member is removed and page->lru is linked directly to the
    per-memory cgroup LRU lists, which removes two pointers from a
    descriptor that exists for every page frame in the system.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Ying Han
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Having a unified structure with a LRU list set for both global zones and
    per-memcg zones allows to keep that code simple which deals with LRU
    lists and does not care about the container itself.

    Once the per-memcg LRU lists directly link struct pages, the isolation
    function and all other list manipulations are shared between the memcg
    case and the global LRU case.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup limit reclaim and traditional global pressure reclaim will
    soon share the same code to reclaim from a hierarchical tree of memory
    cgroups.

    In preparation of this, move the two right next to each other in
    shrink_zone().

    The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
    limit reclaim function, which still does hierarchy walking on its own,
    and a limit (shrinking) reclaim function, which relies on generic
    reclaim code to walk the hierarchy.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
    function replace_page_cache_page(). This function replaces a page in the
    radix-tree with a new page. WHen doing this, memory cgroup needs to fix
    up the accounting information. memcg need to check PCG_USED bit etc.

    In some(many?) cases, 'newpage' is on LRU before calling
    replace_page_cache(). So, memcg's LRU accounting information should be
    fixed, too.

    This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
    In that function, old pages will be unaccounted without touching
    res_counter and new page will be accounted to the memcg (of old page).
    WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
    races with LRU handling.

    Background:
    replace_page_cache_page() is called by FUSE code in its splice() handling.
    Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
    page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
    because rmdir() checks the whole LRU is empty and there is no account leak.
    If a page is on the other LRU than it should be, rmdir() will fail.

    This bug was added in March 2011, but no bug report yet. I guess there
    are not many people who use memcg and FUSE at the same time with upstream
    kernels.

    The result of this bug is that admin cannot destroy a memcg because of
    account leak. So, no panic, no deadlock. And, even if an active cgroup
    exist, umount can succseed. So no problem at shutdown.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Miklos Szeredi
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The current epoll code can be tickled to run basically indefinitely in
    both loop detection path check (on ep_insert()), and in the wakeup paths.
    The programs that tickle this behavior set up deeply linked networks of
    epoll file descriptors that cause the epoll algorithms to traverse them
    indefinitely. A couple of these sample programs have been previously
    posted in this thread: https://lkml.org/lkml/2011/2/25/297.

    To fix the loop detection path check algorithms, I simply keep track of
    the epoll nodes that have been already visited. Thus, the loop detection
    becomes proportional to the number of epoll file descriptor and links.
    This dramatically decreases the run-time of the loop check algorithm. In
    one diabolical case I tried it reduced the run-time from 15 mintues (all
    in kernel time) to .3 seconds.

    Fixing the wakeup paths could be done at wakeup time in a similar manner
    by keeping track of nodes that have already been visited, but the
    complexity is harder, since there can be multiple wakeups on different
    cpus...Thus, I've opted to limit the number of possible wakeup paths when
    the paths are created.

    This is accomplished, by noting that the end file descriptor points that
    are found during the loop detection pass (from the newly added link), are
    actually the sources for wakeup events. I keep a list of these file
    descriptors and limit the number and length of these paths that emanate
    from these 'source file descriptors'. In the current implemetation I
    allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
    length 4 and 10 of length 5. Note that it is sufficient to check the
    'source file descriptors' reachable from the newly added link, since no
    other 'source file descriptors' will have newly added links. This allows
    us to check only the wakeup paths that may have gotten too long, and not
    re-check all possible wakeup paths on the system.

    In terms of the path limit selection, I think its first worth noting that
    the most common case for epoll, is probably the model where you have 1
    epoll file descriptor that is monitoring n number of 'source file
    descriptors'. In this case, each 'source file descriptor' has a 1 path of
    length 1. Thus, I believe that the limits I'm proposing are quite
    reasonable and in fact may be too generous. Thus, I'm hoping that the
    proposed limits will not prevent any workloads that currently work to
    fail.

    In terms of locking, I have extended the use of the 'epmutex' to all
    epoll_ctl add and remove operations. Currently its only used in a subset
    of the add paths. I need to hold the epmutex, so that we can correctly
    traverse a coherent graph, to check the number of paths. I believe that
    this additional locking is probably ok, since its in the setup/teardown
    paths, and doesn't affect the running paths, but it certainly is going to
    add some extra overhead. Also, worth noting is that the epmuex was
    recently added to the ep_ctl add operations in the initial path loop
    detection code using the argument that it was not on a critical path.

    Another thing to note here, is the length of epoll chains that is allowed.
    Currently, eventpoll.c defines:

    /* Maximum number of nesting allowed inside epoll sets */
    #define EP_MAX_NESTS 4

    This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
    + 1). However, this limit is currently only enforced during the loop
    check detection code, and only when the epoll file descriptors are added
    in a certain order. Thus, this limit is currently easily bypassed. The
    newly added check for wakeup paths, stricly limits the wakeup paths to a
    length of 5, regardless of the order in which ep's are linked together.
    Thus, a side-effect of the new code is a more consistent enforcement of
    the graph depth.

    Thus far, I've tested this, using the sample programs previously
    mentioned, which now either return quickly or return -EINVAL. I've also
    testing using the piptest.c epoll tester, which showed no difference in
    performance. I've also created a number of different epoll networks and
    tested that they behave as expectded.

    I believe this solves the original diabolical test cases, while still
    preserving the sane epoll nesting.

    Signed-off-by: Jason Baron
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • While implementing cmpxchg_double() on s390 I realized that we don't set
    CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.

    However setting that option will increase the size of struct page by
    eight bytes on 64 bit, which we certainly do not want. Also, it doesn't
    make sense that a present cpu feature should increase the size of struct
    page.

    Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
    that it should depend on CMPXCHG_DOUBLE instead.

    This patch:

    If an architecture supports CMPXCHG_LOCAL this shouldn't result
    automatically in larger struct pages if the SLUB allocator is used.
    Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
    can be selected if a double word aligned struct page is required. Also
    update x86 Kconfig so that it should work as before.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • The uses have been renamed so delete the unused macro.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches