Eric Lee / smarc-fsl-linux-kernel

15 Jan, 2012

5 commits

577ebb374 block: add and use scsi_blk_cmd_ioctl ... Browse Code »
1

Introduce a wrapper around scsi_cmd_ioctl that takes a block device.

The function will then be enhanced to detect partition block devices
and, in that case, subject the ioctls to whitelisting.

Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe
Cc: James Bottomley
Signed-off-by: Paolo Bonzini
Signed-off-by: Linus Torvalds

Paolo Bonzini
2012-01-15 07:07:24 +0800
f5e4e20fa Merge tag 'gpio-for-linus' of git://git.secretlab.ca/git/linux-2.6 ... Browse Code »

2nd round of GPIO changes for v3.3 merge window

* tag 'gpio-for-linus' of git://git.secretlab.ca/git/linux-2.6:
GPIO: sa1100: implement proper gpiolib gpio_to_irq conversion
gpio: pl061: remove combined interrupt
gpio: pl061: convert to use generic irq chip
GPIO: add bindings for managed devices
ARM: realview: convert pl061 no irq to 0 instead of -1
gpio: pl061: convert to use 0 for no irq
gpio: pl061: use chained_irq_* functions in irq handler
GPIO/pl061: Add suspend resume capability
drivers/gpio/gpio-tegra.c: use devm_request_and_ioremap

Linus Torvalds
2012-01-15 05:25:23 +0800
4964e0664 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus ... Browse Code »

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (119 commits)
MIPS: Delete unused function add_temporary_entry.
MIPS: Set default pci cache line size.
MIPS: Flush huge TLB
MIPS: Octeon: Remove SYS_SUPPORTS_HIGHMEM.
MIPS: Octeon: Add support for OCTEON II PCIe
MIPS: Octeon: Update PCI Latency timer and enable more error reporting.
MIPS: Alchemy: Update cpu-feature-overrides
MIPS: Alchemy: db1200: Improve PB1200 detection.
MIPS: Alchemy: merge Au1000 and Au1300-style IRQ controller code.
MIPS: Alchemy: chain IRQ controllers to MIPS IRQ controller
MIPS: Alchemy: irq: register pm at irq init time
MIPS: Alchemy: Touchscreen support on DB1100
MIPS: Alchemy: Hook up IrDA on DB1000/DB1100
net/irda: convert au1k_ir to platform driver.
MIPS: Alchemy: remove unused board headers
MTD: nand: make au1550nd.c a platform_driver
MIPS: Netlogic: Mark Netlogic chips as SMT capable
MIPS: Netlogic: Add support for XLP 3XX cores
MIPS: Netlogic: Merge some of XLR/XLP wakup code
MIPS: Netlogic: Add default XLP config.
...

Fix up trivial conflicts in arch/mips/kernel/{perf_event_mipsxx.c,
traps.c} and drivers/tty/serial/Makefile

Linus Torvalds
2012-01-15 05:05:21 +0800
0a80939b3 Merge tag 'for-linus' of git://github.com/rustyrussell/linux ... Browse Code »

Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1

* tag 'for-linus' of git://github.com/rustyrussell/linux:
module_param: check that bool parameters really are bool.
intelfbdrv.c: bailearly is an int module_param
paride/pcd: fix bool verbose module parameter.
module_param: make bool parameters really bool (drivers & misc)
module_param: make bool parameters really bool (arch)
module_param: make bool parameters really bool (core code)
kernel/async: remove redundant declaration.
printk: fix unnecessary module_param_name.
lirc_parallel: fix module parameter description.
module_param: avoid bool abuse, add bint for special cases.
module_param: check type correctness for module_param_array
modpost: use linker section to generate table.
modpost: use a table rather than a giant if/else statement.
modules: sysfs - export: taint, coresize, initsize
kernel/params: replace DEBUGP with pr_debug
module: replace DEBUGP with pr_debug
module: struct module_ref should contains long fields
module: Fix performance regression on modules with large symbol tables
module: Add comments describing how the "strmap" logic works

Fix up conflicts in scripts/mod/file2alias.c due to the new linker-
generated table approach to adding __mod_*_device_table entries. The
ARM sa11x0 mcp bus needed to be converted to that too.

Linus Torvalds
2012-01-15 04:32:16 +0800
0b48d4223 Merge branch 'for-3.3' of git://linux-nfs.org/~bfields/linux ... Browse Code »

* 'for-3.3' of git://linux-nfs.org/~bfields/linux: (31 commits)
nfsd4: nfsd4_create_clid_dir return value is unused
NFSD: Change name of extended attribute containing junction
svcrpc: don't revert to SVC_POOL_DEFAULT on nfsd shutdown
svcrpc: fix double-free on shutdown of nfsd after changing pool mode
nfsd4: be forgiving in the absence of the recovery directory
nfsd4: fix spurious 4.1 post-reboot failures
NFSD: forget_delegations should use list_for_each_entry_safe
NFSD: Only reinitilize the recall_lru list under the recall lock
nfsd4: initialize special stateid's at compile time
NFSd: use network-namespace-aware cache registering routines
SUNRPC: create svc_xprt in proper network namespace
svcrpc: update outdated BKL comment
nfsd41: allow non-reclaim open-by-fh's in 4.1
svcrpc: avoid memory-corruption on pool shutdown
svcrpc: destroy server sockets all at once
svcrpc: make svc_delete_xprt static
nfsd: Fix oops when parsing a 0 length export
nfsd4: Use kmemdup rather than duplicating its implementation
nfsd4: add a separate (lockowner, inode) lookup
nfsd4: fix CONFIG_NFSD_FAULT_INJECTION compile error
...

Linus Torvalds
2012-01-15 04:26:41 +0800

14 Jan, 2012

5 commits

2145199c4 Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux ... Browse Code »

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
dma-buf: Documentation update for Kconfig select
nouveau: Support Optimus models for vga_switcheroo
nouveau: properly check for _DSM function support
dma-buf: drop option text so users don't select it.
radeon: Call pci_clear_master() instead of open-coding it.
gma500: Discard modes that don't fit in stolen memory
drm: bump DRM_CONNECTOR_MAX_ENCODER from 2 to 3
drm/radeon/kms: Fix module parameter description format
drm/radeon/kms/ni: fix packet2 handling for VM IB parser
ttm/dma: Remove the WARN() which is not useful.

Linus Torvalds
2012-01-14 12:48:42 +0800
21ebd6c68 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6 ... Browse Code »

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (59 commits)
rtc: max8925: Add function to work as wakeup source
mfd: Add pm ops to max8925
mfd: Convert aat2870 to dev_pm_ops
mfd: Still check other interrupts if we get a wm831x touchscreen IRQ
mfd: Introduce missing kfree in 88pm860x probe routine
mfd: Add S5M series configuration
mfd: Add s5m series irq driver
mfd: Add S5M core driver
mfd: Improve mc13xxx dt binding document
mfd: Fix stmpe section mismatch
mfd: Fix stmpe build warning
mfd: Fix STMPE I2c build failure
mfd: Constify aat2870-core i2c_device_id table
gpio: Add support for stmpe variant 801
mfd: Add support for stmpe variant 801
mfd: Add support for stmpe variant 610
mfd: Add support for STMPE SPI interface
mfd: Separate out STMPE controller and interface specific code
misc: Remove max8997-muic sysfs attributes
mfd: Remove unused wm831x_irq_data_to_mask_reg()
...

Fix up trivial conflict in drivers/leds/Kconfig due to addition of
LEDS_MAX8997 and LEDS_TCA6507 next to each other.

Linus Torvalds
2012-01-14 12:43:32 +0800
4b8be38cf Merge tag 'mmc-merge-for-3.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc ... Browse Code »

MMC highlights for 3.3:

Core:
* Support for the HS200 high-speed eMMC mode.
* Support SDIO 3.0 Ultra High Speed cards.
* Kill pending block requests immediately if card is removed.
* Enable the eMMC feature for locking boot partitions read-only
until next power on, exposed via sysfs.

Drivers:
* Runtime PM support for Intel Medfield SDIO.
* Suspend/resume support for sdhci-spear.
* sh-mmcif now processes requests asynchronously.

* tag 'mmc-merge-for-3.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (58 commits)
mmc: fix a deadlock between system suspend and MMC block IO
mmc: sdhci: restore the enabled dma when do reset all
mmc: dw_mmc: miscaculated the fifo-depth with wrong bit operation
mmc: host: Adds support for eMMC 4.5 HS200 mode
mmc: core: HS200 mode support for eMMC 4.5
mmc: dw_mmc: fixed wrong bit operation for SDMMC_GET_FCNT()
mmc: core: Separate the timeout value for cache-ctrl
mmc: sdhci-spear: Fix compilation error
mmc: sdhci: Deal with failure case in sdhci_suspend_host
mmc: dw_mmc: Clear the DDR mode for non-DDR
mmc: sd: Fix SDR12 timing regression
mmc: sdhci: Fix tuning timer incorrect setting when suspending host
mmc: core: Add option to prevent eMMC sleep command
mmc: omap_hsmmc: use threaded irq handler for card-detect.
mmc: sdhci-pci: enable runtime PM for Medfield SDIO
mmc: sdhci: Always pass clock request value zero to set_clock host op
mmc: sdhci-pci: remove SDHCI_QUIRK2_OWN_CARD_DETECTION
mmc: sdhci-pci: get gpio numbers from platform data
mmc: sdhci-pci: add platform data
mmc: sdhci: prevent card detection activity for non-removable cards
...

Linus Torvalds
2012-01-14 12:41:15 +0800
57e6a7dde Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Fix nlink setting on inode creation
GFS2: fail mount if journal recovery fails
GFS2: let spectator mount do read only recovery
GFS2: Fix a use-after-free that coverity spotted
GFS2: dlm based recovery coordination

Linus Torvalds
2012-01-14 02:33:39 +0800
1a52bb0b6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: ensure prealloc_blob is in place when removing xattr
rbd: initialize snap_rwsem in rbd_add()
ceph: enable/disable dentry complete flags via mount option
vfs: export symbol d_find_any_alias()
ceph: always initialize the dentry in open_root_dentry()
libceph: remove useless return value for osd_client __send_request()
ceph: avoid iput() while holding spinlock in ceph_dir_fsync
ceph: avoid useless dget/dput in encode_fh
ceph: dereference pointer after checking for NULL
crush: fix force for non-root TAKE
ceph: remove unnecessary d_fsdata conditional checks
ceph: Use kmemdup rather than duplicating its implementation

Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
always initialize the dentry in open_root_dentry)

Linus Torvalds
2012-01-14 02:29:21 +0800

13 Jan, 2012

30 commits

afe887df1 drm: bump DRM_CONNECTOR_MAX_ENCODER from 2 to 3 ... Browse Code »

There exists at least one NVIDIA GPU (Quadro NVS 300) that has a DMS-59
connector which is capable of supporting DisplayPort, TMDS and VGA on
a single connector.

We need to bump the allowed encoder limit to support all three configs.

Signed-off-by: Ben Skeggs
Signed-off-by: Dave Airlie

Ben Skeggs
2012-01-13 17:01:09 +0800
099469502 Merge branch 'akpm' (aka "Andrew's patch-bomb, take two") ... Browse Code »

Andrew explains:

- various misc stuff

- Most of the rest of MM: memcg, threaded hugepages, others.

- cpumask

- kexec

- kdump

- some direct-io performance tweaking

- radix-tree optimisations

- new selftests code

A note on this: often people will develop a new userspace-visible
feature and will develop userspace code to exercise/test that
feature. Then they merge the patch and the selftest code dies.
Sometimes we paste it into the changelog. Sometimes the code gets
thrown into Documentation/(!).

This saddens me. So this patch creates a bare-bones framework which
will henceforth allow me to ask people to include their test apps in
the kernel tree so we can keep them alive. Then when people enhance
or fix the feature, I can ask them to update the test app too.

The infrastruture is terribly trivial at present - let's see how it
evolves.

- checkpoint/restart feature work.

A note on this: this is a project by various mad Russians to perform
c/r mainly from userspace, with various oddball helper code added
into the kernel where the need is demonstrated.

So rather than some large central lump of code, what we have is
little bits and pieces popping up in various places which either
expose something new or which permit something which is normally
kernel-private to be modified.

The overall project is an ongoing thing. I've judged that the size
and scope of the thing means that we're more likely to be successful
with it if we integrate the support into mainline piecemeal rather
than allowing it all to develop out-of-tree.

However I'm less confident than the developers that it will all
eventually work! So what I'm asking them to do is to wrap each piece
of new code inside CONFIG_CHECKPOINT_RESTORE. So if it all
eventually comes to tears and the project as a whole fails, it should
be a simple matter to go through and delete all trace of it.

This lot pretty much wraps up the -rc1 merge for me.

* akpm: (96 commits)
unlzo: fix input buffer free
ramoops: update parameters only after successful init
ramoops: fix use of rounddown_pow_of_two()
c/r: prctl: add PR_SET_MM codes to set up mm_struct entries
c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
c/r: introduce CHECKPOINT_RESTORE symbol
selftests: new x86 breakpoints selftest
selftests: new very basic kernel selftests directory
radix_tree: take radix_tree_path off stack
radix_tree: remove radix_tree_indirect_to_ptr()
dio: optimize cache misses in the submission path
vfs: cache request_queue in struct block_device
fs/direct-io.c: calculate fs_count correctly in get_more_blocks()
drivers/parport/parport_pc.c: fix warnings
panic: don't print redundant backtraces on oops
sysctl: add the kernel.ns_last_pid control
kdump: add udev events for memory online/offline
include/linux/crash_dump.h needs elf.h
kdump: fix crash_kexec()/smp_send_stop() race in panic()
kdump: crashk_res init check for /sys/kernel/kexec_crash_size
...

Linus Torvalds
2012-01-13 12:42:54 +0800
7c17d86a8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
pptp: Accept packet with seq zero
RDS: Remove some unused iWARP code
net: fsl: fec: handle 10Mbps speed in RMII mode
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c: add missing iounmap
drivers/net/ethernet/tundra/tsi108_eth.c: add missing iounmap
ksz884x: fix mtu for VLAN
net_sched: sfq: add optional RED on top of SFQ
dp83640: Fix NOHZ local_softirq_pending 08 warning
gianfar: Fix invalid TX frames returned on error queue when time stamping
gianfar: Fix missing sock reference when processing TX time stamps
phylib: introduce mdiobus_alloc_size()
net: decrement memcg jump label when limit, not usage, is changed
net: reintroduce missing rcu_assign_pointer() calls
inet_diag: Rename inet_diag_req_compat into inet_diag_req
inet_diag: Rename inet_diag_req into inet_diag_req_v2
bond_alb: don't disable softirq under bond_alb_xmit
mac80211: fix rx->key NULL pointer dereference in promiscuous mode
nl80211: fix old station flags compatibility
mdio-octeon: use an unique MDIO bus name.
mdio-gpio: use an unique MDIO bus name.
...

Linus Torvalds
2012-01-13 12:30:02 +0800
028ee4be3 c/r: prctl: add PR_SET_MM codes to set up mm_struct entries ... Browse Code »

When we restore a task we need to set up text, data and data heap sizes
from userspace to the values a task had at checkpoint time. This patch
adds auxilary prctl codes for that.

While most of them have a statistical nature (their values are involved
into calculation of /proc//statm output) the start_brk and brk values
are used to compute an allowed size of program data segment expansion.
Which means an arbitrary changes of this values might be dangerous
operation. So to restrict access the following requirements applied to
prctl calls:

- The process has to have CAP_SYS_ADMIN capability granted.
- For all opcodes except start_brk/brk members an appropriate
VMA area must exist and should fit certain VMA flags,
such as:
- code segment must be executable but not writable;
- data segment must not be executable.

start_brk/brk values must not intersect with data segment and must not
exceed RLIMIT_DATA resource limit.

Still the main guard is CAP_SYS_ADMIN capability check.

Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
otherwise these prctl calls will return -EINVAL.

[akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text]
Signed-off-by: Cyrill Gorcunov
Reviewed-by: Kees Cook
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Serge Hallyn
Cc: Pavel Emelyanov
Cc: Vasiliy Kulikov
Cc: KAMEZAWA Hiroyuki
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-01-13 12:13:13 +0800
928da837a radix_tree: remove radix_tree_indirect_to_ptr() ... Browse Code »

It is not used anymore, remove it

Signed-off-by: Xiao Guangrong
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-01-13 12:13:12 +0800
87192a2a4 vfs: cache request_queue in struct block_device ... Browse Code »

This makes it possible to get from the inode to the request_queue with one
less cache miss. Used in followon optimization.

The livetime of the pointer is the same as the gendisk.

This assumes that the queue will always stay the same in the gendisk while
it's visible to block_devices. I think that's safe correct?

Signed-off-by: Andi Kleen
Acked-by: Jeff Moyer
Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2012-01-13 12:13:12 +0800
1f536b9e9 include/linux/crash_dump.h needs elf.h ... Browse Code »

Building an ARM target we get the following warnings:

CC arch/arm/kernel/setup.o
In file included from arch/arm/kernel/setup.c:39:
arch/arm/include/asm/elf.h:102:1: warning: "vmcore_elf64_check_arch" redefined
In file included from arch/arm/kernel/setup.c:24:
include/linux/crash_dump.h:30:1: warning: this is the location of the previous definition

Quoting Russell King:

"linux/crash_dump.h makes no attempt to include asm/elf.h, but it depends
on stuff in asm/elf.h to determine how stuff inside this file is defined
at parse time.

So, if asm/elf.h is included after linux/crash_dump.h or not at all, you
get a different result from the situation where asm/elf.h is included
before."

So add elf.h header to crash_dump.h to avoid this problem.

The original discussion about this can be found at:
http://www.spinics.net/lists/arm-kernel/msg154113.html

Signed-off-by: Fabio Estevam
Cc: Russell King
Cc: [3.2.1]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabio Estevam
2012-01-13 12:13:11 +0800
a3dd33230 kexec: remove KMSG_DUMP_KEXEC ... Browse Code »
43

KMSG_DUMP_KEXEC is useless because we already save kernel messages inside
/proc/vmcore, and it is unsafe to allow modules to do other stuffs in a
crash dump scenario.

[akpm@linux-foundation.org: fix powerpc build]
Signed-off-by: WANG Cong
Reported-by: Vivek Goyal
Acked-by: Vivek Goyal
Acked-by: Jarod Wilson
Cc: "Eric W. Biederman"
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

WANG Cong
2012-01-13 12:13:11 +0800
1c1c53d43 mm: remove del_page_from_lru, add page_off_lru ... Browse Code »

del_page_from_lru() repeats del_page_from_lru_list(), also working out
which LRU the page was on, clearing the relevant bits. Decouple those
functions: remove del_page_from_lru() and add page_off_lru().

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:10 +0800
4111304da mm: enum lru_list lru ... Browse Code »

Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:10 +0800
5095ae837 mm: fewer underscores in ____pagevec_lru_add ... Browse Code »

What's so special about ____pagevec_lru_add() that it needs four leading
underscores? Nothing, it just helped to distinguish from
__pagevec_lru_add() in 2.6.28 development. Cut two leading underscores.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:10 +0800
2bcf88796 mm: take pagevecs off reclaim stack ... Browse Code »

Replace pagevecs in putback_lru_pages() and move_active_pages_to_lru()
by lists of pages_to_free: then apply Konstantin Khlebnikov's
free_hot_cold_page_list() to them instead of pagevec_release().

Which simplifies the flow (no need to drop and retake lock whenever
pagevec fills up) and reduces stale addresses in stack backtraces
(which often showed through the pagevecs); but more importantly,
removes another 120 bytes from the deepest stacks in page reclaim.
Although I've not recently seen an actual stack overflow here with
a vanilla kernel, move_active_pages_to_lru() has often featured in
deep backtraces.

However, free_hot_cold_page_list() does not handle compound pages
(nor need it: a Transparent HugePage would have been split by the
time it reaches the call in shrink_page_list()), but it is possible
for putback_lru_pages() or move_active_pages_to_lru() to be left
holding the last reference on a THP, so must exclude the unlikely
compound case before putting on pages_to_free.

Remove pagevec_strip(), its work now done in move_active_pages_to_lru().
The pagevec in scan_mapping_unevictable_pages() remains in mm/vmscan.c,
but that is never on the reclaim path, and cannot be replaced by a list.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Konstantin Khlebnikov
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:10 +0800
a6bc32b89 mm: compaction: introduce sync-light migration for use by compaction ... Browse Code »
43

This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.

This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.

[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
c82449352 mm: compaction: make isolate_lru_page() filter-aware again ... Browse Code »

Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
noted that compaction does not migrate dirty or writeback pages and that
is was meaningless to pick the page and re-add it to the LRU list. This
had to be partially reverted because some dirty pages can be migrated by
compaction without blocking.

This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise LRU
disruption.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Reviewed-by: Minchan Kim
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
b969c4ab9 mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage ... Browse Code »

Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;

if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;

This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
ea4d349ff vmscan/trace: Add 'file' info to trace_mm_vmscan_lru_isolate() ... Browse Code »

In trace_mm_vmscan_lru_isolate(), we don't output 'file' information to
the trace event and it is a bit inconvenient for the user to get the
real information(like pasted below). mm_vmscan_lru_isolate:
isolate_mode=2 order=0 nr_requested=32 nr_scanned=32 nr_taken=32
contig_taken=0 contig_dirty=0 contig_failed=0

'active' can be obtained by analyzing mode(Thanks go to Minchan and
Mel), So this patch adds 'file' to the trace event and it now looks
like: mm_vmscan_lru_isolate: isolate_mode=2 order=0 nr_requested=32
nr_scanned=32 nr_taken=32 contig_taken=0 contig_dirty=0 contig_failed=0
file=0

Signed-off-by: Tao Ma
Acked-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Reviewed-by: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tao Ma
2012-01-13 12:13:08 +0800
f21760b15 thp: add tlb_remove_pmd_tlb_entry ... Browse Code »

We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
flushed, but not a corresponding API for pmd entry. This isn't a
problem so far because THP is only for x86 currently and tlb_flush()
under x86 will flush entire TLB. But this is confusion and could be
missed if thp is ported to other arch.

Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
__tlb_remove_page() as suggested by Andrea Arcangeli. The
__tlb_remove_page() function is supposed to be called after
tlb_remove_xxx_tlb_entry() and we can catch any misuse.

Signed-off-by: Shaohua Li
Reviewed-by: Andrea Arcangeli
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-01-13 12:13:08 +0800
38c5d72f3 memcg: simplify LRU handling by new rule ... Browse Code »
43

Now, at LRU handling, memory cgroup needs to do complicated works to see
valid pc->mem_cgroup, which may be overwritten.

This patch is for relaxing the protocol. This patch guarantees
- when pc->mem_cgroup is overwritten, page must not be on LRU.

By this, LRU routine can believe pc->mem_cgroup and don't need to check
bits on pc->flags. This new rule may adds small overheads to swapin. But
in most case, lru handling gets faster.

After this patch, PCG_ACCT_LRU bit is obsolete and removed.

[akpm@linux-foundation.org: remove unneeded VM_BUG_ON(), restore hannes's christmas tree]
[akpm@linux-foundation.org: clean up code comment]
[hughd@google.com: fix NULL mem_cgroup_try_charge]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Miklos Szeredi
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Ying Han
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-13 12:13:07 +0800
4e5f01c2b memcg: clear pc->mem_cgroup if necessary. ... Browse Code »
43

This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
and reducing atomic ops/complexity in memcg LRU handling.

In some cases, pages are added to lru before charge to memcg and pages
are not classfied to memory cgroup at lru addtion. Now, the lru where
the page should be added is determined a bit in page_cgroup->flags and
pc->mem_cgroup. I'd like to remove the check of flag.

To handle the case pc->mem_cgroup may contain stale pointers if pages
are added to LRU before classification. This patch resets
pc->mem_cgroup to root_mem_cgroup before lru additions.

[akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
[hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
[akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
[hughd@google.com: stop oops in mem_cgroup_reset_owner()]
[hughd@google.com: fix page migration to reset_owner]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Miklos Szeredi
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Ying Han
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-13 12:13:07 +0800
9fb4b7cc0 page_cgroup: add helper function to get swap_cgroup ... Browse Code »
43

There are multiple places which need to get the swap_cgroup address, so
add a helper function:

static struct swap_cgroup *swap_cgroup_getsc(swp_entry_t ent,
struct swap_cgroup_ctrl **ctrl);

to simplify the code.

Signed-off-by: Bob Liu
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bob Liu
2012-01-13 12:13:07 +0800
72835c86c mm: unify remaining mem_cont, mem, etc. variable names to memcg ... Browse Code »

Signed-off-by: Johannes Weiner
Acked-by: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Cc: Balbir Singh
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:06 +0800
e94c8a9cb memcg: make mem_cgroup_split_huge_fixup() more efficient ... Browse Code »
43

In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies
page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.

But thinking again,
- compound_lock() is held at move_accout...then, it's not necessary
to take move_lock_page_cgroup().
- LRU is locked and all tail pages will go into the same LRU as
head is now on.
- page_cgroup is contiguous in huge page range.

This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
hugepage and reduce costs for spliting.

[akpm@linux-foundation.org: fix typo, per Michal]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Andrea Arcangeli
Reviewed-by: Michal Hocko
Cc: Balbir Singh
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-13 12:13:05 +0800
6b208e3f6 mm: memcg: remove unused node/section info from pc->flags ... Browse Code »

To find the page corresponding to a certain page_cgroup, the pc->flags
encoded the node or section ID with the base array to compare the pc
pointer to.

Now that the per-memory cgroup LRU lists link page descriptors directly,
there is no longer any code that knows the struct page_cgroup of a PFN
but not the struct page.

[hughd@google.com: remove unused node/section info from pc->flags fix]
Signed-off-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Reviewed-by: Kirill A. Shutemov
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: "Kirill A. Shutemov"
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Ying Han
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Christoph Hellwig
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:05 +0800
925b7673c mm: make per-memcg LRU lists exclusive ... Browse Code »
43

Now that all code that operated on global per-zone LRU lists is
converted to operate on per-memory cgroup LRU lists instead, there is no
reason to keep the double-LRU scheme around any longer.

The pc->lru member is removed and page->lru is linked directly to the
per-memory cgroup LRU lists, which removes two pointers from a
descriptor that exists for every page frame in the system.

Signed-off-by: Johannes Weiner
Signed-off-by: Hugh Dickins
Signed-off-by: Ying Han
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Reviewed-by: Kirill A. Shutemov
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:05 +0800
6290df545 mm: collect LRU list heads into struct lruvec ... Browse Code »

Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU
lists and does not care about the container itself.

Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.

Signed-off-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Reviewed-by: Kirill A. Shutemov
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Ying Han
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Christoph Hellwig
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:05 +0800
5660048cc mm: move memcg hierarchy reclaim to generic reclaim code ... Browse Code »

Memory cgroup limit reclaim and traditional global pressure reclaim will
soon share the same code to reclaim from a hierarchical tree of memory
cgroups.

In preparation of this, move the two right next to each other in
shrink_zone().

The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
limit reclaim function, which still does hierarchy walking on its own,
and a limit (shrinking) reclaim function, which relies on generic
reclaim code to walk the hierarchy.

Signed-off-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Reviewed-by: Kirill A. Shutemov
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Ying Han
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Christoph Hellwig
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:05 +0800
ab936cbcd memcg: add mem_cgroup_replace_page_cache() to fix LRU issue ... Browse Code »
1

Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
function replace_page_cache_page(). This function replaces a page in the
radix-tree with a new page. WHen doing this, memory cgroup needs to fix
up the accounting information. memcg need to check PCG_USED bit etc.

In some(many?) cases, 'newpage' is on LRU before calling
replace_page_cache(). So, memcg's LRU accounting information should be
fixed, too.

This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
In that function, old pages will be unaccounted without touching
res_counter and new page will be accounted to the memcg (of old page).
WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
races with LRU handling.

Background:
replace_page_cache_page() is called by FUSE code in its splice() handling.
Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
because rmdir() checks the whole LRU is empty and there is no account leak.
If a page is on the other LRU than it should be, rmdir() will fail.

This bug was added in March 2011, but no bug report yet. I guess there
are not many people who use memcg and FUSE at the same time with upstream
kernels.

The result of this bug is that admin cannot destroy a memcg because of
account leak. So, no panic, no deadlock. And, even if an active cgroup
exist, umount can succseed. So no problem at shutdown.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Miklos Szeredi
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-13 12:13:04 +0800
28d82dc1c epoll: limit paths ... Browse Code »
45

The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.

To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.

Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.

This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.

In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.

In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.

Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4

This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.

Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.

I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.

Signed-off-by: Jason Baron
Cc: Nelson Elhage
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2012-01-13 12:13:04 +0800
43570fd2f mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL ... Browse Code »

While implementing cmpxchg_double() on s390 I realized that we don't set
CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.

However setting that option will increase the size of struct page by
eight bytes on 64 bit, which we certainly do not want. Also, it doesn't
make sense that a present cpu feature should increase the size of struct
page.

Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
that it should depend on CMPXCHG_DOUBLE instead.

This patch:

If an architecture supports CMPXCHG_LOCAL this shouldn't result
automatically in larger struct pages if the SLUB allocator is used.
Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
can be selected if a double word aligned struct page is required. Also
update x86 Kconfig so that it should work as before.

Signed-off-by: Heiko Carstens
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2012-01-13 12:13:03 +0800
0d259cf81 include/linux/linkage.h: remove unused ATTRIB_NORET macro ... Browse Code »

The uses have been renamed so delete the unused macro.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-13 12:13:03 +0800