Doug / smarc-fsl-linux-kernel | Embedian Git Server

01 Aug, 2012

40 commits

2b49b9067 crypto: cast6 - prepare generic module for optimized implementations ... Browse Code »

Rename cast6 module to cast6_generic to allow autoloading of optimized
implementations. Generic functions and s-boxes are exported to be able to use
them within optimized implementations.

Signed-off-by: Johannes Goetzfried
Signed-off-by: Herbert Xu

Johannes Goetzfried
2012-08-01 17:47:30 +0800
4d6d6a2c8 crypto: cast5 - add x86_64/avx assembler implementation ... Browse Code »

This patch adds a x86_64/avx assembler implementation of the Cast5 block
cipher. The implementation processes sixteen blocks in parallel (four 4 block
chunk AVX operations). The table-lookups are done in general-purpose registers.
For small blocksizes the functions from the generic module are called. A good
performance increase is provided for blocksizes greater or equal to 128B.

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmark results:

Intel Core i5-2500 CPU (fam:6, model:42, step:7)

cast5-avx-x86_64 vs. cast5-generic
64bit key:
size ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
16B 0.99x 0.99x 1.00x 1.00x 1.02x 1.01x
64B 1.00x 1.00x 0.98x 1.00x 1.01x 1.02x
256B 2.03x 2.01x 0.95x 2.11x 2.12x 2.13x
1024B 2.30x 2.24x 0.95x 2.29x 2.35x 2.35x
8192B 2.31x 2.27x 0.95x 2.31x 2.39x 2.39x

128bit key:
size ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
16B 0.99x 0.99x 1.00x 1.00x 1.01x 1.01x
64B 1.00x 1.00x 0.98x 1.01x 1.02x 1.01x
256B 2.17x 2.13x 0.96x 2.19x 2.19x 2.19x
1024B 2.29x 2.32x 0.95x 2.34x 2.37x 2.38x
8192B 2.35x 2.32x 0.95x 2.35x 2.39x 2.39x

Signed-off-by: Johannes Goetzfried
Signed-off-by: Herbert Xu

Johannes Goetzfried
2012-08-01 17:47:30 +0800
a2c582609 crypto: testmgr - add larger cast5 testvectors ... Browse Code »

New ECB, CBC and CTR testvectors for cast5. We need larger testvectors to check
parallel code paths in the optimized implementation. Tests have also been added
to the tcrypt module.

Signed-off-by: Johannes Goetzfried
Signed-off-by: Herbert Xu

Johannes Goetzfried
2012-08-01 17:47:29 +0800
270b0c6b4 crypto: cast5 - prepare generic module for optimized implementations ... Browse Code »

Rename cast5 module to cast5_generic to allow autoloading of optimized
implementations. Generic functions and s-boxes are exported to be able to use
them within optimized implementations.

Signed-off-by: Johannes Goetzfried
Signed-off-by: Herbert Xu

Johannes Goetzfried
2012-08-01 17:47:29 +0800
37743cc0d crypto: arch/s390 - cleanup - remove unneeded cra_list initialization ... Browse Code »

Initialization of cra_list is currently mixed, most ciphers initialize this
field and most shashes do not. Initialization however is not needed at all
since cra_list is initialized/overwritten in __crypto_register_alg() with
list_add(). Therefore perform cleanup to remove all unneeded initializations
of this field in 'arch/s390/crypto/'

Cc: Gerald Schaefer
Cc: linux-s390@vger.kernel.org
Signed-off-by: Jussi Kivilinna
Signed-off-by: Jan Glauber
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:29 +0800
e15aa3692 crypto: drivers - remove cra_list initialization ... Browse Code »

Initialization of cra_list is currently mixed, most ciphers initialize this
field and most shashes do not. Initialization however is not needed at all
since cra_list is initialized/overwritten in __crypto_register_alg() with
list_add(). Therefore perform cleanup to remove all unneeded initializations
of this field in 'crypto/drivers/'.

Cc: Benjamin Herrenschmidt
Cc: linux-geode@lists.infradead.org
Cc: Michal Ludvig
Cc: Dmitry Kasatkin
Cc: Varun Wadekar
Cc: Eric Bénard
Signed-off-by: Jussi Kivilinna
Acked-by: Kent Yoder
Acked-by: Vladimir Zapolskiy
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:28 +0800
7af6c2456 crypto: arch/x86 - cleanup - remove unneeded crypto_alg.cra_list initializations ... Browse Code »

Initialization of cra_list is currently mixed, most ciphers initialize this
field and most shashes do not. Initialization however is not needed at all
since cra_list is initialized/overwritten in __crypto_register_alg() with
list_add(). Therefore perform cleanup to remove all unneeded initializations
of this field in 'arch/x86/crypto/'.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:27 +0800
77ec2e734 crypto: cleanup - remove unneeded crypto_alg.cra_list initializations ... Browse Code »

Initialization of cra_list is currently mixed, most ciphers initialize this
field and most shashes do not. Initialization however is not needed at all
since cra_list is initialized/overwritten in __crypto_register_alg() with
list_add(). Therefore perform cleanup to remove all unneeded initializations
of this field in 'crypto/'.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:27 +0800
f4b0277e7 crypto: whirlpool - use crypto_[un]register_shashes ... Browse Code »

Combine all shash algs to be registered and use new crypto_[un]register_shashes
functions. This simplifies init/exit code.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:27 +0800
648b2a102 crypto: sha512 - use crypto_[un]register_shashes ... Browse Code »

Combine all shash algs to be registered and use new crypto_[un]register_shashes
functions. This simplifies init/exit code.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:26 +0800
6aeb49bc5 crypto: sha256 - use crypto_[un]register_shashes ... Browse Code »

Combine all shash algs to be registered and use new crypto_[un]register_shashes
functions. This simplifies init/exit code.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:26 +0800
a5e7a2dcf crypto: tiger - use crypto_[un]register_shashes ... Browse Code »

Combine all shash algs to be registered and use new crypto_[un]register_shashes
functions. This simplifies init/exit code.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:26 +0800
50fc3e8d2 crypto: add crypto_[un]register_shashes for [un]registering multiple shash entries at once ... Browse Code »

Add crypto_[un]register_shashes() to allow simplifying init/exit code of shash
crypto modules that register multiple algorithms.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:26 +0800
8fc229a51 crypto: ansi_cprng - use crypto_[un]register_algs ... Browse Code »

Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. This simplifies init/exit code.

Cc: Neil Horman
Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:25 +0800
bbc406b9d crypto: serpent - use crypto_[un]register_algs ... Browse Code »

Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. This simplifies init/exit code.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:25 +0800
9935e6d2f crypto: des - use crypto_[un]register_algs ... Browse Code »

Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. This simplifies init/exit code.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:24 +0800
70a03bff6 crypto: crypto_null - use crypto_[un]register_algs ... Browse Code »

Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. This simplifies init/exit code.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:24 +0800
738206d32 crypto: tea - use crypto_[un]register_algs ... Browse Code »

Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. This simplifies init/exit code.

Signed-off-by: Jussi Kivilinna
Signed-off-by: Herbert Xu

Jussi Kivilinna
2012-08-01 17:47:24 +0800
2d5349262 Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6 ... Browse Code »

Pull irqdomain changes from Grant Likely:
"Round of refactoring and enhancements to irq_domain infrastructure.
This series starts the process of simplifying irqdomain. The ultimate
goal is to merge LEGACY, LINEAR and TREE mappings into a single
system, but had to back off from that after some last minute bugs.
Instead it mainly reorganizes the code and ensures that the reverse
map gets populated when the irq is mapped instead of the first time it
is looked up.

Merging of the irq_domain types is deferred to v3.7

In other news, this series adds helpers for creating static mappings
on a linear or tree mapping."

* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
irqdomain: Improve diagnostics when a domain mapping fails
irqdomain: eliminate slow-path revmap lookups
irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.
irqdomain: Eliminate dedicated radix lookup functions
irqdomain: Support for static IRQ mapping and association.
irqdomain: Always update revmap when setting up a virq
irqdomain: Split disassociating code into separate function
irq_domain: correct a minor wrong comment for linear revmap
irq_domain: Standardise legacy/linear domain selection
irqdomain: Make ops->map hook optional
irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY
irqdomain: Simple NUMA awareness.
devicetree: add helper inline for retrieving a node's full name

Linus Torvalds
2012-08-01 11:44:03 +0800
ac694dbdb Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge Andrew's second set of patches:
- MM
- a few random fixes
- a couple of RTC leftovers

* emailed patches from Andrew Morton : (120 commits)
rtc/rtc-88pm80x: remove unneed devm_kfree
rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
tmpfs: distribute interleave better across nodes
mm: remove redundant initialization
mm: warn if pg_data_t isn't initialized with zero
mips: zero out pg_data_t when it's allocated
memcg: gix memory accounting scalability in shrink_page_list
mm/sparse: remove index_init_lock
mm/sparse: more checks on mem_section number
mm/sparse: optimize sparse_index_alloc
memcg: add mem_cgroup_from_css() helper
memcg: further prevent OOM with too many dirty pages
memcg: prevent OOM with too many dirty pages
mm: mmu_notifier: fix freed page still mapped in secondary MMU
mm: memcg: only check anon swapin page charges for swap cache
mm: memcg: only check swap cache pages for repeated charging
mm: memcg: split swapin charge function into private and public part
mm: memcg: remove needless !mm fixup to init_mm when charging
mm: memcg: remove unneeded shmem charge type
...

Linus Torvalds
2012-08-01 10:25:39 +0800
a40a1d3d0 Merge tag 'vfio-for-v3.6' of git://github.com/awilliam/linux-vfio ... Browse Code »

Pull VFIO core from Alex Williamson:
"This series includes the VFIO userspace driver interface for the 3.6
kernel merge window. This driver is intended to provide a secure
interface for device access using IOMMU protection for applications
like assignment of physical devices to virtual machines.

Qemu will be the first user of this interface, enabling assignment of
PCI devices to Qemu guests. This interface is intended to eventually
replace the x86-specific assignment mechanism currently available in
KVM.

This interface has the advantage of being more secure, by working with
IOMMU groups to ensure device isolation and providing it's own
filtered resource access mechanism, and also more flexible, in not
being x86 or KVM specific (extensions to enable POWER are already
working).

This driver is originally the work of Tom Lyon, but has since been
handed over to me and gone through a complete overhaul thanks to the
input from David Gibson, Ben Herrenschmidt, Chris Wright, Joerg
Roedel, and others. This driver has been available in linux-next for
the last month."

Paul Mackerras says:
"I would be glad to see it go in since we want to use it with KVM on
PowerPC. If possible we'd like the PowerPC bits for it to go in as
well."

* tag 'vfio-for-v3.6' of git://github.com/awilliam/linux-vfio:
vfio: Add PCI device driver
vfio: Type1 IOMMU implementation
vfio: Add documentation
vfio: VFIO core

Linus Torvalds
2012-08-01 10:17:27 +0800
3e9a97082 Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random ... Browse Code »

Pull random subsystem patches from Ted Ts'o:
"This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.

The goal is to addresses weaknesses discussed in the paper "Mining
your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman,
which will be published in the Proceedings of the 21st Usenix Security
Symposium, August 2012. (See https://factorable.net for more
information and an extended version of the paper.)"

Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
random: mix in architectural randomness in extract_buf()
dmi: Feed DMI table to /dev/random driver
random: Add comment to random_initialize()
random: final removal of IRQF_SAMPLE_RANDOM
um: remove IRQF_SAMPLE_RANDOM which is now a no-op
sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
[ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
...

Linus Torvalds
2012-08-01 10:07:42 +0800
941c8726e Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband ... Browse Code »

Pull final RDMA changes from Roland Dreier:
- Fix IPoIB to stop using unsafe linkage between networking neighbour
layer and private path database.
- Small fixes for bugs found by Fengguang Wu's automated builds.

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IPoIB: Use a private hash table for path lookup in xmit path
IB/qib: Fix size of cc_supported_table_entries
RDMA/ucma: Convert open-coded equivalent to memdup_user()
RDMA/ocrdma: Fix check of GSI CQs
RDMA/cma: Use PTR_RET rather than if (IS_ERR(...)) + PTR_ERR

Linus Torvalds
2012-08-01 09:59:55 +0800
8762541f0 Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media ... Browse Code »

Pull second set of media updates from Mauro Carvalho Chehab:

- radio API: add support to work with radio frequency bands

- new AM/FM radio drivers: radio-shark, radio-shark2

- new Remote Controller USB driver: iguanair

- conversion of several drivers to the v4l2 core control framework

- new board additions at existing drivers

- the remaining (and vast majority of the patches) are due to
drivers/DocBook fixes/cleanups.

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (154 commits)
[media] radio-tea5777: use library for 64bits div
[media] tlg2300: Declare MODULE_FIRMWARE usage
[media] lgs8gxx: Declare MODULE_FIRMWARE usage
[media] xc5000: Add MODULE_FIRMWARE statements
[media] s2255drv: Add MODULE_FIRMWARE statement
[media] dib8000: move dereference after check for NULL
[media] Documentation: Update cardlists
[media] bttv: add support for Aposonic W-DVR
[media] cx25821: Remove bad strcpy to read-only char*
[media] pms.c: remove duplicated include
[media] smiapp-core.c: remove duplicated include
[media] via-camera: pass correct format settings to sensor
[media] rtl2832.c: minor cleanup
[media] Add support for the IguanaWorks USB IR Transceiver
[media] Minor cleanups for MCE USB
[media] drivers/media/dvb/siano/smscoreapi.c: use list_for_each_entry
[media] Use a named union in struct v4l2_ioctl_info
[media] mceusb: Add Twisted Melon USB IDs
[media] staging/media/solo6x10: use module_pci_driver macro
[media] staging/media/dt3155v4l: use module_pci_driver macro
...

Conflicts:
Documentation/feature-removal-schedule.txt

Linus Torvalds
2012-08-01 09:47:44 +0800
6dbb35b0a Merge tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull second wave of NFS client updates from Trond Myklebust:

- Patches from Bryan to allow splitting of the NFSv2/v3/v4 code into
separate modules.

- Fix Oopses in the NFSv4 idmapper

- Fix a deadlock whereby rpciod tries to allocate a new socket and ends
up recursing into the NFS code due to memory reclaim.

- Increase the number of permitted callback connections.

* tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: explicitly reject LOCK_MAND flock() requests
nfs: increase number of permitted callback connections.
SUNRPC: return negative value in case rpcbind client creation error
NFS: Convert v4 into a module
NFS: Convert v3 into a module
NFS: Convert v2 into a module
NFS: Keep module parameters in the generic NFS client
NFS: Split out remaining NFS v4 inode functions
NFS: Pass super operations and xattr handlers in the nfs_subversion
NFS: Only initialize the ACL client in the v3 case
NFS: Create a try_mount rpc op
NFS: Remove the NFS v4 xdev mount function
NFS: Add version registering framework
NFS: Fix a number of bugs in the idmapper
nfs: skip commit in releasepage if we're freeing memory for fs-related reasons
sunrpc: clarify comments on rpc_make_runnable
pnfsblock: bail out partial page IO

Linus Torvalds
2012-08-01 09:45:44 +0800
fd37ce34b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking update from David S. Miller:
"I think Eric Dumazet and I have dealt with all of the known routing
cache removal fallout. Some other minor fixes all around.

1) Fix RCU of cached routes, particular of output routes which require
liberation via call_rcu() instead of call_rcu_bh(). From Eric
Dumazet.

2) Make sure we purge net device references in cached routes properly.

3) TG3 driver bug fixes from Michael Chan.

4) Fix reported 'expires' value in ipv6 routes, from Li Wei.

5) TUN driver ioctl leaks kernel bytes to userspace, from Mathias
Krause."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits)
ipv4: Properly purge netdev references on uncached routes.
ipv4: Cache routes in nexthop exception entries.
ipv4: percpu nh_rth_output cache
ipv4: Restore old dst_free() behavior.
bridge: make port attributes const
ipv4: remove rt_cache_rebuild_count
net: ipv4: fix RCU races on dst refcounts
net: TCP early demux cleanup
tun: Fix formatting.
net/tun: fix ioctl() based info leaks
tg3: Update version to 3.124
tg3: Fix race condition in tg3_get_stats64()
tg3: Add New 5719 Read DMA workaround
tg3: Fix Read DMA workaround for 5719 A0.
tg3: Request APE_LOCK_PHY before PHY access
ipv6: fix incorrect route 'expires' value passed to userspace
mISDN: Bugfix only few bytes are transfered on a connection
seeq: use PTR_RET at init_module of driver
bnx2x: remove cast around the kmalloc in bnx2x_prev_mark_path
ipv4: clean up put_child
...

Linus Torvalds
2012-08-01 09:43:13 +0800
437ea90cc rtc/rtc-88pm80x: remove unneed devm_kfree ... Browse Code »

devm_kzalloc() doesn't need a matching devm_kfree(), the freeing mechanism
will trigger when driver unloads.

Signed-off-by: Devendra Naga
Cc: Alessandro Zummo
Cc: Ashish Jangam
Cc: David Dajun Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Devendra Naga
2012-08-01 09:42:50 +0800
7ead55119 rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails ... Browse Code »

At the probe we are assigning ret to return value of PTR_ERR right after
the rtc_register_drive()r, as we would have done it in the if
(IS_ERR(ptr)) check, since the function fails and goes inside that case

Signed-off-by: Devendra Naga
Cc: Alessandro Zummo
Cc: Ashish Jangam
Cc: David Dajun Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Devendra Naga
2012-08-01 09:42:50 +0800
d833352a4 mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables ... Browse Code »

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log. The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced
in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;

[ ..........] Lots of bad pmd messages followed by this
[ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[ 127.186778] ------------[ cut here ]------------
[ 127.186781] kernel BUG at mm/filemap.c:134!
[ 127.186782] invalid opcode: 0000 [#1] SMP
[ 127.186783] CPU 7
[ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[ 127.186801]
[ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[ 127.186804] RIP: 0010:[] [] __delete_from_page_cache+0x15e/0x160
[ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
[ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[ 127.186821] Stack:
[ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[ 127.186827] Call Trace:
[ 127.186829] [] delete_from_page_cache+0x3b/0x80
[ 127.186832] [] truncate_hugepages+0x115/0x220
[ 127.186834] [] hugetlbfs_evict_inode+0x13/0x30
[ 127.186837] [] evict+0xa7/0x1b0
[ 127.186839] [] iput_final+0xd3/0x1f0
[ 127.186840] [] iput+0x39/0x50
[ 127.186842] [] d_kill+0xf8/0x130
[ 127.186843] [] dput+0xd2/0x1a0
[ 127.186845] [] __fput+0x170/0x230
[ 127.186848] [] ? rb_erase+0xce/0x150
[ 127.186849] [] fput+0x1d/0x30
[ 127.186851] [] remove_vma+0x37/0x80
[ 127.186853] [] do_munmap+0x2d2/0x360
[ 127.186855] [] sys_shmdt+0xc9/0x170
[ 127.186857] [] system_call_fastpath+0x16/0x1b
[ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[ 127.186868] RIP [] __delete_from_page_cache+0x15e/0x160
[ 127.186870] RSP
[ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---

The bug is a race and not always easy to reproduce. To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.

$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done

On my particular machine, it usually triggers within 10 minutes but
enabling debug options can change the timing such that it never hits.
Once the bug is triggered, the machine is in trouble and needs to be
rebooted. The machine will respond but processes accessing proc like "ps
aux" will hang due to the BUG_ON. shutdown will also hang and needs a
hard reset or a sysrq-b.

The basic problem is a race between page table sharing and teardown. For
the most part page table sharing depends on i_mmap_mutex. In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.

Unfortunately it appears to be also insufficient. Consider the following
situation

Process A Process B
--------- ---------
hugetlb_fault shmdt
LockWrite(mmap_sem)
do_munmap
unmap_region
unmap_vmas
unmap_single_vma
unmap_hugepage_range
Lock(i_mmap_mutex)
Lock(mm->page_table_lock)
huge_pmd_unshare/unmap tables page_table_lock)
Unlock(i_mmap_mutex)
huge_pte_alloc ...
Lock(i_mmap_mutex) ...
vma_prio_walk, find svma, spte ...
Lock(mm->page_table_lock) ...
share spte ...
Unlock(mm->page_table_lock) ...
Unlock(i_mmap_mutex) ...
hugetlb_no_page < end; a += 4096)
*a = 0;
}

int main(int argc, char **argv)
{
key_t key = IPC_PRIVATE;
size_t sizeA = nr_huge_page_A * huge_page_size;
size_t sizeB = nr_huge_page_B * huge_page_size;
int shmidA, shmidB;
void *addrA = NULL, *addrB = NULL;
int nr_children = 300, n = 0;

if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}

if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}
if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}

if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}

fork_child:
switch(fork()) {
case 0:
switch (n%3) {
case 0:
play(addrA, sizeA);
break;
case 1:
play(addrB, sizeB);
break;
case 2:
break;
}
break;
case -1:
perror("fork:");
break;
default:
if (++n < nr_children)
goto fork_child;
play(addrA, sizeA);
break;
}
shmdt(addrA);
shmdt(addrB);
do {
wait(NULL);
} while (--n > 0);
shmctl(shmidA, IPC_RMID, NULL);
shmctl(shmidB, IPC_RMID, NULL);
return 0;
}

[akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
Signed-off-by: Hugh Dickins
Reviewed-by: Michal Hocko
Signed-off-by: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:50 +0800
09c231cb8 tmpfs: distribute interleave better across nodes ... Browse Code »

When tmpfs has the interleave memory policy, it always starts allocating
for each file from node 0 at offset 0. When there are many small files,
the lower nodes fill up disproportionately.

This patch spreads out node usage by starting files at nodes other than 0,
by using the inode number to bias the starting node for interleave.

Signed-off-by: Nathan Zimmer
Signed-off-by: Hugh Dickins
Cc: Christoph Lameter
Cc: Nick Piggin
Cc: Lee Schermerhorn
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nathan Zimmer
2012-08-01 09:42:50 +0800
6527af5d1 mm: remove redundant initialization ... Browse Code »

pg_data_t is zeroed before reaching free_area_init_core(), so remove the
now unnecessary initializations.

Signed-off-by: Minchan Kim
Cc: Tejun Heo
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-08-01 09:42:50 +0800
88fdf75d1 mm: warn if pg_data_t isn't initialized with zero ... Browse Code »

Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
when it is allocated. Arch code and memory hotplug already initiailize
pg_data_t. So this warning should never happen. I select fields randomly
near the beginning, middle and end of pg_data_t for checking.

This patch isn't for performance but for removing initialization code
which is necessary to add whenever we adds new field to pg_data_t or zone.

Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
Tejun doesn't like it because in the future, some archs can initialize
some fields in arch code and pass them into general MM part so blindly
clearing it out in mm core part would be very annoying.

Signed-off-by: Minchan Kim
Cc: Tejun Heo
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-08-01 09:42:50 +0800
93180cec0 mips: zero out pg_data_t when it's allocated ... Browse Code »

This patch is preparation for the next patch which removes the zeroing of
the pg_data_t in core MM. All archs except MIPS already do this.

Signed-off-by: Minchan Kim
Cc: Ralf Baechle
Cc: Tejun Heo

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-08-01 09:42:49 +0800
69980e317 memcg: gix memory accounting scalability in shrink_page_list ... Browse Code »

I noticed in a multi-process parallel files reading benchmark I ran on a 8
socket machine, throughput slowed down by a factor of 8 when I ran the
benchmark within a cgroup container. I traced the problem to the
following code path (see below) when we are trying to reclaim memory from
file cache. The res_counter_uncharge function is called on every page
that's reclaimed and created heavy lock contention. The patch below
allows the reclaimed pages to be uncharged from the resource counter in
batch and recovered the regression.

Tim

40.67% usemem [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--92.61%-- res_counter_uncharge
| |
| |--100.00%-- __mem_cgroup_uncharge_common
| | |
| | |--100.00%-- mem_cgroup_uncharge_cache_page
| | | __remove_mapping
| | | shrink_page_list
| | | shrink_inactive_list
| | | shrink_mem_cgroup_zone
| | | shrink_zone
| | | do_try_to_free_pages
| | | try_to_free_pages
| | | __alloc_pages_nodemask
| | | alloc_pages_current

Signed-off-by: Tim Chen
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Johannes Weiner
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tim Chen
2012-08-01 09:42:49 +0800
c1c951833 mm/sparse: remove index_init_lock ... Browse Code »

sparse_index_init() uses the index_init_lock spinlock to protect root
mem_section assignment. The lock is not necessary anymore because the
function is called only during boot (during paging init which is executed
only from a single CPU) and from the hotplug code (by add_memory() via
arch_add_memory()) which uses mem_hotplug_mutex.

The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug
preparation") and sparse_index_init() was used only during boot at that
time.

Later when the hotplug code (and add_memory()) was introduced there was no
synchronization so it was possible to online more sections from the same
root probably (though I am not 100% sure about that). The first
synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and
hibernation in the same kernel") which was later replaced by the
mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce
{un}lock_memory_hotplug()").

Let's remove the lock as it is not needed and it makes the code more
confusing.

[mhocko@suse.cz: changelog]
Signed-off-by: Gavin Shan
Reviewed-by: Michal Hocko
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gavin Shan
2012-08-01 09:42:49 +0800
db36a4611 mm/sparse: more checks on mem_section number ... Browse Code »

__section_nr() was implemented to retrieve the corresponding memory
section number according to its descriptor. It's possible that the
specified memory section descriptor doesn't exist in the global array. So
add more checking on that and report an error for a wrong case.

Signed-off-by: Gavin Shan
Acked-by: David Rientjes
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gavin Shan
2012-08-01 09:42:49 +0800
5b760e64a mm/sparse: optimize sparse_index_alloc ... Browse Code »

With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section
descriptors are allocated from slab or bootmem. When allocating from
slab, let slab/bootmem allocator clear the memory chunk. We needn't clear
it explicitly.

Signed-off-by: Gavin Shan
Reviewed-by: Michal Hocko
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gavin Shan
2012-08-01 09:42:49 +0800
b21451459 memcg: add mem_cgroup_from_css() helper ... Browse Code »

Add a mem_cgroup_from_css() helper to replace open-coded invokations of
container_of(). To clarify the code and to add a little more type safety.

[akpm@linux-foundation.org: fix extensive breakage]
Signed-off-by: Wanpeng Li
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Gavin Shan
Cc: Wanpeng Li
Cc: Gavin Shan
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2012-08-01 09:42:49 +0800
c3b94f44f memcg: further prevent OOM with too many dirty pages ... Browse Code »

The may_enter_fs test turns out to be too restrictive: though I saw no
problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
on 3.5-rc6-mm1. I don't know what the difference there is, perhaps I just
slightly changed the way I started off the testing: dd if=/dev/zero
of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
memory.limit_in_bytes cgroup to ext4 on USB stick.

ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
the transaction needs to be started even before allocating pagecache
memory. But it may not be worth worrying about these days: if direct
reclaim avoids FS writeback, does __GFP_FS now mean anything?

Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
device; but since that also masks off __GFP_IO, we can test for __GFP_IO
directly, ignoring may_enter_fs and __GFP_FS.

But even so, the test still OOMs sometimes: when originally testing on
3.5-rc6, it OOMed about one time in five or ten; when testing just now on
3.5-rc6-mm1, it OOMed on the first iteration.

This residual problem comes from an accumulation of pages under ordinary
writeback, not marked PageReclaim, so rightly not causing the memcg check
to wait on their writeback: these too can prevent shrink_page_list() from
freeing any pages, so many times that memcg reclaim fails and OOMs.

Deal with these in the same way as direct reclaim now deals with dirty FS
pages: mark them PageReclaim. It is appropriate to rotate these to tail
of list when writepage completes, but more importantly, the PageReclaim
flag makes memcg reclaim wait on them if encountered again. Increment
NR_VMSCAN_IMMEDIATE? That's arguable: I chose not.

Setting PageReclaim here may occasionally race with end_page_writeback()
clearing it: lru_deactivate_fn() already faced the same race, and
correctly concluded that the window is small and the issue non-critical.

With these changes, the test runs indefinitely without OOMing on ext4,
ext3 and ext2: I'll move on to test with other filesystems later.

Trivia: invert conditions for a clearer block without an else, and goto
keep_locked to do the unlock_page.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Ying Han
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Fengguang Wu
Acked-by: Michal Hocko
Cc: Dave Chinner
Cc: Theodore Ts'o
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-08-01 09:42:49 +0800
e62e384e9 memcg: prevent OOM with too many dirty pages ... Browse Code »

The current implementation of dirty pages throttling is not memcg aware
which makes it easy to have memcg LRUs full of dirty pages. Without
throttling, these LRUs can be scanned faster than the rate of writeback,
leading to memcg OOM conditions when the hard limit is small.

This patch fixes the problem by throttling the allocating process
(possibly a writer) during the hard limit reclaim by waiting on
PageReclaim pages. We are waiting only for PageReclaim pages because
those are the pages that made one full round over LRU and that means that
the writeback is much slower than scanning.

The solution is far from being ideal - long term solution is memcg aware
dirty throttling - but it is meant to be a band aid until we have a real
fix. We are seeing this happening during nightly backups which are placed
into containers to prevent from eviction of the real working set.

The change affects only memcg reclaim and only when we encounter
PageReclaim pages which is a signal that the reclaim doesn't catch up on
with the writers so somebody should be throttled. This could be
potentially unfair because it could be somebody else from the group who
gets throttled on behalf of the writer but as writers need to allocate as
well and they allocate in higher rate the probability that only innocent
processes would be penalized is not that high.

I have tested this change by a simple dd copying /dev/zero to tmpfs or
ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
containers) and dd got killed by OOM killer every time. With the patch I
could run the dd with the same size under 5M controller without any OOM.
The issue is more visible with slower devices for output.

* With the patch
================
* tmpfs size=2G
---------------
$ vim cgroup_cache_oom_test.sh
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s

* Without the patch
===================
* tmpfs size=2G
---------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46: 4668 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46: 4689 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
./cgroup_cache_oom_test.sh: line 46: 4692 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s

[akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
[hughd@google.com: fix deadlock with loop driver]
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Ying Han
Cc: Greg Thelen
Cc: Hugh Dickins
Reviewed-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Fengguang Wu
Signed-off-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-08-01 09:42:49 +0800